Studies that use measures of cerebro-acoustic coherence have shown that theta oscillations (3–10 Hz) entrain to syllable-size modulations in the energy envelope of speech. This entrainment creates sensory windows in processing acoustic cues. Recent reports submit that delta oscillations (<3 Hz) can be entrained by nonsensory content units like phrases and serve to process meaning—though such views face fundamental problems. Other studies suggest that delta underlies a sensory chunking linked to the processing of sequential attributes of speech sounds. This chunking associated with the “focus of attention” is commonly manifested by the temporal grouping of items in sequence recall. Similar grouping in speech may entrain delta. We investigate this view by examining how low-frequency oscillations entrain to three types of stimuli (tones, nonsense syllables, and utterances) having similar timing, pitch, and energy contours. Entrainment was indexed by “intertrial phase coherence” in the EEGs of 18 listeners. The results show that theta oscillations at central sites entrain to syllable-size elements in speech and tones. However, delta oscillations at frontotemporal sites specifically entrain to temporal groups in both meaningful utterances and meaningless syllables, which indicates that delta may support but does not directly bear on a processing of content. The findings overall suggest that, although theta entrainment relates to a processing of acoustic attributes, delta entrainment links to a sensory chunking that relates to a processing of properties of articulated sounds. The results also show that measures of intertrial phase coherence can be better suited than cerebro-acoustic coherence in revealing delta entrainment.
It is widely acknowledged that endogenous neural oscillations in EEG and MEG selectively entrain to sensory signals. In seminal articles on these effects, Lakatos et al. (2005) and Schroeder et al. (Schroeder & Lakatos, 2009; Schroeder, Lakatos, Kajikawa, Partan, & Puce, 2008) noted that neural responses to speech are particularly revealing in that delta (<3 Hz), theta (3–10 Hz), and gamma (>40 Hz) oscillations can phase-lock to different structural patterns. Specifically, it was suggested that theta oscillations can align with modulations in the energy envelope corresponding to “syllables” and could entrain gamma oscillations, which relate to short-time acoustic indices, such as formant transitions. But delta-band oscillations presented a puzzle because they do not appear to be entrained by energy contours. As the authors noted, “Given the influence of delta phase on theta and gamma amplitudes, it is at first paradoxical that little of the energy in vocalizations is found in the delta range, but prosody (intonation and rhythm) is important in speech perception and it is conveyed at rates of 1–3 Hz, which corresponds to the lower delta oscillation band” (Schroeder et al., 2008, p. 108). In studies that followed, various measures of cerebro-acoustic coherence were applied, which largely focused on the entrainment of theta by modulations in the amplitude envelope (e.g., Luo & Poeppel, 2012; Giraud et al., 2007). However, it is only recently that studies have examined the entrainment of delta and not always by reference to modulations in amplitude.
One early study by Gross et al. (2013; see also Park, Ince, Schyns, Thut, & Gross, 2015) reported that delta oscillations align with various “slow speech envelope variations” (Gross et al., 2013, p. 6). In fact, the particular patterns or structures that entrain delta waves remain unclear, being variably attributed to “intonation, prosody, and phrases” (Park et al., 2015; Giraud & Poeppel, 2012; Cogan & Poeppel, 2011), “accent phrases” (Martin, 2014), “prosodic phrasing” (Molinaro, Lizarazu, Lallier, Bourguignon, & Carreiras, 2016; Bourguignon et al., 2013), “metrical stress and syllable structure” (Peelle, Gross, & Davis, 2013), “long syllables” (Ghitza, 2017; Doelling, Arnal, Ghitza, & Poeppel, 2014), “words” (Kovelman et al., 2012), “phrases” (Meyer, Henry, Gaston, Schmuck, & Friederici, 2017), and “sentences” (Peelle et al., 2013). Yet others contend that delta may not entrain strictly to energy modulations but also to comodulations in frequency (e.g., Henry, Herrmann, & Obleser, 2014; Obleser, Herrmann, & Henry, 2012), somewhat like “melody” in music (Patel & Balaban, 2000), though it is uncertain how such views apply to speech. This is not a marginal issue or a problem of terminology. Inasmuch as delta, theta, and gamma frequencies bear on the processing of different types of information in speech (Cogan & Poeppel, 2011), defining entraining structures in utterances is central to any account of the role of neural oscillations in speech processing.
For example, one often-cited account (Giraud & Poeppel, 2012) suggests that oscillations in the auditory cortex shape theta- and gamma-size sensory windows, which organizes the spiking activity of cortical neurons in response to speech. As for the role of these windows, it is recognized that theta oscillations are involved in processing distinctive acoustic features, and this reflects in correlations between theta entrainment and judgments of speech intelligibility (Peelle et al., 2013; Peelle & Davis, 2012; Howard & Poeppel, 2010). To illustrate how this operates, one can refer to temporal indices in speech acoustics like VOTs and formant transitions, which underlie heard distinctions such as /ta-da/ and /da-ga/, respectively. These temporal indices vary with speaking rate. For instance, the VOT of /da/ produced at slow rates of speech can become similar to the VOT of /ta/ at fast rates (Boucher, 2002). Yet listeners' categorization of /ta-da/ is unaffected by changing speech rates because the indices vary proportionally to other events within syllable-size sensory frames. In fact, removing parts of the “vowel” within these perceptual frames can bias the categorization of intact VOTs and transitions (Toscano & McMurray, 2015; Boucher, 2002). It is this integration of acoustic indices within syllable-size sensory windows that links to theta oscillations. Such a principle has recently been confirmed, in the case of formant transitions, by Ten Oever and Sack (2015), who showed that the entrainment of theta oscillations at different speech rates can bias the categorization of ambiguous syllables perceived as /da/ or /ga/ (see also Ten Oever et al., 2016; for scaling effects of rate on low-frequency oscillations, see Borges, Giraud, Mansvelder, & Linkenkaer-Hansen, 2018). But if theta oscillations integrate feature information over syllable-size windows, what type of information is integrated by delta oscillations?
On the Role of Delta-size Windows: Reviewing Claims of a Nonsensory Entrainment to Content Units
On the latter question, research by Gross et al. (2013) and Park et al. (2015) using MEG and measures of cerebro-acoustic coherence has shown that delta oscillations are more strongly entrained by forward- than backward-presented speech despite similar energy patterns (but see Zoefel & VanRullen, 2016). Because only forward speech has interpretable content, the authors concluded that low-frequency oscillations are not only influenced by sensory (acoustic) aspects but also by knowledge of linguistic content (Peelle et al., 2013). The reports, however, did not point to any particular elements or units that could entrain delta. On this point, a recent study by Meyer et al. (2017) has attributed the content effects to “phrases.” The study used German utterances where the meaning of the initial parts could influence listeners' parsing of a following syntactic phrase with ambiguous prosodic marks. Measures of EEG coherence suggested that listeners' inferred phrase boundaries biased delta-band coherence, and this top–down effect was taken to show that delta is influenced by content units independent of sensory prosody. In Meyer's (2018) view, “because syntactic phrases do not have a physical counterpart in speech, they must be projected in the course of speech decoding by an internal cognitive process that requires the generative application of syntactic knowledge, rather than the tracking of information present in speech” (p. 4). Another investigation by Ding, Melloni, Zhang, Tian, and Poeppel (2016) has extended this viewpoint and attributed content effects to various content units, including words, phrases, and sentences. Using MEG and electrocorticographic techniques, the experiments of Ding et al. involved the presentation of synthesize Chinese utterances where prosody was removed, except for equally timed syllables of 250 msec. Each syllable was said to constitute “words” arranged in two-syllable “phrases” and four-syllable “sentences” (other stimuli were also used). Measures of spectral power showed peaks corresponding to the separate sets of units, and this occurred for Chinese-speaking listeners, whereas only syllable-related peaks appeared for English listeners. From these results, it was concluded that knowledge of content units has an effect on low-frequency neural oscillations that are independent of sensory marks. Though the authors noted that their study was not designed to address the entrainment of neural oscillations (Ding et al., 2016, p. 8), the observations of changes in spectral power were interpreted as evidence that cortical activity entrains to knowledge of linguistic content units (p. 7). The issue, however, is that the preceding reports involving speech stimuli with unusual or absent prosody do not support views of a “nonsensory” entrainment of delta by content units. In fact, such interpretations conflict with available observations and can be rejected a priori on fundamental grounds.
Specifically, reports of the differential entrainment of delta to speech presented in forward and backward order (as in Park et al., 2015; Gross et al., 2013) do not support the conclusion that this effect owes to knowledge of linguistic content. These results show that delta responds to order information in speech sounds, but they do not address the issue of an entrainment to elements or units of content. Similarly, observations of a delta-band phase coherence (or band-power coherence) in the absence of acoustic marks do not force the conclusion that delta entrains to content units like phrases independent of sensory prosody. Indeed, neural responses to “learned” prosodic patterns need not involve sensory marks. For instance, it has been established that, in silent reading, EEG responses of readers align with prosodic structures, as if texts are covertly spoken (Magrassi, Aromataris, Cabrini, Annovazzi-Lodi, & Moro, 2015; Steinhauer, 2003; Steinhauer & Friederici, 2001). In other words, reading tasks show top–down effects of prosody on neural responses. This suggests that listeners are processing verbal material via a knowledge of prosody that is generally acquired from lifelong sensory experiences in speaking a language. Thus, when test participants hear syllables in their native language even with unfamiliar synthetic timing and tones (as in Ding et al., 2016) or when they read text, their knowledge of prosody may well impact neural oscillations in the absence of acoustic marks. But one may not conclude from this that the entrainment of neural oscillations is independent of a sensory prosody and relates instead to units of content. However, more fundamental problems undermine attempts to relate neural oscillations to notional units of linguistic content.
Though assumed forms like words, phrases, and sentences are readily identified in text, they are notoriously difficult to isolate in continuous speech. There is currently no general definition or mark by which to divide these units in utterances and, most often, divisions are assumed by reference to such aspects as spaces and punctuation in writing (Haspelmath, 2011; Miller & Weinert, 1998; Pergnier, 1986). Studies show that concepts of word units only arise with the learning of an alphabet (Havron & Arnon, 2017; Onderdelinden, van de Craats, & Kurvers, 2009), and it is known that not all cultures divide such forms (e.g., Hoosain, 1991, 1992). This indeterminacy of conventional linguistic units and their ethnocultural basis question the idea of a biological, cortical tracking of such elements by endogenous neural oscillations. It also needs to be acknowledged that the length of words, phrases, and sentences can be so short as to make any assumption of a match between these types of units and low-frequency oscillations untenable. For example, in countless short “sentences” such as I'm done, Take'm, Sue's gone, We'll see (etc.), it is difficult to ascertain whether parts of syllables like 'm, 's, 'll constitute “words” or if I'm…, Sue's…, We'll… (etc.) are subject–verb “phrases.” But whatever these units are, it should be recognized that elements with durations in the order of 50–200 msec are not likely to be tracked by endogenous oscillations in the delta range. This is not to say that syntactic–semantic elements have no effect. However, it appears that the kind of information that is processed in delta-size windows may not be commensurate with conventional linguistic units, which are conceptually linked to alphabet writing.
Delta-size Windows and the Sensory Chunking of Speech: The Present Hypothesis and Measures
As noted, the finding that delta oscillations are differentially entrained by speech presented in forward and reverse order may not as such reflect effects of linguistic content. Instead, the observations more directly suggest that delta entrainment is influenced by order information in articulated sounds. In developing this view, it is essential to consider that, compared with general sound perception, speech perception bears on articulated sounds, which reflect ordered sensorimotor events. Moreover, though both speech and sound perception imply a processing of signals that unfold in time, speech communication critically rests on a memory of the sequential order of uttered sounds. Yet sequence memory has a limited span, and this can inherently impose a chunking of signals. This principle was first noted by G. A. Miller (1962), who saw that speech communication does not operate by interpreting signals one segment at a time, but by chunks. More recent accounts refer to a domain-general “sensory chunking” commonly observed in sequence learning (Terrace, 2001). For instance, in reciting novel lists of letters or nonsense syllables, chunking marks generally appear in terms of characteristic inter-item delays that create temporal groupings (for illustrations, see Gilbert, Boucher, & Jemel, 2015; Terrace, 2001). This chunking arises spontaneously even when repeating verbal stimuli without prosody (Boucher, 2006). Some theorists of working memory view such chunking as reflecting the span of a “focus of attention” (Cowan, 2000). Several studies also link top–down effects of attention and perception of temporal patterns in speech or music to delta oscillations (for a review of this work, see Ding & Simon, 2014). However, this research has not addressed the issue of how the entrainment of delta oscillations relates to a sensory chunking of speech (see, however, Ghitza, 2017).
In previous work, we used the signature marks of temporal grouping in verbal recall and showed that such marks in speech evoke EEG responses reflecting an online perceptual chunking of speech sounds (Gilbert et al., 2015; Gilbert, Boucher, & Jemel, 2014). It is worth noting that inter-item delays that mark temporal groups in speech can often correspond to end points of “intonational phrases.” However, the results of Gilbert et al. showed separate positive shifts for temporal groups within tonal phrases, indicating a specific processing of signature marks of chunking. In subsequent tests where sentences and series of nonsense syllables were presented with timing and tonal patterns, the size of temporal groups (but not tonal phrases) was found to affect listeners' recognition memory of heard items, and this was also reflected in differential N400 responses to items within different-sized chunks (Gilbert et al., 2014). On this effect, a recent report has confirmed that presentations of digits in delta-sized chunks influence recall (Ghitza, 2017). Hence, the findings suggest a processing of speech in terms of marked sensory chunks, which may well link to delta oscillations via a principle of entrainment. But it is important to note that, because sensory chunks are associated with “timing” marks and grouping, their effect on delta entrainment may not be revealed by measures that focus on a cerebro-acoustic coherence between brain oscillations and “energy” contours. For this reason, the present report refers to intertrial phase coherence (ITPC; Delorme & Makeig, 2004), which serves to explore how neural entrainment may be evoked not only by energy modulations but also by other temporal or tonal patterns. Finally, it should be emphasized that a sensory chunking of speech has been shown to operate in processing both utterances and strings of nonsense syllables (Gilbert et al., 2014, 2015). Thus, it appears that, if delta entrainment underlies a sensory chunking, then its function would not bear directly on linguistic information but may instead link to a processing of the sequential sensorimotor properties of speech sounds.
Considering these findings, this study examines the view that delta oscillations entrain to signature marks of temporal grouping, which can create sensory chunks in processing sequential attributes of speech. That such entrainment bears specifically on the processing of sensorimotor aspects of articulated sounds (as opposed to a processing of acoustic or content information) is determined by observing ITPC in three types of stimuli: series of pure tones, meaningless syllable strings, and utterances. These contexts are elaborated with similar patterns of timing, pitch, and energy. The predicted effects serve to weigh the above view of delta entrainment to sensorimotor attributes of speech. Specifically, because pure tones do not involve articulated sounds, it is predicted that delta oscillations will entrain differently to timing groups in tones as compared with groups in speech irrespective of the presence of content units. Thus, in submitting that delta entrainment can create sensory windows involved in processing articulated sounds, it is expected that timing marks in speech, in contrast to marks in tones, entrain delta at similar sites whether the speech stimuli are utterances or nonsense syllables with no linguistic content.
Nineteen university-level students, all native speakers of French, and aged 20–30 years (mean = 24.7 years, 12 women) completed the experiment. Data of one participant were removed owing to excessive artifacts in the EEG recordings. The participants were dominant right-handers (Oldfield, 1971); all showed normal hearing levels on an audiometric screening test, and none reported a history of substance abuse, neurological deficits, or psychiatric disorders. Participation was voluntary, and the test protocol followed the guidelines of the Comité d'éthique de la recherche de l'Hôpital Rivière-des-Prairies (Montréal, Canada).
Stimuli and Procedure
The speech stimuli were elaborated using a pacing technique (as in Gilbert et al., 2014, 2015). With this method, a speaker listens to series of pure tones presenting syllable-like modulations, pitch contours, and temporal groups and reproduces these patterns while uttering given series of nonsense syllables and sentences. The technique avoids the potential effects of using synthetic speech and serves to create natural sounding stimuli while allowing a control of timing and pitch attributes. In the present case, the speech stimuli were produced with long and short syllables having duration ratios of about 1.8:1, which is typical of casual French speech (Fant, Kruckensberg, & Nord, 1991), and all utterances presented a similar contour present in habitual speech (Martin, 2015; Delattre, 1965). The main contexts comprised three types of stimuli, including 48 series of tones, 48 series of nonsense syllables, and 48 sentences. Additional sets of “fillers” (48 tokens per type) were elaborated with irregular timing and pitch so as to add variety in the presentations. All the stimuli were aligned so that the “onset” of acoustic files was timed to the “P-center” of the initial syllables or sounds given that the “perceptual” onsets of the acoustic contexts do not correspond to 0 msec where there is no signal (Marcus, 1981; Morton, Marcus, & Frankish, 1976).
To illustrate the similar acoustic properties of the main stimuli, Figure 1 presents an overlay of the analyzed pitch (F0) and amplitude contours (given by the rectified and smoothed waveforms) of the presented signals constituting the three types of stimuli. One can see that the pacing technique served to create speech contexts with patterns that are similar to the tone series, despite the inherent jitter of spoken sequences. In particular, the short and long tones (top panel) are similar to syllable-size modulations in the energy contours of speech, with periods of about 185 and 315 msec (∼5.4 and 3.2 Hz). It is important to note that, in these stimuli, alternations of two short tones and one long tone correspond to three-syllable temporal groups of approximately 685 msec (∼1.4 Hz). In terms of the linguistic content of the presented utterances, the main contexts were sentences containing monosyllabic “words” (as divided in text). These were arranged in similar syntactic sets with the subject and verb expressed in the first three syllables, followed by three complement “phrases.” As for the series of nonsense syllables, all comprised different consonant–vowel items drawn from a repertoire of meaningless units. These syllables that were nonwords (a frequency of zero as lexical units) were randomly distributed across the 12-syllable sequences, with the restriction that no combination formed a recognizable expression in usual speech (as evaluated by two external judges) and that no consecutive syllables had repeated place features or vowels (so as to avoid effects of alliteration and rhyme).
The recorded stimuli were played back using E-Prime (1.0; Psychology Software Tools) and delivered via insert earphones (Eartone 3A, EAR Auditory Systems). The stimuli were played at a fixed intensity with peak sound levels of 68 dBA at the inserts. Blocks of randomly ordered stimuli were presented by type, beginning with tones, then nonsense syllables, and ending with utterances. This provided a gradual increase in the amount of presented information. Participants were instructed to listen to the stimuli while looking at a fixation point on a screen, and their task (unrelated to this study) was to indicate via a key press whether a target tone, syllable, or word presented after the stimulus was part of the heard stimulus or not.
EEG Recording and Analysis
EEGs were recorded over 64 Ag/AgCl electrodes arrayed in an elastic cap (Wavegard Cap) according to the enhanced 10–20 system. EEG gain was 20,000, and signals from DC to 100 Hz were A/D-converted using the ASA EEG/ERP system (ANT Neuro, Enschede) at a 16-bit resolution and 1-kHz sampling rate. During EEG recording, all electrodes were referenced online to a common average, and electrode impedances were kept below 5 kΩ. Bipolar electrodes placed vertically above and below the dominant eye (VEOG) and horizontally at the outer canthus of each eye (HEOG) served to record eye movements. Offline analyses used EEProbe (3.3, ANT Software), Fieldtrip (Oostenveld, Fries, Maris, & Schoffelen, 2011), and custom scripts in MATLAB (version R2015b; The MathWorks). Using the ASA system software, continuous EEG signals were corrected for ocular artifacts using the PCA-based method (Ille, Berg, & Scherg, 2002). The continuous EEG files for each subject were then epoched according to stimulus type into 8000-msec segments, representing trial intervals of −4000 to +4000 msec relative to the stimuli onsets. EEG segments with muscular and other artifacts were automatically rejected if the standard deviation of any scalp electrode within a 200-msec sliding window exceeded 20 μV. On this basis, 10.3% of trials were rejected, leaving, on average, 42.7, 42.8, and 43.5 artifact-free trials per participant for each type of stimuli. For these trials, current source densities were computed using a spherical spline function (Perrin, Pernier, Bertrand, & Echallier, 1989) implemented in the CSD toolbox (1.1; Kayser & Tenke, 2006) with the spline order set to m = 4 and a smoothing constant at λ = 2.5 × 10−5. Finally, time–frequency analyses were applied involving customized routines of EEGlab (Delorme & Makeig, 2004). To extract ITPC, single-trial CSD epochs were convolved with a linearly increased Morlet wavelet from 1 to 15 cycles across the frequency range of 0.25–50 Hz, in 0.5-Hz steps. The resulting ITPC matrix (phase-locking factor) is a channel-based measure of the time course of phase synchronization within a frequency band relative to the experimental event.
The selection of electrode sites and time intervals of interest was based on visual inspections of topographic plots and nonparametric cluster-based permutation tests. The latter served to confirm the time intervals of delta- and theta-band ITPC at which significant differences between tone and speech stimuli appeared for specific sites. The cluster-based tests used the procedure “ft_statistics_montecarlo” implemented in Fieldtrip (Maris & Oostenveld, 2007). In this procedure, dependent t tests are performed on each contrast (e.g., tone vs. utterance stimuli) for each electrode pair at every time point in a specified interval. A Monte Carlo probability value is estimated by randomly permuting conditions within participants and calculating the maximum cluster-level test statistic over 1000 iterations. The electrodes that pass a selected significance level are clustered by direction and spatial proximity, and then, these clusters are averaged across the time interval where contrasts are significant. In a final step following the selection of sites and intervals of interest, repeated-measures ANOVAs and post hoc tests were applied to averaged ITPC modulus values using SPSS (25.0).
Figure 2 provides an overview of the distribution of ITPCs by way of time–frequency plots across sites for the three types of contexts. One can discern differences in the distributions of increases in ITPC: The tone stimuli created rises in ITPC mostly at mid-central sites, whereas the speech stimuli (both nonsense syllables and utterances) created rises laterally at fronto- and centrotemporal sites.
To further illustrate the differential effects of speech and nonspeech stimuli, Figure 3 presents the time–frequency ITPC at C5, Cz, and C6 for the three types of stimuli. The shaded areas in each plot reflect the bands 3–6 and 1–2 Hz, which encompass, respectively, the range of frequencies of syllable-size elements (∼3.2 and 5.4 Hz) and temporal groups (∼1.4 Hz) in the stimuli. As is clear in the figure, syllable-size elements and groups in speech and tone contexts entrain neural oscillations quite differently. One sees that, for the selected sites, the tone stimuli create negligible entrainment of delta oscillations in the range of 1–2 Hz. The speech stimuli, however, elicit a salient lateral entrainment of these delta waves at C5 and C6, and this appears for utterances as well as syllable sequences that have no linguistic content. Theta oscillations of 3–6 Hz do not manifest this differential entrainment for speech stimuli. As seen in Figure 3, rises in ITPC for these oscillations appear for both speech and tones especially at Cz.
As for the general topography of entrainment effects, Figure 4 maps the mean ITPCs for frequency bands 1–2 and 3–6 Hz. ITPC in this case was calculated for an interval of 300–700 msec poststimulus, which, from a visual inspection of plots (Figures 2 and 3), presents the time frame where ITPC is most pronounced across conditions. As Figure 4 shows, ITPC within the ranges of delta (1–2 Hz) and theta (3–6 Hz) presents distinct regional patterns for the speech and nonspeech stimuli. Specifically, delta phase coherence rises bilaterally at fronto- and centrotemporal sites for the speech contexts (at FC6, C6, FT8, T8 and FC5, C5, FT7, T7). Such bilateral activation does not appear for the tone stimuli. As for theta-band oscillations, beyond the strong effects of presented tones, phase coherence arises at similar mid-central sites for speech and nonspeech contexts (mostly at C1, Cz, and C2). These comparative observations essentially support the predictions. Thus, oscillations in the theta band are entrained at similar central sites for speech and tones, suggesting that these oscillations may preferentially relate to a processing of acoustic attributes across different types of stimuli. The entrainment of delta oscillations, on the other hand, appears at lateral sites specifically with speech. Contrary to series of tones, speech involves articulated sounds bearing sensorimotor order information, and this may underlie delta entrainment at fronto- and centrotemporal regions. On the speculation that such entrainment might also reflect top–down effects of linguistic information, the results show similar entrainment for utterances and meaningless syllables. Consequently, delta entrainment may bear on a processing of articulated sounds and not on a processing of meaning units like words and phrases.
As a further verification of the effects of stimuli type on delta- and theta-band ITPCs, time intervals of interest and relevant ROIs were determined using a clustering and a nonparametric Monte Carlo randomization test (1000 random draws; Maris & Oostenveld, 2007). This procedure showed that, for the previously identified ROIs comprising electrodes FC5, FT7, C5, T7, FC6, FT8, C6, and T8, significant differences in delta-band (1.3–1.6 Hz) ITPC appeared at 300–700 msec. For the ROIs of C1, Cz, and C2, theta (3–6 Hz) ITPCs occurred at a shorter interval of 370–450 msec. By reference to these intervals, repeated-measures ANOVAs were applied to grand means of ITPC for the sites of interest. It should be recalled that the relatively narrow bands of delta (1.3–1.6 Hz) and theta (3–6 Hz) roughly match the frequencies of temporal chunks and syllable-size elements in the stimuli. In this way, any entrainment effect of chunks and elements would be revealed by significant rises in ITPC in the specified delta and theta bands. Also, because delta- and theta-band ITPCs are distinctly distributed on different sets of electrodes (Figures 2–4), separate ANOVAs were applied to delta and theta with the aim of evaluating the specific effects of temporal chunks and syllable-size elements across the three types of contexts.
For the delta oscillations in the range of 1.3–1.6 Hz, a 3 × 3 ANOVA compared ITPCs across Types of stimuli (tones vs. nonsense syllables vs. utterances) and Sites of interest (grouped as left frontotemporal FC5, C5, FT7, T7 vs. right frontotemporal FC6, C6, FT8, T8 vs. central C1, Cz, C2). A Mauchly test showed that the sphericity of covariance errors could not be assumed for interaction effects. Consequently, a Greenhouse–Geisser correction was applied in calculating the significance of interactions. The analysis revealed a main effect of Stimuli type, F(2, 34) = 4.627, MSE = 0.011, η2 = .214, p = .017, but no significant effect across Sites, F(2, 34) = 1.427, MSE = 0.011, η2 = .077, p = .255, though these results are qualified by significant interaction effects, F(2.669, 45.371) = 7.509, MSE = 0.006, η2 = .306, p = .001. To clarify the interaction effects, Figure 5 serves to visualize the pooled ITPCs across sites and types of stimuli used in the ANOVA. One can see that for delta oscillations in the range of 1.3–1.6 rises in ITPC occur at the left and right frontotemporal sites (but not on central sites). No significant differences between the utterance stimuli and nonsense syllables appeared at either the left, t(17) = 0.236, SEM = .032, ns, or the right temporal sites, t(17) = 1.348, SEM = 0.024, ns. In further comparing the values, only the rises for the right frontotemporal sites are significantly different from those for tones (using hypothesis-driven post hoc tests, for the utterance stimuli, t(17) = 2.537, SEM = 0.031, p = .02; for nonsense syllables, t(17) = 3.989, SEM = 0.028, p = .001; alpha levels need not be adjusted in this case; Ruxton & Beauchamp, 2008). These results essentially suggest that temporal chunks in speech stimuli created a comparatively stronger rightward entrainment of delta oscillations at frontotemporal locations.
As for theta oscillations of 3–6 Hz, because ITPC in this band dominated central sites, a one-way ANOVA served to compare grand means of an ROI comprising C1, Cz, and C2 across types of stimuli. Because a Mauchly test did not support the assumption of error sphericity, a Greenhouse–Geisser correction of alpha was used. The analyses revealed significant effects of Type, F(1.539, 26.161) = 62.444, MSE = 0.006, η2 = .786, p < .001. Further comparisons via multiple t tests used a Bonferroni–Sidak correction of alpha (i.e., .01 corrected to .003; Olejnik, Li, Supattathum, & Huberty, 1997). On this basis, the tests revealed significant differences in ITPC between utterances and tones, t(17) = 8.889, SEM = 0.024, p < .001, and between nonsense syllables and tones, t(17) = 8.049, SEM = 0.022, p < .001, but no significant difference between utterances and nonsense syllables, t(17) = 2.860, SEM = 0.014, p = .011. These results, reflected in Figure 5, basically indicate that an entrainment of theta waves of 3–6 Hz occurred across contexts and was particularly pronounced at central sites and for tone stimuli. However, in contrast to delta oscillations at frontotemporal sites, theta oscillations at central sites are not specifically entrained by speech and, instead, appear to be entrained by more general periodic aspects of sounds.
The preceding results serve to clarify the role of delta oscillations in speech processing and shed a critical light on recent reports where there is some confusion on the particular speech attributes that entrain these oscillations. Research on neural entrainment to heard utterances has largely focused on how modulations in the energy envelope affect low-frequency oscillations, especially in the theta band. This has led to various measures of “cerebro-acoustic coherence” where it is assumed that entrainment is revealed by the coherence between neural oscillations and energy patterns. But extending this assumption to delta waves has led to difficulties both in defining the patterns that entrain delta and the function of this entrainment.
For instance, reports that delta entrains differently to backward- and forward-presented speech despite similar amplitude patterns has led some to conclude that delta is influenced by listeners' processing of interpretable content in forward speech. Further to these reports, studies have explicitly related the entrainment of low-frequency oscillations to nonsensory units of linguistic content (e.g., Ding et al., 2016; Meyer et al., 2017). In this view, neural oscillations are said to entrain to forms like words and phrases independent of sensory prosody. These interpretations are inherently puzzling in that one is led to wonder how units with no sensory marks could come to be represented in the brain (or what “words” and “phrases” would look like without prosodic elements like syllables, for instance). More fundamentally, the fact that notions of units like words are culture-specific and arise from a knowledge of alphabet writing undercuts the assumption that such forms are biologically tracked by brain oscillations. This study adopted another hypothesis where delta oscillations are seen to entrain to characteristic timing marks, which create sensory chunks in processing speech. According to this view, delta entrainment would not reflect a tracking of assumed content units like words or phrases but a general perceptual chunking that arises in hearing sequential attributes of articulated sounds. The results serve to contrast these two viewpoints.
In particular, it can be seen in the above figures that peak ITPC in the delta-band oscillations occur at centro- and frontotemporal electrodes for the speech stimuli but that such coherence is relatively absent at these sites for the tone stimuli. This indicates that delta entrainment specifically arises at lateral frontotemporal sites when listeners hear speech. It does not arise for sounds like tones that do not contain attributes of articulated sounds. By contrast, the results show that theta-band oscillations at central sites are entrained by both the speech and tone stimuli, suggesting that theta entrainment can relate to periodic patterns in different acoustic signals. It is also important to note that peaks of ITPC within a delta band narrowly match the periodicities of temporal groups in the speech stimuli (∼1.4 Hz; see especially Figure 3). Thus, the entrainment of delta-band oscillations when listeners attend to speech sounds can create sensory chunks that span a number of theta-size modulations in the energy envelope. On the assumption that this delta entrainment might also reflect top–down effects of linguistic content, the present results show similar entrainment for both meaningful utterances and nonsense syllables—which bear no linguistic content. Consequently, delta entrainment appears to link to a processing of sensorimotor aspects of articulated sounds and not to content units like words and phrases. On the other hand, one could argue that listeners may have induced content units like words or attempted a lexical access when hearing nonsense syllables with prosody that is similar to the patterns of utterances. It should be recalled, however, that the presentation of the stimuli was in blocks where sequences of nonsense syllables were presented after a block of tone sequences, and each type of stimuli was introduced by practice contexts. With this design, listeners could predict that upcoming stimuli were meaningless series making it very unlikely that they would attempt a lexical access across 96 sequences of nonsense syllables.
Overall, the above results suggesting that delta entrains to articulated sounds whereas theta-band oscillations entrain to acoustic attributes partly accord with a recent MEG study by Molinaro and Lizarazu (2018). That study involved two experiments serving to compare neural entrainment to speech and nonspeech stimuli. In one experiment, neural entrainment to presented speech was compared with entrainment to continuous white noise with energy modulations in the theta or delta band. In another test, entrainment to speech was compared with entrainment to spectrally inverted speech. Measures of cerebro-acoustic coherence (on amplitude modulations) showed larger delta-band coherence for the speech stimuli compared with that obtained with the white noise or inverted speech. Yet, this differential effect did not occur for theta-band oscillations. These results led the authors to conclude that delta entrainment can be speech-specific whereas theta entrainment mainly reflects a processing of auditory signals. However, the continuous speech stimuli used in these experiments did not allow to determine the particular attributes of speech that entrain delta.
It should be remarked that the present finding that neural oscillations entrain to temporal marks rather than to energy or frequency modulations suggests certain advantages of indirect measures such as ITPC over direct tracking measures of cerebro-acoustic coherence. The latter indices rest on the idea that entrainment is revealed by analogous modulations in sensory and neural signals, whereas the present results indicate that this may not always apply. As for the question of how neural oscillations can physically entrain to temporal groups rather than to energy or frequency patterns, one can speculate that the storing of incoming sound sequences in a limited-capacity buffer can present a mechanism of segmentation commensurate with a delta-size sensory chunking (see also the concept of focus of attention; Cowan, 2000). Such a principle could well explain a neural entrainment to characteristic timing marks of perceptual chunking (Terrace, 2001).
Finally, it may be objected that the above findings reflect the case of French speech and may not extrapolate across languages. It should be reminded, however, that sensory chunking is a domain-general process. Such chunking is typically observed in sequence learning where inter-item delays arise creating groups of elements (for cross-species observations of such temporal chunking, see Terrace, 2001). These signature marks of inter-item delays or lengthening appear across speakers despite language-specific prosody. For instance, some languages have lexical stress (e.g., English, German) whereas others do not (e.g., French, Japanese), and yet chunking patterns commonly appear in list recall across languages (Gilbert et al., 2015; on the independence of inter-item lengthening and lexical stress, see Oller, 1973). Hence, one would expect to find similar results using stimuli from different language typologies.
This research was supported by a team grant from the Fonds Québécois de la Recherche sur la Nature et la Technologie (FQRNT no. 175811).
Reprint requests should be sent to Victor J. Boucher, Université de Montréal, Département de linguistique, C.P. 6128, Montréal, QC, Canada H3C 3J7, or via e-mail: email@example.com.