During natural speech perception, listeners must track the global speaking rate, that is, the overall rate of incoming linguistic information, as well as transient, local speaking rate variations occurring within the global speaking rate. Here, we address the hypothesis that this tracking mechanism is achieved through coupling of cortical signals to the amplitude envelope of the perceived acoustic speech signals. Cortical signals were recorded with magnetoencephalography (MEG) while participants perceived spontaneously produced speech stimuli at three global speaking rates (slow, normal/habitual, and fast). Inherently to spontaneously produced speech, these stimuli also featured local variations in speaking rate. The coupling between cortical and acoustic speech signals was evaluated using audio–MEG coherence. Modulations in audio–MEG coherence spatially differentiated between tracking of global speaking rate, highlighting the temporal cortex bilaterally and the right parietal cortex, and sensitivity to local speaking rate variations, emphasizing the left parietal cortex. Cortical tuning to the temporal structure of natural connected speech thus seems to require the joint contribution of both auditory and parietal regions. These findings suggest that cortical tuning to speech rhythm operates on two functionally distinct levels: one encoding the global rhythmic structure of speech and the other associated with online, rapidly evolving temporal predictions. Thus, it may be proposed that speech perception is shaped by evolutionary tuning, a preference for certain speaking rates, and predictive tuning, associated with cortical tracking of the constantly changing-rate of linguistic information in a speech stream.
A large part of human auditory perception consists of listening to natural connected speech in everyday communicative situations. In these situations, speech perception is guided by the temporal regularities in natural connected speech, also referred to as speech rhythm. Speech rhythm is tied to speaking rate, that is, habitual word (2–3 Hz) and syllable (4–5 Hz) production frequencies (Alexandrou, Saarinen, Kujala, & Salmelin, 2016). Especially in natural speech, the speaking rate is remarkably flexible and demonstrates two types of intraindividual variation (Grosjean & Lane, 1976). First, the global speaking rate can be increased or decreased on-demand, as a speaker can voluntarily modulate speech production speed (Alexandrou et al., 2016; Grosjean & Lane, 1976); this induces changes in the global rhythmic structure of an utterance (Smith, Goffman, Zelaznik, Ying, & McGillem, 1995). Second, within a given global speaking rate, local, involuntary variations in speaking rate occur during the course of a single utterance (Miller, Grosjean, & Lomanto, 1984). Local variations in speaking rate underlie the quasi-rhythmic (as opposed to perfectly rhythmic) pattern observed in natural speech (Tilsen & Arvaniti, 2013).
Behavioral work has shown that, during natural speech perception, listeners track both the global rhythmic structure of the input, that is, the overall rate at which the linguistic information is transmitted over time, as well as the local rate of change in linguistic information (Reinisch, 2016; Baese-Berk et al., 2014; Dilley & Pitt, 2010). This behavioral evidence suggests that tracking both these aspects of speaking rate is crucial for successful speech comprehension. Specifically, it has been proposed that tracking of the global speaking rate allows the listener to extract global information in the speech stream, including segmental, prosodic, and nonlinguistic information. These, in turn, can influence perception of local events in speech. On the other hand, tracking the local speaking rate has been suggested to contribute to forming expectations about upcoming events and to help resolve spectrally ambiguous cues in the speech signal. This tracking has been shown to significantly influence the recognition of both function and content words (e.g., Miller, Green, & Schermer, 1984; Summerfield, 1981; Lasky, Weidner, & Johnson, 1976; Liberman, Delattre, Gerstman, & Cooper, 1956). However, the neural correlates of the tracking mechanism during the perception of natural, connected speech remain unknown.
Neuroimaging studies suggest that the cortex tracks incoming acoustic speech signals, which manifests as coupling between cortical signals and the amplitude envelope of the acoustic speech signal (acoustic amplitude envelope; Gross et al., 2013; Peelle, Gross, & Davis, 2013). Nevertheless, existing evidence that cortical signals track the global speaking rate stems from studies focusing on the perception of single sentences with artificially manipulated global speaking rate (Hertrich, Dietrich, & Ackermann, 2013; Ahissar et al., 2001). Moreover, the cortical tracking of local variations in speaking rate in spontaneously produced speech has not been examined as of present (but see Kayser, Ince, Gross, & Kayser, 2015, for a study of phase-locking dynamics when artificially inducing local speaking rate variations in a read-aloud text). These studies have presented evidence that the coupling between brain signals and the acoustic amplitude envelope is modulated as a function of global speaking rate and local variations of speaking rate. However, these studies have employed non-naturalistic linguistic stimuli, and the variations in speaking rate were generated artificially. We suggest that this kind of stimuli, with their quite regular speaking rate and thus salient rhythmic patterning, do not represent instances of real-life language use. Moreover, such stimuli are presented in a highly controlled experimental setting, where the speech segments appear from complete silence and at regular intervals. This is at odds with the considerably quasi-rhythmic rhythm in spontaneously produced speech encountered in everyday life. Finally, artificial increases in speaking rate have been shown to result in linear changes in the duration of linguistic units, whereas in spontaneously produced speech increases in speaking rate are carried out in a nonlinear manner; this, in turn, affects speech comprehension (Janse, 2004; Janse, Nooteboom, & Quené, 2003).
This study aims to provide a cortical-level characterization of the behaviorally observed tracking of speech rhythm during the perception of continuous, spontaneously produced speech, as we encounter it in real-life communicative situations. We assume that this tracking would be neurally implemented through the coupling of cortical signals with acoustic signals, and in line with behavioral evidence, it would be sensitive to both global speaking rate and local variations in speaking rate. Motivated by our previous work focusing on both behavioral and spectral analysis of the acoustic amplitude envelope (Alexandrou et al., 2016), as well as by empirical evidence (Park, Ince, Schyns, Thut, & Gross, 2015; Doelling, Arnal, Ghitza, & Poeppel, 2014; Peelle et al., 2013) and theoretical models (Ghitza, 2011, 2012; Giraud & Poeppel, 2012), we hypothesized that this tracking would occur at frequency ranges corresponding to the word (2–4 Hz) and syllable (4–7 Hz) production frequencies. Although the lower delta band (0.5–2 Hz) would have been potentially of interest (e.g., Molinaro, Lizarazu, Lallier, Bourguignon, & Carreiras, 2016; Vander Ghinst et al., 2016; Bourguignon et al., 2013), we chose to focus on the 2–4 Hz and 4–7 Hz bands for two reasons: First, in the acoustic amplitude, the frequency range of the lower delta band (0.5–1 Hz) corresponds to fluctuations in timescales that are mainly associated with prosodic features in a speech stream (Bourguignon et al., 2013; Poeppel, Idsardi, & van Wassenhove, 2008); these features are presumably not modulated as a function of speaking rate (Ramus, Nespor, & Mehler, 1999). Second, the power spectrum of cortical activity in the lower delta band is characterized by a notable 1/f trend (pink noise; Voss & Clarke, 1975), which would potentially confound subsequent data analyses (e.g., Demanuele, James, & Sonuga-Barke, 2007).
In this study, cortical signals were recorded with magnetoencephalography (MEG) while participants listened to natural connected speech stimuli at three global speaking rates: slow, normal (habitual), and fast. Audio–MEG coherence was used to quantify how neural signals from across the cortex track the acoustic amplitude envelope of the perceived speech stimuli. Reaching beyond single sentences, read-aloud text passages, or rehearsed narratives used in earlier related studies, the present stimuli consisted of unrehearsed, spontaneously produced connected speech, featuring natural variations in the local speaking rate. To our knowledge, there are no reports on the coupling of cortical with acoustic signals during perception of spontaneously produced speech. Spontaneously produced speech differs from the isolated sentences, read-aloud and rehearsed speech stimuli used by previous studies with regard to, for instance, articulation rate (Jacewicz, Fox, O'Neill, & Salmons, 2009), articulatory patterns (Finke & Rogina, 1997), and prosodic features (Nakajima & Allen, 1993). These features influence, in turn, the structure of the acoustic amplitude envelope, which is one main factor that affects the coupling between acoustic and cortical signals. In addition, behavioral evidence suggests that speakers are able to distinguish between spontaneously produced and rehearsed or read-aloud speech stimuli (Chawla & Krauss, 1994), and the type of stimulus—rehearsed or spontaneously produced—shapes subsequent speech comprehension (Brennan & Schober, 2001). Therefore, this study is aimed at elucidating, for the first time, the coupling patterns that emerge during perception of real-life speech and how this coupling is modulated by one of the most important features of running speech: speaking rate. Importantly, the modulations in global speaking rate employed in this study were carried out naturally, without a metronome, and the syllable production frequency for stimuli at all three rates fell within the natural range of syllable production frequencies (see, e.g., Tsao & Weismer, 1997).
We propose that coherence between audio and MEG signals is a neural mechanism that contributes to natural speech comprehension by aiding listeners to track both the global speaking rate and local speaking rate variations. We anticipate a spatial differentiation between the audio–MEG coherence patterns associated with tracking the global speaking rate and the local variations in speaking rate. Based on previous neuroimaging evidence (Hertrich et al., 2013; Ahissar et al., 2001), we hypothesize that the tracking of global speaking rate engages the auditory cortex bilaterally. Regarding local speaking rate variations, behavioral evidence suggests that such tracking is inherently predictive in nature (Brown, Dilley, & Tanenhaus, 2012; Koreman, 2006). We thus hypothesize that local speaking rate variations would highlight the auditory regions, which have been suggested to be sensitive to temporal regularities in speech stimuli (ten Oever et al., 2017), as well as additional regions, predominantly located in the parietal lobe, that have been suggested to contribute in shaping and updating internal predictive models (Bekinschtein et al., 2009; Andersen & Buneo, 2002).
The participants were 20 healthy, right-handed, native Finnish-speaking adults (11 women, 9 men; mean age = 24.5, range = 19–35 years) with normal hearing. Sample size was determined based on the previous observation that, for more experimentally controlled continuous tasks, a sample size of 10 can be sufficient to examine coherent cortical coupling (Saarinen, Jalava, Kujala, Stevenson, & Salmelin, 2015). In this study, we opted to double that sample size, estimating that it would be sufficient to account for the possibly reduced effect size in more naturalistic tasks. All participants gave their informed written consent before taking part in the experiment, in agreement with a prior approval of the Aalto University ethics committee.
The participants listened to six 40-sec segments (4 min) of connected speech stimuli spontaneously produced by an unfamiliar to them, untrained male speaker at normal, slow, and fast speaking rates. Speech production was prompted by questions (in Finnish) derived from the following thematic categories: own life, preferences, people, culture/traditions, society/politics, and general knowledge (see Alexandrou et al., 2016). The prompts were quite general (e.g., What kind of hobbies do you have or have had during your life? Describe a traditional Christmas holiday. What kind of foods do you like?). The speaker responded to 18 unique thematic questions, six at each speaking rate. The function of the thematic questions was to help the speaker verbalize his own thoughts; he was not required to provide a specific response to each question. Instead, the primary aim was to help the speaker produce fluent, uninterrupted speech at each speaking rate. For the slow-rate, the speaker was asked to reduce his normal speaking rate by 50% by preferably increasing articulation time rather than the length of pauses. For the fast-rate, he was instructed to produce fluent, continuous speech at the highest speaking rate possible while preserving speech intelligibility and minimizing the occurrence of articulatory errors. During speech production, the speaker varied his speaking rate without the aid of any external pacing device. The speech stimuli were recorded in a soundproof room. Raw acoustic signals were collected with a dual diaphragm condenser microphone (B-2 PRO, Behringer) at a 44.1-kHz sampling frequency using Cool Edit 2000 (Syntrillium; see Figure 1 for excerpts of acoustic speech signals at each speaking rate). All stimuli were normalized to the same average intensity using Praat software (Institute of Phonetic Sciences, University of Amsterdam). The acoustic signals recorded at each global speaking rate were transmitted to a transcribing company (Tutkimustie Oy, Tampere, Finland) for strict verbatim transcription (i.e., the acoustic signals were transcribed without editions or modifications). The stimuli can be made available upon request.
The mean syllable production frequency at each speaking rate was obtained by averaging the syllable production frequencies across the six responses at each rate. Mean (±SD) word production frequencies were 1.8 ± 0.2 Hz for the normal-rate, 1.0 ± 0.1 Hz for the slow-rate (58% of normal), and 2.8 ± 0.2 Hz for the fast-rate (157% of normal). Mean (±SD) syllable production frequencies were 4.7 ± 0.5 Hz for the normal-rate, 2.6 ± 0.3 Hz for the slow-rate (55% of normal), and 6.8 ± 0.5 Hz for the fast-rate (148% of normal). Henceforth, the term speaking rate will refer to syllable production frequency. Short pauses in speech were indicated with commas in the transcriptions, allowing a quantitative assessment of the pauses made by the male speaker at each global speaking rate. The mean pause frequency at each global speaking rate was first computed by averaging across the six responses; subsequently, the normalized pause frequency was obtained by dividing the resulting value by the mean syllable production frequency. Mean (±SD) normalized pause frequency was 0.07 ± 0.2 for the normal-rate, 0.1 ± 0.2 for the slow-rate, and 0.07 ± 0.1 for the fast-rate. ANOVA for nonparametric data (Friedman test) revealed no significant differences in mean normalized pause frequency across speaking rates, χ2(2) = 4.3, p = .12. A qualitative evaluation of the transcriptions further confirmed that the speaker was able to produce fluent, connected speech with similar pronunciation and without repetitions or excessive use of filler words at all three global speaking rates.
The transcriptions included time stamps every 5 sec, allowing the estimation of syllable production frequencies in 48 separate time segments per global speaking rate. According to these segment-wise production frequencies, the local speaking rate varied as a function of time at each global speaking rate (Figure 2). The segment-wise production frequencies enabled the identification of the time segments of relatively constant speaking rate and changing speaking rate in the spontaneously produced speech stimuli by calculating the first derivative of the local speaking rate. First, the absolute difference in syllable production frequency between consecutive 5-sec segments was computed within each of the six blocks in each global speaking rate. This yielded seven local speaking rate variation values per block. From these seven local speaking rate variation values per block, we identified the two smallest values (indicating the two 5-sec segments in which the speaking rate had remained the most constant relative to the preceding segment; assigned to the constant-rate category) and the two largest values (indicating the two 5-sec segments in which the speaking rate had changed the most relative to the preceding segment; assigned to the changing-rate category). Subsequently, the 5-sec segments assigned to each category based on this behavioral analysis were matched to the corresponding time points in the acoustic and MEG signals. Data from all three global speaking rates were pooled together, yielding a total of thirty-six 5-sec segments of MEG data (180 sec in total) per participant for each category (except for one participant for whom data from one block of fast-rate speech perception were missing due to a technical issue; for this participant, only thirty-four 5-sec segments of MEG data were available in each category).
The local speaking rate varied between 1.2 and 3.8 Hz for slow-rate speech (mean ± SD, 2.5 ± 0.6 Hz), between 2.4 and 6.4 Hz for normal-rate speech (4.6 ± 1.0 Hz), and between 4.0 Hz and 9.4 Hz for fast-rate speech (6.7 ± 1.3 Hz; also see Figure 2). The local variation in speaking rate, from one segment to the next, ranged from 0 to 1.8 Hz (absolute values) for slow-rate speech, from 0 to 3.4 Hz in normal-rate speech, and from 0.4 to 4.6 Hz for fast-rate speech. The mean syllable production frequency did not differ between the constant-rate category (4.8 ± 2.1 Hz) and the changing-rate category (4.6 ± 2.3 Hz; W = 0.001, p = .46, Wilcoxon signed-rank test), nor did the distributions of normalized variations differ between the three global speaking rates (normal-rate speech vs. slow-rate speech, D42, 42 = 0.10, p = .99; normal-rate speech vs. fast-rate speech, D42, 42 = 0.19, p = .39; slow-rate speech vs. fast-rate speech, D42, 42 = 0.19, p = .39, two-sample Kolmogorov–Smirnov test).
A single speech perception block consisted of a recorded thematic question spoken by a male voice (duration = 3–9 sec; mean ± SD, 5.6 ± 1.3 sec), a 1-sec delay before response onset, a 40-sec speech stimulus (i.e., the unknown male speaker's response to the thematic question), and a 2.5-sec rest period between blocks. A signal tone (50-msec, 1-kHz tone) indicated the beginning of a block, and another signal tone (50-msec, 75-Hz tone) signified the beginning and end of the response. The speech stimuli were grouped into three experimental conditions according to speaking rate (normal, slow, and fast). There were six blocks per experimental condition. Each 40-sec speech stimulus was presented only once to avoid learning effects. The order of the experimental conditions was randomized across participants.
During the experiment, participants were instructed to keep their gaze on a fixation point projected on a screen at ∼1 m from their sitting position. Head position was assessed throughout the experiment by observing the participants' head position through a video connection with the MEG room, by examining the measured MEG data for motion-related artifacts in real time and by reminding the participants between conditions, via intercom, to keep their head position constant and avoid, for example, slouching. Stimuli were presented binaurally at an individually adjusted comfortable listening level through plastic tubes and intracanal earpieces. The participants' task was to listen attentively to each stimulus. Task compliance was evaluated by inserting a 2-sec repetitive auditory segment (repeated four times, thus sounding like a broken record) in one of the six stimuli of each experimental condition. The repetitive segment occurred at a random time point during the 40-sec stimulus. The participant was instructed to indicate occurrence of a repetitive segment with an index finger lift, using an optical response panel. To further verify that the participants attended to the speech stimuli, at the end of the experiment they were asked to fill in a surprise multiple-choice questionnaire regarding the content of the stimuli they had heard. After the last block of each condition, participants were asked to rate the overall intelligibility of the six stimuli they had just heard on a scale of 1–10 (1 = completely unintelligible, 10 = completely intelligible). The effect of global speaking rate on the mean stimulus intelligibility scores (obtained by averaging the individual intelligibility scores across the 20 participants) was tested by ANOVA for nonparametric data (Friedman test).
MEG signals were recorded with a 306-sensor (204 gradiometers, 102 magnetometers) Neuromag Vectorview whole-head device (Elekta Oy) in a magnetically shielded room at Aalto NeuroImaging MEG Core. Data were filtered at 0.03–500 Hz and sampled at 1500 Hz. The participants were seated, with the head covered by the MEG helmet. Each participant's head position with respect to the MEG sensor array was determined by attaching five head position indicator coils to the scalp and briefly energizing them before the measurement. The coil locations were determined in reference to anatomical landmarks (nasion and right/left preauricular points) using a 3-D digitizer (Isotrak 3S1002, Polhemus Navigation Science). Blinks and eye movements (saccades) during the MEG measurement were monitored using EOG. Structural MRIs (3T Siemens MAGNETOM Skyra, Siemens Medical Systems) at Aalto NeuroImaging Advanced Magnetic Imaging Center were obtained for each participant after the MEG measurement using a high-resolution T1-weighted 3-D MPRAGE scan (32-channel head coil, 176 slices with 1-mm slice thickness, voxel size = 1 mm × 1 mm × 1 mm, 7° flip angle, 1100-msec inversion time, repetition time/echo time = 2530/3.3 msec). During the analysis process, the MEG coordinate system was aligned with individual MRIs based on head position coils and anatomical landmarks using MRI lab software (Elekta Oy).
Acoustic Signal Analysis
The amplitude envelope of the 4-min-long acoustic signal recorded at each global speaking rate was computed by full-wave rectifying and low-pass filtering (<10 Hz, fourth-order Butterworth filter, forward and backward) the band-pass filtered acoustic signal (80–2500 Hz, fourth-order Butterworth filter, forward and backward). The signal was filtered in this frequency range to emphasize the voiced signal portions that encode the rhythmic features of speech (Alexandrou et al., 2016; Hertrich et al., 2013). The spectrum of the downsampled (by a factor of 10), Tukey-windowed (r = .2), and zero-padded envelope was calculated by taking the squared magnitude of the fast Fourier transform using an 8192-point window (for more details, see Alexandrou et al., 2016). The mean rate around which the acoustic amplitude envelope fluctuated as a function of time reflected the global speaking rate while the quasi-regular pattern of the amplitude envelope fluctuations captured the local variations in speaking rate (5-sec long example; Figure 3, left). The acoustic power spectra estimated across the whole 4-min of data featured salient spectral peaks that approximately aligned with word and syllable production frequencies (Figure 3, right). It is noteworthy that power spectral peaks associated with prosodic frequencies (i.e., <1 Hz) did not demonstrate any shifts as a function of global speaking rate, possibly indicating similar prosodic patterning across stimuli at all three speaking rates.
MEG Data Analysis
Only gradiometers were included in MEG data analysis as they have a narrow spatial sensitivity pattern and are optimal for recording data from superficial sources; in contrast, magnetometers more readily pick up signals from distant sources, including external artifacts. First, MaxFilter software (Elekta Oy) was used to remove external disturbances from the MEG data with spatiotemporal signal space separation (Taulu & Simola, 2006). Subsequently, blink artifacts were removed from MEG signals using a PCA-based routine (Uusitalo & Ilmoniemi, 1997) implemented in Graph software (Elekta Oy).
The cortical areas showing coherent activity with the amplitude envelope of the acoustic signals were estimated with dynamic imaging of coherent sources (DICS; Gross et al., 2001) using a spherical head model. In DICS, spatial filtering is employed to estimate oscillatory power and coherence in the brain based on a cross-spectral density (CSD) matrix, which represents the oscillatory components and their linear dependencies. DICS is well suited for performing source modeling of continuous MEG data recorded during complex cognitive tasks (Alexandrou, Saarinen, Mäkelä, Kujala, & Salmelin, 2017; Saarinen et al., 2015; Kujala et al., 2007); in addition, it has proven its efficiency for examining coherence between MEG and acoustic speech signals (Peelle et al., 2013).
DICS analysis and the subsequent computation of audio–MEG coherence were carried out in MATLAB software (The MathWorks, Inc.) using custom-made scripts. The covariance computation was done in the form of CSD matrices, which were computed between all planar gradiometer MEG signals and the amplitude envelope of the acoustic signal using Welch's averaged periodogram method (4096-point Hanning windowing, 50% window overlap, 4096-point fast Fourier transform, 0.4 Hz resolution) separately for 10 frequency bins (starting from 1 Hz up to 10 Hz, 1 Hz spacing, 2 Hz spectral width). CSD matrices were calculated for the MEG data recorded during perception of speech at each of the three global speaking rates, as well as for the MEG data in the constant-rate and changing-rate speech categories. Based on these CSD matrices, the cortical-level DICS-based estimates of audio–MEG coherence were computed separately for each frequency bin in a spatially equivalent search grid across participants. The grid sampled the gray matter surface, excluding the cerebellum (20482 points, atlas brain, Freesurfer 5.3; Fischl, 2012). This common grid was transformed to each participant's anatomy via a surface-based transformation (Fischl, Sereno, Tootell, & Dale, 1999). The DICS estimation used a regularization of 0.01% of the maximum eigenvalue of each frequency and condition-specific CSD matrix.
Subsequent statistical tests were carried out using IBM SPSS statistics (IBM) and MATLAB software (The MathWorks, Inc.). The effect of the global speaking rate on the relationship between cortical signals and the acoustic amplitude envelope of the perceived speech signals was evaluated by examining differences in audio–MEG coherence between the main experimental conditions (normal-rate speech vs. slow-rate speech, normal-rate speech vs. fast-rate speech). The effect of local variations in speaking rate on the relationship between cortical signals and the acoustic amplitude envelope of the perceived speech signals was evaluated by examining differences in audio–MEG coherence between the constant-rate speech and changing-rate speech. For group-level statistics, the audio–MEG coherence maps were averaged across the 2–4 Hz and 4–7 Hz frequency bins. The behaviorally estimated syllable production frequencies, spanning the 2–7 Hz range across the three global speaking rates, prompted us to focus subsequent statistical analysis on delta-band (2–4 Hz) and theta-band (4–7 Hz) cortical activity. Moreover, these frequencies have been extensively linked with cortical tracking of speech stimuli (e.g., Kayser et al., 2015; Peelle et al., 2013; Luo & Poeppel, 2007).
Statistical significance was determined using group-level cluster-based statistics controlling for multiple comparisons (cluster-based permutation procedure performed on the statistically significant results obtained from a Student's two-tailed t test for paired samples, 10,000 permutations, statistical significance threshold p < .05, family-wise error corrected, cluster threshold p < .05, weighted distance algorithm for linking adjacent grid points, 15-mm cutoff threshold for cluster size; Maris & Oostenveld, 2007). In accordance with the procedure described in Maris and Oostenveld (2007), the t values were summed within spatially contiguous clusters (for adjacent voxels with p < .05). For each round of the permutation testing, the labels of the two conditions being compared were randomized across participants, and new t statistics were computed in all grid points. For each permutation, the largest cluster t value was collected, yielding a distribution of 10,000 cluster-level t values. Subsequently, the original t statistics were compared with this distribution. An effect was considered significant if the cluster p value exceeded the 95% threshold of the permuted maximum cluster t scores. For each cortical region demonstrating statistically significant differences in audio–MEG coherence, we obtained the Talairach coordinates and Brodmann's area numbers of the center of the region using the Talairach Daemon (Lancaster et al., 2000).
The spatial and spectral specificity of the observed effects was further explored by plotting the frequency spectra in the 1–10 Hz frequency range for each contrast (normal-rate speech vs. slow-rate speech, normal-rate speech vs. fast-rate speech, constant-rate speech vs. changing-rate speech). Audio–MEG coherence spectra in the contrasted conditions were compared in each frequency bin using Student's two-tailed t test (statistical significance threshold p < .05, 19 df, in spatially combined clusters in the case of close-by significant clusters).
To further examine the possibility that the modulation in audio–MEG coherence for the constant-rate speech versus changing-rate speech reflects dynamic tracking of local speaking rate, we computed the across-subject standard deviation (SD) of the cosine of the instantaneous phases of cortical activity for the constant-rate and changing-rate speech categories. This was done to describe the origin and nature of the observed modulations in audio–MEG coherence and further explore the hypothesis that tracking of the local variations in speaking rate is predictive in nature. Because this kind of analysis is statistically dependent on the results of the contrast between constant-rate speech and changing-rate speech, it was made for illustrative purposes and not intended to yield novel findings. The number of 5-sec segments considered in this analysis was 34, determined by the minimum number of such segments identified in individual participants. Across-subject SD was the measure of choice because, in contrast to audio–MEG coherence, it provides a reliable estimate of the phase alignment of cortical activity across subjects even for MEG data sections as short as 5 sec. The segment-wise mean SD values, obtained by averaging SD values across the time bins in each 5-sec segment, were computed for cortical activity in 10 frequency bins in the 1–10 Hz frequency range (1 Hz spacing, bin width ±0.5 Hz). We compared SD values between constant-rate speech and changing-rate speech in each frequency bin using a Student's two-tailed t test for independent samples (statistical significance threshold p < .05, 66 df). The range of phase variability in the constant and changing-rate speech category was computed as the difference between the maximum and the minimum phase variability. The mean difference in phase variability between constant-rate speech and changing-rate speech was quantified by averaging the difference in mean SD across the thirty-four 5-sec MEG data segments included in the analysis.
Attentional Control Tasks
For all three experimental conditions, all participants (20 of 20) were able to detect the repetitive segments embedded in one of the stimuli. The answers to the surprise questionnaire presented at the end of the experiment regarding the content of the speech stimuli were 100% correct for all 20 participants.
The mean (±SD) intelligibility scores were 9.9 ± 0.5 for the normal-rate speech stimuli, 9.9 ± 0.3 for the slow-rate speech stimuli, and 9.5 ± 0.7 for the fast-rate speech stimuli (a score of 10 representing perfect intelligibility). There were no significant differences in intelligibility scores between different rate speech stimuli, χ2(2) = 2, p = .37 (Friedman test).
Effect of Global Speaking Rate on Audio–MEG Coherence Patterns
Global speaking rate was found to be associated with modulations in audio–MEG coherence in the temporal regions bilaterally, as well as in the right paracentral lobule and the right parietal region (Figure 4A). Specifically, there was significantly stronger audio–MEG coherence for perceiving normal-rate speech than fast-rate speech in the superior temporal areas bilaterally (left: −61, −15, 1; BA 22, and right: 61, −9, 1; BA 22) in the 2–4 Hz frequency range (left: cluster p value = .02, cluster size = 140; right: cluster p value = .01, cluster size = 135; cluster threshold p < .05, 10,000 permutations) as well as in the right paracentral lobule (7, −33, 67; BA 6) in the 4–7 Hz frequency range (cluster p value = .03, cluster size = 128, cluster threshold p < .05, 10,000 permutations; Figure 4A, left). Furthermore, significantly increased audio–MEG coherence for perception of slow-rate than normal-rate speech was observed in the right parietal region (50, −38, 40; BA 40) in the 4–7 Hz frequency range (Cluster 1: cluster p value = .03, cluster size = 134; Cluster 2: cluster p value = .01, cluster size = 144; cluster threshold p < .05, 10,000 permutations; Figure 4A, left).
Effect of Local Variations in Speaking Rate on Audio–MEG Coherence Patterns
Local variations in speaking rate were associated with modulation of audio–MEG coherence in the left hemisphere, where significantly stronger audio–MEG coherence was observed for constant-rate speech than changing-rate speech in the 4–7 Hz frequency range of interest (Figure 4B). Specifically, this effect encompassed the left parietal region, particularly highlighting the left postcentral gyrus (−44, −24, 37; BA 2) as well as the left inferior parietal lobule (−33, −20, 41; BA 3; Cluster 1: cluster p value = .03, cluster size = 132; Cluster 2: cluster p value = .04, cluster size = 131; cluster threshold p < .05, 10,000 permutations; Figure 4B, left).
Spatial and Spectral Specificity of the Observed Modulations in Audio–MEG Coherence
Next, we examined the spatial and spectral specificity of the observed modulations in audio–MEG coherence. The audio–MEG coherence spectra were plotted in the 1–10 Hz frequency range for each contrasted condition in each of the cortical regions in which modulations of audio–MEG coherence were observed (Figure 4A and B, right). Paired t tests on the coherence spectra in these ROIs confirmed that the modulations of audio–MEG coherence determined in the 2–4 Hz and 4–7 Hz bands were spatially and spectrally specific. In the left and right superior temporal region, in the right paracentral lobule, as well as in the right and left parietal region, significant differences in audio–MEG coherence (based on the paired t tests; shown as vertical lines in the spectral plots) were observed primarily for the contrast for which the statistically corrected effects had been first discovered (cf. Figure 4A and B, left), and only rather sporadic (i.e., mostly observed in a few isolated frequencies) differences in audio–MEG coherence occurred for the other two contrasts. Significant modulations in audio–MEG coherence were also limited to the frequency ranges in which these effects had been first identified. Specifically, in the left superior temporal region, the paired t tests revealed differences in audio–MEG coherence only for the normal-rate speech versus fast-rate speech contrast (2 Hz: p = .03, 3 Hz: p = .01, 4 Hz: p = .02). In the right superior temporal region, differences were found for the normal-rate speech versus slow-rate speech contrast (1 Hz: p = .03, 8 Hz: p = .003, 9 Hz: p = .02) and for the normal-rate speech versus fast-rate speech contrast (1 Hz: p = .003, 2 Hz: p = 5 × 10−4, 3 Hz: p = .003, 4 Hz: p = .03). In the right parietal lobule, differences were observed for all three contrasts (normal-rate speech vs. slow-rate speech, 7 Hz: p = .02; normal-rate speech vs. fast-rate speech, 4 Hz: p = .01; constant-rate speech vs. changing-rate speech, 2 Hz: p = .01). In the right parietal region, differences were observed only for the normal-rate speech versus slow-rate speech contrast (3 Hz: p = .01, 4 Hz: p = .01, 5 Hz: p = .02, 6 Hz: p = .01, 7 Hz: p = .01). Finally, in the left parietal region, differences were found only for the constant-rate speech versus changing-rate speech contrast (4 Hz: p = .02, 5 Hz: p = .01, 6 Hz: p = .02; Figure 4A and B, left).
Dynamic Readjustment of Cortical Signal Phase in Response to Local Variations in Speaking Rate
The origin and potential dynamic nature of modulations in audio–MEG coherence in response to local variations in speaking rate was further described by computing the variability of the instantaneous phase of cortical signals originating from the left parietal region (Figure 4B, left) for each 5-sec segment included in the constant-rate and changing-rate speech categories (Figure 5). The mean SD was significantly larger (p = .02), indicating greater across-subject variability of the instantaneous phase of MEG signals, for perception of changing-rate speech (gray) than constant-rate speech (black) in the left parietal region (Figure 5). This effect was found for cortical activity in the 6-Hz frequency bin, thus aligning with the 4–7 Hz frequency range in which modulations in audio–MEG coherence were initially observed (Figure 4B, left). Plotting the mean SD for the unsorted 5-sec segments (Figure 5, left) demonstrated that the difference in phase variability between constant-rate and changing-rate speech is quite subtle; this is also illustrated by the mean SD range (0.04 for constant-rate speech; 0.03 for changing-rate speech) and by the absolute average difference in mean SD between constant-rate speech and changing-rate speech (0.005). Yet, the data sorted in ascending order by the mean SD magnitude reveal that the statistically significant difference between constant-rate speech and changing-rate speech reflects a systematically higher phase variability for the changing-rate speech (Figure 5, right), indicating that cortical signal phase is dynamically readjusted in response to local variations in speaking rate.
The present results suggest that cortical signals track multiple features of speech rhythm. The global speaking rate (normal-rate speech vs. fast-rate speech, normal-rate speech vs. slow-rate speech) and local variations in speaking rate (constant-rate speech vs. changing-rate speech) were associated with modulations in audio–MEG coherence in the 2–4 Hz (delta) and 4–7 Hz (theta) frequency ranges. These modulations encompassed cortical regions in both hemispheres. Global speaking rate was associated with modulations in audio–MEG coherence in the temporal areas bilaterally, as well as in the right parietal and paracentral regions. Local variations in speaking rate were accompanied by modulations of audio–MEG coherence only in the left parietal region, instead. These spatially and functionally distinct cortical tracking patterns reveal a dual nature of the cortical tracking mechanism of speaking rate.
This study found evidence of two spatially and functionally distinct components of cortical tracking of speech rhythm. The first component is associated with modulations in audio–MEG coherence as a function of global speaking rate. In line with previous studies, there was an emphasis on the 2–4 Hz and 4–7 Hz frequency ranges, and the middle superior temporal regions bilaterally were highlighted (Park et al., 2015; Peelle et al., 2013; Hertrich, Dietrich, Trouvain, Moos, & Ackermann, 2012; Aiken & Picton, 2008; Luo & Poeppel, 2007). We propose that this effect may be conceptualized as evolutionary tuning, which signifies an innate preference in the motor system for certain frequencies of output (∼5 Hz). This preference is thought to be reflected in the remarkable constancy of habitual speaking rates across languages (Liberman & Whalen, 2000) and has been suggested to have shaped a similar preference in the auditory system (Assaneo et al., 2016). The preference for the 5-Hz frequency has been proposed to have an evolutionary origin, stemming from primate oromotor vocalizations (Ghazanfar, Morrill, & Kayser, 2013; Morrill, Paukner, Ferrari, & Ghazanfar, 2012), and has been shown to be already present in infants (Telkemeyer et al., 2009). Notably, the global rate of the normal-rate speech stimuli employed here was ∼5 Hz, aligning with this proposed innate preference of the motor system and hence with the habitual speaking rates across languages (Alexandrou et al., 2016; Ruspantini et al., 2012; Levelt, Roelofs, & Meyer, 1999). The present findings importantly also point to a preference for the habitual rate of input in the auditory modality: The audio–MEG coherence between auditory cortical activation and the external audio signal was enhanced for normal-rate speech compared with fast-rate speech, an effect thought to facilitate subsequent processing (Schroeder, Wilson, Radman, Scharfman, & Lakatos, 2010; Lakatos, Karmos, Mehta, Ulbert, & Schroeder, 2008). This interpretation is in line with our earlier analysis of MEG signal power in this same data set, which found decreased power for global speaking rates deviating from the habitual rate (Alexandrou et al., 2017), as well as with the observation that normal-rate speech is processed even in the absence of attention (Wild et al., 2012).
Beyond our original hypothesis, cortical tracking of global speaking rate was additionally associated with modulations of audio–MEG coherence in the right paracentral lobule and right parietal cortex. We propose that this finding is linked to the reliance of comprehension of connected speech on the progressive integration and analysis of information (Bourguignon et al., 2013; Lieberman, 1963) related to semantic context (Federmeier, Wlotko, & Meyer, 2008; St. George, Kutas, Martinez, & Sereno, 1999) and prosody (e.g., Hari & Kujala, 2009; Belin, Fecteau, & Bedard, 2004; Kriegstein & Giraud, 2004). The observed effects in the right paracentral lobule and right parietal area may be seen as evidence of the general organization of integration of linguistic information in postcentral parietal regions (Sepulcre, Sabuncu, Yeo, Liu, & Johnson, 2012). Indeed, the right paracentral lobule has been associated with linking information in memory and gradually building comprehension when listening to a continuous story (Maguire, Frith, & Morris, 1999). Enhanced audio–MEG coherence in the right parietal region for the more slowly unfolding slow-rate speech compared with normal-rate speech may reflect the temporal flexibility of linguistic information processing and integration; this region has been shown to scale its activity to match global speaking rate (Lerner, Honey, Katkov, & Hasson, 2014; Small, Andersen, & Kempler, 1997).
The second component of cortical tracking of speech rhythm was associated with local variations in speaking rate. Although we initially hypothesized contributions of both the auditory cortices and the parietal regions, our empirical results exclusively emphasized the left parietal region. The left parietal region has been linked to predictive timing and the encoding of “when” something happens, a process that also interacts with attention (Arnal & Giraud, 2012; Schwartze, Tavano, Schröger, & Kotz, 2012; Nobre, Correa, & Coull, 2007). The emphasis on the 4–7 Hz frequency range is in line with reports suggesting that low-frequency cortical oscillations play a role in predicting future, behaviorally relevant cues (Calderone, Lakatos, Butler, & Castellanos, 2014; Arnal & Giraud, 2012; Saleh, Reimer, Penn, Ojakangas, & Hatsopoulos, 2010). Indeed, the presently observed modulation in audio–MEG coherence with respect to local variations in speaking rate was found to reflect a dynamic re-adjustment of cortical signal phase.
Based on this spatiospectral pattern, we propose that tracking local speaking rate is a process linked to the inherent predisposition of the brain to continuously seek and extract patterns of temporal regularity from the surrounding, ever-changing sensory environment. This predisposition may be conceptualized as predictive timing and predictive coding, a cortical operation mode that governs perceptual processes (Friston, 2012; Schroeder, Lakatos, Kajikawa, Partan, & Puce, 2008; Bar, 2007; Bonte, Mitterer, Zellagui, Poelmans, & Blomert, 2005) and supports subsequent sensory and cognitive processing (Sohoglu, Peelle, Carlyon, & Davis, 2012). This tracking has been suggested to operate in an anticipatory manner, in which the phase of low-frequency cortical activity is reset before the next relevant sensory event (Arnal & Giraud, 2012). Indeed, based on the present evidence suggesting a dynamic tracking of local variations in speaking rate, it appears that the brain actively makes predictions: the tracking of a quasi-regular stimulus (such as the acoustic amplitude envelope) could be highly predictive and quickly adjustable in nature. In this study, the emphasis on the left parietal region could presumably be associated with a predominant top–down nature of this predictive tuning during spontaneous speech perception (e.g., Andersen & Buneo, 2002; Engel, Fries, & Singer, 2001); future studies could further explore this topic. It has been reported that cortical signals track more accurately temporally regular visual (Cravo, Rohenkohl, Wyart, & Nobre, 2013) and auditory stimuli (ten Oever et al., 2017; Kayser et al., 2015). The presently observed enhanced audio–MEG coherence for the temporally regular constant-rate speech may thus be interpreted as a marker of successful temporal predictions, leading to more efficient subsequent processing (ten Oever et al., 2017; Rohenkohl, Cravo, Wyart, & Nobre, 2012; Schroeder et al., 2010; Lakatos et al., 2008).
Considering the present spatiospectral patterns as a whole, we observe dissociations of both spatial and spectral nature which could potentially reflect differences in cortical processing of natural connected speech. It was found that cortical signals from the right parietal region are sensitive to global speaking rate whereas cortical signals from the left parietal regions track local variations in speaking rate. This functional differentiation might be linked to different cortical processing modes in the right (ad hoc, integrative) and left hemisphere (post hoc, anticipatory) during natural, connected speech perception (Federmeier et al., 2008). Furthermore, this study extends the theoretical framework presented in Giraud and Poeppel (2012), as well as previous evidence gained through experimental paradigms based on the perception of isolated sentences, proposing that cortical signals originating mainly from the auditory regions track speech signals (Peelle & Davis, 2012). Thus, the findings of this study are of special interest: As the first report of entrainment patterns in the context of natural connected speech perception, it extends previous knowledge by suggesting that signals from both auditory and parietal regions track a perceived speech signal. Frontal and parietal regions have been suggested to exert a modulatory influence on the relationship between cortical signals and acoustical speech signals (Keitel, Ince, Gross, & Kayser, 2017; Kayser et al., 2015; Park et al., 2015; Gross et al., 2013). This study further extends the role of these regions by showing that the bilateral parietal regions do not merely exert a modulatory influence on auditory regions but also directly contribute to tracking an incoming speech signal, aligning with recent findings (Puschmann et al., 2017).
Regarding the spectral aspect of the observed tracking of speaking rate, theoretical and empirical work both highlight the delta and theta frequency bands (e.g., Ghitza, 2012; Peelle & Davis, 2012), which are also observed in this study. Here, the emphasis on this frequency range may be related to exogenous parameters, namely, the frequency content of the stimuli and specifically the timescales of linguistic elements in speech: In the present stimuli, the word and syllable production frequencies spanned the range 2–7 Hz. In line with previous studies, we could potentially interpret the modulations in audio–MEG coherence in the delta band (2–4 Hz) as reflecting the processing of words, whereas modulations in coherence in the theta band (4–7 Hz) could be related to the processing of syllables. Providing support for this view, we found that local variations in speaking rate, which are usually assessed at the syllabic level due to the quick timescale on which they occur (Quené, 2007), were tracked by theta-band cortical activity. An alternative explanation may be related to an intrinsic cortical preference for certain timescales of processing, irrespective of the input. It is especially noteworthy that, in the auditory regions, modulations in audio–MEG coherence were observed in the delta band, whereas in the parietal regions and the right paracentral lobule such modulations were observed in the theta band. This observed dissociation regarding cortical signaling frequencies might provide evidence in favor of the existence of a hierarchy of timescales in the cortex during complex perceptual tasks (cf. Hasson, Yang, Vallines, Heeger, & Rubin, 2008; Kiebel, Daunizeau, & Friston, 2008).
Coherence describes the frequency–domain correlation of two time series. Increased coherence signifies enhanced synchronization between the two signals and, with regard to neural signaling, has been interpreted as more efficient information transfer between two neural populations (Fries, 2005). Here, we computed coherence between cortical signals and the acoustic amplitude envelope. In the context of this study, modulations in coherence signify changes in synchronization and thus in the tracking of a speech signal. Because we experimentally manipulated global speaking rate and also examined the inherently occurring local variations in speaking rate, modulations in coherence were considered as a direct consequence of these two kinds of variations in speaking rate. As reflected in the power spectra of the acoustic amplitude envelope of our stimuli, variations in speaking rate affect the temporal structure of the acoustic amplitude envelope which, from a computational perspective, is the main factor affecting coherence values. However, we also acknowledge that global speaking rate does not only affect the amount of linguistic content per time unit but it may also be accompanied by changes in the syntactic and semantic structure of an utterance (e.g., Cohen Priva, 2017). Variations in global speaking rate may also be associated with altered prosodic features (e.g., Fougeron & Jun, 1998), although this was likely not the case in this study as the low-frequency components of the acoustic amplitude envelope remained essentially constant. Local variations in speaking rate may, in turn, be associated with higher-level linguistic events in speech: For instance, speaking rate has been suggested to change locally at the phrasal level and co-occur with specific lexical events in speech (e.g., Byrd, Kaun, Narayanan, & Saltzman, 2000). Furthermore, it has been shown that coherence is not only dependent on the characteristics of an incoming stimulus but receives top–down modulations as a function of, for instance, expected syntactic structure (Meyer, Henry, Gaston, Schmuck, & Friederici, 2017). Therefore, even though we here base our interpretations primarily on the rate of speech, which was the variable specifically controlled in the present parametric experimental design, we cannot exclude the possibility that discourse-level features might also contribute to the presently observed coherence patterns. Future studies should further examine the origins of the modulation of audio–MEG coherence during perception of spontaneous speech at different rates.
Spontaneous speech, as opposed to single sentences, read-aloud text passages, or rehearsed narratives used in previous related studies, is the kind of speech we are mostly exposed to in real-life social situations. The presently employed experimental paradigm was designed to approximate a real-life listening context. Unlike previous studies examining the effect of global speaking rate on the relationship between cortical and acoustic signals (Hertrich et al., 2013; Ahissar et al., 2001), this study did not involve artificially time-compressed stimuli or impose external pacing on global speaking rate modulations. Because of the noncontrolled nature of the stimuli, the absolute audio–MEG coherence values were lower than those reported in earlier studies; nonetheless, the wide-band audio–MEG coherence spectra demonstrate that the observed effects are salient and occur at clearly delimited frequency ranges. Both theoretical models (Ghitza, 2011) and studies examining artificially time-compressed stimuli (Hertrich et al., 2013; Ahissar et al., 2001) propose that coherence peaks should shift as a function of global speaking rate. In this study, we did not observe such clear-cut effects in the wide-band audio–MEG coherence spectra. This may be due to natural speech showing some overlap in syllable production frequencies between global speaking rates as shown by a previous study from our research group that characterized speech rhythm through acoustic and EMG signals (Alexandrou et al., 2016). The study addressed spontaneously produced speech at three global speaking rates (normal, slow, fast) in the same 20 individuals considered in this study. However, the fairly stable frequency of coherence peaks across the different global speaking rates may also suggest that, when tracking complex, spontaneously produced speech, the brain employs its own, internal signaling, instead of being passively driven by the frequency content of the input. Finally, stimulus intelligibility did not significantly vary across experimental conditions nor did the attention level of the participants, as assessed through the local attentional task of repetitive segment detection and the surprise multiple-choice questionnaire that probed the level of global stimulus comprehension. In addition, no significant differences were found either in the mean speaking rate between the constant and changing-rate categories or in the distributions of local speaking rate variations among the three global rates. Although it is not possible to completely rule out that attention to aspects of the speech stimulus that are not related to meaning per se may have affected the coherence between speech and cortical signals, we nevertheless suggest that the present findings reliably assess the neural correlates of the dual tracking mechanism of speaking rate during natural speech perception.
The observed audio–MEG coherence patterns offer new insights to the existing framework of cortical tracking of incoming speech signals as a mechanism supporting speech perception. In accordance with an emerging, more integrative view of speech processing (Alexandrou et al., 2017; Poeppel, Emmorey, Hickok, & Pylkkänen, 2012; Federmeier et al., 2008), cortical tracking of natural, connected speech extended beyond the previously reported emphasis on the temporal regions (Peelle et al., 2013; Ahissar et al., 2001) and was found to also engage the parietal regions bilaterally and the right paracentral lobule (see Giordano et al., 2017, for similar findings during perception of continuous, but not spontaneously produced speech). On a broader level, this finding supports the notion that perception is not a passive reaction to incoming stimuli but a highly constructive process that undergoes optimization in both the auditory regions, as well as in the parietal and paracentral regions (Morillon & Schroeder, 2015; Zion-Golumbic et al., 2013; Engel et al., 2001). Indeed, perception of real-life speech entails a continuous perceptual adjustment. First, we encounter a multitude of speakers on an everyday basis, with global speaking rates that demonstrate some individual variation around the habitual speaking rate of ∼5 Hz (Alexandrou et al., 2016; Tsao & Weismer, 1997). This natural interindividual variability is analogous to the experimentally generated intraindividual variations in global speaking rate examined here. Tuning in to faster or slower speakers requires an active adaptation from the listener's part (Dupoux & Green, 1997). It could be suggested that tracking the global speaking rate (characteristic of a given speaker) would afford a solid sensory basis onto which the more fine-grained predictions related to local variations in speaking rate are embedded upon. In the auditory cortex, this presumed sensory basis has been thought to be represented by the frequency of neuronal spiking, which would be proportional to the speaking rate. For instance, a speaker demonstrating a high global speaking rate would induce faster neuronal spiking compared with a speaker who demonstrates a lower global speaking rate (Gütig & Sompolinsky, 2009). Second, building up-to-date predictions of future events based on previous input is paramount for adjusting to the dynamically changing, real-life communicative contexts, in which speech is typically heard only once (Schwartze et al., 2012). Thus, in this dynamic sensory mode, the robust evolutionary tuning is proposed to form a hard-wired perceptual basis, whereas predictive tuning continuously updates this basis as a function of expectation, attention, and sensory input.
This study is the first to examine the neural basis of tracking the speaking rate in perception of spontaneously produced natural connected speech. The present results indicate that cortical tracking of incoming speech is a multidimensional phenomenon. The tracking mechanism of speaking rate is dual in nature, manifesting two spatially and functionally distinct components of cortical tuning in speech perception: evolutionary tuning that is associated with global rhythmic structure and predictive tuning that is driven by local changes in speaking rate. These findings complement previous evidence and propose that, during the perception of spontaneously produced natural connected speech, the functional role of cortical tracking of the acoustic amplitude envelope is not merely confined to achieving syllabic segmentation of the input but extends to temporal predictions and expectations.
This work was financially supported by the Academy of Finland (Grants 255349, 256459, and 283071 to R. S. and Grant 257576 to J. K.), the Alfred Kordelin Foundation (Grant 160143 to A. A.), the Emil Aaltonen Foundation (Grant 170011 N1 to A. A.), the Finnish Cultural Foundation (Grant 00170944 to T. S.), and the Sigrid Jusélius Foundation (grant to R. S.). MEG and MRI data were recorded at the Aalto NeuroImaging research infrastructure.
Reprint requests should be sent to Anna Maria Alexandrou, Department of Neuroscience and Biomedical Engineering, Aalto-yliopisto Perustieteiden korkeakoulu, Rakentajanaukio 2C, Aalto, 00076, Finland, or via e-mail: firstname.lastname@example.org.