Abstract

The temporal envelope of speech is important for speech intelligibility. Entrainment of cortical oscillations to the speech temporal envelope is a putative mechanism underlying speech intelligibility. Here we used magnetoencephalography (MEG) to test the hypothesis that phase-locking to the speech temporal envelope is enhanced for intelligible compared with unintelligible speech sentences. Perceptual “pop-out” was used to change the percept of physically identical tone-vocoded speech sentences from unintelligible to intelligible. The use of pop-out dissociates changes in phase-locking to the speech temporal envelope arising from acoustical differences between un/intelligible speech from changes in speech intelligibility itself. Novel and bespoke whole-head beamforming analyses, based on significant cross-correlation between the temporal envelopes of the speech stimuli and phase-locked neural activity, were used to localize neural sources that track the speech temporal envelope of both intelligible and unintelligible speech. Location-of-interest analyses were carried out in a priori defined locations to measure the representation of the speech temporal envelope for both un/intelligible speech in both the time domain (cross-correlation) and frequency domain (coherence). Whole-brain beamforming analyses identified neural sources phase-locked to the temporal envelopes of both unintelligible and intelligible speech sentences. Crucially there was no difference in phase-locking to the temporal envelope of speech in the pop-out condition in either the whole-brain or location-of-interest analyses, demonstrating that phase-locking to the speech temporal envelope is not enhanced by linguistic information.

INTRODUCTION

The temporal envelopes of sounds contain important cues for auditory perception. Behavioral evidence suggests that the temporal envelope of speech is critical for speech intelligibility (e.g., Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995; Drullman, Festen, & Plomp, 1994). However, in the context of speech comprehension, the role of neural activity phase-locked to the speech temporal envelope remains controversial (for a review, see Peelle & Davis, 2012). Some researchers (e.g., Howard & Poeppel, 2010; Nourski et al., 2009) have argued that the role of phase-locking to the temporal envelope of speech is restricted to encoding acoustic information contained in the speech signal. Tracking the speech temporal envelope is undoubtedly a critical step in the process of speech comprehension but it remains unclear whether this mechanism is affected by linguistic information (e.g., Peelle & Davis, 2012). Recent work also claims that tracking the speech envelope in the phase of low-frequency oscillations plays a role in gating and constraining the transfer of information from sensory to higher-order regions (e.g., Zion Golumbic et al., 2013; Giraud & Poeppel, 2012; Ghitza, 2011). Others (e.g., Doelling, Arnal, Ghitza, & Poeppel, 2014; Giraud & Poeppel, 2012; Luo & Poeppel, 2007; Ahissar et al., 2001) have argued that the entrainment of the phase of theta-based neural oscillations is necessary for speech comprehension itself. Therefore, it is a matter of continuing debate whether phase-locking to the speech temporal envelope reflects the ability of the brain to process both the physical properties of speech sounds and linguistic information.

One of the difficulties in disentangling the contribution of phase-locking to the temporal envelope of speech to auditory and/or speech perception lies in the spectrotemporal complexity of speech and the use of appropriate control stimuli. The generation of control stimuli for speech intelligibility experiments that are adequately matched, in terms of their physical properties, is problematic. An alternative approach is to present physically identical speech items twice: First when the speech item is perceived as unintelligible and then again following perceptual learning of the same speech item. This type of perceptual learning, which occurs rapidly and reliably, has been coined “pop-out” by Davis and colleagues (e.g., Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005). In previous neuroimaging experiments perceptual pop-out has been achieved using sine-wave speech (e.g., Deheane-Lambertz et al., 2005; Liebenthal, Binder, Piorkowski, & Remez, 2004) or noise-vocoded speech (e.g., Sohoglu, Peelle, Carlyon, & Davis, 2012; Giraud et al., 2004). Speech vocoders can be used to manipulate the spectrotemporal information present in speech signals. To generate pop-out using vocoded speech, a given speech item is processed through a vocoder containing a low number of vocoder channels. In the first instance, such a vocoded stimulus is unintelligible. Once participants have heard (or visually read) the original version of the same speech item, the previously unintelligible speech sentence becomes intelligible, that is, pop-out occurs.

The pop-out approach allows the comparison of two presentations of identical stimuli, where only the second presentation yields speech comprehension. The advantage of the pop-out approach is that it presents an opportunity to identify changes in phase-locking to the speech envelope that result from speech intelligibility, rather than physical differences between un/intelligible speech stimuli. However, as the perceptual change in speech perception always occurs following the second presentation, the pop-out approach is confounded by temporal order effects (e.g., Giraud et al., 2004), which should be controlled for through appropriate contrasts at the stage of data analysis.

One of the aims of this study was to determine whether phase-locking to the speech temporal envelope is enhanced when speech is intelligible (Peelle, Gross, & Davis, 2013; Ahissar et al., 2001). In contradistinction to previous studies using noise-vocoded speech to examine neural speech envelope tracking (Doelling et al., 2014; Peelle et al., 2013), here tone-vocoding was used to generate perceptual pop-out. Tone-vocoded and noise-vocoded speech share some properties, but there are behavioral differences in the perception of tone-vocoded and noise-vocoded speech (e.g., Stone, Füllgrabe, & Moore, 2008; Whitmal, Poissant, Freyman, & Helfer, 2007). For example, the intelligibility of tone-vocoded sentences is 13% better than noise-vocoded sentences under quiet listening conditions (Whitmal et al., 2007). The superior performance obtained with tone-vocoded speech has been attributed to the low-frequency inherent fluctuations in the noise bands used to generate noise-vocoded speech. It has been suggested that these low-frequency modulations interfere with or mask the modulations in the speech envelope (Stone et al., 2008; Whitmal et al., 2007) and that tone-vocoders provide more faithful representations of the temporal envelope of speech (Whitmal et al., 2007). When amplitude modulation (i.e., a temporal envelope) is imposed on noise carriers the inherent intrinsic fluctuations present in a stochastic carrier interact with the modulation frequencies present in the imposed temporal envelope (e.g., Dau, Verhey, & Kohlrausch, 1999). The interaction between the temporal envelope of the signal of interest and these intrinsic fluctuations are particularly relevant for narrowband noise carriers. Modulation detection thresholds for modulation frequencies less than half the bandwidth of a narrowband noise carrier are most affected by these intrinsic fluctuations (e.g., Dau et al., 1999).

The potential influence of the intrinsic fluctuations present in noise carriers on noise-vocoded speech is evident in the results presented in Whitmal et al. (2007), for example, see their Figure 6. Whitmal et al. (2007) showed that the spectra of AC-coupled temporal envelope waveforms of both wideband (901-Hz bandwidth) and narrowband (100-Hz bandwidth) Gaussian noise carriers contain inherent fluctuations that are not present for either tonal carriers or low-noise noise carriers. Whitmal et al. (2007) also demonstrated how these fluctuations affect temporal processing through simulated modulation detection thresholds (Dau et al., 1999). The analyses presented by Whitmal et al. (2007) show that when a narrowband noise was used as the carrier this resulted in a concentration of carrier temporal envelope power at low modulation frequencies. Consequently the predicted modulation detection thresholds for low modulation frequencies were increased (Whitmal et al., 2007). Low modulation frequencies are most important for speech intelligibility (Drullman et al., 1994), and therefore, the speech temporal envelope will be most affected by spurious fluctuations in noise-vocoder bands with narrow bandwidths, that is, vocoder bands with low center frequencies. The inherent low-frequency fluctuations in noise-vocoded speech are of particular concern in neuroimaging experiments designed to explicitly examine the relationship between the speech temporal envelope and neural activity recorded in response to noise-vocoded speech (Doelling et al., 2014; Peelle et al., 2013).

In this study, neural activity was recorded using MEG while participants listened to un/intelligible tone-vocoded speech sentences. The data were analyzed using a bespoke beamformer-based approach to maximize the potential of identifying a difference in the entrainment of theta oscillations to both unintelligible and intelligible speech sentences. To this end, multiple metrics of temporal envelope tracking were applied to the MEG data and both whole-head and location-of-interest (LOI; Millman, Prendergast, Hymers, & Green, 2013) analyses were carried out.

First, whole-head MEG beamformer analyses, tailored to calculate time domain cross-correlation between the measured MEG responses and the temporal envelopes of un/intelligible speech sentences, were used to determine where reliable phase-locking to the temporal envelope of speech occurs in the brain. This information can be used to inform and constrain the functionality of models of speech perception. Previous work suggests that areas in both primary and nonprimary auditory cortex (Nourski et al., 2009, 2013; Zion Golumbic et al., 2013) phase-lock to the speech temporal envelope. In addition, higher-order brain regions (Zion Golumbic et al., 2013), including bilateral middle temporal gyri (MTG), inferior frontal gyri (IFG), and motor cortex (Peelle et al., 2013), are also capable of phase-locking to the speech temporal envelope. The analyses described here, including time domain cross-correlation analyses, whole-head beamforming, statistical thresholding, and optimized orientation of beamformer spatial filters, represent a first step toward comprehensive inverse-modeling specifically focused on investigating mechanisms underlying auditory temporal processing.

Second, we wanted to test the hypothesis that enhanced phase-locking to the speech temporal envelope occurs during speech comprehension (e.g., Peelle et al., 2013; Ahissar et al., 2001). To this end, we used the pop-out effect to determine whether phase-locking to the temporal envelope of speech is enhanced for intelligible speech relative to physically identical unintelligible speech. Here intelligible speech is defined as “speech that can be understood and repeated” (e.g., Scott, 2012). Whole-head MEG beamformer analyses, based on cross-correlation between the speech temporal envelope and the phase-locked responses, were used to localize differences in phase-locking to the temporal envelope of identical un/intelligible speech sentences, with the unintelligible version serving as its own perfect internal control. When the role of tracking the speech temporal envelope is focused on speech intelligibility, that is, a contrast of the responses to intelligible speech and unintelligible nonspeech analogues, previous work suggests there is enhanced phase-locking of theta oscillations in areas outside of auditory cortex in the left MTG and left ventral inferior frontal cortex (Peelle et al., 2013).

Third, in addition to the whole-brain beamformer analyses, MEG virtual electrode analyses were used to reconstruct the time series in LOIs within the speech perception network (left Heschl's gyrus [HG], right HG, left MTG, left anterior superior temporal gyrus [STG], left IFG). Two of these LOIs were based on brain regions (left HG and left MTG) where Peelle et al. (2013) found a difference in theta-based coherence for unintelligible and intelligible speech in their “ROI” analyses. For each LOI, the representation of the temporal envelopes of unintelligible and intelligible speech in phase-locked neural activity was assessed using both cross-correlation (time domain; Millman et al., 2013; Nourski et al., 2009; Abrams, Nicol, Zecker, & Kraus, 2008; Ahissar et al., 2001) and coherence (frequency domain; Doelling et al., 2014; Peelle et al., 2013).

METHODS

Participants

Sixteen right-handed participants (10 men) who were native English speakers took part in this experiment. The mean age of the participants was 29.2 years (SD = 7.8 years, range = 20–48 years). The participants reported normal hearing and no history of neurological disorders.

Speech Stimuli

Simple, short-duration sentences (BKB/IHR corpus; e.g., Foster et al., 1993; MacLeod & Summerfield, 1987) spoken by an adult British English man were used as the speech stimuli. The sentences were “The kettle boiled quickly” (always Unintelligible), “The floor was quite slippery” (the Pop-out sentence), and “She ironed her skirt” (always Intelligible). The duration of each speech sentence was approximately 1.5 sec. The duration of each epoch was increased to 2.5 sec through the addition of approximately 1 sec of silence to the end of each sentence. Stimuli were delivered diotically to participants via Etymotic insert earphones (Etymotic Research ER30, Elk Grove Village, IL) at a comfortable sound level.

Noise-vocoders versus Tone-vocoders

Two types of vocoder processing are commonly used in speech perception experiments; noise-vocoders and tone-vocoders. Noise-vocoded and tone-vocoded speech is generated using similar signal processing methodology. The speech signal is first filtered into a given number of channels or bands (typically between 1 and 16 bands). The temporal envelope of each band is extracted, and this extracted envelope is then used to modulate a sinusoidal carrier with a frequency equal to the center frequency of the vocoder channel (tone-vocoding) or modulate a noise carrier with a center frequency and bandwidth equal to the vocoder channel (noise-vocoding). The signals within each channel are filtered again and these signals are then combined to create vocoded speech.

Creation of Tone-vocoded Speech

Tone-vocoded speech sentences were created using custom programs in MATLAB (The MathWorks, Natick, MA). A vocoder with a low number of channels was used so vocoded stimuli remained unintelligible until after training. Vocoding was carried out using a three-band tone-vocoder with the frequency of each band logarithmically spaced between 80 and 8000 Hz, including carrier frequencies of 225, 1047, and 4861 Hz. The temporal envelopes at the output of each band were extracted using half-wave rectification and smoothing. The cutoff frequency of the low-pass filter used to smooth the extracted temporal envelope varied depending on the center frequency of the band; the cutoff frequency was half the equivalent rectangular bandwidth (e.g., Moore & Glasberg, 1983) of each band (24, 68, and 274 Hz for each of the carrier frequencies, respectively). The temporal envelopes extracted from each band were combined to form the broadband speech temporal envelope for each sentence.

Experimental Design

Without training tone-vocoded speech generated with a small number of bands (i.e., three bands) is unintelligible. Pop-out occurs after participants are exposed to the original speech, when the previously unintelligible tone-vocoded version becomes intelligible. The critical advantage of the pop-out condition is that the unintelligible version of the tone-vocoded sentence serves as its own perfect acoustical control for the intelligible version of the same sentence. As acknowledged in Peelle and Davis (2012), the specific methods used to vary the intelligibility of speech will impact the interpretation of experiments on speech intelligibility. Unfortunately all nonspeech analogues used as control stimuli in speech intelligibility experiments are open to criticism. Therefore a pop-out paradigm is the best way to ensure that low-level acoustical differences between speech and nonspeech analogues do not influence results.

Three sentences (see Speech Stimuli section) were used in the MEG experiment. The MEG scanning took place over two data acquisitions “blocks,” and throughout the course of the experiment, the intelligibility of the “Intelligible” and “Pop-out” sentences was manipulated through exposure of the participant to the original unprocessed sentence (see Figure 1): One sentence was always unintelligible (Unintelligiblepre/Unintelligiblepost), as participants were not exposed to the original version of this sentence; one sentence was the Pop-out sentence. This sentence was unintelligible during the first experimental block (Pop-outpre). During the “training” phase of the experiment, participants listened to the original unprocessed version and the three-band tone-vocoded version of the Pop-out sentence in succession until they were confident they could understand the tone-vocoded version. Perceptual learning of the tone-vocoded sentences was usually accomplished after hearing two to three repeats of the original and tone-vocoded version. The Pop-out sentence was therefore intelligible for the second half of the experiment (Pop-outpost); the “Intelligible” sentence was “revealed” before the first MEG block and was therefore always intelligible. In addition to the test sentence material, there were also some silent epochs (2.5-sec duration) and catch trials (2.5-sec duration) presented during each block. During a catch trial, participants were played an auditory cue, which prompted them to respond, using a button box, and indicate whether the last sound they heard was either an unintelligible or intelligible sentence. These intelligibility ratings were used to determine whether the manipulation of the Pop-out condition resulted in the desired effect, that is, Pop-outpre (before training) were unintelligible but Pop-outpost (after training) were intelligible. A technical fault resulted in an inadequate number of responses being recorded for one participant during the second “

Figure 1. 

The design of the MEG experiment. Two speech conditions are illustrated here: Pop-out and Unintelligible speech. Perceptual pop-out was used to change the percept of the Pop-out condition from unintelligible to intelligible speech. Two runs were carried out within a single MEG recording session. During the first block (“Before”) both the Pop-out (Pop-outpre) and Unintelligible (Unintelligiblepre) speech stimuli were perceived as unintelligible speech. Between the first and second blocks, participants were trained (“Training”) on the Pop-out stimulus so that the Pop-out stimulus (Pop-outpost) was perceived as intelligible speech only during the second MEG block (“After”). During the “Training” phase, the participants were not exposed to the unprocessed version of the Unintelligible condition, and therefore, following the training phase, the Unintelligible speech (Unintelligiblepost) remained unintelligible throughout the experiment.

Figure 1. 

The design of the MEG experiment. Two speech conditions are illustrated here: Pop-out and Unintelligible speech. Perceptual pop-out was used to change the percept of the Pop-out condition from unintelligible to intelligible speech. Two runs were carried out within a single MEG recording session. During the first block (“Before”) both the Pop-out (Pop-outpre) and Unintelligible (Unintelligiblepre) speech stimuli were perceived as unintelligible speech. Between the first and second blocks, participants were trained (“Training”) on the Pop-out stimulus so that the Pop-out stimulus (Pop-outpost) was perceived as intelligible speech only during the second MEG block (“After”). During the “Training” phase, the participants were not exposed to the unprocessed version of the Unintelligible condition, and therefore, following the training phase, the Unintelligible speech (Unintelligiblepost) remained unintelligible throughout the experiment.

The number of stimuli presented during each MEG “block” were as follows: The Unintelligible sentence was presented 100 times during each block; the Pop-out sentence was presented 100 times during each block; the sentence that was always Intelligible was presented 50 times during each block; the silent epochs were presented 50 times during each block; the catch trials were presented 25 times per block.

MEG Recording

Data were collected using a Magnes 3600 whole-head 248-channel magnetometer (4-D Neuroimaging, Inc., San Diego, CA). The data were recorded at a sample rate of 678.17-Hz and were low-pass filtered online with a cutoff frequency of 200 Hz.

Before recording individual facial and scalp landmarks (left and right preauricular points, Cz, nasion, and inion) were spatially coregistered using a Polhemus Fastrak System. The landmark locations in relation to the sensor positions are derived on the basis of a precise localization signal provided by five spatially distributed head coils with a fixed spatial relation to the landmarks. These head coils provided a measurement of a participant's head movement at the beginning and end of each data acquisition block.

To carry out artifact rejection, the raw data from each epoch were inspected visually. Epochs contaminated with either physiological or nonphysiological artifacts were manually removed.

Coregistration

For the source-space analyses, the landmark locations were matched with the individual participants' anatomical magnetic resonance (MR) scans using a surface-matching technique adapted from Kozinska, Carducci, and Nowinski (2001). T1-weighted MR images were acquired with a GE 3.0-T Signa Excite HDx system (General Electric, Milwaukee, WI) using an eight-channel head coil and a 3-D Fast Spoiled Gradient Recall sequence: repetition time/echo time/flip angle = 8.03 msec/3.07 msec/20°, spatial resolution of 1.13 mm × 1.13 mm × 1.0 mm, in-plane resolution of 256 × 256 × 176 contiguous slices. For the group beamformer analyses, the individuals' data were spatially normalized to the Montreal Neurological Institute (MNI) standard brain, based on the average of 152 individual T1-weighted structural images (Evans, Collins, Mills, Brown, & Kelly, 1993). The beamformer grid for each participant was initially defined in MNI space and linearly transformed back to individual MRIs. Beamforming analyses were carried out on these transformed grids, and the subsequent t maps were transformed back into MNI space before group statistics were calculated.

Beamformer-based Analyses of Phase-locking to the Speech Temporal Envelope

An MEG beamformer estimates the contribution of a given grid point in the brain to the signal measured at the MEG sensors. Independent beamformers (spatial filters) are constructed for each grid point. Each beamformer is an optimal spatial filter dedicated to a given grid point. The outputs of these spatial filters are often termed “virtual electrodes.”

Conventional beamformer localizations are based on differences in either power or coherence across experimental conditions. One of the aims of this study was to determine which brain regions phase-lock to the temporal envelope of speech using bespoke beamforming analyses tailored to address this specific issue. In previous work (Millman et al., 2013), we developed methods to measure statistically significant phase-locking to speech temporal envelopes at given beamformer grid points, that is, for the time series of individual virtual electrodes. In the present work, one of the aims was to produce volumetric whole-brain beamformer images to show (1) which parts of the brain respond to spoken speech by phase-locking (as measured by cross-correlation) to the temporal envelope of speech and (2) which brain regions show enhanced phase-locking (as measured by cross-correlation) to intelligible speech (Pop-outpost) cf. unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) speech (Peelle et al., 2013). Another aim of the present work was to conduct LOI analyses (based on both cross-correlation and coherence) in several brain locations (left HG, right HG, left MTG, left anterior STG, and left IFG) in the speech perception network (e.g., Hickok & Poeppel, 2007).

In this study a vectorized linearly constrained minimum-variance beamformer (Huang et al., 2004; Van Veen, van Drongelen, Yuchtman, & Suzuki, 1997) was used to obtain the spatial filters with a multiple-sphere head model (Huang, Mosher, & Leahy, 1999). The beamformer grid size was 5 mm. The three orthogonal spatial filters were implemented as a single 3-D system (see Johnson, Prendergast, Hymers, & Green, 2011).

When constructing a spatial filter, the orientation of the signals used to generate the forward model are of critical importance (Johnson et al., 2011), and yet in the majority of beamforming analyses, this issue is overlooked. In the present work, the orientation of the spatial filters was optimized to yield the greatest cross-correlation coefficient (used for both whole-brain and LOI analyses; see Methods, Beamformer Localizations Based on Cross-correlation between the Temporal Envelope of Speech (Unintelligible/Pop-out Conditions) and Phase-locked Neural Activity, Beamformer Localizations Based on Significant Differences in the Cross-correlations between Activity Phase-locked to the Temporal Envelopes of Unintelligible and Pop-out Speech Conditions, and LOI Analyses to Measure Differences in Phase-locking to the Speech Temporal Envelope in Given Brain Locations Based on Cross-correlation (Time Domain) and Coherence (Frequency Domain)) or coherence value (used for LOI analyses only; see Methods, LOI Analyses to Measure Differences in Phase-locking to the Speech Temporal Envelope in Given Brain Locations Based on Cross-correlation (Time Domain) and Coherence (Frequency Domain)). The advantage of this approach is that the inverse model used to characterize the response is specifically tuned to the research question posed. The orientation of the spatial filter was sampled from a spherical shell to find the optimal direction, that is, the direction with the greatest cross-correlation coefficient or coherence value, for each condition independently. The orientation in which the cross-correlation coefficient or coherence value was the greatest was used in subsequent analyses.

Quantification of Phase-locking to the Speech Temporal Envelope

There are two published methods that directly measure (cf. the “phase-dissimilarity index,” e.g., Luo & Poeppel, 2007) the relationship between the speech temporal envelope and noninvasive/invasive recordings of phase-locked activity: cross-correlation in the time domain (e.g., Abrams et al., 2008; Ahissar et al., 2001) and coherence in the frequency domain (Doelling et al., 2014; Peelle et al., 2013). Experiments using measures based on cross-correlation typically involve repeated presentation of the same speech item (e.g., Millman et al., 2013; Abrams et al., 2008) or a small number of speech items with similar temporal envelopes (Nourski et al., 2009; Ahissar et al., 2001). On the other hand, theta-based coherence (4–7 Hz) can be used to measure a more general phase-locked representation of different speech sentences (Peelle et al., 2013), if it assumed that the speech envelope is best represented by phase-locked theta activity (4–7 Hz) throughout the brain. One of the aims of the current study was to conduct whole-head beamforming analyses to establish which brain areas respond to speech by phase-locking to the speech temporal envelope. Therefore, we chose to use cross-correlation in the whole-brain beamformer analyses (see Methods, Beamformer Localizations Based on Cross-correlation between the Temporal Envelope of Speech (Unintelligible/Pop-out Conditions) and Phase-locked Neural Activity and Beamformer Localizations Based on Significant Differences in the Cross-correlations between Activity Phase-locked to the Temporal Envelopes of Unintelligible and Pop-out Speech Conditions) and both cross-correlation and coherence in the LOI analyses (see Methods, LOI Analyses to Measure Differences in Phase-locking to the Speech Temporal Envelope in Given Brain Locations Based on Cross-correlation (Time Domain) and Coherence (Frequency Domain)).

Beamformer Localizations Based on Cross-correlation between the Temporal Envelope of Speech (Unintelligible/Pop-out Conditions) and Phase-locked Neural Activity

For the whole-brain analyses, the representation of the stimulus temporal envelopes in phase-locked activity was quantified in the time domain using cross-correlation analysis (e.g., Nourski et al., 2009; Abrams et al., 2008). The speech temporal envelope was down sampled to ensure that the speech temporal envelope and MEG data were sampled at the same rate. Beamformer covariance matrices were generated with a temporal window commensurate with the mean duration of the speech sentences (0–1500 msec) and bandpass filtered (1–50 Hz) to retain the speech temporal envelope (Rosen, 1992). Cross-correlations between the temporal envelope of the speech sentences and phase-locked activity were performed using customized scripts in Python (www.python.org/) during the “envelope-following period” (Millman et al., 2013; Abrams et al., 2008). The “envelope following period” was defined as 250–1500 msec poststimulus presentation. The absolute value of the peak in the cross-correlation function was found for lags of the response between 0 and 150 msec (Millman et al., 2013; Nourski et al., 2009).

Statistical analyses of representations of the speech temporal envelopes in the recorded MEG responses were carried out in whole-brain beamformer analyses to determine where significant phase-locking to the temporal envelope of speech occurs in the brain. Cross-correlation values were Fisher-transformed before statistical analysis.

Nonparametric permutation tests (e.g., Nichols & Holmes, 2002), based on 1000 permutations, were used to determine whether the phase-locked representations of the speech temporal envelopes were statistically significant based on peak cross-correlation values. The null hypothesis stated that there was no representation of the speech temporal envelopes of the stimuli in the recorded phase-locked activity.

For the whole-head beamforming analyses, the permutation scheme was based on previous work (Millman et al., 2013; Nourski et al., 2013) where phase inversion was used to create the nulls. Nulls were created by phase-inverting the phase-locked component of the response in 50% (selected at random) of the total number of trials. Such an approach has been shown to be an appropriate method for thresholding phase-locked responses (Prendergast, Johnson, Hymers, & Green, 2011). If no genuine phase-locked response to the temporal envelope of the target speech sentence exists, then the cross-correlation values for the nulls should be similar to the observed cross-correlation value. If a significant phase-locked response to the target speech sentence were present, the observed cross-correlation value should fall in the tail of the data-driven null distribution (greater than or equal to the 95th percentile).

Beamformer Localizations Based on Significant Differences in the Cross-correlations between Activity Phase-locked to the Temporal Envelopes of Unintelligible and Pop-out Speech Conditions

The use of the Pop-out condition allows for perfect acoustical control between the intelligible speech condition(Pop-outpost) and its unintelligible counterpart (Pop-outpre, Unintelligiblepre, Unintelligiblepost). However, the pop-out paradigm introduces other experimental confounds such as temporal order and other nonspecific effects, which must be considered in the analysis of such an experimental design. Here, a contrast based on the “response change index” (Hegdé & Kersten, 2010) was used to determine whether there is an enhancement of phase-locking to the temporal envelope of speech for intelligible (Pop-outpost) cf. unintelligible speech (Pop-outpre, Unintelligiblepre, Unintelligiblepost; Peelle et al., 2013). The response change using the main contrast of interest (Popoutpost − Popoutpre) was adjusted by the corresponding response change during the control condition (Unintelligblepost − Unintelligiblepre), that is, the phase-locking change index = [(Popoutpost − Popoutpre) − (Unintelligblepost − Unintelligiblepre)].

A custom script, written in Python, was used calculate the phase-locking change index. Cross-correlation coefficients were calculated for each condition and then Fisher-transformed before statistical analysis. Nonparametric permutation tests (e.g., Nichols & Holmes, 2002), based on 1000 permutations, were used to determine whether the phase-locked representations, based on peak cross-correlation coefficients, of the speech temporal envelopes were significantly different for intelligible (Pop-outpost) versus unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) speech conditions. Phase inversion was used to generate the nulls. The null hypothesis stated that there was no significant difference in phase-locking to the speech temporal envelopes for intelligible (Pop-outpost) versus unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) speech sentences. If no genuine difference in phase-locked response to the temporal envelope of the intelligible speech (Pop-outpost) exists, then the cross-correlation coefficient for the unintelligible speech (Pop-outpre, Unintelligiblepre, Unintelligiblepost) should be similar to the observed cross-correlation coefficient for intelligible speech (Pop-outpost). If a significant phase-locked response to the intelligible speech (Pop-outpost) were present, the observed cross-correlation coefficient should fall in the tail of the data-driven null distribution (greater than or equal to the 95th percentile).

LOI Analyses to Measure Differences in Phase-locking to the Speech Temporal Envelope in Given Brain Locations Based on Cross-correlation (Time Domain) and Coherence (Frequency Domain)

One of the aims of this study was to analyze LOIs within several key locations in the speech perception network (e.g., Hickok & Poeppel, 2007): We used LOIs in left posteromedial HG, right posteromedial HG, left anterior STG (BA 22), left MTG, and left IFG (pars opercularis, BA 44). Three of the LOIs (left HG, right HG, and left MTG) were chosen to be similar in locations to those examined in Peelle et al. (2013). As in Millman et al. (2013), the LOIs were manually seeded in left HG and right HG based on individual participant's anatomy because the anatomy of HG varies considerably among individuals (e.g., Rademacher et al., 2001). MNI coordinates were obtained from the Harvard–Oxford probabilistic cortical atlas for the LOI analyses in left anterior STG [−58, 0, −8] and left IFG [−56, 14, 12]. The MNI coordinates reported by Peelle et al. (2013) [−60, −16, 8] were used for the LOI analysis in left MTG. MNI coordinates were transformed back into individuals' anatomical space, and virtual electrodes were generated from these locations.

Virtual electrodes were used to estimate the source activity, using methods described in Millman, Prendergast, Kitterick, Woods, and Green (2010), at given LOIs. The LOIs were the outputs of the spatial filters at given single beamformer grid points (Millman et al., 2013). Virtual electrodes were reconstructed using a bandwidth of 1–50 Hz, encompassing the range of modulation frequencies in the speech temporal envelope as defined by Rosen (1992). The virtual electrode time window for the speech conditions was 0–1500 msec poststimulus presentation, that is, immediately following stimulus onset. However, measures of temporal envelope representation in phase-locked activity were restricted to the “envelope-following period” (Abrams et al., 2008) from 250 to 1500 msec following stimulus presentation.

For the LOI analyses, the cross-correlation coefficients and coherence values obtained with these orientation-optimized VEs were compared with the outcomes resulting from the standard practice of applying PCA to select VE orientation on the basis of the primary PCA component. LOI analyses were carried out in MATLAB using customized scripts. Phase-locked representations of the temporal envelope of intelligible (Pop-outpost) and unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) speech sentences were assessed using both cross-correlation in the time domain and coherence in the frequency domain for the “envelope following period” only (Abrams et al., 2008). For the cross-correlation analyses, following Abrams et al. (2008), the MATLAB “xcov” function was used to find the peak cross-correlation value for lags between 0 and 150 msec (Millman et al., 2013; Nourski et al., 2009). For the coherence analyses, the MATLAB function “mscohere” was used to measure the mean coherence for frequencies from 4 to 7 Hz (Peelle et al., 2013). Both the cross-correlation coefficients and coherence values were first Fisher-transformed before they were independently subjected to a four-way repeated-measures ANOVA, including VE orientation selection (optimized/PCA), condition (Pop-out/Unintelligible), location (left HG/right HG/left MTG/left anterior STG/left IFG), and experimental block (pre/post training) as factors.

RESULTS

Behavioral Intelligibility Ratings

The responses made during the catch trials were analyzed to confirm that the intelligibility manipulation of the Pop-out condition had the intended effect. The intelligibility ratings for the Pop-outpre (mean = 15.7%, SD = 34%), Unintelligiblepre (mean = 11.1%, SD = 17.3%), and Unintelligiblepost (mean = 17.8%, SD = 28.9%) were low. As expected, the intelligibility ratings for Pop-outpost (mean = 93.5%, SD = 15.2%) were significantly greater (p < .05, paired t test) than the intelligibility ratings for Pop-outpre.

Whole-brain Analyses: Spatiotemporal Correlates of Phase-locking to the Temporal Envelope of Speech

Whole-brain beamforming analyses, based on significant cross-correlations between the speech temporal envelope and phase-locked neural activity, were conducted for both unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) and intelligible (Pop-outpost) speech conditions. The results of these analyses were used to determine where the speech temporal envelope is significantly (p < .01, whole-brain corrected) represented in the brain by phase-locked activity. The results for these whole-brain beamforming analyses are shown in Figure 2.

Figure 2. 

Beamformer source localizations showing significant (p < .01, whole-brain corrected) phase-locking to the speech temporal envelope for each speech condition (Pop-outpre, Unintelligiblepre, Pop-outpost and Unintelligiblepost) relative to the baseline condition (phase inversion, see Methods, Beamformer Localizations Based on Cross-correlation between the Temporal Envelope of Speech (Unintelligible/Pop-out Conditions) and Phase-locked Neural Activity).

Figure 2. 

Beamformer source localizations showing significant (p < .01, whole-brain corrected) phase-locking to the speech temporal envelope for each speech condition (Pop-outpre, Unintelligiblepre, Pop-outpost and Unintelligiblepost) relative to the baseline condition (phase inversion, see Methods, Beamformer Localizations Based on Cross-correlation between the Temporal Envelope of Speech (Unintelligible/Pop-out Conditions) and Phase-locked Neural Activity).

For both the Popoutpre and Unintelligiblepre conditions, there is evidence of bilateral representations of speech envelope tracking, including both the left and right temporal lobes, parietal and motor cortices. For the speech conditions following “training” (Pop-outpost, Unintelligiblepost) between MEG runs, there was less evidence of bilateral speech temporal envelope representations: Beamformer localizations of neural sources phase-locked to the speech temporal envelope were qualitatively lateralized to the right temporal lobe, motor and parietal cortices for both intelligible (Pop-outpost) and unintelligible (Unintelligiblepost) speech conditions. Therefore, although there was overlap in the capability of brain regions in the right hemisphere to track the speech temporal envelope for intelligible (Pop-outpost) and unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) speech conditions, there were differences in the left hemisphere before and after “training,” regardless of whether the speech stimuli became intelligible (Pop-outpre cf. Pop-outpost) or remained unintelligible (Unintelligiblepre cf. Unintelligiblepost).

Whole-brain Analyses: No Enhancement of Phase-locking to the Speech Temporal Envelope for Intelligible (Pop-outpost) Compared with Unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) Speech Sentences

Whole-brain beamforming analyses were used to determine whether there was a significant difference in phase-locking, measured using cross-correlation, to the temporal envelope of intelligible (Pop-outpost) compared with unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) speech. This higher-level contrast was based on the phase-locking change index (see Methods, Beamformer Localizations Based on Significant Differences in the Cross-correlations between Activity Phase-locked to the Temporal Envelopes of Unintelligible and Pop-out Speech Conditions). There was no significant difference in the phase-locking change index, demonstrating that there was no difference in phase-locking to the speech temporal envelope when the percept of an identical speech sentence was changed from unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) to intelligible (Pop-outpost).

LOIs: Phase-locking to the Temporal Envelope of Unintelligible and Pop-out Speech Conditions in Specific Brain Locations

LOI analyses (Millman et al., 2013) are shown for all the LOIs (left HG, right HG, left MTG, left STG, and left IFG) in Figure 3. In each subpanel of Figure 3, the Pop-out speech conditions are shown on the left-hand side of each subpanel and Unintelligible speech conditions are shown on the right-hand side of each subpanel. Subpanels on the left-hand side of Figure 3 (Figure 3A and C) show the cross-correlation coefficients and coherence values obtained when the spatial filter orientations were optimized based on the metric of interest (cross-correlation/coherence). Subpanels on the right-hand side of Figure 3 (Figure 3B and D) show measures of temporal envelope tracking obtained from PCA-selected spatial filter orientations.

Figure 3. 

No effect of speech intelligibility on representations of the speech temporal envelope was reflected in either cross-correlation (A and B) or coherence (C and D) analyses with either orientation-optimization (“optimized”) or PCA applied to the spatial filter orientation. Gray shading, going from light to dark, represents the LOI analyses were carried out in each LOI. In each subpanel, the Pop-out conditions are shown on the left-hand side and the Unintelligible conditions are shown on the right-hand side. Filled bars (“pre”) represent cross-correlation coefficients or coherence values for speech conditions before “training” (i.e., all speech perceived as unintelligible). Open bars (“post”) represent cross-correlation coefficients or coherence values for speech conditions following “training” where Pop-outpost becomes intelligible but there is no change in the perception of the Unintelligible condition. Error bars represent the SEM.

Figure 3. 

No effect of speech intelligibility on representations of the speech temporal envelope was reflected in either cross-correlation (A and B) or coherence (C and D) analyses with either orientation-optimization (“optimized”) or PCA applied to the spatial filter orientation. Gray shading, going from light to dark, represents the LOI analyses were carried out in each LOI. In each subpanel, the Pop-out conditions are shown on the left-hand side and the Unintelligible conditions are shown on the right-hand side. Filled bars (“pre”) represent cross-correlation coefficients or coherence values for speech conditions before “training” (i.e., all speech perceived as unintelligible). Open bars (“post”) represent cross-correlation coefficients or coherence values for speech conditions following “training” where Pop-outpost becomes intelligible but there is no change in the perception of the Unintelligible condition. Error bars represent the SEM.

Filled bars in Figure 3 represent phase-locking measures before “training” on the tone-vocoded version of the target sentence and open bars represent cross-correlation (top) and theta-based coherence (bottom) after “training” (see Figure 1 for details). Repeated-measures ANOVAs (see Methods, LOI Analyses to Measure Differences in Phase-locking to the Speech Temporal Envelope in Given Brain Locations Based on Cross-correlation (Time Domain) and Coherence (Frequency Domain)) were conducted to analyze speech temporal envelope tracking based on cross-correlation and coherence. For the LOI analyses based on cross-correlation, the ANOVA showed a main effect of the Method used to select the spatial filter orientation, F(1, 15) = 258.6, p < .05, and a main effect of the Location, F(4, 60) = 2.6, p < .05. The results of the cross-correlation analyses also showed an interaction between the effects of spatial filter orientation Method and Location, F(4, 60) = 5.6, p < .05. Based on the data presented in top panel of Figure 3, the main effect of spatial filter orientation selection holds true for the experimental conditions (Pop-out/Unintelligible) and experimental blocks (pre/post), but the interaction suggests that the relative benefit of the optimized spatial filter orientation depended on the location of the LOI. The underlying cause of the interaction is likely to arise from inconsistencies in the differences between the PCA and “optimized” spatial filter orientations. For the repeated-measures ANOVA on the coherence metric of temporal envelope tracking, there was a main effect of the Method used to select the spatial filter orientation, F(1, 15) = 176.1, p < .05. There was also an interaction between the Location and the spatial filter orientation selection Method, F(4, 60) = 3.4, p < .05. The data presented in the bottom panel of Figure 3 also suggest that the interaction is most likely to result from the variable but significant effect of the method used for spatial filter orientation selection across different LOIs. Therefore, irrespective of the process used for selection of spatial filter orientation, that is, “optimized” or PCA, there was no significant difference in phase-locking to the speech temporal envelope, as measured by either cross-correlation or coherence, for unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) and intelligible (Pop-outpost) speech conditions.

DISCUSSION

It is a matter of continuing debate whether phase-locking to the speech temporal envelope reflects the ability of the brain to process acoustic information and linguistic information per se. The aim of this study was to test the hypothesis that phase-locking to the speech temporal envelope is enhanced for intelligible speech (e.g., Peelle et al., 2013; Luo & Poeppel., 2007; Ahissar et al., 2001). Here we found that there was no difference in phase-locking to the temporal envelope of a physically identical speech sentence that was perceived as either intelligible or unintelligible.

Critically this study differed from all previous studies on the role of speech temporal envelope tracking in speech perception because perceptual pop-out was used to change the percept of physically identical tone-vocoded speech sentences from unintelligible to intelligible. Using this pop-out paradigm, we were able to dissociate changes in the spectrotemporal properties of unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) and intelligible (Pop-outpost) speech from changes in speech intelligibility itself. Tone-vocoders were used to process the speech sentences because they provide a more faithful representation of the speech temporal envelope than noise-vocoded speech (Stone et al., 2008; Whitmal et al., 2007). Spurious low-frequency modulations present in noise-vocoded speech (Whitmal et al., 2007) present an issue in neuroimaging experiments where the aim is to explicitly examine the relationship between the temporal envelope of noise-vocoded speech and low-frequency neural oscillations (cf. Doelling et al., 2014; Peelle et al., 2013).

One potential drawback in using cross-correlation as the metric of temporal envelope tracking is that cross-correlation analyses require repeated presentation of the same speech item (e.g., Millman et al., 2013; Abrams et al., 2008) or a small number of speech items with similar temporal envelopes (Nourski et al., 2009; Ahissar et al., 2001). Peelle et al. (2013) argued that “the salience and influence of linguistic content are markedly different during full attention to trial-unique sentences” and that the use of limited speech material in previous studies could have led to the interpretation of the role of speech envelope tracking restricted to acoustic analysis.

A disadvantage of using a limited number of speech items is the possible influence on repetition priming effects on responses to repeated speech material (e.g., Dhond, Buckner, Dale, Markinovic, & Halgren, 2001). Repetition priming effects are complex and vary depending on the brain location and timing of the response relative to the stimulus onset (e.g., McDonald et al., 2010). The results presented here did not show significant differences in phase-locked responses before or after “training.” Therefore, in our view, repetition priming effects cannot explain the lack of enhanced phase-locking to intelligible (Pop-outpost) speech found in this study.

Beamformer Localizations of Phase-locking to the Speech Temporal Envelope

Previous work suggests that theta-based speech envelope tracking occurs not only within auditory cortex and associative areas (Nourski et al., 2009, 2013; Peelle et al., 2013; Zion Golumbic et al., 2013) but also in regions involved in higher-level language processing and supramodal processing (Peelle et al., 2013; Zion Golumbic et al., 2013). In the present work, beamformer analyses of neural activity phase-locked to the temporal envelope of speech, based on cross-correlation, localized reliable speech envelope tracking for both unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) and intelligible (Pop-outpost) speech. During the first MEG data acquisition block, when speech was unintelligible (Pop-outpre, Unintelligiblepre), brain regions identified by these beamformer analyses included bilateral temporal lobes and supramodal regions, including motor and parietal cortices. During the second MEG block, that is, following “training,” temporal envelope tracking of both the intelligible (Pop-outpost) and unintelligible (Unintelligiblepost) speech conditions was qualitatively lateralized to the right hemisphere. This right-hemisphere lateralization is interesting, given the rightward lateralization of theta activity predicted by the Asymmetric Sampling in Time model (e.g., Poeppel, 2003). However, as this qualitative lateralization occurred for both unintelligible (Unintelligiblepost) and intelligible (Pop-outpost) speech conditions, this would not reflect a change in processing because of a change in speech intelligibility. Instead, this rightward-lateralization presumably reflects temporal order effects (e.g., Giraud et al., 2004) or other nonspecific effects and serves to emphasize the importance of incorporating changes in a control condition in the analysis of an experimental design based on perceptual pop-out, that is, here we used the “phase-locking change index.”

The Role of Theta-based Phase-locking in Speech Perception

In this study, the “phase-locking change index” yielded no significant difference in phase-locking to the temporal envelope of unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) and intelligible (Pop-outpost) speech, demonstrating that theta-based phase-locking to the temporal envelope of speech is not enhanced by linguistic information. There are at least two, not mutually exclusive, explanations as to why tracking the speech temporal envelope through low-frequency phase-locking is not critical for speech intelligibility. Here we explore these explanations and relate them back to previous work on temporal envelope tracking and speech comprehension.

Ahissar et al. (2001) was the first study to propose that cortical tracking of the speech temporal envelope is a prerequisite for speech comprehension. There are two aspects of the Ahissar et al. (2001) study that require further consideration. First Ahissar et al. (2001) came to the conclusion that phase-locking to the speech temporal envelope is required for speech comprehension because they measured a decrease in tracking the speech envelope concomitant with a decrease in the intelligibility of time-compressed speech. The use of time-compressed speech to vary speech intelligibility creates problems with the interpretation of such a result because time-compressed speech confounds intelligibility and stimulus duration. This confound is particularly important in the context of theta-based phase-locking to the temporal envelope of speech. As was noted by Nourski et al. (2009; see, e.g., Nourski et al., 2009, Figures 3B and 5B), when speech is highly time-compressed and therefore the stimulus duration is short, responses to speech are dominated by the neural onset to the speech stimulus. The reduction in phase-locking to highly time-compressed speech is not because auditory cortex is incapable of tracking the temporal envelope of faster modulation frequencies: Human auditory cortex is capable of phase-locking to relatively high modulation frequencies of at least 100 Hz (e.g., Brugge et al., 2009). Rather, the diminished phase-locking to highly time-compressed speech shown in Ahissar et al. (2001) can be explained because of the short stimulus duration (<500 msec). We predict that if speech items were of sufficient duration before time compression, then phase-locking to the temporal envelope of highly time-compressed speech could be established following the onset response.

Ahissar et al. (2001) also reported that phase-locking to the speech envelope correlated with correct versus incorrect responses. This finding has been taken as evidence that tracking the speech envelope is a prerequisite for speech comprehension (e.g., Peelle & Davis, 2012). The curious aspect of the correlation between phase-locking and trial success reported by Ahissar et al. (2001) is that they found a difference in phase-locking to the speech envelope even when the speech sentences were easy to understand (compression ratio = 0.75). Presumably there were not many instances where participants responded incorrectly or “don't know” to speech that was highly intelligible.

A recent study by Peelle et al. (2013) reported enhanced entrainment of theta oscillations for intelligible speech. In this study, both cross-correlation and coherence measures of temporal envelope tracking were used in LOI analyses to maximize the potential of identifying differences in representations of the temporal envelopes of both unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) and intelligible (Pop-outpost) speech. Here we did not identify a comparable increase in temporal envelope tracking for intelligible (Pop-outpost) speech relative to physically identical unintelligible (Pop-outpre) speech. The coherence values obtained in our source-space LOI analyses are similar to (Figure 3D, PCA-based orientation selection) or better than (Figure 3C, optimized orientation selection) the theta-based coherence values shown by Peelle et al. (2013) in their sensor-space analyses. However, the mean theta-based coherence values (∼0.01–0.02) reported in the source-space analyses presented by Peelle et al. (2013) are counterintuitively smaller, by a factor of 5–10, than those reported in their own sensor-space analyses (∼0.1). Moreover, the coherence values reported by Peelle et al. (2013) in their source-space analyses are much smaller than the mean coherence values (0.15–0.2 from PCA-selected spatial filter orientations) obtained in the LOIs analyses in this study. This discrepancy in the coherence values reported by Peelle et al. (2013) in their sensor-space and source-space analyses perhaps calls into question the utility of theta-based coherence as a measure of temporal envelope representation across the whole-head.

Questioning the Critical Role of Theta-based Phase-locking in Speech Perception

When considering mechanisms that underlie speech perception and/or speech comprehension, it should be noted that the representation of the speech temporal envelope is characterized not only by phase-locked activity but also by changes in power within the high gamma (>70 Hz) frequency band (Zion Golumbic et al., 2013; Pasley et al., 2012; Nourski et al., 2009). Moreover, Nourski et al. (2009) found that, within auditory cortex, tracking of the speech temporal envelope by high gamma power changes was more reliable than tracking by phase-locked activity over a range of time compression rates. Zion Golumbic et al. (2013) also suggested that tracking the speech envelope through either low-frequency phase or high gamma power reflect distinct mechanisms for speech perception.

Previous work supporting the AST model (e.g., Doelling et al., 2014; Zion Golumbic et al., 2013; Giraud & Poeppel, 2012; Ghitza, 2011) suggests that continuous speech is parsed and decoded on the basis of certain essential sublexical units (phonemic, syllabic, phrasal) through neuronal oscillations operating on various timescales (gamma, theta, delta) in a hierarchical manner. According to AST theory, the theta oscillator parses the syllabic rhythm in continuous speech (e.g., Ghitza, 2012; Giraud & Poeppel, 2012; Poeppel, 2003) because the frequency of the theta rhythm (∼4 Hz) is commensurate with the average syllabic rate in speech (∼4 Hz). However, it is important to note that the linear relationship between energy in the modulation spectrum of speech and cortical oscillations only holds for the syllabic rate/theta activity (4 Hz) and not for phonemic/gamma activity (∼25–40 Hz). Moreover, a restricted focus based on the assumption of a linear relationship between the speech temporal envelope and neural responses ignores the importance of nonlinear sound representations, for example, within high gamma (>70 Hz) power (Zion Golumbic et al., 2013; Pasley et al., 2012; Nourski et al., 2009). Pasley et al. (2012) reported that a nonlinear coding scheme yielded best reconstruction accuracy for modulation rates ≥4 Hz, that is, the syllabic rate, providing evidence of a dual linear and nonlinear coding scheme for relatively low modulation frequencies. If a gating mechanism is manifest in low-level auditory areas that parses continuous input into syllables (e.g., Zion Golumbic et al., 2013; Giraud & Poeppel, 2012), we would argue that high gamma (>70 Hz) power changes (e.g., Nourski et al., 2009) are more suited to this task than theta oscillations. Speech envelope tracking through changes in high gamma power is more reliable than theta-based entrainment because high gamma power changes do not seem to be affected by stimulus onset in the same way as theta-based envelope tracking (Nourski et al., 2009).

It is well established that the phase of theta oscillations is coupled with the power of oscillations in the gamma band (e.g., Canolty et al., 2006; Lakatos et al., 2005). In the context of speech perception, coupling between theta and gamma oscillations has been proposed as a critical mechanism for speech perception (e.g., Ghitza, 2012; Giraud & Poeppel, 2012; Morillon, Liégeois-Chauvel, Arnal, Bénar, & Giraud, 2012). Following the hierarchical nature of the AST model, it is assumed that the slower oscillations (delta, theta) are the “masters” of higher oscillations (e.g., Ghitza, 2012; Giraud & Poeppel, 2012; Morillon et al., 2012). Taking into account the delay in the ability of theta oscillations to track the speech temporal envelope, we would argue that high gamma (>70 Hz) power changes (Nourski et al., 2009) would better serve the role of “master” in the context of oscillatory coupling in decoding the syllables within continuous speech, at least in low-level auditory areas.

Improvements to Beamformer-based Analyses of Speech Envelope Tracking

In the present work, we describe novel beamforming methods tailored to investigating the hypothesized role of phase-locking to the speech temporal envelope in speech perception. Here we extended our previous methods (Millman et al., 2013) for calculating the representation of the speech temporal envelope in given brain locations to whole-brain beamformer analyses. We also described a new method to optimize the orientation of beamformer spatial filters to improve beamformer analyses designed to address the specific experimental hypotheses of the current work. In the present work, the orientation of beamformer spatial filters was optimized to focus on the signal of interest. In the literature PCA, based on the spatial filter orientation with the greatest total power, is the standard method to reduce the dimensionality of data used for beamformer analyses. However, in this study, the signal of interest, that is, activity phase-locked to the speech envelope, is not necessarily best represented by total power. Here the spatial filter orientations based on the best representation of the speech temporal envelope (cross-correlation or coherence measures) resulted in significantly greater envelope tracking than the PCA-selected spatial filter orientations. When total power is not the metric of interest, as was the case in this study, the results presented here suggest that there is inherent value in optimizing spatial filter orientations based on the beamformer metric of interest. This novel methodology could easily be generalized to other beamformer-based analyses where total power is not the stimulus representation of interest.

Conclusion

In conclusion, the use of a perceptual pop-out design and tone-vocoded sentences in this study avoided the introduction of confounds based on low-level acoustical differences between unintelligible (Pop-outpre, Unintelligiblepre, Unintelligiblepost) and intelligible (Pop-outpost) speech stimuli. Our results do not support previous work suggesting that linguistic information enhances theta-based phase-locking to the speech temporal envelope (Peelle et al., 2013; Luo & Poeppel, 2007; Ahissar et al., 2001). The results presented here are consistent with previous studies (Howard & Poeppel, 2010; Nourski et al., 2009), suggesting that any relationship between phase-locked activity entrained to the speech temporal envelope and speech perception is based only on tracking the acoustical properties of the speech temporal envelope. Although tracking the temporal envelope of speech is undoubtedly a critical process in speech perception, other mechanism/s must underlie speech intelligibility. The results of the present work add to our understanding of how the brain responds to unintelligible and intelligible speech sentences and may be used to inform cognitive models of speech perception.

Acknowledgments

We are grateful to three anonymous reviewers for their helpful comments on previous versions of this manuscript.

Reprint requests should be sent to Rebecca E. Millman, York Neuroimaging Centre, The Biocentre, York Science Park, Heslington, YO10 5NY, UK, or via e-mail: rem@ynic.york.ac.uk.

REFERENCES

Abrams
,
D. A.
,
Nicol
,
T.
,
Zecker
,
S.
, &
Kraus
,
N.
(
2008
).
Right-hemisphere auditory cortex is dominant for coding syllable patterns in speech.
Journal of Neuroscience
,
28
,
3958
3965
.
Ahissar
,
E.
,
Nagarajan
,
S.
,
Ahissar
,
M.
,
Protopapas
,
A.
,
Mahncke
,
H.
, &
Merzenich
,
M. M.
(
2001
).
Speech comprehension is correlated with temporal response patterns recorded from auditory cortex.
Proceedings of the National Academy of Sciences, U.S.A.
,
98
,
13367
13372
.
Brugge
,
J. F.
,
Nourski
,
K. V.
,
Oya
,
H.
,
Reale
,
R. A.
,
Kawasaki
,
H.
,
Steinschneider
,
M.
,
et al
(
2009
).
Coding of repetitive transients by auditory cortex on Heschl's gyrus.
Journal of Neurophysiology
,
102
,
2358
2374
.
Canolty
,
R. T.
,
Edwards
,
E.
,
Dalal
,
S. S.
,
Soltani
,
M.
,
Nagarajan
,
S. S.
,
Kirsch
,
H. E.
,
et al
(
2006
).
High gamma power is phase-locked to theta oscillations in human neocortex.
Science
,
313
,
1626
1628
.
Dau
,
T.
,
Verhey
,
J.
, &
Kohlrausch
,
A.
(
1999
).
Intrinsic envelope fluctuations and modulation-detection thresholds for narrow-band noise carriers.
Journal of the Acoustical Society of America
,
106
,
2752
2760
.
Davis
,
M. H.
,
Johnsrude
,
I. S.
,
Hervais-Adelman
,
A.
,
Taylor
,
K.
, &
McGettigan
,
C.
(
2005
).
Lexical information drives perceptual learning of distorted speech: Evidence from the comprehension of noise-vocoded sentences.
Journal of Experimental Psychology
,
134
,
222
241
.
Deheane-Lambertz
,
G.
,
Pallier
,
C.
,
Serniclaes
,
W.
,
Sprenger-Charolles
,
L.
,
Jobert
,
A.
, &
Deheane
,
S.
(
2005
).
Neural correlates of switching from auditory to speech perception.
Neuroimage
,
24
,
21
33
.
Dhond
,
R. P.
,
Buckner
,
R. L.
,
Dale
,
A. M.
,
Markinovic
,
K.
, &
Halgren
,
E.
(
2001
).
Spatiotemporal maps of brain activity underlying word generation and their modification during repetition priming.
Journal of Neuroscience
,
21
,
3564
3571
.
Doelling
,
K. B.
,
Arnal
,
L. H.
,
Ghitza
,
O.
, &
Poeppel
,
D.
(
2014
).
The role of slow oscillations in parsing speech into syllables for decoding.
Neuroimage
,
85
,
761
768
.
Drullman
,
R.
,
Festen
,
J. M.
, &
Plomp
,
R.
(
1994
).
Effect of reducing slow temporal modulations on speech reception.
Journal of the Acoustical Society of America
,
95
,
2670
2680
.
Evans
,
A. C.
,
Collins
,
D. L.
,
Mills
,
S. R.
,
Brown
,
E. D.
, &
Kelly
,
R. L.
(
1993
).
3D statistical neuroanatomical models from 305 MRI volumes.
Proceedings of the Institute of Electrical and Electronics Engineers
, —Nuclear Science Symposium and Medical Imaging Conference,
95
,
1813
1817
.
Foster
,
J. R.
,
Summerfield
,
A. Q.
,
Marshall
,
D. H.
,
Palmer
,
L.
,
Ball
,
V.
, &
Rosen
,
S.
(
1993
).
Lip-reading the BKB sentence lists—Corrections for list and practice effects.
British Journal of Audiology
,
27
,
233
246
.
Ghitza
,
O.
(
2011
).
Linking speech perception and neurophysiology: Speech decoding guided by cascaded oscillators locked to the input rhythm.
Frontiers in Psychology
,
2
,
130
.
Ghitza
,
O.
(
2012
).
On the role of theta-driven syllabic parsing in decoding speech: Intelligibility of speech with a manipulated modulation spectrum.
Frontiers in Psychology
,
3
,
238
.
Giraud
,
A.-L.
,
Kell
,
C.
,
Thierfelder
,
C.
,
Sterzer
,
P.
,
Russ
,
M. O.
,
Preibisch
,
C.
,
et al
(
2004
).
Contributions of sensory input, auditory search and verbal comprehension to cortical activity during speech processing.
Cerebral Cortex
,
14
,
247
255
.
Giraud
,
A.-L.
, &
Poeppel
,
D.
(
2012
).
Cortical oscillations and speech processing: Emerging computational principles and operations.
Nature Neuroscience
,
15
,
511
517
.
Hegdé
,
J.
, &
Kersten
,
D.
(
2010
).
A link between visual disambiguation and visual memory.
Journal of Neuroscience
,
30
,
15124
15133
.
Hickok
,
G.
, &
Poeppel
,
D.
(
2007
).
The cortical organization of speech processing.
Nature Neuroscience
,
8
,
393
402
.
Howard
,
M. F.
, &
Poeppel
,
D.
(
2010
).
Discrimination of speech stimuli based on neuronal response phase patterns depends on acoustics but not comprehension.
Journal of Neurophysiology
,
104
,
2500
2511
.
Huang
,
M.-X.
,
Mosher
,
J. C.
, &
Leahy
,
R. M.
(
1999
).
A sensor-weighted overlapping-sphere head model and exhaustive head model comparison for MEG.
Physics in Medicine and Biology
,
44
,
423
440
.
Huang
,
M.-X.
,
Shih
,
J. J.
,
Lee
,
R. R.
,
Harrington
,
D. L.
,
Thoma
,
R. J.
,
Weisend
,
M. P.
,
et al
(
2004
).
Commonalities and differences among vectorised beamformers in electromagnetic source imaging.
Brain Topography
,
16
,
139
158
.
Johnson
,
S.
,
Prendergast
,
G.
,
Hymers
,
M.
, &
Green
,
G. G. R.
(
2011
).
Examining the effects of one- and three-dimensional spatial filtering analyses in magnetoencephalography.
PLoS One
,
6
,
e22251
.
Kozinska
,
D.
,
Carducci
,
F.
, &
Nowinski
,
K.
(
2001
).
Automatic alignment of EEG/MEG and MRI data sets.
Clinical Neurophysiology
,
112
,
1553
1561
.
Lakatos
,
P.
,
Shah
,
A. S.
,
Knuth
,
K. H.
,
Ulbert
,
I.
,
Karmos
,
G.
, &
Schroeder
,
C. E.
(
2005
).
An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex.
Journal of Neurophysiology
,
94
,
1904
1911
.
Liebenthal
,
E.
,
Binder
,
J. R.
,
Piorkowski
,
R. L.
, &
Remez
,
R. E.
(
2004
).
Short-term reorganisation of auditory analysis induced by phonetic experience.
Journal of Cognitive Neuroscience
,
15
,
549
558
.
Luo
,
H.
, &
Poeppel
,
D.
(
2007
).
Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex.
Neuron
,
5
,
1001
1010
.
MacLeod
,
A.
, &
Summerfield
,
A. Q.
(
1987
).
Quantifying the contribution of vision to speech perception in noise.
British Journal of Audiology
,
21
,
131
141
.
McDonald
,
C. R.
,
Thesen
,
T.
,
Carlson
,
C.
,
Blumberg
,
M.
,
Girard
,
H. M.
,
Trongnetrpunya
,
A.
,
et al
(
2010
).
Multimodal imaging of repetition priming: Using fMRI, MEG, and intracranial EEG to reveal spatiotemporal profiles of word processing.
Neuroimage
,
53
,
707
717
.
Millman
,
R. E.
,
Prendergast
,
G.
,
Hymers
,
M.
, &
Green
,
G. G. R.
(
2013
).
Representations of the temporal envelope of sounds in human auditory cortex: Can the results from invasive intracortical “depth” electrode recordings be replicated using non-invasive MEG “virtual electrodes”?
Neuroimage
,
64
,
185
196
.
Millman
,
R. E.
,
Prendergast
,
G.
,
Kitterick
,
P. T.
,
Woods
,
W. P.
, &
Green
,
G. G. R.
(
2010
).
Spatiotemporal reconstruction of the auditory steady-state response to frequency modulation using magentoencephalograhy.
Neuroimage
,
49
,
745
758
.
Moore
,
B. C. J.
, &
Glasberg
,
B. R.
(
1983
).
Suggested formulae for calculating auditory-filter bandwidths and excitation patterns.
Journal of the Acoustical Society of America
,
74
,
750
753
.
Morillon
,
B.
,
Liégeois-Chauvel
,
C.
,
Arnal
,
L. H.
,
Bénar
,
C.-G.
, &
Giraud
,
A.-L.
(
2012
).
Asymmetric function of theta and gamma activity in syllable processing: An intra-cortical study.
Frontiers in Psychology
,
3
,
1
.
Nichols
,
T. E.
, &
Holmes
,
A. P.
(
2002
).
Nonparametric permutation tests for functional neuroimaging: A primer with examples.
Human Brain Mapping
,
15
,
1
25
.
Nourski
,
K. V.
,
Etler
,
C. P.
,
Brugge
,
J. F.
,
Oya
,
H.
,
Kawasaki
,
H.
,
Reale
,
R. A.
,
et al
(
2013
).
Direct recordings from the auditory cortex in a cochlear implant user.
Journal of the Association for Research in Otolaryngology
,
14
,
435
450
.
Nourski
,
K. V.
,
Reale
,
R. A.
,
Oya
,
H.
,
Kawasaki
,
H.
,
Kovach
,
C. K.
,
Chen
,
H.
,
et al
(
2009
).
Temporal envelope of time-compressed speech represented in the human auditory cortex.
Journal of Neuroscience
,
29
,
15564
15574
.
Pasley
,
B. N.
,
David
,
S. V.
,
Mesgarani
,
N.
,
Flinker
,
A.
,
Shamma
,
S. A.
,
Crone
,
N. E.
,
et al
(
2012
).
Reconstructing speech from human auditory cortex.
PLoS Biology
,
10
,
e1001251
.
Peelle
,
J. E.
, &
Davis
,
M. H.
(
2012
).
Neural oscillations carry speech rhythm through to comprehension.
Frontiers in Psychology
,
3
,
320
.
Peelle
,
J. E.
,
Gross
,
J.
, &
Davis
,
M. H.
(
2013
).
Phase-locked responses to speech in human auditory cortex are enhanced during comprehension.
Cerebral Cortex
,
23
,
1378
1387
.
Poeppel
,
D.
(
2003
).
The analysis of speech in different temporal integration windows: Cerebral lateralization as “asymmetric sampling in time”.
Speech Communication
,
41
,
245
255
.
Prendergast
,
G.
,
Johnson
,
S. R.
,
Hymers
,
M.
, &
Green
,
G. G. R.
(
2011
).
Non-parametric statistical thresholding of baseline free MEG beamformer images.
Neuroimage
,
54
,
906
918
.
Rademacher
,
J.
,
Morosan
,
P.
,
Schormann
,
T.
,
Schleicher
,
A.
,
Werner
,
C.
,
Freund
,
H. J.
,
et al
(
2001
).
Probabilistic mapping and volume measurement of human primary auditory cortex.
Neuroimage
,
13
,
669
683
.
Rosen
,
S.
(
1992
).
Temporal information in speech: Acoustic, auditory and linguistic aspects.
Philosophical Transaction of the Royal Society of London, Series B, Biological Sciences
,
33
,
367
373
.
Scott
,
S. K.
(
2012
).
The neurobiology of speech perception and production—Can functional imaging tell us anything we did not already know?
Journal of Communication Disorders
,
45
,
419
425
.
Shannon
,
R. V.
,
Zeng
,
F.-G.
,
Kamath
,
V.
,
Wygonski
,
J.
, &
Ekelid
,
M.
(
1995
).
Speech recognition with primarily temporal cues.
Science
,
270
,
303
304
.
Sohoglu
,
E.
,
Peelle
,
J. E.
,
Carlyon
,
R. P.
, &
Davis
,
M. H.
(
2012
).
Predictive top–down integration of prior knowledge during speech perception.
Journal of Neuroscience
,
32
,
8443
8453
.
Stone
,
M. A.
,
Füllgrabe
,
C.
, &
Moore
,
B. C. J.
(
2008
).
Benefit of high-rate envelope cues in vocoder processing: Effect of number of channels and spectral region.
Journal of the Acoustical Society of America
,
124
,
2272
2282
.
Van Veen
,
B. D.
,
van Drongelen
,
W.
,
Yuchtman
,
M.
, &
Suzuki
,
A.
(
1997
).
Localization of brain electrical activity via linearly constrained minimum variance spatial filtering.
Institute for Electrical and Electronics Engineers Transactions on Biomedical Engineering
,
44
,
867
880
.
Whitmal
,
N. A.
, III,
Poissant
,
S. F.
,
Freyman
,
R. L.
, &
Helfer
,
K. S.
(
2007
).
Speech intelligibility in cochlear implant simulations: Effects of carrier type, interfering noise, and subject experience.
Journal of the Acoustical Society of America
,
122
,
2376
2388
.
Zion Golumbic
,
E. M.
,
Ding
,
N.
,
Bickel
,
S.
,
Lakatos
,
P.
,
Schevon
,
C. A.
,
McKhann
,
G. M.
,
et al
(
2013
).
Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party”.
Neuron
,
77
,
980
991
.