Abstract

We examined how attention modulates the neural encoding of continuous speech under different types of interference. In an EEG experiment, participants attended to a narrative in English while ignoring a competing stream in the other ear. Four different types of interference were presented to the unattended ear: a different English narrative, a narrative in a language unknown to the listener (Spanish), a well-matched nonlinguistic acoustic interference (Musical Rain), and no interference. Neural encoding of attended and unattended signals was assessed by calculating cross-correlations between their respective envelopes and the EEG recordings. Findings revealed more robust neural encoding for the attended envelopes compared with the ignored ones. Critically, however, the type of the interfering stream significantly modulated this process, with the fully intelligible distractor (English) causing the strongest encoding of both attended and unattended streams and latest dissociation between them and nonintelligible distractors causing weaker encoding and early dissociation between attended and unattended streams. The results were consistent over the time course of the spoken narrative. These findings suggest that attended and unattended information can be differentiated at different depths of processing analysis, with the locus of selective attention determined by the nature of the competing stream. They provide strong support to flexible accounts of auditory selective attention.

INTRODUCTION

Directing attention to a single speaker in a multitalker environment is an everyday occurrence that we manage with relative ease. This phenomenon is commonly termed as the “cocktail party” effect (Cherry, 1953). A large body of research has sought to assess how the underlying attentional mechanisms operate and how much of the nonattended signal is perceived in such situations, producing mixed results. Here, we aim to assess these questions by investigating the neural encoding of continuous attended speech under different types of linguistic and nonlinguistic interference.

Selective Attention

Selective attention is the ability to sustain focus on task-relevant stimuli in the presence of distractors. This has long been recognized as an essential cognitive capacity (e.g., James, 1890) because our brains are continuously flooded with information but limited in what they can process. Nevertheless, listeners are also often distracted by irrelevant stimuli, prompting the questions about the locus and mechanisms of attentional allocation in the presence of competing streams of information. Historically, two major views guiding research on auditory selective attention were the “early selection” and the “late selection” approaches. The early selection theory (Broadbent, 1958) argued that, because of our limited processing capacity, attended and unattended information is differentiated early in perceptual processing. More specifically, sensory features can guide attentional selection early on, thus determining what will be subsequently processed for meaning. The late selection approach (Duncan, 1980; Deutsch & Deutsch, 1963) proposed that selective attention cannot affect the perceptual analysis of the stimuli and that both attended and unattended inputs are processed equivalently by the perceptual system. In this view, selective attention only acts later in the process, after the input had undergone semantic encoding and analysis. Subsequent theories argued that unattended information might be attenuated rather than completely filtered out, allowing unattended information with low identification thresholds (as determined by their semantic features) to reach awareness (Treisman, 1969).

Johnston and Heinz (1978) suggested that selective attention is a multimode flexible system, where attended and unattended information can be differentiated at different depths of processing analysis. They also argued that selective attention itself requires processing capacity (cf. Kahneman, 1973), with later selection requiring more processing capacity and effort. On this account, efficient selection can be achieved early based on sensory differences between attended and unattended streams; however, in the absence of effective sensory cues, semantic features will be driving the differentiation later in the process, using more capacity. A more recent account of attention allocation in speech comprehension (Bronkhorst, 2015) also argues that attentional selection can be triggered at different processing depths. Attention triggered early on is based on basic signal properties (sound level, fundamental frequency) and enables fast selection, whereas attention at later processing stages is based on complex information, such as syntactic and semantic information, and used for slow selection.

Experimental Evidence

A substantial body of research used dichotic listening to assess whether auditory attention selects information early on, based on the physical characteristics of the stimulus, or after the input has been processed up to a semantic level. The results are mixed, with some studies showing that both attended and unattended information can be processed up to the semantic level (Bentin, Kutas, & Hillyard, 1995; Wood & Cowan, 1995; Eich, 1984) and others finding no evidence for semantic processing of the unattended stream (Wood, Stadler, & Cowan, 1997; Newstead & Dennis, 1979). The inconsistency has been attributed to inadequate control of attentional shifts to the unattended ear (Dupoux, Kouider, & Mehler, 2003; Holender, 1986), prompting the claim that listeners cannot semantically process information that is genuinely unattended. Yet, further studies demonstrated that unattended information can be processed in the absence of attention shifts to the irrelevant channel (Rivenez, Guillaume, Bourgeon, & Darwin, 2008) and that it can, under certain conditions, be processed up to the semantic and syntactic processing levels (Aydelott, Jamaluddin, & Nixon Pearce, 2015; Pulvermüller, Shtyrov, Hasting, & Carlyon, 2008). This conclusion is also consistent with the argument that the auditory system—although able to selectively focus processing on the relevant stream—has surplus capacity to process auditory information from other streams, regardless of the perceptual load in the attended stream (Murphy, Fraenkel, & Dalton, 2013). However, it has also been argued that the nature of the measurement can determine whether the processing of the unattended message is observed or not (Rivenez et al., 2008), with studies using explicit measures (e.g., word recall) more likely to find that unattended message was not processed.

Thus, although the existing evidence seems to suggest that unattended auditory information can be processed at different depths of analysis, this view is still controversial and associated with the effects of task-dependent variables. It is also unclear how the nature of the competing streams interacts with this process. The current study addresses these questions in a task-free natural listening paradigm by tracking the neural encoding of continuous attended and nonattended speech under different types of linguistic and nonlinguistic interference.

Neural Encoding of Attended and Unattended Streams

The temporal envelope of speech is strongly represented in the brain, with several studies showing a significant correlation between speech envelopes and cortical activity (Lalor & Foxe, 2010; Abrams, Nicol, Zecker, & Kraus, 2008; Aiken & Picton, 2008). These correlations appear to be a result of phase locking or synchronization between neural activity and the slow amplitude modulations of the speech envelope, which are mainly present in the theta frequency band (3–7 Hz) and correspond to the syllabic rate of speech (Doelling, Arnal, Ghitza, & Poeppel, 2014; Giraud & Poeppel, 2012; Drullman, Festen, & Plomp, 1994). Phase locking has also been observed for noise-vocoded speech (i.e., stimuli in which slow amplitude fluctuations are preserved but spectral details are reduced) but is stronger for intelligible stimuli (Ding, Chatterjee, & Simon, 2013; Peelle, Gross, & Davis, 2013).

Selective attention has been shown to have a robust influence on these synchronizations. In “cocktail party” paradigms, the auditory system preferentially tracks the temporal envelope of the attended talker and appears to be out of phase with the ignored speech stream (Rimmele, Zion Golumbic, Schröger, & Poeppel, 2015; Hambrook & Tata, 2014; Horton, Srinivasan, & D'Zmura, 2014; Horton, D'Zmura, & Srinivasan, 2013; Ding & Simon, 2012a; Kerlin, Shahin, & Miller, 2010; Zion Golumbic, Poeppel, & Schroeder, 2012). This phenomenon has been referred to as the “selective entrainment hypothesis” (Zion Golumbic et al., 2013; Giraud & Poeppel, 2012; Schroeder & Lakatos, 2010; Lakatos, Karmos, Mehta, Ulbert, & Schroeder, 2008), which suggests that attention causes low-frequency neural oscillations to entrain to the temporal envelope of the attended speech stream. For instance, Rimmele et al. (2015) used magnetoencephalography and a cocktail paradigm to reveal stronger attentional encoding for natural speech compared with noise-vocoded speech. They suggested that attentional enhancement of speech tracking depends on the presence of fine structure in the stimulus. In another study, Hambrook and Tata (2014) presented two simultaneous audiobook clips while EEG was being recorded. Attentional selection increased the EEG signals that were synchronized with the attended stream, but not the ignored one. Similarly, Horton et al. (2013) asked participants to attend to one of two competing speech streams. The stimuli consisted of random sentences concatenated together for 22 sec. By calculating the cross-correlations between the speech envelopes and the EEG channels, they found evidence of entrainment to the attended stream's low-frequency modulations and much weaker phase locking to the unattended envelope. Horton et al. (2013) results further suggested that the system selects the attended speech stream by amplifying the neural activity synchronized to the envelope. This mechanism reflects an enhancement of the attended stream and possibly also entrainment-based suppression of the unattended one.

Current Study

The current study uses the well-established phenomenon of encoding of the speech envelope as an index of attentional processing. To test whether the locus of selective attention interacts with the nature of the competing streams, we created a cocktail party paradigm in which participants were instructed to attend to one speaker while ignoring a competing stream. We then manipulated the type of the competing stream to create interference at perceptual or linguistic levels and recorded EEG signals in four different conditions. In the first condition, listeners attended to a narrative in English presented in either the left or the right ear while actively ignoring another distracting English story presented in the unattended ear (English–English condition). In the second condition, listeners attended to a narrative in English while ignoring a narrative in Spanish, a language unknown to the volunteers (English–Spanish condition). In the third condition, the interfering stream was Musical Rain (MuR), a nonlinguistic baseline that is closely matched to the acoustic properties of speech, but does not trigger speech percept (English–MuR condition). Finally, the fourth condition was the “Single Talker” condition, where participants were instructed to attend to narratives presented in either the left or right ear, with no interference presented in the other ear.

We hypothesized that attention would increase speech tracking in all conditions compared with the nonattended signal in the other ear, as has been shown in the previous studies (Horton et al., 2013; Ding & Simon, 2012a, 2012b). However, we also predicted that the nature of the competing stream would modulate this process, in line with the accounts about flexible locus of selective attention (Bronkhorst, 2015; Johnston & Heinz, 1978). On this view, the nonlinguistic acoustic noise (MuR) should be dissociated from the attended narrative early on based on their low-level differences (speech vs. nonspeech), producing least interference and requiring least processing capacity. On the other hand, the fully intelligible and meaningful distractor in the English–English condition should produce greatest interference, requiring the use of higher-level semantic and syntactic features to dissociate between the two competing streams. This would trigger late selection and engage more processing capacity (Johnston & Heinz, 1978). Attentional amplification of the entrained neural activity (cf. Horton et al., 2013) should, therefore, be observed more strongly in the late selection condition (English–English) than in the early selection condition (English–MuR), with the English–Spanish condition positioned in between. The final condition, Single Talker, provides a test case for cortical entrainment of speech that is not modulated by the processing demands of divided attention.

The final goal of our study was to track how attention affects neural encoding over time. Although a substantial body of literature investigated the role of various neural systems in sustaining attention and its intensity over time (e.g., Malhotra, Coulthard, & Husain, 2009; Manly et al., 2003), there have been few attempts to integrate it with the literature on neural encoding of attended speech. We addressed this issue by comparing the strength of neural encoding of the attended speech envelopes at the beginning, the middle, and the end of the narratives, across the four different interference conditions.

METHODS

Participants

Twenty-five healthy volunteers were recruited from the University of Cambridge. They were right-handed, monolingual native speakers of British English with no history of hearing problems. Three participants were excluded from data analyses because of technical problems and excess noise; thus, 22 participants contributed to the present study (10 men, mean age = 21.5 years). All participants were provided with detailed information regarding the purpose of the study and gave written consent. The study was approved by the Cambridge Psychology research ethics committee.

Stimuli and Procedure

The experiment consisted of three conditions where the attended speech was paired with interference (English–English, English–Spanish, and English–MuR) and one condition where participants attended to an English narrative without any interference (Single Talker; Table 1). The stimuli were 10 simple children's narratives, such as “The Happy Prince” (eight in English and two in Spanish), and two matched MuR sets that acted as a nonlinguistic acoustic baseline. The stories were obtained from YouTube channels and websites and transcribed into 120 sentences each. Two native British English female speakers recorded four stories each, and one native Spanish female speaker recorded the Spanish narratives. Gender was kept constant to reduce segregation strategies based on talker's gender in the dichotic listening paradigm (Brungart & Simpson, 2007). Sentences ranged from 2.5 to 3.1 sec in length, and some of the sentences were slightly modified by a narrator (e.g., by adding adverbs or adjectives “The swallow was very sad”) to adapt to the 3 sec per sentence criteria. All sentences were normalized to have equivalent root mean square sound amplitude. To produce MuR (following the procedure introduced by Uppenkamp, Johnsrude, Norris, Marslen-Wilson, & Patterson, 2006), we extracted temporal envelopes from the recorded English stimuli and filled them with jittered fragments of synthesized speech. As such, MuR segments preserve the duration, the temporal envelope, and the energy levels of the original speech stimuli, but despite these similarities, the absence of continuous formants means that MuR does not elicit speech percept (Bozic, Tyler, Ives, Randall, & Marslen-Wilson, 2010; Uppenkamp et al., 2006). MuR was generated using custom-made scripts in MATLAB (The Mathworks Inc., 2010).

Table 1. 

Experimental Conditions

 ConditionAttendedUnattended
English–English English A different, competing English narrative 
English–Spanish English A language unknown to participants (Spanish) 
English–MuR English Nonlinguistic acoustic set (MuR) 
Single Talker English N/A (no interference). Single narrative presented in English 
 ConditionAttendedUnattended
English–English English A different, competing English narrative 
English–Spanish English A language unknown to participants (Spanish) 
English–MuR English Nonlinguistic acoustic set (MuR) 
Single Talker English N/A (no interference). Single narrative presented in English 

Type of attended and unattended streams per condition.

From each story, the first 60 sentences (first half) were stringed together and the second 60 sentences (second half) were stringed together (with a 300-msec silence gap between each sentence) to create two blocks that were approximately 3.2 min (192 sec) long each. In each condition, participants attended to two stories (i.e., four blocks of 60 sentences each; 240 sentences in total) swapped between their left and right ear (Figure 1A). While they were actively attending to one channel, a competing stream was simultaneously being presented in the other ear. Participants always attended to an English story, presented either to the left or the right ear, and ignored the other channel. None of the attended stimuli were repeated for the duration of the experiment (i.e., each sentence was attended to only once), but to keep the properties of the attended and the interfering speech equal, the same stories appeared in both capacities—once as attended and once as unattended (with the presentation order, story segment and attention demands counterbalanced across left and right ears; Figure 1A). Participants always heard the “Single Talker” condition first to familiarize themselves with the experimental setup and the demands of attending to left/right. The order of the remaining three conditions and of the stories within each condition was randomized across participants. Overall, participants attended to 960 sentences across the four conditions. Because the Single Talker condition did not have interference, the total number of unattended trials was 720.

Figure 1. 

(A) Structure of an example condition (English–English). Participants attended to the “Happy Prince” story in the first two blocks. Part 1 of the story was presented in the left ear and Part 2 in the right. The story “Five Peas” was the distractor stream in the first two blocks. In Blocks 3 and 4, participants attended to the “Five Peas” story, and the distractor stream was the “Happy Prince” story that had previously been attended to. Presentation order (e.g., attend to the left/right) was randomized across blocks. Presentation side was counterbalanced (i.e., if Part 1 was presented to the left channel, then Part 2 was presented to the right channel). (B) Sequence of a block. Participants were instructed to attend to a channel before the start of the block. They were asked to fixate on a cross placed 150 cm in front of them. The stimuli were presented between 3 and 10 sec after the verbal instruction. After the stimuli finished, participants were asked to complete 10 true/false questions about the story they had just attended to.

Figure 1. 

(A) Structure of an example condition (English–English). Participants attended to the “Happy Prince” story in the first two blocks. Part 1 of the story was presented in the left ear and Part 2 in the right. The story “Five Peas” was the distractor stream in the first two blocks. In Blocks 3 and 4, participants attended to the “Five Peas” story, and the distractor stream was the “Happy Prince” story that had previously been attended to. Presentation order (e.g., attend to the left/right) was randomized across blocks. Presentation side was counterbalanced (i.e., if Part 1 was presented to the left channel, then Part 2 was presented to the right channel). (B) Sequence of a block. Participants were instructed to attend to a channel before the start of the block. They were asked to fixate on a cross placed 150 cm in front of them. The stimuli were presented between 3 and 10 sec after the verbal instruction. After the stimuli finished, participants were asked to complete 10 true/false questions about the story they had just attended to.

During the experiment, participants sat in a comfortable chair in a sound-attenuated room and were asked to fix their gaze on the printout of a cross placed 150 cm at eye level in front of them while the narratives were being presented (Figure 1B). The stimuli were delivered through insert earphones (E-A-RTONE 3a) with a mean intensity of 65 dB SPL and were presented using MATLAB and functions from the Psychophysics Toolbox extensions (Brainard, 1997; Pelli, 1997).

Behavioral Measures

To ensure that participants were paying attention to the desired channel, they were informed that they would be asked questions about the attended story. They completed 10 true/false questions after each block, resulting in a total of 160 responses per participant.

Data Collection and Preprocessing

EEG was recorded using 128 Ag/Ag-CI channel electrode net (Electrical Geodesics, Inc.), with EEG data recorded for 92 of the 128 channels. The 36 excluded channels are located in the outer layers of the net (the neck area); they measure significantly more muscle noise and were therefore not of interest in this study. Voltages were recorded at a sampling rate of 500 Hz, and all net impedances were kept below 100 Ω. Data were filtered between 1 and 100 Hz and down-sampled to 250 Hz. All data were preprocessed and analyzed in MATLAB: EEGLAB Toolbox (Delorme & Makeig, 2004). Data were epoched at the sentence level (2 sec), with a −200 msec prestimulus window. This resulted in 960 attended and 720 unattended epochs per participant. Artifact rejection was carried out per epoch, with bad epochs removed and bad channels interpolated. The Infomax independent component analysis algorithm implemented in the EEGLAB toolbox was carried out to isolate independent components and carry out artifact correction. The resulting independent components were visually inspected to detect artifacts such as eye blinks and other nonbrain activity. These were rejected according to their topography, time course, and spectral traits, generating clean nonartifact data. Finally, data were re-referenced to the average of all channels.

Speech Envelopes

The temporal envelope of the speech was calculated for all attended and unattended stories and the MuR sets. Speech envelopes were calculated using the Mel frequency cepstral coefficients. EEG data were down-sampled to 100 Hz to match the speech envelopes. The acoustic properties of the envelopes (their mean frequency components and the distribution of autocorrelation peaks) were closely matched across the three types of interference, ensuring validity of comparisons between them using the cross-correlation approach described below.

Data Analysis

We characterized the relationship between the acoustic envelopes and EEG channels by calculating Pearson's correlation r between these two measures as a function of lag. As used previously (e.g., Horton et al., 2013; Aiken & Picton, 2008), this approach reveals EEG activity that represents acoustic envelopes: If an EEG channel is synchronized with an envelope at a certain latency, it will show a nonzero cross-corelation at a lag equal to that latency. The cross-correlation function (Bendat & Piersol, 1986) assumes a linear relationship between neural activity and the speech envelope, and for discrete functions f and g, it is defined as
f*gn=m=fmgn+mσfσg
where σf and σg are the standard deviations of f and g.

This correlation was calculated for each 10 msec lag in the range of −200 msec before the onset of a sentence to 600 msec after the onset of a sentence, which covers the range of the observed effects in the literature (e.g., Baltzell et al., 2016). All EEG channels were cross-correlated with the attended, unattended, and control speech envelopes for every sentence. Following Horton et al. (2013), control envelopes were those of different (i.e., nonmatching) sentences, and the correlation values depicted in the control cross-correlations (Figure 2) are therefore due to chance. Control cross-correlation functions were then collapsed across channels and time to form an estimated Gaussian distribution and used to determine the confidence interval at 95%. The null hypothesis states that there was no correlation between the EEG channels and the control envelope at a particular latency. Therefore, cross-correlation functions that were less than the 2.5th percentile and exceeded the 97.5th percentile for attended and unattended correlations were found to be significantly different from zero (p < .05, before correction for multiple comparisons).

Figure 2. 

(A) Control distribution. Control cross-correlations were collapsed over channels and time to form a null distribution. The significance thresholds were set at the 95% confidence interval (thresholds at 97.5th and 2.5th percentiles). (B) Control cross-correlations. Cross-correlations at different lags between EEG channels and control envelopes (Mel frequency cepstral coefficient). The black lines represent the threshold for significance set at the 97.5th and 2.5th percentiles (uncorrected for multiple comparisons), as obtained from the null distribution shown in A.

Figure 2. 

(A) Control distribution. Control cross-correlations were collapsed over channels and time to form a null distribution. The significance thresholds were set at the 95% confidence interval (thresholds at 97.5th and 2.5th percentiles). (B) Control cross-correlations. Cross-correlations at different lags between EEG channels and control envelopes (Mel frequency cepstral coefficient). The black lines represent the threshold for significance set at the 97.5th and 2.5th percentiles (uncorrected for multiple comparisons), as obtained from the null distribution shown in A.

We first computed average cross-correlation functions for all attended and all nonattended trials separately by collapsing the correlation coefficients across conditions and subjects at each time lag. This was followed by computations of the average attended and nonattended cross-correlation functions for each condition separately. The cross-correlation functions for all attended and all nonattended trials were not directly compared due to differences in the overall numbers of attended and unattended trials (960 vs. 720). Differences between attended and unattended cross-correlations in each condition were evaluated using pairwise t tests, with control for multiple comparisons achieved using nonparametric cluster-based permutations (Maris & Oostenveld, 2007) as implemented in the ft_timelockstatistics function in FieldTrip. To this end, pairs of experimental conditions were compared in 10-msec steps for each electrode in the −200 to 600 msec time window. All results with a t value larger than 0.05 (two-tailed test) were selected and clustered on the basis of temporal and spatial adjacency. To correct for multiple comparisons, this calculation used the Monte Carlo randomization test in which trials were randomly partitioned from a combined set of two conditions and placed into two subsets. This procedure was repeated 1000 times to create a histogram of t values and calculate the proportion of random partitions that were greater than the observed t values. The experimental conditions were considered to be significantly different if the probability of such proportion (p value) was less than .05. We also report T values for the obtained clusters of significant differences (representing summed t values across all contributing electrodes) and Cohen's d at the peak. To assess if there were any reliable differences between the attended cross-correlation functions across conditions, the attended functions for each electrode in the −200 to 600 msec time window were compared in 10-msec steps in a one-way repeated-measures ANOVA, using a nonparametric permutation approach (1000 permutations) as implemented in the statcond function in EEGLAB (Delorme, 2006). The control for multiple comparisons was achieved using false discovery rate (FDR, p < .05; Benjamini & Yekutieli, 2001) implemented in the fdr_bh function, allowing us to determine the time points where the attended time series were reliably different from each other. These ANOVAs were followed by post hoc pairwise t tests as described above. The same approach was used to assess differences between the unattended cross-correlation functions across conditions.

Attention over Time

To evaluate whether encoding of the attended and unattended speech envelopes changed as the narrative progressed over time, we assessed the neural encoding of sentences corresponding to the first, middle, and the final third of the narrative (labeled “Beginning,” “Middle,” and “End”). We divided each block (60 sentences) into three equal parts of 20 sentences (Beginning: 1–20; Middle: 21–40; End = 41–60) and then summed across all “Beginning,” “Middle,” and “End” items per condition. This way, we ended up with 80 sentences per group in each condition (e.g., Condition 1 = 1a, 1b, 1c; where a = beginning, b = middle, c = end), which were compared for attended and unattended cross-correlation functions using the same approach as above.

RESULTS

Behavior

Participants completed the comprehension task with a mean accuracy of 94.3% (SD = 3.8%), indicating that the target speaker was attended to as instructed. A one-way repeated-measures ANOVA showed that there was a difference between the number of correct responses across conditions, F(3, 63) = 7.750, p < .001. Post hoc results indicated that participants performed reliably better in the English–English (96.5%, p < .001) and English–Spanish conditions (95.1%, p = .016) compared with the English–MuR condition (92.2%). There were no reliable differences between the English–English and English–Spanish conditions, which also did not differ from the performance in the Single Talker condition (93.3%).

Average Attended and Unattended Cross-correlations and Their Topographies

Continuous EEG data were recorded from participants listening to narrated stories in English in four different listening conditions (English, Spanish, or MuR as interference, Single Talker). Average cross-correlations for attended and unattended speech envelopes (averaged across participants and conditions) are depicted in Figure 3. The attended cross-correlation functions (Figure 3A) show robust neural encoding of the attended speech envelope across conditions, with major clustering of peaks at approximate lags of 120 and 320 msec and a less prominent one at around 500 msec post-onset. It is also necessary to note that we observed some correlations between the EEG signal and the attended envelopes before the sentence onset (0 msec). Similar results have been reported in the previous literature (Thwaites et al., 2015; Horton et al., 2014) and are likely to reflect the periodic nature of the speech signal. Matching the acoustic properties of envelopes across conditions as described above ensured that this does not impact the validity of our subsequent planned comparisons between them. The averaged cross-correlation function for unattended speech (Figure 3B) shows that few EEG channels cross the significance threshold, indicating that attention had a major impact on encoding the speech envelopes. Moreover, the shape of the unattended cross-correlation function differs from the attended one, suggesting that it is not an attenuated or suppressed version of the same pattern.

Figure 3. 

(A) Average cross-correlations for all attended sentences from −200 to +600 msec post-onset. (B) Average cross-correlations for all unattended sentences from −200 to +600 msec post-onset, clearly showing attenuated encoding compared with the attended sentences. (C) Cross-channel average of the absolute values of the attended cross-correlation function. Prominent peaks are seen at latencies of around 120 and 320 msec, and a less prominent one was seen at around 500 msec post-onset. (D) Scalp topographies for cross-correlation values at all electrode positions averaged over latency ranges 90–150 msec, 290–350 msec, and 490–550 msec, corresponding to the peaks observed in C. Warm colors represent positive correlations, whereas cool colors represent negative correlations. The topography of the earlier effects is more central, whereas the topography of later effects has a more frontal distribution.

Figure 3. 

(A) Average cross-correlations for all attended sentences from −200 to +600 msec post-onset. (B) Average cross-correlations for all unattended sentences from −200 to +600 msec post-onset, clearly showing attenuated encoding compared with the attended sentences. (C) Cross-channel average of the absolute values of the attended cross-correlation function. Prominent peaks are seen at latencies of around 120 and 320 msec, and a less prominent one was seen at around 500 msec post-onset. (D) Scalp topographies for cross-correlation values at all electrode positions averaged over latency ranges 90–150 msec, 290–350 msec, and 490–550 msec, corresponding to the peaks observed in C. Warm colors represent positive correlations, whereas cool colors represent negative correlations. The topography of the earlier effects is more central, whereas the topography of later effects has a more frontal distribution.

Scalp topographies for average attended cross-correlations (Figure 3D) are plotted for latency ranges of 90–150 msec, 290–350 msec, and 490–550 msec based on the observed concentration of peaks at those time points. As Figure 3D shows, the topography of the earlier effects is more central, whereas the topography of later effects has a more frontal distribution.

Attended and Unattended Cross-correlations across Interference Conditions

Auditory cortical responses showed entrainment to the attended speech envelope in each condition individually, consistent with previous studies (Ding & Simon, 2012a; Lalor & Foxe, 2010; Aiken & Picton, 2008). Attended cross-correlation functions were also significantly greater than unattended correlations in all conditions, with cluster-based permutation t tests showing both positive and negative differences in each of the three interference conditions (Table 2 and Figure 4). As discussed in the literature (e.g., Kong, Mullangi, & Ding, 2014), the polarity of the cross-correlation can reflect either the direction of the neural current or whether the neural source responds to a power increase or decrease in the envelope. Thus, a negative cross-correlation (or cross-correlation difference) may indicate sources that produce a negative voltage on the scalp following a power increase in the envelope or sources that produce a positive voltage but tracks a power decrease in the envelope.

Table 2. 

Pairwise t Tests between Attended and Unattended Cross-correlations in Each Condition

Attended vs. UnattendedPositive ClusterNegative Cluster
Onset (msec)Peak (msec)pTCohen's dOnset (msec)Peak (msec)pTCohen's d
English–English 130 190 .001 2444.1 1.3 150 510 .001 −2675.8 0.6 
English–Spanish N/A ns N/A N/A 200 .009 −1395.5 1.4 
English–MuR 120 .001 3573.4 2.2 100 .001 −6061.8 1.6 
Attended vs. UnattendedPositive ClusterNegative Cluster
Onset (msec)Peak (msec)pTCohen's dOnset (msec)Peak (msec)pTCohen's d
English–English 130 190 .001 2444.1 1.3 150 510 .001 −2675.8 0.6 
English–Spanish N/A ns N/A N/A 200 .009 −1395.5 1.4 
English–MuR 120 .001 3573.4 2.2 100 .001 −6061.8 1.6 

T = sum of all t values within the cluster; Cohen's d = effect size at cluster peak.

Figure 4. 

(A) Attended cross-correlation functions in each condition from −200 to +600 msec post-onset. Topographies represent significant electrodes at three peak latencies (100, 300, and 500 msec). (B) Unattended cross-correlation functions in each condition and topographies of significant electrodes at three peak latencies (100, 300, and 500 msec). (C) Results for cluster-based permutation t tests between attended and unattended cross-correlation functions in each condition, representing the topographies of the differences and the timing and maxima of the significant clusters. Horizontal blue line represents the timing of significant differences between conditions. (D) Plots of cross-correlation values per participant in the peak electrode for the attended versus unattended comparison in each condition. Means are shown in black horizontal lines.

Figure 4. 

(A) Attended cross-correlation functions in each condition from −200 to +600 msec post-onset. Topographies represent significant electrodes at three peak latencies (100, 300, and 500 msec). (B) Unattended cross-correlation functions in each condition and topographies of significant electrodes at three peak latencies (100, 300, and 500 msec). (C) Results for cluster-based permutation t tests between attended and unattended cross-correlation functions in each condition, representing the topographies of the differences and the timing and maxima of the significant clusters. Horizontal blue line represents the timing of significant differences between conditions. (D) Plots of cross-correlation values per participant in the peak electrode for the attended versus unattended comparison in each condition. Means are shown in black horizontal lines.

For the English–English condition, the difference between attended and unattended cross-correlations only emerged from 130 msec onwards, with the cluster of negative differences (i.e., attended stream triggering stronger negative cross-correlations than the unattended stream) peaking as late as 510 msec over right frontal regions. In contrast, for both English–Spanish and English–MuR conditions, the encoding of attended and unattended envelopes significantly differed from the onset, with peaks at 160 and 200 msec over posterior central and right frontal regions in the English–Spanish condition and at 100 and 120 msec and in the English–MuR condition for positive and negative effects, respectively; suggesting that the type of interference affected how early the listeners could differentiate the attended from the unattended stream.

To directly test these apparent latency distinctions between conditions, we extracted cross-correlation difference values for each condition across the 90–150 msec, 290–350 msec, and 490–550 msec time windows (corresponding to timings observed in Figure 3C) over the posterior central and right frontal areas that consistently emerged as relevant for attentional encoding (Figures 3D and 4C) and submitted them to a repeated-measure ANOVA. Results showed that the three conditions triggered significantly different effects over time across posterior central electrodes (Time × Condition interaction, F(4, 138) = 18.29, p < .001, μ = .35), with the English–MuR condition triggering strongest differentiation between the attended and unattended streams early on, but showing comparable effects to the other two conditions by the latest time window (Figure 5A). In the right frontal areas, conditions showed significantly different effects only in the 490–550 msec time window, F(2, 48) = 5.52, p = .007, μ = .20, with the English–English condition triggering strongest differentiation between attended and unattended streams; however, this effect was not modulated by time (Time × Condition interaction, F < 1; Figure 5B). These results confirm that the type of distractor significantly modulates attentional encoding, with nonlinguistic interference (MuR) dissociated from the attended signal earlier than linguistic interference and fully intelligible distractor (English) triggering strong dissociation between the two streams only at later time points.

Figure 5. 

(A) Pattern of attended versus unattended cross-correlation differences over time across posterior central electrodes (top left insert), showing strong early dissociation triggered by nonlinguistic interference (MuR). (B) Pattern of attended versus unattended cross-correlation differences over time across right frontal electrodes (top right insert), showing stronger late dissociation triggered by intelligible linguistic interference (English).

Figure 5. 

(A) Pattern of attended versus unattended cross-correlation differences over time across posterior central electrodes (top left insert), showing strong early dissociation triggered by nonlinguistic interference (MuR). (B) Pattern of attended versus unattended cross-correlation differences over time across right frontal electrodes (top right insert), showing stronger late dissociation triggered by intelligible linguistic interference (English).

Comparisons of Attended Cross-correlations across Conditions

To directly assess differences between encoding of the attended stream under different types of interference, we submitted all attended cross-correlations (including the no interference Single Talker condition) to one-way repeated-measures ANOVA. Consistent with the analyses reported above, the results (FDR corrected for multiple comparisons) showed robust differences across conditions, emerging both early (0–300 msec) and at later time points (around 500 msec post-onset). To reveal specific patterns of differences between conditions, this ANOVA was followed up with post hoc t tests. These showed that conditions with linguistic interference (English–English and English–Spanish) triggered significantly greater encoding of the attended stream compared with the attended stream in the English–MuR condition (Table 3). These differences were significant from the very onset, with positive clusters over the central regions peaking at 210 msec for the English–English versus English–MuR comparison and 260 msec for the English–Spanish versus English–MuR comparison. Comparisons between English–English and English–Spanish showed that they were encoded equivalently up to 320 msec; from 320 to 600 msec, the encoding of attended speech in the English–English condition was significantly greater than in the English–Spanish condition. These results extend the findings reported above, revealing that increasing intelligibility of the interfering stream (English > Spanish > MuR) triggers stronger encoding to the attended speech. Combined with the timing of these effects reported earlier, they suggest that nonintelligible competitors cause earlier interference but weaker encoding of the attended stream, whereas fully intelligible competitor causes late interference and strongest encoding of the attended stream.

Table 3. 

Pairwise t Tests between Attended Cross-correlation Functions across Conditions

 Positive ClusterNegative Cluster
Onset (msec)Peak (msec)pTCohen's dOnset (msec)Peak (msec)pTCohen's d
Attended vs. Attended in the Three Interference Conditions 
English–English vs. English–Spanish N/A N/A ns N/A N/A 320 400 .011 −1118.03 0.9 
English–English vs. English–MuR 210 .002 2551.74 1.0 N/A N/A ns N/A N/A 
English–Spanish vs. English–MuR 260 .001 2266.61 0.6 N/A N/A ns N/A N/A 
  
Single Talker Comparisons 
Single Talker vs. English–English 260 .004 1449.25 1.1 240 .001 −2365.86 1.1 
Single Talker vs. English–Spanish 350 .001 2643.88 0.8 280 .001 −3478.33 1.4 
Single Talker vs. English–MuR 100 200 .024 1114.18 0.9 60 160 .012 −1493.71 1.1 
 Positive ClusterNegative Cluster
Onset (msec)Peak (msec)pTCohen's dOnset (msec)Peak (msec)pTCohen's d
Attended vs. Attended in the Three Interference Conditions 
English–English vs. English–Spanish N/A N/A ns N/A N/A 320 400 .011 −1118.03 0.9 
English–English vs. English–MuR 210 .002 2551.74 1.0 N/A N/A ns N/A N/A 
English–Spanish vs. English–MuR 260 .001 2266.61 0.6 N/A N/A ns N/A N/A 
  
Single Talker Comparisons 
Single Talker vs. English–English 260 .004 1449.25 1.1 240 .001 −2365.86 1.1 
Single Talker vs. English–Spanish 350 .001 2643.88 0.8 280 .001 −3478.33 1.4 
Single Talker vs. English–MuR 100 200 .024 1114.18 0.9 60 160 .012 −1493.71 1.1 

T = sum of all t values within the cluster; Cohen's d = effect size at cluster peak.

Finally, the Single Talker (no interference) condition showed stronger envelope encoding of the attended speech than any of the interference conditions (Table 3). Compared with the linguistic interference conditions (English–English and English–Spanish), these differences peaked between 250 and 350 msec; in the comparison with the English–MuR condition, they peaked a bit earlier (160–200 msec). These results further emphasize differential effects of linguistic versus nonlinguistic distractors. More informatively, however, they also lend support to the hypothesis that selective attention itself requires processing capacity (Johnston & Heinz, 1978; Kahneman, 1973), such that the presence of any interference reduces the capacity available for encoding of the attended stream, compared with no interference condition.

Comparisons between Unattended Cross-correlations across Conditions

We next compared cross-correlation functions between the EEG data and unattended envelopes in the English–English, English–Spanish, and English–MuR conditions using the same procedure as above. Repeated-measures ANOVA revealed no significant differences across the three conditions. Similarly, post hoc pairwise comparisons showed no significant differences between unattended English–English and English–Spanish conditions, indicating comparable encoding of both types of linguistic interference. However, both unattended linguistic interferences were significantly more encoded than the unattended envelope in the English–MuR condition (Table 4), suggesting that linguistic interference was analyzed to a larger extent than the nonlinguistic interference.

Table 4. 

Pairwise t Tests between Unattended Cross-correlation Functions across Conditions

Unattended vs. UnattendedPositive ClusterNegative Cluster
Onset (msec)Peak (msec)pTCohen's dOnset (msec)Peak (msec)pTCohen's d
English–English vs. English–Spanish N/A N/A ns N/A N/A N/A N/A ns N/A N/A 
English–English vs. English–MuR 40 130 .006 1459.71 0.7 30 390 .001 −2965.37 0.6 
English–Spanish vs. English–MuR 40 390 .003 2417.06 1.0 30 330 .002 −2145.87 0.6 
Unattended vs. UnattendedPositive ClusterNegative Cluster
Onset (msec)Peak (msec)pTCohen's dOnset (msec)Peak (msec)pTCohen's d
English–English vs. English–Spanish N/A N/A ns N/A N/A N/A N/A ns N/A N/A 
English–English vs. English–MuR 40 130 .006 1459.71 0.7 30 390 .001 −2965.37 0.6 
English–Spanish vs. English–MuR 40 390 .003 2417.06 1.0 30 330 .002 −2145.87 0.6 

T = sum of all t values within the cluster; Cohen's d = effect size at cluster peak.

Attention over Time

The continuous nature of our stimuli allowed us to also test whether effects of attention on neural encoding remain constant over time. To this end, in each condition we divided each block (60 sentences) into three equal parts (Beginning: Sentences 1–20, Middle: Sentences 21–40, End = Sentences 41–60) and then summed across all “Beginning,” “Middle,” and “End” items per condition. We then assessed the differences between the “beginning,” “middle,” and “end” of each narrative across subjects. There were no significant differences in any condition between the strength of neural encoding over time (all ps > .05) for either attended or unattended streams, indicating that encoding of speech was constant throughout the entire narrative for both attended and unattended cross-correlation functions.

DISCUSSION

This study aimed to understand how attention modulates the neural encoding of speech in the presence of different types of interference. We created a cocktail party paradigm in which participants attended to one speaker while ignoring a competing stream in the other ear. In all conditions, participants attended to a narrative in English. The competing streams varied from fully intelligible English narratives to linguistic interference in a language unknown to the listeners (Spanish) and nonlinguistic noise (MuR). The results showed that attention affected the neural tracking of speech, with attended streams consistently more encoded than the unattended ones. Critically, however, the characteristics of the interfering stream significantly modulated this process, with increasing intelligibility of the distractor causing stronger encoding of both attended and unattended streams and later dissociation between them.

Theoretical accounts of selective attention have put forward a range of views on the locus and the mechanisms of dissociation between interfering auditory streams (Bronkhorst, 2015; Duncan, 1980; Johnston & Heinz, 1978; Broadbent, 1958). Experimental evidence has been mixed, with some authors emphasizing the influence of task-dependent variables on the results obtained (Rivenez et al., 2008). Our study used a natural listening paradigm and the well-established phenomenon of neural encoding of the speech envelope as an index of processing of both attended and unattended streams. The speech envelope is known to offer key acoustic information concerning the syllabic rate of speech and is critical for speech perception (Greenberg, Carvey, Hitchcock, & Chang, 2003; Rosen, 1992).

Consistent with the existing literature (Rimmele et al., 2015; Horton et al., 2013; Zion Golumbic et al., 2013; Ding & Simon, 2012a, 2012b; Horton, D'Zmura, & Srinivasan, 2011), our results demonstrated that attention strongly modulated the neural encoding of the spoken signal: Robust neural tracking of attended speech was observed across all conditions, with significantly weaker encoding for all types of unattended streams. Furthermore, the most prominent cross-correlation peaks in the attended signal appeared at around 100 msec post-onset, followed by peaks around 300 and 500 msec post-onset. The first two peaks have comparable topographies and are clearly emerging from one large cluster of significant correlations; their latencies and topographies (which are prominently more central and bilateral than those of the 500 msec effects) suggest that they might reflect the N1/P2 and N2 components identified in the auditory evoked potential (AEP) literature (Folstein & Van Petten, 2008; Picton & Hillyard, 1974), which have been linked to aspects of sensory encoding of the stimulus. However, it is important to note that the latencies derived from the cross-correlation functions are not necessarily equivalent to those reported in the AEP literature (AEPs represent voltage potentials, whereas our data reflect correlation values between EEG channels and the speech envelope at different latencies); hence, any interpretation in this context is necessarily tentative. Both early and late cross-correlation effects for attended speech observed here were previously reported in the literature and were also shown to be enhanced by attention (e.g., Kong et al., 2014; Power, Foxe, Forde, Reilly, & Lalor, 2012). Also comparable to the previous literature are our findings about significantly reduced or absent clusters of cross-correlation peaks for unattended streams (Ding & Simon, 2012a; Power et al., 2012). Yet, our main question was about the possible influences of the type of interfering signal on the processes of selective attention, which is what we turn to next.

Intelligibility of Interfering Speech Modulates the Encoding of the Attended Stream

To test how the mechanisms for dissociating between the competing streams interact with the nature of the interfering signal, we manipulated the type of interfering streams ranging from fully intelligible English narratives to linguistic interference in a language unknown to the listeners (Spanish) and nonlinguistic noise (MuR). We predicted that the fully intelligible and meaningful distractor in the English–English condition should produce greatest interference, requiring the use of higher-level lexicosemantic features to dissociate between the two competing streams. The results confirmed that the type of interference significantly modulates how attended speech is encoded in the brain. Even if all three interference conditions showed robust differences between the encoding of attended and unattended streams, the onset of these differences was markedly dissimilar across conditions. In the two conditions where the interfering stream was not meaningful to our listeners, either because it was nonlinguistic (MuR) or it was in a language they did not understand (Spanish), the difference in the encoding of the attended and unattended streams emerged right from the onset. In contrast, intelligible interfering speech (English) was encoded comparably to the attended stream for up to 130 msec after the onset of the competing sentences, with differences between them only emerging after that latency and peaking at 510 msec in the right frontal areas. Direct comparisons of differences in encoding across conditions (Figure 5) further supported this conclusion, showing that nonlinguistic interference (MuR) triggered strong early dissociation between the attended and unattended streams across posterior central areas, whereas intelligible linguistic interference (English) triggered stronger late dissociation across more frontal areas. This clearly indicates that nonlinguistic interference, which can be distinguished from the attended stream based on lower-level features (speech vs. nonspeech) can be easily dissociated from the onset, resulting in an immediate enhancement of the attended signal. On the other hand, there is comparable encoding for both attended and intelligible interfering streams early on, such that the encoding of the attended stream only gets enhanced after both streams have been processed beyond their sensory properties. These results can be clearly interpreted within the framework of flexible accounts of selective attention (Bronkhorst, 2015; Johnston & Heinz, 1978), where selection between the two streams can be achieved earlier when the distractor is nonintelligible and does not necessitate the use of lexical information to dissociate it from the attended speech. The absence of such cues in the English–English condition requires the use of more complex lexicosemantic information, causing delayed selection and later enhancement in the encoding of the attended stream.

Another hypothesis of the flexible selective attention accounts is that the use of higher-level semantic and syntactic information to dissociate between the two streams requires more processing capacity. If correct, this would predict that the strength of encoding of the attended stream would vary as a function of intelligibility of the distractor, with fully intelligible distractors triggering strongest encoding. In line with the proposals about attentional enhancement (e.g., Horton et al., 2013), this would imply that, to maintain full speech comprehension in conditions where processing capacity is divided between two streams, the neurocognitive system might be amplifying the neural activity synchronized to the attended envelope most strongly in the presence of fully intelligible distractors (e.g., English–English), compared with “easier” conditions (MuR/Spanish). We explored this hypothesis by directly comparing the strength of the encoding across the attended streams across the three interfering conditions. Results showed that, as the intelligibility of the distractor increased, the strength of encoding to the attended stream also increased, such that attended cross-correlation functions were strongest in English–English condition and weakest for the English–MuR condition. Notably, however, the strength of encoding in the English–English condition only differed from the English–Spanish condition from 320 msec onwards, when lexicosemantic information would have become available to dissociate between competing linguistic streams (cf. Marslen-Wilson, 1973, 1978). In contrast, both linguistic interference conditions triggered greater attentional encoding than the English–MuR condition from the very onset, arguably reflecting earlier access to the lower-level phonological information (i.e., spectral and temporal properties of formant frequencies needed to dissociate speech from non-speech; Uppenkamp et al., 2006) and easier dissociation of MuR from the attended linguistic stream. This distribution of effects over time fits well with results from the literature on spoken word recognition (e.g., Davis & Johnsrude, 2003; Moss, McCormick, & Tyler, 1997; Frauenfelder & Tyler, 1987; Marslen-Wilson, 1978), which show hierarchical processing of spoken words from the initial acoustic analyses to deriving their lexical and semantic properties at the later processing stages. In the context of attentional effects, these results directly follow from the hypothesis that attention consumes processing capacity (Johnston & Heinz, 1978) such that, to avoid overloading it, stimuli are processed at a minimum level required to carry out a task.

Complementary results also emerged from the Single Talker condition, where participants attended to speech without the presence of interference. As discussed, this condition provides a test case for cortical entrainment of speech that is not modulated by the processing demands of divided attention, where all attention resources can be fully allocated to the instructed task (cf. Kahneman, 1973). Data revealed that attended speech in the Single Talker condition was significantly more encoded than in any of the interfering conditions, with differences peaking between 250 and 350 msec compared with linguistic interference conditions (English–English and English–Spanish) and between 160 and 200 msec compared with nonlinguistic interference. The topographies of these comparisons also show a clear distinction between linguistic and nonlinguistic interference, further supporting the hypothesis that attentional mechanisms flexibly adapt to the differing demands of linguistic and nonlinguistic distractors.

We next turn to the unattended cross-correlations across conditions and comparisons between them. If auditory selective attention is a flexible mechanism, where attended and unattended information can be differentiated at different processing depths depending on the type of distractor, then we could expect to see some differences in encoding across the three interference conditions. Specifically, the distractor that can be dissociated from the attended speech stream earlier and more easily due to their lower-level differences (MuR) would be expected to be encoded less strongly than the linguistic distractors (English and Spanish). This is exactly the pattern we observed, with both unattended linguistic interferences significantly more encoded than the unattended envelope in the English–MuR condition. The differences are, however, subtle and only emerge in post hoc pairwise comparisons between conditions, possibly reflecting entrainment-based suppression of all unattended streams (Horton et al., 2013).

Encoding of Attended Speech Remains Constant over Time

To our knowledge, this is the first study that has used continuous natural speech to test whether effects of attention on neural encoding remain constant over time. To do this, we compared the encoding in the beginning, the middle, and the end of the narrative (with each narrative being 3 min long), across all conditions. No significant differences emerged in any condition for both attended and unattended cross-correlation functions, indicating that the neural encoding of speech remained constant over time.

Conclusion

Our results demonstrate that top–down attention significantly modulates the neural encoding of attended speech in the presence of interference. Characteristics of the interfering stream significantly modulate this process, with increasing intelligibility of the distractor causing stronger encoding of both attended and unattended streams and later dissociation between them. These effects remain constant over the course of a narrative. The results offer strong support to flexible accounts of selective attention.

Reprint requests should be sent to Andrea Olguin, Department of Psychology, University of Cambridge, CB2 3EB, Cambridge, United Kingdom, or via e-mail: ako26@cam.ac.uk.

REFERENCES

REFERENCES
Abrams
,
D. A.
,
Nicol
,
T.
,
Zecker
,
S.
, &
Kraus
,
N.
(
2008
).
Right-hemisphere auditory cortex is dominant for coding syllable patterns in speech
.
Journal of Neuroscience
,
28
,
3958
3965
.
Aiken
,
S. J.
, &
Picton
,
T. W.
(
2008
).
Human cortical responses to the speech envelope
.
Ear and Hearing
,
29
,
139
157
.
Aydelott
,
J.
,
Jamaluddin
,
Z.
, &
Nixon Pearce
,
S.
(
2015
).
Semantic processing of unattended speech in dichotic listening
.
Journal of the Acoustical Society of America
,
138
,
964
975
.
Baltzell
,
L. S.
,
Horton
,
C.
,
Shen
,
Y.
,
Richards
,
V. M.
,
D'Zmura
,
M.
, &
Srinivasan
,
R.
(
2016
).
Attention selectively modulates cortical entrainment in different regions of the speech spectrum
.
Brain Research
,
1644
,
203
212
.
Bendat
,
J. S.
, &
Piersol
,
A. G.
(
1986
).
Random data: Analysis and measurement procedures
.
New York
:
Wiley
.
Benjamini
,
Y.
, &
Yekutieli
,
D.
(
2001
).
The control of the false discovery rate in multiple testing under dependency
.
The Annals of Statistics
,
29
,
1165
1188
.
Bentin
,
S.
,
Kutas
,
M.
, &
Hillyard
,
S. A.
(
1995
).
Semantic processing and memory for attended and unattended words in dichotic listening: Behavioral and electrophysiological evidence
.
Journal of Experimental Psychology: Human Perception and Performance
,
21
,
54
67
.
Bozic
,
M.
,
Tyler
,
L. K.
,
Ives
,
D. T.
,
Randall
,
B.
, &
Marslen-Wilson
,
W. D.
(
2010
).
Bihemispheric foundations for human speech comprehension
.
Proceedings of the National Academy of Sciences, U.S.A.
,
107
,
17439
17444
.
Brainard
,
D. H.
(
1997
).
The psychophysics toolbox
.
Spatial Vision
,
10
,
433
436
.
Broadbent
,
D.
(
1958
).
Perception and communication
.
London
:
Pergamon Press
.
Bronkhorst
,
A. W.
(
2015
).
The cocktail-party problem revisited: Early processing and selection of multi-talker speech
.
Attention, Perception, & Psychophysics
,
77
,
1465
1487
.
Brungart
,
D. S.
, &
Simpson
,
B. D.
(
2007
).
Effect of target-masker similarity on across-ear interference in a dichotic cocktail party listening task
.
Journal of Acoustic Society of America
,
122
,
1724
1734
.
Cherry
,
E. C.
(
1953
).
Some experiments on the recognition of speech with one and two ears
.
Journal of the Acoustical Society of America
,
25
,
975
979
.
Davis
,
M. H.
, &
Johnsrude
,
I. S.
(
2003
).
Hierarchical processing in spoken language comprehension
.
Journal of Neuroscience
,
23
,
3423
3431
.
Delorme
,
A.
(
2006
).
Statistical methods
. In
J.
Webster
(Ed.),
Encyclopedia of medical devices and instrumentation
(pp.
240
264
).
Hoboken
:
Wiley Interscience
.
Delorme
,
A.
, &
Makeig
,
S.
(
2004
).
EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysis
.
Journal of Neuroscience Methods
,
134
,
9
21
.
Deutsch
,
J.
, &
Deutsch
,
D.
(
1963
).
Attention: Some theoretical considerations
.
Psychological Review
,
70
,
80
90
.
Ding
,
N.
,
Chatterjee
,
M.
, &
Simon
,
J. Z.
(
2013
).
Robust cortical entrainment to the speech envelope relies on the spectro-temporal fine structure
.
Neuroimage
,
88C
,
41
46
.
Ding
,
N.
, &
Simon
,
J. Z.
(
2012a
).
Emergence of neural encoding of auditory objects while listening to competing speakers
.
Proceedings of the National Academy of Sciences, U.S.A.
,
109
,
11854
11859
.
Ding
,
N.
, &
Simon
,
J. Z.
(
2012b
).
Neural coding of continuous speech in auditory cortex during monaural and dichotic listening
.
Journal of Neurophysiology
,
107
,
78
89
.
Doelling
,
K. B.
,
Arnal
,
L. H.
,
Ghitza
,
O.
, &
Poeppel
,
D.
(
2014
).
Acoustic landmarks drive delta-theta oscillations to enable speech comprehension by facilitating perceptual parsing
.
Neuroimage
,
85
,
761
768
.
Drullman
,
R.
,
Festen
,
J. M.
, &
Plomp
,
R.
(
1994
).
Effect of reducing slow temporal modulations on speech reception
.
Journal of the Acoustical Society of America
,
95
,
1053
.
Duncan
,
J.
(
1980
).
The locus of interference in the perception of simultaneous stimuli
.
Psychological Review
,
87
,
272
300
.
Dupoux
,
E.
,
Kouider
,
S.
, &
Mehler
,
J.
(
2003
).
Lexical access without attention? Explorations using dichotic priming
.
Journal of Experimental Psychology: Human Perception and Performance
,
29
,
172
184
.
Eich
,
E.
(
1984
).
Memory for unattended events: Remembering with and without awareness
.
Memory & Cognition
,
12
,
105
111
.
Folstein
,
J. R.
, &
Van Petten
,
C.
(
2008
).
Influence of cognitive control and mismatch on the N2 component of the ERP: A review
.
Psychophysiology
,
45
,
152
170
.
Frauenfelder
,
U.
, &
Tyler
,
L. K.
(
1987
).
The process of spoken word recognition: An introduction
.
Cognition
,
25
,
1
20
.
Giraud
,
A.
, &
Poeppel
,
D.
(
2012
).
Speech perception from a neurophysiological perspective
. In
The human auditory cortex. Springer Handbook of Auditory Research
(
Vol. 43
, pp.
225
260
).
New York
:
Springer
.
Greenberg
,
S.
,
Carvey
,
H.
,
Hitchcock
,
L.
, &
Chang
,
S.
(
2003
).
Temporal properties of spontaneous speech—A syllable-centric perspective
.
Journal of Phonetics
,
31
,
465
485
.
Hambrook
,
D. A.
, &
Tata
,
M. S.
(
2014
).
Theta-band phase tracking in the two-talker problem
.
Brain and Language
,
135
,
52
56
.
Holender
,
D.
(
1986
).
Semantic activation without conscious identification in dichotic listening, parafoveal vision, and visual masking: A survey and appraisal
.
Behavioral and Brain Sciences
,
9
,
1
23
.
Horton
,
C.
,
D'Zmura
,
M.
, &
Srinivasan
,
R.
(
2011
).
EEG reveals divergent paths for speech envelopes during selective attention
.
International Journal of Bioelectromagnetism
,
13
,
217
222
.
Horton
,
C.
,
D'Zmura
,
M.
, &
Srinivasan
,
R.
(
2013
).
Suppression of competing speech through entrainment of cortical oscillations
.
Journal of Neurophysiology
,
109
,
3082
3093
.
Horton
,
C.
,
Srinivasan
,
R.
, &
D'Zmura
,
M.
(
2014
).
Envelope responses in single-trial EEG indicate attended speaker in a cocktail party
.
Journal of Neural Engineering
,
141
,
520
529
.
James
,
W.
(
1890
).
The principles of psychology
.
New York
:
Henry Holt and Company
.
Johnston
,
W. A.
, &
Heinz
,
S. P.
(
1978
).
Flexibility and capacity demands of attention
.
Journal of Experimental Psychology: General
,
107
,
420
435
.
Kahneman
,
D.
(
1973
).
Attention and effort
.
Englewood Cliffs, NJ
:
Prentice-Hall, Inc.
Kerlin
,
J. R.
,
Shahin
,
A. J.
, &
Miller
,
L. M.
(
2010
).
Attentional gain control of ongoing cortical speech representations in a “cocktail party.”
Journal of Neuroscience
,
30
,
620
628
.
Kong
,
Y. Y.
,
Mullangi
,
A.
, &
Ding
,
N.
(
2014
).
Differential modulation of auditory responses to attended and unattended speech in different listening conditions
.
Hearing Research
,
316
,
73
81
.
Lakatos
,
P.
,
Karmos
,
G.
,
Mehta
,
A. D.
,
Ulbert
,
I.
, &
Schroeder
,
C. E.
(
2008
).
Entrainment of neuronal attentional selection
.
Science
,
320
,
23
25
.
Lalor
,
E. C.
, &
Foxe
,
J. J.
(
2010
).
Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution
.
European Journal of Neuroscience
,
31
,
189
193
.
Malhotra
,
P.
,
Coulthard
,
E. J.
, &
Husain
,
M.
(
2009
).
Role of right posterior parietal cortex in maintaining attention to spatial locations over time
.
Brain
,
132
,
645
660
.
Manly
,
T.
,
Owen
,
A. M.
,
McAvinue
,
L.
,
Datta
,
A.
,
Lewis
,
G. H.
,
Scott
,
S. K.
, et al
(
2003
).
Enhancing the sensitivity of a sustained attention task to frontal damage: Convergent clinical and functional imaging evidence
.
Neuroscase
,
9
,
340
349
.
Maris
,
E.
, &
Oostenveld
,
R.
(
2007
).
Nonparametric statistical testing of EEG- and MEG-data
.
Journal of Neuroscience Methods
,
164
,
177
190
.
Marslen-Wilson
,
W. D.
(
1973
).
Linguistic structure and speech shadowing at very short latencies
.
Nature
,
244
,
522
.
Marslen-Wilson
,
W. D.
(
1978
).
Processing interactions and lexical access during word recognition in continuous speech
.
Cognitive Psychology
,
10
,
29
63
.
Moss
,
H. E.
,
McCormick
,
S.
, &
Tyler
,
L. K.
(
1997
).
The time course of activation of semantic information during spoken word recognition
.
Language and Cognitive Processes
,
12
,
695
731
.
Murphy
,
S.
,
Fraenkel
,
N.
, &
Dalton
,
P.
(
2013
).
Perceptual load does not modulate auditory distractor processing
.
Cognition
,
129
,
345
355
.
Newstead
,
S.
, &
Dennis
,
I.
(
1979
).
Lexical and grammatical processing of unshadowed messages: A re-examination of the Mackay effect
.
Quarterly Journal of Experimental Psychology
,
31
,
477
488
.
Peelle
,
J. E.
,
Gross
,
J.
, &
Davis
,
M. H.
(
2013
).
Phase-locked responses to speech in human auditory cortex are enhanced during comprehension
.
Cerebral Cortex
,
23
,
1378
1387
.
Pelli
,
D. G.
(
1997
).
The VideoToolbox software for visual psychophysics: Transforming numbers into movies
.
Spatial Vision
,
10
,
437
442
.
Picton
,
T. W.
, &
Hillyard
,
S. A.
(
1974
).
Human auditory evoked potentials. II. Effects of attention
.
Electroencephalography and Clinical Neurophysiology
,
36
,
191
200
.
Power
,
A. J.
,
Foxe
,
J. J.
,
Forde
,
E. J.
,
Reilly
,
R. B.
, &
Lalor
,
E. C.
(
2012
).
At what time is the cocktail party? A late locus of selective attention to natural speech
.
European Journal of Neuroscience
,
35
,
1497
1503
.
Pulvermüller
,
F.
,
Shtyrov
,
Y.
,
Hasting
,
A.
, &
Carlyon
,
R. P.
(
2008
).
Syntax as a reflex: Neurophysiological evidence for early automaticity of grammatical processing
.
Brain and Language
,
104
,
244
253
.
Rimmele
,
J. M.
,
Zion Golumbic
,
E.
,
Schröger
,
E.
, &
Poeppel
,
D.
(
2015
).
The effects of selective attention and speech acoustics on neural speech-tracking in a multi-talker scene
.
Cortex
,
68
,
144
154
.
Rivenez
,
M.
,
Guillaume
,
A.
,
Bourgeon
,
L.
, &
Darwin
,
C. J.
(
2008
).
Effect of voice characteristics on the attended and unattended processing of two concurrent messages
.
European Journal of Cognitive Psychology
,
20
,
967
993
.
Rosen
,
S.
(
1992
).
Temporal information in speech: Acoustic, auditory and linguistic aspects
.
Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences
,
336
,
367
373
.
Schroeder
,
C. E.
, &
Lakatos
,
P.
(
2010
).
Low-frequency neuronal oscillations as instruments of sensory selection
.
Trends in Neuroscience
,
32
,
9
18
.
Thwaites
,
A.
,
Nimmo-Smith
,
I.
,
Fonteneau
,
E.
,
Patterson
,
R. D.
,
Buttery
,
P.
, &
Marslen-Wilson
,
W. D.
(
2015
).
Tracking cortical entrainment in neural activity: Auditory processes in human temporal cortex
.
Frontiers in Computational Neuroscience
,
9
,
1
13
.
Treisman
,
A. M.
(
1969
).
Strategies and models of selective attention
.
Psychological Review
,
76
,
282
299
.
Uppenkamp
,
S.
,
Johnsrude
,
I. S.
,
Norris
,
D.
,
Marslen-Wilson
,
W.
, &
Patterson
,
R. D.
(
2006
).
Locating the initial stages of speech-sound processing in human temporal cortex
.
Neuroimage
,
31
,
1284
1296
.
Wood
,
N.
, &
Cowan
,
N.
(
1995
).
The cocktail party phenomenon revisited: How frequent are attention shifts to one's name in an irrelevant auditory channel?
Learning, Memory, and Cognition
,
21
,
255
260
.
Wood
,
N. L.
,
Stadler
,
M. A.
, &
Cowan
,
N.
(
1997
).
Is there implicit memory without attention? A reexamination of task demands in Eich's (1984) procedure
.
Memory & Cognition
,
25
,
772
779
.
Zion Golumbic
,
E. M.
,
Ding
,
N.
,
Bickel
,
S.
,
Lakatos
,
P.
,
Schevon
,
C. A.
,
McKhann
,
G. M.
, et al
(
2013
).
Mechanisms underlying selective neuronal tracking of attended speech at a ‘cocktail party’
.
Neuron
,
77
,
980
991
.
Zion Golumbic
,
E. M.
,
Poeppel
,
D.
, &
Schroeder
,
C. E.
(
2012
).
Temporal context in speech processing and attentional stream selection: A behavioral and neural perspective
.
Brain and Language
,
122
,
151
161
.