Categorical judgments of otherwise identical phonemes are biased toward hearing words (i.e., “Ganong effect”) suggesting lexical context influences perception of even basic speech primitives. Lexical biasing could manifest via late stage postperceptual mechanisms related to decision or, alternatively, top–down linguistic inference that acts on early perceptual coding. Here, we exploited the temporal sensitivity of EEG to resolve the spatiotemporal dynamics of these context-related influences on speech categorization. Listeners rapidly classified sounds from a /gɪ/-/kɪ/ gradient presented in opposing word–nonword contexts (GIFT–kift vs. giss–KISS), designed to bias perception toward lexical items. Phonetic perception shifted toward the direction of words, establishing a robust Ganong effect behaviorally. ERPs revealed a neural analog of lexical biasing emerging within ∼200 msec. Source analyses uncovered a distributed neural network supporting the Ganong including middle temporal gyrus, inferior parietal lobe, and middle frontal cortex. Yet, among Ganong-sensitive regions, only left middle temporal gyrus and inferior parietal lobe predicted behavioral susceptibility to lexical influence. Our findings confirm lexical status rapidly constrains sublexical categorical representations for speech within several hundred milliseconds but likely does so outside the purview of canonical auditory-sensory brain areas.
An important building block for language is the ability to transform sensory information into abstract linguistic representations (Goldstone & Hendrickson, 2010). Speech sounds vary continuously across time, environments, speaker identities, and stimulus contexts, and yet, listeners easily parse the speech stream into discrete phonemes (Lotto & Holt, 2016; Phillips, 2001; Pisoni & Luce, 1987). The categorical perception (CP) of speech maps infinitely variable acoustic signals into discrete phonetic–linguistic representations on which the speech-language system can operate (Pisoni & Luce, 1987; Pisoni, 1973; Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). CP is indicated when gradually morphed speech sounds along a continuum are heard as belonging to one of a few discrete phonetic classes. Tokens labeled with different identities are said to cross the categorical boundary, a psychological border where listeners' responses abruptly flips because of a perceptual warping of the stimulus space (i.e., compression of within-category sounds; Best & Goldstone, 2019; Goldstone, Steyvers, Spencer-Smith, & Kersten, 2000; Livingston, Andrews, & Harnad, 1998).
One nebulous issue in speech perception concerns whether higher-level activation of lexical representations directly affects sublexical components (e.g., phoneme categories). On one extreme is the rigid view that, once established, internalized speech prototypes (i.e., equivalence classes or category members) are invariant to superficial stimulus manipulation or lexical context (Liberman, Harris, Hoffman, & Griffith, 1957). Under this model, categories are impervious to influences from surrounding information and sound elements that precede or follow an isolated stimulus cannot influence its categorization or location of the perceptual boundary. On the contrary, acoustic–phonetic categories—traditionally considered early or lower-level constructs of the speech signal—are in fact highly malleable to contextual variations (Holt & Lotto, 2010; Myers & Blumstein, 2008; Francis & Ciocca, 2003; Norris, McQueen, & Cutler, 2003; Elman & McClelland, 1988; Ganong, 1980; Pisoni, 1975). Moreover, the degree to which context influences the category identity of speech varies with language experience (Bidelman & Lee, 2015; Lively, Logan, & Pisoni, 1993; Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992). Consequently, it is now well-established that phonetic categories are flexible and perception of even individual speech features depends critically on the surrounding signal (Repp & Liberman, 1987).
Context-dependent effects in CP are best illustrated by the so-called “Ganong effect” (Ganong, 1980). The Ganong phenomenon occurs when listeners' perceived category boundary of a word–nonword continuum of phonemes shifts (is biased) toward the lexical item. When perceiving a “da-ta” continuum, for example, English-speaking listeners show a stark shift in their perceptual category boundary toward lexical items when one of the gradient's endpoints contains a real word (e.g., “DASH-tash”; Ganong, 1980; Ganong & Zatorre, 1980). Similar interpretive biasing can be induced via learning when listeners are exposed to new contexts that shape their perception of otherwise isolated sounds (Norris et al., 2003). Collectively, behavioral studies suggest that stimulus context expands the mental category for expected or behaviorally relevant stimuli (McMurray, Dennhardt, & Struck-Marcell, 2008).
One interpretation of lexical effects is that they reflect direct linguistic influence on perceptual processes. Alternatively, another school of thought argues lexical context effects are postperceptual and are therefore related to executive mechanisms (i.e., response selection, decision). Fox (1984) tested the interaction between lexical knowledge and phonetic categorization during speech perception using Ganong-like stimuli. Lexical status did not influence phonetic categorization at shorter response latencies or when participants were given a response deadline, suggesting lexical context influences later stimulus selection rather than perceptual encoding, per se. This notion is supported by results from Pitt and Samuel (1993), who found the strength of lexical influences on perception of ambiguous sound tokens depended on their position in a word; lexical effects were weaker when tokens occurred toward the beginning compared to the end of words. These data support “late stage” or “selection-based” models whereby the very formation of categories themselves only emerges at a late decision stage of the processing hierarchy (e.g., MERGE model; Norris, McQueen, & Cutler, 2000).
Rather than acting at late stages, lexical biasing could instead manifest via top–down (and perhaps bi-directional) modulations of early perceptual processing with the lexical interface. Indeed, growing evidence from neuroimaging studies (Noe & Fischer-Baum, 2020; Gow, Segawa, Ahlfors, & Lin, 2008; Myers & Blumstein, 2008; van Linden, Stekelenburg, Tuomainen, & Vroomen, 2007) reaffirms such interactive, connectionist views of categorization (e.g., TRACE; McClelland & Elman, 1986). Employing fMRI with a Ganong task, Myers and Blumstein (2008) found that the placement of the phonetic boundary modulated activity both in perceptual (e.g., superior temporal gyrus [STG]), inferior parietal lobe [IPL]) and frontal executive brain areas (inferior frontal gyrus, ACC), with greater activity for ambiguous items near the boundary. The mere involvement of the STG strongly suggests that lexical shifts are not solely due to executive decision processes but, at minimum, includes a perceptual component that either itself has direct access to lexical properties or is interactively reactivated to integrate phonetic and extraphonetic factors in placing the phonetic boundary (Noe & Fischer-Baum, 2020; Gow et al., 2008; Myers & Blumstein, 2008). Although fMRI offers excellent spatial characterization of potential lexical effects, it lacks the temporal precision necessary to resolve the underlying brain dynamics of category formation (Bidelman, Moreno, & Alain, 2013) and related lexical influences (Gow et al., 2008), both of which unfold within a few hundred milliseconds after speech onset (e.g., Mahmud, Yeasin, & Bidelman, 2020).
Extending prior neuroimaging work (Gow et al., 2008; Myers & Blumstein, 2008), the aim of this study was to characterize the spatiotemporal dynamics of context-dependent lexical influences on CP with the goal of establishing where and when speech categories are prone to Ganong-like biasing. We used EEG coupled with source reconstruction to assess the underlying neural bases of phoneme categorization and its lexical modulation. Our task included word–nonword (GIFT–kift) and nonword-to-word (giss–KISS) acoustic gradients of an otherwise identical /gɪ/-/kɪ/ acoustic–phonetic continuum designed to bias listeners' perception toward the lexical item and shift their perceptual category boundary (Myers & Blumstein, 2008; Ganong, 1980). Our findings confirm that lexical status rapidly (∼200–300 msec) constrains sublexical category speech representations but further suggests this interactivity occurs outside canonical auditory-linguistic brain structures. Instead, among Ganong-sensitive brain regions, we find engagement of a temporoparietal circuit (i.e., inferior parietal, middle temporal gyrus [MTG]) is critical to describing listeners' susceptibility to contextual biasing during category judgments.
Sixteen young adults (3 men, 13 women; age: M = 24.5, SD = 12.9 years) were recruited from the University of Memphis student body.1 Sample size was based on several previous neuroimaging studies on context effects in CP (e.g., Gow et al., 2008; Myers & Blumstein, 2008). All exhibited normal hearing sensitivity confirmed via audiometric screening (i.e., < 25 dB HL, octave frequencies 250–8000 Hz). Each participant was strongly right-handed (74.8 ± 27.0% laterality index; Oldfield, 1971), had obtained a collegiate level of education (18.8 ± 2.7 years formal schooling), and was a native speaker of American English. Participants were considered nonmusicians (e.g., Mankel & Bidelman, 2018), having, on average, 3.25 ± 3.3 years of music training. All were paid for their time and gave informed consent in compliance with a protocol approved by the institutional review board at the University of Memphis.
Speech Stimulus Continua
Stimuli were adapted from Myers and Blumstein (2008). Speech tokens consisted of a /gɪ/ to /kɪ/ (i.e., “gih” to “kih”) stop-consonant continuum presented in two word/nonword contexts.2 Each continuum was constructed using eight equally spaced VOTs incrementing from 18 msec (/g/ percept) to 70 msec (/k/ percept; Figure 1). This otherwise identical VOT continuum was used to create word-to-nonword (GIFT–kift) and nonword-to-word (giss–KISS) gradients designed to bias listeners' phonemic perception toward the lexical item (Figure 1B). This was achieved by splicing the appropriate aspiration (i.e., “-ft” for GIFT–kift; “-ss” for giss–KISS) to the end of the otherwise identical /gɪ/-/kɪ/ sounds (for details, see Myers & Blumstein, 2008). All tokens were 500 msec in duration and root-mean-square amplitude normalized.
During EEG recording, listeners heard 120 trials of each individual token (per context) in which they labeled the sound with a binary response (“g” or “k”) as quickly and accurately as possible. Following, the ISI was jittered randomly between 800 and 1000 msec (20-msec steps, uniform distribution) to avoid rhythmic entrainment of the EEG and anticipating subsequent stimuli. Block order for the GIFT–kift versus giss–KISS continua were randomized within and between participants. The auditory stimuli were delivered binaurally at 79 dB SPL through shielded insert earphones (ER-2; Etymotic Research) controlled by a TDT RP2 signal processor (Tucker Davis Technologies).
EEGs were recorded from 64 sintered Ag/AgCl electrodes at standard 10–10 scalp locations (Oostenveld & Praamstra, 2001). Continuous data were digitized at 500 Hz (SynAmps RT amplifiers; Compumedics Neuroscan) using an online passband of DC-200 Hz. Electrodes placed on the outer canthi of the eyes and the superior and inferior orbit monitored ocular movements. Contact impedances were maintained < 10 kΩ. During acquisition, electrodes were referenced to an additional sensor placed ∼ 1 cm posterior to Cz. Data were rereferenced off-line to the common average for analysis. Preprocessing was performed in BESA Research (v7.1; BESA, GmbH). Ocular artifacts (saccades and blinks) were corrected in the continuous EEG using PCA (Picton et al., 2000). Cleaned EEGs were then filtered (1–20 Hz), epoched (−200 to 800 msec), baselined to the prestimulus interval, and ensemble averaged resulting in 16 ERP waveforms per participant (8 tokens × 2 contexts).
Behavioral Data Analysis
Identification scores were fit with a sigmoid function P = 1/[1 + e−β1(x−β0)], where P is the proportion of trials identified as a given phoneme, x is the step number along the stimulus continuum, and β0 and β1 the location and slope of the logistic fit estimated using nonlinear least-squares regression. Comparing parameters between speech contexts revealed possible differences in the “steepness” (i.e., rate of change) and, more critically, the location of the categorical boundary as a function of speech context. A lexical bias (i.e., Ganong effect) is indicated when the location of the perceptual boundary (β0) in phoneme identification shifts dependent on the anchoring speech context (Myers & Blumstein, 2008; Ganong, 1980). Behavioral labeling speeds (i.e., RTs) were computed as listeners' median response latency across trials for a given condition. RTs outside 250–2500 msec were deemed outliers (e.g., fast guesses, lapses of attention) and were excluded from the analysis (Bidelman et al., 2013; Bidelman & Walker, 2017).
EEG Data Analysis
ERP Sensor Responses
From channel-level waveforms, we measured lexical bias effects in the speech ERPs by comparing scalp topographies at the ambiguous midpoint token (Tk4) evoked in the two different speech contexts (i.e., GIFT4 vs. KISS4). This token step is where lexical bias effects were most prominent behaviorally (see Figure 2). Topographic t tests were conducted in EEGLAB (Delorme & Makeig, 2004).
To estimate the underlying sources contributing to the lexical effect, we used Classical Low Resolution Electromagnetic Tomography Analysis Recursively Applied (CLARA; BESA (v7); Iordanov, Hoechstetter, Berg, Paul-Jordanov, & Scherg, 2014) to estimate the neuronal current density underlying the scalp ERPs (e.g., Bidelman, 2018; Alain, Arsenault, Garami, Bidelman, & Snyder, 2017). CLARA models the inverse solution as a large collection of elementary dipoles distributed over nodes on a mesh of the cortical volume. The algorithm estimates the total variance of the scalp data and applies a smoothness constraint to ensure current changes minimally between adjacent brain regions (Michel et al., 2004; Picton et al., 1999). CLARA renders more focal source images by iteratively reducing the source space during repeated estimations. On each iteration (× 2), a spatially smoothed LORETA solution (Pascual-Marqui, Esslen, Kochi, & Lehmann, 2002) was recomputed and voxels below a 1% max amplitude threshold were removed. This provided a spatial weighting term for each voxel on the subsequent step. Two iterations were used with a voxel size of 7 mm in Talairach space and regularization (parameter accounting for noise) set at 0.01% singular value decomposition. Source activations were visualized on BESA's adult brain template (Richards, Sanchez, Phillips-Meek, & Xie, 2016).
To quantify the time course of source activations, we seeded discrete dipoles within the activation centroids identified in the CLARA volume images at a latency of 286 msec, where scalp data showed maximally lexical effects (see Figure 4A). CLARA localized activity to five major foci including MTG, inferior parietal lobe (IPL), and middle frontal gyrus (MFG) in left hemisphere, and precentral gyrus (PrCG) and insular cortex (IC) of right hemisphere (see Figure 4D). Dipole time courses represent the estimated current within each regional source. We then used this 5-dipole model to create a virtual source montage to transform each participant's scalp potentials (sensor-level recordings) into source space (Scherg, Berg, Nakasato, & Beniczky, 2019; Scherg, Ille, Bornfleth, & Berg, 2002). This digital remontaging applied a spatial filter to all electrodes (defined by the foci of our dipole configuration) to transform the electrode recordings to a reduced set of source signals reflecting the neuronal current (in units nAm) as seen within each anatomical ROI (Bidelman, 2018; Bidelman, Davis, & Pridgen, 2018). Critically, we fit individual dipole orientations to each participant's own data (anatomical locations remained fixed) to maximize the explained variance of the model at the individual subject level. The model provided a good fit to the grand averaged scalp data (goodness of fit, entire epoch window = 75%), confirming the ERPs could be described by a restricted number of sources.
From the source waveform time courses, we measured peak amplitudes within the 200- to 300-msec time window, where lexical effects were prominent in raw EEG data (see Figure 4A, B). We then regressed source amplitudes (for each ROI) with listeners' behavioral Ganong effect, computed as the magnitude of shift in their perceptual boundary between speech contexts (i.e., data in Figure 2C). This allowed us to assess the behavioral relevance of each brain ROI and how context-dependent changes in neural activity (i.e., “neural Ganong” effect) relate to lexical biases in CP measured behaviorally.
We analyzed the data using mixed-model ANOVAs in R (R Core team, 2018; lmer4 package) with fixed effects of token (eight levels) and speech context (two levels). Participants served as a random effect. Multiple comparisons were corrected using Tukey–Kramer adjustments. Brain–behavior relations were assessed using robust regression (bisquare weighting) performed using the fitlm function in MATLAB 2020a (The MathWorks, Inc.). Effect sizes are reported for omnibus ANOVAs using Cohen's d (Cohen, 1988), for paired t tests using the formula described in Dunlap, Cortina, Vaslow, and Burke (1996), and as Pearson's r for correlations.
Behavioral identification functions are shown for the two speech contexts in Figure 2A. Listeners more frequently reported /g/ responses in the GIFT–kift continuum and more /k/ responses for the giss–KISS context, confirming that perception for otherwise identical stop consonants is biased toward hearing words. The perceptual boundary location depended strongly on context, t(15) = 4.82, p < .0001; d = 0.961 (Figure 2C and 2E). Consistent with prior studies (Noe & Fischer-Baum, 2020; Myers & Blumstein, 2008; Ganong, 1980), context-dependent effects in CP where most evident near the ambiguous midpoint of the continuum (Tk 4), where listeners' identification abruptly shifted phoneme categories, t(15) = 6.00, p < .0001; d = 2.19 (Figure 2D). Ganong shifts also varied across individuals (e.g., Lam, Xie, Tessmer, & Chandrasekaran, 2017), with some listeners showing strong influence to lexical bias and others showing little to no changes in perception with speech context (Figure 3).
Speech labeling speeds were modulated by context, F(1, 225) = 5.15, p = .024; d = 0.270, and token, F(7, 225) = 2.14, p = .0408; d = 0.370, (Figure 2B). Identification was faster overall when categorizing tokens in the giss–KISS context (p = .024). The main effect of token was attributable to a slowing of RTs near the midpoints of each continua (i.e., mean vs. mean contrast: t(225) = 3.14, p = .0019). Such inverted V shape in labeling speeds, although not prominent in these data, have been attributed to more ambiguity in decision nearer the perceptual boundary (Bidelman & Walker, 2017; Pisoni & Tash, 1974). Collectively, these behavioral results suggest that lexical information (words) biases listeners' categorization of otherwise identical phonetic features; even basic phoneme perception is latticed by the surrounding lexical context of the speech signal.
Scalp ERPs are shown at electrode Cz in Figure 4. To quantify the “neural Ganong” effect, we contrasted ERPs to tokens at the perceptual boundary (i.e., Tk 4; e.g., Myers & Blumstein, 2008), where lexical bias was strongest behaviorally (see Figure 2). Difference waves computed between midpoint tokens evoked during giss–KISS versus GIFT–kift continua revealed context-dependent modulations in the time window between 200–300 msec, t(14) = 3.03, p = .009; d = 1.15 (Figure 4A and 4B).3 That is, despite identical acoustic information, phonemes were processed differentially depending on the word context they carried. The topography of the neural Ganong was broadly distributed over the scalp, spanning frontal, temporal, and parietal electrodes (Figure 4C).
ERP differences between ki(ss) and gi(ft) could be because of lexical biasing of the initial phoneme or the fact that boundary tokens carry different word endings. That is, for stimuli near the category boundary, one token is a real word whereas the other is equivocal in lexical status. To rule out this possibility, after Myers and Blumstein (2008), we compared continuum endpoints that were unequivocally perceived as real words (endpoints perceived as “gift” and “kiss”) with continuum endpoints that were unequivocally perceived as nonwords (endpoints perceived as “giss” and “kift”). These control analyses revealed no significant channel clusters suggesting Ganong differences were not because of the “word status” of the stimuli, per se (Myers & Blumstein, 2008). Similarly, ERP amplitudes for word versus nonword difference waves did not differ from 0 in the same time window that showed Ganong lexical biasing in the experimental conditions, t(14) = 1.36, p = .20.
Source analysis of the ERPs exposed neural activations coding lexical bias in CP within five major foci among the auditory-linguistic-motor loop (e.g., Rauschecker & Scott, 2009; Hickok & Poeppel, 2007), including MTG, IPL (proximal to supramarginal gyrus [SMG]), and MFG in left hemisphere, and PrCG and IC in right hemisphere (Figure 4D). For each participant, we extracted the time course of source activity from dipoles seeded at the centroids of these ROIs. We then measured and regressed the peak activation within each ROI (200- to 300-msec analysis window; see Figure 4B)—reflecting the magnitude of “neural Ganong”—against listeners' behavioral Ganong (i.e., magnitude of perceptual boundary shift; Figure 2C). These brain–behavior correlations revealed strong associations between left MTG and left IPL activity and behavioral bias. The negative association suggests that larger (more positive) change in ERP was associated with smaller magnitude shifts in identification functions. These findings suggest that context-dependent modulations within a restricted temporo-parietal circuit were most inducive to listeners' susceptibility to lexical influences.
By measuring neuroelectric brain activity during rapid speech categorization tasks, our data reveal strong lexical bias in phonetic processing; perception for otherwise identical speech phonemes is attracted toward the direction of words, shifting listeners' categorical boundary dependent on surrounding speech context. We show a neural analog of lexical biasing emerging within ∼200 msec from brain activity localized to a distributed, bilateral temporoparietal network including MTG and IPL. Our findings confirm that when perceiving speech, lexical status rapidly constrains sublexical representations to their category membership within several hundred milliseconds, establishing a direct linguistic influence on early speech processing.
Decoding speech and lexical biasing could be realized via phonetic “feature detectors” (Eimas & Corbit, 1973) that occupy and are differentially sensitive to various segments of the acoustic-linguistic space. Indeed, Ganong-like displacements in perception we observe could occur if linguistic status moves the category boundary toward the most likely lexical candidate. Similarly, nonlinear dynamical models of perception posit that lexical items more strongly activate perceptual “attractor states,” which pull auditory percepts toward word items (Tuller, Case, Ding, & Kelso, 1994). Under this interpretation, the brain might differentially warp the perceptual space such that even the early acoustic–phonetic analysis of speech is continually anchored to a lexical representation (Liberman, Isenberg, & Rakerd, 1981; Remez, Rubin, Pisoni, & Carrell, 1981).
Considerable debate persists as to whether lexical effects in spoken word recognition result from feedback or feedforward processes (Gow et al., 2008; Myers & Blumstein, 2008; Samuel & Pitt, 2003; Norris et al., 2000; Pitt, 1995). Ganong shifts could occur if lexical knowledge exerts top–down influences to directly affect perceptual states. Under these frameworks, lexical-based modulation of auditory-sensory brain areas (i.e., STG; Myers & Blumstein, 2008; van Linden et al., 2007) could result from top–down input from higher levels associated with word forms (e.g., SMG, MTG). Alternatively, a purely feedforward architecture (Norris et al., 2000) posits that lexical and phonetic outputs combine and interact at later postperceptual stages of processing that are intrinsic to overt perceptual tasks (for illustration of these diametric models, see Figure 1 of Gow et al., 2008; Figure 7: Myers & Blumstein, 2008). In attempts to resolve these conflicting models, Gow et al. (2008) used functional connectivity analyses applied to magnetoencephalography (MEG) data and showed that causal neural signaling directed from left SMG to “lower-level” areas (e.g., STG) modulates sensory representations for speech within a latency of 280–480 msec (Gow et al., 2008). The top–down nature of their effects strongly favored a feedback, perceptual account of the Ganong whereby lexical representations influence the earlier encoding of sublexical speech features (e.g., Noe & Fischer-Baum, 2020; Myers & Blumstein, 2008; van Linden et al., 2007).
Our EEG findings closely agree with MEG data by demonstrating a neural analog of Ganong biasing that unfolds early in the chronometry of speech perception. We observed lexical modulation of speech ERPs beginning ∼200 msec after sound onset and no later than 300 msec. The early time window of these effects aligns roughly with the P2 wave of the auditory ERPs, a component that is highly sensitive to perceptual object formation, category structure (Bidelman et al., 2013, 2020; Bidelman & Walker, 2017; Liebenthal et al., 2010), and context effects in speech identification (Bidelman & Lee, 2015).4 Two recent EEG studies using Ganong (Noe & Fischer-Baum, 2020) and cross-modal priming (Getz & Toscano, 2019) paradigms suggest even earlier lexical effects in the time-frame of the N1 (75–175 msec; Noe & Fischer-Baum, 2020; Getz & Toscano, 2019). Noe and Fischer-Baum (2020), for example, concluded the early nature of their lexical response at N1 is unlikely to be modulated by top–down effects. Discrepancies between studies as to the time course of lexical effects is unclear but might be attributable to methodological differences.5 Categorical effects at N1 have been equivocal in the literature (cf. Noe & Fischer-Baum, 2020; Getz & Toscano, 2019; Bidelman et al., 2013; Toscano, McMurray, Dennhardt, & Luck, 2010; Sharma & Dorman, 1999). Moreover, previous studies have not adjudicated the underlying sources that contribute to apparent scalp N1 effects. This is important as the N1 wave is composed of sources beyond the supratemporal plane including frontal lobes and IPL (Picton et al., 1999; Woods, 1995; Knight, Hillyard, Woods, & Neville, 1980), areas highly sensitive to lexical influences. Although our data support notions for an early time course of lexical effects (Noe & Fischer-Baum, 2020; Toscano, Anderson, Fabiani, Gratton, & Garnsey, 2018), they also suggest more parallel/iterative influences on perception.
Our data are more consistent with previous source-level MEG findings that demonstrate Ganong-related modulations around 220 msec (Gow et al., 2008). Our source analysis uncovered a Ganong neural circuit spanning five nodes including MTG, IPL, and MFG in the left hemisphere and PrCG, IC in the right hemisphere. The engagement of frontal brain areas (MFG, IC) is consistent with the notion that lexical effects partly evoke postperceptual, executive processes (Norris et al., 2000). The involvement of IC is perhaps also expected in light of prior imaging work; bilateral inferior frontal activation is particularly evident for speech contrasts that are acoustically ambiguous (Feng, Gan, Wan, Wong, & Chandrasekaran, 2018; Bidelman & Dexter, 2015; Guediche, Salvata, & Blumstein, 2013) and under conditions of increased lexical uncertainty (Bidelman & Walker, 2019; Luthra, Guediche, Blumstein, & Myers, 2019) that place higher demands on attention (Bouton et al., 2018). Indeed, resolving phoneme ambiguity (as in the Ganong) may be one of the first processes to come on-line before the decoding of specific lexical features (Gwilliams, Linzen, Poeppel, & Marantz, 2018). This may account for the early time course of our neural effects.
Notable among the Ganong circuit were nodes in left SMG and MTG. Critically, these regions were the only two areas associated with behavior illustrating their important role in the lexical effect. MTG forms a major component of the ventral speech-language pathway that performs sound-to-meaning inference and acts as a lexical interface linking phonological and semantic information (Hickok & Poeppel, 2004, 2007). MTG is also associated with accessing word meaning (Acheson & Hagoort, 2013), a likely operation in our Ganong task when ambiguous phonemes are perceptually (re)interpreted as words. Relatedly, left IPL and adjacent SMG are strongly recruited during auditory phoneme sound categorization (Luthra, Correia, Kleinschmidt, Mesite, & Myers, in press; Desai, Liebenthal, Waldron, & Binder, 2008; Gow et al., 2008), suggesting their role in phonological coding (Sliwinska, Khadilkar, Campbell-Ratcliffe, Quevenco, & Devlin, 2012). Parietal engagement is especially prominent when speech items are more perceptually confusable (Feng et al., 2018) or require added lexical readout as in Ganong paradigms (Oberfeld & Klöckner-Nowotny, 2016) and may serve as the sensory-motor interface for speech (Hickok, Okada, & Serences, 2009; Hickok & Poeppel, 2000).6 Moreover, using machine learning to decode full brain EEG, we have recently shown that left SMG and related outputs from parietal cortex are among the most salient brain areas that code for category decisions (Al-Fahad, Yeasin, & Bidelman, 2020; Mahmud et al., 2020). Similar results were obtained in a multivariate pattern decoding analysis of Luthra et al. (in press), who showed left parietal (SMG) and right temporal (MTG) regions were among the most informative for describing moment-to-moment variability in categorization. In addition, the link between MTG and PrCG implied in our data points to a pathway between the neural substrates that map sounds to meaning and sensorimotor regions that execute motor commands (Al-Fahad et al., 2020; Du, Buchsbaum, Grady, & Alain, 2014). Still, the early time course of these neural effects (∼250 msec) occurs well before listeners' behavioral RTs (cf. Figure 2B vs. Figure 4), suggesting these mechanisms operate at an early (pre)perceptual level. These findings lead us to infer that rapid (200–300 msec) context-dependent modulations within a restricted temporo-parietal circuit are most inducive to describing the degree to which listeners are susceptible to lexical influences during speech labeling.
Notably absent from our Ganong circuit—identified via differences waves—was canonical auditory-linguistic brain regions (e.g., STG). Although somewhat unexpected, these data agree with previous fMRI results using a nearly identical Giss–Kiss continuum (Myers & Blumstein, 2008). Indeed, Myers and Blumstein (2008) reported that, for stimulus comparisons at the boundary of a Giss–Kiss gradient (Tk4, as used here), there was strong IPL but no Ganong-related differences in several brain areas previously shown to be sensitive to phonetic category structure including STG and inferior frontal gyrus; STG activation was, however, observed for the boundary condition in a Gift–Kift continuum, suggesting the extent of cortex sensitive to lexical effects depends on the direction and where along the continuum the effect is quantified.7 STG activity is greater when stimuli are maximally shifted from their VOT-matched counterparts (Myers & Blumstein, 2008). Although we observe a measurable Ganong effect, it is possible that stronger STG differentiation would have been observed in our EEG data with more salient lexical biasing stimuli (e.g., vowel sounds which are inherently more category ambiguous; Ganong, 1980). Still, the fact that correlations between neural and behavioral Ganong occurred in areas beyond canonical auditory-sensory cortex (e.g., STG) suggests that high-order, top–down mechanisms drive or at least dominate lexical biasing (Gow et al., 2008) rather than auditory temporal cortex, per se. Although they do so rapidly, the engagement of a temporo-parietal circuit outside canonical auditory areas (and negative brain–behavior correlations) further implies our lexical effects might be related to decision, attention, or executive control processes. Indeed, IPL is heavily involved in choice decision making, especially during uncertainty (Vickery & Jiang, 2009). This could explain the strong involvement of this region when classifying ambiguous speech in our task. While we cannot rule out such explanations, the early latency of neural effects (200–300 msec), which occur several hundred milliseconds before listeners' RT decisions, perhaps argues against a straightforward response-selection account of the data. Alternatively, rather than a binary feedforward or feedback model of the lexical effect (Gow et al., 2008), it is possible the formation of speech categories operates in near parallel within lower-order (sensory) and higher-order (cognitive-control) brain structures (Mahmud et al., 2020; Toscano et al., 2018). Our data are broadly consistent with such notions. Category representations also need not be isomorphic across the brain. Category formation might reflect a cascade of events where speech units are reinforced and further discretized by a recontact of acoustic–phonetic with lexical representations (Mahmud et al., 2020; Myers & Blumstein, 2008).
Our data are best cast in terms of interactive rather than serial frameworks of speech perception as in the TRACE model of spoken word recognition (McClelland & Elman, 1986). As confirmed empirically (Noe & Fischer-Baum, 2020; Lam et al., 2017; Gow et al., 2008; Myers & Blumstein, 2008; Ganong, 1980), these models predict stronger lexical biasing when speech sounds carry ambiguity. Indeed, neural correlates of the Ganong effect were most evident at the midpoint of our speech continua, where word influences exert their strongest effect. The very nature of TRACE is that activation traverses from one level to the next before computations at any one stage are complete (McClelland & Elman, 1986). Indeed, available evidence coupled with present results suggest that word recognition could involve simultaneous activation of both continuous acoustic cues and phonological categories (Toscano et al., 2018). It is also possible that the acoustic–phonetic conversion and postperceptual phonetic decision both localize to the same brain areas (Gow et al., 2008, p.621). Nevertheless, our data show that the acoustic–phonetic encoding of 2speech is rapidly subject to linguistic influences within several hundred milliseconds. While the early time course implies a stage of perceptual processing, we find that lexical effects occur strongest outside the purview of canonical auditory-linguistic brain areas via a restricted temporoparietal circuit.
We thank Dr. Emily Myers for sharing stimulus materials.
Reprint requests should be sent to Gavin M. Bidelman, School of Communication Sciences & Disorders, University of Memphis, 4055 North Park Loop, Memphis, TN, 38152, or via e-mail: firstname.lastname@example.org.
This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award number R01DC016267 (G. M. B.).
Diversity in Citation Practices
A retrospective analysis of the citations in every article published in this journal from 2010 to 2020 has revealed a persistent pattern of gender imbalance: Although the proportions of authorship teams (categorized by estimated gender identification of first author/last author) publishing in the Journal of Cognitive Neuroscience (JoCN) during this period were M(an)/M = .408, W(oman)/M = .335, M/W = .108, and W/W = .149, the comparable proportions for the articles that these authorship teams cited were M/M = .579, W/M = .243, M/W = .102, and W/W = .076 (Fulvio et al., JoCN, 33:1, pp. 3–7). Consequently, JoCN encourages all authors to consider gender balance explicitly when selecting which articles to cite and gives them the opportunity to report their article's gender citation balance.
EEG was not recorded from one participant due to a technical error resulting in a final sample size of n = 15 for the neural data (behavioral data were unaffected).
Our task blocks stimuli by continuum. One concern is that blocking might set up an expectation such that, upon hearing the initial stop consonant, listeners already have a category structure in mind, biasing their response toward the word end of the continuum. Thus, listeners might “preload” their response, waiting to hear the completion of the word before interpreting the onset consonant as a “g” vs. “k.” Our use of token randomization within each continuum helps prevent such expectancies. Lexical effects are also still observable when stimuli are fully randomized within and across contexts (Ganong, 1980). Moreover, the early time course of our neural Ganong effects (see Figure 4) suggests the brain is already making predictions on lexical status prior to the completion of word endings. Similarly, if listeners know they are in a “gift-kift” block, for example, they may shift their phonetic category boundary more globally such that processing the end of the word (–ift) is no longer necessary. However, one piece of evidence that such global biasing did not occur is that RT speeds were similar for word versus nonword Tk1/Tk8 endpoint tokens (see Figure 2B). Global biasing would be expected to improve decision speeds for tokens heard in a word context.
The sign of the difference waveform crucially depends on the order of subtraction (much like an MMN), rendering the direction of the wave somewhat arbitrary. Consequently, we favor an interpretation impartial to direction that implies that because the change/difference in response magnitude varies across continua, it is differential neural activity that codes the lexical effect.
Whether the 200- to 300-msec modulation functionally reflects a late of P2 or early P3 response is unknown. P3 would be more expected in oddball-type paradigms (not used here) but is observable in speech identification tasks although later in time (> 300–400 msec; Bidelman & Alain, 2015; Toscano et al., 2010). A similar “post-P2” wave (180–320 msec) has been reported during speech categorization (Bidelman & Alain, 2015; Bidelman et al., 2013), which varied with perceptual (rather) than acoustic classification. This response could represent an integration or reconciliation of the input with a phonetic memory template (Bidelman, Bush, & Boudreaux, 2020; Bidelman & Alain, 2015) and/or attentional reorienting during stimulus evaluation (Knight, Scabini, Woods, & Clayworth, 1989). In support of this functional interpretation, the modulation is observed when classifying speech under higher levels of uncertainty, for example, when identifying speech in noise (Bidelman et al., 2020).
Both N1 studies (Noe & Fischer-Baum, 2020; Getz & Toscano, 2019) used average mastoid reference recordings, which can inflate and bias neural effects to frontal electrodes (Yao et al., 2005) where their ERPs were quantified. Here, we used average reference data (and source imaging), which provides a less biased and unmixed view of neural activity. Another notable difference in Noe and Fischer-Baum (2020) is their use of single trials (n = 38,491 observations) in the statistical analysis to detect lexical effects at N1. Although independence assumptions of using such large quantities of correlated trial-wise EEG might be debatable, such analyses might be more sensitive to detecting earlier lexical effects than the subject-wise approach used here.
The basis of the negative correlation between “neural” and behavioral” Ganong is not entirely clear; positive associations are more easily hypothesized. Speculatively, the negative relation could be related to a lexical ambiguity interpretation. Thus, the negative correlation we find between IPL and MTG and behavioral Ganong shifts (Figure 5) might occur if larger degrees of ambiguity between speech sounds (evoking larger ERP differences waves) reduces lexical certainty. This would tend to reduce the magnitude of the perceptual lexical effect as seen behaviorally.
Myers and Blumstein (2008, p. 283) reported strong clusters of lexically sensitive cortex in canonical auditory areas including STG for boundary stimulus comparisons in their Gift–Kift continuum. In that study, the boundary condition was defined based on the perceptual location within each continuum (i.e., Tk 5 for Gift–Kift; Tk 4 for Giss–Kiss). Here, we compared activation patterns solely at the physically identical Tk 4 stimulus (where perceptual lexical effects were maximal; Figure 2).
These authors contributed equally to this work.