Evidence for a Spoken Word Lexicon in the Auditory Ventral Stream

Abstract The existence of a neural representation for whole words (i.e., a lexicon) is a common feature of many models of speech processing. Prior studies have provided evidence for a visual lexicon containing representations of whole written words in an area of the ventral visual stream known as the visual word form area. Similar experimental support for an auditory lexicon containing representations of spoken words has yet to be shown. Using functional magnetic resonance imaging rapid adaptation techniques, we provide evidence for an auditory lexicon in the auditory word form area in the human left anterior superior temporal gyrus that contains representations highly selective for individual spoken words. Furthermore, we show that familiarization with novel auditory words sharpens the selectivity of their representations in the auditory word form area. These findings reveal strong parallels in how the brain represents written and spoken words, showing convergent processing strategies across modalities in the visual and auditory ventral streams.


INTRODUCTION
Speech perception is perhaps the most remarkable achievement of the human auditory system and one that likely is critically dependent on its overall cortical architecture. It is generally accepted that the functional architecture of auditory cortex in human and nonhuman primates comprises two processing streams (Hickok & Poeppel, 2007;Rauschecker & Scott, 2009;Rauschecker & Tian, 2000). There is an auditory dorsal stream that is involved in the processing of auditory space and motion (van der Heijden et al., 2019) as well as in sensorimotor transformations such as those required for speech production (Archakov et al., 2020;Hickok et al., 2011;Rauschecker, 2011Rauschecker, , 2018. There is also an auditory ventral stream specialized for recognizing auditory objects such as spoken words. This stream is organized along a simpleto-complex feature hierarchy (Rauschecker & Scott, 2009), akin to the organization of the visual ventral stream (Kravitz et al., 2013).
Visual object recognition studies support a simple-to-complex model of cortical visual processing in which neuronal populations in the visual ventral stream are selective for increasingly complex features and ultimately visual objects along a posterior-to-anterior gradient extending from lower-to-higher-order visual areas (Felleman & Van Essen, 1991;Hubel & Wiesel, 1977). For the special case of recognizing written words, this simple-to-complex model predicts that progressively more anterior neuronal populations are selective for a n o p e n a c c e s s j o u r n a l increasingly complex orthographic patterns (Dehaene et al., 2005;Vinckier et al., 2007). Thus, analogous to general visual processing, orthographic word representations are predicted to culminate in representations of whole visual words-an orthographic lexicon. Evidence suggests that these lexical representations are subsequently linked to concept representations in downstream areas like the anterior temporal lobe (Damera et al., 2020;Lambon Ralph et al., 2017;Liuzzi et al., 2019;Malone et al., 2016). The existence of this orthographic lexicon in the brain is predicted by neuropsychological studies of reading (Coltheart, 2004). Indeed, functional magnetic resonance imaging (fMRI; Glezer et al., 2009Glezer et al., , 2015 and, more recently, electrocorticographic data (Hirshorn et al., 2016;Woolnough et al., 2021) have confirmed the existence of such a lexicon in a region of the posterior fusiform cortex known as the visual word form area ( VWFA; Dehaene & Cohen, 2011;Dehaene et al., 2005).
It has been proposed that an analogous simple-to-complex hierarchy exists in the auditory ventral stream as well (Kell et al., 2018;Rauschecker, 1998) extending anteriorly from Heschl's gyrus along the superior temporal cortex (STC; DeWitt & Rauschecker, 2012;Rauschecker & Scott, 2009). Yet, the existence and location of a presumed auditory lexicon, that is, a neural representation for the recognition (and storage) of real words, has been quite controversial (Bogen & Bogen, 1976): The traditional posterior view is that the auditory lexicon should be found in posterior STC (pSTC;Geschwind, 1970). In contrast, a notable metaanalysis (DeWitt & Rauschecker, 2012) provided strong evidence for the existence of wordselective auditory representations in anterior STC (aSTC), consistent with imaging studies of speech intelligibility (Binder et al., 2000;Scott et al., 2000) and proposals for an auditory word form area (AWFA) in the human left anterior temporal cortex (Cohen et al., 2004;DeWitt & Rauschecker, 2012). Such a role of the aSTC is compatible with nonhuman primate studies that show selectivity for complex communication calls in aSTC (Ortiz-Rios et al., 2015;Rauschecker et al., 1995;Tian et al., 2001) and demonstrate, in humans and nonhuman primates, that progressively anterior neuron populations in the STC pool over longer timescales (Hamilton et al., 2018;Hullett et al., 2016;Jasmin et al., 2019;Kajikawa et al., 2015). In this anterior account of lexical processing, the pSTC and speech-responsive regions in the IPL are posited to be involved in "inner speech" and phonological reading (covert articulation), but not auditory comprehension (DeWitt & Rauschecker, 2013;Rauschecker, 2011). Yet, despite this compelling alternative to traditional theories, there is still little direct evidence for an auditory lexicon in the aSTC.
Investigating the existence and location of auditory lexica is critical for understanding the neural bases of speech processing and, consequently, the neural underpinnings of speech processing disorders. However, finely probing the selectivity of neural representations in the human brain with fMRI is challenging, in part because it is difficult to assess the selectivity of these populations. Many studies have identified speech processing areas by contrasting speech stimuli with various nonspeech controls (Evans et al., 2014;Okada et al., 2010;Scott et al., 2000). However, these coarse contrasts cannot reveal what neurons in a particular auditory word-responsive region of interest (ROI) are selective for, for example, phonemes, syllables, or whole words. More sensitive techniques such as fMRI rapid adaptation (fMRI-RA; Grill-Spector & Malach, 2001;Krekelberg et al., 2006) are needed to probe the selectivity of speech representations in the brain and resolve the question of the existence of auditory lexica. In the current study, we used fMRI-RA to test the existence of lexical representations in the auditory ventral stream. Paralleling previous work in the visual system that used fMRI-RA to provide evidence for the existence of an orthographic lexicon in the VWFA (Glezer et al., 2009(Glezer et al., , 2015, we first performed an independent auditory localizer scan that we used to identify the AWFA in individual subjects, and then conducted three fMRI-RA scans that probed the representation in the AWFA and its plasticity. The first two scans consisted of real words (RWs) and pseudowords (PWs; i.e., pronounceable nonwords), respectively. These scans revealed an adaptation profile consistent with lexical selectivity in the putative AWFA for RWs, but not novel PWs, directly replicating results for written words in the VWFA. We then tested the lexicon hypothesis by predicting that training subjects to recognize novel PWs would add them to their auditory lexica, leading them to exhibit lexical-like selectivity in the AWFA following training, as previously shown for written words in the VWFA (Glezer et al., 2015). To do so, we conducted a third fMRI-RA scan after PW training. Results from this scan showed RW-like lexical selectivity to the now-familiar PWs following training, supporting the role of the AWFA as an auditory lexicon shaped by experience with auditory words.

Overall Procedure
In this study, participants completed an auditory localizer scan and three RA experiments over the course of three scanning sessions. In the first session, subjects completed the auditory localizer and RW scans. The auditory localizer scan was used to identify a candidate AWFA region in the anterior auditory ventral stream (Rauschecker & Scott, 2009), adopting the approach used in the visual/orthographic case for defining the VWFA (Glezer et al., 2009(Glezer et al., , 2015. The RW RA scan was used to test if the candidate AWFA exhibited lexical selectivity for RWs. In the remaining two scanning sessions subjects completed pre-and post-training PW scans separated by six behavioral training sessions outside of the scanner. During the pretraining scan, subjects were presented with the then untrained pseudowords (UTPW). Next, subjects completed six behavioral training sessions consisting of a 2-back and a novel/familiar task designed to familiarize and assess subject familiarity with the UTPW, respectively. Subjects completed a maximum of one behavioral session each day and those participants that achieved at least 80% accuracy on the novel/familiar task by the final session qualified for the post-training scan. During the post-training scan, subjects were presented with the same set of PWs, now called trained pseudowords (TPW). The pre-and post-training scans were used to test whether the candidate AWFA exhibited lexical selectivity for TPW (i.e., after training) but not to the UTPW (i.e., before training).

Participants
We recruited a total of 28 right-handed healthy native English speakers for this study (ages 18-34, 12 females). Georgetown University's Institutional Review Board approved all experimental procedures, and written informed consent was obtained from all subjects before the experiment. Two subjects were excluded from further analyses for performing 2 standard deviations below the average on the in-scanner task. Furthermore, two subjects dropped out of the study after completing the RW RA scan. In total, we analyzed 26 subjects for the RW RA scan and 24 of those 26 for the pre-training UTPW scan. Due to subject dropout (n = 4) or failure to achieve 80% accuracy on the novel/familiar task (n = 4), 16 out of the 24 subjects were analyzed for the post-training TPW scan.

Stimuli
Real word stimuli, for the RW RA experiment, were chosen using the English Lexicon Project (Balota et al., 2007). Analogous to studies of the neural representation of written words (Glezer et al., 2009(Glezer et al., , 2015, three sets of 50 high-frequency (>50 per million), monosyllabic RWs that were 3-5 phonemes in length were created. One set of words (target words) served as the reference for the other two lists. The second set was created by altering each target word by a single phoneme to create another RW. The third set was created by selecting for each target word another RW that had the same number of (but no shared) phonemes. All three of these lists were matched on the number of phonemes, orthographic neighborhood, and phonological neighborhood. To create the UTPW/TPW we used MCWord (Medler & Binder, 2005) to generate four sets of 50 target PWs, 3-5 phonemes in length. One set to be trained and three sets to remain untrained and serve as foils in the training task. As with the RW stimuli, we then used the target PW as the reference to generate a set of PWs each differing by one phoneme from the target word, and another set of PWs matched to the target PWs by number of phonemes but not sharing any phonemes. RW and PW sets were matched for length, bigram and trigram frequency, and phonological neighborhood. All stimuli were recorded using a 44.1-kHz sampling rate in a sound-isolated booth by a trained female native speaker of American English.

Auditory Localizer Scan
The auditory localizer scan ( Figure 1A) was used to independently identify auditory speechselective areas. In this scan, subjects were randomly presented with trials from one of five conditions: Real Words, Pseudowords, Scrambled Real Words, Scrambled Pseudowords, and The slow clustered acquisition paradigm used in the auditory localizer scan. Each trial was 9 s long with 1.5 s of volume acquisition and 7.5 s of silence. During the silent period, the subject heard four sounds from one of five stimulus classes and performed a 1-back task. (B) The rapid clustered acquisition paradigm used for the RA scans (Chevillet et al., 2013;Jiang et al., 2018). Each trial was 3.36 s long with 1.68 s of volume acquisition. During the silent period, two spoken words were played to the subject with a 50 ms interstimulus interval. The first word acted as a prime and the second word the target. The experimental paradigms for (C) real words and (D) pseudowords. The prime was followed by a target word that was either the same word (SAME), a word that differed from the target by one phoneme (1PH), or a word that shared no phonemes with the target (DIFF). Furthermore, subjects were presented with silence trials that served as an explicit baseline. During the task, subjects were asked to attend to all the words and respond when they heard the oddball stimulus (RW or PW containing the rhyme "-ox," e.g., "socks") in either the prime or target position.
Silence. Real Words and Pseudowords were one syllable long; the lists were matched for length and orthographic and phonologic neighborhood. Scrambled Real Words and Scrambled Pseudowords were generated by randomly rearranging 200 ms by 1-octave tiles of the constant Q spectrogram (Brown, 1991) for each RW or PW, respectively, and then reconstructing a timedomain waveform with an inverse transform (Ortiz-Rios et al., 2015). Each trial in the auditory localizer scan was 9 s long and began with 2.5 s of silence (during which 1.68 s of scanner acquisition occurred), followed by a ∼3-4 s stimulus presentation period, and concluding with silence. During the stimulus presentation period, subjects heard four stimuli from a given condition and responded with a left-handed button press if any of the four stimuli was a repeat within that block. In total, there were 145 trials per run with 25 trials of each of the five conditions and five 1-back trials for each of the four non-Silence conditions. An additional 18 s of fixation were added to the start and end of each run. Subjects completed five runs of the task.

Rapid Adaptation Scans
There were three RA scans ( Figure 1B-D; i.e., RW, UTPW, TPW) performed on different days. In the RA scans, subjects heard a pair of words (prime/target) on each trial ( Figure 1C-D). The words in each pair were either identical (SAME), differed by a single phoneme (1PH), or shared no phonemes at all (DIFF). These pairs were generated by using the three matched word lists described above. To engage subjects' attention, we asked subjects to perform an oddball rhyme detection task in the scanner. To do so, we created an additional condition (Oddball) in which a word or pseudoword containing the oddball rhyme "-ox" (e.g., socks, grox) was presented in lieu of either the prime or target word. Participants were asked to attentively listen to all stimuli and respond with a left-handed button press when an oddball stimulus was heard. In all three scans, the number of repetitions of each word was counterbalanced across all conditions to control for long-lag priming effects (Henson et al., 2000). Trial order and timing were adjusted using M-sequences (Buračas & Boynton, 2002). Each trial was 3.36 s long and consisted of a 1.68 s silent period followed by 1.68 s stimulus presentation period during which the word pairs were presented. Following prior auditory RA studies (Chevillet et al., 2013;Jiang et al., 2018), we presented stimuli with a 50 ms ISI. In total, there were 25 trials of each condition (SAME, 1PH, DIFF, oddball, and silence) for a total of 125 trials per run. An additional 10.08 s of fixation was added to the start and end of each run. Subjects completed four runs for each scan.

fMRI Data Preprocessing
Image preprocessing was performed using SPM12 (Ashburner et al., 2021; https://www.fil.ion .ucl.ac.uk/spm/software/spm12/). The first three volumes of each run were discarded to allow for T1 stabilization, and the remaining EPI images were spatially realigned to the mean blood oxygen level-dependent (BOLD) reference image. No slice-time correction was done given the presence of temporal discontinuities between successive volumes in clustered acquisition paradigms (Perrachione & Ghosh, 2013). EPI images for each subject were co-registered to the anatomical image. The anatomical image was then segmented and the resulting deformation fields for spatial normalization were used to normalize the functional data to the standard MNI (Montreal Neurological Institute) space. Next, we smoothed the normalized functional images with a 4-mm full-width at half maximum Gaussian kernel. Finally, a first-level model containing regressors for each condition and the six motion parameters from realignment was fit. In the auditory localizer scan, the regressors were Real Words, Pseudowords, Scrambled Real Words, Scrambled Pseudowords, Button Press, and Silence. In the RA scans, the regressors were SAME, 1PH, DIFF, Button Press, and Silence.

Defining the Auditory Word Form Area
The AWFA was determined at the individual-subject level using the auditory localizer scan. Analogous to studies of the visual word form area (Glezer et al., 2009(Glezer et al., , 2015, for each subject we first defined an RW vs. Silence contrast and an RW vs. Scrambled Words contrast. Next, at the group level, we masked the RW vs. Silence t statistic map with the RW vs. Scrambled Words t statistic map thresholded at p < 0.05 before applying a p < 0.001 voxel-wise threshold. The resulting map had a peak at MNI: −62, −14, 2. These coordinates are consistent with the hypothesized locus (MNI: −61, −15, −5) of the AWFA from prior studies (DeWitt & Rauschecker, 2012). We then created the same RW vs. Silence masked by RW vs. Scrambled Words maps at the individual-subject level using the same thresholds as above. Then, for each subject we identified the local peak in the resulting maps closest to the group peak (MNI: −62, −14, 2). Finally, to create each individual subject's AWFA, we created an ROI consisting of the 50 closest voxels to each individual's local peak.

Behavioral Training
Subjects were trained to recognize 150 auditory PWs (TPW, see above). A 2-back training task, in which subjects had to detect repeats of a PW separated by another PW, was used to familiarize subjects with the PWs. Each session of the 2-back task consisted of 15 blocks of 75 trials each with self-paced breaks between each block. Each trial lasted 1.5 s during which subjects heard a single PW and had to respond if they heard a 2-back repeat (i.e., if the current PW was the same as the PW before the last). Each block lasted 112.5 s, and each session lasted for a total task length of 28.125 min excluding breaks. Following the 2-back task, subjects' familiarity with the trained PWs was assessed using a novel/familiar task. For this task, we developed three sets of foils for each of the 150 PWs. Each foil differed from its base PW by a single phoneme. Each session of the novel/familiar task consisted of three blocks of 100 trials with self-paced breaks between each block. Each trial lasted 1.5 s during which subjects heard a single PW and had to respond with either the left or right arrow key to indicate a novel (i.e., a foil) or familiar (i.e., trained) PW. In total, each block lasted 150 s for a total task length of 7.5 min excluding breaks. Over the course of the novel/familiar task sessions each foil list was paired with the trained PW list only twice with at least two days since the last pairing. Six sessions of both tasks were performed over the course of about eight days. To proceed to the post-training RA scan, subjects had to achieve at least 80% accuracy on the novel/familiar task by their sixth session.

Task-Based Functional Connectivity
After preprocessing, we used the CONN toolbox Version 21.a (Whitfield-Gabrieli & Nieto-Castanon, 2012) to calculate seed-to-voxel stimulus-driven functional connectivity from the AWFA during auditory word processing in the auditory localizer scans. Word processing was included as the primary task condition by combining the onset times and durations for RW and PW into one condition. The individual subject (n = 26) AWFA seeds were used for the analysis. We then performed denoising with confounds, which included the subject-specific motion regressors computed during preprocessing in SPM, cerebrospinal fluid and white-matter signals, and their derivatives, following the CompCor strategy (Behzadi et al., 2007) as implemented in CONN. Data were then band-pass filtered (0.008-0.09 Hz) with linear detrending. Seed-to-voxel functional connectivity was performed using a weighted general linear model computing the bivariate correlation between the AWFA seed with the whole brain. Due to the sparse acquisition type of this analysis, no hemodynamic response function weighting was performed. Group-level significance was determined with a voxel threshold of p < 0.001 and a cluster threshold of p < 0.05, false discoverey rate (FDR) corrected.

Auditory Localizer Scan Identifies Bilateral Speech-Selective ROI in the Auditory Ventral Stream
An independent auditory localizer scan was used to identify a putative AWFA near previous literature coordinates (Figure 2A). To do so, analogous to prior studies of written word representations in the VWFA (Glezer et al., 2009(Glezer et al., , 2015, we first identified group-level auditory speech-selective areas (see Materials and Methods) by examining the (Real Words) vs. Silence contrast thresholded at p < 0.001 masked by the RW vs. Scrambled Real Words contrast thresholded at p < 0.05 at the group-level cluster-corrected at the FDR p < 0.05. This approach mimicked definition of the VWFA in prior studies (Glezer et al., 2009(Glezer et al., , 2015 and gave us the greatest flexibility for defining ROIs while ensuring responsiveness to sound and intelligible speech. This revealed several clusters of activation in the superior temporal, frontal,  Figure 2D) were created by building a 50-voxel ROI (400 mm 3 ) around the local peak of each subject (mean ± SD: −62 ± 2, −14.9 ± 3, 2.6 ± 2.6) closest to the group peak (see Materials and Methods). This is similar to the size of the VWFA identified in prior studies (Glezer et al., 2009(Glezer et al., , 2015.

Lexical Selectivity for Real Words but Not Pseudowords in the AWFA
The first two fMRI-RA scans were performed with RWs and UTPWs, respectively. In these experiments, we predicted the lowest signal for the SAME condition since the two identical words presented in that condition would repeatedly activate the same neural populations, thereby causing maximal adaptation. Likewise, we predicted the least amount of adaptation for the DIFF condition because two words that share no phonemes should activate disjoint groups of neurons, irrespective of whether responsiveness to auditory words in the localizer scan in that ROI was due to neurons selective for phonemes, syllables, or whole words. Finally, we tested specific predictions regarding responses in the 1PH condition. Specifically, if neurons in the AWFA ROI are tightly tuned to whole RWs (i.e., if the AWFA contains an auditory lexicon), the two similar but nonidentical RWs in the 1PH condition should have minimal neural overlap, and therefore no adaptation should occur and response levels in the 1PH condition should be comparable to that of the DIFF condition (as found for written real words in the VWFA; Glezer et al., 2009). In contrast, if neurons in the AWFA were tuned to sublexical phoneme combinations, then there should be a gradual release from adaptation from SAME to 1PH to DIFF, with 1PH < DIFF, as sublexical overlap would continue to increase from 1PH to DIFF. For PWs, we predicted that there would be a gradual increase in the BOLD signal paralleling the increasing dissimilarity (i.e., SAME to 1PH to DIFF). This is thought to reflect low-level activation of RW-tuned neurons to phonologically similar PWs. These predictions mirror findings in the VWFA for written words (Glezer et al., 2009(Glezer et al., , 2015 and are compatible with an experience-driven increase in selectivity of neurons in the AWFA to real words because of extensive auditory experience with and the need to discriminate among real words but not pseudowords. To test our hypotheses, we ran a 2-way repeated measures analysis of variance (ANOVA) on AWFA responses to investigate the relationship between lexicality (RW and UTPW) and word similarity (SAME, 1PH, and DIFF). The ANOVA was run on subjects that successfully completed both the RW and UTPW scans (n = 24). This revealed a significant main effect of similarity (F 2,46 = 42.597; p = 3.4E−11) but no significant main effect of lexicality (F 1,23 = 2.547; p = 0.124). Critically, however, the analysis revealed a significant interaction between lexicality and word similarity (F 2,46 = 4.092; p = 0.023). Planned paired t tests revealed an adaptation profile in the AWFA that was consistent with tight neural tuning for individual RWs ( Figure 3A): There was a significant difference in the mean percent signal between the DIFF vs. SAME conditions (t(23) = 7.86; p < 0.001) and the 1PH vs. SAME (t(23) = 5.71; p < 0.001). However, there was no significant difference between the DIFF and 1PH conditions (t(23) = 1.95; p = 0.189). Similar results were obtained using all 26 subjects who completed the RW scan ( Figure S1 in the Supporting Information). In contrast, the adaptation profile for UTPWs was not consistent with lexical selectivity ( Figure 3B): There was a significant response difference between the DIFF vs. SAME conditions (t(23) = 7.03; p < 0.001), the 1PH vs. SAME (t(23) = 3.28; p = 0.010), and, critically, also for the DIFF vs. 1PH conditions (t(23) = 3.89; p = 0.002). Finally, we ran a whole-brain analysis to test whether lexical representations were localized to the anterior STG or more distributed in nature. Specifically, for RW the conjunction of DIFF vs. SAME and 1PH vs. SAME after excluding voxels where DIFF vs. PH was p < 0.05 produced a cluster in the left anterior STG (family-wise error corrected p < 0.05; MNI: −62, −14, 6) within 3.5 mm from the average coordinate (MNI: −62, −14.9, 2.6) of our individual ROIs. Importantly, this analysis for UTPWs produced no significant clusters. Thus, the whole-brain analysis also supports the special status of the AWFA as the location of a lexicon for spoken real words.
Planned paired t tests revealed an adaptation profile that was consistent with tight neural tuning for individual RWs. See Figure S1 in the Supporting Information available at https://doi .org/10.1162/nol_a_00108. There was a significant difference in the mean percent signal between the DIFF vs. SAME conditions (t(25) = 6.94; p < 0.001) and the 1PH vs. SAME (t(25) = 5.36; p < 0.001). However, there was no significant difference between the DIFF and 1PH conditions (t(25) = 1.70; p = 0.3039).
Adaptation Patterns to Pseudowords in the AWFA Exhibit Lexical Selectivity After but Not

Before Familiarization
Next, in the pre-and post-training scans (UTPW and TPW, respectively), we tested the hypothesis that familiarization with previously novel PWs drives the formation of lexical selectivity in the AWFA. To do so, we examined the adaptation profiles for RWs, UTPWs, and TPWs in subjects who had completed all three scans (n = 16). We ran a 2-way repeated measures ANOVA to investigate the relationship between lexicality (RW, UTPW, and TPW) and similarity (SAME, 1PH, and DIFF). This revealed a significant main effect of similarity (F 2,30 = 43.023; p = 2.95E-9) but not a significant main effect of lexicality (F 2,30 = 2.398; p = 0.116). Critically, there was again a significant interaction between lexicality and similarity (F 4,60 = 4.144; p = 0.012). Consistent with the full data set (Figure 3), planned paired t tests revealed an adaptation profile that was consistent with tight neural tuning for individual RWs (Figure 4). There was a significant difference in the mean percent signal change between the DIFF vs. SAME conditions (t(15) = 7.28; p < 0.001) and 1PH vs. SAME (t(15) = 5.85; p < 0.001). However, there was no significant difference between the DIFF and 1PH conditions (t(15) = 0.750; p = 1). In contrast, the adaptation profile for UTPW was not consistent with lexical selectivity (Figure 4): There was a significant difference between the DIFF vs. SAME conditions (t(15) = 7.84; p < 0.001) and the DIFF vs. 1PH conditions (t(15) = 4.57; p = 0.001), but not 1PH vs. SAME (t(15) = 2.16; p = 0.141). Crucially, the adaptation profile for PW after training (i.e., TPW) was consistent with lexical selectivity (Figure 4): There was a significant difference in the mean percent signal between the DIFF vs. SAME conditions (t(15) = 5.55; p < 0.001) and the 1PH vs. SAME (t(15) = 3.17; p = 0.019). However, there was no significant difference between the DIFF and 1PH conditions (t(15) = 1.71; p = 0.327).

The AWFA Connects to the Language Network
We next calculated task-based functional connectivity in the auditory localizer data set (n = 26) to examine the connections between the AWFA and the rest of the brain. This seed-tovoxel analysis ( Figure 5) showed that the AWFA is highly connected with brain regions previously shown (Wilson et al., 2004) to be involved in language processing such as the inferior frontal gyrus (local peak at MNI −46, 12, 26) and premotor cortex (local peak at MNI −50, −8, 48). Especially noteworthy is the connectivity between the AWFA and a cluster in the left posterior fusiform cortex (circled in red; MNI: −44, −46, −14) that encompases the reported location of the VWFA (MNI: −45, −54, −20; Dehaene et al., 2005;Kronbichler et al., 2004). Crucially, subjects never saw any words in print during the auditory localizer scan, thereby precluding the possibility of spurious correlations between these regions. . Auditory lexical representations emerge in the auditory word form area (AWFA) for pseudowords after familiarization training. Within-subject (n = 16) adaptation profile for auditory real words (RW), untrained pseudowords (UTPW), and trained pseudowords (TPW). RW adaptation profile shows tuning to individual RWs consistent with an auditory lexicon. UTPWs show a graded release from adaptation as a function of phonological similarity. Importantly, following familiarization training, adaptation patterns in the AWFA to the same pseudowords (now TPW) reveal tight lexical tuning, similar to RW. Horizontal black line in violin plots indicates the median. ***, **, *, and N.S. mark p < 0.001, <0.01, <0.05, and not significant (>0.1), all Bonferroni-corrected.

DISCUSSION
Cognitive models of speech comprehension (Marslen- Wilson & Welsh, 1978;McClelland & Elman, 1986;Morton, 1969) have proposed the existence of an auditory lexicon. More recent models of speech comprehension in the brain (Hickok & Poeppel, 2007;Rauschecker & Scott, 2009) have posited the existence of such an auditory lexicon in an auditory ventral what stream. Yet, significant disagreement exists between these models about the location of such an auditory lexicon, and no studies to date have directly tested the existence of an auditory lexicon for speech comprehension in the brain. Prior work (Glezer et al., 2009(Glezer et al., , 2015, has used fMRI RA techniques to establish the existence of an orthographic lexicon in a region of the visual ventral stream-the VWFA (subsequently confirmed by human neurophysiological recordings; Hirshorn et al., 2016;Lochy et al., 2018;Woolnough et al., 2021). In the present study, we leveraged these techniques to test the existence of an auditory lexicon in the auditory ventral stream, particularly in the AWFA of the anterior STG (Cohen et al., 2004;DeWitt & Rauschecker, 2012). We first defined the ROI for an individual AWFA through an independent auditory localizer scan. We then showed in RW and UTPW RA scans that, consistent with an auditory lexicon, spoken RWs engaged distinct neural populations in the AWFA, with even a single phoneme change causing full release from adaptation, whereas PWs did not exhibit this lexical adaptation profile but instead showed a graded release from adaptation as a function of phoneme overlap. This graded release from adaptation suggests that neurons in the AWFA, like the VWFA (Glezer et al., 2009), exhibit broader tuning to PWs due to experience-driven refinement of tuning of neurons to RWs but not PWs. Thus, these results directly replicated analogous findings of highly selective lexical tuning to RWs Figure 5. Functional connectivity of the auditory word form area (AWFA). Whole-brain functional connectivity of the AWFA during the auditory localizer task (n = 26). Results are thresholded at a voxel-wise p < 0.001 and cluster-level p < 0.05, family-wise error corrected. Cluster corresponding to the literature coordinates of the visual word form area is circled in red. Color bar represents t statistic.
but not PWs in the VWFA (Glezer et al., 2009). Then, in the TPW RA scan, we showed that training subjects to recognize a set of PWs led to the development of lexical selectivity for these TPWs, again replicating previous results for written words in the VWFA (Glezer et al., 2015). Finally, the AWFA was connected to an area in the left posterior fusiform cortex coincident with the reported location of the VWFA, further supporting analogous roles of the AWFA and VWFA in the processing of spoken and written words, respectively.
Our novel evidence of auditory lexical representations in the brain informs current cognitive models of speech comprehension. While all these models map acoustic input to meaning, they disagree on whether auditory lexica exist (Coltheart, 2004;Woollams, 2015). Our data not only present compelling evidence for the existence of an auditory lexicon-they also place its location in the anterior STG where it is ideally suited to interface with semantic representations located further anteriorly in the temporal lobe (Lambon Ralph et al., 2017;Ueno et al., 2011), thereby completing the mapping of speech sounds to meaning. Furthermore, this anterior STG location is consistent with prior studies demonstrating other familiar auditory objects (Griffiths & Warren, 2004), such as human voices (Belin et al., 2000;Bodin et al., 2021;Staib & Frühholz, 2023) or musical instruments (Leaver & Rauschecker, 2010). Moreover, such a simple-tocomplex progression in selectivity from simple perceptual features over lexical representations to semantic representations in the anteroventral auditory processing stream along the STC is a direct counterpart of the ventral visual stream in the inferior temporal cortex (Damera et al., 2020;Kravitz et al., 2013), revealing convergent processing strategies across speech modalities.