Human listeners are bombarded by acoustic information that the brain rapidly organizes into coherent percepts of objects and events in the environment, which aids speech and music perception. The efficiency of auditory object recognition belies the critical constraint that acoustic stimuli necessarily require time to unfold. Using magnetoencephalography, we studied the time course of the neural processes that transform dynamic acoustic information into auditory object representations. Participants listened to a diverse set of 36 tokens comprising everyday sounds from a typical human environment. Multivariate pattern analysis was used to decode the sound tokens from the magnetoencephalographic recordings. We show that sound tokens can be decoded from brain activity beginning 90 msec after stimulus onset with peak decoding performance occurring at 155 msec poststimulus onset. Decoding performance was primarily driven by differences between category representations (e.g., environmental vs. instrument sounds), although within-category decoding was better than chance. Representational similarity analysis revealed that these emerging neural representations were related to harmonic and spectrotemporal differences among the stimuli, which correspond to canonical acoustic features processed by the auditory pathway. Our findings begin to link the processing of physical sound properties with the perception of auditory objects and events in cortex.
Successful navigation of the auditory environment requires quickly recognizing the stimuli or objects one encounters. This is true for the assessment of potential threats in the auditory scene as well as for the perception of speech and music (Ogg & Slevc, 2019b; Bizley & Cohen, 2013, Griffiths & Warren, 2004). However, the recognition of auditory objects is complicated by the fact that acoustic stimuli require time to develop. Thus, a listener's auditory system must rapidly transform acoustic information that is still evolving into a mental representation of an object or event to maintain behavioral relevance (e.g., when hearing an approaching vehicle that poses a threat to the observer). Despite this complexity, listeners can make accurate judgments about the identity of a sound even when the sound's temporal development (and thus potential to relay useful acoustic detail) is limited in duration to just tens of milliseconds (Ogg, Slevc, & Idsardi, 2017) or less (Suied, Agus, Thorpe, Mesgarani, & Pressnitzer, 2014; Robinson & Patterson, 1995a, 1995b). In the current study, we asked how the auditory system uses incoming acoustic information from natural sound stimuli to help construct representations of auditory objects and events given these temporal constraints. Specifically, we investigated whether certain acoustic events or features are temporally prioritized in a listener's processing of natural sounds and how such prioritization relates to particular acoustic properties of the sounds.
One possibility is that the auditory system hierarchically processes certain kinds of behaviorally relevant information. Recordings in the ventral auditory pathway of nonhuman primates support this view. These studies find more selective responses for increasingly complex stimuli in anterior regions of the temporal lobes (Perrodin, Kayser, Logothetis, & Petkov, 2011; Rauschecker & Scott, 2009; Tian, Reser, Durham, Kustov, & Rauschecker, 2001; Rauschecker & Tian, 2000) that correspond to different processing latencies (Kikuchi, Horwitz, & Mishkin, 2010). These findings have inspired human EEG work that suggests neural responses differ as a function of the sound source, with man-made sounds and human vocal sounds exhibiting stronger responses around 70 and 170 msec, respectively (De Lucia, Clarke, & Murray, 2010; Charest et al., 2009; Murray, Camen, Gonzalez Andino, Bovet, & Clarke, 2006). However, other results suggest stronger responses to instrument sounds compared with voices around 100 msec (Rigoulot, Pell, & Armony, 2015) or vice versa later around 320 msec (Levy, Granot, & Bentin, 2001, 2003).
Previous work has examined broad classes of sounds relative to one another. However, there are many inherent (and important) acoustic differences that exist between different categories of sound (or even among sound sources within the same category) that could drive these results (Ogg & Slevc, 2019a), which have not been explicitly analyzed in previous MEG/EEG studies of natural sounds. Additionally, work on classic auditory evoked responses (using simple synthesized stimuli) such as the MMN (Caclin et al., 2006; Rosburg, 2003; Giard et al., 1995), N1 (Näätänen & Picton, 1987), and M100 response (Roberts, Ferrari, Stufflebeam, & Poeppel, 2000) indicate that the auditory system is sensitive to specific acoustic qualities (e.g., spectral and temporal envelopes, noisiness, spectral variability). The manner by which acoustic features might influence early, time-varying neural representations of natural sounds remains largely unexplored despite extensive examination via other methods such as fMRI (spectral centroid: Allen, Burton, Olman, & Oxenham, 2017; Alluri et al., 2012; Giordano, McAdams, Zatorre, Kriegeskorte, & Belin, 2012; spectral flatness: Alluri et al., 2012; Lewis, Talkington, Tallaksen, & Frum, 2012; spectral envelope: Ogg, Moraczewski, Kuchinsky, & Slevc, 2019; Warren, Jennings, & Griffiths, 2005; noisiness: Giordano et al., 2012; Lewis et al., 2009; fundamental frequency: Allen et al., 2017; Giordano et al., 2012; Patterson, Uppenkamp, Johnsrude, & Griffiths, 2002; attack time: Menon et al., 2002; loudness: Giordano et al., 2012; Langers, van Dijk, Schoenmaker, & Backes, 2007; spectrotemporal modulation: Norman-Haignere, Kanwisher, & McDermott, 2015; Schönwiesner & Zatorre, 2009) and ECoG (Hullett, Hamilton, Mesgarani, Schreiner, & Chang, 2016).
The key to understanding the neural and auditory processing involved in the formation of object and event representations is not just in differentiating the magnitude of different neural responses. Rather, the goal is to understand how information pertaining to an object or stimulus feature is represented in the brain (Hebart & Baker 2018). Machine learning approaches have yielded such insights for vision researchers by using classifiers to “decode” (via multivariate pattern analysis; Tong & Pratte, 2012) the stimuli participants perceived based on time-varying patterns of neural activity measured using magnetoencephalography (MEG; Grootswagers, Wardle, & Carlson, 2017; Cichy, Pantazis, & Oliva, 2014; Carlson, Tovar, Alink, & Kriegeskorte, 2013; Carlson, Hogendoorn, Kanai, Mesik, & Turret, 2011; Figure 1A). Such an approach is often more sensitive than examining response magnitude or latency (Hebart & Baker, 2018; Grootswagers et al., 2017; Staeren, Renvall, De Martino, Goebel, & Formisano, 2009; Formisano, De Martino, Bonte, & Goebel, 2008; Haynes & Rees, 2006).
Better or worse classification accuracy indicates that the brain's response patterns for two stimuli are more or less distinct and thus provides a difference metric for the neural representations of categories and object exemplars. These neural classification accuracy/dissimilarity rates can be contained in a matrix that is organized by item pairs (called a representational dissimilarity matrix [RDM]; see Figure 1A). Such classification results can then be used to understand how physical properties of the stimuli relate to participants' neural responses by correlating the observed patterns of neural classification accuracy with other matrices that pertain to differences among the stimuli (e.g., RDMs of differences among the same item pairs along various acoustic feature dimensions; Figure 1B). This approach, called representational similarity analysis (Kriegeskorte & Kievit, 2013), provides a powerful tool that can reveal which feature dimensions the classifier uses to decode these neural representations as they emerge in time.
Relatively few studies have used decoding techniques to study neural responses to natural auditory stimuli over time in MEG/EEG (although see Sankaran, Thompson, Carlile, & Carlson, 2018; Khalighinejad, Cruzatto da Silva, & Mesgarani, 2017; Teng, Sommer, Pantazis, & Oliva, 2017). So far, the small body of work using these methods has focused only on specific subsets of stimuli (e.g., speech or musical pitches). This precludes a broader understanding of the acoustic dimensions that contribute to how human listeners and the auditory system distinguish auditory objects and events.
In the current study, we employed a wide array of natural sounds relevant to everyday functioning in human listeners (speech from different speakers, sounds from musical instruments, and sounds from everyday objects and materials) to probe the acoustic dimensions that influence early neural responses and the formation of auditory object and event representations. As participants listened to these stimuli, we recorded their brain activity using MEG. Our approach leverages both the acoustic qualities that naturally differ among individual auditory stimuli and the temporal nature of sound. This allows us to glean an understanding of the acoustic qualities that subserve the emerging neural representations of auditory objects as both the stimuli and corresponding neural responses develop over time.
We recruited 20 participants (nine women) who were fluent English speakers, right-handed, and self-reported normal hearing and no neurological disorders. Participants were not selected for musical experience, but typical of a university population (Schellenberg, 2006), most had some musical training (mean = 6.6 years, SD = 3.9 years). This study was conducted with the approval of the University of Maryland institutional review board, and all participants gave their informed consent to participate.
Participants heard thirty-six 300-msec natural sound tokens evenly divided among three categories (see Figure 2). Human environmental sounds were a selection of 12 everyday object sounds from a variety of media or events (e.g., fingers typing on a keyboard, toilet flushing, engine ignition) sampled from the BBC Sound Effects Library (1997), the Sound Events and Real World Events Databases (2008), and Vettel (2010). Musical instrument tokens were 12 individual notes played by six instruments (two notes each) that spanned the range of orchestral timbres: bass clarinet, cello (arco), marimba, tenor trombone, acoustic guitar, and piano (from The University of Iowa Musical Instrument Samples Database, 1997). Speech tokens were 12 excerpts of the utterances “bead,” “bod,” “heed,” and “hod,” as well as an excerpted /α/ and /i/. These were spoken by a man and a woman raised in the Mid-Atlantic United States and recorded in-house (under conditions described in Ogg & Slevc, 2019a; Ogg et al., 2017). These consonants were selected to match the variability of the other sound categories in terms of their noisiness and temporal envelope dynamics. A cat vocalization was also obtained (from the BBC sound effects library) for use as a catch trial stimulus.
Notes for each musical instrument were selected to match one of the male and female speakers' utterances (specifically, in terms of their median fundamental frequency calculated using the YIN algorithm; de Cheveigné & Kawahara, 2002) to control for large fundamental frequency differences between these categories. This required the use of lower register instruments (e.g., cello, trombone) to match the male speaker's fundamental frequencies. Because the /h/ portions of the speech utterances were especially noisy, fundamental frequency matches for these stimuli were based on the last third of the token.
Stimuli were edited to begin at their onsets, defined here as 5 msec before where the absolute value of the stimulus' amplitude reached 10% of its maximum. All stimuli were then truncated to a 300-msec duration, which is sufficient to facilitate sound source identification following onset (Ogg & Slevc, 2019a; Ogg et al., 2017; Suied et al., 2014; Robinson & Patterson, 1995a, 1995b). Cosine onset and offset ramps (both 5 msec) were then applied. Stimulus playback in the MEG scanner was equalized (Ultracurve Pro DEQ2496, Behringer) to be approximately flat (±4 dB) between 40 Hz and 5.5 kHz above which the sound level diminished. Thus, we low-pass filtered our stimuli (zero-phase fourth-order Butterworth filter) at 5.5 kHz before root-mean-square-level normalization. This achieved an average level of 72.5 dB-A as measured after playback by Presentation (NeuroBehavioral Systems), equalization and then delivery via a mixer (1202-VLZ Pro, LOUD Technologies, Inc.), amplifier (SLA1, ART ProAudio), insert earphone tubing (E-A-RTONE Gold 3A Insert Earphone, 3M), and insert earphones (Etymotic Research).
Recording of Stimuli for Presentation and Analysis
To match our acoustic analyses to what participants heard in the MEG as closely as possible, we derived acoustic features from high-fidelity recordings of the stimuli from the scanner presentation system using a 2-cc coupler and 0.5-in. condenser microphone amplified by a 824S/PRM902 sound-level meter system (Larson Davis) into an Apogee Duet (Apogee Electronics). Recordings of the stimuli for behavioral screening were recorded using an approximately 2-cc plastic coupler, a Sony ECM-144 Lavalier microphone (Sony Corporation), and a Zoom H1 (Zoom Video Communications) portable recorder.
Screening and Familiarization
Participants attended an approximately half-hour screening session no more than 11 days before scanning (mean = 5.1 days) to ensure they could accurately identify the stimuli and to make sure any acoustic changes imposed by the scanner's playback apparatus did not impair the identifiability of the tokens. Participants were first allowed to freely listen to each labeled sound file as many times as they liked. Screening then involved six blocks of trials run in PsychoPy (Version 1.83.4; Peirce, 2007). On each trial, a stimulus was repeated three times with an ISI of 700 msec (one stimulus trial per each of six blocks for a total of 18 presentations of each stimulus during screening). In the first block, participants simply listened to the stimuli and read each sound's label on screen (no response required). In the second block, participants pushed a button to indicate the sound they heard on each trial. In the response key, each of the 12 speech and human environmental sounds was assigned a unique response button (as was the cat stimulus), and each of the six instruments was assigned one button for use regardless of the note the instrument played (thus, 21 response buttons total). In this second block, participants were given feedback if they were right or corrective feedback (sound source and response button). The third block (the test block) was identical to the second, but without feedback. The last three blocks of trials were the same as the first three except that these trials presented a recording of each stimulus that was made from the MEG sound playback system. Screening performance was assessed in terms of accuracy on the sixth block of trials (testing block with scanner playback recordings). This ensured that participants could accurately identify the stimuli as they would hear them in the scanner. Participants were required to score at least 89% (33 of 37 correct) to move on to the scan (mean performance = 95%, SD = 3%).
Data Acquisition and Preprocessing
MEG responses were recorded using 157 axial gradiometers in the scanner (Kanazawa Institute of Technology), as participants listened to 100 repetitions of each stimulus presented in random order at a jittered rate of 1 per second (±150 msec drawn from a uniform distribution). MEG data were filtered online with a notch filter at 60 Hz and a low-pass filter at 200 Hz and were recorded at a sampling rate of 1000 Hz. Before scanning, head digitization was performed (POLHEMUS 3 Space Fast Track system), and electrodes were attached to two preauricular sites and three prefrontal sites to spatially coregister participants' head shapes with the MEG. To ensure participants paid attention to the stimuli, their task during the scan was to press a button upon hearing the cat vocalization (presented at an average rate of one catch trial for every 12 stimuli) while they watched a self-selected film on mute. Participants were very accurate on this task (mean accuracy = 93%, SD = 7%), indicating that they remained attentive to the sounds throughout the scan.
Each participant's MEG data was first cleaned with a time-shift PCA algorithm (de Cheveigné & Simon, 2007). The data were then epoched around the presentation of each stimulus from −200 to 600 msec and down-sampled to 200 Hz (5-msec temporal resolution). The extra 100 msec on each end of the −100- to 500-msec epoch was eliminated to remove edge artifacts. Each SQUID sensor was then baseline-corrected to the average of its −100-msec baseline.
Decoding was performed to distinguish the neural responses to each pair of sound tokens at the individual subject level using CoSMoMVPA (Oosterhof, Connolly, & Haxby, 2016; see Figure 1). At each 5-msec time point throughout the epoch, a given pair of sounds was selected and a linear discriminant classifier was trained (using 90% of the data) to associate each of the two item labels with their corresponding MEG responses using the amplitudes of the SQUID sensors (all 157 sensors for that time point) as feature vectors for classification. The classifier was then tested on its ability to correctly assign item labels to the responses in the left-out trials (the last 10% of the data). This repeated such that each cross-validation fold served as both a training and test case and the analysis iterated through each unique item pair (630 pairs of tokens, i.e., one half of the symmetrical dissimilarity matrix excluding the diagonal) and through each 5-msec time point in the epoch (see Figure 1A). Note that essentially the same results were obtained based on a PCA of the MEG data before decoding (thus eliminating any dependence among feature vectors in training and testing; see Grootswagers et al., 2017, for a discussion).
The decoding analyses yielded a matrix of classification accuracies for each item pair and subject at each time point (averaged over cross-validation folds). Classification results within the RDMs were averaged for each subject to assess when (1) overall decoding accuracy across all item pairs exceeded chance (i.e., averaged across the entire RDM; Figure 2B, top), (2) decoding accuracy for items within or between different categories exceeded chance or differed from one another (i.e., averaged among quadrants on or off the diagonal in the RDM; quadrants in Figure 2C vs. Figure 2D resulting in Figure 2B, bottom), or (3) decoding accuracy for individual within- or between-category comparisons exceeded chance or differed from one another (i.e., averaged within quadrants in the RDM; Figure 2C or D). Because we always decoded individual pairs of stimuli from one another, chance performance for all classification analyses (i.e., all item- and category-level comparisons) was 50%. Statistical significance with multiple comparison corrections was assessed via nonparametric (Maris & Oostenveld, 2007) bootstrapped (10,000 iterations) sign permutation tests (vs. chance), t tests (between vs. within category), or ANOVAs (comparing among the three within-category or between-category decoding accuracies in Figure 2C and D, respectively) followed by threshold-free cluster enhancement (Smith & Nichols, 2009) as implemented by CoSMoMVPA (Oosterhof et al., 2016).
Acoustic Feature Extraction and Analyses
To ensure the acoustic analyses were the most accurate representation of the stimuli heard in the scanner, acoustic features were derived from high-fidelity recordings of each stimulus from the scanner's playback system. From these recordings, we extracted acoustic features known to influence the neural processing of sound in humans (Ogg & Slevc, 2019a, 2019b; Allen et al., 2017; Norman-Haignere et al., 2015; Alluri et al., 2012; Giordano et al., 2012; Lewis et al., 2009, 2012; Schönwiesner & Zatorre, 2009; Langers et al., 2007; Warren et al., 2005; Menon et al., 2002; Patterson et al., 2002), via widely used algorithms (YIN: de Cheveigné & Kawahara, 2002; Timbre Toolbox: Kazazis, Esterer, Depalle, & McAdams, 2017; Peeters, Giordano, Susini, Misdariis, & McAdams, 2011; Modulation Power Spectra: Elliott & Theunissen, 2009). Temporal dynamics of the stimuli (regarding the speed of sound onset) were characterized via the log-attack time and temporal centroid of the energy envelope (Kazazis et al., 2017; Peeters et al., 2011). Spectral qualities were assessed via spectral centroid (similar to overall brightness), spectral flatness (similar to overall noisiness of the spectrum), and spectral variability (index of spectral change over time) extracted from an ERB (equivalent rectangular bandwidth) gammatone representation (i.e., a cochleagram) of the sounds (Kazazis et al., 2017; Peeters et al., 2011) in 5-msec windows/increments. We included a proxy for loudness (the “frame energy” feature from Peeters et al., 2011, which we denote as “ERB energy”) to examine the influence of local changes in dynamics in the stimuli and also to account for any differences imparted by the playback system (because perceived loudness is difficult to characterize even after standard root-mean-square normalization, especially for complex natural stimuli that vary in their temporal and frequency characteristics; Giordano et al., 2012; Moore, 2012; Langers et al., 2007). We also included the raw ERB cochleagram that these features were based on (Kazazis et al., 2017; Peeters et al., 2011).
Modulation power spectra were used to capture different spectral and temporal rates of change in the stimuli (Thoret, Depalle, & McAdams, 2017; Elliott & Theunissen, 2009; Gaussian windowed spectrogram, 50 dB dynamic range, 16 Hz frequency, and 10-msec temporal resolution yielding maximum modulation of 31.22 cyc/kHz and 48.39 Hz, respectively). Finally, we extracted aperiodicity and fundamental frequency estimates (de Cheveigné & Kawahara, 2002) from each stimulus in 5-msec increments (minimum of the fundamental frequency search range set to 100 Hz). Missing values were returned for the incomplete analysis windows in the very first and last increments, which were replaced with the nearest nonmissing values.
Acoustic analyses examined both global (representing the stimulus as a whole) and local (each 5-msec increment) stimulus qualities. Global stimulus features comprised any attribute that aggregated over the entire duration of the stimulus (log-attack time, temporal centroid, modulation power spectrum), as well as the median across windowed features. Local stimulus features comprised the 5-msec increments of the windowed measures: spectral (i.e., ERB representation-based) features, aperiodicity, and fundamental frequency features. Acoustic RDMs were created by taking the absolute value of the difference of each feature between stimulus pairs (single-valued features) or by calculating Euclidean distances between stimulus pairs (for features composed of multiple values: modulation power spectra and ERB cochleagram).
Acoustic correlation analyses were first carried out for each subject individually via a Kendall's tau correlation between the neural RDMs for that subject at each time point and either the global acoustic RDMs (Figure 3A) or the local acoustic RDMs at each time point (Figure 3B). These individual subject correlation statistics were assessed for each feature at the group level at each time point using a Wilcoxon signed-rank test against 0. Group-level statistics for each acoustic feature were then corrected across time points at a threshold of false discovery rate (FDR)-corrected p < .001 (Benjamini & Hochberg, 1995).
We first determined how accurately our sound tokens could be decoded from one another based on participants' MEG responses throughout the time series. For this, we analyzed the average pairwise decoding accuracy rates for all possible pairs of sounds (Figure 2A) for each subject throughout the time series. As seen in Figure 2B (top), these grand-averaged decoding accuracy rates exceeded chance in a cluster of time points beginning 90 msec after sound onset. This cluster remained statistically significant throughout the epoch (500 msec), notably even after sound offset at 300 msec (assessed against chance using bootstrapped sign permutation testing and threshold-free cluster enhancement across time points, all ps < .05 corrected). Peak grand-averaged decoding accuracy was 58.4%, which occurred 155 msec after sound onset. Thus, information related to the neural representations of these auditory objects and events can be reliably distinguished very early on in the listeners' MEG response.
Decoding Categories of Sound
An examination of the neural RDMs (Figure 2A) suggests that neural responses to pairs of stimuli from different categories (i.e., instrument vs. human environmental) were better decoded than pairs of stimuli from the same category. To quantify this, we averaged accuracy rates for the regions of the RDM corresponding to between- and within-category comparisons (Figure 2B, bottom). Between- and within-category decoding followed a very similar time course: both exceeded chance in clusters of time points beginning 90 msec after onset and continuing to the end of the epoch (all ps < .05 corrected). Decoding accuracy peaked for both comparisons at similar times as well (155 msec for between and 160 msec for within). However, between-category classification was more accurate (peak = 60.0%) than within-category decoding (peak = 54.8%) throughout a cluster of time points starting at 95 msec and lasting through the end of the epoch (assessed using nonparametric bootstrapped t tests and threshold-free cluster enhancement across time points comparing between- vs. within-category accuracy, all ps < .05 corrected).
We next examined the decoding accuracy for specific within- or between-category comparisons of individual sound categories (Figure 2C, D insets). Within-category decoding accuracy rates were lower overall (compare Figure 2C and Figure 2D; see also Figure 2B), but sounds within each category could nevertheless be decoded from one another better than chance among clusters of time points spanning most of the epoch (Figure 2C; cluster start times for within-speech: 90 msec; within-human environment: 95 msec; within-instrument: 90 msec, all lasting through 500 msec, all ps < .05 corrected). However, the time courses and peaks of the within-category decoding varied within a cluster of time points beginning at 115 msec, lasting until 470 msec (assessed using nonparametric bootstrapped ANOVA and threshold-free cluster enhancement over time points comparing within-category accuracy rates, all ps < .05 corrected). Peak within-instrument decoding (54.1%) occurred at 145 msec, and peak within-human environmental sound decoding (54.7%) occurred later at 215 msec. Interestingly, within-speech decoding accuracy was the highest among the within-category comparisons (56.7%) and peaked at 160 msec.
Significant clusters of time points for all combinations of between-category decoding began at 90 msec and lasted throughout the epoch (Figure 2D; all ps < .05 corrected). Accuracy rates for these comparisons also differed significantly from one another in a cluster of time points throughout the epoch beginning at 120 msec (all ps < .05 corrected). Decoding between human environmental and instrument sounds was the highest overall (peak: 62.5% at 155 msec) followed by human environmental versus speech decoding (peak: 59.3% at 165 msec) and instrument versus speech token decoding (peak: 58.9% at 150 msec).
In summary, shortly after overall decoding first exceeded chance at 90 msec and before the peak in overall decoding at 155 msec, neural representations already began to differ across sound categories. Note that neither the subjects' level of musical training or behavioral performance (at screening or during the scan) were correlated with subject-level averages of decoding accuracy at any time point (overall, in Figure 2B, top, or for any combination of categories in Figure 2C or D, all FDR-corrected ps > .05 across time points for each measure).
Processing of Global Acoustic Features
Speech, instrument and human environmental sounds contain different acoustic regularities that likely relate to the neural representations we decoded. To investigate this possibility, we computed Kendall's tau correlations between the pairwise neural classification accuracy rates (each subject's neural RDMs throughout the epoch) and the pairwise acoustic differences among our stimuli along a variety of dimensions (acoustic RDMs; Figure 1B, center; see Methods for details). Correlations between neural decoding rates and global acoustic differences among the stimuli (i.e., for acoustic measures that operated over the whole sound token or the median across time-varying features) are shown in Figure 3A. These features were strongly associated with decoding rates before and around the peak in decoding accuracy that was observed around 155 msec (all FDR-corrected ps < .001). Differences among stimuli in terms of their modulation power spectra exhibited the earliest significant correlations with the neural RDM (onset rτ = .03 at 100 msec) followed a few milliseconds later by overall spectral variability (onset rτ = .03 at 115 msec) and aperiodicity (onset rτ = .06 at 120 msec). Other significant correlations emerged later for temporal centroid (onset rτ = .05 at 135 msec), overall spectral envelope (assessed by an ERB cochleagram; onset rτ = .03 at 135 msec), spectral centroid (onset rτ = .05 at 145 msec), and the sum of the squared ERB amplitudes across cochlear filter channels (which we used as a proxy for loudness to account for any changes from the sound playback apparatus in the scanner, 125–135 msec, peak rτ = .04). These features mostly remained correlated with the neural RDMs and increased in their correlation strength around the decoding peak at 155 msec except for ERB energy and temporal centroid, which were nonsignificant after 135 msec. The strongest correlations with the neural RDMs around the decoding peak were observed for differences in aperiodicity (rτ = .18 at 165 msec) followed by modulation power spectra (rτ = .14 at 140 msec) and spectral variability (rτ = .13 at 165 msec). Interestingly, many of the features that exhibited a significant association with decoding rates earliest in the epoch are complex acoustic representations known to be associated with some of the earliest processes in auditory cortex: aperiodicity, spectral envelope, and spectrotemporal variability (Norman-Haignere et al., 2015; Theunissen & Elie, 2014; Lewis et al., 2009; Chi, Ru, & Shamma, 2005). Later in the epoch, particularly after sound offset, significant correlations pertained to spectral centroid (peak rτ = .09 at 445 msec), spectral flatness (peak rτ = .06 at 460 msec), and temporal centroid (peak rτ = .11 at 250 msec). Overall, spectral variability (peak rτ = .16 at 435 msec) and aperiodicity (peak rτ = .19 at 350 msec) exhibited the strongest correlations throughout the epoch.
We conducted an additional analysis to examine the influence of fundamental frequency. This was possible because we allowed our speech stimuli to vary naturally in their fundamental frequencies and matched the notes of the instrument tokens along this dimension. However, a global acoustic analysis of fundamental frequency among just the instrument and speech stimulus pairs (excluding the pairs with /h/ consonant tokens, because these were predominantly noise) did not reveal any significant correlation with the neural decoding results.
Processing of Local Acoustic Features
Sound stimuli necessarily unfold over time. Thus, it is possible that local changes in the acoustic signal could have consequences for participants' unfolding neural responses. To understand how local acoustic developments precipitated corresponding changes in neural representations of these sounds, we expanded our global acoustic analysis in a way that leveraged the dynamic information in our stimuli. In this local acoustic analysis (Figure 1B, right), the pairwise acoustic feature differences among the sounds at every 5-msec time point in the stimuli (local acoustic RDMs) were correlated with every subsequent 5-msec RDM throughout the MEG epoch (neural RDMs; see Methods for details).
The local acoustic analysis results are depicted in Figure 3B. The earliest significant correlation between a local acoustic RDM and the neural RDM was for spectral flatness (onset rτ = .05 at 100 msec in the neural epoch correlated with the 5-msec local acoustic RDM) and spectral variability (onset rτ = .03 at 100 msec in the neural epoch correlated with the 40-msec local acoustic RDM). Again, aperiodicity, spectral centroid/envelope, and spectral variability (across many stimulus time points) were strongly and significantly correlated with the decoding results around the decoding peak (155 msec). We also see a small influence of differences in local ERB energy from approximately the first 100 msec of the stimuli on subsequent neural representations. These local acoustic features were more strongly associated with neural decoding accuracy rates than the global features, with aperiodicity (peak rτ = .21 at 360 msec in the neural epoch correlated with the 55-msec local acoustic RDM and rτ = .20 at 165 msec in the neural epoch correlated with the 105-msec local acoustic RDM) and spectral variability (peak rτ = .20 at 160 msec in the neural epoch correlated with the 110-msec local acoustic RDM) exhibiting some of the strongest correlations. Aperiodicity and spectral variability were also significantly associated with decoding across neural and acoustic time points. Meanwhile, spectral differences (including spectral centroid, flatness, and ERB energy) from the earlier portions of the stimuli appear to exert the most influence.
The questions of “when” and “how” a listener recognizes what they hear are fundamentally linked due to the temporal nature of sound. We examined these questions using multivariate pattern analyses of MEG time series data in conjunction with representational similarity analyses of acoustic differences among a set of natural sound stimuli that are behaviorally relevant to human listeners. Using this approach, we gained a better understanding of the acoustic qualities associated with the formation of auditory object and event representations as they unfold in time.
We found that neural activity measured using MEG contained information sufficient to distinguish individual sounds beginning 90 msec after sound onset, reaching a peak at 155 msec, and continuing after sound offset at 300 msec until the end of the neural epoch at 500 msec. This pattern of results was driven by especially accurate decoding of stimuli from different sound categories, although individual instruments, speech utterances, and everyday object sounds could all be decoded from one another better than chance. Acoustic analyses indicated that these emergent representations were associated with differences among the stimuli pertaining to spectrotemporal variability, aperiodicity (or harmonic content), and the sounds' spectral and temporal envelopes. The importance of these features fluctuated over time with spectral envelope and modulation spectra exerting their greatest influence soon after onset, whereas spectral centroid and temporal centroid had a greater influence later in the epoch. Spectral variability and aperiodicity were significantly associated with decoding throughout much of the epoch. These are fundamental acoustic qualities that have been shown to influence neural responses and perception, but our results go further to suggest that these cues are specifically associated with the dynamic formation of mental representations of what a listener hears early after sound onset.
Our analyses needed to manage dynamic information within our stimuli due to the physical (temporal) constraints of sound. By leveraging this information, we obtained a picture of how the brain rapidly integrates and organizes sensory information from natural sounds. However, processing dynamic stimulus information is also an important part of everyday object and event perception beyond the auditory domain. Thus, the approach we describe here could provide valuable insight into other areas of cognitive neuroscience and object perception as well. For example, this approach could be applied to dynamic visual stimuli or multisensory integration paradigms to better understand how the brain integrates and represents specific stimulus information from both modalities to achieve percepts of complex objects and events.
Relative to the decoding of visual objects from MEG, the decoding onsets and latencies that we observed were slower than some findings (Cichy et al., 2014; Isik, Meyers, Leibo, & Poggio, 2014) but comparable to others (Carlson et al., 2011, 2013). This difference might be due to the inherent temporal nature of our sound stimuli relative to the pictures used in previous work on visual objects. Unlike sounds, picture stimuli present all the information necessary for object recognition at once. Auditory stimuli, on the other hand, require time to unfold, thus incrementally relaying the information that supports object recognition. Also, although overall decoding accuracy differed for within- and between-category comparisons among our auditory stimuli, the emergence of category and individual object representations followed a roughly similar time course. In most cases, these exceeded chance around 90 msec and peaked around 155 msec. Studies of visual object decoding suggest somewhat wider variability in the latency of decoding individual objects or categories (Cichy et al., 2014; Carlson et al., 2011, 2013).
A related study (Teng et al., 2017) decoded neural representations of different impact sounds and reverberant spaces from MEG, finding earlier onsets and latencies in their sound source decoding than what we observe here. The slower latencies in our results are likely a function of the more varied stimulus onsets (amplitude envelopes) we employed in our set of sounds. This follows from the diverse set of natural sound sources and events that we examined, compared with the constrained set of impact sounds used in previous work (Teng et al., 2017). Thus, our results provide a useful expansion upon this prior work given that the heterogeneous stimulus set examined here is more representative of the diversity of sounds humans encounter in everyday life (which can involve different amplitude envelopes).
Exploring the mechanisms of sound source decoding further, our results suggest that the auditory system prioritizes cues related to a sound's spectrotemporal variability and aperiodicity, as well as its spectral and temporal envelopes (at least among this set of sounds). These features can effectively distinguish instrument and human environmental sounds (especially early after onset), which occupy opposing extremes on these dimensions (Ogg & Slevc, 2019a; Ogg et al., 2017). Speech utterances, on the other hand (an especially important sound category for humans), exploit all of these acoustic cues and their extremes from moment to moment to relay linguistic information (Stilp & Kluender, 2010; Smith & Lewicki, 2006). Speech sounds also exhibited the highest within-category decoding indicating that individual speech sounds are more distinguishable from one another in listeners' neural responses relative to the instrument and human environmental sounds. Thus, the acoustic fluctuation among speech tokens might account for this increased within-category decoding.
Although these acoustic qualities have previously been shown to influence the strength of neural responses on different timescales (Allen et al., 2017; Hullett et al., 2016; Norman-Haignere et al., 2015; Alluri et al., 2012; Giordano et al., 2012; Lewis et al., 2009, 2012; Schönwiesner & Zatorre, 2009; Langers et al., 2007; Caclin et al., 2006; Warren et al., 2005; Rosburg, 2003; Menon et al., 2002; Patterson et al., 2002; Roberts et al., 2000; Giard et al., 1995; Näätänen & Picton, 1987), our results go further by beginning to link these features to representations of auditory objects and events early in perception. Additionally, we show that these features, which have typically been examined in fMRI (Ogg et al., 2019; Allen et al., 2017; Norman-Haignere et al., 2015; Alluri et al., 2012; Giordano et al., 2012; Lewis et al., 2009, 2012; Schönwiesner & Zatorre, 2009; Langers et al., 2007; Warren et al., 2005; Menon et al., 2002; Patterson et al., 2002), vary in their influence over time, with more primary acoustic processes in cortex associated with decoding accuracy early in the response such as spectral envelope and modulation power spectra (Norman-Haignere et al., 2015; Theunissen & Elie, 2014; Chi et al., 2005). Later in the response, however, neural representations were more related to spectral and temporal centroid features (with spectral variability and aperiodicity correlating strongly throughout). Finally, the time course of these neural and acoustic processes aligns with behavioral identification and categorization results for duration-gated stimuli (e.g., Ogg et al., 2017; Suied et al., 2014). Similar acoustic qualities (from the first tens of milliseconds in the stimuli) appear to be associated with the differentiation of sounds in early MEG decoding time windows (starting around 100–200 msec) and in subsequent behavioral responses (later around 300–600 msec; Ogg et al., 2017; Suied et al., 2014; Agus, Suied, Thorpe, & Pressnitzer, 2012).
One important caveat to these findings and a potential avenue for future research is that it is not clear how (or when) these decoded representations correspond to more abstract qualities or semantic labels associated with these sounds compared with their acoustic features. Separating semantic labels from acoustic processing in the neural responses obtained in this study is difficult because these were not orthogonal in this natural and diverse stimulus set. Indeed, much of our decoding results could follow from differences in acoustic or stimulus-level processing rather than object-level representations per se. This potentially aligns with the timing of the decoding peak we observed, which also coincides with components of the auditory evoked response (particularly N1–P2, between 100 and 200 msec). Evoked responses in this time window are known to be influenced by acoustic differences among stimuli (Caclin et al., 2006; Rosburg, 2003; Roberts et al., 2000; Giard et al., 1995; Näätänen & Picton, 1987) and appear to support representations of speech (Bidelman, & Walker, 2017; Chang et al., 2010; Roberts et al., 2000; Poeppel et al., 1997; Näätänen & Picton, 1987). Note, however, that this decoding peak could also be related to the overall improved signal-to-noise ratio in the neural response that follows from these prominent evoked components. Note also that decoding remained significantly above chance after these time points as well. In either case, we go beyond much of the previous work examining natural sounds in MEG/EEG by describing the acoustic features that might underpin processing differences between sound categories.
The ostensibly modest absolute decoding accuracy rates we observed (peaking at around 8.4% above chance) might raise some concerns regarding the reliability of these results. However, as Hebart and Baker (2018) point out, although high decoding rates are clearly desirable for brain–computer–interface applications, for understanding brain function (which is our present goal), above chance decoding is all that is necessary to inform how the brain processes information (especially given the resolution we have access to with current imaging technologies). Moreover, they make the important point that decoding accuracy is not equivalent to an effect size and that, again, the key is to understand where (or, in our case, “when”) information is contained or processed in the system (i.e., when accuracy exceeds chance). Analogous points can be made regarding the strength of correlations using these decoding accuracy rates (such as in our acoustic analyses) because decoding accuracy places an upper limit on the strength of the correlations that can be observed. Finally, it is worth keeping in mind that essentially the same decoding results were obtained based on a principal components analysis of the MEG data, which indicates that our results are not due to any deleterious effects of interdependence among feature vectors (MEG data) in training and testing (Grootswagers et al., 2017).
It is interesting that we did not observe an influence of fundamental frequency or musical training in these neural decoding results. However, this might be due to aspects of our design or task. For example, the noninfluence of fundamental frequency could reflect our use of isolated sound tokens (rather than longer concurrent streams) or a range restriction in the fundamental frequencies of the speaker's utterances (which spanned approximately one octave). Similarly, it is possible that fundamental frequency requires larger timescales on which to exert an influence (Walker, Bizley, King & Schnupp, 2011; Robinson & Patterson, 1995a, 1995b), beyond the time windows examined here. And although musical training has sometimes been found to influence auditory cortical responses (Bidelman, Weiss, Moreno, & Alain, 2014), these effects might arise more robustly during demanding behavioral tasks (see also Bidelman & Walker, 2017; Alho et al., 2016). Indeed, other findings based on passive tasks with diverse, readily distinguishable stimuli have not found an influence of musical training on neural representations (Ogg et al., 2019).
More generally, attention is known to strengthen neural responses and neural representations of auditory stimuli (Ding & Simon, 2012), particularly during categorization tasks (Bidelman & Walker, 2017; Alho et al., 2016). However, the current study used a mostly passive paradigm (nontarget odd ball detection task). Thus, it is likely that stronger attentional demands to specific target stimuli (or categories) could boost their decoding accuracy and could potentially modify the influence of certain target-specific acoustic features. Some findings also indicate that attention plays a particularly acute role in auditory scene analysis or streaming tasks around 200 msec (O'Sullivan et al., 2015; Snyder, Alain, & Picton, 2006). Viewed in conjunction with our results, this suggests that, after object identification occurs most acutely around 150 msec, attention might act to tune responses to a given stimulus and track a specific sound source through an auditory scene. This is thought to occur via feedback mechanisms from frontal areas (Fritz, David, Radtke-Schuller, Yin, & Shamma, 2010). Viewed alongside our significant acoustic correlations with aperiodicity, modulation spectra, and the spectral envelope, this suggests that these features could be avenues through which auditory cortex is tuned to a given sound via the modulation of attention (Fritz, Elhilali, David, & Shamma, 2007; Petkov et al., 2004). Future work examining how semantic or acoustic cues factor into attentional modulation will also help further illuminate this issue. These exciting new directions, along with our results, will help paint a fuller picture of how sounds are transformed from acoustic phenomena into the events and objects we identify and subsequently track in the environment in service of more complex behaviors and individual goals.
We are grateful for the discussions, feedback, and assistance we received in the course of this work from Natalia Lapinskaya, Christopher Neufeld, Jonathan Simon, Christian Brodbeck, Ed Smith, Tijl Grootswagers, Amanda Robinson, and Stefanie Kuchinsky. A portion of the human environmental sounds were obtained courtesy of the Sound Events Database (www.auditorylab.org/; Copyright 2008, Laurie M. Heller; funding provided by NSF award 0446955). Thanks also to Carnegie Melon's Center for the Neural Basis of Cognition regarding the human environmental sounds and to Lawrence Fritts for managing The University of Iowa Musical Instrument Samples database.
Reprint requests should be sent to Mattson Ogg, Neuroscience and Cognitive Science Program, Department of Psychology, University of Maryland, Biology-Psychology Building, Room 3150 4094 Campus Drive, College Park, MD 20742, or via e-mail: firstname.lastname@example.org.