Abstract

Human listeners are bombarded by acoustic information that the brain rapidly organizes into coherent percepts of objects and events in the environment, which aids speech and music perception. The efficiency of auditory object recognition belies the critical constraint that acoustic stimuli necessarily require time to unfold. Using magnetoencephalography, we studied the time course of the neural processes that transform dynamic acoustic information into auditory object representations. Participants listened to a diverse set of 36 tokens comprising everyday sounds from a typical human environment. Multivariate pattern analysis was used to decode the sound tokens from the magnetoencephalographic recordings. We show that sound tokens can be decoded from brain activity beginning 90 msec after stimulus onset with peak decoding performance occurring at 155 msec poststimulus onset. Decoding performance was primarily driven by differences between category representations (e.g., environmental vs. instrument sounds), although within-category decoding was better than chance. Representational similarity analysis revealed that these emerging neural representations were related to harmonic and spectrotemporal differences among the stimuli, which correspond to canonical acoustic features processed by the auditory pathway. Our findings begin to link the processing of physical sound properties with the perception of auditory objects and events in cortex.

INTRODUCTION

Successful navigation of the auditory environment requires quickly recognizing the stimuli or objects one encounters. This is true for the assessment of potential threats in the auditory scene as well as for the perception of speech and music (Ogg & Slevc, 2019b; Bizley & Cohen, 2013, Griffiths & Warren, 2004). However, the recognition of auditory objects is complicated by the fact that acoustic stimuli require time to develop. Thus, a listener's auditory system must rapidly transform acoustic information that is still evolving into a mental representation of an object or event to maintain behavioral relevance (e.g., when hearing an approaching vehicle that poses a threat to the observer). Despite this complexity, listeners can make accurate judgments about the identity of a sound even when the sound's temporal development (and thus potential to relay useful acoustic detail) is limited in duration to just tens of milliseconds (Ogg, Slevc, & Idsardi, 2017) or less (Suied, Agus, Thorpe, Mesgarani, & Pressnitzer, 2014; Robinson & Patterson, 1995a, 1995b). In the current study, we asked how the auditory system uses incoming acoustic information from natural sound stimuli to help construct representations of auditory objects and events given these temporal constraints. Specifically, we investigated whether certain acoustic events or features are temporally prioritized in a listener's processing of natural sounds and how such prioritization relates to particular acoustic properties of the sounds.

One possibility is that the auditory system hierarchically processes certain kinds of behaviorally relevant information. Recordings in the ventral auditory pathway of nonhuman primates support this view. These studies find more selective responses for increasingly complex stimuli in anterior regions of the temporal lobes (Perrodin, Kayser, Logothetis, & Petkov, 2011; Rauschecker & Scott, 2009; Tian, Reser, Durham, Kustov, & Rauschecker, 2001; Rauschecker & Tian, 2000) that correspond to different processing latencies (Kikuchi, Horwitz, & Mishkin, 2010). These findings have inspired human EEG work that suggests neural responses differ as a function of the sound source, with man-made sounds and human vocal sounds exhibiting stronger responses around 70 and 170 msec, respectively (De Lucia, Clarke, & Murray, 2010; Charest et al., 2009; Murray, Camen, Gonzalez Andino, Bovet, & Clarke, 2006). However, other results suggest stronger responses to instrument sounds compared with voices around 100 msec (Rigoulot, Pell, & Armony, 2015) or vice versa later around 320 msec (Levy, Granot, & Bentin, 2001, 2003).

Previous work has examined broad classes of sounds relative to one another. However, there are many inherent (and important) acoustic differences that exist between different categories of sound (or even among sound sources within the same category) that could drive these results (Ogg & Slevc, 2019a), which have not been explicitly analyzed in previous MEG/EEG studies of natural sounds. Additionally, work on classic auditory evoked responses (using simple synthesized stimuli) such as the MMN (Caclin et al., 2006; Rosburg, 2003; Giard et al., 1995), N1 (Näätänen & Picton, 1987), and M100 response (Roberts, Ferrari, Stufflebeam, & Poeppel, 2000) indicate that the auditory system is sensitive to specific acoustic qualities (e.g., spectral and temporal envelopes, noisiness, spectral variability). The manner by which acoustic features might influence early, time-varying neural representations of natural sounds remains largely unexplored despite extensive examination via other methods such as fMRI (spectral centroid: Allen, Burton, Olman, & Oxenham, 2017; Alluri et al., 2012; Giordano, McAdams, Zatorre, Kriegeskorte, & Belin, 2012; spectral flatness: Alluri et al., 2012; Lewis, Talkington, Tallaksen, & Frum, 2012; spectral envelope: Ogg, Moraczewski, Kuchinsky, & Slevc, 2019; Warren, Jennings, & Griffiths, 2005; noisiness: Giordano et al., 2012; Lewis et al., 2009; fundamental frequency: Allen et al., 2017; Giordano et al., 2012; Patterson, Uppenkamp, Johnsrude, & Griffiths, 2002; attack time: Menon et al., 2002; loudness: Giordano et al., 2012; Langers, van Dijk, Schoenmaker, & Backes, 2007; spectrotemporal modulation: Norman-Haignere, Kanwisher, & McDermott, 2015; Schönwiesner & Zatorre, 2009) and ECoG (Hullett, Hamilton, Mesgarani, Schreiner, & Chang, 2016).

The key to understanding the neural and auditory processing involved in the formation of object and event representations is not just in differentiating the magnitude of different neural responses. Rather, the goal is to understand how information pertaining to an object or stimulus feature is represented in the brain (Hebart & Baker 2018). Machine learning approaches have yielded such insights for vision researchers by using classifiers to “decode” (via multivariate pattern analysis; Tong & Pratte, 2012) the stimuli participants perceived based on time-varying patterns of neural activity measured using magnetoencephalography (MEG; Grootswagers, Wardle, & Carlson, 2017; Cichy, Pantazis, & Oliva, 2014; Carlson, Tovar, Alink, & Kriegeskorte, 2013; Carlson, Hogendoorn, Kanai, Mesik, & Turret, 2011; Figure 1A). Such an approach is often more sensitive than examining response magnitude or latency (Hebart & Baker, 2018; Grootswagers et al., 2017; Staeren, Renvall, De Martino, Goebel, & Formisano, 2009; Formisano, De Martino, Bonte, & Goebel, 2008; Haynes & Rees, 2006).

Figure 1. 

Schematic of the decoding and acoustic analyses. (A) Decoding algorithm for item pairs used to construct a neural RDM. (B) Acoustic correlation analysis between the neural RDMs throughout the epoch and the global or local acoustic feature RDMs. The red box in B around the acoustic waveform (middle) indicates that the data correspond to a global attribute (aggregated across all time points in the stimulus). Red vertical bars on the MEG and acoustic waveforms indicate that the data relate to an individual 5-msec time point. Note that only one instance of each of the unique 630 stimulus pairs was decoded and the data from this half of the RDM (e.g., Item 1 vs. Item 2) was copied to the other half (e.g., Item 2 vs. Item 1) only for illustrative purposes here and in Figure 2. LDA = linear discriminant classifier.

Figure 1. 

Schematic of the decoding and acoustic analyses. (A) Decoding algorithm for item pairs used to construct a neural RDM. (B) Acoustic correlation analysis between the neural RDMs throughout the epoch and the global or local acoustic feature RDMs. The red box in B around the acoustic waveform (middle) indicates that the data correspond to a global attribute (aggregated across all time points in the stimulus). Red vertical bars on the MEG and acoustic waveforms indicate that the data relate to an individual 5-msec time point. Note that only one instance of each of the unique 630 stimulus pairs was decoded and the data from this half of the RDM (e.g., Item 1 vs. Item 2) was copied to the other half (e.g., Item 2 vs. Item 1) only for illustrative purposes here and in Figure 2. LDA = linear discriminant classifier.

Better or worse classification accuracy indicates that the brain's response patterns for two stimuli are more or less distinct and thus provides a difference metric for the neural representations of categories and object exemplars. These neural classification accuracy/dissimilarity rates can be contained in a matrix that is organized by item pairs (called a representational dissimilarity matrix [RDM]; see Figure 1A). Such classification results can then be used to understand how physical properties of the stimuli relate to participants' neural responses by correlating the observed patterns of neural classification accuracy with other matrices that pertain to differences among the stimuli (e.g., RDMs of differences among the same item pairs along various acoustic feature dimensions; Figure 1B). This approach, called representational similarity analysis (Kriegeskorte & Kievit, 2013), provides a powerful tool that can reveal which feature dimensions the classifier uses to decode these neural representations as they emerge in time.

Relatively few studies have used decoding techniques to study neural responses to natural auditory stimuli over time in MEG/EEG (although see Sankaran, Thompson, Carlile, & Carlson, 2018; Khalighinejad, Cruzatto da Silva, & Mesgarani, 2017; Teng, Sommer, Pantazis, & Oliva, 2017). So far, the small body of work using these methods has focused only on specific subsets of stimuli (e.g., speech or musical pitches). This precludes a broader understanding of the acoustic dimensions that contribute to how human listeners and the auditory system distinguish auditory objects and events.

In the current study, we employed a wide array of natural sounds relevant to everyday functioning in human listeners (speech from different speakers, sounds from musical instruments, and sounds from everyday objects and materials) to probe the acoustic dimensions that influence early neural responses and the formation of auditory object and event representations. As participants listened to these stimuli, we recorded their brain activity using MEG. Our approach leverages both the acoustic qualities that naturally differ among individual auditory stimuli and the temporal nature of sound. This allows us to glean an understanding of the acoustic qualities that subserve the emerging neural representations of auditory objects as both the stimuli and corresponding neural responses develop over time.

METHODS

Participants

We recruited 20 participants (nine women) who were fluent English speakers, right-handed, and self-reported normal hearing and no neurological disorders. Participants were not selected for musical experience, but typical of a university population (Schellenberg, 2006), most had some musical training (mean = 6.6 years, SD = 3.9 years). This study was conducted with the approval of the University of Maryland institutional review board, and all participants gave their informed consent to participate.

Stimuli

Participants heard thirty-six 300-msec natural sound tokens evenly divided among three categories (see Figure 2). Human environmental sounds were a selection of 12 everyday object sounds from a variety of media or events (e.g., fingers typing on a keyboard, toilet flushing, engine ignition) sampled from the BBC Sound Effects Library (1997), the Sound Events and Real World Events Databases (2008), and Vettel (2010). Musical instrument tokens were 12 individual notes played by six instruments (two notes each) that spanned the range of orchestral timbres: bass clarinet, cello (arco), marimba, tenor trombone, acoustic guitar, and piano (from The University of Iowa Musical Instrument Samples Database, 1997). Speech tokens were 12 excerpts of the utterances “bead,” “bod,” “heed,” and “hod,” as well as an excerpted /α/ and /i/. These were spoken by a man and a woman raised in the Mid-Atlantic United States and recorded in-house (under conditions described in Ogg & Slevc, 2019a; Ogg et al., 2017). These consonants were selected to match the variability of the other sound categories in terms of their noisiness and temporal envelope dynamics. A cat vocalization was also obtained (from the BBC sound effects library) for use as a catch trial stimulus.

Figure 2. 

Decoding accuracy. (A) Group-averaged neural RDMs at 0, 150, and 300 msec. (B) Grand-averaged decoding accuracy (top; *p < .05 corrected vs. chance) and average between- versus within-category decoding accuracy (bottom; *p < .05 corrected between vs. within). (C) Within-category decoding accuracy for each sound category (*p < .05 corrected vs. chance). (D) Between-category decoding accuracy for each pair of sound categories (*p < .05 corrected vs. chance). Square insets depict the region of the RDM that was averaged for each comparison. The top of B encompasses all the insets in C and D, and the bottom of B represents the insets of C versus D. Error bands represent SEM calculated over subject-level average decoding accuracies for each comparison.

Figure 2. 

Decoding accuracy. (A) Group-averaged neural RDMs at 0, 150, and 300 msec. (B) Grand-averaged decoding accuracy (top; *p < .05 corrected vs. chance) and average between- versus within-category decoding accuracy (bottom; *p < .05 corrected between vs. within). (C) Within-category decoding accuracy for each sound category (*p < .05 corrected vs. chance). (D) Between-category decoding accuracy for each pair of sound categories (*p < .05 corrected vs. chance). Square insets depict the region of the RDM that was averaged for each comparison. The top of B encompasses all the insets in C and D, and the bottom of B represents the insets of C versus D. Error bands represent SEM calculated over subject-level average decoding accuracies for each comparison.

Notes for each musical instrument were selected to match one of the male and female speakers' utterances (specifically, in terms of their median fundamental frequency calculated using the YIN algorithm; de Cheveigné & Kawahara, 2002) to control for large fundamental frequency differences between these categories. This required the use of lower register instruments (e.g., cello, trombone) to match the male speaker's fundamental frequencies. Because the /h/ portions of the speech utterances were especially noisy, fundamental frequency matches for these stimuli were based on the last third of the token.

Stimuli were edited to begin at their onsets, defined here as 5 msec before where the absolute value of the stimulus' amplitude reached 10% of its maximum. All stimuli were then truncated to a 300-msec duration, which is sufficient to facilitate sound source identification following onset (Ogg & Slevc, 2019a; Ogg et al., 2017; Suied et al., 2014; Robinson & Patterson, 1995a, 1995b). Cosine onset and offset ramps (both 5 msec) were then applied. Stimulus playback in the MEG scanner was equalized (Ultracurve Pro DEQ2496, Behringer) to be approximately flat (±4 dB) between 40 Hz and 5.5 kHz above which the sound level diminished. Thus, we low-pass filtered our stimuli (zero-phase fourth-order Butterworth filter) at 5.5 kHz before root-mean-square-level normalization. This achieved an average level of 72.5 dB-A as measured after playback by Presentation (NeuroBehavioral Systems), equalization and then delivery via a mixer (1202-VLZ Pro, LOUD Technologies, Inc.), amplifier (SLA1, ART ProAudio), insert earphone tubing (E-A-RTONE Gold 3A Insert Earphone, 3M), and insert earphones (Etymotic Research).

Recording of Stimuli for Presentation and Analysis

To match our acoustic analyses to what participants heard in the MEG as closely as possible, we derived acoustic features from high-fidelity recordings of the stimuli from the scanner presentation system using a 2-cc coupler and 0.5-in. condenser microphone amplified by a 824S/PRM902 sound-level meter system (Larson Davis) into an Apogee Duet (Apogee Electronics). Recordings of the stimuli for behavioral screening were recorded using an approximately 2-cc plastic coupler, a Sony ECM-144 Lavalier microphone (Sony Corporation), and a Zoom H1 (Zoom Video Communications) portable recorder.

Procedure

Screening and Familiarization

Participants attended an approximately half-hour screening session no more than 11 days before scanning (mean = 5.1 days) to ensure they could accurately identify the stimuli and to make sure any acoustic changes imposed by the scanner's playback apparatus did not impair the identifiability of the tokens. Participants were first allowed to freely listen to each labeled sound file as many times as they liked. Screening then involved six blocks of trials run in PsychoPy (Version 1.83.4; Peirce, 2007). On each trial, a stimulus was repeated three times with an ISI of 700 msec (one stimulus trial per each of six blocks for a total of 18 presentations of each stimulus during screening). In the first block, participants simply listened to the stimuli and read each sound's label on screen (no response required). In the second block, participants pushed a button to indicate the sound they heard on each trial. In the response key, each of the 12 speech and human environmental sounds was assigned a unique response button (as was the cat stimulus), and each of the six instruments was assigned one button for use regardless of the note the instrument played (thus, 21 response buttons total). In this second block, participants were given feedback if they were right or corrective feedback (sound source and response button). The third block (the test block) was identical to the second, but without feedback. The last three blocks of trials were the same as the first three except that these trials presented a recording of each stimulus that was made from the MEG sound playback system. Screening performance was assessed in terms of accuracy on the sixth block of trials (testing block with scanner playback recordings). This ensured that participants could accurately identify the stimuli as they would hear them in the scanner. Participants were required to score at least 89% (33 of 37 correct) to move on to the scan (mean performance = 95%, SD = 3%).

Data Acquisition and Preprocessing

MEG responses were recorded using 157 axial gradiometers in the scanner (Kanazawa Institute of Technology), as participants listened to 100 repetitions of each stimulus presented in random order at a jittered rate of 1 per second (±150 msec drawn from a uniform distribution). MEG data were filtered online with a notch filter at 60 Hz and a low-pass filter at 200 Hz and were recorded at a sampling rate of 1000 Hz. Before scanning, head digitization was performed (POLHEMUS 3 Space Fast Track system), and electrodes were attached to two preauricular sites and three prefrontal sites to spatially coregister participants' head shapes with the MEG. To ensure participants paid attention to the stimuli, their task during the scan was to press a button upon hearing the cat vocalization (presented at an average rate of one catch trial for every 12 stimuli) while they watched a self-selected film on mute. Participants were very accurate on this task (mean accuracy = 93%, SD = 7%), indicating that they remained attentive to the sounds throughout the scan.

Each participant's MEG data was first cleaned with a time-shift PCA algorithm (de Cheveigné & Simon, 2007). The data were then epoched around the presentation of each stimulus from −200 to 600 msec and down-sampled to 200 Hz (5-msec temporal resolution). The extra 100 msec on each end of the −100- to 500-msec epoch was eliminated to remove edge artifacts. Each SQUID sensor was then baseline-corrected to the average of its −100-msec baseline.

Decoding Analysis

Decoding was performed to distinguish the neural responses to each pair of sound tokens at the individual subject level using CoSMoMVPA (Oosterhof, Connolly, & Haxby, 2016; see Figure 1). At each 5-msec time point throughout the epoch, a given pair of sounds was selected and a linear discriminant classifier was trained (using 90% of the data) to associate each of the two item labels with their corresponding MEG responses using the amplitudes of the SQUID sensors (all 157 sensors for that time point) as feature vectors for classification. The classifier was then tested on its ability to correctly assign item labels to the responses in the left-out trials (the last 10% of the data). This repeated such that each cross-validation fold served as both a training and test case and the analysis iterated through each unique item pair (630 pairs of tokens, i.e., one half of the symmetrical dissimilarity matrix excluding the diagonal) and through each 5-msec time point in the epoch (see Figure 1A). Note that essentially the same results were obtained based on a PCA of the MEG data before decoding (thus eliminating any dependence among feature vectors in training and testing; see Grootswagers et al., 2017, for a discussion).

The decoding analyses yielded a matrix of classification accuracies for each item pair and subject at each time point (averaged over cross-validation folds). Classification results within the RDMs were averaged for each subject to assess when (1) overall decoding accuracy across all item pairs exceeded chance (i.e., averaged across the entire RDM; Figure 2B, top), (2) decoding accuracy for items within or between different categories exceeded chance or differed from one another (i.e., averaged among quadrants on or off the diagonal in the RDM; quadrants in Figure 2C vs. Figure 2D resulting in Figure 2B, bottom), or (3) decoding accuracy for individual within- or between-category comparisons exceeded chance or differed from one another (i.e., averaged within quadrants in the RDM; Figure 2C or D). Because we always decoded individual pairs of stimuli from one another, chance performance for all classification analyses (i.e., all item- and category-level comparisons) was 50%. Statistical significance with multiple comparison corrections was assessed via nonparametric (Maris & Oostenveld, 2007) bootstrapped (10,000 iterations) sign permutation tests (vs. chance), t tests (between vs. within category), or ANOVAs (comparing among the three within-category or between-category decoding accuracies in Figure 2C and D, respectively) followed by threshold-free cluster enhancement (Smith & Nichols, 2009) as implemented by CoSMoMVPA (Oosterhof et al., 2016).

Acoustic Feature Extraction and Analyses

To ensure the acoustic analyses were the most accurate representation of the stimuli heard in the scanner, acoustic features were derived from high-fidelity recordings of each stimulus from the scanner's playback system. From these recordings, we extracted acoustic features known to influence the neural processing of sound in humans (Ogg & Slevc, 2019a, 2019b; Allen et al., 2017; Norman-Haignere et al., 2015; Alluri et al., 2012; Giordano et al., 2012; Lewis et al., 2009, 2012; Schönwiesner & Zatorre, 2009; Langers et al., 2007; Warren et al., 2005; Menon et al., 2002; Patterson et al., 2002), via widely used algorithms (YIN: de Cheveigné & Kawahara, 2002; Timbre Toolbox: Kazazis, Esterer, Depalle, & McAdams, 2017; Peeters, Giordano, Susini, Misdariis, & McAdams, 2011; Modulation Power Spectra: Elliott & Theunissen, 2009). Temporal dynamics of the stimuli (regarding the speed of sound onset) were characterized via the log-attack time and temporal centroid of the energy envelope (Kazazis et al., 2017; Peeters et al., 2011). Spectral qualities were assessed via spectral centroid (similar to overall brightness), spectral flatness (similar to overall noisiness of the spectrum), and spectral variability (index of spectral change over time) extracted from an ERB (equivalent rectangular bandwidth) gammatone representation (i.e., a cochleagram) of the sounds (Kazazis et al., 2017; Peeters et al., 2011) in 5-msec windows/increments. We included a proxy for loudness (the “frame energy” feature from Peeters et al., 2011, which we denote as “ERB energy”) to examine the influence of local changes in dynamics in the stimuli and also to account for any differences imparted by the playback system (because perceived loudness is difficult to characterize even after standard root-mean-square normalization, especially for complex natural stimuli that vary in their temporal and frequency characteristics; Giordano et al., 2012; Moore, 2012; Langers et al., 2007). We also included the raw ERB cochleagram that these features were based on (Kazazis et al., 2017; Peeters et al., 2011).

Modulation power spectra were used to capture different spectral and temporal rates of change in the stimuli (Thoret, Depalle, & McAdams, 2017; Elliott & Theunissen, 2009; Gaussian windowed spectrogram, 50 dB dynamic range, 16 Hz frequency, and 10-msec temporal resolution yielding maximum modulation of 31.22 cyc/kHz and 48.39 Hz, respectively). Finally, we extracted aperiodicity and fundamental frequency estimates (de Cheveigné & Kawahara, 2002) from each stimulus in 5-msec increments (minimum of the fundamental frequency search range set to 100 Hz). Missing values were returned for the incomplete analysis windows in the very first and last increments, which were replaced with the nearest nonmissing values.

Acoustic analyses examined both global (representing the stimulus as a whole) and local (each 5-msec increment) stimulus qualities. Global stimulus features comprised any attribute that aggregated over the entire duration of the stimulus (log-attack time, temporal centroid, modulation power spectrum), as well as the median across windowed features. Local stimulus features comprised the 5-msec increments of the windowed measures: spectral (i.e., ERB representation-based) features, aperiodicity, and fundamental frequency features. Acoustic RDMs were created by taking the absolute value of the difference of each feature between stimulus pairs (single-valued features) or by calculating Euclidean distances between stimulus pairs (for features composed of multiple values: modulation power spectra and ERB cochleagram).

Acoustic correlation analyses were first carried out for each subject individually via a Kendall's tau correlation between the neural RDMs for that subject at each time point and either the global acoustic RDMs (Figure 3A) or the local acoustic RDMs at each time point (Figure 3B). These individual subject correlation statistics were assessed for each feature at the group level at each time point using a Wilcoxon signed-rank test against 0. Group-level statistics for each acoustic feature were then corrected across time points at a threshold of false discovery rate (FDR)-corrected p < .001 (Benjamini & Hochberg, 1995).

Figure 3. 

The correlation between neural decoding accuracy (neural RDMs) and acoustic feature differences (acoustic RDMs) throughout the epoch plotted as averages of subject-level correlations. (A) Global acoustic analysis results. Mean correlation statistics and standard errors across participants are plotted for each feature throughout the neural epoch (*FDR-corrected p < .001; error bands represent SEM calculated over subject-level correlations). (B) Local acoustic analysis results. Mean correlation statistics across participants at the neural and acoustic time points that were significant at the group level (FDR-corrected p < .001) are shown.

Figure 3. 

The correlation between neural decoding accuracy (neural RDMs) and acoustic feature differences (acoustic RDMs) throughout the epoch plotted as averages of subject-level correlations. (A) Global acoustic analysis results. Mean correlation statistics and standard errors across participants are plotted for each feature throughout the neural epoch (*FDR-corrected p < .001; error bands represent SEM calculated over subject-level correlations). (B) Local acoustic analysis results. Mean correlation statistics across participants at the neural and acoustic time points that were significant at the group level (FDR-corrected p < .001) are shown.

RESULTS

We first determined how accurately our sound tokens could be decoded from one another based on participants' MEG responses throughout the time series. For this, we analyzed the average pairwise decoding accuracy rates for all possible pairs of sounds (Figure 2A) for each subject throughout the time series. As seen in Figure 2B (top), these grand-averaged decoding accuracy rates exceeded chance in a cluster of time points beginning 90 msec after sound onset. This cluster remained statistically significant throughout the epoch (500 msec), notably even after sound offset at 300 msec (assessed against chance using bootstrapped sign permutation testing and threshold-free cluster enhancement across time points, all ps < .05 corrected). Peak grand-averaged decoding accuracy was 58.4%, which occurred 155 msec after sound onset. Thus, information related to the neural representations of these auditory objects and events can be reliably distinguished very early on in the listeners' MEG response.

Decoding Categories of Sound

An examination of the neural RDMs (Figure 2A) suggests that neural responses to pairs of stimuli from different categories (i.e., instrument vs. human environmental) were better decoded than pairs of stimuli from the same category. To quantify this, we averaged accuracy rates for the regions of the RDM corresponding to between- and within-category comparisons (Figure 2B, bottom). Between- and within-category decoding followed a very similar time course: both exceeded chance in clusters of time points beginning 90 msec after onset and continuing to the end of the epoch (all ps < .05 corrected). Decoding accuracy peaked for both comparisons at similar times as well (155 msec for between and 160 msec for within). However, between-category classification was more accurate (peak = 60.0%) than within-category decoding (peak = 54.8%) throughout a cluster of time points starting at 95 msec and lasting through the end of the epoch (assessed using nonparametric bootstrapped t tests and threshold-free cluster enhancement across time points comparing between- vs. within-category accuracy, all ps < .05 corrected).

We next examined the decoding accuracy for specific within- or between-category comparisons of individual sound categories (Figure 2C, D insets). Within-category decoding accuracy rates were lower overall (compare Figure 2C and Figure 2D; see also Figure 2B), but sounds within each category could nevertheless be decoded from one another better than chance among clusters of time points spanning most of the epoch (Figure 2C; cluster start times for within-speech: 90 msec; within-human environment: 95 msec; within-instrument: 90 msec, all lasting through 500 msec, all ps < .05 corrected). However, the time courses and peaks of the within-category decoding varied within a cluster of time points beginning at 115 msec, lasting until 470 msec (assessed using nonparametric bootstrapped ANOVA and threshold-free cluster enhancement over time points comparing within-category accuracy rates, all ps < .05 corrected). Peak within-instrument decoding (54.1%) occurred at 145 msec, and peak within-human environmental sound decoding (54.7%) occurred later at 215 msec. Interestingly, within-speech decoding accuracy was the highest among the within-category comparisons (56.7%) and peaked at 160 msec.

Significant clusters of time points for all combinations of between-category decoding began at 90 msec and lasted throughout the epoch (Figure 2D; all ps < .05 corrected). Accuracy rates for these comparisons also differed significantly from one another in a cluster of time points throughout the epoch beginning at 120 msec (all ps < .05 corrected). Decoding between human environmental and instrument sounds was the highest overall (peak: 62.5% at 155 msec) followed by human environmental versus speech decoding (peak: 59.3% at 165 msec) and instrument versus speech token decoding (peak: 58.9% at 150 msec).

In summary, shortly after overall decoding first exceeded chance at 90 msec and before the peak in overall decoding at 155 msec, neural representations already began to differ across sound categories. Note that neither the subjects' level of musical training or behavioral performance (at screening or during the scan) were correlated with subject-level averages of decoding accuracy at any time point (overall, in Figure 2B, top, or for any combination of categories in Figure 2C or D, all FDR-corrected ps > .05 across time points for each measure).

Processing of Global Acoustic Features

Speech, instrument and human environmental sounds contain different acoustic regularities that likely relate to the neural representations we decoded. To investigate this possibility, we computed Kendall's tau correlations between the pairwise neural classification accuracy rates (each subject's neural RDMs throughout the epoch) and the pairwise acoustic differences among our stimuli along a variety of dimensions (acoustic RDMs; Figure 1B, center; see Methods for details). Correlations between neural decoding rates and global acoustic differences among the stimuli (i.e., for acoustic measures that operated over the whole sound token or the median across time-varying features) are shown in Figure 3A. These features were strongly associated with decoding rates before and around the peak in decoding accuracy that was observed around 155 msec (all FDR-corrected ps < .001). Differences among stimuli in terms of their modulation power spectra exhibited the earliest significant correlations with the neural RDM (onset rτ = .03 at 100 msec) followed a few milliseconds later by overall spectral variability (onset rτ = .03 at 115 msec) and aperiodicity (onset rτ = .06 at 120 msec). Other significant correlations emerged later for temporal centroid (onset rτ = .05 at 135 msec), overall spectral envelope (assessed by an ERB cochleagram; onset rτ = .03 at 135 msec), spectral centroid (onset rτ = .05 at 145 msec), and the sum of the squared ERB amplitudes across cochlear filter channels (which we used as a proxy for loudness to account for any changes from the sound playback apparatus in the scanner, 125–135 msec, peak rτ = .04). These features mostly remained correlated with the neural RDMs and increased in their correlation strength around the decoding peak at 155 msec except for ERB energy and temporal centroid, which were nonsignificant after 135 msec. The strongest correlations with the neural RDMs around the decoding peak were observed for differences in aperiodicity (rτ = .18 at 165 msec) followed by modulation power spectra (rτ = .14 at 140 msec) and spectral variability (rτ = .13 at 165 msec). Interestingly, many of the features that exhibited a significant association with decoding rates earliest in the epoch are complex acoustic representations known to be associated with some of the earliest processes in auditory cortex: aperiodicity, spectral envelope, and spectrotemporal variability (Norman-Haignere et al., 2015; Theunissen & Elie, 2014; Lewis et al., 2009; Chi, Ru, & Shamma, 2005). Later in the epoch, particularly after sound offset, significant correlations pertained to spectral centroid (peak rτ = .09 at 445 msec), spectral flatness (peak rτ = .06 at 460 msec), and temporal centroid (peak rτ = .11 at 250 msec). Overall, spectral variability (peak rτ = .16 at 435 msec) and aperiodicity (peak rτ = .19 at 350 msec) exhibited the strongest correlations throughout the epoch.

We conducted an additional analysis to examine the influence of fundamental frequency. This was possible because we allowed our speech stimuli to vary naturally in their fundamental frequencies and matched the notes of the instrument tokens along this dimension. However, a global acoustic analysis of fundamental frequency among just the instrument and speech stimulus pairs (excluding the pairs with /h/ consonant tokens, because these were predominantly noise) did not reveal any significant correlation with the neural decoding results.

Processing of Local Acoustic Features

Sound stimuli necessarily unfold over time. Thus, it is possible that local changes in the acoustic signal could have consequences for participants' unfolding neural responses. To understand how local acoustic developments precipitated corresponding changes in neural representations of these sounds, we expanded our global acoustic analysis in a way that leveraged the dynamic information in our stimuli. In this local acoustic analysis (Figure 1B, right), the pairwise acoustic feature differences among the sounds at every 5-msec time point in the stimuli (local acoustic RDMs) were correlated with every subsequent 5-msec RDM throughout the MEG epoch (neural RDMs; see Methods for details).

The local acoustic analysis results are depicted in Figure 3B. The earliest significant correlation between a local acoustic RDM and the neural RDM was for spectral flatness (onset rτ = .05 at 100 msec in the neural epoch correlated with the 5-msec local acoustic RDM) and spectral variability (onset rτ = .03 at 100 msec in the neural epoch correlated with the 40-msec local acoustic RDM). Again, aperiodicity, spectral centroid/envelope, and spectral variability (across many stimulus time points) were strongly and significantly correlated with the decoding results around the decoding peak (155 msec). We also see a small influence of differences in local ERB energy from approximately the first 100 msec of the stimuli on subsequent neural representations. These local acoustic features were more strongly associated with neural decoding accuracy rates than the global features, with aperiodicity (peak rτ = .21 at 360 msec in the neural epoch correlated with the 55-msec local acoustic RDM and rτ = .20 at 165 msec in the neural epoch correlated with the 105-msec local acoustic RDM) and spectral variability (peak rτ = .20 at 160 msec in the neural epoch correlated with the 110-msec local acoustic RDM) exhibiting some of the strongest correlations. Aperiodicity and spectral variability were also significantly associated with decoding across neural and acoustic time points. Meanwhile, spectral differences (including spectral centroid, flatness, and ERB energy) from the earlier portions of the stimuli appear to exert the most influence.

DISCUSSION

The questions of “when” and “how” a listener recognizes what they hear are fundamentally linked due to the temporal nature of sound. We examined these questions using multivariate pattern analyses of MEG time series data in conjunction with representational similarity analyses of acoustic differences among a set of natural sound stimuli that are behaviorally relevant to human listeners. Using this approach, we gained a better understanding of the acoustic qualities associated with the formation of auditory object and event representations as they unfold in time.

We found that neural activity measured using MEG contained information sufficient to distinguish individual sounds beginning 90 msec after sound onset, reaching a peak at 155 msec, and continuing after sound offset at 300 msec until the end of the neural epoch at 500 msec. This pattern of results was driven by especially accurate decoding of stimuli from different sound categories, although individual instruments, speech utterances, and everyday object sounds could all be decoded from one another better than chance. Acoustic analyses indicated that these emergent representations were associated with differences among the stimuli pertaining to spectrotemporal variability, aperiodicity (or harmonic content), and the sounds' spectral and temporal envelopes. The importance of these features fluctuated over time with spectral envelope and modulation spectra exerting their greatest influence soon after onset, whereas spectral centroid and temporal centroid had a greater influence later in the epoch. Spectral variability and aperiodicity were significantly associated with decoding throughout much of the epoch. These are fundamental acoustic qualities that have been shown to influence neural responses and perception, but our results go further to suggest that these cues are specifically associated with the dynamic formation of mental representations of what a listener hears early after sound onset.

Our analyses needed to manage dynamic information within our stimuli due to the physical (temporal) constraints of sound. By leveraging this information, we obtained a picture of how the brain rapidly integrates and organizes sensory information from natural sounds. However, processing dynamic stimulus information is also an important part of everyday object and event perception beyond the auditory domain. Thus, the approach we describe here could provide valuable insight into other areas of cognitive neuroscience and object perception as well. For example, this approach could be applied to dynamic visual stimuli or multisensory integration paradigms to better understand how the brain integrates and represents specific stimulus information from both modalities to achieve percepts of complex objects and events.

Relative to the decoding of visual objects from MEG, the decoding onsets and latencies that we observed were slower than some findings (Cichy et al., 2014; Isik, Meyers, Leibo, & Poggio, 2014) but comparable to others (Carlson et al., 2011, 2013). This difference might be due to the inherent temporal nature of our sound stimuli relative to the pictures used in previous work on visual objects. Unlike sounds, picture stimuli present all the information necessary for object recognition at once. Auditory stimuli, on the other hand, require time to unfold, thus incrementally relaying the information that supports object recognition. Also, although overall decoding accuracy differed for within- and between-category comparisons among our auditory stimuli, the emergence of category and individual object representations followed a roughly similar time course. In most cases, these exceeded chance around 90 msec and peaked around 155 msec. Studies of visual object decoding suggest somewhat wider variability in the latency of decoding individual objects or categories (Cichy et al., 2014; Carlson et al., 2011, 2013).

A related study (Teng et al., 2017) decoded neural representations of different impact sounds and reverberant spaces from MEG, finding earlier onsets and latencies in their sound source decoding than what we observe here. The slower latencies in our results are likely a function of the more varied stimulus onsets (amplitude envelopes) we employed in our set of sounds. This follows from the diverse set of natural sound sources and events that we examined, compared with the constrained set of impact sounds used in previous work (Teng et al., 2017). Thus, our results provide a useful expansion upon this prior work given that the heterogeneous stimulus set examined here is more representative of the diversity of sounds humans encounter in everyday life (which can involve different amplitude envelopes).

Exploring the mechanisms of sound source decoding further, our results suggest that the auditory system prioritizes cues related to a sound's spectrotemporal variability and aperiodicity, as well as its spectral and temporal envelopes (at least among this set of sounds). These features can effectively distinguish instrument and human environmental sounds (especially early after onset), which occupy opposing extremes on these dimensions (Ogg & Slevc, 2019a; Ogg et al., 2017). Speech utterances, on the other hand (an especially important sound category for humans), exploit all of these acoustic cues and their extremes from moment to moment to relay linguistic information (Stilp & Kluender, 2010; Smith & Lewicki, 2006). Speech sounds also exhibited the highest within-category decoding indicating that individual speech sounds are more distinguishable from one another in listeners' neural responses relative to the instrument and human environmental sounds. Thus, the acoustic fluctuation among speech tokens might account for this increased within-category decoding.

Although these acoustic qualities have previously been shown to influence the strength of neural responses on different timescales (Allen et al., 2017; Hullett et al., 2016; Norman-Haignere et al., 2015; Alluri et al., 2012; Giordano et al., 2012; Lewis et al., 2009, 2012; Schönwiesner & Zatorre, 2009; Langers et al., 2007; Caclin et al., 2006; Warren et al., 2005; Rosburg, 2003; Menon et al., 2002; Patterson et al., 2002; Roberts et al., 2000; Giard et al., 1995; Näätänen & Picton, 1987), our results go further by beginning to link these features to representations of auditory objects and events early in perception. Additionally, we show that these features, which have typically been examined in fMRI (Ogg et al., 2019; Allen et al., 2017; Norman-Haignere et al., 2015; Alluri et al., 2012; Giordano et al., 2012; Lewis et al., 2009, 2012; Schönwiesner & Zatorre, 2009; Langers et al., 2007; Warren et al., 2005; Menon et al., 2002; Patterson et al., 2002), vary in their influence over time, with more primary acoustic processes in cortex associated with decoding accuracy early in the response such as spectral envelope and modulation power spectra (Norman-Haignere et al., 2015; Theunissen & Elie, 2014; Chi et al., 2005). Later in the response, however, neural representations were more related to spectral and temporal centroid features (with spectral variability and aperiodicity correlating strongly throughout). Finally, the time course of these neural and acoustic processes aligns with behavioral identification and categorization results for duration-gated stimuli (e.g., Ogg et al., 2017; Suied et al., 2014). Similar acoustic qualities (from the first tens of milliseconds in the stimuli) appear to be associated with the differentiation of sounds in early MEG decoding time windows (starting around 100–200 msec) and in subsequent behavioral responses (later around 300–600 msec; Ogg et al., 2017; Suied et al., 2014; Agus, Suied, Thorpe, & Pressnitzer, 2012).

One important caveat to these findings and a potential avenue for future research is that it is not clear how (or when) these decoded representations correspond to more abstract qualities or semantic labels associated with these sounds compared with their acoustic features. Separating semantic labels from acoustic processing in the neural responses obtained in this study is difficult because these were not orthogonal in this natural and diverse stimulus set. Indeed, much of our decoding results could follow from differences in acoustic or stimulus-level processing rather than object-level representations per se. This potentially aligns with the timing of the decoding peak we observed, which also coincides with components of the auditory evoked response (particularly N1–P2, between 100 and 200 msec). Evoked responses in this time window are known to be influenced by acoustic differences among stimuli (Caclin et al., 2006; Rosburg, 2003; Roberts et al., 2000; Giard et al., 1995; Näätänen & Picton, 1987) and appear to support representations of speech (Bidelman, & Walker, 2017; Chang et al., 2010; Roberts et al., 2000; Poeppel et al., 1997; Näätänen & Picton, 1987). Note, however, that this decoding peak could also be related to the overall improved signal-to-noise ratio in the neural response that follows from these prominent evoked components. Note also that decoding remained significantly above chance after these time points as well. In either case, we go beyond much of the previous work examining natural sounds in MEG/EEG by describing the acoustic features that might underpin processing differences between sound categories.

The ostensibly modest absolute decoding accuracy rates we observed (peaking at around 8.4% above chance) might raise some concerns regarding the reliability of these results. However, as Hebart and Baker (2018) point out, although high decoding rates are clearly desirable for brain–computer–interface applications, for understanding brain function (which is our present goal), above chance decoding is all that is necessary to inform how the brain processes information (especially given the resolution we have access to with current imaging technologies). Moreover, they make the important point that decoding accuracy is not equivalent to an effect size and that, again, the key is to understand where (or, in our case, “when”) information is contained or processed in the system (i.e., when accuracy exceeds chance). Analogous points can be made regarding the strength of correlations using these decoding accuracy rates (such as in our acoustic analyses) because decoding accuracy places an upper limit on the strength of the correlations that can be observed. Finally, it is worth keeping in mind that essentially the same decoding results were obtained based on a principal components analysis of the MEG data, which indicates that our results are not due to any deleterious effects of interdependence among feature vectors (MEG data) in training and testing (Grootswagers et al., 2017).

It is interesting that we did not observe an influence of fundamental frequency or musical training in these neural decoding results. However, this might be due to aspects of our design or task. For example, the noninfluence of fundamental frequency could reflect our use of isolated sound tokens (rather than longer concurrent streams) or a range restriction in the fundamental frequencies of the speaker's utterances (which spanned approximately one octave). Similarly, it is possible that fundamental frequency requires larger timescales on which to exert an influence (Walker, Bizley, King & Schnupp, 2011; Robinson & Patterson, 1995a, 1995b), beyond the time windows examined here. And although musical training has sometimes been found to influence auditory cortical responses (Bidelman, Weiss, Moreno, & Alain, 2014), these effects might arise more robustly during demanding behavioral tasks (see also Bidelman & Walker, 2017; Alho et al., 2016). Indeed, other findings based on passive tasks with diverse, readily distinguishable stimuli have not found an influence of musical training on neural representations (Ogg et al., 2019).

More generally, attention is known to strengthen neural responses and neural representations of auditory stimuli (Ding & Simon, 2012), particularly during categorization tasks (Bidelman & Walker, 2017; Alho et al., 2016). However, the current study used a mostly passive paradigm (nontarget odd ball detection task). Thus, it is likely that stronger attentional demands to specific target stimuli (or categories) could boost their decoding accuracy and could potentially modify the influence of certain target-specific acoustic features. Some findings also indicate that attention plays a particularly acute role in auditory scene analysis or streaming tasks around 200 msec (O'Sullivan et al., 2015; Snyder, Alain, & Picton, 2006). Viewed in conjunction with our results, this suggests that, after object identification occurs most acutely around 150 msec, attention might act to tune responses to a given stimulus and track a specific sound source through an auditory scene. This is thought to occur via feedback mechanisms from frontal areas (Fritz, David, Radtke-Schuller, Yin, & Shamma, 2010). Viewed alongside our significant acoustic correlations with aperiodicity, modulation spectra, and the spectral envelope, this suggests that these features could be avenues through which auditory cortex is tuned to a given sound via the modulation of attention (Fritz, Elhilali, David, & Shamma, 2007; Petkov et al., 2004). Future work examining how semantic or acoustic cues factor into attentional modulation will also help further illuminate this issue. These exciting new directions, along with our results, will help paint a fuller picture of how sounds are transformed from acoustic phenomena into the events and objects we identify and subsequently track in the environment in service of more complex behaviors and individual goals.

Acknowledgments

We are grateful for the discussions, feedback, and assistance we received in the course of this work from Natalia Lapinskaya, Christopher Neufeld, Jonathan Simon, Christian Brodbeck, Ed Smith, Tijl Grootswagers, Amanda Robinson, and Stefanie Kuchinsky. A portion of the human environmental sounds were obtained courtesy of the Sound Events Database (www.auditorylab.org/; Copyright 2008, Laurie M. Heller; funding provided by NSF award 0446955). Thanks also to Carnegie Melon's Center for the Neural Basis of Cognition regarding the human environmental sounds and to Lawrence Fritts for managing The University of Iowa Musical Instrument Samples database.

Reprint requests should be sent to Mattson Ogg, Neuroscience and Cognitive Science Program, Department of Psychology, University of Maryland, Biology-Psychology Building, Room 3150 4094 Campus Drive, College Park, MD 20742, or via e-mail: mogg@umd.edu.

REFERENCES

Agus
,
T. R.
,
Suied
,
C.
,
Thorpe
,
S. J.
, &
Pressnitzer
,
D.
(
2012
).
Fast recognition of musical sounds based on timbre
.
Journal of the Acoustical Society of America
,
131
,
4124
4133
.
Alho
,
J.
,
Green
,
B. M.
,
May
,
P. J. C.
,
Sams
,
M.
,
Tiitinen
,
H.
,
Rauschecker
,
J. P.
, et al
(
2016
).
Early-latency categorical speech sound representations in the left inferior frontal gyrus
.
Neuroimage
,
129
,
214
223
.
Allen
,
E. J.
,
Burton
,
P. C.
,
Olman
,
C. A.
, &
Oxenham
,
A. J.
(
2017
).
Representations of pitch and timbre variation in human auditory cortex
.
Journal of Neuroscience
,
37
,
1284
1293
.
Alluri
,
V.
,
Toiviainen
,
P.
,
Jääskeläinen
,
I. P.
,
Glerean
,
E.
,
Sams
,
M.
, &
Brattico
,
E.
(
2012
).
Large-scale brain networks emerge from dynamic processing of musical timbre, key and rhythm
.
Neuroimage
,
59
,
3677
3689
.
BBC Sound Effects Library
. (
1997
).
BBC worldwide
.
London
:
United Kingdom
.
Benjamini
,
Y.
, &
Hochberg
,
Y.
(
1995
).
Controlling the false discovery rate: A practical and powerful approach to multiple testing
.
Journal of the Royal Statistical Society: Series B: Methodological
,
57
,
289
300
.
Bidelman
,
G. M.
, &
Walker
,
B. S.
(
2017
).
Attentional modulation and domain-specificity underlying the neural organization of auditory categorical perception
.
European Journal of Neuroscience
,
45
,
690
699
.
Bidelman
,
G. M.
,
Weiss
,
M. W.
,
Moreno
,
S.
, &
Alain
,
C.
(
2014
).
Coordinated plasticity in brainstem and auditory cortex contributes to enhanced categorical speech perception in musicians
.
European Journal of Neuroscience
,
40
,
2662
2673
.
Bizley
,
J. K.
, &
Cohen
,
Y. E.
(
2013
).
The what, where and how of auditory-object perception
.
Nature Reviews Neuroscience
,
14
,
693
707
.
Caclin
,
A.
,
Brattico
,
E.
,
Tervaniemi
,
M.
,
Näätänen
,
R.
,
Morlet
,
D.
,
Giard
,
M. H.
, et al
(
2006
).
Separate neural processing of timbre dimensions in auditory sensory memory
.
Journal of Cognitive Neuroscience
,
18
,
1959
1972
.
Carlson
,
T. A.
,
Hogendoorn
,
H.
,
Kanai
,
R.
,
Mesik
,
J.
, &
Turret
,
J.
(
2011
).
High temporal resolution decoding of object position and category
.
Journal of Vision
,
11
,
1
17
.
Carlson
,
T. A.
,
Tovar
,
D. A.
,
Alink
,
A.
, &
Kriegeskorte
,
N.
(
2013
).
Representational dynamics of object vision: The first 1000 ms
.
Journal of Vision
,
13
,
1
19
.
Chang
,
E. F.
,
Rieger
,
J. W.
,
Johnson
,
K.
,
Berger
,
M. S.
,
Barbaro
,
N. M.
, &
Knight
,
R. T.
(
2010
).
Categorical speech representation in human superior temporal gyrus
.
Nature Neuroscience
,
13
,
1428
1432
.
Charest
,
I.
,
Pernet
,
C. R.
,
Rousselet
,
G. A.
,
Quiñones
,
I.
,
Latinus
,
M.
,
Fillion-Bilodeau
,
S.
, et al
(
2009
).
Electrophysiological evidence for an early processing of human voices
.
BMC Neuroscience
,
10
,
127
.
Chi
,
T.
,
Ru
,
P.
, &
Shamma
,
S. A.
(
2005
).
Multiresolution spectrotemporal analysis of complex sounds
.
Journal of the Acoustical Society of America
,
118
,
887
906
.
Cichy
,
R. M.
,
Pantazis
,
D.
, &
Oliva
,
A.
(
2014
).
Resolving human object recognition in space and time
.
Nature Neuroscience
,
17
,
455
462
.
de Cheveigné
,
A.
, &
Kawahara
,
H.
(
2002
).
YIN, a fundamental frequency estimator for speech and music
.
Journal of the Acoustical Society of America
,
111
,
1917
1930
.
de Cheveigné
,
A.
, &
Simon
,
J. Z.
(
2007
).
Denoising based on time-shift PCA
.
Journal of Neuroscience Methods
,
165
,
297
305
.
De Lucia
,
M.
,
Clarke
,
S.
, &
Murray
,
M. M.
(
2010
).
A temporal hierarchy for conspecific vocalization discrimination in humans
.
Journal of Neuroscience
,
30
,
11210
11221
.
Ding
,
N.
, &
Simon
,
J. Z.
(
2012
).
Emergence of neural encoding of auditory objects while listening to competing speakers
.
Proceedings of the National Academy of Sciences, U.S.A.
,
109
,
11854
11859
.
Elliott
,
T. M.
, &
Theunissen
,
F. E.
(
2009
).
The modulation transfer function for speech intelligibility
.
PLOS Computational Biology
,
5
,
e1000302
.
Formisano
,
E.
,
De Martino
,
F.
,
Bonte
,
M.
, &
Goebel
,
R.
(
2008
).
“Who” is saying “what”? Brain-based decoding of human voice and speech
.
Science
,
322
,
970
973
.
Fritz
,
J. B.
,
David
,
S. V.
,
Radtke-Schuller
,
S.
,
Yin
,
P.
, &
Shamma
,
S. A.
(
2010
).
Adaptive, behaviorally gated, persistent encoding of task-relevant auditory information in ferret frontal cortex
.
Nature Neuroscience
,
13
,
1011
1019
.
Fritz
,
J. B.
,
Elhilali
,
M.
,
David
,
S. V.
, &
Shamma
,
S. A.
(
2007
).
Auditory attention—Focusing the searchlight on sound
.
Current Opinion in Neurobiology
,
17
,
437
455
.
Giard
,
M. H.
,
Lavikahen
,
J.
,
Reinikainen
,
K.
,
Perrin
,
F.
,
Bertrand
,
O.
,
Pernier
,
J.
, et al
(
1995
).
Separate representation of stimulus frequency, intensity, and duration in auditory sensory memory: An event-related potential and dipole-model analysis
.
Journal of Cognitive Neuroscience
,
7
,
133
143
.
Giordano
,
B. L.
,
McAdams
,
S.
,
Zatorre
,
R. J.
,
Kriegeskorte
,
N.
, &
Belin
,
P.
(
2012
).
Abstract encoding of auditory objects in cortical activity patterns
.
Cerebral Cortex
,
23
,
2025
2037
.
Griffiths
,
T. D.
, &
Warren
,
J. D.
(
2004
).
What is an auditory object?
Nature Reviews Neuroscience
,
5
,
887
892
.
Grootswagers
,
T.
,
Wardle
,
S. G.
, &
Carlson
,
T. A.
(
2017
).
Decoding dynamic brain patterns from evoked responses: A tutorial on multivariate pattern analysis applied to time series neuroimaging data
.
Journal of Cognitive Neuroscience
,
29
,
677
697
.
Haynes
,
J. D.
, &
Rees
,
G.
(
2006
).
Neuroimaging: Decoding mental states from brain activity in humans
.
Nature Reviews Neuroscience
,
7
,
523
534
.
Hebart
,
M. N.
, &
Baker
,
C. I.
(
2018
).
Deconstructing multivariate decoding for the study of brain function
.
Neuroimage
,
180
,
4
18
.
Hullett
,
P. W.
,
Hamilton
,
L. S.
,
Mesgarani
,
N.
,
Schreiner
,
C. E.
, &
Chang
,
E. F.
(
2016
).
Human superior temporal gyrus organization of spectrotemporal modulation tuning derived from speech stimuli
.
Journal of Neuroscience
,
36
,
2014
2026
.
Isik
,
L.
,
Meyers
,
E. M.
,
Leibo
,
J. Z.
, &
Poggio
,
T.
(
2014
).
The dynamics of invariant object recognition in the human visual system
.
Journal of Neurophysiology
,
111
,
91
102
.
Kazazis
,
S.
,
Esterer
,
N.
,
Depalle
,
P.
, &
McAdams
,
S.
(
2017
).
A performance evaluation of the timbre toolbox and the MIRtoolbox on calibrated test sounds
. In
Proceedings of the 2017 International Symposium on Musical Acoustics
, pp.
144
147
.
Khalighinejad
,
B.
,
Cruzatto da Silva
,
G.
, &
Mesgarani
,
N.
(
2017
).
Dynamic encoding of acoustic features in neural responses to continuous speech
.
Journal of Neuroscience
,
37
,
2176
2185
.
Kikuchi
,
Y.
,
Horwitz
,
B.
, &
Mishkin
,
M.
(
2010
).
Hierarchical auditory processing directed rostrally along the monkey's supratemporal plane
.
Journal of Neuroscience
,
30
,
13021
13030
.
Kriegeskorte
,
N.
, &
Kievit
,
R. A.
(
2013
).
Representational geometry: Integrating cognition, computation, and the brain
.
Trends in Cognitive Sciences
,
17
,
401
412
.
Langers
,
D. R.
,
van Dijk
,
P.
,
Schoenmaker
,
E. S.
, &
Backes
,
W. H.
(
2007
).
fMRI activation in relation to sound intensity and loudness
.
Neuroimage
,
35
,
709
718
.
Levy
,
D. A.
,
Granot
,
R.
, &
Bentin
,
S.
(
2001
).
Processing specificity for human voice stimuli: Electrophysiological evidence
.
NeuroReport
,
12
,
2653
2657
.
Levy
,
D. A.
,
Granot
,
R.
, &
Bentin
,
S.
(
2003
).
Neural sensitivity to human voices: ERP evidence of task and attentional influences
.
Psychophysiology
,
40
,
291
305
.
Lewis
,
J. W.
,
Talkington
,
W. J.
,
Tallaksen
,
K. C.
, &
Frum
,
C. A.
(
2012
).
Auditory object salience: Human cortical processing of non-biological action sounds and their acoustic signal attributes
.
Frontiers in Systems Neuroscience
,
6
,
27
.
Lewis
,
J. W.
,
Talkington
,
W. J.
,
Walker
,
N. A.
,
Spirou
,
G. A.
,
Jajosky
,
A.
,
Frum
,
C.
, et al
(
2009
).
Human cortical organization for processing vocalizations indicates representation of harmonic structure as a signal attribute
.
Journal of Neuroscience
,
29
,
2283
2296
.
Maris
,
E.
, &
Oostenveld
,
R.
(
2007
).
Nonparametric statistical testing of EEG- and MEG-data
.
Journal of Neuroscience Methods
,
164
,
177
190
.
Menon
,
V.
,
Levitin
,
D. J.
,
Smith
,
B. K.
,
Lembke
,
A.
,
Krasnow
,
B. D.
,
Glazer
,
D.
, et al
(
2002
).
Neural correlates of timbre change in harmonic sounds
.
Neuroimage
,
17
,
1742
1754
.
Moore
,
B. C. J.
(
2012
).
An introduction to the psychology of hearing
(6th ed.).
Bingley, UK
:
Emerald
.
Murray
,
M. M.
,
Camen
,
C.
,
Gonzalez Andino
,
S. L.
,
Bovet
,
P.
, &
Clarke
,
S.
(
2006
).
Rapid brain discrimination of sounds of objects
.
Journal of Neuroscience
,
26
,
1293
1302
.
Näätänen
,
R.
, &
Picton
,
T.
(
1987
).
The N1 wave of the human electric and magnetic response to sound: A review and an analysis of the component structure
.
Psychophysiology
,
24
,
375
425
.
Norman-Haignere
,
S.
,
Kanwisher
,
N. G.
, &
McDermott
,
J. H.
(
2015
).
Distinct cortical pathways for music and speech revealed by hypothesis-free voxel decomposition
.
Neuron
,
88
,
1281
1296
.
Ogg
,
M.
,
Moraczewski
,
D.
,
Kuchinsky
,
S. E.
, &
Slevc
,
L. R.
(
2019
).
Separable neural representations of sound sources: Speaker identity and musical timbre
.
Neuroimage
,
191
,
116
126
.
Ogg
,
M.
, &
Slevc
,
L. R.
(
2019a
).
Acoustic correlates of auditory object and event perception: speakers, musical timbres, and environmental sounds
.
Frontiers in Psychology
,
10
,
1594
.
Ogg
,
M.
, &
Slevc
,
L. R.
(
2019b
).
Neural mechanisms of music and language
. In
G.
Zubicaray
&
N.
Schiller
(Eds.),
Oxford handbook of neurolinguistics
(pp.
907
952
).
New York
:
Oxford University Press
.
Ogg
,
M.
,
Slevc
,
L. R.
, &
Idsardi
,
W. J.
(
2017
).
The time course of sound category identification: Insights from acoustic features
.
Journal of the Acoustical Society of America
,
142
,
3459
3473
.
Oosterhof
,
N. N.
,
Connolly
,
A. C.
, &
Haxby
,
J. V.
(
2016
).
CoSMoMVPA: Multi-modal multivariate pattern analysis of neuroimaging data in Matlab/GNU Octave
.
Frontiers in Neuroinformatics
,
10
,
27
.
O'Sullivan
,
J. A.
,
Power
,
A. J.
,
Mesgarani
,
N.
,
Rajaram
,
S.
,
Foxe
,
J. J.
,
Shinn-Cunningham
,
B. G.
, et al
(
2015
).
Attentional selection in a cocktail party environment can be decoded from single-trial EEG
.
Cerebral Cortex
,
25
,
1697
1706
.
Patterson
,
R. D.
,
Uppenkamp
,
S.
,
Johnsrude
,
I. S.
, &
Griffiths
,
T. D.
(
2002
).
The processing of temporal pitch and melody information in auditory cortex
.
Neuron
,
36
,
767
776
.
Peeters
,
G.
,
Giordano
,
B. L.
,
Susini
,
P.
,
Misdariis
,
N.
, &
McAdams
,
S.
(
2011
).
The timbre toolbox: Extracting audio descriptors from musical signals
.
Journal of the Acoustical Society of America
,
130
,
2902
2916
.
Peirce
,
J. W.
(
2007
).
PsychoPy—Psychophysics software in Python
.
Journal of Neuroscience Methods
,
162
,
8
13
.
Perrodin
,
C.
,
Kayser
,
C.
,
Logothetis
,
N. K.
, &
Petkov
,
C. I.
(
2011
).
Voice cells in the primate temporal lobe
.
Current Biology
,
21
,
1408
1415
.
Petkov
,
C. I.
,
Kang
,
X.
,
Alho
,
K.
,
Bertrand
,
O.
,
Yund
,
E. W.
, &
Woods
,
D. L.
(
2004
).
Attentional modulation of human auditory cortex
.
Nature Neuroscience
,
7
,
658
663
.
Poeppel
,
D.
,
Phillips
,
C.
,
Yellin
,
E.
,
Rowley
,
H. A.
,
Roberts
,
T. P.
, &
Marantz
,
A.
(
1997
).
Processing of vowels in supratemporal auditory cortex
.
Neuroscience Letters
,
221
,
145
148
.
Rauschecker
,
J. P.
, &
Scott
,
S. K.
(
2009
).
Maps and streams in the auditory cortex: Nonhuman primates illuminate human speech processing
.
Nature Neuroscience
,
12
,
718
724
.
Rauschecker
,
J. P.
, &
Tian
,
B.
(
2000
).
Mechanisms and streams for processing of “what” and “where” in auditory cortex
.
Proceedings of the National Academy of Sciences, U.S.A.
,
97
,
11800
11806
.
Rigoulot
,
S.
,
Pell
,
M. D.
, &
Armony
,
J. L.
(
2015
).
Time course of the influence of musical expertise on the processing of vocal and musical sounds
.
Neuroscience
,
290
,
175
184
.
Roberts
,
T. P.
,
Ferrari
,
P.
,
Stufflebeam
,
S. M.
, &
Poeppel
,
D.
(
2000
).
Latency of the auditory evoked neuromagnetic field components: Stimulus dependence and insights toward perception
.
Journal of Clinical Neurophysiology
,
17
,
114
129
.
Robinson
,
K.
, &
Patterson
,
R. D.
(
1995a
).
The duration required to identify the instrument, the octave, or the pitch chroma of a musical note
.
Music Perception
,
13
,
1
15
.
Robinson
,
K.
, &
Patterson
,
R. D.
(
1995b
).
The stimulus duration required to identify vowels, their octave, and their pitch chroma
.
Journal of the Acoustical Society of America
,
98
,
1858
1865
.
Rosburg
,
T.
(
2003
).
Left hemispheric dipole locations of the neuromagnetic mismatch negativity to frequency, intensity and duration deviants
.
Cognitive Brain Research
,
16
,
83
90
.
Sankaran
,
N.
,
Thompson
,
W. F.
,
Carlile
,
S.
, &
Carlson
,
T. A.
(
2018
).
Decoding the dynamic representation of musical pitch from human brain activity
.
Scientific Reports
,
8
,
839
.
Schellenberg
,
E. G.
(
2006
).
Long-term positive associations between music lessons and IQ
.
Journal of Educational Psychology
,
98
,
457
468
.
Schönwiesner
,
M.
, &
Zatorre
,
R. J.
(
2009
).
Spectro-temporal modulation transfer function of single voxels in the human auditory cortex measured with high-resolution fMRI
.
Proceedings of the National Academy of Sciences, U.S.A.
,
106
,
14611
14616
.
Smith
,
E. C.
, &
Lewicki
,
M. S.
(
2006
).
Efficient auditory coding
.
Nature
,
439
,
978
982
.
Smith
,
S. M.
, &
Nichols
,
T. E.
(
2009
).
Threshold-free cluster enhancement: Addressing problems of smoothing, threshold dependence and localisation in cluster inference
.
Neuroimage
,
44
,
83
98
.
Snyder
,
J. S.
,
Alain
,
C.
, &
Picton
,
T. W.
(
2006
).
Effects of attention on neuroelectric correlates of auditory stream segregation
.
Journal of Cognitive Neuroscience
,
18
,
1
13
.
Sound Events and Real World Events Databases
. (
2008
).
Pittsburgh, PA
:
Carnegie Mellon University
.
Staeren
,
N.
,
Renvall
,
H.
,
De Martino
,
F.
,
Goebel
,
R.
, &
Formisano
,
E.
(
2009
).
Sound categories are represented as distributed patterns in the human auditory cortex
.
Current Biology
,
19
,
498
502
.
Stilp
,
C. E.
, &
Kluender
,
K. R.
(
2010
).
Cochlea-scaled entropy, not consonants, vowels, or time, best predicts speech intelligibility
.
Proceedings of the National Academy of Sciences, U.S.A.
,
107
,
12387
12392
.
Suied
,
C.
,
Agus
,
T. R.
,
Thorpe
,
S. J.
,
Mesgarani
,
N.
, &
Pressnitzer
,
D.
(
2014
).
Auditory gist: Recognition of very short sounds from timbre cues
.
Journal of the Acoustical Society of America
,
135
,
1380
1391
.
Teng
,
S.
,
Sommer
,
V. R.
,
Pantazis
,
D.
, &
Oliva
,
A.
(
2017
).
Hearing scenes: A neuromagnetic signature of auditory source and reverberant space separation
.
eNeuro
,
4
,
ENEURO.0007-17.2017
.
The University of Iowa
. (
1997
).
Musical instrument samples database
. http://theremin.music.uiowa.edu/MIS.html.
Theunissen
,
F. E.
, &
Elie
,
J. E.
(
2014
).
Neural processing of natural sounds
.
Nature Reviews Neuroscience
,
15
,
355
366
.
Thoret
,
E.
,
Depalle
,
P.
, &
McAdams
,
S.
(
2017
).
Perceptually salient regions of the modulation power spectrum for musical instrument identification
.
Frontiers in Psychology
,
8
,
587
.
Tian
,
B.
,
Reser
,
D.
,
Durham
,
A.
,
Kustov
,
A.
, &
Rauschecker
,
J. P.
(
2001
).
Functional specialization in rhesus monkey auditory cortex
.
Science
,
292
,
290
293
.
Tong
,
F.
, &
Pratte
,
M. S.
(
2012
).
Decoding patterns of human brain activity
.
Annual Review of Psychology
,
63
,
483
509
.
Vettel
,
J. M.
(
2010
).
Neural integration of multimodal events
(Doctoral dissertation)
.
Brown University
,
Providence, RI
.
Walker
,
K. M.
,
Bizley
,
J. K.
,
King
,
A. J.
, &
Schnupp
,
J. W.
(
2011
).
Multiplexed and robust representations of sound features in auditory cortex
.
Journal of Neuroscience
,
31
,
14565
14576
.
Warren
,
J. D.
,
Jennings
,
A. R.
, &
Griffiths
,
T. D.
(
2005
).
Analysis of the spectral envelope of sounds by the human brain
.
Neuroimage
,
24
,
1052
1057
.