Interaction with the world requires an organism to transform sensory signals into representations in which behaviorally meaningful properties of the environment are made explicit. These representations are derived through cascades of neuronal processing stages in which neurons at each stage recode the output of preceding stages. Explanations of sensory coding may thus involve understanding how low-level patterns are combined into more complex structures. To gain insight into such midlevel representations for sound, we designed a hierarchical generative model of natural sounds that learns combinations of spectrotemporal features from natural stimulus statistics. In the first layer, the model forms a sparse convolutional code of spectrograms using a dictionary of learned spectrotemporal kernels. To generalize from specific kernel activation patterns, the second layer encodes patterns of time-varying magnitude of multiple first-layer coefficients. When trained on corpora of speech and environmental sounds, some second-layer units learned to group similar spectrotemporal features. Others instantiate opponency between distinct sets of features. Such groupings might be instantiated by neurons in the auditory cortex, providing a hypothesis for midlevel neuronal computation.
The ability to interact with the environment requires an organism to infer characteristics of the world from sensory signals. One challenge is that the environmental properties an organism must recognize are usually not explicit in the sensory input. A primary function of sensory systems is to transform raw sensory signals into representations in which behaviorally important features are more easily recovered. In doing so, the brain must generalize across irrelevant stimulus variation while maintaining selectivity to the variables that matter for behavior. The nature of sensory codes and the mechanisms by which they achieve appropriate selectivity and invariance are thus a primary target of sensory system research.
The auditory system is believed to instantiate such representations through a sequence of processing stages extending from the cochlea into the auditory cortex. Functional evidence suggests that neurons in progressively higher stages of the auditory pathway respond to increasingly complex and abstract properties of sound (Chechik & Nelken, 2012; Atencio, Sharpee, & Schreiner, 2012; Carruthers et al., 2015; Bizley, Nodal, Nelken, & King, 2005; Elie & Theunissen, 2015; Russ, Ackelson, Baker, & Cohen, 2008; Mesgarani, David, Fritz, & Shamma, 2008; Obleser, Leaver, VanMeter, & Rauschecker, 2010; Bendor & Wang, 2005; Overath, McDermott, Zarate, & Poeppel, 2015; Norman-Haignere, Kanwisher, & McDermott, 2015). Yet our understanding of the underlying transformations remains limited, particularly when compared to the visual system.
Feature selectivity throughout the auditory system has traditionally been described using linear receptive fields (Aertsen & Johannesma, 1981; Theunissen, Sen, & Doupe, 2000; Depireux, Simon, Klein, & Shamma, 2001). The most common instantiation is the spectrotemporal receptive field (STRF), which typically characterizes neural activity with a one-dimensional linear projection of the sound spectrogram transformed with a nonlinearity (Sharpee, Atencio, & Schreiner, 2011). As a neural data analysis technique, STRFs are widespread in auditory neuroscience and have generated insight in domains ranging from plasticity to speech coding (e.g., Fritz, Shamma, Elhilali, & Klein, 2003; Woolley, Fremouw, Hsu, & Theunissen, 2005).
Despite their utility, it is clear that STRFs are at best an incomplete description of auditory codes, especially in the cortex (Machens, Wehr, & Zador, 2004; Sahani & Linden, 2003; Williamson, Ahrens, Linden, & Sahani, 2016). Experimental evidence suggests that auditory neural responses are strongly nonlinear. As a consequence, auditory receptive fields estimated with natural sounds differ substantially from estimates obtained with artificial stimuli (Theunissen et al., 2000). STRF descriptions also fail to capture the dimensionality expansion of higher representational stages. In contrast to the brainstem, neurons in the auditory cortex seem to be sensitive to multiple stimulus features at the same time (Atencio et al., 2012; Kozlov & Gentner, 2016; Harper et al., 2016). The presence of strongly nonlinear behavior and multiplexing necessitates signal models more sophisticated than one-dimensional, linear features of the spectrogram such as STRFs.
An additional challenge to characterizing midlevel features of sound is that humans lack strong intuitions about abstract auditory structure. In specific signal domains such as speech, progress has been made by cataloging phonemes and other frequently occurring structures, but it is not obvious how to generalize this approach to broader corpora of natural sounds.
An alternative approach to understanding sensory representations that is less reliant on domain-specific intuition is that of efficient coding (Barlow, 1961; Attneave, 1954). The efficient coding hypothesis holds that neural codes should exploit the statistical structure of natural signals, allowing such signals to be represented with a minimum of resources. Numerous studies have demonstrated that tuning properties of neurons in early stages of the visual and auditory systems are predicted by statistical models of natural images or sounds (Srinivasan, Laughlin, & Dubs, 1982; Olshausen & Field, 1997; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998; Lewicki, 2002; Klein, König, & Körding, 2003; Smith & Lewicki, 2006; Carlson, Ming, & DeWeese, 2012; Terashima & Okada, 2012; Młynarski, 2014, 2015; Carlin & Elhilali, 2013; Lee, Ekanadham, & Ng, 2008; Hosoya & Hyvärinen, 2015). Although the early successes of this approach engendered optimism, applications have largely been limited to learning a single stage of representation, and extensions to multiple levels of sensory processing have proven difficult. The underlying challenge is that there are many possible forms of high-order statistical dependencies in signals, and the particular dependencies that occur in natural stimuli are typically not obvious. The formulation of models capable of capturing these dependencies requires careful analysis and design (Hyvärinen & Hoyer, 2000; Karklin & Lewicki, 2005, 2009; Cadieu & Olshausen, 2008), and perhaps good fortune, and is additionally constrained by what is tractable to implement. In the auditory system in particular, it remains to be seen whether modeling statistical signal regularities can reveal the complex acoustic structures and invariances that are believed to be represented in higher stages of the auditory system.
The primary goal of the work we describe in this letter was to discover such high-order structure in natural sounds and generate hypotheses about not-yet-observed intermediate-level neural representations. To this end, we developed a probabilistic generative model of natural sounds designed to learn a novel stimulus representation—a population code of naturally occurring combinations of basic spectrotemporal patterns, analogous to STRFs.
The representations learned by the model from corpora of natural sounds suggest grouping principles in the auditory system. Model units learned to pool sets of similar spectrotemporal features, presumably because they occur together in natural sounds. In addition, some model units encode opponency between different sets of features. Although not yet described in the auditory system, such tuning patterns may be analogous to phenomena such as end-stopping or cross-orientation suppression in the visual system. The representations learned by our model also resemble some recently reported properties of auditory cortical neurons, providing further evidence that natural scene statistics can predict neural representations in higher sensory areas.
2 Methods and Models
2.1 Overview of the Hierarchical Model
To learn midlevel auditory representations, we constructed a hierarchical, statistical model of natural sounds. The model structure is depicted in Figure 1. The model consisted of a stimulus layer and two latent layers that were adapted to efficiently represent a corpus of audio signals. Because our goal was to learn midlevel auditory codes, we did not model the raw sound waveform. Instead, we assumed an initial stage of frequency analysis, modeled after that of the mammalian cochlea. This frequency analysis results in a spectrogram-like input representation of sound, which we term a cochleagram, that provides a coarse model of the auditory nerve input to the brain (see Figure 1A, bottom row). This input representation is an matrix, where is the number of frequency channels and is the number of time points. Our aim was to capture statistical dependencies in natural sounds represented in this way.
The first layer of the model, the middle row in Figure 1, was intended to learn basic acoustic features, analogous to STRFs of early auditory neurons. Through the remainder of the letter, we refer to the features learned by the first layer as spectrotemporal kernels (STKs) in order to differentiate them from neurally derived STRFs. The second layer, depicted in the top row of Figure 1A, was intended to learn patterns of STK coactivations that frequently occur in natural sounds.
The model specifies a probability distribution over the space of natural sounds, and its parameters can be understood as random variables whose dependency structure is depicted in Figure 1B. The spectrogram is represented with a set of spectrotemporal features convolved with the latent activation time courses . The second latent layer encodes the magnitudes of with the basis functions convolved with their activation time courses . In the following sections, we present the details of each layer.
2.2 First Layer of Model: Convolutional, Nonnegative Sparse Coding of Spectrograms
The first layer of the model was designed to learn basic spectrotemporal features. One previous attempt to learn sparse spectrotemporal representations of natural sounds (Carlson et al., 2012) produced structures reminiscent of receptive fields of neurons in the auditory midbrain, thalamus, and cortex. Here, we extended this approach by learning a spectrogram representation that is sparse and convolutional.
2.2.1 Dependencies between Coefficients, a Signature of Midlevel Structure
Although the sparse coding strategy we have outlined learns features that are approximately independent across the training set, residual dependencies nonetheless remain. In part, this is because not all dependencies can be modeled with a single layer of convolutional sparse coding. However, due to the nonstationary nature of natural audio, coefficient dependencies can also be present locally despite not being evident across a large corpus as a whole. Empirically, the learned features exhibit dependencies specific to particular sounds (Karklin & Lewicki, 2009) and thus deviate locally from their (approximately independent) marginal distribution. For example, a spoken vowel with a fluctuating pitch contour would require many harmonic STKs to become activated, and their activations would become strongly correlated on a local timescale. Such local correlations reflect higher-order structure of particular natural sounds. Statistically speaking, this is an example of marginally independent random variables exhibiting conditional dependence (in this case, conditioned on a particular point in time or a type of sound).
Figure 3 depicts such dependencies via cross-correlation functions of selected STK activations for a white noise sample and three different natural sounds. Coefficient correlations (black curves in each subplot) vary from sound to sound, but in all cases, they deviate from those obtained with noise (gray bars within subplots), revealing dependence. These dependencies are indicative of midlevel auditory features, perhaps analogous to the correlations between oriented Gabor filters induced by an elongated edge. In this work, we exploited the fact that intermediate-level representations can be learned by modeling dependencies among first-layer features (Karklin & Lewicki, 2005, 2009; Cadieu & Olshausen, 2008).
2.2.2 Variability of STK Activations
A second phenomenon evident in STK activations is that particular patterns of coactivation occur with some variability. This is visible in Figure 4, which depicts activations of selected STKs when encoding multiple exemplars of the same sound—the word one spoken twice by the same speaker (see Figure 4A) and two exemplars of water being poured into a cup (see Figure 4B).
It is apparent that the STK coefficient trajectories for the two exemplars in each case (black and gray lines) reflect the same global pattern even though they differ somewhat from exemplar to exemplar. The similarity suggests that the trajectories could be modeled as different samples from a single time-varying distribution parameterized by a nonstationary coefficient magnitude. When the magnitude increases, the probability of a strong STK activation increases. Retaining the (inferred) time-varying magnitude instead of precise values of STK coefficients would yield a representation more invariant to low-level signal variation, potentially enabling the representation of abstract regularities in the data. Such a representation bears an abstract similarity to the magnitude operation used to compute a spectrogram, in which each frequency channel retains the time-varying energy in different parts of the spectrum. Here we are instead estimating a scale parameter of the underlying distribution, but the process similarly discards aspects of the fine detail of the signal.
2.3 Second Layer of Model: Encoding of STK Combinations
The second layer of the model was intended to exploit the two statistical phenomena detailed in the previous section: conditional dependencies between STKs and their variation across exemplars. Similar to the first layer, the second layer representation is formed by a population of sparsely activated basis functions. These basis functions capture local dependencies among STK magnitudes by encoding the joint distribution of STK activations rather than exact values of STK coefficients. The resulting representation is thus more specific than the first-layer code; instead of encoding single features independently, it signals the presence of particular STK combinations. It is also more invariant, generalizing over specific coefficient values.
Because the proposed representation is a population code of a distribution parameter, it bears conceptual similarity to previously proposed hierarchical models of natural stimuli that encoded patterns of variance (Karklin & Lewicki, 2005; Bumbacher & Ming, 2012), covariance (Karklin & Lewicki, 2009), or complex amplitude (Cadieu & Olshausen, 2008; Hyvärinen & Hoyer, 2000). The novelty of our model structure lies in being convolutional (i.e., it can encode stimuli of arbitrary length using the same representation) and in parameterizing distributions of nonnegative STK coefficients, increasing the interpretability of the learned spectrogram features. The novelty of the model's application is to learn hierarchical representations of sound (previous such efforts have largely been restricted to modeling images, though see Lee, Pham, Largman, & Ng, 2009).
When is high, the distribution of becomes broader (see Figure 5C, black line in the top row, red line in the bottom). This allows the coefficient to attain large values. For small values of the scale parameter, the probability density is concentrated close to 0 (see Figure 5D; red line in the top row, black line in the bottom), and coefficients become small. To model the magnitudes (which are nonnegative, akin to variances), we took their logarithm, mapping their values onto the entire real line so that they could be represented by a sum of real-valued basis functions.
2.4 Learning Procedure
The two layers of the model were trained separately; the training of the second layer occurred after the first-layer training was completed. In the first layer, training was performed with an EM-like procedure that iteratively alternated between inferring STK coefficients and updating STK features (Olshausen & Field, 1997; Cadieu & Olshausen, 2008). Spectrotemporal features were initialized with gaussian white noise. For each excerpt in the training set, coefficients were inferred via gradient descent on the energy function, equation 2.5. Because inference of all coefficients is computationally expensive, we adopted an approximate inference scheme (Blumensath & Davies, 2006). Instead of inferring values of all coefficients for each excerpt, we selected only a subset of them to be minimized. This was done by computing the cross-correlation between a sound excerpt and features and selecting a fixed number of the largest coefficients . The inference step adjusted only this subset of coefficients while setting the rest to 0. Given the inferred coefficients, a gradient step on the spectrotemporal features was performed.
Each learning iteration therefore consisted of the following steps:
Draw a random sound excerpt from the training data set. Excerpts were 323 ms in length (129 time samples of the spectrogram, sampled at 400 Hz).
Compute the cross-correlation of all basis functions with the sound excerpt. Select the 1024 pairs of coefficient indices and time points that yield the highest correlation values.
Infer the values of the selected coefficients by minimizing equation 2.5 with respect to via gradient descent. Set the rest of coefficients to 0.
Compute the gradient step on the basis functions as the derivative of equation 2.5 with respect to basis functions using inferred coefficient values . Update basis functions according to the gradient step.
Normalize all basis functions to unit norm.
This procedure was terminated after 200,000 iterations.
The second layer was then learned via the same procedure used for the first layer. In each iteration, a 423 ms (169 samples at 400 Hz) randomly drawn sound excerpt was encoded by the first layer, and the resulting matrix of coefficients served as an input to the second layer. A subset of coefficients was selected for approximate inference by computing the cross-correlation between features and the logarithm of the first-layer coefficients (analogous to step 2 in the procedure described for the first layer). The energy function was first minimized with respect to coefficients followed by a gradient update to the basis functions (analogous to steps 3 and 4 for the first layer). Entries in the bias vector corresponding to each coefficient were set to the expectation of the coefficient across the entire training set: (i.e., the estimate of the marginal scale parameter for the corresponding STK). Learning was again terminated after 200,000 iterations. To validate the learning algorithm, we ran it on a toy data set generated using known second-layer units. The algorithm successfully recovered the parameters of the generating model, as desired (see appendix C for these results).
2.5 Training Data and Spectrogram Parameters
We trained the model on two different sound corpora. The first corpus was the TIMIT speech database (Garofolo et al., 1993). The second corpus combined a set of environmental sounds (the Pitt sound database; Lewicki, 2002) and a number of animal vocalizations downloaded from freesound.org. The environmental sounds included both transient (e.g., breaking twigs, steps) and ambient (e.g., flowing water, wind) sounds; the animal vocalizations were mostly harmonic.
We computed cochleagrams by filtering sounds with a set of 65 bandpass filters intended to mimic cochlear frequency analysis. Filters were equally spaced on an equivalent rectangular bandwidth (ERB) scale (Glasberg & Moore, 1990), with parameters similar to that from (McDermott & Simoncelli, 2011). Center frequencies ranged from 200 Hz to 8 kHz. We computed the Hilbert envelope of the output of each filter and raised it to the power of 0.3, emulating cochlear amplitude compression (Robles & Ruggero, 2001). To reduce dimensionality, each envelope was downsampled to 400 Hz.
We set the number of features in the first layer to 128 and in the second layer to 100. Pilot experiments yielded qualitatively similar results for alternative feature dimensionalities. Each first-layer feature encoded a 162 ms interval (65 time samples of the spectrogram).
3.1 First Layer: Basic Spectrotemporal Features of Natural Sounds
The first-layer features learned from each of the two sound corpora are shown in Figures 6A and 6B. These features could be considered as the model analogs of neural STRFs. The vast majority of features are well localized within the time-frequency plane, encoding relatively brief acoustic events. The STKs learned from speech included single harmonics and harmonic stacks (see Figure 6A, features numbered 1 and 2), frequency sweeps (feature 3), and broadband clicks (feature 4). The features learned from environmental sounds also included single harmonics and clicks (see Figure 6B, features 1, 2, and 4). In contrast to the results obtained with speech, however, harmonic stacks were absent, and a number of high-frequency hisses and noise-like features were present instead (feature 3).
The properties of the learned dictionaries are also reflected in distributions of feature locations in the time frequency and spectrotemporal modulation planes, as visible in Figures 6C and 6D. Each dot position denotes the center of mass of a single STK, while its color signals the feature's average coefficient value over the stimulus set. For both speech and environmental sounds, the learned STKs uniformly tile the audio frequency spectrum (see Figures 6C and 6D, left panels). Due to the convolutional nature of the code, the energy of each feature is concentrated near the middle of the time axis. The modulation spectra of the learned STKs (see Figures 6C and 6D, right panels) are somewhat specific to the sound corpus. Features trained on a speech corpus were more strongly modulated in frequency, while environmental sounds yielded STKs with faster temporal modulations. STKs learned from both data sets exhibit a spectrotemporal modulation trade-off: if an STK is strongly temporally modulated, its spectral modulation tends to be weaker. This trade-off is an inevitable consequence of time-frequency conjugacy (Singh & Theunissen, 2003) and is also found in the STRFs of the mammalian and avian auditory systems (Miller, Escabí, Read, & Schreiner, 2002; Woolley et al., 2005).
3.2 Second Layer: Combinations of Spectrotemporal Features
STKs captured by the first layer of the model reflect elementary features of natural sounds. By contrast, the features learned by the second layer capture how activations of different STKs cluster together in natural sounds and thus reflect more complex acoustic regularities. We first present several ways of visualizing the multidimensional nature of the second-layer representation, next make some connections to existing neurophysiological data, and then derive some neurophysiological predictions from the model.
3.2.1 Visualizing Second-Layer Features
The second-layer features encode temporal combinations of STK log-magnitudes. Each feature can be represented as an -dimensional matrix, where rows correspond to STKs and columns to time points. An example second-layer unit is shown in Figure 7A (left panel). A positive value in the th row and th column of feature encodes a local increase in the magnitude of the th STK. A negative value encodes a decrease in magnitude.
An alternative visualization is to examine the spectrotemporal structure of the STKs that have large weights in the second-layer feature. The same feature is depicted in this way in the center panel of Figure 7A, which displays the four STKs with the highest average absolute weights for this particular feature. To the right of each STK are its weights (i.e., the corresponding row of the matrix). It is apparent that the weights increase and decrease in a coordinated fashion, and thus they likely encode particular dependencies between the STKs.
To summarize the full distribution of STKs contributing to a second-layer unit, we adopted the visualization scheme illustrated in the right panel of Figure 7A. We plot the center of mass of each STK in the modulation (top row) and time-frequency (middle row) planes, as in Figures 6C and 6D. The dot for a first-layer STK is colored red or blue depending on the sign of their time-averaged weight, with the average absolute value of the weight signaled by the intensity of the color. The bottom row of the panel depicts the temporal pattern of STK magnitudes; line colors correspond to dots in the top and middle rows of the panel. Although the weights of most STKs maintain the same sign over the temporal support of the second-layer unit, they were not directly constrained in this regard, and in some cases the weight trajectories cross zero.
Representative examples of second-layer basis functions are depicted in this way in Figure 7B. We separated them into two broad classes: excitatory units (columns 1 and 2 in Figure 7), which pool STKs using weights of the same sign and therefore encode a pattern of coordinated increase in their magnitudes, and excitatory-inhibitory units (columns 3 and 4 in Figure 7), which pool some STKs with positive average weights and others with negative average weights. We note that excitatory-only and inhibitory-only units are functionally interchangeable in the model because the encoding is unaffected if the signs of both the STK weights and coefficients are reversed. From inspection of Figure 7B, it is evident that some second-layer units pool only a few STKs (e.g., A1, C1, D1), while others are more global and influence activations of many first-layer units (e.g., A2, A3, C3).
3.2.2 Second-Layer Features Encode Patterns of STK Dependencies
To better understand the structure captured by the second-layer units, we considered their relationship to the two kinds of dependencies in STK activations that initially motivated the model. In section 2, we observed that STK activations to natural sounds exhibit strong local cross-correlations as well as variations of particular temporal activation patterns. The second layer of the model was formulated in order to capture and encode these redundancies.
To first test whether the second layer captures the sorts of residual correlations evident in the first-layer output, we measured cross-correlations between STK activations conditioned on the activation of particular second-layer units. Figures 8A to 8D depict four example second-layer units (top row, left column) along with an example stimulus that produced a strong positive response in the unit (selected from the TIMIT corpus). The bottom section of each panel depicts cross-correlation functions of activations of four STKs, averaged over 25 stimulus epochs that produced a strong positive response of the second-layer unit. The cross-correlation functions deviate substantially from 0, as they do when conditioned on excepts of natural sounds (see Figure 3). These correlations reflect the temporal pattern of STK coefficients that the second-layer unit responds to.
We next examined whether the second-layer units respond to STK activations fluctuating around particular global patterns (as depicted in Figure 4). Because the second layer of the model represents the magnitude rather than the precise values of first-layer coefficients, it should be capable of generalizing over minor STK coefficient variation. Figure 9 plots STK coefficient trajectories for stimuli eliciting a strong response in the second-layer features from Figure 8. The STK activation traces reveal variability in each case, but nonetheless exhibit a degree of global consistency, as we saw earlier for natural sound exemplars (see Figure 4). These results provide evidence that the second layer is capturing the dependencies it was intended to model.
3.3 Experimental Predictions
3.3.1 Pooling of Similiar Spectrotemporal Patterns
What do the model results suggest about midlevel auditory structure? Our visualizations of second-layer features in Figure 7B revealed that many represent the concurrent activation of many STKs. Examination of Figure 7B (columns 1 and 2) suggests that the first-layer STKs that such units pool are typically similar in either their spectrotemporal or modulation properties. For example, the unit depicted in panel A2 encodes joint increases in the magnitude of high-frequency STKs, while that in panel C2 encodes joint increases in low-frequency STKs.
In order to quantitatively substantiate these observations, we measured the spread of the center of mass of each of the STKs pooled by excitatory-only second-layer units (those whose STK weights exceeding of the maximum absolute weight were all of the same sign). Specifically, we computed the standard deviation of the center of mass on each dimension of the time-frequency and modulation planes for all STKs with weights exceeding of the maximum weight for the unit, and we compared this to the standard deviation of centers of mass for random samples of STKs (matched in size to the mean number of STKs pooled by the excitatory-only units). These standard deviations measure the spread of STKs in spectrotemporal and modulation domains; if the pooled STKs are similar, the spread should be small.
As shown in Figure 10, the STKs pooled with the same sign were more similar in each of the four analyzed dimensions than random groups of STKs for both sound corpora (red bar versus gray bar, -tests yielded in all cases). This result suggests the hypothesis that STRFs that are pooled by downstream auditory cortical neurons with excitatory weights (Atencio et al., 2012; Kozlov & Gentner, 2016) should also tend to be more similar than expected by chance.
3.3.2 Opponency Patterns in Midlevel Audition
Figure 7B also shows examples of a distinct set of second-layer units in which increased activation of one group of STKs (again, typically similar to each other) is associated with a decrease in activity of another group of STKs. For example, the unit in panel B3 encodes an increase in magnitude of temporally modulated features (clicks), together with a simultaneous decrease of activation of a strongly spectrally modulated (harmonic) STK. Such opponency was evident in a large subset of second-layer units and represents the main novel phenomenon evident in our model.
To our knowledge, no such opponent tuning has been identified in auditory neuroscience, but qualitatively similar opponent behavior is evident in visual neurons exhibiting end-stopping or cross-orientation inhibition. The results raise the possibility that coordinated excitation and inhibition could be a feature of central auditory processing, and we thus examined this model property in detail.
To quantitatively substantiate the observation of opponency, we again examined the variation in STKs pooled by the second-layer units—in this case, those that pooled STKs with both positive and negative weights. We analyzed all units that had at least one STK weight of each sign whose absolute value exceeded of the maximum weight. For each such second-layer unit, we identified centers of mass of the STKs pooled with weights exceeding of the maximum and again computed their standard deviation along time-frequency and modulation coordinates. We compared the standard deviation for STKs pooled with the same sign (separately computed for positive and negative signs) to that of STKs pooled with both signs. If opponent units contrast different types of features, we expect the standard deviation of STKs pooled with both signs to be higher than that of STKs pooled with a single sign.
As shown in Figure 10, even for STKs pooled with the same sign, the spread was typically higher for opponent units than for excitatory-only units for both speech and environmental sounds (blue versus red bars in Figures 10A and 10B, respectively). However, the average spread of STKs pooled with the both signs by opponent units (see Figure 10, dark blue bars) was significantly higher than the spread of STKs pooled with the same sign (see Figure 10, light blue bars). This finding held for all dimensions and both corpora ( in all cases, via t-test). Moreover, the spread of STKs pooled with the same sign by opponent units was also significantly lower than the spread of random groups of STKs (light blue bars versus gray bars; in all cases, via -test). These results provide quantitative evidence that second-layer units pool features in a structured way and that opponent units assign opposite signs to groups of STKs that are similar within groups but differ across groups.
One might imagine that examination of the STKs pooled by each unit would be sufficient to determine the sort of stimuli eliciting strong positive or negative responses. But because the activation of each unit is the result of a nonlinear inference process in which units compete to explain the stimulus pattern (Olshausen & Field, 1997), it is often not obvious what a set of STKs will capture. Thus, to understand which stimuli excite or inhibit second-level features, we inferred coefficients by encoding the entire training data set and selected two sets of 25 sound epochs that elicited the strongest positive and strongest negative responses, respectively, in each unit. Examples of opponent stimulus patterns encoded by second-layer features are visualized in Figure 11. These positive and negative stimuli are depicted in the second and third columns from the left, respectively. In the last two columns, the center of mass of each of these stimuli is plotted on the time-frequency plane (fourth column) and modulation plane (fifth column). Although the center of mass of each stimulus is admittedly a crude summary, the simplicity of the representation facilitates visualization and analysis of the clustering of positive and negative stimuli. We note also that because second-layer coefficients and weights can be sign-reversed without changing the representation, the designation of stimuli (and STK weights) as excitatory or inhibitory is arbitrary.
In some cases, some natural function can be ascribed to the unit, particularly upon listening to stimuli eliciting positive and negative responses. For instance, positive activations of the unit shown in Figure 11B appear to encode onsets of voiced speech, while its negative activations encode voicing offsets. By contrast, the unit shown in Figure 11H appears to code speaker gender. For this unit, representations of positive and negative stimuli were mixed on the time-frequency and modulation planes, but the excitatory stimuli exhibited somewhat coarser spectral modulation (red circles in the right column are more concentrated in the lower part of the plane than blue circles). Listening to the stimuli revealed a clear difference in pitch/gender (23 out of 25 of the negative stimuli were vowels uttered by a female speaker, while all of the positive stimuli were vowels uttered by a male speaker, as confirmed by the first author upon listening to the stimuli). This observation underscores the fact that the time-frequency and modulation planes are but two representational spaces in which to assess stimuli and are used here primarily because they are standard. Moreover, the centroids in these planes are but one way to summarize the STK. Although opponent sets of STKs and positive or negative stimuli often exhibited separation in one of the two planes when analyzed in terms of centroids, they need not do so in order to be distinct.
Taken together, our results demonstrate opponent units that encode STK activation patterns that are mutually exclusive and presumably do not co-occur in natural signals. The phenomenon is a natural one to investigate in the auditory cortex.
3.4 Comparison with Neurophysiological Data
Although our primary goal was to generate predictions of not-yet-observed neural representations of sound, we also sought to test whether our model would reproduce known findings from auditory neuroscience. The first-layer STKs replicated some fairly standard findings in the STRF literature, as discussed earlier. To compare the results from the second layer to experimental data, we examined their receptive field structure and the specificity of their responses and compared each to published neurophysiology data.
3.4.1 Differences in Modulation Tuning between First and Second Model Layers
First, we estimated spectrotemporal receptive fields for units in both layers of the model. The receptive fields of the first-layer units are simply the STK of the unit. To estimate receptive fields of a second-layer unit, we drew inspiration from the spike-triggered average, generating a number of cochleagram samples from each basis function and averaging them. To generate samples from the th basis function, we set a coefficient to 1 with all other set to zero. We then sampled STK activation trajectories from the distribution dictated by the second-layer feature's coefficient and weights, convolved them with the corresponding STKs, and summed the results. We then averaged multiple such samples together. Although we could have computed something more directly analogous to a spike-triggered average, the average sample (which we can compute only because we have the underlying generative model, unlike when conducting a neurophysiology experiment) has the advantage of alleviating the influence of stimulus correlations on the signature of the receptive field. (See appendix B for an illustration of this analysis.)
“Receptive fields” obtained in this way are depicted in Figure 12A. To compare the model units to neurophysiological data, we generated histograms of average spectral and temporal modulation frequency (center of mass in the modulation plane) of first- and second-layer receptive fields and plot them next to distributions of preferred modulation frequencies of neurons in the auditory thalamus and cortex of the cat (Miller et al., 2002; see Figure 12B). The same trend is evident in the model and the auditory system: the second layer prefers features with slower/coarser spectral and temporal modulations relative to the first layer, mirroring the difference seen between the cortex and thalamus. This analysis used features trained on speech, but environmental sounds yielded qualitatively similar results. In the model as well as the brain, lower modulation frequencies may result in part from combining multiple distinct STKs in downstream units.
3.4.2 Model Units Reflect Neural Hierarchical Trends of Response Specificity
Neuronal tuning in early and late stages of the auditory system also tends to differ in specificity (Chechik & Nelken, 2012; Kozlov & Gentner, 2016). Compared to the auditory brainstem, cortical neurons are less selective and respond to multiple features of sound (Kozlov & Gentner, 2016; Atencio et al., 2012), consistent with an increase in abstraction of the representation (Chechik & Nelken, 2012). Suggestions of similar behavior in our model are apparent in the activations of first- and second-layer units to sound, an example of which is shown in Figure 12C. The first-layer feature (middle row) becomes activated only when it is strongly correlated with the stimulus. In contrast, activations of a typical second-layer feature (bottom row) deviate from zero during many seemingly different parts of the stimulus.
We quantified the specificity of tuning with the feature selectivity index (FSI), a measure introduced previously to quantify how correlated a stimulus has to be with a neuron's spike-triggered average to evoke a response (Miller, Escabí, & Schreiner, 2001; see appendix B for definition). An FSI equal to 1 implies that a neuron spikes only when a stimulus is precisely aligned with its STRF (defined as the spike-triggered average), whereas an FSI equal to 0 means that the stimuli triggering neural firing are uncorrelated with the STRF. We computed the FSI using the 25 cochleagram excerpts that most strongly activated each of the first- and second-layer units. The average FSI of each model layer is plotted in Figure 12D. Second-layer features are substantially less specific than first-layer features. A similar effect occurs across different cortical layers (Atencio, Sharpee, & Schreiner, 2009; see Figure 12D, right). Analogous differences seem likely to occur between thalamus and cortex as well, although we are not aware of an explicit prior comparison. The decrease in response specificity in our model can be explained by the fact that second-layer units can become activated when any of the pooled first-layer features (or their combination) appear in the stimulus.
Natural sounds are highly structured. The details of this structure and the mechanisms by which it is encoded by the nervous system remain poorly understood. Progress on both fronts is arguably limited by the shortage of signal models capable of explicitly representing natural acoustic structure. We have proposed a novel statistical model that captures an unexplored type of high-order dependency in natural sounds: correlations between the activations of basic spectrotemporal features. Our model consists of two layers. The first layer learns a set of elementary spectrotemporal kernels and uses them to encode sound cochleagrams. The second layer learns a representation of co-occurrence patterns of the first layer features.
We adopted a generative modeling approach, inspired by its previous successes in the domain of natural image statistics. Previous hierarchical, probabilistic models of natural images were able to learn high-order statistical regularities in natural images (Karklin & Lewicki, 2005, 2009; Cadieu & Olshausen, 2008; Hyvärinen, Hurri, & Hoyer, 2009; Hoyer & Hyvärinen, 2002; Lee et al., 2008; Hosoya & Hyvärinen, 2015; Garrigues & Olshausen, 2008; Berkes, Turner, & Sahani, 2009). In many cases, the representations learned by these models exhibit similarity to empirically observed neural codes in the visual system.
The hierarchical representations learned by our model provide predictions about neural representations of midlevel sound structure. Moreover, the model reproduces certain aspects of the representational transformations found through the thalamus and cortex. The results suggest that principles of efficient coding could shape midlevel processing in the auditory cortex in addition to the auditory periphery (Lewicki, 2002).
4.1 Statistical Dependencies in Natural Sounds
Our model exploits statistical dependencies between first-layer spectrotemporal kernels. These dependencies have two likely causes. First, dependencies almost surely remain from limitations of the convolutional sparse coding model, in that first-layer coefficients are never completely marginally independent even after learning (due to insufficient expressive power of the code). Second, even if the first-layer coefficients were fully marginally independent, they could exhibit local dependencies when conditioned on particular sound excerpts. Observations of such conditional dependence have been made previously, mostly in the context of modeling natural image statistics (Karklin & Lewicki, 2005, 2009; Hyvärinen et al., 2009; Cadieu & Olshausen, 2008; Karklin, Ekanadham, & Simoncelli, 2012). We similarly observed strong nonzero cross-correlations between STKs for particular natural sound excerpts and found that the second-layer units captured these sorts of dependencies (see Figures 3 and 8).
We also observed that different instances of the same acoustic event (such as the same word uttered twice by the same speaker) yield similiar STK coefficient trajectories (see Figure 4), which can be thought of as samples from a nonstationary distribution with a time-varying magnitude parameter. By modeling this magnitude, the model learned representations with some degree of invariance (see Figure 9).
4.2 Model Results: Convolutional Spectrotemporal Features
The model first learned a sparse convolutional decomposition of cochleagrams. Convolutional representations are less redundant than patch-based codes and can represent signals of arbitrary length; they have been successfully applied to spectrogram analysis in engineering (e.g., Blumensath & Davies, 2006; Grosse, Raina, Kwong, & Ng, 2012). The features learned with convolutional sparse coding improved sound classification accuracy, musical note extraction, and source separation in prior work (Blumensath & Davies, 2006; Grosse et al., 2012). By contrast, the model presented here is, to our knowledge, the first use of convolutional sparse coding of audio in a neuroscience context. Accordingly, we provide an in-depth analysis of feature shapes learned by the first layer and relate them to known spectrotemporal tuning properties of auditory neurons.
The learned spectrotemporal kernels span clicks, harmonics, combinations of harmonics, bandpass noise, frequency sweeps, onsets, and offsets. Some of these structures are present in previous learned codes of spectrograms, but due to the convolutional nature of the learned code, the model learns only a single version of each feature rather than replicating it at different time points. Although it is difficult to quantify, our impression is that the code is more diverse than that obtained with previous patch-based approaches (Carlson et al., 2012; Klein et al., 2003). As in previous neurophysiological measurements of cortical STRFs, the model STKs tile the spectrotemporal modulation plane, subject to the contraints of the trade-off between time and frequency (Singh & Theunissen, 2003).
As with features learned from sound waveforms (Lewicki, 2002; Smith & Lewicki, 2006), we found the first-layer spectrogram features to depend on the training corpus (speech or a set of environmental sounds). Although certain spectrotemporal patterns (e.g., single harmonics or clicks) appeared in both features sets, others were corpus specific. Corpus-specific structure was also evident in the distributions of features in the modulation plane. This corpus dependence contrasts with the relatively consistent occurrence of Gabor-like features in sparse codes of images. These results raise the possibility that auditory cortical neuronal tuning might exhibit considerable heterogeneity, given that the full range of natural audio must be encoded by cortical neurons.
4.3 Model Results: Second-Layer Features
The second layer of the model encodes a probability distribution of spectrotemporal feature activations. By jointly modeling the magnitudes of the first-layer responses, the second-layer basis functions capture patterns of spectrotemporal kernel covariation. Magnitude modeling was essential to learning additional structure; we found in pilot experiments that simply applying a second layer of convolutional sparse coding did not produce comparable results. This observation is consistent with previous findings that nonlinear transformations of sparse codes often help to learn residual dependencies (Karklin & Lewicki, 2009; Shan, Zhang, & Cottrell, 2007; Hyvärinen et al., 2009).
The second layer of our model learned a representation of spectrotemporal feature coactivations that frequently occur in natural sounds. Typically, the spectrotemporal features pooled by a second-layer unit shared some property, often evident in at least one of the time-frequency or modulation planes (e.g., frequency content, temporal pattern, or modulation characteristics; see Figure 7), quantified by the similarity in these planes. Our model also identified opponent patterns—sets of spectrotemporal features that are pooled with the opposite sign, presumably because they are rarely active simultaneously in natural sounds (see Figure 11). The opponent features for a second-layer unit were often at least partially separated in at least one of the spectrotemporal modulation or time-frequency planes (see Figure 10).
4.4 Relation to Auditory Neuroscience
The structure learned by the model from natural sounds replicates some known properties of the auditory system. The modulation frequencies preferred by units dropped from the first to the second layer (see Figure 12B), as has been observed between the thalamus and cortex. Unit tuning specificity (i.e., the similarity between stimuli eliciting a strong response) also decreased from the first to the second layer (see Figures 12C and 12D). A similar specificity decrease has been observed between granular and supra- and infra-granular layers of the cortex (Atencio et al., 2009). Although we do not suggest a detailed correspondence between layers of our model and particular anatomical structures, these similarities indicate that some of the principles underlying hierarchical organization of the auditory pathways may derive from the natural sound statistics and architectural choices that constrain our model.
The combinations of spectrotemporal features learned by our model provide hypotheses for neural tuning that might be present in the auditory system. Some units combined only STKs with similar acoustic properties (see Figures 7 and 10). Others encode activations of opponent sets of first-layer kernels. Such opponent kernel sets are presumably those that do not typically become active simultaneously in the training corpus. The results are somewhat analogous to excitation-inhibition phenomena in visual neurophysiology (such as end-stopping, length, and width suppression or cross-orientation inhibition) emerging in models of natural images (Karklin & Lewicki, 2009; Coen-Cagli, Dayan, & Schwartz, 2012; Zhu & Rozell, 2013; Hosoya & Hyvärinen, 2015). The second-layer units of our model provide candidates of potentially analogous phenomena in the auditory cortex.
Our results are also relevant to recent evidence that central auditory neurons in some cases are driven by more than one stimulus feature (Atencio et al., 2009; Sharpee et al., 2011; Kozlov & Gentner, 2016). Our model reproduces that trend, pooling up to dozens of single-layer features with strong weights, and predicts that distinct dimensions may be combined in opponent fashion.
4.5 Relation to Other Modeling Approaches
The model described here represents one of several approaches to the construction of hierarchical signal representations. The representations in our model are learned from natural sounds and thus contrast with hand-engineered models that seek to replicate known or hypothesized features of sensory coding (Chi, Ru, & Shamma, 2005; McDermott & Simoncelli, 2011). One advantage of learning models from natural signals is that any similarities with known neural phenomena provide candidate normative explanations for these phenomena, for example, that they arise from the demands of efficient coding or some other optimality constraint imposed by the model (Olshausen & Field, 1997; Lewicki, 2002). Another potential advantage is that one might hope to learn structures that have not yet been observed neurally but could provide hypotheses for future experiments. The latter was the main motivation for our modeling approach.
The model is also unsupervised and generative, specifying a joint probability distribution over the data and coefficients in both latent layers. An alternative approach to learning hierarchical representations of data is to use discriminative models, which optimize a representation for performance of a particular task. Prominent recent examples of such discriminative learning come from the field of deep neural networks (Yamins et al., 2014). Models of this class are comparatively straightforward to optimize and can yield high performance on classification tasks, but typically require large numbers of labeled training examples. The requirement of labeled training data is sometimes a practical limitation and raises questions about the extent to which the learning procedure could relate to the brain. Generative models currently have the disadvantage of requiring custom optimization procedures. However, their generative nature facilitates certain applications (e.g., denoising) and allows samples to be straightforwardly generated, potentially for use in experiments. Moreover, they do not require labeled data and learn representations that are independent of any particular task, much like the unsupervised learning believed to occur in sensory systems.
We have presented a hierarchical model of natural sounds. When trained on natural sound corpora, the model learned a representation of spectrotemporal feature combinations. The properties of the model layers resemble aspects of hierarchical transformations previously observed in the brain, suggesting that efficient coding could shape such transformations throughout the auditory system. The learned midlevel features provide hypotheses for auditory cortical tuning as well as a means to parameterize stimuli with which to probe midlevel audition.
Appendix A: Gradients for Learning and Inference
A.1 Gradients for Learning and Inference in the First Layer
Learning and inference in the first layer were achieved via gradient descent on negative log posterior (see equation 2.5). Below, we provide expressions of gradient with respect to and respectively
A.2 Gradients for Learning and Inference in the Second Layer
Appendix B: Analysis of Model's Receptive Fields
In this appendix, we describe the methods used to estimate and analyze receptive fields of model units.
B.1 Receptive Field Estimation
Figure 13A illustrates the spectrotemporal receptive field estimation of model units through averaging of strongly activating stimuli (akin to computing spike-triggered average), and the biases that result.
The “true” spectrotemporal pattern encoded by an STK is simply the STK itself. To identify the average spectrotemporal pattern encoded by a second-layer unit, we take advantage of the fact that our model is generative (see Figure 13). For each unit, we generated a number of STK coefficient trajectories (second panel from the left in Figure 13B) from the distribution encoded by that particular unit (e.g., leftmost panel Figure 13B). We then generated spectrotemporal patterns by convolving the trajectories with the corresponding STKs and summing the results together (see Figure 2). Due to the stochasticity of the generated STK trajectories, some variability is visible among the generated spectrotemporal patterns (third panel from the left in Figure 13B). We then averaged these patterns to form an estimate of a spectrotemporal receptive field of a second-layer unit (right panel in Figure 13B). Because the estimation process does not involve averaging stimuli, we avoid biasing the estimate of the average preferred spectrotemporal pattern with stimulus correlations.
It is apparent that spike-triggered averages (third column in Figures 13A and 13B) differ from the true stimulus representation in the model for both layers (as embodied by the STK or the average spectrotemporal pattern generated by a second-layer unit). The discrepancy is due to the presence of strong correlations in natural stimuli.
B.2 Feature Selectivity Index
For comparison with neural data, we computed the feature selectivity index (FSI) proposed in Miller et al. (2001). The FSI is a number lying in the interval. FSI values close to 1 imply that stimuli eliciting a response of a neuron (or, in our case, a model unit) are similar to each other (specifically, the stimuli are close to the mean stimulus eliciting a response). When the FSI value is close to 0, the corresponding neuron spikes at random—stimuli preceding spikes are uncorrelated with the spike-triggered average. The relevance of the FSI for our purposes is that a neuron or model unit that exhibits invariance to some type of stimulus variation should have an FSI less than 1. By comparing the FSI across layers, we hoped to quantify differences in the degree of representational abstraction.
The FSI computation procedure are described in detail in Miller et al. (2001). The only discrepancy between the use of the FSI here and its prior use in neurophysiological studies is that units in our model are continuously active and do not discretely spike. We emulated the selection of stimuli eliciting spikes by selecting the stimulus excerpts yielding the highest activation of model layer units. In the first layer, we selected the 25 stimuli yielding the highest positive activation. In the second layer, we separately computed FSI indices for stimuli eliciting positive and negative responses, using 25 stimuli per unit in each case. We then averaged the indices obtained for the two sets of stimuli for each unit.
Computing FSI for an th unit consists of the following steps:
Compute the average of strongly activating stimuli (analogous to spike-triggered average, STA) separately for positive and negative stimuli in the case of second-layer units.
Compute the correlations between each strongly activating stimulus and its respective STA.
Compute the correlations between the STA and a randomly selected subset of stimuli .
Compute the area under the empirical cumulative distribution function of correlations ,
Compute the area under the empirical cumulative distribution function of
The FSI of each unit is defined as
Appendix C: Test of the Model's Convergence
In order to verify the correctness of the learning algorithm, we generated a toy data set using nine second-layer kernels, out of which three were “opponent” (see Figure 14A, left panel). We generated sample STK activations by randomly superimposing the second-layer kernels to create a “variance map” (see Figure 13B), from which we sampled STK trajectories (see Figure 14C). Each iteration of the learning procedure made a gradient step on the model parameters using a single sample of STK trajectories (from a single sampled variance map). After 1500 iterations (see Figure 14D), the model converged on the solution visualized in the right panel of Figure 14A. The generative kernels were recovered up to permutation, sign changes, and, in some cases, a temporal shift. These latter differences are expected due to the inherent arbitrariness of positive and negative signs in the model and the shift invariance inherent to a convolutional code. These results demonstrate that the learning algorithm is capable of converging to the correct solution when the data are well described by the model.
This material is based on work supported by the Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216 and by a McDonnell Scholar Award to J.H.M. We thank the members of the McDermott lab for helpful comments on earlier drafts of the manuscript.