Abstract
The sparse coding model posits that the visual system has evolved to efficiently code natural stimuli using a sparse set of features from an overcomplete dictionary. The original sparse coding model suffered from two key limitations; however: (1) computing the neural response to an image patch required minimizing a nonlinear objective function via recurrent dynamics and (2) fitting relied on approximate inference methods that ignored uncertainty. Although subsequent work has developed several methods to overcome these obstacles, we propose a novel solution inspired by the variational autoencoder (VAE) framework. We introduce the sparse coding variational autoencoder (SVAE), which augments the sparse coding model with a probabilistic recognition model parameterized by a deep neural network. This recognition model provides a neurally plausible feedforward implementation for the mapping from image patches to neural activities and enables a principled method for fitting the sparse coding model to data via maximization of the evidence lower bound (ELBO). The SVAE differs from standard VAEs in three key respects: the latent representation is overcomplete (there are more latent dimensions than image pixels), the prior is sparse or heavy-tailed instead of gaussian, and the decoder network is a linear projection instead of a deep network. We fit the SVAE to natural image data under different assumed prior distributions and show that it obtains higher test performance than previous fitting methods. Finally, we examine the response properties of the recognition network and show that it captures important nonlinear properties of neurons in the early visual pathway.
1 Introduction
Generative models have played an important role in computational neuroscience by offering normative explanations of observed neural response properties (Olshausen & Field, 1996a, 1996b, 1997; Lewicki & Olshausen, 1999; Dayan et al., 2003; Berkes & Wiskott, 2005; Coen-Cagli et al., 2012). These models seek to model the distribution of stimuli in the world in terms of a conditional probability distribution , the probability of a stimulus given a set of latent variables , and a prior over the latent variables . The advantage of this approach is that it mimics the causal structure of the world: an image falling on the retina is generated by “sources” in the world (e.g., the identity and pose of a face and the light source illuminating it), which are typically latent or hidden from the observer. Perception is naturally formulated as the statistical inference problem of identifying the latent sources that generated a particular sensory stimulus (Knill & Richards, 1996; Weiss et al., 2002; Knill & Pouget, 2004; Moreno-Bote et al., 2011). Mathematically, this corresponds to applying Bayes’ rule to obtain the posterior over latent sources given sensory data: , where the terms on the right-hand side are the likelihood and prior , which come from the generative model.
Perhaps the most famous generative model in neuroscience is the sparse coding model, introduced by Olshausen and Field (1996a, 1996b) to account for the response properties of neurons in visual cortex. The sparse coding model posits that neural activity represents an estimate of the latent features underlying a natural image patch under a linear generative model (see Figure 1). The model’s key feature is sparsity: a heavy-tailed prior over the latent variables ensures that neurons are rarely active, so each image patch must be explained by a small number of active features. Remarkably, the feature vectors obtained by fitting this model to natural images resemble the localized, oriented receptive fields found in the early visual cortex (Olshausen & Field, 1996a). Subsequent work showed the model could account for a variety of properties of neural activity in the visual pathway (e.g., classical and nonclassical receptive field effects: Rozell et al., 2008; Karklin & Lewicki, 2009; Lee et al., 2007).
Although the sparse coding model is a linear generative model, simultaneous recognition (inferring the latent variables from an image) and learning (optimizing the dictionary of features) can be computationally intensive. Instead, optimization has thus far relied on either variational optimization with a Dirac delta approximate posterior (Olshausen & Field, 1996a; Dayan et al., 2003; Seeger, 2008), which does not include uncertainty information, or sampling-based approaches (Berkes et al., 2008; Theis et al., 2012), which lack a neurally plausible implementation. In fact, finding neurally plausible architectures for both recognition and learning can be challenging in general. For variational methods, such architectures exist for both recurrent (Rozell et al., 2008; Charles et al., 2012; Zylberberg et al., 2011; Zhu & Rozell, 2013) and feedforward (Gregor & LeCun, 2010); however, these architectures rely on posterior approximations that do not include uncertainty.
In this article, we propose a unified solution to these two important problems using ideas from the variational autoencoder (VAE; Kingma & Welling, 2014; Rezende et al., 2014). The VAE is a framework for training a complex generative model by coupling it to a recognition model parameterized by a deep neural network. This deep network offers tractable inference for latent variables from data and allows for gradient-based learning of the generative model parameters using a variational objective. Here we adapt the VAE methodology to the sparse coding model by adjusting its structure and prior assumptions. We compare the resulting sparse-coding VAE (SVAE) to fits using the original methodology and show that our model achieves higher log-likelihood on test data. Furthermore, we show that the recognition model of the trained SVAE performs accurate inference under the sparse coding model and captures important response properties of neurons in visual cortex, including orientation tuning, surround suppression, and frequency tuning.
2 Background
2.1 The Sparse Coding Model
2.2 Fitting the Sparse Coding Model
To circumvent this intractable integral, Olshausen and Field (1996a) employed an approximate iterative method for optimizing the dictionary . After the initializing the dictionary randomly, iterate:
- Take a group of training images and compute the MAP estimate of the latent variables for each image using the current dictionary :(2.6)
- Update the dictionary using the gradient of the log-likelihood, conditioned on :where is a learning rate.(2.7)
2.3 Fitting as Variational EM
2.4 Variational Autoencoders
It is worth noting that the first term in (equation 2.26) relies on a stochastic pass through the encoder, given by a noisy sample of the latent from (which is really an approximation to , the conditional distribution of the latent given ). This sample is then passed deterministically through the decoder network . The generative model noise variance serves as an inverse weight that determines how much to penalize reconstruction error relative to the second term in the ELBO. The second term, in turn, can be seen as a Monte Carlo estimate for , the negative KL divergence between the variational posterior and the prior over . Because both distributions are gaussian, the standard approach is to replace the Monte Carlo evaluation of this term with its true expectation, using the fact that the KL divergence between two gaussians can be computed analytically (Kingma & Welling, 2014; Rezende et al., 2014).
In contrast to the iterative variational EM algorithm used for the classic sparse coding model, optimization of the VAE is carried out by simultaneous gradient ascent of the ELBO with respect to generative network parameters and variational parameters . During training, the per-datum ELBO in equation 2.26 is summed over a minibatch of data for each stochastic gradient ascent step.
3 Sparse Coding VAEs
In this article, we adapt the VAE framework to the sparse coding model, a sparse generative model motivated by theoretical coding principles. This involves three changes to the standard VAE: (1) we replace the deep neural network from the VAE generative model with a linear feedforward network; (2) we change the undercomplete latent variable representation to an overcomplete one, so that the dimension of the latent variables is larger than the dimension of the data; and (3) we replace the standard normal prior over the latent variables with heavy-tailed, sparsity-promoting priors (e.g., Laplace and Cauchy).
We leave the remainder of the VAE framework intact, including the conditionally gaussian variational distribution , parameterized by a pair of neural networks that output the mean and covariance as a function of (see section A.3 for an investigation of a Laplace variational posterior). The parameters of the SVAE are therefore given by and a prior , where , , and specify the elements of a sparse coding model (a sparse prior, generative weight matrix, and gaussian noise variance, respectively), and variational parameters are the weights of the recognition networks and governing the variational distribution . Figure 2 shows a schematic comparing the two models.
4 Methods
4.1 Data Preprocessing
We fit the SVAE to pixel image patches sampled from the BSDS300 data set (Martin et al., 2001). Before fitting, we preprocessed the images and split the data into train, test, and validation sets. In the original sparse coding model, the images were whitened in the frequency domain, followed by a low-pass filtering stage. In Olshausen and Field (1996a), the whitening step was taken to expedite learning, accentuating high-frequency features that would be far less prominent for natural image data, dominated by the low-frequency features. This is due to the fact that the Fourier (frequency) components are approximately the principal components of natural images; however, the overall variance of each component scales with the inverse frequency squared (the well-known spectral properties of natural images), producing large differences in variance between the high- and low-frequency features. This poses a problem for gradient-based approaches to fitting the sparse coding model since low-variance directions dominate the gradients. The low-pass filtering stage then served to reduce noise and artifacts from the rectangular sampling grid.
We perform a slight variation of these original preprocessing steps, but with the same overall effect. We whiten the data by performing PCA on the natural images and normalizing each component by its associated eigenvalue. Our low-pass filtering is achieved by retaining only the most significant components, which correspond to the lowest frequency modes, which also roughly corresponds to a circumscribed circle in the square Fourier space of the image, removing the noisy, high-frequency corners of Fourier space.
4.2 SVAE Parameters
We parameterized the recognition models and using feedforward deep neural networks (see Figure 2). The two networks took as input of a vector of preprocessed data and had a shared initial hidden layer of 128 rectified linear units. Each network had two additional hidden layers of 256 and 512 rectified linear units respectively, which were not shared. These networks, parameterizing and , each output a -dimensional vector that encodes the mean and main diagonal of the covariance of the posterior, respectively. We set the off-diagonal elements of the posterior covariance matrix to zero (this assumption, and its implications for the dependencies between latent coefficients, is further discussed in section 6). The final hidden layer in each network is densely connected to the output layer, with no nonlinearity for the mean output layer and a sigmoid nonlinearity for the variance output layer. In principle, the variance values should be encoded by a nonsaturating positive-definite nonlinearity; however, we found that this led to instability during the fitting process, and the sigmoid nonlinearity resulted in more stable behavior. Intuitively, given that our priors have scales of one, the posteriors will generally have variances less than one and can be expressed sufficiently well with the sigmoid nonlinearity.
4.3 Optimization
We optimized the SVAE using the PyTorch (Paszke et al., 2017) machine learning framework. Gradient descent was performed for 128 epochs (approximately optimization steps) with the Adam optimizer (Kingma & Ba, 2014) with default parameters, a batch size of 32, and a learning rate of . The networks always converged well within this number of gradient descent steps. We took the number of Monte Carlo integration samples to be and tested higher values, up to , but found that this parameter did not influence the results. We used the same learning hyperparameters for all three priors.
4.4 Evaluating Goodness of Fit with Annealed Importance Sampling
To evaluate the goodness of fit, we used annealed importance sampling (AIS) (Neal, 2001). We refer to appendix A.1 for a primer on the topic, and the use of the method for evaluating log-likelihoods. Our estimates used 1000 samples with 16 independent chains, a linear annealing procedure over 200 intermediate distributions, and a transition operator consisting of one HMC trajectory with 10 leapfrog steps. Furthermore, we tuned the HMC step size to achieve (within absolute tolerace) the optimal acceptance rate of 0.65 (Neal, 2011; Wu et al., 2016). We do so using a simple gradient descent algorithm: consecutively over single batches of input samples, denoting the average HMC acceptance rate obtained when computing AIS on those input samples, update with learning rate until convergence. Finally, we note that AIS only gives a lower bound on the log-likelihood (Wu et al., 2016). Even though an exact expression is available for the log likelihood in the gaussian case, we use AIS for the gaussian prior as well so that the resulting values are directly comparable.
5 Results
5.1 Quality of Fit
We first assess our model by calculating goodness-of-fit measures. We compare here the various choices in prior distributions using both the SVAE optimized with VAE-based inference and the original sparse coding model fit with the method of Olshausen and Field (1996a). To perform this assessment, we computed the log-likelihood of test data under the fit parameters, using AIS. Figure 3A shows that the log-likelihood is monotonically increasing over training for all priors. We observe that, of the priors tested, sparser distributions result in higher log-likelihoods, with the Cauchy prior providing the best fit. To explore the utility of the sparse coding VAE over the approximate EM method, we similarly calculated the log-likelihoods for the method in Olshausen and Field (1996a), again for each of the three prior distributions. We observe that the log-likelihood goodness-of-fit for the VAE is higher than the equivalent fits obtained with the method of Olshausen and Field (1996a; see Table 1). This is due to the fact that the sparse coding VAE uses a more robust approximation of the log likelihood, and the variational posteriors of VAEs are more informative than simply using the posterior mode. Indeed, we report in Figure 3B the inferred latent values for this variational posterior in comparison to the posterior mode. The closer the approximation is to the true posterior, the closer this histogram of latent values would be to the prior. We see that the inferred latent values of the sparse coding VAEs provide a better approximation of the prior than those obtained with the Dirac delta approximation.
(nats) . | Prior . | ||
---|---|---|---|
Implementation . | Cauchy . | Laplace . | Gaussian . |
VAE | 57 | 50 | 27 |
Traditional method | 221 | 117 | 127 |
(nats) . | Prior . | ||
---|---|---|---|
Implementation . | Cauchy . | Laplace . | Gaussian . |
VAE | 57 | 50 | 27 |
Traditional method | 221 | 117 | 127 |
Note: Values were calculated using AIS with HMC transitions and show improvement across all prior choices.
Finally, we compare in Figure 4 how the learned basis functions in the sparse coding VAE framework (see appendix, Figure 6, for some features per prior distribution) differ from those learned in the standard sparse coding (Olshausen & Field, 1996a). We analyzed the learned basis functions by fitting a Gabor filter to each feature and report the estimated frequency and orientation distributions (see appendix A.2 for more details). We observe a persistent increase in frequency across all priors for the SVAE over the standard sparse coding, along with a heavier tail. For the orientation, there is less of a discernible trend, but we do generally observe more variable orientation distribution for the SVAE basis functions, as opposed to the flatter distribution for the standard sparse coding.
5.2 Feedforward Inference Model
As a consequence of training a VAE, we obtain a neural network that performs approximate Bayesian inference. Previous mechanistic implementations of sparse coding use the MAP estimates under the true posterior to model trial-averaged neural responses (Rozell et al., 2008; Boerlin & Denève, 2011; Gregor & LeCun, 2010; Martins et al., 2011). In our case, the recognition model performs approximate inference using a more expressive approximation, suggesting that it may serve as an effective model of visual cortical responses. To study the response properties of the feedforward inference network, we simulated neural activity as the mean of the recognition distribution with stimulus taken to be sinusoidal gratings of various sizes, contrasts, angles, frequencies, and phases. For each set of grating parameters, we measured responses to both a cosine and sine phase grating. To enforce nonnegative responses and approximate phase invariance, the responses shown in Figure 5 are the root sum-of-squares of responses to both phases.
Figure 5 shows the performance of the recognition network in response to sinusoidal gratings of various sizes, contrasts, angles, and phases. We found that the responses of the recognition model exhibited frequency and orientation tuning (see Figures 5B and 5C), reproducing important characteristics of cortical neurons (Hubel & Wiesel, 1962) and reflecting the Gabor-like structure of the dictionary elements. Additionally, Figure 5D demonstrates that the orientation tuning of these responses is invariant to grating contrast, which is observed in cortical neurons (Troyer et al., 1998). Figure 5A shows that the recognition model exhibits the surround suppression, in which the response to an optimally tuned grating first increases with grating size and then decreases as the scale of the grating increases beyond the classical receptive field (Sceniak et al., 1999). Finally, Figure 5E shows the receptive fields of recognition model neurons, as measured by a reverse-correlation experiment (Ringach & Shapley, 2004), verifying that the linear receptive fields exhibit the same Gabor-like properties as the dictionary elements.
6 Discussion
We have introduced a variational inference framework for the sparse coding model based on the VAE. The resulting SVAE model offers a more principled and accurate method for fitting the sparse coding model and comes equipped with a neural implementation of feedforward inference under the model. We showed first that the classic fitting method of Olshausen and Field is equivalent to variational inference under a delta function variational posterior. We then extended the VAE framework to incorporate the sparse coding model as a generative model. In particular, we replaced the standard deep network of the VAE with an overcomplete latent variable governed by a sparse prior and showed that variational inference using a conditionally gaussian recognition distribution provided accurate, neurally plausible feedforward inference of latent variables from images. Additionally, the SVAE provided improved fitting of the sparse coding model to natural images, as measured by the test log likelihood. Moreover, we showed that the associated recognition model recapitulates important response properties of neurons in the early mammalian visual pathway.
Given this demonstration of VAEs for fitting tailor-made generative models, it is important to ask whether VAEs have additional applications in theoretical neuroscience. Specifically, many models are constrained by their ability to be fit. Our technique may allow more powerful yet still highly structured generative models to be practically applied by making learning tractable. The particular property that VAEs provide an explicit model of the posterior (i.e. as opposed to the Dirac delta approximation) means that connections can also now be drawn between generative models (e.g., sparse coding) and models that depend explicitly on neural variability, a property often tied to the confidence levels in encoding. Current models of overdispersion (e.g. Goris et al., 2014; Charles et al., 2018) shy from proposing mechanistic explanations of the explanatory statistical models. Another important line of future work then is to explore whether the posterior predictions made by SVAEs, or VAEs tuned to other neuroscience models, can account for observed spiking behaviors.
6.1 Relationship to Previous Work
The success of the sparse coding model as an unsupervised learning method for the statistics of the natural world has prompted an entire field of study into models for sparse representation learning and implementations of such models in artificial and biological neural systems. In terms of the basic mathematical model, many refinements and expansions have been proposed to better capture the statistics of natural images, including methods to more strictly induce sparsity (Garrigues & Olshausen, 2008; Girolami, 2001; Olshausen & Millman, 2000; Lewicki & Olshausen, 1999), hierarchical models to capture higher-order statistics (Karklin & Lewicki, 2003, 2005, 2009; Garrigues & Olshausen, 2010), sampling-based learning techniques (Berkes et al., 2008; Theis et al., 2012), constrained dictionary learning for nonnegative data (Charles et al., 2011), and applications to other modalities, such as depth (Tǒsic et al., 2011), motion (Cadieu & Olshausen, 2009), and auditory coding (Smith & Lewicki, 2006).
Given the success of such models in statistically describing visual responses, the mechanistic question as to how a neural substrate could implement sparse coding also became an important research area. Neural implementations of sparse coding have branched into two main directions: recurrent (Rozell et al., 2008; Boerlin & Denève, 2011) and feedforward neural networks (Gregor & LeCun, 2010; Martins et al., 2011). The recurrent network models have been shown to provably solve the sparse-coding problem (Rozell et al., 2008; Shapero et al., 2014; Schwemmer et al., 2015). Furthermore, recurrent models can implement hierarchical extensions (Charles et al., 2012) as well as replicate key properties of visual cortical processing, such as nonclassical receptive fields (Zhu & Rozell, 2013). The feedforward models are typically based off of either mimicking the iterative processing of recursive algorithms, such as the iterative soft-thresholding algorithm (ISTA; Daubechies et al., 2004) or by leveraging unsupervised techniques for learning deep neural networks, such as optimizing autoencoders (Makhzani & Frey, 2013). The resulting methods, such as the learned ISTA (LISTA; Gregor & LeCun, 2010; Borgerding et al., 2017), provide faster feedforward inference than their RNN counterpart,2 at the cost of losing theoretical guarantees on the estimates.
Although the details of these implementations vary, all essentially retain the Dirac delta posterior approximation and are thus constructed to calculate MAP estimates of the coefficients for use in a gradient-based feedback to update the dictionary. None of these methods reassess this basic assumption, and so are limited in the overall accuracy of the marginal log-likelihood estimation of the model , as well as their ability to generalize beyond inference-based networks to other theories of neural processing, such as probabilistic coding (Fiser et al., 2010; Orbán et al., 2016). In this work, we have taken advantage of the refinement of VAEs in the machine learning literature to revisit this initial assumption from the influential early work and create just such a posterior-seeking neural network. Specifically, VAEs can provide a nontrivial approximation of the posterior via a fully Bayesian learning procedure in a feedforward neural network model of inference under the sparse coding model. Additional benefits of VAEs over nonvariational autoencoders with similar goals (e.g., LISTA) are the emerging properties of robustness to outliers and local minima, as observed in recent analysis (Dai et al., 2018). Recent work by Velychko et al. (2023) has also leveraged the formalism offered by the variational framework, focusing on analytical and entropy-based derivations of the ELBO instead of the parallelism with the original sparse coding work presented here. As VAEs have advanced, so have their abilities to account for more complex statistics in the latent representation layer. Discrete-type distributions, such as those enabled by the Concrete or Gumbal Softmax distributions enable categorical modeling that more closely resembles a version of sparsity (Van Den Oord et al., 2017; Maddison et al., 2016a; Jang et al., 2016). Nonlinear ICA (Khemakhem et al., 2020), or other disentangling methods (Chen et al., 2018) seek distributions that maximize independence between the latent representation variables, however rely on highly nonlinear decoding networks, removing interpretability as related to the data.
Obtaining the posterior distribution is especially important given that neural variability and spike-rate overdispersion can be related to the uncertainty in the generative coefficients’ posterior (e.g., via probabilistic coding Fiser et al., 2010; Orbán et al., 2016). Finding a neural implementation of sparse coding that also estimates the full posterior would be an important step toward bridging the efficient and probabilistic coding theories. Other variational sparse coding models tended to sacrifice neurally plausible implementation (Berkes et al., 2008; Theis et al., 2012; Seeger, 2008). Our work complements recent efforts to connect the sparse coding model with tractable variational inference networks. These works have focused on either the nonlinear sparsity model (Salimans, 2016) or the linear generative model of sparse coding (Aitchison et al., 2018). Our work can also be thought of as generalizing models that use, for example, expectation propagation (EP), to approximate the posterior distribution (Seeger, 2008). Other related work uses traditional VAEs and then performs sparse coding in the latent space (Sun et al., 2018). To date, however, no neurally plausible variational method has been designed to capture the three fundamental characteristics of sparse coding: overcomplete codes, sparse priors, and a linear-generative model.
6.2 Limitations and Future Directions
The state-of-the-art results of this work are primarily the result of orienting sparse coding in a variational framework where more expressive variational posterior distributions can be used for model fitting. Nevertheless this work represents only the first steps in this direction. One area for improvement is our selection of a gaussian variational posterior with diagonal covariance matrix . This choice was meant to expand the Dirac delta posterior approximation of traditional sparse coding to include uncertainty. This model, however, still restricts the latent variables to be uncorrelated under the posterior distribution and can limit the variational inference method’s performance (Mnih & Gregor, 2014; Turner & Sahani, 2011).
For example, SVAE posterior employed here cannot directly account for the “explaining-away” effect that occurs between the activations of overlapping dictionary elements (Pearl, 1988; Yu et al., 2022). Instead, explaining away is learned in the recognition model parameters the same way the MAP estimation in sparse coding allows for interplay between estimates under the factorial Dirac delta posterior of sparse coding. This form of explaining away can indeed be seen from the simulated encoding responses in Figure 5, capturing “extraclassical” receptive field effects such as end-stopping and contrast invariant orientation tuning (Zhu & Rozell, 2013). One important difference here is that traditional sparse coding infers all posterior parameters for small batches, allowing for more flexible explaining away if certain parameters are set to zero. Relearning the neural networks from scratch is much more computationally intensive, and thus an important next step is to build nontrivial correlations directly into the variational posteriors.
In addition to improved learning, a more complex posterior would give the recognition model the potential to exhibit interesting phenomena associated with correlations between dictionary element activations. While some computational models aim to account for such correlations (Cadieu & Olshausen, 2009; Karklin & Lewicki, 2009; Averbeck et al., 2006), the SVAE framework would allow for the systematic analysis of the many assumptions possible in the population coding layer within the sparse coding framework. These various assumptions can thus be validated against the population correlations observed in biological networks (Ecker et al., 2011; Cohen & Kohn, 2011). For example, if V1 responses are interpreted as arising via the sampling hypothesis (Fiser et al., 2010), then explaining away may account for correlations in neural variability observed in supra- and infragranular layers of V1 (Hansen et al., 2012). Additionally, such correlations can be related to probabilistic population coding (Fiser et al., 2010; Orbán et al., 2016) where correlated variability represents correlated uncertainty in the neural code.
In this work we restricted the sparse coding model by choosing the magnitude of the output noise variance a priori. This was done in order to make this work comparable to the original sparse coding implementations of Olshausen and Field (1996a). Nevertheless, this parameter can be fit in a data-driven way as well, providing additional performance beyond the current work. In future explorations, this constraint may be relaxed.
One of the favorable properties of sparse coding using MAP inference with a Laplace prior is that truly sparse representations are produced in the sense that a finite fraction of latent variables is inferred to be exactly zero. As a consequence of the more robust variational inference we perform here, the SVAE no longer has this property. Sparse representations could be regained within the SVAE framework if truly sparse priors, which have a finite fraction of their probability mass at exactly zero, were used such as “spike and slab” priors (Garrigues & Olshausen, 2008; Ziniel & Schniter, 2013) or with more structured parameterizations to the latent space (Keller & Welling, 2021). These sparse priors are not differentiable and thus cannot be directly incorporated into the framework presented here; however, continuous approximations of sparse priors do exist, and during this work, we implemented an approximate spike and slab prior using a sum of gaussians with a small and large variance, respectively. We were unable in our setting, however, to replicate the superior performance of such priors seen elsewhere and recommend additional explorations into incorporating hard-sparse priors such as the spike-and-slab or concrete (Maddison et al., 2016b) distributions. Furthermore, we also investigated using the Laplace distribution as a more heavy-tailed alternative for the variational posterior in our VAE methodology. In a similar vein, as reported in Table 2, we did not find a performance benefit in using such variational posterior, which performed worse across all three prior families considered.
(nats) . | Prior . | ||
---|---|---|---|
Posterior . | Cauchy . | Laplace . | Gaussian . |
Gaussian | 53.7 | 43.1 | 27.9 |
Laplace | 51.7 | 28.7 | 9.1 |
(nats) . | Prior . | ||
---|---|---|---|
Posterior . | Cauchy . | Laplace . | Gaussian . |
Gaussian | 53.7 | 43.1 | 27.9 |
Laplace | 51.7 | 28.7 | 9.1 |
Note: Values were calculated using AIS with HMC transitions.
We have restricted ourselves in this work to a relatively simple generative model of natural images. It as been noted, however, that the sparse coding model does not account entirely for the statistics of natural images (Simoncelli & Olshausen, 2001). Hierarchical variants of the sparse coding model (Karklin & Lewicki, 2003, 2005, 2009; Garrigues & Olshausen, 2010) provide superior generative models of natural images. These more complex generative models can be implemented and fit using the same methods we present here by explicitly constructing the VAE generative model in their image.
VAEs have an inherent tendency to prune unused features in the generative networks (Dai et al., 2018). Previous work has noted this effect in the case of the standard gaussian priors. We note that similar pruning occurs in the SVAE when the priors are exponential or Cauchy as well. In these cases, the feature representation remains overcomplete, to a level of approximately 1.3, which is similar to the inferred optimal overcompleteness seen in previous work (Berkes et al., 2008). Therefore the SVAE implicitly infers the overcompleteness level, a feature that previous models had to explicitly account for (Karklin & Lewicki, 2005).
The biological plausibility of our method relies on a feedforward architecture that quickly turns input stimuli into approximate posterior distributions over the latent coefficients. While inference under this model is completely local and feedforward, the learning through backpropagation can potentially result in more complex interactions that are at first not obviously tractable in a local neural setting. An interesting branch of work, however, aims to place backpropagation as used here and in LISTA in a biological framework (Durbin & Rumelhart, 1989; Bengio et al., 2015). Additionally, on the topic of biological plausibility, we observe that the differences (over the standard sparse coding approach) in properties of learned basis functions in Figure 4 highlighted in section 5.1 show some consistency with frequency (Foster et al., 1985) and orientation (Ringach, 2002) statistics of macaque visual cortex (V1) receptive fields. Future work could further investigate the properties of the learned features and formalize this potential relationship with biological features.
One final note is that the SVAE breaks the typical symmetry between encoder and decoder complexity. The encoder is very high-dimensional, while the decoder is a simple linear model. Despite the added complexity, we retain the ability in the SVAE to orient the encoder’s complexity toward the underlying statistics, extracting both in an unsupervised fashion. In this optimization, however, burdening the encoder with added complexity is not a concern. It is, in fact, the details of the linear decoder that matter, and there are likely many local minima in the deep neural network that would help achieve similar performance under that assessment. In other statistical regimes and desired tasks, the opposite may be true (i.e., only the encoder details matter in the cost), and so flipping the complexity would be a very interesting additional path forward to expanding this philosophy further.
7 Conclusion
In summary, we have cast the sparse coding model in the framework of variational inference and demonstrated how modern tools such as the VAE can be used to develop neurally plausible algorithms for inference under generative models. We feel that this work strengthens the connection between machine learning methods for unsupervised learning of natural image statistics and efforts to understand neural computation in the brain and hope it will inspire future studies along these lines.
Appendix A
A.1 A Primer on Annealed Importance Sampling
In particular, the likelihood is the factor that normalizes the posterior in the expression .
The value gives an unbiased estimate of the normalizing ratio, which in the case that is normalized and is the unnormalized posterior above is equal to the likelihood. This generally leads to numerical overflow problems so we instead calculate , which in general gives a biased estimate of (a lower bound on) the log likelihood. Averaging over many independent samples generated from AIS gives a lower bound on the log likelihood. The AIS method has been applied previously to evaluating goodness of fit for VAEs and other deep generative models (Wu et al., 2016).
A.2 Methodological Details
A.2.1 Analysis of Learned Basis Functions
In the main text, we provide more details on the basis functions that emerge out of the SVAE inference procedure against the standard sparse coding. We analyzed the learned basis functions by fitting Gabor filters to the basis functions. Gabor filters in pixel space are gaussian densities in Fourier space, and the orientation and frequency statistics we seek can readily be read from this density. The density peak is at the spatial- frequency of the filter (Gabor, 1946; Movellan, 2002), and the “frequency” and orientation metrics we report are its polar coordinate. Indeed, we report the frequency as the magnitude of the peak, and the orientation is given by the angle to the peak.
A.3 Heavy-Tailed Variational Posterior
We implemented a Laplace posterior as a more heavy-tailed posterior alternative to the gaussian posterior employed in this work. We used the Laplace reparameterization trick to achieve this, analogous to the gaussian case. We observe in Table 2 that across all three priors, the Laplace posterior performed worse.
Acknowledgments
We acknowledge Ryan Pyle for involvement in early stages of this work. V.G. was supported by awards from the Natural Sciences and Engineering Research Council of Canada (PGSD3-557875-2021) and the Fonds de Recherche du Québec Nature et technologies (B2X 297667) G.B. acknowledges the Marine Biological Laboratory in Woods Hole, NIMH funding for the Methods in Computational Neuroscience course (R25MH062204), and support from the Simons Foundation. J.W.P. was supported by grants from the Simons Foundation (SCGB AWD543027), the NIH (R01EY017366), the NIH BRAIN initiative (NS104899 and 9R01DA056404-04), and the CAREER award (IIS-1150186).
Note that the ELBO isn’t actually increasing because, as noted above, it is always formally . We could, however, justify this approach with a careful appeal to a finite-variance that approaches a delta function in the limit.
We note that this is true for digital processing only and that analog recurrent systems can be far faster (Shapero et al., 2014).
References
Author notes
Victor Geadah and Gabriel Barello contributed equally.