The sparse coding hypothesis has generated significant interest in the computational and theoretical neuroscience communities, but there remain open questions about the exact quantitative form of the sparsity penalty and the implementation of such a coding rule in neurally plausible architectures. The main contribution of this work is to show that a wide variety of sparsity-based probabilistic inference problems proposed in the signal processing and statistics literatures can be implemented exactly in the common network architecture known as the locally competitive algorithm (LCA). Among the cost functions we examine are approximate norms (), modified -norms, block- norms, and reweighted algorithms. Of particular interest is that we show significantly increased performance in reweighted algorithms by inferring all parameters jointly in a dynamical system rather than using an iterative approach native to digital computational architectures.
New experimental approaches over the past decades have provided a closer look at how sensory nervous systems such as the visual cortex process information about their environment. Over this time, it has become increasingly evident that the canonical linear-nonlinear model, where cells encode visual information via linear filtering followed by a nonlinearity (e.g., thresholding and saturation), is inadequate to describe the complex processing performed by the sensory cortex. For example, this type of linear-nonlinear model does not capture the rich variety of nonlinear response properties and contextual modulations observed in V1 (Seriès, Lorenceau, & Frégnac, 2003).
Many theoretical neuroscientists have postulated high-level coding and computational principles for sensory cortex to extend our understanding of these systems. In many cases, these proposals are based generally around probabilistic Bayesian inference (Doya, 2007) due to the natural fit with ecological goals and evidence from perceptual tasks in humans (Battaglia, Jacobs, & Aslin, 2003; Hürlimann, Kiper, & Carandini, 2002). Many other researchers have postulated complementary models based on ideas of efficient coding, where information is encoded by removing redundant aspects of the stimulus. A wide variety of interesting models have appeared related to this broad principle of efficient coding, with selected examples including models using predictive coding (Rao & Ballard, 1999; Spratling, 2011), divisive normalization (Schwartz & Simoncelli, 2001), and directly encoding statistical variations (Karklin & Lewicki, 2008; Coen-Cagli, Dayan, & Schwartz, 2012).
The sparse coding hypothesis is one interpretation of efficient coding that has generated significant interest in the computational and theoretical neuroscience communities. In this model, a population of cells performs Bayesian inference to determine the environmental causes of a stimulus, with a goal of using as few simultaneously active units in the encoding as possible. Distributed sparse neural codes have several potential benefits over dense linear codes, including explicit information representation and easy decodability at higher processing stages (Olshausen & Field, 2004), metabolic efficiency (due to the the significant cost of producing and transmitting action potentials; Lennie, 2003), and increased capacity of associative and sequence memory models (Baum, Moody, & Wilczek, 1988; Charles, Yap, & Rozell, 2012). The interest in the sparse coding model was originally generated when it was shown that this simple principle (combined with the statistics of natural images) is sufficient to explain the emergence of V1 receptive field shapes both qualitatively (Olshausen & Field, 1996) and quantitatively (Rehn & Sommer, 2007). More recently, electrophysiology experiments report results consistent with sparse coding (Haider et al., 2010; Vinje & Gallant, 2002), and simulation results have demonstrated that the sparse coding model can account for a wide variety of nonlinear response properties (called nonclassical receptive field effects) reported in single cells and population studies of V1 (Zhu & Rozell, 2012).
Despite this interest, there are many open fundamental questions related to the sparse coding model. First, what exactly is the proper notion of sparsity to use during inference? The original work in the computational neuroscience literature proposed several potential sparsity-inducing cost functions (Olshausen & Field, 1996), and recent work (motivated by strong theoretical results in the signal processing and applied mathematics communities) has seen people gravitate toward the norm. While the main qualitative results appear to be relatively robust to the detailed choice of the sparsity-inducing cost function, the broader signal processing and statistics communities have proposed several alternative cost functions that have appealing computational or statistical properties and may be valuable alternatives. Second, how would such a coding principle be implemented in biologically plausible computational architectures? The computation necessary to implement an inference process with a sparsity penalty amounts to solving a nonsmooth optimization problem that is notoriously challenging to solve (e.g., many gradient-based methods are wildly inefficient due to the nonsmooth nature of the objective). Recent theoretical work has demonstrated several network architectures that can efficiently compute sparse coefficients (Rehn & Sommer, 2007; Rozell, Johnson, Baraniuk, & Olshausen, 2010; Zylberberg, Murphy, & DeWeese, 2011; Perrinet, Samuelides, & Thorpe, 2004). Interestingly, the sparse coding problem has become very prominent in modern signal processing, for example, for use in inverse problems (Elad, Figueiredo, & Ma, 2010) and computer vision (Wright et al., 2010). There is also increasing interest in leveraging the computational benefits of analog neuromorphic architectures for these problems (Shapero, Charles, Rozell, & Hasler, in press; Shapero, Rozell, & Hasler, 2012).
The main contribution of this work is to show that a wide variety of sparsity-based probabilistic inference problems can be implemented exactly in the common network architecture known as the locally competitive algorithm (LCA) (Rozell et al., 2010). The LCA is a type of Hopfield network specifically designed to incorporate nonlinear thresholding elements that make it particularly efficient for solving the nonsmooth optimization problems necessary for sparse coding. In particular, we examine sparsity-based approaches described in the recent statistics and signal processing literature to show that many proposed signal models based on sparsity principles can be implemented efficiently in this common neural architecture. Among the cost functions we examine are approximate norms (), modified -norms that combine the desirable properties of different statistical models, block- norms for use in hierarchical models that impose correlations among the active variables, and reweighted algorithms that use a hierarchical probabilistic model to achieve more efficient encodings. Of particular interest is that we show significantly increased performance in reweighted algorithms by inferring all parameters jointly in a dynamical system rather than using an iterative approach native to digital computational architectures. Preliminary results related to this work were reported in Rozell and Garrigues (2010).
2. Background and Related Work
2.1. Sparse Coding.
2.2. Dynamical Systems for Minimization
Recent work in computational neuroscience has shown that the LCA dynamical system provably solves the optimization programs in equation 2.2 and is efficient for solving the nonsmooth problems of interest in sparse approximation. The LCA (Rozell et al., 2010) architecture comprises a network of analog nodes being driven by the signal to be approximated. Each node competes with neighboring nodes for a chance to represent the signal, and the steady-state response represents the solution to the optimization problem.
The LCA is a specific instance of the Hopfield neural network architecture that has a long history of being used to solve optimization problems (Hopfield, 1982). It is a neurally plausible architecture, consisting of a network of parallel nodes that use computational primitives well matched to individual neuron models. In particular, each node consists of a leaky integrator and a nonlinear thresholding function, and it is driven by both feedforward and lateral (inhibitory and excitatory) recurrent connections. This architecture has been implemented in neuromorphic hardware as a purely analog system (Shapero, Charles et al., in press) and by using integrate-and-fire spiking neurons for each node (Shapero, Rozell, et al., 2012). We also note that other types of network structures have also been proposed recently to approximately solve specific versions of the sparse approximation problem (Rehn & Sommer, 2007; Perrinet et al., 2004; Zylberberg et al., 2011; Hu, Genkin, & Chklovskii, 2012).
3. Alternate Inference Problems in the LCA Architecture
Using the basic relationships described in equations 2.5 and 2.6, we can optimize a variety of cost functions in the same basic LCA structure by analytically determining the corresponding activation function.1 These optimization programs include approximate norms, modified norms that attempt to achieve better statistical properties than BPDN, the group/block norm that induces coactivation structure on the nonzero coefficients, reweighted and algorithms that represent hierarchical statistical models on the coefficients, and classic Tikhonov regularization.
Before exploring specific alternate cost functions in the remainder of this section, it is worthwhile to offer a technical note regarding the optimization programs that are possible to implement in the LCA architecture. The strong theoretical convergence guarantees established for the LCA (Balavoine et al., 2012) apply to a wide variety of possible systems, but they do impose some conditions on the permissible activation functions. We will rely on these same conditions to analytically determine the relationship between the cost and activation functions for the examples in this section. Translated to conditions on the cost functions, the convergence results for the LCA (Balavoine et al., 2012) require that the cost functions be positive , and symmetric and satisfy the condition that the matrix is positive definite (i.e., for separable cost functions). This last condition can intuitively be viewed as requiring that the activation function resulting from equation 2.6 has only a single output for a given input.
Some of the cost functions considered here have nonzero derivatives at the origin, leading to a range of values around the origin where is not defined according to the relationship in equation 2.5. In these cases, the smallest value for which the threshold function is defined results in a zero-valued output (i.e., at ). Since the second derivative condition on the cost function constrains the activation function to be nondecreasing, we can infer that the only allowable value of the activation function must be zero for the regions that are not wellcharacterized by the relationship in equation 2.5. Finally, we note that in most cases, we will consider the behavior of the activation function only for because the behavior for uk<0 is implied by the symmetry condition.
3.1. Approximate Norms
Perhaps the most widely used family of cost functions are the norms . These separable cost functions include ideal sparse approximation (i.e., counting nonzeros), BPDN, and Tikhonov regularization (Tikhonov, 1963) as special cases (p=0, 1, and2, respectively) and are convex for . Furthermore, recent research has shown some benefits of using nonconvex norms (p<1) for inverse problems with sparse signal models (Saab, Chartrand, & Yilmaz, 2008; Elad, Matalon, & Zibulevsky, 2007). While the ideal activation functions can be determined exactly for the three special cases mentioned above (p=0, 1, and2), it is not possible to analytically determine the activation function for arbitrary values of . Elad et al. (2007) introduced several parameterized approximations to the cost functions that are more amenable to analysis. In this section, we use these same approximations to determine activation functions for minimizing approximate norms for .
3.1.1. Approximate for
3.1.2. Approximate for
3.2. Modified Norms
While the general norms have historically been very popular cost functions, many people have noted that this approach can have undesirable statistical properties in some instances (e.g., BPDN can result in biased estimates of large coefficients, Zou, 2006). To address these issues, many researchers in signal processing and statistics have proposed modified cost functions that attempt to alleviate these statistical concerns. For example, hybrid norms smoothly morph between different norms to capture the most desirable characteristics over different regions. In this section, we demonstrate that many of these modified norms can also be implemented in the basic LCA architecture.
3.2.1. Smoothly Clipped Absolute Deviations.
3.2.3. Huber Function.
3.2.4. Amplitude Scale-Invariant Bayes Estimation.
3.4. Reweighted and
Such schemes have shown many empirical benefits over norm minimization, and recent work on reweighted has established theoretical performance guarantees (Khajehnejad, Xu, Avestimehr, & Hassibi, 2010) and interpretations as Bayesian inference in a probabilistic model (Garrigues & Olshausen, 2010).
One of the main drawbacks to reweighted algorithms in digital architectures is the time required for solving the weighted program multiple times. Of course, it is also not clear that a discrete iterative approach such as this could be mapped to an asynchronous analog computational architecture. Because we have established that the LCA architecture can solve the norm optimizations (and weighted norms are a straightforward extension to those results), it would immediately follow that a dynamical system could be used to perform the optimization necessary for each iteration of the algorithm. While this would be a viable strategy, we show here that even more advantages can be gained by performing the entire reweighted algorithm in the context of a dynamical system. Specifically, we consider a modified version of the LCA where an additional set of dynamics is placed on in order to simultaneously optimize the coefficients and coefficient weights in an analog system. While the ideas here are expandable to the general reweighted case, we focus on results involving the reweighted as presented in Garrigues and Olshausen (2010).
At steady state, , which shows that abides by equation 3.2 with representing the proportionality constant. While the complete analysis of this expanded analog system is beyond the scope of this letter, we show in Figure 4a simulations demonstrating that this system reaches a solution of comparable quality to digital iterative methods. Figure 4a plots the relative MSE from a compressed sensing recovery problem with length 1000 vectors from 500 noisy measurements with varying levels of sparsity. We sweep the parameter from zero to one and set the noise variance to 10−4, with each plot representing the relative MSE averaged over 15 randomly chosen signals. Figure 4a plots the recovery quality for three systems: iterative reweighted (using GPSR to solve the iterations: Figueiredo, Nowak, & Wright, 2007), iterative reweighted (using the LCA to solve the iterations), and dynamic reweighted , which uses the modified LCA described above. It is clear that the three systems are achieving nearly the same quality in their signal recovery. Figure 4b plots the convergence of the recovery as a function of time (in terms of system time constants ) for the iterative and dynamic reweighted approaches using the LCA. The dynamically reweighted system clearly converges more quickly, achieving its final solution in approximately the time it takes to perform two iterations of the traditional reweighting scheme using the standard LCA.
Sparsity-based signal models have played a significant role in many theories of neural coding across multiple sensory modalities. Despite the interest in the sparse coding hypothesis from the computational and theoretical neuroscience communities, the qualitative nature of much of the supporting evidence leaves significant ambiguity about the ideal form for a sparsity-inducing cost function. While recent trends favor the norm due the emergence of guarantees in the signal processing literature, there are many sparsity-inducing signal models that may have benefits for neural computation and should be candidate models for neural coding. We have shown here that many of the sparsity-inducing cost functions proposed in the signal processing and statistics literatures can be implemented in a single unified dynamical system.
From the results presented here, we conclude that neurally plausible computational architectures can support a wide variety of sparsity-based signal models, and it is therefore reasonable to consider this broad family of models as reasonable candidates for theories of sensory neural coding. Furthermore, we have shown that even a relatively complex hierarchical probabilistic model resulting in a reweighted inference scheme can be implemented efficiently in a purely analog system. This observation is particularly interesting because it illustrates a fundamental potential advantage of analog computation over digital systems. Specifically, the analog approach to this problem is able to continuously infer two sets of variables jointly rather than take an iterative approach that fundamentally must wait for the computations in each iteration for one variable to fully converge before inferring the other variable.
Beyond the applicability of these results to theories of neural computation, the recent shift toward optimization as a fundamental computational tool in the modern signal processing toolbox has made it difficult to implement many of these algorithms in applications with significant power constraints or real-time processing requirements. The results of this letter broaden the scope of problems that could potentially be approached through efficient neuromorphic architectures. The design and implementation of analog circuits have traditionally been difficult, but recent advances in reconfigurable analog circuits (Twigg & Hasler, 2009) have improved many of the issues related to the design of these systems. In fact, the reconfigurable platform described in Twigg and Hasler has been used to implement a small version of the LCA for solving BPDN (Shapero, Charles et al., in press; Shapero, Rozell et al., 2012), and preliminary tests of this implementation are consistent with simulations of the idealized LCA. These results lend encouragement to the idea that efficient analog circuits could be implemented for the variety of cost functions described in this letter.
Appendix: Soft-Threshold Activation for BPDN Using the Log-Barrier Relaxation
To illustrate the convergence of this relaxation to the desired cost function and the corresponding soft-threshold activation function, Figure 5 plots and in this relaxed problem for several values of . Note that in the extended formulation of BPDN given in equation A.1, the variables occur in pairs where only one of them can be nonzero at a time. Because the activation function is zero for all state values with a magnitude less than threshold, it is possible to represent each of these pairs of variables in one LCA node that can take on positive and negative values and where the activation function is a two-sided soft-thresholding function (thereby reducing the number of nodes back down to N).
We are grateful to B. Olshausen and J. Romberg for valuable discussions related to this work. This work was partially supported by NSF grant CCF-0905346.
We also note that a cost function might be easily implementable even in the absence of an analytical formula for the activation function simply by using numerical integration to find a solution and fitting the resulting curve.