Abstract

The sparse coding hypothesis has generated significant interest in the computational and theoretical neuroscience communities, but there remain open questions about the exact quantitative form of the sparsity penalty and the implementation of such a coding rule in neurally plausible architectures. The main contribution of this work is to show that a wide variety of sparsity-based probabilistic inference problems proposed in the signal processing and statistics literatures can be implemented exactly in the common network architecture known as the locally competitive algorithm (LCA). Among the cost functions we examine are approximate norms (), modified -norms, block- norms, and reweighted algorithms. Of particular interest is that we show significantly increased performance in reweighted algorithms by inferring all parameters jointly in a dynamical system rather than using an iterative approach native to digital computational architectures.

1.  Introduction

New experimental approaches over the past decades have provided a closer look at how sensory nervous systems such as the visual cortex process information about their environment. Over this time, it has become increasingly evident that the canonical linear-nonlinear model, where cells encode visual information via linear filtering followed by a nonlinearity (e.g., thresholding and saturation), is inadequate to describe the complex processing performed by the sensory cortex. For example, this type of linear-nonlinear model does not capture the rich variety of nonlinear response properties and contextual modulations observed in V1 (Seriès, Lorenceau, & Frégnac, 2003).

Many theoretical neuroscientists have postulated high-level coding and computational principles for sensory cortex to extend our understanding of these systems. In many cases, these proposals are based generally around probabilistic Bayesian inference (Doya, 2007) due to the natural fit with ecological goals and evidence from perceptual tasks in humans (Battaglia, Jacobs, & Aslin, 2003; Hürlimann, Kiper, & Carandini, 2002). Many other researchers have postulated complementary models based on ideas of efficient coding, where information is encoded by removing redundant aspects of the stimulus. A wide variety of interesting models have appeared related to this broad principle of efficient coding, with selected examples including models using predictive coding (Rao & Ballard, 1999; Spratling, 2011), divisive normalization (Schwartz & Simoncelli, 2001), and directly encoding statistical variations (Karklin & Lewicki, 2008; Coen-Cagli, Dayan, & Schwartz, 2012).

The sparse coding hypothesis is one interpretation of efficient coding that has generated significant interest in the computational and theoretical neuroscience communities. In this model, a population of cells performs Bayesian inference to determine the environmental causes of a stimulus, with a goal of using as few simultaneously active units in the encoding as possible. Distributed sparse neural codes have several potential benefits over dense linear codes, including explicit information representation and easy decodability at higher processing stages (Olshausen & Field, 2004), metabolic efficiency (due to the the significant cost of producing and transmitting action potentials; Lennie, 2003), and increased capacity of associative and sequence memory models (Baum, Moody, & Wilczek, 1988; Charles, Yap, & Rozell, 2012). The interest in the sparse coding model was originally generated when it was shown that this simple principle (combined with the statistics of natural images) is sufficient to explain the emergence of V1 receptive field shapes both qualitatively (Olshausen & Field, 1996) and quantitatively (Rehn & Sommer, 2007). More recently, electrophysiology experiments report results consistent with sparse coding (Haider et al., 2010; Vinje & Gallant, 2002), and simulation results have demonstrated that the sparse coding model can account for a wide variety of nonlinear response properties (called nonclassical receptive field effects) reported in single cells and population studies of V1 (Zhu & Rozell, 2012).

Despite this interest, there are many open fundamental questions related to the sparse coding model. First, what exactly is the proper notion of sparsity to use during inference? The original work in the computational neuroscience literature proposed several potential sparsity-inducing cost functions (Olshausen & Field, 1996), and recent work (motivated by strong theoretical results in the signal processing and applied mathematics communities) has seen people gravitate toward the norm. While the main qualitative results appear to be relatively robust to the detailed choice of the sparsity-inducing cost function, the broader signal processing and statistics communities have proposed several alternative cost functions that have appealing computational or statistical properties and may be valuable alternatives. Second, how would such a coding principle be implemented in biologically plausible computational architectures? The computation necessary to implement an inference process with a sparsity penalty amounts to solving a nonsmooth optimization problem that is notoriously challenging to solve (e.g., many gradient-based methods are wildly inefficient due to the nonsmooth nature of the objective). Recent theoretical work has demonstrated several network architectures that can efficiently compute sparse coefficients (Rehn & Sommer, 2007; Rozell, Johnson, Baraniuk, & Olshausen, 2010; Zylberberg, Murphy, & DeWeese, 2011; Perrinet, Samuelides, & Thorpe, 2004). Interestingly, the sparse coding problem has become very prominent in modern signal processing, for example, for use in inverse problems (Elad, Figueiredo, & Ma, 2010) and computer vision (Wright et al., 2010). There is also increasing interest in leveraging the computational benefits of analog neuromorphic architectures for these problems (Shapero, Charles, Rozell, & Hasler, in press; Shapero, Rozell, & Hasler, 2012).

The main contribution of this work is to show that a wide variety of sparsity-based probabilistic inference problems can be implemented exactly in the common network architecture known as the locally competitive algorithm (LCA) (Rozell et al., 2010). The LCA is a type of Hopfield network specifically designed to incorporate nonlinear thresholding elements that make it particularly efficient for solving the nonsmooth optimization problems necessary for sparse coding. In particular, we examine sparsity-based approaches described in the recent statistics and signal processing literature to show that many proposed signal models based on sparsity principles can be implemented efficiently in this common neural architecture. Among the cost functions we examine are approximate norms (), modified -norms that combine the desirable properties of different statistical models, block- norms for use in hierarchical models that impose correlations among the active variables, and reweighted algorithms that use a hierarchical probabilistic model to achieve more efficient encodings. Of particular interest is that we show significantly increased performance in reweighted algorithms by inferring all parameters jointly in a dynamical system rather than using an iterative approach native to digital computational architectures. Preliminary results related to this work were reported in Rozell and Garrigues (2010).

2.  Background and Related Work

2.1.  Sparse Coding.

In the sparse coding problem, we use probabilistic inference to find the smallest number of causes for an observed signal under a linear generative model,
formula
2.1
where is the observed signal, is the coefficient vector, is the dictionary of causes, and is gaussian noise. The coefficient vector is said to be sparse as we seek a solution with relatively few nonzero entries. The coefficients a are generally inferred via MAP estimation, which results in solving a nonlinear optimization problem,
formula
2.2
where is a cost function penalizing a based on its fit with the signal model and is a parameter denoting the relative trade-off between the data fidelity term (i.e., MSE, which arises from the log likelihood of the gaussian noise) and the cost function. The cost function is the nonlinear sparsity-inducing regularization term, corresponding to the log prior of the data model. More details about the formulation of this problem in the Bayesian inference framework can be found in Olshausen and Field (1997). Basic signal models frequently assume independence among the elements of a, resulting in a cost function that separates into a sum of individual costs . One common example is the norm, defined as .

2.2.  Dynamical Systems for Minimization

Recent work in computational neuroscience has shown that the LCA dynamical system provably solves the optimization programs in equation 2.2 and is efficient for solving the nonsmooth problems of interest in sparse approximation. The LCA (Rozell et al., 2010) architecture comprises a network of analog nodes being driven by the signal to be approximated. Each node competes with neighboring nodes for a chance to represent the signal, and the steady-state response represents the solution to the optimization problem.

The LCA is a specific instance of the Hopfield neural network architecture that has a long history of being used to solve optimization problems (Hopfield, 1982). It is a neurally plausible architecture, consisting of a network of parallel nodes that use computational primitives well matched to individual neuron models. In particular, each node consists of a leaky integrator and a nonlinear thresholding function, and it is driven by both feedforward and lateral (inhibitory and excitatory) recurrent connections. This architecture has been implemented in neuromorphic hardware as a purely analog system (Shapero, Charles et al., in press) and by using integrate-and-fire spiking neurons for each node (Shapero, Rozell, et al., 2012). We also note that other types of network structures have also been proposed recently to approximately solve specific versions of the sparse approximation problem (Rehn & Sommer, 2007; Perrinet et al., 2004; Zylberberg et al., 2011; Hu, Genkin, & Chklovskii, 2012).

Specifically, the kth node of the LCA is associated with the kth column of . Without loss of generality, we assume that each column has unit norm. This node is described at a given time t by an internal state variable uk(t). The coefficients a are related to the internal states u via an activation (thresholding) function that is parameterized by . In the important special case when the cost function is separable, the output of each node k can be calculated independently of all other nodes by a pointwise activation function . Individual nodes are leaky integrators driven by an input proportional to , and competition between nodes occurs by lateral connections that allow highly active nodes to suppress nodes with less activity. The dynamics for node k are given by
formula
2.3
where is the system time constant. In vector form, the dynamics for the whole network are given by
formula
2.4
Rozell et al. (2010) showed that for the energy surface E given in equation 2.2 with a separable, continuous, and piecewise differentiable cost function, the path induced by the LCA (using the outputs ak(t) as the optimization variable) ensures when the cost function satisfies
formula
2.5
where is nondecreasing. We use the notation for convenience when the activation function is invertible, but this invertibility is not strictly required (i.e., the relationship in equation 2.5 involving just is sufficient). The same arguments also extend to the more general case of nonseparable cost functions, ensuring when
formula
2.6
Recent follow-up work (Balavoine, Romberg, & Rozell, 2012) establishes stronger guarantees on the LCA, specifically showing that this system is globally convergent to the minimum of E (which may be a local minima if is not convex) and proving that the system converges exponentially fast with an analytically bounded convergence rate.
The relationship in equation 2.5 requires cost functions that are differentiable and activation functions that are invertible. However, the cost function for BPDN (the norm) is nonsmooth at the origin, and the most effective sparsity-promoting activation functions will likely have noninvertible thresholding properties. In these cases, one can start with a smooth cost function that is a relaxed version of the desired cost and calculate the corresponding activation function. Taking the limit of the relaxation parameter in the activation function yields a formula for that can be used to solve the desired problem. Specifically, in the appendix, we use the log-barrier relaxation (Boyd & Vandenberghe, 2004) to show that the LCA solves BPDN when the activation function is the well-known soft thresholding function:
formula
Similarly, the LCA can find a local minima to the nonconvex optimization program that minimizes the “norm” of the coefficients (i.e., number of nonzeros) by using the hard thresholding activation function (Rozell et al., 2010):
formula
where is the standard indicator function.

3.  Alternate Inference Problems in the LCA Architecture

Using the basic relationships described in equations 2.5 and 2.6, we can optimize a variety of cost functions in the same basic LCA structure by analytically determining the corresponding activation function.1 These optimization programs include approximate norms, modified norms that attempt to achieve better statistical properties than BPDN, the group/block norm that induces coactivation structure on the nonzero coefficients, reweighted and algorithms that represent hierarchical statistical models on the coefficients, and classic Tikhonov regularization.

Before exploring specific alternate cost functions in the remainder of this section, it is worthwhile to offer a technical note regarding the optimization programs that are possible to implement in the LCA architecture. The strong theoretical convergence guarantees established for the LCA (Balavoine et al., 2012) apply to a wide variety of possible systems, but they do impose some conditions on the permissible activation functions. We will rely on these same conditions to analytically determine the relationship between the cost and activation functions for the examples in this section. Translated to conditions on the cost functions, the convergence results for the LCA (Balavoine et al., 2012) require that the cost functions be positive , and symmetric and satisfy the condition that the matrix is positive definite (i.e., for separable cost functions). This last condition can intuitively be viewed as requiring that the activation function resulting from equation 2.6 has only a single output for a given input.

Some of the cost functions considered here have nonzero derivatives at the origin, leading to a range of values around the origin where is not defined according to the relationship in equation 2.5. In these cases, the smallest value for which the threshold function is defined results in a zero-valued output (i.e., at ). Since the second derivative condition on the cost function constrains the activation function to be nondecreasing, we can infer that the only allowable value of the activation function must be zero for the regions that are not wellcharacterized by the relationship in equation 2.5. Finally, we note that in most cases, we will consider the behavior of the activation function only for because the behavior for uk<0 is implied by the symmetry condition.

3.1.  Approximate Norms

Perhaps the most widely used family of cost functions are the norms . These separable cost functions include ideal sparse approximation (i.e., counting nonzeros), BPDN, and Tikhonov regularization (Tikhonov, 1963) as special cases (p=0, 1, and2, respectively) and are convex for . Furthermore, recent research has shown some benefits of using nonconvex norms (p<1) for inverse problems with sparse signal models (Saab, Chartrand, & Yilmaz, 2008; Elad, Matalon, & Zibulevsky, 2007). While the ideal activation functions can be determined exactly for the three special cases mentioned above (p=0, 1, and2), it is not possible to analytically determine the activation function for arbitrary values of . Elad et al. (2007) introduced several parameterized approximations to the cost functions that are more amenable to analysis. In this section, we use these same approximations to determine activation functions for minimizing approximate norms for .

3.1.1.  Approximate for

For , Elad et al. (2007) propose the approximate cost function
formula
as a good match for the true norm for some value of parameters s and c. In the limiting cases, c = 1 with yields the norm and c=2s with yields the norm. Three intermediate examples for p=1.25, 1.5, and 1.75 are shown in Figure 1. For any specific value of p, we find the best values of c and s by using standard numerical optimization techniques to minimize the squared error to the true cost function over the interval [0,2]. From this cost function, we can differentiate to obtain the relationship between each uk and ak as
formula
Figure 1:

Approximate cost functions and their corresponding thresholding functions. (Left) The cost functions are approximated over the parameters c, s for values of p ranging from 0 to 1 (top) and 1 to 2 (bottom). The true costs are shown as dotted lines in the same shades. Using these values of c and s, a nonlinear activation function that can be used in the LCA to solve the optimization is plotted (right) using the thresholding equations for 0<p<1 (top) and 1<p<2 (bottom). The thresholding functions clearly span the ranges between soft and hard thresholding for the lower range of p and between soft thresholding and linear amplification for the upper range of p.

Figure 1:

Approximate cost functions and their corresponding thresholding functions. (Left) The cost functions are approximated over the parameters c, s for values of p ranging from 0 to 1 (top) and 1 to 2 (bottom). The true costs are shown as dotted lines in the same shades. Using these values of c and s, a nonlinear activation function that can be used in the LCA to solve the optimization is plotted (right) using the thresholding equations for 0<p<1 (top) and 1<p<2 (bottom). The thresholding functions clearly span the ranges between soft and hard thresholding for the lower range of p and between soft thresholding and linear amplification for the upper range of p.

We see from this relationship that with c = 1 and , we obtain for (i.e., the soft-thresholding function for BPDN), while with c=2s and , we obtain (i.e., a linear amplifier for Tikhonov regularization). Solving for ak in terms of uk (restricting the solution to be positive and increasing) yields a general relationship for the activation function
formula
This solution is shown in Figure 1 for p= 1.25, 1.5, and 1.75 for .

3.1.2.  Approximate for

For , Elad et al. (2007) also propose the following approximate cost function as a good match for the true norm for some value of parameters s and c:
formula
where the parameters c>0 and s>0 can be optimized as above to approximate different values of p. Three approximations for p = 0.5, 0.75, and 0.9 are shown in Figure 1. To determine the activation function, we again differentiate and find the appropriate relationship to be
formula
Solving for ak reduces to solving a quadratic equation, which leads to two possible solutions. As above, we restrict the activation function to include only the solution that is positive and increasing, resulting in the activation function
formula
This activation function is valid only over the range where the output is a positive real number. If , this condition reduces to . More generally, this condition reduces to .

3.2.  Modified Norms

While the general norms have historically been very popular cost functions, many people have noted that this approach can have undesirable statistical properties in some instances (e.g., BPDN can result in biased estimates of large coefficients, Zou, 2006). To address these issues, many researchers in signal processing and statistics have proposed modified cost functions that attempt to alleviate these statistical concerns. For example, hybrid norms smoothly morph between different norms to capture the most desirable characteristics over different regions. In this section, we demonstrate that many of these modified norms can also be implemented in the basic LCA architecture.

3.2.1.  Smoothly Clipped Absolute Deviations.

A common goal for modified norms is to retain the continuity of the cost function near the origin demonstrated by the norm, while using a constant cost function for larger coefficients (similar to the norm) to avoid statistical biases. One approach to achieving these competing goals is the smoothly clipped absolute deviations (SCAD) penalty (Fan, 1997; Antoniadis & Fan, 2001). The SCAD approach directly concatenates the and norms with a quadratic transition region, resulting in the cost function given by
formula
for ( defines the width of the transition region). An example of this cost function with and is shown in Figure 2.
Figure 2:

Cost functions and their corresponding thresholding functions. (Left) The cost functions are compared for the (top) with , scale-invariant Bayes with , the Huber cost with and and (bottom) with , SCAD with and , and transformed with and . (Right) The corresponding nonlinear activation function, which can be used in the LCA to solve the regularized optimization program for each cost function.

Figure 2:

Cost functions and their corresponding thresholding functions. (Left) The cost functions are compared for the (top) with , scale-invariant Bayes with , the Huber cost with and and (bottom) with , SCAD with and , and transformed with and . (Right) The corresponding nonlinear activation function, which can be used in the LCA to solve the regularized optimization program for each cost function.

To obtain the activation function, we again solve for ak as a function of uk. For SCAD (and all of the piecewise cost functions we consider), the activation function can be determined individually for each region, paying careful attention to the ranges of the inputs uk and outputs ak to ensure consistency. For , we have , implying that ak=0 for and over the interval . For , we have
formula
over the interval . Finally, for , we have ak=uk, giving the full activation function
formula
which is shown in Figure 2 for and . Note that this activation function requires (Antoniadis & Fan, 2001, recommend a value of ). While this is apparent from consistency arguments once the thresholding function has been derived, this restriction on can also be deduced from the condition .

3.2.2.  Transformed

Similar to the SCAD cost function, the transformed cost (Antoniadis & Fan, 2001; Nikolova, 2000) attempts to capture something close to the norm for small coefficients while reducing the penalty on larger coefficients. Specifically, transformed uses the fractional cost function given by
formula
for some . An example of this cost with and is shown in Figure 2. After calculating the derivative of the cost function, the activation function can be found by solving
formula
for ak. Inverting this equation reduces to solving a cubic equation in ak. The three roots can be calculated analytically, but only one root generates a viable thresholding function by being both positive and increasing for positive uk. That root is given by
formula
This solution is viable only when ak is real valued, which corresponds to the range Outside this range, no viable nonzero solution exists, and so ak=0. The full thresholding function is shown in Figure 2 for and .

3.2.3.  Huber Function.

The Huber cost function (Huber, 1973) aims to modify standard optimization to improve the robustness to outliers. This cost function consists of a quadratic cost function on smaller values and a smooth transition to an cost on larger values, given by
formula
An example of the Huber cost is shown in Figure 2 for and . As in the case of other piecewise cost functions, we calculate the activation function separately over each interval of interest by calculating the derivative of the cost function in each region. For the first interval, the relationship is given by , which obviously gives the activation function for . For the second interval, we have , which yields the activation function for . Putting the pieces together, the full activation function (as expected) is a mixture of the Tikhonov regularization and the soft thresholding used for optimization given by
formula
which is shown in Figure 2 for and . We can see that as , the cost function converges to the norm and the thresholding function correctly converges back to the soft-threshold function derived earlier using the log-barrier method.

3.2.4.  Amplitude Scale-Invariant Bayes Estimation.

A known problem with using the norm as a cost function is that it is not scale invariant, meaning that the results can be poor if the amplitude of the input signals changes significantly (assuming a constant value of . Many cost functions (including the ones presented above) are heuristically motivated, drawing on intuition and trade-offs between the behavior of various norms. In contrast, Figueiredo and Nowak (2001) approach the problem from the perspective of Bayesian inference with a Jeffreys prior to determine a cost function with more invariance to amplitude scaling, similar to the nonnegative Garrote (Gao, 2001). We consider here the cost function
formula
which is proportional to the one given by Figueiredo and Nowak (2001) and is shown in Figure 2 for .
Taking the derivative of this cost function, we end up with the relationship between uk and ak:
formula
Solving for ak as a function of uk yields the following activation function,
formula
matching the results from Figueiredo and Nowak (2001). This activation function is shown in Figure 2 for

3.3.  Block

While all cost functions discussed earlier in this section have been separable, there is increasing interest in nonseparable cost functions that capture structure (i.e., statistical dependencies) between the nonzero coefficients. For example, this structure would be important in performing inference in a complex cell energy model where the energies (i.e., magnitudes) are sparse in a complex-valued signal decomposition (e.g., Cadieu & Olshausen, 2012). Perhaps the most widely cited cost function discussed in this regard is the block norm (also called the group norm), which assumes that the coefficients representing x are active in known groups. In this framework, the coefficients are divided into blocks, , and each block of coefficients is represented as a vector al. For our purposes, we assume the blocks are non-overlapping but may have different cardinalities. The block norm (Eldar, Kuppinger, & Bolcskei, 2010) is defined as the norm over the norms of the groups,
formula
essentially encouraging sparsity between the blocks (i.e., requiring only a few groups to be active) with no individual penalty on the coefficient values within a block. Because this cost is not separable, the activation function will no longer be a pointwise nonlinearity and will instead have multiple inputs and outputs.
Following the same general approach as above, we calculate the gradient of the cost function for each block,
formula
yielding the following relationship between the activation function inputs and outputs
formula
3.1
While directly solving this relationship for al appears difficult, we note that we can simplify the equation by expressing in terms of . To see this, take the norm of both sides of equation 3.1 to get . In substituting back into equation 3.1, the relationship simplifies to
formula
over the range , implying .
This relationship yields the block-wise thresholding function
formula
This activation function can be thought of as a type of shrinkage operation applied to an entire group of coefficients, with a threshold that depends on the norm of the group inputs. For the case of groups of two elements (with ), Figure 3 shows the nonlinearities for each of the two states as a function of the value of the other state.
Figure 3:

The nonlinear activation function used in the LCA to optimize the nonoverlapping group LASSO cost function has multiple inputs and multiple outputs. The plot shows an example thresholding function for both elements in a group of size 2 , with each line illustrating the nonlinear effect on a1 while u2 is held constant.

Figure 3:

The nonlinear activation function used in the LCA to optimize the nonoverlapping group LASSO cost function has multiple inputs and multiple outputs. The plot shows an example thresholding function for both elements in a group of size 2 , with each line illustrating the nonlinear effect on a1 while u2 is held constant.

3.4.  Reweighted and

Recent work has also demonstrated that reweighted norms can achieve better sparsity by iteratively solving a series of tractable convex programs (Wipf & Nagarajan, 2010; Chartrand & Yin, 2008; Candès, Wakin, & Boyd, 2008; Garrigues & Olshausen, 2010). For example, reweighted (Candès et al., 2008) is an iterative algorithm where a single iteration consists of solving a weighted minimization , followed by a weight update according to the rule
formula
3.2
where is a small parameter. By having approximately equal to the inverse of the norm of the coefficient from the previous iteration, this algorithm is more aggressive than BPDN at driving small coefficients to zero and increasing sparsity in the solutions. Similarly, reweighted algorithms (Wipf & Nagarajan, 2010) have also been used to approximate different p-norms with weights updated as
formula

Such schemes have shown many empirical benefits over norm minimization, and recent work on reweighted has established theoretical performance guarantees (Khajehnejad, Xu, Avestimehr, & Hassibi, 2010) and interpretations as Bayesian inference in a probabilistic model (Garrigues & Olshausen, 2010).

One of the main drawbacks to reweighted algorithms in digital architectures is the time required for solving the weighted program multiple times. Of course, it is also not clear that a discrete iterative approach such as this could be mapped to an asynchronous analog computational architecture. Because we have established that the LCA architecture can solve the norm optimizations (and weighted norms are a straightforward extension to those results), it would immediately follow that a dynamical system could be used to perform the optimization necessary for each iteration of the algorithm. While this would be a viable strategy, we show here that even more advantages can be gained by performing the entire reweighted algorithm in the context of a dynamical system. Specifically, we consider a modified version of the LCA where an additional set of dynamics is placed on in order to simultaneously optimize the coefficients and coefficient weights in an analog system. While the ideas here are expandable to the general reweighted case, we focus on results involving the reweighted as presented in Garrigues and Olshausen (2010).

The modified LCA is given by the system equations
formula

At steady state, , which shows that abides by equation 3.2 with representing the proportionality constant. While the complete analysis of this expanded analog system is beyond the scope of this letter, we show in Figure 4a simulations demonstrating that this system reaches a solution of comparable quality to digital iterative methods. Figure 4a plots the relative MSE from a compressed sensing recovery problem with length 1000 vectors from 500 noisy measurements with varying levels of sparsity. We sweep the parameter from zero to one and set the noise variance to 10−4, with each plot representing the relative MSE averaged over 15 randomly chosen signals. Figure 4a plots the recovery quality for three systems: iterative reweighted (using GPSR to solve the iterations: Figueiredo, Nowak, & Wright, 2007), iterative reweighted (using the LCA to solve the iterations), and dynamic reweighted , which uses the modified LCA described above. It is clear that the three systems are achieving nearly the same quality in their signal recovery. Figure 4b plots the convergence of the recovery as a function of time (in terms of system time constants ) for the iterative and dynamic reweighted approaches using the LCA. The dynamically reweighted system clearly converges more quickly, achieving its final solution in approximately the time it takes to perform two iterations of the traditional reweighting scheme using the standard LCA.

Figure 4:

Reweighted optimization in digital algorithms and in a modified LCA. (a) Reweighted optimization for a signal with N = 1000 and , with swept from 0 to 1. The traditional iterative reweighting scheme is performed with both a standard digital algorithm (GPSR) and the LCA. For comparison, a dynamic reweighting scheme where the LCA is modified to have continuous dynamics on the regularization parameter (rather than discrete iterations) is also shown. Each method is clearly achieving similar solutions. (b) The temporal evolution of the recovery relative MSE for a problem with N = 1000, = 0.6, and = 0.45. Solutions are shown for the amount of simulated time (in terms of number of time constants). The dynamically reweighted system converges in approximately the time it takes to use the LCA to solve two iterations of the traditional reweighted algorithm.

Figure 4:

Reweighted optimization in digital algorithms and in a modified LCA. (a) Reweighted optimization for a signal with N = 1000 and , with swept from 0 to 1. The traditional iterative reweighting scheme is performed with both a standard digital algorithm (GPSR) and the LCA. For comparison, a dynamic reweighting scheme where the LCA is modified to have continuous dynamics on the regularization parameter (rather than discrete iterations) is also shown. Each method is clearly achieving similar solutions. (b) The temporal evolution of the recovery relative MSE for a problem with N = 1000, = 0.6, and = 0.45. Solutions are shown for the amount of simulated time (in terms of number of time constants). The dynamically reweighted system converges in approximately the time it takes to use the LCA to solve two iterations of the traditional reweighted algorithm.

4.  Conclusion

Sparsity-based signal models have played a significant role in many theories of neural coding across multiple sensory modalities. Despite the interest in the sparse coding hypothesis from the computational and theoretical neuroscience communities, the qualitative nature of much of the supporting evidence leaves significant ambiguity about the ideal form for a sparsity-inducing cost function. While recent trends favor the norm due the emergence of guarantees in the signal processing literature, there are many sparsity-inducing signal models that may have benefits for neural computation and should be candidate models for neural coding. We have shown here that many of the sparsity-inducing cost functions proposed in the signal processing and statistics literatures can be implemented in a single unified dynamical system.

From the results presented here, we conclude that neurally plausible computational architectures can support a wide variety of sparsity-based signal models, and it is therefore reasonable to consider this broad family of models as reasonable candidates for theories of sensory neural coding. Furthermore, we have shown that even a relatively complex hierarchical probabilistic model resulting in a reweighted inference scheme can be implemented efficiently in a purely analog system. This observation is particularly interesting because it illustrates a fundamental potential advantage of analog computation over digital systems. Specifically, the analog approach to this problem is able to continuously infer two sets of variables jointly rather than take an iterative approach that fundamentally must wait for the computations in each iteration for one variable to fully converge before inferring the other variable.

Beyond the applicability of these results to theories of neural computation, the recent shift toward optimization as a fundamental computational tool in the modern signal processing toolbox has made it difficult to implement many of these algorithms in applications with significant power constraints or real-time processing requirements. The results of this letter broaden the scope of problems that could potentially be approached through efficient neuromorphic architectures. The design and implementation of analog circuits have traditionally been difficult, but recent advances in reconfigurable analog circuits (Twigg & Hasler, 2009) have improved many of the issues related to the design of these systems. In fact, the reconfigurable platform described in Twigg and Hasler has been used to implement a small version of the LCA for solving BPDN (Shapero, Charles et al., in press; Shapero, Rozell et al., 2012), and preliminary tests of this implementation are consistent with simulations of the idealized LCA. These results lend encouragement to the idea that efficient analog circuits could be implemented for the variety of cost functions described in this letter.

Appendix:  Soft-Threshold Activation for BPDN Using the Log-Barrier Relaxation

We first rewrite the desired BPDN problem, equation 2.2, with the cost function, in an extended formulation to make the variables nonnegative. Define a new matrix through the concatenation operation . Similarly, define a vector z=[zT+, zT]T of length 2N such that and a=z+z. Essentially z represents the original variables a by separating them into two subvectors depending on their sign. We can then write a constrained optimization program that is equivalent to BPDN:
formula
A.1
This reformulation is a standard way to show that cost penalties are equivalent to a linear function in a constrained optimization program. One can then apply the standard log-barrier relaxation to convert the program in equation A.1 to an approximately equivalent unconstrained program:
formula
A.2
As , this program approaches the desired program, equation A.2. This relaxation strategy underlies an interior point algorithm (called the barrier method) for solving convex optimization programs, where equation A.2 is repeatedly solved with increasing values of (Boyd & Vandenberghe, 2004).
Note that the relaxed problem in equation A.2 fits the form of the general optimization program stated in equation 2.2 with the differentiable cost function For a fixed value of , this cost function can be differentiated and used in the relationship given in equation 2.5 to solve for zk in terms of uk to find the corresponding invertible activation function:
formula
Finally it is straightforward to show that in the relaxation limit () where the program in equation A.2 approaches BPDN, the desired activation function becomes the soft-thresholding function:
formula

To illustrate the convergence of this relaxation to the desired cost function and the corresponding soft-threshold activation function, Figure 5 plots and in this relaxed problem for several values of . Note that in the extended formulation of BPDN given in equation A.1, the variables occur in pairs where only one of them can be nonzero at a time. Because the activation function is zero for all state values with a magnitude less than threshold, it is possible to represent each of these pairs of variables in one LCA node that can take on positive and negative values and where the activation function is a two-sided soft-thresholding function (thereby reducing the number of nodes back down to N).

Figure 5:

Log-barrier relaxations of BPDN. (a) The cost function approaches the ideal norm as the relaxation parameter is increased. (b) In a similar way, the nonlinear activation function derived for the LCA approaches the ideal soft-thresholding operator as the relaxation parameter is increased.

Figure 5:

Log-barrier relaxations of BPDN. (a) The cost function approaches the ideal norm as the relaxation parameter is increased. (b) In a similar way, the nonlinear activation function derived for the LCA approaches the ideal soft-thresholding operator as the relaxation parameter is increased.

Acknowledgments

We are grateful to B. Olshausen and J. Romberg for valuable discussions related to this work. This work was partially supported by NSF grant CCF-0905346.

References

Antoniadis
,
A.
, &
Fan
,
J.
(
2001
).
Regularization of wavelet approximations
.
Journal of the American Statistical Association
,
96
,
939
967
.
Balavoine
,
A.
,
Romberg
,
J.
, &
Rozell
,
C.
(
2012
).
Convergence and rate analysis of neural networks for sparse approximation
.
IEEE Transactions on Neural Networks and Learning Systems
,
23
,
1377
1389
.
Battaglia
,
P.
,
Jacobs
,
R.
, &
Aslin
,
R.
(
2003
).
Bayesian integration of visual and auditory signals for spatial localization
.
JOSA A
,
20
,
1391
1397
.
Baum
,
E.
,
Moody
,
J.
, &
Wilczek
,
F.
(
1988
).
Internal representations for associative memory
.
Biological Cybernetics
,
59
,
217
228
.
Boyd
,
S.
, &
Vandenberghe
,
L.
(
2004
).
Convex optimization
.
Cambridge
:
Cambridge University Press
.
Cadieu
,
C. F.
, &
Olshausen
,
B. A.
(
2012
).
Learning intermediate-level representations of form and motion from natural movies
.
Neural Computation
,
24
,
827
866
.
Candès
,
E.
,
Wakin
,
M.
, &
Boyd
,
S.
(
2008
).
Enhancing sparsity by reweighted minimization
.
Journal of Fourier Analysis and Applications
,
14
,
877
905
.
Charles
,
A. S.
,
Yap
,
H. L.
, &
Rozell
,
C. J.
(
2012
).
Short-term memory capacity in recurrent networks via compressed sensing
.
Cosyne Abstracts 2012
,
Salt Lake City, UT
.
Chartrand
,
R.
, &
Yin
,
W.
(
2008
).
Iteratively reweighted algorithms for compressive sensing
. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
(pp.
3869
3872
).
Piscataway, NJ
:
IEEE
.
Coen-Cagli
,
R.
,
Dayan
,
P.
, &
Schwartz
,
O.
(
2012
).
Cortical surround interactions and perceptual salience via natural scene statistics
.
PLoS Comput. Biol.
,
8
,
e1002405
.
Doya
,
K.
(
2007
).
Bayesian brain: Probabilistic approaches to neural coding
.
Cambridge, MA
:
MIT Press
.
Elad
,
M.
,
Figueiredo
,
M.
, &
Ma
,
Y.
(
2010
).
On the role of sparse and redundant representations in image processing
.
Proceedings of the IEEE
,
98
,
972
982
.
Elad
,
M.
,
Matalon
,
B.
, &
Zibulevsky
,
M.
(
2007
).
Coordinate and subspace optimization methods for linear least squares with non-quadratic regularization
.
Applied and Computational Harmonic Analysis
,
23
,
346
367
.
Eldar
,
Y. C.
,
Kuppinger
,
P.
, &
Bolcskei
,
H.
(
2010
).
Block-sparse signals: Uncertainty relationships and efficient recovery
.
IEEE Transactions on Signal Processing
,
58
,
3042
3054
.
Fan
,
J.
(
1997
).
Comments on “Wavelets in statistics: A review” by A. Antoniadis
.
Statistical Methods and Applications
,
6
,
131
138
.
Figueiredo
,
M.A.T.
, &
Nowak
,
R. D.
(
2001
).
Wavelet-based image estimation: An empirical Bayes approach using Jeffrey's noninformative prior
.
IEEE Transactions on Image Processing
,
10
,
1322
1331
.
Figueiredo
,
M.A.T.
,
Nowak
,
R. D.
, &
Wright
,
S. J.
(
2007
).
Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems
.
IEEE Journal of Selected Topics in Signal Processing
,
1
,
586
597
.
Gao
,
H.
(
2001
).
Wavelet shrinkage denoising using the non-negative Garrote
.
Journal of Computational and Graphical Statistics
,
7
,
469
488
.
Garrigues
,
P.
, &
Olshausen
,
B.
(
2010
).
Group sparse coding with a Laplacian scale mixture prior
. In
J. Lafferty, C.K.J. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.)
,
Advances in neural information processing systems
,
23
(pp. 
1
9
).
Red Hook, NY
:
Curran
.
Haider
,
B.
,
Krause
,
M.
,
Duque
,
A.
,
Yu
,
Y.
,
Touryan
,
J.
,
Mazer
,
J.
, et al
(
2010
).
Synaptic and network mechanisms of sparse and reliable visual cortical activity during nonclassical receptive field stimulation
.
Neuron
,
65
,
107
121
.
Hopfield
,
J.
(
1982
).
Neural networks and physical systems with emergent collective computational abilities
.
Proceedings of the National Academy of Sciences
,
79
,
2554
2558
.
Hu
,
T.
,
Genkin
,
A.
, &
Chklovskii
,
D. B.
(
2012
).
A network of spiking neurons for computing sparse representations in an energy efficient way
.
Neural Computation, 24
,
2852
2872
.
Huber
,
P. J.
(
1973
).
Robust regression: Asymptotics, conjectures and Monte Carlo
.
Annals of Statistics
,
1
,
799
821
.
Hürlimann
,
F.
,
Kiper
,
D.
, &
Carandini
,
M.
(
2002
).
Testing the Bayesian model of perceived speed
.
Vision Research
,
42
,
2253
2257
.
Karklin
,
Y.
, &
Lewicki
,
M.
(
2008
).
Emergence of complex cell properties by learning to generalize in natural scenes
.
Nature
,
457
,
83
86
.
Khajehnejad
,
M.
,
Xu
,
W.
,
Avestimehr
,
S.
, &
Hassibi
,
B.
(
2010
).
Improved sparse recovery thresholds with two-step reweighted ℓ1 minimization
.
arXiv:1004.0402
.
Lennie
,
P.
(
2003
).
The cost of cortical computation
.
Current Biology
,
13
,
493
497
.
Nikolova
,
M.
(
2000
).
Local strong homogeneity of a regularized estimator
.
SIAM Journal on Applied Mathematics
,
61
,
633
658
.
Olshausen
,
B.
, &
Field
,
D.
(
1996
).
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
.
Nature
,
381
,
607
609
.
Olshausen
,
B. A.
, &
Field
,
D.
(
1997
).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research
,
37
,
3311
3325
.
Olshausen
,
B.
, &
Field
,
D.
(
2004
).
Sparse coding of sensory inputs
.
Current Opinion in Neurobiology
,
14
,
481
487
.
Perrinet
,
L.
,
Samuelides
,
M.
, &
Thorpe
,
S.
(
2004
).
Sparse spike coding in an asynchronous feed-forward multi-layer neural network using matching pursuit
.
Neurocomputing
,
57
,
125
134
.
Rao
,
R.
, &
Ballard
,
D.
(
1999
).
Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects
.
Nature Neuroscience
,
2
,
79
87
.
Rehn
,
M.
, &
Sommer
,
F. T.
(
2007
).
A network that uses few active neurones to code visual input predicts the diverse shapes of cortical receptive fields
.
Journal of Computational Neuroscience
,
22
,
135
146
.
Rozell
,
C.
, &
Garrigues
,
P.
(
2010
).
Analog sparse approximation for compressed sensing recovery
. In
Proceedings of the 2010 ASILOMAR Conference on Signals, Systems and Computers
(pp.
822
826
).
Piscataway, NJ
:
IEEE
.
Rozell
,
C. J.
,
Johnson
,
D. H.
,
Baraniuk
,
R. G.
, &
Olshausen
,
B. A.
(
2010
).
Sparse coding via thresholding and local competition in neural circuits
.
Neural Computation
,
20
,
2526
2563
.
Saab
,
R.
,
Chartrand
,
R.
, &
Yilmaz
,
O.
(
2008
).
Stable sparse approximations via nonconvex optimization
. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
(pp.
3885
3888
).
Piscataway, NJ
:
IEEE
.
Schwartz
,
O.
, &
Simoncelli
,
E.
(
2001
).
Natural signal statistics and sensory gain control
.
Nature Neuroscience
,
4
,
819
825
.
Seriès
,
P.
,
Lorenceau
,
J.
, &
Frégnac
,
Y.
(
2003
).
The silent surround of V1 receptive fields: Theory and experiments
.
Journal of Physiology-Paris
,
97
,
453
474
.
Shapero
,
S.
,
Rozell
,
C. J.
, &
Hasler
,
P.
(
2012
).
Configurable hardware integrate and fire neurons for sparse approximation
.
Manuscript submitted for publication
.
Shapero
,
S.
,
Charles
,
A.
,
Rozell
,
C. J.
, &
Hasler
,
P.
(
in press
).
Low power sparse approximation on reconfigurable analog hardware
.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems
.
Spratling
,
M.
(
2011
).
A single functional model accounts for the distinct properties of suppression in cortical area V1
.
Vision Research
,
51
,
563
576
.
Tikhonov
,
A.
(
1963
).
Regularization of incorrectly posed problems
.
Soviet Math. Dokl
,
4
,
1624
1627
.
Twigg
,
C.
, &
Hasler
,
P.
(
2009
).
Configurable analog signal processing
.
Digital Signal Processing
,
19
,
904
922
.
Vinje
,
W.
, &
Gallant
,
J.
(
2002
).
Natural stimulation of the nonclassical receptive field increases information transmission efficiency in V1
.
Journal of Neuroscience
,
22
,
2904
2915
.
Wipf
,
D.
, &
Nagarajan
,
S.
(
2010
).
Iterative reweighted and methods for finding sparse solutions
.
IEEE Journal of Selected Topics in Signal Processing
,
4
,
317
329
.
Wright
,
J.
,
Ma
,
Y.
,
Mairal
,
J.
,
Sapiro
,
G.
,
Huang
,
T.
, &
Yan
,
S.
(
2010
).
Sparse representation for computer vision and pattern recognition
.
Proceedings of the IEEE
,
98
,
1031
1044
.
Zhu
,
M.
, &
Rozell
,
C. J.
(
2012
).
Population characteristics and interpretations of NCRF effects emerging from sparse coding
.
Cosyne Abstracts 2012
,
Salt Lake City, UT
.
Zou
,
H.
(
2006
).
The adaptive LASSO and its oracle properties
.
Journal of the American Statistical Association
,
101
,
1418
1429
.
Zylberberg
,
J.
,
Murphy
,
J. T.
, &
DeWeese
,
M. R.
(
2011
).
A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of V1 simple cell receptive fields
.
PLoS Comput. Biol.
,
7
,
e1002250
.

Note

1

We also note that a cost function might be easily implementable even in the absence of an analytical formula for the activation function simply by using numerical integration to find a solution and fitting the resulting curve.