## Abstract

The sparse coding hypothesis has generated significant interest in the computational and theoretical neuroscience communities, but there remain open questions about the exact quantitative form of the sparsity penalty and the implementation of such a coding rule in neurally plausible architectures. The main contribution of this work is to show that a wide variety of sparsity-based probabilistic inference problems proposed in the signal processing and statistics literatures can be implemented exactly in the common network architecture known as the locally competitive algorithm (LCA). Among the cost functions we examine are approximate norms (), modified -norms, block- norms, and reweighted algorithms. Of particular interest is that we show significantly increased performance in reweighted algorithms by inferring all parameters jointly in a dynamical system rather than using an iterative approach native to digital computational architectures.

## 1. Introduction

New experimental approaches over the past decades have provided a closer look at how sensory nervous systems such as the visual cortex process information about their environment. Over this time, it has become increasingly evident that the canonical linear-nonlinear model, where cells encode visual information via linear filtering followed by a nonlinearity (e.g., thresholding and saturation), is inadequate to describe the complex processing performed by the sensory cortex. For example, this type of linear-nonlinear model does not capture the rich variety of nonlinear response properties and contextual modulations observed in V1 (Seriès, Lorenceau, & Frégnac, 2003).

Many theoretical neuroscientists have postulated high-level coding and computational principles for sensory cortex to extend our understanding of these systems. In many cases, these proposals are based generally around probabilistic Bayesian inference (Doya, 2007) due to the natural fit with ecological goals and evidence from perceptual tasks in humans (Battaglia, Jacobs, & Aslin, 2003; Hürlimann, Kiper, & Carandini, 2002). Many other researchers have postulated complementary models based on ideas of efficient coding, where information is encoded by removing redundant aspects of the stimulus. A wide variety of interesting models have appeared related to this broad principle of efficient coding, with selected examples including models using predictive coding (Rao & Ballard, 1999; Spratling, 2011), divisive normalization (Schwartz & Simoncelli, 2001), and directly encoding statistical variations (Karklin & Lewicki, 2008; Coen-Cagli, Dayan, & Schwartz, 2012).

The sparse coding hypothesis is one interpretation of efficient coding that has generated significant interest in the computational and theoretical neuroscience communities. In this model, a population of cells performs Bayesian inference to determine the environmental causes of a stimulus, with a goal of using as few simultaneously active units in the encoding as possible. Distributed sparse neural codes have several potential benefits over dense linear codes, including explicit information representation and easy decodability at higher processing stages (Olshausen & Field, 2004), metabolic efficiency (due to the the significant cost of producing and transmitting action potentials; Lennie, 2003), and increased capacity of associative and sequence memory models (Baum, Moody, & Wilczek, 1988; Charles, Yap, & Rozell, 2012). The interest in the sparse coding model was originally generated when it was shown that this simple principle (combined with the statistics of natural images) is sufficient to explain the emergence of V1 receptive field shapes both qualitatively (Olshausen & Field, 1996) and quantitatively (Rehn & Sommer, 2007). More recently, electrophysiology experiments report results consistent with sparse coding (Haider et al., 2010; Vinje & Gallant, 2002), and simulation results have demonstrated that the sparse coding model can account for a wide variety of nonlinear response properties (called nonclassical receptive field effects) reported in single cells and population studies of V1 (Zhu & Rozell, 2012).

Despite this interest, there are many open fundamental questions related to the sparse coding model. First, what exactly is the proper notion of sparsity to use during inference? The original work in the computational neuroscience literature proposed several potential sparsity-inducing cost functions (Olshausen & Field, 1996), and recent work (motivated by strong theoretical results in the signal processing and applied mathematics communities) has seen people gravitate toward the norm. While the main qualitative results appear to be relatively robust to the detailed choice of the sparsity-inducing cost function, the broader signal processing and statistics communities have proposed several alternative cost functions that have appealing computational or statistical properties and may be valuable alternatives. Second, how would such a coding principle be implemented in biologically plausible computational architectures? The computation necessary to implement an inference process with a sparsity penalty amounts to solving a nonsmooth optimization problem that is notoriously challenging to solve (e.g., many gradient-based methods are wildly inefficient due to the nonsmooth nature of the objective). Recent theoretical work has demonstrated several network architectures that can efficiently compute sparse coefficients (Rehn & Sommer, 2007; Rozell, Johnson, Baraniuk, & Olshausen, 2010; Zylberberg, Murphy, & DeWeese, 2011; Perrinet, Samuelides, & Thorpe, 2004). Interestingly, the sparse coding problem has become very prominent in modern signal processing, for example, for use in inverse problems (Elad, Figueiredo, & Ma, 2010) and computer vision (Wright et al., 2010). There is also increasing interest in leveraging the computational benefits of analog neuromorphic architectures for these problems (Shapero, Charles, Rozell, & Hasler, in press; Shapero, Rozell, & Hasler, 2012).

The main contribution of this work is to show that a wide variety of sparsity-based probabilistic inference problems can be implemented exactly in the common network architecture known as the locally competitive algorithm (LCA) (Rozell et al., 2010). The LCA is a type of Hopfield network specifically designed to incorporate nonlinear thresholding elements that make it particularly efficient for solving the nonsmooth optimization problems necessary for sparse coding. In particular, we examine sparsity-based approaches described in the recent statistics and signal processing literature to show that many proposed signal models based on sparsity principles can be implemented efficiently in this common neural architecture. Among the cost functions we examine are approximate norms (), modified -norms that combine the desirable properties of different statistical models, block- norms for use in hierarchical models that impose correlations among the active variables, and reweighted algorithms that use a hierarchical probabilistic model to achieve more efficient encodings. Of particular interest is that we show significantly increased performance in reweighted algorithms by inferring all parameters jointly in a dynamical system rather than using an iterative approach native to digital computational architectures. Preliminary results related to this work were reported in Rozell and Garrigues (2010).

## 2. Background and Related Work

### 2.1. Sparse Coding.

**are generally inferred via MAP estimation, which results in solving a nonlinear optimization problem, where is a cost function penalizing**

*a***based on its fit with the signal model and is a parameter denoting the relative trade-off between the data fidelity term (i.e., MSE, which arises from the log likelihood of the gaussian noise) and the cost function. The cost function is the nonlinear sparsity-inducing regularization term, corresponding to the log prior of the data model. More details about the formulation of this problem in the Bayesian inference framework can be found in Olshausen and Field (1997). Basic signal models frequently assume independence among the elements of**

*a***, resulting in a cost function that separates into a sum of individual costs . One common example is the norm, defined as .**

*a*### 2.2. Dynamical Systems for Minimization

Recent work in computational neuroscience has shown that the LCA dynamical system provably solves the optimization programs in equation 2.2 and is efficient for solving the nonsmooth problems of interest in sparse approximation. The LCA (Rozell et al., 2010) architecture comprises a network of analog nodes being driven by the signal to be approximated. Each node competes with neighboring nodes for a chance to represent the signal, and the steady-state response represents the solution to the optimization problem.

The LCA is a specific instance of the Hopfield neural network architecture that has a long history of being used to solve optimization problems (Hopfield, 1982). It is a neurally plausible architecture, consisting of a network of parallel nodes that use computational primitives well matched to individual neuron models. In particular, each node consists of a leaky integrator and a nonlinear thresholding function, and it is driven by both feedforward and lateral (inhibitory and excitatory) recurrent connections. This architecture has been implemented in neuromorphic hardware as a purely analog system (Shapero, Charles et al., in press) and by using integrate-and-fire spiking neurons for each node (Shapero, Rozell, et al., 2012). We also note that other types of network structures have also been proposed recently to approximately solve specific versions of the sparse approximation problem (Rehn & Sommer, 2007; Perrinet et al., 2004; Zylberberg et al., 2011; Hu, Genkin, & Chklovskii, 2012).

*k*th node of the LCA is associated with the

*k*th column of . Without loss of generality, we assume that each column has unit norm. This node is described at a given time

*t*by an internal state variable

*u*(

_{k}*t*). The coefficients

**are related to the internal states**

*a***via an activation (thresholding) function that is parameterized by . In the important special case when the cost function is separable, the output of each node**

*u**k*can be calculated independently of all other nodes by a pointwise activation function . Individual nodes are leaky integrators driven by an input proportional to , and competition between nodes occurs by lateral connections that allow highly active nodes to suppress nodes with less activity. The dynamics for node

*k*are given by where is the system time constant. In vector form, the dynamics for the whole network are given by

*E*given in equation 2.2 with a separable, continuous, and piecewise differentiable cost function, the path induced by the LCA (using the outputs

*a*(

_{k}*t*) as the optimization variable) ensures when the cost function satisfies where is nondecreasing. We use the notation for convenience when the activation function is invertible, but this invertibility is not strictly required (i.e., the relationship in equation 2.5 involving just is sufficient). The same arguments also extend to the more general case of nonseparable cost functions, ensuring when Recent follow-up work (Balavoine, Romberg, & Rozell, 2012) establishes stronger guarantees on the LCA, specifically showing that this system is globally convergent to the minimum of

*E*(which may be a local minima if is not convex) and proving that the system converges exponentially fast with an analytically bounded convergence rate.

## 3. Alternate Inference Problems in the LCA Architecture

Using the basic relationships described in equations 2.5 and 2.6, we can optimize a variety of cost functions in the same basic LCA structure by analytically determining the corresponding activation function.^{1} These optimization programs include approximate norms, modified norms that attempt to achieve better statistical properties than BPDN, the group/block norm that induces coactivation structure on the nonzero coefficients, reweighted and algorithms that represent hierarchical statistical models on the coefficients, and classic Tikhonov regularization.

Before exploring specific alternate cost functions in the remainder of this section, it is worthwhile to offer a technical note regarding the optimization programs that are possible to implement in the LCA architecture. The strong theoretical convergence guarantees established for the LCA (Balavoine et al., 2012) apply to a wide variety of possible systems, but they do impose some conditions on the permissible activation functions. We will rely on these same conditions to analytically determine the relationship between the cost and activation functions for the examples in this section. Translated to conditions on the cost functions, the convergence results for the LCA (Balavoine et al., 2012) require that the cost functions be positive , and symmetric and satisfy the condition that the matrix is positive definite (i.e., for separable cost functions). This last condition can intuitively be viewed as requiring that the activation function resulting from equation 2.6 has only a single output for a given input.

Some of the cost functions considered here have nonzero derivatives at the origin, leading to a range of values around the origin where is not defined according to the relationship in equation 2.5. In these cases, the smallest value for which the threshold function is defined results in a zero-valued output (i.e., at ). Since the second derivative condition on the cost function constrains the activation function to be nondecreasing, we can infer that the only allowable value of the activation function must be zero for the regions that are not wellcharacterized by the relationship in equation 2.5. Finally, we note that in most cases, we will consider the behavior of the activation function only for because the behavior for *u _{k}*<0 is implied by the symmetry condition.

### 3.1. Approximate Norms

Perhaps the most widely used family of cost functions are the norms . These separable cost functions include ideal sparse approximation (i.e., counting nonzeros), BPDN, and Tikhonov regularization (Tikhonov, 1963) as special cases (*p*=0, 1, and2, respectively) and are convex for . Furthermore, recent research has shown some benefits of using nonconvex norms (*p*<1) for inverse problems with sparse signal models (Saab, Chartrand, & Yilmaz, 2008; Elad, Matalon, & Zibulevsky, 2007). While the ideal activation functions can be determined exactly for the three special cases mentioned above (*p*=0, 1, and2), it is not possible to analytically determine the activation function for arbitrary values of . Elad et al. (2007) introduced several parameterized approximations to the cost functions that are more amenable to analysis. In this section, we use these same approximations to determine activation functions for minimizing approximate norms for .

#### 3.1.1. Approximate for

*s*and

*c*. In the limiting cases,

*c*= 1 with yields the norm and

*c*=2

*s*with yields the norm. Three intermediate examples for

*p*=1.25, 1.5, and 1.75 are shown in Figure 1. For any specific value of

*p*, we find the best values of

*c*and

*s*by using standard numerical optimization techniques to minimize the squared error to the true cost function over the interval [0,2]. From this cost function, we can differentiate to obtain the relationship between each

*u*and

_{k}*a*as

_{k}*c*= 1 and , we obtain for (i.e., the soft-thresholding function for BPDN), while with

*c*=2

*s*and , we obtain (i.e., a linear amplifier for Tikhonov regularization). Solving for

*a*in terms of

_{k}*u*(restricting the solution to be positive and increasing) yields a general relationship for the activation function This solution is shown in Figure 1 for

_{k}*p*= 1.25, 1.5, and 1.75 for .

#### 3.1.2. Approximate for

*s*and

*c*: where the parameters

*c*>0 and

*s*>0 can be optimized as above to approximate different values of

*p*. Three approximations for

*p*= 0.5, 0.75, and 0.9 are shown in Figure 1. To determine the activation function, we again differentiate and find the appropriate relationship to be Solving for

*a*reduces to solving a quadratic equation, which leads to two possible solutions. As above, we restrict the activation function to include only the solution that is positive and increasing, resulting in the activation function This activation function is valid only over the range where the output is a positive real number. If , this condition reduces to . More generally, this condition reduces to .

_{k}### 3.2. Modified Norms

While the general norms have historically been very popular cost functions, many people have noted that this approach can have undesirable statistical properties in some instances (e.g., BPDN can result in biased estimates of large coefficients, Zou, 2006). To address these issues, many researchers in signal processing and statistics have proposed modified cost functions that attempt to alleviate these statistical concerns. For example, hybrid norms smoothly morph between different norms to capture the most desirable characteristics over different regions. In this section, we demonstrate that many of these modified norms can also be implemented in the basic LCA architecture.

#### 3.2.1. Smoothly Clipped Absolute Deviations.

*a*as a function of

_{k}*u*. For SCAD (and all of the piecewise cost functions we consider), the activation function can be determined individually for each region, paying careful attention to the ranges of the inputs

_{k}*u*and outputs

_{k}*a*to ensure consistency. For , we have , implying that

_{k}*a*=0 for and over the interval . For , we have over the interval . Finally, for , we have

_{k}*a*=

_{k}*u*, giving the full activation function which is shown in Figure 2 for and . Note that this activation function requires (Antoniadis & Fan, 2001, recommend a value of ). While this is apparent from consistency arguments once the thresholding function has been derived, this restriction on can also be deduced from the condition .

_{k}#### 3.2.2. Transformed

*a*. Inverting this equation reduces to solving a cubic equation in

_{k}*a*. The three roots can be calculated analytically, but only one root generates a viable thresholding function by being both positive and increasing for positive

_{k}*u*. That root is given by This solution is viable only when

_{k}*a*is real valued, which corresponds to the range Outside this range, no viable nonzero solution exists, and so

_{k}*a*=0. The full thresholding function is shown in Figure 2 for and .

_{k}#### 3.2.3. Huber Function.

#### 3.2.4. Amplitude Scale-Invariant Bayes Estimation.

### 3.3. Block

**are active in known groups. In this framework, the coefficients are divided into blocks, , and each block of coefficients is represented as a vector**

*x*

*a*^{l}. For our purposes, we assume the blocks are non-overlapping but may have different cardinalities. The block norm (Eldar, Kuppinger, & Bolcskei, 2010) is defined as the norm over the norms of the groups, essentially encouraging sparsity between the blocks (i.e., requiring only a few groups to be active) with no individual penalty on the coefficient values within a block. Because this cost is not separable, the activation function will no longer be a pointwise nonlinearity and will instead have multiple inputs and outputs.

*a*^{l}appears difficult, we note that we can simplify the equation by expressing in terms of . To see this, take the norm of both sides of equation 3.1 to get . In substituting back into equation 3.1, the relationship simplifies to over the range , implying .

### 3.4. Reweighted and

*p*-norms with weights updated as

Such schemes have shown many empirical benefits over norm minimization, and recent work on reweighted has established theoretical performance guarantees (Khajehnejad, Xu, Avestimehr, & Hassibi, 2010) and interpretations as Bayesian inference in a probabilistic model (Garrigues & Olshausen, 2010).

One of the main drawbacks to reweighted algorithms in digital architectures is the time required for solving the weighted program multiple times. Of course, it is also not clear that a discrete iterative approach such as this could be mapped to an asynchronous analog computational architecture. Because we have established that the LCA architecture can solve the norm optimizations (and weighted norms are a straightforward extension to those results), it would immediately follow that a dynamical system could be used to perform the optimization necessary for each iteration of the algorithm. While this would be a viable strategy, we show here that even more advantages can be gained by performing the entire reweighted algorithm in the context of a dynamical system. Specifically, we consider a modified version of the LCA where an additional set of dynamics is placed on in order to simultaneously optimize the coefficients and coefficient weights in an analog system. While the ideas here are expandable to the general reweighted case, we focus on results involving the reweighted as presented in Garrigues and Olshausen (2010).

At steady state, , which shows that abides by equation 3.2 with representing the proportionality constant. While the complete analysis of this expanded analog system is beyond the scope of this letter, we show in Figure 4a simulations demonstrating that this system reaches a solution of comparable quality to digital iterative methods. Figure 4a plots the relative MSE from a compressed sensing recovery problem with length 1000 vectors from 500 noisy measurements with varying levels of sparsity. We sweep the parameter from zero to one and set the noise variance to 10^{−4}, with each plot representing the relative MSE averaged over 15 randomly chosen signals. Figure 4a plots the recovery quality for three systems: iterative reweighted (using GPSR to solve the iterations: Figueiredo, Nowak, & Wright, 2007), iterative reweighted (using the LCA to solve the iterations), and dynamic reweighted , which uses the modified LCA described above. It is clear that the three systems are achieving nearly the same quality in their signal recovery. Figure 4b plots the convergence of the recovery as a function of time (in terms of system time constants ) for the iterative and dynamic reweighted approaches using the LCA. The dynamically reweighted system clearly converges more quickly, achieving its final solution in approximately the time it takes to perform two iterations of the traditional reweighting scheme using the standard LCA.

## 4. Conclusion

Sparsity-based signal models have played a significant role in many theories of neural coding across multiple sensory modalities. Despite the interest in the sparse coding hypothesis from the computational and theoretical neuroscience communities, the qualitative nature of much of the supporting evidence leaves significant ambiguity about the ideal form for a sparsity-inducing cost function. While recent trends favor the norm due the emergence of guarantees in the signal processing literature, there are many sparsity-inducing signal models that may have benefits for neural computation and should be candidate models for neural coding. We have shown here that many of the sparsity-inducing cost functions proposed in the signal processing and statistics literatures can be implemented in a single unified dynamical system.

From the results presented here, we conclude that neurally plausible computational architectures can support a wide variety of sparsity-based signal models, and it is therefore reasonable to consider this broad family of models as reasonable candidates for theories of sensory neural coding. Furthermore, we have shown that even a relatively complex hierarchical probabilistic model resulting in a reweighted inference scheme can be implemented efficiently in a purely analog system. This observation is particularly interesting because it illustrates a fundamental potential advantage of analog computation over digital systems. Specifically, the analog approach to this problem is able to continuously infer two sets of variables jointly rather than take an iterative approach that fundamentally must wait for the computations in each iteration for one variable to fully converge before inferring the other variable.

Beyond the applicability of these results to theories of neural computation, the recent shift toward optimization as a fundamental computational tool in the modern signal processing toolbox has made it difficult to implement many of these algorithms in applications with significant power constraints or real-time processing requirements. The results of this letter broaden the scope of problems that could potentially be approached through efficient neuromorphic architectures. The design and implementation of analog circuits have traditionally been difficult, but recent advances in reconfigurable analog circuits (Twigg & Hasler, 2009) have improved many of the issues related to the design of these systems. In fact, the reconfigurable platform described in Twigg and Hasler has been used to implement a small version of the LCA for solving BPDN (Shapero, Charles et al., in press; Shapero, Rozell et al., 2012), and preliminary tests of this implementation are consistent with simulations of the idealized LCA. These results lend encouragement to the idea that efficient analog circuits could be implemented for the variety of cost functions described in this letter.

## Appendix: Soft-Threshold Activation for BPDN Using the Log-Barrier Relaxation

**=[**

*z*

*z*^{T}

_{+},

*z*^{T}

_{−}]

^{T}of length 2

*N*such that and

**=**

*a*

*z*_{+}−

*z*_{−}. Essentially

**represents the original variables**

*z***by separating them into two subvectors depending on their sign. We can then write a constrained optimization program that is equivalent to BPDN: This reformulation is a standard way to show that cost penalties are equivalent to a linear function in a constrained optimization program. One can then apply the standard log-barrier relaxation to convert the program in equation A.1 to an approximately equivalent unconstrained program: As , this program approaches the desired program, equation A.2. This relaxation strategy underlies an interior point algorithm (called the barrier method) for solving convex optimization programs, where equation A.2 is repeatedly solved with increasing values of (Boyd & Vandenberghe, 2004).**

*a**z*in terms of

_{k}*u*to find the corresponding invertible activation function: Finally it is straightforward to show that in the relaxation limit () where the program in equation A.2 approaches BPDN, the desired activation function becomes the soft-thresholding function:

_{k}To illustrate the convergence of this relaxation to the desired cost function and the corresponding soft-threshold activation function, Figure 5 plots and in this relaxed problem for several values of . Note that in the extended formulation of BPDN given in equation A.1, the variables occur in pairs where only one of them can be nonzero at a time. Because the activation function is zero for all state values with a magnitude less than threshold, it is possible to represent each of these pairs of variables in one LCA node that can take on positive and negative values and where the activation function is a two-sided soft-thresholding function (thereby reducing the number of nodes back down to *N*).

## Acknowledgments

We are grateful to B. Olshausen and J. Romberg for valuable discussions related to this work. This work was partially supported by NSF grant CCF-0905346.

## References

^{1}minimization

## Note

^{1}

We also note that a cost function might be easily implementable even in the absence of an analytical formula for the activation function simply by using numerical integration to find a solution and fitting the resulting curve.