Abstract

Blind source separation—the extraction of independent sources from a mixture—is an important problem for both artificial and natural signal processing. Here, we address a special case of this problem when sources (but not the mixing matrix) are known to be nonnegative—for example, due to the physical nature of the sources. We search for the solution to this problem that can be implemented using biologically plausible neural networks. Specifically, we consider the online setting where the data set is streamed to a neural network. The novelty of our approach is that we formulate blind nonnegative source separation as a similarity matching problem and derive neural networks from the similarity matching objective. Importantly, synaptic weights in our networks are updated according to biologically plausible local learning rules.

1  Introduction

Extraction of latent causes, or sources, from complex stimuli is essential for making sense of the world. Such stimuli could be mixtures of sounds, mixtures of odors, or natural images. If supervision, or ground truth, about the causes is lacking, the problem is known as blind source separation.

The blind source separation problem can be solved by assuming a generative model, wherein the observed stimuli are linear combinations of independent sources, an approach known as independent component analysis (ICA) (Jutten & Hérault, 1991; Comon, 1994; Bell & Sejnowski, 1995; Hyvärinen & Oja, 2000). Formally, the stimulus at time is expressed as a -component vector,
formula
1.1
where is an unknown but time-independent mixing matrix and represents the signals of sources at time . In this letter, we assume that .

The goal of ICA is to infer source signals, , from the stimuli . Whereas many ICA algorithms have been developed by the signal processing community (Comon & Jutten, 2010), most of them cannot be implemented by biologically plausible neural networks. Yet our brains can solve the blind source separation problem effortlessly (Bronkhorst, 2000; Asari, Pearlmutter, & Zador, 2006; Narayan et al., 2007; Bee & Micheyl, 2008; McDermott, 2009; Mesgarani & Chang, 2012; Golumbic et al., 2013; Isomura, Kotani, & Jimbo, 2015). Therefore, discovering a biologically plausible ICA algorithm is an important problem.

For an algorithm to be implementable by biological neural networks, it must satisfy (at least) the following requirements. First, it must operate in the online (or streaming) setting. In other words, the input data set is not available as a whole but is streamed one data vector at a time, and the corresponding output must be computed before the next data vector arrives. Second, the output of most neurons in the brain (either a firing rate or the synaptic vesicle release rate) is nonnegative. Third, the weights of synapses in a neural network must be updated using local learning rules; they depend on the activity of only the corresponding pre- and postsynaptic neurons.

Given the nonnegative nature of neuronal output, we consider a special case of ICA where sources are assumed to be nonnegative, termed nonnegative independent component analysis (NICA; Plumbley, 2001, 2002). Of course, to recover the sources, one can use standard ICA algorithms that do not rely on the nonnegativity of sources, such as fastICA (Hyvärinen & Oja, 1997, 2000; Hyvarinen, 1999). Neural learning rules have been proposed for ICA (e.g., Linsker, 1997; Eagleman et al., 2001; Isomura & Toyoizumi, 2016). However, taking into account nonnegativity may lead to simpler and more efficient algorithms (Plumbley, 2001, 2003; Plumbley & Oja, 2004; Oja & Plumbley, 2004; Yuan & Oja, 2004; Zheng, Huang, Sun, Lyu, & Lok, 2006; Ouedraogo, Souloumiac, & Jutten, 2010; Li, Liang, & Risteski, 2016).

While most of the existing NICA algorithms have not met the biological plausibility requirements, in terms of online setting and local learning rules, there are two notable exceptions. First, Plumbley (2001) successfully simulated a neural network on a small data set, yet no theoretical analysis was given. Second, Plumbley (2003) and Plumbley and Oja (2004) proposed a nonnegative PCA algorithm for a streaming setting; however, its neural implementation requires nonlocal learning rules. Further, this algorithm requires prewhitened data, yet no streaming whitening algorithm was given.

Here, we propose a biologically plausible NICA algorithm. The novelty of our approach is that the algorithm is derived from the similarity matching principle, which postulates that neural circuits map more similar inputs to more similar outputs (Pehlevan, Hu, & Chklovskii, 2015). Previous work proposed various objective functions to find similarity matching neural representations and solved these optimization problems with biologically plausible neural networks (Pehlevan et al., 2015; Pehlevan & Chklovskii, 2014, 2015a, 2015b; Hu, Pehlevan, & Chklovskii, 2014). Here we apply these networks to NICA.

The rest of the letter is organized as follows. In section 2, we show that blind source separation, after a generalized prewhitening step, can be posed as a nonnegative similarity matching (NSM) problem (Pehlevan & Chklovskii, 2014). In section 3, using results from Pehlevan and Chklovskii (2014, 2015a), we show that both the generalized prewhitening step and the NSM step can be solved online by neural networks with local learning rules. Stacking these two networks leads to the two-layer NICA network. In section 4, we compare the performance of our algorithm to other ICA and NICA algorithms for various data sets.

2  Offline NICA via NSM

In this section, we first review Plumbley's analysis of NICA and then reformulate NICA as an NSM problem.

2.1  Review of Plumbley's Analysis

When source signals are nonnegative, the source separation problem simplifies. It can be solved in two straightforward steps: noncentered prewhitening and orthonormal rotation (Plumbley, 2002).

Noncentered prewhitening transforms to , where and is a whitening matrix,1 such that , where angled brackets denote an average over the source distribution and is the identity matrix. Note that the mean of is not removed in the tranformation; otherwise, one would not be able to use the constraint that the sources are nonnegative (Plumbley, 2003).

Assuming that sources have unit variance,2
formula
2.1
the combined effect of source mixing and prewhitening () is an orthonormal rotation. To see this, note that by definition, and .

The second step of NICA relies on the following observation (Plumbley, 2002):

Theorem 1 .

Suppose sources are independent, nonnegative, and well grounded, that is, Prob for any . Consider an orthonormal transformation . Then is a permutation matrix with probability 1 if and only if is nonnegative.

In the second step, we look for an orthonormal such that is nonnegative. When found, Plumbley's theorem guarantees that is a permutation of the sources. Several algorithms have been developed based on this observation (Plumbley, 2003; Plumbley & Oja, 2004; Oja & Plumbley, 2004; Yuan & Oja, 2004).

Note that only the sources but not necessarily the mixing matrix must be nonnegative. Therefore, NICA allows generative models, where features not only add up but also cancel each other, as in the presence of a shadow in an image (Plumbley, 2002). In this respect, NICA is more general than nonnegative matrix factorization (NMF; Lee & Seung, 1999; Paatero & Tapper, 1994) where both the sources and the mixing matrix are required to be nonnegative.

2.2  NICA as NSM

Next we reformulate NICA as an NSM problem. This reformulation will allow us to derive an online neural network for NICA in section 3. Our main departure from Plumbley's analysis is to work with similarity matrices rather than covariance matrices and a finite number of samples rather than the full probability distribution of the sources.

First, let us switch to the matrix notation, where data matrices are formed by concatenating data column vectors, for example, so that , and so that . In this notation, we introduce a time-centering operation such that, for example, time-centered stimuli are where and is a -dimensional column vector whose elements are all 1's.

Our goal is to recover from , where is unknown. We make two assumptions. First, sources are nonnegative and decorrelated, . Note that while general ICA and NICA problems are stated with the independence assumption on the sources, it is sufficient for our purposes that they are only decorrelated. Second, the mixing matrix, , is full rank.

We propose that the source matrix, , can be recovered from in two steps, illustrated in Figure 1. The first step is generalized prewhitening: transform to , where is with , so that has unit eigenvalues and zero eigenvalues. When , is whitened; otherwise, channels of are correlated. Such prewhitening is useful because it implies according to the following theorem:

Figure 1:

Illustration of the NICA algorithm. Two source channels (left) are linearly transformed to a two-dimensional mixture (middle). Prewhitening (right) gives an orthogonal transformation of the sources. Sources are then recovered by solving the NSM problem, equation 2.6. Green and red plus signs track two source vectors across mixing and prewhitening stages.

Figure 1:

Illustration of the NICA algorithm. Two source channels (left) are linearly transformed to a two-dimensional mixture (middle). Prewhitening (right) gives an orthogonal transformation of the sources. Sources are then recovered by solving the NSM problem, equation 2.6. Green and red plus signs track two source vectors across mixing and prewhitening stages.

Theorem 2.
If is such that obeys
formula
2.2
an eigenvalue decomposition with , then
formula
2.3
Proof.
To see why equation 2.2 is sufficient, first note that . This follows from the definition of and equation 2.1. If equation 2.2 holds, then
formula
2.4
In turn, this is sufficient to prove that . To see that, assume an SVD decomposition of . Equation 2.4 implies that —that the diagonal elements of are all 1's. Then,
formula
2.5
This gives us the desired results .
Remark 1.

If , the channels of are correlated, except in the special case . The whitening used in Plumbley's analysis (Plumbley, 2002) requires .

The second step is NSM—solving the following optimization problem:
formula
2.6
where the optimization is performed over nonnegative , that is, . We call equation 2.6 the NSM cost function (Pehlevan & Chklovskii, 2014). Because inner products quantify similarities, we call and input and output similarity matrices; their elements hold the pairwise similarities between input and the pairwise similarities between output vectors, respectively. Then the cost function, equation 2.6, preserves the input similarity structure as much as possible under the nonnegativity constraint. Variants of equation 2.6 have been considered in applied math literature under the name “nonnegative symmetric matrix factorization” for clustering applications (e.g., Kuang, Park, & Ding, 2012; Kuang, Yun, & Park, 2015).
Now we make our key observation. Using theorem 2, we can rewrite equation 2.6 as
formula
2.7
Since both and are nonnegative, rank- matrices, , where is a permutation matrix, is a solution to this optimization problem, and the sources are successfully recovered.

Uniqueness of the solutions (up to permutations) is hard to establish. While both sufficient conditions, and necessary and sufficient conditions for uniqueness exist, these are nontrivial to verify, and usually the verification is NP-complete (Donoho & Stodden, 2003; Laurberg, Christensen, Plumbley, Hansen, & Jensen, 2008; Huang, Sidiropoulos, & Swami, 2014). A review of related uniqueness results can be found in Huang et al. (2014). A necessary condition for uniqueness given in Huang et al. (2014) states that if the factorization of to is unique (up to permutations), then each row of contains at least one element that is equal to 0. This necessary condition is similar to Plumbley's well-groundedness requirement used in proving theorem 1.

The NSM problem can be solved by projected gradient descent,
formula
2.8
where the operation is applied elementwise and is the size of the gradient step. Other algorithms can be found in Kuang et al. (2012, 2015) and Huang et al. (2014).

3  Derivation of NICA Neural Networks from Similarity Matching Objectives

Our analysis in the previous section revealed that the NICA problem can be solved in two steps: generalized prewhitening and nonnegative similarity matching. Here, we derive neural networks for each of these steps and stack them to give a biologically plausible two-layer neural network that operates in a streaming setting.

In a departure from the previous section, the number of output channels is reduced to the number of sources at the prewhitening stage rather than the later NSM stage (). This assumption simplifies our analysis significantly. The full problem is addressed in appendix B.

3.1  Noncentered Prewhitening in a Streaming Input Setting

To derive a neurally plausible online algorithm for prewhitening, we pose generalized prewhitening, equation 2.2, as an optimization problem. Online minimization of this optimization problem gives an algorithm that can be mapped to the operation of a biologically plausible neural network.

Generalized prewhitening solves a constrained similarity alignment problem,
formula
3.1
where is the -centered mixture of independent sources and is a matrix, constrained such that is positive semidefinite. The solution of this objective aligns similarity matrices and so that their right singular vectors are the same (Pehlevan & Chklovskii, 2015a). Then the objective under the trace diagonalizes and its value is the sum of eigenvalue pair products. Since the eigenvalues of are upper-bounded by , the objective, equation 3.1, is maximized by setting the eigenvalues of that pair with the top eigenvalues of to and the rest to zero. Hence, the optimal satisfies the generalized prewhitening condition, equation 2.2 (Pehlevan & Chklovskii, 2015a). More formally:
Theorem 3 (modified from Pehlevan & Chklovskii, 2015a).
Suppose an eigendecomposition of is , where eigenvalues are sorted in decreasing order. Then all optimal of equation 3.1 have an SVD decomposition of the form
formula
3.2
where is with ones on top of the diagonal and zeros on the rest of the diagonal.

The theorem implies that, first, , and hence satisfies the generalized prewhitening condition, equation 2.2. Second, , the linear mapping between and , can be constructed using an SVD decomposition of and equation 3.2.

The constraint in equation 3.1 can be introduced into the objective function using as a Lagrange multiplier the Gramian of matrix with ():
formula
3.3
This optimization problem (Pehlevan & Chklovskii, 2015a) will be used to derive an online neural algorithm.
Whereas the optimization problem, equation 3.3, is formulated in the offline setting (i.e., outputs are computed only after receiving all inputs) to derive a biologically plausible algorithm, we need to formulate the optimization problem in the online setting: the algorithm receives inputs sequentially, one at a time, and computes an output before the next input arrives. Therefore, we optimize equation 3.3 only for the data already received and only with respect to the current output:
formula
3.4
By keeping only those terms that depend on or , we get the following objective:
formula
3.5
where
formula
3.6
In the large- limit, the first three terms dominate over the last term, which we ignore. The remaining objective is strictly concave in and convex in . We assume that the matrix is full rank. Then the objective has a unique saddle point,
formula
3.7
where
formula
3.8
Hence, can be interpreted as the current estimate of the prewhitening matrix, .
We solve equation 3.5 with a gradient descent-ascent,
formula
3.9
where measures “time” within a single time step of . Biologically, this is justified if the activity dynamics converges faster than the correlation time of the input data. The dynamics, equation 3.9, can be proved to converge to the saddle point of the objective, equation 3.6 (see appendix A).

Equation 3.9 describes the dynamics of a single-layer neural network with two populations (see Figure 2). represents the weights of feedforward synaptic connections, and and represent the weights of synaptic connections between the two populations. Remarkably, synaptic weights appear in the online algorithm despite their absence in the optimization problem formulations 3.3 and 3.4. Furthermore, neurons can be associated with the principal neurons of a biological circuit and neurons with interneurons.

Figure 2:

Biologically plausible network for blind source separation. The prewhitening stage is composed of two populations of linear neurons. The NSM stage has a single population of rectifying neurons.

Figure 2:

Biologically plausible network for blind source separation. The prewhitening stage is composed of two populations of linear neurons. The NSM stage has a single population of rectifying neurons.

Finally, using the definition of synaptic weight matrices, equation 3.8, we can formulate recursive update rules:
formula
3.10

Equations 3.9 and 3.10 define a neural algorithm that proceeds in two phases. After each stimulus presentation, equation 3.9 is iterated until convergence by the dynamics of neuronal activities. Then synaptic weights are updated according to local, anti-Hebbian (for synapses from interneurons), and Hebbian (for all other synapses) rules, equation 3.10. Biologically, synaptic weights are updated on a slower timescale than neuronal activity dynamics.

Our algorithm can be viewed as a special case of the algorithm proposed in (Plumbley 1994, 1996). Plumbley (1994) analyzed the convergence of synaptic weights in a stochastic setting by a linear stability analysis of the stationary point of synaptic weight updates. His results are directly applicable to our algorithm and show that if the synaptic weights of our algorithm converge to a stationary state, they whiten the input.

Importantly, unlike Plumbley (1994, 1996), who proposed the algorithm heuristically, we derived it by posing and solving an optimization problem.

3.1.1  Computing

The optimization problem, equation 3.3, and the corresponding neural algorithm, equations 3.9 and 3.10, almost achieve what is needed for noncentered prewhitening, but we still need to find , since for the NSM step, we need . We now discuss how can be learned along with using the same online algorithm.

Our online algorithm for centered whitening is of the following form. First, a neural dynamics stage outputs a linear transformation of the input:
formula
3.11
Second, synaptic weights and, hence, are updated:
formula
3.12
We can supplement this algorithm with a running estimate of . Let the running estimate of average stimulus activity be
formula
3.13
Then,
formula
3.14
Alternatively, equations 3.11 and 3.14 can be combined into a single step:
formula
3.15
where the network receives uncentered stimuli and prewhitenes it. Note that assignment, equation 3.15, can still be done by iterating equation 3.9, except now the input is rather than . However, synaptic weights are still updated using , , and . Therefore, we keep recursive estimates of the means. Substituting equation 3.13 into 3.14 and using 3.12,
formula
3.16
The term requires nonlocal calculations. Assuming that in the large- limit, updates to are small, we can ignore this term and obtain a recursion:
formula
3.17
Finally, a similar argument can be given for . We keep a recursive estimate of :
formula
3.18
To summarize, when a new stimulus is observed, the algorithm operates in two steps. In the first step, the following two-population neural dynamics runs until convergence to a fixed point:
formula
3.19
The convergence proof for neural dynamics, equation 3.9, given in appendix A, also applies here. Besides the synaptic weight, each neuron remembers its own average activity and each synapse remembers average incoming activity. In the second step of the algorithm, the average activities are updated by
formula
3.20
Synaptic weight matrices are updated recursively by
formula
3.21
Once the synaptic updates are done, the new stimulus, , is processed. We note again that all the synaptic update rules are local, and hence are biologically plausible.

3.2  Online NSM

Next, we derive the second-layer network, which solves the NSM optimization problem 2.6 in an online setting (Pehlevan & Chklovskii, 2014).

The online optimization problem is
formula
3.22
Proceeding as before, we rewrite equation 3.22, keeping only terms that depend on :
formula
3.23
In the large- limit, the last two terms can be ignored, and the remainder is a quadratic form in . We minimize it using coordinate descent (Wright, 2015), which is both fast and neurally plausible. In this approach, neurons are updated one by one by performing an exact minimization of equation 3.23 with respect to until convergence:
formula
3.24
where
formula
3.25
For the next time step (), we can update the synaptic weights recursively, giving us the following local learning rules:
formula
3.26
Interestingly, these update rules are local and are identical to the single-neuron Oja rule (Oja, 1982), except that the learning rate is given explicitly in terms of cumulative activity and the lateral connections are anti-Hebbian.

After the arrival of each data vector, the operation of the complete two-layer network algorithm, Figure 2, is as follows. First, the dynamics of the prewhitening network runs until convergence. Then the output of the prewhitening network is fed to the NSM network, and the NSM network dynamics runs until convergence to a fixed point. Synaptic weights are updated in both networks for processing the next data vector.

3.2.1  NICA Is a Stationary State of Online NSM

Here we show that the solution to the NICA problem is a stationary synaptic weights state of the online NSM algorithm. In the stationary state, the expected updates to synaptic weights are zero:
formula
3.27
where we dropped the index and brackets denote averages over the source distribution.

Suppose the stimuli obey the NICA generative model, equation 1.1, and the observed mixture, , is whitened with the exact (generalized) prewhitening matrix described in theorem 2. Then input to the network at time is . Our claim is that there exist synaptic weight configurations for which (1) for any mixed input, , the output of the network is the source vector, that is, , where is a permutation matrix, and (2) this synaptic configuration is a stationary state.

We prove our claim by constructing these synaptic weights. For each permutation matrix, we first relabel the outputs such that the th output recovers the th source and hence becomes the identity matrix. Then the weights are
formula
3.28
Given mixture , NSM neural dynamics with these weights converge to , which is the unique fixed point.3 Furthermore, these weights define a stationary state as defined in equation 3.27, assuming a fixed learning rate. To see this substitute weights from equation 3.28 into the last two equations of 3.26 and average over the source distribution. The fixed learning rate assumption is valid in the large- limit when changes to become small ( (see Pehlevan et al., 2015).

4  Numerical Simulations

Here we present numerical simulations of our two-layered neural network using various data sets and compare the results to that of other algorithms.

In all our simulations, , except in Figure 5B, where . Our networks were initialized as follows:

  1. In the prewhitening network, and were chosen to be random orthonormal matrices. is initialized as because of its definition in equation 3.8 and the fact that this choice guarantees the convergence of the neural dynamics, equation 3.19 (see appendix A).

  2. In the NSM network, was initialized to a random orthonormal matrix and was set to zero.

The learning rates were chosen as follows:

  1. For the prewhitening network, we generalized the time-dependent learning rate, equation 3.21, to
    formula
    and performed a grid search over and to find the combination with the best performance. Our performance measures will be introduced below.
  2. For the NSM network, we generalized the activity-dependent learning rate, equation 3.26, to,
    formula
    and performed a grid search over several values of and to find the combination with the best performance. The parameter introduces “forgetting” to the system (Pehlevan et al., 2015). We hypothesized that forgetting will be beneficial in the two-layer setting because the prewhitening layer output changes over time and the NSM layer has to adapt. Further, for comparison purposes, we also implemented this algorithm with a time-dependent learning rate of the form 4.1 and performed a grid search with and to find the combination with best performance.

For the NSM network, to speed up our simulations, we implemented a procedure from Plumbley & Oja (2004). At each iteration, we checked whether there is any output neuron that has not fired up until that iteration. If so, we flipped the sign of its feedforward inputs. In practice, the flipping occurred only within the first 10 or 50 iterations.

For comparison, we implemented five other algorithms. First is the offline algorithm, equation 2.8; the other two are chosen to represent major online algorithm classes:

  1. Offline projected gradient descent: We simulated the projected gradient descent algorithm, 2.8. We used variable step sizes of the form 4.1 and performed a grid search with and to find the combination with the best performance. We initialized elements of the matrix, , by drawing a gaussian random variable with zero mean and unit variance and rectifying it. The input data set was whitened offline before passing to projected gradient descent.

  2. fastICA: fastICA (Hyvärinen & Oja, 1997, 2000; Hyvarinen, 1999) is a popular ICA algorithm that does not assume nonnegativity of sources. We implemented an online version of fastICA (Hyvärinen & Oja, 1998) using the same parameters except for feedforward weights. We used the time-dependent learning rate, equation 4.1, and performed a grid search with and to find the combination with the best performance. fastICA requires whitened and centered input (Hyvärinen & Oja, 1998) and computes a decoding matrix that maps mixtures back to sources. We ran the algorithm with whitened and centered input. To recover nonnegative sources, we applied the decoding matrix to noncentered but whitened input.

  3. Infomax ICA: Bell and Sejnowski (1995) proposed a blind source separation algorithm that maximizes the mutual information between inputs and outputs—the Infomax principle (Linsker, 1988). We simulated an online version due to Amari, Cichocki, and Yang (1996). We chose cubic neural nonlinearities compatible with subgaussian input sources. This differs from our fastICA implementation where the nonlinearity is also learned online. Infomax ICA computes a decoding matrix using centered, but not whitened, data. To recover nonnegative sources, we applied the decoding matrix to noncentered inputs. Finally, we rescaled the sources so that their variance is 1. We experimented with several learning rate parameters for finding optimal performance.

  4. Linsker's network: Linsker (1997) proposed a neural network with local learning rules for Infomax ICA. We simulated this algorithm with cubic neural nonlinearities and preprocessing and decoding done as in our Infomax ICA implementation.

  5. Nonnegative PCA: A nonnegative PCA algorithm (Plumbley & Oja, 2004) solves the NICA task and makes explicit use of the nonnegativity of sources. We use the online version given in Plumbley and Oja (2004). To speed up our simulations, we implemented a procedure from Plumbley and Oja (2004). At each iteration, we checked for any output neuron that had not fired up until that iteration. If so, we flipped the sign of its feedforward inputs. For this algorithm, we again used the time-dependent learning rate of equation 4.1 and performed a grid search with and to find the combination with the best performance. Nonnegative PCA assumes whitened but not centered input (Plumbley & Oja, 2004).

Next, we present the results of our simulations on three data sets.

4.1  Mixture of Random Uniform Sources

The source independent and identically distributed samples were set to zero with probability 0.5 and sampled uniformly from iterval with probability 0.5. The dimensions of source vectors were . The mixing matrices are given in appendix C. For each run, source vectors were generated. (For a sample of the original and mixed signals, see Figure 3A.)

Figure 3:

Performance of our algorithms when presented with a mixture of random uniform sources. (A) Sample source, mixture, whitened, and recovered signals for a task, performed by our two-layer algorithm. The whitened signal is the output of the first layer, while the recovered signal is the output of second layer. (B) Performance comparison of the online algorithms presented in this letter with projected gradient descent, Online fastICA, Infomax ICA, Linsker's network, and Nonnegative PCA. Curves show averages over 10 simulations. Error bars are not shown for visual clarity. Learning rate parameters are given in appendix D.

Figure 3:

Performance of our algorithms when presented with a mixture of random uniform sources. (A) Sample source, mixture, whitened, and recovered signals for a task, performed by our two-layer algorithm. The whitened signal is the output of the first layer, while the recovered signal is the output of second layer. (B) Performance comparison of the online algorithms presented in this letter with projected gradient descent, Online fastICA, Infomax ICA, Linsker's network, and Nonnegative PCA. Curves show averages over 10 simulations. Error bars are not shown for visual clarity. Learning rate parameters are given in appendix D.

The inputs to fastICA and Nonnegative PCA algorithms were prew-hitened offline, and in the case of fastICA, they were also centered. We ran our NSM network as both a single-layer algorithm with prewhitening done offline and as part of our two-layer algorithm with whitening done online.

To quantify the performance of tested algorithms, we used the mean-squared error,
formula
4.3
where is a permutation matrix that is chosen to minimize the mean-squared error at . The learning rate parameters of all networks were optimized by a grid search using as the performance metric.

In Figure 3B, we show the performances of all algorithms we implemented. Our algorithms perform as well as or better than others, especially as the dimensionality of the input increases. Offline whitening is better than online whitening, however, as dimensionality increases, online whitening becomes competitive with offline whitening. In fact, our two-layer and single-layer networks perform better than online fastICA and nonnegative PCA for which whitening was done offline.

We also simulated a fully offline algorithm by taking projected gradient descent steps until the residual error plateaued (see Figure 3B). The performance of the offline algorithm quantifies two important metrics. First, it establishes the loss in performance due to online (as opposed to offline) processing. Second, it establishes the lowest error that could be achieved by the NSM method for the given data set. The lowest error is not necessarily zero due to the finite size of the data set. This method is not perfect because the projected gradient descent may get stuck in a local minimum of equation 2.6.

We also tested whether the learned synaptic weights of our network match our theoretical predictions. In Figure 4A, we show examples of learned feedforward and recurrent synaptic weights at and what is expected from our theory, equation 3.28. We observed an almost perfect match between the two. In Figure 4B, we quantify the convergence of simulated synaptic weights to the theoretical prediction by plotting a normalized error metric defined by .

Figure 4:

Theoretical predictions of learned synaptic weights match simulations. (A) Example synaptic weight matrices predicted from theory, equation 3.28, compared to results from an example simulation (). (B) Convergence of simulated network synaptic weights to theoretical predictions. For this figure, inputs were whitened offline and the NSM network was run with time-dependent learning rates. Shaded bars show standard error over 10 simulations.

Figure 4:

Theoretical predictions of learned synaptic weights match simulations. (A) Example synaptic weight matrices predicted from theory, equation 3.28, compared to results from an example simulation (). (B) Convergence of simulated network synaptic weights to theoretical predictions. For this figure, inputs were whitened offline and the NSM network was run with time-dependent learning rates. Shaded bars show standard error over 10 simulations.

4.2  Mixture of Random Uniform and Exponential Sources

Our algorithm can demix sources sampled from different statistical distributions. To illustrate this point, we generated a data set with two uniform and three exponential source channels. The uniform sources were sampled as before. The exponential sources were either zero (with probability 0.5) or sampled from an exponential distribution, scaled so that the variance of the channel is 1. In Figure 5A, we show that the algorithm succesfully recovers sources.

Figure 5:

Performance of our two-layer algorithm when presented with a mixture of random uniform and exponential sources. (A) Recovery of a mixture of three exponential and two uniform sources. (B) Recovery of three exponential sources corrupted by two background noise channels. Learning rate parameters are given in appendix D.

Figure 5:

Performance of our two-layer algorithm when presented with a mixture of random uniform and exponential sources. (A) Recovery of a mixture of three exponential and two uniform sources. (B) Recovery of three exponential sources corrupted by two background noise channels. Learning rate parameters are given in appendix D.

To test the denoising capabilities of our algorithm, we created a data set where source signals are accompanied by background noise. Sources to be recovered were three exponential channels that were sampled as before. Background noises were two uniform channels that were sampled as before, except scaled to have variance 0.1. To denoise the resulting five-dimensional mixture, the prewhitening layer reduced its five input dimensions to three. Then the NSM layer successfully recovered sources (see Figure 5B). Hence, the prewhitening layer can act as a denoising stage.

4.3  Mixture of Natural Scenes

Next, we consider recovering images from their mixtures (see Figure 6A), where each image is treated as one source. Four image patches of size pixels were chosen from a set of images of natural scenes previously used in Hyvärinen and Hoyer (2000) and Plumbley and Oja (2004). The preprocessing was as in Plumbley and Oja (2004): (1) images were downsampled by a factor of 4 to obtain patches, (2) pixel intensities were shifted to have a minimum of zero, and (3) pixel intensities were scaled to have unit variance. Hence, in this data set, there are sources, corresponding image patches, and a total of samples. These samples were presented to the algorithm 5000 times with randomly permuted order in each presentation. The mixing matrix, which was generated randomly, is given in appendix C.

In Figure 6B, we show the performances of all algorithms we implemented in this task. We see that our algorithms, when compared to fastICA and nonnegative PCA, perform much better.

Figure 6:

Performance of our algorithm when presented with a mixture of natural images. (A) Sample source, mixture, and recovered images, performed by our two-layer algorithm. (B) Performance comparison of our online algorithms with online fastICA and nonnegative PCA. Shaded bars show standard error over 10 simulations. Learning rate parameters are listed in appendix D.

Figure 6:

Performance of our algorithm when presented with a mixture of natural images. (A) Sample source, mixture, and recovered images, performed by our two-layer algorithm. (B) Performance comparison of our online algorithms with online fastICA and nonnegative PCA. Shaded bars show standard error over 10 simulations. Learning rate parameters are listed in appendix D.

5  Discussion

In this letter, we presented a new neural algorithm for blind nonnegative source separation. We started by assuming the nonnegative ICA generative model (Plumbley, 2001, 2002) where inputs are linear mixtures of independent and nonnegative sources. We showed that the sources can be recovered from inputs by two sequential steps: (1) generalized whitening and (2) NSM. In fact, our argument requires sources to be only uncorrelated, and not necessarily independent. Each of the two steps can be performed online with single-layer neural networks with local learning rules (Pehlevan & Chklovskii, 2014, 2015a). Stacking these two networks yields a two-layer neural network for blind nonnegative source separation (see Figure 2). Numerical simulations show that our neural network algorithm performs well.

Because our network is derived from optimization principles, its biologically realistic features can be given meaning. The network is multilayered because each layer performs a different optimization. Lateral connections create competition between principal neurons, forcing them to differentiate their outputs. Interneurons clamp the activity dimensions of principal neurons (Pehlevan & Chklovskii, 2015a). Rectifying neural nonlinearity is related to nonnegativity of sources. Synaptic plasticity (Malenka & Bear, 2004), implemented by local Hebbian and anti-Hebbian learning rules, achieves online learning. While Hebbian learning is famously observed in neural circuits (Bliss & Lømo, 1973; Bliss & Gardner-Medwin, 1973), our network also makes heavy use of anti-Hebbian learning, which can be interpreted as the long-term potentiation of inhibitory postsynaptic potentials. Experiments show that such long-term potentiation can arise from pairing action potentials in inhibitory neurons with subthreshold depolarization of postsynaptic pyramidal neurons (Komatsu, 1994; Maffei, Nataraj, Nelson, & Turrigiano, 2006). However, plasticity in inhibitory synapses does not have to be Hebbian, that is, require correlation between pre- and postsynaptic activity (Kullmann, Moreau, Bakiri, & Nicholson, 2012).

For improved biological realism, the network should respond to a continuous stimulus stream by continuous and simultaneous changes to its outputs and synaptic weights. Presumably this requires neural timescales to be faster and synaptic timescales to be slower than that of changes in stimuli. To explore this possibility, we simulated some of our data sets with a limited number of neural activity updates (not shown) and found that about 10 updates per neuron are sufficient for successful recovery of sources without significant loss in performance. With a neural time scale of 10 ms, this should take about 100 ms, which is sufficiently fast given that, for example, the temporal autocorrelation timescale of natural image sequences is about 500 ms (David, Vinje, & Gallant, 2004; Bull, 2014).

It is interesting to compare the two-layered architecture we present to the multilayer neural networks of deep learning approaches (LeCun, Bengio, & Hinton, 2015):

  1. For each data presentation, our network performs recurrent dynamics to produce an output, while the deep networks have feedforward architecture.

  2. The first layer of our network has multiple neuron types, principal and interneurons, and only principal neurons project to the next layer. In deep learning, all neurons in a layer project to the next layer.

  3. Our network operates with local learning rules, while deep learning uses backpropagation, which is not local.

  4. We derived the architecture, the dynamics, and the learning rules of our network from a principled cost function. In deep learning, the architecture and the dynamics of a neural network are designed by hand; only the learning rule is derived from a cost function.

  5. In building a neural algorithm, we started with a generative model of inputs, from which we inferred algorithmic steps to recover latent sources.

These algorithmic steps guided us in deciding which single-layer networks to stack. In deep learning, no such generative model is assumed, and network architecture design is more of an art. We believe that starting from a generative model might lead to a more systematic way of network design. In fact, the question of generative model appropriate for deep networks is already being asked (Patel, Nguyen, & Baraniuk, 2016).

Notes

1

In his analysis Plumbley (2002) assumed (mixture channels are the same as source channels), but this assumption can be relaxed, as we show.

2

Without loss of generality, a scalar factor that multiplies a source can always be absorbed into the corresponding column of the mixing matrix.

3

Proof: The net input to neuron at the claimed fixed point is . Plugging in equation 3.26 and , and using equation 2.5, one gets that the net input is , which is also the output since sources are nonnegative. This fixed point is unique and globally stable because the NSM neural dynamics is a coordinate descent on a strictly convex cost given by .

4
The fixed point is globally convergent if and only if the eigenvalues of the matrix
formula
B.7
have negative real parts. One can show that eigenvalues are , eigenvalues are , and for each positive eigenvalue, of , one gets a pair . All eigenvalues have negative real parts.

Appendix A:  Convergence of the Gradient Descent-Ascent Dynamics

Here we prove that the neural dynamics, equation 3.9, converges to the saddle point of the objective function, equation 3.6. We assume that is full rank and . First, note that the optimum of equation 3.6 is also the fixed point of equation 3.9. Since the neural dynamics, equation 3.9, is linear, the fixed point is globally convergent if and only if the eigenvalues of the matrix
formula
A.1
have negative real parts.
The eigenvalue equation is
formula
A.2
which implies
formula
A.3
Using these relations, we can solve for all the eigenvalues. There are two cases:
  1. . This implies that and . is in the null space of . Since is with , the null space is -dimensional, and one has degenerate eigenvalues.

  2. . Substituting for in the first equation of equation A.3, this implies that . Hence, is an eigenvector of . For each eigenvalue of , there are two corresponding eigenvectors . can be solved uniquely from the first equation in equation A.3.

Hence, there are degenerate eigenvalues and pairs of conjugate eigenvalues , one pair for each eigenvalue of . Since are real and positive (we assume is full rank and by definition ), real parts of all are negative and hence the neural dynamics, equation 3.9, is globally convergent.

Appendix B:  Modified Objective Function and Neural Network for Generalized Prewhitening

While deriving our online neural algorithm, we assumed that the number of output channels is reduced to the number of sources at the prewhitening stage (). However, our offline analysis did not need such reduction; one could keep for generalized prewhitening. Here we provide an online neural algorithm that allows .

First, we point out why the prewhitening algorithm given in the main text is not adequate for . In appendix A, we proved that the neural dynamics described by equation 3.9 converges to the saddle point of the objective function, equation 3.6. This proof assumes that is full rank. However, if , this assumption breaks down as the network learns because perfectly prewhitened has rank (low rank) and a perfectly prewhitening network would have , which would also be low rank. We simulated this network with and observed that the condition number of increased with and the neural dynamics took longer to converge. Although the algorithm was still functioning well for practical purposes, we present a modification that fully resolves the problem.

We propose a modified offline objective function (Pehlevan & Chklov-skii, 2015a) and a corresponding neural network. Consider the following:
formula
B.1
where is a centered mixture of independent sources, is now an matrix with , is an matrix with , and is a positive parameter. Notice the additional -dependent term compared to equation 3.3. If is less than the lowest eigenvalue of , optimal is a linear transform of and satisfies the generalized prewhitening condition, equation 2.2 (Pehlevan & Chklovskii, 2015a). More precisely,
Theorem 4 (modified from Pehlevan & Chklovskii, 2015a).
Suppose an eigendecomposition of is , where eigenvalues are sorted in order of magnitude. If is less than the lowest eigenvalue of , all optimal of equation 3.13 have an SVD decomposition of the form
formula
B.2
where is with ones at top diagonals and zeros at rest.

Using this cost function, we will derive a neural algorithm that does not suffer from the described convergence issues even if . We now need to choose the parameter , and for that we need to know the spectral properties of .

To derive an online algorithm, we repeat the steps taken before:
formula
B.3
where
formula
B.4
In the large- limit, the first four terms dominate over the last term, which we ignore. The remaining objective is strictly concave in and strictly convex in . Note that equation 3.6 was only convex in but not strictly convex. The objective has a unique saddle point, even if is not full rank:
formula
B.5
where matrices are defined as before and is the identity matrix.
We solve equation B.3 with gradient descent-ascent,
formula
B.6
where is time measured within a single time step of . The dynamics, equation 3.9, can be proved to converge to the saddle point, equation B.5, modifying the proof in appendix A.4 Synaptic weight updates are the same as before, equation 3.10. Finally, this network can be modified to also compute following the steps before.

Appendix C:  Mixing Matrices for Numerical Simulations

For the random source data set, the mixing matrix was
formula
C.1
We do not list the mixing matrices for cases for space-saving purposes; however, they are available from us on request.
For the natural scene data set, the mixing matrix was
formula
C.2

Appendix D:  Learning Rate Parameters for Numerical Simulations

For Figures 3 to 6 the following parameters were found to be best performing as a result of our grid search:

 
fastICANPCANSM (activity)NSM (time)
 (10, 0.01) (10, 0.1) (10, 0.8) (10, 0.1) 
 (100, 0.01) (10, 0.01) (10, 0.9) (10, 0.01) 
 (100, 0.01) (100, 0.01) (10, 0.9) (10, 0.1) 
 (100, 0.01) (1000, 0.01) (10, 0.9) (10, 0.01) 
Images (, 0.01) (1000, 0.01) (100, 0.9) NA 
 Two-Layer Offline Infomax ICA Linsker's Algorithm 
 (100, 1, 10, 0.8) (, 0.001) (1000, 0.2) (1000, 0.2) 
 (100, 1, 10, 0.9) (, 0.001) (1000, 0.2) (1000, 0.2) 
 (100, 1, 10, 0.9) (, 0.01) (1000, 0.2) (1000, 0.2) 
 (100, 1, 10, 0.9) (, 0.01) (1000, 0.2) (1000, 0.2) 
Images (100, 1, 100, 0.9) NA NA NA 
fastICANPCANSM (activity)NSM (time)
 (10, 0.01) (10, 0.1) (10, 0.8) (10, 0.1) 
 (100, 0.01) (10, 0.01) (10, 0.9) (10, 0.01) 
 (100, 0.01) (100, 0.01) (10, 0.9) (10, 0.1) 
 (100, 0.01) (1000, 0.01) (10, 0.9) (10, 0.01) 
Images (, 0.01) (1000, 0.01) (100, 0.9) NA 
 Two-Layer Offline Infomax ICA Linsker's Algorithm 
 (100, 1, 10, 0.8) (, 0.001) (1000, 0.2) (1000, 0.2) 
 (100, 1, 10, 0.9) (, 0.001) (1000, 0.2) (1000, 0.2) 
 (100, 1, 10, 0.9) (, 0.01) (1000, 0.2) (1000, 0.2) 
 (100, 1, 10, 0.9) (, 0.01) (1000, 0.2) (1000, 0.2) 
Images (100, 1, 100, 0.9) NA NA NA 

Acknowledgments

We thank Andrea Giovannucci, Eftychios Pnevmatikakis, Anirvan Sengupta, and Sebastian Seung for useful discussions. D.C. is grateful to the IARPA MICRONS program for support.

References

Amari
,
S.
,
Cichocki
,
A.
, &
Yang
,
H.
(
1996
). A new learning algorithm for blind signal separation. In
D.
Touretzky
,
M.
Mozer
, &
M.
Hasselmo
(Eds.),
Advances in neural information processing systems
,
8
(pp.
757
763
).
Cambridge, MA
:
MIT Press
.
Asari
,
H.
,
Pearlmutter
,
B. A.
, &
Zador
,
A. M.
(
2006
).
Sparse representations for the cocktail party problem
.
Journal of Neuroscience
,
26
(
28
),
7477
7490
.
Bee
,
M. A.
, &
Micheyl
,
C.
(
2008
).
The cocktail party problem: What is it? How can it be solved? And why should animal behaviorists study it?
Journal of Comparative Psychology
,
122
(
3
),
235
.
Bell
,
A. J.
, &
Sejnowski
,
T. J.
(
1995
).
An information-maximization approach to blind separation and blind deconvolution
.
Neural Computation
,
7
(
6
),
1129
1159
.
Bliss
,
T. V.
, &
Gardner-Medwin
,
A.
(
1973
).
Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path
.
Journal of Physiology
,
232
(
2
),
357
.
Bliss
,
T. V.
, &
Lømo
,
T.
(
1973
).
Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path
.
Journal of Physiology
,
232
(
2
),
331
356
.
Bronkhorst
,
A. W.
(
2000
).
The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions
.
Acta Acustica United with Acustica
,
86
(
1
),
117
128
.
Bull
,
D. R.
(
2014
).
Communicating pictures: A course in image and video coding
.
Orlando, FL
:
Academic Press
.
Comon
,
P.
(
1994
).
Independent component analysis, a new concept?
Signal Processing
,
36
(
3
),
287
314
.
Comon
,
P.
, &
Jutten
,
C.
(
2010
).
Handbook of blind source separation: Independent component analysis and applications
.
Orlando, FL
:
Academic Press
.
David
,
S. V.
,
Vinje
,
W. E.
, &
Gallant
,
J. L.
(
2004
).
Natural stimulus statistics alter the receptive field structure of V1 neurons
.
Journal of Neuroscience
,
24
(
31
),
6991
7006
.
Donoho
,
D.
, &
Stodden
,
V.
(
2003
). When does non-negative matrix factorization give a correct decomposition into parts? In
S.
Thrun
,
K.
Saul
, &
P. B.
Schölkopf
(Eds.),
Advances in neural information processing systems, 16
.
Cambridge, MA
:
MIT Press
.
Eagleman
,
D. M.
,
Coenen
,
O. J.-M. D.
,
Mitsner
,
V.
,
Bartol
,
T. M.
,
Bell
,
A. J.
, and
Sejnowski
,
T. J.
(
2001
).
Cerebellar glomeruli: Does limited extracellular calcium implement a sparse encoding strategy?
In
Proceedings of the 8th Annual Joint Symposium on Neural Computation
.
Golumbic
,
E. M. Z.
,
Ding
,
N.
,
Bickel
,
S.
,
Lakatos
,
P.
,
Schevon
,
C. A.
,
McKhann
,
G. M.
, …
Poeppel
,
D.
(
2013
).
Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party.”
Neuron
,
77
(
5
),
980
991
.
Hu
,
T.
,
Pehlevan
,
C.
, &
Chklovskii
,
D. B.
(
2014
). A Hebbian/anti-Hebbian network for online sparse dictionary learning derived from symmetric matrix factorization. In
Proceedings of the Asilomar Conference on Signals, Systems and Computers
(pp.
613
619
).
Piscataway, NJ
:
IEEE
.
Huang
,
K.
,
Sidiropoulos
,
N.
, &
Swami
,
A.
(
2014
).
Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition
.
IEEE Transactions on Signal Processing
,
62
(
1
),
211
224
.
Hyvarinen
,
A.
(
1999
).
Fast and robust fixed-point algorithms for independent component analysis
.
IEEE Transactions on Neural Networks
,
10
(
3
),
626
634
.
Hyvärinen
,
A.
, &
Hoyer
,
P.
(
2000
).
Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces
.
Neural Computation
,
12
(
7
),
1705
1720
.
Hyvärinen
,
A.
, &
Oja
,
E.
(
1997
).
A fast fixed-point algorithm for independent component analysis
.
Neural Computation
,
9
(
7
),
1483
1492
.
Hyvärinen
,
A.
, &
Oja
,
E.
(
1998
).
Independent component analysis by general nonlinear Hebbian-like learning rules
.
Signal Processing
,
64
(
3
),
301
313
.
Hyvärinen
,
A.
, &
Oja
,
E.
(
2000
).
Independent component analysis: Algorithms and applications
.
Neural Networks
,
13
(
4
),
411
430
.
Isomura
,
T.
,
Kotani
,
K.
, &
Jimbo
,
Y.
(
2015
).
Cultured cortical neurons can perform blind source separation according to the free-energy principle
.
PLoS Comput. Biol.
,
11
(
12
),
e1004643
.
Isomura
,
T.
, &
Toyoizumi
,
T.
(
2016
).
A local learning rule for independent component analysis
.
Scientific Reports, 6
.
Jutten
,
C.
, &
Hérault
,
J.
(
1991
).
Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture
.
Signal Processing
,
24
(
1
),
1
10
.
Komatsu
,
Y.
(
1994
).
Age-dependent long-term potentiation of inhibitory synaptic transmission in rat visual cortex
.
Journal of Neuroscience
,
14
(
11
),
6488
6499
.
Kuang
,
D.
,
Park
,
H.
, &
Ding
,
C. H.
(
2012
). Symmetric nonnegative matrix factorization for graph clustering. In
Proceedings of the 2012 SIAM International Conference on Data Mining
(
vol. 12
, pp.
106
117
).
Philadelphia
:
SIAM
.
Kuang
,
D.
,
Yun
,
S.
, &
Park
,
H.
(
2015
).
Symnmf: Nonnegative low-rank approximation of a similarity matrix for graph clustering
.
Journal of Global Optimization
,
62
(
3
),
545
574
.
Kullmann
,
D. M.
,
Moreau
,
A. W.
,
Bakiri
,
Y.
, &
Nicholson
,
E.
(
2012
).
Plasticity of inhibition
.
Neuron
,
75
(
6
),
951
962
.
Laurberg
,
H.
,
Christensen
,
M. G.
,
Plumbley
,
M. D.
,
Hansen
,
L. K.
, &
Jensen
,
S. H.
(
2008
).
Theorems on positive data: On the uniqueness of NMF
.
Computational Intelligence and Neuroscience
,
2008
,
764206
.
LeCun
,
Y.
,
Bengio
,
Y.
, &
Hinton
,
G.
(
2015
).
Deep learning
.
Nature
,
521
(
7553
),
436
444
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
1999
).
Learning the parts of objects by non-negative matrix factorization
.
Nature
,
401
(
6755
),
788
791
.
Li
,
Y.
,
Liang
,
Y.
, &
Risteski
,
A.
(
2016
).
Recovery guarantee of non-negative matrix factorization via alternating updates
.
arXiv preprint:1611.03819
Linsker
,
R.
(
1988
).
Self-organization in a perceptual network
.
Computer
,
21
(
3
),
105
117
.
Linsker
,
R.
(
1997
).
A local learning rule that enables information maximization for arbitrary input distributions
.
Neural Computation
,
9
(
8
),
1661
1665
.
Maffei
,
A.
,
Nataraj
,
K.
,
Nelson
,
S. B.
, &
Turrigiano
,
G. G.
(
2006
).
Potentiation of cortical inhibition by visual deprivation
.
Nature
,
443
(
7107
),
81
84
.
Malenka
,
R. C.
, &
Bear
,
M. F.
(
2004
).
LTP and LTD: An embarrassment of riches
.
Neuron
,
44
(
1
),
5
21
.
McDermott
,
J. H.
(
2009
).
The cocktail party problem
.
Current Biology
,
19
(
22
),
R1024
R1027
.
Mesgarani
,
N.
, &
Chang
,
E. F.
(
2012
).
Selective cortical representation of attended speaker in multi-talker speech perception
.
Nature
,
485
(
7397
),
233
236
.
Narayan
,
R.
,
Best
,
V.
,
Ozmeral
,
E.
,
McClaine
,
E.
,
Dent
,
M.
,
Shinn-Cunningham
,
B.
, &
Sen
,
K.
(
2007
).
Cortical interference effects in the cocktail party problem
.
Nature Neuroscience
,
10
(
12
),
1601
1607
.
Oja
,
E.
(
1982
).
Simplified neuron model as a principal component analyzer
.
J. Math. Biol.
,
15
(
3
),
267
273
.
Oja
,
E.
, &
Plumbley
,
M.
(
2004
).
Blind separation of positive sources by globally convergent gradient search
.
Neural Computation
,
16
(
9
),
1811
1825
.
Ouedraogo
,
W. S. B.
,
Souloumiac
,
A.
, &
Jutten
,
C.
(
2010
). Non-negative independent component analysis algorithm based on 2D Givens rotations and a newton optimization. In
V.
Vigneron
,
V.
Zarzoso
,
E.
Moreau
,
R.
Gribonval
, &
E.
Vincent
(Eds.),
Latent variable analysis and signal separation
(pp.
522
529
).
New York
:
Springer
.
Paatero
,
P.
, &
Tapper
,
U.
(
1994
).
Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values
.
Environmetrics
,
5
(
2
),
111
126
.
Patel
,
A. B.
,
Nguyen
,
M. T.
, &
Baraniuk
,
R.
(
2016
). A probabilistic framework for deep learning. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
(pp.
2550
2558
).
Red Hook, NY
:
Curran
.
Pehlevan
,
C.
, &
Chklovskii
,
D.
(
2014
). A Hebbian/anti-Hebbian network derived from online non-negative matrix factorization can cluster and discover sparse features. In
Proceedings of the Asilomar Conference on Signals, Systems and Computers
(pp.
769
775
).
Piscataway, NJ
:
IEEE
.
Pehlevan
,
C.
, &
Chklovskii
,
D.
(
2015a
). A normative theory of adaptive dimensionality reduction in neural networks. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
(pp.
2260
2268
).
Red Hook, NY
:
Curran
.
Pehlevan
,
C.
, &
Chklovskii
,
D.
(
2015b
). Optimization theory of Hebbian/anti-Hebbian networks for PCA and whitening. In
Proceedings of the 2015 53rd Annual Allerton Conference on Communication, Control, and Computing
(pp.
1458
1465
).
Piscataway, NJ
:
IEEE
.
Pehlevan
,
C.
,
Hu
,
T.
, &
Chklovskii
,
D.
(
2015
).
A Hebbian/anti-Hebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data
.
Neural Computation
,
27
,
1461
1495
.
Plumbley
,
M.
(
1994
). A subspace network that determines its own output dimension (Tech. Report).
London
:
Kings College London
.
Plumbley
,
M.
(
1996
).
Information processing in negative feedback neural networks
.
Network-Comp. Neural
,
7
(
2
),
301
305
.
Plumbley
,
M. D.
(
2001
). Adaptive lateral inhibition for non-negative ICA. In
Proceedings of the International Conference on Independent Component Analysis and Blind Signal Separation
(pp.
516
521
).
New York
:
Springer
.
Plumbley
,
M.
(
2002
).
Conditions for nonnegative independent component analysis
.
IEEE Signal Processing Letters
,
9
(
6
),
177
180
.
Plumbley
,
M. D.
(
2003
).
Algorithms for nonnegative independent component analysis
.
IEEE Transactions on Neural Networks
,
14
(
3
),
534
543
.
Plumbley
,
M. D.
, &
Oja
,
E.
(
2004
).
A “nonnegative PCA” algorithm for independent component analysis
.
IEEE Transactions on Neural Networks
,
15
(
1
),
66
76
.
Wright
,
S. J.
(
2015
).
Coordinate descent algorithms
.
Mathematical Programming
,
151
(
1
),
3
34
.
Yuan
,
Z.
, &
Oja
,
E.
(
2004
). A fastica algorithm for non-negative independent component analysis. In
C. G.
Puntonet
&
A.
Prieto
(Eds.),
Independent component analysis and blind signal separation
(pp.
1
8
).
New York
:
Springer
.
Zheng
,
C.-H.
,
Huang
,
D.-S.
,
Sun
,
Z.-L.
,
Lyu
,
M. R.
, &
Lok
,
T.-M.
(
2006
).
Nonnegative independent component analysis based on minimizing mutual information technique
.
Neurocomputing
,
69
(
7
),
878
883
.