## Abstract

Neural network models of early sensory processing typically reduce the dimensionality of streaming input data. Such networks learn the principal subspace, in the sense of principal component analysis, by adjusting synaptic weights according to activity-dependent learning rules. When derived from a principled cost function, these rules are nonlocal and hence biologically implausible. At the same time, biologically plausible local rules have been postulated rather than derived from a principled cost function. Here, to bridge this gap, we derive a biologically plausible network for subspace learning on streaming data by minimizing a principled cost function. In a departure from previous work, where cost was quantified by the representation, or reconstruction, error, we adopt a multidimensional scaling cost function for streaming data. The resulting algorithm relies only on biologically plausible Hebbian and anti-Hebbian local learning rules. In a stochastic setting, synaptic weights converge to a stationary state, which projects the input data onto the principal subspace. If the data are generated by a nonstationary distribution, the network can track the principal subspace. Thus, our result makes a step toward an algorithmic theory of neural computation.

## 1  Introduction

Early sensory processing reduces the dimensionality of streamed inputs (Hyvärinen, Hurri, & Hoyer, 2009), as evidenced by a high ratio of input to output nerve fiber counts (Shepherd, 2003). For example, in the human retina, information gathered by approximately 125 million photoreceptors is conveyed to the lateral geniculate nucleus through 1 million or so ganglion cells (Hubel, 1995). By learning a lower-dimensional subspace and projecting the streamed data onto that subspace, the nervous system denoises and compresses the data simplifying further processing. Therefore, a biologically plausible implementation of dimensionality reduction may offer a model of early sensory processing.

For a single neuron, a biologically plausible implementation of dimensionality reduction in the streaming, or online, setting has been proposed in the seminal work of Oja (1982; see Figure 1A). At each time point, t, an input vector, , is presented to the neuron, and, in response, it computes a scalar output, , where w is a row-vector of input synaptic weights. Furthermore, synaptic weights w are updated according to a version of Hebbian learning called Oja’s rule,
1.1
where is a learning rate and designates a transpose. Then the neuron’s synaptic weight vector converges to the principal eigenvector of the covariance matrix of the streamed data (Oja, 1982). Importantly, Oja’s learning rule is local, meaning that synaptic weight updates depend on the activities of only pre- and postsynaptic neurons accessible to each synapse and therefore biologically plausible.
Figure 1:

An Oja neuron and our neural network. (A) A single Oja neuron computes the principal component, y, of the input data, , if its synaptic weights follow Hebbian updates. (B) A multineuron network computes the principal subspace of the input if the feedforward connection weight updates follow a Hebbian and the lateral connection weight updates follow an anti-Hebbian rule.

Figure 1:

An Oja neuron and our neural network. (A) A single Oja neuron computes the principal component, y, of the input data, , if its synaptic weights follow Hebbian updates. (B) A multineuron network computes the principal subspace of the input if the feedforward connection weight updates follow a Hebbian and the lateral connection weight updates follow an anti-Hebbian rule.

Oja’s rule can be derived by an approximate gradient descent of the mean squared representation error (Cichocki & Amari, 2002; Yang, 1995), a so-called synthesis view of principal component analysis (PCA) (Pearson, 1901; Preisendorfer & Mobley, 1988):
1.2

Computing principal components beyond the first requires more than one output neuron and motivated numerous neural networks. Some well-known examples are the generalized Hebbian algorithm (GHA) (Sanger, 1989), Földiak’s network (Földiak, 1989), the subspace network (Karhunen & Oja, 1982), Rubner’s network (Rubner & Tavan, 1989; Rubner & Schulten, 1990), Leen’s minimal coupling and full coupling networks (Leen, 1990, 1991), and the APEX network (Kung & Diamantaras, 1990; Kung, Diamantaras, & Taur, 1994). We refer to Becker and Plumbley (1996), Diamantaras and Kung (1996), and Diamantaras (2002) for a detailed review of these and further developments.

However, none of the previous contributions was able to derive a multineuronal single-layer network with local learning rules by minimizing a principled cost function, in a way that Oja’s rule, equation 1.1, was derived for a single neuron. The GHA and the subspace rules rely on nonlocal learning rules: feedforward synaptic updates depend on other neurons’ synaptic weights and activities. Leen’s minimal network is also nonlocal: feedforward synaptic updates of a neuron depend on its lateral synaptic weights. While Földiak’s, Rubner’s, and Leen’s full coupling networks use local Hebbian and anti-Hebbian rules, they were postulated rather than derived from a principled cost function. APEX network perhaps comes closest to our criterion: the rule for each neuron can be related separately to a cost function that includes contributions from other neurons. But no cost function describes all the neurons combined.

At the same time, numerous dimensionality-reduction algorithms have been developed for data analysis needs, disregarding the biological plausibility requirement. Perhaps the most common approach is again principal component analysis (PCA), which was originally developed for batch processing (Pearson, 1901) but later adapted to streaming data (Yang, 1995; Crammer, 2006; Arora, Cotter, Livescu, & Srebro, 2012; Goes, Zhang, Arora, & Lerman, 2014). (For a more detailed collection of references, see, e.g., Balzano, 2012.) These algorithms typically minimize the representation error cost function:
1.3
where is a data matrix and is a wide matrix (for detailed notation, see below). The minimum of equation 1.3 is when rows of are orthonormal and span the m-dimensional principal subspace, and therefore is the projection matrix to the subspace (Yang, 1995).1

A gradient descent minimization of such cost function can be approximately implemented by the subspace network (Yang, 1995), which, as pointed out above, requires nonlocal learning rules. While this algorithm can be implemented in a neural network using local learning rules, it requires a second layer of neurons (Oja, 1992), making it less appealing.

In this letter, we derive a single-layer network with local Hebbian and anti-Hebbian learning rules, similar in architecture to Földiak's (1989) (see Figure 1B), from a principled cost function and demonstrate that it recovers a principal subspace from streaming data. The novelty of our approach is that rather than starting with the representation error cost function traditionally used for dimensionality reduction, such as PCA, we use the cost function of classical multidimensional scaling (CMDS), a member of the family of multidimensional scaling (MDS) methods (Cox & Cox, 2000; Mardia, Kent, & Bibby, 1980). Whereas the connection between CMDS and PCA has been pointed out previously (Williams, 2001; Cox & Cox, 2000; Mardia et al., 1980), CMDS is typically performed in the batch setting. Instead, we developed a neural network implementation of CMDS for streaming data.

The rest of the letter is organized as follows. In section 2, by minimizing the CMDS cost function, we derive two online algorithms implementable by a single-layer network, with synchronous and asynchronous synaptic weight updates. In section 3, we demonstrate analytically that synaptic weights define a principal subspace whose dimension m is determined by the number of output neurons and that the stability of the solution requires that this subspace corresponds to top m principal components. In section 4, we show numerically that our algorithm recovers the principal subspace of a synthetic data set and does it faster than the existing algorithms. Finally, in section 5, we consider the case when data are generated by a nonstationary distribution and present a generalization of our algorithm to principal subspace tracking.

## 2  Derivation of Online Algorithms from the CMDS Cost Function

CMDS represents high-dimensional input data in a lower-dimensional output space while preserving pairwise similarities between samples (Young & Householder, 1938; Torgerson, 1952).2 Let T centered input data samples in be represented by column vectors concatenated into an matrix . The corresponding output representations in , , are column vectors, , concatenated into an -dimensional matrix . Similarities between vectors in Euclidean spaces are captured by their inner products. For the input (output) data, such inner products are assembled into a Gram matrix ().3 For a given , CMDS finds by minimizing the so-called strain cost function (Carroll & Chang, 1972):
2.1

For discovering a low-dimensional subspace, the CMDS cost function, equation 2.1, is a viable alternative to the representation error cost function, equation 1.3, because its solution is related to PCA (Williams, 2001; Cox & Cox, 2000; Mardia et al., 1980). Specifically, Y is the linear projection of X onto the (principal sub-)space spanned by m principal eigenvectors of the sample covariance matrix . The CMDS cost function defines a subspace rather than individual eigenvectors because left orthogonal rotations of an optimal Y stay in the subspace and are also optimal, as is evident from the symmetry of the cost function.

In order to reduce the dimensionality of streaming data, we minimize the CMDS cost function, equation 2.1, in the stochastic online setting. At time T, a data sample, , drawn independently from a zero-mean distribution is presented to the algorithm, which computes a corresponding output, , prior to the presentation of the next data sample. Whereas in the batch setting, each data sample affects all outputs, in the online setting, past outputs cannot be altered. Thus, at time T, the algorithm minimizes the cost depending on all inputs and ouputs up to time T with respect to while keeping all the previous outputs fixed:
2.2
where the last equality follows from the definition of the Frobenius norm. By keeping only the terms that depend on current output , we get
2.3
In the large-T limit, expression 2.3 simplifies further because the first two terms grow linearly with T and therefore dominate over the last two. After dropping the last two terms, we arrive at
2.4

We term the cost in expression 2.4 the online CMDS cost. Because the online CMDS cost is a positive semidefinite quadratic form in , this optimization problem is convex. While it admits a closed-form analytical solution via matrix inversion, we are interested in biologically plausible algorithms. Next, we consider two algorithms that can be mapped onto single-layer neural networks with local learning rules: coordinate descent leading to asynchronous updates and Jacobi iteration leading to synchronous updates.

### 2.1  A Neural Network with Asynchronous Updates

The online CMDS cost function, equation 2.4, can be minimized by coordinate descent, which at every step finds the optimal value of one component of while keeping the rest fixed. The components can be cycled through in any order until the iteration converges to a fixed point. Such iteration is guaranteed to converge under very mild assumptions: diagonals of have to be positive (Luo & Tseng, 1991), meaning that each output coordinate has produced at least one nonzero output before current time step T. This condition is almost always satisfied in practice.

The cost to be minimized at each coordinate descent step with respect to the ith channel’s activity is
Keeping only those terms that depend on yields
By taking a derivative with respect to and setting it to zero, we arrive at the following closed-form solution:
2.5
To implement this algorithm in a neural network, we denote normalized input-output and output-output covariances,
2.6
allowing us to rewrite the solution, equation 2.5, in a form suggestive of a linear neural network,
2.7
where and represent the synaptic weights of feedforward and lateral connections respectively (see Figure 1B).
Finally, to formulate a fully online algorithm, we rewrite equation 2.6 in a recursive form. This requires introducing a scalar variable representing the cumulative squared activity of a neuron i up to time ,
2.8
Then at each time point, T, after the output is computed by the network, the following updates are performed:
2.9
Equations 2.7 and 2.9 define a neural network algorithm that minimizes the online CMDS cost function, equation 2.4, for streaming data by alternating between two phases: neural activity dynamics and synaptic updates. After a data sample is presented at time T, in the neuronal activity phase, neuron activities are updated one by one (i.e., asynchronously; see equation 2.7) until the dynamics converges to a fixed point defined by the following equation:
2.10
where is the m-dimensional identity matrix.

In the second phase of the algorithm, synaptic weights are updated, according to a local Hebbian rule, equation 2.9, for feedforward connections and, according to a local anti-Hebbian rule (due to the minus sign in equation 2.7), for lateral connections. Interestingly, these updates have the same form as the single-neuron Oja’s rule, equation 1.1 (Oja, 1982), except that the learning rate is not a free parameter but is determined by the cumulative neuronal activity .4 To the best of our knowledge, such a single-neuron rule (Hu, Towfic, Pehlevan, Genkin, & Chklovskii, 2013) has not been derived in the multineuron case. An alternative derivation of this algorithm is presented in section A.1 in the appendix.

Unlike the representation error cost function, equation 1.3, the CMDS cost function, equation 2.1, is formulated only in terms of input and output activity. Yet the minimization with respect to recovers feedforward and lateral synaptic weights.

### 2.2  A Neural Network with Synchronous Updates

Here, we present an alternative way to derive a neural network algorithm from the large-T limit of the online CMDS cost function, equation 2.4. By taking a derivative with respect to and setting it to zero, we arrive at the following linear matrix equation:
2.11
We solve this system of equations using Jacobi iteration (Strang, 2009) by first splitting the output covariance matrix that appears on the left side of equation 2.11 into its diagonal component and the remainder ,
where the ith diagonal element of , , as defined in equation 2.8. Then equation 2.11 is equivalent to
Interestingly, the matrices obtained on the right side are algebraically equivalent to the feedforward and lateral synaptic weight matrices defined in equation 2.6:
2.12
Hence, the Jacobi iteration for solving equation 2.11,
2.13
converges to the same fixed point as the coordinate descent, equation 2.10.

Iteration 2.13 is naturally implemented by the same single-layer linear neural network as for the asynchronous update, Figure 1B. For each stimulus presentation the network goes through two phases. In the first phase, iteration 2.13 is repeated until convergence. Unlike the coordinate descent algorithm, which updated the activity of neurons one after another, here, activities of all neurons are updated synchronously. In the second phase, synaptic weight matrices are updated according to the same rules as in the asynchronous update algorithm, equation 2.9.

Unlike the asynchronous update, equation 2.7, for which convergence is almost always guaranteed (Luo & Tseng, 1991), convergence of iteration 2.13 is guaranteed only when the spectral radius of is less than 1 (Strang, 2009). Whereas we cannot prove that this condition is always met, the synchronous algorithm works well in practice. While in the rest of the letter, we consider only the asynchronous updates algorithm, our results hold for the synchronous updates algorithm provided it converges.

## 3  Stationary Synaptic Weights Define a Principal Subspace

What is the nature of the lower-dimensional representation found by our algorithm? In CMDS, outputs are the Euclidean coordinates in the principal subspace of the input vector (Cox & Cox, 2000; Mardia et al., 1980). While our algorithm uses the same cost function as CMDS, the minimization is performed in the streaming, or online, setting. Therefore, we cannot take for granted that our algorithm will find the principal subspace of the input. In this section, we provide analytical evidence, by a stability analysis in a stochastic setting, that our algorithm extracts the principal subspace of the input data and projects onto that subspace. We start by previewing our results and method.

Our algorithm performs a linear dimensionality reduction since the transformation between the input and the output is linear. This can be seen from the neural activity fixed point, equation 2.10, which we rewrite as
3.1
where is a matrix defined in terms of the synaptic weight matrices and :
3.2
Relation 3.1 shows that the linear filter of a neuron, which we term a neural filter, is the corresponding row of . The space that neural filters span, the row space of , is termed a filter space.

First, we prove that in the stationary state of our algorithm, neural filters are indeed orthonormal vectors (see section 3.2, theorem 5). Second, we demonstrate that the orthonormal filters form the basis of a space spanned by some m eigenvectors of the covariance of the inputs (see section 3.3, theorem 8). Third, by analyzing linear perturbations around the stationary state, we find that stability requires these m eigenvectors to be the principal eigenvectors and therefore the filter space to be the principal subspace (see section 3.4, theorem 11).

These results show that even though our algorithm was derived starting from the CMDS cost function, equation 2.1, converges to the optimal solution of the representation error cost function, equation 1.3. This correspondence suggests that is the algorithm’s current estimate of the projection matrix to the principal subspace. Further, in equation 1.3, columns of are interpreted as data features. Then columns of , or neural filters, are the algorithm’s estimate of such features.

Rigorous stability analyses of PCA neural networks (Oja, 1982, 1992; Oja & Karhunen, 1985; Sanger, 1989; Hornik & Kuan, 1992; Plumbley, 1995) typically use the ODE method (Kushner & Clark, 1978). Using a theorem of stochastic approximation theory (Kushner & Clark, 1978), the convergence properties of the algorithm are determined using a corresponding deterministic differential equation.5

Unfortunately the ODE method cannot be used for our network. While the method requires learning rates that depend only on time, in our network, learning rates () are activity dependent. Therefore we take a different approach. We directly work with the discrete-time system, assume convergence to a stationary state, to be defined below, and study the stability of the stationary state.

### 3.1  Preliminaries

We adopt a stochastic setting where the input to the network at each time point, , is an n-dimensional independent and identically distributed random vector with zero mean, , where brackets denote an average over the input distribution, and covariance .

Our analysis is performed for the stationary state of synaptic weight updates; that is, when averaged over the distribution of input values, the updates on and average to zero. This is the point of convergence of our algorithm. For the rest of the section, we drop the time index T to denote stationary state variables.

The remaining dynamical variables, learning rates , keep decreasing at each time step due to neural activity. We assume that the algorithm has run for a sufficiently long time such that the change in learning rate is small and it can be treated as a constant for a single update. Moreover, we assume that the algorithm converges to a stationary point sufficiently fast such that the following approximation is valid at large T,
where is calculated with stationary state weight matrices.

We collect these assumptions into a definition:

Definition 1
(Stationary State).  In the stationary state,
and
with T large.

The stationary state assumption leads us to define various relations between synaptic weight matrices, summarized in the following corollary:

Corollary 1.
In the stationary state,
3.3
and
3.4
where is the Kronecker delta.
Proof.
The stationarity assumption when applied to the update rule on , equation 2.9, leads immediately to equation 3.3. The stationarity assumption applied to the update rule on , equation 2.9, gives
The last equality does not hold for since diagonal elements of are zero. To cover the case , we add an identity matrix to , and hence one recovers equation 3.4.
Remark.

Note that equation 3.4 implies —that lateral connection weights are not symmetrical.

### 3.2  Orthonormality of Neural Filters

Here we prove the orthonormality of neural filters in the stationary state. First, we need the following lemma:

Lemma 1.
In the stationary state, the following equality holds:
3.5
Proof.

By equation 3.4, . Using , we substitute for yk on the right-hand side: . Next, the stationarity condition, equation 3.3, yields . Canceling on both sides proves the lemma.

Now we can prove our theorem:

Theorem 1.
In the stationary state, neural filters are orthonormal:
3.6
Proof.

First, we substitute for (but not for ) its definition (see equation 3.2): . Next, using lemma 4, we substitute by . The right-hand side becomes .

Remark.

Theorem 5 implies that .

### 3.3  Neural Filters and Their Relationship to the Eigenspace of the Covariance Matrix

How is the filter space related to the input? We partially answer this question in theorem 8, using the following lemma:

Lemma 2.
In the stationary state, and commute:
3.7
Proof.

See section A.2.

Now we can state our second theorem.

Theorem 2.

At the stationary state state, the filter space is an m-dimensional subspace in that is spanned by some m eigenvectors of the covariance matrix.

Proof.

Because and commute (see lemma 7), they must share the same eigenvectors. Equation 3.6 of theorem 5 implies that m eigenvalues of are unity and the rest are zero. Eigenvectors associated with unit eigenvalues span the row space of and are identical to some m eigenvectors of C.6

Which m eigenvectors of C span the filter space? To show that these are the eigenvectors corresponding to the largest eigenvalues of , we perform a linear stability analysis around the stationary point and show that any other combination would be unstable.

### 3.4  Linear Stability Requires Neural Filters to Span a Principal Subspace

The strategy here is to perturb from its equilibrium value and show that the perturbation is linearly stable only if the row space of is the space spanned by the eigenvectors corresponding to the m highest eigenvalues of . To prove this result, we need two more lemmas.

Lemma 3.
Let be an real matrix with orthonormal rows and an real matrix with orthonormal rows, whose rows are chosen to be orthogonal to the rows of . Any real matrix can be decomposed as
where is an skew-symmetric matrix, is an symmetric matrix, and is an matrix.
Proof.

Define , and . Then .

We denote an arbitrary perturbation of as , where a small parameter is implied. We can use lemma 9 to decompose as
3.8
where the rows of are orthogonal to the rows of . Skew-symmetric corresponds to rotations of filters within the filter space; it keeps neural filters orthonormal. Symmetric keeps the filter space invariant but destroys orthonormality. is a perturbation that takes the neural filters outside the filter space.

Next, we calculate how evolves under the learning rule, .

Lemma 4.
A perturbation to the stationary state has the following evolution under the learning rule to linear order in perturbation and linear order in :
3.9
Proof.

The proof is provided in section A.3.

Now we can state our main result in the following theorem:

Theorem 3.

The stationary state of neuronal filters is stable, in large-T limit, only if the m-dimensional filter space is spanned by the eigenvectors of the covariance matrix corresponding to the m highest eigenvectors.

Proof.

The Full proof is given in section A.4. Here we sketch the proof.

To simplify our analysis, we choose a specific in lemma 9 without losing generality. Let be eigenvectors of and be corresponding eigenvalues, labeled so that the first m eigenvectors span the row space of (or filter space). We choose rows of to be the remaining eigenvectors: .

By extracting the evolution of components of from equation 3.9 using equation 3.8, we are ready to state the conditions under which perturbations of are stable. Multiplying equation 3.9 on the right by gives the evolution of :
Here we changed our notation to to make it explicit that for each j, we have one matrix equation. These equations are stable when all eigenvalues of all are negative, which requires, as shown in section A.4,
This result proves that the perturbation is stable only if the filter space is identical to the space spanned by eigenvectors corresponding to the m highest eigenvalues of .
It remains to analyze the stability of and perturbations. Multiplying equation 3.9 on the right by gives
perturbation, which rotates neural filters, does not decay. This behavior is inherently related to the discussed symmetry of the strain cost function, equation 2.1, with respect to left rotations of the Y matrix. Rotated y vectors are obtained from the input by rotated neural filters, and hence perturbation does not affect the cost. But destroys orthonormality, and these perturbations do decay, making the orthonormal solution stable.

To summarize our analysis, if the dynamics converges to a stationary state, neural filters form an orthonormal basis of the principal subspace.

## 4  Numerical Simulations of the Asynchronous Network

Here, we simulate the performance of the network with asynchronous updates, equations 2.7 and 2.9, on synthetic data. The data were generated by a colored gaussian process with an arbitrarily chosen “actual” covariance matrix. We choose the number of input channels, , and the number of output channels, . In the input data, the ratio of the power in the first four principal components to the power in the remaining 60 components was 0.54. and were initialized randomly, and the step size of synaptic updates was initialized to . The coordinate descent step is cycled over neurons until the magnitude of change in in one cycle is less than times the magnitude of .

We compared the performance of the asynchronous updates network, equation 2.7 and 2.9, with two previously proposed networks, APEX (Kung & Diamantaras, 1990; Kung et al., 1994) and Földiak’s (1989), on the same data set (see Figure 2). The APEX network uses the same Hebbian and anti-Hebbian learning rules for synaptic weights, but the architecture is slightly different in that the lateral connection matrix, , is lower triangular. Földiak’s network has the same architecture as ours (see Figure 1B) and the same learning rules for feedforward connections. However, the learning rule for lateral connections is , unlike equation 2.9. For the sake of fairness, we applied the same adaptive step-size procedure for all networks. As in equation 2.9, the step size for each neuron i at time T was , with . In fact, such a learning rate has been recommended and argued to be optimal for the APEX network (Diamantaras & Kung, 1996; Diamantaras, 2002; see also note 4).

Figure 2:

Performance of the asynchronous neural network compared with existing algorithms. Each algorithm was applied to 40 different random data sets drawn from the same gaussian statistics, described in text. Weight initializations were random. Solid lines indicate means, and shades indicate standard deviations across 40 runs. All errors are in decibels (dB). For formal metric definitions, see the text. (A) Strain error as a function of data presentations. The dotted line is the best error in batch setting, calculated using eigenvalues of the actual covariance matrix. (B) Subspace error as a function of data presentations. (C) Nonorthonormality error as a function of data presentations.

Figure 2:

Performance of the asynchronous neural network compared with existing algorithms. Each algorithm was applied to 40 different random data sets drawn from the same gaussian statistics, described in text. Weight initializations were random. Solid lines indicate means, and shades indicate standard deviations across 40 runs. All errors are in decibels (dB). For formal metric definitions, see the text. (A) Strain error as a function of data presentations. The dotted line is the best error in batch setting, calculated using eigenvalues of the actual covariance matrix. (B) Subspace error as a function of data presentations. (C) Nonorthonormality error as a function of data presentations.

To quantify the performance of these algorithms, we used three different metrics. First is the strain cost function, equation 2.1, normalized by T2 (see Figure 2A). Such a normalization is chosen because the minimum value of offline strain cost is equal to the power contained in the eigenmodes beyond the top m: , where are eigenvalues of sample covariance matrix (Cox & Cox, 2000; Mardia et al., 1980). For each of the three networks, as expected, the strain cost rapidly drops toward its lower bound. As our network was derived from the minimization of the strain cost function, it is not surprising that the cost drops faster than in the other two.

The second metric quantifies the deviation of the learned subspace from the actual principal subspace. At each T, the deviation is , where is an matrix whose rows are the principal eigenvectors, is the projection matrix to the principal subspace, is defined the same way for APEX and Földiak networks as ours, and is the learned estimate of the projection matrix to the principal subspace. Such a deviation rapidly falls for each network, confirming that all three algorithms learn the principal subspace (see Figure 2B). Again, our algorithm extracts the principal subspace faster than the other two networks.

The third metric measures the degree of nonorthonormality among the computed neural filters. At each T, . The nonorthonormality error quickly drops for all networks, confirming that neural filters converge to orthonormal vectors (see Figure 2C). Yet again our network orthonormalizes neural filters much faster than the other two networks.

## 5  Subspace Tracking Using a Neural Network with Local Learning Rules

We have demonstrated that our network learns a linear subspace of streaming data generated by a stationary distribution. But what if the data are generated by an evolving distribution and we need to track the corresponding linear subspace? Using algorithm 2.9 would be suboptimal because the learning rate is adjusted to effectively “remember” the contribution of all the past data points.

A natural way to track an evolving subspace is to “forget” the contribution of older data points (Yang, 1995). In this section, we derive an algorithm with “forgetting” from a principled cost function where errors in the similarity of old data points are discounted:
5.1
where is a discounting factor with corresponding to our original algorithm, equation 2.2. The effective timescale of forgetting is
5.2
By introducing a -dimensional diagonal matrix with diagonal elements we can rewrite equation 5.1 in a matrix notation:
5.3
Yang (1995) used a similar discounting to derive subspace tracking algorithms from the representation error cost function, equation 1.3.
To derive an online algorithm to solve equation 5.3, we follow the same steps as before. By keeping only the terms that depend on current output we get
5.4
In equation 5.4, provided that past input-input and input-output outer products are not forgotten for a sufficiently long time (i.e., ), the first two terms dominate over the last two for large T. After dropping the last two terms, we arrive at
5.5
As in the nondiscounted case, minimization of the discounted online CMDS cost function by coordinate descent, equation 5.5, leads to a neural network with asynchronous updates,
5.6
and by a Jacobi iteration to a neural network with synchronous updates,
5.7
with synaptic weight matrices in both cases given by
5.8
Finally, we rewrite equation 5.8 in a recursive form. As before, we introduce a scalar variable representing the discounted cumulative activity of a neuron i up to time ,
5.9
5.10
These updates are local and almost identical to the original updates, equation 2.9, except the update, where the past cumulative activity is discounted by . For suitably chosen , the learning rate, , stays sufficiently large even at large T, allowing the algorithm to react to changes in data statistics.

As before, we have a two-phase algorithm for minimizing the discounted online CMDS cost function, equation 5.5. For each data presentation, first the neural network dynamics is run using equation 5.6 or 5.7, until the dynamics converges to a fixed point. In the second step, synaptic weights are updated using equation 5.10.

In Figure 3, we present the results of a numerical simulation of our subspace tracking algorithm with asynchronous updates similar to that in section 4 but for nonstationary synthetic data. The data are drawn from two different gaussian distributions: from to , with covariance , and from to , with covariance . We ran our algorithm with four different factors: ().

Figure 3:

Performance of the subspace tracking asynchronous neural network with nonstationary data. The algorithm with different factors was applied to 40 different random data sets drawn from the same nonstationary statistics, described in the text. Weight initializations were random. Solid lines indicate means, and shades indicate standard deviations. All errors are in decibels (dB). For formal metric definitions, see the text. (A) Subspace error as a function of data presentations. (B) Nonorthonormality error as a function of data presentations.

Figure 3:

Performance of the subspace tracking asynchronous neural network with nonstationary data. The algorithm with different factors was applied to 40 different random data sets drawn from the same nonstationary statistics, described in the text. Weight initializations were random. Solid lines indicate means, and shades indicate standard deviations. All errors are in decibels (dB). For formal metric definitions, see the text. (A) Subspace error as a function of data presentations. (B) Nonorthonormality error as a function of data presentations.

We evaluate the subspace tracking performance of the algorithm using a modification of the subspace error metric introduced in section 4. From to , the error is , where is an matrix whose rows are the principal eigenvectors of . From to , the error is , where is an matrix whose rows are the principal eigenvectors of . Figure 3A plots this modified subspace error. Initially the subspace error decreases, reaching lower values with higher . Higher allows for smaller learning rates, allowing a fine-tuning of the neural filters and hence lower error. At , a sudden jump is observed corresponding to the change in principal subspace. The network rapidly corrects its neural filters to project to the new principal subspace, and the error falls to before jump values. It is interesting to note that higher now leads to a slower decay due to extended memory in the past.

We also quantify the degree of nonorthonormality of neural filters using the nonorthonormality error defined in section 4. Initially the nonorthonormality error decreases, reaching lower values with higher . Again, higher allows for smaller learning rates, allowing a fine-tuning of the neural filters. At , an increase in orthonormality error is observed as the network adjusts its neural filters. Then the error falls to before change values, with higher leading to a slower decay due to extended memory in the past.

## 6  Discussion

In this letter, we made a step toward a mathematically rigorous model of neuronal dimensionality reduction satisfying more biological constraints than was previously possible. By starting with the CMDS cost function, equation 2.1, we derived a single-layer neural network of linear units using only local learning rules. Using a local stability analysis, we showed that our algorithm finds a set of orthonormal neural filters and projects the input data stream to its principal subspace. We showed that with a small modification in learning rate updates, the same algorithm performs subspace tracking.

Our algorithm finds the principal subspace but not necessarily the principal components themselves. This is not a weakness since both the representation error cost, equation 1.3, and CMDS cost, equation 2.1, are minimized by projections to principal subspace and finding the principal components is not necessary.

Our network is most similar to Földiak’s (1989) network, which learns feedforward weights by a Hebbian Oja rule and the all-to-all lateral weights by an anti-Hebbian rule. Yet the functional form of the anti-Hebbian learning rule in Földiak’s network, , is different from ours, equation 2.9, resulting in three interesting differences. First, because the synaptic weight update rules in Földiak’s network are symmetric, if the weights are initialized symmetric (i.e., ), and learning rates are identical for lateral weights, they will stay symmetric. As mentioned above, such symmetry does not exist in our network (see equations 2.9 and 3.4). Second, while in Földiak’s network neural filters need not be orthonormal (Földiak, 1989; Leen, 1991), in our network they will be (see theorem 5). Third, in Földiak’s (1989) network, output units are decorrelated, since in its stationary state, . This need not be true in our network. Yet correlations among output units do not necessarily mean that information in the output about the input is reduced.7

Our network is similar to the APEX network (Kung & Diamantaras, 1990) in the functional form of both the feedforward and the lateral weights. However the network architecture is different because the APEX network has a lower-triangular lateral connectivity matrix. Such difference in architecture leads to two interesting differences in the APEX network operation (Diamantaras & Kung, 1996): (1) the outputs converge to the principal components, and (2) lateral weights decay to zero and neural filters are the feedforward weights. In our network, lateral weights do not have to decay to zero and neural filters depend on both the feedforward and lateral weights (see equation 3.2).

In numerical simulations, we observed that our network is faster than Földiak’s and APEX networks in minimizing the strain error, finding the principal subspace and orthonormalizing neural filters. This result demonstrates the advantage of our principled approach compared to heuristic learning rules.

Our choice of coordinate descent to minimize the cost function in the activity dynamics phase allowed us to circumvent problems associated with matrix inversion: . Matrix inversion causes problems for neural network implementations because it is a nonlocal operation. In the absence of a cost function, Földiak (1989) suggested implementing matrix inversion by iterating until convergence. We derived a similar algorithm using Jacobi iteration. However, in general, such iterative schemes are not guaranteed to converge (Hornik & Kuan, 1992). Our coordinate descent algorithm is almost always guaranteed to converge because the cost function in the activity dynamics phase, equation 2.4, meets the criteria in Luo and Tseng (1991).

Unfortunately, our treatment still suffers from the problem common to most other biologically plausible neural networks (Hornik & Kuan, 1992): a complete global convergence analysis of synaptic weights is not yet available. Our stability analysis is local in the sense that it starts by assuming that the synaptic weight dynamics has reached a stationary state and then proves that perturbations around the stationary state are stable. We have not made a theoretical statement on whether this state can ever be reached or how fast such a state can be reached. Global convergence results using stochastic approximation theory are available for the single-neuron Oja rule (Oja & Karhunen, 1985), its nonlocal generalizations (Plumbley, 1995), and the APEX rule (Diamantaras & Kung, 1996); however, applicability of stochastic approximation theory was questioned recently (Zufiria, 2002). Although a neural network implementation is unknown, Warmuth and Kuzmin’s (2008) online PCA algorithm stands out as the only algorithm for which a regret bound has been proved. An asymptotic dependence of regret on time can also be interpreted as convergence speed.

This letter also contributes to the MDS literature by applying the CMDS method to streaming data. However, our method has limitations in that to derive neural algorithms, we used the strain cost, equation 2.1, of CMDS. Such cost is formulated in terms of similarities, inner products to be exact, between pairs of data vectors and allowed us to consider a streaming setting where a data vector is revealed at a time. In the most general formulation of MDS, pairwise dissimilarities between data instances are given rather than data vectors themselves or similarities between them (Cox & Cox, 2000; Mardia et al., 1980). This generates two immediate problems for a generalization of our approach. First, a mapping to the strain cost function, equation 2.1, is possible only if the dissimilarites are Euclidean distances (see note 3). In general, dissimilarities do not need to be Euclidean or even metric distances (Cox & Cox, 2000; Mardia et al., 1980) and one cannot start from the strain cost, equation 2.1, for derivation of a neural algorithm. Second, in the streaming version of the general MDS setting, at each step, dissimilarities between the current and all past data instances are revealed, unlike our approach where the data vector itself is revealed. It is a challenging problem for future studies to find neural implementations in such generalized setting.

The online CMDS cost functions, equations 2.4 and 5.5, should also be valuable for subspace learning and tracking applications where biological plausibility is not a necessity. Minimization of such cost functions could be performed much more efficiently in the absence of constraints imposed by biology.8 It remains to be seen how the algorithms presented in this letter and their generalizations compare to state-of-the-art online subspace tracking algorithms from machine learning literature (Cichocki & Amari, 2002).

Finally, we believe that formulating the cost function in terms of similarities supports the possibility of representation-invariant computations in neural networks.

### Appendix:  Derivations and Proofs

#### A.1  Alternative Derivation of an Asynchronous Network

Here, we solve the system of equations 2.11 iteratively (Strang, 2009). First, we split the output covariance matrix that appears on the left-hand side of equation 2.11 into its diagonal component , a strictly upper triangular matrix , and a strictly lower triangular matrix :
A.1
Substituting this into equation 2.11, we get
A.2
where is a parameter. We solve equation 2.11 by iterating
A.3
until convergence. If symmetric is positive definite, the convergence is guaranteed for by the Ostrowski-Reich theorem (Reich, 1949; Ostrowski, 1954). When , the iteration, equation A.3, corresponds to the Gauss-Seidel method, and, when , to the succesive overrelaxation method. The choice of for fastest convergence depends on the problem, and we will not explore this question here. However, values around 1.9 are generally recommended (Strang, 2009).
Because in equation A.2, the matrix multiplying on the left is lower triangular and on the right is upper triangular, iteration A.3 can be performed component-by-component (Strang, 2009):
A.4
Note that is replaced with its new value before moving to the next component.
This algorithm can be implemented in a neural network,
A.5
where and , as defined in equation 2.6, represent the synaptic weights of feedforward and lateral connections, respectively. The case of can be implemented by a leaky integrator neuron. The case corresponds to our original asynchronous algorithm, except that now updates are performed in a particular order. For the case, which may converge faster, we do not see a biologically plausible implementation since it requires self-inhibition.

Finally, to express the algorithm in a fully online form, we rewrite equation 2.6 via recursive updates, resulting in equation 2.9.

#### A.2  Proof of Lemma 7

Proof of Lemma 2.
In our derivation below, we use results from equations 3.2, 3.3, and 3.4 of the main text.

#### A.3  Proof of Lemma 10

Here we calculate how evolves under the learning rule, , and derive equation 3.9.

First, we introduce some new notation to simplify our expressions. We define lateral synaptic weight matrix with diagonals set to 1 as
A.6
We use to denote perturbed matrices
A.7
Note that when the network is run with these perturbed synaptic matrices, for input , the network dynamics will settle to the fixed point,
A.8
which is different from the fixed point of the stationary network, .

Now we can prove lemma 10.

Proof of lemma 4.

The proof has the following steps.

1. Since our update rules are formulated in terms of and , it will be helpful to express in terms of and . The definition of , equation 3.2, gives us the desired relation:
2. We show that in the stationary state,
Proof. Average changes due to synaptic updates on both sides of equation A.9 are equal: . Noting that the unperturbed matrices are stationary, that is, , one gets , from which equation A.10 follows.
3. We calculate and using the learning rule, in terms of matrices , , , , and , and plug the result into equation A.10. This manipulation is going to give us the evolution of equation, 3.9.First, :
Next we calculate :
Plugging these in equation A.10, we get
Mkr and terms can be eliminated using the previously derived relations, equations 3.2 and A.9. This leads to a cancellation of some of the terms given above, and finally we have
To proceed further, we note that
which allows us to simplify the last term. Then we get our final result:

#### A.4  Proof of Theorem 11

For ease of reference, we remind that in general can be written as in equation 3.8:
Here, is an skew symmetric matrix, is an symmetric matrix, and is an matrix. is an matrix with orthonormal rows. These rows are chosen to be orthogonal to the rows of . Let be the eigenvectors and be the corresponding eigenvalues. We label them such that spans the same space as the space spanned by the first m eigenvectors. We choose rows of to be the remaining eigenvectors: . Then, for future reference,
A.12
We also refer to the definition, equation A.6:

Proof of Theorem 11. Below, we discuss the conditions under which perturbations of are stable. We work to linear order in as stated in theorem 11. We treat separately the evolution of , , and under a general perturbation .

1. Stability of

1. Evolution of is given by
Proof. Starting from equation 3.8 and using equation A.12,
Here the last line results from equation A.12 applied to equation 3.9. We look at the first term again using equations A.12 and then 3.8:
Combining these gives equation A.13.
2. When is equation A.13 stable? Next, we show that stability requires
For ease of manipulation, we express equation A.13 as a matrix equation for each column of . For convenience, we change our notation to ,
We have one matrix equation for each j. These equations are stable if all eigenvalues of all are negative;
3. If one could calculate eigenvalues of , the stability condition can be articulated. We start this calculation by noting that
Therefore,
Then we need to calculate the eigenvalues of . They are
Multiply both sides by :
Next, we use the commutation of and , equation 3.7, and the orthogonality of neural filters, , equation 3.6, to simplify the left-hand side:
This implies that
Note that by the orthogonality of neural filters, the following is also true:
All the relations above would hold true if and , but this would require , which is a contradiction. Then equations A.16 and A.17 imply that is a shared eigenvector between and . and were shown to commute before, and they share a complete set of eigenvectors. However, some nm eigenvectors of have zero eigenvalues in . We had labeled shared eigenvectors with unit eigenvalue in to be . The eigenvalue of with respect to is 1; therefore, is one of . This proves that and
4. From equation A.15, it follows that for stability

2. Stability of and . Next, we check the stabilities of and :
In deriving the last line, we used equations 3.8 and A.12. The k summation was calculated before equation A.14. Plugging this in equation A.18, one gets
perturbation, which rotates neural filters to other orthonormal basis within the principal subspace, does not decay. On the other hand, destroys orthonormality and these perturbations do decay, making the orthonormal solution stable.

Collectively, the results above prove theorem 11.

#### A.5  Perturbation of the Stationary State due to Data Presentation

Our discussion of the linear stability of the stationary point assumed general perturbations. Perturbations that arise from data presentation,
A.19
form a restricted class of the most general case and have special consequences. Focusing on this case, we show that data presentations do not rotate the basis for extracted subspace in the stationary state.
We calculate perturbations within the extracted subspace. Using equations 3.8 and A.12,
A.20
We look at term more closely:
Plugging this back into equation A.20 gives
A.21
Therefore, perturbations that arise from data presentation do not rotate neural filter basis within the extracted subspace. This property should increase the stability of the neural filter basis within the extracted subspace.

## Acknowledgments

We are grateful to L. Greengard, S. Seung, and M. Warmuth for helpful discussions.

## References

Arora
,
R.
,
Cotter
,
A.
,
Livescu
,
K.
, &
Srebro
,
N.
(
2012
).
Stochastic optimization for PCA and PLS
. In
Proceedings of the Allerton Conference on Communication, Control, and Computing
(pp.
861
868
).
Piscataway, NJ
:
IEEE
.
Balzano
,
L. K.
(
2012
).
Handling missing data in high-dimensional subspace modeling
.
Becker
,
S.
, &
Plumbley
,
M.
(
1996
).
Unsupervised neural network learning procedures for feature extraction and classification
.
Appl. Intell.
,
6
(
3
),
185
203
.
Carroll
,
J.
, &
Chang
,
J.
(
1972
).
IDIOSCAL (individual differences in orientation scaling): A generalization of INDSCAL allowing idiosyncratic reference systems as well as an analytic approximation to INDSCAL
.
Paper presented at the Psychometric Meeting
,
Princeton, NJ
.
Cichocki
,
A.
, &
Amari
,
S.-I.
(
2002
).
Adaptive blind signal and image processing
.
Hoboken, NJ
:
Wiley
.
Cox
,
T.
, &
Cox
,
M.
(
2000
).
Multidimensional scaling
.
Boca Raton, FL
:
CRC Press
.
Crammer
,
K.
(
2006
).
Online tracking of linear subspaces
. In
G.
Lugois
&
H. U.
Simon
(Eds.),
Learning theory
(pp.
438
452
).
New York
:
Springer
.
Diamantaras
,
K.
(
2002
).
Neural networks and principal component analysis
. In
Y.-H.
Hu
&
J.-N.
Hwang
(Eds.),
Handbook of neural networks in signal processing
.
Boca Raton, FL
:
CRC Press
.
Diamantaras
,
K.
, &
Kung
,
S.
(
1996
).
Principal component neural networks: Theory and applications
.
Hoboken, NJ
:
Wiley
.
Földiak
,
P.
(
1989
).
Adaptive network for optimal linear feature extraction
. In
Proceedings of the International Joint Conference on Neural Networks
(pp.
401
405
).
Piscataway, NJ
:
IEEE
.
Goes
,
J.
,
Zhang
,
T.
,
Arora
,
R.
, &
Lerman
,
G.
(
2014
).
Robust stochastic principal component analysis
. In
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
(pp.
266
274
). http://jmlr.org/proceedings/papers/v33/
Hornik
,
K.
, &
Kuan
,
C.-M.
(
1992
).
Convergence analysis of local feature extraction algorithms
.
Neural Networks
,
5
,
229
240
.
Hu
,
T.
,
Towfic
,
Z.
,
Pehlevan
,
C.
,
Genkin
,
A.
, &
Chklovskii
,
D.
(
2013
).
A neuron as a signal processing device
. In
Proceedings of the Asilomar Conference on Signals, Systems and Computers
(pp.
362
366
).
Piscataway, NJ
:
IEEE
.
Hubel
,
D. H.
(
1995
).
Eye, brain, and vision.
New York
:
Scientific American Library/Scientific American Books
.
Hyvärinen
,
A.
,
Hurri
,
J.
, &
Hoyer
,
P. O.
(
2009
).
Natural image statistics: A probabilistic approach to early computational vision.
New York
:
Springer
.
Karhunen
,
J.
, &
Oja
,
E.
(
1982
).
New methods for stochastic approximation of truncated Karhunen-Loeve expansions
. In
Proc. 6th Int. Conf. on Pattern Recognition
(pp.
550
553
).
New York
:
Springer-Verlag
.
Kung
,
S.-Y.
(
2014
).
Kernel methods and machine learning
.
Cambridge
:
Cambridge University Press
.
Kung
,
S.
, &
Diamantaras
,
K.
(
1990
).
A neural network learning algorithm for adaptive principal component extraction (apex)
. In
Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing
(pp.
861
864
).
Piscataway, NJ
:
IEEE
.
Kung
,
S.-Y.
,
Diamantaras
,
K.
, &
Taur
,
J.-S.
(
1994
).
Adaptive principal component extraction (APEX) and applications
.
IEEE T Signal Proces.
,
42
,
1202
1217
.
Kushner
H. J.
, &
Clark
D. S.
(
1978
).
Stochastic approximation methods for constrained and unconstrained systems
.
New York
:
Springer
.
Leen
,
T. K.
(
1990
).
Dynamics of learning in recurrent feature-discovery networks
. In
D.
Touretzky
&
R.
Lippmann
(Eds.),
Advances in neural information processing systems, 3
(pp.
70
76
).
San Mateo, CA
:
Morgan Kaufmann
.
Leen
,
T. K.
(
1991
).
Dynamics of learning in linear feature-discovery networks
.
Network
,
2
(
1
),
85
105
.
Linsker
,
R.
(
1988
).
Self-organization in a perceptual network
.
IEEE Computer
,
21
,
105
117
.
Luo
,
Z. Q.
, &
Tseng
,
P.
(
1991
).
On the convergence of a matrix splitting algorithm for the symmetric monotone linear complementarity problem
.
SIAM J. Control Optim.
,
29
,
1037
1060
.
Mardia
,
K.
,
Kent
,
J.
, &
Bibby
,
J.
(
1980
).
Multivariate analysis
.
Orlando, FL
:
.
Oja
,
E.
(
1982
).
Simplified neuron model as a principal component analyzer
.
J. Math. Biol.
,
15
,
267
273
.
Oja
,
E.
(
1992
).
Principal components, minor components, and linear neural networks
.
Neural Networks
,
5
,
927
935
.
Oja
,
E.
, &
Karhunen
,
J.
(
1985
).
On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix
.
J. Math. Anal. Appl.
,
106
(
1
),
69
84
.
Ostrowksi
A. M.
, (
1954
).
On the linear iteration procedures for symmetric matrices
.
Rend. Mat. Appl.
,
14
,
140
163
.
Pearson
,
K.
(
1901
).
On lines and planes of closest fit to systems of points in space
.
Philos Mag.
,
2
,
559
572
.
Plumbley
,
M. D.
(
1993
).
A Hebbian/anti-Hebbian network which optimizes information capacity by orthonormalizing the principal subspace
. In
Proceedings of the International Conference on Artificial Neural Networks
(pp.
86
90
).
Piscataway, NJ
:
IEEE
.
Plumbley
,
M. D.
(
1995
).
Lyapunov functions for convergence of principal component algorithms
.
Neural Networks
,
8
(
1
),
11
23
.
Preisendorfer
,
R.
, &
Mobley
,
C.
(
1988
).
Principal component analysis in meteorology and oceanography
.
New York
:
Elsevier Science
.
Reich
,
E.
(
1949
).
On the convergence of the classical iterative procedures for symmetric matrices
.
Ann. Math. Statistics
,
20
,
448
451
.
Rubner
,
J.
, &
Schulten
,
K.
(
1990
).
Development of feature detectors by self-organization
.
Biol. Cybern.
,
62
,
193
199
.
Rubner
,
J.
, &
Tavan
,
P.
(
1989
).
A self-organizing network for principal-component analysis
.
Europhysics Letters
,
10
(
7
),
693
.
Sanger
,
T.
(
1989
).
Optimal unsupervised learning in a single-layer linear feedforward neural network
.
Neural Networks
,
2
(
6
),
459
473
.
Shepherd
,
G.
(
2003
).
The synaptic organization of the brain
.
New York
:
Oxford University Press
.
Strang
,
G.
(
2009
).
Introduction to linear algebra
.
Wellesley, MA
:
Wellesley-Cambridge Press
Torgerson
,
W.
(
1952
).
Multidimensional scaling: I. Theory and method
.
Psychometrika
,
17
,
401
419
.
Warmuth
,
M.
, &
Kuzmin
,
D.
(
2008
).
Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension
.
J. Mach. Learn. Res.
,
9
(
10
),
2287
2320
.
Williams
,
C.
(
2001
).
On a connection between kernel PCA and metric multidimensional scaling
. In
T. K.
Leen
,
T. G.
Dietterich
, &
V.
Tresp
(Eds.),
Advances in neural information processing systems, 13
(pp.
675
681
).
Cambridge, MA
:
MIT Press
.
Yang
,
B.
(
1995
).
Projection approximation subspace tracking
.
IEEE T Signal Proces.
,
43
,
95
107
.
Young
,
G.
, &
Householder
,
A.
(
1938
).
Discussion of a set of points in terms of their mutual distances
.
Psychometrika
,
3
(
1
),
19
22
.
Zufiria
,
P. J.
(
2002
).
On the discrete-time dynamics of the basic Hebbian neural network node
.
IEEE Trans. Neural Netw.
,
13
,
1342
1352
.

## Notes

1

Recall that in general, the projection matrix to the row space of a matrix P is given by , provided is full rank (Plumbley, 1995). If the rows of are orthonormal, this reduces to .

2

Whereas MDS in general starts with dissimilarities between samples that may not live in Euclidean geometry, in CMDS, data are assumed to have a Euclidean representation.

3

When input data are pairwise Euclidean distances, assembled into a matrix , the Gram matrix, , can be constructed from by , where , is the centering matrix, is the vector of n unitary components, and is the n-dimensional identity matrix (Cox & Cox, 2000; Mardia et al., 1980).

4

The single-neuron Oja’s rule derived from the minimization of a least squares optimization cost function ends up with the identical learning rate (Diamantaras, 2002; Hu et al., 2013). Motivated by this fact, such learning rate has been argued to be optimal for the APEX network (Diamantaras & Kung, 1996; Diamantaras, 2002) and used by others (Yang, 1995).

5

Application of stochastic approximation theory to PCA neural networks depends on a set of mathematical assumptions. See Zufiria (2002) for a critique of the validity of these assumptions and an alternative approach to stability analysis.

6

If this fact is not familiar, we recommend Strang’s (2009) discussion of singular value decomposition.

7

As pointed out before (Linsker, 1988; Plumbley, 1993, 1995; Kung, 2014), PCA maximizes mutual information between a gaussian input, , and an output, , such that rows of have unit norms. When rows of are principal eigenvectors, outputs are principal components and are uncorrelated. However, the output can be multiplied by a rotation matrix, , and mutual information is unchanged, . is now a correlated gaussian, and still has rows with unit norms. Therefore, one can have correlated outputs with maximal mutual information between input and output as long as rows of span the principal subspace.

8

For example, matrix equation 2.11 could be solved by a conjugate gradient descent method instead of iterative methods. Matrices that keep input-input and output-output correlations in equation 2.11 can be calculated recursively, leading to a truly online method.