Why Do Similarity Matching Objectives Lead to Hebbian/Anti-Hebbian Networks?

Modeling self-organization of neural networks for unsupervised learning using Hebbian and anti-Hebbian plasticity has a long history in neuroscience. Yet derivations of single-layer networks with such local learning rules from principled optimization objectives became possible only recently, with the introduction of similarity matching objectives. What explains the success of similarity matching objectives in deriving neural networks with local learning rules? Here, using dimensionality reduction as an example, we introduce several variable substitutions that illuminate the success of similarity matching. We show that the full network objective may be optimized separately for each synapse using local learning rules in both the offline and online settings. We formalize the long-standing intuition of the rivalry between Hebbian and anti-Hebbian rules by formulating a min-max optimization problem. We introduce a novel dimensionality reduction objective using fractional matrix exponents. To illustrate the generality of our approach, we apply it to a novel formulation of dimensionality reduction combined with whitening. We confirm numerically that the networks with learning rules derived from principled objectives perform better than those with heuristic learning rules.


Introduction
The human brain generates complex behaviors via the dynamics of electrical activity in a network of ∼ 10 11 neurons each making ∼ 10 4 synaptic connections.As there is no known centralized authority determining which specific connections a neuron makes or specifying the weights of individual synapses, synaptic connections must be established based on local rules.Therefore, a major challenge in neuroscience is to determine local synaptic learning rules that would ensure that the network acts coherently, i.e. guarantee robust network self-organization.
Much work has been devoted to the self-organization of neural networks for solving unsupervised computational tasks using Hebbian and anti-Hebbian learning rules (Földiak, 1990;Földiak, 1989;Rubner and Tavan, 1989;Rubner and Schulten, 1990;Carlson, 1990;Plumbley, 1993b;Leen, 1991;Plumbley, 1993a;Linsker, 1997).Unsupervised setting is natural in biology because large-scale labeled datasets are typically unavailable.Hebbian and anti-Hebbian learning rules are biologically plausible because they are local: The weight of an (anti-)Hebbian synapse is proportional to the (minus) correlation in activity between the two neurons the synapse connects.
In networks for dimensionality reduction, for example, feedforward connections use Hebbian rules and lateral -anti-Hebbian, Figure 1.Hebbian rules attempt to align each neuronal feature vector, whose components are the weights of synapses impinging onto the neuron, with the input space direction of greatest variance.Anti-Hebbian rules mediate competition among neurons which prevents their feature vectors from aligning in the same direction.A rivalry between the two kinds of rules results in the equilibrium where synaptic weight vectors span the principal subspace of the input covariance matrix, i. e. the subspace spanned by the eigenvectors corresponding to the largest eigenvalues.
However, in most existing single-layer networks, Figure 1, Hebbian and anti-Hebbian learning rules were postulated rather than derived from a principled objective.Having such derivation should yield better performing rules and deeper understanding than has been achived using heuristic rules.But, until recently, all derivations of single-layer networks from principled objectives led to biologically implausible non-local learning rules, where the weight of a synapse depends on the activities of neurons other than the two the synapse connects.
Recently, single-layer networks with local learning rules have been derived from similarity matching objective functions (Pehlevan et al., 2015;Pehlevan and Chklovskii, 2014;Hu et al., 2014).But why do similarity matching objectives lead to neural networks with local, Hebbian and anti-Hebbian learning rules?A clear answer to this question has been lacking.
Here, we answer this question by performing several illuminating variable transformations.Specifically, we reduce the full network optimization problem to a set of trivial optimization problems for each synapse which can be solved locally.Eliminating neural activity variables leads to a min-max objective in terms of feedforward and lateral synaptic weight matrices.This finally formalizes the long-held intuition about the adversarial relationship of Hebbian and anti-Hebbian learning rules.
In this paper, we make the following contributions.In Section 2, we present a more transparent derivation of the previously proposed online similarity matching algorithm for Principal Subspace Projection (PSP).In Section 3, we propose a novel objective for PSP combined with spherizing, or whitening, the data, which we name Principal Subspace Whitening (PSW), and derive from it a biologically plausible online algorithm.Also, in Sections 2 and 3, we demonstrate that stability in the offline setting guarantees projection onto the principal subspace and give principled learning rate recommendations.In Section 4, by eliminating activity variables from the objectives, we derive min-max formulations of PSP and PSW which yield themselves to game-theoretical interpretations.In Section 5, by expressing the optimization objectives in terms of feedforward synaptic weights only, we arrive at novel formulations of dimensionality reduction in terms of fractional powers of matrices.In Section 6, we demonstrate numerically that the performance of our online algorithms is superior to the heuristic ones.
2 From similarity matching to Hebbian/anti-Hebbian networks for PSP

Derivation of a mixed PSP from similarity matching
The PSP problem is formulated as follows.Given T centered input data samples, x t ∈ R n , find T projections, y t ∈ R k , onto the principal subspace (k ≤ n), i.e. the subspace spanned by eigenvectors corresponding to the k top eigenvalues of the input covariance matrix: where we resort to a matrix notation by concatenating input column vectors into X = [x 1 , . . ., x T ].Similarly, outputs are Y = [y 1 , . . ., y T ].
Our goal is to derive a biologically plausible single-layer neural network implementing PSP by optimizing a principled objective.Biological plausibility requires that the learning rules are local, i.e. synaptic weight update depends on the activity of only the two neurons the synapse connects.The only PSP objective known to yield a single-layer neural network with local learning rules is based on similarity matching (Pehlevan et al., 2015).This objective, borrowed from Multi-Dimensional Scaling (MDS), minimizes the mismatch between the similarity of inputs and outputs (Mardia et al., 1980;Williams, 2001;Cox and Cox, 2000): Here, similarity is quantified by the inner products between all pairs of inputs (outputs) comprising the Grammians X ⊤ X (Y ⊤ Y).
One can understand intuitively that the objective (2) is optimized by the projection onto the principal subspace by considering the following (for a rigorous proof see (Pehlevan and Chklovskii, 2015;Mardia et al., 1980;Cox and Cox, 2000)).First, substitute a Singular Value Decomposition (SVD) for matrices X and Y and note that the mismatch is minimized by matching right singular vectors of Y to that of X.Then, rotating the Grammians to the diagonal basis reduces the minimization problem to minimizing the mismatch between the corresponding singular values squared.Therefore, Y is given by the top k right singular vectors of X scaled by corresponding singular values.As the objective (2) is invariant to the left-multiplication of Y by an orthogonal matrix, it has infinitely many degenerate solutions.One such solution corresponds to the Principal Component Analysis (PCA).
Unlike non-neural-network formulations of PSP or PCA, similarity matching outputs principal components (scores) rather than principal eigenvectors of the input covariance (loadings).Such difference in formulation is motivated by our interest in PSP or PCA neural networks (Diamantaras and Kung, 1996) that output principal components, y t , rather than principal eigenvectors.Principal eigenvectors are not transmitted downstream of the network but can be recovered computationally from the synaptic weight matrices.Although synaptic weights do not enter the objective (2), in previous work (Pehlevan et al., 2015), they arose naturally in the derivation of the online algorithm (see below) and stored correlations between input and output neural activities.
Next, we derive the min-max PSP objective from Eq. ( 2), starting with expanding the square of the Frobenius norm: We can rewrite Eq. ( 3) by introducing two new dynamical variable matrices in place of covariance matrices 1 T XY ⊤ and 1 T YY ⊤ : min (5) To see that Eq. ( 5) is equivalent to Eq. (3) find optimal W * =1 T YX ⊤ and M * = 1 T YY ⊤ by setting the corresponding derivatives of objective (5) to zero.Then, substitute W * and M * into Eq.( 5) to obtain (3).
Finally, we exchange the order of minimization with respect to Y and W as well as the order of minimization with respect to Y and maximization with respect to M in Eq. ( 5).The last exchange is justified by the saddle point property (see Proposition 1 in Appendix A).Then, we arrive at the following min-max optimization problem: min where L P SP (W, M, Y) is defined in Eq. ( 5).We call this a mixed objective because it includes both output variables, Y, and covariances, W and M.

Offline PSP algorithm
In this section, we present an offline optimization algorithm to solve the PSP problem and analyze fixed points of the corresponding dynamics.These results will be used in the next Section for the biologically plausible online algorithm implemented by neural networks.
In the offline setting, we can solve Eq. ( 6) by the alternating optimization approach used commonly in neural networks literature (Olshausen et al., 1996;Olshausen and Field, 1997;Arora et al., 2015).We, first, minimize with respect to Y while keeping W and M fixed, and, second, make a gradient descent-ascent step with respect to W and M while keeping Y fixed: where η is the W learning rate and τ > 0 is a ratio of learning rates for W and M. In Appendix C, we analyze how τ affects linear stability of the fixed point dynamics.These two phases are iterated until convergence (Algorithm 1) 1 .
Algorithm 1 Offline min-max PSP 1: Initialize W. Initalize M as a positive definite matrix.
2: Iterate until convergence: 3: Minimize Eq. ( 5) with respect to Y, keeping W and M fixed: 4: Perform a gradient descent-ascent step with respect to W and M for a fixed Y: where the step size, 0 < η < 1, may depend on the iteration.
Optimal Y in Eq. ( 9) exists because M stays positive definite if initialized as such.

Linearly stable fixed points of Algorithm 1 correspond to the PSP
Here we demonstrate that convergence of Algorithm 1 to fixed W and M implies that Y is a PSP of X.To this end, we approximate the gradient descent-ascent dynamics in the limit of small learning rate with the system of differential equations: where t is now the time index for gradient descent-ascent dynamics.
To state our main result in Theorem 1, we define the "filter matrix" F(t) whose rows are "neural filters" so that, according to Eq. ( 9), Theorem 1.Fixed points of the dynamical system (11) have the following properties: 1.The neural filters, F, are orthonormal, i.e.FF ⊤ = I.
2. The neural filters span a k-dimensional subspace in R n spanned by some k eigenvectors of the input covariance matrix.
3. Stability of a fixed point requires that the neural filters span the principal subspace of X.
4. Suppose the neural filters span the principal subspace.Define where i = 1, . . ., k, j = 1, . . ., k and {σ 1 , . . ., σ k } are the top k principal eigenvalues of C. We assume σ k = σ k+1 .This fixed point is linearly stable if and only if: for all (i, j) pairs.By linearly stable we mean that linear perturbations of W and M converge to a configuration in which the new neural filters are merely rotations within the principal subspace of the original neural filters.
Proof.See Appendix C.
Based on Theorem 1 we claim that, provided the dynamics converges to a fixed point, Algorithm 1 has found a PSP of input data.Note that the orthonormality of the neural filters is desired and consistent with PSP since, in this approach, outputs, Y, are interpreted as coordinates with respect to a basis spanning the principal subspace.
Theorem 1 yields a practical recommendation for choosing learning rate parameters in simulations.In a typical situation, one will not know the eigenvalues of the covariance matrix a priori but can rely on the fact, γ ij ≥ 2.Then, Eq. (15) implies that for τ ≤ 1/2 the principal subspace is linearly stable leading to numerical convergence and stability.

Online neural min-max optimization algorithms
Unlike the offline setting considered so far, where all the input data are available from the outset, in the online setting, input data are streamed to the algorithm anti-Hebbian plasticity Hebbian x 1 x n . . .sequentially, one at a time.The algorithm must compute the corresponding output before the next input arrives and transmit it downstream.Once transmitted, the output cannot be altered.Moreover, the algorithm cannot store in memory any sizable fraction of past inputs or outputs but only a few, O(nk), state variables.
Whereas developing algorithms for the online setting is more challenging than that for the offline, it is necessary both for data analysis and for modeling biological neural networks.The size of modern datasets may exceed that of available RAM and/or the output must be computed before the dataset is fully streamed.Biological neural networks operating on the data streamed by the sensory organs are incapable of storing any significant fraction of it and compute the output on the fly.Pehlevan et al. (2015) gave a derivation of a neural online algorithm for PSP, starting from the original similarity matching cost function (2).Here, instead, we start from the min-max form of similarity matching (6) and end up with a class of algorithms that reduce to the algorithm of Pehlevan et al. (2015) for special choices of learning rates.Our main contribution, however, is that the current derivation is much more intuitive and simpler, with insights to why similarity matching leads to local learning rules.
We start by rewriting the min-max PSP objective (6) as a sum of timeseparable terms that can be optimized independently: where and This separation in time is a benefit of the min-max PSP objective (6), and leads to a natural way to derive an online algorithm that was not available for the original similarity matching cost function (2).
To solve the optimization problem, Eq. ( 16), in the online setting, we optimize sequentially each l P SP,t .For each t, first, minimize Eq.( 18) with respect to y t while keeping W t and M t fixed.Second, make a gradient descent-ascent step with respect to W t and M t for fixed Y: where 0 < η t < 1 is the W learning rate and τ > 0 is the ratio of W and M learning rates.As before, Proposition 2 (Appendix B) ensures that the online gradient descent-ascent updates, Eq. ( 19), follow from alternating optimization (Olshausen et al., 1996;Olshausen and Field, 1997;Arora et al., 2015) of l P SP,t .

3:
Receive input x t

4:
Neural activity: Run until convergence 5: Plasticity: Update synaptic weight matrices, Algorithm 2 can be implemented by a biologically plausible neural network.The dynamics (20) corresponds to neural activity in a recurrent circuit, where W t is the feedforward synaptic weight matrix and −M t is the lateral synaptic weight matrix, Fig. 1A.Since M t is always positive definite, Eq. ( 18) is a Lyapunov function for neural activity.Hence the dynamics is guaranteed to converge to a unique fixed point, y t = M −1 t W t x t , where matrix inversion is computed iteratively in a distributed manner.
Updates of covariance matrices, Eq. ( 21), can be interpreted as synaptic learning rules: Hebbian for feedforward and anti-Hebbian (due to the "−" sign in ( 20)) for lateral synaptic weights.Importantly, these rules are local -the weight of each synapse depends only on the activity of the pair of neurons that synapse connects -and therefore biologically plausible.
Even requiring full optimization with respect to y t vs. a gradient step with respect to W t and M t may have a biological justification.As neural activity dynamics is typically faster than synaptic plasticity, it may settle before the arrival of the next input.
To see why similarity matching leads to local learning rules let us consider Eqs. ( 6) and ( 16).Aside from separating in time, useful for derivation of online learning rules, L P SP (W, M, Y) also separates in synaptic weights and their preand postsynaptic neural activities, Therefore, a derivative with respect to a synaptic weight depends only on the quantities accessible to the synapse.
Finally, we address two potential criticisms of the neural PSP algorithm.First is the existence of autapses, i.e. self-coupling of neurons, in our network manifested in nonzero diagonals of the lateral connectivity matrix, M, Fig 1A .Whereas autapses are encountered in the brain, they are rarely seen in principal neurons (Ikeda and Bekkers, 2006).Second is the symmetry of lateral synaptic weights in our network which is not observed experimentally.We derive an autapse-free network architecture (zeros on the diagonal of the lateral synaptic weight matrix M t ) with asymmetric lateral connectivity, Fig 1B, by using coordinate descent (Pehlevan et al., 2015) in place of gradient descent in the neural dynamics stage (20) (see Appendix F).The resulting algorithm produces the same outputs as the current algorithm and for the special case τ = 1/2 and η t = η/2, reduces to the algorithm with "forgetting" of Pehlevan et al. (2015).
3 From constrained similarity matching to Hebbian/anti-Hebbian networks for PSW The variable substitution method we introduced in the previous section can be applied to other computational objectives in order to derive neural networks with local learning rules.To give an example, we derive a neural network for PSW, which can be formulated as a constrained similarity matching problem.This example also illustrates how an optimization constraint can be implemented by biological mechanisms.

Derivation of PSW from constrained similarity matching
The PSW problem is closely related to PSP: project centered input data samples onto the principal subspace (k ≤ n), and "spherize" the data in the subspace so that the variances in all directions are 1.To derive a neural PSW algorithm, we use the similarity matching objective with an additional constraint: PSW : min We rewrite Eq. ( 23) by expanding the Frobenius norm squared and dropping the Tr Y ⊤ YY ⊤ Y term, which is constant under the constraint, thus reducing (23) to a constrained similarity alignment problem: To see that objective ( 24) is optimized by the PSW, first, substitute a Singular Value Decomposition (SVD) for matrices X and Y and note that the alignment is maximized by matching right singular vectors of Y to X and rotating to the diagonal basis (for a rigorous proof see (Pehlevan and Chklovskii, 2015)).Since the squared singular values of Y equal unity, the objective ( 24) is reduced to a summation of k squared singular values of X and is optimized by choosing the top k.Then, Y is given by the top k right singular vectors of X scaled by √ T .As before, objective ( 24) is invariant to the left-multiplication of Y by an orthogonal matrix and, therefore, has infinitely many degenerate solutions.
Next, we derive a mixed PSW objective from Eq. ( 24) by introducing two new dynamical variable matrices: the input-output correlation matrix, W = 1 T XY ⊤ , and the Lagrange multiplier matrix, M, for the whitening constraint: min where To see that Eq. ( 26) is equivalent to Eq. ( 24), find optimal W * = 1 T YX ⊤ by setting the corresponding derivatives of the objective (26) to zero.Then, substitute W * into Eq.( 26) to obtain the Lagrangian of Eq. ( 24).
Finally, we exchange the order of minimization with respect to Y and W as well as the order of minimization with respect to Y and maximization with respect to M in Eq. ( 26) (see Proposition 5 in Appendix D for a proof).Then, we arrive at the following min-max optimization problem with a mixed objective: min where L P SW (W, M, Y) is defined in Eq. ( 26).

Offline PSW algorithm
Next, we give an offline algorithm for the PSW problem, using the alternating optimization procedure as before.We solve Eq. ( 27) by, first, optimizing with respect to Y for fixed W and M and, second, making a gradient descent-ascent step with respect to W and M while keeping Y fixed2 .We arrive at the following algorithm: Algorithm 3 Offline min-max PSW 1: Initialize W. Initialize M as a positive definite matrix.
2: Iterate until convergence: 3: Minimize Eq. ( 26) with respect to Y, keeping W and M fixed: 4: Perform a gradient descent-ascent step with respect to W and M for a fixed Y: where the step size, 0 < η < 1, may depend on the iteration.
Convergence of Algorithm 3 requires the input covariance matrix, C, to have at least k non-zero eigenvalues.Otherwise, a consistent solution cannot be found because update (29) forces Y to be full-rank while Eq. ( 28) lowers its rank.

Linearly stable fixed points of Algorithm 3 correspond to PSW
Here we claim that convergence of Algorithm 3 to fixed W and M implies PSW of X.In the limit of small learning rate, the gradient descent-ascent dynamics can be approximated with the system of differential equations: where t is now the time index for gradient descent-ascent dynamics.We again define the neural filter matrix F = M −1 W.
Theorem 2. Fixed points of the dynamical system (30) have the following properties: 1.The outputs are whitened, i.e. 1 T YY ⊤ = I. 2. The neural filters span a k-dimensional subspace in R n which is spanned by some k eigenvectors of the input covariance matrix.
3. Stability of the fixed point requires that the neural filters span the principal subspace of X.
4. Suppose the neural filters span the principal subspace.This fixed point is linearly stable if and only if for all (i, j) pairs, i = j.By linear stability we mean that linear perturbations of W and M converge to a rotation of the original neural filters within the principal subspace.
Proof.See Appendix E.
Based on Theorem 2 we claim that, provided Algorithm 3 converges, this fixed point corresponds to a PSW of input data.Unlike the PSP case, the neural filters are not orthonormal.

Online algorithm for PSW
As before, we start by rewriting the min-max PSW objective (27) as a sum of time-separable terms that can be optimized independently: where and l t (W, M, y t ) is defined in Eq. ( 18).In the online setting, Eq. ( 32) can be optimized by sequentially minimizing each l P SW,t .For each t, first, minimize (18) with respect to y t for fixed W t and M t , second, update W t and M t according to a gradient descent-ascent step for fixed y t : where 0 < η t < 1 is the W learning rate and τ > 0 is the ratio of W and M learning rates.

3:
Receive input x t

4:
Neural activity: Run until convergence 5: Plasticity: Update synaptic weight matrices, Algorithm 4 can be implemented by a biologically plausible single-layer neural network with lateral connections as in Algorithm 2, Fig. 1A.Updates to synaptic weights, Eq. ( 36), are local, Hebbian/anti-Hebbian plasticity rules.An autapse-free network architecture, Fig 1B, may be obtained using coordinate descent (Pehlevan et al., 2015) in place of gradient descent in the neural dynamics stage (35) (see Appendix G).
The lateral connections here are the Lagrange multipliers introduced in the offline problem, Eq. ( 26).In the PSP network, they resulted from a variable transformation of the output covariance matrix.This difference caries over to the learning rules, where in Algorithm 4, the lateral learning rule is enforcing the whitening of the output, but in Algorithm 2, the lateral learning rule sets the lateral weight matrix to the output covariance matrix.

Game theoretical interpretation of Hebbian/anti-Hebbian learning
In the original similarity matching objective, Eq. ( 2), the only variables are neuronal activities which, at the optimum, represent principal components.In Section 2, we rewrote these objectives by introducing matrices W and M corresponding to synaptic connection weights, Eq. ( 5).Here, we eliminate neural activity variables altogether and arrive at a min-max formulation in terms of feedforward, W, and lateral, M, connection weight matrices only.This formulation lends itself to a game-theoretical interpretation.
Since in the offline PSP setting, optimal M * in Eq. ( 6) is an invertible matrix (because M * = 1 T Y * Y * ⊤ , see also Appendix A), we can restrict our optimization to invertible matrices, M, only.Then, we can optimize objective (5) with respect to Y and substitute its optimal value Y * = M −1 WX into ( 5) and ( 6) to obtain: This min-max objective admits a game-theoretical interpretation where feedforward, W, and lateral, M, synaptic weight matrices oppose each other.To reduce the objective, feedforward synaptic weight vectors of each output neuron attempt to align with the direction of maximum variance of input data.However, if this was the only driving force then all output neurons would learn the same synaptic weight vectors and represent the same top principal component.At the same time, linear dependency between different feedforward synaptic weight vectors can be exploited by the lateral synaptic weights to increase the objective by cancelling the contributions of different components.To avoid this, the feedforward synaptic weight vectors become linearly independent and span the principal subspace.
A similar interpretation can be given for PSW, where feedforward, W, and lateral, M, synaptic weight matrices oppose each other adversarially.

Novel formulations of dimensionality reduction using fractional exponents
In this section, we point to a new class of dimensionality reduction objective functions that naturally follow from the min-max objectives ( 5) and ( 6).Eliminating both the neural activity variables, Y, and the lateral connection weight matrix, M, we arrive at optimization problems in terms of the feedforward weight matrix, W, only.The rows of optimal W form a non-orthogonal basis of the principal subspace.Such formulations of principal subspace problems involve fractional exponents of matrices and, to the best of our knowledge, have not been proposed previously.
By replacing max M min Y optimization in the min-max PSP objective, Eq. ( 6), by its saddle point value (see Proposition 1 in Appendix A) we find the following objective expressed solely in terms of W: The rows of the optimal W are not principal eigenvectors, rather the rowspace of W spans the principal subspace.
By replacing max M min Y optimization in the min-max PSW objective, Eq. ( 27), by its optimal value (see Proposition 5 in Appendix D): min As before, the rows of the optimal W are not principal eigenvectors, rather the rowspace of W spans the principal eigenspace.
We observe that the only material difference between Eqs. ( 38) and ( 39) is in the value of the fractional exponent.Based on this, we conjecture that any objective function of such form with a fractional exponent from a continuous range is optimized by W spanning the principal subspace.Such solutions would differ in the eigenvalues associated with the corresponding components.
A supporting argument for our conjecture comes from the work of Miao and Hua (1998), which studied the cost min Eq. 40 can be seen as a limiting case of our conjecture, where the fractional exponent goes to zero.Indeed, Miao and Hua (1998) proved that the rows of optimal W are an orthonormal basis for the principal eigenspace.

Numerical experiments
Next, we test our findings using a simple artificial dataset.We generated an n = 10 dimensional dataset and we simulated our offline and online algorithms to reduce this dataset to k = 3 dimensions, using different values of the parameter τ .
The results are plotted in Figs. 2, 3, 4 and 5 along with details of the simulations in the figures' caption.
Consistent with Theorems 1 and 2, small perturbations to PSP and PSW fixed points decayed (solid lines) or grew (dashed lines) depending on the value of τ , Fig. 2A.Offline simulations that start from random initial conditions converged to the PSP (or the PSW) solution if the fixed point was linearly stable, Fig. 2B.Interestingly, the online algorithms' performance were very close to that of the offline, Fig. 2C.Learning rates were η t = 1/ (10 3 + t).Errors were defined using deviation of the neural filters from their optimal values (Pehlevan et al., 2015).Let U be the 10 × 3 matrix whose columns are the top 3 left singular vectors of X. PSP error: ) in MATLAB notation.Solid (dashed) lines indicate linearly stable (unstable) choices of τ .A) Small perturbations to the fixed point.W and M matrices were initialized by adding a random Gaussian variable, N (0, 10 −6 ), elementwise to their fixed point values.B) Offline algorithm, initialized with random W and M matrices.C) Online algorithm, initialized with the same initial condition as in B).A random column of X is processed at each time.The error for linearly unstable simulations in Fig. 2 saturates rather than blowing up.This may seem at odds with Theorems 1 and 2, which stated that if there is a stable fixed point of the dynamics, it should be the PSP/PSW solution.A closer look resolves this dilemma.In Fig. 3, we plot the evolution of an element of the M matrix in the offline algorithms for stable and unstable choices of τ .When the principal subspace is linearly unstable, the synaptic weights exhibit undamped oscillations.The dynamics seems to be confined to a manifold with a fixed distance (in terms of the error metric) from the principal subspace.That the error does not grow to infinity is a result of the stabilizing effect of min-max antagonism of the synaptic weights.Online algorithms behave similarly.
Next, we studied in detail the effect of τ parameter on the convergence.In the offline algorithm, we plot the error after a fixed number of gradient steps, as a function of τ .For PSP, there is an optimal τ .Decreasing τ beyond the optimal value doesn't lead to a degradation in performance, however increasing it leads to a rapid increase in the error.For PSW, there is a plateau of low error for low values of τ but a rapid increase as one approaches the linear instability threshold.Online algorithms behave similarly.
Finally, we compared the performance of our online PSP algorithm to neural PSP algorithms with heuristic learning rules such as the Subspace Network (Oja, 1989) and the Generalized Hebbian Algorithm (GHA) (Sanger, 1989), on the same dataset.We found that our algorithm converges much faster (Fig. 5).Previously,  the original similarity matching network (Pehlevan et al., 2015), which is a special case of the online PSP algorithm of this paper, was shown to converge faster than the APEX (Kung et al., 1994) and Földiak's (Földiak, 1989) networks.

Discussion
In this paper, through transparent variable substitutions, we demonstrated why biologically plausible neural networks can be derived from similarity matching objectives, mathematically formalized the adversarial relationship between Hebbian feedforward and anti-Hebbian lateral connections using min-max optimization lending itself to a game-theoretical interpretation, and formulated dimensionality reduction tasks as optimizations of fractional powers of matrices.The formalism we developed should generalize to unsupervised tasks other than dimensionality reduction and could provide a theoretical foundation for both natural and artificial neural networks.
In comparing our networks with biological ones, most importantly, our networks rely only on local learning rules that can be implemented by synaptic plasticity.While Hebbian learning is famously observed in neural circuits (Bliss and Lømo,  (Oja, 1989) and the GHA (Sanger, 1989).The dataset and the error metric is as in Fig. 2. For fairness of comparison, the learning rates in all networks were set to η = 10 −3 .τ = 1/2 for the online PSP algorithm .Feedforward connectivity matrices were initialized randomly.For the online PSP algorithm, lateral connectivity matrix was initialized to the identity matrix.Curves show averages over 10 trials.
1973; Bliss and Gardner-Medwin, 1973), our networks also require anti-Hebbian learning, which can be interpreted as the long-term potentiation of inhibitory postsynaptic potentials.Experimentally, such long-term potentiation can arise from pairing action potentials in inhibitory neurons with subthreshold depolarization of postsynaptic pyramidal neurons (Komatsu, 1994;Maffei et al., 2006).However, plasticity in inhibitory synapses does not have to be Hebbian, i.e. depend on the correlation between pre-and postsynaptic activity (Kullmann et al., 2012).
To make progress, we had to make several simplifications sacrificing biological realism.In particular, we assumed that neuronal activity is a continuous variable which would correspond to membrane depolarization (in graded potential neurons) or firing rate (in spiking neurons).We ignored the nonlinearity of the neuronal input-output function.Such linear regime could be implemented via a resting state bias (in graded potential neurons) or resting firing rate (in spiking neurons).
The applicability of our networks as models of biological networks can be judged by experimentally testing the following predictions.First, we predict a relationship between the feedforward and lateral synaptic weight matrices which could be tested using modern connectomics datasets.Second, we suggest that similarity of output activity matches that of the input which could be tested by neuronal population activity measurements using calcium imaging.
Often the choice of a learning rate is crucial to the learning performance of neural networks.Here, we encountered a nuanced case where the ratio of feedforward and lateral weights, τ , affects the learning performance significantly.First, there is a maximum value of such ratio, beyond which the principal subspace solution is linearly unstable.The maximum value depends on the principal eigenvalues, but for PSP, τ ≤ 1/2 is always linearly stable.For PSW there isn't an always safe choice.Having the same learning rates for feedforward and lateral weights, τ = 1, may actually be unstable.Second, linear stability is not the only thing that affects performance.In simulations, for PSP, we observed that there is an optimal value of τ .For PSW, decreasing τ seems to increase performance until a plateau is reached.This difference between PSP and PSW may be attributed to the difference of origins of lateral connectivity.In PSW algorithms, lateral weights originate from Lagrange multipliers enforcing an optimization constraint.Low τ , meaning higher lateral learning rates, force the network to satisfy the constraint during the whole evolution of the algorithm.
Based on these observation, we can make practical suggestions for the τ parameter.For PSP, τ = 1/2 seems to be a good choice, which is also preferred from another derivation of an online similarity matching algorithm (Pehlevan et al., 2015).For PSW, the smaller the τ , the better it is, although one should make sure that the lateral weight learning rate η/τ is still sufficiently small.

A Proof of strong min-max property for PSP objective
Here we show that minimization with respect to Y and maximization with respect to M can be exchanged in Eq. ( 5).We will make use of the following min-max theorem (Boyd and Vandenberghe, 2004), for which we give a proof for completeness: Then, Since max b min a f (a, b) ≤ min a max b f (a, b) is always true, we get an equality.Now, we present the main result of this section.
where Y, M and A are arbitrary sized, real-valued matrices.f obeys a strong min-max property: Proof.We will show that the saddle-point property holds for Eq. ( 44).Then the result follows from Theorem 1.
If the saddle point exists, it is when ∇f = 0, Note that M * is symmetric and positive semidefinite.Multiplying the first equation by M * on the left and the right, and using the the second equation, we arrive at Solutions to Eq. ( 46) are not unique, because M * may not be invertible depending on A. However, all solutions give the same value of f : Now, we check if the saddle-point property, Eq. ( 41), holds.The first inequality is satisfied: The second inequality is also satisfied: where the last line follows form M * being positive semidefinite.Eq.s ( 49) and ( 50) show that the saddle-point property (41) holds, and therefore max and min can be exchanged and the value of f at the saddle-point is B Taking a derivative using a chain rule Proposition 2. Suppose a differentiable, scalar function H(a 1 , . . ., a m ), where a i ∈ R d i with arbitrary d i .Assume a finite minimum with respect to a m exists for a given set of {a 1 , . . ., a m−1 }: and the optimal a * m = arg min am H(a 1 , . . ., a m ) is a stationary point Then, for i = 1, . . ., m − 1 Proof.The result follows from application of the chain rule and the stationarity of the minimum: where the second term is zero due to Eq. ( 52).

C Proof of Theorem 1
Here we prove Theorem 1 using methodology from (Pehlevan et al., 2015).The fixed points of Eq. ( 11) are ( using¯for fixed point): where C is the input covariance matrix defined as in Eq. (1).

C.1 Proof of item 1
The result follows from Eq.s ( 12) and ( 55): C.2 Proof of item 2 First note that at fixed points, F⊤ F and C commute: Proof.The result follows from Eq.s ( 12) and ( 55): F⊤ F and C share the same eigenvectors, because they commute.Orthonormality of neural filters, Eq. ( 56), implies that the k rows of F are degenerate eigenvectors of F⊤ F with unit eigenvalue.To see this: F⊤ F F⊤ = F⊤ .Because the filters are degenerate, the corresponding k shared eigenvectors of C may not be the filters themselves but linear combinations of them.Nevertheless, the shared eigenvectors composed of filters span the same space as the filters.
Since we are interested in PSP, it is desirable that it is the top k eigenvectors of C that spans the filter space.A linear stability analysis around the fixed point reveals that any other combination is unstable, and that the PS is stable if τ is chosen appropriately.

C.3 Proof of item 3 Preliminaries
In order to perform a linear stability analysis, we linearize the system of equations (11) around the fixed point.Even though Eq. ( 11) depends on W and M, we will find it convenient to change variables and work with F and M instead.
Using the relation F = M −1 W, one can express linear perturbations of F around its fixed point, δF, in terms of perturbations of W and M: Linearization of Eq. ( 11) gives: and Using these, we arrive at: Eq.s ( 61) and ( 62) define a closed system of equations.
It will be useful to decompose δF into components3 : where δA is an k × k anti-symmetric matrix, δS is an k × k symmetric matrix and δB is an k × (n − k) matrix.Ḡ is an (n − k) × n matrix with orthonormal rows, which are orthogonal to the rows of F. δA and δS are perturbations that keep the neural filters within the filter space.Anti-symmetric δA corresponds to rotations of filters within the filter space, preserving orthonormality.Symmetric δS destroys orthonormality.δB is a perturbation that takes the neural filters outside of the filter space.Let v 1,...,n be the eigenvectors C and σ 1,...,n be the corresponding eigenvalues.We label them such that F spans the same space as the space spanned by the first k eigenvectors.We choose rows of Ḡ to be the remaining eigenvectors, i.e.Ḡ⊤ :

Proof
The proof of item 3 in Theorem 1 follows from studying the stability of δB component.
Multiplying Eq. ( 62) on the right by Ḡ⊤ , one arrives at a decoupled equation for δB: where for convenience we changed our notation to δB kj = δB j k .For each j, the dynamics is linearly stable if all eigenvalues of all P j are negative.In turn, this implies that for stability, eigenvalues of M should be greater than σ k+1,...,n .
Eigenvalues of M are: Proof.The eigenvalue equation implies that which can be seen by multiplying Eq. ( 67) on the left by F⊤ , using the commutation of F⊤ F and C, and the orthonormality of neural filters.Further, orthonormality of neural filters implies: Then, F⊤ λ is a shared eigenvector4 between C and F⊤ F. Shared eigenvectors of C with unit eigenvalue in F⊤ F are v 1 , . . ., v k .Since the eigenvalue of F⊤ λ with respect to F⊤ F is 1, F⊤ λ must be one of v 1 , . . ., v k .Then Eq. ( 68) implies that λ = {σ 1 , . . ., σ k } and Then, it follows that linear stability requires This proves our claim that if at the fixed point, the neural filters span a subspace other than the principal subspace, the fixed point is linearly unstable.

C.4 Proof of item 4
We now assume that the fixed point is the principal subspace.From item 3, we know that the δB perturbations are stable.The proof of item 4 in Theorem 1, follows from the linear stabilities of δA and δS.
Multiplying Eq. ( 62) on the right by F⊤ , Unlike the case of δB, this equation is coupled to δM, whose dynamics, Eq. ( 61), reduces to: We will only consider symmetric δM perturbations, although if antisymmetric perturbations were allowed, they would stably decay to zero, because the only antisymmetric term on the RHS of (73) would come from δM.From Eq.s ( 72) and ( 73), it follows that The RHS is symmetric.Therefore, the antisymmetric part of the LHS must equal zero.This gives us an integral of the dynamics where Ω is a constant, skew symmetric matrix.This reveals an interesting point, after the perturbation δA and δM will not decay to 0, even if the fixed point is stable.In hindsight, this is expected because due to the symmetry of the problem: there is a manifold of stable fixed points (bases in principal subspace), and perturbations within this manifold should not decay.A similar situation was observed in (Pehlevan et al., 2015).The symmetric part of Eq. ( 74) gives, which, using (73), implies To summarize, we analyze the linear stability of the system of equations, defined by Eq.s ( 73), ( 75), ( 77).
Next, we change to a basis where M is diagonal.M is symmetric, its eigenvalues are the principal eigenvectors {σ 1 , . . ., σ k } as shown in Appendix C.3 and it has an orthonormal set of eigenvectors.Let U be the matrix that contains the eigenvectors of M in its columns.Define Expressing Eq.s ( 73), ( 75), (77) in this new basis, in component form, and eliminating δA U ij : where This is a closed system of equations for each (i, j) pair!The fixed point of this system of equations is at Hence, if the linear perturbations are stable, the perturbations that destroy the orthonormality of neural filters will decay to zero, and orthonormality will be restored.
The stability of the fixed point is governed by the trace and the determinant of the matrix H ij .The trace is and the determinant is The system ( 79) is linearly stable if both the trace is negative and the determinant is positive.
Defining the following function of covariance eigenvalues: the trace is negative if and only if The determinant is positive if and only if Since γ ij > 0, Eq. ( 86) implies Eq. ( 85).For stability, Eq. ( 86) has to be satisfied for all (i, j) pairs.When i = j, γ ii = 2, Eq. ( 86) is satisfied because RHS is infinity.When i = j, Eq. ( 86) is nontrivial, and depends on relations between covariance eigenvalues.Since γ ij ≥ 2, τ ≤ 1/2 is always stable.Collectively, our results prove item 4 of Theorem 1.

D Proof of strong min-max property for PSW objective
Here we show that minimization with respect to Y and maximization with respect to M can be exchanged in Eq. ( 26).We do this by explicitly calculating the value of with respect to min-max and max-min optimizations, and showing that the value does not change.
Proof.Left side of Eq. ( 88) is a constrained optimization problem: Suppose an SVD of Then the optimization problem reduces to: Note that k j=1 u ⊤ A,i u Y,j v ⊤ A,i v Y,j ≤ 1 5 and therefore the cost is lower bounded by The sum in question is k i=1 α i β i , which is an inner product of a unit vector and a vector with magnitude less than or equal to 1. Hence, the maximal inner product can be 1.
Proof.Note that we only need to consider the symmetric part of M, because its antisymmetric component does not contribute to the cost.Below, we use M to mean its symmetric part.We will evaluate the value of the objective considering the following cases: 1.A = 0.In this case the first term in Eq. ( 92) drops.Minimization of the second term with respect to Y gives −∞ if M has a negative eigenvalue, or a 0 if M is positive semidefinite.Hence, the max-min objective is zero, and the proposition holds.
2. A = 0 and A is full-rank.
(a) M has at least one negative eigenvalue.Then, minimization of Eq. ( 92) with respect to Y gives −∞.
(b) M is positive semidefinite and has at least one zero eigenvalue.Then, minimization of Eq. ( 92) with respect to Y gives −∞.To achieve this solution, one chooses all columns of Y to be one of the zero eigenvectors.The sign of the eigenvector is chosen such that Tr A ⊤ Y is positive.Multiplying Y by a positive scalar, one can reduce the objective indefinitely.
(c) M is positive definite.Then, Y * = M −1 A minimizes Eq. ( 92) with respect to Y. Plugging this back to (92), we get the objective The positive definite M that maximizes Eq. ( 93) can be found by setting its derivative to zero Plugging this back in Eq. ( 93), one gets the objective which is maximal with respect to all possible M. Therefore the proposition holds.
3. A = 0 and A has rank r < k.
(a) M has at least one negative eigenvalue.Then, minimization of Eq. ( 92) with respect to Y gives −∞, as before.
(b) M is positive semidefinite and has at least one zero eigenvalue.
i.If at least one of the zero-eigenvectors of M is not a left zerosingular vector of A, then, minimization of Eq. ( 92) with respect to Y gives −∞.To achieve this solution, one chooses all columns of Y to be the zero-eigenvector of M that is not a left zero-singular vector of A. The sign of the eigenvector is chosen such that Tr A ⊤ Y is positive.Multiplying Y by a positive scalar, one can reduce the objective indefinitely.ii.If all of the zero-eigenvectors of M are also left zero-singular vectors of A, then Eq. ( 92) can be reformulated in the subspace spanned by top r eigenvectors of M. Suppose a SVD for A where columns of Y ⊥ are perdendicular to the space spanned by {u A,1 , . . ., u A,r }.Then value of the objective Eq. ( 92) only depends on Y A .Defining new matrices Ãi,: , where i, j = 1, . . ., r, we can rewrite Eq. ( 92) as Now Ã is full-rank and M is positive definite.As in 2.(c), the objective which is maximal with respect to positive definite M matrices is (c) M is positive definite.As in 2.(c), the objective which is maximal with respect to positive definite M matrices is This is also maximal with respect to all possible M. Therefore the proposition holds.
Propositions (3) and ( 4) imply the strong min-max property for the PSW cost.
Proposition 5.The strong min-max property for the PSW cost: linearization of (30): Even though Eq. ( 30) depends on W and M, we will find it convenient to change variables and work with R, as defined in Eq. ( 101), and M instead.Since R, W and M are interdependent, we express the perturbations of R in terms of W and M perturbations: which implies that Plugging these in and eliminating δW, we arrive at a linearized equation for δR: To asses the stability of δR, we expand it as in Appendix C.3: where δA is an k ×k skew-symmetric matrix, δS is an k ×k symmetric matrix and δB is an k×(n−k) matrix.Ḡ is an (n−k)×n matrix with orthonormal rows.These rows are chosen to be orthogonal to the rows of R. As before, skew-symmetric δA corresponds to rotations of filters within the normalized filter space, symmetric δS keeps the normalized filter space invariant but destroys orthonormality and δB is a perturbation that takes the normalized neural filters outside of the filter space.
Let v 1,...,n be the eigenvectors C and σ 1,...,n be the corresponding eigenvalues.We label them such that R spans the same space as the space spanned by the first k eigenvectors.We choose rows of Ḡ to be the remaining eigenvectors, i.e.Ḡ⊤ := [v k+1 , . . ., v n ].
This proves that if at the fixed point, normalized neural filters span a subspace other than the principal subspace, the fixed point is linearly unstable.Since the span of normalized neural filters is that of the neural filters, item 3 follows.

E.4 Proof of item 4
Proof of item 4 follows from the linear stabilities of δA and δS.Multiplying Eq. ( 108) on the right by R⊤ , and separating the resulting equation in to into its symmetric and anti-symmetric parts, we arrive at: To obtain a closed set of equations, we complement these equations with δM evolution, which we obtain by plugging the expansion (109) into Eq.( 105): We only consider symmetric δM below, since our algorithm preserves the symmetry of M in runtime.
We now change to a basis where M is diagonal.M is symmetric and has an orthonormal set of eigenvectors.Its eigenvalues are the principal eigenvalues {σ 1 , . . ., σ k } (from Appendix C.3).Let U be the matrix that contains the eigenvectors of M in its columns.Define δA U := U ⊤ δAU, δS U := U ⊤ δSU, δM U := U ⊤ δMU. (115) In this new basis, the linearized equations, in component form, become: where Linear stability is governed by the three eigenvalues of H ij .One of the eigenvalues is 0, due to the existence of the rotational symmetry in the problem.The corresponding eigenvector is [σ j − σ i , 1, 0].Note that the third element of the eigenvector is zero, showing that the orthogonality of the normalized neural filters are not spoiled even in this mode.
For stability of the principal subspace, the other two eigenvalues must be negative, which means their sum should be negative, and their multiplication should be positive.It is easy to show that both the negativity of the summation and the positivity of the multiplication holds if and only if for all (i, j) pairs with i = j: τ < σ i + σ j 2 (σ i − σ j ) 2 . (118) Hence we have showed that linear perturbations of fixed point weights decay to a configuration in which normalized neural filters are rotations of the original normalized neural filters within the subspace.It follows from Eq. ( 101), that the same holds for neural filters.
G Autapse-free constrained similarity matching network with asymmetric lateral connectivity Following similar steps to Appendix F, we derive an autapse-free PSW neural algorithm with asymmetric lateral connections.We replace the gradient descent neural dynamics defined by Eq. ( 35) by a coordinate descent dynamics, where at every step, one finds the optimal value of one component of y t , while keeping the rest fixed: The components can be cycled through in any order until the iteration converges to a fixed point.The coordinate descent iteration, Eq. ( 123), can be interpreted as the dynamics of an asynchronous autapse-free neural network, Fig. 1B, with synaptic weights: As in Appendix F, the new lateral weights are asymmetric.
Updates for these synaptic weights can be derived from the updates for W t and M t , Eq. (36).Defining another scalar state variable for each ith neuron Dt,i := τ M t,ii /η As in Appendix F, in addition to synaptic weights, the neurons need to keep track of a postsynaptic activity depended variable Dt,i and gradient descent-ascent learning rate parameters η W,t , η M,t and η M,t−1 .

Figure 2 :
Figure 2: Demonstration of the stability of the PSP (top row) and PSW (bottom row) algorithms.We constructed an n = 10 by T = 2000 data matrix X from its SVD, where the left and right singular vectors are chosen randomly, the top three singular values are set to { √ 3T , √ 2T , √ T } and the rest of the singular values are chosen uniformly in [0, 0.1 √ T ].Learning rates were η t = 1/ (10 3 + t).Errors were defined using deviation of the neural filters from their optimal values(Pehlevan et al., 2015).Let U be the 10 × 3 matrix whose columns are the top 3 left singular vectors of X. PSP error:F(t) ⊤ F(t) − UU ⊤ F , PSW error: F(t) ⊤ F(t) − USU ⊤ F , with S = diag ([1/3, 1/2, 1]) in MATLAB notation.Solid (dashed) lines indicate linearly stable (unstable) choices of τ .A) Small perturbations to the fixed point.W and M matrices were initialized by adding a random Gaussian variable, N (0, 10 −6 ), elementwise to their fixed point values.B) Offline algorithm, initialized with random W and M matrices.C) Online algorithm, initialized with the same initial condition as in B).A random column of X is processed at each time.

Figure 4 :
Figure4: Effect of τ of performance.Error after 2 × 10 4 gradient steps are plotted as a function of different choices of τ .Same dataset was used as in Fig.2with same network initalization and learning rates.Both curves start from τ = 0.01 and go to the maximum value allowed for linear stability.

Figure 5 :
Figure5: Comparison of the online PSP algorithm with the Subspace Network(Oja, 1989) and the GHA(Sanger, 1989).The dataset and the error metric is as in Fig.2.For fairness of comparison, the learning rates in all networks were set to η = 10 −3 .τ = 1/2 for the online PSP algorithm .Feedforward connectivity matrices were initialized randomly.For the online PSP algorithm, lateral connectivity matrix was initialized to the identity matrix.Curves show averages over 10 trials.