Cortical pyramidal neurons receive inputs from multiple distinct neural populations and integrate these inputs in separate dendritic compartments. We explore the possibility that cortical microcircuits implement canonical correlation analysis (CCA), an unsupervised learning method that projects the inputs onto a common subspace so as to maximize the correlations between the projections. To this end, we seek a multichannel CCA algorithm that can be implemented in a biologically plausible neural network. For biological plausibility, we require that the network operates in the online setting and its synaptic update rules are local. Starting from a novel CCA objective function, we derive an online optimization algorithm whose optimization steps can be implemented in a single-layer neural network with multicompartmental neurons and local non-Hebbian learning rules. We also derive an extension of our online CCA algorithm with adaptive output rank and output whitening. Interestingly, the extension maps onto a neural network whose neural architecture and synaptic updates resemble neural circuitry and non-Hebbian plasticity observed in the cortex.

Our brains can effortlessly extract a latent source contributing to two synchronous data streams, often from different sensory modalities. Consider, for example, following an actor while watching a movie with a soundtrack. We can easily pay attention to the actor's gesticulation and voice while filtering out irrelevant visual and auditory signals. How can biological neurons accomplish such multisensory integration?

In this article, we explore an algorithm for solving a linear version of this problem known as canonical correlation analysis (CCA; Hotelling, 1936). In CCA, the two synchronous data sets, known as views, are projected onto a common lower-dimensional subspace so that the projections are maximally correlated. For simple generative models, the sum of these projections yields an optimal estimate of the latent source (Bach & Jordan, 2005). CCA is a popular method because it has a closed-form exact solution in terms of the singular value decomposition (SVD) of the correlation matrix. Therefore, the projections can be computed using fast and well-understood spectral numerical methods.

To serve as a viable model of a neuronal circuit, the CCA algorithm must map onto a neural network consistent with basic biological facts. For our purposes, we say that a network is “biologically plausible” if it satisfies the following two minimal requirements: (1) the network operates in the online setting, that is, upon receiving an input, it computes the corresponding output without relying on the storage of any significant fraction of the full data set, and (2) the learning rules are local in the sense that each synaptic update depends only on the variables that are available as biophysical quantities represented in the pre- or postsynaptic neurons.

There are a number of neural network implementations of CCA (Lai & Fyfe, 1999; Pezeshki, Azimi-Sadjadi, & Scharf, 2003; Gou & Fyfe, 2004; Vía, Santamaría, & Pérez, 2007); however, most of these networks use nonlocal learning rules and are therefore not biologically plausible. One exception is the normative neural network model derived by Pehlevan, Zhao, Sengupta, and Chklovskii (2020). They start with formulating an objective for single-(output) channel CCA and derive an online optimization algorithm (previously proposed in Lai and Fyfe, 1999) that maps onto a pyramidal neuron with three electrotonic compartments: soma, as well as apical and basal dendrites. The apical and basal synaptic inputs represent the two views, the two dendritic compartments extract highly correlated CCA projections of the inputs, and the soma computes the sum of projections and outputs it downstream as action potentials. The communication between the compartments is implemented by calcium plateaus that also mediate non-Hebbian but local synaptic plasticity.

Whereas Pehlevan et al. (2020) also propose circuits of pyramidal neurons for multichannel CCA, their implementations lack biological plausibility. In one implementation, they resort to deflation where the circuit sequentially finds projections of the two views. Implementing this algorithm in a neural network requires a centralized mechanism to facilitate the sequential updates, and there is no experimental evidence of such a biological mechanism. In another implementation that does not require a centralized mechanism, the neural network has asymmetric lateral connections among pyramidal neurons. However, that algorithm is not derived from a principled objective for CCA and the network architecture does not match the neuronal circuitry observed in cortical microcircuits.

In this work, starting with a novel similarity-based CCA objective function, we derive a novel offline CCA algorithm, algorithm 1, and an online multichannel CCA algorithm, algorithm 2, which can be implemented in a single-layer network composed of three-compartment neurons and with local non-Hebbian synaptic update rules (see Figure 1). While our neural network implementation of CCA captures salient features of cortical microcircuits, the network includes direct lateral connections between the principal neurons (see Figure 1), which is in contrast to cortical microcircuits where lateral influence between cortical pyramidal neurons is often indirect and mediated by interneurons. With this in mind, we derive an extension of our online CCA algorithm, algorithm 3, which adaptively chooses the rank of the output based on the level of correlation captured and also whitens the output. This extension is especially relevant for online unsupervised learning algorithms, which are often confronted with the challenge of adapting to nonstationary input streams. In addition, the algorithm naturally maps onto a neural network with multicompartmental principal neurons and without direct lateral connections between the principal neurons (see Figure 3 in section 5). Interestingly, both the neural architecture and local, non-Hebbian plasticity resemble neural circuitry and synaptic plasticity in cortical microcircuits.
Figure 1:

Single-layer network architecture with $k$ multicompartmental neurons for outputting the sum of the canonical correlation subspace projections (CCSPs) $z=(z1,…,zk)$. See algorithm 2. Here $a=Wxx$ and $b=Wyy$ are projections of the views $x=(x1,…,xm)$ and $y=(y1,…,yn)$ onto a common $k$-dimensional subspace. The output, $z=M-1(a+b)$, is the sum of the CCSPs and is computed using recurrent lateral connections. The components of $a$, $b$, and $z$ are represented in three separate compartments of the neurons. Filled circles denote non-Hebbian synapses, and empty circles denote anti-Hebbian synapses. Importantly, each synaptic update depends only on variables represented locally.

Figure 1:

Single-layer network architecture with $k$ multicompartmental neurons for outputting the sum of the canonical correlation subspace projections (CCSPs) $z=(z1,…,zk)$. See algorithm 2. Here $a=Wxx$ and $b=Wyy$ are projections of the views $x=(x1,…,xm)$ and $y=(y1,…,yn)$ onto a common $k$-dimensional subspace. The output, $z=M-1(a+b)$, is the sum of the CCSPs and is computed using recurrent lateral connections. The components of $a$, $b$, and $z$ are represented in three separate compartments of the neurons. Filled circles denote non-Hebbian synapses, and empty circles denote anti-Hebbian synapses. Importantly, each synaptic update depends only on variables represented locally.

Close modal
Figure 2:

Illustration of CCA. The left column depicts point clouds of two-dimensional views $X$ and $Y$ with lines indicating the span of the top principal component and top canonical correlation basis vector for each view (labeled “PCA” and “CCA,” respectively). The center and right columns depict point clouds of joint one-dimensional projections of $X$ and $Y$ onto their top principal component or top canonical correlation basis vector, with the correlations between the two projections listed. As illustrated in the lower right plot, the correlation between the projected views is maximized when each view is projected onto its top canonical correlation basis vector.

Figure 2:

Illustration of CCA. The left column depicts point clouds of two-dimensional views $X$ and $Y$ with lines indicating the span of the top principal component and top canonical correlation basis vector for each view (labeled “PCA” and “CCA,” respectively). The center and right columns depict point clouds of joint one-dimensional projections of $X$ and $Y$ onto their top principal component or top canonical correlation basis vector, with the correlations between the two projections listed. As illustrated in the lower right plot, the correlation between the projected views is maximized when each view is projected onto its top canonical correlation basis vector.

Close modal
Figure 3:

Cortical microcircuit implementation of Adaptive Bio-CCA with output whitening (algorithm 3). The black cell bodies denote pyramidal neurons, with the apical tufts pointing upward. The red and blue lines denote the axons respectively transmitting the apical input $x$ and basal input $y$. The black lines originating from the bases of the pyramidal neurons are their axons, which transmit their output $z$. The green cell bodies denote the interneurons, and the green lines are their axons, which transmit their output $n$. Filled circles denote non-Hebbian synapses whose updates are proportional to the input (i.e., $x$ or $y$) and the weighted sum of the calcium plateau potential plus backpropagating somatic output [i.e., $cb+(1-α)z$ or $ca+(1-α)z$]. The directions of travel of these weighted sums are depicted using dashed lines with arrows. Empty circles denote Hebbian or anti-Hebbian synapses whose updates are proportional or inversely proportional to the pre- and postsynaptic activities.

Figure 3:

Cortical microcircuit implementation of Adaptive Bio-CCA with output whitening (algorithm 3). The black cell bodies denote pyramidal neurons, with the apical tufts pointing upward. The red and blue lines denote the axons respectively transmitting the apical input $x$ and basal input $y$. The black lines originating from the bases of the pyramidal neurons are their axons, which transmit their output $z$. The green cell bodies denote the interneurons, and the green lines are their axons, which transmit their output $n$. Filled circles denote non-Hebbian synapses whose updates are proportional to the input (i.e., $x$ or $y$) and the weighted sum of the calcium plateau potential plus backpropagating somatic output [i.e., $cb+(1-α)z$ or $ca+(1-α)z$]. The directions of travel of these weighted sums are depicted using dashed lines with arrows. Empty circles denote Hebbian or anti-Hebbian synapses whose updates are proportional or inversely proportional to the pre- and postsynaptic activities.

Close modal

There are a number of existing consequential models of cortical microcircuits with multicompartmental neurons and non-Hebbian plasticity (Körding & König, 2001; Urbanczik & Senn, 2014; Guerguiev, Lillicrap, & Richards, 2017; Sacramento, Costa, Bengio, & Senn, 2018; Haga & Fukai, 2018; Richards & Lillicrap, 2019; Milstein et al., 2020). These models provide mechanistic descriptions of the neural dynamics and synaptic plasticity and account for many experimental observations, including the nonlinearity of neural outputs and the layered organization of the cortex. While our neural network model is single-layered and linear, it is derived from a principled CCA objective function, which has several advantages. First, since biological neural networks evolved to adaptively perform behaviorally relevant computations, it is natural to view them as optimizing a relevant objective function. Second, our approach clarifies which features of the network (e.g., multicompartmental neurons and non-Hebbian synaptic updates) are central to computing correlations. Finally, since the optimization algorithm is derived from a CCA objective that can be solved offline, the neural activities and synaptic weights can be analytically predicted for any input without resorting to numerical simulation. In this way, our neural network model is interpretable and analytically tractable, and it provides a useful complement to nonlinear, layered neural network models.

The remainder of this work is organized as follows. We state the CCA problem in section 2. In section 3, we introduce a novel objective for the CCA problem and derive offline and online CCA algorithms. In section 4, we derive an extension of our CCA algorithm, and in section 5, we map the extension onto a simplified cortical microcircuit. We provide results of numerical simulations in section 6.

Notation. For positive integers $p,q$, let $Rp$ denote $p$-dimensional Euclidean space, and let $Rp×q$ denote the set of $p×q$ real-valued matrices equipped with the Frobenius norm $∥·∥F$. We use boldface lowercase letters (e.g., $v$) to denote vectors and boldface uppercase letters (e.g., $M$) to denote matrices. We let $O(p)$ denote the set of $p×p$ orthogonal matrices and $S++p$ denote the set of $p×p$ positive definite matrices. We let $Ip$ denote the $p×p$ identity matrix.

Given $T$ pairs of full-rank, centered input data samples $(x1,y1),…,(xT,yT)∈Rm×Rn$ and $k≤min(m,n)$, the CCA problem is to find $k$-dimensional linear projections of the views $x1,…,xT$ and $y1,…,yT$ that are maximally correlated (see Figure 2). To be precise, consider the CCA objective,
$argmaxVx∈Rm×k,Vy∈Rn×kTrVx⊤CxyVy,$
(2.1)
subject to the whitening constraint,1
$Vx⊤CxxVx+Vy⊤CyyVy=Ik,$
(2.2)
where we have defined the sample covariance matrices
$Cxx:=1T∑t=1Txtxt⊤,Cxy:=1T∑t=1Txtyt⊤,Cyy:=1T∑t=1Tytyt⊤.$
(2.3)
To compute the solution of the CCA objective equations 2.1 and 2.2, define the $m×n$ correlation matrix,
$Rxy:=Cxx-1/2CxyCyy-1/2.$
(2.4)
Let $ρ1≥⋯≥ρmin(m,n)$ denote the singular values, and let $Ux∈O(m)$ and $Uy∈O(n)$ denote the matrices whose column vectors are respectively the left- and right-singular vectors of the correlation matrix. The $i$th singular value $ρi$ is referred to as the $i$th canonical correlation, and the $i$th column vectors of $Cxx-1/2Ux$ and $Cyy-1/2Uy$ are jointly referred to as the $i$th pair of canonical correlation basis vectors, for $i=1,…,min(m,n)$. The maximal value of the trace in equation 2.1 is the normalized sum of canonical correlations: $(ρ1+⋯+ρk)/2$. For simplicity, we assume $ρk>ρk+1$ so the subspace spanned by the first $k$ canonical correlation basis vectors is unique. In this case, every solution of the CCA objective, equations 2.1 and 2.2, denoted $(V¯x,V¯y)$, is of the form
$V¯x=Cxx-1/2Ux(k)Q,V¯y=Cyy-1/2Uy(k)Q,$
(2.5)
where $Ux(k)$ (resp. $Uy(k)$) is the $m×k$ (resp. $n×k$) matrix whose $i$th column vector is equal to the $i$th column vector of $Ux$ (resp. $Uy$) for $i=1,…,k$, and $Q∈O(k)$ is any orthogonal matrix. Since the column vectors of any solution $(V¯x,V¯y)$ span the same subspaces as the first $k$ pairs of canonical correlation basis vectors, we refer to the column vectors of $V¯x$ and $V¯y$ as basis vectors. We refer to the $k$-dimensional projections $V¯x⊤xt$ and $V¯y⊤yt$ as canonical correlation subspace projections (CCSPs).
The focus of this work is to derive a single-layer, biologically plausible network whose input at each time $t$ is the pair $(xt,yt)$ and the output is the following sum of the CCSPs:
$zt:=V¯x⊤xt+V¯y⊤yt,$
(2.6)
which, as mentioned in section 1, is a highly relevant statistic (see also section 6.1). This is in contrast to many existing CCA networks that output one or both CCSPs (Lai & Fyfe, 1999; Pezeshki et al., 2003; Gou & Fyfe, 2004; Vía et al., 2007; Golkar, Lipshutz, Bahroun, Sengupta, & Chklovskii, 2020). The components of the two input vectors $xt$ and $yt$ are represented by the activity of upstream neurons belonging to two different populations, which are integrated in separate compartments of the principal neurons in our network. The components of the output vector $zt$ are represented by the activity of the principal neurons in our network (see Figure 1).

While CCA is typically viewed as an unsupervised learning method, it can also be interpreted as a special case of the supervised learning method reduced-rank regression, in which case one input is the feature vector and the other input is the label (see, e.g., Velu & Reinsel, 2013). With this supervised learning view of CCA, the natural output of a CCA network is the CCSP of the feature vector. In separate work (Golkar et al., 2020), we derive an algorithm for the general reduced-rank regression problem, which includes CCA as a special case, for outputting the projection of the feature vector. The algorithm derived in Golkar et al. (2020) resembles the adaptive CCA with output whitening algorithm that we derive in section 4 of this work (see algorithm 3 as well as appendix B.2 for a detailed comparison of the two algorithms); however, there are significant advantages to the algorithm derived here. First, our network outputs the (whitened) sum of the CCSPs, which is a relevant statistic in applications. The algorithm in Golkar et al. (2020) only outputs the CCSP of the feature vector, which is natural when viewing CCA as a supervised learning method, but not when viewing CCA as an unsupervised learning method for integrating multiview inputs. Second, in contrast to the algorithm derived in Golkar et al. (2020), our adaptive CCA with output whitening algorithm allows for adaptive output rank. This is particularly important for analyzing nonstationary input streams, a challenge that brains regularly face.

To derive a network that computes the sums of CCSPs for arbitrary input data sets, we adopt a normative approach in which we identify an appropriate cost function whose optimization leads to an online algorithm that can be implemented by a network with local learning rules. Previously, such an approach was taken to derive a biologically plausible PCA network from a similarity matching objective function (Pehlevan, Hu, & Chklovskii, 2015). We leverage this work by reformulating a CCA problem in terms of PCA of a modified data set and then solving it using similarity matching.

### 3.1  A Similarity Matching Objective

First, we note that the sums of CCSPs $z1,…,zT$ are equal to the principal subspace projections of the data $ξ1,…,ξT$, where $ξt$ is the following $d$-dimensional vector of concatenated whitened inputs (recall $d:=m+n$):
$ξt:=Cxx-1/2xtCyy-1/2yt.$
(3.1)
(See appendix A for a detailed justification.) Next, we use the fact that the principal subspace projections can be expressed in terms of solutions of similarity matching objectives. To this end, we define the matrices $Ξ:=[ξ1,…,ξT]∈Rd×T$ and $Z:=[z1,…,zT]∈Rk×T$, so that $Z$ is a linear projection of $Ξ$ onto its $k$-dimensional principal subspace. As shown in Cox and Cox (2000) and Williams (2001), the principal subspace projection $Z$ is a solution of the following similarity matching objective:
$argminZ∈Rk×T12T2Z⊤Z-Ξ⊤ΞF2.$
(3.2)
The objective, equations 3.2, which comes from classical multidimensional scaling (Cox & Cox, 2000), minimizes the difference between the similarity of output pairs, $zt⊤zt'$, and the similarity of input pairs, $ξt⊤ξt'$, where similarity is measured in terms of inner products. Finally, defining $X:=[x1,…,xT]∈Rm×T$ and $Y:=[y1,…,yT]∈Rn×T$, we use the definition of $ξt$ in equation 3.1, to rewrite the objective, equations 3.2, as follows:
$argminZ∈Rk×T12T2Z⊤Z-X⊤Cxx-1X-Y⊤Cyy-1YF2.$
(3.3)
In the remainder of this section, we derive our online CCA algorithm. Then, in sections 4 and 5, we derive an extension of our CCA algorithm and map it onto the cortical microcircuit. Readers who are primarily interested in the derivation of the extension and its relation to the cortical microcircuit can safely skip to section 4.

### 3.2  A Min-Max Objective

While the similarity matching objective, equations 3.3, can be minimized by taking gradient descent steps with respect to $Z$, this would not lead to an online algorithm because such computation requires combining data from different time steps. Rather, we introduce auxiliary matrix variables, which store sufficient statistics allowing for the CCA computation using solely contemporary inputs and will correspond to synaptic weights in the network implementation, and rewrite the minimization problem 3.3 as a min-max problem.

Expanding the square in equation 3.3 and dropping terms that do not depend on $Z$ yields the minimization problem:
$minZ∈Rk×T-1T2TrZ⊤ZX⊤Cxx-1X-1T2TrZ⊤ZY⊤Cyy-1Y+12T2TrZ⊤ZZ⊤Z.$
Next, we introduce dynamic matrix variables $Wx$, $Wy$, and $M$ in place of the matrices $1TZX⊤Cxx-1$, $1TZY⊤Cyy-1$ and $1TZZ⊤$, respectively, and rewrite the minimization problem as a min-max problem,
$minZ∈Rk×TminWx∈Rk×mminWy∈Rk×nmaxM∈S++kL(Wx,Wy,M,Z),$
where
$L(Wx,Wy,M,Z):=1TTr-2Z⊤WxX-2Z⊤WyY+Z⊤MZ+TrWxCxxWx⊤+WyCyyWy⊤-12M2.$
(3.4)
To verify the above substitutions are valid, it suffices to optimize over the matrices $Wx$, $Wy$, and $M$—for example, by differentiating $L(Wx,Wy,M,Z)$ with respect to $Wx$, $Wy$, or $M$, setting the derivative equal to zero, and solving for $Wx$, $Wy$, or $M$. Finally, we interchange the order of minimization with respect to $Z$ and $(Wx,Wy)$, as well as the order of minimization with respect to $Z$ and maximization with respect to $M$:
$minWx∈Rk×mminWy∈Rk×nmaxM∈S++kminZ∈Rk×TL(Wx,Wy,M,Z).$
(3.5)
The second interchange is justified by the fact that $L(Wx,Wy,M,Z)$ satisfies the saddle point property with respect to $Z$ and $M$, which follows from its strict convexity in $Z$ (since $M$ is positive definite) and strict concavity in $M$.
Given an optimal quadruple of the min-max problem, equation 3.5, we can compute the basis vectors as follows. First, minimizing the objective $L(Wx,Wy,M,Z)$ over $Z$ yields the relation
$Z¯:=argminZ∈Rk×TL(Wx,Wy,M,Z)=M-1WxX+M-1WyY.$
(3.6)
Therefore, if $(W¯x,W¯y,M¯,Z¯)$ is an optimal quadruple of the min-max problem, 3.5, it follows from equation 2.6 that the corresponding basis vectors satisfy
$V¯x⊤=M¯-1W¯xandV¯y⊤=M¯-1W¯y.$
(3.7)

### 3.3  An Offline CCA Algorithm

Before deriving our online CCA algorithm, we first demonstrate how the objective, equations 3.5, can be optimized in the offline setting, where one has access to the data matrices $X$ and $Y$ in their entirety. In this case, the algorithm solves the min-max problem 3.5 by alternating minimization and maximization steps. First, for fixed $Wx$, $Wy$, and $M$, we minimize the objective function $L(Wx,Wy,M,Z)$ over $Z$ to obtain the minimum $Z¯$ defined in equation 3.6. Then, with $Z¯$ fixed, we perform a gradient descent-ascent step with respect to $(Wx,Wy)$ and $M$:
$Wx←Wx-η∂L(Wx,Wy,M,Z¯)∂Wx,Wx←Wy-η∂L(Wx,Wy,M,Z¯)∂Wy,M←M+ητ∂L(Wx,Wy,M,Z¯)∂M.$
Here $η>0$ is the learning rate for $Wx$ and $Wy$, which may depend on the iteration, and $τ>0$ is the ratio of the learning rates for $Wx$ (or $Wy$) and $M$. Substituting in the explicit expressions for the partial derivatives of $L(Wx,Wy,M,Z¯)$ yields our offline CCA algorithm (algorithm 1), which we refer to as Offline-CCA.

Recall that $M$ is optimized over the set of positive-definite matrices $S++k$. To ensure that $M$ remains positive definite after each update, note that the update rule for $M$ can be rewritten as the following convex combination (provided $η≤τ$): $M←(1-ητ)M+ητ(1TZZ⊤)$. Since $1TZZ⊤$ is positive semidefinite, to guarantee that $M$ remains positive definite given a positive-definite initialization, it suffices to assume that $η<τ$.

### 3.4  An Online CCA Algorithm

In the online setting, the input data $(xt,yt)$ are streamed one at a time, and the algorithm must compute its output $zt$ without accessing any significant fraction of $X$ and $Y$. To derive an online algorithm, it is useful to write the cost function as an average over time-separable terms:
$L(Wx,Wy,M,Z)=1T∑t=1Tlt(Wx,Wy,M,zt),$
where
$lt(Wx,Wy,M,zt):=-2zt⊤Wxxt-2zt⊤Wyyt+zt⊤Mzt+TrWxxtxt⊤Wx⊤+Wyytyt⊤Wy⊤-12M2.$
(3.8)
At iteration $t$, to compute the output $zt$, we minimize the cost function $lt(Wx,Wy,M,zt)$ with respect to $zt$ by running the following gradient descent dynamics to equilibrium:
$dzt(γ)dγ=at+bt-Mzt(γ),$
(3.9)
where we have defined the following $k$-dimensional projections of the inputs: $at:=Wxxt$ and $bt:=Wyyt$. These dynamics, which will correspond to recurrent neural dynamics in our network implementation, are assumed to occur on a fast timescale, allowing $zt(γ)$ to equilibrate at $z¯t:=M-1(at+bt)$ before the algorithm outputs its value. After $zt(γ)$ equilibrates, we update the matrices $(Wx,Wy,M)$ by taking a stochastic gradient descent-ascent step of the cost function $lt(Wx,Wy,M,z¯t)$ with respect to $(Wx,Wy)$ and $M$:
$Wx←Wx-η∂lt(Wx,Wy,M,z¯t)∂Wx,Wx←Wy-η∂lt(Wx,Wy,M,z¯t)∂Wy,M←M+ητ∂lt(Wx,Wy,M,z¯t)∂M.$
Substituting in the explicit expressions for the partial derivatives of $lt(Wx,Wy,M,z¯t)$ yields our online CCA algorithm (algorithm 2), which we refer to as Bio-CCA.

Algorithm 2 can be implemented in a biologically plausible single-layer network with $k$ neurons that each consist of three separate compartments (see Figure 1). At each time step, the inputs $xt$ and $yt$ are multiplied by the respective feedforward synapses $Wx$ and $Wy$ to yield the $k$-dimensional vectors $at$ and $bt$, which are represented in the first two compartments of the $k$ neurons. Lateral synapses, $-M$, connect the $k$ neurons. The vector of neuronal outputs, $zt$, equals the normalized sum of the CCSPs and is computed locally using recurrent dynamics in equation 3.9. The synaptic updates can be written elementwise as follows:
$Wx,ij←Wx,ij+η(zt,i-at,i)xt,j,1≤i≤k,1≤j≤m,Wy,ij←Wy,ij+η(zt,i-bt,i)yt,j,1≤i≤k,1≤j≤n,Mij←Mij+ητ(zt,izt,j-Mij),1≤i,j≤k.$
As shown above, the update to synapse $Wx,ij$ (resp. $Wy,ij$), which connects the $j$th input $xt,j$ (resp. $yt,j$) to the $i$th output neuron, depends only on the quantities $zt,i$, $at,i$ (resp. $bt,i$), and $xt,j$ (resp. $yt,j$), which are represented in the pre- and postsynaptic neurons, so the updates are local but non-Hebbian due to the contribution from the $at,i$ (resp. $bt,i$) term. Similarly, the update to synapse $-Mij$, which connects the $j$th output neuron to the $i$th output neuron, is inversely proportional to $zt,izt,j$, the product of the outputs of the pre- and postsynaptic neurons, so the updates are local and anti-Hebbian.

We now introduce an extension of Bio-CCA that addresses two biologically relevant issues. First, Bio-CCA a priori sets the output rank at $k$; however, it may be advantageous for a neural circuit to instead adaptively set the output rank depending on the level of correlation captured. In particular, this can be achieved by projecting each view onto the subspace spanned by the canonical correlation basis vectors, which correspond to canonical correlations that exceed a threshold. Second, it is useful from an information-theoretic perspective for neural circuits to whiten their outputs (Plumbley, 1993), and there is experimental evidence that neural outputs in the cortex are decorrelated (Ecker et al., 2010; Miura, Mainen, & Uchida, 2012). Both adaptive output rank and output whitening modifications were implemented for a PCA network by Pehlevan and Chklovskii (2015) and can be adapted to the CCA setting. Here we present the modifications without providing detailed proofs, which can be found in the supplement of Pehlevan and Chklovskii (2015).

In order to implement these extensions, we need to appropriately modify the similarity matching objective function equations 3.3. First, to adaptively choose the output rank, we add a quadratic penalty $Tr(Z⊤Z)$ to the objective function 3.3:
$argminZ∈Rk×T12T2Z⊤Z-X⊤Cxx-1X-Y⊤Cyy-1YF2+αTTrZ⊤Z.$
(4.1)
The effect of the quadratic penalty is to rank-constrain the output, with $α≥0$ acting as a threshold parameter on the eigenvalues values of the output covariance.
Next, to whiten the output, we expand the square in equation 4.1 and replace the quartic term $Tr(Z⊤ZZ⊤Z)$ by a Lagrange constraint enforcing $Z⊤Z⪯TIT$ (i.e., $TIT-Z⊤Z$ is positive semidefinite):
$argminZ∈Rk×TmaxN∈Rk×T1T2Tr(-Z⊤ZX⊤Cxx-1X-Z⊤ZY⊤Cyy-1Y+αTZ⊤Z)+1T2Tr[N⊤N(Z⊤Z-TIT)].$
(4.2)
The effect of the Lagrange constraint in equation 4.2 is to enforce that all nonzero eigenvalues of the output covariance are set to one.
Solutions of the objective, equations 4.2, can be expressed in terms of the eigendecomposition of the Gram matrix $Ξ⊤Ξ=TUξΛξUξ⊤$, where $Uξ∈O(T)$ is a matrix of eigenvectors and $Λξ=diag(λ1,…,λd,0,…,0)$ is the $T×T$ diagonal matrix whose nonzero entries $λ1≥⋯≥λd>0$ are the eigenvalues of the $d×d$ covariance matrix:
$Cξξ:=1T∑t=1Tξtξt⊤=ImRxyRxy⊤In.$
(4.3)
Assume, for technical purposes, that $α∉{λ1,…,λd}$. Then, as shown in Pehlevan et al. (2015, theorem 3), every solution $Z^$ of objective 4.2 is of the form
$Z^=QTΛ^ξ(k)Uξ(k)⊤,Λ^ξ(k)=diag(H(λ1-α),…,H(λk-α)),$
where $Q∈O(k)$ is any orthogonal matrix, $Uξ(k)∈RT×k$ is the $T×k$ matrix whose $i$th column vector is equal to the $i$th column vector of $Uξ$, for $i=1,…,k$, and $H$ is the Heaviside step function defined by $H(r)=1$ if $r>0$ and $H(r)=0$ otherwise. Finally, we note that in view of equation 4.3 and the SVD of $Rxy$, the top $min(m,n)$ eigenvalues of $Cξξ$ satisfy
$λi=1+ρi,i=1,…,min(m,n),$
where we recall that $ρ1,…,ρmin(m,n)$ are the canonical correlations. Thus, $H(λi-α)=H(ρi-(α-1))$, for $i=1,…,k$. In other words, the objective 4.2 outputs the sum of the projections of the inputs $xt$ and $yt$ onto the canonical correlation subspace spanned by the (at most $k$) pairs of canonical correlation basis vectors associated with canonical correlations exceeding the threshold $max(α-1,0)$, and sets the nonzero output covariance eigenvalues to one, thus implementing both the adaptive output rank and output whitening modifications.
With the modified objective, equations 4.2, in hand, the next step is to derive an online algorithm. Similar to section 3.2, we introduce dynamic matrix variables $Wx$, $Wy$, and $P$ in place of $1TZX⊤Cxx-1$, $1TZY⊤Cyy-1$ and $1TZN⊤$ to rewrite the objective 4.2 as follows:
$argminZ∈Rk×TmaxN∈Rk×TminWx∈Rk×mminWy∈Rk×nmaxP∈Rk×kL˜(Wx,Wy,P,Z,N),$
(4.4)
where
$L˜(Wx,Wy,P,Z,N):=1TTr-2Z⊤WxX-2Z⊤WyY+αZ⊤Z+1TTr2N⊤P⊤Z-N⊤N+TrWxCxxWx⊤+WyCyyWy⊤-PP⊤.$
After interchanging the order of optimization, we solve the min-max optimization problem by taking online gradient descent-ascent steps, with descent step size $η$ and ascent step size $ητ$. Since the remaining steps are similar to those taken in section 3 to derive Bio-CCA, we defer the details to section B.1 and simply state the online algorithm (algorithm 3), which we refer to as Adaptive Bio-CCA with output whitening.

We now show that Adaptive Bio-CCA with output whitening (algorithm 3) maps onto a neural network with local, non-Hebbian synaptic update rules that emulate salient aspects of synaptic plasticity found experimentally in cortical microcircuits (in both the neocortex and the hippocampus).

Cortical microcircuits contain two classes of neurons: excitatory pyramidal neurons and inhibitory interneurons. Pyramidal neurons receive excitatory synaptic inputs from two distinct sources via their apical and basal dendrites. The apical dendrites are all oriented in a single direction, and the basal dendrites branch from the cell body in the opposite direction (Takahashi & Magee, 2009; Larkum, 2013); see Figure 3. The excitatory synaptic currents in the apical and basal dendrites are first integrated separately in their respective compartments (Takahashi & Magee, 2009; Larkum, 2013). If the integrated excitatory current in the apical compartment exceeds the corresponding inhibitory input (the source of which is explained below) it produces a calcium plateau potential that propagates through the basal dendrites, driving plasticity (Takahashi & Magee, 2009; Larkum, 2013; Bittner et al., 2015). When the apical calcium plateau potential and basal dendritic current coincidentally arrive in the soma, they generate a burst in spiking output (Larkum, Zhu, & Sakmann, 1999; Larkum, 2013; Bittner et al., 2015). Inhibitory interneurons integrate pyramidal outputs and reciprocally inhibit the apical dendrites of pyramidal neurons, thus closing the loop.

We propose that a network of $k$ pyramidal neurons implements CCA on the inputs received by apical and basal dendrites and outputs the whitened sum of CCSPs (algorithm 3). In our model, each pyramidal neuron has three compartments: two compartments for the apical and basal dendritic currents and one compartment for the somatic output. The two data sets $X$ and $Y$ are represented as activity vectors $xt$ and $yt$ streamed onto the apical and basal dendrites respectively (see Figure 3). At each time step, the activity vectors are multiplied by the corresponding synaptic weights to yield localized apical and basal dendritic currents, $at=Wxxt$ and $bt=Wyyt$, thus implementing projection onto the common subspace. This is followed by the following linear recurrent neural dynamics:
$dzt(γ)dγ=at+bt-Pnt(γ)-αzt(γ),$
(5.1)
$dnt(γ)dγ=P⊤zt(γ)-nt(γ),$
(5.2)
where the components of $zt$ are represented by the spiking activity of pyramidal neurons, the components of $nt$ are represented by the activity of inhibitory interneurons, the components of $P$ are represented by the synaptic weights from the interneurons to the pyramidal neurons, the components of $P⊤$ are represented by the synaptic weights from the pyramidal neurons to the interneurons, and $α$ is the threshold parameter of the adaptive algorithm. These dynamics equilibrate at $nt=P⊤zt$ and
$zt=(PP⊤+αIk)-1(at+bt).$
(5.3)
Provided $α>0$, we can rearrange equation 5.3 to write the output as
$zt=α-1(bt+cta),$
where the components of $cta:=at-Pnt$ are represented by the apical calcium plateau potentials within each pyramidal neuron. In other words, the output is proportional to the sum of the basal dendritic current and the apical calcium plateau potential, which is consistent with experimental evidence showing that the output depends on both the basal inputs and apical calcium plateau potential (Bittner et al., 2015; Bittner, Milstein, Grienberger, Romani, & Magee, 2017; Magee & Grienberger, 2020).
Next, we compare the synaptic update rules with experimental evidence. Rearranging equation 5.3 and substituting into the synaptic update rules in algorithm 3, we can rewrite the synaptic updates as follows:
$Wx←Wx+ηctb+(1-α)ztxt⊤,Wy←Wy+ηcta+(1-α)ztyt⊤,P←P+ητ(ztnt⊤-P),$
where the components of $ctb:=bt-Pnt$ are represented by basal calcium plateau potentials within each pyramidal neuron. The learning signal for the basal (resp. apical) synaptic updates of this circuit is the correlation between the sum of the apical (resp. basal) calcium plateau potentials plus the scaled spiking activity of the pyramidal neurons, $ctb+(1-α)zt$ [resp. $cta+(1-α)zt$], and the synaptic inputs to the basal (resp. apical) dendrites, $xt$ (resp. $yt$). When $α=1$ the spiking (action potentials) of the postsynaptic neuron is not required for synaptic plasticity, whereas when $α≠1$, the spiking of the postsynaptic neuron affects synaptic plasticity along with the calcium plateau. Therefore, our model can account for a range of experimental observations that have demonstrated that spiking in the postsynaptic neuron contributes to plasticity in some contexts (Golding, Staff, & Spruston, 2002; Gambino et al., 2014; Bittner et al., 2017; Magee & Grienberger, 2020), but does not appear to affect plasticity in other contexts (Tigaret, Olivo, Sadowski, Ashby, & Mellor, 2016; Sjöström & Häusser, 2006). Unlike Hebbian learning rules, which depend only on the correlation of the spiking output of the postsynaptic neuron with the presynaptic spiking, the mechanisms involving the calcium plateau potential represented internally in a neuron are called non-Hebbian. Because synapses have access to both the corresponding presynaptic activity and the calcium plateau potential, the learning rule remains local.

Note that the update rule for the synapses in the apical dendrites, $Wx$, depends on the basal calcium plateau potentials $ctb$. Experimental evidence is focused on apical calcium plateau potentials, and it is not clear whether differences between basal inputs and inhibitory signals generate calcium signals for driving plasticity in the apical dendrites. Alternatively, the learning rule for $Wx$ coincides with the learning rule for the apical dendrites in Golkar et al. (2020), where a biological implementation in terms of local depolarization and backpropagating spikes was proposed. Due to the inconclusive evidence pertaining to plasticity in the apical tuft, we find it useful to put forth both interpretations.

Multicompartmental models of pyramidal neurons have been invoked previously in the context of biological implementation of the backpropagation algorithm (Körding & König, 2001; Urbanczik & Senn, 2014; Guerguiev et al., 2017; Haga & Fukai, 2018; Sacramento et al., 2018; Richards & Lillicrap, 2019). Under this interpretation, the apical compartment represents the target output, the basal compartment represents the algorithm prediction, and calcium plateau potentials communicate the error from the apical to the basal compartment, which is used for synaptic weight updates. The difference between these models and ours is that we use a normative approach to derive not only the learning rules but also the neural dynamics of the CCA algorithm, ensuring that the output of the network is known for any input. On the other hand, the linearity of neural dynamics in our network means that stacking our networks will not lead to any nontrivial results expected of a deep learning architecture. We leave introducing nonlinearities into neural dynamics and stacking our network to future work.

We conclude this section with comments on the interneuron-to-pyramidal neuron synaptic weight matrix $P$ and pyramidal neuron-to-interneuron synaptic weight matrix $P⊤$, as well as the computational role of the interneurons in this network. First, the algorithm appears to require a weight-sharing mechanism between the two sets of synapses to ensure the symmetry between the weight matrices, which is biologically unrealistic and commonly referred to as the weight transport problem. However, even without any initial symmetry between these feedforward and feedback synaptic weights, because of the symmetry of the local learning rules, the difference between the two will decay exponentially without requiring weight transport (see section B.3). Second, in equations 5.1 and 5.2, the interneuron-to-pyramidal neuron synaptic weight matrix $P$ is preceded by a negative sign, and the pyramidal neuron-to-interneuron synaptic weight matrix $P⊤$ is preceded by a positive sign, which is consistent with the fact that in simplified cortical microcircuits, interneuron-to-pyramidal neuron synapses are inhibitory, whereas the pyramidal neuron-to-interneuron synapses are excitatory. That being said, this interpretation is superficial because the weight matrices are not constrained to be nonnegative, which is due to the fact that we are implementing a linear statistical method. Imposing nonnegativity constraints on the weights $P$ and $P⊤$ may be useful for implementing nonlinear statistical methods; however, this requires further investigation. Finally, the activities of the interneurons $N$ were introduced in equation 4.2 to decorrelate the output. This is consistent with previous models of the cortex (King, Zylberberg, & DeWeese, 2013; Wanner & Friedrich, 2020), which have introduced inhibitory interneurons to decorrelate excitatory outputs; however, in contrast to the current work, the models proposed in King et al. (2013) and Wanner and Friedrich (2020) are not normative.

We now evaluate the performance of the online algorithms, Bio-CCA and Adaptive Bio-CCA with output whitening. In each plot, the lines and shaded regions respectively denote the means and 90% confidence intervals over five runs. Detailed descriptions of the implementations are given in section C.1. All experiments were performed in Python on an iMac Pro equipped with a 3.2 GHz 8-Core Intel Xeon W CPU. The evaluation code is available at https://github.com/flatironinstitute/bio-cca.

### 6.1  Data Sets

We first describe the evaluation data sets.

#### 6.1.1  Synthetic

We generated a synthetic data set with $T=100,000$ samples according to the probabilistic model for CCA introduced by Bach and Jordan (2005). In particular, let $s1,…,sT$ be independent and identically distributed (i.i.d.) 8-dimensional latent mean-zero gaussian vectors with identity covariance. Let $Tx∈R50×8$, $Ty∈R30×8$, $Ψx∈S++50$, and $Ψy∈S++30$ be randomly generated matrices and define the 50-dimensional observations $x1,…,xT$ and 30-dimensional observations $y1,…,yT$ by
$xt:=Txst+ϕt,yt:=Tyst+ψt,t=1,…,T,$
where $ϕ1,…,ϕT$ (resp. $ψ1,…,ψT$) are i.i.d. 50-dimensional (resp. 30-dimensional) mean-zero gaussian vectors with covariance $Ψx$ (resp. $Ψy$). Thus, conditioned on the latent random variable $s$, the observation $x$ (resp. $y$) has a gaussian distribution with mean $Txs$ (resp. $Tys$) with covariance $Ψx$ (resp. $Ψy$):
$xt|st∼N(Txst,Ψx),yt|st∼N(Tyst,Ψy).$
For this generative model, Bach and Jordan (2005) showed that the posterior expectation of the latent vector $st$ given the observation $(xt,yt)$ is a linear transformation of the sum of the 8-dimensional CCSPs $zt$; that is, $Est|(xt,yt)=Lzt$ for some $8×8$ matrix $L$. (To see this, set $M1=M2=Pd1/2$ in the paragraph following Bach & Jordan, 2005, theorem 2). The first 10 canonical correlations are plotted in Figure 4 (left). Observe that the first 8 canonical correlations are close to 1, and the remaining canonical correlations are approximately 0. This sharp drop in the canonical correlations is a consequence of the linear generative model and is generally not the case in real data (see, e.g., the right panel in Figure 4). Still, we find it useful to test our algorithms on this synthetic data set since the generative model is well studied and relevant to CCA (Bach & Jordan, 2005).
Figure 4:

Top 10 canonical correlations $ρ1,…,ρ10$ of the synthetic data set (left) and Mediamill (right).

Figure 4:

Top 10 canonical correlations $ρ1,…,ρ10$ of the synthetic data set (left) and Mediamill (right).

Close modal

#### 6.1.2  Mediamill

The data set Mediamill (Snoek, Worring, Van Gemert, Geusebroek, & Smeulders, 2006) consists of $T=43,907$ samples (including training and testing sets) of video data and text annotations, and has been previously used to evaluate CCA algorithms (Arora, Marinov, Mianjy, & Srebro, 2017; Pehlevan et al., 2020). The first view consists of 120-dimensional visual features extracted from representative video frames. The second view consists of 101-dimensional vectors whose components correspond to manually labeled semantic concepts associated with the video frames (e.g., “basketball” or “tree”). To ensure that the problem is well conditioned, we add gaussian noise with covariance matrix $ɛI120$ (resp. $ɛI101$), for $ɛ=0.1$, to the first (resp. second) view to generate the data matrix $X$ (resp. $Y$). The first 10 canonical correlations are plotted in Figure 4 (right).

#### 6.1.3  Nonstationary

To evaluate Adaptive Bio-CCA with output whitening, we generated a nonstationary synthetic data set with $T=300,000$ samples, which are streamed from three distinct distributions that are generated according to the probabilistic model in Bach and Jordan (2005). In this case, the first $N=100,000$ samples are generated from a 4-dimensional latent source, the second $N$ samples are generated from an 8-dimensional latent source, and the final $N$ samples are generated from a 1-dimensional latent source.

Specifically, we let $s1,…,sN$ (resp. $sN+1,…,s2N$ and $s2N+1,…,sT$) be i.i.d. 4-dimensional (resp. 8-dimensional and 1-dimensional) mean-zero gaussian vectors with identity covariance. We then let $Tx(1)∈R50×4$, $Tx(2)∈R50×8$, $Tx(3)∈R50×1$, $Ty(1)∈R30×4$, $Ty(2)∈R30×8$, $Ty(3)∈R30×1$, $Ψx∈S++50$, and $Ψy∈S++30$ be randomly generated matrices and define the 50-dimensional observations $x1,…,xT$ and 30-dimensional observations $y1,…,yT$ by
$xt:=Tx(1)st+ϕt,yt:=Ty(1)st+ψt,t=1,…,N,xt:=Tx(2)st+ϕt,yt:=Ty(2)st+ψt,t=N+1,…,2N,xt:=Tx(3)st+ϕt,yt:=Ty(3)st+ψt,t=2N+1,…,T,$
where, as before, $ϕ1,…,ϕT$ (resp. $ψ1,…,ψT$) are i.i.d. 50-dimensional (resp. 30-dimensional) mean-zero gaussian vectors with covariance $Ψx$ (resp. $Ψy$). In Figure 5, we plot the first 10 canonical correlations for each of the three distributions.
Figure 5:

Top 10 canonical correlations $ρ1,…,ρ10$ of the three distributions that contribute to the nonstationary synthetic data set.

Figure 5:

Top 10 canonical correlations $ρ1,…,ρ10$ of the three distributions that contribute to the nonstationary synthetic data set.

Close modal

### 6.2  Bio-CCA

We now evaluate the performance of Bio-CCA (see algorithm 2) on the synthetic dataset and Mediamill.

#### 6.2.1  Competing Algorithms

We compare the performance of Bio-CCA with the following state-of-the-art online CCA algorithms:

• A two-timescale algorithm for computing the top canonical correlation basis vectors (i.e., $k=1$) introduced by Bhatia et al. (2018). The algorithm is abbreviated “Gen-Oja” due to its resemblance to Oja's method (Oja, 1982).

• An inexact matrix stochastic gradient method for solving CCA, abbreviated “MSG-CCA,” which was derived by Arora et al. (2017).

• The asymmetric neural network proposed by Pehlevan et al. (2020), which we abbreviate as “Asym-NN.”

• The biologically plausible reduced-rank regression algorithm derived by Golkar et al. (2020), abbreviated “Bio-RRR,” which implements a supervised version of CCA when $s=1$ (see algorithm 4 in section B.2).

Detailed descriptions of the implementations of each algorithm are provided in section C.1.

#### 6.2.2  Performance Metrics

To evaluate the performance of Bio-CCA, we use two performance metrics. The first performance metric is the following [0,2]-valued normalized objective error function:
$Normalizedobjectiveerror(t):=ρmax-Tr(Vx,t⊤CxyVy,t)ρmax.$
(6.1)
Here $ρmax:=(ρ1+⋯+ρk)/2$ is the optimal value of the CCA objective, equations 2.1 and 2.2, and $(Vx,t,Vy,t)$ are the basis vectors reported by the respective algorithm after iteration $t$, normalized to ensure they satisfy the orthonormality constraint, equation 2.2. (We do not evaluate Bio-RRR using this metric because the algorithm only outputs one set of basis vectors.)
The second performance metric is the ($x-$)subspace error function defined by
$Subspaceerror(t):=Vx,t(Vx,t⊤Vx,t)-1Vx,t⊤-V¯x(V¯x⊤V¯x)-1V¯x⊤F2,$
(6.2)
where $V¯x$ is the matrix of optimal basis vectors defined as in equation 2.5. (We do not evaluate MSG-CCA using this metric because the algorithm outputs the product $Vx,tVy,t⊤$ rather than outputting the basis vectors $Vx,t$ and $Vy,t$ separately.)

#### 6.2.3  Evaluation on the Synthetic Data Set

In Figure 6 we plot the performance of Bio-CCA in terms of both sample and run-time efficiency, against the competing algorithms for target dimensions $k=1,2,4$ on the synthetic data set, presented once in a randomly permuted order. For $k=1$, Gen-Oja initially outperforms Bio-CCA in sample and run-time efficiency; however, Bio-CCA eventually performs comparable to Gen-Oja when given sufficiently many samples (and outperforms Gen-Oja in terms of subspace error). For $k=2,4$, the sample and run-time efficiency of Bio-CCA outperforms the other competing algorithms. The MSG-CCA error does not begin to decay until the $103$ iteration because the first $103$ samples are used to obtain initial estimates of the covariance matrices $Cxx$ and $Cyy$. The poor performance of the asymmetric network (Pehlevan et al., 2020) is due in part to the fact that the algorithm depends on the gaps between the canonical correlations, which are small for the synthetic data set. In Figure 11 in section C.2, we verify that the Bio-CCA basis vectors asymptotically satisfy the orthonormality constraint equation 2.2.
Figure 6:

Comparisons of Bio-CCA (algorithm 2) with the competing algorithms on the synthetic data set, for $k=1,2,4$, in terms of the normalized objective error defined in equation 6.1 as a function of sample number and run time (top two rows), and in terms of the subspace error defined in equation 6.2 as a function of sample number and run time (bottom two rows).

Figure 6:

Comparisons of Bio-CCA (algorithm 2) with the competing algorithms on the synthetic data set, for $k=1,2,4$, in terms of the normalized objective error defined in equation 6.1 as a function of sample number and run time (top two rows), and in terms of the subspace error defined in equation 6.2 as a function of sample number and run time (bottom two rows).

Close modal
Figure 7:

Comparisons of Bio-CCA (algorithm 2) with the competing algorithms on Mediamill, for $k=1,2,4$, in terms of the normalized objective error defined in equation 6.1 as a function of sample number and run time (top two rows), and in terms of the subspace error defined in equation 6.2 as a function of sample number and run time (bottom two rows).

Figure 7:

Comparisons of Bio-CCA (algorithm 2) with the competing algorithms on Mediamill, for $k=1,2,4$, in terms of the normalized objective error defined in equation 6.1 as a function of sample number and run time (top two rows), and in terms of the subspace error defined in equation 6.2 as a function of sample number and run time (bottom two rows).

Close modal
Figure 8:

Comparisons of Adaptive Bio-CCA with output whitening (see algorithm 3) and Bio-RRR on the synthetic data set, for $α=1.2,1.5,1.8$, in terms of the subspace error as a function of sample number (top row) and run time (bottom row). For each $α$, $r$ is the target output rank defined as in equation 6.3.

Figure 8:

Comparisons of Adaptive Bio-CCA with output whitening (see algorithm 3) and Bio-RRR on the synthetic data set, for $α=1.2,1.5,1.8$, in terms of the subspace error as a function of sample number (top row) and run time (bottom row). For each $α$, $r$ is the target output rank defined as in equation 6.3.

Close modal
Figure 9:

Comparisons of Adaptive Bio-CCA with output whitening (see algorithm 3) with Bio-RRR on the data set Mediamill, for $α=1.2,1.5,1.8$, in terms of the subspace error as a function of sample number (top row) and run time (bottom row). For each $α$, $r$ is the target output rank defined as in equation 6.3.

Figure 9:

Comparisons of Adaptive Bio-CCA with output whitening (see algorithm 3) with Bio-RRR on the data set Mediamill, for $α=1.2,1.5,1.8$, in terms of the subspace error as a function of sample number (top row) and run time (bottom row). For each $α$, $r$ is the target output rank defined as in equation 6.3.

Close modal
Figure 10:

On the left, we plot the performance of Adaptive Bio-CCA with output whitening (with $α=1.5$) on the nonstationary synthetic data set in terms of the subspace error defined in equation 6.4. On the right we plot the output rank defined in equation 6.5.

Figure 10:

On the left, we plot the performance of Adaptive Bio-CCA with output whitening (with $α=1.5$) on the nonstationary synthetic data set in terms of the subspace error defined in equation 6.4. On the right we plot the output rank defined in equation 6.5.

Close modal
Figure 11:

The deviation of the Bio-CCA solution from the CCA orthonormality constraint, in terms of the orthonormality error defined in equation C.3 on the synthetic data set.

Figure 11:

The deviation of the Bio-CCA solution from the CCA orthonormality constraint, in terms of the orthonormality error defined in equation C.3 on the synthetic data set.

Close modal

#### 6.2.4  Evaluation on Mediamill

In Figure 7 we plot the performance of Bio-CCA, in terms of both sample and run-time efficiency, against the competing algorithms for target dimensions $k=1,2,4$ on Mediamill, presented three times with a randomly permuted order in each presentation. When tested on Mediamill, the sample and run time efficiency of Bio-CCA outperform the competing algorithms (for $k=1$, Bio-RRR performs comparable to Bio-CCA). In Figure 12 in section C.2, we verify that the Bio-CCA basis vectors asymptotically satisfy the orthonormality constraint, equation 2.2.
Figure 12:

The deviation of the Bio-CCA solution from the CCA orthonormality constraint, in terms of the orthonormality error defined in equation C.3, on the data set Mediamill.

Figure 12:

The deviation of the Bio-CCA solution from the CCA orthonormality constraint, in terms of the orthonormality error defined in equation C.3, on the data set Mediamill.

Close modal

### 6.3  Adaptive Bio-CCA with Output Whitening

Next, we evaluate the performance of Adaptive Bio-CCA with output whitening (see algorithm 3) on all three data sets. Since we are unaware of competing online algorithms for adaptive CCA with output whitening, to compare the performance of algorithm 3 to existing methods, we also plot the performance of Bio-RRR Golkar et al. (2020) with respect to subspace error, where we a priori select the target dimension. We chose Bio-RRR because the algorithm also maps onto a neural network that resembles the cortical microcircuit and because it performs relatively well on the synthetic data set and Mediamill.

#### 6.3.1  Performance Metric

To evaluate the performance of Adaptive Bio-CCA with output whitening, we use a subspace error metric. Recall that the target output rank of algorithm 3, denoted $r$, is equal to the number of canonical correlations $ρ1,…,ρk$ that exceed $max(α-1,0)$, that is,
$r:=max{1≤i≤k:ρi>max(α-1,0)}.$
(6.3)
Since the target dimension is often less than $k$, the subspace error defined in equation 6.2 is not an appropriate metric because the projection matrices are rank $k$. Rather, we evaluate Adaptive Bio-CCA with output whitening using the adaptive subspace error function defined by
$Adaptivesubspaceerror(t):=U˜x,tU˜x,t⊤-V˜x(V˜x⊤V˜x)-1V˜x⊤F2,$
(6.4)
where $U˜x,t$ is the $m×r$ matrix whose column vectors are the top $r$ right-singular vectors of $Wx,t$, and $V˜x$ is the $m×r$ matrix whose $i$th column vector is equal to the $i$th column vector of the matrix $V¯x$ defined in equation 2.5, for $i=1,…,r$. (The covariance matrices used in the definition for $V¯x$ are those associated with the distribution that is being streamed at iteration $t$.) If $r=k$, then the error aligns with the subspace error defined in equation 6.2. Since the output dimension of Bio-RRR is a priori set to $r$, the performance of Bio-RRR is evaluated using the subspace error defined in equation 6.2.

#### 6.3.2  Evaluation on the Synthetic Data Set

From Figure 4 (left) we see that the first eight canonical correlations are close to one, and the remaining canonical correlations are approximately zero. Therefore, for $k≥8$ and $α∈(1.1,1.9)$, algorithm 3 should project the inputs $xt$ and $yt$ onto the 8-dimensional subspace spanned by the top eight pairs of canonical correlation basis vectors, and set the nonzero output covariance eigenvalues to one. In Figure 8 we plot the performance of algorithm 3 with $k=10$ for $α=1.2,1.5,1.8$ on the synthetic data set (presented once with a randomly permuted order). We see that Adaptive Bio-CCA with output whitening outperforms Bio-RRR, even though the target output dimension of Bio-RRR was set a priori. In Figure 13 in section C.2, we verify that the Adaptive Bio-CCA with output whitening basis vectors asymptotically satisfies the whitenening constraint.
Figure 13:

The deviation of the Bio-CCA solution from the CCA orthonormality constraint, in terms of the orthonormality error defined in equation C.3, on the synthetic data set.

Figure 13:

The deviation of the Bio-CCA solution from the CCA orthonormality constraint, in terms of the orthonormality error defined in equation C.3, on the synthetic data set.

Close modal

#### 6.3.3  Evaluation on Mediamill

From Figure 4 (right) we see that the canonical correlations of the data set Mediamill exhibit a more gradual decay than the canonical correlations of the synthetic data set. As we increase the threshold $α$ in the interval (1.1,1.4), the rank of the output of algorithm 3 decreases. In Figure 9 we plot the performance of algorithm 3 with $k=10$ for $α=1.15,1.2,1.3$ on Mediamill (presented three times with a randomly permuted order in each presentation). As with the synthetic data set, we find that Adaptive Bio-CCA with output whitening outperforms Bio-RRR. In Figure 14 in section C.2, we verify that the Adaptive Bio-CCA with output whitening basis vectors asymptotically satisfy the whitenening constraint.
Figure 14:

The deviation of the Adaptive Bio-CCA with output whitening solution from the whitening constraint, in terms of the whitening error defined in equation C.4, on the data set Mediamill.

Figure 14:

The deviation of the Adaptive Bio-CCA with output whitening solution from the whitening constraint, in terms of the whitening error defined in equation C.4, on the data set Mediamill.

Close modal

#### 6.3.4  Evaluation on the Nonstationary Data Set

From Figure 5, we see that the distributions that contribute to the nonstationary data set all have canonical correlations that are close to zero or one. The first distribution has four canonical correlations close to one, the second distribution has eight canonical correlations close to one and the third distribution has one canonical correlation close to one. Therefore, for $k≥8$ and $α∈(1.1,1.9)$, algorithm 3 should initially (for the first 100,000 inputs) project the inputs $xt$ and $yt$ onto a 4-dimensional subspace, then (for the second 100,000 inputs) project the inputs onto an 8-dimensional subspace, and finally (for the final 100,000 inputs) project the inputs onto a 1-dimensional subspaces In Figure 10 (left) we plot the performance of algorithm 3 with $k=10$ and $α=1.5$ on the nonstationary data set in terms of the adaptive subspace error.2 Note that the adaptive subspace error spikes each time the algorithm is streamed inputs from a new distribution (these epochs are denoted by the gray vertical dashed lines) before decaying. In Figure 10 (right) we plot the output rank of Adaptive Bio-CCA with output whitening defined by
$Outputrank(t)=Tr(Vx,t⊤CxxVx,t+Vx,t⊤CxyVy,t+Vy,t⊤Cxy⊤Vx,t+Vy,t⊤CyyVy,t),$
(6.5)
where, at each iteration $t$, the covariance matrices correspond to the distribution from which $(xt,yt)$ is streamed. Note that the output rank adapts each time the algorithm is streamed inputs from a new distribution to match the dimension of the latent variable generating the distribution. In particular, we see that the algorithm is able to quickly adapt its output rank to new distributions.

In this work, we derived an online algorithm for CCA that can be implemented in a neural network with multicompartmental neurons and local, non-Hebbian learning rules. We also derived an extension that adaptively chooses the output rank and whitens the output. Remarkably, the neural architecture and non-Hebbian learning rules of our extension resembled neural circuitry and non-Hebbian plasticity in cortical pyramidal neurons. Thus, our neural network model may be useful for understanding the computational role of multicompartmental neurons with non-Hebbian plasticity.

While our neural network model captures salient features of cortical microcircuits, there are important biophysical properties that are not explained by our model. First, our model uses linear neurons to solve the linear CCA problem, which substantially limits its computational capabilities and is a major simplification of cortical pyramidal neurons that can perform nonlinear operations (Gidon et al., 2020). However, studying the analytically tractable and interpretable linear neural network model is useful for understanding more complex nonlinear models. Such an approach has proven successful for studying deep networks in the machine learning literature (Arora, Cohen, Golowich, & Hu, 2019). In future work, we plan to incorporate nonlinear neurons in our model.

Second, our neural network implementation requires the same number of interneurons and principal neurons, whereas in the cortex, there are approximately four times more pyramidal neurons than interneurons (Larkum et al., 1999). In our model, the interneurons decorrelate the output, and, in practice, the optimal fixed points of the algorithm can destabilize when there are fewer interneurons than principal neurons (see remark 3A in the supplementary material of Pehlevan & Chklovskii, 2015). In biological circuits, these instabilities could be mitigated by other biophysical constraints; however, a theoretical justification would require additional work.

Third, the output of our neural network is the equally weighted sum of the basal and apical projections. However, experimental evidence suggests that the pyramidal neurons integrate their apical and basal inputs asymmetrically (Larkum, Nevian, Sandler, Polsky, & Schiller, 2009; Larkum, 2013; Major, Larkum, & Schiller, 2013). In addition, in our model, the apical learning rule is non-Hebbian and depends on a calcium plateau potential that travels from the basal dendrites to the apical tuft. Experimental evidence for calcium plateau potential dependent plasticity is focused on the basal dendrites, with inconclusive evidence on the plasticity rules for the apical dendrites (Golding et al., 2002; Sjöström & Häusser, 2006).

To provide an alternative explanation of cortical computation, in a separate work (Golkar et al., 2020), we derive an online algorithm for the general supervised learning method reduced-rank regression, which includes CCA as a special case (see section B.2 for a detailed comparison of the two algorithms). The algorithm also maps onto a neural network with multicompartmental neurons and non-Hebbian plasticity in the basal dendrites. Both models adopt a normative approach in which the algorithms are derived from principled objective functions. This approach is highly instructive as the differences between the models highlight which features of the network that are central to implementing an unsupervised learning method versus a supervised learning method.

There are three main differences between the biological interpretation of the two algorithms. First, the output of the network in Golkar et al. (2020) is the projection of the basal inputs, with no apical contribution. Second, the network in Golkar et al. (2020) allows for a range of apical synaptic update rules, including Hebbian updates. Third, the adaptive network derived here includes a threshold parameter $α$, which adaptively sets the output dimension and is not included in Golkar et al. (2020). In our model, this parameter corresponds to the contribution of the somatic output to plasticity in the basal dendrites. These differences can be compared to experimental outcomes to provide evidence that cortical microcircuits implement unsupervised algorithms, supervised algorithms, or mixtures of both. Thus, we find it informative to put forth and contrast the two models.

Finally, we did not prove theoretical guarantees that our algorithms converge. As we show in appendix D, Offline-CCA and Bio-CCA can be viewed as gradient descent-ascent and stochastic gradient descent-ascent algorithms for solving a nonconvex-concave min-max problem. While gradient descent-ascent algorithms are natural methods for solving such min-max problems, they are not always guaranteed to converge to a desired solution. In fact, when the gradient descent step size is not sufficiently small relative to the gradient ascent step size (i.e., when $τ$ is not sufficiently small), gradient descent-ascent algorithms for solving nonconvex-concave min-max problems can converge to limit cycles (Hommes & Ochea, 2012; Mertikopoulos, Papadimitriou, & Piliouras, 2018). Establishing local or global convergence and convergence rate guarantees that for general gradient descent-ascent algorithms is an active area of research, and even recent advances (Lin, Jin, & Jordan, 2020) impose assumptions that are not satisfied in our setting. In appendix D, we discuss these challenges and place our algorithms within the broader context of gradient descent-ascent algorithms for solving nonconvex-concave min-max problems.

In this section, we show that the sums of the CCPSs $z1,…,zT$ can be expressed as the principal subspace projection of the whitened concatenated data $ξ1,…,ξT$, defined by
$ξt:=Cxx-1/2xtCyy-1/2yt.$
(A.1)
To this end, recall the CCA objective,
$argmaxVx∈Rm×k,Vy∈Rn×kTrVx⊤CxyVy,$
subject to the whitening constraint $Vx⊤CxxVx+Vy⊤CyyVy=Ik$. We can rewrite the CCA objective as a generalized eigenproblem,
$argmaxVx∈Rm×k,Vy∈Rn×kTrVxVy⊤CxyCxy⊤VxVy,$
(A.2)
subject to the whitening constraint
$VxVy⊤CxxCyyVxVy=Ik.$
(A.3)
The equivalence can be readily verified by expanding the matrix products in equations A.2 and A.3.
Next, we transform this generalized eigenproblem formulation, equations A.2 and A.3, into a standard eigenproblem formulation. Since the trace of the left-hand side of equation A.3 is constrained to equal $k$, we can add it to the trace in equation A.2, without affecting the output of the argmax, to arrive at the CCA objective:
$argmaxVx∈Rm×k,Vy∈Rn×kTrVxVy⊤CxxCxyCxy⊤CyyVxVy,$
(A.4)
subject to the whitening constraint in equation A.3.
Our final step is a substitution of variables. Define the $d×k$ matrix
$Vξ:=Cxx1/2VxCyy1/2Vy.$
(A.5)
After substituting into equations A.4 and A.3, we see that $(V¯x,V¯y)$ is a solution of the CCA objective if and only if $V¯ξ$ is a solution of
$argmaxVξ∈Rd×kTrVξ⊤CξξVξsubjecttoVξ⊤Vξ=Ik,$
(A.6)
where we recall that $Cξξ$ is the covariance matrix defined by $Cξξ:=1T∑t=1Tξtξt⊤$. Importantly, equation A.6 is the variance maximization objective for the standard PCA eigenproblem, which is optimized when the column vectors of $Vξ$ form an orthonormal basis spanning the $k$-dimensional principal subspace of the data $ξ1,…,ξT$. Therefore, by equations A.1 and A.5, the projection of the vector $ξt$ onto its principal subspace is precisely the desired sum of CCSPs:
$zt:=V¯x⊤xt+V¯y⊤yt=V¯ξ⊤ξt.$

### B.1  Detailed Derivation of Algorithm 3

Recall the min-max objective given in equation 4.4:
$argminZ∈Rk×TmaxN∈Rk×TminWx∈Rk×mminWy∈Rk×nmaxP∈Rk×kL˜(Wx,Wy,P,Z,N),$
(B.1)
where
$L˜(Wx,Wy,P,Z,N):=1TTr-2Z⊤WxX-2Z⊤WyY+αZ⊤Z+1TTr2N⊤P⊤Z-N⊤N+TrWxCxxWx⊤+WyCyyWy⊤-PP⊤.$
Since $L˜(Wx,Wy,P,Z,N)$ is strictly convex in $Wx$, $Wy$, and $Z$, and strictly concave in $P$ and $N$, we can interchange the order of optimization to obtain
$minWx∈Rk×mminWy∈Rk×nmaxP∈Rk×kminZ∈Rk×TmaxN∈Rk×TL˜(Wx,Wy,P,Z,N).$
(B.2)
The interchange of the maximization with respect to $N$ and the minimization with respect to $Wx$ and $Wy$ is justified by the fact that for fixed $Z$ and $P$, $L˜(Wx,Wy,P,Z,N)$ is strictly convex in $(Wx,Wy)$ and strictly concave in $N$. Similarly, the interchange of the minimization with respect to $Z$ and the maximization with respect to $P$ is justified by the fact that for fixed $N$, $Wx$, and $Wy$, $L˜(Wx,Wy,P,Z,N)$ is convex in $Z$ and strictly concave in $P$. In order to derive an online algorithm, we write the objective in terms of time-separable terms,
$L˜(Wx,Wy,P,Z,N)=1T∑t=1Tl˜t(Wx,Wy,P,zt,nt),$
where
$l˜t(Wx,Wy,P,zt,nt):=-2zt⊤Wxxt-2zt⊤Wyyt+αzt⊤zt+2nt⊤P⊤zt-nt⊤nt+TrWxxtxt⊤Wx⊤+Wyytyt⊤Wy⊤-PP⊤.$
At each time step $t$, for fixed $Wx$, $Wy$, and $M$, we first simultaneously maximize the objective function $l˜t(Wx,Wy,P,zt,nt)$ over $nt$ and minimize over $zt$ by running the fast gradient descent-ascent dynamics in equations 5.1 and 5.2 to equilibrium. Since $l˜t(Wx,Wy,P,zt,nt)$ is convex in $zt$ and concave in $nt$, these fast dynamics equilibriate at the saddle point where $z¯t=(PP⊤+αIk)-1(at+bt)$ and $n¯t=P⊤zt$. Then, with $(n¯t,z¯t)$ fixed, we perform a gradient descent-ascent step of the objective function with respect to $(Wx,Wy)$ and $P$:
$Wx←Wx-η2∂l˜t(Wx,Wy,P,z¯t,n¯t)∂Wx,Wx←Wy-η2∂l˜t(Wx,Wy,P,z¯t,n¯t)∂Wy,P←P+η2τ∂l˜t(Wx,Wy,P,z¯t,n¯t)∂P.$
Substituting in with the partial derivatives yields algorithm 3.

### B.2  Comparison with Bio-RRR

In this section, we compare Adaptive Bio-CCA with output whitening (see algorithm 3) and Bio-RRR (Golkar et al., 2020, algorithm 2). We first state the Bio-RRR algorithm.3

Bio-RRR implements a supervised learning method for minimizing reconstruction error, with the parameter $0≤s≤1$ specifying the norm under which the error is measured (see Golkar et al., 2020, section 3, for details). Importantly, when $s=1$, algorithm 4 implements a supervised version of CCA. Setting $α=1$ in algorithm 3 and $s=1$ in algorithm 4, the algorithms have identical network architectures and synaptic update rules, namely:
$Wx←Wx+ηx(zt-at)xt⊤,Wy←Wy+ηyctayt⊤,P←P+ηp(ztnt⊤-P),$
where we recall that $cta:=at-Pnt$. For this parameter choice, the main difference between the algorithms is that Adaptive Bio-CCA with output whitening is an unsupervised learning algorithm, whereas Bio-RRR is a supervised learning algorithm, which is reflected in their outputs $zt$: the output of algorithm 3 is the whitened sum of the basal dendritic currents and the apical calcium plateau potential, that is, $zt=bt+cta$, whereas the output of (Golkar et al., 2020, algorithm 2) is the (whitened) CCSP of the basal inputs, $zt=bt$. In other words, the apical inputs do not directly contribute to the output of the network in Golkar et al. (2020), only indirectly via plasticity in the basal dendrites. Experimental evidence suggests that apical calcium plateau potentials contribute significantly to the outputs of pyramidal cells, which supports the model derived here. Furthermore, the model in this work allows one to adjust the parameter $α$ to adaptively set the output rank, which is important for analyzing non-stationary input streams. In Table 1 we summarize the differences between the two algorithms.
Table 1:

Comparison of Adaptive Bio-CCA with Output Whitening and Bio-RRR.

Unsupervised/supervised unsupervised supervised
Whitened outputs $√$ $√$
Adaptive output rank $√$ $×$
Unsupervised/supervised unsupervised supervised
Whitened outputs $√$ $√$
Adaptive output rank $√$ $×$

### B.3  Decoupling the Interneuron Synapses

The neural network for Adaptive Bio-CCA with output whitening derived in section 4 requires the pyramidal neuron-to-interneuron synaptic weight matrix $P⊤$ to be the the transpose of the interneuron-to-pyramidal neuron synaptic weight matrix $P$. Enforcing this symmetry via a centralized mechanism is not biologically plausible. Rather, following (Golkar et al., 2020, appendix D), we show that the symmetry between these two sets of weights naturally follows from the local learning rules.

To begin, we replace the pyramidal neuron-to-interneuron weight matrix $P⊤$ by a new weight matrix $R$ with Hebbian learning rules:
$R←R+ητ(ntzt⊤-R).$
(B.3)
Let $P0$ and $R0$ denote the initial values of $P$ and $R$. Then, in view of the updates rule for $P$ in and $R$ in algorithm 3 and equation B.3, respectively, the difference $P⊤-R$ after $T$ updates is given by
$P⊤-R=1-ητT(P0⊤-R0).$
In particular, the difference decays exponentially (recall that $η<τ$ by assumption).

### C.1  Experimental Details

#### C.1.1  Bio-CCA

We implemented algorithm 2. We initialized $Wx$ (resp. $Wy$) to be a random matrix with i.i.d. mean-zero normal entries with variance $1/m$ (resp. $1/n$). We initialized $M$ to be the identity matrix $Ik$. We used the time-dependent learning rate of the form $ηt=η0/(1+γt)$. To find the optimal hyperparameters, we performed a grid search over $η0∈{10-2,10-3,10-4}$, $γ∈{10-3,10-4,10-5}$, and $τ∈{1,0.5,0.1}$. The best-performing parameters are reported in Table 2. Finally, to ensure the reported basis vectors satisfy the orthonormality constraints equation 2.2, we report the following normalized basis vectors:
$Vx⊤:=M-1(WxCxxWx⊤+WyCyyWy⊤)M-1-1/2M-1Wx,$
(C.1)
$Vy⊤:=M-1(WxCxxWx⊤+WyCyyWy⊤)M-1-1/2M-1Wy.$
(C.2)
Table 2:

Optimal Time-Dependent Learning Rates.

SyntheticMediamill
Bio-CCA $(η0,γ,τ)$ $(10-3,10-4,0.1)$ $(10-2,10-4,0.1)$
Gen-Oja $(β0,γ)$ $(1,10-2)$ $(10-2,10-3)$
Asym-NN $(η0,α)$ $(2×10-4,5×10-6)$ $(2×10-3,5×10-6)$
Bio-RRR $(η0,γ,μ)$ $(10-3,10-4,10)$ $(10-2,10-5,10)$
Adaptive Bio-CCA $(η0,γ,τ)$ $(10-3,10-4,0.1)$ $(10-2,10-4,0.5)$
SyntheticMediamill
Bio-CCA $(η0,γ,τ)$ $(10-3,10-4,0.1)$ $(10-2,10-4,0.1)$
Gen-Oja $(β0,γ)$ $(1,10-2)$ $(10-2,10-3)$
Asym-NN $(η0,α)$ $(2×10-4,5×10-6)$ $(2×10-3,5×10-6)$
Bio-RRR $(η0,γ,μ)$ $(10-3,10-4,10)$ $(10-2,10-5,10)$
Adaptive Bio-CCA $(η0,γ,τ)$ $(10-3,10-4,0.1)$ $(10-2,10-4,0.5)$

#### C.1.2  MSG-CCA

We implemented the online algorithm stated in Arora et al. (2017). MSG-CCA requires a training set to estimate the covariance matrices $Cxx$ and $Cyy$. We provided the algorithm with 1000 samples to initially estimate the covariance matrix. Following Arora et al. (2017), we use the time-dependent learning rate $ηt=0.1/t$.

#### C.1.3  Gen-Oja

We implemented the online algorithm stated in Bhatia et al. (2018). The algorithm includes two learning rates: $αt$ and $βt$. As stated in Bhatia et al. (2018), the Gen-Oja's performance is robust to changes in the learning rate $αt$ but sensitive to changes in the learning rate $βt$. Following Bhatia et al. (2018), we set $αt$ to be constant and equal to $1/R2$ where $R2:=Tr(Cxx)+Tr(Cyy)$. To optimize over $βt$, we used a time-dependent learning rate of the form $βt=β0/(1+γt)$ and performed a grid search over $β0∈{1,10-1,10-2}$ and $γ∈{10-1,10-2,10-3}$. The best-performing parameters are reported in Table 2.

#### C.1.4  Asymmetric CCA Network

We implemented the online multi-channel CCA algorithm derived in Pehlevan et al. (2020). Following Pehlevan et al. (2020), we use the linearly decaying learning rate $ηt=η0×max(1-αt,0.1)$. To optimize the performance of the algorithm, we performed a grid search over $η0∈{2×10-2,10-2,2×10-3,10-3}$ and $α∈{5×10-5,10-5,5×10-6,10-6}$. The best-performing parameters are reported in Table 2.

#### C.1.5  Bio-RRR

We implemented the online CCA algorithm derived in Golkar et al. (2020) with $s=1$ (see algorithm 4). The algorithm includes learning rates $ηx$, $ηy$, and $ηp$. Following Golkar et al. (2020), we set $ηy=ηt$ and $ηx=ηp=ηy/μ$ and use the time-dependent learning rate of the form $ηt=η0/(1+γt)$. We performed a grid search over $η0∈{10-2,10-3,10-4}$, $γ∈{10-3,10-4,10-5}$ and $μ∈{1,10,100}$ and list best-performing parameters in Table 2.

#### C.1.6  Adaptive Bio-CCA with Output Whitening

We implemented algorithm 3. We initialized $Wx$, $Wy$, and $P$ to be random matrices with i.i.d. standard normal entries. To find the optimal hyperparameters, we performed a grid search over $η0∈{10-2,10-3,10-4}$, $γ∈{10-3,10-4,10-5}$, and $τ∈{1,0.5,0.1}$. The best-performing parameters are reported in Table 2.

### C.2  Orthonormality Constraints

#### C.2.1  Bio-CCA

To verify that the Bio-CCA basis vectors asymptotically satisfy the orthonormality constraint, equation 2.2, we use the following orthonormality error function:
$OrthonormalityError(t):=∥Mt-1(Wx,tCxxWx,t⊤+Wy,tCyyWy,t⊤)Mt-1-Ik∥F2k.$
(C.3)

In Figure 11 (resp. Figure 12) we demonstrate that Bio-CCA asymptotically satisfies the CCA orthonormality constraints, equation 2.2 on the synthetic data set (resp. Mediamill).

#### C.2.2  Adaptive Bio-CCA with Output Whitening

To verify that the top $r$ eigenvalues of the output covariance asymptotically approach 1, we let $λ1(t)≥⋯≥λk(t)$ denote the ordered eigenvalues of the matrix,
$Czz(t):=Vx,t⊤CxxVx,t+Vx,t⊤CxyVy,t+Vy,t⊤Cxy⊤Vx,t+Vy,t⊤CyyVy,t,$
and define the whitening error by
$WhiteningError(t):=∑i=1r|λi(t)-1|2+∑i=r+1k|λi(t)|2k.$
(C.4)
In Figure 13 (resp. Figure 14) we demonstrate that Bio-CCA asymptotically satisfies the CCA orthonormality constraints equation 2.2, on the synthetic data set (resp. Mediamill).
To interpret our offline and online CCA algorithms (see algorithms 1 and 2) mathematically, we first make the following observation. Both algorithms optimize the min-max objective, equation 3.5, which includes optimization over the neural outputs $Z$. In this way, the neural activities can be viewed as optimization steps, which is useful for a biological interpretation of the algorithms. However, since we assume a separation of timescales in which the neural outputs equilibrate at their optimal values before the synaptic weight matrices are updated, the neural dynamics are superfluous when analyzing the algorithms from a mathematical perspective. Therefore, we set $Z$ equal to its equilibrium value $Z¯=M-1(WxX+WyY)$ in the cost function $L(Wx,Wy,M,Z)$ to obtain a min-max problem in terms of the synaptic weights:
$minW∈Rk×dmaxM∈S++kF(W,M),$
(D.1)
where $W:=[WxWy]$ is the matrix of concatenated weights and $F(W,M):=L(Wx,Wy,M,Z¯)$. After substitution, we see that $F:Rk×d×S++k→R$ is the nonconvex-concave function,
$F(W,M)=Tr-M-1WAW⊤+WBW⊤-12M2,$
with partial derivatives
$-∂F(W,M)∂W=2M-1WA-2WB,∂F(W,M)∂M=M-1WAW⊤M-1-M,$
where we have defined
$A:=CxxCxyCxy⊤Cyy,B:=CxxCyy.$
(D.2)
Similar objectives, with different values of $A$ and $B$, arise in the analysis of online principal subspace projection (Pehlevan, Sengupta, & Chklovskii, 2018) and slow feature analysis (Lipshutz et al., 2020) algorithms.
The synaptic updates in both our offline and online algorithms can be viewed as (stochastic) gradient descent-ascent algorithms for solving the noncovex-concave min-max problem, equation D.1. To make the comparison with our offline algorithm, we substitute the optimal value $Z¯$ into the synaptic weight updates in algorithm 1 to obtain
$W←W+2η(M-1WA-WB),$
(D.3)
$M←M+ητ(M-1WAW⊤M-1-M),$
(D.4)
Comparing these updates to the partial derivatives of $F$, we see that Offline-CCA is naturally interpreted as a gradient descent-ascent algorithm for solving the min-max problem, equation D.1, with descent step size $η$ and ascent step size $ητ$. Similarly, to make the comparison with our online algorithm, we substitute the explicit expression for the equilibrium value $z¯t=M-1(at+bt)$, where $at=Wxxt$ and $bt=Wyyt$, into the synaptic weight updates in algorithm 2 to rewrite the updates:
$W←W+2η(M-1WAt-WBt),M←M+ητ(M-1WAtW⊤M-1-M),$
where
$At:=xtxt⊤xtyt⊤ytxt⊤ytyt⊤,Bt:=xtxt⊤ytyt⊤.$
Comparing these updates to the partial derivatives of $F$, we see that our online algorithm is naturally interpreted as a stochastic gradient descent-ascent algorithm for solving the min-max problem, equation D.1, using the time $t$ rank-1 approximations $At$ and $Bt$ in place of $A$ and $B$, respectively.

Establishing theoretical guarantees for solving nonconvex-concave min-max problems of the form, equation D.1 via stochastic gradient descent-ascent is an active area of research (Razaviyayn et al., 2020). Borkar (1997, 2009) proved asymptotic convergence to the solution of the min-max problem for a two timescale stochastic gradient descent-ascent algorithm, where the ratio between the learning rates for the minimization step and the maximization step, $τ$, depends on the iteration and converges to zero in the limit as the iteration number approaches infinity. Lin et al. (2020) established convergence rate guarantees for a stochastic gradient descent-ascent algorithm to an equilibrium point (not necessarily a solution of the min-max problem). Both results, however, impose assumptions that do not hold in our setting: the partial derivatives of $F$ are Lipschitz continuous and $M$ is restricted to a bounded convex set. Therefore, establishing global stability with convergence rate guarantees for our offline and online CCA algorithms requires new mathematical techniques that are beyond the scope of this work.

Even proving local convergence properties is nontrivial. In the special case that $B=Id$, Pehlevan et al. (2018) carefully analyzed the continuous dynamical system obtained by formally taking the step size $η$ to zero in equations D.3 and D.4. They computed an explicit value $τ0≥1/2$, in terms the eigenvalues of $A$ such that if $τ≤τ0$, then solutions of the min-max problem, equation D.1, are the only linearly stable fixed points of the continuous dynamics. The case that $B≠Id$ is more complicated and the approach in Pehlevan et al. (2018) is not readily extended. In ongoing work, we take a step toward understanding the asymptotics of our algorithms by analyzing local stability properties for a general class of gradient descent-ascent algorithms, which includes Offline-CCA and Bio-CCA as special cases.

1

This constraint differs slightly from the usual CCA whitening constraint $Vx⊤CxxVx=Vy⊤CyyVy=Ik$; however, the constraints are equivalent up to a scaling factor of 2.

2

Since the competing algorithms are not adaptive and need to have their output dimension set by hand, we do not include a competing algorithm for comparison.

3

In Golkar et al. (2020) the inputs $xt$ are the basal inputs and the inputs $yt$ are the apical inputs. Here we switch the inputs to be consistent with algorithm 3.

We thank Nati Srebro for drawing our attention to CCA, and we thank Tiberiu Tesileanu and Charles Windolf for their helpful feedback on an earlier draft of this manuscript.

Arora
,
R.
,
Marinov
,
T. V.
,
Mianjy
,
P.
, &
Srebro
,
N.
(
2017
). Stochastic approximation for canonical correlation analysis. In
I.
Guyon
,
Y. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 30
(pp.
4775
4784
).
Red Hook, NY
:
Curran
.
Arora
,
S.
,
Cohen
,
N.
,
Golowich
,
N.
, &
Hu
,
W.
(
2019
).
A convergence analysis of gradient descent for deep linear neural networks.
In
Proceedings of the International Conference on Learning Representations
.
Bach
,
F. R.
, &
Jordan
,
M. I.
(
2005
).
A probabilistic interpretation of canonical correlation analysis
(Technical Report). University of California, Berkeley.
Bhatia
,
K.
,
Pacchiano
,
A.
,
Flammarion
,
N.
,
Bartlett
,
P. L.
, &
Jordan
,
M. I.
(
2018
). Gen-Oja: Simple and efficient algorithm for streaming generalized eigenvector computation. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
, 31 (pp.
7016
7025
).
Red Hook, NY
:
Curran
.
Bittner
,
K. C.
,
Grienberger
,
C.
,
Vaidya
,
S. P.
,
Milstein
,
A. D.
,
Macklin
,
J. J.
,
Suh
,
J.
, …
Magee
,
J. C.
(
2015
).
Conjunctive input processing drives feature selectivity in hippocampal CA1 neurons
.
Nature Neuroscience
,
18
(
8
), 1133.
Bittner
,
K. C.
,
Milstein
,
A. D.
,
Grienberger
,
C.
,
Romani
,
S.
, &
Magee
,
J. C.
(
2017
).
Behavioral time scale synaptic plasticity underlies CA1 place fields
.
Science
,
357
(
6355
),
1033
1036
.
Borkar
,
V. S.
(
1997
).
Stochastic approximation with two time scales
.
Systems and Control Letters
,
29
(
5
),
291
294
.
Borkar
,
V. S.
(
2009
).
Stochastic approximation: A dynamical systems viewpoint
.
Berlin
:
Springer
.
Cox
,
T. F.
, &
Cox
,
M. A.
(
2000
).
Multidimensional scaling
.
London
:
Chapman and Hall/CRC
.
Ecker
,
A. S.
,
Berens
,
P.
,
Keliris
,
G. A.
,
Bethge
,
M.
,
Logothetis
,
N. K.
, &
Tolias
,
A. S.
(
2010
).
Decorrelated neuronal firing in cortical microcircuits
.
Science
,
327
(
5965
),
584
587
.
Gambino
,
F.
,
Pagès
,
S.
,
Kehayas
,
V.
,
Baptista
,
D.
,
Tatti
,
R.
,
Carleton
,
A.
, &
Holtmaat
,
A.
(
2014
).
Sensory-evoked LTP driven by dendritic plateau potentials in vivo
.
Nature
,
515
(
7525
),
116
119
.
Gidon
,
A.
,
Zolnik
,
T. A.
,
Fidzinski
,
P.
,
Bolduan
,
F.
,
Papoutsi
,
A.
,
Poirazi
,
P.
, …
Larkum
,
M. E.
(
2020
).
Dendritic action potentials and computation in human layer 2/3 cortical neurons
.
Science
,
367
(
6473
),
83
87
.
Golding
,
N. L.
,
Staff
,
N. P.
, &
Spruston
,
N.
(
2002
).
Dendritic spikes as a mechanism for cooperative long-term potentiation
.
Nature
,
418
(
6895
),
326
331
.
Golkar
,
S.
,
Lipshutz
,
D.
,
Bahroun
,
Y.
,
Sengupta
,
A. M.
, &
Chklovskii
,
D. B.
(
2020
). A simple normative network approximates local non-Hebbian learning in the cortex. In
H.
Larochelle
,
M.
Ranzato
,
R.
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
,
33
(pp.
7283
7295
).
Red Hook, NY
:
Curran
.
Gou
,
Z.
, &
Fyfe
,
C.
(
2004
).
A canonical correlation neural network for multicollinearity and functional data
.
Neural Networks
,
17
(
2
),
285
293
.
Guerguiev
,
J.
,
Lillicrap
,
T. P.
, &
Richards
,
B. A.
(
2017
).
Towards deep learning with segregated dendrites
.
eLife
,
6
, e22901.
Haga
,
T.
, &
Fukai
,
T.
(
2018
).
Dendritic processing of spontaneous neuronal sequences for single-trial learning
.
Scientific Reports
,
8
(
1
),
1
22
.
Hommes
,
C. H.
, &
Ochea
,
M. I.
(
2012
).
Multiple equilibria and limit cycles in evolutionary games with logit dynamics
.
Games and Economic Behavior
,
74
(
1
),
434
441
.
Hotelling
,
H.
(
1936
).
Relations between two sets of variates
.
Biometrika
,
28
(
3–4
),
321
377
.
King
,
P. D.
,
Zylberberg
,
J.
, &
DeWeese
,
M. R.
(
2013
).
Inhibitory interneurons decorrelate excitatory cells to drive sparse code formation in a spiking model of V1
.
Journal of Neuroscience
,
33
(
13
),
5475
5485
.
Körding
,
K. P.
, &
König
,
P.
(
2001
).
Supervised and unsupervised learning with two sites of synaptic integration
.
Journal of Computational Neuroscience
,
11
(
3
),
207
215
.
Lai
,
P. L.
, &
Fyfe
,
C.
(
1999
).
A neural implementation of canonical correlation analysis
.
Neural Networks
,
12
(
10
),
1391
1397
.
Larkum
,
M.
(
2013
).
A cellular mechanism for cortical associations: An organizing principle for the cerebral cortex
.
Trends in Neurosciences
,
36
(
3
),
141
151
.
Larkum
,
M. E.
,
Nevian
,
T.
,
Sandler
,
M.
,
Polsky
,
A.
, &
Schiller
,
J.
(
2009
).
Synaptic integration in tuft dendrites of layer 5 pyramidal neurons: A new unifying principle
.
Science
,
325
(
5941
),
756
760
.
Larkum
,
M. E.
,
Zhu
,
J. J.
, &
Sakmann
,
B.
(
1999
).
A new cellular mechanism for coupling inputs arriving at different cortical layers
.
Nature
,
398
(
6725
),
338
341
.
Lin
,
T.
,
Jin
,
C.
, &
Jordan
,
M. I.
(
2020
).
On gradient descent ascent for nonconvex-concave minimax problems.
In
Proceedings of the International Conference on Machine Learning
.
Lipshutz
,
D.
,
Windolf
,
C.
,
Golkar
,
S.
, &
Chklovskii
,
D. B.
(
2020
). A biologically plausible neural network for slow feature analysis. In
H.
Larochelle
,
M.
Ranzato
,
R.
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
,
33
(pp.
14986
14996
).
Red Hook, NY
:
Curran
.
Magee
,
J. C.
, &
Grienberger
,
C.
(
2020
).
Synaptic plasticity forms and functions.
Annual Review of Neuroscience
,
43
,
95
117
.
Major
,
G.
,
Larkum
,
M. E.
, &
Schiller
,
J.
(
2013
).
Active properties of neocortical pyramidal neuron dendrites
.
Annual Review of Neuroscience
,
36
,
1
24
.
Mertikopoulos
,
P.
,
,
C.
, &
Piliouras
,
G.
(
2018
).
In
Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms
(pp.
2703
2717
).
:
SIAM
.
Milstein
,
A. D.
,
Li
,
Y.
,
Bittner
,
K. C.
,
Grienberger
,
C.
,
Soltesz
,
I.
,
Magee
,
J. C.
, &
Romani
,
S.
(
2020
).
Bidirectional synaptic plasticity rapidly modifies hippocampal representations independent of correlated activity
. bioRxiv.
Miura
,
K.
,
Mainen
,
Z. F.
, &
Uchida
,
N.
(
2012
).
Odor representations in olfactory cortex: Distributed rate coding and decorrelated population activity
.
Neuron
,
74
(
6
),
1087
1098
.
Oja
,
E.
(
1982
).
Simplified neuron model as a principal component analyzer
.
Journal of Mathematical Biology
,
15
(
3
),
267
273
.
Pehlevan
,
C.
, &
Chklovskii
,
D. B.
(
2015
). A normative theory of adaptive dimensionality reduction in neural networks. In
C.
Cortes
,
N.
Lawrence
,
D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
(pp.
2269
2277
).
Red Hook, NY
:
Curran: Red Hook
.
Pehlevan
,
C.
,
Hu
,
T.
, &
Chklovskii
,
D. B.
(
2015
).
A Hebbian/anti-Hebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data
.
Neural Computation
,
27
(
7
),
1461
1495
.
Pehlevan
,
C.
,
Sengupta
,
A. M.
, &
Chklovskii
,
D. B.
(
2018
).
Why do similarity matching objectives lead to Hebbian/anti-Hebbian networks?
Neural Computation
,
30
(
1
),
84
124
.
Pehlevan
,
C.
,
Zhao
,
X.
,
Sengupta
,
A. M.
, &
Chklovskii
,
D. B.
(
2020
).
Neurons as canonical correlation analyzers.
Frontiers in Computational Neuroscience
,
15
(
55
).
Pezeshki
,
A.
,
,
M. R.
, &
Scharf
,
L. L.
(
2003
).
A network for recursive extraction of canonical coordinates
.
Neural Networks
,
16
(
5–6
),
801
808
.
Plumbley
,
M. D.
(
1993
).
A Hebbian/anti-Hebbian network which optimizes information capacity by orthonormalizing the principal subspace.
In
Proceedings of the 1993 Third International Conference on Artificial Neural Networks
(pp.
86
90
).
Razaviyayn
,
M.
,
Huang
,
T.
,
Lu
,
S.
,
Nouiehed
,
M.
,
Sanjabi
,
M.
, &
Hong
,
M.
(
2020
).
Nonconvex min-max optimization: Applications, challenges, and recent theoretical advances
.
IEEE Signal Processing Magazine
,
37
(
5
),
55
66
.
Richards
,
B. A.
, &
Lillicrap
,
T. P.
(
2019
).
Dendritic solutions to the credit assignment problem
.
Current Opinion in Neurobiology
,
54
,
28
36
.
Sacramento
,
J.
,
Costa
,
R. P.
,
Bengio
,
Y.
, &
Senn
,
W.
(
2018
). Dendritic cortical microcircuits approximate the backpropagation algorithm. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
8721
8732
).
Red Hook, NY
:
Curran
.
Sjöström
,
P J.
, &
Haüsser
,
M.
(
2006
).
A cooperative switch determines the sign of synaptic plasticity in distal dendrites of neocortical pyramidal neurons
.
Neuron
,
51
(
2
),
227
238
.
Snoek
,
C. G.
,
Worring
,
M.
,
Van Gemert
,
J. C.
,
Geusebroek
,
J.-M.
, &
Smeulders
,
A. W.
(
2006
).
The challenge problem for automated detection of 101 semantic concepts in multimedia.
In
Proceedings of the 14th ACM International Conference on Multimedia
(pp.
421
430
).
New York
:
ACM
.
Takahashi
,
H.
, &
Magee
,
J. C.
(
2009
).
Pathway interactions and synaptic plasticity in the dendritic tuft regions of CA1 pyramidal neurons
.
Neuron
,
62
(
1
),
102
111
.
Tigaret
,
C. M.
,
Olivo
,
V.
,
,
J. H.
,
Ashby
,
M. C.
, &
Mellor
,
J. R.
(
2016
).
Coordinated activation of distinct Ca$2+$ sources and metabotropic glutamate receptors encodes Hebbian synaptic plasticity
.
Nature Communications
,
7
(
1
),
1
14
.
Urbanczik
,
R.
, &
Senn
,
W.
(
2014
).
Learning by the dendritic prediction of somatic spiking
.
Neuron
,
81
(
3
),
521
528
.
Velu
,
R.
, &
Reinsel
,
G. C.
(
2013
).
Multivariate reduced-rank regression: Theory and applications
.
Berlin
:
Springer
.
Vía
,
J.
,
Santamaría
,
I.
, &
Pérez
,
J.
(
2007
).
A learning algorithm for adaptive canonical correlation analysis of several data sets
.
Neural Networks
,
20
(
1
),
139
152
.
Wanner
,
A. A.
, &
Friedrich
,
R. W.
(
2020
).
Whitening of odor representations by the wiring diagram of the olfactory bulb
.
Nature Neuroscience
,
23
(
3
),
433
442
.
Williams
,
C. K.
(
2001
). On a connection between kernel PCA and metric multidimensional scaling. In
T.
Leen
,
T.
Dietterich
, &
V.
Tresp
(Eds.),
Advances in neural information processing systems
, 13 (pp.
675
681
).
Cambridge, MA
:
MIT Press
.

## Author notes

D.L., Y.B., and S.G. contributed equally.