For many years, a combination of principal component analysis (PCA) and independent component analysis (ICA) has been used for blind source separation (BSS). However, it remains unclear why these linear methods work well with real-world data that involve nonlinear source mixtures. This work theoretically validates that a cascade of linear PCA and ICA can solve a nonlinear BSS problem accurately—when the sensory inputs are generated from hidden sources via nonlinear mappings with sufficient dimensionality. Our proposed theorem, termed the asymptotic linearization theorem, theoretically guarantees that applying linear PCA to the inputs can reliably extract a subspace spanned by the linear projections from every hidden source as the major components—and thus projecting the inputs onto their major eigenspace can effectively recover a linear transformation of the hidden sources. Then subsequent application of linear ICA can separate all the true independent hidden sources accurately. Zero-element-wise-error nonlinear BSS is asymptotically attained when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality. Our proposed theorem is validated analytically and numerically. Moreover, the same computation can be performed by using Hebbian-like plasticity rules, implying the biological plausibility of this nonlinear BSS strategy. Our results highlight the utility of linear PCA and ICA for accurately and reliably recovering nonlinearly mixed sources and suggest the importance of employing sensors with sufficient dimensionality to identify true hidden sources of real-world data.

Blind source separation (BSS) involves the separation of mixed sensory inputs into their hidden sources without knowledge of the manner in which they were mixed (Cichocki, Zdunek, Phan, & Amari, 2009; Comon & Jutten, 2010). Among the numerous BSS methods, a combination of principal component analysis (PCA) (Pearson, 1901; Oja, 1982, 1989; Sanger, 1989; Xu, 1993; Jolliffe, 2002) and independent component analysis (ICA) (Comon, 1994; Bell & Sejnowski, 1995, 1997; Amari, Cichocki, & Yang, 1996; Hyvarinen & Oja, 1997) is one of the most widely used approaches. In this combined PCA–ICA approach, PCA yields a low-dimensional concise representation (i.e., the major principal components) of sensory inputs that most suitably describes the original high-dimensional redundant inputs. Whereas, ICA provides a representation (i.e., encoders) that separates the compressed sensory inputs into independent hidden sources. A classical setup for BSS assumes a linear generative process (Bell & Sejnowski, 1995), in which sensory inputs are generated as a linear superposition of independent hidden sources. The linear BSS problem has been extensively studied both analytically and numerically (Amari, Chen, & Cichocki, 1997; Oja & Yuan, 2006; Erdogan, 2009), where the cascade of PCA and ICA is guaranteed to provide the optimal linear encoder that can separate sensory inputs into their true hidden sources, up to permutations and sign-flips (Baldi & Hornik, 1989; Chen, Hua, & Yan, 1998; Papadias, 2000; Erdogan, 2007).

Another crucial perspective is the applicability of BSS methods to real-world data generated from a nonlinear generative process. In particular, the aim of nonlinear BSS is to identify the inverse of a nonlinear generative process that generates sensory inputs and thereby infer their true independent hidden sources based exclusively on the sensory inputs. Although the cascade of linear PCA and ICA has been applied empirically to real-world BSS problems (Calhoun, Liu, & Adali, 2009), no one has yet theoretically proven that this linear BSS approach can solve a nonlinear BSS problem. To address this gap, this work demonstrates mathematically that the cascade of PCA and ICA can solve a nonlinear BSS problem accurately when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality so that various nonlinear mappings from sources to inputs can be determined from the inputs themselves.

In general, there are five requirements for solving the nonlinear BSS problem. The first two requirements are related to the representation capacity of the encoder: (1) the encoder's parameter space must be sufficiently large to accommodate the actual solution that can express the inverse of the true generative process. (2) However, this parameter space should not be too large; otherwise, a nonlinear BSS problem can have infinitely many spurious solutions wherein all encoders are independent but dissimilar to the true hidden sources (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). Hence, it is important to constrain the representation capacity of the encoder in order to satisfy these opposing requirements. A typical approach for solving the nonlinear BSS problem involves using a multilayer neural network—with nonlinear activation functions—that learns the inverse of the generative process (Lappalainen & Honkela, 2000; Karhunen, 2001; Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013; Dinh, Krueger, & Bengio, 2014). The remaining three requirements are related to the unsupervised learning algorithms used to identify the optimal parameters for the encoder: (3) the learning algorithm must have a fixed point at which the network expresses the inverse of the generative process; (4) the fixed point must be linearly stable so that the learning process converges to the solution; and (5) the probability of not converging to this solution should be small, that is, most realistic initial conditions must be within the basin of attraction of the true solution.

Approaches using a nonlinear multilayer neural network satisfy requirement 1 when the number of neurons in each layer is sufficient (universality: Cybenko, 1989; Hornik, Stinchcombe, & White, 1989; Barron, 1993); moreover, learning algorithms that satisfy requirements 3 and 4 are also known (Dayan, Hinton, Neal, & Zemel, 1995; Friston, 2008; Friston, Trujillo-Barreto, & Daunizeau, 2008). However, reliable identification of the true hidden sources is still necessary because the encoder can have infinitely many spurious solutions if its representation capacity is too large (i.e., if requirement 2 is violated). As previously indicated, to the best of our knowledge, there is no theoretical proof that confirms a solution for a nonlinear BSS problem (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004), except for some cases wherein temporal information—such that each independent source has its own dynamics—is available (Hyvarinen & Morioka, 2016, 2017; Khemakhem, Kingma, Monti, & Hyvarinen, 2020). Moreover, even when requirement 2 is satisfied, there is no guarantee that a learning algorithm will converge to the true hidden source representation because it might be trapped in a local minimum wherein outputs are still not independent of each other. Thus, for a nonlinear BSS problem, it is of paramount importance to simplify the parameter space of the inverse model in order to remove spurious solutions and prevent the learning algorithm from attaining a local minimum (to satisfy requirements 2 and 5) while retaining its capacity to represent the actual solution (requirement 1). Hence, in this work, we apply a linear approach to solving a nonlinear BSS problem in order to ensure requirements 2 and 5 are satisfied. We demonstrate that the cascade of PCA and ICA can reliably identify a good approximation of the inverse of nonlinear generative processes asymptotically under the condition where the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality (thus satisfying requirements 1–5). Although such a condition is different from the case typically considered by earlier work—where the sources and inputs have the same dimensionality—the condition we consider turns out to be apt for the mathematical justification of the achievability of the nonlinear BSS.

Figure 1:

Structures of nonlinear generative process (top) and linear neural network (bottom). Hidden sources $s=(s1,…,sNs)T$ generate sensory inputs $x=(x1,…,xNx)T$ via nonlinear mappings characterized by nonlinear bases $f=(f1,…,fNf)T$, where $A={Akl}$ and $a=(a1,…,aNf)T$ are gaussian-distributed higher-layer mixing weights and offsets, respectively, and $B={Bjk}$ is a lower-layer mixing weight matrix. Encoders comprise a single-layer linear neural network, where $u=(u1,…,uNs)T$ are neural outputs and $W={Wij}$ is a synaptic weight matrix. Equations on the right-hand-side panels summarize the generation of sensory inputs from hidden sources through nonlinear mixtures and the inversion of this process using a linear neural network.

Figure 1:

Structures of nonlinear generative process (top) and linear neural network (bottom). Hidden sources $s=(s1,…,sNs)T$ generate sensory inputs $x=(x1,…,xNx)T$ via nonlinear mappings characterized by nonlinear bases $f=(f1,…,fNf)T$, where $A={Akl}$ and $a=(a1,…,aNf)T$ are gaussian-distributed higher-layer mixing weights and offsets, respectively, and $B={Bjk}$ is a lower-layer mixing weight matrix. Encoders comprise a single-layer linear neural network, where $u=(u1,…,uNs)T$ are neural outputs and $W={Wij}$ is a synaptic weight matrix. Equations on the right-hand-side panels summarize the generation of sensory inputs from hidden sources through nonlinear mixtures and the inversion of this process using a linear neural network.

Close modal

### 2.1  Overview

Our proposed theorem, referred to as asymptotic linearization theorem, is based on an intuition that when the dimensionality of sensory inputs is significantly larger than that of hidden sources, these inputs must involve various linear and nonlinear mappings of all hidden sources—thus providing sufficient information to identify the true hidden sources without ambiguity using an unsupervised learning approach. We consider that $Ns$-dimensional hidden sources $s≡(s1,…,sNs)T$ generate $Nx$-dimensional sensory inputs $x≡(x1,…,xNx)T$ through a generative process characterized by an arbitrary nonlinear mapping $x≡F(s)$ (see Figure 1). Here, the hidden sources are supposed to follow independently an identical probability distribution with zero mean and unit variance $ps(s)≡∏ipi(si)$. A class of nonlinear mappings $F(s)$ can be universally approximated by a specific but generic form of two-layer network. Suppose $A∈RNf×Ns$ and $B∈RNx×Nf$ as higher- and lower-layer mixing matrices, respectively; $a∈RNf$ as a constant vector of offsets; $y≡As$ as a linear mixture of sources, $f(•):R↦R$ as a nonlinear function; and $f≡(f1,…,fNf)T≡f(As+a)≡f(y+a)$ as $Nf$-dimensional nonlinear bases. The sensory inputs are given as
$x=Bf(As+a),$
(2.1)
or, equivalently, $x=Bf(y+a)=Bf$. This expression using the two-layer network is universal in the component-wise sense1 (Cybenko, 1989; Hornik et al., 1989; Barron, 1993) and each element of $x$ can represent an arbitrary mapping $x=F(s)$ as $Nf$ increases by adjusting parameters $a$, $A$, and $B$. We further suppose that each element of $A$ and $a$ is independently generated from a gaussian distribution $N[0,1/Ns]$, which retains its universality (Rahimi & Recht, 2008a, 2008b) as long as $B$ is tuned to minimize the mean squared error $E[|Bf-F(s)|2]$. Here, $E[•]$ describes the average over $ps(s)$. The scaling of $A$ is to ensure that the argument of $f$ is of order 1. The $Ns-1/2$ order offset $a$ is introduced to this model to express any generative process $F(s)$; however, it is negligibly small relative to $As$ for large $Ns$. The whole system, including the generative process and neural network, is depicted in Figure 1 (left). The corresponding equations are summarized in Figure 1 (right).
For analytical tractability, we decompose nonlinear bases $f$ into the sum of linear and nonlinear parts of hidden sources as follows: linear components of the hidden sources in the bases are defined as $Hs$ using a coefficient matrix $H$ that minimizes the mean squared error, $H≡argminHE[|f-E[f]-Hs|2]$. Such a coefficient matrix is computed as $H=E[fsT]$. The remaining part $ϕ≡f-E[f]-Hs$ is referred to as nonlinear components of the hidden sources, which are orthogonal (uncorrelated) to $s$ (i.e., $E[ϕsT]=O$). This definition of linear and nonlinear components is unique. Thus, the sensory inputs (see equation 2.1) are decomposed into linear and nonlinear transforms of the hidden sources:
$x-E[x]=BHs︸signal+Bϕ︸residual.$
(2.2)
The first term on the right-hand side represents the signal comprising the linear components of the hidden sources, whereas the second term represents the residual introduced by the nonlinearity in the generative process. Further, because $ϕ$ is uncorrelated with $s$, the covariance of the bases is decomposed into their linear and nonlinear counterparts $Cov[f]≡E[(f-E[f])(f-E[f])T]=Cov[Hs]+Cov[ϕ]=HHT+Σ$, where $Σ≡Cov[ϕ]$ indicates the covariance of $ϕ$. Thus, the covariance of the sensory inputs $Cov[x]≡BCov[f]BT$ can be decomposed into the signal and residual covariances:
$Cov[x]=BHHTBT︸signalcovariance+BΣBT︸residualcovariance.$
(2.3)
Crucially, the signal covariance has only $Ns$ nonzero eigenvalues when $Nf>Ns$ owing to low-column-rank matrix $H$, whereas the eigenvalues of the residual covariance are distributed in the $Nf$-dimensional eigenspace. This implies that if the norms of the linear and nonlinear components in the inputs are in a similar order and $Ns$ is sufficiently large, the eigenvalues of the signal covariance (i.e., linear components) are $Nf/Ns$ order times larger than those of the residual covariance (i.e., nonlinear components). We will prove in the following sections that when $Ns≫1$, this property is derived from the fact that elements of $A$ are gaussian distributed and singular values of $B$ are of order 1.

In what follows, we demonstrate that owing to the aforementioned property, the first $Ns$ major principal components of the input covariance precisely match the signal covariance when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality. Consequently, projecting the inputs onto the subspace spanned by major eigenvectors can effectively extract the linear components in the inputs. Moreover, the same projection can effectively filter out the nonlinear components in the inputs because the majority of the nonlinear components are perpendicular to the major eigenspace. Thus, applying PCA to the inputs enables the recovery of a linear transformation of all the true hidden sources of the mixed sensory inputs with a small estimation error. This property, termed asymptotic linearization, enables the reduction of the original nonlinear BSS problem to a simple linear BSS problem, consequently satisfying requirements 1 to 5. In the remainder of this article, we mathematically validate this theorem for a wide range of nonlinear setups and thereby demonstrate that linear encoder neural networks can perform the nonlinear BSS in a self-organizing manner.2 We assume that $Nx=Nf≫Ns≫1$ throughout the article.3

### 2.2  PCA Can Extract Linear Projections of All True Hidden Sources

In this section, we first demonstrate that the major principal components of the input covariance precisely match the signal covariance when $Nf≫Ns≫1$ by analytically calculating eigenvalues of $HHT$ and $Σ$ for the aforementioned system. For analytical tractability, we assume that $f(•)$ is an odd nonlinear function. This assumption does not weaken our proposed theorem because the presumed generative process in equation 2.1 remains universal. We further assume that $ps(s)$ is a symmetric distribution.

Each element of—and any pair of two elements in—a vector $y≡As$ is approximately gaussian distributed for large $Ns$ due to the central limit theorem. The deviation of their marginal distribution $p(yi,yj)$ from the corresponding zero-mean gaussian distribution $pN(yi,yj)≡N[0,A˜A˜T]$ is order $Ns-1$ because the source distribution is symmetric, where $A˜∈R2×Ns$ indicates a submatrix of $A$ comprising its $i$th and $j$th rows (see lemma 1 and its proof in section 4.1). This asymptotic property allows us to compute $H$ and $Σ$ based on the expectation over a tractable gaussian distribution $pN(yi,yj)$—as a proxy for the expectation over $p(yi,yj)$—as the leading order for large $Ns$, despite the fact that $s$ actually follows a nongaussian distribution (ensure that $p(yi,yj)$ converges to $pN(yi,yj)$ in the large $Ns$ limit). We denote the expectation over $pN(yi,yj)$ as $EN[•]≡∫•pN(yi,yj)dyidyj$ to distinguish it from $E[•]$. The latter can be rewritten as $E[•]=EN[•(1+G(yi,yj)/Ns)]$ using an order-one function $G(yi,yj)$ that characterizes the deviation of $p(yi,yj)$ from $pN(yi,yj)$; thus, the deviation caused by this approximation is negligibly smaller than the leading order in the following analyses.

Owing to this asymptotic property, the coefficient matrix $H$ is computed as follows: the pseudo-inverse of $A$, $A+≡(ATA)-1AT$, satisfies $A+A=I$ and $A+T=AT+$. Thus, we have $H=HATA+T=E[fyT]A+T$. It can be approximated by $EN[fyT]A+T$ as the leading order from lemma 1. From the integration by parts, we obtain $EN[fyT]=-∫fpN'(y)TdyAAT=∫diag[f']pN(y)dyAAT=diag[EN[f']]AAT$. This relationship is also known as the Bussgang theorem (Bussgang, 1952). Thus, $EN[fyT]A+T=diag[EN[f']]A$ is order $Ns-1/2$ for a generic odd function $f(•)$ (as it is the product of order 1 diagonal matrix $diag[EN[f']]$ and order $Ns-1/2$ matrix $A$). According to lemma 1, the difference between $H$ and $diag[EN[f']]A$ is order $Ns-3/2$ as it is $Ns-1$ order times smaller than the leading order (see remark 1 in section 4.1 for details). Hence, we obtain
$H=diag[EN[f']]A+ONs-3/2=f'¯A+ONs-1.$
(2.4)
In the last equality, the scalar coefficient is expressed as $f'¯≡∫f'(ξ)exp(-ξ2/2)dξ/2π$ using the expectation over a unit gaussian variable $ξ$. The order $Ns-1$ error term includes the nongaussian contributions of $y$ and the effect of the order $Ns-1/2$ deviation of $diag[EN[f']]$ from $f'¯$, where the latter yields the order $Ns-1$ error owing to the product of order $Ns-1/2$ diagonal matrix $(diag[EN[f']]-f'¯I)$ and order $Ns-1/2$ matrix $A$. Due to the low column rank of matrix $A$, $AAT$ has $Ns$ nonzero eigenvalues, all of which are $Nf/Ns$ as the leading order, because elements of $A$ are independent and identically distributed variables sampled from a gaussian distribution $N[0,1/Ns]$ (Marchenko & Pastur, 1967). The same scaling of eigenvalues also holds for $HHT$ because the difference between the nonzero singular values of $H$ and those of $f'¯A$—caused by the order $Ns-1/2$ deviation of $diag[EN[f']]$ from $f'¯I$—is smaller than order $(Nf/Ns)1/2$.
Next, we characterize the covariance matrix $Σ=Cov[ϕ]$ using the asymptotic property. The nonlinear components can be cast as a function of $y$, $ϕ(y)=f(y+a)-E[f]-E[fyT]A+TA+y$. For large $Ns$, the covariance of $y$, $Cov[y]=AAT$, is close to the identity matrix—and the deviation $(AAT-I)$ is order $Ns-1/2$ for each element—because elements of $A$ are gaussian distributed. This weak correlation between $yi$ and $yj$ ($i≠j$) leads to a weak correlation of their functions $ϕi$ and $ϕj$. Thus, by computing the Taylor expansion of $CovN[ϕi,ϕj]≡EN[ϕiϕj]-EN[ϕi]EN[ϕj]$ with respect to the small covariance $CovN[yi,yj]≡EN[yiyj]$, we obtain (Toyoizumi & Abbott, 2011) (see lemma 2 in section 4.2):
$CovN[ϕi,ϕj]=∑n=1∞ENϕi(n)ENϕj(n)n!CovN[yi,yj]n$
(2.5)
for $i≠j$. Here, $CovN[yi,yj]n$ is order $Ns-n/2$. Because $EN[ϕi(1)]=O(Ns-1)$, $EN[ϕi(2)]=O(Ns-1/2)$ and $|EN[ϕi(n)]|≤O(1)$ for $n≥3$ hold with a generic odd function $f(•)$ (see remark 2 in section 4.2 for details), $CovN[ϕi,ϕj]$ is order $Ns-3/2$. In contrast, the $i$th diagonal element $CovN[ϕi,ϕi]$ is order 1. These observations conclude that eigenvalues of $CovN[ϕ]$—and therefore those of $Σ$ according to lemma 1 (see remark 1 in section 4.1 for details)—are not larger than order $max[1,NfNs-3/2]$; thus, all nonzero eigenvalues of $HHT$—which are order $Nf/Ns$—are sufficiently greater than those of $Σ$ when $Nf≫Ns≫1$.

One can further proceed to the calculation of $Σ$ by explicitly computing the coefficients $EN[ϕi(n)]$ up to the fourth order. Because $f(•)$ is an odd function, using $CovN[yi,yj]n=((AAT)⊙n)ij$, equation 2.5 becomes $CovN[ϕi,ϕj]=EN[f(3)(yi)]EN[f(3)(yj)]{aiaj((AAT)⊙2)ij/2+((AAT)⊙3)ij/6}+O(Ns-5/2)$ (see remark 2 in section 4.2 for details). This analytical expression involves the Hadamard (element-wise) power of matrix $AAT$ (denoted by $⊙$), for example, $(AAT)⊙3≡(AAT)⊙(AAT)⊙(AAT)$. When $Nf>Ns2≫1$, $(AAT)⊙3$ has $Ns$ major eigenvalues—all of which are $3Nf/Ns2$ (as the leading order) with the corresponding eigenvectors that match the directions of $AAT$—and $(Nf-Ns)$ minor eigenvalues that are negligibly smaller than order $Nf/Ns2$ (see lemma 3 in section 4.3). This property yields an approximation $(AAT)⊙3-diag[(AAT)⊙3]=(3/Ns)(AAT-diag[AAT])$ up to the negligible minor eigenmodes. Note that when $Ns2>Nf≫1$, off-diagonal elements of $(AAT)⊙3$ are negligible compared to diagonal elements of $Σ$ (see below). Similarly, $(AAT)⊙2$ has one major eigenvalue, $Nf/Ns$, in the direction of $(1,...,1)T$, while other minor eigenmodes of it are negligible. These approximations are used to compute off-diagonal elements of $Σ$.

Hence, $Σ$ is analytically expressed as
$Σ=f2¯-f'¯2I+f(3)¯22Ns(AAT+aaT)+Ξ.$
(2.6)
Here, from $Σ=Cov[f]-HHT$ and equation 2.4, each diagonal element is expressed as $Σii=f2¯-f'¯2+O(Ns-1/2)$ using $f2¯≡∫f2(ξ)exp(-ξ2/2)dξ/2π$, which yields the first term of equation 2.6 (where the error term is involved in the third term). The second term is generated from equation 2.5 followed by lemma 3, where $EN[f(3)(yi)]$ is approximated by $f(3)¯≡∫f(3)(ξ)exp(-ξ2/2)dξ/2π$ up to order $Ns-1/2$. The third term is the error matrix $Ξ$ that summarizes the deviations caused by the nongaussianity of $p(yi,yj)$ (see lemma 1; see also remark 1 in section 4.1), higher-order terms of the Taylor series in equation 2.5, minor eigenmodes of $(AAT)⊙2$ and $(AAT)⊙3$, and the effect of the order $Ns-1/2$ deviations of coefficients (e.g., $E[fi2]$) from their approximations (e.g., $f2¯$). Due to its construction, eigenvalues of $Ξ$ are smaller than those of the first or second term of equation 2.6. This indicates that for large $Ns$, either the first or second term of equation 2.6 provides the largest eigenvalue of $Σ$, which is order $max[1,Nf/Ns2]$. Thus, $Ξ$ is negligible for the following analyses.
Accordingly, all nonzero eigenvalues of $HHT$—that are order $Nf/Ns$—are much greater than the maximum eigenvalue of $Σ$—that is, order $max[1,Nf/Ns2]$—when $Nf≫Ns≫1$. Thus, unless $B$ specifically attenuates one of $Ns$ major eigenmodes of $HHT$ or significantly amplifies a particular eigenmode of $Σ$, we have
$mineig[HTBTBH]≫maxeig[BΣBT]$
(2.7)
for $Nx=Nf≫Ns≫1$. Here, $eig[•]$ indicates a set of eigenvalues of $•$. Because $mineig[HTBTBH]$ is not smaller than order $Nf/Nsmineig[BTB]$, while $maxeig[BΣBT]$ is not larger than order $max[1,Nf/Ns2]maxeig[BTB]$, inequality 2.7 holds at least when $Nf/Ns$ and $Ns$ are much greater than $maxeig[BTB]/mineig[BTB]$. This is the case, for example, if all the singular values of $B$ are order 1. We call $B$ sufficiently isotropic when $maxeig[BTB]/mineig[BTB]$ can be upper-bounded by a (possibly large) finite constant and focus on such $B$, because the effective input dimensionality is otherwise much smaller than $Nx$. In other words, one can redefine $Nx$ of any system by replacing the original sensory inputs with their compressed representation to render $B$ isotropic.
When inequality 2.7 holds, the first $Ns$ major eigenmodes of $Cov[x]$ precisely match the signal covariance, while minor eigenmodes are negligibly small. Hence, there is a clear spectrum gap between the largest $Ns$ eigenvalues and the rest. This indicates that one can reliably identify the signal covariance and source dimensionality based exclusively on PCA of $Cov[x]$ in an unsupervised manner. Hence, when $ΛM∈RNs×Ns$ is a diagonal matrix that arranges the first $Ns$ major eigenvalues of $Cov[x]$ in descending order, with the corresponding eigenvector matrix $PM∈RNx×Ns$, we obtain
$PMΛMPMT≃BHHTBT.$
(2.8)

The corrections of $ΛM$ and $PM$ due to the presence of the residual covariance are estimated as follows: by the first-order perturbation theorem (Griffiths, 2005), the correction of the $i$th major eigenvalue of $Cov[x]$ is upper-bounded by $maxeig[BΣBT]$, while the norm of the correction of the $i$th eigenvector is upper-bounded by $maxeig[BΣBT]/mineig[HTBTBH]$. These corrections are negligibly smaller than the leading-order terms, when inequality 2.7 holds. Further details are provided in section 4.4.

Therefore, applying PCA to $Cov[x]$ can extract the signal covariance as the major principal components with high accuracy when $Nx=Nf≫Ns≫1$ and $B$ is sufficiently isotropic. This property is numerically validated. The major principal components of $Cov[x]$ suitably approximate the signal covariance (see Figure 2A). In the simulations, elements of $B$ are sampled from a gaussian distribution with zero mean. The trace ratio $tr[BΣBT]/tr[HTBTBH]$ (i.e., the ratio of the eigenvalue sum of residual covariance $tr[BΣBT]$ to that of signal covariance $tr[HTBTBH]$) retains the value of about 0.6 irrespective of $Ns$ and $Nx$, indicating that the large-scale systems we consider are fairly nonlinear (see Figure 2B). In contrast, the eigenvalue ratio $maxeig[BΣBT]/mineig[HTBTBH]$ monotonically converges to zero when $Nx$ increases; thus, inequality 2.7 holds for large $Nx$ (see Figure 2C). Consequently, $PM$ approximates nonzero eigenmodes of the signal covariance accurately (see Figure 2D). These results indicate that PCA is a promising method for reliably identifying the linear components in sensory inputs. In what follows, we explicitly demonstrate that projecting the inputs onto the major eigenspace can recover true hidden sources of sensory inputs accurately.
Figure 2:

PCA can identify linear components comprising linear projections from every true hidden source. Here, hidden sources $s$ are independently generated from an identical uniform distribution with zero mean and unit variance; $Nf=Nx$ and $f(•)=sign(•)$ are fixed; elements of $A,a$ are independently sampled from $N[0,1/Ns]$; and elements of $B$ are independently sampled from $N[0,1/Nf]$. PCA is achieved via eigenvalue decomposition. (A) Comparison between the signal covariance and major principal components of sensory inputs, when $Ns=10$ and $Nx=103$. (B) Trace ratio $tr[BΣBT]/tr[HTBTBH]$. (C) Eigenvalue ratio $maxeig[BΣBT]/mineig[HTBTBH]$. (D) Error in extracting nonzero eigenmodes of signal covariance as major components, which is scored by $1-tr[PMTULULTPM]/Ns$, where $UL∈RNx×Ns$ indicates the left-singular vectors of $BH$. As $Nx$ increases, $PMPMT$ converges to $ULULT$ reliably and accurately. Solid lines indicate theoretical values of estimation errors: see equation 4.24 for details. Circles and error bars indicate the means and areas between maximum and minimum values obtained with 20 different realizations of $s,A,B,a$, where some error bars are hidden by the circles.

Figure 2:

PCA can identify linear components comprising linear projections from every true hidden source. Here, hidden sources $s$ are independently generated from an identical uniform distribution with zero mean and unit variance; $Nf=Nx$ and $f(•)=sign(•)$ are fixed; elements of $A,a$ are independently sampled from $N[0,1/Ns]$; and elements of $B$ are independently sampled from $N[0,1/Nf]$. PCA is achieved via eigenvalue decomposition. (A) Comparison between the signal covariance and major principal components of sensory inputs, when $Ns=10$ and $Nx=103$. (B) Trace ratio $tr[BΣBT]/tr[HTBTBH]$. (C) Eigenvalue ratio $maxeig[BΣBT]/mineig[HTBTBH]$. (D) Error in extracting nonzero eigenmodes of signal covariance as major components, which is scored by $1-tr[PMTULULTPM]/Ns$, where $UL∈RNx×Ns$ indicates the left-singular vectors of $BH$. As $Nx$ increases, $PMPMT$ converges to $ULULT$ reliably and accurately. Solid lines indicate theoretical values of estimation errors: see equation 4.24 for details. Circles and error bars indicate the means and areas between maximum and minimum values obtained with 20 different realizations of $s,A,B,a$, where some error bars are hidden by the circles.

Close modal

### 2.3  Asymptotic Linearization Theorem

We now consider a linear encoder comprising a single-layer linear neural network,
$u≡Wx-E[x],$
(2.9)
where $u≡(u1,…,uNs)T$ are $Ns$-dimensional neural outputs and $W∈RNs×Nx$ is a synaptic weight matrix. Suppose that by applying PCA to the inputs, one obtains $W$, which represents a subspace spanned by the major eigenvectors, $W=ΩMΛM-1/2PMT$, where $ΩM$ is an arbitrary $Ns×Ns$ orthogonal matrix expressing an ambiguity. A normalization factor $ΛM-1/2$ is multiplied so as to ensure $Cov[u]=I$. Neural outputs with this $W$ indeed express the optimal linear encoder of the inputs because it is the solution of the maximum likelihood estimation that minimizes the loss to reconstruct the inputs from lower-dimensional encoder $u$ using a linear network under gaussian assumption (see Xu, 1993, and Wentzell, Andrews, Hamilton, Faber, & Kowalski, 1997, for related studies). In other words, when we assume that the loss follows a unit gaussian distribution, $argminWE[|x-E[x]-W+u|2]=ΩMΛM-1/2PMT$ holds under the constraint of $dim[u]=Ns$ and $Cov[u]=I$, where $W+$ indicates the pseudo inverse of $W$.
Crucially, equation 2.8 directly provides the key analytical expression to represent a subspace spanned by the linear components:
$W≃Ω(BH)+.$
(2.10)
Here, $(BH)+≡(HTBTBH)-1HTBT$ is the pseudo inverse of $BH$, and $Ω$ is another arbitrary $Ns×Ns$ orthogonal matrix. Error in approximating $(BH)+$ is negligible for the following calculations as long as inequality 2.7 holds (see section 4.4 for more details). Equation 2.10 indicates that the directions of the linear components can be computed under the BSS setup—up to an arbitrary orthogonal ambiguity factor $Ω$. Hence, we obtain the following theorem:
Theorem 1
(Asymptotic Linearization). When inequality 2.7 holds, from equations 2.2, 2.9, and 2.10, the linear encoder with optimal matrix $W=ΩMΛM-1/2PMT$ can be analytically expressed as
$u=Ω(s+ɛ)$
(2.11)
using a linearization error $ɛ≡(BH)+Bϕ$ with the covariance matrix of
$Cov[ɛ]=(BH)+BΣBT(BH)+T.$
(2.12)
The maximum eigenvalue of $Cov[ɛ]$ is upper-bounded by $maxeig[BΣBT]/mineig[HTBTBH]$, which is sufficiently smaller than one from inequality 2.7. In particular, when $ps(s)$ is a symmetric distribution, $f(•)$ is an odd function, and $B$ is sufficiently isotropic, using equation 2.6, $Cov[ɛ]$ can be explicitly computed as
$Cov[ɛ]=NsNff2¯f'¯2-1(I+Δ)+f(3)¯22Nsf'¯2I$
(2.13)
as the leading order. Symmetric matrix $Δ≡UAT(BTB-I)2UA$ characterizes the anisotropy of $BTB$ in the directions of the left-singular vectors of $A$, $UA∈RNf×Ns$, wherein $maxeig[Δ]$ is upper-bounded by $maxeig[(BTB-I)2]=O(1)$. Together, we conclude that
$u=Ωs+ONsNf+O1Ns.$
(2.14)

The derivation detail of equation 2.13 is provided in section 4.5. Equation 2.13 indicates that the linearization error monotonically decreases as $Nf/Ns$ and $Ns$ increase; thus, $u$ converges to a linear mixture of all the true hidden sources ($u→Ωs$) in the limit of large $Nf/Ns$ and $Ns$. Only the anisotropy of $BTB$ in the directions of $UA$ increases the linearization error. In essence, applying PCA to the inputs effectively filters out nonlinear components in the inputs because the majority of nonlinear components are perpendicular to the directions of the signal covariance.

Although the obtained encoder is not independent of each other because of the multiplication with $Ω$, it is remarkable that the proposed approach enables the conversion of the original nonlinear BSS problem to a simple linear BSS problem. This indicates that $u$ can be separated into each independent encoder by further applying a linear ICA method (Comon, 1994; Bell & Sejnowski, 1995, 1997; Amari et al., 1996; Hyvarinen & Oja, 1997) to it, and these independent encoders match the true hidden sources up to permutations and sign-flips. Zero-element-wise-error nonlinear BSS is attained in the limit of large $Nf/Ns$ and $Ns$. As a side note, PCA can indeed recover a linear transformation of true hidden sources in the inputs even when these sources have higher-order correlations if the average of these correlations converges to zero with high source dimensionality. This property is potentially useful, for example, for identifying true hidden states of time series data generated from nonlinear systems (Isomura & Toyoizumi, 2021).

In summary, we analytically quantified the accuracy of the optimal linear encoder—obtained through the cascade of PCA and ICA—in inverting the nonlinear generative process to identify all hidden sources. The encoder increases its accuracy as $Nf/Ns$ and $Ns$ increase and asymptotically attains the true hidden sources.

The proposed theorem is empirically validated by numerical simulations. Each element of the optimal linear encoder obtained by the PCA–ICA cascade represents a hidden source, wherein BSS errors (the difference between the true and estimated sources) decrease as $Nx$ increases (see Figure 3A). The PCA–ICA cascade performs the nonlinear BSS with various types of nonlinear basis functions (see Figures 3B and 3C). This is a remarkable property of the PCA–ICA cascade because these results indicate that it can perform the nonlinear BSS without knowing the true nonlinearity that characterizes the generative process. Although equation 2.13 is not applicable to non-odd nonlinear basis function $f(•)$, empirical observations indicate that the PCA–ICA cascade can identify the true hidden sources even with non-odd function $f(•)$ (see Figure 3C), as long as $H=E[fsT]$ is nonzero.
Figure 3:

BSS error can be characterized by the source and input dimensionalities. In all panels, $s$ is generated by an identical uniform distribution; $Nf=Nx$ is fixed; and $A,B,a$ are sampled from gaussian distributions as in Figure 2. Encoders $u˜=(u˜1,…,u˜Ns)T$ are obtained by applying the PCA–ICA cascade to sensory inputs. For visualization purpose, elements of $u˜$ are permuted and sign-flipped to ensure that $u˜i$ encodes $si$. (A) Each element of $u˜$ encodes a hidden source, wherein an error in representing a source monotonically decreases when $Nx$ increases. The generative process is characterized with $Ns=100$ and $f(•)=sign(•)$. Shaded areas represent theoretical values of the standard deviation computed using equation 2.13. As elements of $B$ are gaussian distributed, $Δ=I$ holds. Inset panels depict the absolute value of covariance matrix $|Cov[u˜,s]|$ with gray-scale values ranging from 0 (white) to 1 (black). A diagonal covariance matrix indicates the successful identification of all the true hidden sources. (B, C) Nonlinear BSS when the generative process is characterized by $f(•)=(•)3$ or $f(•)=ReLU(•)$. $ReLU(•)$ outputs $•$ for $•>0$ or 0 otherwise. In panel C, the shaded area is computed using equation 2.12. (D) The quantitative relationship between the source and input dimensionalities and element-wise BSS error is scored by the mean squared error $E[|s-u˜|2]/Ns$. Here, $f(•)=sign(•)$ is supposed. Circles and error bars indicate the means and areas between maximum and minimum values obtained with 20 different realizations of $s,A,B,a$, where some error bars are hidden by the circles. Solid and dashed lines represent theoretical values computed using equations 2.12 and 2.13, respectively. These lines fit the actual BSS errors when $Nx≫Ns≫1$, although some deviations occur when $Ns$ or $Nx$ is small. Blue cross marks indicate actual BSS errors for $Ns=10$ when hidden sources are sampled from a symmetric truncated normal distribution.

Figure 3:

BSS error can be characterized by the source and input dimensionalities. In all panels, $s$ is generated by an identical uniform distribution; $Nf=Nx$ is fixed; and $A,B,a$ are sampled from gaussian distributions as in Figure 2. Encoders $u˜=(u˜1,…,u˜Ns)T$ are obtained by applying the PCA–ICA cascade to sensory inputs. For visualization purpose, elements of $u˜$ are permuted and sign-flipped to ensure that $u˜i$ encodes $si$. (A) Each element of $u˜$ encodes a hidden source, wherein an error in representing a source monotonically decreases when $Nx$ increases. The generative process is characterized with $Ns=100$ and $f(•)=sign(•)$. Shaded areas represent theoretical values of the standard deviation computed using equation 2.13. As elements of $B$ are gaussian distributed, $Δ=I$ holds. Inset panels depict the absolute value of covariance matrix $|Cov[u˜,s]|$ with gray-scale values ranging from 0 (white) to 1 (black). A diagonal covariance matrix indicates the successful identification of all the true hidden sources. (B, C) Nonlinear BSS when the generative process is characterized by $f(•)=(•)3$ or $f(•)=ReLU(•)$. $ReLU(•)$ outputs $•$ for $•>0$ or 0 otherwise. In panel C, the shaded area is computed using equation 2.12. (D) The quantitative relationship between the source and input dimensionalities and element-wise BSS error is scored by the mean squared error $E[|s-u˜|2]/Ns$. Here, $f(•)=sign(•)$ is supposed. Circles and error bars indicate the means and areas between maximum and minimum values obtained with 20 different realizations of $s,A,B,a$, where some error bars are hidden by the circles. Solid and dashed lines represent theoretical values computed using equations 2.12 and 2.13, respectively. These lines fit the actual BSS errors when $Nx≫Ns≫1$, although some deviations occur when $Ns$ or $Nx$ is small. Blue cross marks indicate actual BSS errors for $Ns=10$ when hidden sources are sampled from a symmetric truncated normal distribution.

Close modal

The log-log plot illustrates that the magnitude of element-wise BSS errors—scored by the mean squared error—decreases inversely proportional to $Nx/Ns$; however, it saturates around $Nx=Ns2$ (see Figure 3D). These observations validate equation 2.13 which asserts that the linearization error is determined by the sum of $Ns/Nf$ and $1/Ns$ order terms. Although equation 2.13 overestimates the BSS error when $Ns=10$, this is because each dimension of $As$ significantly deviates from a gaussian variable due to small $Ns$—as $s$ is sampled from a uniform distribution in these simulations. We confirm that when hidden sources are generated from a distribution close to gaussian, actual BSS errors shift toward the theoretical value of equation 2.13 even when $Ns=10$ (see the cross marks in Figure 3D). Indeed, this deviation disappears for large $Ns$ according to the central limit theorem. Therefore, equation 2.13 is a good approximation of actual BSS errors for $Nx=Nf≫Ns≫1$, and the proposed theorem suitably predicts the performance of the PCA–ICA cascade for a wide range of nonlinear BSS setups.

### 2.4  Hebbian-Like Learning Rules Can Reliably Solve Nonlinear BSS Problem

As a corollary of the proposed theorem, a linear neural network can identify true hidden sources through Hebbian-like plasticity rules in the nonlinear BSS setup under consideration. Oja's subspace rule (Oja, 1989)—which is a modified version of the Hebbian plasticity rule (see section 4.6)—is a well-known PCA approach that extracts the major eigenspace without yielding a spurious solution or attaining a local minimum (Baldi & Hornik, 1989; Chen et al., 1998). Thus, with generic random initial synaptic weights, this Hebbian-like learning rule can quickly and reliably identify an optimal linear encoder that can recover true hidden sources from their nonlinear mixtures in a self-organizing or unsupervised manner.

Numerical experiments demonstrate that regardless of the random initialization of synaptic weight matrix $WPCA∈RNs×Nx$, Oja's subspace rule updates $WPCA$ to converge to the major eigenvectors, that is, the directions of the linear components. The accuracy of extracting the linear components increases as the number of training samples increases, and reaches the same accuracy as an extraction via eigenvalue decomposition (see Figure 4A). Because the original nonlinear BSS problem has now been transformed to a simple linear BSS problem, the following linear ICA approach (Comon, 1994; Bell & Sejnowski, 1995, 1997; Amari et al., 1996; Hyvarinen & Oja, 1997) can reliably separate all the hidden sources from the features extracted using Oja's subspace rule. Amari's ICA algorithm (Amari et al., 1996), which is another Hebbian-like rule (see section 4.6), updates synaptic weight matrix $WICA∈RNs×Ns$ to render neural outputs independent of each other. The obtained independent components accurately match the true hidden sources of the nonlinear generative process up to their permutations and sign-flips (see Figure 4B). These results highlight that the cascade of PCA and ICA—implemented via Hebbian-like learning rules—can self-organize the optimal linear encoder and therefore identify all the true hidden sources in this nonlinear BSS setup, with high accuracy and reliability when $Nx=Nf≫Ns≫1$.
Figure 4:

Hebbian-like learning rules can identify the optimal linear encoder. As in Figures 2 and 3, $s$ is generated by an identical uniform distribution; $Ns=100$, $Nf=Nx$, and $f(•)=sign(•)$ are fixed; and $A,B,a$ are sampled from gaussian distributions. (A) Learning process of Oja's subspace rule for PCA. Synaptic weight matrix $WPCA∈RNs×Nx$—initialized as a random matrix—converges to $PMT$ up to multiplication of an orthogonal matrix from the left-hand side. This indicates that Oja's rule extracts the linear components according to the proposed theorem. Dashed lines indicate the estimation errors when eigenvalue decomposition is applied (see Figure 2C). (B) Learning process of Amari's ICA algorithm. Synaptic weight matrix $WICA∈RNs×Ns$—initialized as a random matrix—learns to separate the compressed inputs $WPCA(x-E[x])$ into independent signals. Because PCA yields a linear transformation of hidden sources, the ensuing independent encoder $u˜≡WICAWPCA(x-E[x])$ identifies all the true hidden sources with a small BSS error. Elements of $u˜$ are permuted and sign-flipped to ensure that $u˜i$ encodes $si$. Dashed lines are computed using equation 2.13. In panels A and B, learning rates are $ηPCA=10-3$ and $ηICA=0.02$, respectively. Solid lines represent the mean estimation errors, while shaded areas represent areas between maximum and minimum values obtained with 20 different realizations of $s,A,B,a$.

Figure 4:

Hebbian-like learning rules can identify the optimal linear encoder. As in Figures 2 and 3, $s$ is generated by an identical uniform distribution; $Ns=100$, $Nf=Nx$, and $f(•)=sign(•)$ are fixed; and $A,B,a$ are sampled from gaussian distributions. (A) Learning process of Oja's subspace rule for PCA. Synaptic weight matrix $WPCA∈RNs×Nx$—initialized as a random matrix—converges to $PMT$ up to multiplication of an orthogonal matrix from the left-hand side. This indicates that Oja's rule extracts the linear components according to the proposed theorem. Dashed lines indicate the estimation errors when eigenvalue decomposition is applied (see Figure 2C). (B) Learning process of Amari's ICA algorithm. Synaptic weight matrix $WICA∈RNs×Ns$—initialized as a random matrix—learns to separate the compressed inputs $WPCA(x-E[x])$ into independent signals. Because PCA yields a linear transformation of hidden sources, the ensuing independent encoder $u˜≡WICAWPCA(x-E[x])$ identifies all the true hidden sources with a small BSS error. Elements of $u˜$ are permuted and sign-flipped to ensure that $u˜i$ encodes $si$. Dashed lines are computed using equation 2.13. In panels A and B, learning rates are $ηPCA=10-3$ and $ηICA=0.02$, respectively. Solid lines represent the mean estimation errors, while shaded areas represent areas between maximum and minimum values obtained with 20 different realizations of $s,A,B,a$.

Close modal

In this work, we theoretically quantified the accuracy of nonlinear BSS performed using the cascade of linear PCA and ICA when sensory inputs are generated from a two-layer nonlinear generative process. First, we demonstrated that as the dimensionality of hidden sources increases and the dimensionalities of sensory inputs and nonlinear bases increase relative to the source dimensionality, the first $Ns$ major principal components approximately express a subspace spanned by the linear projections from all hidden sources of the sensory inputs. Under the same condition, we then demonstrated that the optimal linear encoder obtained by projecting the inputs onto the major eigenspace can accurately recover all the true hidden sources from their nonlinear mixtures. This property is termed the asymptotic linearization theorem. The accuracy of the subspace extraction increases as $Nf/Ns$ and $Ns$ increase, because the gap between the minimum eigenvalue of the linear (i.e., signal) components and maximum eigenvalue of the nonlinear (i.e., residual) components becomes significantly large. Hebbian-like plasticity rules can also identify the optimal linear encoder by extracting major principal components in a manner equal to PCA. Subsequent application of linear ICA on the extracted principal components can reliably identify all the true hidden sources up to permutations and sign-flips. Unlike conventional nonlinear BSS methods that can yield spurious solutions (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004), the PCA–ICA cascade is guaranteed to identify the true hidden sources in the asymptotic condition because it successfully satisfies requirements 1 to 5 specified earlier.

The unique identification of true hidden sources $s$ (up to permutations and sign-flips) is widely recognized only under the linear BSS setup. In contrast, it is well known that conventional nonlinear BSS approaches using nonlinear neural networks do not guarantee the identification of true hidden sources under the general nonlinear BSS setup (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). One may ask if nonlinear BSS methods may find a component-wise nonlinear transformation of the original sources $s'=g(s)$ instead of $s$ because $s'$ is still an independent source representation. This is observed when conventional nonlinear BSS approaches based on the nonlinear ICA are employed because the nonlinear ICA finds one of many representations that minimize the dependency among outputs. However, such nonlinear BSS approaches do not guarantee the identification of true sources. This is because when $g'(s)$ is a component-wise nonlinear transformation that renders a nongaussian source gaussian distributed, a transformation $s''=g(Rg'(s))$—characterized by some orthogonal matrix $R$ and component-wise nonlinear transformation $g$—can yield an arbitrary independent representation $s''$ that differs from the original sources. The former ($s'$) is a special case of the latter ($s''$). Even in the former case, finding $s'$ transformed by a highly nonlinear, nonmonotonic function $g(s)$ is problematic. Thus, the general nonlinear BSS is essentially an ill-posed problem. Having said this, earlier works typically investigated the case in which the sources and inputs have the same dimensionality, while the other conditions have not been extensively analyzed. Thus, in the current work, we focused on the nonlinear BSS under the condition where $Nx=Nf≫Ns≫1$.

A remarkable aspect of the proposed nonlinear BSS approach is that it utilizes the linear PCA to extract the linear components of the original sources from the mixed sensory inputs. This is the process that enables the reduction of the original nonlinear BSS to a simple linear BSS and consequently ensures the reliable identification of the true sources. In other words, the proposed approach does not rely on the nonlinear ICA; thus, no concerns exist regarding the creation of the already noted spurious solutions. We mathematically demonstrated the absence of such spurious solutions when $Nx=Nf≫Ns≫1$. Specifically, we adopted a minimal assumption about the relationship between $B$ and the system dimensionalities such that $Nf/Ns$ and $Ns$ are much greater than $maxeig[BTB]/mineig[BTB]$. Our mathematical analyses demonstrate that this condition is sufficient to asymptotically determine the possible form of hidden sources, which can generate the observed sensory inputs $x$, in a unique manner. In contrast, in the limit of large $Nf/Ns$ and $Ns$, no nonlinear transformation of $s$ (i.e., $s'$ or $s''$) can generate the observed sensory inputs while satisfying the aforementioned condition.4 Thus, when inequality 2.7 holds, the true hidden sources $s$ are uniquely determined up to permutations and sign-flips ambiguities without component-wise nonlinear ambiguity. (Conversely, if inequality 2.7 does not hold, it might not be possible to distinguish true hidden sources and their nonlinear transformations in an unsupervised manner.) Hence, under the condition we consider, the nonlinear BSS is formally reduced to a simple linear BSS, wherein only permutations and sign-flips ambiguities exist while no nonlinear ambiguity remains. This property is crucial for identifying the true sources under the nonlinear BSS setup, without being attracted by spurious solutions such as $s'$ and $s''$.

The nonlinear generative processes that we considered in this work are sufficiently generic. Owing to the universality of two-layer networks, each element of an arbitrary generative process $x=F(s)$ can be approximated using equation 2.1 with a high degree of accuracy in the component-wise sense when $Nx=Nf≫Ns≫1$.5 This allows one to analyze the achievability of the nonlinear BSS using a class of generative processes comprising random basis functions, as a proxy for investigating $x=F(s)$. As a general property of such generative processes, the eigenvalues of the signal covariance (i.e., linear components) become significantly larger than those of the residual covariance (i.e., nonlinear components) when $Nx=Nf≫Ns≫1$. Sufficient system dimensionalities to exhibit this property are determined depending on matrix $B$ that characterizes the generative process. Namely, this observation holds true at least when both $Nf/Ns$ and $Ns$ are much greater than $maxeig[BTB]/mineig[BTB]$. The existence of such $Ns$ and $Nf$ is guaranteed if we consider sufficiently isotropic $B$ wherein $maxeig[BTB]/mineig[BTB]$ is upper-bounded by a large, finite constant. Meanwhile, with such a large constant, the two-layer networks can approximate each element of $x=F(s)$ accurately. This in turn means that as $Nf/Ns$ and $Ns$ increase, the proposed theorem becomes applicable to a sufficiently broad class of nonlinear generative processes.

This asymptotic property can be understood as follows. When adding a new random basis ($fk=Hks+ϕk$) to the existing large-scale system, the linear component of $fk$ ($Hks$) must be placed within the existing low-dimensional subspace spanned only by $Ns$ ($≪Nf$) linear projections of sources. In contrast, its nonlinear component ($ϕk$) is almost uncorrelated with other nonlinear components ($ϕ$), as shown in equation 2.6, because these components are characterized by gaussian distributed $A$. Thus, the former increases the eigenvalues of the signal covariance in proportion to $Nf/Ns$, while the latter adds a new dimension in the subspace of nonlinear components without significantly increasing the maximum eigenvalue of the residual covariance. Owing to this mechanism, the eigenvalues of the signal covariance become $Nf/Ns$ times larger than those of the residual covariance when $Ns≫1$. Hence, the linear PCA is sufficient to reduce the nonlinear BSS to a linear BSS.

Nonlinear variants of PCA such as autoencoders (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013) have been widely used for representation learning. Because natural sensory data are highly redundant and do not uniformly cover the entire input space (Chandler & Field, 2007), finding a concise representation of the sensory data is essential to characterize its properties (Arora & Risteski 2017). In general, if a large, nonlinear neural network is used, many equally good solutions will be produced (Kawaguchi, 2016; Lu & Kawaguchi, 2017; Nguyen & Hein, 2017). In this case, there is no objective reason to select one solution over another if they have similar reconstruction accuracy. However, this property also leads to infinitely many spurious solutions if the aim is the identification of the true hidden sources (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). Consequently, the outcomes of these approaches using nonlinear neural networks are intrinsically ambiguous, and obtained solutions highly depend on the heuristic design of the regularization parameters used in the networks and learning algorithms (Dahl, Yu, Deng, & Acero, 2012; Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov, 2012; Wan, Zeiler, Zhang, LeCun, & Fergus, 2013). However, unlike those nonlinear approaches, we proved that the cascade of linear PCA and ICA ensures that the true hidden sources of sensory inputs are obtained if there are sufficient dimensionalities in mappings from hidden sources to sensory inputs. Thus, this linear approach is suitable to solve the nonlinear BSS problem while retaining the guarantee to identify the true solution. Its BSS error converges in proportion with the source-to-input dimensionality ratio for a large-scale system comprising high-dimensional sources. Furthermore, if needed, its outcomes can be used as a plausible initial condition for nonlinear methods to further learn the forward model of the generative process.

Because most natural data (including biological, chemical, and social data) are generated from nonlinear generative processes, the scope of application of nonlinear BSS is quite broad. The proposed theorem provides a theoretical justification for the application of standard linear PCA and ICA to nonlinearly generated natural data to infer their true hidden sources. An essential lesson from the proposed theorem is that an artificial intelligence should employ a sufficient number of sensors to identify true hidden sources of sensory inputs accurately, because the BSS error is proportional to the source-to-input dimensionality ratio in a large-scale system.

In the case of computer vision, low-dimensional representations (typically 10–$104$ components) are extracted from up to millions of pixels of the high-dimensional raw images. Our theory implies that the linear PCA–ICA cascade can be utilized to identify the true hidden sources of natural image data. This has been demonstrated using video images (Isomura & Toyoizumi, 2021). In particular, a combination of the proposed theorem with another scheme that effectively removes unpredictable noise from sequential data (i.e., video sequence) is a powerful tool to identify the true hidden sources of real-world data. By this combination, one can obtain data with minimal irrelevant or noise components—an ideal condition for the asymptotic linearization theorem—from the original noisy observations. Such a combination enables the reliable and accurate estimations of hidden sources that generate the input sequences, even in the presence of considerable observation noise. The source estimation performance was demonstrated by using sequential visual inputs comprising hand-digits, rotating 3D objects, and natural scenes. These results support the applicability of the proposed theorem to the natural data and further highlight that the PCA–ICA cascade can extract true or relevant hidden sources when the natural data are sufficiently high-dimensional. (See Isomura & Toyoizumi, 2021, for details.)

In addition, a few works have cited a preprint version of this article to justify the application of their linear methods to real-world nonlinear BSS tasks, in the contexts of wireless sensor networks (van der Lee, Exarchakos, & de Groot, 2019) and single-channel source separation (Mika, Budzik, & Józwik, 2020). These works demonstrated that linear methods work well with nonlinear BSS tasks, as our theorem predicts. The asymptotic linearization theorem is of great importance in justifying the application of these linear methods to real-world nonlinear BSS tasks.

In terms of the interpretability of the outcomes of the PCA–ICA cascade, one may ask if applying the cascade to natural image patches simply yields less interpretable Gabor-filter-like outputs, which are usually not considered to be certain nonlinear, higher-order features underlying natural images. However, when we applied the PCA–ICA cascade—featuring a separate noise-reduction technique that we developed—to natural image data with a high-dimensionality reduction rate, we obtained outputs or images that represent features relevant to hidden sources, such as the categories of objects (Isomura & Toyoizumi, 2021). Based on our observations, three requirements to render the PCA–ICA cascade extract features relevant to hidden sources are considered. First, the dimensionality reduction rate should be high relative to that of the conventional PCA application. Indeed, when the dimensionality reduction rate was low, we observed that the obtained encoders exhibit Gabor-filter-like patterns. Second, the PCA–ICA cascade should be applied not to image patches but to entire images. Image patches contain only a fraction of information about the original hidden sources; thus, it is likely that the original hidden sources cannot be recovered from image patches. Third, PCA should be applied to denoised data wherein observation noise is removed in advance using a separate scheme, as in our case. When PCA is applied to noisy data, sufficient information for recovering hidden sources cannot be extracted as the major principal components because PCA preferentially extracts large noise owing to its extra variance. Hence, when the requirements are satisfied, the outcome of the PCA–ICA cascade can exhibit a high interpretability, wherein the extracted features are relevant to hidden sources.

As a side note, to utilize the asymptotic linearization theorem to estimate the accuracy of source separation, one cannot use up-sampling techniques to increase the input dimensionality beyond the resolution of the original data or images. This is because these techniques simply yield linear or nonlinear transformations of the original data, which typically do not increase information about true hidden sources. In other words, the resolutions of the original data determine the effective input dimensionality, so that these up-samplings merely render matrix $B$ singular (thereby violating the condition of isotropic $B$).

Furthermore, an interesting possibility is that living organisms might have developed high-dimensional biological sensors to perform nonlinear BSS using a linear biological encoder because this strategy guarantees robust nonlinear BSS. In particular, this might be the approach taken by the human nervous system using large numbers of sensory cells, such as approximately 100 million rod cells and 6 million cone cells in the retina, and approximately 16,000 hair cells in the cochlea, to process sensory signals (Kandel, Schwartz, Jessell, Siegelbaum, & Hudspeth, 2013).

Neuronal networks in the brain are known to update their synapses by following Hebbian plasticity rules (Hebb, 1949; Malenka & Bear, 2004), and researchers believe that the Hebbian plasticity plays a key role in representation learning (Dayan & Abbott, 2001; Gerstner & Kistler, 2002) and BSS (Brown, Yamada, & Sejnowski, 2001) in the brain—in particular, the dynamics of neural activity and plasticity in a class of canonical neural networks can be universally characterized in terms of representation learning or BSS from a Bayesian perspective (Isomura & Friston, 2020). Thus, it follows that major linear BSS algorithms for PCA (Oja, 1982, 1989) and ICA (Bell & Sejnowski, 1995, 1997; Amari et al., 1996; Hyvarinen & Oja, 1997) are formulated as a variant of Hebbian plasticity rules. Moreover, in vitro neuronal networks learn to separately represent independent hidden sources in (and only in) the presence of Hebbian plasticity (Isomura, Kotani, & Jimbo, 2015; Isomura & Friston, 2018). Nonetheless, the manner in which the brain can possibly solve a nonlinear BSS problem remains unclear, even though it might be a prerequisite for many of its cognitive processes such as visual recognition (DiCarlo et al., 2012). While Oja's subspace rule for PCA (Oja, 1989) and Amari's ICA algorithm (Amari et al., 1996) were used in this article, these rules can be replaced with more biologically plausible local Hebbian learning rules (Foldiak, 1990; Linsker, 1997; Pehlevan, Mohan, & Chklovskii, 2017; Isomura & Toyoizumi, 2016, 2018, 2019; Leugering & Pipa, 2018) that require only directly accessible signals to update synapses. A recent work indicated that even a single-layer neural network can perform both PCA and ICA through a local learning rule (Isomura & Toyoizumi, 2018), implying that even a single-layer network can perform a nonlinear BSS.

In summary, we demonstrated that when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality, a cascaded setup of linear PCA and ICA can reliably identify the optimal linear encoder to decompose nonlinearly generated sensory inputs into their true hidden sources with increasing accuracy. This is because the higher-dimensional inputs can provide greater evidence about the hidden sources, which removes the possibility of finding spurious solutions and reduces the BSS error. As a corollary of the proposed theorem, Hebbian-like plasticity rules that perform PCA and ICA can reliably update synaptic weights to express the optimal encoding matrix and can consequently identify the true hidden sources in a self-organizing or unsupervised manner. This theoretical justification is potentially useful for designing reliable and explainable artificial intelligence, and for understanding the neuronal mechanisms underlying perceptual inference.

### 4.1  Lemma 1

Definition 1.

Suppose $ps(s)$ is a symmetric distribution, $y˜≡(yi,yj)T≡A˜s$ is a two-dimensional subvector of $y=As$, and $A˜≡(Ai1,…,AiNs;Aj1,…,AjNs)$ is a $2×Ns$ submatrix of $A$. The expectation of an arbitrary function $F(y˜)$ over $ps(s)$ is denoted as $E[F(y˜)]$, whereas when $y˜$ is sampled from a gaussian distribution $pN(y˜)≡N[0,A˜A˜T]$, the expectation of $F(y˜)$ over $pN(y˜)$ is denoted as $EN[F(y˜)]≡∫F(y˜)pN(y˜)dy˜$.

Lemma 1.

When $F(y˜)$ is order 1, the difference between $E[F(y˜)]$ and $EN[F(y˜)]$ is upper-bounded by order $Ns-1$ as a corollary of the central limit theorem. In particular, $E[F(y˜)]$ can be expressed as $EN[F(y˜)(1+G(y˜)/Ns)]$, where $G(y˜)=O(1)$ is a function that characterizes the deviation of $p(y˜)$ from $pN(y˜)$.

Proof.
The characteristic function of $p(y˜)$ is given as $ψ(τ)≡E[eiτTy˜]$ as a function of $τ∈R2$, and its logarithm is denoted as $Ψ(τ)≡logψ(τ)$. The first- to fourth-order derivatives of $Ψ(τ)$ with respect to $τ$ are provided as
$Ψ(1)=E[iy˜eiτTy˜]/ψ,Ψ(2)=E[(iy˜)⊗2eiτTy˜]/ψ-(Ψ(1))⊗2,Ψ(3)=E[(iy˜)⊗3eiτTy˜]/ψ-E[(iy˜)⊗2eiτTy˜]/ψ⊗Ψ(1)-Ψ(2)⊗Ψ(1)-Ψ(1)⊗Ψ(2)=E[(iy˜)⊗3eiτTy˜]/ψ-2Ψ(2)⊗Ψ(1)-(Ψ(1))⊗3-Ψ(1)⊗Ψ(2),Ψ(4)=E[(iy˜)⊗4eiτTy˜]/ψ-E[(iy˜)⊗3eiτTy˜]/ψ⊗Ψ(1)-2Ψ(3)⊗Ψ(1)-3(Ψ(2))⊗2-Ψ(2)⊗(Ψ(1))⊗2-Ψ(1)⊗Ψ(2)⊗Ψ(1)-(Ψ(1))⊗2⊗Ψ(2)-Ψ(1)⊗Ψ(3).$
(4.1)
When $τ=0$, they become $Ψ(0)=0$, $Ψ(1)(0)=0$, $Ψ(2)(0)=-E[y˜⊗2]$, $Ψ(3)(0)=0$, and $Ψ(4)(0)=E[y˜⊗4]-3E[y˜⊗2]⊗2$ owing to the symmetric source distribution $ps(s)$.
They are further computed as $E[y˜⊗2]=Vec[A˜A˜T]$ and $E[y˜⊗4]=Vec[E[(y˜y˜T)⊗(y˜y˜T)]]=3Vec[A˜A˜T]⊗2+κVec[∑k=1Ns(A˜•kA˜•kT)⊗2]$, where $κ≡E[sk4]-3$ indicates the kurtosis of the source distribution and $A˜•k$ denotes the $k$th column of $A˜$. Moreover, $Ψ(2)(0)·τ⊗2=-τTA˜A˜Tτ$ and $Ψ(4)(0)·τ⊗4=κVec[∑k=1Ns(A˜•kA˜•kT)⊗2]·τ⊗4=κ∑k=1Ns(τTA˜•k)4=3κ(τTτ)2/Ns+O(Ns-3/2)$ hold. Here, a quartic function of $τ$, $∑k=1Ns(τTA˜•k)4$, is order $Ns-1$ because $A˜•k$ is sampled from $N[0,1/Ns]$. Thus, the Taylor expansion of $Ψ(τ)$ is expressed as
$Ψ(τ)=∑n=0∞Ψ(n)(0)n!·τ⊗n=-12τTA˜A˜Tτ+κ8Ns(τTτ)2+ONs-3/2.$
(4.2)
Hence, from $ψ(τ)=exp[-τTA˜A˜Tτ/2+κ(τTτ)2/8Ns+O(Ns-3/2)]=e-τTA˜A˜Tτ/2(1+κ(τTτ)2/8Ns+O(Ns-3/2))$, $p(y˜)$ is computed as follows (the inversion theorem):
$p(y˜)=1(2π)2∫R2e-iτTy˜ψ(τ)λ(dτ)=1(2π)2∫R2e-iτTy˜-12τTA˜A˜Tτ1+κ8Ns(τTτ)2+ONs-3/2λ(dτ)=pN(y˜)1+G(y˜)Ns,$
(4.3)
where $λ$ denotes the Lebesgue measure. Here, $G(y˜)=O(1)$ indicates a function of $y˜$ that characterizes the difference between $p(y˜)$ and $pN(y˜)$. Hence, the expectation over $p(y˜)$ can be rewritten using that over $pN(y˜)$:
$E[F(y˜)]=∫F(y˜)ps(s)ds=∫F(y˜)p(y˜)dy˜=∫F(y˜)pN(y˜)1+G(y˜)Nsdy˜=ENF(y˜)1+G(y˜)Ns.$
(4.4)
Remark 1.
For equation 2.4, from the Bussgang theorem (Bussgang, 1952), we obtain
$E[f(y˜)y˜T]=ENf(y˜)y˜T1+G(y˜)Ns=ENdiag[f'(y˜)]1+G(y˜)Ns+f(y˜)G'(y˜)TNsA˜A˜T=diag[EN[f'(y˜)]]A˜+ONs-3/2A˜T$
(4.5)
as $A˜$ is order $Ns-1/2$. Thus, we obtain $H=E[fyT]A+T=diag[EN[f']]A+O(Ns-3/2)$.
For equation 2.5, because $yi$ and $yj$ are only weakly correlated with each other, $(1+G(y˜)/Ns)$ can be approximated as $(1+G(y˜)/Ns)≃(1+G(yi)/Ns)(1+G(yj)/Ns)$ as the leading order using newly defined functions $G(yi)$ and $G(yj)$. Thus, we obtain
$Cov[ϕi,ϕj]=E[(ϕi-E[ϕi])(ϕj-E[ϕj])]≃EN(ϕi-E[ϕi])1+G(yi)Ns(ϕj-E[ϕj])1+G(yj)Ns=CovN[ϕi*,ϕj*]=∑n=1∞ENϕi*(n)ENϕj*(n)n!CovN[yi,yj]n$
(4.6)
for $i≠j$ as the leading order, where $ϕi*≡(ϕi-E[ϕi])(1+G(yi)/Ns)$ and $ϕj*≡(ϕj-E[ϕj])(1+G(yj)/Ns)$ are modified nonlinear components. The last equality holds according to lemma 2 (see below), wherein $EN[ϕi*(n)]$ approximates $E[ϕi(n)]$. As a corollary of lemma 1, the deviation of $EN[ϕi*(n)]≃E[ϕi(n)]$ from $EN[ϕi(n)]$ due to the nongaussianity of $yi$ is order $Ns-1$. Because $CovN[yi,yj]n=O(Ns-n/2)$, $EN[ϕi(1)]=O(Ns-1)$, $EN[ϕi(2)]=O(Ns-1/2)$, and $|EN[ϕi(n)]|≤O(1)$ for $n≥3$ hold with a generic odd function $f(•)$ (see remark 2 for details), the difference between $CovN[ϕi*,ϕj*]$ and $CovN[ϕi,ϕj]$ is order $Ns-5/2$. Thus, $Cov[ϕi,ϕj]$ can be characterized by computing $CovN[ϕi,ϕj]$ as a proxy up to the negligible deviation; consequently, equation 2.6 is obtained.

### 4.2  Lemma 2

Lemma 2.
Suppose $v$ and $w$ are gaussian variables with zero mean, $σv$ and $σw$ are their standard deviations, and their correlation $c≡Cov[v,w]/σvσw$ is smaller than 1. For an arbitrary function $g(•)$,
$Cov[g(v),g(w)]=∑n=1∞σvnσwnE[g(n)(v)]E[g(n)(w)]cnn!.$
(4.7)
Proof.
Because the correlation between $v$ and $w$ is $c$, $w$ can be cast as $w=σwcv/σv+σw1-c2ξ$ using a new a zero-mean and unit-variance gaussian variable $ξ$ that is independent of $v$. When we define the covariance between $g(v)$ and $g(w)$ as $Φ(c)≡Cov[g(v),g(w)]$, its derivative with respect to $c$ is given as
$Φ'(c)=Covg(v),g'σwcvσv+σw1-c2ξσwvσv-σwcξ1-c2=E(g(v)-E[g(v)])g'σwcvσv+σw1-c2ξσwvσv-σwcξ1-c2.$
(4.8)
By applying the Bussgang theorem (Bussgang, 1952) with respect to $v$, we obtain
$E(g(v)-E[g(v)])g'σwcvσv+σw1-c2ξv=E[g'(v)g'σwcvσv+σw1-c2ξ+(g(v)-E[g(v)])g''σwcvσv+σw1-c2ξσwcσv]σv2.$
(4.9)
Similarly, applying the same theorem with respect to $ξ$ yields
$E(g(v)-E[g(v)])g'σwcvσv+σw1-c2ξξ=E(g(v)-E[g(v)])g''σwcvσv+σw1-c2ξσw1-c2.$
(4.10)
As equation 4.8 comprises equations 4.9 and 4.10, $Φ'(c)$ becomes
$Φ'(c)=σvσwEg'(v)g'σwcvσv+σw1-c2ξ.$
(4.11)
Hence, for an arbitrary natural number $n$, we obtain
$Φ(n)(c)=σvnσwnEg(n)(v)g(n)σwcvσv+σw1-c2ξ.$
(4.12)
When $c=0$, $Φ(n)(c)=σvnσwnE[g(n)(v)]E[g(n)(σwξ)]=σvnσwnE[g(n)(v)]E[g(n)(w)]$ holds, where we again regard $σwξ=w$. Thus, from the Taylor expansion with respect to $c$, we obtain
$Φ(c)=∑n=1∞Φ(n)(0)cnn!=∑n=1∞σvnσwnE[g(n)(v)]E[g(n)(w)]cnn!.$
(4.13)
Remark 2.
For $ϕ(y)=f(y+a)-E[f]-E[fyT]A+TA+y$, the coefficients $EN[ϕi(n)]$ can be computed as follows (up to the fourth order): because applying the Bussgang theorem (Bussgang, 1952) yields $EN[fyT]A+TA+=diag[EN[f']]=O(1)$, the difference between $E[fyT]A+TA+$ and $diag[EN[f']]$ is order $Ns-1$ (due to the nongaussian contributions of $y$; see lemma 1). Thus, from $ϕ(1)=diag[f']-E[fyT]A+TA+$, we have
$EN[ϕ(1)]=ENdiag[f']-diag[EN[f']]+O(Ns-1)=O(Ns-1).$
(4.14)
Moreover, for the second- to fourth-order derivatives, we obtain
$EN[ϕi(2)]=EN[f(2)(yi+ai)]=EN[f(3)(yi)]ai+O(Ns-3/2)=O(Ns-1/2),EN[ϕi(3)]=EN[f(3)(yi+ai)]=EN[f(3)(yi)]+O(Ns-1)=O(1),EN[ϕi(4)]=EN[f(4)(yi+ai)]=O(Ns-1/2).$
(4.15)
Here, the Taylor expansion with respect to $ai$ is applied, where $EN[f(2)(yi)]=EN[f(4)(yi)]=0$ owing to odd function $f(•)$. For $n≥5$, $|EN[ϕi(n)]|≤O(1)$ holds. Thus, because $CovN[yi,yj]n=((AAT)⊙n)ij$ is order $Ns-n/2$, equation 2.5 becomes
$CovN[ϕi,ϕj]=EN[ϕi(1)]EN[ϕj(1)](AAT)ij︸O(Ns-5/2)+12EN[ϕi(2)]EN[ϕj(2)]((AAT)⊙2)ij+16EN[ϕi(3)]EN[ϕj(3)]((AAT)⊙3)ij+124EN[ϕi(4)]EN[ϕj(4)]((AAT)⊙4)ij︸O(Ns-3)+O(Ns-5/2)=EN[f(3)(yi)]EN[f(3)(yj)]×12aiaj((AAT)⊙2)ij+16((AAT)⊙3)ij+O(Ns-5/2).$
(4.16)

### 4.3  Lemma 3

Lemma 3.

When elements of $A∈RNf×Ns$ are independently sampled from a gaussian distribution $N[0,1/Ns]$ and $Nf>Ns2≫1$, $(AAT)⊙3≡(AAT)⊙(AAT)⊙(AAT)$ has $Ns$ major eigenmodes—all of which have the eigenvalue of $3Nf/Ns2$ (as the leading order) with the corresponding eigenvectors that match the directions of $AAT$—and up to $(Nf-Ns)$ randomly distributed nonzero minor eigenmodes whose eigenvalues are order $max[1,Nf/Ns3]$ on average. Thus, the latter are negligibly smaller than the former when $Nf>Ns2≫1$.

Proof.
Suppose $Nf>Ns2≫1$. When the $k$th column vector of $A$ is denoted as $A•k≡(A1k,…,ANfk)T∈RNf$, $(AAT)⊙3$ becomes
$(AAT)⊙3=∑k=1Ns∑l=1Ns∑m=1Ns(A•k⊙A•l⊙A•m)(A•k⊙A•l⊙A•m)T=∑k=1Ns(A•k⊙3)(A•k⊙3)T+3∑k≠l(A•k⊙2⊙A•l)(A•k⊙2⊙A•l)T+6∑k
(4.17)
Here, $∑k=1Ns(A•k⊙3)(A•k⊙3)T=(A⊙3)(A⊙3)T$ and $∑k≠l(A•k⊙2⊙A•l)(A•k⊙2⊙A•l)T={(A⊙2)(A⊙2)T}⊙(AAT)-(A⊙3)(A⊙3)T$ hold. Let $QDQT$ be the eigenvalue decomposition of $(A⊙2)(A⊙2)T$ using a diagonal matrix that arranges eigenvalues in descending order $D∈RNs×Ns$ with the corresponding eigenvectors $Q∈RNf×Ns$. We thus have an expression ${(A⊙2)(A⊙2)T}⊙(AAT)=∑k=1Nsdiag(Dkk1/2V•k)AATdiag(Dkk1/2V•k)$. Thus, we obtain
$(AAT)⊙3=3∑k=1Nsdiag(Dkk1/2V•k)AATdiag(Dkk1/2V•k)-2(A⊙3)(A⊙3)T+6∑k
(4.18)
Hence, the first entry in the first term, $3diag[D111/2V•1]AATdiag[D111/2V•1]=(3/Ns)diag[1+O(Ns-1/2)]AATdiag[1+O(Ns-1/2)]$, corresponds to $Ns$ largest eigenmodes whose eigenvalues are $3Nf/Ns2$ (as the leading order) with the corresponding eigenvectors that match the directions of $AAT$. This is because elements of $A$ are independent and identically distributed variables sampled from a gaussian distribution $N[0,1/Ns]$, whereas the other entries (with $k≥2$) in the first term approximately yield $Ns(Ns-1)$ different minor eigenmodes whose eigenvalues are order $Nf/Ns3$. Then all the eigenmodes of the second term, $-2(A⊙3)(A⊙3)T$, can be expressed using eigenvectors involved in the first term, where all the eigenvalues of $(A⊙3)(A⊙3)T$ are order $Nf/Ns3$. Finally, the third term involves $Ns(Ns-1)(Ns-2)/6$ terms that are nearly uncorrelated with each other—which are also nearly uncorrelated with the first and second terms—leading to up to $Ns(Ns-1)(Ns-2)/6$ different minor eigenmodes whose eigenvalues are order $Nf/Ns3$ on average. Intuitively, $A•k⊙A•l⊙A•m$ with $k approximates each eigenmode as elements of $A$ are gaussian distributed. Ensure that $A•k'⊙A•l'⊙A•m'$ is nearly uncorrelated with $A•k⊙A•l⊙A•m$ if at least one of $(k',l',m')$ is different from $(k,l,m)$. In summary, when $Ns2+Ns(Ns-1)(Ns-2)/6, $(AAT)⊙3$ has $(Ns(Ns-1)+Ns(Ns-1)(Ns-2)/6)$ randomly distributed nonzero minor eigenmodes whose eigenvalues are of order $Nf/Ns3$ on average; otherwise, $(AAT)⊙3$ has $(Nf-Ns)$ randomly distributed nonzero minor eigenmodes whose eigenvalues are of order 1 on average.
Remark 3.

Because the nonzero minor eigenmodes are nearly uncorrelated with eigenmodes of $AAT$, their contributions are negligible when deriving equation 2.13. It is straightforward to apply this proof to $(AAT)⊙2$.

### 4.4  First-Order Approximation of Major Principal Components

The first $Ns$ major eigenmodes of the input covariance approximates the signal covariance comprising the linear projections from every hidden source. We first demonstrate this as the zeroth-order approximation by explicitly decomposing the input covariance (see equation 2.3) into the product of orthogonal and diagonal matrices. We define the singular value decomposition of the linear components $BH$ as $BH=ULSLVLT$ using orthogonal matrices $UL∈RNx×Ns$ and $VL∈RNs×Ns$ and a diagonal matrix $SL∈RNs×Ns$ that arranges singular values in descending order (note that $Nx>Ns$). Moreover, we define an orthogonal matrix $UN∈RNx×(Nx-Ns)$ such that it is perpendicular to $UL$ and that multiplying it by the residual covariance $BΣBT$ from both sides diagonalizes $BΣBT$. Thus, $(UL,UN)(UL,UN)T=I$ holds, and $XNN≡UNTBΣBTUN∈R(Nx-Ns)×(Nx-Ns)$ is a diagonal matrix that arranges its diagonal elements in descending order. Without loss of generality, $BΣBT$ is decomposed into three matrix factors:
$BΣBT=UL,UNXLLXNLTXNLXNNULTUNT.$
(4.19)
Here, $XLL≡ULTBΣBTUL∈RNs×Ns$ is a symmetric matrix and $XNL≡UNTBΣBTUL∈R(Nx-Ns)×Ns$ is a vertically long rectangular matrix. This decomposition separates $BΣBT$ into components in the directions of $UL$ and the rest. From equations 2.3 and 4.19, we obtain the key equality that links the eigenvalue decomposition with the decomposition into signal and residual covariances:
$PM,PmΛMOOΛmPMTPmT=UL,UNSL2+XLLXNLTXNLXNNULTUNT.$
(4.20)
The left-hand side is the eigenvalue decomposition of $Cov[x]$, where $ΛM∈RNs×Ns$ and $Λm∈R(Nx-Ns)×(Nx-Ns)$ are diagonal matrices of major and minor eigenvalues, respectively, and $PM∈RNx×Ns$ and $Pm∈RNx×(Nx-Ns)$ are their corresponding eigenvectors. When inequality 2.7 holds, $SL2+XLL$ is sufficiently close to a diagonal matrix and $XNL$ is negligibly small relative to $SL2+XLL$; thus, the right-hand side can be viewed as the zeroth-order approximation of the eigenvalue decomposition of $Cov[x]$. Owing to the uniqueness of eigenvalue decomposition, we obtain $(PM,Pm)=(UL,UN)$, $ΛM=SL2+diag[XLL]$, and $Λm=XNN$ as the leading order.
Next, we compute an error in the first order by considering a small perturbation that diagonalizes the right-hand side of equation 4.20 more accurately. Suppose $(I,ET;-E,I)∈RNx×Nx$ as a block matrix that comprises $Ns×Ns$ and $(Nx-Ns)×(Nx-Ns)$ identity matrices and a rectangular matrix $E∈R(Nx-Ns)×Ns$ with small value elements. As $(I,ET;-E,I)(I,ET;-E,I)T=I+O(|E|2)$, it is an orthogonal matrix as the first-order approximation. Because the middle matrix on the right-hand side of equation 4.20 has been almost diagonalized, a small perturbation by $(I,ET;-E,I)$ can diagonalize it with higher accuracy. By multiplying $(I,ET;-E,I)$ and its transpose by the near-diagonal matrix from both sides, we obtain
$IET-EISL2+XLLXNLTXNLXNNI-ETEI=SL2+XLL+ETXNL+XNLTE-(SL2+XLL)ET+XNLT+ETXNN-E(SL2+XLL)+XNL+XNNE-XNLET-EXNLT+XNN+O(|E|2)$
(4.21)
up to the first order of $E$. Equation 4.21 is diagonalized only when $-E(SL2+XLL)+XNL+XNNE=O$. Thus, we obtain
$Vec[E]=I⊗(SL2+XLL)-XNN⊗I-1Vec[XNL]≈I⊗SL-2Vec[XNL]⟺E≈XNLSL-2$
(4.22)
as the first-order approximation. Substituting $E=XNLSL-2$ into equation 4.21 yields the first-order approximation of major and minor eigenvalues:
$ΛM=SL2I+SL-2diag[XLL]+O(|E|2),Λm=XNN-2diag[XNLSL-2XNLT]+O(|E|2).$
(4.23)
Moreover, the first-order approximation of their corresponding eigenvectors is provided as
$(PM,Pm)=(UL,UN)I-ETEI=UL+UNE,UN-ULET=UL+UNXNLSL-2,UN-ULSL-2XNLT+O(|E|2).$
(4.24)
The accuracy of these approximations is empirically validated using numerical simulations (see Figure 2D), wherein an error in approximating $UL$ using $PM$ is analytically computed as $tr[ETE]/Ns=tr[SL-2ULTBΣBT(I-ULULT)BΣBTULSL-2]/Ns$ as the leading order. This error is negligibly small relative to the linearization error in equation 2.12 when inequality 2.7 holds.

### 4.5  Derivation of Linearization Error Covariance

Equation 2.13 is derived as follows: from equation 2.4, $(BH)+=f'¯-1(BA)++O(Ns-1)$ holds when $B$ is sufficiently isotropic. Thus, from equations 2.6 and 2.12, we obtain
$Cov[ɛ]=f2¯f'¯2-1(BA)+BBT(BA)+T+f(3)¯22Nsf'¯2I$
(4.25)
as the leading order. Here, $aaT$ and $Ξ$ in equation 2.6 are negligibly smaller than the leading-order eigenmodes when inequality 2.7 holds and $B$ is sufficiently isotropic. If we define the left-singular vectors of $A$ as $UA∈RNf×Ns$, because $A$ has $Ns$ singular values with the leading-order term of $(Nf/Ns)1/2$ (Marchenko & Pastur, 1967), $A=(Nf/Ns)1/2UA$ holds approximately. Thus, $(BA)+=(Ns/Nf)1/2(UATBTBUA)-1UATBT$ holds as the leading order.
Further, as $B$ is isotropic, $UATBTBUA$ is sufficiently close to the identity matrix; thus, $(UATBTBUA)-1≈I-UAT(BTB-I)UA$ provides the first-order approximation. Similarly, $UATBTBBTBUA=I+UAT{2(BTB-I)+(BTB-I)2}UA$ is sufficiently close to the identity matrix. Hence,
$(BA)+BBT(BA)+T≈NsNfI+UAT(BTB-I)2UA$
(4.26)
holds as the leading order. Thus, using $Δ≡UAT(BTB-I)2UA$, we obtain equation 2.13.

### 4.6  Hebbian-Like Learning Rules

Oja's subspace rule for PCA (Oja, 1989) is defined as
$W˙PCA∝EuPCA(x-E[x]-WPCATuPCA)T,$
(4.27)
where $WPCA∈RNs×Nx$ is a synaptic weight matrix, $uPCA≡WPCA(x-E[x])∈RNs$ is a vector of encoders, and the dot over $WPCA$ denotes a temporal derivative. This rule can extract a subspace spanned by the first $Ns$ principal components (i.e., $WPCA→ΩMPMT$) in a manner equal to PCA via eigenvalue decomposition. Although Oja's subspace rule does not have a cost function, a gradient descent rule for PCA, based on the least mean squared error, is known (Xu, 1993), whose cost function is given as $E[|x-E[x]-WPCATuPCA|2]$. Indeed, this equals the maximum likelihood estimation of $x-E[x]$ using a lower-dimensional linear encoder $uPCA$ under the assumption that the loss follows a unit gaussian distribution. The gradient descent on this cost function is equal to Oja's subspace rule up to an additional term that does not essentially change the behavior of the algorithm.
For BSS of $uPCA$, Amari's ICA algorithm is considered (Amari et al., 1996). Using encoders $u˜≡WICAuPCA∈RNs$ with synaptic weight matrix $WICA∈RNs×Ns$, the ICA cost function is defined as the Kullback–Leibler divergence between the actual distribution of $u˜$, $p(u˜)$, and its prior belief $p0(u˜)≡∏ip0(u˜i)$, $DKL≡DKL[p(u˜)||p0(u˜)]≡E[logp(u˜)-logp0(u˜)]$. The natural gradient of $DKL$ yields Amari's ICA algorithm,
$W˙ICA∝-∂DKL∂WICAWICATWICA=(I-Eg(u˜)u˜T)WICA,$
(4.28)
where $g(u˜)≡-dlogp0(u˜)/du˜$ is a nonlinear activation function.

This work was supported by RIKEN Center for Brain Science (T.I. and T.T.), Brain/MINDS from AMED under grant JP21dm020700 (T.T.), and JSPS KAKENHI grant JP18H05432 (T.T.). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

All relevant data are within the article. The Matlab scripts are available at https://github.com/takuyaisomura/asymptotic_linearization.

1

In this work, based on the literature (Cybenko, 1989), we define the universality in the component-wise sense as the condition wherein each element of $x=F(s)$ is approximated using the two-layer network with an approximation error smaller than an arbitrary $δ>0$, that is, $sups[|Fj(s)-Bjf(As+a)|]<δ$ for $j=1,…,Nx$ when $Nf$ is sufficiently large. Note that $Fj(s)$ is the $j$th element of $F(s)$ and $Bj$ is the $j$th row vector of $B$. It is known that when $f$ is a sigmoidal function and parameters are selected appropriately, the approximation error can be upper-bounded by the order $Nf-1$ (Barron, 1993); hence, $E[|Fj(s)-Bjf(As+a)|2]≤O(Nf-1)$. This relationship holds true when $Nf≫Ns≫1$, irrespective of $Nx$.

2

In this article, we refer to the neural network as the encoder because networks that convert the input data into a different (lower-dimensional) code or representation are widely recognized as encoders in the literature on machine learning (Hinton & Salakhutdinov, 2006; Goodfellow, Bengio & Courville, 2016). However, one may think that the generative process encodes the hidden sources into the sensory inputs through nonlinear mixtures. From this viewpoint, one may term the neural network as the decoder that decodes the inputs to recover the original sources.

3

Our numerical simulations suggest that the system behaves similarly for $Nx≥Nf≥Ns>1$ in some cases, although our theorem holds mathematically only when $Nx=Nf≫Ns≫1$.

4

This property can be understood as follows. Because $x$ is generated through $x=Bf(As+a)$ with a sufficiently isotropic $B$ that satisfies inequality 2.7, from the proposed theorem, the encoder $u=ΛM-1/2PMT(x-E[x])=Ω(s+ɛ)$ asymptotically becomes $u→Ωs$ in the limit of large $Nf/Ns$ and $Ns$. In addition, we define $s'=g(s)∈RNs$ as a nonlinear transformation of $s$. If there is another generative process $x=B'f'(A's'+a')$ that can generate the observed $x$ while satisfying inequality 2.7—where $A'$ and $a'$ are gaussian distributed matrix and vector, $f'$ is a nonlinear function (not the derivative of $f$, unlike in the main text), and $B'$ is a sufficiently isotropic matrix—the encoder can also be expressed as $u'=ΛM-1/2PMT(x-E[x])=Ω'(s'+ɛ')$, which asymptotically becomes $u'→Ω's'$ in the limit of large $Nf/Ns$ and $Ns$. Note that $ΛM$ and $PM$ are the same as above because $x$ does not change. However, while $u≡u'$ holds by construction, $Ωs¬≡Ω's'$ for any orthogonal matrices $Ω$ and $Ω'$ because $s'$ is a nonlinear transformation of $s$. Thus, such a generative process $x=B'f'(A's'+a')$ does not exist in the limit of large $Nf/Ns$ and $Ns$.

5

It should be noted that unlike the condition we considered (i.e., $Nx=Nf≫Ns≫1$), when $Nf≫Nx$, one may observe a counterexample of the proposed theorem such that the major principal components are dissimilar to a linear transformation of the true hidden sources, as is well-known in the literature on nonlinear ICA (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). This is because when $Nf≫Nx$, a general $B$ may not be sufficiently isotropic. In other words, $Nf≫Nx$ corresponds to the case wherein $Nx$ is insufficient to ensure inequality, equation 2.7; thus, $Nx$ needs to be greater. Hence, it is remarkable that the proposed theorem specifies the condition under which the achievability of the nonlinear BSS is mathematically guaranteed.

Amari
,
S. I.
,
Chen
,
T. P.
, &
Cichocki
,
A.
(
1997
).
Stability analysis of learning algorithms for blind source separation
.
Neural Networks
,
10
(
8
),
1345
1351
.
Amari
,
S. I.
,
Cichocki
,
A.
, &
Yang
,
H. H.
(
1996
). A new learning algorithm for blind signal separation. In
D. S.
Touretzky
,
M. C.
Mozer
, &
M. E.
Hasselmo
(Eds.),
Advances in neural information processing systems
,
8
(pp.
757
763
).
Cambridge, MA
:
MIT Press
.
Arora
,
S.
, &
Risteski
,
A.
(
2017
).
Provable benefits of representation learning.
arXiv:1706.04601.
Baldi
,
P.
, &
Hornik
,
K.
(
1989
).
Neural networks and principal component analysis: Learning from examples without local minima
.
Neural Networks
,
2
(
1
),
53
58
.
Barron
,
A. R.
(
1993
).
Universal approximation bounds for superpositions of a sigmoidal function
.
IEEE Trans. Info. Theory
,
39
(
3
),
930
945
.
Bell
,
A. J.
, &
Sejnowski
,
T. J.
(
1995
).
An information-maximization approach to blind separation and blind deconvolution
.
Neural Comput.
,
7
(
6
),
1129
1159
.
Bell
,
A. J.
, &
Sejnowski
,
T. J.
(
1997
).
The “independent components” of natural scenes are edge filters
.
Vision Res.
,
37
(
23
),
3327
3338
.
Brown
,
G. D.
,
,
S.
, &
Sejnowski
,
T. J.
(
2001
).
Independent component analysis at the neural cocktail party
.
Trends Neurosci.
,
24
(
1
),
54
63
.
Bussgang
,
J. J.
(
1952
).
Cross-correlation functions of amplitude-distorted gaussian signals
(Technical Report 216).
Cambridge, MA
:
MIT Research Laboratory of Electronics
.
Calhoun
,
V. D.
,
Liu
,
J.
, &
,
T.
(
2009
).
A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data
.
NeuroImage
,
45
(
1
),
S163
S172
.
Chandler
,
D. M.
, &
Field
,
D. J.
(
2007
).
Estimates of the information content and dimensionality of natural scenes from proximity distributions
.
J. Opt. Soc. Am. A
,
24
(
4
),
922
941
.
Chen
,
T.
,
Hua
,
Y.
, &
Yan
,
W. Y.
(
1998
).
Global convergence of Oja's subspace algorithm for principal component extraction
.
IEEE Trans. Neural Netw.
,
9
(
1
),
58
67
.
Cichocki
,
A.
,
Zdunek
,
R.
,
Phan
,
A. H.
, &
Amari
,
S. I.
(
2009
).
Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation
.
Hoboken, NJ
:
Wiley
.
Comon
,
P.
(
1994
).
Independent component analysis, a new concept?
Sig. Process
,
36
(
3
),
287
314
.
Comon
,
P.
, &
Jutten
,
C.
(
2010
).
Handbook of blind source separation: Independent component analysis and applications
.
Orlando, FL
:
.
Cybenko
,
G.
(
1989
).
Approximation by superpositions of a sigmoidal function
.
Math Control Signals Syst.
,
2
(
4
),
303
314
.
Dahl
,
G. E.
,
Yu
,
D.
,
Deng
,
L.
, &
Acero
,
A.
(
2012
).
Context-dependent pretrained deep neural networks for large-vocabulary speech recognition
.
IEEE Trans. Audio Speech Lang. Proc.
,
20
(
1
),
30
42
.
Dayan
,
P.
, &
Abbott
,
L. F.
(
2001
).
Theoretical neuroscience: Computational and mathematical modeling of neural systems
.
Cambridge, MA
:
MIT Press
.
Dayan
,
P.
,
Hinton
,
G. E.
,
Neal
,
R. M.
, &
Zemel
,
R. S.
(
1995
).
The Helmholtz machine
.
Neural Comput.
,
7
(
5
),
889
904
.
DiCarlo
,
J. J.
,
Zoccolan
,
D.
, &
Rust
,
N.C.
(
2012
).
How does the brain solve visual object recognition?
Neuron
,
73
(
3
),
415
434
.
Dinh
,
L.
,
Krueger
,
D.
, &
Bengio
,
Y.
(
2014
).
NICE: Non-linear independent components estimation
. arXiv:1410.8516.
Erdogan
,
A. T.
(
2007
).
Globally convergent deflationary instantaneous blind source separation algorithm for digital communication signals
.
IEEE Trans. Signal Process.
,
55
(
5
),
2182
2192
.
Erdogan
,
A. T.
(
2009
).
On the convergence of ICA algorithms with symmetric orthogonalization
.
IEEE Trans. Signal Process.
,
57
(
6
),
2209
2221
.
Földiák
,
P.
(
1990
).
Forming sparse representations by local anti-Hebbian learning
.
Biol. Cybern.
,
64
(
2
),
165
170
.
Friston
,
K.
(
2008
).
Hierarchical models in the brain
.
PLOS Comput. Biol.
,
4
, e1000211.
Friston
,
K.
,
Trujillo-Barreto
,
N.
, &
Daunizeau
,
J.
(
2008
).
DEM: A variational treatment of dynamic systems
.
NeuroImage
,
41
(
3
),
849
885
.
Gerstner
,
W.
, &
Kistler
,
W. M.
(
2002
).
Spiking neuron models: Single neurons, populations, plasticity
.
Cambridge
:
Cambridge University Press
.
Goodfellow
,
I.
,
Bengio
,
Y.
, &
Courville
,
A.
(
2016
).
Deep learning
. (
Cambridge, MA
:
MIT Press
).
Griffiths
,
D. J.
(
2005
).
Introduction to quantum mechanics
(2nd ed.).
:
Pearson Prentice Hall
.
Hebb
,
D. O.
(
1949
).
The organization of behavior: A neuropsychological theory
.
New York
:
Wiley
.
Hinton
,
G. E.
, &
Salakhutdinov
,
R. R.
(
2006
).
Reducing the dimensionality of data with neural networks
.
Science
,
313
(
5786
),
504
507
.
Hinton
,
G. E.
,
Srivastava
,
N.
,
Krizhevsky
,
A.
,
Sutskever
,
I.
, &
Salakhutdinov
,
R. R.
(
2012
).
Improving neural networks by preventing co-adaptation of feature detectors
. arXiv:1207.0580.
Hornik
,
K.
,
Stinchcombe
,
M.
, &
White
,
H.
(
1989
).
Multilayer feedforward networks are universal approximators
.
Neural Netw.
,
2
(
5
),
359
366
.
Hyvärinen
,
A.
, &
Morioka
,
H.
(
2016
). Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
, 29 (pp.
3765
3773
).
Red Hook, NY
:
Curran
.
Hyvärinen
,
A. J.
, &
Morioka
,
H.
(
2017
).
Nonlinear ICA of temporally dependent stationary sources
. In
Proceedings of the Conference on Machine Learn Research
.
Hyvärinen
,
A.
, &
Oja
,
E.
(
1997
).
A fast fixed-point algorithm for independent component analysis
.
Neural Comput.
,
9
(
7
),
1483
1492
.
Hyvärinen
,
A.
, &
Pajunen
,
P.
(
1999
).
Nonlinear independent component analysis: Existence and uniqueness results
.
Neural Netw.
,
12
(
3
),
429
439
.
Isomura
,
T.
, &
Friston
,
K.
(
2018
).
In vitro neural networks minimise variational free energy
.
Sci. Rep.
,
8
, 16926.
Isomura
,
T.
, &
Friston
,
K.
(
2020
).
Reverse engineering neural networks to characterize their cost functions
.
Neural Comput.
,
32
(
11
),
2085
2121
.
Isomura
,
T.
,
Kotani
,
K.
, &
Jimbo
,
Y.
(
2015
).
Cultured cortical neurons can perform blind source separation according to the free-energy principle
.
PLOS Comput. Biol.
,
11
(
12
), e1004643.
Isomura
,
T.
, &
Toyoizumi
,
T.
(
2016
).
A local learning rule for independent component analysis
.
Sci. Rep.
,
6
, 28073.
Isomura
,
T.
, &
Toyoizumi
,
T.
(
2018
).
Error-gated Hebbian rule: A local learning rule for principal and independent component analysis
.
Sci. Rep.
,
8
, 1835.
Isomura
,
T.
, &
Toyoizumi
,
T.
(
2019
).
Multi-context blind source separation by error-gated Hebbian rule
.
Sci. Rep.
,
9
, 7127.
Isomura
,
T.
, &
Toyoizumi
,
T.
(
2021
).
Dimensionality reduction to maximize prediction generalization capability
. Nat. Mach. Intell., in press.
Jolliffe
,
I. T.
(
2002
).
Principal component analysis
(2nd ed.).
Berlin
:
Springer
.
Jutten
,
C.
, &
Karhunen
,
J.
(
2004
).
Advances in blind source separation (BSS) and independent component analysis (ICA) for nonlinear mixtures
.
Int. J. Neural. Syst.
,
14
(
5
),
267
292
.
Kandel
,
E. R.
,
Schwartz
,
J. H.
,
Jessell
,
T., M.
,
Siegelbaum
,
S. A.
, &
Hudspeth
,
A. J.
(
2013
).
Principles of neural science
(5th ed.).
New York
:
McGraw-Hill
.
Karhunen
,
J.
(
2001
).
Nonlinear independent component analysis
. In
S.
Roberts
&
R.
Everson
(Eds.),
Independent component analysis: Principles and practice
(pp.
113
134
).
Cambridge
:
Cambridge University Press
.
Kawaguchi
,
K.
(
2016
). Deep learning without poor local minima. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
.
Red Hook, NY
:
Curran
.
Khemakhem
,
I.
,
Kingma
,
D.
,
Monti
,
R.
, &
Hyvarinen
,
A.
(
2020
).
Variational autoencoders and nonlinear ICA: A unifying framework
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
2207
2217
.
ML Research Press
.
Kingma
,
D. P.
, &
Welling
,
M.
(
2013
).
Auto-encoding variational Bayes
. arXiv:1312.6114.
Lappalainen
,
H.
, &
Honkela
,
A.
(
2000
).
Bayesian non-linear independent component analysis by multi-layer perceptrons
. In
M.
Girolami
, (Ed.),
(pp.
93
121
).
London
:
Springer
.
Leugering
,
J.
, &
Pipa
,
G.
(
2018
).
A unifying framework of synaptic and intrinsic plasticity in neural populations
.
Neural Comput.
,
30
(
4
),
945
986
.
Linsker
,
R.
(
1997
).
A local learning rule that enables information maximization for arbitrary input distributions
.
Neural Comput.
,
9
(
8
),
1661
1665
.
Lu
,
H.
, &
Kawaguchi
,
K.
(
2017
).
Depth creates no bad local minima
. arXiv:1702.08580.
Malenka
,
R. C.
, &
Bear
,
M. F.
(
2004
).
LTP and LTD: An embarrassment of riches
.
Neuron
,
44
(
1
),
5
21
.
Marchenko
,
V. A.
, &
Pastur
,
L. A.
(
1967
).
Distribution of eigenvalues for some sets of random matrices
.
Mat. Sbornik
,
114
,
507
536
.
Mika
,
D.
,
Budzik
,
G.
, &
Józwik
,
J.
(
2020
).
Single channel source separation with ICA-based time-frequency decomposition
.
Sensors
,
20
(
7
), 2019.
Nguyen
,
Q.
, &
Hein
,
M.
(
2017
).
The loss surface of deep and wide neural networks
. arXiv:1704.08045.
Oja
,
E.
(
1982
).
Simplified neuron model as a principal component analyzer
.
J. Math. Biol.
,
15
(
3
),
267
273
.
Oja
,
E.
(
1989
).
Neural networks, principal components, and subspaces
.
Int. J. Neural. Syst.
,
1
(
10
),
61
68
.
Oja
,
E.
, &
Yuan
,
Z.
(
2006
).
The FastICA algorithm revisited: Convergence analysis
.
IEEE Trans. Neural Netw.
,
17
(
6
),
1370
1381
.
,
C. B.
(
2000
).
Globally convergent blind source separation based on a multiuser kurtosis maximization criterion
.
IEEE Trans. Signal Process.
,
48
(
12
),
3508
3519
.
Pearson
,
K.
(
1901
).
On lines and planes of closest fit to systems of points in space
.
Philos. Mag.
,
2
(
11
),
559
572
.
Pehlevan
,
C.
,
Mohan
,
S.
, &
Chklovskii
,
D. B.
(
2017
).
Blind nonnegative source separation using biological neural networks
.
Neural Comput.
,
29
(
11
),
2925
2954
.
Rahimi
,
A.
, &
Recht
,
B.
(
2008a
).
Uniform approximation of functions with random bases.
In
Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing
(pp.
555
561
).
Red Hook, NY
:
Curran
.
Rahimi
,
A.
, &
Recht
,
B.
(
2008b
). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In
D.
Koller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems
, 21 (pp.
1313
1320
).
Cambridge, MA
:
MIT Press
.
Sanger
,
T. D.
(
1989
).
Optimal unsupervised learning in a single-layer linear feedforward neural network
.
Neural Netw.
,
2
(
6
),
459
473
.
Toyoizumi
,
T.
, &
Abbott
,
L. F.
(
2011
).
Beyond the edge of chaos: Amplification and temporal integration by recurrent networks in the chaotic regime
.
Phys. Rev. E.
,
84
(
5
), 051908.
van der Lee
,
T.
,
Exarchakos
,
G.
, &
de Groot
,
S. H.
(
2019
).
In-network Hebbian plasticity for wireless sensor networks
. In
Proceedings of the International Conference on Internet and Distributed Computing Systems
(pp.
79
88
).
Cham
:
Springer
.
Wan
,
L.
,
Zeiler
,
M.
,
Zhang
,
S.
,
LeCun
,
Y.
, &
Fergus
,
R.
(
2013
).
Regularization of neural networks using DropConnect
. In
Proceedings of Machine Learning Research
(
28
)
3
,
1058
1066
.
Wentzell
,
P. D.
,
Andrews
,
D. T.
,
Hamilton
,
D. C.
,
Faber
,
K.
, &
Kowalski
,
B. R.
(
1997
).
Maximum likelihood principal component analysis
.
J. Chemom.
,
11
(
4
),
339
366
.
Xu
,
L.
(
1993
).
Least mean square error reconstruction principle for self-organizing neural–nets
.
Neural Netw.
,
6
(
5
),
627
648
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.