For many years, a combination of principal component analysis (PCA) and independent component analysis (ICA) has been used for blind source separation (BSS). However, it remains unclear why these linear methods work well with real-world data that involve nonlinear source mixtures. This work theoretically validates that a cascade of linear PCA and ICA can solve a nonlinear BSS problem accurately—when the sensory inputs are generated from hidden sources via nonlinear mappings with sufficient dimensionality. Our proposed theorem, termed the asymptotic linearization theorem, theoretically guarantees that applying linear PCA to the inputs can reliably extract a subspace spanned by the linear projections from every hidden source as the major components—and thus projecting the inputs onto their major eigenspace can effectively recover a linear transformation of the hidden sources. Then subsequent application of linear ICA can separate all the true independent hidden sources accurately. Zero-element-wise-error nonlinear BSS is asymptotically attained when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality. Our proposed theorem is validated analytically and numerically. Moreover, the same computation can be performed by using Hebbian-like plasticity rules, implying the biological plausibility of this nonlinear BSS strategy. Our results highlight the utility of linear PCA and ICA for accurately and reliably recovering nonlinearly mixed sources and suggest the importance of employing sensors with sufficient dimensionality to identify true hidden sources of real-world data.

## 1 Introduction

Blind source separation (BSS) involves the separation of mixed sensory inputs into their hidden sources without knowledge of the manner in which they were mixed (Cichocki, Zdunek, Phan, & Amari, 2009; Comon & Jutten, 2010). Among the numerous BSS methods, a combination of principal component analysis (PCA) (Pearson, 1901; Oja, 1982, 1989; Sanger, 1989; Xu, 1993; Jolliffe, 2002) and independent component analysis (ICA) (Comon, 1994; Bell & Sejnowski, 1995, 1997; Amari, Cichocki, & Yang, 1996; Hyvarinen & Oja, 1997) is one of the most widely used approaches. In this combined PCA–ICA approach, PCA yields a low-dimensional concise representation (i.e., the major principal components) of sensory inputs that most suitably describes the original high-dimensional redundant inputs. Whereas, ICA provides a representation (i.e., encoders) that separates the compressed sensory inputs into independent hidden sources. A classical setup for BSS assumes a linear generative process (Bell & Sejnowski, 1995), in which sensory inputs are generated as a linear superposition of independent hidden sources. The linear BSS problem has been extensively studied both analytically and numerically (Amari, Chen, & Cichocki, 1997; Oja & Yuan, 2006; Erdogan, 2009), where the cascade of PCA and ICA is guaranteed to provide the optimal linear encoder that can separate sensory inputs into their true hidden sources, up to permutations and sign-flips (Baldi & Hornik, 1989; Chen, Hua, & Yan, 1998; Papadias, 2000; Erdogan, 2007).

Another crucial perspective is the applicability of BSS methods to real-world data generated from a nonlinear generative process. In particular, the aim of nonlinear BSS is to identify the inverse of a nonlinear generative process that generates sensory inputs and thereby infer their true independent hidden sources based exclusively on the sensory inputs. Although the cascade of linear PCA and ICA has been applied empirically to real-world BSS problems (Calhoun, Liu, & Adali, 2009), no one has yet theoretically proven that this linear BSS approach can solve a nonlinear BSS problem. To address this gap, this work demonstrates mathematically that the cascade of PCA and ICA can solve a nonlinear BSS problem accurately when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality so that various nonlinear mappings from sources to inputs can be determined from the inputs themselves.

In general, there are five requirements for solving the nonlinear BSS problem. The first two requirements are related to the representation capacity of the encoder: (1) the encoder's parameter space must be sufficiently large to accommodate the actual solution that can express the inverse of the true generative process. (2) However, this parameter space should not be too large; otherwise, a nonlinear BSS problem can have infinitely many spurious solutions wherein all encoders are independent but dissimilar to the true hidden sources (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). Hence, it is important to constrain the representation capacity of the encoder in order to satisfy these opposing requirements. A typical approach for solving the nonlinear BSS problem involves using a multilayer neural network—with nonlinear activation functions—that learns the inverse of the generative process (Lappalainen & Honkela, 2000; Karhunen, 2001; Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013; Dinh, Krueger, & Bengio, 2014). The remaining three requirements are related to the unsupervised learning algorithms used to identify the optimal parameters for the encoder: (3) the learning algorithm must have a fixed point at which the network expresses the inverse of the generative process; (4) the fixed point must be linearly stable so that the learning process converges to the solution; and (5) the probability of not converging to this solution should be small, that is, most realistic initial conditions must be within the basin of attraction of the true solution.

Approaches using a nonlinear multilayer neural network satisfy requirement 1 when the number of neurons in each layer is sufficient (universality: Cybenko, 1989; Hornik, Stinchcombe, & White, 1989; Barron, 1993); moreover, learning algorithms that satisfy requirements 3 and 4 are also known (Dayan, Hinton, Neal, & Zemel, 1995; Friston, 2008; Friston, Trujillo-Barreto, & Daunizeau, 2008). However, reliable identification of the true hidden sources is still necessary because the encoder can have infinitely many spurious solutions if its representation capacity is too large (i.e., if requirement 2 is violated). As previously indicated, to the best of our knowledge, there is no theoretical proof that confirms a solution for a nonlinear BSS problem (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004), except for some cases wherein temporal information—such that each independent source has its own dynamics—is available (Hyvarinen & Morioka, 2016, 2017; Khemakhem, Kingma, Monti, & Hyvarinen, 2020). Moreover, even when requirement 2 is satisfied, there is no guarantee that a learning algorithm will converge to the true hidden source representation because it might be trapped in a local minimum wherein outputs are still not independent of each other. Thus, for a nonlinear BSS problem, it is of paramount importance to simplify the parameter space of the inverse model in order to remove spurious solutions and prevent the learning algorithm from attaining a local minimum (to satisfy requirements 2 and 5) while retaining its capacity to represent the actual solution (requirement 1). Hence, in this work, we apply a linear approach to solving a nonlinear BSS problem in order to ensure requirements 2 and 5 are satisfied. We demonstrate that the cascade of PCA and ICA can reliably identify a good approximation of the inverse of nonlinear generative processes asymptotically under the condition where the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality (thus satisfying requirements 1–5). Although such a condition is different from the case typically considered by earlier work—where the sources and inputs have the same dimensionality—the condition we consider turns out to be apt for the mathematical justification of the achievability of the nonlinear BSS.

## 2 Results

### 2.1 Overview

^{1}(Cybenko, 1989; Hornik et al., 1989; Barron, 1993) and each element of $x$ can represent an arbitrary mapping $x=F(s)$ as $Nf$ increases by adjusting parameters $a$, $A$, and $B$. We further suppose that each element of $A$ and $a$ is independently generated from a gaussian distribution $N[0,1/Ns]$, which retains its universality (Rahimi & Recht, 2008a, 2008b) as long as $B$ is tuned to minimize the mean squared error $E[|Bf-F(s)|2]$. Here, $E[\u2022]$ describes the average over $ps(s)$. The scaling of $A$ is to ensure that the argument of $f$ is of order 1. The $Ns-1/2$ order offset $a$ is introduced to this model to express any generative process $F(s)$; however, it is negligibly small relative to $As$ for large $Ns$. The whole system, including the generative process and neural network, is depicted in Figure 1 (left). The corresponding equations are summarized in Figure 1 (right).

In what follows, we demonstrate that owing to the aforementioned property, the first $Ns$ major principal components of the input covariance precisely match the signal covariance when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality. Consequently, projecting the inputs onto the subspace spanned by major eigenvectors can effectively extract the linear components in the inputs. Moreover, the same projection can effectively filter out the nonlinear components in the inputs because the majority of the nonlinear components are perpendicular to the major eigenspace. Thus, applying PCA to the inputs enables the recovery of a linear transformation of all the true hidden sources of the mixed sensory inputs with a small estimation error. This property, termed asymptotic linearization, enables the reduction of the original nonlinear BSS problem to a simple linear BSS problem, consequently satisfying requirements 1 to 5. In the remainder of this article, we mathematically validate this theorem for a wide range of nonlinear setups and thereby demonstrate that linear encoder neural networks can perform the nonlinear BSS in a self-organizing manner.^{2} We assume that $Nx=Nf\u226bNs\u226b1$ throughout the article.^{3}

### 2.2 PCA Can Extract Linear Projections of All True Hidden Sources

In this section, we first demonstrate that the major principal components of the input covariance precisely match the signal covariance when $Nf\u226bNs\u226b1$ by analytically calculating eigenvalues of $HHT$ and $\Sigma $ for the aforementioned system. For analytical tractability, we assume that $f(\u2022)$ is an odd nonlinear function. This assumption does not weaken our proposed theorem because the presumed generative process in equation 2.1 remains universal. We further assume that $ps(s)$ is a symmetric distribution.

Each element of—and any pair of two elements in—a vector $y\u2261As$ is approximately gaussian distributed for large $Ns$ due to the central limit theorem. The deviation of their marginal distribution $p(yi,yj)$ from the corresponding zero-mean gaussian distribution $pN(yi,yj)\u2261N[0,A\u02dcA\u02dcT]$ is order $Ns-1$ because the source distribution is symmetric, where $A\u02dc\u2208R2\xd7Ns$ indicates a submatrix of $A$ comprising its $i$th and $j$th rows (see lemma 1 and its proof in section 4.1). This asymptotic property allows us to compute $H$ and $\Sigma $ based on the expectation over a tractable gaussian distribution $pN(yi,yj)$—as a proxy for the expectation over $p(yi,yj)$—as the leading order for large $Ns$, despite the fact that $s$ actually follows a nongaussian distribution (ensure that $p(yi,yj)$ converges to $pN(yi,yj)$ in the large $Ns$ limit). We denote the expectation over $pN(yi,yj)$ as $EN[\u2022]\u2261\u222b\u2022pN(yi,yj)dyidyj$ to distinguish it from $E[\u2022]$. The latter can be rewritten as $E[\u2022]=EN[\u2022(1+G(yi,yj)/Ns)]$ using an order-one function $G(yi,yj)$ that characterizes the deviation of $p(yi,yj)$ from $pN(yi,yj)$; thus, the deviation caused by this approximation is negligibly smaller than the leading order in the following analyses.

One can further proceed to the calculation of $\Sigma $ by explicitly computing the coefficients $EN[\varphi i(n)]$ up to the fourth order. Because $f(\u2022)$ is an odd function, using $CovN[yi,yj]n=((AAT)\u2299n)ij$, equation 2.5 becomes $CovN[\varphi i,\varphi j]=EN[f(3)(yi)]EN[f(3)(yj)]{aiaj((AAT)\u22992)ij/2+((AAT)\u22993)ij/6}+O(Ns-5/2)$ (see remark 2 in section 4.2 for details). This analytical expression involves the Hadamard (element-wise) power of matrix $AAT$ (denoted by $\u2299$), for example, $(AAT)\u22993\u2261(AAT)\u2299(AAT)\u2299(AAT)$. When $Nf>Ns2\u226b1$, $(AAT)\u22993$ has $Ns$ major eigenvalues—all of which are $3Nf/Ns2$ (as the leading order) with the corresponding eigenvectors that match the directions of $AAT$—and $(Nf-Ns)$ minor eigenvalues that are negligibly smaller than order $Nf/Ns2$ (see lemma 3 in section 4.3). This property yields an approximation $(AAT)\u22993-diag[(AAT)\u22993]=(3/Ns)(AAT-diag[AAT])$ up to the negligible minor eigenmodes. Note that when $Ns2>Nf\u226b1$, off-diagonal elements of $(AAT)\u22993$ are negligible compared to diagonal elements of $\Sigma $ (see below). Similarly, $(AAT)\u22992$ has one major eigenvalue, $Nf/Ns$, in the direction of $(1,...,1)T$, while other minor eigenmodes of it are negligible. These approximations are used to compute off-diagonal elements of $\Sigma $.

The corrections of $\Lambda M$ and $PM$ due to the presence of the residual covariance are estimated as follows: by the first-order perturbation theorem (Griffiths, 2005), the correction of the $i$th major eigenvalue of $Cov[x]$ is upper-bounded by $maxeig[B\Sigma BT]$, while the norm of the correction of the $i$th eigenvector is upper-bounded by $maxeig[B\Sigma BT]/mineig[HTBTBH]$. These corrections are negligibly smaller than the leading-order terms, when inequality 2.7 holds. Further details are provided in section 4.4.

### 2.3 Asymptotic Linearization Theorem

**Theorem 1**

The derivation detail of equation 2.13 is provided in section 4.5. Equation 2.13 indicates that the linearization error monotonically decreases as $Nf/Ns$ and $Ns$ increase; thus, $u$ converges to a linear mixture of all the true hidden sources ($u\u2192\Omega s$) in the limit of large $Nf/Ns$ and $Ns$. Only the anisotropy of $BTB$ in the directions of $UA$ increases the linearization error. In essence, applying PCA to the inputs effectively filters out nonlinear components in the inputs because the majority of nonlinear components are perpendicular to the directions of the signal covariance.

Although the obtained encoder is not independent of each other because of the multiplication with $\Omega $, it is remarkable that the proposed approach enables the conversion of the original nonlinear BSS problem to a simple linear BSS problem. This indicates that $u$ can be separated into each independent encoder by further applying a linear ICA method (Comon, 1994; Bell & Sejnowski, 1995, 1997; Amari et al., 1996; Hyvarinen & Oja, 1997) to it, and these independent encoders match the true hidden sources up to permutations and sign-flips. Zero-element-wise-error nonlinear BSS is attained in the limit of large $Nf/Ns$ and $Ns$. As a side note, PCA can indeed recover a linear transformation of true hidden sources in the inputs even when these sources have higher-order correlations if the average of these correlations converges to zero with high source dimensionality. This property is potentially useful, for example, for identifying true hidden states of time series data generated from nonlinear systems (Isomura & Toyoizumi, 2021).

In summary, we analytically quantified the accuracy of the optimal linear encoder—obtained through the cascade of PCA and ICA—in inverting the nonlinear generative process to identify all hidden sources. The encoder increases its accuracy as $Nf/Ns$ and $Ns$ increase and asymptotically attains the true hidden sources.

The log-log plot illustrates that the magnitude of element-wise BSS errors—scored by the mean squared error—decreases inversely proportional to $Nx/Ns$; however, it saturates around $Nx=Ns2$ (see Figure 3D). These observations validate equation 2.13 which asserts that the linearization error is determined by the sum of $Ns/Nf$ and $1/Ns$ order terms. Although equation 2.13 overestimates the BSS error when $Ns=10$, this is because each dimension of $As$ significantly deviates from a gaussian variable due to small $Ns$—as $s$ is sampled from a uniform distribution in these simulations. We confirm that when hidden sources are generated from a distribution close to gaussian, actual BSS errors shift toward the theoretical value of equation 2.13 even when $Ns=10$ (see the cross marks in Figure 3D). Indeed, this deviation disappears for large $Ns$ according to the central limit theorem. Therefore, equation 2.13 is a good approximation of actual BSS errors for $Nx=Nf\u226bNs\u226b1$, and the proposed theorem suitably predicts the performance of the PCA–ICA cascade for a wide range of nonlinear BSS setups.

### 2.4 Hebbian-Like Learning Rules Can Reliably Solve Nonlinear BSS Problem

As a corollary of the proposed theorem, a linear neural network can identify true hidden sources through Hebbian-like plasticity rules in the nonlinear BSS setup under consideration. Oja's subspace rule (Oja, 1989)—which is a modified version of the Hebbian plasticity rule (see section 4.6)—is a well-known PCA approach that extracts the major eigenspace without yielding a spurious solution or attaining a local minimum (Baldi & Hornik, 1989; Chen et al., 1998). Thus, with generic random initial synaptic weights, this Hebbian-like learning rule can quickly and reliably identify an optimal linear encoder that can recover true hidden sources from their nonlinear mixtures in a self-organizing or unsupervised manner.

## 3 Discussion

In this work, we theoretically quantified the accuracy of nonlinear BSS performed using the cascade of linear PCA and ICA when sensory inputs are generated from a two-layer nonlinear generative process. First, we demonstrated that as the dimensionality of hidden sources increases and the dimensionalities of sensory inputs and nonlinear bases increase relative to the source dimensionality, the first $Ns$ major principal components approximately express a subspace spanned by the linear projections from all hidden sources of the sensory inputs. Under the same condition, we then demonstrated that the optimal linear encoder obtained by projecting the inputs onto the major eigenspace can accurately recover all the true hidden sources from their nonlinear mixtures. This property is termed the asymptotic linearization theorem. The accuracy of the subspace extraction increases as $Nf/Ns$ and $Ns$ increase, because the gap between the minimum eigenvalue of the linear (i.e., signal) components and maximum eigenvalue of the nonlinear (i.e., residual) components becomes significantly large. Hebbian-like plasticity rules can also identify the optimal linear encoder by extracting major principal components in a manner equal to PCA. Subsequent application of linear ICA on the extracted principal components can reliably identify all the true hidden sources up to permutations and sign-flips. Unlike conventional nonlinear BSS methods that can yield spurious solutions (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004), the PCA–ICA cascade is guaranteed to identify the true hidden sources in the asymptotic condition because it successfully satisfies requirements 1 to 5 specified earlier.

The unique identification of true hidden sources $s$ (up to permutations and sign-flips) is widely recognized only under the linear BSS setup. In contrast, it is well known that conventional nonlinear BSS approaches using nonlinear neural networks do not guarantee the identification of true hidden sources under the general nonlinear BSS setup (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). One may ask if nonlinear BSS methods may find a component-wise nonlinear transformation of the original sources $s'=g(s)$ instead of $s$ because $s'$ is still an independent source representation. This is observed when conventional nonlinear BSS approaches based on the nonlinear ICA are employed because the nonlinear ICA finds one of many representations that minimize the dependency among outputs. However, such nonlinear BSS approaches do not guarantee the identification of true sources. This is because when $g'(s)$ is a component-wise nonlinear transformation that renders a nongaussian source gaussian distributed, a transformation $s''=g(Rg'(s))$—characterized by some orthogonal matrix $R$ and component-wise nonlinear transformation $g$—can yield an arbitrary independent representation $s''$ that differs from the original sources. The former ($s'$) is a special case of the latter ($s''$). Even in the former case, finding $s'$ transformed by a highly nonlinear, nonmonotonic function $g(s)$ is problematic. Thus, the general nonlinear BSS is essentially an ill-posed problem. Having said this, earlier works typically investigated the case in which the sources and inputs have the same dimensionality, while the other conditions have not been extensively analyzed. Thus, in the current work, we focused on the nonlinear BSS under the condition where $Nx=Nf\u226bNs\u226b1$.

A remarkable aspect of the proposed nonlinear BSS approach is that it utilizes the linear PCA to extract the linear components of the original sources from the mixed sensory inputs. This is the process that enables the reduction of the original nonlinear BSS to a simple linear BSS and consequently ensures the reliable identification of the true sources. In other words, the proposed approach does not rely on the nonlinear ICA; thus, no concerns exist regarding the creation of the already noted spurious solutions. We mathematically demonstrated the absence of such spurious solutions when $Nx=Nf\u226bNs\u226b1$. Specifically, we adopted a minimal assumption about the relationship between $B$ and the system dimensionalities such that $Nf/Ns$ and $Ns$ are much greater than $maxeig[BTB]/mineig[BTB]$. Our mathematical analyses demonstrate that this condition is sufficient to asymptotically determine the possible form of hidden sources, which can generate the observed sensory inputs $x$, in a unique manner. In contrast, in the limit of large $Nf/Ns$ and $Ns$, no nonlinear transformation of $s$ (i.e., $s'$ or $s''$) can generate the observed sensory inputs while satisfying the aforementioned condition.^{4} Thus, when inequality 2.7 holds, the true hidden sources $s$ are uniquely determined up to permutations and sign-flips ambiguities without component-wise nonlinear ambiguity. (Conversely, if inequality 2.7 does not hold, it might not be possible to distinguish true hidden sources and their nonlinear transformations in an unsupervised manner.) Hence, under the condition we consider, the nonlinear BSS is formally reduced to a simple linear BSS, wherein only permutations and sign-flips ambiguities exist while no nonlinear ambiguity remains. This property is crucial for identifying the true sources under the nonlinear BSS setup, without being attracted by spurious solutions such as $s'$ and $s''$.

The nonlinear generative processes that we considered in this work are sufficiently generic. Owing to the universality of two-layer networks, each element of an arbitrary generative process $x=F(s)$ can be approximated using equation 2.1 with a high degree of accuracy in the component-wise sense when $Nx=Nf\u226bNs\u226b1$.^{5} This allows one to analyze the achievability of the nonlinear BSS using a class of generative processes comprising random basis functions, as a proxy for investigating $x=F(s)$. As a general property of such generative processes, the eigenvalues of the signal covariance (i.e., linear components) become significantly larger than those of the residual covariance (i.e., nonlinear components) when $Nx=Nf\u226bNs\u226b1$. Sufficient system dimensionalities to exhibit this property are determined depending on matrix $B$ that characterizes the generative process. Namely, this observation holds true at least when both $Nf/Ns$ and $Ns$ are much greater than $maxeig[BTB]/mineig[BTB]$. The existence of such $Ns$ and $Nf$ is guaranteed if we consider sufficiently isotropic $B$ wherein $maxeig[BTB]/mineig[BTB]$ is upper-bounded by a large, finite constant. Meanwhile, with such a large constant, the two-layer networks can approximate each element of $x=F(s)$ accurately. This in turn means that as $Nf/Ns$ and $Ns$ increase, the proposed theorem becomes applicable to a sufficiently broad class of nonlinear generative processes.

This asymptotic property can be understood as follows. When adding a new random basis ($fk=Hks+\varphi k$) to the existing large-scale system, the linear component of $fk$ ($Hks$) must be placed within the existing low-dimensional subspace spanned only by $Ns$ ($\u226aNf$) linear projections of sources. In contrast, its nonlinear component ($\varphi k$) is almost uncorrelated with other nonlinear components ($\varphi $), as shown in equation 2.6, because these components are characterized by gaussian distributed $A$. Thus, the former increases the eigenvalues of the signal covariance in proportion to $Nf/Ns$, while the latter adds a new dimension in the subspace of nonlinear components without significantly increasing the maximum eigenvalue of the residual covariance. Owing to this mechanism, the eigenvalues of the signal covariance become $Nf/Ns$ times larger than those of the residual covariance when $Ns\u226b1$. Hence, the linear PCA is sufficient to reduce the nonlinear BSS to a linear BSS.

Nonlinear variants of PCA such as autoencoders (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013) have been widely used for representation learning. Because natural sensory data are highly redundant and do not uniformly cover the entire input space (Chandler & Field, 2007), finding a concise representation of the sensory data is essential to characterize its properties (Arora & Risteski 2017). In general, if a large, nonlinear neural network is used, many equally good solutions will be produced (Kawaguchi, 2016; Lu & Kawaguchi, 2017; Nguyen & Hein, 2017). In this case, there is no objective reason to select one solution over another if they have similar reconstruction accuracy. However, this property also leads to infinitely many spurious solutions if the aim is the identification of the true hidden sources (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). Consequently, the outcomes of these approaches using nonlinear neural networks are intrinsically ambiguous, and obtained solutions highly depend on the heuristic design of the regularization parameters used in the networks and learning algorithms (Dahl, Yu, Deng, & Acero, 2012; Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov, 2012; Wan, Zeiler, Zhang, LeCun, & Fergus, 2013). However, unlike those nonlinear approaches, we proved that the cascade of linear PCA and ICA ensures that the true hidden sources of sensory inputs are obtained if there are sufficient dimensionalities in mappings from hidden sources to sensory inputs. Thus, this linear approach is suitable to solve the nonlinear BSS problem while retaining the guarantee to identify the true solution. Its BSS error converges in proportion with the source-to-input dimensionality ratio for a large-scale system comprising high-dimensional sources. Furthermore, if needed, its outcomes can be used as a plausible initial condition for nonlinear methods to further learn the forward model of the generative process.

Because most natural data (including biological, chemical, and social data) are generated from nonlinear generative processes, the scope of application of nonlinear BSS is quite broad. The proposed theorem provides a theoretical justification for the application of standard linear PCA and ICA to nonlinearly generated natural data to infer their true hidden sources. An essential lesson from the proposed theorem is that an artificial intelligence should employ a sufficient number of sensors to identify true hidden sources of sensory inputs accurately, because the BSS error is proportional to the source-to-input dimensionality ratio in a large-scale system.

In the case of computer vision, low-dimensional representations (typically 10–$104$ components) are extracted from up to millions of pixels of the high-dimensional raw images. Our theory implies that the linear PCA–ICA cascade can be utilized to identify the true hidden sources of natural image data. This has been demonstrated using video images (Isomura & Toyoizumi, 2021). In particular, a combination of the proposed theorem with another scheme that effectively removes unpredictable noise from sequential data (i.e., video sequence) is a powerful tool to identify the true hidden sources of real-world data. By this combination, one can obtain data with minimal irrelevant or noise components—an ideal condition for the asymptotic linearization theorem—from the original noisy observations. Such a combination enables the reliable and accurate estimations of hidden sources that generate the input sequences, even in the presence of considerable observation noise. The source estimation performance was demonstrated by using sequential visual inputs comprising hand-digits, rotating 3D objects, and natural scenes. These results support the applicability of the proposed theorem to the natural data and further highlight that the PCA–ICA cascade can extract true or relevant hidden sources when the natural data are sufficiently high-dimensional. (See Isomura & Toyoizumi, 2021, for details.)

In addition, a few works have cited a preprint version of this article to justify the application of their linear methods to real-world nonlinear BSS tasks, in the contexts of wireless sensor networks (van der Lee, Exarchakos, & de Groot, 2019) and single-channel source separation (Mika, Budzik, & Józwik, 2020). These works demonstrated that linear methods work well with nonlinear BSS tasks, as our theorem predicts. The asymptotic linearization theorem is of great importance in justifying the application of these linear methods to real-world nonlinear BSS tasks.

In terms of the interpretability of the outcomes of the PCA–ICA cascade, one may ask if applying the cascade to natural image patches simply yields less interpretable Gabor-filter-like outputs, which are usually not considered to be certain nonlinear, higher-order features underlying natural images. However, when we applied the PCA–ICA cascade—featuring a separate noise-reduction technique that we developed—to natural image data with a high-dimensionality reduction rate, we obtained outputs or images that represent features relevant to hidden sources, such as the categories of objects (Isomura & Toyoizumi, 2021). Based on our observations, three requirements to render the PCA–ICA cascade extract features relevant to hidden sources are considered. First, the dimensionality reduction rate should be high relative to that of the conventional PCA application. Indeed, when the dimensionality reduction rate was low, we observed that the obtained encoders exhibit Gabor-filter-like patterns. Second, the PCA–ICA cascade should be applied not to image patches but to entire images. Image patches contain only a fraction of information about the original hidden sources; thus, it is likely that the original hidden sources cannot be recovered from image patches. Third, PCA should be applied to denoised data wherein observation noise is removed in advance using a separate scheme, as in our case. When PCA is applied to noisy data, sufficient information for recovering hidden sources cannot be extracted as the major principal components because PCA preferentially extracts large noise owing to its extra variance. Hence, when the requirements are satisfied, the outcome of the PCA–ICA cascade can exhibit a high interpretability, wherein the extracted features are relevant to hidden sources.

As a side note, to utilize the asymptotic linearization theorem to estimate the accuracy of source separation, one cannot use up-sampling techniques to increase the input dimensionality beyond the resolution of the original data or images. This is because these techniques simply yield linear or nonlinear transformations of the original data, which typically do not increase information about true hidden sources. In other words, the resolutions of the original data determine the effective input dimensionality, so that these up-samplings merely render matrix $B$ singular (thereby violating the condition of isotropic $B$).

Furthermore, an interesting possibility is that living organisms might have developed high-dimensional biological sensors to perform nonlinear BSS using a linear biological encoder because this strategy guarantees robust nonlinear BSS. In particular, this might be the approach taken by the human nervous system using large numbers of sensory cells, such as approximately 100 million rod cells and 6 million cone cells in the retina, and approximately 16,000 hair cells in the cochlea, to process sensory signals (Kandel, Schwartz, Jessell, Siegelbaum, & Hudspeth, 2013).

Neuronal networks in the brain are known to update their synapses by following Hebbian plasticity rules (Hebb, 1949; Malenka & Bear, 2004), and researchers believe that the Hebbian plasticity plays a key role in representation learning (Dayan & Abbott, 2001; Gerstner & Kistler, 2002) and BSS (Brown, Yamada, & Sejnowski, 2001) in the brain—in particular, the dynamics of neural activity and plasticity in a class of canonical neural networks can be universally characterized in terms of representation learning or BSS from a Bayesian perspective (Isomura & Friston, 2020). Thus, it follows that major linear BSS algorithms for PCA (Oja, 1982, 1989) and ICA (Bell & Sejnowski, 1995, 1997; Amari et al., 1996; Hyvarinen & Oja, 1997) are formulated as a variant of Hebbian plasticity rules. Moreover, in vitro neuronal networks learn to separately represent independent hidden sources in (and only in) the presence of Hebbian plasticity (Isomura, Kotani, & Jimbo, 2015; Isomura & Friston, 2018). Nonetheless, the manner in which the brain can possibly solve a nonlinear BSS problem remains unclear, even though it might be a prerequisite for many of its cognitive processes such as visual recognition (DiCarlo et al., 2012). While Oja's subspace rule for PCA (Oja, 1989) and Amari's ICA algorithm (Amari et al., 1996) were used in this article, these rules can be replaced with more biologically plausible local Hebbian learning rules (Foldiak, 1990; Linsker, 1997; Pehlevan, Mohan, & Chklovskii, 2017; Isomura & Toyoizumi, 2016, 2018, 2019; Leugering & Pipa, 2018) that require only directly accessible signals to update synapses. A recent work indicated that even a single-layer neural network can perform both PCA and ICA through a local learning rule (Isomura & Toyoizumi, 2018), implying that even a single-layer network can perform a nonlinear BSS.

In summary, we demonstrated that when the source dimensionality is large and the input dimensionality is sufficiently larger than the source dimensionality, a cascaded setup of linear PCA and ICA can reliably identify the optimal linear encoder to decompose nonlinearly generated sensory inputs into their true hidden sources with increasing accuracy. This is because the higher-dimensional inputs can provide greater evidence about the hidden sources, which removes the possibility of finding spurious solutions and reduces the BSS error. As a corollary of the proposed theorem, Hebbian-like plasticity rules that perform PCA and ICA can reliably update synaptic weights to express the optimal encoding matrix and can consequently identify the true hidden sources in a self-organizing or unsupervised manner. This theoretical justification is potentially useful for designing reliable and explainable artificial intelligence, and for understanding the neuronal mechanisms underlying perceptual inference.

## 4 Methods

### 4.1 Lemma 1

**Definition 1.**

Suppose $ps(s)$ is a symmetric distribution, $y\u02dc\u2261(yi,yj)T\u2261A\u02dcs$ is a two-dimensional subvector of $y=As$, and $A\u02dc\u2261(Ai1,\u2026,AiNs;Aj1,\u2026,AjNs)$ is a $2\xd7Ns$ submatrix of $A$. The expectation of an arbitrary function $F(y\u02dc)$ over $ps(s)$ is denoted as $E[F(y\u02dc)]$, whereas when $y\u02dc$ is sampled from a gaussian distribution $pN(y\u02dc)\u2261N[0,A\u02dcA\u02dcT]$, the expectation of $F(y\u02dc)$ over $pN(y\u02dc)$ is denoted as $EN[F(y\u02dc)]\u2261\u222bF(y\u02dc)pN(y\u02dc)dy\u02dc$.

**Lemma 1.**

When $F(y\u02dc)$ is order 1, the difference between $E[F(y\u02dc)]$ and $EN[F(y\u02dc)]$ is upper-bounded by order $Ns-1$ as a corollary of the central limit theorem. In particular, $E[F(y\u02dc)]$ can be expressed as $EN[F(y\u02dc)(1+G(y\u02dc)/Ns)]$, where $G(y\u02dc)=O(1)$ is a function that characterizes the deviation of $p(y\u02dc)$ from $pN(y\u02dc)$.

**Proof.**

**Remark 1.**

### 4.2 Lemma 2

**Lemma 2.**

**Proof.**

**Remark 2.**

### 4.3 Lemma 3

**Lemma 3.**

When elements of $A\u2208RNf\xd7Ns$ are independently sampled from a gaussian distribution $N[0,1/Ns]$ and $Nf>Ns2\u226b1$, $(AAT)\u22993\u2261(AAT)\u2299(AAT)\u2299(AAT)$ has $Ns$ major eigenmodes—all of which have the eigenvalue of $3Nf/Ns2$ (as the leading order) with the corresponding eigenvectors that match the directions of $AAT$—and up to $(Nf-Ns)$ randomly distributed nonzero minor eigenmodes whose eigenvalues are order $max[1,Nf/Ns3]$ on average. Thus, the latter are negligibly smaller than the former when $Nf>Ns2\u226b1$.

**Proof.**

**Remark 3.**

Because the nonzero minor eigenmodes are nearly uncorrelated with eigenmodes of $AAT$, their contributions are negligible when deriving equation 2.13. It is straightforward to apply this proof to $(AAT)\u22992$.

### 4.4 First-Order Approximation of Major Principal Components

### 4.5 Derivation of Linearization Error Covariance

### 4.6 Hebbian-Like Learning Rules

## Acknowledgments

This work was supported by RIKEN Center for Brain Science (T.I. and T.T.), Brain/MINDS from AMED under grant JP21dm020700 (T.T.), and JSPS KAKENHI grant JP18H05432 (T.T.). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

## Data Availability

All relevant data are within the article. The Matlab scripts are available at https://github.com/takuyaisomura/asymptotic_linearization.

## Notes

^{1}

In this work, based on the literature (Cybenko, 1989), we define the universality in the component-wise sense as the condition wherein each element of $x=F(s)$ is approximated using the two-layer network with an approximation error smaller than an arbitrary $\delta >0$, that is, $sups[|Fj(s)-Bjf(As+a)|]<\delta $ for $j=1,\u2026,Nx$ when $Nf$ is sufficiently large. Note that $Fj(s)$ is the $j$th element of $F(s)$ and $Bj$ is the $j$th row vector of $B$. It is known that when $f$ is a sigmoidal function and parameters are selected appropriately, the approximation error can be upper-bounded by the order $Nf-1$ (Barron, 1993); hence, $E[|Fj(s)-Bjf(As+a)|2]\u2264O(Nf-1)$. This relationship holds true when $Nf\u226bNs\u226b1$, irrespective of $Nx$.

^{2}

In this article, we refer to the neural network as the encoder because networks that convert the input data into a different (lower-dimensional) code or representation are widely recognized as encoders in the literature on machine learning (Hinton & Salakhutdinov, 2006; Goodfellow, Bengio & Courville, 2016). However, one may think that the generative process encodes the hidden sources into the sensory inputs through nonlinear mixtures. From this viewpoint, one may term the neural network as the decoder that decodes the inputs to recover the original sources.

^{3}

Our numerical simulations suggest that the system behaves similarly for $Nx\u2265Nf\u2265Ns>1$ in some cases, although our theorem holds mathematically only when $Nx=Nf\u226bNs\u226b1$.

^{4}

This property can be understood as follows. Because $x$ is generated through $x=Bf(As+a)$ with a sufficiently isotropic $B$ that satisfies inequality 2.7, from the proposed theorem, the encoder $u=\Lambda M-1/2PMT(x-E[x])=\Omega (s+\u025b)$ asymptotically becomes $u\u2192\Omega s$ in the limit of large $Nf/Ns$ and $Ns$. In addition, we define $s'=g(s)\u2208RNs$ as a nonlinear transformation of $s$. If there is another generative process $x=B'f'(A's'+a')$ that can generate the observed $x$ while satisfying inequality 2.7—where $A'$ and $a'$ are gaussian distributed matrix and vector, $f'$ is a nonlinear function (not the derivative of $f$, unlike in the main text), and $B'$ is a sufficiently isotropic matrix—the encoder can also be expressed as $u'=\Lambda M-1/2PMT(x-E[x])=\Omega '(s'+\u025b')$, which asymptotically becomes $u'\u2192\Omega 's'$ in the limit of large $Nf/Ns$ and $Ns$. Note that $\Lambda M$ and $PM$ are the same as above because $x$ does not change. However, while $u\u2261u'$ holds by construction, $\Omega s\xac\u2261\Omega 's'$ for any orthogonal matrices $\Omega $ and $\Omega '$ because $s'$ is a nonlinear transformation of $s$. Thus, such a generative process $x=B'f'(A's'+a')$ does not exist in the limit of large $Nf/Ns$ and $Ns$.

^{5}

It should be noted that unlike the condition we considered (i.e., $Nx=Nf\u226bNs\u226b1$), when $Nf\u226bNx$, one may observe a counterexample of the proposed theorem such that the major principal components are dissimilar to a linear transformation of the true hidden sources, as is well-known in the literature on nonlinear ICA (Hyvarinen & Pajunen, 1999; Jutten & Karhunen, 2004). This is because when $Nf\u226bNx$, a general $B$ may not be sufficiently isotropic. In other words, $Nf\u226bNx$ corresponds to the case wherein $Nx$ is insufficient to ensure inequality, equation 2.7; thus, $Nx$ needs to be greater. Hence, it is remarkable that the proposed theorem specifies the condition under which the achievability of the nonlinear BSS is mathematically guaranteed.

## References

*Neural Networks*

*Advances in neural information processing systems*

*Provable benefits of representation learning.*

*Neural Networks*

*IEEE Trans. Info. Theory*

*Neural Comput.*

*Vision Res.*

*Trends Neurosci.*

*Cross-correlation functions of amplitude-distorted gaussian signals*

*NeuroImage*

*J. Opt. Soc. Am. A*

*IEEE Trans. Neural Netw.*

*Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation*

*Sig. Process*

*Handbook of blind source separation: Independent component analysis and applications*

*Math Control Signals Syst.*

*IEEE Trans. Audio Speech Lang. Proc.*

*Theoretical neuroscience: Computational and mathematical modeling of neural systems*

*Neural Comput.*

*Neuron*

*NICE: Non-linear independent components estimation*

*IEEE Trans. Signal Process.*

*IEEE Trans. Signal Process.*

*Biol. Cybern.*

*PLOS Comput. Biol.*

*NeuroImage*

*Spiking neuron models: Single neurons, populations, plasticity*

*Deep learning*

*Introduction to quantum mechanics*

*The organization of behavior: A neuropsychological theory*

*Science*

*Improving neural networks by preventing co-adaptation of feature detectors*

*Neural Netw.*

*Advances in neural information processing systems*

*29*(pp.

*Proceedings of the Conference on Machine Learn Research*

*Neural Comput.*

*Neural Netw.*

*Sci. Rep.*

*Neural Comput.*

*PLOS Comput. Biol.*

*Sci. Rep.*

*Sci. Rep.*

*Sci. Rep.*

*Nat. Mach. Intell.*, in press.

*Int. J. Neural. Syst.*

*Principles of neural science*

*Independent component analysis: Principles and practice*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Artificial Intelligence and Statistics*

*Auto-encoding variational Bayes*

*Advances in independent component analysis*

*Neural Comput.*

*Neural Comput.*

*Depth creates no bad local minima*

*Neuron*

*Mat. Sbornik*

*Sensors*

*The loss surface of deep and wide neural networks*

*J. Math. Biol.*

*Int. J. Neural. Syst.*

*IEEE Trans. Neural Netw.*

*IEEE Trans. Signal Process.*

*Philos. Mag.*

*Neural Comput.*

*Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing*

*Advances in neural information processing systems*

*21*(pp.

*Neural Netw.*

*Phys. Rev. E.*

*Proceedings of the International Conference on Internet and Distributed Computing Systems*

*Proceedings of Machine Learning Research*

*J. Chemom.*

*Neural Netw.*