## Abstract

Accurately evaluating statistical independence among random variables is a key
element of independent component analysis (ICA). In this letter, we employ a
squared-loss variant of mutual information as an independence measure and give
its estimation method. Our basic idea is to estimate the ratio of probability
densities directly without going through density estimation, thereby avoiding
the difficult task of density estimation. In this density ratio approach, a
natural cross-validation procedure is available for hyperparameter selection.
Thus, all tuning parameters such as the kernel width or the regularization
parameter can be objectively optimized. This is an advantage over recently
developed kernel-based independence measures and is a highly useful property in
unsupervised learning problems such as ICA. Based on this novel independence
measure, we develop an ICA algorithm, named *least-squares independent
component analysis*.

## 1. Introduction

The purpose of independent component analysis (ICA) (Hyvärinen, Karhunen, & Oja, 2001) is to obtain a transformation matrix that separates mixed signals into statistically independent source signals. A direct approach to ICA is to find a transformation matrix such that independence among separated signals is maximized under some independence measure such as mutual information (MI).

Various approaches to evaluating the independence among random variables from samples have been explored so far. A naive approach is to estimate probability densities based on parametric or nonparametric density estimation methods. However, finding an appropriate parametric model is not easy without strong prior knowledge, and nonparametric estimation is not accurate in high-dimensional problems. Thus, this naive approach is not reliable in practice. Another approach is to approximate the negentropy (or negative entropy) based on the Gram-Charlier expansion (Cardoso & Souloumiac, 1993; Comon, 1994; Amari, Cichocki, & Yang, 1996) or the Edgeworth expansion (Hulle, 2008). An advantage of this negentropy-based approach is that a hard task of density estimation is not directly involved. However, these expansion techniques are based on the assumption that the target density is close to normal, and violation of this assumption can cause large approximation error.

The above approaches are based on the probability densities of signals. Another line of research that does not explicitly involve probability densities employs nonlinear correlation: signals are statistically independent if and only if all nonlinear correlations among the signals vanish. Following this line, computationally efficient algorithms have been developed based on a contrast function (Jutten & Hérault, 1991; Hyvärinen, 1999), which is an approximation of negentropy or mutual information. However, these methods require prespecifying nonlinearities in the contrast function and thus could be inaccurate if the predetermined nonlinearities do not match the target distribution. To cope with this problem, the kernel trick has been applied in ICA, which allows evaluating all nonlinear correlations in a computationally efficient manner (Bach & Jordan, 2002). However, its practical performance depends on the choice of kernels (more specifically, the gaussian kernel width), and there seems to be no theoretically justified method to determine the kernel width (see also Fukumizu, Bach, & Jordan, 2009). This is a critical problem in unsupervised learning problems such as ICA.

In this letter, we develop a new ICA algorithm that resolves these problems. We adopt a squared-loss variant of MI (which we call squared-loss MI, SMI) as an independence measure and approximate it by estimating the ratio of probability densities contained in SMI directly without going through density estimation. This approach, which follows the line of Sugiyama et al. (2008), Kanamori, Hido, and Sugiyama (2009), and Nguyen, Wainwright, and Jordan (in press), allows us to avoid the difficult task of density estimation. Another practical advantage of this density-ratio approach is that a natural cross-validation (CV) procedure is available for hyperparameter selection. Thus, all tuning parameters such as the kernel width or the regularization parameter can be objectively and systematically optimized through CV.

From an algorithmic point of view, our density-ratio approach analytically provides a
nonparametric estimator of SMI; furthermore, its derivative can also be computed
analytically, and these properties are used in deriving a new ICA algorithm. The
proposed method is named *least-squares independent component
analysis* (LICA). Characteristics of existing and proposed ICA methods
are summarized in Table 1, highlighting the
advantage of the proposed LICA approach.

. | Hyperparameter . | . |
---|---|---|

. | Selection . | Distribution . |

Fast ICA (FICA) | ||

(Hyvärinen, 1999) | Not necessary | Not free |

Natural gradient ICA (NICA) | ||

(Amari et al., 1996) | Not necessary | Not free |

Kernel ICA (KICA) | ||

(Bach & Jordan, 2002) | Not available | Free |

Edgeworth-expansion ICA (EICA) | ||

(Hulle, 2008) | Not necessary | Nearly normal |

Least-squares ICA (LICA) | ||

(proposed) | Available | Free |

. | Hyperparameter . | . |
---|---|---|

. | Selection . | Distribution . |

Fast ICA (FICA) | ||

(Hyvärinen, 1999) | Not necessary | Not free |

Natural gradient ICA (NICA) | ||

(Amari et al., 1996) | Not necessary | Not free |

Kernel ICA (KICA) | ||

(Bach & Jordan, 2002) | Not available | Free |

Edgeworth-expansion ICA (EICA) | ||

(Hulle, 2008) | Not necessary | Nearly normal |

Least-squares ICA (LICA) | ||

(proposed) | Available | Free |

The structure of this letter is as follows. In section 2, we formulate our estimator of SMI. In section 3, we derive the LICA algorithm based on the SMI estimator. Section 4 is devoted to numerical experiments where we show that our method properly estimates the true demixing matrix using toy data sets and compare the performances of the proposed and existing methods on artificial and real data sets.

## 2. SMI Estimation for ICA

In this section, we formulate the ICA problem and introduce our independence measure, SMI. Then we give an estimation method of SMI and derive an ICA algorithm.

### 2.1. Problem Formulation.

*d*-dimensional random signal, drawn from a distribution with density

*p*(

**), where are statistically independent of each other and denotes the transpose of a matrix or a vector. Thus,**

*x**p*(

**) can be factorized as We cannot directly observe the source signal**

*x***, only a linearly mixed signal**

*x***: where**

*y***is a invertible matrix called the mixing matrix. The goal of ICA is, given samples of the mixed signals , to obtain a demixing matrix**

*A***that recovers the original source signal**

*W***. We denote the demixed signal by**

*x***: The ideal solution is**

*z***=**

*W*

*A*^{−1}, but we can recover the source signals only up to permutation and scaling of components of

**due to nonidentifiability of the ICA setup (Hyvärinen et al., 2001).**

*x***so that components of**

*W***are as independent as possible. Here, we adopt SMI as the independence measure: where**

*z**q*(

**) denotes the joint density of**

*z***and**

*z**r*(

**) denotes the product of marginal densities : Note that SMI is the Pearson divergence (Pearson, 1900; Paninsky, 2003; Liese & Vajda, 2006; Cichocki, Zdunek, Phan, & Amari, 2009) between**

*z**q*(

**) and**

*z**r*(

**), while ordinary MI is the Kullback-Leibler divergence (Kullback & Leibler, 1951). Since**

*z**I*is nonnegative and it vanishes if and only if

_{s}*q*(

**)=**

*z**r*(

**), the degree of independence among may be measured by SMI. Note that equation 2.1 corresponds to the**

*z**f-divergence*(Ali & Silvey, 1966; Csiszár, 1967) between

*q*(

**) and**

*x**r*(

**) with the squared loss, while ordinary MI corresponds to the**

*z**f*-divergence with the log loss. Thus, SMI could be regarded as a natural generalization of ordinary MI.

**that minimizes SMI. Let us denote the demixed samples by Our key constraint when estimating SMI is that we want to avoid density estimation since it is a difficult task (Vapnik, 1998). Below, we show how this could be accomplished.**

*W*### 2.2. SMI Approximation by Density Ratio Estimation.

*q*(

**) and**

*z**r*(

**) by Then SMI can be written as Therefore, SMI can be approximated through the estimation of , the expectation of over**

*z**q*(

**). This can be achieved by taking the sample average of an estimator of the density ratio , say, :**

*z***0**

_{b}denotes the

*b*-dimensional vector with all zeros. Note that could be dependent on the samples , that is, kernel models are also allowed. We explain in section 2.3 how the basis functions are chosen.

*q*(

**) and**

*z**r*(

**) cannot be computed since**

*z**q*(

**) and**

*z**r*(

**) are unknown. To cope with this problem, we approximate the expectations by their empirical averages. Then the optimization problem is reduced to where we include for avoiding overfitting. is called the regularization parameter, and**

*z***is some positive definite matrix. and are defined as Differentiating the objective function in equation 2.7 with respect to and equating it to zero, we can obtain an analytic form solution as Thus, the solution can be computed efficiently by solving a system of linear equations.**

*R*### 2.3. Design of Basis Functions and Hyperparameter Selection.

*K*disjoint subsets of (approximately) the same size (we use

*K*=5 in the experiments). Then an estimator is obtained using (i.e., without ), and the approximation error for the holdout samples is computed: where the matrix and the vector are defined in the same way as and , but computed using only . This procedure is repeated for , and its average

*J*

^{(K-CV)}is computed: For parameter selection, we compute

*J*

^{(K-CV)}for all hyperparameter candidates (the gaussian width and the regularization parameter in the current setting) and choose the parameter that minimizes

*J*

^{(K-CV)}. We can show that

*J*

^{(K-CV)}is an almost unbiased estimator of the objective function in equation 2.5, where the “almost”-ness comes from the fact that the number of samples is reduced in the CV procedure due to data splitting (Geisser, 1975; Kohave, 1995).

## 3. The LICA Algorithms

In this section, we show how the SMI estimation idea could be employed in the context
of ICA. Here we derive two algorithms, which we call least-squares independent
component analysis (LICA), for obtaining a minimizer of with respect to the demixing matrix ** W**. One is based on a plain gradient method (which we refer to as PG-LICA) and
the other on a natural gradient method for whitened samples (which we refer to as
NG-LICA). A Matlab implementation of LICA is available online at http://www.simplex.t.u-tokyo.ac.jp/~s-taiji/software/LICA/index.html.

### 3.1. Plain Gradient Algorithm: PG-LICA.

**is given by where (>0) is the step size. As shown in the appendix, the gradient is given by where, for and , For the regularization matrix**

*W***defined by equation 2.12, the partial derivative is given by**

*R*In practice, we may iteratively perform line search along the gradient and optimize the gaussian width and the regularization parameter by CV. A pseudo-code of the PG-LICA algorithm is summarized in algorithm 1.

### 3.2. Natural Gradient Algorithm for Whitened Data: NG-LICA.

The second algorithm is based on a *natural gradient* technique
(Amari, 1998).

**can be restricted to the orthogonal group**

*W**O*(

*d*) without loss of generality.

*O*(

*d*) at

**is equal to the space of all matrices**

*W***such that is skew symmetric, that is, . The steepest direction on this tangent space, which is called the natural gradient, is given as follows (Amari, 1998): where the canonical metric is adopted in the tangent space. Then the geodesic from**

*U***in the direction of the natural gradient over**

*W**O*(

*d*) can be expressed by

**, Thus, when we perform line search along the geodesic in the natural gradient direction, the minimizer may be searched from the set That is,**

*D**t*is chosen such that (see equation 2.10) is minimized and

**is updated as Geometry and optimization algorithms on more general structure, the Stiefel manifold, is discussed in more detail in Nishimori and Akaho (2005).**

*W*A pseudo-code of the NG-LICA algorithm is summarized in algorithm 2.

### 3.3. Remarks.

The proposed LICA algorithms can be regarded as an application of the general unconstrained least squares density-ratio estimator proposed by Kanamori et al. (2009) to SMI in the context of ICA.

SMI is closely related to the kernel independence measures developed recently
(Gretton, Bousquet, Smola, & Schölkopf, 2005; Gretton, Herbrich, Smola, Bousquet, &
Schölkopf, 2005; Fukumizu et al., 2008). In particular, it has been
shown that the normalized cross-covariance operator (NOCCO) proposed in
Fukumizu, Gretton, Sun, and Schölkopf (2008) is also an estimator of SMI for *d*=2.
However, there is no reasonable hyperparameter selection method for this and all
other kernel-based independence measures (see also Bach & Jordan, 2002, and Fukumizu et al., 2009). This is a crucial limitation in unsupervised
learning scenarios such as ICA. On the other hand, cross-validation can be
applied to our method for hyperparameter selection, as shown in section 2.3.

## 4. Experiments

In this section, we investigate the experimental performance of the proposed method.

### 4.1. Illustrative Examples.

First, we illustrate how the proposed method behaves using the following three two-dimensional data sets:

Sub-sub-gaussians:

*p*()=*x**U*(*x*^{(1)}; −0.5, 0.5)*U*(*x*^{(2)}; −0.5, 0.5)Super-super-gaussians:

*p*()=*x**L*(*x*^{(1)}; 0, 1)*L*(*x*^{(2)}; 0, 1)Sub-super-gaussians:

*p*()=*x**U*(*x*^{(1)}; −0.5, 0.5)*L*(*x*^{(2)}; 0, 1)

*U*(

*x*;

*a*,

*b*) denotes the uniform density on [

*a*,

*b*] and denotes the Laplace density with mean and variance

*v*. Let the number of samples be

*n*=300, and we observe mixed samples through the following mixing matrix: The observed samples are plotted in Figure 1. We employed the NG-LICA algorithm described in algorithm 2. Hyperparameters and in LICA were chosen by five-fold CV from the 10 values in [0.1, 1] at regular intervals and the 10 values in [0.001, 1] at regular intervals in log scale, respectively. The regularization term was set to the squared RKHS norm induced by the gaussian kernel, that is, we employed

**defined by equation 2.12.**

*R*The true independent directions as well as the estimated independent directions
are plotted in Figure 1. Figure 2 depicts the value of the estimated SMI 10 over iterations, and
Figure 3 depicts the elements of the
demixing matrix ** W** over iterations. The results show that estimated SMI decreases rapidly
and good solutions are obtained for all the data sets. The reason the estimated
SMI in Figure 2 does not decrease
monotonically is that during the natural gradient optimization procedure, the
hyperparameters ( and ) are adjusted by CV (see algorithm 2), which possibly causes
an increase in the objective values.

### 4.2. Performance Comparison.

^{1}We used the three data sets, a, b, and c in section 4.1; the demosig data set available in the FastICA package

^{1}for Matlab; and 10halo, Sergio7, Speech4, and c5signals data sets available in the ICALAB signal processing benchmark data sets

^{1}(Cichocki & Amari, 2003). Data sets a, b, and c, demosig, Sergio7, and c5signals are artificial data sets. Data sets 10halo and Speech4 are real data sets. We employed the Amari index (Amari et al., 1996) as the performance measure (smaller is better): where for an estimated demixing matrix . We used the publicly available Matlab codes for KICA, FICA and JADE, where default parameter settings were used. Hyperparameters and in LICA were chosen by five-fold CV from the 10 values in [0.1, 1] at regular intervals and the 10 values in [0.001, 1] at regular intervals in log scale, respectively.

**was set as equation 2.12.**

*R*We randomly generated the mixing matrix ** A** and source signals for artificial data sets and computed the Amari index
between the true

**and for estimated by each method. As training samples, we used the first**

*A**n*samples for Sergio7 and c5signals, and the

*n*samples between the 1001th and (1000+

*n*)-th interval for 10halo and Speech4, where we tested

*n*=200 and 500.

The performance of each method is summarized in Table 2, which depicts the mean and standard deviation of the Amari index over 50 trials. NG-LICA overall shows good performance. KICA tends to work reasonably well for data sets a, b, c, and demosig, but it performs poorly for the ICALAB data sets; this seems to be caused by an inappropriate choice of the gaussian kernel width and local optima. FICA and JADE tend to work reasonably well for the ICALAB data sets, but perform poorly for a, b, c, and demosig. We conjecture that the contrast functions in FICA and the fourth-order statistics in JADE did not appropriately catch the nongaussianity of data sets a, b, c, and demosig. Overall, the proposed LICA algorithm is shown to be a promising ICA method.

Data Set . | n
. | NG-LICA . | KICA . | FICA . | JADE . |
---|---|---|---|---|---|

a | 200 | 0.05(0.03) | 0.04(0.02) | 0.06(0.03) | 0.04(0.02) |

500 | 0.03(0.01) | 0.03(0.01) | 0.03(0.02) | 0.02(0.01) | |

b | 200 | 0.06(0.04) | 0.12(0.15) | 0.16(0.20) | 0.15(0.17) |

500 | 0.04(0.03) | 0.05(0.04) | 0.11(0.12) | 0.05(0.04) | |

c | 200 | 0.08(0.05) | 0.09(0.06) | 0.14(0.11) | 0.13(0.09) |

500 | 0.04(0.03) | 0.04(0.03) | 0.09(0.08) | 0.10(0.06) | |

demosig | 200 | 0.04(0.01) | 0.05(0.11) | 0.08(0.05) | 0.08(0.08) |

500 | 0.02(0.01) | 0.04(0.09) | 0.04(0.03) | 0.04(0.02) | |

10halo | 200 | 0.29(0.02) | 0.38(0.03) | 0.33(0.07) | 0.36(0.00) |

500 | 0.22(0.02) | 0.37(0.03) | 0.22(0.03) | 0.28(0.00) | |

Sergio7 | 200 | 0.04(0.01) | 0.38(0.04) | 0.05(0.02) | 0.07(0.00) |

500 | 0.05(0.02) | 0.37(0.03) | 0.04(0.01) | 0.04(0.00) | |

Speech4 | 200 | 0.18(0.03) | 0.29(0.05) | 0.20(0.03) | 0.22(0.00) |

500 | 0.07(0.00) | 0.10(0.04) | 0.10(0.04) | 0.06(0.00) | |

c5signals | 200 | 0.12(0.01) | 0.25(0.15) | 0.10(0.02) | 0.12(0.00) |

500 | 0.06(0.04) | 0.07(0.06) | 0.04(0.02) | 0.07(0.00) |

Data Set . | n
. | NG-LICA . | KICA . | FICA . | JADE . |
---|---|---|---|---|---|

a | 200 | 0.05(0.03) | 0.04(0.02) | 0.06(0.03) | 0.04(0.02) |

500 | 0.03(0.01) | 0.03(0.01) | 0.03(0.02) | 0.02(0.01) | |

b | 200 | 0.06(0.04) | 0.12(0.15) | 0.16(0.20) | 0.15(0.17) |

500 | 0.04(0.03) | 0.05(0.04) | 0.11(0.12) | 0.05(0.04) | |

c | 200 | 0.08(0.05) | 0.09(0.06) | 0.14(0.11) | 0.13(0.09) |

500 | 0.04(0.03) | 0.04(0.03) | 0.09(0.08) | 0.10(0.06) | |

demosig | 200 | 0.04(0.01) | 0.05(0.11) | 0.08(0.05) | 0.08(0.08) |

500 | 0.02(0.01) | 0.04(0.09) | 0.04(0.03) | 0.04(0.02) | |

10halo | 200 | 0.29(0.02) | 0.38(0.03) | 0.33(0.07) | 0.36(0.00) |

500 | 0.22(0.02) | 0.37(0.03) | 0.22(0.03) | 0.28(0.00) | |

Sergio7 | 200 | 0.04(0.01) | 0.38(0.04) | 0.05(0.02) | 0.07(0.00) |

500 | 0.05(0.02) | 0.37(0.03) | 0.04(0.01) | 0.04(0.00) | |

Speech4 | 200 | 0.18(0.03) | 0.29(0.05) | 0.20(0.03) | 0.22(0.00) |

500 | 0.07(0.00) | 0.10(0.04) | 0.10(0.04) | 0.06(0.00) | |

c5signals | 200 | 0.12(0.01) | 0.25(0.15) | 0.10(0.02) | 0.12(0.00) |

500 | 0.06(0.04) | 0.07(0.06) | 0.04(0.02) | 0.07(0.00) |

Notes: Data sets a, b, and c are taken from section 4.1. The demosig data set is from the
FastICA package. The 10halo, Sergio7, Speech4, and c5signals data
sets are taken from the ICALAB benchmarks data sets. The best method
in terms of the mean Amari index and comparable ones based on the
one-sided *t*-test at the significance level 1% are
in bold.

## 5. Conclusion

In this letter, we proposed a new ICA method based on a squared-loss variant of mutual information. The proposed method, least squares ICA (LICA), has several preferable properties; it is, for example, distribution free and hyperparameter selection by cross-validation is available.

Similar to other ICA algorithms, the optimization problem involved in LICA is nonconvex. Thus, practically it is very important to develop good heuristics for initialization and avoid local optima in the gradient procedures, an open research topic to be investigated. Moreover, although our SMI estimator is analytic, the LICA algorithm is still computationally rather expensive due to linear equations and cross-validation. Our future work will address the computational issue, for example, by vectorization and parallelization.

## Appendix: Derivation of the Gradient of the SMI Estimator

**(**

*B**x*). Then the partial derivative of with respect to is given by Substituting this in equation A.1, we have which gives equation 3.2.

## Acknowledgments

We thank Takafumi Kanamori for his valuable comments. T.S. was supported in part by the JSPS Research Fellowships for Young Scientists and Global COE Program “The Research and Training Center for New Development in Mathematics,” MEXT, Japan. M.S. acknowledges support from SCAT, AOARD, and the JST PRESTO program.