## Abstract

The goal of sufficient dimension reduction in supervised learning is to find the low-dimensional subspace of input features that contains all of the information about the output values that the input features possess. In this letter, we propose a novel sufficient dimension-reduction method using a squared-loss variant of mutual information as a dependency measure. We apply a density-ratio estimator for approximating squared-loss mutual information that is formulated as a minimum contrast estimator on parametric or nonparametric models. Since cross-validation is available for choosing an appropriate model, our method does not require any prespecified structure on the underlying distributions. We elucidate the asymptotic bias of our estimator on parametric models and the asymptotic convergence rate on nonparametric models. The convergence analysis utilizes the uniform tail-bound of a U-process, and the convergence rate is characterized by the bracketing entropy of the model. We then develop a natural gradient algorithm on the Grassmann manifold for sufficient subspace search. The analytic formula of our estimator allows us to compute the gradient efficiently. Numerical experiments show that the proposed method compares favorably with existing dimension-reduction approaches on artificial and benchmark data sets.

## 1.  Introduction

The purpose of dimension reduction in supervised learning is to find the low-dimensional subspace of input features that has sufficient information for predicting output values. Sufficient dimension reduction (SDR) initiated by Li (1991), is aimed at finding a low-rank projection matrix such that given the relevant subspace of input features, the rest becomes conditionally independent of output values (Cook, 1998b; Chiaromonte & Cook, 2002). Such a low-dimensional subspace contains all of the information of the output that the covariate contains. Finding such a subspace not only allows us to use the dimension-reduced features for estimating input-output relations, but also offers insights into which features are important.

A traditional dependency measure between random variables is the Pearson correlation coefficient (PCC), which can be used for detecting linear dependency, so it is useful for gaussian data. However, the gaussian assumption may be rarely fulfilled in practice. Recently kernel-based dimension reduction has been studied in order to overcome the weakness of this coefficient. The Hilbert-Schmidt independence criterion (HSIC) (Gretton, Bousquet, Smola, & Schölkopf, 2005) utilizes cross-covariance operators on universal reproducing kernel Hilbert spaces (RKHSs) (Steinwart, 2001). Cross-covariance operators are an infinite-dimensional generalization of covariance matrices. HSIC allows one to efficiently detect nonlinear dependency by making use of the reproducing property of RKHSs (Aronszajn, 1950). Its usefulness in feature selection scenarios has been shown in Song, Smola, Gretton, Borgwardt, and Bedo (2007). However, HSIC has several theoretical and practical weaknesses. Theoretically, it evaluates independence between random variables, not conditional independence. Thus, it does not perform SDR in a strict sense. From the practical point of view, HSIC evaluates the covariance between random variables, not the correlation. This means that the change of input feature scaling affects the dimension-reduction solution, which is not preferable in practice.

Kernel dimension reduction (KDR) (Fukumizu, Bach, & Jordan, 2004) can overcome these weaknesses. KDR evaluates conditional covariance using the kernel trick, and thus it directly performs SDR. Through experiments, KDR was demonstrated to outperform other dimension-reduction schemes such as canonical correlation analysis (Hotelling, 1936; Breiman & Friedman, 1985), partial least squares (Wold, 1966; Goutis & Fearn, 1996; Durand & Sabatier, 1997; Reiss & Ogden, 2007), sliced inverse regression (Li, 1991; Bura & Cook, 2001; Cook & Ni, 2005; Zhu, Miao, & Peng, 2006), and the principal Hessian direction (Li, 1992; Cook, 1998a; Li, Lue, & Chen, 2000). Theoretical properties of KDR such as consistency have been studied thoroughly (Fukumizu, Bach, & Jordan, 2009). However, KDR still has a weakness in practice: the performance of KDR (and also HSIC) depends on the choice of kernel parameters (e.g., the gaussian width) and the regularization parameter. So far, there seems to be no model selection method for KDR and HSIC (as discussed in Fukumizu et al., 2009).1

Another possible criterion for SDR is mutual information (MI) (Cover & Thomas, 2006). MI could be directly employed for SDR since maximizing MI between output and projected input leads to conditional independence between output and input given the projected input. So far, a great deal of effort has been made to estimate MI accurately—for example, based on an adaptive histogram (HIST) (Darbellay & Vajda, 1999), kernel density estimation (KDE) (Torkkola, 2003), nearest-neighbor distance (NN) (Kraskov, Stögbauer, & Grassberger, 2004), Edgeworth expansion (EDGE) (Hulle, 2005), maximum-likelihood MI estimation (MLMI) (Suzuki, Sugiyama, Sese, & Kanamori, 2008). Among them, MLMI has been shown to possess various practical advantages.

As summarized in Table 1, MLMI affords model selection by cross-variation, while there is no systematic method to choose tuning parameters for HSIC, KDR, and NN. MLMI does not require specific structures on the underlying distributions, and EDGE requires that the distributions are near gaussian. MLMI does not involve density estimation of the underlying distributions so that it shows good performance in practice.

Table 1:
Summary of Existing and Proposed Dependency Measures.
NonlinearModelDensityFeature
MethodsDependenceSelectionDistributionEstimationExtraction
PCC Not detectable Not necessary Gaussian Not involved Possible
HSIC Detectable Not available Free Not involved Possible
KDR Detectable Not available Free Not involved Possible
HIST Detectable Available Free Involved Not available
KDE Detectable Available Free Involved Possible
NN Detectable Not available Free Not involved Not available
EDGE Detectable Not necessary Near gaussian Not involved Possible
MLMI Detectable Available Free Not involved Not available
LSMI Detectable Available Free Not involved Possible
NonlinearModelDensityFeature
MethodsDependenceSelectionDistributionEstimationExtraction
PCC Not detectable Not necessary Gaussian Not involved Possible
HSIC Detectable Not available Free Not involved Possible
KDR Detectable Not available Free Not involved Possible
HIST Detectable Available Free Involved Not available
KDE Detectable Available Free Involved Possible
NN Detectable Not available Free Not involved Not available
EDGE Detectable Not necessary Near gaussian Not involved Possible
MLMI Detectable Available Free Not involved Not available
LSMI Detectable Available Free Not involved Possible

Note: The boldface corresponds to a beneficial feature of the method.

Based on this comparison, we want to employ the MLMI method for dimension reduction. However, this is not straightforward since the MLMI estimator is not explicit, that is, it is implicitly defined as the solution of an optimization problem and is computed numerically. In the dimension-reduction (or feature-extraction) scenarios, the projection matrix needs to be optimized over an MI approximator. To cope with this problem, we adopt a squared-loss variant of MI called the squared-loss MI (SMI) as our independence measure and use an estimator of SMI called least-squares MI (LSMI) (Suzuki, Sugiyama, Kanamori, & Sese, 2009) for dimension reduction. LSMI inherits good properties from MLMI and, moreover, provides an analytic SMI estimator that gives an analytic formula for its derivative (see Table 1).

The goal of this letter is to develop a dimension-reduction algorithm based on LSMI. Our first contribution is to theoretically elucidate the rate of convergence of the LSMI estimator in parametric and nonparametric settings. Then we develop a practical dimension-reduction algorithm based on LSMI, which we call least-squares dimension reduction (LSDR). LSDR optimizes the projection matrix using a natural gradient algorithm (Amari, 1998) on the Grassmann manifold. Finally, through numerical experiments, we show the usefulness of the LSDR method.

## 2.  Dimension Reduction via SMI Estimation

In this section, we first formulate the problem of sufficient dimension reduction (SDR) (Cook, 1998b; Chiaromonte & Cook, 2002) and show how squared-loss mutual information (SMI) can be employed in the context of SDR. Then we introduce a method of approximating SMI without going through density estimation and elucidate convergence properties of the SMI estimator. Finally, we develop a dimension-reduction method based on the SMI estimator.

### 2.1.  Sufficient Dimension Reduction

Let () be the domain of input feature x and be the domain of output data2y. We suppose that is equipped with a -algebra , and there is a base measure denoted by dy. As for , we consider the standard -algebra of the Lebesgue measurable sets and denote the Lebesgue measure by dx. We assume there is a joint density pxy(x, y) defined on the product space with respect to .

To search a subspace of input space containing sufficient information about the output, we utilize the Grassmann manifold , that is, the set of all m-dimensional subspaces in . The Grassmann manifold is obtained by identifying those matrices in orthonormal matrices whose columns span the same subspace:
2.1
where denotes the transpose, Im is the m-dimensional identity matrix, and is the equivalence relation such that if the rows of both W and span the same space.
Let be any projection matrix corresponding to a member of the Grassmann manifold . Let () be the orthogonal projection of input x given by :
Suppose that satisfies
2.2
That is, given the projected feature , the (remaining) feature x is conditionally independent of output y and can be discarded without sacrificing the predictability of y. Note that the conditional independence is invariant against the choice of the representative .
Suppose that we are given n independent and identically distributed (i.i.d.) paired samples,
2.3
drawn from a joint distribution with density pxy(x, y). The goal of SDR is, from data Dn, to find a projection matrix whose range agrees with that of . For a projection matrix W, we write
We assume that m is known throughout this letter.

### 2.2.  Squared-Loss Mutual Information

A direct approach to SDR would be to determine W so that equation 2.2 is fulfilled. Let us denote by z=Wx for some projection matrix W. To this end, we adopt SMI as our criterion to be maximized with respect to W:
2.4
where pyz(y, z) denotes the joint density of y and z and py(y) and pz(z) denote the marginal densities of y and z, respectively. SMI(Y, Z) allows us to evaluate independence between y and z since SMI(Y, Z) vanishes if and only if
Note that equation 2.4 corresponds to the f-divergence (Ali & Silvey, 1966; Csiszár, 1967) from pyz(y, z) to py(y)pz(z) with the squared loss, while ordinary MI corresponds to the f-divergence with the log loss, that is, the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951):
Thus, SMI could be regarded as a natural alternative to ordinary MI.

The rationale behind the use of SMI in the context of SDR relies on the following lemma:

Lemma 1.
Decompose x into z and the component orthogonal to z as , that is, z is a member of the image of W and is a member of the subspace perpendicular to the image of W. Let , and py|z(y|z) be conditional densities. Then we have
A proof of this lemma is given in appendix  A. Lemma 1 implies
and the equality holds if and only if
which is equivalent to equation 2.2. Thus, equation 2.2 can be achieved by maximizing SMI(Z, Y) with respect to W; then the sufficient subspace can be identified.

Now we want to find the projection matrix W that maximizes SMI(Z, Y). However, SMI is inaccessible in practice since densities pyz(y, z), py(y), and pz(z) are unknown. Thus, SMI needs to be estimated from data samples. We next introduce an SMI estimator.

### 2.3.  SMI Approximation via Density-Ratio Estimation

Here, we consider a fixed projection matrix W and discuss the problem of approximating SMI from samples. The convex duality (Boyd & Vandenberghe, 2004) gives the variational representation (Keziou, 2003; Nguyen, Wainwright, & Jordan, 2010) of SMI as
where is taken over all measurable functions on and
2.5
This can be checked as follows. For , we have
where is the convex conjugate of f that satisfies (Boyd & Vandenberghe, 2004). Thus, computing SMI is reduced to finding the minimizer of J(g). We can show that is given by
2.6
Thus, estimating SMI(Y, Z) is reduced to estimating the above density ratio.3 We do not choose a strategy to plug in density estimators of pyz, py, and pz into formula 2.6. This is because, in a region with small py(y)pz(z), the small estimation error of pyz(y, z) is strongly amplified. To avoid the unstable behavior around the tail, we directly model the density ratio itself and impose regularization to control instability of density-ratio estimators when needed.

Next we consider parametric and nonparametric methods for estimating SMI.

#### 2.3.1.  Parametric Convergence Analysis

Let us consider the case where the function class from which the function g is searched is a parametric model:
Suppose that the true density-ratio is contained in the model , that is, there exists () such that . Approximating the probability densities pyz(y, z), py(y), and pz(z) in equation 2.5 by their empirical distributions, we obtain the following optimization problem:
2.7
Then an SMI approximator can be constructed as
2.8
Suppose the standard regularity conditions for the consistency is satisfied (see, e.g., section 3.2.1 of van der Vaart & Wellner, 1996). Let A and B be matrices defined as
where and are copies of y and z and the partial derivative is taken with respect to the th element of the parameter . Then we have the following theorem:
Theorem 1.
Suppose that the matrix A is positive definite. Then the SMI estimator 2.8 satisfies
2.9
where denotes the asymptotic order in probability. Furthermore, we have
2.10
wheredenotes the expectation over data samples Dn (see equation 2.3).

A proof of theorem 1 can be found in appendix B. This theorem means that the above SMI estimator retains the optimality in terms of the order of convergence in n, since is the optimal convergence rate in the parametric setup (van der Vaart, 2000).

#### 2.3.2.  Nonparametric Convergence Analysis

Next, we consider nonparametric cases. Let the function class be a general set of functions on , where . Let us consider a nonparametric version of the empirical problem (cf. equation 2.7):
2.11
where R(g) is a nonnegative regularization functional such that
2.12
Then a nonparametric version of SMI approximator is given as
A useful example is to use a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950) as and the RKHS norm as R(g). Suppose is an RKHS associated with bounded kernel :
2.13
Let denote the norm in the RKHS . Then satisfies equation 2.12:
where we used the reproducing property of the kernel and Schwartz's inequality. Note that the gaussian kernel satisfies this with C=1. It is known that gaussian kernel RKHS spans a dense subset in the set of continuous functions. Another example of RKHSs is Sobolev space. The canonical norm for this space is the integral of the squared derivatives of functions. Thus the regularization term imposes the solution to be smooth. The RKHS technique in Sobolev space has been well exploited in the context of spline models (Wahba, 1990). We intend that the regularization term R(g) is a generalization of the RKHS norm. Roughly speaking, R(g) is like a “norm” of the function space .
We assume that the true density-ratio function is contained in the model and is bounded from above:
Let be a ball of with radius M>0:
To derive the convergence rate of our estimator, we use the bracketing entropy that is a complexity measure of a function class (van der Vaart & Wellner, 1996):
Definition 1.

Given two functions l and u, the bracket [l, u] is the set of all functions f with for all y and z. An -bracket is a bracket [l, u] with . The bracketing entropy is the logarithm of the minimum number of -brackets needed to cover a function set.

We assume that there exists such that for all M>0,
2.14
This quantity represents a complexity of function class —the larger is, the more complex the function class is because, for larger , more brackets are needed to cover the function class. Gaussian RKHS satisfies this condition for arbitrarily small (Steinwart & Scovel, 2007). Note that when R(g) is the RKHS norm, condition 2.14 holds for all M>0 if that holds for M=1.

Then we have the following theorem:

Theorem 2.
Under the above setting, if and , then we have
2.15

A proof of theorem 2 is in appendix  C. The conditions and roughly mean that the regularization parameter should be sufficiently small but not too small. This theorem shows that the convergence rate of the nonparametric version is also if we take as for sufficiently small . However, the nonparametric method requires a milder model assumption.

According to Nguyen et al. (2010), where a log-loss version of theorem 2 has been proven in the context of KL divergence estimation, the above convergence rate achieves the optimal minimax rate under some setup. Thus, the convergence property of the above nonparametric method would also be optimal in the same sense.

As stated above, gaussian RKHS satisfies the bracketing number condition 2.14 for arbitrary . Thus, SMI with the gaussian kernel achieves the convergence rate 2.15 with arbitrary . However, gaussian RKHS is not sufficiently rich to estimate a function in Sobolev spaces with the optimal rate. To estimate a function in a Sobolev space with the optimal rate, we need to adjust the gaussian width appropriately depending on the sample size and the regularity of the Sobolev space. To analyze the convergence rate for varying gaussian widths, technique other than that used in this letter is required. What we have shown in the theorem works only for a fixed gaussian width. To cope with a situation of varying gaussian width, the techniques recently developed by Eberts and Steinwart (2011) are useful.

#### 2.3.3.  Practical Implementation

Here we describe a practical version of the above SMI approximation method, least-squares mutual information (LSMI) (Suzuki et al., 2009).

Let us restrict the search space of function g to some linear subspace ,
2.16
where is a parameter to be learned from samples. is a basis function such that
where 0b is the b-dimensional vector with all zeros and the inequality for vectors is applied in the element-wise manner. may be dependent on the samples , that is, kernel models are also allowed. Later, in section 2.3.5, we explain how the basis functions are designed.
Let us approximate the probability densities pyz(y, z), py(y), and pz(z) in equation 2.5 by their empirical distributions. Then we have
2.17
where we included for regularization purposes and
2.18
Note that when is an RKHS corresponding to a kernel k, is a member of the RKHS and the RKHS norm of g satisfies where R is the Gram matrix, that is, . Thus, the regularization term satisfies condition 2.12 if the kernel function is bounded as equation 2.13.
Differentiating the objective function, equation 2.17, with respect to and equating it to zero, we obtain
2.19
Thus, the solution can be obtained just by solving the above system of linear equations. The solution is given analytically as
Then we can analytically approximate SMI as
2.20

#### 2.3.4.  Model Selection by Cross-Validation

As shown in sections 2.3.1 and 2.3.2, our SMI estimator was shown to possess preferable convergence properties. Nevertheless, its practical performance depends on the choice of basis functions and the regularization parameter. In order to determine basis functions and the regularization parameter , cross-validation (CV) is available for the LSMI estimator: First, the samples are divided into K disjoint subsets of (approximately) the same size. Then an estimator is obtained using (i.e., without ), and the approximation error for the holdout samples is computed. This procedure is repeated for , and its mean is output:
2.21
where and denote and computed on the holdout samples in equation 2.18. For model selection, we compute for all model candidates (the basis function and the regularization parameter ), and choose the best model that minimizes . We can show that is an almost unbiased estimator of the objective function J, where the “almost-ness” comes from the fact that the sample size is reduced in the CV procedure due to data splitting (Schölkopf & Smola, 2002).

For the parametric setup, we may derive an asymptotic unbiased estimator of J (also known as an information criterion, Akaike, 1974) based on theorem 1, which could be employed for model selection. However, we do not pursue this direction in this letter.

#### 2.3.5.  Design of Basis Functions

The above CV procedure would be useful when good candidates of basis functions are prepared. Here we propose to use the product kernel of the following form as basis functions:
since the number of kernel evaluation when computing is reduced from n2 to 2n:
In the regression scenarios where y is continuous, we use the gaussian kernel as the base kernels,
where are gaussian centers randomly chosen from . More precisely, we set and , where are randomly chosen from without replacement.

The rationale behind this basis function choice is as follows. The density ratio, equation 2.6, tends to take large values if py(y)pz(z) is small and pyz(y, z) is large. When a nonnegative function is approximated by a gaussian kernel model, many kernels may be needed in the region where the output of the target function is large. However, only a small number of kernels would be enough in the region where the output of the target function is close to zero. Following this heuristic, we decided to allocate many kernels in the regions where pyz(y, z) is large; this can be achieved by setting the gaussian centers at .4

In the classification scenarios where y is categorical, we use the delta kernel for y:
where is 1 if and 0 otherwise. Note that in this case, the matrix becomes block diagonal, given that the samples are sorted according to the class labels. Then the linear equation, 2.19, can be solved efficiently.

More generally, when y is structured (e.g., strings, trees, and graphs), we may employ kernels for structured data as (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002; Collins & Duffy, 2002; Kashima & Koyanagi, 2002; Kondor & Lafferty, 2002; Kashima, Tsuda, & Inokuchi, 2003; Gärtner, 2003; Gärtner, Flach, & Wrobel, 2003).

### 2.4.  Least-Squares Dimension Reduction

Finally, we show how the SMI approximator is employed for dimension reduction. To find a sufficient subspace, the dimension-reduction problem is cast as an optimization problem over the grassmann manifold (see equation 2.1).

Here we employ a gradient-ascent algorithm to find the maximizer of the LSMI approximator with respect to W. After a few lines of calculations, we can show that the gradient is given by
where and we used the formula for a positive symmetric matrix x, each element of which is a function of t.
In the Euclidean space, the ordinary gradient gives the steepest direction. However, on a manifold, the natural gradient (Amari, 1998) gives the steepest direction. The natural gradient at W is the projection of the ordinary gradient to the tangent space of at W. If the tangent space is equipped with the canonical metric the natural gradient is given as follows (Edelman, Arias, & Smith, 1998):
2.22
where is any matrix such that is orthogonal. Then the geodesic from W to the direction of the natural gradient over can be expressed using t () as
where exp for a matrix denotes the matrix exponential and Ob is the zero matrix (note that the derivative coincides with the natural gradient, equation 2.22; see Edelman et al., 1998 for detailed derivation of the geodesic). Thus, line search along the geodesic in the natural gradient direction is equivalent to finding the maximizer from .
For choosing the step size of each gradient update, we may use some approximate line search method such as Armijo's rule (Patriksson, 1999) or backtracking line search (Boyd & Vandenberghe, 2004). In our setting, Armijo's rule finds the step size as the maximum tk that satisfies
where t0=1, (), , and are given parameters.
We call the proposed dimension-reduction algorithm least-squares dimension reduction (LSDR). The algorithm is summarized in algorithm 1. In practice, we performed CV once in several steps because executing CV at every step is computationally expensive.

## 3.  Numerical Experiments

In this section, we experimentally investigate the performance of the proposed and existing dimension-reduction methods using artificial and real data sets. In the proposed method, we use the gaussian kernel as basis functions and employ the regularized kernel Gram matrix as the regularization matrix R: where is the kernel Gram matrix for the chosen centers: . is added to for avoiding nondegeneracy; we set . We fix the number of basis functions to b=min(100, n) and choose the gaussian width and the regularization parameter based on five-fold CV with grid search. We restart the natural gradient search 10 times with random initial points and choose the one having the minimum CV score, equation 2.21.

Figure 1:

Artificial data sets.

Figure 1:

Artificial data sets.

### 3.1.  Dimension Reduction for Artificial Data Sets

We use six artificial data sets: Three that we designed and three we borrowed from Fukumizu et al. (2009; see Figure 1):

1. Linear dependence:d=5, M=1. y has a linear dependence on x as
where x(k) denotes the kth element of x, and . Here denotes the normal density with mean and variance-covariance matrix . The optimal projection is given by
2. Nonlinear dependence 1:d=5, M=1. y has a quadratic dependence on x as
where and . The optimal projection is given by equation 3.1.
3. Nonlinear dependence 2:d=5, M=1. y has a lattice-structured dependence on x as
where denotes the uniform density on a set . The optimal projection is given by equation 3.1.
4. Fukumizu et al. (2009):d=4, m=2. y has a nonlinear dependence on x as
where and . The optimal projection is
5. Fukumizu et al. (2009):d=4, M=1. y has a nonlinear dependence on x as
where and . The optimal projection is .
6. Fukumizu et al. (2009):d=10, M=1. y has a nonlinear dependence on x as
where and . The optimal projection is .

Let us compare the proposed LSDR with the following methods.

• Kernel dimension reduction (KDR) (Fukumizu et al., 2009)

• Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005)

• Sliced inverse regression (SIR) (Li, 1991)

• Sliced average variance estimation (SAVE) (Cook, 2000)

• Principal Hessian direction (pHd) (Li, 1992)

• Minimum average (conditional) variance estimation (dMAVE) (Xia, 2007)

In KDR and HSIC, the gaussian width is set to the median sample distance, following the suggestions in the original papers (Gretton et al., 2005; Fukumizu et al., 2009). We used the dimension-reduction package included in R for SIR, SAVE, and pHd. The parameters for these methods such as the number of slices were set to be the default values. To execute dMAVE, we used the publicly available code.5 The principal directions estimated by SIR, SAVE, pHd, and dMAVE do not necessarily form an orthogonal system, that is, if we let F be the matrix, each row of which corresponds to each principal direction, then F is not necessarily a projection matrix. To recover a projection matrix W, we performed singular decomposition of F as F=VSU and set W=U.

We evaluate the performance of each method by
3.2
where denotes the Frobenius norm, is an estimated projection matrix, and is an optimal projection matrix.

The performance of each method is summarized in Table 2, which depicts the mean and standard deviation of the Frobenius-norm error, equation 3.2, over 50 trials when the number of samples is n=100. LSDR overall shows good performance; in particular, it performs the best for data sets b and c. KDR also tends to work reasonably well, but it sometimes performs poorly; this seems to be caused by an inappropriate choice of the gaussian kernel width, implying that the heuristic of using the median sample distance as the kernel width is not always appropriate. On the other hand, LSDR with CV performs stably well for various types of data sets. dMAVE also works well and is competitive with LSDR for these artificial data sets.

Table 2:
Mean and Standard Deviation of the Frobenius-norm Error, Equation 3.2, for Toy Data Sets.
.13(.04) .13(.05) .17(.07) .11(.05) .37(.27) .89(.12) .13(.04)
.15(.06) .25(.21) .44(.36) .83(.19) .31(.11) .24(.07) .20(.09)
.10(.05) .44(.32) .68(.32) .89(.14) .48(.20) .86(.12) .25(.28)
.20(.14) .16(.06) .18(.08) .30(.15) .44(.18) .50(.18) .10(.05)
.09(.06) .13(.06) .16(.07) .21(.10) .34(.19) .36(.14) .08(.04)
10 .35(.12) .40(.12) .49(.17) .68(.22) .91(.13) .83(.12) .26(.06)
.13(.04) .13(.05) .17(.07) .11(.05) .37(.27) .89(.12) .13(.04)
.15(.06) .25(.21) .44(.36) .83(.19) .31(.11) .24(.07) .20(.09)
.10(.05) .44(.32) .68(.32) .89(.14) .48(.20) .86(.12) .25(.28)
.20(.14) .16(.06) .18(.08) .30(.15) .44(.18) .50(.18) .10(.05)
.09(.06) .13(.06) .16(.07) .21(.10) .34(.19) .36(.14) .08(.04)
10 .35(.12) .40(.12) .49(.17) .68(.22) .91(.13) .83(.12) .26(.06)

Notes: The best method in terms of the mean error and comparable ones based on the one-sided t-test at the significance level 1% are indicated by boldface. Standard deviations are in parentheses.

### 3.2.  Classification for Benchmark Data Sets

Finally, we evaluate the classification performance after dimension reduction for several benchmark data sets. We use Image, Waveform, Pima-Indians-Diabetes, and Letter Recognition in the UCI repository.6 We randomly choose 200 samples from the data set and apply LSDR, KDR, HSIC, and dMAVE to obtain projections onto low-dimension subspaces with , where denotes the smallest integer not smaller than c. Then we train the support vector machine (Schölkopf & Smola, 2002) on the 200 projected training samples.

The misclassification rate is computed for samples not used for training. Table 3 summarizes the mean and standard deviation of the classification error over 20 iterations. This shows that the proposed method overall compares favorably with the other methods.

Table 3:
Mean and Standard Deviation of Misclassification Rates for Benchmark Data Sets.
Data SetdmLSDRKDRHSICdMAVE
18 .083(.019) .125(.038) .158(.044) .501(.134)
Image 18 .088(.022) .106(.026) .115(.035) .468(.130)
18 14 .093(.018) .091(.019) .095(.023) .468(.130)
21 .130(.014) .127(.008) .160(.016) .183(.016)
Waveform 21 11 .119(.013) .135(.010) .163(.016) .184(.015)
21 16 .116(.007) .131(.008) .159(.014) .182(.021)
.249(.022) .247(.024) .252(.020) .257(.017)
Pima .260(.016) .250(.021) .252(.017) .265(.027)
.244(.020) .243(.019) .251(.021) .252(.019)
16 .031(.009) .028(.012) .035(.014) .018(.008)
Letter (a,b,c) 16 .026(.008) .017(.007) .020(.006) .016(.006)
16 12 .016(.006) .014(.006) .017(.008) .013(.008)
Data SetdmLSDRKDRHSICdMAVE
18 .083(.019) .125(.038) .158(.044) .501(.134)
Image 18 .088(.022) .106(.026) .115(.035) .468(.130)
18 14 .093(.018) .091(.019) .095(.023) .468(.130)
21 .130(.014) .127(.008) .160(.016) .183(.016)
Waveform 21 11 .119(.013) .135(.010) .163(.016) .184(.015)
21 16 .116(.007) .131(.008) .159(.014) .182(.021)
.249(.022) .247(.024) .252(.020) .257(.017)
Pima .260(.016) .250(.021) .252(.017) .265(.027)
.244(.020) .243(.019) .251(.021) .252(.019)
16 .031(.009) .028(.012) .035(.014) .018(.008)
Letter (a,b,c) 16 .026(.008) .017(.007) .020(.006) .016(.006)
16 12 .016(.006) .014(.006) .017(.008) .013(.008)

Notes: The best method in terms of the mean error and comparable ones based on the one-sided t-test at the significance level 1% are indicated by boldface. Standard deviations are in parentheses.

## 4.  Conclusion

In this letter, we have proposed a new dimension-reduction method utilizing a squared-loss variant of mutual information (SMI). Our contributions were parametric and nonparametric analyses of the rate of convergence of the SMI approximator and the proposal of a dimension-reduction algorithm based on the SMI approximator. The proposed method is advantageous in several respects; density estimation is not involved, it is distribution free, and model selection by cross-validation is available, for example. The effectiveness of the proposed method over existing methods was shown through experiments.

## Appendix A:  Proof of Lemma 1

Let . By the relations and we have
Thus, we obtain
Noticing that
we have
which concludes the proof of lemma 1.

## Appendix B:  Proof of Theorem 1

For notational simplicity, we define linear operators as
Let denote for . Since is the optimizer of problem 2.7, we have
B.1
Therefore, as in the standard asymptotic expansion for maximum likelihood estimators (van der Vaart, 2000), we have
This implies
B.2
Since , we have
and
B.3
Thus, equation B.2 implies
in particular, . Moreover equation B.2 becomes
B.4
Equations 2.5 and 2.8 give
B.5
The first four terms of the right-hand side can be expanded as
B.6
On the other hand, by the central limit theorem,
B.7
Substituting equations B.6 and B.7 into equation B.5, we have the first assertion, equation 2.9.
Next we prove the second assertion, equation 2.10. Based on equations B.5 and B.6, we can evaluate the expectation of as
where we used the fact that
Below, we will show that
B.8
Obviously this gives equation 2.10.
Equation B.4 implies
B.9
Let
for w=(y, z) and . Then h is symmetric (i.e., ), has mean 0 (i.e., ), and satisfies
Therefore, is a U-statistic with the symmetric kernel h(wi, wj). It is known (theorem 7.1 in Hoeffding, 1948) that the variance of U-statistic is given by
B.10
where w1=(y1, z1), w2=(y2, z2), and are i.i.d. variables with probability density pyz. For notational simplicity, we write
Then
Going through simple calculations, each term on the right-hand side of the above equation can be evaluated as
Therefore,

This with equations B.9 and B.10 yields equation B.8, and thus we have the second assertion, equation 2.10.

## Appendix C:  Proof of Theorem 2

Before proving theorem 2, we show the following auxiliary lemma:

Lemma 2.
Under the setting of theorem 2, if and , then we have
C.1
wheremeans the L2(pxpy)-norm and denotes the asymptotic order in probability.
Proof.
Hoeffding (1963) derived Bernstein's inequality for double sum,
for . By applying the above inequality to the proof of Theorem 2 in Birgé and Massart (1993) or theorem 5.11 in van de Geer (2000) instead of Bernstein's inequality for i.i.d. sum, we obtain a “double sum” version of those theorems, that is, exponential decay of the tail probability of , where is a class of uniformly bounded measurable functions and satisfies a polynomial bracketing condition, 2.14, as . Later, this is used to obtain equations C.3 and C.4.
From the definition, we obtain
On the other hand, indicates
Therefore, to bound , it suffices to bound the left-hand-side of the above inequality.
Define and as
For such that , the L2(pxpy)-norm of the difference between and is bounded by
Thus, the bracketing entropy of has the following order:
Let . Then, as in lemma 5.14 and theorem 10.6 in van de Geer (2000), we obtain
C.2
where . Here we have used the double sum version of theorem 5.11 in van de Geer (2000), which is needed to obtain the same formula as lemma 5.14 in van de Geer. Since
the right-hand-side of equation C.2 is further bounded by
C.3
Similarly, we can show that
C.4
Note that for ,
which implies
Thus,
C.5
Combining equations C.3 to C.5, we can bound the L2(pxpy)-norm of f as
The rest can be proved by following a similar line to theorem 10.6 in van de Geer (2000).

We redefine and define . There are two possible situations: (1) or (2) . We show the stochastic order of and for the two situations separately.

Situation 1.

(i) If , then
In this case,
For the first event, we have
For the second event, we have
which indicates
and
Thus, the probability of this event goes to 0.
(ii) If , then
In this case,
Then, similar to the above case, we can show that the probability of the second event goes to 0. On the other hand, for the first event, and .

Situation 2.

In this situation, .

(i) If , then
In this case,
(ii) If , it is obvious that

Consequently, the proof of lemma 2 was completed.

Based on lemma 2, we prove theorem 2 below.

Proof of Theorem 2.
As in the proof of lemma 2, let . Since , we have
C.6
We show that each term of the right-hand side of the above equation is . By the central limit theorem, we have
Since lemma 2 gives and , equations C.3 to C.5 in the proof of lemma 2 imply
In the above derivation, was used. Lemma 2 also implies
Combining these inequalities with equation C.6 implies
which concludes the proof of theorem 2.

## Acknowledgments

T.S. was supported in part by MEXT Kakenhi 22700289 and the Aihara Project, the FIRST program from JSPS, initiated by CSTP. M.S. was supported by AOARD, SCAT, and the JST PRESTO program.

## References

Akaike
,
H.
(
1974
).
A new look at the statistical model identification
.
IEEE Transactions on Automatic Control
,
AC-19
,
716
723
.
Ali
,
S. M.
, &
Silvey
,
S. D.
(
1966
).
A general class of coefficients of divergence of one distribution from another
.
Journal of the Royal Statistical Society, Series B
,
28
,
131
142
.
Amari
,
S.
(
1998
).
Natural gradient works efficiently in learning
.
Neural Computation
,
10
,
251
276
.
Aronszajn
,
N.
(
1950
).
Theory of reproducing kernels
.
Transactions of the American Mathematical Society
,
68
,
337
404
.
Birgé
,
L.
, &
Massart
,
P.
(
1993
).
Rates of convergence for minimum contrast estimators
.
Probability Theory and Related Fields
,
97
,
113
150
.
Boyd
,
S.
, &
Vandenberghe
,
L.
(
2004
).
Convex optimization
.
Cambridge
:
Cambridge University Press
.
Breiman
,
L.
, &
Friedman
,
J. H.
(
1985
).
Estimating optimal transformations for multiple regression and correlation
.
Journal of the American Statistical Association
,
80
,
580
598
.
Bura
,
E.
, &
Cook
,
R. D.
(
2001
).
Extending sliced inverse regression
.
Journal of the American Statistical Association
,
96
,
996
1003
.
Chiaromonte
,
F.
, &
Cook
,
R. D.
(
2002
).
Sufficient dimension reduction and graphics in regression
.
Annals of the Institute of Statistical Mathematics
,
54
,
768
795
.
Collins
,
M.
, &
Duffy
,
N.
(
2002
).
Convolution kernels for natural language
. In
T. G. Dieterrich, S. Becker, & Z. Ghahramani
(Eds.),
Advances in Neural information processing systems
,
14
.
Cambridge, MA
:
MIT Press
.
Cook
,
R. D.
(
1998a
).
Principal Hessian directions revisited
.
Journal of the American Statistical Association
,
93
,
84
100
.
Cook
,
R. D.
(
1998b
).
Regression graphics: Ideas for studying regressions through graphics
.
New York
:
Wiley
.
Cook
,
R. D.
(
2000
).
SAVE: A method for dimension reduction and graphics in regression
.
Communications in Statistics—Theory and Methods
,
29
,
2109
2121
.
Cook
,
R. D.
, &
Ni
,
L.
(
2005
).
Sufficient dimension reduction via inverse regression
.
Journal of the American Statistical Association
,
100
,
410
428
.
Cover
,
T. M.
, &
Thomas
,
J. A.
(
2006
).
Elements of information theory
(2nd ed.).
Hoboken, NJ
:
Wiley
.
Csiszár
,
I.
(
1967
).
Information-type measures of difference of probability distributions and indirect observation
.
Studia Scientiarum Mathematicarum Hungarica
,
2
,
229
318
.
Darbellay
,
G. A.
, &
Vajda
,
I.
(
1999
).
Estimation of the information by an adaptive partitioning of the observation space
.
IEEE Transactions on Information Theory
,
45
,
1315
1321
.
Durand
,
J.
, &
Sabatier
,
R.
(
1997
).
Additive splines for partial least squares regression
.
Journal of the American Statistical Association
,
92
,
1546
1554
.
Eberts
,
M.
, &
Steinwart
,
I.
(
2011
).
Optimal learning rates for least squares SVMs using gaussian kernels
. In
J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. Pereira, & K. Q. Weinberger
(Eds.),
Advances in neural information processing systems, 24
(pp.
1539
1547
).
Red Hook, NY
:
Curran
.
Edelman
,
A.
,
Arias
,
T. A.
, &
Smith
,
S. T.
(
1998
).
The geometry of algorithms with orthogonality constraints
.
SIAM J. Matrix Anal. Appl.
,
20
,
303
353
.
Fukumizu
,
K.
,
Bach
,
F. R.
, &
Jordan
,
M. I.
(
2004
).
Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces
.
Journal of Machine Learning Research
,
5
,
73
99
.
Fukumizu
,
K.
,
Bach
,
F. R.
, &
Jordan
,
M.
(
2009
).
Kernel dimension reduction in regression
.
Annals of Statistics
,
37
,
1871
1905
.
Gärtner
,
T.
(
2003
).
A survey of kernels for structured data
.
,
5
,
49
58
.
Gärtner
,
T.
,
Flach
,
P.
, &
Wrobel
,
S.
(
2003
).
On graph kernels: Hardness results and efficient alternatives
. In
Proceedings of the Sixteenth Annual Conference on Computational Learning Theory
.
Berlin
:
Springer-Verlag
.
Goutis
,
C.
, &
Fearn
,
T.
(
1996
).
Partial least squares regression on smooth factors
.
Journal of the American Statistical Association
,
91
,
627
632
.
Gretton
,
A.
,
Bousquet
,
O.
,
Smola
,
A.
, &
Schölkopf
,
B.
(
2005
).
Measuring statistical dependence with Hilbert-Schmidt norms
. In
S. Jain, H. U. Simon, & E. Tomita
(Eds.),
Algorithmic learning theory
(pp.
63
77
).
Berlin
:
Springer-Verlag
.
Hoeffding
,
W.
(
1948
).
A class of statistics with asymptotically normal distribution
.
Annals of Mathematical Statistics
,
19
,
293
325
.
Hoeffding
,
W.
(
1963
).
Probability inequalities for sums of bounded random variables
.
Journal of the American Statistical Association
,
58
,
13
30
.
Hotelling
,
H.
(
1936
).
Relations between two sets of cariants
.
Biometrika
,
28
,
321
377
.
Hulle
,
M. M. V.
(
2005
).
Edgeworth approximation of multivariate differential entropy
.
Neural Computation
,
17
,
1903
1910
.
Kashima
,
H.
, &
Koyanagi
,
T.
(
2002
).
Kernels for semi-structured data
. In
Proceedings of the Nineteenth International Conference on Machine Learning
(pp.
291
298
).
San Francisco, CA
:
Morgan Kaufmann
.
Kashima
,
H.
,
Tsuda
,
K.
, &
Inokuchi
,
A.
(
2003
).
Marginalized kernels between labeled graphs
. In
Proceedings of the 20th International Conference on Machine Learning
.
N.p.
:
AAAI Press
.
Keziou
,
A.
(
2003
).
Dual representation of -divergences and applications
.
C.R. Acad. Sci. Paris, Ser. I
,
336
,
857
862
.
Kondor
,
R. I.
, &
Lafferty
,
J.
(
2002
).
Diffusion kernels on graphs and other discrete input spaces
. In
Proceedings of the Nineteenth International Conference on Machine Learning
(pp.
315
322
).
San Francisco
:
Morgan Kaufmann
.
,
A.
,
Stögbauer
,
H.
, &
Grassberger
,
P.
(
2004
).
Estimating mutual information
.
Physical Review E
,
69
,
066138
.
Kullback
,
S.
, &
Leibler
,
R. A.
(
1951
).
On information and sufficiency
.
Annals of Mathematical Statistics
,
22
,
79
86
.
Li
,
K. C.
(
1991
).
Sliced inverse regression for dimension reduction
.
Journal of American Statistical Association
,
86
,
316
342
.
Li
,
K. C.
(
1992
).
On principal Hessian directions for data visualization and dimension reduction: Another application of Stein's lemma
.
Journal of the American Statistical Association
,
87
,
1025
1039
.
Li
,
K. C.
,
Lue
,
H. H.
, &
Chen
,
C. H.
(
2000
).
Interactive tree-structured regression via principal Hessian directions
.
Journal of the American Statistical Association
,
95
,
547
560
.
Lodhi
,
H.
,
Saunders
,
C.
,
Shawe-Taylor
,
J.
,
Cristianini
,
N.
, &
Watkins
,
C.
(
2002
).
Text classification using string kernels
.
Journal of Machine Learning Research
,
2
,
419
444
.
Nguyen
,
X.
,
Wainwright
,
M. J.
, &
Jordan
,
M. I.
(
2010
).
Estimating divergence functionals and the likelihood ratio by convex risk minimization
.
IEEE Transactions on Information Theory
,
56
,
5847
5861
.
Patriksson
,
M.
(
1999
).
Nonlinear programming and variational inequality problems
.
Dordrecht
:
.
Reiss
,
P. T.
, &
Ogden
,
R. T.
(
2007
).
Functional principal component regression and functional partial least squares
.
Journal of the American Statistical Association
,
102
,
984
996
.
Schölkopf
,
B.
, &
Smola
,
A. J.
(
2002
).
Learning with kernels
.
Cambridge, MA
:
MIT Press
.
Song
,
L.
,
Smola
,
A.
,
Gretton
,
A.
,
Borgwardt
,
K. M.
, &
Bedo
,
J.
(
2007
).
Supervised feature selection via dependence estimation
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
823
830
).
New York
:
ACM
.
Steinwart
,
I.
(
2001
).
On the influence of the kernel on the consistency of support vector machines
.
Journal of Machine Learning Research
,
2
,
67
93
.
Steinwart
,
I.
, &
Scovel
,
C.
(
2007
).
Fast rates for support vector machines using gaussian kernels
.
Annals of Statistics
,
35
,
575
607
.
Suzuki
,
T.
,
Sugiyama
,
M.
,
Kanamori
,
T.
, &
Sese
,
J.
(
2009
).
Mutual information estimation reveals global associations between stimuli and biological processes
.
BMC Bioinformatics
,
10
,
S52
.
Suzuki
,
T.
,
Sugiyama
,
M.
,
Sese
,
J.
, &
Kanamori
,
T.
(
2008
).
Approximating mutual information by maximum likelihood density ratio estimation
. In
Proceedings of the Third Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery
(Vol.
4
, pp.
5
20
).
Torkkola
,
K.
(
2003
).
Feature extraction by non-parametric mutual information maximization
.
Journal of Machine Learning Research
,
3
,
1415
1438
.
van de Geer
,
S.
(
2000
).
Empirical processes in M-estimation
.
Cambridge
:
Cambridge University Press
.
van der Vaart
,
A. W.
(
2000
).
Asymptotic statistics
.
Cambridge
:
Cambridge University Press
.
van der Vaart
,
A. W.
, &
Wellner
,
J. A.
(
1996
).
Weak convergence and empirical processes: With applications to statistics
.
New York
:
Springer
.
Wahba
,
G.
(
1990
).
Spline models for observational data
.
:
SIAM
.
Wold
,
H.
(
1966
).
Estimation of principal components and related models by iterative least squares
. In
P. R. Krishnaiah
(Ed.),
Multivariate analysis
(pp.
391
420
).
New York
:
.
Xia
,
Y.
(
2007
).
A constructive approach to the estimation of dimension reduction directions
.
Annals of Statistics
,
35
,
2654
2690
.
Zhu
,
L.
,
Miao
,
B.
, &
Peng
,
H.
(
2006
).
On sliced inverse regression with high-dimensional covariates
.
Journal of the American Statistical Association
,
101
,
630
643
.

## Notes

1

In principle, it is possible to choose the gaussian width and the regularization parameter by cross-validation (CV) over a successive predictor. However, this is not preferable due to the following two reasons. The first is a significant increase of computational cost. When CV is used, the tuning parameters in KDR (or HSIC) and hyperparameters in the target predictor (such as basis parameters and the regularization parameter) should be optimized at the same time. This results in a deeply nested CV procedure, and therefore this could be computationally very expensive. Another reason is that features extracted based on CV are no longer independent of predictors, which is not preferable from the viewpoint of interpretability.

2

can be multidimensional and either continuous (i.e., regression) or categorical (i.e., classification). Structured outputs such as strings, trees, and graphs can also be handled in our framework, as explained later.

3

This result can be generalized to a general f-divergence with small modification (Keziou, 2003; Nguyen et al., 2010).

4

Alternatively, we may locate n2 gaussian kernels at . However, in our preliminary experiments, this significantly increased the computational cost without improving the performance.