## Abstract

The goal of sufficient dimension reduction in supervised learning is to find the low-dimensional subspace of input features that contains all of the information about the output values that the input features possess. In this letter, we propose a novel sufficient dimension-reduction method using a squared-loss variant of mutual information as a dependency measure. We apply a density-ratio estimator for approximating squared-loss mutual information that is formulated as a minimum contrast estimator on parametric or nonparametric models. Since cross-validation is available for choosing an appropriate model, our method does not require any prespecified structure on the underlying distributions. We elucidate the asymptotic bias of our estimator on parametric models and the asymptotic convergence rate on nonparametric models. The convergence analysis utilizes the uniform tail-bound of a *U*-process, and the convergence rate is characterized by the bracketing entropy of the model. We then develop a natural gradient algorithm on the Grassmann manifold for sufficient subspace search. The analytic formula of our estimator allows us to compute the gradient efficiently. Numerical experiments show that the proposed method compares favorably with existing dimension-reduction approaches on artificial and benchmark data sets.

## 1. Introduction

The purpose of dimension reduction in supervised learning is to find the low-dimensional subspace of input features that has sufficient information for predicting output values. Sufficient dimension reduction (SDR) initiated by Li (1991), is aimed at finding a low-rank projection matrix such that given the relevant subspace of input features, the rest becomes conditionally independent of output values (Cook, 1998b; Chiaromonte & Cook, 2002). Such a low-dimensional subspace contains all of the information of the output that the covariate contains. Finding such a subspace not only allows us to use the dimension-reduced features for estimating input-output relations, but also offers insights into which features are important.

A traditional dependency measure between random variables is the Pearson correlation coefficient (PCC), which can be used for detecting linear dependency, so it is useful for gaussian data. However, the gaussian assumption may be rarely fulfilled in practice. Recently kernel-based dimension reduction has been studied in order to overcome the weakness of this coefficient. The Hilbert-Schmidt independence criterion (HSIC) (Gretton, Bousquet, Smola, & Schölkopf, 2005) utilizes cross-covariance operators on universal reproducing kernel Hilbert spaces (RKHSs) (Steinwart, 2001). Cross-covariance operators are an infinite-dimensional generalization of covariance matrices. HSIC allows one to efficiently detect nonlinear dependency by making use of the reproducing property of RKHSs (Aronszajn, 1950). Its usefulness in feature selection scenarios has been shown in Song, Smola, Gretton, Borgwardt, and Bedo (2007). However, HSIC has several theoretical and practical weaknesses. Theoretically, it evaluates independence between random variables, not *conditional* independence. Thus, it does not perform SDR in a strict sense. From the practical point of view, HSIC evaluates the covariance between random variables, not the correlation. This means that the change of input feature scaling affects the dimension-reduction solution, which is not preferable in practice.

Kernel dimension reduction (KDR) (Fukumizu, Bach, & Jordan, 2004) can overcome these weaknesses. KDR evaluates conditional covariance using the kernel trick, and thus it directly performs SDR. Through experiments, KDR was demonstrated to outperform other dimension-reduction schemes such as canonical correlation analysis (Hotelling, 1936; Breiman & Friedman, 1985), partial least squares (Wold, 1966; Goutis & Fearn, 1996; Durand & Sabatier, 1997; Reiss & Ogden, 2007), sliced inverse regression (Li, 1991; Bura & Cook, 2001; Cook & Ni, 2005; Zhu, Miao, & Peng, 2006), and the principal Hessian direction (Li, 1992; Cook, 1998a; Li, Lue, & Chen, 2000). Theoretical properties of KDR such as consistency have been studied thoroughly (Fukumizu, Bach, & Jordan, 2009). However, KDR still has a weakness in practice: the performance of KDR (and also HSIC) depends on the choice of kernel parameters (e.g., the gaussian width) and the regularization parameter. So far, there seems to be no model selection method for KDR and HSIC (as discussed in Fukumizu et al., 2009).^{1}

Another possible criterion for SDR is mutual information (MI) (Cover & Thomas, 2006). MI could be directly employed for SDR since maximizing MI between output and projected input leads to conditional independence between output and input given the projected input. So far, a great deal of effort has been made to estimate MI accurately—for example, based on an adaptive histogram (HIST) (Darbellay & Vajda, 1999), kernel density estimation (KDE) (Torkkola, 2003), nearest-neighbor distance (NN) (Kraskov, Stögbauer, & Grassberger, 2004), Edgeworth expansion (EDGE) (Hulle, 2005), maximum-likelihood MI estimation (MLMI) (Suzuki, Sugiyama, Sese, & Kanamori, 2008). Among them, MLMI has been shown to possess various practical advantages.

As summarized in Table 1, MLMI affords model selection by cross-variation, while there is no systematic method to choose tuning parameters for HSIC, KDR, and NN. MLMI does not require specific structures on the underlying distributions, and EDGE requires that the distributions are near gaussian. MLMI does not involve density estimation of the underlying distributions so that it shows good performance in practice.

. | Nonlinear . | Model . | . | Density . | Feature . |
---|---|---|---|---|---|

Methods . | Dependence . | Selection . | Distribution . | Estimation . | Extraction . |

PCC | Not detectable | Not necessary | Gaussian | Not involved | Possible |

HSIC | Detectable | Not available | Free | Not involved | Possible |

KDR | Detectable | Not available | Free | Not involved | Possible |

HIST | Detectable | Available | Free | Involved | Not available |

KDE | Detectable | Available | Free | Involved | Possible |

NN | Detectable | Not available | Free | Not involved | Not available |

EDGE | Detectable | Not necessary | Near gaussian | Not involved | Possible |

MLMI | Detectable | Available | Free | Not involved | Not available |

LSMI | Detectable | Available | Free | Not involved | Possible |

. | Nonlinear . | Model . | . | Density . | Feature . |
---|---|---|---|---|---|

Methods . | Dependence . | Selection . | Distribution . | Estimation . | Extraction . |

PCC | Not detectable | Not necessary | Gaussian | Not involved | Possible |

HSIC | Detectable | Not available | Free | Not involved | Possible |

KDR | Detectable | Not available | Free | Not involved | Possible |

HIST | Detectable | Available | Free | Involved | Not available |

KDE | Detectable | Available | Free | Involved | Possible |

NN | Detectable | Not available | Free | Not involved | Not available |

EDGE | Detectable | Not necessary | Near gaussian | Not involved | Possible |

MLMI | Detectable | Available | Free | Not involved | Not available |

LSMI | Detectable | Available | Free | Not involved | Possible |

Note: The boldface corresponds to a beneficial feature of the method.

Based on this comparison, we want to employ the MLMI method for dimension reduction. However, this is not straightforward since the MLMI estimator is not explicit, that is, it is implicitly defined as the solution of an optimization problem and is computed numerically. In the dimension-reduction (or feature-extraction) scenarios, the projection matrix needs to be optimized over an MI approximator. To cope with this problem, we adopt a squared-loss variant of MI called the *squared-loss MI* (SMI) as our independence measure and use an estimator of SMI called *least-squares MI* (LSMI) (Suzuki, Sugiyama, Kanamori, & Sese, 2009) for dimension reduction. LSMI inherits good properties from MLMI and, moreover, provides an analytic SMI estimator that gives an analytic formula for its derivative (see Table 1).

The goal of this letter is to develop a dimension-reduction algorithm based on LSMI. Our first contribution is to theoretically elucidate the rate of convergence of the LSMI estimator in parametric and nonparametric settings. Then we develop a practical dimension-reduction algorithm based on LSMI, which we call *least-squares dimension reduction* (LSDR). LSDR optimizes the projection matrix using a natural gradient algorithm (Amari, 1998) on the Grassmann manifold. Finally, through numerical experiments, we show the usefulness of the LSDR method.

## 2. Dimension Reduction via SMI Estimation

In this section, we first formulate the problem of sufficient dimension reduction (SDR) (Cook, 1998b; Chiaromonte & Cook, 2002) and show how squared-loss mutual information (SMI) can be employed in the context of SDR. Then we introduce a method of approximating SMI without going through density estimation and elucidate convergence properties of the SMI estimator. Finally, we develop a dimension-reduction method based on the SMI estimator.

### 2.1. Sufficient Dimension Reduction

Let () be the domain of input feature ** x** and be the domain of output data

^{2}

**. We suppose that is equipped with a -algebra , and there is a base measure denoted by d**

*y***. As for , we consider the standard -algebra of the Lebesgue measurable sets and denote the Lebesgue measure by d**

*y***. We assume there is a joint density**

*x**p*

_{xy}(

**,**

*x***) defined on the product space with respect to .**

*y**m*-dimensional subspaces in . The Grassmann manifold is obtained by identifying those matrices in orthonormal matrices whose columns span the same subspace: where denotes the transpose,

*I*_{m}is the

*m*-dimensional identity matrix, and is the equivalence relation such that if the rows of both

**and span the same space.**

*W***given by : Suppose that satisfies That is, given the projected feature , the (remaining) feature**

*x***is conditionally independent of output**

*x***and can be discarded without sacrificing the predictability of**

*y***. Note that the conditional independence is invariant against the choice of the representative .**

*y**n*independent and identically distributed (i.i.d.) paired samples, drawn from a joint distribution with density

*p*

_{xy}(

**,**

*x***). The goal of SDR is, from data**

*y**D*, to find a projection matrix whose range agrees with that of . For a projection matrix

^{n}**, we write We assume that**

*W**m*is known throughout this letter.

### 2.2. Squared-Loss Mutual Information

**so that equation 2.2 is fulfilled. Let us denote by**

*W***=**

*z***for some projection matrix**

*Wx***. To this end, we adopt SMI as our criterion to be maximized with respect to**

*W***: where**

*W**p*

_{yz}(

**,**

*y***) denotes the joint density of**

*z***and**

*y***and**

*z**p*

_{y}(

**) and**

*y**p*

_{z}(

**) denote the marginal densities of**

*z***and**

*y***, respectively. SMI(**

*z**Y*,

*Z*) allows us to evaluate independence between

**and**

*y***since SMI(**

*z**Y*,

*Z*) vanishes if and only if Note that equation 2.4 corresponds to the

*f*-divergence (Ali & Silvey, 1966; Csiszár, 1967) from

*p*

_{yz}(

**,**

*y***) to**

*z**p*

_{y}(

**)**

*y**p*

_{z}(

**) with the squared loss, while ordinary MI corresponds to the**

*z**f*-divergence with the log loss, that is, the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951): Thus, SMI could be regarded as a natural alternative to ordinary MI.

The rationale behind the use of SMI in the context of SDR relies on the following lemma:

Now we want to find the projection matrix ** W** that maximizes SMI(

*Z*,

*Y*). However, SMI is inaccessible in practice since densities

*p*

_{yz}(

**,**

*y***),**

*z**p*

_{y}(

**), and**

*y**p*

_{z}(

**) are unknown. Thus, SMI needs to be estimated from data samples. We next introduce an SMI estimator.**

*z*### 2.3. SMI Approximation via Density-Ratio Estimation

**and discuss the problem of approximating SMI from samples. The convex duality (Boyd & Vandenberghe, 2004) gives the variational representation (Keziou, 2003; Nguyen, Wainwright, & Jordan, 2010) of SMI as where is taken over all measurable functions on and This can be checked as follows. For , we have where is the convex conjugate of**

*W**f*that satisfies (Boyd & Vandenberghe, 2004). Thus, computing SMI is reduced to finding the minimizer of

*J*(

*g*). We can show that is given by Thus, estimating SMI(

*Y*,

*Z*) is reduced to estimating the above density ratio.

^{3}We do not choose a strategy to plug in density estimators of

*p*

_{yz},

*p*

_{y}, and

*p*

_{z}into formula 2.6. This is because, in a region with small

*p*

_{y}(

**)**

*y**p*

_{z}(

**), the small estimation error of**

*z**p*

_{yz}(

**,**

*y***) is strongly amplified. To avoid the unstable behavior around the tail, we directly model the density ratio itself and impose regularization to control instability of density-ratio estimators when needed.**

*z*Next we consider parametric and nonparametric methods for estimating SMI.

#### 2.3.1. Parametric Convergence Analysis

*g*is searched is a parametric model: Suppose that the true density-ratio is contained in the model , that is, there exists () such that . Approximating the probability densities

*p*

_{yz}(

**,**

*y***),**

*z**p*

_{y}(

**), and**

*y**p*

_{z}(

**) in equation 2.5 by their empirical distributions, we obtain the following optimization problem: Then an SMI approximator can be constructed as**

*z***and**

*A***be matrices defined as where and are copies of**

*B***and**

*y***and the partial derivative is taken with respect to the th element of the parameter . Then we have the following theorem:**

*z*#### 2.3.2. Nonparametric Convergence Analysis

*R*(

*g*) is a nonnegative regularization functional such that Then a nonparametric version of SMI approximator is given as

*R*(

*g*). Suppose is an RKHS associated with bounded kernel : Let denote the norm in the RKHS . Then satisfies equation 2.12: where we used the reproducing property of the kernel and Schwartz's inequality. Note that the gaussian kernel satisfies this with

*C*=1. It is known that gaussian kernel RKHS spans a dense subset in the set of continuous functions. Another example of RKHSs is Sobolev space. The canonical norm for this space is the integral of the squared derivatives of functions. Thus the regularization term imposes the solution to be smooth. The RKHS technique in Sobolev space has been well exploited in the context of spline models (Wahba, 1990). We intend that the regularization term

*R*(

*g*) is a generalization of the RKHS norm. Roughly speaking,

*R*(

*g*) is like a “norm” of the function space .

*M*>0: To derive the convergence rate of our estimator, we use the bracketing entropy that is a complexity measure of a function class (van der Vaart & Wellner, 1996):

*Given two functions l and u, the bracket [l, u] is the set of all functions f with for all y and z. An -bracket is a bracket [l, u] with . The bracketing entropy is the logarithm of the minimum number of -brackets needed to cover a function set*.

*M*>0, This quantity represents a complexity of function class —the larger is, the more complex the function class is because, for larger , more brackets are needed to cover the function class. Gaussian RKHS satisfies this condition for arbitrarily small (Steinwart & Scovel, 2007). Note that when

*R*(

*g*) is the RKHS norm, condition 2.14 holds for all

*M*>0 if that holds for

*M*=1.

Then we have the following theorem:

A proof of theorem 2 is in appendix C. The conditions and roughly mean that the regularization parameter should be sufficiently small but not too small. This theorem shows that the convergence rate of the nonparametric version is also if we take as for sufficiently small . However, the nonparametric method requires a milder model assumption.

According to Nguyen et al. (2010), where a log-loss version of theorem 2 has been proven in the context of KL divergence estimation, the above convergence rate achieves the optimal minimax rate under some setup. Thus, the convergence property of the above nonparametric method would also be optimal in the same sense.

As stated above, gaussian RKHS satisfies the bracketing number condition 2.14 for arbitrary . Thus, SMI with the gaussian kernel achieves the convergence rate 2.15 with arbitrary . However, gaussian RKHS is not sufficiently rich to estimate a function in Sobolev spaces with the optimal rate. To estimate a function in a Sobolev space with the optimal rate, we need to adjust the gaussian width appropriately depending on the sample size and the regularity of the Sobolev space. To analyze the convergence rate for varying gaussian widths, technique other than that used in this letter is required. What we have shown in the theorem works only for a fixed gaussian width. To cope with a situation of varying gaussian width, the techniques recently developed by Eberts and Steinwart (2011) are useful.

#### 2.3.3. Practical Implementation

Here we describe a practical version of the above SMI approximation method, least-squares mutual information (LSMI) (Suzuki et al., 2009).

*g*to some linear subspace , where is a parameter to be learned from samples. is a basis function such that where

**0**

_{b}is the

*b*-dimensional vector with all zeros and the inequality for vectors is applied in the element-wise manner. may be dependent on the samples , that is, kernel models are also allowed. Later, in section 2.3.5, we explain how the basis functions are designed.

*p*

_{yz}(

**,**

*y***),**

*z**p*

_{y}(

**), and**

*y**p*

_{z}(

**) in equation 2.5 by their empirical distributions. Then we have where we included for regularization purposes and Note that when is an RKHS corresponding to a kernel**

*z**k*, is a member of the RKHS and the RKHS norm of

*g*satisfies where

**is the Gram matrix, that is, . Thus, the regularization term satisfies condition 2.12 if the kernel function is bounded as equation 2.13.**

*R*#### 2.3.4. Model Selection by Cross-Validation

*K*disjoint subsets of (approximately) the same size. Then an estimator is obtained using (i.e., without ), and the approximation error for the holdout samples is computed. This procedure is repeated for , and its mean is output: where and denote and computed on the holdout samples in equation 2.18. For model selection, we compute for all model candidates (the basis function and the regularization parameter ), and choose the best model that minimizes . We can show that is an almost unbiased estimator of the objective function

*J*, where the “almost-ness” comes from the fact that the sample size is reduced in the CV procedure due to data splitting (Schölkopf & Smola, 2002).

For the parametric setup, we may derive an asymptotic unbiased estimator of *J* (also known as an information criterion, Akaike, 1974) based on theorem 1, which could be employed for model selection. However, we do not pursue this direction in this letter.

#### 2.3.5. Design of Basis Functions

The rationale behind this basis function choice is as follows. The density ratio, equation 2.6, tends to take large values if *p*_{y}(** y**)

*p*

_{z}(

**) is small and**

*z**p*

_{yz}(

**,**

*y***) is large. When a nonnegative function is approximated by a gaussian kernel model, many kernels may be needed in the region where the output of the target function is large. However, only a small number of kernels would be enough in the region where the output of the target function is close to zero. Following this heuristic, we decided to allocate many kernels in the regions where**

*z**p*

_{yz}(

**,**

*y***) is large; this can be achieved by setting the gaussian centers at .**

*z*^{4}

**is categorical, we use the delta kernel for**

*y***: where is 1 if and 0 otherwise. Note that in this case, the matrix becomes block diagonal, given that the samples are sorted according to the class labels. Then the linear equation, 2.19, can be solved efficiently.**

*y*More generally, when ** y** is structured (e.g., strings, trees, and graphs), we may employ kernels for structured data as (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002; Collins & Duffy, 2002; Kashima & Koyanagi, 2002; Kondor & Lafferty, 2002; Kashima, Tsuda, & Inokuchi, 2003; Gärtner, 2003; Gärtner, Flach, & Wrobel, 2003).

### 2.4. Least-Squares Dimension Reduction

Finally, we show how the SMI approximator is employed for dimension reduction. To find a sufficient subspace, the dimension-reduction problem is cast as an optimization problem over the grassmann manifold (see equation 2.1).

**is the projection of the ordinary gradient to the tangent space of at**

*W***. If the tangent space is equipped with the canonical metric the natural gradient is given as follows (Edelman, Arias, & Smith, 1998): where is any matrix such that is orthogonal. Then the geodesic from**

*W***to the direction of the natural gradient over can be expressed using**

*W**t*() as where exp for a matrix denotes the matrix exponential and

*O*_{b}is the zero matrix (note that the derivative coincides with the natural gradient, equation 2.22; see Edelman et al., 1998 for detailed derivation of the geodesic). Thus, line search along the geodesic in the natural gradient direction is equivalent to finding the maximizer from .

*t*that satisfies where

_{k}*t*

_{0}=1, (), , and are given parameters.

## 3. Numerical Experiments

In this section, we experimentally investigate the performance of the proposed and existing dimension-reduction methods using artificial and real data sets. In the proposed method, we use the gaussian kernel as basis functions and employ the regularized kernel Gram matrix as the regularization matrix ** R**: where is the kernel Gram matrix for the chosen centers: . is added to for avoiding nondegeneracy; we set . We fix the number of basis functions to

*b*=min(100,

*n*) and choose the gaussian width and the regularization parameter based on five-fold CV with grid search. We restart the natural gradient search 10 times with random initial points and choose the one having the minimum CV score, equation 2.21.

### 3.1. Dimension Reduction for Artificial Data Sets

We use six artificial data sets: Three that we designed and three we borrowed from Fukumizu et al. (2009; see Figure 1):

**Nonlinear dependence 1:***d*=5,*M*=1.*y*has a quadratic dependence onas where and . The optimal projection is given by equation 3.1.*x***Nonlinear dependence 2:***d*=5,*M*=1.*y*has a lattice-structured dependence onas where denotes the uniform density on a set . The optimal projection is given by equation 3.1.*x***Fukumizu et al. (2009):***d*=4,*m*=2.*y*has a nonlinear dependence onas where and . The optimal projection is*x***Fukumizu et al. (2009):***d*=4,*M*=1.*y*has a nonlinear dependence onas where and . The optimal projection is .*x***Fukumizu et al. (2009):***d*=10,*M*=1.*y*has a nonlinear dependence onas where and . The optimal projection is .*x*

Let us compare the proposed LSDR with the following methods.

Kernel dimension reduction (KDR) (Fukumizu et al., 2009)

Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005)

Sliced inverse regression (SIR) (Li, 1991)

Sliced average variance estimation (SAVE) (Cook, 2000)

Principal Hessian direction (pHd) (Li, 1992)

Minimum average (conditional) variance estimation (dMAVE) (Xia, 2007)

In KDR and HSIC, the gaussian width is set to the median sample distance, following the suggestions in the original papers (Gretton et al., 2005; Fukumizu et al., 2009). We used the dimension-reduction package included in R for SIR, SAVE, and pHd. The parameters for these methods such as the number of slices were set to be the default values. To execute dMAVE, we used the publicly available code.^{5} The principal directions estimated by SIR, SAVE, pHd, and dMAVE do not necessarily form an orthogonal system, that is, if we let ** F** be the matrix, each row of which corresponds to each principal direction, then

**is not necessarily a projection matrix. To recover a projection matrix**

*F***, we performed singular decomposition of**

*W***as**

*F***=**

*F***and set**

*VSU***=**

*W***.**

*U*The performance of each method is summarized in Table 2, which depicts the mean and standard deviation of the Frobenius-norm error, equation 3.2, over 50 trials when the number of samples is *n*=100. LSDR overall shows good performance; in particular, it performs the best for data sets b and c. KDR also tends to work reasonably well, but it sometimes performs poorly; this seems to be caused by an inappropriate choice of the gaussian kernel width, implying that the heuristic of using the median sample distance as the kernel width is not always appropriate. On the other hand, LSDR with CV performs stably well for various types of data sets. dMAVE also works well and is competitive with LSDR for these artificial data sets.

Data . | d
. | m
. | LSDR . | KDR . | HSIC . | SIR . | SAVE . | pHd . | dMAVE . |
---|---|---|---|---|---|---|---|---|---|

a | 5 | 1 | .13(.04) | .13(.05) | .17(.07) | .11(.05) | .37(.27) | .89(.12) | .13(.04) |

b | 5 | 1 | .15(.06) | .25(.21) | .44(.36) | .83(.19) | .31(.11) | .24(.07) | .20(.09) |

c | 5 | 1 | .10(.05) | .44(.32) | .68(.32) | .89(.14) | .48(.20) | .86(.12) | .25(.28) |

d | 4 | 2 | .20(.14) | .16(.06) | .18(.08) | .30(.15) | .44(.18) | .50(.18) | .10(.05) |

e | 4 | 1 | .09(.06) | .13(.06) | .16(.07) | .21(.10) | .34(.19) | .36(.14) | .08(.04) |

f | 10 | 1 | .35(.12) | .40(.12) | .49(.17) | .68(.22) | .91(.13) | .83(.12) | .26(.06) |

Data . | d
. | m
. | LSDR . | KDR . | HSIC . | SIR . | SAVE . | pHd . | dMAVE . |
---|---|---|---|---|---|---|---|---|---|

a | 5 | 1 | .13(.04) | .13(.05) | .17(.07) | .11(.05) | .37(.27) | .89(.12) | .13(.04) |

b | 5 | 1 | .15(.06) | .25(.21) | .44(.36) | .83(.19) | .31(.11) | .24(.07) | .20(.09) |

c | 5 | 1 | .10(.05) | .44(.32) | .68(.32) | .89(.14) | .48(.20) | .86(.12) | .25(.28) |

d | 4 | 2 | .20(.14) | .16(.06) | .18(.08) | .30(.15) | .44(.18) | .50(.18) | .10(.05) |

e | 4 | 1 | .09(.06) | .13(.06) | .16(.07) | .21(.10) | .34(.19) | .36(.14) | .08(.04) |

f | 10 | 1 | .35(.12) | .40(.12) | .49(.17) | .68(.22) | .91(.13) | .83(.12) | .26(.06) |

Notes: The best method in terms of the mean error and comparable ones based on the one-sided *t*-test at the significance level 1% are indicated by boldface. Standard deviations are in parentheses.

### 3.2. Classification for Benchmark Data Sets

Finally, we evaluate the classification performance after dimension reduction for several benchmark data sets. We use Image, Waveform, Pima-Indians-Diabetes, and Letter Recognition in the UCI repository.^{6} We randomly choose 200 samples from the data set and apply LSDR, KDR, HSIC, and dMAVE to obtain projections onto low-dimension subspaces with , where denotes the smallest integer not smaller than *c*. Then we train the support vector machine (Schölkopf & Smola, 2002) on the 200 projected training samples.

The misclassification rate is computed for samples not used for training. Table 3 summarizes the mean and standard deviation of the classification error over 20 iterations. This shows that the proposed method overall compares favorably with the other methods.

Data Set . | d
. | m
. | LSDR . | KDR . | HSIC . | dMAVE . |
---|---|---|---|---|---|---|

18 | 5 | .083(.019) | .125(.038) | .158(.044) | .501(.134) | |

Image | 18 | 9 | .088(.022) | .106(.026) | .115(.035) | .468(.130) |

18 | 14 | .093(.018) | .091(.019) | .095(.023) | .468(.130) | |

21 | 6 | .130(.014) | .127(.008) | .160(.016) | .183(.016) | |

Waveform | 21 | 11 | .119(.013) | .135(.010) | .163(.016) | .184(.015) |

21 | 16 | .116(.007) | .131(.008) | .159(.014) | .182(.021) | |

8 | 2 | .249(.022) | .247(.024) | .252(.020) | .257(.017) | |

Pima | 8 | 4 | .260(.016) | .250(.021) | .252(.017) | .265(.027) |

8 | 6 | .244(.020) | .243(.019) | .251(.021) | .252(.019) | |

16 | 4 | .031(.009) | .028(.012) | .035(.014) | .018(.008) | |

Letter (a,b,c) | 16 | 8 | .026(.008) | .017(.007) | .020(.006) | .016(.006) |

16 | 12 | .016(.006) | .014(.006) | .017(.008) | .013(.008) |

Data Set . | d
. | m
. | LSDR . | KDR . | HSIC . | dMAVE . |
---|---|---|---|---|---|---|

18 | 5 | .083(.019) | .125(.038) | .158(.044) | .501(.134) | |

Image | 18 | 9 | .088(.022) | .106(.026) | .115(.035) | .468(.130) |

18 | 14 | .093(.018) | .091(.019) | .095(.023) | .468(.130) | |

21 | 6 | .130(.014) | .127(.008) | .160(.016) | .183(.016) | |

Waveform | 21 | 11 | .119(.013) | .135(.010) | .163(.016) | .184(.015) |

21 | 16 | .116(.007) | .131(.008) | .159(.014) | .182(.021) | |

8 | 2 | .249(.022) | .247(.024) | .252(.020) | .257(.017) | |

Pima | 8 | 4 | .260(.016) | .250(.021) | .252(.017) | .265(.027) |

8 | 6 | .244(.020) | .243(.019) | .251(.021) | .252(.019) | |

16 | 4 | .031(.009) | .028(.012) | .035(.014) | .018(.008) | |

Letter (a,b,c) | 16 | 8 | .026(.008) | .017(.007) | .020(.006) | .016(.006) |

16 | 12 | .016(.006) | .014(.006) | .017(.008) | .013(.008) |

Notes: The best method in terms of the mean error and comparable ones based on the one-sided *t*-test at the significance level 1% are indicated by boldface. Standard deviations are in parentheses.

## 4. Conclusion

In this letter, we have proposed a new dimension-reduction method utilizing a squared-loss variant of mutual information (SMI). Our contributions were parametric and nonparametric analyses of the rate of convergence of the SMI approximator and the proposal of a dimension-reduction algorithm based on the SMI approximator. The proposed method is advantageous in several respects; density estimation is not involved, it is distribution free, and model selection by cross-validation is available, for example. The effectiveness of the proposed method over existing methods was shown through experiments.

## Appendix A: Proof of Lemma 1

## Appendix B: Proof of Theorem 1

**=(**

*w***,**

*y***) and . Then**

*z**h*is symmetric (i.e., ), has mean

**0**(i.e., ), and satisfies Therefore, is a

*U*-statistic with the symmetric kernel

*h*(

*w*_{i},

*w*_{j}). It is known (theorem 7.1 in Hoeffding, 1948) that the variance of

*U*-statistic is given by where

*w*_{1}=(

*y*_{1},

*z*_{1}),

*w*_{2}=(

*y*_{2},

*z*_{2}), and are i.i.d. variables with probability density

*p*

_{yz}. For notational simplicity, we write Then Going through simple calculations, each term on the right-hand side of the above equation can be evaluated as Therefore,

## Appendix C: Proof of Theorem 2

Before proving theorem 2, we show the following auxiliary lemma:

*L*

_{2}(

*p*

_{x}

*p*

_{y})-norm of the difference between and is bounded by Thus, the bracketing entropy of has the following order: Let . Then, as in lemma 5.14 and theorem 10.6 in van de Geer (2000), we obtain where . Here we have used the double sum version of theorem 5.11 in van de Geer (2000), which is needed to obtain the same formula as lemma 5.14 in van de Geer. Since the right-hand-side of equation C.2 is further bounded by

*L*

_{2}(

*p*

_{x}

*p*

_{y})-norm of

*f*as The rest can be proved by following a similar line to theorem 10.6 in van de Geer (2000).

We redefine and define . There are two possible situations: (1) or (2) . We show the stochastic order of and for the two situations separately.

**Situation 1.**

**Situation 2.**

In this situation, .

Consequently, the proof of lemma 2 was completed.

Based on lemma 2, we prove theorem 2 below.

## Acknowledgments

T.S. was supported in part by MEXT Kakenhi 22700289 and the Aihara Project, the FIRST program from JSPS, initiated by CSTP. M.S. was supported by AOARD, SCAT, and the JST PRESTO program.

## References

## Notes

^{1}

In principle, it is possible to choose the gaussian width and the regularization parameter by cross-validation (CV) over a successive predictor. However, this is not preferable due to the following two reasons. The first is a significant increase of computational cost. When CV is used, the tuning parameters in KDR (or HSIC) and hyperparameters in the target predictor (such as basis parameters and the regularization parameter) should be optimized at the same time. This results in a deeply nested CV procedure, and therefore this could be computationally very expensive. Another reason is that features extracted based on CV are no longer independent of predictors, which is not preferable from the viewpoint of interpretability.

^{2}

can be multidimensional and either continuous (i.e., regression) or categorical (i.e., classification). Structured outputs such as strings, trees, and graphs can also be handled in our framework, as explained later.

^{4}

Alternatively, we may locate *n*^{2} gaussian kernels at . However, in our preliminary experiments, this significantly increased the computational cost without improving the performance.