## Abstract

Sufficient dimension reduction (SDR) is aimed at obtaining the low-rank projection matrix in the input space such that information about output data is maximally preserved. Among various approaches to SDR, a promising method is based on the eigendecomposition of the outer product of the gradient of the conditional density of output given input. In this letter, we propose a novel estimator of the gradient of the logarithmic conditional density that directly fits a linear-in-parameter model to the true gradient under the squared loss. Thanks to this simple least-squares formulation, its solution can be computed efficiently in a closed form. Then we develop a new SDR method based on the proposed gradient estimator. We theoretically prove that the proposed gradient estimator, as well as the SDR solution obtained from it, achieves the optimal parametric convergence rate. Finally, we experimentally demonstrate that our SDR method compares favorably with existing approaches in both accuracy and computational efficiency on a variety of artificial and benchmark data sets.

## 1  Introduction

Sufficient dimension reduction (SDR) is a solid framework of supervised linear dimension reduction and is aimed at finding the projection matrix in the input space so that information about output data is maximally preserved in the subspace. An influential work on SDR is sliced inverse regression (Li, 1991), which employs the inverse regression function for obtaining the projection matrix. Since then, a number of SDR methods, such as the principal Hessian direction (Li, 1992) and sliced average variance estimation (Dennis Cook, 2000), have been developed along the same line. However, these methods commonly require the strong assumption that the probability density function of input data is elliptical, which is often not fulfilled in practice.

To overcome this limitation, various SDR methods based on conditional independence of output data given projected input data have been proposed and demonstrated to work well. For example, kernel dimension reduction (Fukumizu, Bach, & Jordan, 2004, 2009) directly evaluates the conditional independence based on kernel methods; least-squares dimension reduction (Suzuki & Sugiyama, 2013) evaluates the conditional independence through least-squares estimation of mutual information; and other related SDR methods nonparametrically estimate the conditional density of output data given projected input data (Tangkaratt, Xie, & Sugiyama, 2015; Xia, 2007). Furthermore, a supervised linear dimension reduction method robust against outliers was recently proposed based on direct estimation of the derivative of the quadratic mutual information (Tangkaratt, Sasaki, & Sugiyama, 2017). However, a common drawback is that these methods require solving nonconvex optimization problems. Thus, when gradient-based optimization methods are employed, these SDR methods can be computationally expensive and may get stuck in bad local optima.

Another line of SDR research, which is computationally efficient and avoids bad local optima, is based on the eigendecomposition of the outer product of the gradient of the conditional expectation of output data given input data (Hristache, Juditsky, Polzehl, & Spokoiny, 2001; Samarov, 1993; Xia, Tong, Li, & Zhu, 2002). However, since this approach fulfills a necessary condition only for SDR, the obtained solution is not sufficient in general. To cope with this problem, a modified SDR method based on the gradient of the conditional density of output given input, which satisfies a sufficient condition for SDR, was developed (Xia, 2007). However, since the gradient is estimated by the local linear smoother (Fan & Gijbels, 1996), it is computationally expensive for large data sets, it performs poorly when the input or subspace dimensionality is high, and model selection is cumbersome in practice.

In this letter, we propose a novel estimator of the gradient of the logarithmic conditional density of output given input that can overcome these weaknesses. The essential idea of our method is to directly fit a linear-in-parameter model to the true gradient under the squared loss, which allows efficient computation of the solution in a closed form and straightforward model selection by cross-validation. Then we develop an SDR method based on the eigendecomposition to the outer product of our gradient estimates. We theoretically prove that our gradient estimator and SDR method asymptotically provide the optimal solutions at the optimal parametric convergence rate. Finally, we experimentally demonstrate that our proposed SDR method is more accurate and computationally efficient than existing methods on a variety of data sets.

The rest of this letter is organized as follows: Section 2 mathematically formulates the problem of SDR and reviews gradient-based SDR methods. Section 3 proposes a novel estimator of the gradient of the logarithmic conditional density and develops an SDR method based on it. Section 4 theoretically investigates the properties of the proposed gradient estimator and SDR method. Section 5 experimentally evaluates the performance of the proposed SDR method on a variety of data sets. Section 6 concludes the letter. Our preliminary results were presented at ACML 2015 (Sasaki, Tangkaratt, & Sugiyama, 2015), but here we performed theoretical analysis of the proposed gradient estimator and SDR method, extended the proposed SDR method for classification, and added more experiments.

## 2  Review of Existing Methods

In this section, we formulate the problem of SDR and review existing gradient-based SDR methods.

### 2.1  Problem Formulation

Suppose that we are given a collection of pairs of input and output data,
that are drawn independently from a joint distribution with density . and are the domains of inputs and outputs, respectively,1 and denotes the transpose. Further, we assume that the input dimensionality is large, but the “intrinsic” dimensionality of , which is denoted by , is rather small. The goal of SDR is to estimate a low-rank projection matrix from so that the following SDR condition is satisfied:
2.1
where , denotes the by identity matrix and . Throughout this letter, we denote the subspace spanned by the columns of by .

### 2.2  Gradient-Based Approach in SDR

A simple and computationally efficient approach in SDR is based on the gradient of the conditional expectation of given . This approach begins with the following equation, which can be easily derived from the SDR condition, equation 2.1:
2.2
where is the conditional expectation of given , denotes the vector differential operator with respect to , , and denotes the th column vector in . Equation 2.2 indicates that the gradient is contained in . Thus, in the same way as principal component analysis, the projection matrix can be estimated as a minimizer of the expected reconstruction error of the gradient ,
2.3
where is the expectation over the joint density and denotes the norm. Minimizing the left-hand side of equation 2.3 is equivalent to maximizing the second term on the right-hand side. Thus, can be estimated as a collection of the eigenvectors associated with the top eigenvalues of . This approach seems appealing because the eigendecomposition gives us one of the global optima and, moreover, can be efficiently estimated. However, using conditional expectations is problematic because equation 2.2 is necessary only to equation 2.1. As Fukumizu and Leng (2014, sec. 2.1) discussed, for example, when follows a regression model where , the regression function depends only on , but the conditional density depends on both and .
Gradient-based kernel dimension reduction (gKDR), which copes with this problem, is based on the gradient of conditional expectation where is a function in a reproducing kernel Hilbert space (Fukumizu & Leng, 2014). Based on the kernel method, gKDR estimates by applying the eigendecomposition of the sample average of
2.4
where is the by matrix whose th element is the partial derivative of a kernel function , denotes the regularization parameter, and and are the Gram matrices whose th elements are kernel values and , respectively. However, the performance of gKDR strongly depends on the regularization parameter and the kernel parameters in and , and there seems to be no systematic model selection method for dimension reduction based on kernel methods (Fukumizu et al., 2004; Fukumizu, Bach, & Jordan, 2009).2
An alternative way to handle the limitation of the method based on the gradient of the conditional expectation is to estimate the gradient of the conditional density , which is also contained in :
2.5
Unlike equation 2.2, equation 2.5 is sufficient for equation 2.1, which can be proved similarly in Fukumizu and Leng (2014, theorem 2). Given equation 2.5, the technical challenge is to estimate the gradient . The outer product of gradient based on the conditional density functions (dOPG) (Xia, 2007) nonparametrically estimates the gradient by the local linear smoother (LLS) (Fan & Gijbels, 1996), which is briefly reviewed below.
Consider a regression-like model,
where denotes a symmetric unimodal kernel with the bandwidth parameter , and we assume that the conditional expectation of with respect to given is zero. As shown later, can be seen as a model of the conditional density . By taking the conditional expectation of given on both the left- and right-hand sides, we obtain
The key point of the above equation is that when as , the conditional expectation approaches the conditional density . Thus, estimating and its gradient is asymptotically equivalent to estimating the conditional density and its gradient .
To estimate , the first-order Taylor approximation is applied as follows:
2.6
Then and its gradient, at , are simultaneously estimated with the first-order Taylor approximation as the minimizers of the following weighted squared errors:
2.7
where is a weight function with a kernel function containing the bandwidth parameter .

The dOPG algorithm is summarized in algorithm 1. As done in Hristache et al. (2001), to improve the performance of dOPG, an iterative method is adopted by setting the weight function as . Furthermore, a trimming function is introduced to handle notorious boundary points. However, dOPG seems to have three drawbacks:

1. In the Taylor approximation, equation 2.6, the proximity requirement holds when data are dense. Thus, dOPG might not perform well when data are sparsely distributed. This is particularly problematic when the input dimensionality or intrinsic dimensionality is relatively high.

2. dOPG can be computationally expensive for large data sets because the number of elements in is .

3. The parameter values of and are determined based on the normal reference rule of the nonparametric conditional density estimation (Fan & Gijbels, 1996; Silverman, 1986). Therefore, when the density is far from the normal density, this parameter selection method may not work well.

To improve performance in gradient estimation, we propose a novel estimator for the gradient of the logarithmic conditional density. The proposed estimator does not rely on the Taylor approximation and includes a cross-validation procedure for parameter selection. In addition, with an integer less than or equal to , the number of parameters in the estimator is , which is much smaller than in LLS. We then develop a gradient-based SDR method based on our gradient estimator.

## 3  Least-Squares Logarithmic Conditional Density Gradients

In this section, we propose a novel estimator for the gradient of the logarithmic conditional density:
3.1
where . For the gradient of the logarithmic conditional density, the SDR condition, equation 2.1, gives the similar equation as equation 2.5:
The previous equation indicates that the gradient of the logarithmic conditional density is also included in , and thus can be estimated based on the eigendecomposition as in the previous gradient-based approach. After proposing our gradient estimator, we develop an SDR method.

### 3.1  The Estimator

The fundamental idea is to fit a gradient model, , directly to the true gradient of the logarithmic conditional density of given under the squared loss:
3.2
where . The first term in equation 3.2 can be easily estimated from samples, but estimating the second and third terms is not straightforward.
To estimate the second term in equation 3.2, we apply integration by parts:
where we assumed that . The previous equation shows that we can estimate the second term from samples without any special effort.

Unlike the second term, integration by parts does not help us to estimate the third term. One difficulty is that the term includes the true partial derivative of the log density . Alternatively, our approach to estimate the third term is to employ a nonparametric plug-in estimator for the log-density derivative, called least-squares log-density gradients (LSLDG) (Cox, 1985; Sasaki, Hyvärinen, & Sugiyama, 2014): LSLDG directly estimates log-density derivatives without going through density estimation. The solution of LSLDG can be computed efficiently in a closed form, and LSLDG includes a cross-validation procedure for model selection. Furthermore, it has been experimentally shown to work better than the gradient estimator based on kernel density estimation. Therefore, LSLDG would accurately and efficiently approximate the third term.

Substituting by the LSLDG estimator gives an approximation of the risk as
Then the empirical version of the approximative risk is provided by
3.3
To estimate , we use the following linear-in-parameter model:
3.4
where denote basis functions and is the number of basis functions and fixed at in this letter. For regression problems, with the bandwidth parameters and , we set
while in the classification scenario,
where and are the gaussian centers randomly chosen from and , respectively, if and otherwise. By substituting this model into and adding the regularizer to equation 3.3, we obtain the closed-form solution as
where is the regularization parameter,
Finally, the estimator, which we call least-squares logarithmic conditional density gradients (LSLCG), is obtained as

### 3.2  Model Selection by Cross-Validation

The performance of LSLCG depends on the choice of models, which are the gaussian bandwidth parameters, and , and the regularization parameter in the current setup. We perform model selection by cross-validation as follows:

• Step 1: Divide the samples into disjoint subsets .

• Step 2: Obtain the estimator using (i.e., without ), and then compute the holdout error for as
3.5
where denotes the number of elements in .
• Step 3: Choose the model that minimizes .

### 3.3  Illustration of Gradient Estimation

Here, we illustrate how accurately LSLCG can estimate the gradient of the log-conditional density and compare it with existing methods. In this illustration, the output was generated from
where both and were drawn from the standard normal distribution. We applied the following three methods to the data:
1. LSLCG: The proposed estimator for the gradient of the logarithmic conditional density. We fix at , which denotes the median value of with respect to and , but and are cross-validated as in section 3.2: were selected from 10 different values of (), where is the median value of with respect to and and with 10 difference values of from to .

2. KDE: The gradient of the logarithmic conditional density was estimated through kernel density estimation (KDE). First, with the gaussian kernel for KDE, the conditional density was estimated by
where the bandwidth parameters and were independently cross-validated with respect to the log likelihood of the kernel density estimates of and by employing and , respectively. We selected 10 candidates of for and for where and is the median value of with respect to , , and . Then the log-conditional gradients were computed from .
3. LLS: The gradient of the logarithmic conditional density was estimated through LLS (Fan & Gijbels, 1996). A gradient estimate for the log-conditional density at is given by the ratio because and in equation 2.7 correspond to estimates of the conditional density and its gradient, respectively. We set and , where was cross-validated with respect to the log likelihood of the the kernel density estimate of by employing as the kernel function; similarly, was also cross-validated. We used the same candidates of and for and , respectively.

The estimation error was assessed by
where denotes an estimate of .

The result is presented in Figure 1. As the dimensionality of input data increases, KDE produces larger errors (see Figure 1a). A possible reason is that a good density estimator does not necessarily mean a good density gradient estimator. Thus, estimating density derivatives via density estimation is not a good approach, as previously demonstrated in (nonconditional) log-density-derivative estimation (Sasaki et al., 2014). LLS also does not work well for high-dimensional data, while the errors of LSLCG increase much more mildly. In addition, Figure 1b shows that the errors of LSLCG quickly decrease as the sample size increases. Thus, our approach of fitting a model directly to the true gradient is promising.

Figure 1:

Comparison to existing methods in estimation of log-conditional density gradients. Each point and error bar denote the average and standard deviation over 100 runs, respectively.

Figure 1:

Comparison to existing methods in estimation of log-conditional density gradients. Each point and error bar denote the average and standard deviation over 100 runs, respectively.

### 3.4  Least-Squares Gradients for Dimension Reduction

Following the gradient-based approach, we propose a new SDR method in algorithm 2. The proposed SDR method can be interpreted as a simpler version of the dOPG in algorithm 1 without iterative steps and the trimming function. We call this method least-squares gradients for dimension reduction (LSGDR).

## 4  Theoretical Analysis

This section provides a theoretical investigation of the behavior of LSLCG and LSGDR.

Our analysis focuses on estimation errors to the optimal LSLCG estimate of without finite sample approximation, which is defined as
where
and and is the optimal LSLDG estimate of . If is strictly positive definite, then is allowed in our analysis.

As proved in the supplementary material for Sasaki, Niu, and Sugiyama (2016), the plug-in LSLDG estimator achieves the optimal parametric rate, , where denotes the probabilistic order, and thus, without loss of generality, we simply assume that converges to in this order for all . Under this setting, we obtained the following theorem:

Theorem 1.
Suppose that the number of basis functions in the linear-in-parameter model, equation 3.4, is fixed and does not grow as increases. Then, as ,
provided that .

The proof of theorem 1 is in appendix A. Theorem 1 asserts that LSLCG is consistent with the optimal estimate and also achieves the optimal parametric convergence rate in a standard setting.

Let us denote and by a collection of the eigenvectors associated with the top eigenvalues of
respectively. corresponds to an estimate from LSGDR. Next, we define the minimum eigengap of . Let be the disjoint eigenvalues of , that is, the eigenvalues counted without multiplicity, such that is the largest eigenvalue counted with multiplicity. The minimum eigengap is defined by
Then, the following theorem indicates that approaches as increases:
Theorem 2.
Suppose that the number of basis functions in the linear-in-parameter model, equation 3.4, is fixed and does not grow as increases. Furthermore, assume that the respective th largest eigenvalues of and are larger than the respective th largest eigenvalues, and where stands for the Frobenius norm. Then, as ,
4.1

Appendix B includes the proof of theorem 2. As seen in LSLCG, LSGDR also provides a consistent estimate, and its convergence rate is optimal under the parametric setting.

## 5  Numerical Experiments

In this section, we experimentally investigate the performance of our SDR method on both artificial and benchmark data sets and compare our method with existing methods.

### 5.1  Illustration of Dimension Reduction on Artificial Data

First, we illustrate the performance of the proposed SDR method using artificial data, and comparison is made among the following SDR methods:

• LSGDR: The proposed method following a gradient-based approach. Here, we fix at but performed five-fold cross-validation to and as in section 3.2. were selected from 10 different values of (),3 and with 10 different values of from to 1. We recall that and are the median values of and with respect to and , respectively.

• dMAVE (Xia, 2007):4 To estimate , this method solves a nonconvex optimization problem. To avoid bad local optima, an initial estimate of is given by the first iteration of algorithm 1, the dOPG algorithm.

• LSDR (Suzuki & Sugiyama, 2013):5 To estimate , this method solves a nonconvex optimization problem. To avoid bad local optima, multiple point search with 10 different initial values is performed, and the best solution is chosen with respect to a criterion. LSDR also performs five-fold cross-validation for model selection, and we set the same number of parameter candidates as LSGDR.

• gKDR (Fukumizu & Leng, 2014):6 This method follows a gradient-based approach as reviewed in section 2.2. The gaussian kernels were used for and . As in LSGDR, we fix the width parameter in at , and the width parameter in and regularization parameter are determined by five-fold cross-validation based on a nearest-neighbor regression / classification method where is fixed at 5. We set the parameter candidates by following Fukumizu and Leng (2014). The 10 candidates for the width in were given by (), where is the median of  (Gretton et al., 2007), while with 10 different values of from to . The total number of parameter candidates is the same as LSGDR.

• dOPG (Xia, 2007):7 This method follows a gradient-based approach as reviewed in section 2.2.

#### 5.1.1  Dimension-Scalability and Computational Efficiency

Here, we experimentally investigate the behavior of the SDR methods based on the following model:
where and were drawn from the standard normal distribution. Under this model, the optimal projection matrix is given by
5.1
where denotes the by null matrix. As in section 4, estimation error was assessed by
where denotes an estimate of by an SDR method.

As illustrated in Figures 2a and 2b, the performance of LSGDR is the best on a wide range of data and intrinsic dimensions. As the intrinsic dimension increases, dMAVE, LSDR, and dOPG do not work well. For LSDR, good initialization would be difficult when the intrinsic dimension is high, and dMAVE is initialized based on dOPG, which also performs poorly. The unsatisfactory performance of dOPG implies that the higher intrinsic dimension makes estimation harder rather than the original data dimension . gKDR works reasonably even to data with relatively high . However, as the data dimension increases, gKDR also produces large errors. A possible reason is that model selection in gKDR might be more difficult to higher-dimensional data.

Figure 2:

Comparison to existing SDR methods in terms of estimation error against (a) the intrinsic dimension and (b) data dimension . Each point denotes the average over 50 runs, and the error bars are standard deviations.

Figure 2:

Comparison to existing SDR methods in terms of estimation error against (a) the intrinsic dimension and (b) data dimension . Each point denotes the average over 50 runs, and the error bars are standard deviations.

Figure 3 reveals that LSGDR is a computationally efficient method. Since dMAVE and LSDR make use of the projected data , the computation costs increase as the intrinsic-dimensionality grows. On the other hand, does not strongly affect the computation costs of gKDR, dOPG, and LSGDR, which use the original (nonprojected) input data. Instead, when data dimension increases, the computation costs of gKDR, dOPG, and LSGDR increase (see Figure 3d). Note that the computation costs of dOPG and LSGDR increase more mildly. This is because LSGDR uses only basis functions in the linear-in-parameter model, equation 3.4, and dOPG does not perform cross-validation. In addition, the sample size increases the computation costs of dMAVE, dOPG, and gKDR (see Figure 3d). Both dMAVE and dOPG employ LLS, the number of whose parameters to be estimated is , and gKDR has to compute the inverse of an matrix in equation 2.4. For gKDR, the computation cost can be decreased by reducing the size of Gram matrices, and , and number of centers in the kernel function as done in LSGDR. However, this should be inappropriate. When estimating , the sample average of equation 2.4 is taken over only input data , and reducing the size of the Gram matrix is equivalent to discarding a number of output samples . Thus, the performance of gKDR can be worse. In contrast, LSGDR employs the sample average of the outer product over both input and output data and would work well with a smaller number of centers in basis functions.

Figure 3:

Comparison to existing SDR methods in terms of CPU time against (a) the intrinsic dimension , (b) data dimension , and (c) sample size . Each point denotes the average over 50 runs, and the error bars are standard deviations. The vertical axes are displayed in logarithmic scale.

Figure 3:

Comparison to existing SDR methods in terms of CPU time against (a) the intrinsic dimension , (b) data dimension , and (c) sample size . Each point denotes the average over 50 runs, and the error bars are standard deviations. The vertical axes are displayed in logarithmic scale.

#### 5.1.2  Illustration on a Variety of Artificial Data Sets

Here, we generated data according to the following various kinds of models, all of which were adopted from the articles of dMAVE, LSDR, gKDR, and dOPG:

• Model a (Xia, 2007):
where denotes the sign function, , , and denotes the normal density with the mean and covariance matrix . The first four elements in are all 0.5, while the others are zeros. For , the first four elements are 0.5, , 0.5, and , respectively, and the others are zeros. The optimal projection matrix is .
• Model b (Xia, 2007):
where , , , , and denotes the uniform density on . The optimal projection matrix is .
• Model c (Xia, 2007):
where and . The optimal projection matrix is the same as equation 5.1.
• Model d (Fukumizu et al., 2009; Suzuki & Sugiyama, 2013):
where and . The optimal projection matrix is the same as equation 5.1.
• Model e (Fukumizu et al., 2009; Suzuki & Sugiyama, 2013):
where and . The optimal projection matrix is the same as 5.1.
• Model f (Fukumizu & Leng, 2014):
where , and . The optimal projection matrix is .
• Model g (Fukumizu & Leng, 2014):
where , , , , and denotes the Gamma density with the shape parameter and scale parameter . The optimal projection matrix is .
• Model h (Fukumizu & Leng, 2014):
where is drawn from a normal distribution truncated on , and . The optimal projection matrix is the same as equation 5.1.

The results are summarized in Table 1. Table 1 indicates for models a to c that LSGDR performs best or comparable to the best method in terms of the estimation error. Since models a, b, and c are complex, LSGDR is a promising method for various kinds of data. For computation cost, LSGDR is more advantageous than dOPG when the sample size is large as reviewed in section 2.2. For the Table 2 entries for models d and e, LSGDR is the more accurate method, and dOPG is the most computationally efficient. Unlike LSGDR, LSDR, and gKDR, the parameters in dOPG are not cross-validated, and thus it should be efficient when the sample size is small. However, the computation costs of LSGDR and gKDR are not so expensive. LSGDR overall performs well to data drawn from nongaussian densities as well (see Table 1).

Table 1:
Estimation Errors and CPU Time.
LSGDRdMAVELSDRgKDRdOPG
Model a:
Error 0.364(0.117) 0.657(0.177) 0.655(0.104) 0.670(0.123) 0.973(0.217)
Time 0.893(0.084) 1.546(0.096) 1.290(0.100) 2.342(0.070) 0.417(0.085)
Model b:
Error 0.303(0.095) 0.280(0.061) 0.388(0.085) 0.424(0.154) 0.322(0.069)
Time 0.719(0.067) 1.600(0.160) 1.249(0.095) 2.665(0.099) 0.839(0.074)
Model c:
Error 0.205(0.257) 0.286(0.060) 1.666(0.186) 0.623(0.159) 0.811(0.364)
Time 0.746(0.099) 1.871(0.189) 1.836(0.137) 2.796(0.132) 1.042(0.125)
Model d:
Error 0.100(0.114) 0.223(0.115) 0.378(0.262) 0.382(0.253) 0.218(0.112)
Time 0.057(0.061) −0.763(0.043) 0.709(0.059) 0.449(0.029) 1.088(0.048)
Model e:
Error 0.193(0.143) 0.368(0.098) 0.594(0.198) 0.676(0.317) 0.619(0.162)
Time 0.442(0.041) 0.032(0.015) 0.836(0.064) 0.962(0.026) 0.646(0.021)
Model f:
Error 0.124(0.099) 0.097(0.029) 0.086(0.021) 0.126(0.030) 0.068(0.015)
Time 0.640(0.106) 0.626(0.135) 1.078(0.113) 1.910(0.073) 0.009(0.099)
Model g:
Error 0.292(0.477) 1.022(0.339) 0.762(0.354) 0.757(0.356) 0.999(0.379)
Time 0.724(0.075) 1.386(0.120) 1.516(0.088) 2.512(0.079) 0.621(0.084)
Model h:
Error 0.044(0.027) 0.150(0.043) 0.234(0.066) 0.184(0.068) 0.234(0.063)
Time 0.726(0.053) 1.380(0.119) 1.074(0.070) 2.509(0.071) 0.599(0.054)
LSGDRdMAVELSDRgKDRdOPG
Model a:
Error 0.364(0.117) 0.657(0.177) 0.655(0.104) 0.670(0.123) 0.973(0.217)
Time 0.893(0.084) 1.546(0.096) 1.290(0.100) 2.342(0.070) 0.417(0.085)
Model b:
Error 0.303(0.095) 0.280(0.061) 0.388(0.085) 0.424(0.154) 0.322(0.069)
Time 0.719(0.067) 1.600(0.160) 1.249(0.095) 2.665(0.099) 0.839(0.074)
Model c:
Error 0.205(0.257) 0.286(0.060) 1.666(0.186) 0.623(0.159) 0.811(0.364)
Time 0.746(0.099) 1.871(0.189) 1.836(0.137) 2.796(0.132) 1.042(0.125)
Model d:
Error 0.100(0.114) 0.223(0.115) 0.378(0.262) 0.382(0.253) 0.218(0.112)
Time 0.057(0.061) −0.763(0.043) 0.709(0.059) 0.449(0.029) 1.088(0.048)
Model e:
Error 0.193(0.143) 0.368(0.098) 0.594(0.198) 0.676(0.317) 0.619(0.162)
Time 0.442(0.041) 0.032(0.015) 0.836(0.064) 0.962(0.026) 0.646(0.021)
Model f:
Error 0.124(0.099) 0.097(0.029) 0.086(0.021) 0.126(0.030) 0.068(0.015)
Time 0.640(0.106) 0.626(0.135) 1.078(0.113) 1.910(0.073) 0.009(0.099)
Model g:
Error 0.292(0.477) 1.022(0.339) 0.762(0.354) 0.757(0.356) 0.999(0.379)
Time 0.724(0.075) 1.386(0.120) 1.516(0.088) 2.512(0.079) 0.621(0.084)
Model h:
Error 0.044(0.027) 0.150(0.043) 0.234(0.066) 0.184(0.068) 0.234(0.063)
Time 0.726(0.053) 1.380(0.119) 1.074(0.070) 2.509(0.071) 0.599(0.054)

Notes: Averages and standard deviations of estimation errors and CPU time over 50 runs. The numbers in the parentheses are standard deviations. CPU time is displayed in logarithmic scale. The best and comparable methods judged by the -test at the significance level are in bold.

Table 2:
Regression Errors.
LSGDRdMAVELSDRgKDRNo Reduc.
White wine (
0.840(0.012) 0.846(0.014) 0.847(0.013) 0.843(0.012) 0.836(0.008)
0.844(0.011) 0.851(0.012) 0.856(0.015) 0.849(0.014) 0.844(0.009)
0.846(0.010) 0.862(0.014) 0.865(0.012) 0.858(0.015) 0.850(0.009)
0.848(0.012) 0.866(0.015) 0.879(0.018) 0.860(0.013) 0.856(0.011)
Red wine (
0.810(0.015) 0.811(0.019) 0.808(0.019) 0.809(0.016) 0.804(0.015)
0.816(0.016) 0.819(0.017) 0.820(0.018) 0.813(0.017) 0.813(0.014)
0.816(0.014) 0.826(0.015) 0.823(0.016) 0.821(0.016) 0.825(0.012)
0.815(0.013) 0.832(0.014) 0.831(0.015) 0.824(0.016) 0.828(0.012)
Housing (
0.456(0.047) 0.436(0.039) 0.467(0.054) 0.428(0.043) 0.442(0.045)
0.462(0.043) 0.465(0.041) 0.483(0.052) 0.457(0.040) 0.463(0.046)
0.461(0.042) 0.461(0.043) 0.487(0.042) 0.455(0.046) 0.467(0.043)
0.463(0.044) 0.493(0.041) 0.510(0.049) 0.484(0.038) 0.521(0.042)
Concrete (
0.416(0.019) 0.420(0.020) 0.441(0.023) 0.404(0.021) 0.428(0.015)
0.424(0.024) 0.437(0.025) 0.446(0.021) 0.413(0.023) 0.467(0.015)
0.419(0.023) 0.447(0.024) 0.457(0.023) 0.440(0.026) 0.508(0.017)
0.420(0.021) 0.457(0.021) 0.459(0.022) 0.454(0.023) 0.545(0.018)
Yacht (
0.122(0.017) 0.160(0.042) 0.165(0.063) 0.139(0.028) 0.485(0.035)
0.123(0.023) 0.176(0.043) 0.158(0.045) 0.204(0.059) 0.577(0.043)
0.122(0.016) 0.202(0.078) 0.162(0.046) 0.257(0.058) 0.624(0.037)
0.124(0.016) 0.217(0.057) 0.180(0.059) 0.285(0.057) 0.660(0.042)
Auto MPG (
0.394(0.032) 0.381(0.028) 0.378(0.030) 0.373(0.023) 0.365(0.023)
0.382(0.033) 0.387(0.025) 0.387(0.023) 0.383(0.027) 0.389(0.023)
0.386(0.030) 0.394(0.022) 0.394(0.025) 0.394(0.027) 0.426(0.023)
0.384(0.029) 0.391(0.024) 0.398(0.032) 0.389(0.024) 0.430(0.023)
Physicochem (
0.827(0.025) 0.812(0.027) 0.808(0.026) 0.802(0.023) 0.801(0.024)
0.831(0.023) 0.825(0.028) 0.825(0.026) 0.827(0.026) 0.836(0.020)
0.840(0.021) 0.835(0.024) 0.840(0.023) 0.839(0.025) 0.845(0.021)
0.837(0.023) 0.848(0.024) 0.855(0.035) 0.848(0.022) 0.861(0.020)
Air foil (
0.443(0.018) 0.463(0.022) 0.481(0.033) 0.440(0.021) 0.475(0.017)
0.464(0.027) 0.475(0.023) 0.523(0.041) 0.462(0.024) 0.569(0.013)
0.461(0.026) 0.493(0.029) 0.555(0.041) 0.494(0.019) 0.613(0.010)
0.481(0.028) 0.494(0.027) 0.575(0.032) 0.519(0.026) 0.637(0.013)
Power plant (
0.256(0.003) 0.253(0.003) 0.255(0.003) 0.253(0.003) 0.252(0.002)
0.257(0.003) 0.254(0.004) 0.256(0.003) 0.255(0.003) 0.264(0.003)
0.257(0.003) 0.255(0.003) 0.258(0.003) 0.257(0.003) 0.280(0.005)
0.259(0.006) 0.257(0.003) 0.259(0.003) 0.258(0.005) 0.295(0.005)
Body fat (*StatLib) (
0.586(0.035) 0.600(0.032) 0.605(0.044) 0.589(0.035) 0.612(0.026)
0.587(0.032) 0.611(0.045) 0.623(0.041) 0.596(0.042) 0.623(0.027)
0.593(0.031) 0.624(0.047) 0.631(0.049) 0.601(0.043) 0.641(0.030)
0.603(0.032) 0.656(0.052) 0.652(0.047) 0.613(0.038) 0.659(0.031)
LSGDRdMAVELSDRgKDRNo Reduc.
White wine (
0.840(0.012) 0.846(0.014) 0.847(0.013) 0.843(0.012) 0.836(0.008)
0.844(0.011) 0.851(0.012) 0.856(0.015) 0.849(0.014) 0.844(0.009)
0.846(0.010) 0.862(0.014) 0.865(0.012) 0.858(0.015) 0.850(0.009)
0.848(0.012) 0.866(0.015) 0.879(0.018) 0.860(0.013) 0.856(0.011)
Red wine (
0.810(0.015) 0.811(0.019) 0.808(0.019) 0.809(0.016) 0.804(0.015)
0.816(0.016) 0.819(0.017) 0.820(0.018) 0.813(0.017) 0.813(0.014)
0.816(0.014) 0.826(0.015) 0.823(0.016) 0.821(0.016) 0.825(0.012)
0.815(0.013) 0.832(0.014) 0.831(0.015) 0.824(0.016) 0.828(0.012)
Housing (
0.456(0.047) 0.436(0.039) 0.467(0.054) 0.428(0.043) 0.442(0.045)
0.462(0.043) 0.465(0.041) 0.483(0.052) 0.457(0.040) 0.463(0.046)
0.461(0.042) 0.461(0.043) 0.487(0.042) 0.455(0.046) 0.467(0.043)
0.463(0.044) 0.493(0.041) 0.510(0.049) 0.484(0.038) 0.521(0.042)
Concrete (
0.416(0.019) 0.420(0.020) 0.441(0.023) 0.404(0.021) 0.428(0.015)
0.424(0.024) 0.437(0.025) 0.446(0.021) 0.413(0.023) 0.467(0.015)
0.419(0.023) 0.447(0.024) 0.457(0.023) 0.440(0.026) 0.508(0.017)
0.420(0.021) 0.457(0.021) 0.459(0.022) 0.454(0.023) 0.545(0.018)
Yacht (
0.122(0.017) 0.160(0.042) 0.165(0.063) 0.139(0.028) 0.485(0.035)
0.123(0.023) 0.176(0.043) 0.158(0.045) 0.204(0.059) 0.577(0.043)
0.122(0.016) 0.202(0.078) 0.162(0.046) 0.257(0.058) 0.624(0.037)
0.124(0.016) 0.217(0.057) 0.180(0.059) 0.285(0.057) 0.660(0.042)
Auto MPG (
0.394(0.032) 0.381(0.028) 0.378(0.030) 0.373(0.023) 0.365(0.023)
0.382(0.033) 0.387(0.025) 0.387(0.023) 0.383(0.027) 0.389(0.023)
0.386(0.030) 0.394(0.022) 0.394(0.025) 0.394(0.027) 0.426(0.023)
0.384(0.029) 0.391(0.024) 0.398(0.032) 0.389(0.024) 0.430(0.023)
Physicochem (
0.827(0.025) 0.812(0.027) 0.808(0.026) 0.802(0.023) 0.801(0.024)
0.831(0.023) 0.825(0.028) 0.825(0.026) 0.827(0.026) 0.836(0.020)
0.840(0.021) 0.835(0.024) 0.840(0.023) 0.839(0.025) 0.845(0.021)
0.837(0.023) 0.848(0.024) 0.855(0.035) 0.848(0.022) 0.861(0.020)
Air foil (
0.443(0.018) 0.463(0.022) 0.481(0.033) 0.440(0.021) 0.475(0.017)
0.464(0.027) 0.475(0.023) 0.523(0.041) 0.462(0.024) 0.569(0.013)
0.461(0.026) 0.493(0.029) 0.555(0.041) 0.494(0.019) 0.613(0.010)
0.481(0.028) 0.494(0.027) 0.575(0.032) 0.519(0.026) 0.637(0.013)
Power plant (
0.256(0.003) 0.253(0.003) 0.255(0.003) 0.253(0.003) 0.252(0.002)
0.257(0.003) 0.254(0.004) 0.256(0.003) 0.255(0.003) 0.264(0.003)
0.257(0.003) 0.255(0.003) 0.258(0.003) 0.257(0.003) 0.280(0.005)
0.259(0.006) 0.257(0.003) 0.259(0.003) 0.258(0.005) 0.295(0.005)
Body fat (*StatLib) (
0.586(0.035) 0.600(0.032) 0.605(0.044) 0.589(0.035) 0.612(0.026)
0.587(0.032) 0.611(0.045) 0.623(0.041) 0.596(0.042) 0.623(0.027)
0.593(0.031) 0.624(0.047) 0.631(0.049) 0.601(0.043) 0.641(0.030)
0.603(0.032) 0.656(0.052) 0.652(0.047) 0.613(0.038) 0.659(0.031)

Notes: Averages and standard deviations of regression errors over 50 runs. “No Reduc.” means the results without dimension reduction. The best and comparable methods according to the -test at the significance level are in bold. denotes the number of added noise dimensions to the original data.

### 5.2  Regression on Benchmark Data Sets

Next, we apply LSGDR and the existing methods, dMAVE, LSDR, and gKDR, to regression tasks on the UCI benchmark data sets (Bache & Lichman, 2013) and StatLib.8 In this experiment, dOPG was excluded because dMAVE is initialized based on dOPG and often showed similar or better estimation performance than dOPG as demonstrated in the previous experiments. We first standardized each data set so that the mean and standard deviation are zero and one, respectively. Then we randomly selected samples from each data set in the training phase; the rest, whose number is denoted by , were used in the test phase.9 After estimating by each method in the training phase, we performed kernel ridge regression with the gaussian kernel to the dimension-reduced data . In the test phase, the regression error was measured by
where denotes a learned regressor. Furthermore, we made data sets more challenging by concatenating Gamma variables independently drawn from to . Unlike the previous experiment, the true intrinsic dimensionality is unknown, and thus we cross-validated it as follows. Five-fold cross-validation was performed to choose the intrinsic dimensionality for each method so as to minimize the regression error, where the candidates of were .

The results are summarized in Table 2. When the noise dimensions are not added, the performance of LSGDR is comparable to or worse than the other methods for some data sets. However, as the noise dimensions increase, LSGDR often significantly outperforms the other methods. These results imply that LSGDR is useful for finding informative subspaces to relatively higher-dimensional data.

### 5.3  Classification on Benchmark Data Sets

Finally, LSGDR is applied to binary classification. Data sets were downloaded from https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/.10 The training data were randomly selected from the whole data set, and the remaining data were used in the test phase. For large data sets, we randomly selected 1000 samples for the test phase. As preprocessing, we standardized both the training and test data by using the mean and standard deviation of the training data. In the training phase, the projection matrix was estimated by each method using the training data, and then the support vector machine (SVM) (Schölkopf & Smola, 2001) was trained based on the dimension-reduced data.11 The performance was measured by misclassification rates of the trained SVM to the test data. In this classification experiment, we used a variant of gKDR called gKDR-v as in Fukumizu and Leng (2014). As in the previous regression experiment, the intrinsic dimension was cross-validated in the training phase so that the misclassification rate is minimized.

Table 3 shows the average and standard deviation of misclassification rates over 50 runs, indicating that LSGDR compares favorably with the other methods.

Table 3:
Misclassification Rates.
LSGDRdMAVELSDRgKDRNo Reduc.
Australian
14.780(1.596) 14.604(1.431) 15.576(1.538) 14.751(1.662) 14.482(1.271)
Breast-cancer
4.010(2.567) 3.802(1.000) 4.214(0.970) 3.546(1.236) 3.332(0.843)
COD-RNA
6.494(1.254) 6.540(1.197) 6.838(0.945) 6.388(0.979) 6.170(0.958)
Diabetes
25.134(2.014) 25.729(2.320) 26.243(3.107) 24.996(2.274) 24.415(1.700)
Heart
18.871(3.763) 20.318(4.146) 19.906(3.913) 19.647(4.016) 18.106(3.455)
Liver-disorders
31.918(3.457) 31.262(4.320) 33.928(4.110) 31.149(3.547) 30.718(3.301)
SUSY
24.366(1.518) 24.818(2.301) 26.160(3.079) 25.538(2.876) 24.286(1.677)
Shuttle
0.700(1.363) 1.266(0.789) 1.836(1.215) 1.240(1.091) 1.544(0.731)
LSGDRdMAVELSDRgKDRNo Reduc.
Australian
14.780(1.596) 14.604(1.431) 15.576(1.538) 14.751(1.662) 14.482(1.271)
Breast-cancer
4.010(2.567) 3.802(1.000) 4.214(0.970) 3.546(1.236) 3.332(0.843)
COD-RNA
6.494(1.254) 6.540(1.197) 6.838(0.945) 6.388(0.979) 6.170(0.958)
Diabetes
25.134(2.014) 25.729(2.320) 26.243(3.107) 24.996(2.274) 24.415(1.700)
Heart
18.871(3.763) 20.318(4.146) 19.906(3.913) 19.647(4.016) 18.106(3.455)
Liver-disorders
31.918(3.457) 31.262(4.320) 33.928(4.110) 31.149(3.547) 30.718(3.301)
SUSY
24.366(1.518) 24.818(2.301) 26.160(3.079) 25.538(2.876) 24.286(1.677)
Shuttle
0.700(1.363) 1.266(0.789) 1.836(1.215) 1.240(1.091) 1.544(0.731)

Notes: Averages and standard deviations of misclassification rates over 50 runs. “No Reduc.” means the results without dimension reduction. The best and comparable methods according to the -test at the significance level are in bold.

## 6  Conclusion

The main contribution of this letter is a novel estimator for the gradients of logarithmic conditional densities, which improves the performance of SDR methods. The proposed gradient estimator efficiently computes the solution in a closed form, and a model selection method is also available. With the proposed gradient estimator, we developed an SDR method based on eigendecomposition. Our theoretical analysis showed that the proposed estimator and our SDR method converge to the optimal solutions at the optimal rate under a parametric setting. We experimentally demonstrated that the proposed SDR method works well on a variety of data sets.

## Notes

1

is continuous in regression, while it is categorical in classification.

2

In principle, model selection can be performed by cross-validation (CV) over a successive predictor. However, this should be avoided in practice for two reasons. First, when CV is applied, one should optimize both parameters in an SDR method and hyperparameters in the predictor. This procedure results in a nested CV, which is computationally quite inefficient. Second, features extracted based on CV are no longer independent of predictors, which is not preferable in terms of interpretability (Suzuki & Sugiyama, 2013).

3

Some large value of the minimum candidate of can be justified as follows. First, log densities tend to prefer large bandwidth values because of the logarithm. Second, if is completely independent of , . Thus, the ideal estimate is , which can be achieved in two ways: for all or when . To exclude the second possibility, we set the minimum candidate of at some large value.

9

We subsampled only the Physiochem data set by extracting the first 1000 samples because it is too large.

10

For the “shuttle” data set, we used only data samples in classes 1 and 4.

11

We employed the Matlab software for SVM called LIBSVM (Chang & Lin, 2011).

## Appendix A:  Proof of Theorem 1

Here, we prove theorem 1. Our proof essentially follows the proof in Shiino, Sasaki, Niu, and Sugiyama (2016) and goes through three steps.

### A.1  Step 1: Establishment of the Growth Condition.

We recall that
and define
With the following lemma, we establish the second-order growth condition (Bonnans & Shapiro, 1998; see definition 6.1):
Lemma 1.
Letting be the smallest eigenvalue of , the following growth condition holds:
where .
Proof. Taylor's theorem (Nocedal & Wright, 1999, theorem 2.1) gives
where is the Hessian matrix of and lies between and . Since ,
where the optimality condition was applied.

### A.2  Step 2: Stability Analysis.

Here, we provide stability analysis of around . We define a set of perturbation parameters by
where denotes the cone of by symmetric positive semidefinite matrices. Then a perturbed version of is given by
A stability property of around is characterized by the following lemma:
Lemma 2.
The difference function is Lipschitz continuous modulus,
on a sufficiently small neighborhood of .
Proof. The gradient of the difference function is computed as
Because of the regularization, for . Given a -ball of , which is defined by , the following inequality for holds:
This inequality provides
The above inequality states that the norm of the gradient is bounded with an order . Thus, the difference function is Lipschitz continuous on with a Lipschitz constant of the same order.

### A.3  Step 3: A Convergence Rate of LSLCG.

Let us recall that
Based on lemmas 3 and 4 and proposition 6.1 in Bonnans and Shapiro (1998),
is the minimizer of with respect to when , and .
The central limit theorem (CLT) asserts that . For ,
The convergence rate of (A) is because of CLT. However, (B) is not as straightforward as (A). Here, we further decompose (B) into
(D) is clearly . As proved in the supplementary material for Sasaki et al. (2016), converges to at any with , which implies that (C) is also . As a result, . We have already assumed that . Hence, as ,
We finally establish a convergence rate of LSLCG. The Cauchy-Schwartz inequality gives
Since all elements in are uniformly bounded,
Hence, theorem 1 was proved.

## Appendix B:  Proof of Theorem 2

As seen in the supplementary material for Sasaki et al. (2016), we prove theorem 2.

First, to bound , we define a relay matrix from to as
This relay matrix gives the following inequality:
B.1
The first term on the right-hand side of equation B.1 converges in according to theorem 1, and the second term also converges in the same order due to CLT. Thus,
B.2
Next, we derive a probabilistic order of . Let be the orthogonal projector onto the eigenspace of associated with (). We recall that are the disjoint eigenvalues of such that is the largest eigenvalue counted with multiplicity. A perturbation matrix is denoted by . According to lemma 5.2 and the proof of lemma 5.3 in Koltchinskii and Giné (2000), whenever , we obtain
where the orthogonal projector onto the eigenspace of . Since and are orthogonal matrices, and . Thus, equation B.2 gives

## Acknowledgments

G. N. acknowledges support from JST CREST JPMJCR1403.

## References

Bache
,
K.
, &
Lichman
,
M.
(
2013
).
UCI machine learning repository
.
Irvine
:
University of California, Irvine
. http://archive.ics.uci.edu/ml/
Bonnans
,
F.
, &
Shapiro
,
A.
(
1998
).
Optimization problems with perturbations, a guided tour
.
SIAM Review
,
40
(
2
),
228
264
.
Chang
,
C.
, &
Lin
,
C.
(
2011
).
LIBSVM: A library for support vector machines
.
ACM Transactions on Intelligent Systems and Technology
,
2
,
27:1
27:27
. (
software available at
http://www.csie.ntu.edu.tw/∼cjlin/libsvm)
Cox
,
D. D.
(
1985
).
A penalty method for nonparametric estimation of the logarithmic derivative of a density function
.
Annals of the Institute of Statistical Mathematics
,
37
(
1
),
271
288
.
Dennis Cook
,
R.
(
2000
).
SAVE: A method for dimension reduction and graphics in regression
.
Communications in Statistics—Theory and Methods
,
29
(
9–10
),
2109
2121
.
Fan
,
J.
, &
Gijbels
,
I.
(
1996
).
Local polynomial modelling and its applications
.
Boca Raton, FL
:
CRC Press
.
Fukumizu
,
K.
,
Bach
,
F. R.
, &
Jordan
,
M. I.
(
2004
).
Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces
.
Journal of Machine Learning Research
,
5
,
73
99
.
Fukumizu
,
K.
,
Bach
,
F. R.
, &
Jordan
,
M. I.
(
2009
).
Kernel dimension reduction in regression
.
Annals of Statistics
,
37
(
4
),
1871
1905
.
Fukumizu
,
K.
, &
Leng
,
C.
(
2014
).
Gradient-based kernel dimension reduction for regression
.
Journal of the American Statistical Association
,
109
(
505
),
359
370
.
Gretton
,
A.
,
Fukumizu
,
K.
,
Teo
,
C.
,
Song
,
L.
,
Schölkopf
,
B.
, &
Smola
,
A.
(
2007
). A kernel statistical test of independence. In
J. C.
Platt
,
D.
Koller
,
Y.
Singer
, &
S.
Roweis
(Eds.),
Advances in neural information processing systems
,
20
(pp.
585
592
).
Cambridge, MA
:
MIT Press
.
Hristache
,
M.
,
Juditsky
,
A.
,
Polzehl
,
J.
, &
Spokoiny
,
V.
(
2001
).
Structure adaptive approach for dimension reduction
.
Annals of Statistics
,
29
(
6
),
1537
1566
.
Koltchinskii
,
V.
, &
Giné
,
E.
(
2000
).
Random matrix approximation of spectra of integral operators
.
Bernoulli
,
6
(
1
),
113
167
.
Li
,
K.
(
1991
).
Sliced inverse regression for dimension reduction
.
Journal of the American Statistical Association
,
86
(
414
),
316
327
.
Li
,
K.
(
1992
).
On principal Hessian directions for data visualization and dimension reduction: Another application of Stein's lemma
.
Journal of the American Statistical Association
,
87
(
420
),
1025
1039
.
Nocedal
,
J.
, &
Wright
,
S.
(
1999
).
Numerical optimization
.
Berlin
:
Springer-Verlag
.
Samarov
,
A. M.
(
1993
).
Exploring regression structure using nonparametric functional estimation
.
Journal of the American Statistical Association
,
88
(
423
),
836
847
.
Sasaki
,
H.
,
Hyvärinen
,
A.
, &
Sugiyama
,
M.
(
2014
). Clustering via mode seeking by direct estimation of the gradient of a log-density. In
Proceedings of the Machine Learning and Knowledge Discovery in Databases Part III—European Conference, ECML/PKDD 2014
(vol. 8726, pp.
19
34
).
Berlin
:
Springer
.
Sasaki
,
H.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2016
).
Non-gaussian component analysis with log-density gradient estimation
. In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
(vol. 51, pp.
1177
1185
).
Sasaki
,
H.
,
Tangkaratt
,
V.
, &
Sugiyama
,
M.
(
2015
).
Sufficient dimension reduction via direct estimation of the gradients of logarithmic conditional densities
. In
Proceedings of the 7th Asian Conference on Machine Learning
(vol. 45, pp.
33
48
).
Schölkopf
,
B.
, &
Smola
,
A.
(
2001
).
Learning with kernels: Support vector machines, regularization, optimization, and beyond
.
Cambridge, MA
:
MIT Press
.
Shiino
,
H.
,
Sasaki
,
H.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2016
).
Whitening-free least-squares non-gaussian component analysis
.
arXiv:1603.01029
.
Silverman
,
B.
(
1986
).
Density estimation for statistics and data analysis
.
Boca Raton, FL
:
CRC Press
.
Suzuki
,
T.
, &
Sugiyama
,
M.
(
2013
). Sufficient dimension reduction via squared-loss mutual information estimation.
Neural Computation
,
25
(
3
),
725
758
.
Tangkaratt
,
V.
,
Sasaki
,
H.
, &
Sugiyama
,
M.
(
2017
). Direct estimation of the derivative of quadratic mutual information with application in supervised dimension reduction.
Neural Computation
,
29
(
8
), pp.
2076
2122
.
Tangkaratt
,
V.
,
Xie
,
N.
, &
Sugiyama
,
M.
(
2015
).
Conditional density estimation with dimensionality reduction via squared-loss conditional entropy minimization
.
Neural Computation
,
27
(
1
),
228
254
.
Xia
,
Y.
(
2007
).
A constructive approach to the estimation of dimension reduction directions
.
Annals of Statistics
,
35
(
6
),
2654
2690
.
Xia
,
Y.
,
Tong
,
H.
,
Li
,
W. K.
, &
Zhu
,
L. X.
(
2002
).
An adaptive estimation of dimension reduction space
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
64
(
3
),
363
410
.