## Abstract

Sufficient dimension reduction (SDR) is aimed at obtaining the low-rank projection matrix in the input space such that information about output data is maximally preserved. Among various approaches to SDR, a promising method is based on the eigendecomposition of the outer product of the gradient of the conditional density of output given input. In this letter, we propose a novel estimator of the gradient of the logarithmic conditional density that directly fits a linear-in-parameter model to the true gradient under the squared loss. Thanks to this simple least-squares formulation, its solution can be computed efficiently in a closed form. Then we develop a new SDR method based on the proposed gradient estimator. We theoretically prove that the proposed gradient estimator, as well as the SDR solution obtained from it, achieves the optimal parametric convergence rate. Finally, we experimentally demonstrate that our SDR method compares favorably with existing approaches in both accuracy and computational efficiency on a variety of artificial and benchmark data sets.

## 1 Introduction

Sufficient dimension reduction (SDR) is a solid framework of supervised linear dimension reduction and is aimed at finding the projection matrix in the input space so that information about output data is maximally preserved in the subspace. An influential work on SDR is sliced inverse regression (Li, 1991), which employs the inverse regression function for obtaining the projection matrix. Since then, a number of SDR methods, such as the principal Hessian direction (Li, 1992) and sliced average variance estimation (Dennis Cook, 2000), have been developed along the same line. However, these methods commonly require the strong assumption that the probability density function of input data is elliptical, which is often not fulfilled in practice.

To overcome this limitation, various SDR methods based on conditional independence of output data given projected input data have been proposed and demonstrated to work well. For example, kernel dimension reduction (Fukumizu, Bach, & Jordan, 2004, 2009) directly evaluates the conditional independence based on kernel methods; least-squares dimension reduction (Suzuki & Sugiyama, 2013) evaluates the conditional independence through least-squares estimation of mutual information; and other related SDR methods nonparametrically estimate the conditional density of output data given projected input data (Tangkaratt, Xie, & Sugiyama, 2015; Xia, 2007). Furthermore, a supervised linear dimension reduction method robust against outliers was recently proposed based on direct estimation of the derivative of the quadratic mutual information (Tangkaratt, Sasaki, & Sugiyama, 2017). However, a common drawback is that these methods require solving nonconvex optimization problems. Thus, when gradient-based optimization methods are employed, these SDR methods can be computationally expensive and may get stuck in bad local optima.

Another line of SDR research, which is computationally efficient and avoids bad local optima, is based on the eigendecomposition of the outer product of the gradient of the conditional expectation of output data given input data (Hristache, Juditsky, Polzehl, & Spokoiny, 2001; Samarov, 1993; Xia, Tong, Li, & Zhu, 2002). However, since this approach fulfills a necessary condition only for SDR, the obtained solution is not sufficient in general. To cope with this problem, a modified SDR method based on the gradient of the conditional density of output given input, which satisfies a sufficient condition for SDR, was developed (Xia, 2007). However, since the gradient is estimated by the local linear smoother (Fan & Gijbels, 1996), it is computationally expensive for large data sets, it performs poorly when the input or subspace dimensionality is high, and model selection is cumbersome in practice.

In this letter, we propose a novel estimator of the gradient of the logarithmic conditional density of output given input that can overcome these weaknesses. The essential idea of our method is to directly fit a linear-in-parameter model to the true gradient under the squared loss, which allows efficient computation of the solution in a closed form and straightforward model selection by cross-validation. Then we develop an SDR method based on the eigendecomposition to the outer product of our gradient estimates. We theoretically prove that our gradient estimator and SDR method asymptotically provide the optimal solutions at the optimal parametric convergence rate. Finally, we experimentally demonstrate that our proposed SDR method is more accurate and computationally efficient than existing methods on a variety of data sets.

The rest of this letter is organized as follows: Section 2 mathematically formulates the problem of SDR and reviews gradient-based SDR methods. Section 3 proposes a novel estimator of the gradient of the logarithmic conditional density and develops an SDR method based on it. Section 4 theoretically investigates the properties of the proposed gradient estimator and SDR method. Section 5 experimentally evaluates the performance of the proposed SDR method on a variety of data sets. Section 6 concludes the letter. Our preliminary results were presented at ACML 2015 (Sasaki, Tangkaratt, & Sugiyama, 2015), but here we performed theoretical analysis of the proposed gradient estimator and SDR method, extended the proposed SDR method for classification, and added more experiments.

## 2 Review of Existing Methods

In this section, we formulate the problem of SDR and review existing gradient-based SDR methods.

### 2.1 Problem Formulation

^{1}and denotes the transpose. Further, we assume that the input dimensionality is large, but the “intrinsic” dimensionality of , which is denoted by , is rather small. The goal of SDR is to estimate a low-rank projection matrix from so that the following SDR condition is satisfied: where , denotes the by identity matrix and . Throughout this letter, we denote the subspace spanned by the columns of by .

### 2.2 Gradient-Based Approach in SDR

^{2}

^{2}). Given equation 2.5, the technical challenge is to estimate the gradient . The outer product of gradient based on the conditional density functions (dOPG) (Xia, 2007) nonparametrically estimates the gradient by the local linear smoother (LLS) (Fan & Gijbels, 1996), which is briefly reviewed below.

The dOPG algorithm is summarized in algorithm 1. As done in Hristache et al. (2001), to improve the performance of dOPG, an iterative method is adopted by setting the weight function as . Furthermore, a trimming function is introduced to handle notorious boundary points. However, dOPG seems to have three drawbacks:

In the Taylor approximation, equation 2.6, the proximity requirement holds when data are dense. Thus, dOPG might not perform well when data are sparsely distributed. This is particularly problematic when the input dimensionality or intrinsic dimensionality is relatively high.

dOPG can be computationally expensive for large data sets because the number of elements in is .

The parameter values of and are determined based on the normal reference rule of the nonparametric conditional density estimation (Fan & Gijbels, 1996; Silverman, 1986). Therefore, when the density is far from the normal density, this parameter selection method may not work well.

To improve performance in gradient estimation, we propose a novel estimator for the gradient of the logarithmic conditional density. The proposed estimator does not rely on the Taylor approximation and includes a cross-validation procedure for parameter selection. In addition, with an integer less than or equal to , the number of parameters in the estimator is , which is much smaller than in LLS. We then develop a gradient-based SDR method based on our gradient estimator.

## 3 Least-Squares Logarithmic Conditional Density Gradients

### 3.1 The Estimator

Unlike the second term, integration by parts does not help us to estimate the third term. One difficulty is that the term includes the true partial derivative of the log density . Alternatively, our approach to estimate the third term is to employ a nonparametric plug-in estimator for the log-density derivative, called *least-squares log-density gradients* (LSLDG) (Cox, 1985; Sasaki, Hyvärinen, & Sugiyama, 2014): LSLDG directly estimates log-density derivatives without going through density estimation. The solution of LSLDG can be computed efficiently in a closed form, and LSLDG includes a cross-validation procedure for model selection. Furthermore, it has been experimentally shown to work better than the gradient estimator based on kernel density estimation. Therefore, LSLDG would accurately and efficiently approximate the third term.

*least-squares logarithmic conditional density gradients*(LSLCG), is obtained as

### 3.2 Model Selection by Cross-Validation

The performance of LSLCG depends on the choice of models, which are the gaussian bandwidth parameters, and , and the regularization parameter in the current setup. We perform model selection by cross-validation as follows:

Step 1: Divide the samples into disjoint subsets .

Step 3: Choose the model that minimizes .

### 3.3 Illustration of Gradient Estimation

**LSLCG**: The proposed estimator for the gradient of the logarithmic conditional density. We fix at , which denotes the median value of with respect to and , but and are cross-validated as in section 3.2: were selected from 10 different values of (), where is the median value of with respect to and and with 10 difference values of from to .**KDE**: The gradient of the logarithmic conditional density was estimated through kernel density estimation (KDE). First, with the gaussian kernel for KDE, the conditional density was estimated by where the bandwidth parameters and were independently cross-validated with respect to the log likelihood of the kernel density estimates of and by employing and , respectively. We selected 10 candidates of for and for where and is the median value of with respect to , , and . Then the log-conditional gradients were computed from .**LLS**: The gradient of the logarithmic conditional density was estimated through LLS (Fan & Gijbels, 1996). A gradient estimate for the log-conditional density at is given by the ratio because and in equation 2.7 correspond to estimates of the conditional density and its gradient, respectively. We set and , where was cross-validated with respect to the log likelihood of the the kernel density estimate of by employing as the kernel function; similarly, was also cross-validated. We used the same candidates of and for and , respectively.

The result is presented in Figure 1. As the dimensionality of input data increases, KDE produces larger errors (see Figure 1a). A possible reason is that a good density estimator does not necessarily mean a good density gradient estimator. Thus, estimating density derivatives via density estimation is not a good approach, as previously demonstrated in (nonconditional) log-density-derivative estimation (Sasaki et al., 2014). LLS also does not work well for high-dimensional data, while the errors of LSLCG increase much more mildly. In addition, Figure 1b shows that the errors of LSLCG quickly decrease as the sample size increases. Thus, our approach of fitting a model directly to the true gradient is promising.

### 3.4 Least-Squares Gradients for Dimension Reduction

Following the gradient-based approach, we propose a new SDR method in algorithm 2. The proposed SDR method can be interpreted as a simpler version of the dOPG in algorithm 1 without iterative steps and the trimming function. We call this method *least-squares gradients for dimension reduction* (LSGDR).

## 4 Theoretical Analysis

This section provides a theoretical investigation of the behavior of LSLCG and LSGDR.

As proved in the supplementary material for Sasaki, Niu, and Sugiyama (2016), the plug-in LSLDG estimator achieves the optimal parametric rate, , where denotes the probabilistic order, and thus, without loss of generality, we simply assume that converges to in this order for all . Under this setting, we obtained the following theorem:

The proof of theorem ^{1} is in appendix A. Theorem ^{1} asserts that LSLCG is consistent with the optimal estimate and also achieves the optimal parametric convergence rate in a standard setting.

Appendix B includes the proof of theorem ^{2}. As seen in LSLCG, LSGDR also provides a consistent estimate, and its convergence rate is optimal under the parametric setting.

## 5 Numerical Experiments

In this section, we experimentally investigate the performance of our SDR method on both artificial and benchmark data sets and compare our method with existing methods.

### 5.1 Illustration of Dimension Reduction on Artificial Data

First, we illustrate the performance of the proposed SDR method using artificial data, and comparison is made among the following SDR methods:

- •
**LSGDR**: The proposed method following a gradient-based approach. Here, we fix at but performed five-fold cross-validation to and as in section 3.2. were selected from 10 different values of (),^{3}and with 10 different values of from to 1. We recall that and are the median values of and with respect to and , respectively. - •
**dMAVE**(Xia, 2007):^{4}To estimate , this method solves a nonconvex optimization problem. To avoid bad local optima, an initial estimate of is given by the first iteration of algorithm 1, the dOPG algorithm. - •
**LSDR**(Suzuki & Sugiyama, 2013):^{5}To estimate , this method solves a nonconvex optimization problem. To avoid bad local optima, multiple point search with 10 different initial values is performed, and the best solution is chosen with respect to a criterion. LSDR also performs five-fold cross-validation for model selection, and we set the same number of parameter candidates as LSGDR. - •
**gKDR**(Fukumizu & Leng, 2014):^{6}This method follows a gradient-based approach as reviewed in section 2.2. The gaussian kernels were used for and . As in LSGDR, we fix the width parameter in at , and the width parameter in and regularization parameter are determined by five-fold cross-validation based on a nearest-neighbor regression / classification method where is fixed at 5. We set the parameter candidates by following Fukumizu and Leng (2014). The 10 candidates for the width in were given by (), where is the median of (Gretton et al., 2007), while with 10 different values of from to . The total number of parameter candidates is the same as LSGDR. - •
**dOPG**(Xia, 2007):^{7}This method follows a gradient-based approach as reviewed in section 2.2.

#### 5.1.1 Dimension-Scalability and Computational Efficiency

As illustrated in Figures 2a and 2b, the performance of LSGDR is the best on a wide range of data and intrinsic dimensions. As the intrinsic dimension increases, dMAVE, LSDR, and dOPG do not work well. For LSDR, good initialization would be difficult when the intrinsic dimension is high, and dMAVE is initialized based on dOPG, which also performs poorly. The unsatisfactory performance of dOPG implies that the higher intrinsic dimension makes estimation harder rather than the original data dimension . gKDR works reasonably even to data with relatively high . However, as the data dimension increases, gKDR also produces large errors. A possible reason is that model selection in gKDR might be more difficult to higher-dimensional data.

Figure 3 reveals that LSGDR is a computationally efficient method. Since dMAVE and LSDR make use of the projected data , the computation costs increase as the intrinsic-dimensionality grows. On the other hand, does not strongly affect the computation costs of gKDR, dOPG, and LSGDR, which use the original (nonprojected) input data. Instead, when data dimension increases, the computation costs of gKDR, dOPG, and LSGDR increase (see Figure 3d). Note that the computation costs of dOPG and LSGDR increase more mildly. This is because LSGDR uses only basis functions in the linear-in-parameter model, equation 3.4, and dOPG does not perform cross-validation. In addition, the sample size increases the computation costs of dMAVE, dOPG, and gKDR (see Figure 3d). Both dMAVE and dOPG employ LLS, the number of whose parameters to be estimated is , and gKDR has to compute the inverse of an matrix in equation 2.4. For gKDR, the computation cost can be decreased by reducing the size of Gram matrices, and , and number of centers in the kernel function as done in LSGDR. However, this should be inappropriate. When estimating , the sample average of equation 2.4 is taken over only input data , and reducing the size of the Gram matrix is equivalent to discarding a number of output samples . Thus, the performance of gKDR can be worse. In contrast, LSGDR employs the sample average of the outer product over both input and output data and would work well with a smaller number of centers in basis functions.

#### 5.1.2 Illustration on a Variety of Artificial Data Sets

Here, we generated data according to the following various kinds of models, all of which were adopted from the articles of dMAVE, LSDR, gKDR, and dOPG:

- Model a (Xia, 2007): where denotes the sign function, , , and denotes the normal density with the mean and covariance matrix . The first four elements in are all 0.5, while the others are zeros. For , the first four elements are 0.5, , 0.5, and , respectively, and the others are zeros. The optimal projection matrix is .
- Model b (Xia, 2007): where , , , , and denotes the uniform density on . The optimal projection matrix is .
- Model g (Fukumizu & Leng, 2014): where , , , , and denotes the Gamma density with the shape parameter and scale parameter . The optimal projection matrix is .

The results are summarized in Table 1. Table 1 indicates for models a to c that LSGDR performs best or comparable to the best method in terms of the estimation error. Since models a, b, and c are complex, LSGDR is a promising method for various kinds of data. For computation cost, LSGDR is more advantageous than dOPG when the sample size is large as reviewed in section 2.2. For the Table 2 entries for models d and e, LSGDR is the more accurate method, and dOPG is the most computationally efficient. Unlike LSGDR, LSDR, and gKDR, the parameters in dOPG are not cross-validated, and thus it should be efficient when the sample size is small. However, the computation costs of LSGDR and gKDR are not so expensive. LSGDR overall performs well to data drawn from nongaussian densities as well (see Table 1).

. | LSGDR . | dMAVE . | LSDR . | gKDR . | dOPG . |
---|---|---|---|---|---|

Model a: | |||||

Error | 0.364(0.117) | 0.657(0.177) | 0.655(0.104) | 0.670(0.123) | 0.973(0.217) |

Time | 0.893(0.084) | 1.546(0.096) | 1.290(0.100) | 2.342(0.070) | 0.417(0.085) |

Model b: | |||||

Error | 0.303(0.095) | 0.280(0.061) | 0.388(0.085) | 0.424(0.154) | 0.322(0.069) |

Time | 0.719(0.067) | 1.600(0.160) | 1.249(0.095) | 2.665(0.099) | 0.839(0.074) |

Model c: | |||||

Error | 0.205(0.257) | 0.286(0.060) | 1.666(0.186) | 0.623(0.159) | 0.811(0.364) |

Time | 0.746(0.099) | 1.871(0.189) | 1.836(0.137) | 2.796(0.132) | 1.042(0.125) |

Model d: | |||||

Error | 0.100(0.114) | 0.223(0.115) | 0.378(0.262) | 0.382(0.253) | 0.218(0.112) |

Time | 0.057(0.061) | −0.763(0.043) | 0.709(0.059) | 0.449(0.029) | 1.088(0.048) |

Model e: | |||||

Error | 0.193(0.143) | 0.368(0.098) | 0.594(0.198) | 0.676(0.317) | 0.619(0.162) |

Time | 0.442(0.041) | 0.032(0.015) | 0.836(0.064) | 0.962(0.026) | 0.646(0.021) |

Model f: | |||||

Error | 0.124(0.099) | 0.097(0.029) | 0.086(0.021) | 0.126(0.030) | 0.068(0.015) |

Time | 0.640(0.106) | 0.626(0.135) | 1.078(0.113) | 1.910(0.073) | 0.009(0.099) |

Model g: | |||||

Error | 0.292(0.477) | 1.022(0.339) | 0.762(0.354) | 0.757(0.356) | 0.999(0.379) |

Time | 0.724(0.075) | 1.386(0.120) | 1.516(0.088) | 2.512(0.079) | 0.621(0.084) |

Model h: | |||||

Error | 0.044(0.027) | 0.150(0.043) | 0.234(0.066) | 0.184(0.068) | 0.234(0.063) |

Time | 0.726(0.053) | 1.380(0.119) | 1.074(0.070) | 2.509(0.071) | 0.599(0.054) |

. | LSGDR . | dMAVE . | LSDR . | gKDR . | dOPG . |
---|---|---|---|---|---|

Model a: | |||||

Error | 0.364(0.117) | 0.657(0.177) | 0.655(0.104) | 0.670(0.123) | 0.973(0.217) |

Time | 0.893(0.084) | 1.546(0.096) | 1.290(0.100) | 2.342(0.070) | 0.417(0.085) |

Model b: | |||||

Error | 0.303(0.095) | 0.280(0.061) | 0.388(0.085) | 0.424(0.154) | 0.322(0.069) |

Time | 0.719(0.067) | 1.600(0.160) | 1.249(0.095) | 2.665(0.099) | 0.839(0.074) |

Model c: | |||||

Error | 0.205(0.257) | 0.286(0.060) | 1.666(0.186) | 0.623(0.159) | 0.811(0.364) |

Time | 0.746(0.099) | 1.871(0.189) | 1.836(0.137) | 2.796(0.132) | 1.042(0.125) |

Model d: | |||||

Error | 0.100(0.114) | 0.223(0.115) | 0.378(0.262) | 0.382(0.253) | 0.218(0.112) |

Time | 0.057(0.061) | −0.763(0.043) | 0.709(0.059) | 0.449(0.029) | 1.088(0.048) |

Model e: | |||||

Error | 0.193(0.143) | 0.368(0.098) | 0.594(0.198) | 0.676(0.317) | 0.619(0.162) |

Time | 0.442(0.041) | 0.032(0.015) | 0.836(0.064) | 0.962(0.026) | 0.646(0.021) |

Model f: | |||||

Error | 0.124(0.099) | 0.097(0.029) | 0.086(0.021) | 0.126(0.030) | 0.068(0.015) |

Time | 0.640(0.106) | 0.626(0.135) | 1.078(0.113) | 1.910(0.073) | 0.009(0.099) |

Model g: | |||||

Error | 0.292(0.477) | 1.022(0.339) | 0.762(0.354) | 0.757(0.356) | 0.999(0.379) |

Time | 0.724(0.075) | 1.386(0.120) | 1.516(0.088) | 2.512(0.079) | 0.621(0.084) |

Model h: | |||||

Error | 0.044(0.027) | 0.150(0.043) | 0.234(0.066) | 0.184(0.068) | 0.234(0.063) |

Time | 0.726(0.053) | 1.380(0.119) | 1.074(0.070) | 2.509(0.071) | 0.599(0.054) |

Notes: Averages and standard deviations of estimation errors and CPU time over 50 runs. The numbers in the parentheses are standard deviations. CPU time is displayed in logarithmic scale. The best and comparable methods judged by the -test at the significance level are in bold.

. | LSGDR . | dMAVE . | LSDR . | gKDR . | No Reduc. . |
---|---|---|---|---|---|

White wine () | |||||

0 | 0.840(0.012) | 0.846(0.014) | 0.847(0.013) | 0.843(0.012) | 0.836(0.008) |

2 | 0.844(0.011) | 0.851(0.012) | 0.856(0.015) | 0.849(0.014) | 0.844(0.009) |

4 | 0.846(0.010) | 0.862(0.014) | 0.865(0.012) | 0.858(0.015) | 0.850(0.009) |

6 | 0.848(0.012) | 0.866(0.015) | 0.879(0.018) | 0.860(0.013) | 0.856(0.011) |

Red wine () | |||||

0 | 0.810(0.015) | 0.811(0.019) | 0.808(0.019) | 0.809(0.016) | 0.804(0.015) |

2 | 0.816(0.016) | 0.819(0.017) | 0.820(0.018) | 0.813(0.017) | 0.813(0.014) |

4 | 0.816(0.014) | 0.826(0.015) | 0.823(0.016) | 0.821(0.016) | 0.825(0.012) |

6 | 0.815(0.013) | 0.832(0.014) | 0.831(0.015) | 0.824(0.016) | 0.828(0.012) |

Housing () | |||||

0 | 0.456(0.047) | 0.436(0.039) | 0.467(0.054) | 0.428(0.043) | 0.442(0.045) |

2 | 0.462(0.043) | 0.465(0.041) | 0.483(0.052) | 0.457(0.040) | 0.463(0.046) |

4 | 0.461(0.042) | 0.461(0.043) | 0.487(0.042) | 0.455(0.046) | 0.467(0.043) |

6 | 0.463(0.044) | 0.493(0.041) | 0.510(0.049) | 0.484(0.038) | 0.521(0.042) |

Concrete () | |||||

0 | 0.416(0.019) | 0.420(0.020) | 0.441(0.023) | 0.404(0.021) | 0.428(0.015) |

2 | 0.424(0.024) | 0.437(0.025) | 0.446(0.021) | 0.413(0.023) | 0.467(0.015) |

4 | 0.419(0.023) | 0.447(0.024) | 0.457(0.023) | 0.440(0.026) | 0.508(0.017) |

6 | 0.420(0.021) | 0.457(0.021) | 0.459(0.022) | 0.454(0.023) | 0.545(0.018) |

Yacht () | |||||

0 | 0.122(0.017) | 0.160(0.042) | 0.165(0.063) | 0.139(0.028) | 0.485(0.035) |

2 | 0.123(0.023) | 0.176(0.043) | 0.158(0.045) | 0.204(0.059) | 0.577(0.043) |

4 | 0.122(0.016) | 0.202(0.078) | 0.162(0.046) | 0.257(0.058) | 0.624(0.037) |

6 | 0.124(0.016) | 0.217(0.057) | 0.180(0.059) | 0.285(0.057) | 0.660(0.042) |

Auto MPG () | |||||

0 | 0.394(0.032) | 0.381(0.028) | 0.378(0.030) | 0.373(0.023) | 0.365(0.023) |

2 | 0.382(0.033) | 0.387(0.025) | 0.387(0.023) | 0.383(0.027) | 0.389(0.023) |

4 | 0.386(0.030) | 0.394(0.022) | 0.394(0.025) | 0.394(0.027) | 0.426(0.023) |

6 | 0.384(0.029) | 0.391(0.024) | 0.398(0.032) | 0.389(0.024) | 0.430(0.023) |

Physicochem () | |||||

0 | 0.827(0.025) | 0.812(0.027) | 0.808(0.026) | 0.802(0.023) | 0.801(0.024) |

2 | 0.831(0.023) | 0.825(0.028) | 0.825(0.026) | 0.827(0.026) | 0.836(0.020) |

4 | 0.840(0.021) | 0.835(0.024) | 0.840(0.023) | 0.839(0.025) | 0.845(0.021) |

6 | 0.837(0.023) | 0.848(0.024) | 0.855(0.035) | 0.848(0.022) | 0.861(0.020) |

Air foil () | |||||

0 | 0.443(0.018) | 0.463(0.022) | 0.481(0.033) | 0.440(0.021) | 0.475(0.017) |

2 | 0.464(0.027) | 0.475(0.023) | 0.523(0.041) | 0.462(0.024) | 0.569(0.013) |

4 | 0.461(0.026) | 0.493(0.029) | 0.555(0.041) | 0.494(0.019) | 0.613(0.010) |

6 | 0.481(0.028) | 0.494(0.027) | 0.575(0.032) | 0.519(0.026) | 0.637(0.013) |

Power plant () | |||||

0 | 0.256(0.003) | 0.253(0.003) | 0.255(0.003) | 0.253(0.003) | 0.252(0.002) |

2 | 0.257(0.003) | 0.254(0.004) | 0.256(0.003) | 0.255(0.003) | 0.264(0.003) |

4 | 0.257(0.003) | 0.255(0.003) | 0.258(0.003) | 0.257(0.003) | 0.280(0.005) |

6 | 0.259(0.006) | 0.257(0.003) | 0.259(0.003) | 0.258(0.005) | 0.295(0.005) |

Body fat (*StatLib) () | |||||

0 | 0.586(0.035) | 0.600(0.032) | 0.605(0.044) | 0.589(0.035) | 0.612(0.026) |

2 | 0.587(0.032) | 0.611(0.045) | 0.623(0.041) | 0.596(0.042) | 0.623(0.027) |

4 | 0.593(0.031) | 0.624(0.047) | 0.631(0.049) | 0.601(0.043) | 0.641(0.030) |

6 | 0.603(0.032) | 0.656(0.052) | 0.652(0.047) | 0.613(0.038) | 0.659(0.031) |

. | LSGDR . | dMAVE . | LSDR . | gKDR . | No Reduc. . |
---|---|---|---|---|---|

White wine () | |||||

0 | 0.840(0.012) | 0.846(0.014) | 0.847(0.013) | 0.843(0.012) | 0.836(0.008) |

2 | 0.844(0.011) | 0.851(0.012) | 0.856(0.015) | 0.849(0.014) | 0.844(0.009) |

4 | 0.846(0.010) | 0.862(0.014) | 0.865(0.012) | 0.858(0.015) | 0.850(0.009) |

6 | 0.848(0.012) | 0.866(0.015) | 0.879(0.018) | 0.860(0.013) | 0.856(0.011) |

Red wine () | |||||

0 | 0.810(0.015) | 0.811(0.019) | 0.808(0.019) | 0.809(0.016) | 0.804(0.015) |

2 | 0.816(0.016) | 0.819(0.017) | 0.820(0.018) | 0.813(0.017) | 0.813(0.014) |

4 | 0.816(0.014) | 0.826(0.015) | 0.823(0.016) | 0.821(0.016) | 0.825(0.012) |

6 | 0.815(0.013) | 0.832(0.014) | 0.831(0.015) | 0.824(0.016) | 0.828(0.012) |

Housing () | |||||

0 | 0.456(0.047) | 0.436(0.039) | 0.467(0.054) | 0.428(0.043) | 0.442(0.045) |

2 | 0.462(0.043) | 0.465(0.041) | 0.483(0.052) | 0.457(0.040) | 0.463(0.046) |

4 | 0.461(0.042) | 0.461(0.043) | 0.487(0.042) | 0.455(0.046) | 0.467(0.043) |

6 | 0.463(0.044) | 0.493(0.041) | 0.510(0.049) | 0.484(0.038) | 0.521(0.042) |

Concrete () | |||||

0 | 0.416(0.019) | 0.420(0.020) | 0.441(0.023) | 0.404(0.021) | 0.428(0.015) |

2 | 0.424(0.024) | 0.437(0.025) | 0.446(0.021) | 0.413(0.023) | 0.467(0.015) |

4 | 0.419(0.023) | 0.447(0.024) | 0.457(0.023) | 0.440(0.026) | 0.508(0.017) |

6 | 0.420(0.021) | 0.457(0.021) | 0.459(0.022) | 0.454(0.023) | 0.545(0.018) |

Yacht () | |||||

0 | 0.122(0.017) | 0.160(0.042) | 0.165(0.063) | 0.139(0.028) | 0.485(0.035) |

2 | 0.123(0.023) | 0.176(0.043) | 0.158(0.045) | 0.204(0.059) | 0.577(0.043) |

4 | 0.122(0.016) | 0.202(0.078) | 0.162(0.046) | 0.257(0.058) | 0.624(0.037) |

6 | 0.124(0.016) | 0.217(0.057) | 0.180(0.059) | 0.285(0.057) | 0.660(0.042) |

Auto MPG () | |||||

0 | 0.394(0.032) | 0.381(0.028) | 0.378(0.030) | 0.373(0.023) | 0.365(0.023) |

2 | 0.382(0.033) | 0.387(0.025) | 0.387(0.023) | 0.383(0.027) | 0.389(0.023) |

4 | 0.386(0.030) | 0.394(0.022) | 0.394(0.025) | 0.394(0.027) | 0.426(0.023) |

6 | 0.384(0.029) | 0.391(0.024) | 0.398(0.032) | 0.389(0.024) | 0.430(0.023) |

Physicochem () | |||||

0 | 0.827(0.025) | 0.812(0.027) | 0.808(0.026) | 0.802(0.023) | 0.801(0.024) |

2 | 0.831(0.023) | 0.825(0.028) | 0.825(0.026) | 0.827(0.026) | 0.836(0.020) |

4 | 0.840(0.021) | 0.835(0.024) | 0.840(0.023) | 0.839(0.025) | 0.845(0.021) |

6 | 0.837(0.023) | 0.848(0.024) | 0.855(0.035) | 0.848(0.022) | 0.861(0.020) |

Air foil () | |||||

0 | 0.443(0.018) | 0.463(0.022) | 0.481(0.033) | 0.440(0.021) | 0.475(0.017) |

2 | 0.464(0.027) | 0.475(0.023) | 0.523(0.041) | 0.462(0.024) | 0.569(0.013) |

4 | 0.461(0.026) | 0.493(0.029) | 0.555(0.041) | 0.494(0.019) | 0.613(0.010) |

6 | 0.481(0.028) | 0.494(0.027) | 0.575(0.032) | 0.519(0.026) | 0.637(0.013) |

Power plant () | |||||

0 | 0.256(0.003) | 0.253(0.003) | 0.255(0.003) | 0.253(0.003) | 0.252(0.002) |

2 | 0.257(0.003) | 0.254(0.004) | 0.256(0.003) | 0.255(0.003) | 0.264(0.003) |

4 | 0.257(0.003) | 0.255(0.003) | 0.258(0.003) | 0.257(0.003) | 0.280(0.005) |

6 | 0.259(0.006) | 0.257(0.003) | 0.259(0.003) | 0.258(0.005) | 0.295(0.005) |

Body fat (*StatLib) () | |||||

0 | 0.586(0.035) | 0.600(0.032) | 0.605(0.044) | 0.589(0.035) | 0.612(0.026) |

2 | 0.587(0.032) | 0.611(0.045) | 0.623(0.041) | 0.596(0.042) | 0.623(0.027) |

4 | 0.593(0.031) | 0.624(0.047) | 0.631(0.049) | 0.601(0.043) | 0.641(0.030) |

6 | 0.603(0.032) | 0.656(0.052) | 0.652(0.047) | 0.613(0.038) | 0.659(0.031) |

Notes: Averages and standard deviations of regression errors over 50 runs. “No Reduc.” means the results without dimension reduction. The best and comparable methods according to the -test at the significance level are in bold. denotes the number of added noise dimensions to the original data.

### 5.2 Regression on Benchmark Data Sets

^{8}In this experiment, dOPG was excluded because dMAVE is initialized based on dOPG and often showed similar or better estimation performance than dOPG as demonstrated in the previous experiments. We first standardized each data set so that the mean and standard deviation are zero and one, respectively. Then we randomly selected samples from each data set in the training phase; the rest, whose number is denoted by , were used in the test phase.

^{9}After estimating by each method in the training phase, we performed kernel ridge regression with the gaussian kernel to the dimension-reduced data . In the test phase, the regression error was measured by where denotes a learned regressor. Furthermore, we made data sets more challenging by concatenating Gamma variables independently drawn from to . Unlike the previous experiment, the true intrinsic dimensionality is unknown, and thus we cross-validated it as follows. Five-fold cross-validation was performed to choose the intrinsic dimensionality for each method so as to minimize the regression error, where the candidates of were .

The results are summarized in Table 2. When the noise dimensions are not added, the performance of LSGDR is comparable to or worse than the other methods for some data sets. However, as the noise dimensions increase, LSGDR often significantly outperforms the other methods. These results imply that LSGDR is useful for finding informative subspaces to relatively higher-dimensional data.

### 5.3 Classification on Benchmark Data Sets

Finally, LSGDR is applied to binary classification. Data sets were downloaded from https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/.^{10} The training data were randomly selected from the whole data set, and the remaining data were used in the test phase. For large data sets, we randomly selected 1000 samples for the test phase. As preprocessing, we standardized both the training and test data by using the mean and standard deviation of the training data. In the training phase, the projection matrix was estimated by each method using the training data, and then the support vector machine (SVM) (Schölkopf & Smola, 2001) was trained based on the dimension-reduced data.^{11} The performance was measured by misclassification rates of the trained SVM to the test data. In this classification experiment, we used a variant of gKDR called *gKDR-v* as in Fukumizu and Leng (2014). As in the previous regression experiment, the intrinsic dimension was cross-validated in the training phase so that the misclassification rate is minimized.

Table 3 shows the average and standard deviation of misclassification rates over 50 runs, indicating that LSGDR compares favorably with the other methods.

LSGDR . | dMAVE . | LSDR . | gKDR . | No Reduc. . |
---|---|---|---|---|

Australian | ||||

14.780(1.596) | 14.604(1.431) | 15.576(1.538) | 14.751(1.662) | 14.482(1.271) |

Breast-cancer | ||||

4.010(2.567) | 3.802(1.000) | 4.214(0.970) | 3.546(1.236) | 3.332(0.843) |

COD-RNA | ||||

6.494(1.254) | 6.540(1.197) | 6.838(0.945) | 6.388(0.979) | 6.170(0.958) |

Diabetes | ||||

25.134(2.014) | 25.729(2.320) | 26.243(3.107) | 24.996(2.274) | 24.415(1.700) |

Heart | ||||

18.871(3.763) | 20.318(4.146) | 19.906(3.913) | 19.647(4.016) | 18.106(3.455) |

Liver-disorders | ||||

31.918(3.457) | 31.262(4.320) | 33.928(4.110) | 31.149(3.547) | 30.718(3.301) |

SUSY | ||||

24.366(1.518) | 24.818(2.301) | 26.160(3.079) | 25.538(2.876) | 24.286(1.677) |

Shuttle | ||||

0.700(1.363) | 1.266(0.789) | 1.836(1.215) | 1.240(1.091) | 1.544(0.731) |

LSGDR . | dMAVE . | LSDR . | gKDR . | No Reduc. . |
---|---|---|---|---|

Australian | ||||

14.780(1.596) | 14.604(1.431) | 15.576(1.538) | 14.751(1.662) | 14.482(1.271) |

Breast-cancer | ||||

4.010(2.567) | 3.802(1.000) | 4.214(0.970) | 3.546(1.236) | 3.332(0.843) |

COD-RNA | ||||

6.494(1.254) | 6.540(1.197) | 6.838(0.945) | 6.388(0.979) | 6.170(0.958) |

Diabetes | ||||

25.134(2.014) | 25.729(2.320) | 26.243(3.107) | 24.996(2.274) | 24.415(1.700) |

Heart | ||||

18.871(3.763) | 20.318(4.146) | 19.906(3.913) | 19.647(4.016) | 18.106(3.455) |

Liver-disorders | ||||

31.918(3.457) | 31.262(4.320) | 33.928(4.110) | 31.149(3.547) | 30.718(3.301) |

SUSY | ||||

24.366(1.518) | 24.818(2.301) | 26.160(3.079) | 25.538(2.876) | 24.286(1.677) |

Shuttle | ||||

0.700(1.363) | 1.266(0.789) | 1.836(1.215) | 1.240(1.091) | 1.544(0.731) |

Notes: Averages and standard deviations of misclassification rates over 50 runs. “No Reduc.” means the results without dimension reduction. The best and comparable methods according to the -test at the significance level are in bold.

## 6 Conclusion

The main contribution of this letter is a novel estimator for the gradients of logarithmic conditional densities, which improves the performance of SDR methods. The proposed gradient estimator efficiently computes the solution in a closed form, and a model selection method is also available. With the proposed gradient estimator, we developed an SDR method based on eigendecomposition. Our theoretical analysis showed that the proposed estimator and our SDR method converge to the optimal solutions at the optimal rate under a parametric setting. We experimentally demonstrated that the proposed SDR method works well on a variety of data sets.

## Notes

is continuous in regression, while it is categorical in classification.

In principle, model selection can be performed by cross-validation (CV) over a successive predictor. However, this should be avoided in practice for two reasons. First, when CV is applied, one should optimize both parameters in an SDR method and hyperparameters in the predictor. This procedure results in a nested CV, which is computationally quite inefficient. Second, features extracted based on CV are no longer independent of predictors, which is not preferable in terms of interpretability (Suzuki & Sugiyama, 2013).

Some large value of the minimum candidate of can be justified as follows. First, log densities tend to prefer large bandwidth values because of the logarithm. Second, if is completely independent of , . Thus, the ideal estimate is , which can be achieved in two ways: for all or when . To exclude the second possibility, we set the minimum candidate of at some large value.

We subsampled only the Physiochem data set by extracting the first 1000 samples because it is too large.

For the “shuttle” data set, we used only data samples in classes 1 and 4.

We employed the Matlab software for SVM called *LIBSVM* (Chang & Lin, 2011).

## Appendix A: Proof of Theorem ^{1}

Here, we prove theorem ^{1}. Our proof essentially follows the proof in Shiino, Sasaki, Niu, and Sugiyama (2016) and goes through three steps.

### A.1 Step 1: Establishment of the Growth Condition.

**Proof.**Taylor's theorem (Nocedal & Wright, 1999, theorem 2.1) gives where is the Hessian matrix of and lies between and . Since , where the optimality condition was applied.

### A.2 Step 2: Stability Analysis.

**Proof.**The gradient of the difference function is computed as Because of the regularization, for . Given a -ball of , which is defined by , the following inequality for holds: This inequality provides The above inequality states that the norm of the gradient is bounded with an order . Thus, the difference function is Lipschitz continuous on with a Lipschitz constant of the same order.

### A.3 Step 3: A Convergence Rate of LSLCG.

^{3}and

^{4}and proposition 6.1 in Bonnans and Shapiro (1998), is the minimizer of with respect to when , and .

## Appendix B: Proof of Theorem ^{2}

As seen in the supplementary material for Sasaki et al. (2016), we prove theorem ^{2}.

^{1}, and the second term also converges in the same order due to CLT. Thus,

## Acknowledgments

G. N. acknowledges support from JST CREST JPMJCR1403.