Abstract

A typical goal of linear-supervised dimension reduction is to find a low-dimensional subspace of the input space such that the projected input variables preserve maximal information about the output variables. The dependence-maximization approach solves the supervised dimension-reduction problem through maximizing a statistical dependence between projected input variables and output variables. A well-known statistical dependence measure is mutual information (MI), which is based on the Kullback-Leibler (KL) divergence. However, it is known that the KL divergence is sensitive to outliers. Quadratic MI (QMI) is a variant of MI based on the distance, which is more robust against outliers than the KL divergence, and a computationally efficient method to estimate QMI from data, least squares QMI (LSQMI), has been proposed recently. For these reasons, developing a supervised dimension-reduction method based on LSQMI seems promising. However, not QMI itself but the derivative of QMI is needed for subspace search in linear-supervised dimension reduction, and the derivative of an accurate QMI estimator is not necessarily a good estimator of the derivative of QMI. In this letter, we propose to directly estimate the derivative of QMI without estimating QMI itself. We show that the direct estimation of the derivative of QMI is more accurate than the derivative of the estimated QMI. Finally, we develop a linear-supervised dimension-reduction algorithm that efficiently uses the proposed derivative estimator and demonstrate through experiments that the proposed method is more robust against outliers than existing methods.

1  Introduction

Supervised learning is one of the central problems in machine learning, which aims at learning an input-output relationship from given input-output paired data samples. Although many methods were proposed to perform supervised learning, they often work poorly when the input variables have high dimensionality. Such a situation is commonly referred to as the curse of dimensionality (Bishop, 2006), and a common approach to mitigate the curse of dimensionality is to preprocess the input variables by dimension reduction (Burges, 2010).

A typical goal of linear dimension reduction in supervised learning is to find a low-dimensional subspace of the input space such that the projected input variables preserve maximal information about the output variables. Thus, a subsequent supervised learning method can use the low-dimensional projection of the input variables to learn the input-output relationship with a minimal loss of information. The purpose of this letter is to develop a novel linear-supervised dimension-reduction method.

The dependence-maximization approach solves the supervised dimension-reduction problem through maximizing a statistical dependence measure between projected input variables and output variables. Mutual information (MI) is a well-known tool for measuring statistical dependency between random variables (Cover & Thomas, 1991). It is well studied, and many methods have been proposed to estimate MI from data. A notable method is the maximum likelihood MI (MLMI) (Suzuki, Sugiyama, Sese, & Kanamori, 2008), which does not require any assumption on the data distribution and can perform model selection via cross validation. For these reasons, MLMI seems to be an appealing tool for supervised dimension reduction. However, MI is defined based on the Kullback-Leibler divergence (Kullback & Leibler, 1951), which is known to be sensitive to outliers (Basu, Harris, Hjort, & Jones, 1998). Hence, MI is not an appropriate tool when it is applied on a data set containing outliers.

Quadratic MI (QMI) is a variant of MI (Principe, Xu, Zhao, & Fisher, 2000). Unlike MI, it is defined based on the distance. A notable advantage of the distance over the KL divergence is that the distance is more robust against outliers (Basu et al., 1998). Moreover, a computationally efficient method to estimate QMI from data, least-squares QMI (LSQMI) (Sainui & Sugiyama, 2013), has been proposed. LSQMI does not require any assumption on the data distribution and can perform model selection via cross validation. For these reasons, developing a supervised dimension-reduction method based on LSQMI is more promising.

An approach to use LSQMI for supervised dimension reduction is to first estimate QMI between projected input variables and output variables by LSQMI, and then search for a subspace that maximizes the estimated QMI by a nonlinear optimization method such as gradient ascent. However, the essential quantity of the subspace search is the derivative of QMI with regard to the subspace, not QMI itself. Thus, LSQMI may not be an appropriate tool for developing supervised dimension-reduction methods since the derivative of an accurate QMI estimator is not necessarily an accurate estimator of the derivative of QMI.

To cope with this problem, we propose in this letter a novel method to directly estimate the derivative of QMI without estimating QMI itself. The proposed method has the following advantageous properties: it does not require any assumption on the data distribution, the estimator can be computed analytically, and the tuning parameters can be objectively chosen by cross validation. We show through experiments that the proposed direct estimator of the derivative of QMI is more accurate than the derivative of the estimated QMI. Then we develop a fixed-point iteration that efficiently uses the proposed estimator of the derivative of QMI to perform supervised dimension reduction. Finally, we demonstrate the usefulness of the proposed supervised dimension-reduction method through experiments and show that the proposed method is more robust against outliers than existing methods.

The organization of this letter is as follows. We formulate the linear-supervised dimension-reduction problem and review some existing methods in section 2. Then we give an overview of QMI and review some QMI estimators in section 3. The details of the proposed derivative estimator are given in section 4. In section 5 we develop a supervised dimension-reduction algorithm based on the proposed derivative estimator. The experimental results are given in section 6. A further extension of the proposed derivative estimator is presented in section 7. The conclusion is given in section 8.

2  Linear-Supervised Dimension Reduction

In this section, we formulate the linear-supervised dimension-reduction problem. Then we briefly review existing supervised dimension-reduction methods and discuss their problems.

2.1  Problem Formulation

Let and be the input domain and output domain with dimensionality and , respectively, and be a joint probability density on . First, assume that we are given an input-output paired data set , where each data sample is drawn independently from the joint density:
formula

Next, let be an orthonormal matrix with a known constant , where denotes the -by- identity matrix and denotes the matrix transpose. Then assume that there exists a -dimensional subspace in spanned by the rows of such that the projection of onto this subspace denoted by   preserves the maximal information about of . That is, we can substitute by with a minimal loss of information about . We refer to the problem of estimating from the given data as linear-supervised dimension reduction. Below, we review some of the existing linear-supervised dimension-reduction methods.

2.2  Sliced Inverse Regression

Sliced inverse regression (SIR; Li, 1991) is a well-known linear-supervised dimension-reduction method. It formulates linear-supervised dimension reduction as a problem of finding , which makes and conditionally independent given :
formula
2.1
The key principle of SIR lies in the following equality,1
formula
2.2
where denotes the conditional expectation and denotes the th row of . The importance of this equality is that if the equality holds for any and some constants , then the inverse regression curve lies on the space spanned by , which satisfies equation 2.1. Based on this fact, SIR estimates as follows. First, the range of is sliced into multiple slices. Then is estimated as the mean of for each slice of . Finally, is obtained as the largest principal components of the covariance matrix of the means.

The significant advantages of SIR are its simplicity and scalability to large data sets. However, SIR relies on the equality in equation 2.2, which typically requires that is an elliptically symmetric distribution such as gaussian. This is restrictive, and thus the practical usefulness of SIR is limited.

2.3  Minimum Average Variance Estimation Based on the Conditional Density Functions

The minimum average variance estimation based on the conditional density functions (dMAVE; Xia, 2007) is a linear-supervised dimension-reduction method that does not require any assumption on the data distribution and is more practical compared to SIR. Briefly, dMAVE aims to find a matrix that yields an accurate nonparametric estimation of the conditional density .

The essential part of dMAVE is the following model,
formula
where denotes a symmetric kernel function with bandwidth , denotes a conditional expectation of given , and with . An important property of this model is that when as . Then dMAVE estimates by a local linear smoother (Fan, Yao, & Tong, 1996). More specifically, a local linear smoother of is given by
formula
2.3
where is an arbitrary point close to and and are parameters. Based on this local linear smoother, dMAVE solves the following minimization problem,
formula
2.4
where is a symmetric kernel function with bandwidth . The function is a trimming function that is evaluated as zero when the densities of or are lower than some threshold. A solution to this minimization problem is obtained by alternatively solving quadratic programming problems for , and until convergence.

The main advantage of dMAVE is that it does not require any assumption on the data distribution. However, a significant disadvantage of dMAVE is that there is no systematic method to choose the kernel bandwidths and the trimming threshold. In practice, dMAVE uses a bandwidth selection method based on the normal-reference rule of the nonparametric conditional density estimation (Silverman, 1986; Fan et al., 1996), and a fixed trimming threshold. Although this model selection strategy works reasonably well in general, it does not always guarantee good performance on all kinds of data sets.

Another disadvantage of dMAVE is that the optimization problem in equation 2.4 may have many local solutions. To cope with this problem, dMAVE proposed using a linear supervised dimension-reduction method, the outer product of gradient based on conditional density functions (dOPG; Xia, 2007) to obtain a good initial solution. Thus, dMAVE may not perform well if dOPG fails to provide a good initial solution.

2.4  Kernel Dimension Reduction

Another linear-supervised dimension-reduction method that does not require any assumption on the data distribution is kernel dimension reduction (KDR; Fukumizu, Bach, & Jordan, 2009). Unlike dMAVE, which focuses on conditional density, KDR aims to find a matrix that satisfies the conditional independence in equation 2.1. The key idea of KDR is to evaluate the conditional independence through a conditional covariance operator over reproducing kernel Hilbert spaces (RKHSs; Aronszajn, 1950).

Throughout this section, we use to denote an RKHS of functions on the domain equipped with the reproducing kernel ,
formula
for and . The RKHSs of functions on domains and are also defined similarly as and , respectively. The cross-covariance operator : satisfies the following equality for all and ,
formula
where , , and denote expectations over densities , , and , respectively. Then the conditional covariance operator can be defined using cross-covariance operators as
formula
2.5
where it is assumed that always exists. The importance of the conditional covariance operator in supervised dimension reduction lies in the following relations,
formula
2.6
where the inequality refers to the partial order of self-adjoint operators, and
formula
2.7
These relations mean that the conditional independence can be achieved by finding a matrix that minimizes in the partial order of self-adjoint operators. Based on this fact, KDR solves the following minimization problem:
formula
2.8
where denotes a regularization parameter, and denote centered gram matrices with the kernels and , respectively, and denotes the trace of an operator. A solution to this minimization problem is obtained by a gradient descent method.

KDR does not require any assumption on the data distribution and was shown to work well on various regression and classification tasks (Fukumizu et al., 2009). However, KDR has two weaknesses in practice. The first is that although the kernel parameters and the regularization parameter can heavily affect the performance, there seems to be no justifiable model selection method to choose these parameters so far. Although it is always possible to choose these tuning parameters based on the prediction performance of a successive supervised learning method with cross validation, this approach results in a nested loop of model selection for both KDR itself and the successive supervised learning method. Moreover, this approach makes supervised dimension reduction depend on the successive supervised learning method, which may not be favorable in practice.

The second weakness is that the optimization problem in equation 2.8 is nonconvex and may have multiple local solutions. Thus, if the initial solution is not properly chosen, the performance of KDR may be unreliable. A simple approach to cope with this problem is to restart the optimization with several different initial guesses and choose the best solution based on equation 2.8. However, this approach is computationally expensive. A more sophisticated approach was considered in Fukumizu and Leng (2014), which proposed using a solution of a linear-supervised dimension-reduction method, gradient-based kernel dimension reduction (gKDR), as an initial solution for KDR. However, it is not guaranteed that gKDR always provides a good initial solution for KDR.

2.5  Least-Squares Dimension Reduction

The least-squares dimension reduction (LSDR; Suzuki & Sugiyama, 2013) is another linear-supervised dimension-reduction method that does not require any assumption on the data distribution. Similar to KDR, LSDR aims to find a matrix that satisfies the conditional independence in equation 2.1. However, instead of the conditional covariance operators, LSDR evaluates the conditional independence through a statistical dependence measure.

LSDR utilizes a statistical dependence measure, squared-loss mutual information (SMI). SMI between random variables and is defined as
formula
2.9
is always nonnegative and equals zero if and only if and are statistically independent, that is, . The important properties of SMI in supervised dimension reduction are
formula
and
formula
Thus, conditional independence can be achieved by finding a matrix that maximizes . Since is typically unknown, it is estimated by the least-squares mutual information (Suzuki, Sugiyama, Kanamori, & Sese, 2009) method, which directly estimates the density ratio without performing any density estimation. Then LSDR solves the following maximization problem,
formula
2.10
where denotes the estimated SMI. The solution to this maximization problem is obtained by a gradient ascent method. Note that this maximization problem is nonconvex and may have many local solutions.

LSDR does not require any assumption on the data distribution, similar to dMAVE and KDR. It can also avoid a poor local solution based on the objective value, similar to KDR. However, the significant advantage of LSDR over dMAVE and KDR is that it can perform model selection via cross validation without requiring any successive supervised learning method. This is a practically favorable property as a supervised dimension-reduction method.

However, a disadvantage of LSDR is that the density ratio function can be highly fluctuated, especially when the data contain outliers. Since it is typically difficult to accurately estimate a highly fluctuated function, LSDR could be unreliable in the presence of outliers.

Recently, a linear-supervised dimension-reduction method, least-squares conditional entropy (LSCE; Tangkaratt, Xie, & Sugiyama, 2015) was proposed. Unlike LSDR, LSCE is based on the minimization of a squared-loss variant of conditional entropy, which contains the density ratio function . Since the marginal density is not contained in the formulation, LSCE tends to be robust against outliers when outliers are in the output domain. However, LSCE could still be unreliable when outliers are in the input domain.

Next, we consider a linear-supervised dimension-reduction approach based on quadratic mutual information, which can overcome the disadvantages of the existing methods.

3  Quadratic Mutual Information

In this section, we briefly introduce quadratic mutual information and discuss how it can be used to perform robust supervised dimension reduction.

3.1  Quadratic Mutual Information and Mutual Information

Quad-ratic mutual information (QMI) is a measure for statistical dependency between random variables (Principe et al., 2000) and is defined as
formula
3.1
is always nonnegative and equals zero if and only if and are statistically independent: . Such a property of QMI is similar to that of the ordinary mutual information (MI), which is defined as
formula
3.2
The essential difference between QMI and MI is the discrepancy measure. is the distance between and , while is the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951).

MI has been studied and applied to many data analysis tasks (Cover & Thomas, 1991). Moreover, an efficient method to estimate MI from data is also available (Suzuki et al., 2008). However, MI is not always the optimal choice for measuring statistical dependence because it is not robust against outliers. An intuitive explanation is that MI contains the log function and the density ratio: the value of logarithm can be highly sharp near zero, and the density ratio can be highly fluctuated and diverge to infinity. Thus, the value of MI tends to be unstable and unreliable in the presence of outliers. In contrast, QMI does not contain the log function and the density ratio, and thus QMI should be more robust against outliers than MI.

Another explanation of the robustness of QMI and MI can be understood based on their discrepancy measures. Both distance (QMI) and KL divergence (MI) can be regarded as members of a more general divergence class, density power divergence (Basu et al., 1998):
formula
3.3
where . Based on this divergence class, the distance and the KL divergence can be obtained by setting and , respectively. As Basu et al. (1998) discussed, the parameter controls the robustness against outliers of the divergence, where a large value of indicates high robustness. This means that the distance () is more robust against outliers than the KL divergence ().
In supervised dimension reduction, robustness against outliers is an important requirement because outliers often make supervised dimension-reduction methods work poorly. Thus, developing a supervised dimension-reduction method based on QMI is an attractive approach since QMI is robust against outliers. This QMI-based supervised dimension-reduction method is performed by finding a matrix that maximizes :
formula
The motivation is that if is maximized then and are maximally, dependent on each other, and thus we may disregard with a minimal loss of information about .

Since is typically unknown, it needs to be estimated from data. We next review existing QMI estimation methods and then discuss a weakness of performing supervised dimension reduction using these QMI estimation methods.

3.2  Existing QMI Estimation Methods

We review two QMI estimation methods that estimate from the given data. The first method estimates QMI through density estimation and the second through density difference estimation.

3.2.1  QMI Estimator Based on Density Estimation

Expanding equation 3.1 allows us to express as
formula
3.4
A naive approach to estimate is to separately estimate the unknown densities , , and by density estimation methods such as kernel density estimation (KDE; Silverman, 1986) and then plug the estimates into equation 3.4.

Following this approach, the KDE-based QMI estimator has been studied and applied to many problems such as feature extraction for classification (Torkkola, 2003; Principe et al., 2000), blind source separation (Principe et al., 2000), and image registration (Atif, Ripoche, Coussinet, & Osorio, 2003). Although this density estimation-based approach was shown to work well, accurately estimating densities for high-dimensional data is known to be one of the most challenging tasks (Vapnik, 1998). Moreover, the densities contained in equation 3.4 are estimated independently without regard for the accuracy of the QMI estimator. Thus, even if each density is accurately estimated, the QMI estimator obtained from these density estimates does not necessarily give an accurate QMI. An approach to mitigate this problem is to consider density estimators whose combination minimizes the estimation error of QMI. Although this approach shows better performance than the independent density estimation approach, it still performs poorly in high-dimensional problems (Sugiyama et al., 2013).

3.2.2  Least-Squares QMI

To avoid the separate density estimation, Sainui and Sugiyama (2013) proposed an alternative method, least-squares QMI (LSQMI). We briefly review the LSQMI method.

First, notice that can be expressed in terms of the density difference as
formula
3.5
where
formula
The key idea of LSQMI is to directly estimate the density difference without going through any density estimation by the procedure of the least-squares density difference (Sugiyama et al., 2013). Letting be a model of the density difference, LSQMI learns so that it is fitted to the true density difference under the squared loss:
formula
By expanding the integrand, we obtain
formula
Since the last term is a constant with regard to the model , we omit it and obtain the following criterion:
formula
3.6
Then the density difference estimator is obtained as the solution of the following minimization problem:
formula
3.7
The solution of the minimization problem in equation 3.7 depends on the choice of the model . LSQMI employs the following linear-in-parameter model,
formula
where is a parameter vector and is a basis function vector. For this model, finding the solution of equation 3.7 is equivalent to solving
formula
where
formula
3.8
formula
3.9
By approximating the expectation over the densities , , and with sample averages, we obtain the following empirical minimization problem,
formula
where is the sample approximation of equation 3.9:
formula
By including the regularization term, we obtain
formula
where is the regularization parameter. Then the solution is obtained analytically as
formula
3.10
Therefore, the density difference estimator is obtained as
formula
Finally, the QMI estimator is obtained by substituting the density difference estimator into equation 3.5. A direct substitution yields two possible QMI estimators:
formula
3.11
formula
3.12
However, Sugiyama et al. (2013) showed that a linear combination of the two estimators defined as
formula
3.13
provides smaller bias and is a more appropriate QMI estimator.

As shown above, LSQMI avoids multiple-step density estimation by directly estimating the density difference contained in QMI. It was shown that such a direct estimation procedure tends to be more accurate than multiple-step estimation (Sugiyama et al., 2013). Moreover, LSQMI is able to objectively choose the tuning parameter contained in the basis function and the regularization parameter based on cross validation. This property allows LSQMI to solve challenging tasks such as clustering (Sainui & Sugiyama, 2013) and unsupervised dimension reduction (Sainui & Sugiyama, 2014) in an objective way.

3.3  Supervised Dimension Reduction via LSQMI

Given an efficient QMI estimation method such as LSQMI, linear-supervised dimension reduction can be performed by finding a matrix defined as
formula
3.14
A straightforward approach to solving equation 3.14 is to perform the gradient ascent,
formula
where denotes the step size. The update formula means that the essential point of the QMI-based supervised dimension-reduction method is not the accuracy of the QMI estimator but the accuracy of the estimator of the derivative of the QMI. Thus, the existing LSQMI-based approach, which first estimates QMI and then computes the derivatives of the QMI estimator, is not necessarily appropriate since an accurate estimator of QMI does not necessarily mean that its derivative is an accurate estimator of the derivative of QMI. Next, we describe our proposed method, which overcomes this problem.

4  Derivative of Quadratic Mutual Information

To cope with the weakness of the QMI estimation methods when performing linear-supervised dimension reduction, we propose to directly estimate the derivative of QMI without estimating QMI itself.

4.1  Direct Estimation of the Derivative of Quadratic Mutual Information

From equation 3.5, the derivative of the with regard to the th element of can be expressed by2
formula
4.1
where in the second line, we assume that the order of the derivative and the integration is interchangeable. By approximating the expectations over the densities , , and with sample averages, we obtain an approximation of the derivative of QMI as
formula
4.2
Note that since , we have that is the -dimensional vector with zero everywhere except at the th dimension, which has value . Hence, equation 4.2 can be simplified as
formula
4.3
This means that the derivative of with regard to can be obtained once we know the derivatives of the density difference with regard to  for all . However, these derivatives are often unknown and need to be estimated from data. We next discuss existing approaches and their drawbacks. Then we propose our approach, which can overcome the drawbacks.

4.2  Existing Approaches to Estimate the Derivative of the Density Difference

Our current goal is to obtain the derivative of the density difference with regard to  which can be rewritten as
formula
4.4
All terms in equation 4.4 are unknown in practice and need to be estimated from data. There are three approaches to estimate them.
  1. Density estimation. Separately estimate the densities , , and by, for example, kernel density estimation. Then estimate the right-hand side of equation 4.4 as
    formula
    where , , and denote the estimated densities.
  2. Density derivative estimation. Estimate the density by for example, kernel density estimation. Next, separately estimate the densities derivative and by, for example, the method of mean integrated square error for derivatives (Sasaki, Noh, & Sugiyama, 2015), which can estimate the density derivative without estimating the density itself. Then estimate the right-hand side of equation 4.4 as
    formula
    where denotes the estimated density and and denote the (directly) estimated density derivatives.
  3. Density difference estimation. Estimate the density difference by, for example, least-squares density difference (Sugiyama et al., 2013), which can estimate the density difference without estimating the densities themselves. Then estimate the left-hand side of equation 4.4 as
    formula
    where denotes the (directly) estimated density difference.

The problem of approaches 1 and 2 is that they involve multiple estimation steps where some quantities are estimated first and then they are plugged into equation 4.4. Such multiple-step methods are not appropriate since each estimated quantity is obtained without regarding the others and the succeeding plug-in step using these estimates can magnify the estimation error contained in each estimated quantity.

Approach 3 seems more promising than the other two since there is only one estimated quantity . However, it is still not the optimal approach due to the fact that an accurate estimator of the density difference does not necessarily mean that its derivative is an accurate estimator of the derivative of the density difference.

To avoid the above problems, we propose a new approach, which directly estimates the derivative of the density difference.

4.3  Direct Estimation of the Derivative of the Density Difference

We propose to estimate the derivative of the density difference with regard to  using a model :
formula
The model is learned so that it is fitted to its corresponding derivative under the square loss:
formula
4.5
By expanding the square, we obtain
formula
Since the last term is a constant with regard to the model , we omit it and obtain the following criterion:
formula
4.6
The second term is intractable due to the unknown derivative of the density difference. To make this term tractable, we use integration by parts (Kasube, 1983) to obtain the following:
formula
4.7
where denotes an integration over except for the th element. Here, we require
formula
4.8
which is a mild assumption since the tails of the density difference often vanish when approaches infinity. Applying the assumption to the left-hand side of equation 4.7 allows us to express equation 4.6 as
formula
Then the estimator is obtained as a solution of the following minimization problem:
formula
4.9
The solution of equation 4.9 depends on the choice of the model. Let us employ the following linear-in-parameter model as :
formula
4.10
where is a parameter vector and is a basis function vector whose practical choice will be discussed later in detail. For this model, finding the solution of equation 4.9 is equivalent to solving
formula
4.11
where we define
formula
4.12
formula
4.13
By approximating the expectation over the densities , , and with sample averages, we obtain the following empirical minimization problem:
formula
4.14
where is the sample approximation of equation 4.13:
formula
4.15
By including the regularization term to control the model complexity, we obtain
formula
4.16
where denotes the regularization parameter. This minimization problem is convex with regard to the parameter , and the solution can be obtained analytically as
formula
4.17
where denotes the identity matrix. Finally, the estimator of the derivative of the density difference is obtained by substituting the solution into the model, equation 4.10, as
formula
4.18
Using this solution, an estimator of the derivative of QMI can be directly obtained by substituting equation 4.18 into equation 4.3 as
formula
4.19
We call this method the least-squares QMI derivative (LSQMID).

4.4  Basis Function Design

As basis function , we propose to use
formula
where . First, we define the th gaussian function as
formula
4.20
where and denote gaussian centers chosen randomly from the data samples and denotes the gaussian width. We may use different gaussian widths for and , but this approach significantly increases the computation time for model selection, discussed in section 4.5. In our implementation, we standardize each dimension of and to have unit variance and zero mean and then use the common gaussian width for both and . We also set in the experiments.
Based on the above gaussian function, we propose to use the following function as the th basis for the th model of the derivative of the density difference:
formula
4.21
This function is the derivative of the th gaussian basis function with regard to . A benefit of this basis function design is that the integral appearing in can be computed analytically. Through a simple calculation, we obtain the th element of as follows:
formula

As discussed in section 5, this basis function choice has further benefits when we develop a linear-supervised dimension-reduction method.

4.5  Model Selection by Cross Validation

The practical performance of the LSQMID method depends on the choice of the gaussian width and the regularization parameter included in the estimator . These tuning parameters can be objectively chosen by the -fold cross-validation (CV) procedure:

  1. Divide the training data into disjoint subsets with approximately the same size. In the experiments, we choose .

  2. For each candidate and each subset , compute a solution by equation 4.17 with the candidate and samples from (i.e., all data samples except samples in ).

  3. Compute the CV score of each candidate pair by
    formula
    where denotes computed from the candidate and samples in .
  4. Choose the tuning parameter pair such that it minimizes the CV score as
    formula

5  Supervised Dimension Reduction via LSQMID

In this section, we propose a linear-supervised dimension-reduction met-hod based on the proposed LSQMID estimator.

5.1  Gradient Ascent via LSQMID

Recall that our goal in linear-supervised dimension reduction is to find the matrix :
formula
5.1
A straightforward approach to find a solution to equation 5.1 using the proposed method is to perform gradient ascent as
formula
5.2
where denotes the step size.3 It is known that choosing a good step size is a difficult task in practice (Nocedal & Wright, 2006). Line search is an algorithm to choose a good step size by finding a step size that satisfies certain conditions such as the Armijo rule (Armijo, 1966). However, these conditions often require access to the objective value , which is unavailable in our current setup since the QMI derivative is directly estimated without estimating QMI. Thus, if we want to perform line search, QMI needs to be estimated separately. However, this is problematic since the estimation of the derivative of the QMI and the estimation of the QMI are performed independently without regard to the other, and thus they may not be consistent. For example, the gradient , which is supposed to be an ascent direction, may be regarded as a descent direction on the surface of the estimated QMI. For such a case, the step size chosen by any line search algorithm is unreliable, and the resulting may not be a good solution.

We consider two approaches that can cope with this problem.

5.2  QMI Approximation via LSQMID

To avoid separate QMI estimation, we consider an approximated QMI obtained as a by-product of the proposed method. Recall that the proposed method models the derivative of the density difference as
formula
This means that the density difference can be approximated by
formula
5.3
where is an unknown quantity, which is a constant with regard to .
In a special case where , we can use equation 5.3 to obtain a proper approximator of in a similar fashion to the LSQMI method. To verify this, let us substitute equation 5.3 into one of the in equation 3.5 to obtain
formula
where the last line follows from
formula
By approximating the expectation with sample averages, we obtain a QMI approximator as
formula
5.4
The main advantage of using is that it is obtained from the derivative estimation and thus should be consistent with the estimated derivative. This allows us to perform line search for the gradient ascent in a consistent manner. We may further improve the optimization procedure by considering an optimization problem over the Grassmann manifold:
formula
5.5
where is defined as
formula
That is, is a set of -by- orthonormal matrices whose rows span the same subspace. This manifold optimization is more efficient than the original optimization since every step of the optimization always satisfies the orthonormal constraint, and we no longer need to perform orthonormalization. More details of manifold optimization can be found in Absil, Mahony, and Sepulchre (2008).

Although the QMI approximation in equation 5.4 allows us to choose step size by line search in a consistent manner, such an approximation is unavailable when . Next, we consider an alternative optimization strategy that does not require access to the QMI value.

5.3  Fixed-Point Iteration

To avoid the problem of choosing the step size that requires access to the QMI value, we propose to use a fixed-point iteration for finding a solution to equation 5.1. Note that from the first-order optimality condition, a solution is a stationary point that satisfies
formula
where denotes -by- zero matrix. By using the proposed basis function in equation 4.21, equation 4.19 can be expressed as
formula
5.6
where we define
formula
with the column vector of length consisting of the th dimension over all and the symbol represents the element-wise vector product. Then an approximated solution may be obtained by finding for all such that the left-hand side of equation 5.6 is zero. This optimization strategy results in a fixed-point iteration for each dimension of :
formula
5.7
Finally, we orthonormalize the solution after each iteration as
formula
In practice, we perform this orthonormalization only every several iterations for computational efficiency.
There is a relation between the fixed-point iteration and gradient method. By substituting equation 5.6 into equation 5.7, we obtain
formula
5.8
This means that the fixed-point update step is a gradient method with an adaptive stepsize . Thus, if is always positive, the fixed-point iteration will converge to local maxima. Unfortunately there is no guarantee that is always positive in our formulation. Indeed, in our numerical experiments, sometimes took a negative value. A heuristic remedy would be to update the matrix only when is positive. However, this approach did not work well in our preliminary experiments, so we decided not to modify anything. We will further investigate this issue in future work.

The optimization problem in equation 5.1 is nonconvex and may have saddle points and local solutions. To avoid obtaining a saddle point or a poor local solution, we perform the optimization starting from several initial guesses and choose the solution that gives the maximum estimated QMI as the final solution.

6  Experiments

In this section, we demonstrate the usefulness of the proposed method through experiments.

6.1  Illustrative Experiments

First, we illustrate the behavior of the proposed method in terms of the QMI derivative estimation Let denote the gaussian distribution with mean and covariance . Then, for , we generate a data set and a matrix as follows:
formula
where denotes the zero vector of length 2. Thus, we have . The goal is to estimate
formula
at different values of . Note that is maximized at , that is, .

Figure 1a shows the averaged value over 20 trials of the estimated by LSQMI. The vertical axis indicates the value of the estimated QMI, and the horizontal axis indicates the value of . We use and for estimating QMI and denote the results by LSQMI(3000) and LSQMI(100), respectively. We perform cross-validation at and use the chosen tuning parameters for all values of . The result shows that LSQMI accurately estimates when the sample size is large. However, when the sample size is small, the estimated has high fluctuation.

Figure 1:

The mean and standard error of the estimated and the estimated derivative of with regard to  over 20 trials.

Figure 1:

The mean and standard error of the estimated and the estimated derivative of with regard to  over 20 trials.

Figure 1b shows the averaged value over 20 trials of the derivative of with regard to  computed by LSQMI(3000), LSQMI(100), and the proposed method with , which is denoted by LSQMID(100). For the proposed method, we perform cross-validation at and use the chosen tuning parameters for all values of . The result shows that LSQMID(100), gives a smoother estimate than LSQMI(100), which has high fluctuation. To further explain the cause of the fluctuation of LSQMI(100), we plot experimental results of four trials in Figure 2, where the left column corresponds to the value of the estimated and the right column corresponds to the value of the estimated derivative of with regard to . These results show that for LSQMI(100), a small fluctuation in the estimated QMI can cause a large fluctuation in the estimated derivative of QMI. On the other hand, LSQMID directly estimates the derivative of QMI and thus does not suffer from this problem.

Figure 2:

Examples of the estimated QMI and the estimated derivative of QMI. The left column shows the estimated , and the right column shows the estimated derivative of with regard to . Each row indicates each trial.

Figure 2:

Examples of the estimated QMI and the estimated derivative of QMI. The left column shows the estimated , and the right column shows the estimated derivative of with regard to . Each row indicates each trial.

Next, we investigate the behavior of the proposed method when the target dimensionality increases. For , we generate data set as follows:
formula
The goal is to estimate at different values of . The estimated derivative is evaluated by the mean squared error defined as
formula
where denotes the derivative of QMI estimated by LSQMI with sample size . The matrices with are generated randomly such that .

Figure 3 shows the mean and standard error over 10 trials of the mean squared error on sample sizes and target dimensionalities . The results show that for and , LSQMID gives much more accurate estimated derivatives that that of LSQMI, especially when the sample sizes are small.

Figure 3:

The mean and standard error of the mean squared error (MSE) of the estimated QMI derivatives over 10 trials on different sample sizes and different dimensionalities.

Figure 3:

The mean and standard error of the mean squared error (MSE) of the estimated QMI derivatives over 10 trials on different sample sizes and different dimensionalities.

For , LSQMID performs better only when the sample size is small. When the sample size increases, the improvement of LSQMI is better than that of LSQMID, and LSQMI eventually outperforms LSQMID. The main reason behind this phenomenon is that derivative estimation is very challenging when the target dimensionality is large, and LSQMID would require a much larger number of samples in order to accurately estimate these derivatives.

6.2  Artificial Data Sets

Next, we evaluate the usefulness of the proposed method in linear-supervised dimension reduction using artificial data sets.

6.2.1  Setup

First, let denote the uniform distribution over an interval , denote the gamma distribution with shape parameter and scale parameter , and denote the Laplace distribution with mean and scale parameter . Then we consider data sets with the output dimensionality and the optimal matrix (including their rotations) as follows:

  • Data set A: For , we use
    formula
  • Data set B: For and , we use
    formula
  • Data set C: For and , we use
    formula
  • Data set D: For and , we use
    formula

For data sets A, B, and C, is an additive noise, while for dataset D, is a multiplicative noise. Figure 4 shows the plot of these data sets (after standardization). Note the presence of outliers in the data sets.

Figure 4:

Artificial data sets.

Figure 4:

Artificial data sets.

To estimate from , we execute the following linear-supervised dimension-reduction methods:

  • LSQMID: The proposed method. Linear-supervised dimension reduction is performed by maximizing where the derivative of is estimated by the proposed method. The solution is obtained by fixed-point iteration.4

  • LSQMI: Linear-supervised dimension reduction is performed by maximizing where is estimated by LSQMI and the derivative of with regard to  is computed from the QMI estimator. The solution is obtained by gradient ascent with line search over the Grassmann manifold.5

  • LSDR (Suzuki & Sugiyama, 2013): Linear-supervised dimension reduction is performed by maximizing . The solution is obtained by gradient ascent with line search over the Grassmann manifold.6

  • LSCE (Tangkaratt et al., 2015): Linear-supervised dimension reduction is performed by minimizing a squared loss variant of conditional entropy. The solution is obtained by gradient descent with line search over the Grassmann manifold.

  • dMAVE (Xia, 2007): Linear-supervised dimension reduction is performed by minimizing an error of the local linear smoother of the conditional density . The solution is obtained by solving quadratic programming problems.7

  • KDR (Fukumizu et al., 2009): Linear-supervised dimension reduction is performed by minimizing the trace of the conditional covariance operator . The solution is obtained by gradient descent with line search over the Stiefel manifold.8

We set the number of basis functions to . For LSQMID, LSQMI, LSDR, and LSCE, we randomly generate 10 orthonormal matrices and use them as the initial solutions. For dMAVE, we use a solution obtained by dOPG (Xia, 2007) as the initial solution. For KDR, we consider two approaches for obtaining the initial solutions. KDR (gKDR) uses a solution obtained by gKDR (Fukumizu & Leng, 2014) as the initial solution. KDR (Random) uses 10 randomly generated orthonormal matrices as the initial solutions and chooses a solution with the minimum objective value as the final solution. We also compare these methods with a randomly generated orthonormal matrix. The obtained solution is evaluated by the dimension-reduction error defined as
formula
where denotes the Frobenius norm of a matrix. This dimension-reduction error is invariant to rotation within the subspace—an error of is the same as that of where is a -by- orthogonal matrix.

6.2.2  Results on Different Sample Sizes

We first evaluate the methods on different sample sizes. Table 1 shows the mean and standard error over 50 trials of the dimension-reduction error with different sample sizes where the input dimensionality is fixed to . The randomly generated matrices are uninformative and give large errors. LSQMID works very well for data sets A, C, and D, but it works quite poorly for data set B when compared with other methods. However, LSQMID gives the most informative results for data set C where outliers are in the input domain. These results demonstrate the weakness of existing methods in terms of robustness against outliers.

Table 1:
Mean and Standard Error of the Dimension-Reduction Error over 50 Trials for Artificial Data Sets on Different Sample Sizes with Input Dimensionality .
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)Random
A 50 0.464(0.080) 0.990(0.066) 0.149(0.013) 0.652(0.083) 0.233(0.033) 0.418(0.071) 0.190(0.044) 1.304(0.016) 
 100 0.111(0.024) 0.473(0.077) 0.070(0.005) 0.160(0.044) 0.127(0.008) 0.124(0.035) 0.075(0.006) 1.304(0.016) 
 150 0.059(0.005) 0.165(0.044) 0.058(0.004) 0.054(0.005) 0.095(0.005) 0.056(0.004) 0.056(0.004) 1.304(0.016) 
 200 0.045(0.004) 0.072(0.027) 0.046(0.004) 0.052(0.009) 0.080(0.005) 0.047(0.004) 0.047(0.004) 1.304(0.016) 
 250 0.040(0.003) 0.070(0.027) 0.041(0.003) 0.041(0.004) 0.070(0.004) 0.045(0.004) 0.045(0.004) 1.304(0.016) 
B 100 0.362(0.037) 1.290(0.057) 0.370(0.032) 0.226(0.022) 0.248(0.016) 0.421(0.042) 0.433(0.042) 1.465(0.028) 
 200 0.221(0.022) 0.700(0.081) 0.196(0.007) 0.116(0.008) 0.155(0.009) 0.168(0.010) 0.168(0.010) 1.465(0.028) 
 300 0.103(0.008) 0.359(0.066) 0.138(0.005) 0.075(0.003) 0.109(0.004) 0.122(0.006) 0.122(0.006) 1.465(0.028) 
 400 0.081(0.005) 0.111(0.009) 0.128(0.004) 0.080(0.006) 0.104(0.005) 0.089(0.005) 0.089(0.005) 1.465(0.028) 
 500 0.081(0.004) 0.130(0.021) 0.114(0.004) 0.069(0.006) 0.075(0.003) 0.068(0.004) 0.068(0.004) 1.465(0.028) 
C 100 1.108(0.069) 1.316(0.057) 1.371(0.024) 1.240(0.039) 1.164(0.036) 1.437(0.023) 1.395(0.023) 1.465(0.028) 
 200 0.819(0.092) 1.086(0.089) 1.336(0.026) 1.205(0.043) 1.015(0.054) 1.325(0.020) 1.358(0.019) 1.465(0.028) 
 300 0.333(0.061) 0.618(0.081) 1.346(0.029) 1.120(0.048) 0.981(0.047) 1.271(0.024) 1.279(0.026) 1.465(0.028) 
 400 0.224(0.054) 0.404(0.080) 1.327(0.028) 1.133(0.044) 0.863(0.056) 1.198(0.033) 1.250(0.023) 1.465(0.028) 
 500 0.267(0.069) 0.461(0.087) 1.347(0.027) 1.084(0.050) 0.756(0.054) 1.215(0.032) 1.217(0.020) 1.465(0.028) 
D 100 0.602(0.070) 1.033(0.070) 0.706(0.055) 0.630(0.059) 0.877(0.056) 0.610(0.046) 0.466(0.036) 1.465(0.028) 
 200 0.401(0.049) 0.569(0.064) 0.408(0.037) 0.338(0.028) 0.630(0.057) 0.371(0.026) 0.338(0.028) 1.465(0.028) 
 300 0.274(0.040) 0.334(0.043) 0.276(0.021) 0.293(0.030) 0.453(0.045) 0.266(0.021) 0.263(0.020) 1.465(0.028) 
 400 0.216(0.035) 0.176(0.016) 0.223(0.013) 0.214(0.018) 0.324(0.043) 0.252(0.013) 0.238(0.013) 1.465(0.028) 
 500 0.137(0.013) 0.151(0.013) 0.191(0.012) 0.218(0.018) 0.258(0.028) 0.205(0.012) 0.195(0.012) 1.465(0.028) 
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)Random
A 50 0.464(0.080) 0.990(0.066) 0.149(0.013) 0.652(0.083) 0.233(0.033) 0.418(0.071) 0.190(0.044) 1.304(0.016) 
 100 0.111(0.024) 0.473(0.077) 0.070(0.005) 0.160(0.044) 0.127(0.008) 0.124(0.035) 0.075(0.006) 1.304(0.016) 
 150 0.059(0.005) 0.165(0.044) 0.058(0.004) 0.054(0.005) 0.095(0.005) 0.056(0.004) 0.056(0.004) 1.304(0.016) 
 200 0.045(0.004) 0.072(0.027) 0.046(0.004) 0.052(0.009) 0.080(0.005) 0.047(0.004) 0.047(0.004) 1.304(0.016) 
 250 0.040(0.003) 0.070(0.027) 0.041(0.003) 0.041(0.004) 0.070(0.004) 0.045(0.004) 0.045(0.004) 1.304(0.016) 
B 100 0.362(0.037) 1.290(0.057) 0.370(0.032) 0.226(0.022) 0.248(0.016) 0.421(0.042) 0.433(0.042) 1.465(0.028) 
 200 0.221(0.022) 0.700(0.081) 0.196(0.007) 0.116(0.008) 0.155(0.009) 0.168(0.010) 0.168(0.010) 1.465(0.028) 
 300 0.103(0.008) 0.359(0.066) 0.138(0.005) 0.075(0.003) 0.109(0.004) 0.122(0.006) 0.122(0.006) 1.465(0.028) 
 400 0.081(0.005) 0.111(0.009) 0.128(0.004) 0.080(0.006) 0.104(0.005) 0.089(0.005) 0.089(0.005) 1.465(0.028) 
 500 0.081(0.004) 0.130(0.021) 0.114(0.004) 0.069(0.006) 0.075(0.003) 0.068(0.004) 0.068(0.004) 1.465(0.028) 
C 100 1.108(0.069) 1.316(0.057) 1.371(0.024) 1.240(0.039) 1.164(0.036) 1.437(0.023) 1.395(0.023) 1.465(0.028) 
 200 0.819(0.092) 1.086(0.089) 1.336(0.026) 1.205(0.043) 1.015(0.054) 1.325(0.020) 1.358(0.019) 1.465(0.028) 
 300 0.333(0.061) 0.618(0.081) 1.346(0.029) 1.120(0.048) 0.981(0.047) 1.271(0.024) 1.279(0.026) 1.465(0.028) 
 400 0.224(0.054) 0.404(0.080) 1.327(0.028) 1.133(0.044) 0.863(0.056) 1.198(0.033) 1.250(0.023) 1.465(0.028) 
 500 0.267(0.069) 0.461(0.087) 1.347(0.027) 1.084(0.050) 0.756(0.054) 1.215(0.032) 1.217(0.020) 1.465(0.028) 
D 100 0.602(0.070) 1.033(0.070) 0.706(0.055) 0.630(0.059) 0.877(0.056) 0.610(0.046) 0.466(0.036) 1.465(0.028) 
 200 0.401(0.049) 0.569(0.064) 0.408(0.037) 0.338(0.028) 0.630(0.057) 0.371(0.026) 0.338(0.028) 1.465(0.028) 
 300 0.274(0.040) 0.334(0.043) 0.276(0.021) 0.293(0.030) 0.453(0.045) 0.266(0.021) 0.263(0.020) 1.465(0.028) 
 400 0.216(0.035) 0.176(0.016) 0.223(0.013) 0.214(0.018) 0.324(0.043) 0.252(0.013) 0.238(0.013) 1.465(0.028) 
 500 0.137(0.013) 0.151(0.013) 0.191(0.012) 0.218(0.018) 0.258(0.028) 0.205(0.012) 0.195(0.012) 1.465(0.028) 

Note: The best methods in terms of the mean error and comparable methods according to the paired t-test at the significance level 5% are in bold.

LSQMI tends to be unstable and works poorly, except for data set D when the sample size is large. Note that LSQMI is comparable to the best method (in term of the mean error) in data set A due to its unstable behavior. The cause of this unstability could be the high fluctuation of the derivative of QMI by LSQMI, as shown previously in the illustrative experiment.

The two variants of KDR work quite well on data sets A, B, and D. However, KDR (gKDR) is quite unstable for data set A when the sample size is small, which can be seen by its relatively large standard errors. In contrast, KDR (Random) gives much more stable results. This implies that gKDR might provide a poor initial solution to KDR in some trials, which makes KDR fail to find a good solution. On the other hand, dMAVE works quite poorly overall, which might be because its model selection strategy is not suitable for these data sets.

Table 2 shows the mean and standard error over 50 trials of the computation time on different sample sizes.9 All methods take longer as the number of samples increases. The results also show that LSQMID is computationally more demanding than other methods except LSQMI and KDR (Random). dMAVE and KDR (gKDR) are computationally very efficient because they do not perform cross validation for parameter tuning and do not restart optimization with many initial solutions.10 On the other hand, LSQMID, LSQMI, LSDR, and LSCE perform cross validation and restart optimization with 10 initial solutions. Despite these similarities, LSQMID is computationally more demanding than LSDR and LSCE for two reasons. First, LSQMID performs orthonormalization, while LSDR and LSCE use manifold optimization. It is known that manifold optimization tends to be computationally more efficient than orthonormalization (Absil et al., 2008). Second, LSQMID estimates the derivative of QMI with regard to by estimators for derivatives of the density difference, while LSDR and LSCE estimate a single quantity.

Table 2:
Mean and Standard Error of the Computation Time in Seconds over 50 Trials for Artificial Data Sets on Different Sample Sizes with Input Dimensionality .
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
A 50 9.429(0.057) 34.110(0.591) 6.174(0.120) 7.305(0.099) 0.094(0.002) 0.395(0.016) 4.231(0.079) 
 100 27.814(0.274) 92.052(1.738) 14.150(0.227) 21.928(0.323) 0.360(0.013) 1.292(0.050) 16.573(0.490) 
 150 59.452(1.064) 124.585(2.866) 21.368(0.354) 36.561(0.498) 0.583(0.012) 2.064(0.038) 32.998(0.190) 
 200 93.544(1.900) 181.314(3.732) 30.290(0.473) 53.591(0.733) 0.895(0.024) 3.895(0.123) 56.161(0.644) 
 250 89.871(1.923) 174.112(3.526) 31.211(0.615) 56.277(0.749) 1.197(0.022) 4.915(0.082) 72.194(0.220) 
B 100 49.036(0.429) 154.616(3.053) 12.783(0.246) 18.447(0.200) 0.363(0.012) 1.210(0.026) 13.001(0.150) 
 200 145.692(2.185) 300.697(6.914) 24.349(0.432) 46.142(0.514) 0.852(0.018) 3.624(0.094) 38.819(0.287) 
 300 168.623(3.047) 251.485(5.868) 26.052(0.469) 47.725(0.531) 1.735(0.032) 7.823(0.127) 86.127(0.798) 
 400 183.009(2.868) 231.134(6.435) 28.021(0.430) 46.014(0.503) 2.681(0.055) 13.670(0.209) 144.049(0.332) 
 500 203.555(3.401) 223.437(5.363) 30.299(0.523) 48.906(0.538) 4.130(0.094) 24.843(0.321) 241.030(0.592) 
C 100 49.381(0.287) 155.790(2.051) 13.448(0.198) 16.040(0.167) 0.320(0.002) 1.060(0.008) 11.186(0.061) 
 200 132.830(0.152) 358.054(7.186) 29.940(0.474) 43.118(0.600) 0.812(0.011) 3.667(0.026) 38.001(0.253) 
 300 148.501(0.210) 331.875(6.626) 32.432(0.627) 38.933(0.448) 1.591(0.013) 6.766(0.045) 74.014(0.454) 
 400 169.348(0.375) 343.421(7.810) 36.026(0.672) 40.473(0.483) 2.450(0.016) 13.352(0.070) 140.416(0.343) 
 500 186.787(0.357) 352.525(8.282) 39.465(0.813) 45.053(0.552) 3.806(0.032) 22.837(0.126) 240.166(0.707) 
D 100 48.305(0.372) 153.212(3.692) 15.120(0.342) 18.610(0.231) 0.392(0.015) 1.283(0.048) 18.654(0.457) 
 200 161.487(2.647) 322.414(6.317) 32.599(0.605) 47.629(0.679) 0.920(0.020) 3.856(0.098) 52.280(0.430) 
 300 181.158(2.660) 271.453(6.801) 36.607(0.663) 49.746(0.685) 1.792(0.035) 7.559(0.099) 102.684(0.256) 
 400 202.340(3.286) 259.462(5.090) 40.922(0.732) 50.476(0.671) 2.880(0.078) 13.556(0.229) 157.811(0.571) 
 500 222.527(3.043) 261.036(5.512) 46.453(0.882) 54.916(0.804) 4.456(0.120) 23.930(0.302) 276.041(2.123) 
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
A 50 9.429(0.057) 34.110(0.591) 6.174(0.120) 7.305(0.099) 0.094(0.002) 0.395(0.016) 4.231(0.079) 
 100 27.814(0.274) 92.052(1.738) 14.150(0.227) 21.928(0.323) 0.360(0.013) 1.292(0.050) 16.573(0.490) 
 150 59.452(1.064) 124.585(2.866) 21.368(0.354) 36.561(0.498) 0.583(0.012) 2.064(0.038) 32.998(0.190) 
 200 93.544(1.900) 181.314(3.732) 30.290(0.473) 53.591(0.733) 0.895(0.024) 3.895(0.123) 56.161(0.644) 
 250 89.871(1.923) 174.112(3.526) 31.211(0.615) 56.277(0.749) 1.197(0.022) 4.915(0.082) 72.194(0.220) 
B 100 49.036(0.429) 154.616(3.053) 12.783(0.246) 18.447(0.200) 0.363(0.012) 1.210(0.026) 13.001(0.150) 
 200 145.692(2.185) 300.697(6.914) 24.349(0.432) 46.142(0.514) 0.852(0.018) 3.624(0.094) 38.819(0.287) 
 300 168.623(3.047) 251.485(5.868) 26.052(0.469) 47.725(0.531) 1.735(0.032) 7.823(0.127) 86.127(0.798) 
 400 183.009(2.868) 231.134(6.435) 28.021(0.430) 46.014(0.503) 2.681(0.055) 13.670(0.209) 144.049(0.332) 
 500 203.555(3.401) 223.437(5.363) 30.299(0.523) 48.906(0.538) 4.130(0.094) 24.843(0.321) 241.030(0.592) 
C 100 49.381(0.287) 155.790(2.051) 13.448(0.198) 16.040(0.167) 0.320(0.002) 1.060(0.008) 11.186(0.061) 
 200 132.830(0.152) 358.054(7.186) 29.940(0.474) 43.118(0.600) 0.812(0.011) 3.667(0.026) 38.001(0.253) 
 300 148.501(0.210) 331.875(6.626) 32.432(0.627) 38.933(0.448) 1.591(0.013) 6.766(0.045) 74.014(0.454) 
 400 169.348(0.375) 343.421(7.810) 36.026(0.672) 40.473(0.483) 2.450(0.016) 13.352(0.070) 140.416(0.343) 
 500 186.787(0.357) 352.525(8.282) 39.465(0.813) 45.053(0.552) 3.806(0.032) 22.837(0.126) 240.166(0.707) 
D 100 48.305(0.372) 153.212(3.692) 15.120(0.342) 18.610(0.231) 0.392(0.015) 1.283(0.048) 18.654(0.457) 
 200 161.487(2.647) 322.414(6.317) 32.599(0.605) 47.629(0.679) 0.920(0.020) 3.856(0.098) 52.280(0.430) 
 300 181.158(2.660) 271.453(6.801) 36.607(0.663) 49.746(0.685) 1.792(0.035) 7.559(0.099) 102.684(0.256) 
 400 202.340(3.286) 259.462(5.090) 40.922(0.732) 50.476(0.671) 2.880(0.078) 13.556(0.229) 157.811(0.571) 
 500 222.527(3.043) 261.036(5.512) 46.453(0.882) 54.916(0.804) 4.456(0.120) 23.930(0.302) 276.041(2.123) 

LSQMI is computationally the most inefficient even though it also uses manifold optimization and estimates a single quantity. The reason could be that the backtracking line search parameters that we used for the toolbox (Boumal et al., 2014) are not suitable for LSQMI, which results in many backtracking steps per iteration. We believe that the computation time of LSQMI can be improved with more proper backtracking line search parameter tuning.

KDR (Random) also uses manifold optimization and estimates a single quantity. However, it takes longer computation time than LSQMID for data sets B, C, and D when . The reason is that KDR inverts a -by- matrix (see equation 2.8) while LSQMID inverts number of -by- matrices (see equation 4.17). Thus, when is much larger than and is small, inverting a single -by- matrix can take much longer than inverting number of -by- matrices.

6.2.3  Results on Different Input Dimensionalities

Next, we evaluate the methods on different input dimensionalities. Table 3 shows the mean and standard error over 50 trials of the dimension-reduction error with different input dimensionalities where the sample size is fixed. We use for data set A, and for data sets B, C, and D. The randomly generated matrices are uninformative and give large errors. All methods perform best on low-input dimensionalities. For all data sets, LSQMID works well overall except when . For data set C, only LSQMID and LSQMI give informative result.

Table 3:
Mean and Standard Error of the Dimension-Reduction Error over 50 Trials for Artificial Data Sets on Different Input Dimensionalities with Fixed Sample Sizes.
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)Random
A 0.045(0.004) 0.046(0.005) 0.047(0.004) 0.045(0.004) 0.076(0.005) 0.047(0.004) 0.047(0.004) 1.182(0.034) 
 0.045(0.004) 0.072(0.027) 0.046(0.004) 0.052(0.009) 0.080(0.005) 0.047(0.004) 0.047(0.004) 1.304(0.016) 
 0.060(0.005) 0.321(0.068) 0.054(0.004) 0.540(0.082) 0.100(0.004) 0.057(0.004) 0.057(0.004) 1.341(0.011) 
 10 0.055(0.004) 0.597(0.086) 0.056(0.004) 0.883(0.083) 0.116(0.004) 0.061(0.003) 0.061(0.003) 1.355(0.009) 
 15 0.151(0.037) 0.885(0.088) 0.061(0.004) 1.253(0.050) 0.136(0.005) 0.456(0.084) 0.069(0.004) 1.376(0.007) 
B 0.039(0.003) 0.049(0.006) 0.062(0.004) 0.038(0.003) 0.053(0.003) 0.058(0.004) 0.058(0.004) 1.062(0.043) 
 0.081(0.005) 0.111(0.009) 0.128(0.004) 0.080(0.006) 0.104(0.005) 0.089(0.005) 0.089(0.005) 1.465(0.028) 
 0.143(0.007) 0.784(0.100) 0.163(0.006) 0.120(0.012) 0.139(0.004) 0.137(0.004) 0.137(0.004) 1.702(0.021) 
 10 0.201(0.025) 1.065(0.103) 0.180(0.004) 0.155(0.013) 0.168(0.005) 0.179(0.007) 0.179(0.007) 1.771(0.017) 
 15 0.368(0.046) 1.682(0.053) 0.227(0.006) 0.172(0.008) 0.207(0.004) 0.227(0.006) 0.226(0.006) 1.853(0.012) 
C 0.212(0.054) 0.219(0.065) 1.187(0.055) 0.695(0.071) 0.577(0.062) 0.946(0.043) 0.963(0.037) 1.062(0.043) 
 0.224(0.054) 0.404(0.080) 1.327(0.028) 1.133(0.044) 0.863(0.056) 1.198(0.033) 1.250(0.023) 1.465(0.028) 
 0.589(0.087) 0.746(0.094) 1.391(0.013) 1.204(0.037) 1.086(0.047) 1.259(0.029) 1.279(0.020) 1.702(0.021) 
 10 0.765(0.088) 0.829(0.089) 1.404(0.009) 1.328(0.022) 1.227(0.031) 1.295(0.023) 1.313(0.021) 1.771(0.017) 
 15 1.308(0.068) 1.095(0.073) 1.426(0.010) 1.399(0.012) 1.360(0.019) 1.362(0.014) 1.315(0.022) 1.853(0.012) 
D 0.067(0.010) 0.069(0.010) 0.131(0.012) 0.095(0.011) 0.317(0.049) 0.129(0.013) 0.124(0.012) 1.062(0.043) 
 0.216(0.035) 0.176(0.016) 0.223(0.013) 0.214(0.018) 0.324(0.043) 0.252(0.013) 0.238(0.013) 1.465(0.028) 
 0.343(0.036) 0.727(0.071) 0.314(0.023) 0.348(0.033) 0.380(0.030) 0.356(0.020) 0.345(0.018) 1.702(0.021) 
 10 0.473(0.049) 0.809(0.063) 0.387(0.020) 0.484(0.034) 0.605(0.061) 0.484(0.036) 0.420(0.022) 1.771(0.017) 
 15 0.689(0.049) 1.400(0.063) 0.616(0.040) 0.757(0.044) 0.936(0.056) 0.632(0.035) 0.558(0.018) 1.853(0.012) 
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)Random
A 0.045(0.004) 0.046(0.005) 0.047(0.004) 0.045(0.004) 0.076(0.005) 0.047(0.004) 0.047(0.004) 1.182(0.034) 
 0.045(0.004) 0.072(0.027) 0.046(0.004) 0.052(0.009) 0.080(0.005) 0.047(0.004) 0.047(0.004) 1.304(0.016) 
 0.060(0.005) 0.321(0.068) 0.054(0.004) 0.540(0.082) 0.100(0.004) 0.057(0.004) 0.057(0.004) 1.341(0.011) 
 10 0.055(0.004) 0.597(0.086) 0.056(0.004) 0.883(0.083) 0.116(0.004) 0.061(0.003) 0.061(0.003) 1.355(0.009) 
 15 0.151(0.037) 0.885(0.088) 0.061(0.004) 1.253(0.050) 0.136(0.005) 0.456(0.084) 0.069(0.004) 1.376(0.007) 
B 0.039(0.003) 0.049(0.006) 0.062(0.004) 0.038(0.003) 0.053(0.003) 0.058(0.004) 0.058(0.004) 1.062(0.043) 
 0.081(0.005) 0.111(0.009) 0.128(0.004) 0.080(0.006) 0.104(0.005) 0.089(0.005) 0.089(0.005) 1.465(0.028) 
 0.143(0.007) 0.784(0.100) 0.163(0.006) 0.120(0.012) 0.139(0.004) 0.137(0.004) 0.137(0.004) 1.702(0.021) 
 10 0.201(0.025) 1.065(0.103) 0.180(0.004) 0.155(0.013) 0.168(0.005) 0.179(0.007) 0.179(0.007) 1.771(0.017) 
 15 0.368(0.046) 1.682(0.053) 0.227(0.006) 0.172(0.008) 0.207(0.004) 0.227(0.006) 0.226(0.006) 1.853(0.012) 
C 0.212(0.054) 0.219(0.065) 1.187(0.055) 0.695(0.071) 0.577(0.062) 0.946(0.043) 0.963(0.037) 1.062(0.043) 
 0.224(0.054) 0.404(0.080) 1.327(0.028) 1.133(0.044) 0.863(0.056) 1.198(0.033) 1.250(0.023) 1.465(0.028) 
 0.589(0.087) 0.746(0.094) 1.391(0.013) 1.204(0.037) 1.086(0.047) 1.259(0.029) 1.279(0.020) 1.702(0.021) 
 10 0.765(0.088) 0.829(0.089) 1.404(0.009) 1.328(0.022) 1.227(0.031) 1.295(0.023) 1.313(0.021) 1.771(0.017) 
 15 1.308(0.068) 1.095(0.073) 1.426(0.010) 1.399(0.012) 1.360(0.019) 1.362(0.014) 1.315(0.022) 1.853(0.012) 
D 0.067(0.010) 0.069(0.010) 0.131(0.012) 0.095(0.011) 0.317(0.049) 0.129(0.013) 0.124(0.012) 1.062(0.043) 
 0.216(0.035) 0.176(0.016) 0.223(0.013) 0.214(0.018) 0.324(0.043) 0.252(0.013) 0.238(0.013) 1.465(0.028) 
 0.343(0.036) 0.727(0.071) 0.314(0.023) 0.348(0.033) 0.380(0.030) 0.356(0.020) 0.345(0.018) 1.702(0.021) 
 10 0.473(0.049) 0.809(0.063) 0.387(0.020) 0.484(0.034) 0.605(0.061) 0.484(0.036) 0.420(0.022) 1.771(0.017) 
 15 0.689(0.049) 1.400(0.063) 0.616(0.040) 0.757(0.044) 0.936(0.056) 0.632(0.035) 0.558(0.018) 1.853(0.012) 

Notes: for data sets A, and for data sets B, C, and D. The best methods in terms of the mean error and comparable methods according to the paired t-test at the significance level 5% are in bold.

Table 4 shows the mean and standard error over 50 trials of the computation time on different input dimensionalities. All methods take longer as the dimensionality increases. However, dMAVE has the largest relative increment among all methods; it takes approximately three times longer when increases from 10 to 15.

Table 4:
Mean and Standard Error of the Computation Time in Seconds over 50 Trials for Artificial Data Sets on Different Input Dimensionalities with Fixed Sample Sizes.
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
A 87.730(1.582) 116.106(2.398) 21.621(0.517) 44.211(0.551) 0.412(0.009) 3.539(0.090) 41.643(0.627) 
 93.544(1.900) 181.314(3.732) 30.290(0.473) 53.591(0.733) 0.895(0.024) 3.895(0.123) 56.161(0.644) 
 95.935(1.649) 308.600(5.879) 36.921(0.690) 47.501(0.711) 2.147(0.034) 3.793(0.093) 54.826(0.605) 
 10 88.833(1.398) 363.697(6.154) 40.374(0.602) 50.793(0.641) 3.335(0.043) 3.695(0.085) 63.552(0.706) 
 15 109.637(1.437) 476.093(3.908) 48.330(0.756) 59.981(1.006) 12.079(0.261) 4.350(0.111) 70.260(0.741) 
B 169.567(2.992) 106.728(1.305) 29.736(0.499) 48.965(0.704) 1.254(0.025) 13.023(0.237) 137.888(1.222) 
 183.009(2.868) 231.134(6.435) 28.021(0.430) 46.014(0.503) 2.681(0.055) 13.670(0.209) 144.049(0.332) 
 205.578(3.333) 438.931(7.086) 36.879(0.659) 52.329(0.698) 6.704(0.159) 16.084(0.308) 154.936(0.421) 
 10 220.762(3.240) 499.952(6.730) 43.026(0.821) 57.738(0.727) 10.746(0.249) 17.482(0.319) 161.003(0.446) 
 15 263.757(3.493) 577.488(3.999) 61.165(1.145) 72.184(0.993) 31.961(0.747) 21.231(0.360) 188.233(0.408) 
C 154.131(0.343) 92.523(1.908) 25.316(0.478) 43.993(0.635) 1.230(0.011) 11.587(0.073) 124.230(0.498) 
 169.348(0.375) 343.421(7.810) 36.026(0.672) 40.473(0.483) 2.450(0.016) 13.352(0.070) 140.416(0.343) 
 191.035(0.345) 500.047(4.296) 50.586(0.708) 47.681(0.606) 6.112(0.044) 15.460(0.081) 158.438(0.479) 
 10 208.191(0.615) 529.924(2.915) 56.769(0.689) 48.858(0.621) 9.749(0.068) 15.328(0.094) 162.992(0.593) 
 15 250.683(0.865) 570.572(1.730) 74.443(0.774) 55.368(0.682) 28.005(0.072) 18.953(0.089) 192.165(0.457) 
D 186.339(2.808) 108.504(2.145) 37.751(0.687) 55.153(0.787) 1.333(0.032) 12.895(0.174) 149.617(0.489) 
 202.340(3.286) 259.462(5.090) 40.922(0.732) 50.476(0.671) 2.880(0.078) 13.556(0.229) 157.811(0.571) 
 226.080(2.683) 471.307(6.142) 55.825(1.108) 59.997(0.878) 6.873(0.140) 16.407(0.178) 176.396(0.405) 
 10 242.822(2.923) 540.447(5.618) 64.376(0.967) 65.743(1.044) 11.472(0.269) 17.919(0.212) 191.839(0.575) 
 15 275.041(3.748) 594.448(3.833) 88.215(1.650) 81.971(1.304) 32.900(0.722) 20.652(0.322) 219.071(0.640) 
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
A 87.730(1.582) 116.106(2.398) 21.621(0.517) 44.211(0.551) 0.412(0.009) 3.539(0.090) 41.643(0.627) 
 93.544(1.900) 181.314(3.732) 30.290(0.473) 53.591(0.733) 0.895(0.024) 3.895(0.123) 56.161(0.644) 
 95.935(1.649) 308.600(5.879) 36.921(0.690) 47.501(0.711) 2.147(0.034) 3.793(0.093) 54.826(0.605) 
 10 88.833(1.398) 363.697(6.154) 40.374(0.602) 50.793(0.641) 3.335(0.043) 3.695(0.085) 63.552(0.706) 
 15 109.637(1.437) 476.093(3.908) 48.330(0.756) 59.981(1.006) 12.079(0.261) 4.350(0.111) 70.260(0.741) 
B 169.567(2.992) 106.728(1.305) 29.736(0.499) 48.965(0.704) 1.254(0.025) 13.023(0.237) 137.888(1.222) 
 183.009(2.868) 231.134(6.435) 28.021(0.430) 46.014(0.503) 2.681(0.055) 13.670(0.209) 144.049(0.332) 
 205.578(3.333) 438.931(7.086) 36.879(0.659) 52.329(0.698) 6.704(0.159) 16.084(0.308) 154.936(0.421) 
 10 220.762(3.240) 499.952(6.730) 43.026(0.821) 57.738(0.727) 10.746(0.249) 17.482(0.319) 161.003(0.446) 
 15 263.757(3.493) 577.488(3.999) 61.165(1.145) 72.184(0.993) 31.961(0.747) 21.231(0.360) 188.233(0.408) 
C 154.131(0.343) 92.523(1.908) 25.316(0.478) 43.993(0.635) 1.230(0.011) 11.587(0.073) 124.230(0.498) 
 169.348(0.375) 343.421(7.810) 36.026(0.672) 40.473(0.483) 2.450(0.016) 13.352(0.070) 140.416(0.343) 
 191.035(0.345) 500.047(4.296) 50.586(0.708) 47.681(0.606) 6.112(0.044) 15.460(0.081) 158.438(0.479) 
 10 208.191(0.615) 529.924(2.915) 56.769(0.689) 48.858(0.621) 9.749(0.068) 15.328(0.094) 162.992(0.593) 
 15 250.683(0.865) 570.572(1.730) 74.443(0.774) 55.368(0.682) 28.005(0.072) 18.953(0.089) 192.165(0.457) 
D 186.339(2.808) 108.504(2.145) 37.751(0.687) 55.153(0.787) 1.333(0.032) 12.895(0.174) 149.617(0.489) 
 202.340(3.286) 259.462(5.090) 40.922(0.732) 50.476(0.671) 2.880(0.078) 13.556(0.229) 157.811(0.571) 
 226.080(2.683) 471.307(6.142) 55.825(1.108) 59.997(0.878) 6.873(0.140) 16.407(0.178) 176.396(0.405) 
 10 242.822(2.923) 540.447(5.618) 64.376(0.967) 65.743(1.044) 11.472(0.269) 17.919(0.212) 191.839(0.575) 
 15 275.041(3.748) 594.448(3.833) 88.215(1.650) 81.971(1.304) 32.900(0.722) 20.652(0.322) 219.071(0.640) 

Note: for data sets A, and for data sets B, C, and D.

6.3  Benchmark Data Sets

Finally, we evaluate the usefulness of the proposed method on benchmark data sets. In the following experiments, we consider linear-supervised dimension reduction for classification and regression tasks.

6.3.1  Classification

We first evaluate the proposed method on a classification task. We consider the Wine data set from the UCI repository (Bache & Lichman, 2013). The input variables have dimensionality , and the output variable determines one of the three classes. We standardize the input so that it has zero mean and unit variance. The data set contains 178 samples. We randomly choose samples for training purposes and use the rest for testing purposes. We execute linear-supervised dimension-reduction methods and principal component analysis (PCA; Jolliffe, 1986) with target dimensionality to obtain solutions . Then we train a support vector machine classifier (SVM; Cortes & Vapnik, 1995).11 The performance of a classifier is evaluated by the misclassification rate for test samples
formula
where denotes the indicator function, which equals 1 when the expression is true and 0 otherwise.

The misclassification rate in Table 5 shows that LSQMID performs very well for this data set when , and it gives the lowest misclassification rate when . In contrast, LSQMI performs very poorly and is also highly unstable, as can be seen by a relatively large standard error. We expect that this is because the sample size is quite small, which makes the performance of LSQMI relatively poor, as demonstrated in our previous experiments.

Table 5:
Mean and Standard Error of the Misclassification Rate over 20 Trials for the Wine Data Set with Different Target Dimensionalities.
LSQMIDLSQMILSDRdMAVEKDR (Random)PCA
23.33(10.82) 31.54(4.71) 8.59(3.09) 15.77(5.08) 8.53(2.70) 15.38(3.25) 
3.27(1.21) 24.81(5.94) 4.94(2.25) 3.33(1.46) 3.14(1.64) 3.65(2.05) 
2.95(2.00) 19.87(7.09) 6.03(3.14) 3.33(2.01) 3.53(1.66) 3.53(1.60) 
3.53(3.65) 21.03(10.75) 6.09(3.58) 3.59(2.55) 3.40(2.01) 3.59(2.02) 
LSQMIDLSQMILSDRdMAVEKDR (Random)PCA
23.33(10.82) 31.54(4.71) 8.59(3.09) 15.77(5.08) 8.53(2.70) 15.38(3.25) 
3.27(1.21) 24.81(5.94) 4.94(2.25) 3.33(1.46) 3.14(1.64) 3.65(2.05) 
2.95(2.00) 19.87(7.09) 6.03(3.14) 3.33(2.01) 3.53(1.66) 3.53(1.60) 
3.53(3.65) 21.03(10.75) 6.09(3.58) 3.59(2.55) 3.40(2.01) 3.59(2.02) 

Note: The best methods in terms of the mean error and comparable methods according to the paired t-test at the significance level are in bold.

Figure 5 shows data points projected by with . We can see that all methods except LSQMI give good projections, and we can easily distinguish data points between classes in the new data spaces. In contrast, for LSQMI, many data points from one class (in purple) cannot be distinguished from the other two classes in the new data spaces.

Figure 5:

Data points of the Wine data set after projection by each linear dimension-reduction method. Data points from the same class are indicated by the same color.

Figure 5:

Data points of the Wine data set after projection by each linear dimension-reduction method. Data points from the same class are indicated by the same color.

6.3.2  Regression

Next, we evaluate the proposed method on regression tasks using data sets from the UCI repository. To make the tasks more challenging, we append the original input with noise features of dimensionality 5. More specifically, for the original input with dimensionality , we consider the augmented input with dimensionality as
formula
where for . Then we use the paired data to perform experiments. We randomly choose samples for training purposes and use the rest for testing purposes. We execute linear-supervised dimension-reduction methods with target dimensionality to obtain solutions . Then we use a kernel ridge regressor and a -nearest neighbor regressor to evaluate the performance. The performance of a regressor is measured by the root mean squared error (RMSE) for test samples:
formula

We use a kernel ridge regressor with the gaussian kernel where the gaussian width and the regularization parameter are chosen by fivefold cross validation. Table 6 shows the RMSE averaged over 30 trials. LSQMID performs well overall for all data sets. LSQMI also performs well for the Fertility and Bike data sets where it outperforms LSQMID in terms of the mean error. However, LSQMI does not work for the other data sets. LSCE and dMAVE perform well on only some data sets, and LSDR, KDR (gKDR), and KDR (Random) perform poorly on these data sets.

Table 6:
Mean and Standard Error of the Root Mean Squared Error over 30 Trials for Benchmark Data Sets Using a Kernel Ridge Regressor.
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
Fertility 50 14 1.215(0.049) 1.092(0.043) 1.315(0.043) 1.185(0.050) 1.321(0.063) 1.116(0.050) 1.174(0.047) 
   1.051(0.045) 1.029(0.043) 1.199(0.031) 1.080(0.047) 1.340(0.052) 1.104(0.044) 1.247(0.049) 
   1.052(0.044) 1.038(0.047) 1.104(0.044) 1.091(0.041) 1.288(0.048) 1.121(0.043) 1.231(0.037) 
   1.046(0.042) 1.026(0.042) 1.092(0.039) 1.083(0.044) 1.271(0.033) 1.146(0.044) 1.202(0.035) 
Yacht 100 11 0.120(0.005) 0.546(0.042) 0.180(0.012) 0.718(0.051) 0.213(0.017) 0.124(0.007) 0.124(0.007) 
   0.154(0.011) 0.675(0.047) 0.344(0.023) 0.275(0.013) 0.224(0.014) 0.278(0.033) 0.248(0.012) 
   0.314(0.024) 0.690(0.037) 0.425(0.018) 0.319(0.017) 0.265(0.013) 0.353(0.028) 0.318(0.015) 
   0.413(0.021) 0.732(0.043) 0.494(0.015) 0.355(0.013) 0.352(0.017) 0.399(0.012) 0.400(0.015) 
Concrete 200 13 0.621(0.013) 0.606(0.014) 0.606(0.008) 0.604(0.009) 0.582(0.006) 0.791(0.030) 0.637(0.012) 
   0.568(0.010) 0.591(0.009) 0.568(0.010) 0.567(0.011) 0.529(0.009) 0.614(0.025) 0.541(0.014) 
   0.557(0.009) 0.579(0.011) 0.576(0.012) 0.571(0.010) 0.539(0.007) 0.579(0.016) 0.558(0.012) 
   0.545(0.012) 0.667(0.025) 0.568(0.010) 0.577(0.010) 0.540(0.008) 0.571(0.014) 0.583(0.014) 
Breast-Cancer 200 15 0.447(0.011) 0.523(0.018) 0.442(0.010) 0.453(0.016) 0.375(0.007) 0.447(0.012) 0.465(0.014) 
   0.435(0.010) 0.473(0.012) 0.437(0.009) 0.437(0.011) 0.420(0.012) 0.454(0.014) 0.440(0.011) 
   0.376(0.004) 0.462(0.010) 0.431(0.007) 0.438(0.009) 0.426(0.008) 0.430(0.007) 0.430(0.009) 
   0.377(0.005) 0.419(0.008) 0.436(0.007) 0.425(0.012) 0.426(0.011) 0.433(0.007) 0.435(0.009) 
Bike 300 19 0.043(0.011) 0.070(0.019) 0.016(0.001) 0.015(0.004) 0.139(0.051) 0.513(0.059) 0.194(0.005) 
   0.036(0.005) 0.035(0.003) 0.049(0.002) 0.031(0.005) 0.081(0.007) 0.291(0.050) 0.086(0.006) 
   0.037(0.005) 0.032(0.003) 0.065(0.002) 0.043(0.005) 0.086(0.008) 0.243(0.037) 0.090(0.006) 
   0.060(0.006) 0.051(0.007) 0.077(0.002) 0.045(0.005) 0.071(0.005) 0.213(0.029) 0.074(0.006) 
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
Fertility 50 14 1.215(0.049) 1.092(0.043) 1.315(0.043) 1.185(0.050) 1.321(0.063) 1.116(0.050) 1.174(0.047) 
   1.051(0.045) 1.029(0.043) 1.199(0.031) 1.080(0.047) 1.340(0.052) 1.104(0.044) 1.247(0.049) 
   1.052(0.044) 1.038(0.047) 1.104(0.044) 1.091(0.041) 1.288(0.048) 1.121(0.043) 1.231(0.037) 
   1.046(0.042) 1.026(0.042) 1.092(0.039) 1.083(0.044) 1.271(0.033) 1.146(0.044) 1.202(0.035) 
Yacht 100 11 0.120(0.005) 0.546(0.042) 0.180(0.012) 0.718(0.051) 0.213(0.017) 0.124(0.007) 0.124(0.007) 
   0.154(0.011) 0.675(0.047) 0.344(0.023) 0.275(0.013) 0.224(0.014) 0.278(0.033) 0.248(0.012) 
   0.314(0.024) 0.690(0.037) 0.425(0.018) 0.319(0.017) 0.265(0.013) 0.353(0.028) 0.318(0.015) 
   0.413(0.021) 0.732(0.043) 0.494(0.015) 0.355(0.013) 0.352(0.017) 0.399(0.012) 0.400(0.015) 
Concrete 200 13 0.621(0.013) 0.606(0.014) 0.606(0.008) 0.604(0.009) 0.582(0.006) 0.791(0.030) 0.637(0.012) 
   0.568(0.010) 0.591(0.009) 0.568(0.010) 0.567(0.011) 0.529(0.009) 0.614(0.025) 0.541(0.014) 
   0.557(0.009) 0.579(0.011) 0.576(0.012) 0.571(0.010) 0.539(0.007) 0.579(0.016) 0.558(0.012) 
   0.545(0.012) 0.667(0.025) 0.568(0.010) 0.577(0.010) 0.540(0.008) 0.571(0.014) 0.583(0.014) 
Breast-Cancer 200 15 0.447(0.011) 0.523(0.018) 0.442(0.010) 0.453(0.016) 0.375(0.007) 0.447(0.012) 0.465(0.014) 
   0.435(0.010) 0.473(0.012) 0.437(0.009) 0.437(0.011) 0.420(0.012) 0.454(0.014) 0.440(0.011) 
   0.376(0.004) 0.462(0.010) 0.431(0.007) 0.438(0.009) 0.426(0.008) 0.430(0.007) 0.430(0.009) 
   0.377(0.005) 0.419(0.008) 0.436(0.007) 0.425(0.012) 0.426(0.011) 0.433(0.007) 0.435(0.009) 
Bike 300 19 0.043(0.011) 0.070(0.019) 0.016(0.001) 0.015(0.004) 0.139(0.051) 0.513(0.059) 0.194(0.005) 
   0.036(0.005) 0.035(0.003) 0.049(0.002) 0.031(0.005) 0.081(0.007) 0.291(0.050) 0.086(0.006) 
   0.037(0.005) 0.032(0.003) 0.065(0.002) 0.043(0.005) 0.086(0.008) 0.243(0.037) 0.090(0.006) 
   0.060(0.006) 0.051(0.007) 0.077(0.002) 0.045(0.005) 0.071(0.005) 0.213(0.029) 0.074(0.006) 

Note: The best methods in terms of the mean error and comparable methods according to the paired t-test at the significance level are in bold.

Next, we use a -nearest neighbor regressor where is chosen by fivefold cross validation. Table 7 shows the RMSE averaged over 30 trials. It shows that the -nearest neighbor regressor gives smaller RMSEs than the kernel ridge regressor, except for the Fertility data set. This is perhaps because -nearest neighbor tends to work well when the data have low dimensionality. The results between linear-supervised dimension-reduction methods are quite similar to those of the kernel ridge regressor, with the exception that LSDR and dMAVE also perform well on the Bike data set.

Table 7:
Mean and Standard Error of the Root Mean Squared Error over 30 Trials for Benchmark Data Sets Using a -Nearest Neighbor Regressor.
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
Fertility 50 14 1.875(0.154) 1.467(0.103) 2.330(0.146) 1.451(0.124) 2.162(0.149) 1.367(0.117) 1.440(0.121) 
   1.581(0.107) 1.387(0.100) 1.998(0.107) 1.344(0.102) 2.206(0.120) 1.407(0.130) 1.718(0.140) 
   1.517(0.119) 1.383(0.103) 1.794(0.117) 1.661(0.149) 1.953(0.140) 1.439(0.102) 1.677(0.126) 
   1.546(0.100) 1.236(0.091) 1.842(0.139) 1.696(0.126) 1.759(0.105) 1.575(0.124) 1.655(0.115) 
Yacht 100 11 0.020(0.002) 0.368(0.049) 0.031(0.003) 0.629(0.077) 0.040(0.005) 0.019(0.002) 0.018(0.002) 
   0.026(0.003) 0.510(0.078) 0.194(0.022) 0.147(0.011) 0.101(0.011) 0.201(0.029) 0.191(0.017) 
   0.171(0.021) 0.577(0.048) 0.311(0.022) 0.257(0.020) 0.186(0.014) 0.319(0.026) 0.337(0.024) 
   0.369(0.031) 0.674(0.066) 0.422(0.025) 0.344(0.026) 0.324(0.023) 0.437(0.025) 0.459(0.034) 
Concrete 200 13 0.411(0.019) 0.391(0.017) 0.379(0.009) 0.382(0.010) 0.356(0.007) 0.669(0.048) 0.428(0.016) 
   0.343(0.010) 0.369(0.009) 0.345(0.009) 0.349(0.013) 0.307(0.010) 0.404(0.033) 0.316(0.019) 
   0.356(0.012) 0.373(0.013) 0.375(0.012) 0.381(0.012) 0.347(0.010) 0.401(0.018) 0.388(0.014) 
   0.369(0.012) 0.525(0.034) 0.398(0.013) 0.397(0.014) 0.382(0.012) 0.440(0.014) 0.448(0.015) 
Breast-Cancer 200 15 0.203(0.009) 0.279(0.019) 0.209(0.010) 0.233(0.019) 0.139(0.006) 0.224(0.013) 0.234(0.015) 
   0.199(0.010) 0.236(0.012) 0.198(0.011) 0.221(0.017) 0.190(0.011) 0.215(0.013) 0.208(0.011) 
   0.145(0.005) 0.218(0.011) 0.180(0.008) 0.202(0.012) 0.197(0.010) 0.195(0.010) 0.197(0.011) 
   0.140(0.004) 0.179(0.008) 0.187(0.008) 0.194(0.014) 0.193(0.011) 0.189(0.011) 0.187(0.010) 
Bike 300 19 0.007(0.004) 0.016(0.006) 0.001(0.000) 0.001(0.000) 0.104(0.052) 0.390(0.075) 0.042(0.002) 
   0.006(0.001) 0.005(0.001) 0.006(0.001) 0.007(0.002) 0.006(0.001) 0.167(0.051) 0.018(0.001) 
   0.008(0.002) 0.007(0.002) 0.009(0.001) 0.011(0.001) 0.009(0.001) 0.123(0.035) 0.037(0.001) 
   0.018(0.003) 0.019(0.003) 0.019(0.002) 0.019(0.002) 0.014(0.001) 0.107(0.019) 0.055(0.002) 
Data SetLSQMIDLSQMILSDRLSCEdMAVEKDR (gKDR)KDR (Random)
Fertility 50 14 1.875(0.154) 1.467(0.103) 2.330(0.146) 1.451(0.124) 2.162(0.149) 1.367(0.117) 1.440(0.121) 
   1.581(0.107) 1.387(0.100) 1.998(0.107) 1.344(0.102) 2.206(0.120) 1.407(0.130) 1.718(0.140) 
   1.517(0.119) 1.383(0.103) 1.794(0.117) 1.661(0.149) 1.953(0.140) 1.439(0.102) 1.677(0.126) 
   1.546(0.100) 1.236(0.091) 1.842(0.139) 1.696(0.126) 1.759(0.105) 1.575(0.124) 1.655(0.115) 
Yacht 100 11 0.020(0.002) 0.368(0.049) 0.031(0.003) 0.629(0.077) 0.040(0.005) 0.019(0.002) 0.018(0.002) 
   0.026(0.003) 0.510(0.078) 0.194(0.022) 0.147(0.011) 0.101(0.011) 0.201(0.029) 0.191(0.017) 
   0.171(0.021) 0.577(0.048) 0.311(0.022) 0.257(0.020) 0.186(0.014) 0.319(0.026) 0.337(0.024) 
   0.369(0.031) 0.674(0.066) 0.422(0.025) 0.344(0.026) 0.324(0.023) 0.437(0.025) 0.459(0.034) 
Concrete 200 13 0.411(0.019) 0.391(0.017) 0.379(0.009) 0.382(0.010) 0.356(0.007) 0.669(0.048) 0.428(0.016) 
   0.343(0.010) 0.369(0.009) 0.345(0.009) 0.349(0.013) 0.307(0.010) 0.404(0.033) 0.316(0.019) 
   0.356(0.012) 0.373(0.013) 0.375(0.012) 0.381(0.012) 0.347(0.010) 0.401(0.018) 0.388(0.014) 
   0.369(0.012) 0.525(0.034) 0.398(0.013) 0.397(0.014) 0.382(0.012) 0.440(0.014) 0.448(0.015) 
Breast-Cancer 200 15 0.203(0.009) 0.279(0.019) 0.209(0.010) 0.233(0.019) 0.139(0.006) 0.224(0.013) 0.234(0.015) 
   0.199(0.010) 0.236(0.012) 0.198(0.011) 0.221(0.017) 0.190(0.011) 0.215(0.013) 0.208(0.011) 
   0.145(0.005) 0.218(0.011) 0.180(0.008) 0.202(0.012) 0.197(0.010) 0.195(0.010) 0.197(0.011) 
   0.140(0.004) 0.179(0.008) 0.187(0.008) 0.194(0.014) 0.193(0.011) 0.189(0.011) 0.187(0.010) 
Bike 300 19 0.007(0.004) 0.016(0.006) 0.001(0.000) 0.001(0.000) 0.104(0.052) 0.390(0.075) 0.042(0.002) 
   0.006(0.001) 0.005(0.001) 0.006(0.001) 0.007(0.002) 0.006(0.001) 0.167(0.051) 0.018(0.001) 
   0.008(0.002) 0.007(0.002) 0.009(0.001) 0.011(0.001) 0.009(0.001) 0.123(0.035) 0.037(0.001) 
   0.018(0.003) 0.019(0.003) 0.019(0.002) 0.019(0.002) 0.014(0.001) 0.107(0.019) 0.055(0.002) 

Note: The best methods in terms of the mean error and comparable methods according to the paired t-test at the significance level 5% are in bold.

These results show that LSQMID works well as a linear-supervised dimension-reduction method for both kernel ridge regressor and -nearest neighbor regressor.

7  Further Extension: Estimation of Higher-Order Derivatives of Quadratic Mutual Information

We have shown that the (first-order) derivative of QMI can be estimated once we know the (first-order) derivative of the density difference, and we proposed a least-squares estimator to directly estimate the (first-order) derivative of the density difference from data. We further show that a higher-order derivative of QMI can also be estimated in a similar manner.

From an approximation of the derivative of QMI in equation 4.3 the th-order derivative of QMI with regard to  can be obtained from data as
formula
7.1
This means that the th-order derivative of QMI with regard to  can be obtained once we know the th-order derivative of the density difference with regard to . A least-squares estimator for this derivative can be obtained as follows. Let be a model of the th-order derivatives of the density difference:
formula
A least-squares estimator minimizes the following squared loss:
formula
7.2
By expanding the square and ignoring the constant term, we obtain
formula
7.3
Under the mild assumption, applying the integration by parts for times yields
formula
7.4
Then, the estimator is obtained as a solution of the following minimization problem:
formula
7.5
For a linear-in-parameter model , the parameter which minimizes a regularized empirical minimization problem, can be obtained analytically as
formula
7.6
where denotes the regularization parameter and
formula
Finally, an estimator of the th-order derivative of the density difference is given by
formula
7.7
Substituting this estimator to the th order of the derivative of QMI in equation 7.1 yields
formula
7.8
We may also use the th-order derivative of the gaussian function as the basis function. In such a case, the gaussian width and the regularization parameter can be objective chosen by the -fold cross-validation procedure in section 4.5, where the score of each candidate pair is
formula

It should be noted that we implicitly assume that the density difference is at least times differentiable for estimating the th-order derivatives of the density difference. Moreover, we also implicitly assume that the derivatives are smooth and can be accurately estimated by a linear combination of smooth basis functions such as the derivatives of the gaussian function.

8  Conclusion

We proposed a novel linear-supervised dimension-reduction method based on efficient maximization of quadratic mutual information (QMI). Our key idea was to directly estimate the derivative of QMI without estimating QMI itself. We first developed a method to directly estimate the derivative of QMI and then developed fixed-point iteration, which efficiently uses the derivative estimator to find a maximizer of QMI. In addition to the robustness against outliers thanks to the property of QMI, the proposed method is widely applicable because it does not require any assumption on the data distribution, and tuning parameters can be objectively chosen by cross validation. The experimental results on artificial and benchmark data sets showed that the proposed method is promising.

The proposed method seems to be computationally expensive. The main reason for this inefficiency is that we restart the optimization with several initial guesses in order to avoid obtaining a poor local solution. Our future work includes a computationally more efficient approach to obtain a better solution; exploring geodesic convexity (Udriste, 1994) would also be an interesting direction.

From another point of view, when , the estimator in LSQMID can be regarded as approximating the derivative of the density difference in the gaussian reproducing kernel Hilbert space (RKHS) because the derivative of the gaussian kernel belongs to the gaussian RKHS (Zhou, 2008). Kernelizing the estimator in LSQMID would be an interesting topic to know the optimal form of the estimator from the representer theorem for the derivatives of kernels (Zhou, 2008) and to theoretically understand its properties as done in direct density difference estimation (Sugiyama et al., 2013). We will pursue this interesting research topic in the future.

The illustrative experiments showed that LSQMID is not suitable for estimating high-dimensional derivatives. The main reason might be that these derivative estimators are learned independent of each other. However, the derivatives are in fact derived from the same QMI. Thus, it is likely that there is some information that can be shared among these derivative estimators. This information-sharing aspect of derivative estimation was previously investigated in the multitask learning approach for density derivative estimation (Yamane, Sasaki, & Sugiyama, 2016). A similar idea may also improve the performance of LSQMID in high-dimensional problems.

The experimental results on the artificial data sets showed that the performance of LSQMI, which aims at maximizing an estimated QMI, decreases significantly when the input dimensionality increases. Our proposed method significantly improves the performance of the QMI-based supervised dimension-reduction approach by directly estimating the derivative of QMI. The performance of LSDR, which aims at maximizing an estimated SMI, does not affect much when the input dimensionality increases. Hence, by the same analogy, it is intuitive to say that an SMI-based supervised dimension-reduction approach that directly estimates the derivative of SMI would work even better than LSDR. Developing a method to directly estimate the derivative of SMI will be our future work.

Notes

1 

For simplicity, we assume that is standardized so that and .

2 

Throughout this section, we use instead of when we consider its derivative for notational convenience. However, they still represent the QMI between random variables and .

3 

We may also consider higher-order methods such as Newton's method (Nocedal & Wright, 2006) and directly estimate higher-order derivatives from data, as explained in section 7. However, estimating higher-order derivatives is computationally very expensive. Therefore, we consider only first-order methods in this letter.

4 

Our code is publicly available at http://www.ms.k.u-tokyo.ac.jp/software.html#LSQMID

5 

We use the manifold optimization toolbox (Boumal, Mishra, Absil, & Sepulchre, 2014) to perform the optimization.

7 

We use the program code: http://www.stat.nus.edu.sg/∼staxyc/.

9 

The computation time is measured using Matlab on a 2.10 GHz 8 core processor with 128 GB memory.

10 

gKDR performs cross validation based on the regression error to choose its tuning parameter (Fukumizu & Leng, 2014). However, gKDR is not an iterative method, and the computation time of KDR (gKDR) is mostly dominated by KDR.

11 

We use the LibSVM implementation by Chang and Lin (2011).

Acknowledgments

V.T. was supported by KAKENHI 16J08434, H.S. was supported by KAKENHI 15H06103, and M.S. was supported by KAKENHI 25700022.

References

Absil
,
P.-A.
,
Mahony
,
R.
, &