Abstract

Regression aims at estimating the conditional mean of output given input. However, regression is not informative enough if the conditional density is multimodal, heteroskedastic, and asymmetric. In such a case, estimating the conditional density itself is preferable, but conditional density estimation (CDE) is challenging in high-dimensional space. A naive approach to coping with high dimensionality is to first perform dimensionality reduction (DR) and then execute CDE. However, a two-step process does not perform well in practice because the error incurred in the first DR step can be magnified in the second CDE step. In this letter, we propose a novel single-shot procedure that performs CDE and DR simultaneously in an integrated way. Our key idea is to formulate DR as the problem of minimizing a squared-loss variant of conditional entropy, and this is solved using CDE. Thus, an additional CDE step is not needed after DR. We demonstrate the usefulness of the proposed method through extensive experiments on various data sets, including humanoid robot transition and computer art.

1  Introduction

Analyzing an input-output relationship from samples is one of the central challenges in machine learning. The most common approach is regression, which estimates the conditional mean of output given input . However, just analyzing the conditional mean is not informative enough, when the conditional density possesses multimodality, asymmetry, and heteroskedasticity (i.e., input-dependent variance) as a function of output . In such cases, it would be more appropriate to estimate the conditional density itself (see Figure 2).

The most naive approach to conditional density estimation (CDE) would be -neighbor kernel density estimation (-KDE), which performs standard KDE along only with nearby samples in the input domain. However, -KDE does not work well in high-dimensional problems because the number of nearby samples is too few. To avoid the small sample problem, KDE may be applied twice to estimate and separately and the estimated densities may be plugged into the decomposed form to estimate the conditional density. However, taking the ratio of two estimated densities significantly magnifies the estimation error and thus is not reliable. To overcome this problem, an approach to directly estimating the density ratio without separate estimation of densities and has been explored (Sugiyama et al., 2010). This method, called least-squares CDE (LSCDE), was proved to possess the optimal nonparametric learning rate in the mini-max sense, and its solution can be efficiently and analytically computed. Nevertheless, estimating conditional densities in high-dimensional problems is still challenging.

A natural idea to cope with the high dimensionality is to perform dimensionality reduction (DR) before CDE. Sufficient DR (Li, 1991; Cook & Ni, 2005) is a framework of supervised DR aimed at finding the subspace of input that contains all information on output , and a method based on conditional-covariance operators in reproducing kernel Hilbert spaces has been proposed (Fukumizu, Bach, & Jordan, 2009). Although this method possesses superior theoretical properties, it is not easy to use in practice because no systematic model selection method is available for kernel parameters. To overcome this problem, an alternative sufficient DR method based on squared-loss mutual information (SMI) has been proposed recently (Suzuki & Sugiyama, 2013). This method involves nonparametric estimation of SMI that is theoretically guaranteed to achieve the optimal estimation rate, and all tuning parameters can be systematically chosen in practice by cross-validation with respect to the SMI approximation error.

Given such state-of-the-art DR methods, performing DR before LSCDE would be a promising approach to improving the accuracy of CDE in high-dimensional problems. However, such a two-step approach is not preferable because DR in the first step is performed without regard to CDE in the second step, and thus small errors incurred in the DR step can be significantly magnified in the CDE step.

In this letter, we propose a single-shot method that integrates DR and CDE. Our key idea is to formulate the sufficient DR problem in terms of the squared-loss conditional entropy (SCE), which includes the conditional density in its definition, and LSCDE is executed when DR is performed. Therefore, when DR is completed, the final conditional density estimator has already been obtained without an additional CDE step (see Figure 1). We demonstrate the usefulness of the proposed method, named least-squares conditional entropy (LSCE), through experiments on benchmark data sets, humanoid robot control simulations, and computer art.

Figure 1:

Conditional density estimation (CDE) and dimensionality reduction (DR). (a) CDE without DR performs poorly in high-dimensional problems. (b) CDE after DR can magnify the small DR error in the CDE step. (c) CDE with DR (proposed) performs CDE in the DR process in an integrated manner.

Figure 1:

Conditional density estimation (CDE) and dimensionality reduction (DR). (a) CDE without DR performs poorly in high-dimensional problems. (b) CDE after DR can magnify the small DR error in the CDE step. (c) CDE with DR (proposed) performs CDE in the DR process in an integrated manner.

2  Conditional Density Estimation with Dimensionality Reduction

In this section, we describe our proposed method for conditional density estimation with dimensionality reduction.

2.1  Problem Formulation

Let and be the input and output domains with dimensionality and , respectively, and let be a joint probability density on . Assume that we are given n independent and identically distributed (i.i.d.) training samples from the joint density:
formula
The goal is to estimate the conditional density from the samples.
Our implicit assumption is that the input dimensionality is large, but its intrinsic dimensionality, denoted by , is rather small. More specifically, let and be and matrices such that is an orthogonal matrix. Then we assume that can be decomposed into the component and its perpendicular component so that and are conditionally independent given :
formula
2.1
This means that is the relevant part of , and the rest does not contain any information on . The problem of finding is called sufficient dimensionality reduction (Li, 1991; Cook & Ni, 2005).

2.2  Sufficient Dimensionality Reduction with SCE

Let us consider a squared-loss variant of conditional entropy, squared-loss CE (SCE):
formula
2.2
By expanding the squared term in equation 2.2, we obtain
formula
2.3
where is defined as
formula
2.4
Then we have the following theorem (its proof is given in appendix A), which forms the basis of our proposed method:
Theorem 1.
formula
This theorem shows , and the equality holds if and only if
formula
This is equivalent to the conditional independence, equation 2.1, and therefore sufficient dimensionality reduction can be performed by minimizing with respect to :
formula
2.5
Here, denotes the Grassmann manifold, which is a set of orthogonal matrices without overlaps,
formula
where denotes the identity matrix and represents the equivalence relation: and are written as if their rows span the same subspace.
Since , is equivalent to the negative Pearson divergence (Pearson, 1900) from to , which is a member of the f-divergence class (Ali & Silvey, 1966; Csiszár, 1967) with the squared-loss function. Ordinary conditional entropy (CE), defined by
formula
is the negative Kullback-Leibler divergence (Kullback & Leibler, 1951) from to . Since the Kullback-Leibler divergence is also a member of the f-divergence class (with the log-loss function), CE and SCE have similar properties. Indeed, theorem 1 also holds for ordinary CE. However, the Pearson divergence is shown to be more robust against outliers (Basu, Harris, Hjort, & Jones, 1998; Sugiyama, Suzuki, & Kanamori, 2012), since the log function, is very sharp near zero, is not included. Furthermore, as we show, can be approximated analytically, and thus its derivative can also be easily computed. This is a critical property for developing a dimensionality-reduction method because we want to minimize with respect to , where the gradient is highly useful in devising an optimization algorithm. For this reason, we adopt SCE instead of CE below.

2.3  SCE Approximation

Since in equation 2.5 is unknown in practice, we approximate it using samples .

The trivial inequality yields , and thus we have
formula
2.6
If we set , we have
formula
If we multiply both sides of the above inequality with and integrate over and , we have
formula
2.7
where minimization with respect to b is now performed as a function of and . (For more general discussions on divergence bounding, see Keziou, 2003, and Nguyen, Wainwright, & Jordan, 2010).
Let us consider a linear-in-parameter model for b:
formula
where is a parameter vector and is a vector of basis functions. If the expectations over densities and are approximated by sample averages and the -regularizer () is included, the above minimization problem yields
formula
where
formula
2.8
The solution is analytically given by
formula
which yields . Then, from equation 2.7, we obtain an approximator of analytically as
formula
We call this method least-squares conditional entropy (LSCE).

2.4  Model Selection by Cross-Validation

The approximator depends on the choice of models—i.e., the basis function and the regularization parameter . Such a model can be objectively selected by cross-validation as follows:

  1. The training data set is divided into K disjoint subsets with (approximately) the same size.

  2. For each model M in the candidate set,

    1. For,

      • For model M, the LSCE solution is computed from (i.e., all samples except ).

      • Evaluate the upper bound of obtained by using the hold-out data :
        formula
        where denotes the cardinality of .

    2. The average score is computed as
      formula

  3. The model that minimizes the average score is chosen:
    formula
  4. For the chosen model , the LSCE solution is computed from all samples , and the approximator is computed.

In the experiments, we use .

2.5  Dimensionality Reduction with SCE

Now we solve the following optimization problem by gradient descent:
formula
2.9
As shown in appendix B, the gradient of is given by
formula
where .

In the Euclidean space, the above gradient gives the steepest direction. However, on a manifold, the natural gradient (Amari, 1998) gives the steepest direction.

The natural gradient at is the projection of the ordinary gradient to the tangent space of at . If the tangent space is equipped with the canonical metric , the natural gradient is given as follows (Edelman, Arias, & Smith, 1998):
formula
where is a matrix such that [] is an orthogonal matrix.
Then the geodesic from to the direction of the natural gradient over can be expressed using as
formula
where “” for a matrix denotes the matrix exponential and denotes the zero matrix. Note that the derivative at coincides with the natural gradient (see Edelman et al., 1998, for details). Thus, line search along the geodesic in the natural gradient direction is equivalent to finding the minimizer from .

Once is updated, SCE is reestimated with the new , and gradient descent is performed again. This entire procedure is repeated until converges. When SCE is reestimated, performing cross-validation in every step is computationally expensive. In our implementation, we perform cross-validation only once every five gradient updates. Furthermore, to find a better local optimal solution, this gradient descent procedure is executed 20 times with randomly chosen initial solutions; the one achieving the smallest value of is chosen.

2.6  Conditional Density Estimation with SCE

Since the maximum of equation 2.6 is attained at and in the current derivation, the optimal is actually the conditional density itself. Therefore, obtained by LSCE is a conditional density estimator. This implies that the upper-bound minimization procedure described in section 2.3 is equivalent to least-squares conditional density estimation (LSCDE) (Sugiyama et al., 2010), which minimizes the squared error:
formula
Then, in the same way as the original LSCDE, we may postprocess the solution to make the conditional density estimator nonnegative and normalized as
formula
2.10
where . Note that even if the solution is postprocessed as equation 2.10, the optimal estimation rate of the LSCDE solution is still maintained (Sugiyama et al., 2010).

2.7  Basis Function Design

In practice, we use the following gaussian function as the kth basis:
formula
2.11
where denotes the kth gaussian center located at . When the sample size n is too large, we may use only a subset of samples as gaussian centers. denotes the gaussian bandwidth, which is chosen by cross-validation, as explained in section 2.4. We may use different bandwidths for and , but this will increase the computation time for model selection. In our implementation, we normalize each element of and to have the unit variance in advance and then use the common bandwidth for and .
A notable advantage of using the gaussian function is that the integral over appeared in (see equation 2.8) can be computed analytically as
formula
Similarly, the normalization term in equation 2.10 can also be computed analytically as
formula

2.8  Discussion

We have proposed minimizing SCE for dimensionality reduction:
formula
In previous work Suzuki and Sugiyama (2013), squared-loss mutual information (SMI) was maximized for dimensionality reduction:
formula
This shows that the essential difference is whether is included in the denominator of the density ratio. Thus, if is uniform, the proposed dimensionality-reduction method using SCE is reduced to the existing method using SMI. However, if is not uniform, the density ratio function included in SMI may be more fluctuated than included in SCE. Since a smoother function can be more accurately estimated from a small number of samples in general, the proposed method using SCE is expected to work better than the existing method using SMI. We will experimentally demonstrate this effect in section 3.

Sufficient dimension reduction based on the conditional density has also been studied in the statistics literature. The density-minimum average variance estimation (dMAVE) method (Xia, 2007) finds a dimension-reduction subspace using local linear regression for the conditional density in a semi-parametric manner. A similar approach has also been taken in the sliced regression for dimension reduction method (Wang & Xia, 2008), where the cumulative conditional density is used instead of the conditional density. A Bayesian approach to sufficient dimension reduction called the Bayesian dimension reduction (BDR) method (Reich, Bondell, & Li, 2011) has been proposed recently. This method models the conditional density as a gaussian mixture model and obtains a dimension-reduction subspace through sampling from the learned prior distribution of low-dimensional input. These methods have been shown to work well for dimension reduction in real-world data sets, although they are applicable only to univariate output data where .

In regression, learning with the squared loss is not robust against outliers (Huber, 1981). However, density estimation (Basu et al., 1998) and density ratio estimation (Sugiyama et al., 2012) under the Pearson divergence are known to be robust against outliers. Thus, in the same sense, the proposed LSCE estimator would also be robust against outliers. We experimentally investigate the robustness in section 3.

3  Experiments

In this section, we experimentally investigate the practical usefulness of the proposed method. We consider the following dimensionality-reduction schemes:

  • None:

    No dimensionality reduction is performed.

  • dMAVE:

    The density-minimum average variance estimation method where dimension reduction is performed through local linear regression for the conditional density (Xia, 2007).1

  • BDR:

    The Bayesian dimension-reduction method where the conditional density is modeled by a gaussian mixture model and dimension reduction is performed by sampling from the prior distribution of low-dimensional input (Reich et al., 2011).2

  • LSMI:

    Dimension reduction is performed by maximizing an SMI approximator called least-squares MI (LSMI) using natural gradients over the Grassmann manifold (Suzuki & Sugiyama, 2013).

  • LSCE (proposed):

    Dimension reduction is performed by minimizing the proposed LSCE using natural gradients over the Grassmann manifold.

  • True (reference):

    The “true” subspace is used (only for artificial data).

After dimension reduction, we execute the following conditional density estimators:

  • -KDE:

    -neighbor kernel density estimation, where is chosen by least-squares cross-validation.

  • LSCDE:

    Least-squares conditional density estimation (Sugiyama et al., 2010).

Note that the proposed method, which is the combination of LSCE and LSCDE, does not explicitly require the post-LSCDE step because LSCDE is executed inside LSCE. Since the dMAVE and BDR methods are applicable only to univariate output, they are not included in experiments with multivariate output data.

3.1  Illustration

First, we illustrate the behavior of the plain LSCDE (None/LSCDE) and the proposed method (LSCE/LSCDE). The data sets illustrated in Figure 2 have , , and . The first dimension of input and output y of the samples is plotted in the graphs, and the other four dimensions of are just standard normal noise. The results show that the plain LSCDE does not perform well due to the irrelevant noise dimensions of , while the proposed method gives much better estimates.

Figure 2:

Examples of conditional density estimation by plain LSCDE (None/LSCDE) and the proposed method (LSCE/LSCDE).

Figure 2:

Examples of conditional density estimation by plain LSCDE (None/LSCDE) and the proposed method (LSCE/LSCDE).

3.2  Artificial Data Sets

Next, we compare the proposed method with the existing dimensionality-reduction methods on conditional density estimation by LSCDE in artificial data sets.

For , , , and , where denotes the normal distribution with mean and covariance matrix , we consider the following artificial data sets:

The first row of Figure 3 shows the dimensionality-reduction error between true and its estimate for different sample size n, measured by
formula
where denotes the Frobenius norm. All methods perform similarly for data set a, and the dMAVE and BDR methods outperform LSCE and LSMI when .
Figure 3:

(Left) The mean and standard error of the dimensionality-reduction error over 20 runs. (Right) The mean and standard error of the conditional-density estimation error over 20 runs.

Figure 3:

(Left) The mean and standard error of the dimensionality-reduction error over 20 runs. (Right) The mean and standard error of the conditional-density estimation error over 20 runs.

In data set b, LSMI does not work well compared to other methods especially when . To explain this behavior, we plot the histograms of in the left column of Figure 4. They show that the profile of the histogram (a sample approximation of ) in data set b is much sharper than that in data set a. As discussed in section 2.8, the density ratio used in LSMI contains . Thus, for data set b, the density ratio would be highly nonsmooth and thus is hard to approximate. The conditional density used in other methods is , where is not included. Therefore, would be smoother than and is easier to estimate than .

Figure 4:

(Left) Example histograms of on the artificial data sets. (Right) Example data plot of relevant features of against y when n = 400 on the artificial data sets. The left distribution in the histogram of data set c is regarded as outliers.

Figure 4:

(Left) Example histograms of on the artificial data sets. (Right) Example data plot of relevant features of against y when n = 400 on the artificial data sets. The left distribution in the histogram of data set c is regarded as outliers.

For data set c we consider the situation where contain outliers that are not related to . The data profile of data set c in the right column of Figure 4 illustrates such a situation. The result on data set c shows that the proposed LSCE method is robust against outliers and gives the best subspace estimation accuracy, while the BDR method performs unreliably with large standard errors.

The right column of Figure 3 plots the conditional density estimation error between true and its estimate , evaluated by the squared loss:
formula
where is a set of test samples that have not been used for training. We set . For data sets a and c, all methods with dimension reduction perform equally well, which is much better than no dimension reduction (None/LSCDE) and is comparable to the method with the true subspace (True/LSCDE). For data set b, all methods except LSMI/LSCDE perform well overall and are comparable to the methods with the true subspace.

3.3  Benchmark Data Sets

Next, we use the UCI benchmark data sets (Bache & Lichman, 2013). We randomly select n samples from each data set for training, and the rest are used to measure the conditional density estimation error in the test phase. Since the dimensionality of the subspace is unknown, we chose it by cross-validation. More specifically, five-fold cross-validation is performed for each combination of the dimensionality-reduction and conditional-density estimation methods to choose subspace dimensionalities such that the conditional-density estimation error is minimized. Note that tuning parameters and are also chosen based on cross-validation for each method. Since the conditional-density estimation error is equivalent to SCE, choosing the subspace dimensionalities by the conditional-density estimation error in LSCE is equivalent to choosing subspace dimensionalities that give the minimum SCE value.

The results of univariate output benchmark data sets averaged over 10 runs are summarized in Table 1, showing that LSCDE tends to outperform -KDE and the proposed LSCE/LSCDE method works well overall. Both LSMI/LSCDE and dMAVE/LSCDE methods also perform well in all data sets, while BDR/LSCDE does not work well in the data sets containing outliers such as Red Wine, White Wine, and Forest Fires. Table 2 describes the subspace dimensionalities chosen by cross-validation averaged over 10 runs. It shows that all dimensionality-reduction methods reduce the input dimension significantly, especially for Yacht, Red Wine, and White Wine, where the best method always chooses in all runs.

Table 1:
Mean and Standard Error of the Conditional Density Estimation Error over 10 Runs for Univariate Output Data Sets.
LSCELSMIdMAVEBDRNo reduction
Data SetLSCDE-KDELSCDE-KDELSCDE-KDELSCDE-KDELSCDE-KDE
Servo           
Yacht           
Auto MPG           
Concrete           
Physicochem           
Red Wine           
White Wine           
Forest Fires           
Housing           
LSCELSMIdMAVEBDRNo reduction
Data SetLSCDE-KDELSCDE-KDELSCDE-KDELSCDE-KDELSCDE-KDE
Servo           
Yacht           
Auto MPG           
Concrete           
Physicochem           
Red Wine           
White Wine           
Forest Fires           
Housing           

Note: The best methods in terms of the mean error and comparable methods according to the two-sample paired t-test at the significance level are specified in bold.

Table 2:
Mean and Standard Error of the Chosen Subspace Dimensionality over 10 Runs for Univariate Output Data Sets.
LSCELSMIdMAVEBDR
Data SetnLSCDE-KDELSCDE-KDELSCDE-KDELSCDE-KDE
Servo  50         
Yacht  80         
Auto MPG  100         
Concrete  300         
Physicochem  500         
Red Wine  300         
White Wine  400         
Forest Fires  100         
Housing  100         
LSCELSMIdMAVEBDR
Data SetnLSCDE-KDELSCDE-KDELSCDE-KDELSCDE-KDE
Servo  50         
Yacht  80         
Auto MPG  100         
Concrete  300         
Physicochem  500         
Red Wine  300         
White Wine  400         
Forest Fires  100         
Housing  100         

The results of multivariate output Stock and Energy benchmark data sets are summarized in Table 3, showing that the proposed LSCE/LSCDE method also works well for multivariate output data sets and significantly outperforms methods without dimensionality reduction. Table 4 describes the subspace dimensionalities selected by cross-validation, showing that LSMI/LSCDE tends to more aggressively reduce the dimensionality than LSCE/LSCDE.

Table 3:
Mean and Standard Error of the Conditional Density Estimation Error over 10 Runs for Multivariate Output Data Sets.
LSCELSMINo Reduction
Data SetLSCDE-KDELSCDE-KDELSCDE-KDEScale
Stock        
Energy        
2 joints        
4 joints        
9 joints        
Sumi-e 1        
Sumi-e 2        
Sumi-e 3        
LSCELSMINo Reduction
Data SetLSCDE-KDELSCDE-KDELSCDE-KDEScale
Stock        
Energy        
2 joints        
4 joints        
9 joints        
Sumi-e 1        
Sumi-e 2        
Sumi-e 3        

Note: The best methods in terms of the mean error and comparable methods according to the two-sample paired t-test at the significance level are specified in bold.

Table 4:
Mean and Standard Error of the Chosen Subspace Dimensionality over 10 Runs for Multivariate Output Data Sets.
LSCELSMI
Data SetnLSCDE-KDELSCDE-KDE
Stock  100     
Energy  200     
2 joints  100     
4 joints  200     
9 joints  500