## Abstract

Regression aims at estimating the conditional mean of output given input. However, regression is not informative enough if the conditional density is multimodal, heteroskedastic, and asymmetric. In such a case, estimating the conditional density itself is preferable, but conditional density estimation (CDE) is challenging in high-dimensional space. A naive approach to coping with high dimensionality is to first perform dimensionality reduction (DR) and then execute CDE. However, a two-step process does not perform well in practice because the error incurred in the first DR step can be magnified in the second CDE step. In this letter, we propose a novel single-shot procedure that performs CDE and DR simultaneously in an integrated way. Our key idea is to formulate DR as the problem of minimizing a squared-loss variant of conditional entropy, and this is solved using CDE. Thus, an additional CDE step is not needed after DR. We demonstrate the usefulness of the proposed method through extensive experiments on various data sets, including humanoid robot transition and computer art.

## 1 Introduction

Analyzing an input-output relationship from samples is one of the central challenges in machine learning. The most common approach is regression, which estimates the conditional mean of output given input . However, just analyzing the conditional mean is not informative enough, when the conditional density possesses multimodality, asymmetry, and heteroskedasticity (i.e., input-dependent variance) as a function of output . In such cases, it would be more appropriate to estimate the conditional density itself (see Figure 2).

The most naive approach to conditional density estimation (CDE) would be -neighbor kernel density estimation (-KDE), which performs standard KDE along only with nearby samples in the input domain. However, -KDE does not work well in high-dimensional problems because the number of nearby samples is too few. To avoid the small sample problem, KDE may be applied twice to estimate and separately and the estimated densities may be plugged into the decomposed form to estimate the conditional density. However, taking the ratio of two estimated densities significantly magnifies the estimation error and thus is not reliable. To overcome this problem, an approach to directly estimating the density ratio without separate estimation of densities and has been explored (Sugiyama et al., 2010). This method, called least-squares CDE (LSCDE), was proved to possess the optimal nonparametric learning rate in the mini-max sense, and its solution can be efficiently and analytically computed. Nevertheless, estimating conditional densities in high-dimensional problems is still challenging.

A natural idea to cope with the high dimensionality is to perform dimensionality reduction (DR) before CDE. Sufficient DR (Li, 1991; Cook & Ni, 2005) is a framework of supervised DR aimed at finding the subspace of input that contains all information on output , and a method based on conditional-covariance operators in reproducing kernel Hilbert spaces has been proposed (Fukumizu, Bach, & Jordan, 2009). Although this method possesses superior theoretical properties, it is not easy to use in practice because no systematic model selection method is available for kernel parameters. To overcome this problem, an alternative sufficient DR method based on squared-loss mutual information (SMI) has been proposed recently (Suzuki & Sugiyama, 2013). This method involves nonparametric estimation of SMI that is theoretically guaranteed to achieve the optimal estimation rate, and all tuning parameters can be systematically chosen in practice by cross-validation with respect to the SMI approximation error.

Given such state-of-the-art DR methods, performing DR before LSCDE would be a promising approach to improving the accuracy of CDE in high-dimensional problems. However, such a two-step approach is not preferable because DR in the first step is performed without regard to CDE in the second step, and thus small errors incurred in the DR step can be significantly magnified in the CDE step.

In this letter, we propose a single-shot method that integrates DR and CDE. Our key idea is to formulate the sufficient DR problem in terms of the squared-loss conditional entropy (SCE), which includes the conditional density in its definition, and LSCDE is executed when DR is performed. Therefore, when DR is completed, the final conditional density estimator has already been obtained without an additional CDE step (see Figure 1). We demonstrate the usefulness of the proposed method, named least-squares conditional entropy (LSCE), through experiments on benchmark data sets, humanoid robot control simulations, and computer art.

## 2 Conditional Density Estimation with Dimensionality Reduction

In this section, we describe our proposed method for conditional density estimation with dimensionality reduction.

### 2.1 Problem Formulation

*n*independent and identically distributed (i.i.d.) training samples from the joint density: The goal is to estimate the conditional density from the samples.

### 2.2 Sufficient Dimensionality Reduction with SCE

*f*-divergence class (Ali & Silvey, 1966; Csiszár, 1967) with the squared-loss function. Ordinary conditional entropy (CE), defined by is the negative Kullback-Leibler divergence (Kullback & Leibler, 1951) from to . Since the Kullback-Leibler divergence is also a member of the

*f*-divergence class (with the log-loss function), CE and SCE have similar properties. Indeed, theorem 1 also holds for ordinary CE. However, the Pearson divergence is shown to be more robust against outliers (Basu, Harris, Hjort, & Jones, 1998; Sugiyama, Suzuki, & Kanamori, 2012), since the log function, is very sharp near zero, is not included. Furthermore, as we show, can be approximated analytically, and thus its derivative can also be easily computed. This is a critical property for developing a dimensionality-reduction method because we want to minimize with respect to , where the gradient is highly useful in devising an optimization algorithm. For this reason, we adopt SCE instead of CE below.

### 2.3 SCE Approximation

Since in equation 2.5 is unknown in practice, we approximate it using samples .

*b*is now performed as a function of and . (For more general discussions on divergence bounding, see Keziou, 2003, and Nguyen, Wainwright, & Jordan, 2010).

*b*: where is a parameter vector and is a vector of basis functions. If the expectations over densities and are approximated by sample averages and the -regularizer () is included, the above minimization problem yields where The solution is analytically given by which yields . Then, from equation 2.7, we obtain an approximator of analytically as We call this method least-squares conditional entropy (LSCE).

### 2.4 Model Selection by Cross-Validation

The approximator depends on the choice of models—i.e., the basis function and the regularization parameter . Such a model can be objectively selected by cross-validation as follows:

The training data set is divided into

*K*disjoint subsets with (approximately) the same size.For each model

*M*in the candidate set,**For**,For model

*M*, the LSCE solution is computed from (i.e., all samples except ).

For the chosen model , the LSCE solution is computed from all samples , and the approximator is computed.

In the experiments, we use .

### 2.5 Dimensionality Reduction with SCE

In the Euclidean space, the above gradient gives the steepest direction. However, on a manifold, the natural gradient (Amari, 1998) gives the steepest direction.

Once is updated, SCE is reestimated with the new , and gradient descent is performed again. This entire procedure is repeated until converges. When SCE is reestimated, performing cross-validation in every step is computationally expensive. In our implementation, we perform cross-validation only once every five gradient updates. Furthermore, to find a better local optimal solution, this gradient descent procedure is executed 20 times with randomly chosen initial solutions; the one achieving the smallest value of is chosen.

### 2.6 Conditional Density Estimation with SCE

### 2.7 Basis Function Design

*k*th basis: where denotes the

*k*th gaussian center located at . When the sample size

*n*is too large, we may use only a subset of samples as gaussian centers. denotes the gaussian bandwidth, which is chosen by cross-validation, as explained in section 2.4. We may use different bandwidths for and , but this will increase the computation time for model selection. In our implementation, we normalize each element of and to have the unit variance in advance and then use the common bandwidth for and .

### 2.8 Discussion

Sufficient dimension reduction based on the conditional density has also been studied in the statistics literature. The density-minimum average variance estimation (dMAVE) method (Xia, 2007) finds a dimension-reduction subspace using local linear regression for the conditional density in a semi-parametric manner. A similar approach has also been taken in the sliced regression for dimension reduction method (Wang & Xia, 2008), where the cumulative conditional density is used instead of the conditional density. A Bayesian approach to sufficient dimension reduction called the Bayesian dimension reduction (BDR) method (Reich, Bondell, & Li, 2011) has been proposed recently. This method models the conditional density as a gaussian mixture model and obtains a dimension-reduction subspace through sampling from the learned prior distribution of low-dimensional input. These methods have been shown to work well for dimension reduction in real-world data sets, although they are applicable only to univariate output data where .

In regression, learning with the squared loss is not robust against outliers (Huber, 1981). However, density estimation (Basu et al., 1998) and density ratio estimation (Sugiyama et al., 2012) under the Pearson divergence are known to be robust against outliers. Thus, in the same sense, the proposed LSCE estimator would also be robust against outliers. We experimentally investigate the robustness in section 3.

## 3 Experiments

In this section, we experimentally investigate the practical usefulness of the proposed method. We consider the following dimensionality-reduction schemes:

- None:
No dimensionality reduction is performed.

- dMAVE:
The density-minimum average variance estimation method where dimension reduction is performed through local linear regression for the conditional density (Xia, 2007).

^{1} - BDR:
The Bayesian dimension-reduction method where the conditional density is modeled by a gaussian mixture model and dimension reduction is performed by sampling from the prior distribution of low-dimensional input (Reich et al., 2011).

^{2} - LSMI:
Dimension reduction is performed by maximizing an SMI approximator called least-squares MI (LSMI) using natural gradients over the Grassmann manifold (Suzuki & Sugiyama, 2013).

- LSCE (proposed):
Dimension reduction is performed by minimizing the proposed LSCE using natural gradients over the Grassmann manifold.

- True (reference):
The “true” subspace is used (only for artificial data).

After dimension reduction, we execute the following conditional density estimators:

- -KDE:
-neighbor kernel density estimation, where is chosen by least-squares cross-validation.

- LSCDE:
Least-squares conditional density estimation (Sugiyama et al., 2010).

Note that the proposed method, which is the combination of LSCE and LSCDE, does not explicitly require the post-LSCDE step because LSCDE is executed inside LSCE. Since the dMAVE and BDR methods are applicable only to univariate output, they are not included in experiments with multivariate output data.

### 3.1 Illustration

First, we illustrate the behavior of the plain LSCDE (None/LSCDE) and the proposed method (LSCE/LSCDE). The data sets illustrated in Figure 2 have , , and . The first dimension of input and output *y* of the samples is plotted in the graphs, and the other four dimensions of are just standard normal noise. The results show that the plain LSCDE does not perform well due to the irrelevant noise dimensions of , while the proposed method gives much better estimates.

### 3.2 Artificial Data Sets

Next, we compare the proposed method with the existing dimensionality-reduction methods on conditional density estimation by LSCDE in artificial data sets.

For , , , and , where denotes the normal distribution with mean and covariance matrix , we consider the following artificial data sets:

*n*, measured by where denotes the Frobenius norm. All methods perform similarly for data set a, and the dMAVE and BDR methods outperform LSCE and LSMI when .

In data set b, LSMI does not work well compared to other methods especially when . To explain this behavior, we plot the histograms of in the left column of Figure 4. They show that the profile of the histogram (a sample approximation of ) in data set b is much sharper than that in data set a. As discussed in section 2.8, the density ratio used in LSMI contains . Thus, for data set b, the density ratio would be highly nonsmooth and thus is hard to approximate. The conditional density used in other methods is , where is not included. Therefore, would be smoother than and is easier to estimate than .

For data set c we consider the situation where contain outliers that are not related to . The data profile of data set c in the right column of Figure 4 illustrates such a situation. The result on data set c shows that the proposed LSCE method is robust against outliers and gives the best subspace estimation accuracy, while the BDR method performs unreliably with large standard errors.

### 3.3 Benchmark Data Sets

Next, we use the UCI benchmark data sets (Bache & Lichman, 2013). We randomly select *n* samples from each data set for training, and the rest are used to measure the conditional density estimation error in the test phase. Since the dimensionality of the subspace is unknown, we chose it by cross-validation. More specifically, five-fold cross-validation is performed for each combination of the dimensionality-reduction and conditional-density estimation methods to choose subspace dimensionalities such that the conditional-density estimation error is minimized. Note that tuning parameters and are also chosen based on cross-validation for each method. Since the conditional-density estimation error is equivalent to SCE, choosing the subspace dimensionalities by the conditional-density estimation error in LSCE is equivalent to choosing subspace dimensionalities that give the minimum SCE value.

The results of univariate output benchmark data sets averaged over 10 runs are summarized in Table 1, showing that LSCDE tends to outperform -KDE and the proposed LSCE/LSCDE method works well overall. Both LSMI/LSCDE and dMAVE/LSCDE methods also perform well in all data sets, while BDR/LSCDE does not work well in the data sets containing outliers such as Red Wine, White Wine, and Forest Fires. Table 2 describes the subspace dimensionalities chosen by cross-validation averaged over 10 runs. It shows that all dimensionality-reduction methods reduce the input dimension significantly, especially for Yacht, Red Wine, and White Wine, where the best method always chooses in all runs.

. | LSCE . | LSMI . | dMAVE . | BDR . | No reduction . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Data Set . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . |

Servo | ||||||||||

Yacht | ||||||||||

Auto MPG | ||||||||||

Concrete | ||||||||||

Physicochem | ||||||||||

Red Wine | ||||||||||

White Wine | ||||||||||

Forest Fires | ||||||||||

Housing |

. | LSCE . | LSMI . | dMAVE . | BDR . | No reduction . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Data Set . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . |

Servo | ||||||||||

Yacht | ||||||||||

Auto MPG | ||||||||||

Concrete | ||||||||||

Physicochem | ||||||||||

Red Wine | ||||||||||

White Wine | ||||||||||

Forest Fires | ||||||||||

Housing |

Note: The best methods in terms of the mean error and comparable methods according to the two-sample paired *t*-test at the significance level are specified in bold.

. | . | . | LSCE . | LSMI . | dMAVE . | BDR . | ||||
---|---|---|---|---|---|---|---|---|---|---|

Data Set . | . | n
. | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . |

Servo | 50 | |||||||||

Yacht | 80 | |||||||||

Auto MPG | 100 | |||||||||

Concrete | 300 | |||||||||

Physicochem | 500 | |||||||||

Red Wine | 300 | |||||||||

White Wine | 400 | |||||||||

Forest Fires | 100 | |||||||||

Housing | 100 |

. | . | . | LSCE . | LSMI . | dMAVE . | BDR . | ||||
---|---|---|---|---|---|---|---|---|---|---|

Data Set . | . | n
. | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . |

Servo | 50 | |||||||||

Yacht | 80 | |||||||||

Auto MPG | 100 | |||||||||

Concrete | 300 | |||||||||

Physicochem | 500 | |||||||||

Red Wine | 300 | |||||||||

White Wine | 400 | |||||||||

Forest Fires | 100 | |||||||||

Housing | 100 |

The results of multivariate output Stock and Energy benchmark data sets are summarized in Table 3, showing that the proposed LSCE/LSCDE method also works well for multivariate output data sets and significantly outperforms methods without dimensionality reduction. Table 4 describes the subspace dimensionalities selected by cross-validation, showing that LSMI/LSCDE tends to more aggressively reduce the dimensionality than LSCE/LSCDE.

. | LSCE . | LSMI . | No Reduction . | . | |||
---|---|---|---|---|---|---|---|

Data Set . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | Scale . |

Stock | |||||||

Energy | |||||||

2 joints | |||||||

4 joints | |||||||

9 joints | |||||||

Sumi-e 1 | |||||||

Sumi-e 2 | |||||||

Sumi-e 3 |

. | LSCE . | LSMI . | No Reduction . | . | |||
---|---|---|---|---|---|---|---|

Data Set . | LSCDE . | -KDE . | LSCDE . | -KDE . | LSCDE . | -KDE . | Scale . |

Stock | |||||||

Energy | |||||||

2 joints | |||||||

4 joints | |||||||

9 joints | |||||||

Sumi-e 1 | |||||||

Sumi-e 2 | |||||||

Sumi-e 3 |

Note: The best methods in terms of the mean error and comparable methods according to the two-sample paired *t*-test at the significance level are specified in bold.