We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, this procedure does not necessarily work well because the first step is performed without regard to the second step, and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this letter, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a nonparametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be used in L2-distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.
When estimating a quantity consisting of two elements, a two-stage approach of first estimating the two elements separately and then approximating the target quantity based on the estimates of the two elements often performs poorly; the reason is that the first stage is carried out without regard to the second stage, and thus a small estimation error incurred in the first stage can cause a big error in the second stage. To cope with this problem, it would be more appropriate to directly estimate the target quantity in a single-shot process without separately estimating the two elements.
A seminal example that follows this general idea is pattern recognition by the support vector machine (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995; Vapnik, 1998).1 Instead of separately estimating two probability distributions of positive and negative patterns, the support vector machine directly learns the boundary between the positive and negative classes that is sufficient for pattern recognition. More recently, a problem of estimating the ratio of two probability densities was tackled in a similar fashion (Qin, 1998; Sugiyama et al., 2008; Gretton et al., 2009; Kanamori, Hido, & Sugiyama, 2009; Nguyen, Wainwright, & Jordan, 2010; Kanamori, Suzuki, & Sugiyama, 2012; Sugiyama, Suzuki, & Kanamori, 2012a, 2012b) The ratio of two probability densities is directly estimated without going through separate estimation of the two probability densities.
In this letter, we explore this line of research and propose a method for directly estimating the difference between two probability densities in a single-shot process. Density ratios and density differences can both be used for comparing probability densities via approximation of divergences such as the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) and the L2-distance. A divergence estimator can be used for solving various machine learning tasks, including class-balance estimation under class-prior change (Saerens, Latinne, & Decaestecker, 2002; du Plessis & Sugiyama, 2012), image segmentation and registration (Liu et al., 2010; Atif, Ripoche, & Osorio, 2003), target object detection and recognition (Gray & Principe, 2010; Yamanaka, Matsugu, & Sugiyama, forthcoming b), feature selection and extraction (Torkkola, 2003; Suzuki & Sugiyama, 2013), and change-point detection in time series (Kawahara & Sugiyama, 2012; Liu, Yamada, Collier, & Sugiyama, 2013; Yamanaka, Matsugu, & Sugiyama, forthcoming a). In this divergence approximation scenario, density differences are more advantageous than density ratios in several aspects. For example, density ratios can be unbounded even for simple cases (Cortes, Mansour, & Mohri, 2010; Yamada, Suzuki, Kanamori, Hachiya, & Sugiyama, 2013), whereas density differences are always bounded as long as both densities are bounded. Thus, density differences are expected to be learned more easily than density ratios. Also, density ratios are asymmetric and thus the “direction” needs to be determined by a user, whereas density differences are symmetric and thus there is no need to think about the direction. These are our primal motivations to develop a density-difference estimator.
Note that density ratios have their own applications beyond divergence approximation to which density differences may not be applied, such as importance sampling and conditional probability estimation (Sugiyama et al., 2012a). On the other hand, density differences also have their own unique applications to which density ratios may not be applied, such as the estimation of highest density-difference regions in flow cytometric data analysis (Duong, Koch, & Wand, 2009) and unsupervised labeling (du Plessis, 2013). This implies that for density ratios and density differences, neither of them includes the other in terms of ranges of applications. This is our additional motivation to pursue a practical algorithm for density-difference estimation.
For this density-difference estimation problem, we propose a single-shot method, the least-squares density-difference (LSDD) estimator, that directly estimates the density difference without separately estimating two densities. LSDD is derived within the framework of kernel regularized least-squares estimation, and its solution can be computed analytically in a computationally efficient and stable manner. Furthermore, LSDD is equipped with cross-validation, and thus all tuning parameters such as the kernel width and the regularization parameter can be systematically and objectively optimized. We derive a finite-sample error bound for the LSDD estimator in a nonparametric setup and show that it achieves the optimal convergence rate.
We also apply LSDD to L2-distance estimation and show that it is more accurate than the difference of KDEs, which tends to severely underestimate the L2-distance (Anderson, Hall, & Titterington, 1994). Compared with the KL divergence, the L2-distance is more robust against outliers (Basu, Harris, Hjort, & Jones, 1998; Scott, 2001; Besbeas & Morgan, 2004). We experimentally demonstrate the usefulness of LSDD in robust distribution comparison such as semisupervised class-prior estimation and unsupervised change detection.
2. Density-Difference Estimation
In this section, we propose a single-shot method for estimating the difference between two probability densities from samples and analyze its theoretical properties.
2.1. Problem Formulation and Naive Approach
First, we formulate the problem of density-difference estimation.
However, we argue that the KDE-based density-difference estimator is not the best approach because of its two-step nature: a small estimation error incurred in each density estimate can cause a big error in the final density-difference estimate. More intuitively, good density estimators tend to be smooth, and thus a density-difference estimator obtained from such smooth density estimators tends to be oversmoothed (Hall & Wand, 1988; Anderson et al., 1994; see also the numerical experiments in section 4.1.1).
To overcome this weakness, we give a single-shot procedure of directly estimating the density difference f(x) without separately estimating the densities p(x) and .
2.2. Least-Squares Density-Difference Estimation
2.3. Theoretical Analysis
Here, we theoretically investigate the behavior of the LSDD estimator.
2.3.1. Parametric Convergence
First, we consider a linear parametric setup where basis functions in our density-difference model 2.2 are fixed.
This result implies that the LSDD estimator has asymptotic normality with asymptotic order , the optimal convergence rate in the parametric setup.
2.3.2. Nonparametric Error Bound
Next, we consider a nonparametric setup where a density-difference function is learned in a gaussian reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950).
It is known that if the naive KDE with a gaussian kernel is used for estimating a probability density with regularity , the optimal learning rate cannot be achieved (Farrell, 1972; Silverman, 1986). To achieve the optimal rate by KDE, we should choose a kernel function specifically tailored to each regularity (Parzen, 1962). However, such a kernel function is not nonnegative, and it is difficult to implement it in practice. However, our LSDD estimator can always achieve the optimal learning rate for a gaussian kernel without regard to regularity .
2.4. Model Selection by Cross-Validation
The theoretical analyses showed the superiority of LSDD in terms of the convergence rates. However, the practical performance of LSDD depends on the choice of models (i.e., the kernel width and the regularization parameter ). Here, we show that the model can be optimized by cross-validation (CV).
A Matlab implementation of LSDD is available online from http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSDD/.
3. L2-Distance Estimation by LSDD
3.1. Basic Forms
3.2. Reduction of Bias Caused by Regularization
3.3. Further Bias Correction
In this section, we experimentally evaluate the performance of LSDD.
4.1. Numerical Examples
First, we show numerical examples using artificial data sets.
4.1.1. LSDD versus KDE
Two gaussian widths are independently chosen from the above candidate values based on five-fold least-squares cross-validation. That is, for each density, we perform cross-validation in terms of the L2-distance between estimated and true densities so that the density is optimally approximated (Härdle et al., 2004).
Two gaussian widths are jointly chosen from the above candidate values based on five-fold cross-validation in terms of the LSDD criterion (Hall & Wand, 1988). That is, we compute the cross-validated LSDD criterion as a function of two gaussian widths and find the best pair that minimizes the criterion.
We first illustrate the behavior of the LSDD- and KDE-based methods under d=1 and . Figure 1 depicts the data samples and density-difference estimation results obtained by LSDD, KDEi, and KDEj for (i.e., ). Cross-validation scores of LSDD are included at the bottom of the figure. The figure shows that LSDD and KDEj give accurate estimates of the true density difference f(x)=0. The density-difference estimate obtained by KDEi, however, is rather fluctuated, although both densities are reasonably well approximated by KDEs. This illustrates an advantage of directly estimating the density difference without going through separate estimation of each density. Figure 2 depicts the results for (i.e., ), showing again that LSDD performs well. KDEi and KDEj give the same estimation result for this data set, which slightly underestimates the peaks.
Next, we compare the performance of L2-distance approximation based on LSDD, KDEi, and KDEj. For and d=1, 5, we draw samples from the above p(x) and . Figure 3 depicts the mean and standard error of estimated L2-distances over 1000 runs as functions of mean . When d=1 (see Figure 3a), the LSDD-based L2-distance estimator gives the most accurate estimates of the true L2-distance, whereas the KDEi-based L2-distance estimator slightly underestimates the true L2-distance for large . This is caused by the fact that KDE tends to provide smooth density estimates (see Figure 2c again). Such smooth density estimates are accurate as density estimates, but the difference of smooth density estimates yields a small L2-distance estimate (Anderson et al., 1994). More specifically, the density is estimated accurately at around x=0.5, but negative values of the density difference f(x) are underestimated there because is smoother than the true density p(x) and thus its tail values at around x=0.5 are larger. The KDEj-based L2-distance estimator tends to improve this drawback of KDEi to some extent, but it still slightly underestimates the true L2-distance when is large.
When d=5 (see Figure 3b), the KDE-based L2-distance estimators severely underestimate the true L2-distance for large , but the LSDD-based L2-distance estimator still gives reasonably accurate estimates of the true L2-distance even when d=5. However, we note that LSDD also slightly underestimates the true L2-distance when is large because slight underestimation tends to yield smaller variance, and thus such stabilized solutions are more accurate in terms of the bias-variance trade-off.
In Figure 1, we illustrated that LSDD and KDEj work better than KDEi when the gaussian mean is . However, in Figure 3a, KDEi is shown to be the best-performing method for in terms of the average over 1000 runs. To fill this gap, we depict in Figure 4 histograms of L2-distance estimates obtained by LSDD, KDEi, and KDEj over 1000 runs for the gaussian mean . This graph shows that LSDD and KDEj give exactly correct solutions (i.e., zero) about 300 times, whereas KDEi gives estimates about 0.01 more than 300 times. The graphs plotted in Figure 1 correspond to such typical results where LSDD and KDEj, outperform KDEi. On the other hand, KDEi stably gives estimates less than 0.1 almost always, whereas LSDD and KDEj occasionally give large estimates. This rather unstable behavior of LSDD and KDEj, which was caused by inappropriate choice of the gaussian width (and the regularization parameter for LSDD) by cross-validation, led to larger mean values of LSDD and KDEj than KDEi in Figure 3a.
Finally, we investigate the behavior of LSDD, KDEi, and KDEj in L2-distance estimation when the numbers of samples from two distributions are imbalanced. Figure 5 plots the means and standard errors of the L2-distance estimated by LSDD, KDEi, and KDEj for d=1, 5 and over 1000 runs as functions of when n is fixed to 200.
When d=1 and (see Figure 5a), all three methods behave similarly, and the accuracy tends to be improved as increases. However, improvement when is moderate. When the input dimensionality is increased to d=5 (see Figure 5b), LSDD and KDEj still have the same tendency. However, KDEi behaves differently, and the approximation error tends to grow as increases. This implies that improving the accuracy of one of the density estimates does not necessarily improve the overall estimation accuracy of the density difference.
When d=1 and (see Figure 5c), LSDD tends to provide better estimates as increases, whereas KDEi and KDEj keep underestimating the true L2-distance even when is increased. Finally, when d=5 and (see Figure 5d), LSDD stably provides reasonably good results and its performance does not change significantly when is increased. On the other hand, KDEi and KDEj tend to give better results as increases.
Overall, LSDD sometimes gives slightly better results for , but its performance is not significantly different from those for . The accuracy of KDEi and KDEj when is increased gets better or worse depending on the situation. Thus, having more data samples from one of the distributions does not seem to always improve the estimation accuracy in density-difference estimation.
4.1.2. L2-Distance versus KL Divergence
Figure 7a depicts the L2-distance and the KL-divergence for outlier mean . This shows that both the L2-distance and the KL-divergence increase as increases. However, the L2-distance is bounded from above, whereas the KL-divergence diverges to infinity as tends to infinity. This result implies that the L2-distance is less sensitive to outliers than the KL-divergence, which agrees well with the observation given in Basu et al. (1998).
Next, we draw samples from p(x) and and estimate the L2-distance by LSDD and the KL-divergence by the Kullback-Leibler importance estimation procedure (KLIEP) (Sugiyama et al., 2008).5 Figure 7b depicts estimated L2-distance and KL-divergence for outlier mean over 100 runs. This shows that both LSDD and KLIEP reasonably capture the profiles of the true L2-distance and the KL-divergence, although the scale of KLIEP values is much different from the true values (see Figure 7a) because the estimated normalization factor was unreliable.
Finally, based on the permutation test procedure (Efron & Tibshirani, 1993), we conduct hypothesis testing of the null hypothesis whether densities p and are the same. More specifically, we first compute a distance estimate for the original data sets and and obtain a distance/divergence estimate . Next, we randomly permute the samples and assign the first samples to a set and the remaining samples to another set . Then we compute a distance/divergence estimate again using the randomly permuted data sets and and obtain . Because and can be regarded as being drawn from the same distribution, would take a value close to zero. This random permutation procedure is repeated many times (100 times in the following experiments), and the distribution of under the null hypothesis (i.e., the two distributions are the same) is constructed. Finally, the p-value is approximated by evaluating the relative ranking of in the histogram of . We set the significance level at 5%.
Figure 8a depicts the rejection rate of the null hypothesis (i.e., ) over 1000 runs for outlier rate and outlier mean , based on the L2-distance estimated by LSDD and the KL-divergence estimated by KLIEP. This shows that the KLIEP-based test rejects the null hypothesis more frequently for large , whereas the rejection rate of the LSDD-based test is almost unchanged with respect to .
Figure 8b depicts the rejection rate of the null hypothesis for outlier mean and outlier rate . When (i.e., no outliers), both the LSDD-based test and the KLIEP-based test accept the null hypothesis with the designated significance level approximately. When , the LSDD-based test still keeps a low rejection rate, whereas the KLIEP-based test tends to reject the null hypothesis more frequently. When , both the LSDD-based test and the KLIEP-based test always reject the null hypothesis.
Overall, the results imply that the two-sample homogeneity test by LSDD is more robust against outliers (i.e., two distributions tend to be regarded as the same even in the presence of outliers) than the KLIEP-based test.
Here, we apply LSDD to semisupervised class-balance estimation under class-prior change and change-point detection in time series.
4.2.1. Semisupervised Class-Balance Estimation
In real-world pattern recognition tasks, changes in class balance between the training and test phases are often observed. In such cases, naive classifier training produces significant estimation bias because the class balance in the training data set does not properly reflect that of the test data set. Here, we consider the problem of learning the class balance of a test data set in a semisupervised learning setup where unlabeled test samples are provided in addition to labeled training samples (Chapelle, Schölkopf, & Zien, 2006).
Here, our objective is to estimate ptest(y) from labeled training samples drawn independently from ptrain(x, y) and unlabeled test samples drawn independently from ptest(x). Given test labels , ptest(y) can be naively estimated by , where is the number of test samples in class y. Here, however, we want to estimate ptest(y) without .
As an additional baseline, we include the expectation-maximization (EM)-based class-prior estimation method (Saerens et al., 2002), which corresponds to distribution matching under the KL divergence. More specifically, in the EM-based algorithm, test class-prior estimate and test class-posterior estimate are iteratively estimated as follows:
Obtain an estimate of the training class-posterior probability, , from labeled training samples , for example, by kernel logistic regression (Hastie, Tibshirani, & Friedman, 2001) or its squared-loss variant (Sugiyama, 2010).
Obtain an estimate of the training class-prior probability from training data as , where ny is the number of training samples in class y.
Iterate steps 4 and 5 until convergence.
The right graphs in Figure 10 plot the test misclassification error over 1000 runs. The results show the LSDD-based method provides lower classification errors, which would be brought by good estimates of test class-balances.
4.2.2. Unsupervised Change Detection
The objective of change detection is to discover abrupt property changes behind time-series data.
Our strategy is to compute a certain dissimilarity measure between two consecutive segments and and use it as the plausibility of change points (see Figure 11). As a dissimilarity measure, we use the L2-distance estimated by LSDD and the KL-divergence estimated by the KL importance estimation procedure (KLIEP) (Sugiyama et al., 2008). We set k=10 and r=50.
We use two data sets. One is the IPSJ SIG-SLP Corpora and Environments for Noisy Speech Recognition (CENSREC) dataset (http://research.nii.ac.jp/src/en/CENSREC-1-C.html). This data set, provided by the National Institute of Informatics, Japan, records human voice in a noisy environment such as a restaurant. The other, data set is the Human Activity Sensing Consortium (HASC) Challenge 2011 (http://hasc.jp/hc2011/), which provides human activity information collected by portable three-axis accelerometers. Because the orientation of the accelerometers is not necessarily fixed, we take the -norm of the three-dimensional data. The HASC data set is relatively simple, so we artificially add zero-mean gaussian noise with standard deviation 5 at each time point with probability 0.005.
The top graphs in Figure 12 display the original time series, where true change points were manually annotated. The time-series data in Figure 12a correspond to a sequence of noise-speech-noise-speech-noise, whereas that in Figure 12b corresponds to a sequence of actions jog-stay-stair down-stay-stair up. The bottom graphs in Figure 12 plot change scores obtained by each method. The results show that the LSDD-based change score indicates the existence of change points more clearly than the KLIEP-based approach. The superior performance of LSDD over the KLIEP-based change score would be brought by its robustness against outliers (see section 4.1.2).
Finally, we compare the change-detection performance more systematically using the receiver operating characteristic (ROC) curves (the false-positive rate versus the true-positive rate) and the area under the ROC curve (AUC) values. In addition to LSDD and KLIEP, we also test the L2-distance estimated by KDEi and KDEj (see section 4.1.1). Moreover, in our comparison, we include native change detection methods based on autoregressive models (AR) (Takeuchi & Yamanishi, 2006), subspace identification (SI) (Kawahara, Yairi, & Machida, 2007), singular spectrum transformation (SST) (Moskvina & Zhigljavsky, 2003), one-class support vector machine (SVM) (Desobry, Davy, & Doncarli, 2005), kernel Fisher discriminant analysis (KFD) (Harchaoui, Bach, & Moulines, 2009), and kernel change-point detection (KCP) (Arlot, Celisse, & Harchaoui, 2012). Tuning parameters included in these methods were manually optimized.
We use 10 data sets taken from each of the CENSREC and HASC data collections. Mean ROC curves are plotted in Figure 13, and AUC values are described in Table 1. The experimental results show that LSDD tends to outperform other methods and is comparable to state-of-the-art native change-detection methods.
|Data ID .||LSDD .||KDEi .||KDEj .||KLIEP .||AR .||SI .||SST .||SVM .||KFD .||KCP .|
|CENSREC data set|
|HASC data set|
|Data ID .||LSDD .||KDEi .||KDEj .||KLIEP .||AR .||SI .||SST .||SVM .||KFD .||KCP .|
|CENSREC data set|
|HASC data set|
Note: The best method and comparable ones in terms of mean AUC values by the t-test (Henkel, 1976) at the significance level of 5% are indicated with boldface.
In this letter, we proposed a method for directly estimating the difference between two probability density functions without density estimation. The proposed method, the least-squares density-difference (LSDD), was derived within the framework of kernel regularized least-squares estimation, and its solution can be computed analytically in a computationally efficient and stable manner. Furthermore, LSDD is equipped with cross-validation, and thus all tuning parameters, such as the kernel width and the regularization parameter, can be systematically and objectively optimized. We showed the asymptotic normality of LSDD in a parametric setup and derived a finite-sample error bound for LSDD in a nonparametric setup. In both cases, LSDD was shown to achieve the optimal convergence rates.
We also proposed an L2-distance estimator based on LSDD, which nicely cancels the bias caused by regularization. The LSDD-based L2-distance estimator was experimentally shown to be more accurate than differences of kernel density estimators and more robust against outliers than a Kullback-Leibler divergence estimator. However, we also experimentally observed that cross-validation of LSDD is sometimes rather unstable when the target density difference is zero (i.e., two distributions are equivalent). This can potentially cause performance degradation in two-sample homogeneity testing because estimation of zero density-difference is repeatedly executed when approximating the null distribution in the permutation-test framework. Stabilizing cross-validation and improving the accuracy of density-difference estimation when the target density-difference is zero is a topic for future work.
A related line of research to density-difference estimation is density-ratio estimation (Sugiyama et al., 2012a), which directly estimates the ratio of probability densities without separate density estimation (Qin, 1998; Huang et al., 2007; Bickel, Brückner, & Scheffer, 2007; Sugiyama et al., 2008; Kanamori et al., 2009; Sugiyama et al., 2012b). Potential weaknesses of density-ratio estimation are that density ratios can be unbounded even for simple cases (Cortes et al., 2010) and their estimation may suffer from outliers (Basu et al., 1998; Scott, 2001; Besbeas & Morgan, 2004).
To mitigate these weaknesses, the concept of relative density ratios was introduced recently, which “flatten” the density ratio as for (Yamada et al., 2013). Even when the plain density ratio is unbounded, the relative density ratio is always bounded by for . Although estimation of relative density ratios as well as approximation of relative divergences was demonstrated to be more reliable (Yamada et al., 2013), there is no systematic method to choose the relativity parameter , which is a critical limitation in practice.
On the other hand, density-difference estimation is more advantageous than density-ratio estimation in the senses that density differences are always bounded as long as each density is bounded, their estimation is robust against outliers (Basu et al., 1998; Scott, 2001; Besbeas & Morgan, 2004), and there exist no tuning parameters such as the relativity parameter . However, a potential weakness of density differences is that they cannot be used for importance sampling (Sugiyama & Kawanabe, 2012) and conditional probability estimation (Sugiyama, Takeuchi, et al., 2010; Sugiyama, 2010), which are promising uses of density-ratio estimation. Thus, further exploring uses of density-difference estimation, particularly in the tasks that density-ratio estimation cannot be used for, is promising future work.
A simple application of density-difference estimation would be probabilistic pattern recognition, because the sign of the density difference gives the Bayes-optimal decision (Duda, Hart, & Stork, 2001). Furthermore, in the context of pattern recognition with a reject option, the density difference can be used for finding the optimal rejection threshold (Chow, 1970). In future work, we will investigate the behavior of LSDD in probabilistic pattern recognition theoretically and experimentally.
Density-difference estimation is a novel research paradigm in machine learning, and we have proposed a simple but useful method for this emerging topic. Our future work will develop more powerful algorithms for density-difference estimation. For example, considering more general loss functions than the squared loss (Sugiyama et al., 2012b) and incorporating dimension reduction (von Bünau, Meinecke, Király, & Müller, 2009; Sugiyama, Kawanabe, & Chui, 2010; Sugiyama et al., 2011; Yamada & Sugiyama, 2011) would be interesting directions to pursue. Exploring a wide variety of real-world applications is also an important future work.
Appendix A: Technical Details of Nonparametric Convergence Analysis in Section 2.3.2
We assume the following conditions:
Then we have the following theorem;
To prove this, we use the technique developed in Eberts and Steinwart (2011) for a regression problem:
Note that is the optimal learning rate to estimate a function in (Eberts & Steinwart, 2011). Therefore, the density-difference estimator with a gaussian kernel achieves the optimal learning rate by appropriately choosing the regularization parameter and the gaussian width. Because the learning rate depends on , the LSDD estimator has adaptivity to the smoothness of the true function.
Our analysis heavily relies on the techniques developed in Eberts and Steinwart (2011) for a regression problem. The main difference is that the analysis in their paper involves a clipping procedure, which stems from the fact that the analyzed estimator requires an empirical approximation of the expectation of the square term. The Lipschitz continuity of the square function is used to investigate this term, and the clipping procedure is used to ensure the Lipschitz continuity. In this letter, we can exactly compute so that we do not need the Lipschitz continuity.
Appendix B:. Derivation of Equation 3.6
Appendix C: Derivation of Equation 3.7
We thank Wittawat Jitkrittum and John Quinn for their comments and Zaïd Harchaoui for providing us a program code of kernel change-point detection. M.S. was supported by MEXT KAKENHI 23300069 and AOARD; T.K. was supported by MEXT KAKENHI 24500340; T.S. was supported by MEXT KAKENHI 22700289, the Aihara Project, the FIRST program from JSPS initiated by CSTP, and the Global COE Program “The research and Training Center for New development in Mathematics,” MEXT, Japan; M. du P. was supported by MEXT Scholarship; S.L. was supported by the JST PRESTO program; and I.T. was supported by MEXT KAKENHI 23700165. An earlier version of this paper was presented at the Neural Information Processing Systems conference, 2012 (Sugiyama, Suzuki, Kanamori, du Plessis, Liu, & Takeuchi, 2012).
More precisely, Vapnik (1998) said, “If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.” Two-stage density-difference estimation corresponds to solving a more general problem of separate density estimation in the first stage.
More specifically, the regularizer is replaced from the squared -norm of parameters to the squared RKHS norm of a learned function, which is necessary to establish consistency. Nevertheless, we use the squared -norm of parameters in experiments because it is simpler and performs well in practice.
Estimation of the KL-divergence from data has been extensively studied recently (Wang, Kulkarmi, & Verdú, 2005; Sugiyama et al., 2008; Pérez-Cruz, 2008; Silva & Narayanan, 2010; Nguyen et al., 2010); was shown to possess a superior convergence property and demonstrated to work well in practice (Sugiyama et al., 2008). KLIEP is based on direct estimation of density ratio without density estimation of p(x) and . See also Nguyen et al. (2010), which proposes essentially the same procedure.