## Abstract

We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, this procedure does not necessarily work well because the first step is performed without regard to the second step, and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this letter, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a nonparametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be used in ** L^{2}**-distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.

## 1. Introduction

When estimating a quantity consisting of two elements, a two-stage approach of first estimating the two elements separately and then approximating the target quantity based on the estimates of the two elements often performs poorly; the reason is that the first stage is carried out without regard to the second stage, and thus a small estimation error incurred in the first stage can cause a big error in the second stage. To cope with this problem, it would be more appropriate to directly estimate the target quantity in a single-shot process without separately estimating the two elements.

A seminal example that follows this general idea is pattern recognition by the support vector machine (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995; Vapnik, 1998).^{1} Instead of separately estimating two probability distributions of positive and negative patterns, the support vector machine directly learns the boundary between the positive and negative classes that is sufficient for pattern recognition. More recently, a problem of estimating the ratio of two probability densities was tackled in a similar fashion (Qin, 1998; Sugiyama et al., 2008; Gretton et al., 2009; Kanamori, Hido, & Sugiyama, 2009; Nguyen, Wainwright, & Jordan, 2010; Kanamori, Suzuki, & Sugiyama, 2012; Sugiyama, Suzuki, & Kanamori, 2012a, 2012b) The ratio of two probability densities is directly estimated without going through separate estimation of the two probability densities.

In this letter, we explore this line of research and propose a method for directly estimating the difference between two probability densities in a single-shot process. Density ratios and density differences can both be used for comparing probability densities via approximation of divergences such as the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) and the *L*^{2}-distance. A divergence estimator can be used for solving various machine learning tasks, including class-balance estimation under class-prior change (Saerens, Latinne, & Decaestecker, 2002; du Plessis & Sugiyama, 2012), image segmentation and registration (Liu et al., 2010; Atif, Ripoche, & Osorio, 2003), target object detection and recognition (Gray & Principe, 2010; Yamanaka, Matsugu, & Sugiyama, forthcoming b), feature selection and extraction (Torkkola, 2003; Suzuki & Sugiyama, 2013), and change-point detection in time series (Kawahara & Sugiyama, 2012; Liu, Yamada, Collier, & Sugiyama, 2013; Yamanaka, Matsugu, & Sugiyama, forthcoming a). In this divergence approximation scenario, density differences are more advantageous than density ratios in several aspects. For example, density ratios can be unbounded even for simple cases (Cortes, Mansour, & Mohri, 2010; Yamada, Suzuki, Kanamori, Hachiya, & Sugiyama, 2013), whereas density differences are always bounded as long as both densities are bounded. Thus, density differences are expected to be learned more easily than density ratios. Also, density ratios are asymmetric and thus the “direction” needs to be determined by a user, whereas density differences are symmetric and thus there is no need to think about the direction. These are our primal motivations to develop a density-difference estimator.

Note that density ratios have their own applications beyond divergence approximation to which density differences may not be applied, such as importance sampling and conditional probability estimation (Sugiyama et al., 2012a). On the other hand, density differences also have their own unique applications to which density ratios may not be applied, such as the estimation of highest density-difference regions in flow cytometric data analysis (Duong, Koch, & Wand, 2009) and unsupervised labeling (du Plessis, 2013). This implies that for density ratios and density differences, neither of them includes the other in terms of ranges of applications. This is our additional motivation to pursue a practical algorithm for density-difference estimation.

For this density-difference estimation problem, we propose a single-shot method, the least-squares density-difference (LSDD) estimator, that directly estimates the density difference without separately estimating two densities. LSDD is derived within the framework of kernel regularized least-squares estimation, and its solution can be computed analytically in a computationally efficient and stable manner. Furthermore, LSDD is equipped with cross-validation, and thus all tuning parameters such as the kernel width and the regularization parameter can be systematically and objectively optimized. We derive a finite-sample error bound for the LSDD estimator in a nonparametric setup and show that it achieves the optimal convergence rate.

We also apply LSDD to *L*^{2}-distance estimation and show that it is more accurate than the difference of KDEs, which tends to severely underestimate the *L*^{2}-distance (Anderson, Hall, & Titterington, 1994). Compared with the KL divergence, the *L*^{2}-distance is more robust against outliers (Basu, Harris, Hjort, & Jones, 1998; Scott, 2001; Besbeas & Morgan, 2004). We experimentally demonstrate the usefulness of LSDD in robust distribution comparison such as semisupervised class-prior estimation and unsupervised change detection.

The rest of this letter is structured as follows. In section 2, we derive the LSDD method and investigate its theoretical properties. In section 3, we show how LSDD can be utilized for *L*^{2}-distance approximation. In section 4, we illustrate the numerical behavior of LSDD. We conclude in section 5.

## 2. Density-Difference Estimation

In this section, we propose a single-shot method for estimating the difference between two probability densities from samples and analyze its theoretical properties.

### 2.1. Problem Formulation and Naive Approach

First, we formulate the problem of density-difference estimation.

However, we argue that the KDE-based density-difference estimator is not the best approach because of its two-step nature: a small estimation error incurred in each density estimate can cause a big error in the final density-difference estimate. More intuitively, good density estimators tend to be smooth, and thus a density-difference estimator obtained from such smooth density estimators tends to be oversmoothed (Hall & Wand, 1988; Anderson et al., 1994; see also the numerical experiments in section 4.1.1).

To overcome this weakness, we give a single-shot procedure of directly estimating the density difference *f*(** x**) without separately estimating the densities

*p*(

**) and .**

*x*### 2.2. Least-Squares Density-Difference Estimation

*g*(

**) to the true density-difference function**

*x**f*(

**) under the squared loss:**

*x*^{2}We use the following linear-in-parameter model as

*g*(

**): where**

*x**b*denotes the number of basis functions, is a

*b*-dimensional basis function vector, is a

*b*-dimensional parameter vector, and denotes the transpose. In practice, we use the following nonparametric gaussian kernel model as

*g*(

**): where are gaussian kernel centers. If is large, we may use only a subset of as gaussian kernel centers.**

*x***is the matrix and**

*H***is the**

*H**b*-dimensional vector defined as Note that for the gaussian kernel model, equation 2.3, the integral in

**can be computed analytically as where**

*H**d*denotes the dimensionality of

**. This is part of the reason that we chose the gaussian kernel model in practice. Another reason for this choice is its theoretical superiority, as discussed in section 2.3.2.**

*x***by empirical estimators and adding an -regularizer to the objective function, we arrive at the following optimization problem: where () is the regularization parameter and is the**

*H**b*-dimensional vector defined as Taking the derivative of the objective function in equation 2.4 and equating it to zero, we can obtain the solution analytically as where

*I*_{b}denotes the

*b*-dimensional identity matrix.

### 2.3. Theoretical Analysis

Here, we theoretically investigate the behavior of the LSDD estimator.

#### 2.3.1. Parametric Convergence

First, we consider a linear parametric setup where basis functions in our density-difference model 2.2 are fixed.

**and covariance matrix where**

*0*

*V*_{p}denotes the covariance matrix of under the probability density

*p*(

**): and denotes the expectation of under the probability density**

*x**p*(

**):**

*x*This result implies that the LSDD estimator has asymptotic normality with asymptotic order , the optimal convergence rate in the parametric setup.

#### 2.3.2. Nonparametric Error Bound

Next, we consider a nonparametric setup where a density-difference function is learned in a gaussian reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950).

*K*such that for all and , the nonparametric LSDD estimator with appropriate choice of and satisfies

^{4}with probability not less than . Here,

*d*denotes the dimensionality of input vector

**, and denotes the regularity of Besov space to which the true density-difference function**

*x**f*belongs (smaller/larger means

*f*is “less/more complex”; see appendix A for its precise definition). Because is the optimal learning rate in this setup (Eberts & Steinwart, 2011), the result shows that the nonparametric LSDD estimator achieves the optimal convergence rate.

It is known that if the naive KDE with a gaussian kernel is used for estimating a probability density with regularity , the optimal learning rate cannot be achieved (Farrell, 1972; Silverman, 1986). To achieve the optimal rate by KDE, we should choose a kernel function specifically tailored to each regularity (Parzen, 1962). However, such a kernel function is not nonnegative, and it is difficult to implement it in practice. However, our LSDD estimator can always achieve the optimal learning rate for a gaussian kernel without regard to regularity .

### 2.4. Model Selection by Cross-Validation

The theoretical analyses showed the superiority of LSDD in terms of the convergence rates. However, the practical performance of LSDD depends on the choice of models (i.e., the kernel width and the regularization parameter ). Here, we show that the model can be optimized by cross-validation (CV).

*T*disjoint subsets and , respectively. Then we obtain a density-difference estimate from and (i.e., all samples without and ), and compute its holdout error for and as where denotes the number of elements in the set . We repeat this holdout validation procedure for , and compute the average holdout error as Finally, we choose the model that minimizes CV.

A Matlab implementation of LSDD is available online from http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSDD/.

## 3. *L*^{2}-Distance Estimation by LSDD

*L*

^{2}### 3.1. Basic Forms

*f*(

**) with an LSDD estimator and approximate the expectations by empirical averages, the following**

*x**L*

^{2}-distance estimator can be obtained: Similarly, for another expression, replacing

*f*(

**) with an LSDD estimator gives another**

*x**L*

^{2}-distance estimator:

### 3.2. Reduction of Bias Caused by Regularization

*L*

^{2}-distance estimator of the following form: where is a real scalar. If the regularization parameter () is small, then equation 3.5 can be expressed as where

*o*denotes the probabilistic order (the derivation of equation 3.6 is given in appendix B). Thus, up to , the bias introduced by regularization (i.e., the second term on the right-hand side of equation 3.6 that depends on ) can be eliminated if , which yields equation 3.4. Note that if no regularization is imposed (i.e., ), both equations 3.2 and 3.3 yield , the first term on the right-hand side of equation 3.6.

_{p}*g*=

*f*. If the expectations are replaced by empirical estimators and the linear-in-parameter model, equation 2.2, is used as

*g*, the above optimization problem is reduced to the LSDD objective function without regularization (see equation 2.4). Thus, LSDD corresponds to approximately maximizing the above lower bound, and equation 3.4 is its maximum value.

### 3.3. Further Bias Correction

*L*

^{2}-distance estimator, equation 3.3. However, it is actually a slightly biased estimator of the target quantity (): where denotes the expectation over all samples and , and

*V*_{p}and are defined by equation 2.6 (its derivation is given in appendix C).

*L*

^{2}-distance estimator as where is an empirical estimator of covariance matrix

*V*_{p}, and is an empirical estimator of the expectation :

*L*

^{2}-distance is nonnegative by definition (see equation 3.1), but the above bias-corrected estimate can take a negative value. Following the same line as, Baranchik (1964), the positive-part estimator may be more accurate: However, in our preliminary experiments, does not always perform well, particularly when

**is ill conditioned. For this reason, we propose using defined by equation 3.4.**

*H*## 4. Experiments

In this section, we experimentally evaluate the performance of LSDD.

### 4.1. Numerical Examples

First, we show numerical examples using artificial data sets.

#### 4.1.1. LSDD versus KDE

Two gaussian widths are independently chosen from the above candidate values based on five-fold least-squares cross-validation. That is, for each density, we perform cross-validation in terms of the

*L*^{2}-distance between estimated and true densities so that the density is optimally approximated (Härdle et al., 2004).Two gaussian widths are jointly chosen from the above candidate values based on five-fold cross-validation in terms of the LSDD criterion (Hall & Wand, 1988). That is, we compute the cross-validated LSDD criterion as a function of two gaussian widths and find the best pair that minimizes the criterion.

We first illustrate the behavior of the LSDD- and KDE-based methods under *d*=1 and . Figure 1 depicts the data samples and density-difference estimation results obtained by LSDD, KDEi, and KDEj for (i.e., ). Cross-validation scores of LSDD are included at the bottom of the figure. The figure shows that LSDD and KDEj give accurate estimates of the true density difference *f*(*x*)=0. The density-difference estimate obtained by KDEi, however, is rather fluctuated, although both densities are reasonably well approximated by KDEs. This illustrates an advantage of directly estimating the density difference without going through separate estimation of each density. Figure 2 depicts the results for (i.e., ), showing again that LSDD performs well. KDEi and KDEj give the same estimation result for this data set, which slightly underestimates the peaks.

Next, we compare the performance of *L*^{2}-distance approximation based on LSDD, KDEi, and KDEj. For and *d*=1, 5, we draw samples from the above *p*(** x**) and . Figure 3 depicts the mean and standard error of estimated

*L*

^{2}-distances over 1000 runs as functions of mean . When

*d*=1 (see Figure 3a), the LSDD-based

*L*

^{2}-distance estimator gives the most accurate estimates of the true

*L*

^{2}-distance, whereas the KDEi-based

*L*

^{2}-distance estimator slightly underestimates the true

*L*

^{2}-distance for large . This is caused by the fact that KDE tends to provide smooth density estimates (see Figure 2c again). Such smooth density estimates are accurate as density estimates, but the difference of smooth density estimates yields a small

*L*

^{2}-distance estimate (Anderson et al., 1994). More specifically, the density is estimated accurately at around

*x*=0.5, but negative values of the density difference

*f*(

*x*) are underestimated there because is smoother than the true density

*p*(

*x*) and thus its tail values at around

*x*=0.5 are larger. The KDEj-based

*L*

^{2}-distance estimator tends to improve this drawback of KDEi to some extent, but it still slightly underestimates the true

*L*

^{2}-distance when is large.

When *d*=5 (see Figure 3b), the KDE-based *L*^{2}-distance estimators severely underestimate the true *L*^{2}-distance for large , but the LSDD-based *L*^{2}-distance estimator still gives reasonably accurate estimates of the true *L*^{2}-distance even when *d*=5. However, we note that LSDD also slightly underestimates the true *L*^{2}-distance when is large because slight underestimation tends to yield smaller variance, and thus such stabilized solutions are more accurate in terms of the bias-variance trade-off.

In Figure 1, we illustrated that LSDD and KDEj work better than KDEi when the gaussian mean is . However, in Figure 3a, KDEi is shown to be the best-performing method for in terms of the average over 1000 runs. To fill this gap, we depict in Figure 4 histograms of *L*^{2}-distance estimates obtained by LSDD, KDEi, and KDEj over 1000 runs for the gaussian mean . This graph shows that LSDD and KDEj give exactly correct solutions (i.e., zero) about 300 times, whereas KDEi gives estimates about 0.01 more than 300 times. The graphs plotted in Figure 1 correspond to such typical results where LSDD and KDEj, outperform KDEi. On the other hand, KDEi stably gives estimates less than 0.1 almost always, whereas LSDD and KDEj occasionally give large estimates. This rather unstable behavior of LSDD and KDEj, which was caused by inappropriate choice of the gaussian width (and the regularization parameter for LSDD) by cross-validation, led to larger mean values of LSDD and KDEj than KDEi in Figure 3a.

Finally, we investigate the behavior of LSDD, KDEi, and KDEj in *L*^{2}-distance estimation when the numbers of samples from two distributions are imbalanced. Figure 5 plots the means and standard errors of the *L*^{2}-distance estimated by LSDD, KDEi, and KDEj for *d*=1, 5 and over 1000 runs as functions of when *n* is fixed to 200.

When *d*=1 and (see Figure 5a), all three methods behave similarly, and the accuracy tends to be improved as increases. However, improvement when is moderate. When the input dimensionality is increased to *d*=5 (see Figure 5b), LSDD and KDEj still have the same tendency. However, KDEi behaves differently, and the approximation error tends to grow as increases. This implies that improving the accuracy of one of the density estimates does not necessarily improve the overall estimation accuracy of the density difference.

When *d*=1 and (see Figure 5c), LSDD tends to provide better estimates as increases, whereas KDEi and KDEj keep underestimating the true *L*^{2}-distance even when is increased. Finally, when *d*=5 and (see Figure 5d), LSDD stably provides reasonably good results and its performance does not change significantly when is increased. On the other hand, KDEi and KDEj tend to give better results as increases.

Overall, LSDD sometimes gives slightly better results for , but its performance is not significantly different from those for . The accuracy of KDEi and KDEj when is increased gets better or worse depending on the situation. Thus, having more data samples from one of the distributions does not seem to always improve the estimation accuracy in density-difference estimation.

#### 4.1.2. *L*^{2}-Distance versus KL Divergence

*p*(

**) to is defined as**

*x**L*

^{2}-distance and the KL-divergence. For

*d*=1, let Implications of the above densities are that samples drawn from

*N*(

*x*; 0, 1

^{2}) are inliers, whereas samples drawn from are outliers. We set the outlier rate at and the outlier mean at (see Figure 6).

Figure 7a depicts the *L*^{2}-distance and the KL-divergence for outlier mean . This shows that both the *L*^{2}-distance and the KL-divergence increase as increases. However, the *L*^{2}-distance is bounded from above, whereas the KL-divergence diverges to infinity as tends to infinity. This result implies that the *L*^{2}-distance is less sensitive to outliers than the KL-divergence, which agrees well with the observation given in Basu et al. (1998).

Next, we draw samples from *p*(*x*) and and estimate the *L*^{2}-distance by LSDD and the KL-divergence by the Kullback-Leibler importance estimation procedure (KLIEP) (Sugiyama et al., 2008).^{5} Figure 7b depicts estimated *L*^{2}-distance and KL-divergence for outlier mean over 100 runs. This shows that both LSDD and KLIEP reasonably capture the profiles of the true *L*^{2}-distance and the KL-divergence, although the scale of KLIEP values is much different from the true values (see Figure 7a) because the estimated normalization factor was unreliable.

Finally, based on the permutation test procedure (Efron & Tibshirani, 1993), we conduct hypothesis testing of the null hypothesis whether densities *p* and are the same. More specifically, we first compute a distance estimate for the original data sets and and obtain a distance/divergence estimate . Next, we randomly permute the samples and assign the first samples to a set and the remaining samples to another set . Then we compute a distance/divergence estimate again using the randomly permuted data sets and and obtain . Because and can be regarded as being drawn from the same distribution, would take a value close to zero. This random permutation procedure is repeated many times (100 times in the following experiments), and the distribution of under the null hypothesis (i.e., the two distributions are the same) is constructed. Finally, the *p*-value is approximated by evaluating the relative ranking of in the histogram of . We set the significance level at 5%.

Figure 8a depicts the rejection rate of the null hypothesis (i.e., ) over 1000 runs for outlier rate and outlier mean , based on the *L*^{2}-distance estimated by LSDD and the KL-divergence estimated by KLIEP. This shows that the KLIEP-based test rejects the null hypothesis more frequently for large , whereas the rejection rate of the LSDD-based test is almost unchanged with respect to .

Figure 8b depicts the rejection rate of the null hypothesis for outlier mean and outlier rate . When (i.e., no outliers), both the LSDD-based test and the KLIEP-based test accept the null hypothesis with the designated significance level approximately. When , the LSDD-based test still keeps a low rejection rate, whereas the KLIEP-based test tends to reject the null hypothesis more frequently. When , both the LSDD-based test and the KLIEP-based test always reject the null hypothesis.

Overall, the results imply that the two-sample homogeneity test by LSDD is more robust against outliers (i.e., two distributions tend to be regarded as the same even in the presence of outliers) than the KLIEP-based test.

### 4.2. Applications

Here, we apply LSDD to semisupervised class-balance estimation under class-prior change and change-point detection in time series.

#### 4.2.1. Semisupervised Class-Balance Estimation

In real-world pattern recognition tasks, changes in class balance between the training and test phases are often observed. In such cases, naive classifier training produces significant estimation bias because the class balance in the training data set does not properly reflect that of the test data set. Here, we consider the problem of learning the class balance of a test data set in a semisupervised learning setup where unlabeled test samples are provided in addition to labeled training samples (Chapelle, Schölkopf, & Zien, 2006).

*p*

_{train}(

*y*) and that for test data

*p*

_{test}(

*y*) are different: However, we assume that the class-conditional density for training data

*p*

_{train}(

**|**

*x**y*) and that for test data

*p*

_{test}(

**|**

*x**y*) is unchanged: Note that training and test joint densities

*p*

_{train}(

**,**

*x**y*) and

*p*

_{test}(

**,**

*x**y*), as well as training and test input densities

*p*

_{train}(

**) and**

*x**p*

_{test}(

**), are generally different under this setup.**

*x*Here, our objective is to estimate *p*_{test}(*y*) from labeled training samples drawn independently from *p*_{train}(** x**,

*y*) and unlabeled test samples drawn independently from

*p*

_{test}(

**). Given test labels ,**

*x**p*

_{test}(

*y*) can be naively estimated by , where is the number of test samples in class

*y*. Here, however, we want to estimate

*p*

_{test}(

*y*) without .

*p*

_{test}(

**) (Saerens et al., 2002), where is a mixing coefficient to learn. (See Figure 9 for schematic illustration.) Here, we use the**

*x**L*

^{2}-distance estimated by LSDD, KDEi, and KDEj (see section 4.1.1) for this distribution matching. Note that when LSDD is used to estimate the

*L*

^{2}-distance, separate estimation of is not involved, but the difference between

*p*

_{test}(

**) and is directly estimated.**

*x*As an additional baseline, we include the expectation-maximization (EM)-based class-prior estimation method (Saerens et al., 2002), which corresponds to distribution matching under the KL divergence. More specifically, in the EM-based algorithm, test class-prior estimate and test class-posterior estimate are iteratively estimated as follows:

Obtain an estimate of the training class-posterior probability, , from labeled training samples , for example, by kernel logistic regression (Hastie, Tibshirani, & Friedman, 2001) or its squared-loss variant (Sugiyama, 2010).

Obtain an estimate of the training class-prior probability from training data as , where

*n*is the number of training samples in class_{y}*y*.Iterate steps 4 and 5 until convergence.

**is estimated by where is the gaussian kernel function with kernel width . are learned parameters given by where , , and () is the regularization parameter. The gaussian width and the regularization parameter are chosen by 5-fold weighted cross-validation (Sugiyama, Krauledat, & Müller, 2007) in terms of the misclassification error.**

*x*The right graphs in Figure 10 plot the test misclassification error over 1000 runs. The results show the LSDD-based method provides lower classification errors, which would be brought by good estimates of test class-balances.

#### 4.2.2. Unsupervised Change Detection

The objective of change detection is to discover abrupt property changes behind time-series data.

*m*-dimensional time-series sample at time

*t*, and let be a subsequence of time series at time

*t*with length

*k*. We treat the subsequence

**(**

*Y**t*) as a sample, instead of a single point

**(**

*Y**t*), by which time-dependent information can be incorporated naturally (Kawahara & Sugiyama, 2012). Let be a set of

*r*retrospective subsequence samples starting at time

*t*:

Our strategy is to compute a certain dissimilarity measure between two consecutive segments and and use it as the plausibility of change points (see Figure 11). As a dissimilarity measure, we use the *L*^{2}-distance estimated by LSDD and the KL-divergence estimated by the KL importance estimation procedure (KLIEP) (Sugiyama et al., 2008). We set *k*=10 and *r*=50.

We use two data sets. One is the IPSJ SIG-SLP Corpora and Environments for Noisy Speech Recognition (CENSREC) dataset (http://research.nii.ac.jp/src/en/CENSREC-1-C.html). This data set, provided by the National Institute of Informatics, Japan, records human voice in a noisy environment such as a restaurant. The other, data set is the Human Activity Sensing Consortium (HASC) Challenge 2011 (http://hasc.jp/hc2011/), which provides human activity information collected by portable three-axis accelerometers. Because the orientation of the accelerometers is not necessarily fixed, we take the -norm of the three-dimensional data. The HASC data set is relatively simple, so we artificially add zero-mean gaussian noise with standard deviation 5 at each time point with probability 0.005.

The top graphs in Figure 12 display the original time series, where true change points were manually annotated. The time-series data in Figure 12a correspond to a sequence of noise-speech-noise-speech-noise, whereas that in Figure 12b corresponds to a sequence of actions jog-stay-stair down-stay-stair up. The bottom graphs in Figure 12 plot change scores obtained by each method. The results show that the LSDD-based change score indicates the existence of change points more clearly than the KLIEP-based approach. The superior performance of LSDD over the KLIEP-based change score would be brought by its robustness against outliers (see section 4.1.2).

Finally, we compare the change-detection performance more systematically using the receiver operating characteristic (ROC) curves (the false-positive rate versus the true-positive rate) and the area under the ROC curve (AUC) values. In addition to LSDD and KLIEP, we also test the *L*^{2}-distance estimated by KDEi and KDEj (see section 4.1.1). Moreover, in our comparison, we include native change detection methods based on autoregressive models (AR) (Takeuchi & Yamanishi, 2006), subspace identification (SI) (Kawahara, Yairi, & Machida, 2007), singular spectrum transformation (SST) (Moskvina & Zhigljavsky, 2003), one-class support vector machine (SVM) (Desobry, Davy, & Doncarli, 2005), kernel Fisher discriminant analysis (KFD) (Harchaoui, Bach, & Moulines, 2009), and kernel change-point detection (KCP) (Arlot, Celisse, & Harchaoui, 2012). Tuning parameters included in these methods were manually optimized.

We use 10 data sets taken from each of the CENSREC and HASC data collections. Mean ROC curves are plotted in Figure 13, and AUC values are described in Table 1. The experimental results show that LSDD tends to outperform other methods and is comparable to state-of-the-art native change-detection methods.

Data ID . | LSDD . | KDEi . | KDEj . | KLIEP . | AR . | SI . | SST . | SVM . | KFD . | KCP . |
---|---|---|---|---|---|---|---|---|---|---|

CENSREC data set | ||||||||||

1 | .888 | .737 | .731 | .437 | .769 | .739 | .507 | .604 | .881 | .917 |

2 | .871 | .803 | .706 | .618 | .777 | .736 | .541 | .612 | .912 | .879 |

3 | .910 | .753 | .690 | .744 | .762 | .821 | .616 | .886 | .876 | .743 |

4 | .936 | .823 | .578 | .683 | .776 | .816 | .723 | .871 | .981 | .826 |

5 | .878 | .712 | .799 | .667 | .768 | .701 | .625 | .843 | .880 | .945 |

6 | .830 | .732 | .711 | .696 | .679 | .727 | .484 | .781 | .841 | .947 |

7 | .813 | .727 | .737 | .513 | .727 | .733 | .612 | .779 | .938 | .968 |

8 | .889 | .841 | .734 | .691 | .783 | .775 | .526 | .698 | .934 | .935 |

9 | .828 | .739 | .586 | .609 | .776 | .770 | .609 | .819 | .922 | .980 |

10 | .943 | .687 | .773 | .692 | .670 | .747 | .551 | .835 | .889 | .984 |

Mean | .879 | .755 | .705 | .635 | .749 | .756 | .580 | .773 | .905 | .913 |

SE | .014 | .016 | .023 | .030 | .013 | .012 | .023 | .032 | .013 | .024 |

HASC data set | ||||||||||

1 | .792 | .823 | .753 | .650 | .860 | .690 | .806 | .800 | .885 | .874 |

2 | .842 | .665 | .741 | .712 | .733 | .800 | .745 | .725 | .904 | .826 |

3 | .773 | .605 | .536 | .708 | .910 | .899 | .807 | .932 | .707 | .641 |

4 | .921 | .839 | .837 | .587 | .816 | .735 | .685 | .751 | .903 | .759 |

5 | .838 | .849 | .859 | .565 | .831 | .823 | .809 | .840 | .961 | .725 |

6 | .834 | .755 | .781 | .676 | .868 | .740 | .736 | .838 | .871 | .800 |

7 | .841 | .763 | .598 | .657 | .807 | .759 | .797 | .829 | .770 | .532 |

8 | .878 | .833 | .857 | .581 | .629 | .704 | .682 | .800 | .852 | .661 |

9 | .864 | .850 | .866 | .693 | .738 | .744 | .781 | .790 | .842 | .697 |

10 | .847 | .663 | .680 | .554 | .796 | .725 | .790 | .850 | .866 | .787 |

Mean | .843 | .764 | .751 | .638 | .799 | .762 | .764 | .815 | .856 | .730 |

SE | .013 | .029 | .036 | .020 | .026 | .020 | .016 | .018 | .023 | .032 |

Data ID . | LSDD . | KDEi . | KDEj . | KLIEP . | AR . | SI . | SST . | SVM . | KFD . | KCP . |
---|---|---|---|---|---|---|---|---|---|---|

CENSREC data set | ||||||||||

1 | .888 | .737 | .731 | .437 | .769 | .739 | .507 | .604 | .881 | .917 |

2 | .871 | .803 | .706 | .618 | .777 | .736 | .541 | .612 | .912 | .879 |

3 | .910 | .753 | .690 | .744 | .762 | .821 | .616 | .886 | .876 | .743 |

4 | .936 | .823 | .578 | .683 | .776 | .816 | .723 | .871 | .981 | .826 |

5 | .878 | .712 | .799 | .667 | .768 | .701 | .625 | .843 | .880 | .945 |

6 | .830 | .732 | .711 | .696 | .679 | .727 | .484 | .781 | .841 | .947 |

7 | .813 | .727 | .737 | .513 | .727 | .733 | .612 | .779 | .938 | .968 |

8 | .889 | .841 | .734 | .691 | .783 | .775 | .526 | .698 | .934 | .935 |

9 | .828 | .739 | .586 | .609 | .776 | .770 | .609 | .819 | .922 | .980 |

10 | .943 | .687 | .773 | .692 | .670 | .747 | .551 | .835 | .889 | .984 |

Mean | .879 | .755 | .705 | .635 | .749 | .756 | .580 | .773 | .905 | .913 |

SE | .014 | .016 | .023 | .030 | .013 | .012 | .023 | .032 | .013 | .024 |

HASC data set | ||||||||||

1 | .792 | .823 | .753 | .650 | .860 | .690 | .806 | .800 | .885 | .874 |

2 | .842 | .665 | .741 | .712 | .733 | .800 | .745 | .725 | .904 | .826 |

3 | .773 | .605 | .536 | .708 | .910 | .899 | .807 | .932 | .707 | .641 |

4 | .921 | .839 | .837 | .587 | .816 | .735 | .685 | .751 | .903 | .759 |

5 | .838 | .849 | .859 | .565 | .831 | .823 | .809 | .840 | .961 | .725 |

6 | .834 | .755 | .781 | .676 | .868 | .740 | .736 | .838 | .871 | .800 |

7 | .841 | .763 | .598 | .657 | .807 | .759 | .797 | .829 | .770 | .532 |

8 | .878 | .833 | .857 | .581 | .629 | .704 | .682 | .800 | .852 | .661 |

9 | .864 | .850 | .866 | .693 | .738 | .744 | .781 | .790 | .842 | .697 |

10 | .847 | .663 | .680 | .554 | .796 | .725 | .790 | .850 | .866 | .787 |

Mean | .843 | .764 | .751 | .638 | .799 | .762 | .764 | .815 | .856 | .730 |

SE | .013 | .029 | .036 | .020 | .026 | .020 | .016 | .018 | .023 | .032 |

Note: The best method and comparable ones in terms of mean AUC values by the *t*-test (Henkel, 1976) at the significance level of 5% are indicated with boldface.

## 5. Conclusion

In this letter, we proposed a method for directly estimating the difference between two probability density functions without density estimation. The proposed method, the least-squares density-difference (LSDD), was derived within the framework of kernel regularized least-squares estimation, and its solution can be computed analytically in a computationally efficient and stable manner. Furthermore, LSDD is equipped with cross-validation, and thus all tuning parameters, such as the kernel width and the regularization parameter, can be systematically and objectively optimized. We showed the asymptotic normality of LSDD in a parametric setup and derived a finite-sample error bound for LSDD in a nonparametric setup. In both cases, LSDD was shown to achieve the optimal convergence rates.

We also proposed an *L*^{2}-distance estimator based on LSDD, which nicely cancels the bias caused by regularization. The LSDD-based *L*^{2}-distance estimator was experimentally shown to be more accurate than differences of kernel density estimators and more robust against outliers than a Kullback-Leibler divergence estimator. However, we also experimentally observed that cross-validation of LSDD is sometimes rather unstable when the target density difference is zero (i.e., two distributions are equivalent). This can potentially cause performance degradation in two-sample homogeneity testing because estimation of zero density-difference is repeatedly executed when approximating the null distribution in the permutation-test framework. Stabilizing cross-validation and improving the accuracy of density-difference estimation when the target density-difference is zero is a topic for future work.

A related line of research to density-difference estimation is density-ratio estimation (Sugiyama et al., 2012a), which directly estimates the ratio of probability densities without separate density estimation (Qin, 1998; Huang et al., 2007; Bickel, Brückner, & Scheffer, 2007; Sugiyama et al., 2008; Kanamori et al., 2009; Sugiyama et al., 2012b). Potential weaknesses of density-ratio estimation are that density ratios can be unbounded even for simple cases (Cortes et al., 2010) and their estimation may suffer from outliers (Basu et al., 1998; Scott, 2001; Besbeas & Morgan, 2004).

To mitigate these weaknesses, the concept of relative density ratios was introduced recently, which “flatten” the density ratio as for (Yamada et al., 2013). Even when the plain density ratio is unbounded, the relative density ratio is always bounded by for . Although estimation of relative density ratios as well as approximation of relative divergences was demonstrated to be more reliable (Yamada et al., 2013), there is no systematic method to choose the relativity parameter , which is a critical limitation in practice.

On the other hand, density-difference estimation is more advantageous than density-ratio estimation in the senses that density differences are always bounded as long as each density is bounded, their estimation is robust against outliers (Basu et al., 1998; Scott, 2001; Besbeas & Morgan, 2004), and there exist no tuning parameters such as the relativity parameter . However, a potential weakness of density differences is that they cannot be used for importance sampling (Sugiyama & Kawanabe, 2012) and conditional probability estimation (Sugiyama, Takeuchi, et al., 2010; Sugiyama, 2010), which are promising uses of density-ratio estimation. Thus, further exploring uses of density-difference estimation, particularly in the tasks that density-ratio estimation cannot be used for, is promising future work.

A simple application of density-difference estimation would be probabilistic pattern recognition, because the sign of the density difference gives the Bayes-optimal decision (Duda, Hart, & Stork, 2001). Furthermore, in the context of pattern recognition with a reject option, the density difference can be used for finding the optimal rejection threshold (Chow, 1970). In future work, we will investigate the behavior of LSDD in probabilistic pattern recognition theoretically and experimentally.

Density-difference estimation is a novel research paradigm in machine learning, and we have proposed a simple but useful method for this emerging topic. Our future work will develop more powerful algorithms for density-difference estimation. For example, considering more general loss functions than the squared loss (Sugiyama et al., 2012b) and incorporating dimension reduction (von Bünau, Meinecke, Király, & Müller, 2009; Sugiyama, Kawanabe, & Chui, 2010; Sugiyama et al., 2011; Yamada & Sugiyama, 2011) would be interesting directions to pursue. Exploring a wide variety of real-world applications is also an important future work.

## Appendix A: Technical Details of Nonparametric Convergence Analysis in Section 2.3.2

We assume the following conditions:

*M*such that The density difference is a member of Besov space with regularity . That is, where is the Besov space with regularity , and where denotes the largest integer less than or equal to and is the

*r*-th modulus of smoothness (see Eberts & Steinwart, 2011, for the definitions).

Then we have the following theorem;

To prove this, we use the technique developed in Eberts and Steinwart (2011) for a regression problem:

*K*and , we define that is,

*f*

_{0}is the convolution of

*K*and . Because of lemma 2 in Eberts and Steinwart (2011), we have and Moreover, lemma 3 in Eberts and Steinwart (2011) gives and Lemma 1 in Eberts and Steinwart (2011) yields that there exists a constant

*C*

_{r,2}such that

*r*>0 is a positive real number such that for Let for and . Then we have and Here, let and we assume that there exists a function such that where denotes the expectation over all samples. Then, by the peeling device (see theorem 7.7 in Steinwart & Christmann, 2008), we have Therefore, by Talagrand's concentration inequality, we have where denotes the probability of an event.

*f*can be bounded as if

*d*<2

*m*. Here we set . Then we have Now because and hold from theorems 7.16 and 7.34 in Steinwart and Christmann (2008), we can take where and are arbitrary and are constants depending on . In the same way, we can also obtain a bound of .

*r*to satisfy then we have with probability . To satisfy equation A.6, it suffices to set where

*C*is a sufficiently large constant depending on .

*Q*−

_{n}*Q*)(

*f*

_{0}−

*f*). By Bernstein's inequality, we have with probability , where

*C*is a universal constant. In a similar way, we can also obtain Combining these inequalities, we have with probability , where

*C*is a universal constant.

Note that is the optimal learning rate to estimate a function in (Eberts & Steinwart, 2011). Therefore, the density-difference estimator with a gaussian kernel achieves the optimal learning rate by appropriately choosing the regularization parameter and the gaussian width. Because the learning rate depends on , the LSDD estimator has adaptivity to the smoothness of the true function.

Our analysis heavily relies on the techniques developed in Eberts and Steinwart (2011) for a regression problem. The main difference is that the analysis in their paper involves a clipping procedure, which stems from the fact that the analyzed estimator requires an empirical approximation of the expectation of the square term. The Lipschitz continuity of the square function is used to investigate this term, and the clipping procedure is used to ensure the Lipschitz continuity. In this letter, we can exactly compute so that we do not need the Lipschitz continuity.

## Appendix B:. Derivation of Equation 3.6

*o*denotes the probabilistic order. Then equation 3.5 can be expressed as which concludes the proof.

_{p}## Appendix C: Derivation of Equation 3.7

## Acknowledgments

We thank Wittawat Jitkrittum and John Quinn for their comments and Zaïd Harchaoui for providing us a program code of kernel change-point detection. M.S. was supported by MEXT KAKENHI 23300069 and AOARD; T.K. was supported by MEXT KAKENHI 24500340; T.S. was supported by MEXT KAKENHI 22700289, the Aihara Project, the FIRST program from JSPS initiated by CSTP, and the Global COE Program “The research and Training Center for New development in Mathematics,” MEXT, Japan; M. du P. was supported by MEXT Scholarship; S.L. was supported by the JST PRESTO program; and I.T. was supported by MEXT KAKENHI 23700165. An earlier version of this paper was presented at the Neural Information Processing Systems conference, 2012 (Sugiyama, Suzuki, Kanamori, du Plessis, Liu, & Takeuchi, 2012).

## References

## Notes

^{1}

More precisely, Vapnik (1998) said, “If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.” Two-stage density-difference estimation corresponds to solving a more general problem of separate density estimation in the first stage.

^{3}

More specifically, the regularizer is replaced from the squared -norm of parameters to the squared RKHS norm of a learned function, which is necessary to establish consistency. Nevertheless, we use the squared -norm of parameters in experiments because it is simpler and performs well in practice.

^{5}

Estimation of the KL-divergence from data has been extensively studied recently (Wang, Kulkarmi, & Verdú, 2005; Sugiyama et al., 2008; Pérez-Cruz, 2008; Silva & Narayanan, 2010; Nguyen et al., 2010); was shown to possess a superior convergence property and demonstrated to work well in practice (Sugiyama et al., 2008). KLIEP is based on direct estimation of density ratio without density estimation of *p*(** x**) and . See also Nguyen et al. (2010), which proposes essentially the same procedure.