## Abstract

Nearest-neighbor estimators for the Kullback-Leiber (KL) divergence that are asymptotically unbiased have recently been proposed and demonstrated in a number of applications. However, with a small number of samples, nonparametric methods typically suffer from large estimation bias due to the nonlocality of information derived from nearest-neighbor statistics. In this letter, we show that this estimation bias can be mitigated by modifying the metric function, and we propose a novel method for learning a locally optimal Mahalanobis distance function from parametric generative models of the underlying density distributions. Using both simulations and experiments on a variety of data sets, we demonstrate that this interplay between approximate generative models and nonparametric techniques can significantly improve the accuracy of nearest-neighbor-based estimation of the KL divergence.

## 1 Introduction

Estimates of the KL divergence have been used for a variety of machine learning applications, including homogeneity testing of a density function (Kanamori, Suzuki, & Sugiyama, 2012), dependency testing for feature selection (Bontempi & Meyer, 2010; Brown, 2009; Peng, Long, & Ding, 2005; Yu & Liu, 2004), state change detection (Kawahara & Sugiyama, 2012; Liu, Yamada, Collier, & Sugiyama, 2013), and model parameter estimation (Ranneby, 1984). Moreover, other information-theoretic measures, such as entropy and the Jensen-Shannon divergence, can be expressed in terms of the KL divergence. Therefore, an efficient algorithm for estimating the KL divergence is critical for a range of applications in machine learning and statistics.

Unfortunately, all of these nonparametric methods suffer from high bias when used with a finite number of samples. For this reason, when only a small number of samples are available, parametric methods are typically more reliable than nonparametric methods. In this letter, we show how to improve a nonparametric estimator when using a finite number of samples by incorporating information from parametric models. More specifically, the bias of a nonparametric estimator can be reduced using information from parametric models by learning an appropriate local Mahalanobis-type metric for nearest-neighbor selection. Due to the invariance property of the KL divergence under changes to the metric function, convergence of the nearest-neighbor estimator to the true divergence is guaranteed regardless of the metric. Thus, the existing known asymptotic convergence properties of nearest-neighbor estimators apply to our hybrid method without any additional assumptions.

The remainder of the letter is organized as follows. In section 2, we describe nearest-neighbor-based methods for estimating the KL divergence. We then derive the finite sampling bias error and show how a metric for minimizing the bias can be learned. In section 3, we provide experimental results using a number of synthetic and benchmark data sets showing how our proposed method can improve the accuracy for KL divergence estimates. We conclude in section 4 with a discussion of our results and potential future work.

## 2 Metric Learning for Nearest-Neighbor KL-Divergence Estimation

In this section, we analyze the bias of nearest-neighbor estimation for the KL divergence and show that it depends on the distance metric. We then propose a metric learning algorithm to minimize this bias using parametric information.

### 2.1 Bias of Nearest-Neighbor Density Estimation

#### 2.1.1 Average Probability Density within a Sphere

#### 2.1.2 Deviation of the Probability Estimation

Next, we show that the deviation of the probability estimation is asymptotically the same as the deviation of the average probability density from the density at the center . We first note that the volume of a -dimensional hypersphere is with diameter and .

The distortion in the observed nearest-neighbor distance is caused from the assumption that the data are generated from even though they are in fact generated from . As shown in Figure 1, the observed should ideally be close to , and the area under region should match that of area . The objective of our metric learning algorithm is thus to modify the metric in order to minimize the difference between and .

When , equation 2.29 is equivalent to the Laplace probability density function and is related to the estimators derived in Leonenko and Pronzato (2010), Leonenko et al. (2008), and Póczos and Schneider (2011, 2012) under the assumption of a flat underlying probability density. However, for finite samples, we see that may not be small, and the more complete expression in equation 2.29 needs to be considered for more accurate density estimation.

#### 2.1.3 Comments on Using -Nearest Neighbors

We note that in Noh, Zhang, and Lee (2018), classification was considered with -nearest neighbors, and an appropriate metric was obtained analytically. In classification, only the label information of the nearest neighbors is relevant for obtaining a metric. For KL-divergence estimation, however, the actual distances to nearest neighbors are required for the estimator; therefore, the parameter will modulate the bias in the estimation of the KL divergence differently from -nearest neighbor classification. Although we do not have an analytic expression for the bias in the general case, we find that our metric learning algorithm for does improve the performance of KL-divergence estimation using nearest neighbors, and the proposed method can be used for general , as shown below.

### 2.2 Bias of Nearest-Neighbor KL-Divergence Estimation

Turning back to the problem of KL-divergence estimation, plugging the density estimate from equation 2.1 into the definition of yields the nearest-neighbor KL-divergence estimator in equation 1.3 using two sets of i.i.d. samples and .

Note that this bias is for the estimate at a particular point, and the total bias is given by the expected value. In the asymptotic limit with infinite samples , the coefficients and become zero, yielding a consistent estimation of the true KL divergence. Unfortunately, in high-dimensional spaces, the convergence rate is slow, with the bias decreasing with rate . However, with the proper choice of metric, the pointwise bias can be significantly reduced even for small and by minimizing the effect of the Laplacians and .

### 2.3 Metric Learning for Bias Reduction

The analysis shows that the bias of nearest-neighbor KL-divergence estimation depends not only on the number of data but also the curvatures of the underlying density functions and . In this section, we show how the bias can be reduced by learning an appropriate distance metric.

^{2}and we see that the bias can be minimized by solving the following semidefinite program: where the symmetric matrix is defined from the Hessians as Note that also depends on . The global solution of this semidefinite program can be obtained analytically. The particular solution we use is the following (Noh, Zhang, & Lee, 2010), where and are the diagonal matrices containing positive and negative eigenvalues of , respectively. The matrices and share the same eigenvectors, and is the collection of eigenvectors that correspond to the eigenvalues in , and corresponds to the eigenvalues in .

Note that the solution of equation 2.54 is not unique. We have selected the metric in equation 2.56 since it is well behaved and provides the minimum deviation for any number of positive and negative eigenvalues. The global scale factor is chosen to satisfy in our experiments. Further details about the derivation of this solution are provided in the appendix.

At each sample in , we use the appropriate metric to obtain nearest-neighbor distances and estimate , and then we use Monte Carlo sampling and summation to estimate the KL divergence. The full procedure is summarized in algorithm 1.

## 3 Experiments

In this section, we provide experimental results to illustrate how our method estimates the KL divergence for a variety of problems.^{3}

### 3.1 KL-Divergence Estimation from Synthetic Data

We compare the performance of the following six nonparametric methods:

- •
Proposed nearest-neighbor KL-divergence estimator with the gaussian metric (NNG). The metric is obtained from the gaussian model for each density function.

- •
Conventional nearest-neighbor KL-divergence estimator without metric learning (NN).

- •
KL-divergence estimator using density-ratio estimation with variational representation (Ratio) (Nguyen et al., 2007).

- •
KL-divergence estimator using direct density-ratio estimation with kernel representation (KLIEP) (Sugiyama et al., 2008).

- •
Risk-based nearest-neighbor KL-divergence estimator (fRisk) (Garcia-Garcia, von Luxburg, & Santos-Rodriguez, 2011).

- •
Bias reduction method of the nearest neighbor estimation using Wang, Kulkarni, and Verdu (2009a) (NNW).

First, we consider the estimation of KL divergence between two gaussian densities with various shapes and dimensionalities. Figure 2 depicts the experimental results for isotropic-isotropic and isotropic-correlated gaussians with increasing numbers of samples. For the data, we use and as mean parameters for two gaussian density functions of dimensionality . The isotropic-isotropic gaussian experiment uses isotropic covariance matrices for two gaussians. The isotropic-correlated gaussians use one isotropic covariance matrix and one nonisotropic covariance matrix: the nonisotropic covariance matrix has one nonzero element added to the isotropic matrix. For Figure 2b, the nonisotropic covariance matrix is , for Figure 2d, we use the nonisotropic covariance matrix , and for Figure 2f, we use . For these various configurations of gaussian data, the KL-divergence estimation results are provided with increasing numbers of data samples, where the nearest-neighbor estimation with the proposed metric (NNG) shows consistently better estimates than the other estimation methods.

Figures 3a and 3b show how the estimation results change with an increase in the mean difference and covariance difference, respectively, for a fixed number of data samples (). In Figure 3a, two isotropic five-dimensional gaussians are used with variance one, and the difference in mean (horizontal axis) is increased from 0 to 5. In Figure 3b, two isotropic five-dimensional gaussians of variance one are used with the same mean, and one covariance element (horizontal axis) is increased from 0 to 1 showing the increase of KL divergence. Compared with the true KL divergence for two gaussian densities, the proposed NNG always provides significant improvement in accuracy, while the NN and Ratio tend to underestimate the true KL divergence.

In Figures 2 and 3, we see how the nonparametric estimation using nearest neighbor information can be improved with the help of parametric modeling. The information comes from fitting parametric general models globally to the data distributions. The experimental results in Figures 2 and 3 show that the proposed metric is quite efficient in taking advantage of accurate parametric models.

We use a 5D density function with independent marginal functions, each of which is represented by equation 3.1. In addition to NNG, NN, and Ratio, we include in our comparison the following methods:

- •
Proposed nearest-neighbor KL-divergence estimator with the Student's- metric (NNS)

- •
Gaussian parametric method (GP)

Here, GP uses the plug-in estimate from equation 1.2 with maximum-likelihood estimation.

Figure 4a shows the estimation results for 5D Student's- data with , as the number of data samples is varied. As expected, NNS shows rapid convergence to the true KL divergence by using the model containing the true underlying density function. NN is worse, but it is better than Ratio in this 5D experiment. The large deviation of GP is due to the inaccuracy of the gaussian assumption. However, with this inaccurate generative model, NNG is quite accurate and comparable to NNS. This robustness to using an inaccurate parametric model in NNG illustrates the utility of our proposed hybrid method.

In Figure 4b, we show estimation results for different degrees of freedom in the Student's- distributions. With other parameters fixed, the increase in the number of degrees of freedom lessens the overlap between the two density functions; hence, the true KL divergence (red) grows with the increase of . The NNS estimator, with estimated scale and location, achieves a very accurate estimate of the true KL divergence when the number of degrees of freedom is small. However, with high degrees of freedom, the difficulty of estimating the correct degree gives rise to inaccurate estimates of the KL divergence. In this case, the gaussian model, which is only approximate, yields better estimates in our method.

In this experiment, conventional NN performs poorly when is large. Considering that the curvature in the density increases with , our theoretical analysis explains that the inaccuracy in conventional NN estimates comes from the finite sampling bias associated with the curvature. The deviation in GP is reduced as we increase because the true Student's- distribution approaches gaussian though it is still large compared with those in other nonparametric methods. Finally, the best performance in estimating the true KL divergence at large is again shown by NNG.

As shown in the experiments, metric information from coarse parametric models—in our case, gaussian for both densities—can help improve the estimation. The proposed metric does not overfit because the metric fitted globally is quite robust to local noise. Thus, the proposed hybrid method seamlessly incorporates global information from parametric models with local nonparametric estimation. It is similar in spirit to previous work that combines generative and discriminative methods in classification problems (Lacoste-Julien, Sha, & Jordan, 2009; Lasserre, Bishop, & Minka, 2006; Raina, Shen, Ng, & McCallum, 2004). For classification, hybrid methods are needed when there are different input regimes where one model works better than others (Ng & Jordan, 2001). Previously, it was shown that discriminative nearest-neighbor classification can be improved using a generative model (Noh et al., 2010). This letter shows how nonparametric divergence estimation can also be improved by incorporating information from parametric generative models.

### 3.2 Change Detection in Time Series

We next examine the problem of discovering change points in time-series data. In some cases, the change could be due to a shifting mean or variance, which can be detected by a simple difference measure in equation 1.2; otherwise, more complex statistical properties need to be detected.

We use data collected for the Human Activity Sensing Consortium (HASC) Challenge 2011,^{4} which contains human activity information measured by a portable three-axis accelerometer. The task is to segment different behaviors such as “stay,” “walk,” “jog,” and “skip.” Because the orientation of the accelerometer is not fixed, we take the -norm of the three-dimensional accelerometer data as the time series.

Figure 5 shows an example of the time series from HASC and its corresponding change points. We measured the KL divergence using NNG, NN, GP, fRisk, KLIEP, and NNW and compared the accuracies of the different methods in determining whether a change point is properly detected within a small tolerance window ( samples) from the true labeled change point. The change point classification is performed for various thresholds, and the area under the ROC curve (AUC) scores is reported in Table 1.

. | NNG . | NN . | GP . | fRisk . | KLIEP . | NNW . |
---|---|---|---|---|---|---|

Original data | *0.810 (0.052) | 0.759 (0.051) | 0.787 (0.032) | 0.793 (0.047) | 0.576 (0.055) | 0.748 (0.050) |

Gaussian noise | 0.550 (0.055) | 0.510 (0.082) | *0.608 (0.078) | 0.543 (0.078) | 0.491 (0.061) | 0.532 (0.058) |

Poisson noise | *0.735 (0.028) | 0.709 (0.032) | 0.599 (0.060) | 0.729 (0.043) | 0.511 (0.041) | 0.699 (0.036) |

. | NNG . | NN . | GP . | fRisk . | KLIEP . | NNW . |
---|---|---|---|---|---|---|

Original data | *0.810 (0.052) | 0.759 (0.051) | 0.787 (0.032) | 0.793 (0.047) | 0.576 (0.055) | 0.748 (0.050) |

Gaussian noise | 0.550 (0.055) | 0.510 (0.082) | *0.608 (0.078) | 0.543 (0.078) | 0.491 (0.061) | 0.532 (0.058) |

Poisson noise | *0.735 (0.028) | 0.709 (0.032) | 0.599 (0.060) | 0.729 (0.043) | 0.511 (0.041) | 0.699 (0.036) |

Note: In addition to the original data, data corrupted with gaussian and Poisson noise are tested. The methods with the best accuracy are starred, and those with -values less than 0.05 from the single-sided -test are in bold.

In Figure 5, the estimated KL divergence from NNG is compared with NN. The two methods show similar overall tendencies; however, NNG is quantitatively more accurate in detecting the location of the true change points. Table 1 compares the accuracies of NNG, NN, GP, fRisk, KLIEP, and NNW for change point detection on the HASC data set. We expect that methods with more accurate estimates of the KL divergence will perform better in change point detection. Indeed, our NNG method outperforms other methods not only with the original data set but also with the data corrupted by noise. For gaussian noise, all algorithms show decreases in accuracy, while the drop in performance of GP is relatively small, demonstrating that the mean and covariance statistics are good indicators of change point. On the other hand, with Poisson noise, the mean and covariance information are more severely corrupted, and the parametric GP estimator is worse than the nonparametric methods.

Surprisingly, fRisk performs well in this change point detection experiment, whereas it mostly fails in estimating the actual KL divergence with synthetic data. fRisk differs from other estimators in that it is relatively insensitive when the KL divergence is very large, as shown in Figures 2e to 2f, and this property may serve to regularize the performance of this estimator.

### 3.3 Feature Selection Using Jensen-Shannon Divergence Estimation

Another application that we consider is the use of KL-divergence estimation to select a smaller number of features relevant for classification.

#### 3.3.1 Scoring Features

A particular parametric approximation of the JS divergence, mIMR, has been suggested for feature selection to analyze gene expression data (Bontempi & Meyer, 2010). This method parameterizes certain correlations between features and labels, but mIMR may miss some higher-order associations. Another popular method using kernel functions, HSIC1 (Song et al., 2012), can capture information related to the JS divergence (Gretton et al., 2008). However, this method is easily corrupted by outliers. In our experiments, we test HSIC1 using a third-order polynomial kernel.

In contrast to the KL divergence, the JS divergence does not have an analytic solution between gaussians due to the mixture distribution. This makes the use of plug-in estimators problematic for JS divergence with estimated parameters. Regardless, this does not pose an issue for our method since it incorporates parametric density information only to define a metric and uses nonparametric statistics to estimate the actual divergence. In the experiments below, gaussians for the class-conditional densities are estimated by maximum likelihood and used to define a metric for nearest-neighbor estimation of the JS divergence.

#### 3.3.2 Feature Selection in High-Dimensional Noisy Gaussians

We first consider synthetic data from two high-dimensional probability density functions and examine the performance of the various feature selection methods. Two 1000-dimensional gaussian densities are prepared with difference only in 30-dimensional subspace. Even with a significant difference in this 30-dimensional subspace, identifying these 30 dimensionalities correctly is difficult with a small number of samples.

The first gaussian has a scaled 1000-dimensional random covariance matrix having the maximum eigenvalue 1. The other gaussian has the same covariance matrix, but now an additional random positive-semidefinite and symmetric matrix is prepared with maximum eigenvalue 1 and with 0 elements except the 30-dimensional space. The matrix is added to the covariance of the second gaussian, resulting in the difference only in the 30-dimensional space from the covariance matrix of the first gaussian. In order to make the problem even more difficult, uniform noise between 0 and 10 is added with probability for all features.

We use forward selection to choose the best 20 features, and the number of true relevant features among those selected is shown in Figure 6a, along with the classification performance with those selected features in Figure 6b.

With a small number of samples of 1000-dimensional feature vectors and with noisy observations, it is quite difficult to find the true discriminative 30-dimensional space. Compared with other algorithms, the nearest-neighbor-based methods are able to discover a significant number of correct components. NN performs reasonably in most situations, but in comparison, NNG, using a learned metric, always finds more relevant components, resulting in better performance in terms of classification accuracy.

#### 3.3.3 Gene Selection with Microarray Data

We applied our feature selection method using JS divergence to the problem of selecting genes for breast cancer prognosis. The experiments were performed with two gene expression data sets collected from clinical studies (Curtis et al., 2012; Freije et al., 2004; Loi et al., 2007; Spira et al., 2007; van't Veer et al., 2002), and samples in each data set are classified using selected genes and compared with other feature selection methods for classification. Hyperparameters for all methods were chosen via cross-validation with a separate validation set that is not used for evaluation.

For the NNG method, gaussian parametric models are used for and . As described above, the gaussian models are used only for determining an appropriate metric, and this metric is used to obtain the nearest-neighbor information for the estimation of JS divergence on the class-conditional distributions. The best features are then selected based on these estimates.

We use the forward selection strategy for feature selection and compare the average AUC of classification for the proposed NNG method, the plain NN method, mIMR, which is a parametric method of estimating the JS-divergence (Bontempi & Meyer, 2010), and a simple univariate -score. The results are reported in Figure 7, showing that the proposed NNG method compares favorably with the other methods.

## 4 Conclusion

In this work, we have shown how nonparametric nearest-neighbor estimation techniques can be significantly improved by the proper choice of a metric. Finite sampling causes bias in nearest-neighbor-based estimates, and this bias can be reduced by choosing a metric that minimizes the effect of curvature in the underlying densities. We show how the bias can be computed using parametric density models and how the optimal metric function for nearest-neighbor distances can be learned from the parametric models.

We note that our proposed metric in equation 2.56 reduces the finite sampling bias to leading order and is effective in providing more accurate estimates in our empirical tests. Nevertheless, in some situations, higher-order deviations not included in our analysis may become important. Further work is needed to properly include those effects in the analysis and could result in more sophisticated metrics for nonparametric divergence estimation.

Finally, the dependency of the estimation on the assumptions of the underlying parametric models needs to be investigated more systematically. In this work, we showed that using even approximate gaussian models to globally model the underlying data distributions can yield useful estimates. Better results, however, may be obtained using more sophisticated generative models in combination with nonparametric estimation methods. One possible extension would be incorporating generative mixture models with this method.

In conclusion, we believe that the combination of parametric density modeling in conjunction with nonparametric nearest-neighbor methods will be valuable in analyzing data distributions with smaller numbers of samples. This letter has demonstrated its utility for KL-divergence estimation with applications for change detection, classification, and feature selection; it is anticipated that future extensions of this method will be adapted for a wide range of applications.

## Appendix: Derivation of Metric Matrix in Equation 2.53

## Notes

^{1}

We used instead of because 1 degree of freedom was used for estimating the expectation over in equation 1.1.

^{2}

We note that the KL divergence itself is metric invariant. Due to the positive definiteness of , we can always find a full-rank matrix such that . Then the metric change can be regarded as a linear transformation of variables , which uses the Euclidean metric in the transformed -space. In this case, because and , we have .

^{3}

Code for KL-divergence estimation using our learned metric can be found at https://github.com/nohyung/KL-Divergence-Estimation-Metric/tree/master/matlab.

## Acknowledgments

Y.K.N. is supported by grants from NRF/MSIT-2017R1E1A1A03070945, M.S. and M.C.dP. from the JST CREST JPMJCR1403, S.L. from KAKENHI grant-in-Aid (RAS 15H06823), and Y.K.N. and F.C.P. from BK21Plus and MITIP-10048320. D.D.L. acknowledges support from the U.S. NSF, NIH, ONR, ARL, AFOSR, DOT, and DARPA.

## References

*f*-divergences

*f*-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models

*f*-divergence estimation with confidence. In

*k*-nearest-neighbor distances