Abstract

Nearest-neighbor estimators for the Kullback-Leiber (KL) divergence that are asymptotically unbiased have recently been proposed and demonstrated in a number of applications. However, with a small number of samples, nonparametric methods typically suffer from large estimation bias due to the nonlocality of information derived from nearest-neighbor statistics. In this letter, we show that this estimation bias can be mitigated by modifying the metric function, and we propose a novel method for learning a locally optimal Mahalanobis distance function from parametric generative models of the underlying density distributions. Using both simulations and experiments on a variety of data sets, we demonstrate that this interplay between approximate generative models and nonparametric techniques can significantly improve the accuracy of nearest-neighbor-based estimation of the KL divergence.

1  Introduction

In large-scale data analysis, it is critical that differences between data distributions are accurately modeled and quantitatively evaluated. The use of principled information-theoretic criteria to quantify differences in data distributions has been especially effective (Gao, Steeg, & Galstyan, 2015; Kraskov, Stögbauer, & Grassberger, 2004; Wang, Kulkarni, & Verdu, 2009b). In this letter, we consider the important problem of estimating the Kullback-Leibler (KL) divergence between two probability density distributions, and ;
formula
1.1
given independent and identically distributed (i.i.d.) samples from the two distributions: from , and from , respectively.

Estimates of the KL divergence have been used for a variety of machine learning applications, including homogeneity testing of a density function (Kanamori, Suzuki, & Sugiyama, 2012), dependency testing for feature selection (Bontempi & Meyer, 2010; Brown, 2009; Peng, Long, & Ding, 2005; Yu & Liu, 2004), state change detection (Kawahara & Sugiyama, 2012; Liu, Yamada, Collier, & Sugiyama, 2013), and model parameter estimation (Ranneby, 1984). Moreover, other information-theoretic measures, such as entropy and the Jensen-Shannon divergence, can be expressed in terms of the KL divergence. Therefore, an efficient algorithm for estimating the KL divergence is critical for a range of applications in machine learning and statistics.

Current approaches to estimating the KL divergence can be roughly divided into parametric and nonparametric approaches. Parametric approaches typically assume a particular form for the underlying densities and first estimate the parameters of the density models from the data. The KL divergence is then estimated by computing the divergence of the fitted parametric models. For example, using gaussian assumptions for both and , the corresponding means and covariance matrices are first estimated using maximum likelihood. The KL divergence is then computed from the following closed-form estimate:
formula
1.2
where denotes the determinant of matrix . This approximation and its variants (Bontempi & Meyer, 2010; Das & Nenadic, 2008) are computationally efficient and stable. When the true distributions are close to gaussians, this estimate will be quite accurate; however, when the parametric models are not valid, this estimate will not asymptotically converge to the true value of the divergence.
In contrast, nonparametric approaches do not make any restrictive assumptions about the underlying density models. One popular nonparametric approach uses only nearest-neighbor distances: (Leonenko, Pronzato, & Savani, 2008; Pérez-Cruz, 2008; Wang, Kulkarni, & Verdu, 2006),
formula
1.3
where and , and and denote the distance from in to the nearest neighbors in and , respectively. Previous work has shown that and are consistent estimators of and up to a common normalization factor (Loftsgaarden & Quesenberry, 1965). Convergence to the individual underlying densities does not necessarily imply convergence of the density ratio; nevertheless, it has recently been proved that the estimator in equation 1.3 satisfies both almost sure convergence and convergence to the true KL divergence in equation 1.1 (Leonenko & Pronzato, 2010; Leonenko, Pronzato, & Savani, 2008; Póczos & Schneider, 2011, 2012). Moon and Hero (2014) also proved asymptotic normality for various -divergence plug-in estimators including equation 1.3. A bound on the bias in the estimate of an individual probability density from finite samples was recently derived using the concentration properties of the nearest-neighbor estimates (Singh & Poczos, 2016). An alternative nonparametric approach in Reshef et al. (2011) counts the number of samples that lie close to a point when the samples are projected along particular coordinates. Another approach that directly estimates the density ratio was shown to achieve the optimal nonparametric convergence rate in the minimax sense (Nguyen, Wainwright, & Jordan, 2007; Sugiyama et al., 2008).

Unfortunately, all of these nonparametric methods suffer from high bias when used with a finite number of samples. For this reason, when only a small number of samples are available, parametric methods are typically more reliable than nonparametric methods. In this letter, we show how to improve a nonparametric estimator when using a finite number of samples by incorporating information from parametric models. More specifically, the bias of a nonparametric estimator can be reduced using information from parametric models by learning an appropriate local Mahalanobis-type metric for nearest-neighbor selection. Due to the invariance property of the KL divergence under changes to the metric function, convergence of the nearest-neighbor estimator to the true divergence is guaranteed regardless of the metric. Thus, the existing known asymptotic convergence properties of nearest-neighbor estimators apply to our hybrid method without any additional assumptions.

The remainder of the letter is organized as follows. In section 2, we describe nearest-neighbor-based methods for estimating the KL divergence. We then derive the finite sampling bias error and show how a metric for minimizing the bias can be learned. In section 3, we provide experimental results using a number of synthetic and benchmark data sets showing how our proposed method can improve the accuracy for KL divergence estimates. We conclude in section 4 with a discussion of our results and potential future work.

2  Metric Learning for Nearest-Neighbor KL-Divergence Estimation

In this section, we analyze the bias of nearest-neighbor estimation for the KL divergence and show that it depends on the distance metric. We then propose a metric learning algorithm to minimize this bias using parametric information.

2.1  Bias of Nearest-Neighbor Density Estimation

Let us consider a -dimensional density estimation problem from samples following . A simple method for estimating the underlying density, is computed from the nearest-neighbor distances (Loftsgaarden & Quesenberry, 1965):
formula
2.1
formula
2.2
formula
2.3
formula
2.4
denotes the gamma function, is the nearest neighbor in from , and is the distance to the nearest neighbor.
The KL divergence between two distributions can then be estimated by plugging in the estimated densities computed from equation 2.1 (Póczos & Schneider, 2011). To analyze the bias in this method, we first consider how a bias perturbs the density estimation (Loftsgaarden & Quesenberry, 1965). When the underlying probability density function is twice differentiable, the second-order Taylor expansion of around gives
formula
2.5
where and denote the gradient and Hessian of , respectively. In this expansion, we assume the nearest neighbors to be at finite, nonzero distance from the point of interest , which is generally true for points sampled from continuous densities. In this finite sample situation, the bias of the nearest-neighbor density estimator is given by averaging equation 2.5 within the -dimensional hypersphere with center and radius .

2.1.1  Average Probability Density within a Sphere

We analyze a -dimensional sphere with radius , centered at the point with probability density . In a neighborhood around , the density at can be approximated using the second-order Taylor series:
formula
2.6
where the Hessian of is . The average probability within the hypersphere can be obtained by integrating equation 2.6:
formula
2.7
The first term is the probability density at multiplied by the volume of the hypersphere:
formula
2.8
with , the volume of the hypersphere with radius 1. The second term can be converted into an integration along the direction of the gradient, . Along the direction of the gradient, we use the parameterization , which is the distance from the center . Because the integrand can be written using as
formula
2.9
the integrand is constant when is constant. The second term therefore can be understood as the integration over one variable , with weight equal to the volume of the -dimensional hypersphere of radius and orthogonal to the gradient vector:
formula
2.10
Here, denotes the volume of the hypersphere in the orthogonal subspace. The integration can now be performed over , and due to symmetry, it is easy to see that the contribution of the integral in the second term is exactly zero.
The third term can be rewritten in terms of the eigenvectors of the Hessian, , denoted by . We use the representation of the difference vector using the expansion of the eigenvectors with coefficients :
formula
2.11
We then obtain
formula
2.12
where the corresponding eigenvalues of are .
In order to obtain the average probability over the hypersphere, we first note two equalities:
formula
2.13
formula
2.14
where we used the change of variable and and the equality,
formula
2.15
We now reformulate equation 2.11 as
formula
2.16
and recursively integrate using equation 2.13 and equation 2.14, yielding
formula
2.17
with Laplacian at , .
By summing equations 2.8 and 2.17 and dividing by the volume of the hypersphere, the average value of the probability density within the hypersphere becomes
formula
2.18

2.1.2  Deviation of the Probability Estimation

Next, we show that the deviation of the probability estimation is asymptotically the same as the deviation of the average probability density from the density at the center . We first note that the volume of a -dimensional hypersphere is with diameter and .

We consider expanding the volume of the hypersphere at a constant rate until a nearest neighbor appears. As a function of time , with the volume expanding at a constant unit rate, the probability density in time that the first data point (nearest neighbor) appears is given by
formula
2.19
where the total probability is given by the product of the spherical volume and its average density. Equation 2.19 is the probability of a nearest neighbor appearing within the step increment , given by the term , conditioned on the fact that no nearest neighbors have appeared at previous times.
By taking the of both sides and approximating for small ,
formula
2.20
formula
2.21
formula
2.22
The density function in time of the nearest neighbor appearing is then given by the expression
formula
2.23
We compare this nonstationary density function with one having constant probability density . If the probability density is constant, , the solution to equation 2.23 is a simple exponential:
formula
2.24
For the general , we consider with spherical volume and average probability density . The quantity can then be derived:
formula
2.25
formula
2.26
Here, we use and because the volume is assumed to increase at unit rate. Recalling from equation 2.18 that ,
formula
2.27
Taking on both sides of equation 2.22 and using equation 2.27, we can simplify equation 2.19 as
formula
2.28
formula
2.29
where we use the relationship between the diameter and time.

The distortion in the observed nearest-neighbor distance is caused from the assumption that the data are generated from even though they are in fact generated from . As shown in Figure 1, the observed should ideally be close to , and the area under region should match that of area . The objective of our metric learning algorithm is thus to modify the metric in order to minimize the difference between and .

Figure 1:

Area of the uniform probability density and the area of the true probability density .

Figure 1:

Area of the uniform probability density and the area of the true probability density .

In order for areas and to match, we require
formula
2.30
where has constant density . This gives the relationship between nearest-neighbor distances:
formula
2.31
The distortion between and can be written in terms of as follows:
formula
2.32
formula
2.33
formula
2.34
and by substituting , we obtain the amount of underestimation of in terms of the observed nearest-neighbor distance:
formula
2.35
Naively, the estimation can be directly corrected by adding this deviation to the observed , but rather than correcting the scalar deviation of each nearest-neighbor point, we seek a better definition of the distance producing a small bias.
When the estimator is used for the estimation of , the bias in the density estimate can therefore be expressed as
formula
2.36
formula
2.37
formula
2.38
with denoting the expectation over the nearest-neighbor from .
The bias can be further calculated using the density estimated from the nearest-neighbor distance. It is known in Goria, Leonenko, Mergel, and Inverardi (2005) and Wang et al. (2006) that the value converges in distribution to the exponential distribution with rate parameter . Using this asymptotic density function, the expectation can be calculated as
formula
2.39
formula
2.40
and thus the bias can be approximated as
formula
2.41
formula
2.42
According to this expression, the bias depends on the curvature of the underlying density function, , and vanishes if goes to infinity. If either the underlying density function has a small curvature or is large, the bias will be small.

When , equation 2.29 is equivalent to the Laplace probability density function and is related to the estimators derived in Leonenko and Pronzato (2010), Leonenko et al. (2008), and Póczos and Schneider (2011, 2012) under the assumption of a flat underlying probability density. However, for finite samples, we see that may not be small, and the more complete expression in equation 2.29 needs to be considered for more accurate density estimation.

2.1.3  Comments on Using -Nearest Neighbors

The derivation in the previous section does not directly apply to th nearest neighbors for . A similar calculation to equation 2.23 gives the following expression for the density of the -nearest neighbor, :
formula
2.43
which should be compared with the associated gamma density function for (Wang et al., 2006):
formula
2.44
Analogous to equation 2.30, the relationship between the distorted distance and the nondistorted distance with a uniform density assumption, and , respectively, can be obtained for -nearest neighbors. However for -nearest neighbors with , the distortion is not given by a simple analytic relationship as it is for .

We note that in Noh, Zhang, and Lee (2018), classification was considered with -nearest neighbors, and an appropriate metric was obtained analytically. In classification, only the label information of the nearest neighbors is relevant for obtaining a metric. For KL-divergence estimation, however, the actual distances to nearest neighbors are required for the estimator; therefore, the parameter will modulate the bias in the estimation of the KL divergence differently from -nearest neighbor classification. Although we do not have an analytic expression for the bias in the general case, we find that our metric learning algorithm for does improve the performance of KL-divergence estimation using nearest neighbors, and the proposed method can be used for general , as shown below.

2.2  Bias of Nearest-Neighbor KL-Divergence Estimation

Turning back to the problem of KL-divergence estimation, plugging the density estimate from equation 2.1 into the definition of yields the nearest-neighbor KL-divergence estimator in equation 1.3 using two sets of i.i.d. samples and .

Based on the analysis of bias in the nearest-neighbor density estimates, we can analyze the bias of the nearest-neighbor KL-divergence estimation to leading order as
formula
2.45
formula
2.46
formula
2.47
using1
formula
2.48
formula
2.49
Thus, the asymptotic bias of is given as
formula
2.50

Note that this bias is for the estimate at a particular point, and the total bias is given by the expected value. In the asymptotic limit with infinite samples , the coefficients and become zero, yielding a consistent estimation of the true KL divergence. Unfortunately, in high-dimensional spaces, the convergence rate is slow, with the bias decreasing with rate . However, with the proper choice of metric, the pointwise bias can be significantly reduced even for small and by minimizing the effect of the Laplacians and .

2.3  Metric Learning for Bias Reduction

The analysis shows that the bias of nearest-neighbor KL-divergence estimation depends not only on the number of data but also the curvatures of the underlying density functions and . In this section, we show how the bias can be reduced by learning an appropriate distance metric.

We use a local Mahalanobis-type distance metric parameterized by a positive-definite symmetric matrix at : the distance between the point of interest and another point is
formula
2.51
formula
2.52
Below, is simply denoted by . The bias depends on the choice of the metric ,2 and we see that the bias can be minimized by solving the following semidefinite program:
formula
2.53
formula
2.54
where the symmetric matrix is defined from the Hessians as
formula
2.55
Note that also depends on . The global solution of this semidefinite program can be obtained analytically. The particular solution we use is the following (Noh, Zhang, & Lee, 2010),
formula
2.56
where and are the diagonal matrices containing positive and negative eigenvalues of , respectively. The matrices and share the same eigenvectors, and is the collection of eigenvectors that correspond to the eigenvalues in , and corresponds to the eigenvalues in .

Note that the solution of equation 2.54 is not unique. We have selected the metric in equation 2.56 since it is well behaved and provides the minimum deviation for any number of positive and negative eigenvalues. The global scale factor is chosen to satisfy in our experiments. Further details about the derivation of this solution are provided in the appendix.

In order to compute matrix , we incorporate information from approximate parametric models of the underlying densities and . For example, when gaussian models are used for and , we can explicitly derive the matrix using
formula
2.57
where , , , and are estimates of the means and covariances. In our experiments, we have used maximum-likelihood estimators for the mean and covariance parameters.

At each sample in , we use the appropriate metric to obtain nearest-neighbor distances and estimate , and then we use Monte Carlo sampling and summation to estimate the KL divergence. The full procedure is summarized in algorithm 1.

formula

3  Experiments

In this section, we provide experimental results to illustrate how our method estimates the KL divergence for a variety of problems.3

3.1  KL-Divergence Estimation from Synthetic Data

We compare the performance of the following six nonparametric methods:

  • Proposed nearest-neighbor KL-divergence estimator with the gaussian metric (NNG). The metric is obtained from the gaussian model for each density function.

  • Conventional nearest-neighbor KL-divergence estimator without metric learning (NN).

  • KL-divergence estimator using density-ratio estimation with variational representation (Ratio) (Nguyen et al., 2007).

  • KL-divergence estimator using direct density-ratio estimation with kernel representation (KLIEP) (Sugiyama et al., 2008).

  • Risk-based nearest-neighbor KL-divergence estimator (fRisk) (Garcia-Garcia, von Luxburg, & Santos-Rodriguez, 2011).

  • Bias reduction method of the nearest neighbor estimation using Wang, Kulkarni, and Verdu (2009a) (NNW).

First, we consider the estimation of KL divergence between two gaussian densities with various shapes and dimensionalities. Figure 2 depicts the experimental results for isotropic-isotropic and isotropic-correlated gaussians with increasing numbers of samples. For the data, we use and as mean parameters for two gaussian density functions of dimensionality . The isotropic-isotropic gaussian experiment uses isotropic covariance matrices for two gaussians. The isotropic-correlated gaussians use one isotropic covariance matrix and one nonisotropic covariance matrix: the nonisotropic covariance matrix has one nonzero element added to the isotropic matrix. For Figure 2b, the nonisotropic covariance matrix is , for Figure 2d, we use the nonisotropic covariance matrix , and for Figure 2f, we use . For these various configurations of gaussian data, the KL-divergence estimation results are provided with increasing numbers of data samples, where the nearest-neighbor estimation with the proposed metric (NNG) shows consistently better estimates than the other estimation methods.

Figure 2:

KL-divergence estimation for two gaussian probability densities. The two means are and where the size of the vectors is the size of the dimensionality . For isotropic data in panels a, c, and e, the covariance matrices are the same identity matrices of size by , and for correlated data in panels b, d, and f, the covariance of one gaussian is the identity matrix and the other has one nonzero off-diagonal covariance element added to the identity matrix. True KL divergence is calculated using analytical integration of two underlying gaussians from which data are generated. NNG uses the gaussians but with estimated parameters.

Figure 2:

KL-divergence estimation for two gaussian probability densities. The two means are and where the size of the vectors is the size of the dimensionality . For isotropic data in panels a, c, and e, the covariance matrices are the same identity matrices of size by , and for correlated data in panels b, d, and f, the covariance of one gaussian is the identity matrix and the other has one nonzero off-diagonal covariance element added to the identity matrix. True KL divergence is calculated using analytical integration of two underlying gaussians from which data are generated. NNG uses the gaussians but with estimated parameters.

Figures 3a and 3b show how the estimation results change with an increase in the mean difference and covariance difference, respectively, for a fixed number of data samples (). In Figure 3a, two isotropic five-dimensional gaussians are used with variance one, and the difference in mean (horizontal axis) is increased from 0 to 5. In Figure 3b, two isotropic five-dimensional gaussians of variance one are used with the same mean, and one covariance element (horizontal axis) is increased from 0 to 1 showing the increase of KL divergence. Compared with the true KL divergence for two gaussian densities, the proposed NNG always provides significant improvement in accuracy, while the NN and Ratio tend to underestimate the true KL divergence.

Figure 3:

KL-divergence estimation for two 5D gaussian probability densities with increasing mean difference (left) and increasing covariance difference (right).

Figure 3:

KL-divergence estimation for two 5D gaussian probability densities with increasing mean difference (left) and increasing covariance difference (right).

In Figures 2 and 3, we see how the nonparametric estimation using nearest neighbor information can be improved with the help of parametric modeling. The information comes from fitting parametric general models globally to the data distributions. The experimental results in Figures 2 and 3 show that the proposed metric is quite efficient in taking advantage of accurate parametric models.

Unfortunately, in many realistic situations, there may not be a parametric model that describes the true densities. However, we show that we can still take advantage of a rough parametric model in defining a good metric for nonparametric estimation. The example we consider is a KL-divergence estimation between two Student's- density functions. A one-dimensional Student's- density function is defined as
formula
3.1
with location and scale parameters and . The overall shape of the Student's- distribution differs according to the parameter , representing the number of degrees of freedom. With high , the Student's- distribution is known to have a shape close to gaussian. As decreases, the distribution has heavier tails. We show how the nonparametric estimation of the KL divergence is assisted by a coarse parametric model, a gaussian.

We use a 5D density function with independent marginal functions, each of which is represented by equation 3.1. In addition to NNG, NN, and Ratio, we include in our comparison the following methods:

  • Proposed nearest-neighbor KL-divergence estimator with the Student's- metric (NNS)

  • Gaussian parametric method (GP)

Here, GP uses the plug-in estimate from equation 1.2 with maximum-likelihood estimation.

Figure 4a shows the estimation results for 5D Student's- data with , as the number of data samples is varied. As expected, NNS shows rapid convergence to the true KL divergence by using the model containing the true underlying density function. NN is worse, but it is better than Ratio in this 5D experiment. The large deviation of GP is due to the inaccuracy of the gaussian assumption. However, with this inaccurate generative model, NNG is quite accurate and comparable to NNS. This robustness to using an inaccurate parametric model in NNG illustrates the utility of our proposed hybrid method.

Figure 4:

KL-divergence estimation for two 5D Student's- probability densities. NNG incorporates information from approximate gaussian models, whereas NNS uses the Student's- density models with parameters estimated from maximum likelihood.

Figure 4:

KL-divergence estimation for two 5D Student's- probability densities. NNG incorporates information from approximate gaussian models, whereas NNS uses the Student's- density models with parameters estimated from maximum likelihood.

In Figure 4b, we show estimation results for different degrees of freedom in the Student's- distributions. With other parameters fixed, the increase in the number of degrees of freedom lessens the overlap between the two density functions; hence, the true KL divergence (red) grows with the increase of . The NNS estimator, with estimated scale and location, achieves a very accurate estimate of the true KL divergence when the number of degrees of freedom is small. However, with high degrees of freedom, the difficulty of estimating the correct degree gives rise to inaccurate estimates of the KL divergence. In this case, the gaussian model, which is only approximate, yields better estimates in our method.

In this experiment, conventional NN performs poorly when is large. Considering that the curvature in the density increases with , our theoretical analysis explains that the inaccuracy in conventional NN estimates comes from the finite sampling bias associated with the curvature. The deviation in GP is reduced as we increase because the true Student's- distribution approaches gaussian though it is still large compared with those in other nonparametric methods. Finally, the best performance in estimating the true KL divergence at large is again shown by NNG.

As shown in the experiments, metric information from coarse parametric models—in our case, gaussian for both densities—can help improve the estimation. The proposed metric does not overfit because the metric fitted globally is quite robust to local noise. Thus, the proposed hybrid method seamlessly incorporates global information from parametric models with local nonparametric estimation. It is similar in spirit to previous work that combines generative and discriminative methods in classification problems (Lacoste-Julien, Sha, & Jordan, 2009; Lasserre, Bishop, & Minka, 2006; Raina, Shen, Ng, & McCallum, 2004). For classification, hybrid methods are needed when there are different input regimes where one model works better than others (Ng & Jordan, 2001). Previously, it was shown that discriminative nearest-neighbor classification can be improved using a generative model (Noh et al., 2010). This letter shows how nonparametric divergence estimation can also be improved by incorporating information from parametric generative models.

3.2  Change Detection in Time Series

We next examine the problem of discovering change points in time-series data. In some cases, the change could be due to a shifting mean or variance, which can be detected by a simple difference measure in equation 1.2; otherwise, more complex statistical properties need to be detected.

We consider a column vector of length , to represent a segment of time series at time , and a collection of such vectors is obtained by taking data samples from a sliding window:
formula
3.2
As in Kawahara and Sugiyama (2012), the vectors in are modeled as being generated from an underlying density function. We then measure the KL divergence between the densities representing two sets, and , which represent data around time and time . A change point is detected whenever the estimated KL divergence between and is larger than a predefined threshold.

We use data collected for the Human Activity Sensing Consortium (HASC) Challenge 2011,4 which contains human activity information measured by a portable three-axis accelerometer. The task is to segment different behaviors such as “stay,” “walk,” “jog,” and “skip.” Because the orientation of the accelerometer is not fixed, we take the -norm of the three-dimensional accelerometer data as the time series.

Figure 5 shows an example of the time series from HASC and its corresponding change points. We measured the KL divergence using NNG, NN, GP, fRisk, KLIEP, and NNW and compared the accuracies of the different methods in determining whether a change point is properly detected within a small tolerance window ( samples) from the true labeled change point. The change point classification is performed for various thresholds, and the area under the ROC curve (AUC) scores is reported in Table 1.

Figure 5:

Examples of time series in HASC and the estimated KL divergence for NNG (middle) and NN (bottom) in both panels a and b.

Figure 5:

Examples of time series in HASC and the estimated KL divergence for NNG (middle) and NN (bottom) in both panels a and b.

Table 1:
Area under the ROC (AUC) Curve for Discriminating Change Points.
NNGNNGPfRiskKLIEPNNW
Original data *0.810 (0.052) 0.759 (0.051) 0.787 (0.032) 0.793 (0.047) 0.576 (0.055) 0.748 (0.050) 
Gaussian noise 0.550 (0.055) 0.510 (0.082) *0.608 (0.078) 0.543 (0.078) 0.491 (0.061) 0.532 (0.058) 
Poisson noise *0.735 (0.028) 0.709 (0.032) 0.599 (0.060) 0.729 (0.043) 0.511 (0.041) 0.699 (0.036) 
NNGNNGPfRiskKLIEPNNW
Original data *0.810 (0.052) 0.759 (0.051) 0.787 (0.032) 0.793 (0.047) 0.576 (0.055) 0.748 (0.050) 
Gaussian noise 0.550 (0.055) 0.510 (0.082) *0.608 (0.078) 0.543 (0.078) 0.491 (0.061) 0.532 (0.058) 
Poisson noise *0.735 (0.028) 0.709 (0.032) 0.599 (0.060) 0.729 (0.043) 0.511 (0.041) 0.699 (0.036) 

Note: In addition to the original data, data corrupted with gaussian and Poisson noise are tested. The methods with the best accuracy are starred, and those with -values less than 0.05 from the single-sided -test are in bold.

In Figure 5, the estimated KL divergence from NNG is compared with NN. The two methods show similar overall tendencies; however, NNG is quantitatively more accurate in detecting the location of the true change points. Table 1 compares the accuracies of NNG, NN, GP, fRisk, KLIEP, and NNW for change point detection on the HASC data set. We expect that methods with more accurate estimates of the KL divergence will perform better in change point detection. Indeed, our NNG method outperforms other methods not only with the original data set but also with the data corrupted by noise. For gaussian noise, all algorithms show decreases in accuracy, while the drop in performance of GP is relatively small, demonstrating that the mean and covariance statistics are good indicators of change point. On the other hand, with Poisson noise, the mean and covariance information are more severely corrupted, and the parametric GP estimator is worse than the nonparametric methods.

Surprisingly, fRisk performs well in this change point detection experiment, whereas it mostly fails in estimating the actual KL divergence with synthetic data. fRisk differs from other estimators in that it is relatively insensitive when the KL divergence is very large, as shown in Figures 2e to 2f, and this property may serve to regularize the performance of this estimator.

3.3  Feature Selection Using Jensen-Shannon Divergence Estimation

Another application that we consider is the use of KL-divergence estimation to select a smaller number of features relevant for classification.

3.3.1  Scoring Features

For feature selection, the Welch's -score has previously been suggested as a selection criterion for relevant feature selection by examining the mean difference between two classes relative to the size of the variances and (Loo, Roberts, Hrebien, & Kam, 2007; Song, Smola, Gretton, Bedo, & Borgwardt, 2012):
formula
3.3
With this criterion, simple correlations between individual features and the labels are considered, but more complex relationships between features are ignored. In order to improve feature selection, we will compare this simple method with the Jensen-Shannon (JS) divergence:
formula
3.4
The JS divergence is a well-known information-theoretic criterion that captures the mutual information between the features and labels. While this criterion has a strong theoretical foundation, the estimation of JS divergence from data is more challenging, and several approximations have been proposed for its use with feature selection.

A particular parametric approximation of the JS divergence, mIMR, has been suggested for feature selection to analyze gene expression data (Bontempi & Meyer, 2010). This method parameterizes certain correlations between features and labels, but mIMR may miss some higher-order associations. Another popular method using kernel functions, HSIC1 (Song et al., 2012), can capture information related to the JS divergence (Gretton et al., 2008). However, this method is easily corrupted by outliers. In our experiments, we test HSIC1 using a third-order polynomial kernel.

Along with the previously described criteria, the JS-divergence estimation method is compared in this section. We write the JS divergence as the combination of two KL divergences:
formula
3.5
where is expressed in terms of the class-priors and , and class-conditional densities and for classes 1 and 2, respectively. We use KL-divergence estimation with our learned metric and compare the feature selection results with the other methods.

In contrast to the KL divergence, the JS divergence does not have an analytic solution between gaussians due to the mixture distribution. This makes the use of plug-in estimators problematic for JS divergence with estimated parameters. Regardless, this does not pose an issue for our method since it incorporates parametric density information only to define a metric and uses nonparametric statistics to estimate the actual divergence. In the experiments below, gaussians for the class-conditional densities are estimated by maximum likelihood and used to define a metric for nearest-neighbor estimation of the JS divergence.

3.3.2  Feature Selection in High-Dimensional Noisy Gaussians

We first consider synthetic data from two high-dimensional probability density functions and examine the performance of the various feature selection methods. Two 1000-dimensional gaussian densities are prepared with difference only in 30-dimensional subspace. Even with a significant difference in this 30-dimensional subspace, identifying these 30 dimensionalities correctly is difficult with a small number of samples.

The first gaussian has a scaled 1000-dimensional random covariance matrix having the maximum eigenvalue 1. The other gaussian has the same covariance matrix, but now an additional random positive-semidefinite and symmetric matrix is prepared with maximum eigenvalue 1 and with 0 elements except the 30-dimensional space. The matrix is added to the covariance of the second gaussian, resulting in the difference only in the 30-dimensional space from the covariance matrix of the first gaussian. In order to make the problem even more difficult, uniform noise between 0 and 10 is added with probability for all features.

We use forward selection to choose the best 20 features, and the number of true relevant features among those selected is shown in Figure 6a, along with the classification performance with those selected features in Figure 6b.

Figure 6:

Feature selection in gaussian example.

Figure 6:

Feature selection in gaussian example.

With a small number of samples of 1000-dimensional feature vectors and with noisy observations, it is quite difficult to find the true discriminative 30-dimensional space. Compared with other algorithms, the nearest-neighbor-based methods are able to discover a significant number of correct components. NN performs reasonably in most situations, but in comparison, NNG, using a learned metric, always finds more relevant components, resulting in better performance in terms of classification accuracy.

3.3.3  Gene Selection with Microarray Data

We applied our feature selection method using JS divergence to the problem of selecting genes for breast cancer prognosis. The experiments were performed with two gene expression data sets collected from clinical studies (Curtis et al., 2012; Freije et al., 2004; Loi et al., 2007; Spira et al., 2007; van't Veer et al., 2002), and samples in each data set are classified using selected genes and compared with other feature selection methods for classification. Hyperparameters for all methods were chosen via cross-validation with a separate validation set that is not used for evaluation.

For the NNG method, gaussian parametric models are used for and . As described above, the gaussian models are used only for determining an appropriate metric, and this metric is used to obtain the nearest-neighbor information for the estimation of JS divergence on the class-conditional distributions. The best features are then selected based on these estimates.

We use the forward selection strategy for feature selection and compare the average AUC of classification for the proposed NNG method, the plain NN method, mIMR, which is a parametric method of estimating the JS-divergence (Bontempi & Meyer, 2010), and a simple univariate -score. The results are reported in Figure 7, showing that the proposed NNG method compares favorably with the other methods.

Figure 7:

Gene expression classification using selected features for data sets SMK-CAN-183 (Freije et al., 2004), GLI-85 (Spira et al., 2007), GSE2990 (Loi et al., 2007), Van't Veer (van't Veer et al., 2002), and Metabric (Curtis et al., 2012).

Figure 7:

Gene expression classification using selected features for data sets SMK-CAN-183 (Freije et al., 2004), GLI-85 (Spira et al., 2007), GSE2990 (Loi et al., 2007), Van't Veer (van't Veer et al., 2002), and Metabric (Curtis et al., 2012).

4  Conclusion

In this work, we have shown how nonparametric nearest-neighbor estimation techniques can be significantly improved by the proper choice of a metric. Finite sampling causes bias in nearest-neighbor-based estimates, and this bias can be reduced by choosing a metric that minimizes the effect of curvature in the underlying densities. We show how the bias can be computed using parametric density models and how the optimal metric function for nearest-neighbor distances can be learned from the parametric models.

We note that our proposed metric in equation 2.56 reduces the finite sampling bias to leading order and is effective in providing more accurate estimates in our empirical tests. Nevertheless, in some situations, higher-order deviations not included in our analysis may become important. Further work is needed to properly include those effects in the analysis and could result in more sophisticated metrics for nonparametric divergence estimation.

Finally, the dependency of the estimation on the assumptions of the underlying parametric models needs to be investigated more systematically. In this work, we showed that using even approximate gaussian models to globally model the underlying data distributions can yield useful estimates. Better results, however, may be obtained using more sophisticated generative models in combination with nonparametric estimation methods. One possible extension would be incorporating generative mixture models with this method.

In conclusion, we believe that the combination of parametric density modeling in conjunction with nonparametric nearest-neighbor methods will be valuable in analyzing data distributions with smaller numbers of samples. This letter has demonstrated its utility for KL-divergence estimation with applications for change detection, classification, and feature selection; it is anticipated that future extensions of this method will be adapted for a wide range of applications.

Appendix:  Derivation of Metric Matrix in Equation 2.53

The solution of the semidefinite program in equation 2.53 is not unique. Here, we explain how we chose the analytic solution in that equation. We first reformulate the semidefinite program using the eigenvalues of : . The optimal solution shares eigenvectors with according to equation 2.53. The optimization can then be formulated as
formula
A.1
To simplify, we change variables to obtain the new objective function,
formula
A.2
with Lagrangian multipliers and for . Setting the derivative of this objective function with respect to to zero,
formula
A.3
formula
A.4
yielding the condition for all :
formula
A.5
Here, the KKT condition for , which for each , is
formula
A.6
We find a solution with for all and do not have to consider the second KKT condition. If for all ,
formula
A.7
There are two possibilities:
formula
A.8
formula
A.9
Condition 1 is equivalent to , and unless all have the same sign, there are many possible solutions. On the other hand, in order for condition 2 to be satisfied, either for all or for all . In this case, we have the unique solution,
formula
A.10
formula
A.11
To obtain a metric that is consistent at the boundary between condition 1 and condition 2, we set
formula
A.12
formula
A.13
and find and that satisfy . From , we choose and , where and are the number of positive eigenvalues and negative eigenvalues of , respectively. This completes the derivation of the metric given by matrix .

Notes

1

We used instead of because 1 degree of freedom was used for estimating the expectation over in equation 1.1.

2

We note that the KL divergence itself is metric invariant. Due to the positive definiteness of , we can always find a full-rank matrix such that . Then the metric change can be regarded as a linear transformation of variables , which uses the Euclidean metric in the transformed -space. In this case, because and , we have .

3

Code for KL-divergence estimation using our learned metric can be found at https://github.com/nohyung/KL-Divergence-Estimation-Metric/tree/master/matlab.

Acknowledgments

Y.K.N. is supported by grants from NRF/MSIT-2017R1E1A1A03070945, M.S. and M.C.dP. from the JST CREST JPMJCR1403, S.L. from KAKENHI grant-in-Aid (RAS 15H06823), and Y.K.N. and F.C.P. from BK21Plus and MITIP-10048320. D.D.L. acknowledges support from the U.S. NSF, NIH, ONR, ARL, AFOSR, DOT, and DARPA.

References

Bontempi
,
G.
, &
Meyer
,
P. E.
(
2010
).
Causal filter selection in microarray data
. In
Proceedings of the 27th International Conference on Machine Learning
.
New York
:
ACM
.
Brown
,
G.
(
2009
).
A new perspective for information theoretic feature selection
. In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
.
Curtis
,
C.
,
Shah
,
S. P.
,
Chin
,
S.-F.
,
Turashvili
,
G.
,
Rueda
,
O. M.
,
Dunning
,
M. J.
, …
Aparicio
,
S.
(
2012
).
The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups
.
Nature
,
486
,
346
352
.
Das
,
K.
, &
Nenadic
,
Z.
(
2008
).
Approximate information discriminant analysis: A computationally simple heteroscedastic feature extraction technique
.
Pattern Recognition
,
41
(
5
),
1548
1557
.
Freije
,
W. A.
,
Castro-Vargas
,
F. E.
,
Fang
,
Z.
,
Horvath
,
S.
,
Cloughesy
,
T.
,
Liau
,
L. M.
, …
Nelson
,
S. F.
(
2004
).
Gene expression profiling of gliomas strongly predicts survival
.
Cancer Research
,
64
(
18
),
6503
6510
.
Gao
,
S.
,
Steeg
,
G. V.
, &
Galstyan
,
A.
(
2015
).
Efficient estimation of mutual information for strongly dependent variables
. In
Proceedings of the 18th International Conference on Artificial Intelligence and Statistics
.
Garcia-Garcia
,
D.
,
von Luxburg
,
U.
, &
Santos-Rodriguez
,
R.
(
2011
).
Risk-based generalizations of f-divergences
. In
Proceedings of 28th International Conference on Machine Learning
.
New York
:
ACM
.
Goria
,
M. N.
,
Leonenko
,
N. N.
,
Mergel
,
V. V.
, &
Inverardi
,
P.
(
2005
).
A new class of random vector entropy estimators and its applications in testing statistical hypotheses
.
Journal of Nonparametric Statistics
,
17
(
3
),
277
297
.
Gretton
,
A.
,
Fukumizu
,
K.
,
Choon
,
H. T.
,
Song
,
L.
,
Schölkopf
,
B.
, &
Alex
,
J. S.
(
2008
). A kernel statistical test of independence. In
J. C.
Platt
,
D.
Köller
,
Y.
Singer
, &
S. T.
Roweis
(Eds.),
Advances in neural information processing systems
,
20
.
Cambridge, MA
:
MIT Press
.
Kanamori
,
T.
,
Suzuki
,
T.
, &
Sugiyama
,
M.
(
2012
).
f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models
.
IEEE Transactions on Information Theory
,
58
(
2
),
708
720
.
Kawahara
,
Y.
, &
Sugiyama
,
M.
(
2012
).
Sequential change-point detection based on direct density-ratio estimation
.
Statistical Analysis and Data Mining
,
5
(
2
),
114
127
.
Kraskov
,
A.
,
Stögbauer
,
H.
, &
Grassberger
,
P.
(
2004
).
Estimating mutual information
.
Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics
,
69
(
6 Pt. 2
),
066138
.
Lacoste-Julien
,
S.
,
Sha
,
F.
, &
Jordan
,
M.
(
2009
). DiscLDA: Discriminative learning for dimensionality reduction and classification. In
D.
Köller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems
,
21
.
Red Hook, NY
:
Curran
.
Lasserre
,
J.
,
Bishop
,
C.
, &
Minka
,
T.
(
2006
).
Principled hybrids of generative and discriminative models
. In
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
.
Piscataway, NJ
:
IEEE
.
Leonenko
,
N.
, &
Pronzato
,
L.
(
2010
).
Correction: A class of Rényi information estimators for multidimensional densities
.
Annals of Statistics
,
38
,
3837
3838
.
Leonenko
,
N.
,
Pronzato
,
L.
, &
Savani
,
V.
(
2008
).
A class of Rényi information estimators for multidimensional densities
.
Annals of Statistics
,
36
,
2153
2182
.
Liu
,
S.
,
Yamada
,
M.
,
Collier
,
N.
, &
Sugiyama
,
M.
(
2013
).
Change-point detection in time-series data by relative density-ratio estimation
.
Neural Networks
,
43
,
72
83
.
Loftsgaarden
,
D. O.
, &
Quesenberry
,
C. P.
(
1965
).
A nonparametric estimate of a multivariate density function
.
Annals of Mathematical Statistics
,
36
(
3
),
1049
1051
.
Loi
,
S.
,
Haibe-Kains
,
B.
,
Desmedt
,
C.
,
Lallemand
,
F.
,
Tutt
,
A.
,
Gillet
,
C.
, …
Sotiriou
,
C.
(
2007
).
Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade
.
Journal of Clinical Oncology
,
25
(
10
),
1239
1246
.
Loo
,
L.-H.
,
Roberts
,
S.
,
Hrebien
,
L.
, &
Kam
,
M.
(
2007
).
New criteria for selecting differentially expressed genes: Filter-based feature selection techniques for better detection of changes in the distributions of expression levels
.
IEEE Engineering in Medicine and Biology Magazine
,
26
(
2
),
17
26
.
Moon
,
K.
, &
Hero
,
A.
(
2014
). Multivariate f-divergence estimation with confidence. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
.
Red Hook, NY
:
Curran
.
Ng
,
A.
, &
Jordan
,
M.
(
2001
). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In
T. G.
Dietterich
,
S.
Becker
, &
Z.
Ghahramani
(Eds.),
Advances in neural information processing systems
,
14
.
Cambridge, MA
:
MIT Press
.
Nguyen
,
X.
,
Wainwright
,
M. J.
, &
Jordan
,
M. I.
(
2007
). Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In
J. C.
Platt
,
D.
Köller
,
Y.
Singer
, &
S.
Roweis
(Eds.),
Advances in neural information processing systems
,
20
.
Cambridge, MA
:
MIT Press
.
Noh
,
Y.-K.
,
Zhang
,
B.-T.
, &
Lee
,
D. D.
(
2010
). Generative local metric learning for nearest neighbor classification. In
J. D.
Lafferty
,
C. K. I.
Williams
,
J.
Shawe-Taylor
,
R. S.
Zemel
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
23
.
Red Hook, NY
:
Curran
.
Noh
,
Y.-K.
,
Zhang
,
B.-T.
, &
Lee
,
D. D.
(
2018
).
Generative local metric learning for nearest neighbor classification
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
40
(
1
),
106
118
.
Peng
,
H.
,
Long
,
F.
, &
Ding
,
C.
(
2005
).
Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
27
(
8
),
1226
1238
.
Pérez-Cruz
,
F.
(
2008
).
Kullback-Leibler divergence estimation of continuous distributions
. In
Proceedings of IEEE International Symposium on Information Theory
.
Piscataway, NJ
:
IEEE
.
Póczos
,
B.
, &
Schneider
,
J.
(
2011
).
On the estimation of alpha-divergences
. In
Proceedings of 14th International Conference on Artificial Intelligence and Statistics
.
Póczos
,
B.
, &
Schneider
,
J.
(
2012
).
Nonparametric estimation of conditional information and divergences
. In
Proceedings of 15th International Conference on Artificial Intelligence and Statistics
.
Raina
,
R.
,
Shen
,
Y.
,
Ng
,
A.
, &
McCallum
,
A.
(
2004
). Classification with hybrid generative/discriminative models. In
S.
Thrun
,
L. K.
Saul
, &
B.
Schölkopf
(Eds.),
Advances in neural information processing systems
,
16
.
Cambridge, MA
:
MIT Press
.
Ranneby
,
B.
(
1984
).
The maximum spacing method–An estimation method related to the maximum likelihood method
.
Scandinavian Journal of Statistics
,
1
(
2
),
93
112
.
Reshef
,
D. N.
,
Reshef
,
Y. A.
,
Finucane
,
H. K.
,
Grossman
,
S. R.
,
McVean
,
G.
,
Turnbaugh
,
P. J.
, …
Sabeti
,
P. C.
(
2011
).
Detecting novel associations in large data sets
.
Science
,
334
(
6062
),
1518
1524
.
Singh
,
S.
, &
Poczos
,
B.
(
2016
).
Finite-sample analysis of fixed-k nearest neighbor density functional estimators
. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
.
Song
,
L.
,
Smola
,
A.
,
Gretton
,
A.
,
Bedo
,
J.
, &
Borgwardt
,
K.
(
2012
).
Feature selection via dependence maximization
.
Journal of Machine Learning Research
,
13
(
1
),
1393
1434
.
Spira
,
A.
,
Beane
,
J.
,
Shah
,
V.
,
Steiling
,
K.
,
Liu
,
G.
,
Schembri
,
F.
, …
Brody
,
J.
(
2007
).
Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer
.
Nature Medicine
,
13
(
3
),
361
366
.
Sugiyama
,
M.
,
Suzuki
,
T.
,
Nakajima
,
S.
,
Kashima
,
H.
,
von Bünau
,
P.
, &
Kawanabe
,
M.
(
2008
).
Direct importance estimation for covariate shift adaptation
.
Annals of the Institute of Statistical Mathematics
,
60
(
4
),
699
746
.
van't Veer
,
L. J.
,
Dai
,
H.
,
van de Vijver
,
M. J.
,
He
,
Y. D.
,
Hart
,
A. A. M.
,
Mao
,
M.
, …
Friend
,
S. H.
(
2002
).
Gene expression profiling predicts clinical outcome of breast cancer
.
Nature
,
415
,
530
536
.
Wang
,
Q.
,
Kulkarni
,
S. R.
, &
Verdu
,
S.
(
2006
).
A nearest-neighbor approach to estimating divergence between continuous random vectors
.
IEEE International Symposium on Information Theory
,
242
246
.
Wang
,
Q.
,
Kulkarni
,
S. R.
, &
Verdu
,
S.
(
2009a
).
Divergence estimation for multidimensional densities via k-nearest-neighbor distances
.
IEEE Transactions on Information Theory
,
55
(
5
),
2392
2405
.
Wang
,
Q.
,
Kulkarni
,
S. R.
, &
Verdu
,
S.
(
2009b
).
Universal estimation of information measures for analog sources
.
Foundations and Trends in Communications and Information Theory
,
5
(
3
),
265
353
.
Yu
,
L.
, &
Liu
,
H.
(
2004
).
Efficient feature selection via analysis of relevance and redundancy
.
Journal of Machine Learning Research
,
5
,
1205
1224
.