## Abstract

In this letter, we introduce an estimator of Küllback-Leibler divergence based on two independent samples. We show that on any finite alphabet, this estimator has an exponentially decaying bias and that it is consistent and asymptotically normal. To explain the importance of this estimator, we provide a thorough analysis of the more standard plug-in estimator. We show that it is consistent and asymptotically normal, but with an infinite bias. Moreover, if we modify the plug-in estimator to remove the rare events that cause the bias to become infinite, the bias still decays at a rate no faster than . Further, we extend our results to estimating the symmetrized Küllback-Leibler divergence. We conclude by providing simulation results, which show that the asymptotic properties of these estimators hold even for relatively small sample sizes.

## 1  Introduction

Let and be two discrete probability distributions on the same finite alphabet, , where is a finite integer. The Küllback-Leibler divergence, also known as relative entropy, is defined to be
1.1
Here and throughout, we adopt the standard conventions that if , then , and if , then . The Küllback-Leibler divergence is the distance between two probability distributions, introduced by Küllback and Leibler (1951), and is an important measure of information in information theory. (For details, see Cover & Thomas, 2006.) This distance is not a metric since it does not satisfy the triangle inequality and is not symmetric. A symmetric modification of Küllback-Leibler divergence is considered in section 4.

In this letter, we consider nonparametric estimation of when P and Q are unknown. Mori, Nishikimi, and Smith (2005) considered the case when the distribution Q is known. Another varient was studied in Marcon, Hérault, Baraloto, and Lang (2012). Perhaps the most commonly studied variant of this problem is when P and Q are continuous distributions (see Wang, Kulkarni, & Verdú, 2009; Nguyen, Wainwright, & Jordan, 2010). It should be noted that when estimating the Küllback-Leibler divergence in the continuous case, one often discretizes the data, and hence our framework is applicable to such situations.

Assume that P and Q are unknown. Let be an independent and identically distributed (i.i.d.) sample of size n from according to P, and let be an i.i.d. sample of size m from according to Q. Assume further that the two samples are independent of each other. Let and be the sequences of observed counts, and let and be the sample proportions. We make the following assumptions:

1. and for .

2. there exists a such that .

Note that we are not assuming that K is known, only that it is some finite integer greater than or equal to two.

Perhaps the most intuitive estimator of is the so-called plug-in estimator given by
1.2
In section 2.1 we show that this estimator is consistent and asymptotically normal, but with an infinite bias. Then in section 2.2, we introduce a modification of this estimator and show that it is still consistent and asymptotically normal, but now with a finite bias. We further show that this bias decays no faster than . In section 3, we introduce an estimator of given by
1.3
and show that it is consistent and asymptotically normal, and that its bias decays exponentially fast. In section 4, we derive the corresponding results for the symmetrized Küllback-Leibler divergence. Finally, in section 5, we perform a small-scale simulation study to see how well our asymptotic results hold for finite sample sizes.

Note that the estimators and appear to depend on K. However, for any letter with , we have , and thus we do not need to include it in the sum. For this reason, the sum only needs to be taken over letters that have been observed; hence, knowledge of K is not required.

Before proceeding, let
1.4
and note that . The first term, A, is the negative of the entropy of the distribution P. Entropy is a measure of randomness, first introduced in Shannon (1948). The problem of estimating the entropy (and thus ) has a long history, going back to at least Miller (1955). For an overview of the literature on entropy estimation see Paninski (2003) and the references therein. More recent results can be found in Zhang (2012, 2013), Zhang and Zhang (2012), and Zhang and Grabchak (2013).
We conclude this section by introducing some notation. We use to indicate a defining equality, to indicate the integer part of a nonegative real number; and to denote convergence in law and convergence in probability, respectively; to denote a binomial distribution with parameters m and p; to denote a normal distribution with mean and variance ; and to denote a multivariate normal distribution with mean vector and covariance matrix . For any two functions f and g taking values in with , we write to mean
1.5
is defined as in equation 1.5, but allowing for equality at (but strict inequality at 0), and is defined as in equation 1.5, but allowing for equality at 0 (but strict inequality at ).

## 2  Preliminaries

In this section, we present some results about the plug-in estimator and its modifications. These results appear to be missing from the literature. While we will need them to prove the asymptotic normality of the estimator given in equation 1.3, they are of independent interest.

### 2.1  Properties of the Plug-in Estimator

In this section we show that , the plug-in estimator given in equation 1.2, is a consistent and asymptotically normal estimator of D, but with an infinite bias. Define the -dimensional vectors,
and note that as . Moreover, by the multivariate normal approximation to the multinomial distribution,
2.1
where is the covariance matrix given by
2.2
Here, and are matrices given by
and
Let
and
For each j, , we have
and
The delta method gives the following result:
Proposition 1.
Provided that ,
2.3
Remark 1.

It is well-known that is a positive-definite matrix (see Tanabe & Sagae, 1992). For this reason, the condition is equivalent to the condition that . It is straightforward to see that this holds if and only if .

By the continuous mapping theorem, is a consistent estimator of and is a consistent estimator of . From this, Slutsky’s theorem, and remark 1, we get the following:

Corollary 1.
Provided that ,
2.4

Although is a consistent and asymptotically normal estimator of D, it has an infinite bias. To see this, we note that for every k, , and hence .

It is clear that the infinite bias is an artifact of events with vanishingly small probability. Specifically, the problem stems from the fact that may be zero. To better understand the behavior of the plug-in, we will see what happens when we ignore such events. To do this, we modify the estimator of qk to make sure that it is never zero. While there are many ways to do this, we introduce the following augmentation:
2.5

### 2.2  Properties of the Augmented Plug-in Estimator

Define an augmented plug-in estimator of D by
2.6

As with the estimators and , this appears to depend on K, but, in fact, knowledge of K is not required.

In this section, we show that is a consistent and asymptotically normal estimator of D with a bias that decays no faster than . Our goal is not to suggest that this is a useful estimator, but to help us understand the behavior of the plug-in. Moreover, these results will be instrumental in deriving the corresponding properties of the estimator proposed in equation 1.3.

Proposition 2.
Provided that ,
2.7
Proof.
In light of proposition 1 and remark 1, it suffices to show that . The fact that for any ,

gives the result.

By arguments similar to the proof of corollary 1, we have the following:

Corollary 2.
Provided that ,
2.8
where .
Example 1.
In applications, a commonly used special case is when . In this case, P and Q are both Bernoulli distributions with probabilities of success being p and q, respectively. Note that in general, . In this case, ,
and reduces to
2.9
By propositions 1 and 2, it follows that
We now discuss the bias of . Miller (1951) (see also Paninski, 2003, for a more formal treatment) shows that
2.10
We will show that for large enough n,
2.11
Once this is established, we immediately get the following:
Proposition 3.
For large enough n,
2.12

The proof of equation 2.11, and hence of proposition 3, will be based on the following result for general plug-in estimators of B:

Lemma 1.
Let be any estimator of qk such that is independent of and a.s. for all . If and then
Proof.
By the Taylor expansion of the logarithm,
where
Since is an unbiased estimator of pk and it is independent of ,
Since , Jensen’s inequality implies that
and thus .
Proof of Proposition 3.
It suffices to show equation 2.11. Note that and , where . By lemma 1,
Note that
and for large enough m, by the Taylor expansion of the logarithm,
where the remainder term by properties of alternating series. This means that for large enough m,
We have
and for large enough m,
where the final inequality follows by the fact that . Thus, for large enough m,
which completes the proof.

## 3  A New Estimator

In this section, we derive the estimator of D given in equation 1.3. We show that it is consistent and asymptotically normal and that its bias decays exponentially fast. Recall that . We will estimate A and B separately.

Since A is the negative of the entropy of P, we can estimate it by the negative of an estimator of entropy. Specifically, we take the negative of the estimator presented in Zhang (2012), which has an exponentially decaying bias. This leads to the estimator
3.1
where
3.2
Let . Zhang (2012) showed that
3.3
where . Thus, the bias decays exponentially fast.
Remark 2.
The product term in equation 3.2 assumes a value of 0 when . This is because, if , then when , we have . Combining this observation with some basic algebra gives
3.4
We now develop an estimator of B. By the Taylor expansion of the logarithm,
3.5
First, for , we derive an unbiased estimator of . For each fixed k, , and each fixed v, , consider a size-v subsample of the size-m sample from Q. Let and be the counts and the relative frequencies of the subsample. Note that ; thus, is an unbiased estimator of . Since there are different subsamples of size from the sample of size m, the U-statistic
3.6
where the sum is taken over all possible subsamples of size , is also an unbiased estimator of . Following the argument of Zhang and Zhou (2010), we note that is simply counting the number of subsamples of size v in which the letter is missing. This count can be summarized as follows:
1. If , the count is
2. If the count is zero.

Since, in the second case, , in both cases we have
3.8
Note that is an unbiased estimator of pk and it is independent of , which, when , is an unbiased estimator of . Thus, for such v, an unbiased estimator of is . This leads to an estimator of B given by
3.9
By construction , and by equation 3.5, the bias satisfies
3.10
where . Thus, the bias of decays exponentially in m, and hence in n.
Combining the fact that the product term in equation 3.8 assumes a value of 0 for any and some basic algebra shows that we can write
3.11
Remark 3.

When constructing our estimator we needed to count the number of subsamples that do not have an observation of the letter . When looking at the estimator in the form given in equation 3.11, this is reflected in the fact that the product term equals 1 when . This seems to imply that we need to know the value of K a priori. However, in the estimator, the entire product term is multiplied by . Thus, we do not need to find the value of the product if the letter has not appeared in the sample . This means that we do not actually need to know the value of K.

The proposed estimator of is given by
3.12
Combining equations 3.4 and 3.11, it can be shown that this estimator is as in equation 1.3. The bias of is given by
3.13
Since each part of the bias decays exponentially fast, the bias decays at least exponentially as well. Note that when the distributions and the sample sizes are similar, the bias tends to be smaller. We will now show that the estimator is asymptotically normal.
Theorem 1.
Provided that ,

Before giving the proof, we state the following corollary. Its proof is similar to that of corollary 1.

Corollary 3.
Provided that ,
where is as in corollary 2.

To prove theorem 1, we need the following result:

Lemma 2.
If , then for any integer ,
Proof.

Noting that is convex on the interval and that , the result follows immediately.

Proof of Theorem 1.
Since , by proposition 2 and Slutsky’s theorem, it suffices to show that . Note that
We will show that and . For a better presentation, the proof is divided into two parts: part 1 is the proof of and part 2 is the proof of . We note that a proof of is given in Zhang (2012). However, we include a new proof, which is simpler and significantly different.
Part 1: We first consider . By the Taylor expansion of the logarithm we have
and hence
Now note that for we have
3.14
Let and note that . For any integer , we have
Thus,
Clearly, it suffices to show that for . We have by equation 3.3 and by equation 2.10. Now observe that and
where the third line follows by lemma 2 and the fact that . Thus,
Part 2: We now consider . Note that
Since for every and , by Slutsky’s theorem, it suffices to show that for each . By the Taylor expansion of the logarithm,
First, we show that . When , we have , , and thus
Hence, , and therefore . Now note that and
since when . To complete the proof, we will show that . We have
because is an unbiased estimator of by construction. By the Taylor expansion of the logarithm and by an argument similar to the proof of proposition 3. Finally because
Hence , but since is already established, we have , which implies . From here, the result of the theorem follows.

## 4  Symmetrized Küllback-Leibler Divergence

One of the limitations of Küllback-Leibler divergence is that it is not symmetric. This can make its use in certain applications appear artificial. For this reason, it is useful to consider a symmetric variant of Küllback-Leibler divergence. This was recognized already in Küllback and Leibler (1951). (See Veldhuis, 2002, for more recent applications.) The symmetrized Küllback-Leibler divergence is defined to be
4.1
We can define the plug-in estimator of S(P,Q) by
4.2
and the augmented plug-in estimator of S(P,Q) by
4.3
Here is as in equation 2.5 and for . Let be a function such that
and let . For each j, , we have

By arguments similar to the proofs of propositions 1 and 2 and corollaries 1 and 2, we get proposition 4 and corollary 4:

Proposition 4.
Provided that ,
and
Corollary 4.
Provided that ,
and
where .
Example 2.
Continuing from example 1 where P and Q both have Bernoulli distributions, with probabilities of success being p and q, we have
where
Remark 4.

Since has an infinite bias as an estimator of and has an infinite bias as an estimator of , it follows that has an infinite bias as an estimator of S(P,Q). Similarly, from proposition 3, it follows that has a bias that decays no faster than .

By modifying the estimator given in equation 1.3, we can get an estimator of S(P,Q), whose bias decays exponentially fast. Let
4.4
By equation 3.13, the bias of this estimator is
which decays at least exponentially fast as since each part does.
Theorem 2.
Provided that ,
Proof.

By proposition 4 and Slutsky’s theorem, it suffices to show that . This follows from the facts that and , which follow from the proof of theorem 1.

By arguments similar to the proof of corollary 1, we have the following:

Corollary 5.
Provided that ,
where is as in corollary 4.

## 5  Simulation Results

In this section we present a small-scale simulation study in order to verify the results obtained in the letter. For these simulations, we consider the distributions where
and where
Table 1 gives the values of the Küllback-Leibler divergence and symmetrized Küllback-Leibler divergence for these distributions. Note that, as expected, .
Table 1:
Küllback-Leibler Divergence.
DivergenceTrue Value

S(P,Q
DivergenceTrue Value

S(P,Q

The simulations were performed as follows. For a fixed sample size n, we simulated n observations from distribution P and n observations from distribution Q. We then estimated using both the augmented plug-in estimator and the proposed estimator . We did not consider the plug-in estimator since the only cases where it differs from are those where it takes the value of infinity.

To estimate the bias, we found the difference between the true value and the estimated value for both estimators. We then repeated this times for the same sample size and averaged the differences together. This was repeated for all sample sizes ranging from to in increments of . The results are given in Figure 1, panel 1a. In the plot, the dashed line represents the absolute value of the bias of and the solid line represents the absolute value of the bias of . We see that has a significantly smaller bias than . In fact, has a very small bias even for small sample sizes.

Figure 1:

Bias of the estimators and effectiveness of the confidence intervals. The solid line corresponds to the estimator (or in panels 3a and 3b) and the dashed line coresponds to the estimator (or in panels 3a and 3b).

Figure 1:

Bias of the estimators and effectiveness of the confidence intervals. The solid line corresponds to the estimator (or in panels 3a and 3b) and the dashed line coresponds to the estimator (or in panels 3a and 3b).

In a similar way, we estimated . Plots of the absolute value of the bias in this case are given in Figure 1, panel 2a. In this case, is competitive for smaller sample sizes, but has a significantly larger bias than for larger sample sizes. We also estimated the symmetrized Küllback-Leibler divergence S(P,Q) by and . The plots of the biases in this case are given in Figure 1, panel 3a. As we might expect, the biases are in between those of the other two cases. In all three cases, we see that the biases of and are small and, especially for larger sample sizes, significantly smaller than the biases of and .

One of the main applications of the asymptotic normality results given in this letter is the ability to construct confidence intervals. Specifically, in light of corollary 3, an asymptotic confidence interval for is given by
5.1
where
A similar asymptotic confidence interval can be constructed based on by using corollary 2. Likewise, we can construct asymptotic confidence intervals for S(P,Q) based on using corollary 5 or based on using corollary 4. We will now use simulation to see how well these asymptotic confidence intervals work for finite sample sizes.

The simulations are performed as before. First, for a fixed sample size n, we get our two random samples and estimate by and . Then we use equation 5.1 to construct asymptotic confidence intervals based on each of these estimators. We repeat this times and see what proportion of the time the true is in the interval. Plots of the proportions for different sample sizes are given in Figure 1, part 1b. The solid line represents the proportion of the time that the true value is in the confidence interval based on and the dashed line represents the proportion of the time that the true value is in the confidence interval based on . Ideally both lines should be close to .

We see that the confidence interval based on performs very poorly except for small sample sizes. This appears to be caused by the sizable bias of this estimator. Once the sample size is relatively large, the radius of the confidence interval becomes very small, and because it is smaller than the bias, we cannot capture the true value. On the other hand, the confidence interval based on works very well and gets close to fairly quickly.

In a similar way, we get confidence intervals for based on and . Plots of the proportion of the time that the confidence interval captures the true value are given in Figure 1, panel 2b. Similar results for S(P,Q) are presented in Figure 1, panel 3b. In both cases, we see that the confidence intervals based on and do very well, while those based on and are not useful for large sample sizes.

## Acknowledgments

We thank Lijuan Cao for help in performing the simulations and the anonymous referee whose comments helped to improve the presentation of this letter.

## References

Cover
,
T. M.
, &
Thomas
,
J. A.
(
2006
).
Elements of information theory
(2nd ed.).
Hoboken, NJ
:
Wiley
.
Küllback
,
S.
, &
Leibler
,
R. A.
(
1951
).
On information and sufficiency
.
Annals of Mathematical Statistics
,
22
(
1
),
79
86
.
Marcon
,
E.
,
Hérault
,
B.
,
Baraloto
,
C.
, &
Lang
,
G.
(
2012
).
The decomposition of Shannon’s entropy and a confidence interval for beta diversity
.
Oikos
,
121
(
4
),
516
522
.
Miller
,
G.
(
1955
).
Note on the bias of information estimates
. In
H.
Questler
(Ed.),
Information theory in psychology: Problems and methods
(pp.
95
100
).
Glencoe, IL
:
Free Press
.
Mori
,
T.
,
Nishikimi
,
K.
, &
Smith
,
T. E.
(
2005
).
A divergence statistic for industrial localization
.
Review of Economics and Statistics
,
87
(
4
),
635
651
.
Nguyen
,
X.
,
Wainwright
,
M. J.
, &
Jordan
,
M. I.
(
2010
).
Estimating divergence functionals and the likelihood ratio by convex risk minimization
.
IEEE Transactions on Information Theory
,
56
(
10
),
5847
5861
.
Paninski
,
L.
(
2003
).
Estimation of entropy and mutual information
.
Neural Computation
,
15
(
6
),
1191
1253
.
Shannon
,
C. E.
(
1948
).
A mathematical theory of communication
.
Bell System Technical Journal, 27
,
379–423
,
623
656
.
Tanabe
,
K.
, &
Sagae
,
M.
(
1992
).
An exact Cholesky decomposition and the generalized inverse of the variance-covariance matrix of the multinomial distribution, with applications
.
Journal of the Royal Statistical Society. Series B (Methodological)
,
54
(
1
),
211
219
.
Veldhuis
,
R.
(
2002
).
The centroid of the symmetrical Küllback-Leibler distance
.
IEEE Signal Processing Letters
,
9
(
3
),
96
99
.
Wang
,
Q.
,
Kulkarni
,
S. R.
, &
Verdú
,
S.
(
2009
).
Divergence estimation for multidimensional densities via k-nearest-neighbor distances
.
IEEE Transactions on Information Theory
,
55
(
5
),
2392
2405
.
Zhang
,
Z.
(
2012
).
Entropy estimation in Turing’s perspective
.
Neural Computation
,
24
(
5
),
1368
1389
.
Zhang
,
Z.
(
2013
).
Asymptotic normality of an entropy estimator with exponentially decaying bias
.
IEEE Transactions on Information Theory
,
59
(
1
),
504
508
.
Zhang
,
Z.
, &
Grabchak
,
M.
(
2013
).
Bias adjustment for a nonparametric entropy estimator
.
Entropy
,
15
(
6
),
1999
2011
.
Zhang
,
Z.
, &
Zhang
,
X.
(
2012
).
A normal law for the plug-in estimator of entropy
.
IEEE Transactions on Information Theory
,
58
(
5
),
2745
2747
.
Zhang
,
Z.
, &
Zhou
,
J.
(
2010
).
Re-parameterization of multinomial distribution and diversity indices
.
Journal of Statistical Planning and Inference
,
140
(
7
),
1731
1738
.