We compare an entropy estimator recently discussed by Zhang (2012) with two estimators, and , introduced by Grassberger (2003) and Schürmann (2004). We prove the identity , which has not been taken into account by Zhang (2012). Then we prove that the systematic error (bias) of is less than or equal to the bias of the ordinary likelihood (or plug-in) estimator of entropy. Finally, by numerical simulation, we verify that for the most interesting regime of small sample estimation and large event spaces, the estimator has a significantly smaller statistical error than .
Symbolic sequences are typically characterized by an alphabet A of d different letters. We assume statistical stationarity: any letter-block (word or n-gram of constant length) wi, , can be expected at any chosen site to occur with a known probability prob(wi) and .
On the other hand, there is the Bayesian approach to entropy estimation, building on an approach introduced by Nemenman, Shafee, and Bialek (2002), or a generalization recently proposed by Archer, Park, and Pillow (2014). There, the basic strategy is to place a prior over the space of probability distributions and then perform inference using the induced posterior distribution over entropy. Actually, a partial numerical comparison of the popular Bayesian entropy estimates and those discussed her can be found in the work of Archer et al. (2014). Unfortunately, these simulations consider only the bias of the entropy estimates, not their mean square error, which takes into account the important trade-off between bias and variance. However, in the considerations to be discussed below, for what we intend to demonstrate, no explicit prior information on distributions is assumed, and we will focus ourself on non-Bayes entropy estimates only.
It should be mentioned that the numerical computation time of the estimator is significantly faster than for . Actually, this improvement has not been taken into account by Archer et al. (2014; see Figure 11), where the authors still used expression 1.5 above.
2 Comparison of and
The estimator is less biased than (or equally biased) the likelihood estimator , for all samples of size and .
3 Numerical Comparison of and
As we can see, the RMSE of all estimators is monotonic decreasing in N. The convergence of the naive estimator is rather slow compared to the other estimators, while the performance of is slightly better than for . On the other hand, the statistical error of is significantly smaller than the statistical error of and , and this behavior seems to be representative for large M. The statistical error for increasing M and fixed sample size N is shown in Figures 3 and 4. For , the RMSE of and is greater than of . This phenomenon reflects the fact that the bias reduction becomes more and more relevant for increasing M compared to the contribution of the variance.
As we can see from both examples, the gap between and is slightly smaller for the peaked Zipf distribution compared to the uniform distribution. Thus, we ask for the performance in the extreme case of the delta distribution , which has entropy zero. Indeed, in this special case, we have for any sample size N, but for . Actually, in this case, the statistical error of the latter scales like for large N.
In this note, we classified the entropy estimator of Zhang (2012) within the family of entropy estimators originally introduced by Schürmann (2004). This reveals an interesting connection between two different approaches to entropy estimation: one coming from the generalization of the diversity index of Simpson and the other one from the estimation of in the family of Renyi entropies. This connection is explicitly established by the identity . In addition, we proved that the statistical bias of is smaller than the bias of the likelihood estimator . Furthermore, by numerical computation for various probability distributions, we found that (or the heuristic estimator ) can be improved by the estimator , which is an excellent member of the estimator family of Grassberger (2003) and Schürmann (2004). There is a uniform variance upper bound of (and therefore of ) that decays at a rate of for all distributions with finite entropy (Zhang, 2012). It would be interesting to know if this variance bound also holds for the estimator or . The answer might be found in a forthcoming publication.