## Abstract

We compare an entropy estimator recently discussed by Zhang (2012) with two estimators, and , introduced by Grassberger (2003) and Schürmann (2004). We prove the identity , which has not been taken into account by Zhang (2012). Then we prove that the systematic error (bias) of is less than or equal to the bias of the ordinary likelihood (or plug-in) estimator of entropy. Finally, by numerical simulation, we verify that for the most interesting regime of small sample estimation and large event spaces, the estimator has a significantly smaller statistical error than .

## 1 Introduction

Symbolic sequences are typically characterized by an alphabet *A* of *d* different letters. We assume statistical stationarity: any letter-block (word or *n*-gram of constant length) *w _{i}*, , can be expected at any chosen site to occur with a known probability prob(

*w*) and .

_{i}*M*of possible words explodes exponentially with

*n*.

*N*independent observations and let

*k*, , be the frequency of realization

_{i}*w*in the ensemble. However, with the choice , the naive (or likelihood) estimate, leads to a systematic underestimation of the Shannon entropy (Miller, 1955; Harris, 1975; Herzel, 1988; Schürmann & Grassberger, 1996; Grassberger, 2003; Schürmann, 2004). In particular, if

_{i}*M*is on the order of the number of data points

*N*, then fluctuations increase and estimates usually become significantly biased. By

*bias*, we denote the deviation of the expectation value of an estimator from the true value. In general, the problem in estimating functions of probability distributions is to construct an estimator whose estimates both fluctuate with the smallest possible variance and are least biased.

On the other hand, there is the Bayesian approach to entropy estimation, building on an approach introduced by Nemenman, Shafee, and Bialek (2002), or a generalization recently proposed by Archer, Park, and Pillow (2014). There, the basic strategy is to place a prior over the space of probability distributions and then perform inference using the induced posterior distribution over entropy. Actually, a partial numerical comparison of the popular Bayesian entropy estimates and those discussed her can be found in the work of Archer et al. (2014). Unfortunately, these simulations consider only the bias of the entropy estimates, not their mean square error, which takes into account the important trade-off between bias and variance. However, in the considerations to be discussed below, for what we intend to demonstrate, no explicit prior information on distributions is assumed, and we will focus ourself on non-Bayes entropy estimates only.

^{1}Zhang (2012) has mentioned that there exists an interesting estimator of each term in equation 1.3, which is unbiased up to the order , namely, , where is explicitly given by the expression such that is a statistical consistent entropy estimator of

*H*with (negative) bias Indeed, the estimator is notable because a uniform variance upper bound has been proven by Zhang (2012) that decays at a rate of for all distributions with finite entropy, compared to of the ordinary likelihood estimator established by Antos and Kontoyiannis (2001). It should be mentioned here that the latter decay rate is an implication of the Efron-Stein inequality, whereas the former (faster) decay rate is derived within the completely different approach introduced by Zhang (2012). Actually, it seems hard to prove the same decay rate for the likelihood estimator.

It should be mentioned that the numerical computation time of the estimator is significantly faster than for . Actually, this improvement has not been taken into account by Archer et al. (2014; see Figure 11), where the authors still used expression 1.5 above.

*M*and not very strongly peaked probability distributions. Actually, these are the distributions we are mainly interested in. The numerical comparison of the mean square error of and will be evaluated for the uniform probability distribution, the Zipf distribution, and the zero-entropy delta distribution.

## 2 Comparison of and

*i*th term of equation 1.4: By extending with

*N*in the product, this expression can be rewritten as Next, the product is reformulated as a quotient of factorials, and in terms of binomial coefficients, we get Now, the

*i*th term of the estimator, equation 1.5, is obtained by summation over , while is the

*k*th harmonic number (Abramowitz & Stegun, 1965). Applying the identity (with , the Euler-Mascheroni constant) and summation for , we obtain the estimator, equation 1.7, which proves the identity . In addition, we have the following proposition:

The estimator is less biased than (or equally biased) the likelihood estimator , for all samples of size and .

*N*with . For any finite , the inequality is satisfied. The proof is by Taylor series expansion of the exponential function. From this, by simple algebraic manipulations, it follows that the right-hand side of equation 2.8 is less than for any finite . It follows that equation 2.8 is satisfied for any

*k*with . This proves that is less biased than for any .

## 3 Numerical Comparison of and

*B*of is explicitly known, a correction is defined by subtraction of the bias term

_{N}*B*with

_{N}*p*replaced by its estimate . The modified estimator is then given by , while is the plug-in estimator of

_{i}*B*. For simplicity, we deny applying the same procedure of bias correction for the estimator . Our first data sample is taken from the uniform probability distribution for . In addition, we consider the (right-tailed) Zipf distribution with , for and normalization constant (reciprocal of the

_{N}*M*th harmonic number). The statistical error for increasing sample size

*N*and given

*M*is shown in Figures 1 and 2.

As we can see, the RMSE of all estimators is monotonic decreasing in *N*. The convergence of the naive estimator is rather slow compared to the other estimators, while the performance of is slightly better than for . On the other hand, the statistical error of is significantly smaller than the statistical error of and , and this behavior seems to be representative for large *M*. The statistical error for increasing *M* and fixed sample size *N* is shown in Figures 3 and 4. For , the RMSE of and is greater than of . This phenomenon reflects the fact that the bias reduction becomes more and more relevant for increasing *M* compared to the contribution of the variance.

As we can see from both examples, the gap between and is slightly smaller for the peaked Zipf distribution compared to the uniform distribution. Thus, we ask for the performance in the extreme case of the delta distribution , which has entropy zero. Indeed, in this special case, we have for any sample size *N*, but for . Actually, in this case, the statistical error of the latter scales like for large *N*.

## 4 Summary

In this note, we classified the entropy estimator of Zhang (2012) within the family of entropy estimators originally introduced by Schürmann (2004). This reveals an interesting connection between two different approaches to entropy estimation: one coming from the generalization of the diversity index of Simpson and the other one from the estimation of in the family of Renyi entropies. This connection is explicitly established by the identity . In addition, we proved that the statistical bias of is smaller than the bias of the likelihood estimator . Furthermore, by numerical computation for various probability distributions, we found that (or the heuristic estimator ) can be improved by the estimator , which is an excellent member of the estimator family of Grassberger (2003) and Schürmann (2004). There is a uniform variance upper bound of (and therefore of ) that decays at a rate of for all distributions with finite entropy (Zhang, 2012). It would be interesting to know if this variance bound also holds for the estimator or . The answer might be found in a forthcoming publication.

## References

*The statistical estimation of entropy in the non-parametric case*

## Note

^{1}

For another interpretation of this representation, see Montgomery-Smith and Schürmann (2005).