Abstract
Information-theoretic measures are among the most standard techniques for evaluation of clustering methods including word sense induction (WSI) systems. Such measures rely on sample-based estimates of the entropy. However, the standard maximum likelihood estimates of the entropy are heavily biased with the bias dependent on, among other things, the number of clusters and the sample size. This makes the measures unreliable and unfair when the number of clusters produced by different systems vary and the sample size is not exceedingly large. This corresponds exactly to the setting of WSI evaluation where a ground-truth cluster sense number arguably does not exist and the standard evaluation scenarios use a small number of instances of each word to compute the score. We describe more accurate entropy estimators and analyze their performance both in simulations and on evaluation of WSI systems.
1. Introduction
The task of word sense induction (WSI) has grown in popularity recently. WSI has the advantage of not assuming a predefined inventory of senses. Rather, senses are induced in an unsupervised fashion on the basis of corpus evidence (Schütze 1998; Purandare and Pedersen 2004). WSI systems can therefore better adapt to different target domains that may require sense inventories of different granularities. However, the fact that WSI systems do not rely on fixed inventories also makes it notoriously difficult to evaluate and compare their performance. WSI evaluation is a type of cluster evaluation problem. Although cluster evaluation has received much attention (see, e.g., Dom 2001; Strehl and Gosh 2002; Meila 2007), it is still not a solved problem. Finding a good way to score partially incorrect clusters is particularly difficult. Several solutions have been proposed but information theoretic measures have been among the most successful and widely used techniques. One example is the normalized mutual information, also known as V-measure (Strehl and Gosh 2002; Rosenberg and Hirschberg 2007), which has, for example, been adopted in the SemEval 2010 WSI task (Manandhar et al. 2010).
All information theoretic measures of cluster quality essentially rely on sample-based estimates of entropy. For instance, the mutual information I(c, k) between a gold standard class c and an output cluster k can be written H(c) + H(k) − H(k, c), where H(c) and H(k) are the marginal entropies of c and k, respectively, and H(k, c) is their joint entropy. The most standard estimator is the maximum-likelihood (ML) estimator, which substitutes the probability of each event (cluster, classes, or cluster-class pair occurrence) with its normalized empirical frequency.
Entropy estimators, even though consistent, are biased. This means that the expected estimate of the entropy on a finite sample set is different from the true value. It is also different from an expected estimate on a larger test set generated from the same distribution, as the bias depends on the size of the sample. This discrepancy negatively affects entropy-based evaluation measures, such as the V-measure. This is different from supervised classification evaluation, where the classification accuracy on a finite test set is expected to be equal to the error rate (for the independent and identically distributed, i.i.d.) case, though it can be different due to variance (due to choice of the test set). As long as the number of samples is large with respect to the number of classes and clusters, the estimate is sufficiently close to the true entropy. Otherwise, the quality of entropy estimators matters and the bias of the estimator can be large. This problem is especially prominent for the ML estimator (Miller 1955).
In WSI, we are faced with exactly those conditions that negatively affect the entropy estimators. In a typical setting, the number of examples per word is small—for example, less than 100 on average for the SemEval 2010 WSI task. The number of clusters, on the other hand, can be fairly high, with some systems outputting more than 10 sense clusters per word on average. Because the bias of an entropy estimator is dependent on, among other things, the number of clusters, the ranking of different WSI systems is partly affected by the number of clusters they produce. Even worse, the ranking is also affected by the size of the test set. The problem is exacerbated when computing the joint entropy between clusters and classes, H(k, c), because this requires estimating the joint probability of cluster-class pairs for which the statistics are even more sparse.
The bias problem of entropy estimators has long been known in the information theory community and many studies have addressed this issue (e.g., Miller 1955; Batu et al. 2002; Grasberger and Schürmann 1996). In this article, we compare different estimators and their influence on the computed evaluation scores. We run simulations using a Zipfian distribution where we know the true entropy. We also compare different estimators against the SemEval 2010 WSI benchmark. Our results strongly suggest that there are estimators, namely, the best-upper-bound (BUB) estimator (Paninski 2003) and jackknifed (Tukey 1958; Quenouille 1956) estimators, which are clearly preferable to the commonly used ML estimators.
2. Clustering Evaluation
2.1 Information-Theoretic Measures
The ML estimators of entropy are consistent but heavily negatively biased (see Section 3 for details). In other words, the expectation of Ĥ is lower than the true entropy, and this discrepancy increases with the number of clusters m and decreases with the sample size N. When m is comparable to N, the ML estimator is known to be very inaccurate (Paninski 2004).
Note that for V-measure estimation the main source of the estimation error is the joint entropy H(k, c),1 as the number of possible pairs (c, k) for most systems would be large whereas the total number of occurrences will remain the same as for the estimation of H(c) and H(k). Therefore, the absolute value of the bias for Ĥ(c, k) will exceed the aggregate bias of the estimators of marginal entropy, Ĥ(c) and Ĥ(k). As a result, the V-measure will be positively biased, and this bias would be especially high for systems predicting a large number of clusters.
This phenomenon has been previously noticed (Manandhar et al. 2010) but no satisfactory explanation has been given. The shortcomings of the ML estimator are especially easy to see on the example of a baseline system that assigns every instance in the testing set to an individual cluster. This baseline, when averaged over the 100 target words, outperforms all the participants' systems of the SemEval-2010 task on the standard testing set (Manandhar and Klapaftis 2009). Though we cannot compute the true bias for any real system, the computation is trivial for this baseline. The true V-measure is equal to 0, as the baseline can be regarded as a limiting case of a stochastic system that picks up one of the m clusters under the uniform distribution with m → ∞; the mutual information between any class labels and clustering produced by such model equals 0 for every m. However, the ML estimate for the V-measure is . For the testing set of SemEval 2010, this estimate, averaged over all the words, yields 31.7%, which by far exceeds the best result of any system (16.2%). On an infinite (or sufficiently large) set, however, its performance would change to the worst. This is a problem not only for the baseline but for any system which outputs a large number of classes: The error measures computed on the small test set are far from their expectations on the new data. We will see in our quantitative analyses (Section 5) that using more accurate estimators will have the most significant effect on both the V-measure and on the ranks of systems which output richer clustering, agreeing with this argument.
Though in this analysis we focused on the V-measure, other information theoretic measures have also been proposed. Examples of such measures include the variation of information measure (Meila 2007) VI(c, k) = H(c|k) + H(k|c) and Q0 measure (Dom 2001) H(c|k). This argument applies to these evaluation measures as well, and they can all be potentially improved by using more accurate estimators.
2.2 Alternative Measures
Not only information-theoretic measures have been proposed for clustering evaluation. An alternative evaluation strategy is to attempt to find the best possible mapping between the predicted clusters and the gold-standard classes and then apply standard measures like precision, recall, and F-score. However, if the best mapping is selected on the test set the result can be overoptimistic, especially for rich clusterings. Consequently, such methods constrain the set of permissible mappings to a restricted family. For example, for the F-score, one considers only mappings from each class to a single predicted cluster (Zhao and Karypis 2005; Agirre and Soroa 2007). This restriction is generally too strong for many clustering problems (Meila 2007; Rosenberg and Hirschberg 2007), and especially inappropriate for the WSI evaluation setting, as it penalizes sense induction systems that induce more fine-grained senses than the ones present in the gold-standard sense set.
The Paired F-score (Manandhar et al. 2010) is somewhat less restrictive than the F-score measures in that it defines precision and recall in terms of pairs of instances (i.e., effectively evaluating systems based on the proportion of correct links). However, the Paired F-score has the undesirable property that it ranks those systems highest which put all instances in one cluster, thereby obtaining perfect recall.
As an alternative, the supervised evaluation measure has been proposed (Agirre et al. 2006). This approach in addition to the testing set uses an auxiliary mapping set. First the mapping is induced on the mapping set, then the quality of the mapping is evaluated on the testing set. One problem with this evaluation scenario is that the size of the mapping set has an effect on the results and there is no obvious criterion for selecting the right size of the mapping set. For the WSI task, the importance of the set size was empirically confirmed when the evaluation set was split in proportions 80:20 (80% for the mapping sets, and 20% for testing) instead of the original 60:40 split: The scores of all top 10 systems improved and the ranking changed as well (Manandhar et al. 2010) (see also Table 1 later in this article).
Further cluster evaluation measures have been proposed for other language processing tasks, such as B3 (Bagga and Baldwin 1998) or CEAF (Luo 2005) for coreference resolution evaluation. In this article, we are concerned with entropy-based measures. For a more general assessment of measures for clustering evaluation see Amigo et al. (2009) and Klapaftis and Manandhar (2013).
3. Entropy Estimation
Given the influence that information theory has had on many fields, including signal processing, neurophysiology, and psychology, to name a few, it is not surprising that the topic of entropy estimation has received considerable attention over the last 50 years.2 However, much of the work has focused on settings where the number of classes is significantly lower than the size of the sample. More recently a set-up where the sample size N is comparable to the number of classes m has started to receive attention (Paninski 2003, 2004).
In this section, we start by discussing the intuition for why the ML estimator is heavily biased in this setting. Though unbiased estimators of entropy do not exist,3 various techniques have been proposed to reduce the bias while controlling the variance (Grasberger and Schürmann 1996; Batu et al. 2002). We will discuss widely used bias-corrected estimators, the ML estimator with Miller-Madow bias correction (Miller 1955) and the jackknifed estimator (Strong et al. 1998). Then we turn to the more recent technique proposed specifically for the N ∼ m setting, the best-upper-bound (BUB) estimator (Paninski 2003). We will conclude this section by explaining how these estimators can be computed using stochastic (weighted) output of WSI systems.
3.1 Standard Estimators of Entropy
3.2 BUB Estimator
This observation suggests that it makes sense to study an estimator of the general form as defined in Equation (2). Upper bounds on the bias and variance of such estimators4 have been stated in Paninski (2003). These bounds imply an upper bound on the standard measure of estimator performance, mean squared error (MSE, the sum of the variance and squared bias). The worst-case estimator is then obtained by selecting to minimize the upper bound on MSE, and, therefore, it is called the best-upper-bound estimator. This optimization problem5 corresponds to a regularized least-squares problem and can be solved analytically (see Appendix A and Paninski [2003] for technical details).
This technique is fairly general, and can potentially be used to minimize the bound for a particular type of distribution. This direction can be promising, as the types of distributions observed in WSI are normally fairly skewed (arguably Zipfian) and tighter bounds on MSE may be possible. In this work, we use the universal worst-case bounds advocated in Paninski (2003).
3.3 Estimation with Stochastic Predictions
As many WSI systems maintain a distribution over predicted clusters, in SemEval 2010 the participants were encouraged to provide a weighted prediction (i.e., a distribution over potential clusters for each example) instead of predicting just a single most-likely cluster.
4. Simulations
Because the true entropy (and V-measure) is not known on the WSI task, we start with simulations where we generated samples from a known distribution and can compare the estimates (and their biases) with the true entropy. In all our experiments, we set the number of clusters m to 10 and varied the sample size N (Figure 1). Each point on the graph is the result of averaging over 1,000 sampling experiments.6
The distribution of senses for a given word is normally skewed: For most words the vast majority of occurrences correspond to one or two most common senses even though the total number of senses can be quite large (Kilgarriff 2004). This type of long-tail distribution can be modeled with Zipf's law. Consequently, most of our experiments consider Zipfian distributions. For Zipf's law, the probability of choosing an element with rank k is proportional to , where s is a shape parameter. Small values of the parameter s correspond to flatter distributions; the distributions with a larger s are increasingly more skewed. The estimators' prediction for Zipfian distributions with a different s are shown in Figure 2. For s = 4, over 90% of the probability mass is concentrated on a single class. For every distribution we plot the true entropy (H) and the estimated values; compare results with a uniform distribution as seen in Figure 1.
In all figures, we observe that over the entire range of sample sizes, the bias for the bias-corrected estimates is indeed reduced substantially with respect to that of the ML estimator. This difference is particularly large for smaller N—the realistic setting for the computation of H(c, k) for the WSI task. For the uniform distribution and flatter Zipf distributions (s = 1 and 2), the JK estimator seems preferable for all but the smallest sample sizes (N > 3). The BUB estimator outperforms the JK estimator with very skewed distributions (s = 3 and s = 4) and in most cases provides the least biased estimates with very small N. However, these results with very small sample sizes (N ≤ 2) may not have much practical relevance as any estimator is highly inaccurate in this mode. The MM bias correction, as expected, is not sufficient for small N. Although it outperforms the ML estimates, its error is consistently larger than those of other bias-correction strategies.
Overall, the simulations suggest that the ML estimators are not very appropriate for entropy estimation with the types of distributions which are likely to be observed in the WSI tasks. Both the JK and BUB estimators are considerably less biased alternatives to the ML estimations.
5. Effects on WSI Evaluation
To gauge the effect of the bias problem on WSI evaluation, we computed how the ranking of the SemEval 2010 systems (Manandhar et al. 2010) were affected by different estimators. The SemEval 2010 organizers supplied a test set containing 8,915 manually annotated examples covering 100 polysemous lemmas.
The average number of gold standard senses per lemma was 3.79. Overall, 27 systems participated and were ranked according to their performance on the test set, applying the V-measure evaluation as well as paired F-score and a supervised evaluation scheme. The systems were also compared against three baselines. For the Most Frequent Sense (MFS) baseline all test instances of a given target lemma are grouped into one cluster, that is, there is exactly one cluster per lemma. The second baseline, Random, assigns each instance randomly to one of four clusters. The last baseline, proposed in Manandhar and Klapaftis (2009), 1-cluster-per-instance (1ClI), produces as many clusters as there are instances in the test set.
Table 1 gives an overview of the different systems and the three baselines (shown in italics). The systems are presented in the order in which they were given in the official SemEval 2010 results table (Table 4 in Manandhar et al. 2010, p. 66). Table 1 shows the average number of clusters per word (C#), the V-measure computed with different estimators (ML, MM, JK, and BUB), and the rankings it produces (in brackets).7 For comparison, the results of a supervised evaluation are also shown. The bottom two rows (KCDC-PC-2* and UoY*) show the scores computed from the stochastic (weighted) output (Section 3.3) for systems KCDC-PC-2 and UoY, respectively. Other systems did not produce weighted output.
The 27 systems vary widely in the number of average clusters they output per lemma, ranging from 1.02 (Duluth-WSI-SVD) to 17.5 (KSU-KDD). To assess the influence of the cluster granularity on the entropy estimates, we compared the estimates given by the ML estimator against those given by JK and BUB for different numbers of clusters. Figure 3 plots the cluster numbers output by the systems against the estimate difference for ML vs. JK (Figure 3a) and ML vs. BUB (Figure 3b). If two estimators agree perfectly (i.e., produce the same estimate), their difference should always be zero, independent of the number of clusters. As can be seen, this is not the case. As expected, the difference is larger for systems with larger numbers of clusters, such as KSU-KDD. This trend will result in unfair preference towards systems producing richer clusterings.
Figure 4 shows the effect that the discrepancy in estimates has on the rankings produced by using either of the three estimators. Figure 4a plots the ranking of the ML estimator against JK, and Figure 4b plots the ranking of ML against BUB. Dots that lie on the diagonal line indicate systems whose rank has not changed. It can be seen that this only applies to a minority of the systems. In general, there are significant differences between the rankings produced by ML and those by JK or BUB. We have seen that the ML estimator can lead to counterintuitive and undesirable results, such as ranking the 1-cluster-per-instance baseline highest. The BUB estimator corrects this and assigns the last rank to this baseline.8
The estimate for the V-measure is based on the estimates of the marginal and joint entropies. To confirm our intuition that joint entropies are more significantly corrected, we looked into the differences between estimates of each entropy for five systems with the largest number of clusters (excluding the 1C1I baseline). The average differences in estimation of H(k) and H(k, c) between JK and ML estimators are 0.08 and 0.16, respectively, confirming our hypothesis. Analogous discrepancies for the pair BUB vs. ML are 0.15 and 0.06, respectively. The differences in the entropy of the gold standard clustering, H(c), is less significant (< 0.02 for both methods) as the gold standard is less fine-grained than the clusters proposed by these five systems.
For the stochastic version evaluation, we can observe that the score for the KCDC-PC-2 system is mostly decreased with respect to the “deterministic” evaluation (except for the MM estimator). Conversely, the score of UoY is mostly improved except to the prediction of the BUB estimator. These differences are somewhat surprising: The stochastic version resulted in significantly larger disagreement between the estimators than the deterministic version. We do not yet have a satisfactory explanation for this phenomenon.
It is important to notice that for the vast majority of the systems there is agreement between the scores of the JK and BUB estimator, wheres the ML estimator significantly overestimates the V-measure for most of the systems. This observation, coupled with the observed behavior of the JK and BUB estimators in the simulations, suggest that their predictions are considerably more reliable than predictions of the plug-in ML estimator.
Comparing the V-measure (BUB) rankings to those obtained by supervised evaluation (last two columns in Table 1) shows noticeable differences. Several systems that rank highly according to the V-measure occupy the lower end of the scale when evaluated according to supervised recall (Hermit, KSU KDD, Duluth-Mix-Narrow-Gap).
6. Conclusions
In this work, we analyzed the shortcomings of information-theoretic measures in the context of WSI evaluation and argued that main drawbacks of these approaches, such as the preference for the systems predicting richer clusterings or assigning the top score to the 1-cluster-per-instance baseline, are caused by the bias of the underlying sample-based estimates of entropy. We studied alternative estimators, including one specifically designed to deal with cases where the number of examples is comparable with the number of clusters. Two of the considered estimators, the jackknifed estimator and the best-upper-bound estimator, achieve consistently and significantly less biased results than the standard ML estimator when evaluated in simulations with Zipfian distributions. The corresponding estimates in the WSI evaluation context can result in significant changes in scores and relative rankings, with systems producing richer clusterings more severely affected. We believe that these results strongly suggest that more accurate estimates of entropy should be used in future evaluations of sense induction systems. Other unsupervised tasks in natural language processing, such as word clustering or named entity disambiguation, may also benefit from using information-theoretic scores based on more accurate estimators.
Appendix A: Derivation for the BUB Estimator
Finally, MSE can be bounded by substituting Equations (A.2) and (A.3) into the inequality (A.1). For computational reasons, instead of choosing to minimize the bound, the L2 relaxation of the L ∞ loss is used, resulting in a regularized least-squares problem.
Acknowledgments
The research was carried out when the authors were at Saarland University. It was funded by the German Research Foundation DFG (Cluster of Excellence on Multimodal Computing and Interaction, Saarland University, Germany). We would like to thank the anonymous reviewers for their valuable feedback.
Notes
V-measure can be expressed via entropies in a number of different ways, although, for ML estimation they are all equivalent. For some more complex estimators, including some of the ones considered here, the resulting estimates will be somewhat different depending on the decomposition. We will focus on the symmetric form presented here.
For a relatively recent overview of progress in entropy estimation research see, for example, the proceedings of the NIPS 2003 workshop on entropy estimation.
The expectation of any estimate from i.i.d. samples is a polynomial function of class probabilities. The entropy is non-polynomial and therefore unbiased estimators do not exist.
We argued that variance is not particularly important for the ML estimator with N ∼ m. However, for an arbitrary estimator of the form of Equation (2) this may not be true, as the coefficients aj,N may be oscillating, resulting in an estimator with a large variance (Antos and Kontoyiannis 2001).
More formally, its modification where the L2 norm is optimized instead of the original L ∞ optimization set-up.
In this way we study only the bias of estimators.
The ranking produced by the ML estimator should mirror that of the official results. In some cases it does not—for example, system UoY was placed before KSU in the official results, whereas the ML estimator would predict the reverse order. As the difference in V-measure is small, we attribute this discrepancy to rounding errors. The system KCDC-GDC seems to be misplaced in the official results list; according to V-measure it should be ranked higher. Our ranking was computed before rounding, and there were no ties.
Note that the V-measure is actually negative here. Though this is not possible for the true V-measure, the estimated V-measure expresses a difference between the estimated joint entropy and the marginal entropies and can be negative.
References
Author notes
Microsoft Development Center Norway. E-mail: [email protected].
Institute for Logic, Language and Computation. E-mail: [email protected].
Computational Linguistics and Digital Humanities, Trier University, 54286 Trier, Germany. E-mail: [email protected].