## Abstract

Word sense disambiguation and the related field of automated word sense induction traditionally assume that the occurrences of a lemma can be partitioned into senses. But this seems to be a much easier task for some lemmas than others. Our work builds on recent work that proposes describing word meaning in a graded fashion rather than through a strict partition into senses; in this article we argue that not all lemmas may need the more complex graded analysis, depending on their partitionability. Although there is plenty of evidence from previous studies and from the linguistics literature that there is a spectrum of partitionability of word meanings, this is the first attempt to measure the phenomenon and to couple the machine learning literature on clusterability with word usage data used in computational linguistics.

We propose to operationalize partitionability as clusterability, a measure of how easy the occurrences of a lemma are to cluster. We test two ways of measuring clusterability: (1) existing measures from the machine learning literature that aim to measure the goodness of optimal k-means clusterings, and (2) the idea that if a lemma is more clusterable, two clusterings based on two different “views” of the same data points will be more congruent. The two views that we use are two different sets of manually constructed lexical substitutes for the target lemma, on the one hand monolingual paraphrases, and on the other hand translations. We apply automatic clustering to the manual annotations. We use manual annotations because we want the representations of the instances that we cluster to be as informative and “clean” as possible. We show that when we control for polysemy, our measures of clusterability tend to correlate with partitionability, in particular some of the type-(1) clusterability measures, and that these measures outperform a baseline that relies on the amount of overlap in a soft clustering.

## 1. Introduction

In computational linguistics, the field of word sense disambiguation (WSD)—where a computer selects the appropriate sense from an inventory for a word in a given context—has received considerable attention.^{1} Initially, most work focused on manually constructed inventories such as WordNet (Fellbaum 1998) but there has subsequently been a great deal of work on the related field of word sense induction (wsi) (Pedersen 2006; Manandhar et al. 2010; Jurgens and Klapaftis 2013) prior to disambiguation. This article concerns the phenomenon of word meaning and current practice in the fields of wsd and wsi.

Computational approaches to determining word meaning in context have traditionally relied on a fixed sense inventory produced by humans or by a wsi system that groups token instances into hard clusters. Either sense inventory can then be applied to tag sentences on the premise that there will be one best-fitting sense for each token instance. However, word meanings do not always take the form of discrete senses but vary on a continuum between clear-cut ambiguity and vagueness (Tuggy 1993). For example, the noun *crane* is a clear-cut case of ambiguity between lifting device and bird, whereas the exact meaning of the noun *thing* can only be retrieved via the context of use rather than via a representation in the mental lexicon of speakers. Cases of polysemy such as the verb *paint*, which can mean painting a picture, decorating a room, or painting a mural on a house, lie somewhere between these two poles. Tuggy highlights the fact that boundaries between these different categories are blurred. Although specific context clearly plays a role (Copestake and Briscoe 1995; Passonneau et al. 2010) some lemmas are inherently much harder to partition than others (Kilgarriff 1998; Cruse 2000). There are recent attempts to address some of these issues by using alternative characterizations of word meaning that do not involve creating a partition of usages into senses (McCarthy and Navigli 2009; Erk, McCarthy, and Gaylord 2013), and by asking WSI systems to produce soft or graded clusterings (Jurgens and Klapaftis 2013) where tokens can belong to a mixture of the clusters. However, these approaches do not overtly consider the location of a lemma on the continuum, but doing so should help in determining an appropriate representation. Whereas the broad senses of the noun *crane* could easily be represented by a hard clustering, this would not make any sense for the noun *thing*; meanwhile, the verb *paint* might benefit from a more graded representation.

In this article, we propose the notion of **partitionability** of a lemma, that is, the ease with which usages can be grouped into senses. We exploit data from annotation studies to explore the partitionability of different lemmas and see where on the ambiguity–vagueness cline a lemma is. This should be useful in helping to determine the appropriate computational representation for a word's meanings—for example, whether a hard clustering will suffice, whether a soft clustering would be more appropriate, or whether a clustering representation does not make sense. To our knowledge, there has been no study on detecting partitionability of word senses.

We operationalize partitionability as **clusterability**, a measure of how much structure there is in the data and therefore how easy it is to cluster (Ackerman and Ben-David 2009a), and test to what extent clusterability can predict partitionability. For deriving a gold estimate of partitionability, we turn to the Usage Similarity (hereafter **Usim**) data set (Erk, McCarthy, and Gaylord 2009), for which annotators have rated the similarity of pairs of instances of a word using a graded scale (an example is given in Section 2.2). We use inter-annotator agreement (IAA) on this data set as an indication of partitionability. Passonneau et al. (2010) demonstrated that IAA is correlated with sense confusability. Because this data set consists of similarity judgments on a scale, rather than annotation with traditional word senses, it gives rise to a second indication of partitionability: We can use the degree to which annotators have used intermediate points on a scale, which indicate that two instances are neither identical in meaning nor completely different, but somewhat related.

We want to know to what extent measures of clusterability of instances can predict the partitionability of a lemma. As our focus in this article is to test the predictive power of clusterability measures in the best possible case, we want the representations of the instances that we cluster to be as informative and “clean” as possible. For this reason, we represent instances through manually annotated translations (Mihalcea, Sinha, and McCarthy 2010) and paraphrases (McCarthy and Navigli 2007). Both translations (Resnik and Yarowsky 2000; Carpuat and Wu 2007; Apidianaki 2008) and monolingual paraphrases (Yuret 2007; Biemann and Nygaard 2010; Apidianaki, Verzeni, and McCarthy 2014) have previously been used as a way of inducing word senses, so they should be well suited for the task. Since the suggestion by Resnik and Yarowsky (1997) to limit WSD to senses lexicalized in other languages, numerous works have exploited translations for semantic analysis. Dyvik (1998) discovers word senses and their relationships through translations in a parallel corpus and Ide, Erjavec, and Tufis, (2002) group the occurrences of words into senses by using translation vectors built from a multilingual corpus. More recent works focus on discovering the relationships between the translations and grouping them into clusters either automatically (Bannard and Callison-Burch 2005; Apidianaki 2009; Bansal, DeNero, and Lin 2012) or manually (Lefever and Hoste 2010). McCarthy (2011) shows that overlap of translations compared to overlap of paraphrases on sentence pairs for a given lemma are correlated with inter-annotator agreement of graded lemma usage similarity judgments (Erk, McCarthy, and Gaylord 2009) but does not attempt to cluster the translation or paraphrase data or examine the findings in terms of clusterability. In this initial study of the clusteribility phenomenon, we represent instances through translation and paraphrase annotations; in the future, we will move to automatically generated instance representations.

There is a small amount of work on clusterability in the area of machine learning theory (Epter, Krishnamoorthy, and Zaki 1999; Zhang 2001; Ostrovsky et al. 2006; Ackerman and Ben-David 2009a), and all existing measures are based on *k*-means clustering. Two of them (variance ratio and worst pair ratio) test how tight the clusters are and how far different clusters are from each other (Epter, Krishnamoorthy, and Zaki 1999; Zhang 2001), and one (separability) tests how much the value of the objective function changes as the number *k* of clusters changes (Ostrovsky et al. 2006). We test all three of these **intra-clustering** (hereafter intra-clust) measures of clusterability. In addition, we test the intuition that for a well-clusterable lemma, the clusterings based on two different “views” of the same data points—in our case, a clustering based on monolingual paraphrases and a clustering based on translations—should be similar. For this **inter-clustering** (inter-clust) notion of clusterability, we use a simple graphical method that does not have the requirement of needing a specified number of clusters. We use this same graphical clustering to provide the *k* for our intra-clust measures because the existing definitions of clusterability from machine learning theory need the number of clusters to be fixed in advance. There are a vast number of clustering algorithms with which we could experiment. The clustering algorithm itself is not being evaluated here. Instead, the hypothesis is that if a data set is more clusterable, then it should be computationally easier to cluster (Ackerman and Ben-David 2009b) because the structure in the data is more obvious, so any reasonable algorithm should be able to partition the data to reflect that structure. We contrast the performance of the three intra-clust measures and the inter-clust measure with a simplistic baseline that relies on the amount of overlapping items in a soft clustering of the instance data, since such a baseline would be immediately available if one applied soft clustering to all lemmas.

We show that when controlling for polysemy, our indicators of higher clusterability tend to correlate with our two gold standard partitionability estimates. In particular, clusterability tends to correlate positively with higher inter-annotator agreement and negatively with a greater proportion of mid-range judgments on a graded scale of instance similarity. Although all our measures show some positive results, it is the intra-clust measures (particularly two of these) that are most promising.

## 2. Characterizing Word Meaning

### 2.1 The Difficulty of Characterizing Word Meaning

There has been an enormous amount of work in the fields of wsd and wsi relying on a fixed inventory of senses and on the assumption of a single best sense for a given instance (for example, see the large body of work described in Navigli [2009]) though doubts have been expressed about this methodology when looking at the linguistic data (Kilgarriff 1998; Hanks 2000; Kilgarriff 2006). One major issue arises from the fact that there is a spectrum of word meaning phenomena (Tuggy 1993) from clear-cut cases of ambiguity where meanings are distinct and separable, to cases where meanings are intertwined (highly interrelated) (Cruse 2000; Kilgarriff 1998), to cases of vagueness at the other extreme where meanings are underspecified. For example, at the ambiguous end of the spectrum are words like *bank* (noun) with the distinct senses of financial institution and side of a river. In such cases, it is relatively straightforward to differentiate corpus examples and come up with clear definitions for a dictionary or other lexical resource.^{2} These clearly ambiguous words are commonplace in articles promoting wsd because the ambiguity is evident and the need to resolve it is compelling. On the other end of the spectrum are cases where meaning is unspecified (vague); for example, Tuggy gives the example that *aunt* can be father's sister or mother's sister. There may be no contextual evidence to determine the intended reading and this does not trouble hearers and should not trouble computers (the exact meaning can be left unspecified). Cases of polysemy are somewhere in between. Examples from Tuggy include the noun *set* (a chess set, a set in tennis, a set of dishes, and a set in logic) and the verb *break* (a stick, a law, a horse, water, ranks, a code, and a record), each having many connections between the related senses. Although it is assumed in many cases that one meaning has spawned the other by a metaphorical process (Lakoff 1987)—for example, the mouth of a river from the mouth of a person—the process is not always transparent and neither is the point at which the spawned meaning takes an independent existence.

From the linguistics literature, it seems that the boundaries on this continuum are not clear-cut and tests aimed at distinguishing the different categories are not definitive (Cruse 2000). Meanwhile, in computational linguistics, researchers point to there being differences in distinguishing meanings with some words being much harder than others (Landes, Leacock, and Randee 1998), resulting in differences in inter-tagger agreement (Passonneau et al. 2010, 2012), issues in manually partitioning the semantic space (Chen and Palmer 2009), and difficulties in making alignments between lexical resources (Palmer, Dang, and Rosenzweig 2000; Eom, Dickinson, and Katz 2012). For example, OntoNotes is a project aimed at producing a sense inventory by iteratively grouping corpus instances into senses and then ensuring that these senses can be reliably distinguished by annotators to give an impressive 90% inter-annotator agreement (Hovy et al. 2006). Although the process is straightforward in many cases, for some lemmas this is not possible even after multiple re-partitionings (Chen and Palmer 2009).

Recent work on graded annotations (Erk, McCarthy, and Gaylord 2009, 2013) and graded word sense induction (Jurgens and Klapaftis 2013) has aimed to allow word sense annotations where it is assumed that more than one sense can apply and where the senses do not have to be equally applicable. In the graded annotation study, the annotators are assigned various tasks including two independent sense labeling tasks where they are given corpus instances of a target lemma and sense definitions (Word-Net) and are asked to (1) find the most appropriate sense for the context and (2) assign a score out of 5 as to the applicability of every sense for that lemma. In graded word sense induction (Jurgens and Klapaftis 2013), computer systems and annotators preparing the gold standard have to assign tokens in context to clusters (WordNet senses) but each token is assigned to as many senses as deemed appropriate and with a graded level of applicability on a Likert scale (1–5). This scenario allows for overlapping sense assignments and sense clusters, which is a more natural fit for lemmas with related senses, but inter-annotator agreement is highly variable depending on the lemma, varying between 0.903 and 0.0 on Krippendorff's α (Krippendorff 1980). This concurs with the variation seen in other annotation efforts, such as the MASC word sense corpus (Passonneau et al. 2012). Erk, McCarthy, and Gaylord (2009) demonstrated that annotators produced more categorical decisions (5 - identical vs. 1 - completely different) for some words and more mid-range decisions (4 - very similar, 3 - similar, 2 - mostly different) for others. This is not solely due to granularity. In a later article (Erk, McCarthy, and Gaylord 2013), the authors demonstrated that when coarse-grained inventories are used, there are some words where, unsurprisingly, usages in the same coarse senses tend to have higher similarity than those in different coarse senses, but for some lemmas, the reverse happens. Although graded annotations (Erk, McCarthy, and Gaylord 2009, 2013) and soft clusterings (Jurgens and Klapaftis 2013) allow for representing subtler relationships between senses, not all words necessitate such a complicated framework. This article is aimed at finding metrics that can measure how difficult a word's meanings are to partition.

### 2.2 Alternative Word Meaning Characterizations

Several groups have proposed alternative characterizations of word meaning that do not rely on a partition of instances into senses. We use three of these approaches in the current article: two to provide instance annotations that we use as the basis for clustering and one to provide a gold standard indication of partitionability. Crucially, these three data sets are all produced by adding annotations to samples taken from the same set of sentences used for the English lexical substitution task (McCarthy and Navigli 2007), hereafter **lexsub**. Ten sentences for the target lemma *post.n*^{3} are shown in Table 1, with the corresponding sentence ids (s#) in the lexsub data set and the target token underlined.

In lexsub, human annotators saw a target lemma in a given sentence context and were asked to provide one or more substitutes for the lemma in that context. There were 10 instances for each lemma, and the lemmas were manually selected by the task organizers. The cross-lingual lexical substitution task (Mihalcea, Sinha, and McCarthy 2010) (**clls**) is similar, except that whereas in lexsub both the original sentence and the substitutes were in English, clls used Spanish substitutes. For both tasks, multiple annotators provided substitutes for each target instance. Table 2 shows the English substitutes from lexsub alongside the Spanish substitutes from clls for the sentences for *post.n* displayed in Table 1.

In the Usim annotation (Erk, McCarthy, and Gaylord 2009, 2013), annotators saw a pair of sentences at a time that both contained an instance of the same target word. Annotators then provided a graded judgment on a scale of 1–5 of how similar the usage of the target lemma was in the two sentences. Multiple annotators rated each sentence pair. Table 3 shows the average judgments for the *post.n* example between each pair of sentence ids in Table 1.^{4}

The three data sets overlap in the sentences that they cover: Both Usim and clls are drawn from a subset of the data from lexsub.^{5} The overlap between all three data sets is 45 lemmas each in the context of ten sentences.^{6} In this article we only use data from this common subset as it provides us with a gold-standard (Usim) and two different representations of the instances (lexsub and clls substitutes). The 45 lemmas in this subset include 14 nouns, 14 adjectives, 15 verbs, and 2 adverbs.^{7}

In our experiments herein, we use the Usim data as a gold-standard of how difficult to partition usages of a lemma is. We use both lexsub and clls independently as the basis for intra-clust clusterability experiments. We compare clusterings based on lexsub and clls for the inter-clust clusterability experiments.

## 3. Measuring Clusterability

We present two main approaches to estimating clusterability of word usages using the translation and paraphrase data from clls and lexsub. Firstly, we estimate cluster-ability using intra-clust measures from machine learning. Secondly, our inter-clust method uses clustering evaluation metrics to compare agreement between two clusterings obtained from clls and lexsub based on the intuition that less clusterable lemmas will have lower congruence between solutions from the two data sets (which provide different views of the same underlying data).

### 3.1 Intra-Clustering Clusterability Measures

The notion of the general clusterability of a data set (as opposed to the goodness of any particular clustering) is explored within the field of machine learning by Ackerman and Ben-David (2009a). Consider for example the plots in Figure 1, where the data points on the left should be more clusterable than those on the right because the partitions are easier to make. All the notions of clusterability that Ackerman and Ben-David consider are based on *k*-means and involve optimum clusterings for a fixed *k*.

*k*-means clustering. Let

*X*be a set of data points, then a

*k*-means

*k*-clustering of

*X*is a partitioning of

*X*into

*k*sets. We write

*C*= {

*X*

_{1}, … ,

*X*

_{k}} for a

*k*-clustering of

*X*, with . The

*k*-means loss function for a

*k*-clustering

*C*is the sum of squared distances of all data points from the centroid of their cluster,where the centroid or center mass of a set

*Y*of points isA “

*k*-means optimal

*k*-clustering” of the set

*X*is a

*k*-clustering of

*X*that has the minimal

*k*-means loss of all

*k*-clusterings of

*X*. There may be multiple such clusterings.

**variance ratio**(vr), introduced by Zhang (2001). Its underlying intuition is that in a good clustering, points should be close to the centroid of their cluster, and clusters should be far apart. For a set

*Y*of points,is the variance of

*Y*. For a

*k*-clustering

*C*of

*X*, we write , and define within-cluster variance

*W*(

*C*) and between-cluster variance

*B*(

*C*) of

*C*as follows:Then the variance ratio of the data set

*X*for the number

*k*of clusters iswhere is the set of

*k*-means optimal

*k*-clusterings of

*X*. A higher variance ratio indicates better clusterability because variance ratio rises as the distance between clusters increases (

*B*(

*C*)) and the distance within clusters decreases (

*W*(

*C*)).

**Worst pair ratio**(wpr) uses a similar intuition as variance ratio, in that it, too, considers a ratio of a within-cluster measure and a between-cluster measure. But it focuses on “worst pairs” (Epter, Krishnamoorthy, and Zaki 1999), the closest pair of points that are in different clusters, and the most distant points that are in the same cluster. For two data points

*x*,

*y*∈

*X*and a

*k*-clustering

*C*of

*X*, we write

*x*∼

_{C}

*y*if

*x*and

*y*are in the same cluster of

*C*, and otherwise. Then the split of

*C*is the minimum distance of two data points in different clusters, and the width of

*C*is the maximum distance of two data points in the same cluster:We use the variant of worst pair ratio given by Ackerman and Ben-David (2009b), as their definition is analogous to variance ratio:where is the set of

*k*-means optimal

*k*-clusterings of

*X*. Worst pair ratio is similar to variance ratio but can be expected to be more affected by noise in the data, as it only looks at two pairs of data points while variance ratio averages over all data points.

**separability**(sep), due to Ostrovsky et al. (2006). Its intuition is different from that of variance ratio and worst pair ratio: It measures the improvement in clustering (in terms of the

*k*-means loss function) when we move from (

*k*− 1) clusters to

*k*clusters. We write Opt

_{k}(

*X*) = min

_{C k-clustering of X}for the

*k*-means loss of a

*k*-means optimal

*k*-clustering of

*X*. Then a data set

*X*is (

*k*, ε) separable if Opt

_{k}(

*X*) ≤ ε Opt

_{k−1}(

*X*). Separability-based clusterability is defined byWhereas for variance ratio and worst pair ratio higher values indicate better cluster-ability, the opposite is true for separability: Lower values of separability signal a larger drop in

*k*-means loss when moving from (

*k*− 1) to

*k*clusters.

^{8}

The clusterability measures that we describe here all rely on *k*-means optimal clusterings, as they were all designed to prove properties of clusterings in the area of clustering theory. To use them to test clusterability of concrete data sets in practice, we use an external measure to determine *k* (described in Section 4.3), and we approximate *k*-means optimality by performing many clusterings of the same data set with different random starting points, and using the clustering with minimal *k*-means loss .

### 3.2 Inter-Clustering Clusterability Measures

If the instances of a lemma are highly clusterable, then an instance clustering derived from monolingual paraphrase substitutes and a second clustering of the same instances derived from translation substitutes should be relatively similar. We compare two clustering solutions using the SemEval 2010 wsi task (Manandhar et al. 2010) measures: **V-measure** (*V*) (Rosenberg and Hirschberg 2007) and **paired F score** (*pF*) (Artiles, Amigó, and Gonzalo 2009).

*V* is the harmonic mean of homogeneity and completeness. Homogeneity refers to the degree that each cluster consists of data points primarily belonging to a single gold-standard class, and completeness refers to the degree that each gold-standard class consists of data points primarily assigned to a single cluster. The *V* measure is noted to depend on both entropy and number of clusters: Systems that provide more clusters do better. For this reason, Manandhar et al. (2010) also used the paired F score (*pF*), which is the harmonic mean of precision and recall. Precision is the number of common instance pairs between clustering solution and gold-standard classes divided by the number of pairs in the clustering solution, and recall is the same numerator but divided by the total number of pairs in the gold-standard. *pF* penalizes a difference in number of clusters to the gold-standard in either direction.^{9}

## 4. Experimental Design

In our experiments reported here, we test both intra-clust and inter-clust clusterability measures. All clusterability results are computed on the basis of lexsub and clls data. The clusterings that we use for the intra-clust measures are *k*-means clusterings. We use *k*-means because this is how these measures have been defined in the machine learning literature; as *k*-means is a widely used clustering, this is not an onerous restriction. The similarity between sentences used by *k*-means is defined in Section 4.2. The *k*-means method needs the number *k* of clusters as input; we determine this number for each lemma by a simple graph-partitioning method that groups all instances that have a minimum number of substitutes in common (Section 4.3). The graph-partitioning method is also used for the inter-clust approach, since it provides the simplest partitioning of the data and determines the number of partitions (clusters) automatically.

In addition to the intra-clust and inter-clust clusterability measures, we test a baseline measure based on degree of overlap in an overlapping clustering (Section 4.4).

We compare the clusterability ratings to two gold standard partitionability estimates, both of which are derived from Usim (Section 4.1). We perform two experiments to measure how well clusterability tracks partitionability (Section 4.6).

### 4.1 The Gold Standard: Estimating Partitionability from Usim

**Uiaa**): Uiaa is the inter-tagger agreement for a given lemma taken as the average pairwise Spearman's correlation between the ranked judgments of the annotators. Second, we model partitionability through the proportion of mid-range judgments over all instances for a lemma and all annotators (

**Umid**). We follow McCarthy (2011) in calculating Umid as follows. Midrange judgments are between 2 and 4, that is not 1 (completely different usages) and not 5 (the same usage). Let

*a*∈

*A*be an annotator from the set

*A*of all annotators, and

*j*

_{a}∈

*P*

_{l}be the judgment of annotator

*a*for a sentence pair for a lemma from all possible such pairings for that lemma (

*P*

_{l}). Then the Umid score for that lemma is calculated as

Umid is a more direct indication of partitionability than Uiaa in that one might have high values of inter-tagger agreement where annotators all agree on mid-range scores. Uiaa is useful as it demonstrates clearly that these measures can indicate “tricky” lemmas that might prove problematic for human annotators and computational linguistic systems.

### 4.2 Similarity of Sentences Through lexsub and clls for *k*-Means Clustering

*post.n*, is turned into a vector as follows. Each possible lexsub substitute for

*post.n*over all its ten instances becomes a dimension. For a given sentence, for example sentence 701 in Table 2, the value for dimension

*t*is the number of times

*t*was named as a substitute for sentence 701. So the vector for sentence 701 has an entry of 3 in the dimension

*position*, an entry of 2 in the dimension

*job*, and a value of 1 in the dimension

*role*, and zero in all other dimensions, and analogously for the other instances. The clls data is turned into one vector per instance in the same way. This results in vectors of the same dimensionality for all instances of the same lemma, though the instances of different lemmas can be in different spaces (which does not matter, as they will never be compared). The distance (

*d*

_{vec}) between two instances

*s*,

*s*′ of the same lemma ℓ is calculated as the Euclidean distance between their vectors. If there are

*n*substitutes overall for ℓ across all its instances, then the distance of

*s*and

*s*′ is

### 4.3 Graphical Partitioning

This subsection describes the method that we use for determining the number of clusters (*k*) for a given lemma needed by the intra-clust approach described in Section 3.1, and for providing data partitions for the inter-clust measure of clusterability described in Section 3.2. We adopt a simple graph-based approach to partitioning word usages according to their distance, following Di Marco and Navigli (2013). Traditionally, graph-based WSI algorithms reveal a word's senses by partitioning a co-occurrence graph built from its contexts into vertex sets that group semantically related words (Véronis 2004). In these experiments we build graphs for the lexsub and clls target lemmas and partition them based on the distance of the instances, reflected in the substitute annotations. Although the graphical approach is straightforward and representative of the sort of wsi methods used in our field, the exact graph partitioning method is not being evaluated here. Other graph partitioning or clustering algorithms could equally be used.

For a given lemma *l*, we build two undirected graphs using the lexsub and clls substitutes for *l*. An instance of *l* is identified by a sentence id (s#) and is represented by a vertex in the graph. Each instance is associated with a set of substitutes (from either lexsub or clls) as shown in Table 2 for the noun *post*. Two vertices are linked by an edge if their distance is found to be low enough.

*k*-means clustering. The distance of two vertices is estimated based on the overlap of their substitute sets. As the number of substitutes in each set varies, we use the size of the whole sets along with the size of the intersection for calculating the distance. Let

*s*be an instance (sentence) from a data set (lexsub or clls) and

*T*be the set of substitute types

^{10}provided for that instance in lexsub or clls. The distance (

*d*

_{node}) between two instances (nodes)

*s*and

*s*′ with substitute sets

*T*and

*T*′ corresponds to the number of moves necessary to convert

*T*into

*T*′. We use the metric proposed by Goldberg, Hayvanovych, and Magdon-Ismail (2010), which considers the elements that are shared by, and are unique to, each of the sets.We consider two instances as similar enough to be linked by an edge if their intersection is not empty (i.e., they have at least one common substitute) and their distance is below a threshold. After observation of the distance results for different lemmas, the threshold was defined to be equal to 7.

^{11}A pair of instances with a distance below the threshold is linked by an edge in the graph. For example, instances 705 and 706 of

*post.n*are linked in the graph built from the lexsub data (cf. Table 2) because their intersection is not empty (they share

**comp**). As the comps do not share any instances, they correspond to a hard (non-overlapping) clustering solution over the set of instances. Two instances belong to the same component if there is a path between their vertices. The top part of Table 4 displays the comps obtained for

*post.n*from the lexsub data. The 10 instances of the lemma in Table 2 are grouped into four comps. Instances 705 and 706 that were linked in the graph are found in the same connected component. On the contrary, 710 shares no substitutes with any other instance as shown in Table 2, and, as a consequence, does not satisfy either the intersection or the distance criterion. Instance 710 is thus isolated as it is linked to no other instances, and forms a separate component.

Figure 2 shows the frequency distribution of lemmas over number of comps.

### 4.4 A Baseline Measure Based on Cluster Overlap

Our proposed clusterability measures (both intra-and inter-clust) are applicable to hard clusterings. WSI in computational linguistics has traditionally focused on a hard partition of usages into senses but there have been recent attempts to allow for graded annotation (Erk, McCarthy, and Gaylord 2009, 2013) and soft clustering (Jurgens and Klapaftis 2013). We wanted to see how well the extent of overlap between clusters might be used as a measure of clusterability because this information is present for any soft clustering. If this simple criterion worked well, it would avoid the need for an independent measure of clusterability. If the amount of overlap is an indicator of clusterability then soft clustering can be applied and lemmas with clear-cut sense distinctions will be identified as having little or no overlap between clusters, as depicted in Figure 3.

For this baseline, we measure overlap from a second set of node groupings of the graphs described in Section 4.3, where an instance can fall into more than one of the groups. We refer to this soft grouping solution as **cliques**. A clique consists of a maximal set of nodes that are pairwise adjacent.^{12} They are typically finer grained than the comps because there may be vertices in a component that have a path between them without being adjacent.^{13}

The lower part of Table 4 contains the cliques obtained for *post.n* in lexsub. The two solutions, comps and cliques, presented for the lemma in this table are very similar except that there is a further distinction in the cliques as the first cluster in the comps is subdivided between two different senses of *mail* (broadly speaking, the physical and electronic senses). Note that these two cliques overlap and share instance 705.

*C*

_{s}be the set of partitions (cliques) to which a sentence

*s*from the sentences for a given lemma (

*S*

_{l}) is automatically assigned. Then

*nc*

_{s}(

*l*) measures the average number of cliques to which the sentences for a given lemma are assigned.We assume that lemmas that are less easy to partition will have higher values of

*nc*

_{s}compared with lemmas with a similar number of clusters over all sentences but with lower values of

*nc*

_{s}.

### 4.5 Experimental Design Overview

In Figure 4 we give an overview of the whole processing pipeline, from the input data to the clusterability estimation. The graphs built for each lemma from the lexsub and clls data are partitioned twice creating comps and cliniques. The comps serve to define the *k* per lemma needed by the intra-clust clusterability metrics (vr, sep, wpr). The inter-clust metrics (*V* and *pF*) compare the two sets of comps created for a lemma from the lexsub and clls data. The overlaps present in the cliniques are exploited by the baseline metric (*nc*_{s}).

### 4.6 Evaluation

Table 5 provides a summary of the two gold standard partitionability estimates and the two types of clusterability measures, along with the baseline clusterability measure that we test. The partitionability estimates and the clusterability measures vary in their directions: In some cases, high values denote high partitionability; in other cases high values indicate low partitionability. Because wpr and vr are predicted to have high values for more clusterable lemmas and sep has low values, we expect wpr and vr to positively correlate with Uiaa and negatively with Umid and the direction of correlation to be reversed for sep. Our clustering evaluation metrics (*V* and *pF*) should provide correlations with the gold standards in the same direction as wpr and vr since a high congruence between the two solutions for a lemma from different annotations of the same sentences should be indicative of higher clusterability and consequently higher values of Uiaa and lower values of Umid. As regards the baseline approach based on cluster overlap, because we assume that lemmas that are less easy to partition will have higher values of *nc*_{s}, high values of *nc*_{s} should be positively correlated with Umid and negatively correlated with Uiaa (like sep). Table 6 gives an overview of the expected directions.

Gold partitionability estimates | Umid: proportion of mid-range (2–4) instance similarity ratings for a lemma Uiaa: inter-annotator agreement on the Usim data set (average pairwise Spearman) |

Intra-clust clusterability measures | vr, wpr, sep based on k-means clustering k estimated as comps clustering computed based on either lexsub or clls substitutes |

Inter-clust clusterability measures | comparing comps partitioning of clls with comps partitioning of lexsub comparison either through V or pF |

Baseline | average number nc_{s} of cliques clusters, computed either from lexsub or clls data |

Gold partitionability estimates | Umid: proportion of mid-range (2–4) instance similarity ratings for a lemma Uiaa: inter-annotator agreement on the Usim data set (average pairwise Spearman) |

Intra-clust clusterability measures | vr, wpr, sep based on k-means clustering k estimated as comps clustering computed based on either lexsub or clls substitutes |

Inter-clust clusterability measures | comparing comps partitioning of clls with comps partitioning of lexsub comparison either through V or pF |

Baseline | average number nc_{s} of cliques clusters, computed either from lexsub or clls data |

Gold partitionability estimates . | Clusterability measures . |
---|---|

Umid: ↘ | vr: ↗ |

Uiaa: ↗ | wpr: ↗ |

sep: ↘ | |

V: ↗ | |

pF: ↗ | |

nc_{s}: ↘ |

Gold partitionability estimates . | Clusterability measures . |
---|---|

Umid: ↘ | vr: ↗ |

Uiaa: ↗ | wpr: ↗ |

sep: ↘ | |

V: ↗ | |

pF: ↗ | |

nc_{s}: ↘ |

We perform two sets of experiments, which differ in the way in which we control for polysemy. Partitionability estimates as well as clusterability predictions can be expected to be influenced by polysemy. Polysemy has an influence on inter-annotator agreement in that agreement is lower with higher attested polysemy (Passonneau et al. 2010). The number of clusters also influences all our measures of clusterability. Manandhar et al. (2010) note that *V* and *pF* are influenced by polysemy. Also, all intra-clust clusterability measures are influenced by *k*. Variance ratio and worst pair ratio both improve monotonically with *k* because the distance of points from the center mass of their cluster decreases as the number of clusters rises (this affects the within-cluster variance *W*(*C*) and width(*C*)). Separability is always lowest for *k* = *n* (number of data points), and almost always second-lowest for *k* = *n* − 1.

The first set of experiments measures correlation using Spearman's ρ between a ranking of partitionability estimates and a ranking of clusterability predictions. We do not perform correlation across all lemmas but control for polysemy by grouping lemmas into polysemy bands, and performing correlations only on lemmas with a polysemy within the bounds of the same band. Let *k* be the number of clusters for lemma *l*, which is the number of comps for all clusterability metrics other than *nc*_{s}, and the number of cliniques for *nc*_{s}. For the cluster congruence metrics (*V* and *pF*), we take the average number of clusters for a lemma in both lexsub and clls.^{14} Then we define three polysemy bands:

- •
low: 2 ≤

*k*< 4.3 - •
mid: 4.3 ≤

*k*< 6.6 - •
high: 6.6 ≤

*k*< 9

Note that none of the intra-clust clusterability measures are applicable for *k* = 1, so in cases where the number of comps is one, the lemma is excluded from analysis. In these cases the clustering algorithm itself decides that the instances are not easy to partition.

The second set of experiments performs linear regression to link partitionability to clusterability, using the degree of polysemy *k* as an additional independent variable. As we expect polysemy to interfere with all clusterability measures, we are interested not so much in polysemy as a separate variable but in the interaction polysemy × clusterability. This lets us test experimentally whether our prediction that polysemy influences clusterability is borne out in the data. As the second set of experiments does not break the lemmas into polysemy bands, we have a single, larger set of data points undergoing analysis, which gives us a stronger basis for assessing significance.

## 5. Experiments

In this section we provide our main results evaluating the various clusterability measures against our gold-standard estimates. Section 5.1 discusses the evaluation via correlation with Spearman's ρ. In Section 5.2 we present the regression experiments. In Section 5.3 we provide examples and lemma rankings by two of our best performing metrics.

### 5.1 Correlation of Clusterability Measures Using Spearman's ρ

We calculated Spearman's correlation coefficient (ρ) for both gold standards (Uiaa and Umid) against all clusterability measures: intra-clust (vr, wpr, and sep), inter-clust (*V* and *pF*), and the baseline *nc*_{s}. For all these measures except the inter-clust, we calculate ρ using lexsub and clls separately as our clusterability measure input. The inter-clust measures rely on two views of the data so we use lexsub and clls together as input. We calculate the correlation for lemmas in the polysemy bands (low, mid, and high, as described above in Section 4.6) subject to the constraint that there are at least five lemmas within the polysemy range for that band. We provide the details of all trials in Appendix A and report the main findings here.

Table 7 shows the average Spearman's ρ over all trials for each clusterability measure. Although there are a few non-significant results from individual trials that are in the unanticipated direction (as discussed in the following paragraph), all average ρ are in the anticipated direction, specified in Table 6; sep and *nc*_{s} are positively correlated with Umid and negatively with Uiaa whereas for all other measures the direction of correlation is reversed. Some of the metrics show a promising level of correlation but the performance of the metrics varies. The baseline *nc*_{s} is particularly weak, highlighting that the amount of shared sentences in overlapping clusters is not a strong indication of clusterability. This is important because if this simple baseline had been a good indicator of clusterability, then a sensible approach to the phenomenon of partionability of word meaning would be to simply soft cluster a word's instances and the extent of overlap would be a direct indication that the meanings are highly intertwined. wpr is also quite weak, which is not unexpected: It only considers the worst pair rather than all data points, as noted in Section 3.1. Both inter-clust measures (*pF* and *V*) have a stronger correlation with Uiaa than with Umid, whereas for the machine learning measures the reverse is true and the correlation is stronger for Umid. As mentioned in Section 4.1, Umid is a more direct gold-standard indicator of partitionability but Uiaa is useful as a gold standard as it indicates how problematic annotation will be for humans. The machine learning metric sep and our proposal for *pF* as an indication of clusterability provide the strongest average correlations, though the results for *pF* are less consistent over trials.^{15}

measure type . | measure . | average ρ . | prop. ρ > 0.4* or ** . | ||
---|---|---|---|---|---|

Umid . | Uiaa . | Umid . | Uiaa . | ||

intra-clust | vr | −0.483 | 0.365 | 2/3 | 2/3 |

sep | 0.569 | −0.390 | 2/3 | 1/3 | |

wpr | −0.322 | 0.210 | 1/3 | 0/3 | |

inter-clust | pF | −0.318 | 0.540 | 0/2 | 1/2 |

V | −0.123 | 0.493 | 0/2 | 0/2 | |

baseline | nc_{s} | 0.053 | −0.164 | 0/6 | 1/6 |

measure type . | measure . | average ρ . | prop. ρ > 0.4* or ** . | ||
---|---|---|---|---|---|

Umid . | Uiaa . | Umid . | Uiaa . | ||

intra-clust | vr | −0.483 | 0.365 | 2/3 | 2/3 |

sep | 0.569 | −0.390 | 2/3 | 1/3 | |

wpr | −0.322 | 0.210 | 1/3 | 0/3 | |

inter-clust | pF | −0.318 | 0.540 | 0/2 | 1/2 |

V | −0.123 | 0.493 | 0/2 | 0/2 | |

baseline | nc_{s} | 0.053 | −0.164 | 0/6 | 1/6 |

Because we are controlling for polysemy, there is less data (lemmas) for each correlation measurement so many individual trials do not give significant results, but all significant correlations are in the anticipated direction. The final two columns of Table 7 show the proportion of cases that are significant at the 0.05 level or above and have ρ> 0.4^{16} in the anticipated direction out of all individual trials meeting the constraint of five or more lemmas in the respective polysemy band for lexsub or clls input data. We are limited by the available gold-standard data and need to control for polysemy. So there are several results with a promising ρ which, however, are not significant, such that they are scored negatively in this more stringent summary. Nevertheless, from this summary of the results we can see that the machine learning metrics, particularly vr (which has a higher proportion of successful trials) and sep (which has the highest average correlations) are most consistent in indicating partitionability using either gold-standard estimate (Umid or Uiaa) with vr achieving 66.7% success (2 out of 3 trials for each gold-standard ranking). wpr is less promising for the reasons stated above. Although there are some successful trials for the inter-clust approaches, the results are not consistent and only one trial showed a (highly) significant correlation. The baseline approach which measures cluster overlap has only one significant result in all 6 trials, but more worrisome for this measure is the fact that in 4 out of the 12 trials (2 for each Umid and Uiaa) the correlation was in the non-anticipated direction. In contrast there was only one result for wpr (on clls) in the non-anticipated direction and one result for *V* on the fence (ρ = 0) and all other individual results for the inter and intra-clust measures were in the anticipated direction.

There were typically more lemmas in the intra-clust trials with lexsub compared to clls, as shown in Appendix A due to the fact that many lemmas in clls have only one component (see Figure 2) and are therefore excluded from the intra-clust clusterability estimation.^{17}

### 5.2 Linking Partitionability to Clusterability and Polysemy Through Regression

Our first round of experiments revealed some clear differences between approaches and implied good performance, particularly for the intra-clust measures vr and sep. In the first round of experiments, however, we separated lemmas into polysemy bands and this resulted in the set of lemmas involved in each individual correlation experiment being somewhat small. This makes it hard to obtain significant results. Even for the overall most successful measures, not all trials came out as significant. In this second round of experiments, we therefore change the set-up in a way that allows us to test on all lemmas in a single experiment, to see which clusterability measures will exhibit an overall significant ability to predict partitionability.

We use linear regression, an analysis closely related to correlation.^{18} The dependent variable to be predicted is a partitionability estimate, either Umid or Uiaa. We use two types of independent variables (predictors). The first is the clusterability measure— here we call this variable **clust**. The second is the degree of polysemy, which we call **poly**. This way we can model an influence of polysemy on clusterability as an interaction of variables, and have all lemmas undergo analysis at the same time. This lets us obtain more reliable results: Previously, a non-significant result could indicate either a weak predictor or a data set that was too small after controlling for polysemy, but now the data set undergoing analysis is much bigger.^{19} Furthermore, this experiment demonstrates how clusterability and polysemy can be used together as predictors.

The variable clust reflects the clusterability predictions of each measure. We use the actual values, not their rank among the clusterability values for all lemmas. This way we can test the ability of our clusterability measures to predict partitionability for individual lemmas, while the rank is always relative to other lemmas that are being analyzed at the same time. The values of the variable clust are obviously different for each clusterability measure, but the values of poly also vary across clusterability measures: For all intra-clust measures poly is the number of comps. For the inter-clust measures, it is the average number of comps between the numbers computed from lexsub and from clls. For the *nc*_{s} baseline it is the number of cliniques. In all cases, poly is the actual number of comps or cliniques, not the polysemy band.

We test three different models in our linear regression experiment. The first model has poly as its sole predictor. It tests to what extent partitionability issues can be explained solely by a larger number of comps or cliniques. Our hypothesis is that this simple model will not suffice. The second model has clust as its sole predictor, ignoring possible influences from polysemy. The third model uses the interaction poly × clust as a predictor (along with poly and clust as separate variables). Our hypothesis is that this third model should fare particularly well, given the influence of polysemy on clusterability measures that we derived theoretically above.^{20}

*M*predicting

*Y*from predictors

*X*

_{1}, … ,

*X*

_{m}as

*Y*= β

_{0}+ β

_{1}

*X*

_{1}+ … + β

_{m}

*X*

_{m}, it tests the null hypothesis that β

_{0}= β

_{1}= … = β

_{m}= 0. That is, it tests whether

*M*is statistically indistinguishable from a model with no predictors.

^{21}Second, we use the Akaike Information Criterion (AIC) to compare models. AIC tests how well a model will likely generalize (rather than overfit) by penalizing models with more predictors. AIC uses the log likelihood of the model under the data, corrected for model complexity computed as its number of predictors. Given again a model

*M*predicting

*Y*(in our case, either Umid or Uiaa) from

*m*predictors, the AIC isThe lower the AIC value, the better the generalization of the model. The model preferred by AIC is the one that minimizes the Kullback-Leibler divergence between the model and the data. AIC allows us to compare all models that model the same data, that is, all models predicting Umid can be compared to each other, and likewise all models predicting Uiaa.

The number of data points in each model depends on the partitioning (as lemmas with *k* = 1 cannot enter into intra-clust clusterability analysis), which differs between clls and lexsub. AIC depends on the sample size (through *p*(*Y*|*M*)), so in order to be able to compare all models that model the same partitionability estimate, we compute AIC only on the subset of lemmas that enters in all analyses.^{22} In contrast, we compute the F test on all lemmas where the clusterability measure is valid,^{23} in order to use the largest possible set of lemmas to test the viability of a model.^{24}

Table 8 shows the results for models predicting Umid, and Table 9 shows the results for the prediction of Uiaa. The bolded figures are the best AIC values for each substitute set (clls, lexsub, both) where the corresponding F-tests reach significance.^{25}

data . | cl. measure . | poly . | clust . | poly × clust . | |||
---|---|---|---|---|---|---|---|

F . | AIC . | F . | AIC . | F . | AIC . | ||

clls | vr | - | -24.9 | - | -25.7 | ** | -35.1 |

clls | sep | - | -24.9 | ** | -34.2 | * | -30.3 |

clls | wpr | - | -24.9 | - | -27.5 | - | -25.4 |

clls | nc_{s} | - | -30.6 | - | -25.8 | - | -26.8 |

lexsub | vr | - | -24.8 | * | -31.4 | *** | -34.5 |

lexsub | sep | - | -24.8 | ** | -32.6 | *** | -32.2 |

lexsub | wpr | - | -24.8 | *** | -34.0 | * | -30.2 |

lexsub | nc_{s} | ** | -26.5 | ** | -27.9 | * | -28.2 |

both | pF | - | -24.8 | - | -25.3 | - | -21.5 |

both | V | - | -24.8 | * | -28.0 | - | -25.7 |

data . | cl. measure . | poly . | clust . | poly × clust . | |||
---|---|---|---|---|---|---|---|

F . | AIC . | F . | AIC . | F . | AIC . | ||

clls | vr | - | -24.9 | - | -25.7 | ** | -35.1 |

clls | sep | - | -24.9 | ** | -34.2 | * | -30.3 |

clls | wpr | - | -24.9 | - | -27.5 | - | -25.4 |

clls | nc_{s} | - | -30.6 | - | -25.8 | - | -26.8 |

lexsub | vr | - | -24.8 | * | -31.4 | *** | -34.5 |

lexsub | sep | - | -24.8 | ** | -32.6 | *** | -32.2 |

lexsub | wpr | - | -24.8 | *** | -34.0 | * | -30.2 |

lexsub | nc_{s} | ** | -26.5 | ** | -27.9 | * | -28.2 |

both | pF | - | -24.8 | - | -25.3 | - | -21.5 |

both | V | - | -24.8 | * | -28.0 | - | -25.7 |

data . | cl. measure . | poly . | clust . | poly × clust . | |||
---|---|---|---|---|---|---|---|

F . | AIC . | F . | AIC . | F . | AIC . | ||

clls | vr | - | -20.2 | - | -21.2 | - | -20.7 |

clls | sep | - | -20.2 | - | -23.0 | - | -20.9 |

clls | wpr | - | -20.2 | - | -24.1 | - | -21.8 |

clls | nc_{s} | - | -20.8 | - | -20.3 | - | -19.3 |

lexsub | vr | - | -20.4 | - | -21.7 | * | -26.9 |

lexsub | sep | - | -20.4 | ** | -27.7 | * | -25.4 |

lexsub | wpr | - | -20.4 | * | -29s7 | - | -27.0 |

lexsub | nc_{s} | - | -22.7 | ** | -21.4 | ** | -24.8 |

both | pF | - | -20.0 | - | -22.1 | - | -18.8 |

both | V | - | -20.0 | - | -24.8 | - | -21.9 |

data . | cl. measure . | poly . | clust . | poly × clust . | |||
---|---|---|---|---|---|---|---|

F . | AIC . | F . | AIC . | F . | AIC . | ||

clls | vr | - | -20.2 | - | -21.2 | - | -20.7 |

clls | sep | - | -20.2 | - | -23.0 | - | -20.9 |

clls | wpr | - | -20.2 | - | -24.1 | - | -21.8 |

clls | nc_{s} | - | -20.8 | - | -20.3 | - | -19.3 |

lexsub | vr | - | -20.4 | - | -21.7 | * | -26.9 |

lexsub | sep | - | -20.4 | ** | -27.7 | * | -25.4 |

lexsub | wpr | - | -20.4 | * | -29s7 | - | -27.0 |

lexsub | nc_{s} | - | -22.7 | ** | -21.4 | ** | -24.8 |

both | pF | - | -20.0 | - | -22.1 | - | -18.8 |

both | V | - | -20.0 | - | -24.8 | - | -21.9 |

Confirming the results from our first round of experiments, we obtain the best results for sep and vr: The best AIC results in predicting Umid are reached by vr, while sep shows a particularly reliable performance. In predicting Umid, all sep models that use clust reach significance, and in predicting Uiaa, all sep models that use clust reach significance if they are based on lexsub substitutes. wpr reaches the best AIC values on predicting Uiaa, but on the F test, which takes into account more lemmas, its results are less often significant.

As in the first round of experiments, the performance of the two inter-clust measures is not as strong as that of the intra-clust measures. Here the inter-clust measures are in fact often comparable to the *nc*_{s} baseline. However, as clls seems to be harder to use as a basis than lexsub (we comment on this subsequently), the inter-clust measures may be hampered by problems with the clls data.

The baseline *nc*_{s} measure does not have as dismal a performance here as it did in the first round of experiments, but its performance is still worse throughout than that of the intra-clust measures. Interestingly, the poly variable that we use for *nc*_{s}, which is the absolute number of cliniques for a lemma, is informative to some extent for Umid but not for Uiaa, and the clust variable is informative to some extent for Uiaa but not for Umid.

The regression experiments overall confirm the influence of polysemy on the clusterability measures. Although clusterability as a predictor on its own (the clust models) often reaches significance in predicting partitionability, taking polysemy into account (in the poly × clust models) often strengthens the model in predicting Umid and achieves the overall best results (the two bolded models); however for Uiaa the results are more ambivalent, where of the four clusterability measures that produce significant models, two improve when the interaction with polysemy is taken into account, and the two others do not. We also note that comps alone (the poly variable for the intra-clust models) never manages to predict partitionability in any way, for either Umid or Uiaa. In contrast, the number of cliniques (the poly variable of the *nc*_{s} model) emerges as a predictor of Umid, though not of Uiaa.

In comparing Umid versus Uiaa, we see that Umid seems to be generally easier to predict, as it has more models with a significant F test.

Comparing the clls and lexsub substitutions, we see that the use of lexsub leads to much better predictions than clls. Most strikingly, in predicting Uiaa no model achieves significance using clls. We have commented on this issue before: The reason for this effect is that many lemmas in clls have only one component and are therefore excluded from the intra-clust clusterability estimation.

*Clusterability in practice.* As this round of experiments used the raw clusterability figures to predict partitionability, rather than their rank, it points the way to using clusterability in practice: Given a lemma, collect instance data (for example paraphrases, translations, or vectors). Estimate the number of clusters, for example using a graphical clustering approach. Then use a clusterability measure (sep or vr recommended) to determine its degree of clusterability, and use a regression classifier to predict a partitionability estimate. It may help to take the interaction of clust and poly into account. If the estimate is high, then a hard clustering is more likely to be appropriate, and sense tagging for training or testing should not be difficult. Where the estimate is low it is more likely that a more complex graded representation is needed, and in extreme cases clustering should be avoided altogether. Determining where the boundaries are would depend on the purpose of the lexical representation and is not addressed in this article. Our contribution is an approach to determine the relative location of lemmas on a continuum of partitionability.

### 5.3 Lemma Clusterability Rankings and Some Examples

Our clusterability metrics, in particular vr and sep, are useful for determining the partitionability of lemmas. In this section we show the rankings for these two metrics with our lemmas and provide a couple of more detailed examples with the lexsub and clls data.

In Table 10 we show the lemmas that have *k* > 1 when partitioned into comps using the lexsub substitutes, their respective gold standard Umid and Uiaa values, and the sep and vr values calculated for them on the basis of lexsub substitutes. The “L by Uiaa” and “L by Umid” columns display the lemmas reranked according to the two gold-standard estimates, and the “L by vr” and “L by sep” columns do likewise for the vr and sep clusterability measures. We have reversed the order of the ranking by Umid and sep because these measures are high when clusterability is low and vice versa. Lemmas with high partitionability should therefore be near the bottom of the table in columns 7–10 and lemmas with low partitionability should be near the top. There are differences and all rankings are influenced by polysemy, but we can see from this table that on the whole the metrics rank lemmas similarly to the gold-standard rankings with highly clusterable lemmas (such as *fire.v*) at the bottom of the table and less clusterable lemmas (such as *work.v*) nearer the top.

We now take a closer look at two example lemmas, *fire.v* and *solid.a*. Table 11 provides the comps from both the lexsub and the clls data. Both lemmas have a polysemy of 2 according to the comps clustering. *fire.v* is an example of a highly clusterable lemma whereas *solid.a* is a less-clusterable lemma. Table 12 shows the values for the clusterability measures. The intra-clust metrics are calculated for both lexsub and clls independently whereas the inter-clust metrics (*pF* and *V*) compare the two independent clustering solutions with each other. *fire.v* is more clusterable as can be seen by the clusters over the lexsub and clls data (Table 11), which denote a clear sense distinction, and by the Uiaa and Umid from the Usim gold standard. The measures wpr, vr, *V*, and *pF* are all higher for the more clusterable *fire.v* compared with *solid.a*, whereas sep is lower as anticipated. The two lemmas were selected as examples with the same number of comps to allow for a comparison of the values. The overlap measure *nc*_{s} is higher for *solid.a* as anticipated.^{26}

comps . | ||||
---|---|---|---|---|

intra-clust . | lexsub . | clls . | ||

fire.v
. | solid.a
. | fire.v
. | solid.a
. | |

sep | 0.122 | 0.584 | 0.179 | 0.685 |

vr | 7.178 | 0.713 | 4.579 | 0.459 |

wpr | 1.732 | 0.845 | 1.795 | 0.707 |

lexsub and clls | ||||

inter-clust | fire.v | solid.a | ||

pF | 1 | 0.081 | ||

V | 1 | 0.590 | ||

cliniques | ||||

lexsub | clls | |||

baseline | fire.v (2 #cl) | solid.a (4 #cl) | fire.v (2 #cl) | solid.a (7 #cl) |

ncs | 1.0 | 1.5 | 1 | 2.1 |

Gold-Standard from Usim | ||||

Gold-Standard | fire.v | solid.a | ||

Uiaa | 0.930 | 0.490 | ||

Umid | 0.169 | 0.630 |

comps . | ||||
---|---|---|---|---|

intra-clust . | lexsub . | clls . | ||

fire.v
. | solid.a
. | fire.v
. | solid.a
. | |

sep | 0.122 | 0.584 | 0.179 | 0.685 |

vr | 7.178 | 0.713 | 4.579 | 0.459 |

wpr | 1.732 | 0.845 | 1.795 | 0.707 |

lexsub and clls | ||||

inter-clust | fire.v | solid.a | ||

pF | 1 | 0.081 | ||

V | 1 | 0.590 | ||

cliniques | ||||

lexsub | clls | |||

baseline | fire.v (2 #cl) | solid.a (4 #cl) | fire.v (2 #cl) | solid.a (7 #cl) |

ncs | 1.0 | 1.5 | 1 | 2.1 |

Gold-Standard from Usim | ||||

Gold-Standard | fire.v | solid.a | ||

Uiaa | 0.930 | 0.490 | ||

Umid | 0.169 | 0.630 |

Note that for the highly clusterable lemma *fire.v* there are no substitutes in common in the two groupings with either the lexsub or clls data because there is no substitute overlap in the sentences, which results in the comps and cliniques solutions being equivalent, whereas for *solid.a* there are several substitutes shared by the groupings for lexsub (e.g., *strong*) and clls (e.g., *solido*).

## 6. Conclusions and Future Work

In this article, we have introduced the theoretical notion of clusterability from machine learning discussed by Ackerman and Ben-David (2009a) and argued that it is relevant to WSI since lemmas vary as to the degree of partitionability, as highlighted in the linguistics literature (Tuggy 1993) and supported by evidence from annotation studies (Chen and Palmer 2009; Erk, McCarthy, and Gaylord 2009, 2013). We have demonstrated here how clustering of translation or paraphrase data can be used with clusterability measures to estimate how easily a word's usages can be partitioned into discrete senses. In addition to the intra-clust measures from the machine learning literature, we have also operationalized clusterability as consistency in clustering across information sources using clustering solutions from translation and paraphrase data together. We refer to this second set of measures as inter-clust measures.

We conducted two sets of experiments. In the first we controlled for polysemy by performing correlations between clusterability estimates and our gold standard on our lemmas in three polysemy bands, which allows us to look at correlation independent of polysemy. In the second set of experiments we used linear regression on the data from all lemmas together, which allows us to see how polysemy and clusterability can work together as predictors. We find that the machine learning metrics sep and vr produce the most promising results. The inter-clust metrics (*V* and *pF*) are interesting in that they consider the congruence of different views of the same underlying usages, but although there are some promising results, the measures are not as consistent and in particular in the second set of experiments do not outperform the baseline. This may be due to their reliance on clls, which generally produces weaker results compared to lexsub. Our baseline, which measures the amount of overlap in overlapping clustering solutions, shows consistently weaker performance than the intra-clust measures.

A variant of the inter-clust measures we would like to explore is a comparison of results from different clustering algorithms. Because more clusterable data is computationally easier to cluster (Ackerman and Ben-David 2009a), we assume more clusterable data should produce closer results across different algorithms operating on the same input data. We plan to test this empirically in future.

Clusterability metrics should be useful in planning annotation projects (and estimating their costs) as well as for determining the appropriate lexical representation for a lemma. A more clusterable lemma is anticipated to be better-suited to the traditional hard-clustering winner-takes-all wsd methodology compared with a less clusterable lemma where a more complex soft-clustering approach should be considered and more time and expertise is anticipated for any annotation and verification tasks. For some tasks, it may be worthwhile to focus disambiguation efforts only on lemmas with a reasonable level of partitionability.

We believe that notions of clusterability from machine learning are particularly relevant to wsi and the field of word meaning representation in general. These notions might prove useful in other areas of computational linguistics and lexical semantics in particular. One such area to explore would be clustering predicate-argument data (Sun and Korhonen 2009; Schulte im Walde 2006).

All the metrics and gold standards measure clusterability on a continuum. We have yet to address the issue of where the cut-off points on that continuum for alternate representations might be. There is also the issue that for a given word, there may be some meanings which are distinct and others that are intertwined. It may in future be possible to find contiguous regions of the data that are clusterable, even if there are other regions where the meanings are less distinguishable.

The paraphrase and translation data we have used to examine clusterability metrics have been produced manually. In future work, the measures could be applied to automatically generated paraphrases and translations or to vector-space or word (or phrase) embedding representations of the instances. Use of automatically produced data would allow us to measure clusterability over a larger vocabulary and corpus of instances but we would need to find an appropriate gold standard. One option might be evidence of inter-tagger agreement from corpus annotation studies (Passonneau et al. 2012) or data on ease of word sense alignment (Eom, Dickinson, and Katz 2012).

## Appendix A: Individual Spearman's Correlation Trials

Tables A.1,^{14}^{15}^{16}–A.5 provide the details of the individual Spearman's correlation trials of clusterability measures against the gold standards reported in Section 5.1. All correlations in the anticipated direction are marked in , and those in the counter-intuitive direction are marked in and noted by *opp* in the final column. In the same column, we use * for statistical significance with p < 0.05 and ** for p < 0.01. We use only those polysemy bands where there are at least five lemmas within the polysemy range for that band. The number of lemmas (#) in each band is shown within parentheses.

Band (#) . | Clusterability measure . | Usim measure . | ρ . | sig/opp
. |
---|---|---|---|---|

low (22) | vr | Umid | * | |

low (22) | vr | Uiaa | * | |

low (22) | sep | Umid | ** | |

low (22) | sep | Uiaa | ||

low (22) | wpr | Umid | ||

low (22) | wpr | Uiaa | opp |

Band (#) . | Clusterability measure . | Usim measure . | ρ . | sig/opp
. |
---|---|---|---|---|

low (22) | vr | Umid | * | |

low (22) | vr | Uiaa | * | |

low (22) | sep | Umid | ** | |

low (22) | sep | Uiaa | ||

low (22) | wpr | Umid | ||

low (22) | wpr | Uiaa | opp |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (29) | vr | Umid | ** | |

mid (10) | vr | Umid | ||

low (29) | vr | Uiaa | * | |

mid (10) | vr | Uiaa | ||

low (29) | sep | Umid | ||

mid (10) | sep | Umid | * | |

low (29) | sep | Uiaa | ||

mid (10) | sep | Uiaa | * | |

low (29) | wpr | Umid | * | |

mid (10) | wpr | Umid | ||

low (29) | wpr | Uiaa | ||

mid (10) | wpr | Uiaa |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (29) | vr | Umid | ** | |

mid (10) | vr | Umid | ||

low (29) | vr | Uiaa | * | |

mid (10) | vr | Uiaa | ||

low (29) | sep | Umid | ||

mid (10) | sep | Umid | * | |

low (29) | sep | Uiaa | ||

mid (10) | sep | Uiaa | * | |

low (29) | wpr | Umid | * | |

mid (10) | wpr | Umid | ||

low (29) | wpr | Uiaa | ||

mid (10) | wpr | Uiaa |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (29) | pF | Umid | ** | |

mid (5) | pF | Umid | ||

low (29) | pF | Uiaa | ||

mid (5) | pF | Uiaa | * | |

low (29) | V | Umid | ||

mid (5) | V | Umid | 0 | |

low (29) | V | Uiaa | * | |

mid (5) | V | Uiaa |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (29) | pF | Umid | ** | |

mid (5) | pF | Umid | ||

low (29) | pF | Uiaa | ||

mid (5) | pF | Uiaa | * | |

low (29) | V | Umid | ||

mid (5) | V | Umid | 0 | |

low (29) | V | Uiaa | * | |

mid (5) | V | Uiaa |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (14) | nc_{s} | Umid | ||

mid (17) | nc_{s} | Umid | ||

high (9) | nc_{s} | Umid | opp | |

low (14) | nc_{s} | Uiaa | ||

mid (17) | nc_{s} | Uiaa | * | |

high (9) | nc_{s} | Uiaa | opp |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (14) | nc_{s} | Umid | ||

mid (17) | nc_{s} | Umid | ||

high (9) | nc_{s} | Umid | opp | |

low (14) | nc_{s} | Uiaa | ||

mid (17) | nc_{s} | Uiaa | * | |

high (9) | nc_{s} | Uiaa | opp |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (14) | nc_{s} | Umid | ||

mid (19) | nc_{s} | Umid | ||

high (10) | nc_{s} | Umid | opp | |

low (14) | nc_{s} | Uiaa | ||

mid (19) | nc_{s} | Uiaa | ||

high (10) | nc_{s} | Uiaa | opp |

Band (#) . | measure1 . | measure2 . | ρ . | sig/opp
. |
---|---|---|---|---|

low (14) | nc_{s} | Umid | ||

mid (19) | nc_{s} | Umid | ||

high (10) | nc_{s} | Umid | opp | |

low (14) | nc_{s} | Uiaa | ||

mid (19) | nc_{s} | Uiaa | ||

high (10) | nc_{s} | Uiaa | opp |

## Acknowledgments

This work was partially supported by National Science Foundation grant IIS-0845925 to K. E. We thank the anonymous reviewers for many helpful comments and suggestions.

## Notes

Different etymology can help in determining such homonymous cases where several meanings have coincidentally ended up having the same word form, but there are many cases where etymologically related meanings are just as distinct to speakers (Ide and Wilks 2006).

We use *n*, *v*, *a*, *r* suffixes to denote nouns, verbs, adjectives, and adverbs, respectively.

There are no judgments for a sentence paired with itself and we do not repeat values where a judgment has appeared already in the table (for example, 702-701, given that we already have 701-702 displayed).

Some sentences in lexsub did not have two or more responses and for that reason were omitted from the data.

Usim data were collected in two rounds. For the four lemmas where there is both round 1 and round 2 Usim and clls data, we use round 2 data only because there are more annotators (8 for round 2 in contrast to 3 for round 1) (Erk, McCarthy, and Gaylord 2013).

These are the lemmas used in our experiments: account.n, call.v, charge.v, check.v, clear.v, coach.n, dismiss.v, draw.v, dry.a, execution.n, field.n, figure.n, fire.v, flat.a, fresh.a, function.n, hard.r, heavy.a, hold.v, investigator.n, lead.n, light.a, match.n, new.a, order.v, paper.n, poor.a, post.n, put.v, range.n, raw.a, right.r, ring.n, rude.a, shade.n, shed.v, skip.v, soft.a, solid.a, special.a, stiff.a, strong.a, tap.v, throw.v, work.v.

Ackerman and Ben-David (2009b) proposed an additional clusterability measure, center perturbation. However, this measure is not scale invariant, in that its clusterability scores depend on the overall distance between data points in *X*. As we found this dependency to be very strong, we are not using center perturbation in our experiments in this article.

Because both measures (*V* and *pF*) use the harmonic mean, it does not matter whether we use clls as the gold standard against lexsub or vice versa: The harmonic mean of homogeneity and completeness, or precision and recall, is the same regardless of which clustering solution is considered as “gold.”

We have not used the frequency of each substitute, which is the number of annotators that provided it in lexsub or clls, though it would be possible to experiment with this in future work.

In future work, we intend to explore ways for defining the distance threshold dynamically, on a per lemma basis.

Cliques are computed directly from a graph, not from the comps.

Note that two different comps and cliques can share substitutes (translations or paraphrases). Substitutes serve to determine the distance of the instances. If the distance is high, two instances are not linked in the graph despite their shared substitutes.

Differences in granularity are quite possibly an indication of non-clusterability, but not necessarily. We have also tried using the difference in the number of clusters between clls and lexsub as an indicator of clusterability but the proposed measures allow a more complete estimation of disparity and so far seem more reliable.

This can be seen in Table A.3 in Appendix A.

This is generally considered the lower bound of moderate correlation for Spearman's and is the level of inter-annotator agreement achieved in other semantics tasks (for example see Mitchell and Lapata [2008]).

As noted before, none of the intra-clust measures are applicable for the case of *k* = 1.

The regression coefficient is a standardization of Pearson's r, a correlation coefficient, related via a ratio of standard deviations.

Also, the first round of experiments had to drop some lemmas from the analysis when they were in a polysemy band with too few members; the second round of experiments does not have this issue.

We also tested a model with predictors poly+clust, without interaction. We do not report on results for this model here as it did not yield any interesting results. It was basically always between clust and poly × clust.

We will say an F test “reached significance” to mean that the null hypothesis was rejected for some model.

This subset comprises 27 lemmas: charge.v, clear.v, draw.v, dry.a, fire.v, flat.a, hard.r, heavy.a, hold.v, lead.n, light.a, match.n, paper.n, post.n, range.n, raw.a, right.r, ring.n, rude.a, shade.n, shed.v, skip.v, soft.a, solid.a, stiff.a, tap.v, throw.v.

For the intra-clust measures, this is only lemmas where *k* > 1.

We also computed AIC separately for substitute sets lexsub, clls, and both (for inter-clust). The relative ordering of models within each substitute set remained mostly the same.

Log-likelihood values can be positive, as they are in our case, leading to negative AIC values. See, for example, http://blog.stata.com/2011/02/16/positive-log-likelihood-values-happen/.

The cliniques clustering gives a different number of clusters to the two lemmas, so these two lemmas would be in different polysemy bands for the correlation experiments on *nc*_{s} since we control for polysemy.

## References

## Author notes

Department of Theoretical and Applied Linguistics, University of Cambridge, UK. E-mail: diana@dianamccarthy.co.uk.

LIMSI, CNRS, Université Paris-Saclay, France. E-mail: marianna.apidianaki@limsi.fr.

Department of Linguistics, University of Texas at Austin, USA. E-mail: katrin.erk@mail.utexas.edu.