Abstract
Word Sense Disambiguation (WSD) systems automatically choose the intended meaning of a word in context. In this article we present a WSD algorithm based on random walks over large Lexical Knowledge Bases (LKB). We show that our algorithm performs better than other graph-based methods when run on a graph built from WordNet and eXtended WordNet. Our algorithm and LKB combination compares favorably to other knowledge-based approaches in the literature that use similar knowledge on a variety of English data sets and a data set on Spanish. We include a detailed analysis of the factors that affect the algorithm. The algorithm and the LKBs used are publicly available, and the results easily reproducible.
1. Introduction
Word Sense Disambiguation (WSD) is a key enabling technology that automatically chooses the intended sense of a word in context. It has been the focus of intensive research since the beginning of Natural Language Processing (NLP), and more recently it has been shown to be useful in several tasks such as parsing (Agirre, Baldwin, and Martinez 2008; Agirre et al. 2011), machine translation (Carpuat and Wu 2007; Chan, Ng, and Chiang 2007), information retrieval (Pérez-Agüera and Zaragoza 2008; Zhong and Ng 2012), question answering (Surdeanu, Ciaramita, and Zaragoza 2008), and summarization (Barzilay and Elhadad 1997). WSD is considered to be a key step in order to approach language understanding beyond keyword matching.
The best performing WSD systems are currently those based on supervised learning, as attested in public evaluation exercises (Snyder and Palmer 2004; Pradhan et al. 2007), but they need large amounts of hand-tagged data, which is typically very expensive to produce. Contrary to lexical-sample exercises (where plenty of training and testing examples for a handful of words are provided), all-words exercises (which comprise all words occurring in a running text, and where training data is more scarce) show that only a few systems beat the most frequent sense (MFS) heuristic, with small differences. For instance, the best system in SensEval-3 scored 65.2 F1, compared to 62.4 (Snyder and Palmer 2004). The best current state-of-the-art WSD system (Zhong and Ng 2010), outperforms the MFS heuristic by 5% to 8% in absolute F1 scores on the SensEval and SemEval fine-grained English all words tasks.
The causes of the small improvement over the MFS heuristic can be found in the relatively small amount of training data available (sparseness) and the problems that arise when the supervised systems are applied to different corpora from that used to train the system (corpus mismatch) (Ng 1997; Escudero, Márquez, and Rigau 2000). Note that most of the supervised systems for English are trained over SemCor (Miller et al. 1993), a half-a-million word subset of the Brown Corpus made available from the WordNet team, and DSO (Ng and Lee 1996), comprising 192,800 word occurrences from the Brown and WSJ corpora corresponding to the 191 most frequent nouns and verbs. Several researchers have explored solutions to sparseness. For instance, Chan and Ng (2005) present an unsupervised method to obtain training examples from bilingual data, which was used together with SemCor and DSO to train one of the best performing supervised systems to date (Zhong and Ng 2010).
In view of the problems of supervised systems, knowledge-based WSD is emerging as a powerful alternative. Knowledge-based WSD systems exploit the information in a lexical knowledge base (LKB) to perform WSD. They currently perform below supervised systems on general domain data, but are attaining performance close or above MFS without access to hand-tagged data (Ponzetto and Navigli 2010). In this sense, they provide a complementary strand of research which could be combined with supervised methods, as shown for instance in Navigli (2008). In addition, Agirre, López de Lacalle, and Soroa (2009) show that knowledge-based WSD systems can outperform supervised systems in a domain-specific data set, where MFS from general domains also fails. In this article, we will focus our attention on knowledge-based methods.
Early work for knowledge-based WSD was based on measures of similarity between pairs of concepts. In order to maximize pairwise similarity for a sequence of n words where each has up to k senses, the algorithms had to consider up to kn sense sequences. Greedy methods were often used to avoid the combinatorial explosion (Patwardhan, Banerjee, and Pedersen 2003). As an alternative, graph-based methods are able to exploit the structural properties of the graph underlying a particular LKB. These methods are able to consider all possible combinations of occurring senses on a particular context, and thus offer a way to analyze efficiently the inter-relations among them, gaining much attention in the NLP community (Mihalcea 2005; Navigli and Lapata 2007; Sinha and Mihalcea 2007; Agirre and Soroa 2008; Navigli and Lapata 2010). The nodes in the graph represent the concepts (word senses) in the LKB, and edges in the graph represent relations between them, such as subclass and part-of. Network analysis techniques based on random walks like PageRank (Brin and Page 1998) can then be used to choose the senses that are most relevant in the graph, and thus output those senses.
In order to deal with large knowledge bases containing more than 100,000 concepts (Fellbaum 1998), previous algorithms had to extract subsets of the LKB (Navigli and Lapata 2007; Navigli and Lapata 2010) or construct ad hoc graphs for each context to be disambiguated (Mihalcea 2005; Sinha and Mihalcea 2007). An additional reason for the use of custom-built subsets of ad hoc graphs for each context is that if we were using a centrality algorithm like PageRank over the whole graph, it would choose the most important senses in the LKB regardless of context, limiting the applicability of the algorithm. For instance, the word coach is ambiguous at least between the “sports coach” and the “transport service” meanings, as shown in the following examples:
- (1)
Nadal is sharing a house with his uncle andcoach, Toni, and his physical trainer, Rafael Maymo.
- (2)
Our fleet comprisescoachesfrom 35 to 58 seats.
If we were to run a centrality algorithm over the whole LKB, with no context, then we would always assign coach to the same concept, and we would thus fail to correctly disambiguate either one of the given examples.
The contributions of this article are the following: (1) A WSD method based on random walks over large LKBs. The algorithm outperforms other graph-based algorithms when using a LKB built from WordNet and eXtended WordNet. The algorithm and LKB combination compares favorably to the state-of-the-art in knowledge-based WSD on a wide variety of data sets, including four English and one Spanish data set. (2) A detailed analysis of the factors that affect the algorithm. (3) The algorithm together with the corresponding graphs are publicly available1 and can be applied easily to sense inventories and knowledge bases different from WordNet.
The algorithm for WSD was first presented in Agirre and Soroa (2009). In this article, we present further evaluation on two more recent data sets, analyze the parameters and options of the system, compare it to the state of the art, and discuss the relation of our algorithm with PageRank and the MFS heuristic.
2. Related Work
Traditional knowledge-based WSD systems assign a sense to an ambiguous word by comparing each of its senses with those of the surrounding context. Typically, some semantic similarity metric is used for calculating the relatedness among senses (Lesk 1986; Patwardhan, Banerjee, and Pedersen 2003). The metric varies between counting word overlaps between definitions of the words (Lesk 1986) to finding distances between concepts following the structure of the LKB (Patwardhan, Banerjee, and Pedersen 2003). Usually the distances are calculated using only hierarchical relations on the LKB (Sussna 1993; Agirre and Rigau 1996). Combining both intuitions, Jiang and Conrath (1997) present a metric that combines statistics from corpus and a lexical taxonomy structure. One of the major drawbacks of these approaches stems from the fact that senses are compared in a pairwise fashion and thus the number of computations grows exponentially with the number of words—that is, for a sequence of n words where each has up to k senses they need to consider up to kn sense sequences. Although alternatives like simulated annealing (Cowie, Guthrie, and Guthrie 1992) and conceptual density (Agirre and Rigau 1996) were tried, most of the knowledge-based WSD at the time was done in a suboptimal word-by-word greedy process, namely, disambiguating words one at a time (Patwardhan, Banerjee, and Pedersen 2003). Still, some recent work on finding predominant senses in domains has applied such similarity-based techniques with success (McCarthy et al. 2007).
Recently, graph-based methods for knowledge-based WSD have gained much attention in the NLP community (Mihalcea 2005; Navigli and Velardi 2005; Navigli and Lapata 2007; Sinha and Mihalcea 2007; Agirre and Soroa 2008; Navigli and Lapata 2010). These methods use well-known graph-based techniques to find and exploit the structural properties of the graph underlying a particular LKB. Graph-based techniques consider all the sense combinations of the words occurring on a particular context at once, and thus offer a way to analyze the relations among them with respect to the whole graph. They are particularly suited for disambiguating words in the sequence, and they manage to exploit the interrelations among the senses in the given context. In this sense, they provide a principled solution to the exponential explosion problem mentioned before, with excellent performance.
Graph-based WSD is performed over a graph composed of senses (nodes) and relations between pairs of senses (edges). The relations may be of several types (lexico-semantic, cooccurrence relations, etc.) and may have some weight attached to them. All the methods reviewed in this section use some version of WordNet as a LKB. Apart from relations in WordNet, some authors have used semi-automatic and fully automatic methods to enrich WordNet with additional relations. Mihalcea and Moldovan (2001) disambiguated WordNet glosses in a resource called eXtended WordNet. The disambiguated glosses have been shown to improve results of a graph-based system (Agirre and Soroa 2008), and we have also used them in our experiments. Navigli and Velardi (2005) enriched WordNet with cooccurrence relations semi-automatically and showed that those relations are effective in a number of graph-based WSD systems (Navigli and Velardi 2005; Navigli and Lapata 2007, 2010). More recently, Cuadros and Rigau 2006, 2007, 2008 learned automatically so-called KnowNets, and showed that the new provided relations improved WSD performance when plugged into a simple vector-based WSD system. Finally, Ponzetto and Navigli (2010) have acquired relations automatically from Wikipedia, released as WordNet++, and have shown that they are beneficial in a graph-based WSD algorithm. All of these relations are publicly available with the exception of Navigli and Velardi (2005), but note that the system is available on-line.2
Disambiguation is typically performed by applying a ranking algorithm over the graph, and then assigning the concepts with highest rank to the corresponding words. Given the computational cost of using large graphs like WordNet, most researchers use smaller subgraphs built on-line for each target context. The main idea of the subgraph method is to extract the subgraph whose vertices and relations are particularly relevant for the set of senses from a given input context. The subgraph is then analyzed and the most relevant vertices are chosen as the correct senses of the words.
The TextRank algorithm for WSD (Mihalcea 2005) creates a complete weighted graph (e.g., a graph in which every pair of distinct vertices is connected by a weighted edge) formed by the synsets of the words in the input context. The weight of the links joining two synsets is calculated by executing Lesk's algorithm (Lesk 1986) between them—that is, by calculating the overlap between the words in the glosses of the corresponding senses. Once the complete graph is built, a random walk algorithm (PageRank) is executed over it and words are assigned to the most relevant synset. In this sense, PageRank is used as an alternative to simulated annealing to find the optimal pairwise combinations. This work is extended in Sinha and Mihalcea (2007), using a collection of semantic similarity measures when assigning a weight to the links across synsets. They also compare different graph-based centrality algorithms to rank the vertices of the complete graph. They use different similarity metrics for different POS types and a voting scheme among the centrality algorithm ranks.
In Navigli and Velardi (2005), the authors develop a knowledge-based WSD method based on lexical chains called structural semantic interconnections (SSI). Although the system was first designed to find the meaning of the words in WordNet glosses, the authors also apply the method for labeling each word in a text sequence. Given a text sequence, SSI first identifies monosemous words and assigns the corresponding synset to them. Then, it iteratively disambiguates the rest of the terms by selecting the senses that get the strongest interconnection with the synsets selected so far. The interconnection is calculated by searching for paths on the LKB, constrained by some hand-made rules of possible semantic patterns.
In Navigli and Lapata (2007, 2010), the authors perform a two-stage process for WSD. Given an input context, the method first explores the whole LKB in order to find a subgraph that is particularly relevant for the words of the context. The subgraph is calculated by applying a depth-first search algorithm over the LKB graph for every word sense occurring in a context. Then, they study different graph-based centrality algorithms for deciding the relevance of the nodes on the subgraph. As a result, every word of the context is attached to the highest ranking concept among its possible senses. The best results were obtained by a simple algorithm like choosing the concept for each word with the largest degree (number of edges) and by PageRank (Brin and Page 1998). We reimplemented their best methods in order to compare our algorithm with theirs on the same setting (cf. Section 6.3). In later work (Ponzetto and Navigli 2010) the authors apply a subset of their methods to an enriched WordNet with additional relations from Wikipedia, improving their results for nouns.
Tsatsaronis, Vazirgiannis, and Androutsopoulos (2007) and Agirre and Soroa (2008) also use such a two-stage process. They build the graph as before, but using breadth-first search. The first authors apply a spreading activation algorithm over the subgraph for node ranking, while the second use PageRank. In later work (Tsatsaronis, Varlamis, and Nørvåg 2010) spreading activation is compared with PageRank and other centrality measures like HITS (Kleinberg 1998), obtaining better results than in their previous work.
This work departs from earlier work in its use of the full graph, and its ability to infuse context information when computing the importance of nodes in the graph. For this, we resort to an extension of the PageRank algorithm (Brin and Page 1998), called Personalized PageRank (Haveliwala 2002), which tries to bias PageRank using a set of representative topics and thus capture more accurately the notion of importance with respect to a particular topic. In our case, we initialize the random walk with the words in the context of the target word, and thus we obtain a context-dependent PageRank. We will show that this method is indeed effective for WSD. Note that in order to use other centrality algorithms (e.g., HITS [Kleinberg 1998]), previous authors had to build a subgraph first. In principle, those algorithms could be made context-dependent when using the full graph and altering their formulae, but we are not aware of such variations.
Random walks over WordNet using Personalized PageRank have been also used to measure semantic similarity between two words (Hughes and Ramage 2007; Agirre et al. 2009). In those papers, the random walks are initialized with a single word, whereas we use all content words in the context. The results obtained by the authors, especially in the latter paper, are well above other WordNet-based methods.
Most previous work on knowledge-based WSD has presented results on one or two general domain corpora for English. We present our results on four general domain data sets for English and a Spanish data set (Màrquez et al. 2007). Alternatively, some researchers have applied knowledge-based WSD to specific domains, using different methods to adapt the method to the particular test domain. In Agirre, López de Lacalle, and Soroa (2009) and Navigli et al. (2011), the authors apply our Personalized PageRank method to a domain-specific corpus with good results. Ponzetto and Navigli (2010) also apply graph-based algorithms to the same domain-specific corpus.
3. WordNet
Most WSD work uses WordNet as the sense inventory of choice. WordNet (Fellbaum 1998) is a freely available3 lexical database of English, which groups nouns, verbs, adjectives, and adverbs into sets of synonyms, each expressing a distinct concept (called synset in WordNet parlance). For instance, coach has five nominal senses and two verbal senses, which correspond to the following synsets:
<coach#n1, manager#n2, handler#n3>
<coach#n2, private instructor#n1, tutor#n1>
<coach#n3, passenger car#n1, carriage#n1>
<coach#n4, four-in-hand#n2, coach-and-four#n1>
<coach#n5, bus#n1, autobus#n1, charabanc#n1,double-decker#n1,jitney#n1 …>
<coach#v1, train#v7>
<coach#v2>
The synsets in WordNet are interlinked with conceptual-semantic and lexical relations. Examples of conceptual-semantic relations are hypernymy, which corresponds to the superclass or is-a relation, and holonymy, the part-of relation. Figure 1 shows two small regions of the graph around three synsets of the word coach, including several conceptual-semantic relations and lexical relations. For example, the figure shows that concept trainer#n1 is a coach#n1 (hypernymy relation), and that seat#n1 is a part of coach#n5 (holonymy relation). The figure only shows a small subset of the relations for three synsets of coach. If we were to show the relations of the rest of the synsets in WordNet we would end up with a densely connected graph, where one can go from one synset to another following the semantic relations. In addition to purely conceptual-semantic relations which hold between synsets, there are also lexical relations which hold between specific senses. For instance, angry#a2 is the antonym of calm#a2 and a derivation relation exists between handler#n3 and handle#v6, meaning that handler is a derived form of handle and that the third nominal sense of handler is related to the sixth verbal sense of handle. Although lexical relations hold only between two senses, we generalize to the whole synset. This generalization captures the notion that if handler#n3 is related by derivation to handle#v6, then coach#n1 is also semantically related to handle#v6 (as shown in Figure 1).
In addition to these relations, we also use the relation between each synset and the words in the glosses. Most of the words in the glosses have been manually associated with their corresponding senses, and we can thus produce a link between the synset being glossed, and the synsets of each of the words in the gloss. For instance, following one of the given glosses, a gloss relation would be added between coach#v2 and drive#v2. The gloss relations were not available prior to WordNet 3.0, and we thus used automatically disambiguated glosses for WordNet 1.7 and WordNet 2.1, as made freely available in the eXtended WordNet (Mihalcea and Moldovan 2001). Note also that the eXtended WordNet provided about 550,000 relations, whereas the disambiguated glosses made available with WordNet 3.0 provide around 339,000 relations. We compare the performance of XWN relations and WordNet 3.0 gloss relations in Section 6.4.4.
Table 1 summarizes the most relevant relations (with less frequent relations grouped as “other”). The table also lists how we grouped the relations, and the overall counts. Note that inverse relations are not counted, as their numbers equal those of the original relation. In Section 6.4.5 we report the impact of the relations in the behavior of the system. Overall, the graph for WordNet 1.7 has 109,359 vertices (concepts) and 620,396 edges (relations between concepts). Note that there is some overlap between xwn and other types of relations. For instance, the hypernym of coach#n4 is carriage#n2, which is also present in its gloss. Note that most of the relation types relate concepts from the same part of speech, with the exception of derivation and xwn.
Finally, we have also used the Spanish WordNet (Atserias, Rigau, and Villarejo 2004). In addition to the native relations, we also added relations from the eXtended WordNet. All in all, it contains 105,501 vertices and 623,316 relations.
3.1 Representing WordNet as a Graph
An LKB such as WordNet can be seen as a set of concepts and relations among them, plus a dictionary, which contains the list of words (typically word lemmas) linked to the corresponding concepts (senses). WordNet can be thus represented as a graph G = (V,E). V is the set of nodes, where each node represents one concept (vi ∈ V), and E is the set of edges. Each relation between concepts vi and vj is represented by an edge ei,j ∈ E. We ignore the relation type of the edges. If two WordNet relations exist between two nodes, we only represent one edge, and ignore the type of the relation. We chose to use undirected relations between concepts, because most of the relations are symmetric and have their inverse counterpart (cf. Section 3), and in preliminary work we failed to see any effect using directed relations.
In addition, we also add vertices for the dictionary words, which are linked to their corresponding concepts by directed edges (cf. Figure 1). Note that monosemous words will be related to just one concept, whereas polysemous words may be attached to several. Section 5.2 explains the reason for using directed edges, and also mentions an alternative to avoid introducing these vertices.
4. PageRank and Personalized PageRank
The PageRank random walk algorithm (Brin and Page 1998) is a method for ranking the vertices in a graph according to their relative structural importance. The main idea of PageRank is that whenever a link from vi to vj exists in a graph, a vote from node i to node j is produced, and hence the rank of node j increases. In addition, the strength of the vote from i to j also depends on the rank of node i: The more important node i is, the more strength its votes will have. Alternatively, PageRank can also be viewed as the result of a random walk process, where the final rank of node i represents the probability of a random walk over the graph ending on node i, at a sufficiently large time.
The second term in Equation (1) can also be seen as a smoothing factor that makes any graph fulfill the property of being aperiodic and irreducible, and thus guarantees that the PageRank calculation converges to a unique stationary distribution.
In the traditional PageRank formulation the vector v is a stochastic normalized vector whose element values are all , thus assigning equal probabilities to all nodes in the graph in the case of random jumps. However, as pointed out by Haveliwala (2002), the vector v can be non-uniform and assign stronger probabilities to certain kinds of nodes, effectively biasing the resulting PageRank vector to prefer these nodes. For example, if we concentrate all the probability mass on a unique node i, all random jumps on the walk will return to i and thus its rank will be high; moreover, the high rank of i will make all the nodes in its vicinity also receive a high rank. Thus, the importance of node i given by the initial distribution of v spreads along the graph on successive iterations of the algorithm. As a consequence, the P vector can be seen as representing the relevance of every node in the graph from the perspective of node i.
In this article, we will use Static PageRank to refer to the case when a uniform v vector is used in Equation (1); and whenever a modified v is used, we will call it Personalized PageRank. The next section shows how we define a modified v.
PageRank is actually calculated by applying an iterative algorithm that computes Equation (1) successively until convergence below a given threshold is achieved, or until a fixed number of iterations are executed. Following usual practice, we used a damping value of 0.85 and finish the calculations after 30 iterations (Haveliwala 2002; Langville and Meyer 2003; Mihalcea 2005). Some preliminary experiments with higher iteration counts showed that although sometimes the node ranks varied, the relative order among particular word synsets remained stable after the initial iterations (cf. Section 6.4 for further details). Note that, in order to discard the effect of dangling nodes (i.e., nodes without outlinks) one would need to slightly modify Equation (1) following Langville and Meyer (2003).4 This modification is not necessary for WordNet, as it does not have dangling nodes.
5. Random Walks for WSD
We tested two different methods to apply random walks to WSD.
5.1 Static PageRank, No Context
If we apply traditional PageRank over the whole WordNet, we get a context-independent ranking of word senses. All concepts in WordNet get ranked according to their PageRank value. Given a target word, it suffices to check which is the relative ranking of its senses, and the WSD system would output the one ranking highest. We call this application of PageRank to WSD Static PageRank static for short, as it does not change with the context, and we use it as a baseline.
As the PageRank measure over undirected graphs for a node is closely related to the degree of the node, the Static PageRank returns the most predominant sense according to the number of relations the senses have. We think that this is closely related to the Most Frequent Sense attested in general corpora, as the lexicon builders would tend to assign more relations to the most predominant sense. In fact, our results (cf. Section 6.4.5) show that this is indeed the case for the English WordNet.
5.2 Personalized PageRank, Using Context
Static PageRank is independent of context, but this is not what we want in a WSD system. Given an input piece of text we want to disambiguate all content words in the input according to the relationships among them. For this we can use Personalized PageRank (ppr for short) over the whole WordNet graph.
Given an input text (e.g., a sentence), we extract the list Wii = 1…m of content words (i.e., nouns, verbs, adjectives, and adverbs) that have an entry in the dictionary, and thus can be related to LKB concepts. As a result of the disambiguation process, every LKB concept receives a score. Then, for each target word to be disambiguated, we just choose its associated concept in G with maximum score.
In order to apply Personalized PageRank over the LKB graph, the context words are first inserted into the graph G as nodes, and linked with directed edges to their respective concepts. Then, the Personalized PageRank of the graph G is computed by concentrating the initial probability mass uniformly over the newly introduced word nodes. As the words are linked to the concepts by directed edges, they act as source nodes injecting mass into the concepts they are associated with, which thus become relevant nodes, and spread their mass over the LKB graph. Therefore, the resulting Personalized PageRank vector can be seen as a measure of the structural relevance of LKB concepts in the presence of the input context.
Making the edges from words to concepts directed is important, as the use of undirected edges will move part of the probability mass in the concepts to the word nodes. Note the contrast with the edges representing relations between concepts, which are undirected (cf. Section 3.1).
Alternatively, we could do without the word nodes, concentrating the initial probability mass on the senses of the words under consideration. Such an initialization over the graph with undirected edges between synset nodes is equivalent to initializing the walk on the words in a graph with undirected edges between synset nodes and directed nodes from words to synsets. We experimentally checked that the results of both alternatives are indistinguishable. Although the alternative without nodes is marginally more efficient, we keep the word nodes as they provide a more intuitive and appealing formalization.
One problem with Personalized PageRank is that if one of the target words has two senses that are related by semantic relations, those senses reinforce each other, and could thus dampen the effect of the other senses in the context. Although one could remove direct edges between competing senses from the graph, it is quite rare that those senses are directly linked, and usually a path with several edges is involved. With this observation in mind we devised a variant called word-to-word heuristic (pprw2w for short), where we run Personalized PageRank separately for each target word in the context, that is, for each target word Wi, we concentrate the initial probability mass in the senses of the rest of the words in the context of Wi, but not in the senses of the target word itself, so that context words increase their relative importance in the graph. The main idea of this approach is to avoid biasing the initial score of concepts associated with target word Wi, and let the surrounding words decide which concept associated with Wi has more relevance. Contrary to the previous approach, pprw2w does not disambiguate all target words of the context in a single run, which makes it less efficient (cf. Section 6.4).
Figure 2 illustrates the disambiguation of a sample sentence. The static method (not shown in the figure) would choose the synset coach#n1 for the word coach because it is related to more concepts than other senses, and because those senses are related to concepts that have a high degree (for instance, sport#1). The ppr method (left side of Figure 2) concentrates the initial mass on the content words in the example. After running the iterative algorithm, the system would return coach#n1 as the result for the target word coach. Although the words in the sentence clearly indicate that the correct synset in this sentence corresponds to coach#n5, the fact that teacher#n1 is related to trainer#n1 in WordNet causes both coach#n2 and coach#n1 to reinforce each other, and make their pagerank higher. The right side of Figure 2 depicts the pprw2w method, where the word coach is not activated. Thus, there is no reinforcement between coach senses, and the method would correctly choose coach#n5 as the proper synset.
6. Evaluation
WSD literature has used several measures for evaluation. Precision is the percentage of correctly disambiguated instances divided by the number of instances disambiguated. Some systems don't disambiguate all instances, and thus the precision can be high even if the system disambiguates a handful of instances. In our case, when a word has two senses with the same PageRank value, our algorithm does not return anything, because it abstains from returning a sense in the case of ties. In contrast, recall measures the percentage of correctly disambiguated instances divided by the total number of instances to be disambiguated. This measure penalizes systems that are unable to return a solution for all instances. Finally, the harmonic mean between precision and recall (F1) combines both measures. F1 is our main measure of evaluation, as it provides a balanced measure between the two extremes. Note that a system that returns a solution for all instances would have equal precision, recall, and F1 measures.
In our experiments we build a context of at least 20 content words for each sentence to be disambiguated, taking the sentences immediately before and after it in the case that the original sentence was too short. The parameters for the PageRank algorithm were set to 0.85 and 30 iterations following standard practice (Haveliwala 2002; Langville and Meyer 2003; Mihalcea 2005). The post hoc impact of those and other parameters has been studied in Section 6.4.
The general domain data sets used in this work are the SensEval-2 (S2AW) (Snyder and Palmer 2004), SensEval-3 (S3AW) (Palmer et al. 2001), and SemEval-2007 fine-grained (S07AW) (Palmer et al. 2001; Snyder and Palmer 2004; Pradhan et al. 2007) and coarse grained all-words data sets (S07CG) (Navigli, Litkowski, and Hargraves 2007). All data sets have been produced similarly: A few documents were selected for tagging, at least two annotators tagged nouns, verbs, adjectives, and adverbs, inter-tagger agreement was measured, and the discrepancies between taggers were solved. The first two data sets are labeled with WordNet 1.7 tags, the third uses WordNet 2.1 tags, and the last one uses coarse-grained senses that group WordNet 2.1 senses. We run our system using WordNet 1.7 relations and senses for the first two data sets, and WordNet 2.1 for the other two. Section 6.4.3 explores the use of WordNet 3.0 and compares the performance with the use of other versions.
Regarding the coarse senses in S07CG, we used the mapping from WordNet 2.1 senses made available by the authors of the data set. In order to return coarse grained-senses, we run our algorithm on fine-grained senses, and aggregate the scores for all senses that map to the same coarse-grained sense. We finally choose the coarse-grained sense with the highest score.
The data sets used in this article contain polysemous and monosemous words, as customary; the percentage of monosemous word occurrences in the S2AW, S3AW, S07AW, and S07CG data sets are 20.7%, 16.9%, 14.4%, and 29.9%, respectively.
6.1 Results
Table 2 shows the results as F1 of our random walk WSD systems over these data sets. We detail overall results, as well as results per part of speech, and whether there is any statistical difference with respect to the best result on each column. Statistical significance is obtained using the paired bootstrap resampling method (Noreen 1989), p < 0.01.
S2AW - SensEval-2 All-Words . | |||||
---|---|---|---|---|---|
Method . | All . | N . | V . | Adj. . | Adv. . |
ppr | 58.7* | 71.8 | 35.0 | 58.9 | 69.8 |
pprw2w | 59.7 | 70.3 | 40.3 | 59.8 | 72.9 |
static | 58.0* | 66.5 | 40.2 | 59.8 | 72.5 |
S3AW - SensEval-3 All-Words | |||||
Method | All | N | V | Adj. | Adv. |
ppr | 57.3* | 63.7 | 47.5 | 61.3 | 96.3 |
pprw2w | 57.9 | 65.3 | 47.2 | 63.6 | 96.3 |
static | 56.5* | 62.5 | 47.1 | 62.8 | 96.3 |
S07AW - SemEval 2007 All-Words | |||||
Method | All | N | V | Adj. | Adv. |
ppr | 39.7* | 51.6 | 34.6 | – | – |
pprw2w | 41.7* | 56.0 | 35.3 | – | – |
static | 43.0 | 56.0 | 37.3 | – | – |
S07CG - SemEval 2007 Coarse-grained All-Words | |||||
Method | All | N | V | Adj. | Adv. |
ppr | 78.1* | 78.3 | 73.8 | 84.0 | 78.4 |
pprw2w | 80.1 | 83.6 | 71.1 | 83.1 | 82.3 |
static | 79.2* | 81.0 | 72.4 | 82.9 | 82.8 |
S2AW - SensEval-2 All-Words . | |||||
---|---|---|---|---|---|
Method . | All . | N . | V . | Adj. . | Adv. . |
ppr | 58.7* | 71.8 | 35.0 | 58.9 | 69.8 |
pprw2w | 59.7 | 70.3 | 40.3 | 59.8 | 72.9 |
static | 58.0* | 66.5 | 40.2 | 59.8 | 72.5 |
S3AW - SensEval-3 All-Words | |||||
Method | All | N | V | Adj. | Adv. |
ppr | 57.3* | 63.7 | 47.5 | 61.3 | 96.3 |
pprw2w | 57.9 | 65.3 | 47.2 | 63.6 | 96.3 |
static | 56.5* | 62.5 | 47.1 | 62.8 | 96.3 |
S07AW - SemEval 2007 All-Words | |||||
Method | All | N | V | Adj. | Adv. |
ppr | 39.7* | 51.6 | 34.6 | – | – |
pprw2w | 41.7* | 56.0 | 35.3 | – | – |
static | 43.0 | 56.0 | 37.3 | – | – |
S07CG - SemEval 2007 Coarse-grained All-Words | |||||
Method | All | N | V | Adj. | Adv. |
ppr | 78.1* | 78.3 | 73.8 | 84.0 | 78.4 |
pprw2w | 80.1 | 83.6 | 71.1 | 83.1 | 82.3 |
static | 79.2* | 81.0 | 72.4 | 82.9 | 82.8 |
The table shows that pprw2w is consistently the best method in three data sets. All in all the differences are small, and in one data set static obtains the best results. The differences with respect to the best system overall are always statistically significant. In fact, it is remarkable that a simple non-contextual measure like static performs so well, without the need for building subgraphs or any other manipulation. Section 6.4.6 will show that in some circumstances the performance of static is much lower and analyzes the reasons for this drop. Regarding the use of the word-to-word heuristic, it consistently provides slightly better results than ppr in all four data sets. An analysis of the performance according to the POS shows that pprw2w performs better particularly on nouns, but there does not seem to be a clear pattern for the rest. In the rest of the article, we will only show the overall results, omitting those for all POS, in order not to clutter the result tables.
Our algorithms do not always return an answer, and thus the precision is higher than the F1 measure. For instance, in S2AW the percentage of instances that get an answer ranges between 95.4% and 95.6% for ppr, pprw2w, and static. The precision for pprw2w in S2AW is 61.1%, the recall is 58.4%, and F1 is 59.7%. This pattern of slightly higher values for precisions, lower values for recall, and F1 in between is repeated for all data sets, POS, and data sets. The percentage of instances that get an answer for the other data sets is higher, ranging between 98.1% in S3AW and 99.9% in S07CG.
6.2 Comparison to State-of-the-Art Systems
In this section we compare our results with the WSD systems described in Section 2, as well as the top performing supervised systems at competition time and other unsupervised systems that improved on them. Note that we do not mention all unsupervised systems participating in the competitions, but we do select the top performing ones. All results in Table 3 are given as overall F1 for all Parts of Speech, but we also report F1 for nouns in the case of S07CG, where Ponz10 (Ponzetto and Navigli 2010) reported very high results, but only for nouns. Note that the systems reported here and our system might use different context sizes.
System . | S2AW . | S3AW . | S07AW . | S07CG (N) . | |
---|---|---|---|---|---|
Mih05 | 54.2 | 52.2 | |||
Sinha07 | 57.6 | 53.6 | |||
Tsatsa10 | 58.8 | 57.4 | |||
Agirre08 | 56.8 | ||||
Nav10 | 52.9 | 43.1 | |||
JU-SKNSB / TKB-UO | 40.2 | 70.2 | (70.8) | ||
Ponz10 | (79.4) | ||||
Pprw2w | 59.7 | 57.9 | 41.7 | 80.1 | (83.6) |
MFS(1) | 60.1 | 62.3 | 51.4 | 78.9 | (77.4) |
IRST-DDD-00(1) | 58.3 | ||||
Nav05(1) / UOR-SSI(1) | 60.4 | 83.2 | (84.1) | ||
Bestsup(2) | 68.6 | 65.2 | 59.1 | 82.5 | (82.3) |
Zhong10(2) | 68.2 | 67.6 | 58.3 | 82.6 |
System . | S2AW . | S3AW . | S07AW . | S07CG (N) . | |
---|---|---|---|---|---|
Mih05 | 54.2 | 52.2 | |||
Sinha07 | 57.6 | 53.6 | |||
Tsatsa10 | 58.8 | 57.4 | |||
Agirre08 | 56.8 | ||||
Nav10 | 52.9 | 43.1 | |||
JU-SKNSB / TKB-UO | 40.2 | 70.2 | (70.8) | ||
Ponz10 | (79.4) | ||||
Pprw2w | 59.7 | 57.9 | 41.7 | 80.1 | (83.6) |
MFS(1) | 60.1 | 62.3 | 51.4 | 78.9 | (77.4) |
IRST-DDD-00(1) | 58.3 | ||||
Nav05(1) / UOR-SSI(1) | 60.4 | 83.2 | (84.1) | ||
Bestsup(2) | 68.6 | 65.2 | 59.1 | 82.5 | (82.3) |
Zhong10(2) | 68.2 | 67.6 | 58.3 | 82.6 |
For easier reference, Table 3 uses a shorthand for each system, whereas the text in this section includes the shorthand and the full reference the first time the shorthand is used. The shorthand uses the first letters of the first author followed by the year of the paper, except for systems which participated in SensEval and SemEval, where we use their acronym. Most systems in the table have been presented in Section 2, with a few exceptions that will be presented this section.
The results in Table 3 confirm that our system (pprw2w) performs on the state-of-the-art of knowledge-based and unsupervised systems, with two exceptions:
- (1)
Nav10 (Navigli and Lapata 2010) obtained better results on S07AW. We will compare both systems in more detail below, and also include a reimplementation in the next subsection which shows that, when using the same LKB, our method obtains better results.
- (2)
Although not reported in the table, an unsupervised system using automatically acquired training examples from bilingual data (Chan and Ng 2005) obtained very good results on S2AW nouns (77.2 F1, compared with our 70.3 F1 in Table 2). The automatically acquired training examples are used in addition to hand-annotated data in Zhong10 (Zhong and Ng 2010), also reported in the table (see below).
We report the best unsupervised systems in S07AW and S07CG on the same row. JU-SKNSB (Naskar and Bandyopadhyay 2007) is a system based on an extended version of the Lesk algorithm (Lesk 1986), evaluated on S07AW. TKB-UO (Anaya-Sánchez, Pons-Porrata, and Berlanga-Llavori 2007), which was evaluated in S07CG, clusters WordNet senses and uses so-called topic signatures based on WordNet information for disambiguation. IRST-DDD-00 (Strapparava, Gliozzo, and Giuliano 2004) is a system based on WordNet domains which leverages on large unannotated corpora. They obtained excellent results, but their calculation of scores takes into account synset probabilities from SemCor, and the system can thus be considered to use some degree of supervision. We consider that systems which make use of information derived from hand-annotated corpora need to be singled out as having some degree of supervision. This includes systems using the MFS heuristic, as it is derived from hand-annotated corpora. In the case of the English WordNet, the use of the first sense also falls in this category, as the order of senses in WordNet is based on sense counts in hand-annotated corpora. Note that for wordnets in other languages, hand-annotated corpus is scarce, and thus our main results do not use this information. Section 6.4.7 analyzes the results of our system when combined with this information.
Among supervised systems, the best supervised systems at competition time are reported in a single row (Mihalcea 2002; Decadt et al. 2004; Chan, Ng, and Zhong 2007; Tratz et al. 2007). We also report Zhong10 (Zhong and Ng 2010), which is a freely available supervised system giving some of the strongest results in WSD.
We will now discuss in detail the systems that are most similar to our own. We first review the WordNet versions and relations used by each system. Mih05 (Mihalcea 2005) and Sinha07 (Sinha and Mihalcea 2007) apply several similarity methods, which use WordNet information from versions 1.7.1 and 2.0, respectively, including all relations and the text in the glosses.5 Tsatsa10 (Tsatsaronis, Varlamis, and Nørvåg 2010) uses WordNet 2.0. Agirre08 (Agirre and Soroa 2008) experimented with several LKBs formed by combining relations from different sources and versions, including WordNet 1.7 and eXtended WordNet. Nav05 and Nav10 (Navigli and Velardi 2005; Navigli and Lapata 2010) use WordNet 2.0, enriched with manually added co-occurrence relations which are not publicly available.
We can see in Table 3 that the combination of Personalized PageRank and LKB presented in this article outperforms both Mih05 and Sinha07. In order to factor out the difference in the WordNet version, we performed experiments using WN2.1 and eXtended WordNet, yielding 58.7 and 56.5 F1 for S2AW and S3AW, respectively. Although a head-to-head comparison is not possible, the systems use similar information: Although they use glosses, our algorithm cannot directly use the glosses, and thus we use disambiguated glosses as delivered in eXtended WordNet. All in all the results suggest that analyzing the LKB structure as a graph is preferable to computing pairwise similarity measures over synsets to build a custom graph and then applying graph measures. The results of various in-house experiments replicating Mih05 also confirmed this observation. Note also that our methods are simpler than the combination strategy used in Sinha07.
Nav05 (Navigli and Velardi 2005) uses a knowledge-based WSD method based on lexical chains called structural semantic interconnections (SSI). The SSI method was evaluated on the SensEval-3 data set, as shown in row Nav05 in Table 3. Note that the method labels an instance with the MFS of the word if the algorithm produces no output for that instance, which makes comparison to our system unfair, especially given the fact that the MFS performs better than SSI. In fact, it is not possible to separate the effect of SSI from that of the MFS, and we thus report it as using some degree of supervision in the table. A variant of the algorithm called UOR-SSI (Navigli, Litkowski, and Hargraves 2007) (reported in the same row) used a manually added set of 70,000 relations and obtained the best results in S07CG out-of-competition,6 even better than the best supervised method. Reimplementing SSI is not trivial, so we did not check the performance of a variant of SSI that does not use MFS and that uses the same LKB as our method. Section 6.4.7 analyzes the results of our system when combined with MFS information.
Agirre08 (Agirre and Soroa 2008) uses breadth-first search to extract subgraphs of the WordNet graph for each context to be disambiguated, and then applies PageRank. Our better results seem to indicate that using the full graph instead of those subgraphs would perform better. In order to check whether the better results are due to differences in the information used, the next subsection presents the results of our reimplementation of the systems using the same information as our full-graph algorithm.
Tsatsa10 (Tsatsaronis, Vazirgiannis, and Androutsopoulos 2007; Tsatsaronis, Varlamis, and Nørvåg 2010) also builds the graph using breadth-first search, but weighting each type of edge differently, and using graph-based measures that take into account those weights. This is in contrast to the experiments performed in this article where edges have no weight, and is an interesting avenue for future work.
Nav10 (Navigli and Lapata 2010) first builds a subgraph of WordNet composed of paths between synsets using depth-first search and then applies a set of graph centrality algorithms. The best results are obtained using the degree of the nodes, and they present two variants, depending on how they treat ties: Either they return a sense at random, or they return the most frequent sense. For fair comparison to our system (which does not use MFS as a back-off), Table 3 reports the former variant as Nav10. This system is better than ours in one data set and worse in another. They use 60,000 relations that are not publicly available, but they do not use eXtended WordNet relations. In order to check whether the difference in performance is due to the relations used or the algorithm, the next subsection presents a reimplementation of their best graph-based algorithms using the same LKB as we do. In earlier work (Navigli and Lapata 2007) they test a similar system on S3AW, but report results only for nouns, verbs, and adjectives (F1 of 61.9, 36.1, and 62.8, respectively), all of which are below the results of our system (cf. Table 2).
6.3 Comparison with Related Algorithms
The previous section shows that our algorithm when applied to a LKB built from WordNet and eXtended WordNet outperforms other knowledge-based systems in all cases but one system in one data set. In this section we factor out algorithm and LKB, and present the results of other graph-based methods for WSD using the same WordNet versions and relations as in the previous section. As we mentioned in Section 2, ours is the only method using the full WordNet graph. Navigli and Lapata (2010) and Ponzetto and Navigli (2010) build a custom graph based on the relations in WordNet as follows: For each sense si of each word in the context, a depth-first search (dfs for short) is conducted through the WordNet graph starting in si until another sense sj of a word in the context is found or maximum distance is reached. The maximum distance was set by the authors to 6. All nodes and edges between si and sj, inclusive, are added to the subgraph. Graph-based measures are then used to select the output senses for each target word, with degree and PageRank yielding the best results. In closely related work, Agirre and Soroa (2008) and Tsatsaronis, Varlamis, and Nørvåg (2010) use breadth-first search (bfs) over the whole graph, and keep all paths connecting senses. Note that unlike the dfs approach, bfs does not require any threshold. The subgraphs obtained by each of these methods are slightly different.
We reimplemented both strategies, namely, dfs with threshold 6 and bfs with no threshold. Table 4 shows the overall results of degree and PageRank for both kinds of subgraphs. dfs yields slightly better results than bfs but pprw2w is best in all four data sets, with statistical significance.
. | Reference . | S2AW . | S3AW . | S07AW . | S07CG . |
---|---|---|---|---|---|
dfsdegree | Nav10, Ponz10 | 58.4* | 56.4* | 40.3* | 79.4* |
bfsdegree | 57.9* | 56.5* | 39.9* | 79.2* | |
dfsPageRank | Nav10 | 58.2* | 56.4* | 39.9* | 79.6* |
bfsPageRank | Agirre08 | 57.7* | 56.7* | 39.7* | 79.4* |
dfsPPR | 59.3* | 58.2 | 41.40 | 78.1* | |
bfsPPR | 58.80 | 57.50 | 41.20 | 78.8* | |
dfspprw2w | 58.70 | 58.00 | 41.2* | 79.7* | |
bfspprw2w | 58.1* | 57.90 | 41.9 | 79.5* | |
pprw2w | 59.7 | 57.90 | 41.70 | 80.1 |
. | Reference . | S2AW . | S3AW . | S07AW . | S07CG . |
---|---|---|---|---|---|
dfsdegree | Nav10, Ponz10 | 58.4* | 56.4* | 40.3* | 79.4* |
bfsdegree | 57.9* | 56.5* | 39.9* | 79.2* | |
dfsPageRank | Nav10 | 58.2* | 56.4* | 39.9* | 79.6* |
bfsPageRank | Agirre08 | 57.7* | 56.7* | 39.7* | 79.4* |
dfsPPR | 59.3* | 58.2 | 41.40 | 78.1* | |
bfsPPR | 58.80 | 57.50 | 41.20 | 78.8* | |
dfspprw2w | 58.70 | 58.00 | 41.2* | 79.7* | |
bfspprw2w | 58.1* | 57.90 | 41.9 | 79.5* | |
pprw2w | 59.7 | 57.90 | 41.70 | 80.1 |
In addition, we run ppr and pprw2w on dfs and bfs subgraphs, and obtained better results than degree and PageRank in all data sets. dfs with ppr and dfs with pprw2w are best in S3AW and S07AW, respectively, although the differences with pprw2w are not statistically significant. pprw2w on the full graph is best in two data sets, with statistical significance.
From these results we can conclude that ppr and pprw2w yield the best results also for subgraphs. Regarding the use of the full graph with respect to dfs or bfs, the performances for pprw2w are very similar, but using the full graph gives a small advantage. Section 6.4.5 provides an analysis of efficiency.
6.4 Analysis of Performance Factors
The behavior of the WSD system is influenced by a set of parameters that can yield different results. In our main experiments we did not perform any parameter tuning; we just used some default values which were found to be useful according to previous work. In this section we perform a post hoc analysis of several parameters on the general performance of the system, reporting F1 on a single data set, S2AW.
6.4.1 PageRank Parameters
The PageRank algorithm has two main parameters, the so-called damping factor and the number of iterations (or, conversely, the convergence threshold), which we set as 0.85 and 30, respectively (cf. Section 4). Figure 3 depicts the effect of varying the number of iterations. It shows that the algorithm converges very quickly: One sole iteration yields relatively high performance, and 20 iterations are enough to achieve convergence. Note also that the performance is in the [58.0, 58.5] range for iterations over 5. Note that we use the same range of F1 for the y axis of Figures 3, 4, and 5 for easier comparison.
Figure 4 shows the effect of varying the damping factor. Note that a damping factor of zero means that the PageRank value coincides with the initial probability distribution. Given the way we initialize the distribution (c.f. Section 5.2), it would mean that the algorithm is not able to disambiguate the target words. Thus, the initial value on Figure 4 corresponds to a damping factor of 0.001. On the other hand, a damping factor of 1 yields to the same results as the static method (c.f. Section 5.1). The best value is attained with 0.90, with similar values around it (less than 0.5 absolute points in variation), in agreement with previous results which preferred values in the 0.85…0.95 interval (Haveliwala 2002; Langville and Meyer 2003; Mihalcea 2005).
6.4.2 Size of Context Window
Figure 5 shows the performance of the system when trying different context windows for the target words. The best context size is for windows of 20 content words, with less than 0.5 absolute point losses for windows in the [5, 25] range.
6.4.3 Using Different WordNet Versions
There has been little research on the best strategy to use when dealing with data sets and resources attached to different WordNet versions. Table 5 shows the results for the four data sets used in this study when using different WordNet versions. Two of the data sets (S2AW and S3AW) were tagged with senses from version 1.7, S07AW with senses from version 2.1, and S07CG with coarse senses built on version 2.1 senses.
Data set . | version . | 1.7 + xwn . | 2.1 + xwn . | 3.0 + xwn . |
---|---|---|---|---|
S2AW | 1.7 | 59.7 | 58.7 | 58.4 |
S3AW | 1.7 | 57.9 | 56.5 | 56.8 |
S07AW | 2.1 | 40.7 | 41.7 | 40.9 |
S07CG | 2.1 coarse | 79.6 | 80.1 | 79.6 |
Data set . | version . | 1.7 + xwn . | 2.1 + xwn . | 3.0 + xwn . |
---|---|---|---|---|
S2AW | 1.7 | 59.7 | 58.7 | 58.4 |
S3AW | 1.7 | 57.9 | 56.5 | 56.8 |
S07AW | 2.1 | 40.7 | 41.7 | 40.9 |
S07CG | 2.1 coarse | 79.6 | 80.1 | 79.6 |
Given the fact that WordNet 3.0 is a more recent version that includes more relations, one would hope that using it would provide the best results (Cuadros and Rigau 2008; Navigli and Lapata 2010). We built a graph analogous to the ones for versions 1.7 and 2.1, but using the hand-disambiguated glosses instead of eXtended WordNet glosses. We used freely available mappings (Daude, Padro, and Rigau 2000)7 to convert our eXtended WordNet relations to 3.0, and then the WordNet 3.0 sense results to the corresponding version. In addition, we also tested WN1.7 on S07AW and S07CG, and WN2.1 on S2AW and S3AW, also using the mappings from Daude, Padro, and Rigau (2000).
Table 5 shows that the best results are obtained using our algorithm on the same WordNet version as used in the respective data set. When testing on data sets tagged with WordNet 1.7, similar results are obtained using 2.1 or 3.0. When testing on data sets based on 2.1, 3.0 has a small lead over 1.7. In any case, the differences are small ranging from 1.4 absolute points to 0.5 points. All in all, it seems that the changes introduced by different versions slightly deteriorate the results, and the best strategy is to use the same WordNet version as was used for tagging.
6.4.4 Using xwn vs. WN3.0 Gloss Relations
WordNet 3.0 was released with an accompanying data set comprising glosses where some of the words had been manually disambiguated. In Table 6 we present the results of using these glosses with the WN3.0 graph, showing that the results are lower than using XWN relations. We also checked the use of WN3.0 gloss relation with other WordNet versions, and the results using XWN were always slightly better. We hypothesize that the better results for XWN are due to the amount of relations, with XWN holding 61% more relations than WN3.0 glosses. Still, the best relations are obtained with the combination of both kinds of gloss relations.
6.4.5 Analysis of Relations
Previous results were obtained using all the relations of WordNet and taking eXtended WordNet relations into account. In this section we analyze the effect of the relation types on the whole process, following the relation groups presented in Table 1. Table 7 shows the results when using different combinations over relation types. The eXtended WordNet xwn relations appear the most valuable when performing random walk WSD, as their performance is as good as when using the whole graph, and they produce a large drop when ablated from the graph. Ignoring antonymy relations produces a small improvement, but the differences between using all the relations, eliminating antonyms, and using xwn relations only are too small to draw any further conclusions. It seems that given the xwn relations (the most numerous, cf. Section 3.1), our algorithm is fairly robust to the addition or deletion of other kinds of relations (less numerous).
relation . | single . | + tax . | ablation . |
---|---|---|---|
tax | 37.4 | – | 59.9 |
ant | 19.1 | 42.1 | 59.9 |
mer | 23.4 | 36.4 | 59.6 |
rel | 35.4 | 46.1 | 59.6 |
xwn | 59.9 | 59.8 | 47.1 |
Reference system (all relations) | 59.7 |
relation . | single . | + tax . | ablation . |
---|---|---|---|
tax | 37.4 | – | 59.9 |
ant | 19.1 | 42.1 | 59.9 |
mer | 23.4 | 36.4 | 59.6 |
rel | 35.4 | 46.1 | 59.6 |
xwn | 59.9 | 59.8 | 47.1 |
Reference system (all relations) | 59.7 |
6.4.6 Behavior with Respect to static and MFS
The high results of the very simple static method (PageRank with no context) seems to imply that there is no need to use context for disambiguation. Our intuition was that the synsets which correspond to the most frequent senses would get more relations. We thus computed the correlation between systems, gold tags, and MFS. In order to make the correlation results comparable to the figures used on evaluation, we use the number of times both sets of results agree, divided by the number of results returned by the first system. Table 8 shows such a matrix of pairwise correlations. If we take the row of gold tags, the results reflect the performance of each system (the precision). In the case of MFS, the column shows that static has a slightly larger correlation with the MFS than the other two methods. The matrix also shows that all our three methods agree more than 80% of the time, with ppr and static having a relatively smaller agreement.
In contrast, related work using the same techniques over domain-specific words (Agirre, López de Lacalle, and Soroa 2009) shows that the results of our Personalized PageRank models departs significantly from MFS and static. Table 9 shows the results of the three techniques on the three subcorpora that constitute the evaluation data set published in Koeling, McCarthy, and Carroll (2005). This data set consists of examples retrieved from the Sports and Finance sections of the Reuters corpus, and also from the balanced British National Corpus (BNC), which is used as a general domain contrast corpus.
System . | BNC . | Sports . | Finance . |
---|---|---|---|
MFS | 34.9 | 19.6 | 37.1 |
static | 36.6 | 20.1 | 39.6 |
pprw2w | 37.7 | 51.5 | 59.3 |
System . | BNC . | Sports . | Finance . |
---|---|---|---|
MFS | 34.9 | 19.6 | 37.1 |
static | 36.6 | 20.1 | 39.6 |
pprw2w | 37.7 | 51.5 | 59.3 |
Applying PageRank over the entire WordNet graph yields low results, very similar to those of MFS, and below those of Personalized PageRank. This confirms that static PageRank is closely related to MFS, as we hypothesized in Section 5.1 and showed in Table 8 for the other general domain data sets. Whereas the results of pprw2w are very similar in the general-domain BNC, pprw2w departs from static and MFS with 30 and 20 points of difference in the domain-specific Sports and Finance corpora. These results are highly relevant, because they show that ppr is able to effectively use contextual information, and depart from the MFS and static baselines.
6.4.7 Combination with MFS
As mentioned in Section 6.2, we have avoided using any information regarding sense frequencies from annotated corpora, as this information is not always available for all wordnets. In this section we report the results of our algorithm when taking into account prior probabilities of senses taken from sense counts. We used the sense counts provided with WordNet in the index.sense file.8 In this setting, the edges linking words and their respective senses are weighted according to the prior probabilities of those senses, instead of uniform weights as in Section 5.2.
Table 10 shows that results when using priors from MFS improve over the results of the original Pprw2w in all data sets. The improvement varies across parts of speech, and, for instance, the results for nouns in S07CG are worse (shown in rightmost column of Table 10). In addition, the results for Pprw2w when using MFS information improve over MFS in all cases except for S07AW.
System . | S2AW . | S3AW . | S07AW . | S07CG (N) . | |
---|---|---|---|---|---|
Pprw2w | 59.7 | 57.9 | 41.7 | 80.1 | (83.6) |
Pprw2w MFS | 62.6 | 63.0 | 48.6 | 81.4 | (82.1) |
MFS | 60.1 | 62.3 | 51.4 | 78.9 | (77.4) |
IRST-DDD-00 | 58.3 | ||||
Nav05 / UOR-SSI | 60.4 | 83.2 | (84.1) | ||
Ponz10 | 81.7 | (85.5) |
System . | S2AW . | S3AW . | S07AW . | S07CG (N) . | |
---|---|---|---|---|---|
Pprw2w | 59.7 | 57.9 | 41.7 | 80.1 | (83.6) |
Pprw2w MFS | 62.6 | 63.0 | 48.6 | 81.4 | (82.1) |
MFS | 60.1 | 62.3 | 51.4 | 78.9 | (77.4) |
IRST-DDD-00 | 58.3 | ||||
Nav05 / UOR-SSI | 60.4 | 83.2 | (84.1) | ||
Ponz10 | 81.7 | (85.5) |
The table also reports the best systems that do use MFS (see Section 6.3 for detailed explanations). For S2AW and S07AW we do not have references to related systems. For S3AW we can see that our system performs best. In the case of S07CG, UOR-SSI reports better results than our system. Finally, the final row reports their system when combined with MFS information as back-off (Ponzetto and Navigli 2010), which also attains better results than our system. We tried to use a combination method similar to theirs, but did not manage to improve results.
6.4.8 Efficiency of Full Graphs vs. Subgraphs
Given the very close results of our algorithm when using full graphs and subgraphs (cf. Section 6.3), we studied the efficiency of each. We benchmarked several graph-based methods on the S2AW data set, which comprises 2,473 instances to be disambiguated. All tests were done on a multicore computer with 16 GB of memory using a single 2.66 GHz processor. When using the full graph ppr disambiguates full sentences in one go at 684 instances per minute, whereas PPRw2w disambiguates one word at a time, 70 instances per minute. The dfs subgraphs provide better performance than PPRw2w, 228 instances per minute when using degree, with marginally slower performance when using PPRw2w (210 instances per minute). The bfs subgraph is slowest, with around 20 instances per minute. The memory footprint of using the full graph algorithm is small, just 270 MB, so several processes can be run on a multiprocessor machine easily.
All in all, there is a tradeoff in performance and speed, with PPRw2w on the full graph providing better results at the cost of some speed, and PPR on the full graph providing the best speed at the cost of worse performance. Using dfs with PPRw2w lays in between and is also a good alternative, and its speed can be improved using pre-indexed paths.
6.5 Experiments on Spanish
Our WSD algorithm can be applied over non-English texts, provided that a LKB for this particular language exists. We have applied our random walk algorithms to the Spanish WordNet (Atserias, Rigau, and Villarejo 2004), using the SemEval-2007 Task 09 data set as evaluation gold standard (Màrquez et al. 2007). The data set contains examples of the 150 most frequent nouns in the CESS-ECE corpus, manually annotated with Spanish WordNet synsets. It is split into a train and test part, and has an “all words” shape (i.e., input consists of sentences, each one having at least one occurrence of a target noun). We ran the experiment over the test part (792 instances), and used the train part for calculating the MFS heuristic. The results in Table 11 are consistent with those for English, with our algorithms approaching MFS performance, and pprw2w yielding the best results. Note that for this data set the supervised algorithm could barely improve over the MFS, which performs very well, suggesting that in this particular data set the sense distributions are highly skewed.
Method . | Acc. . |
---|---|
ppr | 78.4* |
pprw2w | 79.3 |
static | 76.5* |
First sense | 66.4* |
MFS | 84.6* |
Best | 85.1* |
Method . | Acc. . |
---|---|
ppr | 78.4* |
pprw2w | 79.3 |
static | 76.5* |
First sense | 66.4* |
MFS | 84.6* |
Best | 85.1* |
Finally, we also show results for the first sense in the Spanish WordNet. In the Spanish WordNet the order of the senses of a word has been assigned directly by the lexicographer (Atserias, Rigau, and Villarejo 2004), as there is no information of sense frequency from hand-annotated corpora. This is in contrast to the English WordNet, where the senses are ordered according to their frequency in annotated corpora (Fellbaum 1998), and reflects the status on most other wordnets. In this case, our algorithm clearly improves over the first sense in the dictionary.
7. Conclusions
In this article we present a method for knowledge-based Word Sense Disambiguation based on random walks over relations in a LKB. Our algorithm uses the full graph of WordNet efficiently, and performs better than PageRank or degree on subgraphs (Navigli and Lapata 2007; Agirre and Soroa 2008; Navigli and Lapata 2010; Ponzetto and Navigli 2010). We also show that our combination of method and LKB built from WordNet and eXtended WordNet compares favorably to other knowledge-based systems using similar information sources (Mihalcea 2005; Sinha and Mihalcea 2007; Tsatsaronis, Vazirgiannis, and Androutsopoulos 2007; Tsatsaronis, Varlamis, and Nørvåg 2010). Our analysis shows that Personalized PageRank yields similar results when using subgraphs and the full graph, with a trade-off between speed and performance, where Personalized PageRank over the full graph is fastest, its word-to-word variant slowest, and Personalized PageRank over the subgraph lies in between.
We also show that the algorithm can be easily ported to other languages with good results, with the only requirement of having a wordnet. Our results improve over the first sense of the Spanish dictionary. This is particularly relevant for wordnets other than English. For the English WordNet the senses of a word are ordered according to the frequency of the senses in hand-annotated corpora, and thus the first sense is equivalent to the Most Frequent Sense, but this information is not always available for languages that lack large-scale hand-annotated corpora.
We have performed an extensive analysis, showing the behavior according to the parameters of PageRank, and studying the impact of different relations and WordNet versions. We have also analyzed the relation between our ppr algorithm, MFS, and static PageRank. In general domain corpora they get similar results, close to the performance of the MFS learned from SemCor, but the results reported on domain-specific data sets (Agirre, López de Lacalle, and Soroa 2009) show that ppr is able to move away from the MFS and static and improve over them, indicating that ppr is able to effectively use contextual information, and depart from MFS and static PageRank.
The experiments in this study are readily reproducible, as the algorithm and the LKBs are publicly available.9 The system can be applied easily to sense inventories and knowledge bases different from WordNet.
In the future we would like to explore methods to incorporate global weights of the edges in the random walk calculations (Tsatsaronis, Varlamis, and Nørvåg 2010). Given the complementarity of the WordNet++ resource (Ponzetto and Navigli 2010) and our algorithm, it would be very interesting to explore the combination of both, as well as the contribution of other WordNet related resources (Cuadros and Rigau 2008).
Acknowledgments
We are deeply indebted to the reviewers, who greatly helped to improve the article. Our thanks to Rob Koeling and Diana McCarthy for kindly providing the data set, thesauri, and assistance, and to Roberto Navigli and Simone Ponzetto for clarifying the method to map to coarse-grained senses. This work has been partially funded by the Education Ministry (project KNOW2 TIN2009-15049-C03-01).
Notes
The equation becomes P = cMP + (ca + (1 − c)e)v, where ai = 1 if node i is a dangling node, and 0 otherwise, and e is a vector of all ones.
Personal communication.
The task was co-organized by the authors.
References
Author notes
Informatika Fakultatea, Manuel Lardizabal 1, 20018 Donostia, Basque Country. E-mail: [email protected].
IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Basque Country. E-mail: [email protected].
Informatika Fakultatea, Manuel Lardizabal 1, 20018 Donostia, Basque Country. E-mail: [email protected].