Abstract
This article presents a new model for word sense disambiguation formulated in terms of evolutionary game theory, where each word to be disambiguated is represented as a node on a graph whose edges represent word relations and senses are represented as classes. The words simultaneously update their class membership preferences according to the senses that neighboring words are likely to choose. We use distributional information to weigh the influence that each word has on the decisions of the others and semantic similarity information to measure the strength of compatibility among the choices. With this information we can formulate the word sense disambiguation problem as a constraint satisfaction problem and solve it using tools derived from game theory, maintaining the textual coherence. The model is based on two ideas: Similar words should be assigned to similar classes and the meaning of a word does not depend on all the words in a text but just on some of them. The article provides an in-depth motivation of the idea of modeling the word sense disambiguation problem in terms of game theory, which is illustrated by an example. The conclusion presents an extensive analysis on the combination of similarity measures to use in the framework and a comparison with state-of-the-art systems. The results show that our model outperforms state-of-the-art algorithms and can be applied to different tasks and in different scenarios.
1. Introduction
Word Sense Disambiguation (WSD) is the task of identifying the intended meaning of a word based on the context in which it appears (Navigli 2009). It has been studied since the beginnings of Natural Language Processing (NLP) (Weaver 1955) and today it is still a central topic of this discipline. This because it is important for many NLP tasks such as text understanding (Kilgarriff 1997), text entailment (Dagan and Glickman 2004), machine translation (Vickrey et al. 2005), opinion mining (Smrž 2006), sentiment analysis (Rentoumi et al. 2009), and information extraction (Zhong and Ng 2012). All these applications can benefit from the disambiguation of ambiguous words, as a preliminary process; otherwise they remain on the surface of the word, compromising the coherence of the data to be analyzed (Pantel and Lin 2002).
To solve this problem, over the past few years, the research community has proposed several algorithms based on supervised (Tratz et al. 2007; Zhong and Ng 2010), semi-supervised (Navigli and Velardi 2005; Pham, Ng, and Lee 2005), and unsupervised (Mihalcea 2005; McCarthy et al. 2007) learning models. Nowadays, although supervised methods perform better in general domains, unsupervised and semi-supervised models are receiving increasing attention from the research community, with performances close to the state of the art of supervised systems (Ponzetto and Navigli 2010). In particular, knowledge-based and graph-based algorithms are emerging as promising approaches to solve the disambiguation problem (Sinha and Mihalcea 2007; Agirre et al. 2009). The peculiarities of these algorithms are that they do not require any corpus evidence and use only the structural properties of a lexical database to perform the disambiguation task. In fact, unsupervised methods are able to overcome a common problem in supervised learning—the knowledge acquisition problem, which requires the production of large-scale resources, manually annotated with word senses.
Knowledge-based approaches exploit the information from knowledge resources such as dictionaries, thesauri, or ontologies and compute sense similarity scores to disambiguate words in context (Mihalcea 2006). Graph-based approaches model the relations among words and senses in a text with graphs, representing words and senses as nodes and the relations among them as edges. From this representation the structural properties of the graph can be extracted and the most relevant concepts in the network can be computed (Agirre et al. 2006; Navigli and Lapata 2007).
Our approach falls within these two lines of research; it uses a graph structure to model the geometry of the data points (the words in a text) and a knowledge base to extract the senses of each word and to compute the similarity among them. The most important difference between our approach and state-of-the-art graph-based approaches (Véronis 2004; Sinha and Mihalcea 2007; Navigli and Lapata 2010; Agirre, de Lacalle, and Soroa 2014; Moro, Raganato, and Navigli 2014) is that in our method the graph contains only words and not senses. This graph is used to model the pairwise interaction among words and not to rank the senses in the graph according to their relative importance.
The starting point of our research is based on two fundamental assumptions:
- 1.
The meaning of a sentence emerges from the interaction of the components that are involved in it.
- 2.
These interactions are different and must be weighted in order to supply the right amount of information.
- •
There is a financial institution near the river bank.
- •
They were troubled by insects while playing cricket.
Our approach is based on the principle that the senses of the words that share a strong relation must be similar. The idea of assigning a similar class to similar objects has been implemented in a different way by Kleinberg and Tardos (2002), within a Markov random field framework. They have shown that it is beneficial in combinatorial optimization problems. In our case, this idea can preserve the textual coherence—a characteristic that is missing in many state-of-the-art systems. In particular, it is missing in systems in which the words are disambiguated independently. On the contrary, our approach disambiguates all the words in a text concurrently, using an underlying structure of interconnected links, which models the interdependence between the words. In so doing, we model the idea that the meaning for any word depends at least implicitly on the combined meaning of all the interacting words.
In our study, we model these interactions by developing a system in which it is possible to map lexical items onto concepts exploiting contextual information in a way in which collocated words influence each other simultaneously, imposing constraints in order to preserve the textual coherence. For this reason, we have decided to use a powerful tool, derived from game theory: the non-cooperative game (see Section 4). In our system, the nodes of the graph are interpreted as players, in the game theoretic sense (see Section 4), that play a game with the other words in the graph in order to maximize their utility; constraints are defined as similarity measures among the senses of two words that are playing a game. The concept of utility has been used in different ways in the game theory literature; in general, it refers to the satisfaction that a player derives from the outcome of a game (Szabó and Fath 2007). From our point of view, increasing the utility of a word means increasing the textual coherence in a distributional semantics perspective (Firth 1957). In fact, it has been shown that collocated words tend to have a determined meaning (Gale, Church, and Yarowsky 1992; Yarowsky 1993).
Game theoretic frameworks have been used in different ways to study language use (Pietarinen 2007; Skyrms 2010) and evolution (Nowak, Komarova, and Niyogi 2001), but to the best of our knowledge, our method is the first attempt to use it in a specific NLP task. This choice is motivated by the fact that game theoretic models are able to perform a consistent labeling of the data (Hummel and Zucker 1983; Pelillo 1997), taking into account contextual information. These features are of great importance for an unsupervised or semi-supervised algorithm, which tries to perform a WSD task, because, by assumption, the sense of a word is given by the context in which it appears. Within a game theoretic framework we are able to cast the WSD problem as a continuous optimization problem, exploiting contextual information in a dynamic way. Furthermore, no supervision is required and the system can adapt easily to different contextual domains, which is exactly what is required for a WSD algorithm.
The additional reason for the use of a consistent labeling system relies on the fact that it is able to deal with semantic drifts (Curran, Murphy, and Scholz 2007). In fact, as shown in the two example sentences, concentrating the disambiguation task of a word on highly collocated words, taking into account proximity (or even syntactic) information, allows the meaning interpretation to be guided only towards senses that are strongly related to the word that has to be disambiguated.
In this article, we provide a detailed discussion about the motivation behind our approach and a full evaluation of our algorithm, comparing it with state-of-the-art systems in WSD tasks. In a previous work we used a similar algorithm in a semi-supervised scenario (Tripodi, Pelillo, and Delmonte 2015), casting the WSD task as a graph transduction problem. Now we have extended that work, making the algorithm fully unsupervised. Furthermore, in this article we provide a complete evaluation of the algorithm extending our previous works (Tripodi and Pelillo 2015), exploiting proximity relations among words.
An important feature of our approach is that it is versatile. In fact, the method can adapt to different scenarios and to different tasks, and it is possible to use it as unsupervised or semi-supervised. The semi-supervised approach, presented in Tripodi, Pelillo, and Delmonte (2015), is a bootstrapping graph-based method, which propagates, over the graph, the information from labeled nodes to unlabeled. In this article, we also provide a new semi-supervised version of the approach, which can exploit the evidence from sense tagged corpora or the most frequent sense heuristic and does not require labeled nodes to propagate the labeling information.
We tested our approach on different data sets from WSD and entity-linking tasks in order to find the similarity measures that perform better, and evaluated our approach against unsupervised, semi-supervised, and supervised state-of-the-art systems. The results of this evaluation show that our method performs well and can be considered as a valid alternative to current models.
2. Related Work
There are two major paradigms in WSD: supervised and knowledge-based. Supervised algorithms learn, from sense-labeled corpora, a computational model of the words of interest. Then, the obtained model is used to classify new instances of the same words. Knowledge-based algorithms perform the disambiguation task by using an existing lexical knowledge base, which usually is structured as a semantic network. Then, these approaches use graph algorithms to disambiguate the words of interest, based on the relations that these words' senses have in the network (Pilehvar and Navigli 2014).
A popular supervised WSD system, which has shown good performances in different WSD tasks, is It Makes Sense (Zhong and Ng 2010). It takes as input a text and for each content word (noun, verb, adjective, or adverb) outputs a list of possible senses ranked according to the likelihood of appearing in a determined context and extracted from a knowledge base. The training data used by this system are derived from SemCor (Miller et al. 1993), DSO (Ng and Lee 1996), and collected automatically exploiting parallel corpora (Chan and Ng 2005). Its default classifier is LIBLINEAR2 with a linear kernel and its default parameters.
Unsupervised and knowledge-based algorithms for WSD are attracting great attention from the research community. This is because supervised systems require training data, which are difficult to obtain. In fact, producing sense-tagged data is a time-consuming process, which has to be carried out separately for each language of interest. Furthermore, as investigated by Yarowsky and Florian (2002), the performance of a supervised algorithm degrades substantially with an increase of sense entropy. Sense entropy refers to the distribution over the possible senses of a word, as seen in training data. Additionally, a supervised system has difficulty in adapting to different contexts, because it depends on prior knowledge, which makes the algorithm rigid; therefore, it cannot efficiently adapt to domain specific cases, when other optimal solutions may be available (Yarowsky and Florian 2002).
One of the most common heuristics that allows us to exploit sense tagged data such as SemCor (Miller et al. 1993) is the most frequent sense. It exploits the overall sense distribution for each word to be disambiguated, choosing the sense with the highest probability regardless of any other information. This simple procedure is very powerful in general domains but cannot handle senses with a low distribution, which can be found in specific domains.
With these observations in mind, Koeling et al. (2005) created three domain-specific corpora to evaluate WSD systems. They tested whether WSD algorithms are able to adapt to different contexts, comparing their results with the most frequent sense heuristic computed on general domains corpora. They used an unsupervised approach to obtain the most frequent sense for a specific domain (McCarthy et al. 2007) and demonstrated that their approach outperforms the most frequent sense heuristic derived from general domain and labeled data.
This heuristics for the unsupervised acquisition of the predominant sense of a word consists of collecting all the possible senses of a word and then in ranking these senses. The ranking is computed according to the information derived from a distributional thesaurus automatically produced from a large corpus and a semantic similarity measure derived from the sense inventory. Although the authors have demonstrated that this approach is able to outperform the most frequent sense heuristic computed on sense-tagged data on general domains, it is not easy to use it on real world applications, especially when the domain of the text to be disambiguated is not known in advance.
Other unsupervised and semi-supervised approaches, rather than computing the prevalent sense of a word, try to identify the actual sense of a word in a determined phrase, exploiting the information derived from its context. This is the case with traditional algorithms, which exploit the pairwise semantic similarity among a target word and the words in its context (Lesk 1986; Resnik 1995; Patwardhan, Banerjee, and Pedersen 2003). Our work could be considered as a continuation of this tradition, which tries to identify the intended meaning of a word given its context, using a new approach for the computation of the sense combinations.
Graph-based algorithms for WSD are gaining much attention in the NLP community. This is because graph theory is a powerful tool that can be used both for the organization of the contextual information and for the computation of the relations among word senses. It allows us to extract the structural properties of a text. Examples of this kind of approach construct a graph from all the senses of the words in a text and then use connectivity measures in order to identify the most relevant word senses in the graph (Navigli and Lapata 2007; Sinha and Mihalcea 2007). Navigli and Lapata (2007) conducted an extensive analysis of graph connectivity measures for unsupervised WSD. Their approach uses a knowledge base, such as WordNet, to collect and organize all the possible senses of the words to be disambiguated in a graph structure, then uses the same resource to search for a path (of predefined length) between each pair of senses in the graph. Then, if it exists, it adds all the nodes and edges on this path to the graph. These measures analyze local and global properties of the graph. Local measures, such as degree centrality and eigenvector centrality, determine the degree of relevance of a single vertex. Global properties, such as compactness, graph entropy, and edge density, analyze the structure of the graph as a whole. The results of the study show that local measures outperform global measures and, in particular, that degree centrality and PageRank (Page et al. 1999) (which is a variant of the eigenvector centrality measure) achieve the best results.
One algorithm that tries to improve centrality algorithms is SUDOKU, introduced by Minion and Sainudiin (2014). It is an iterative approach, which simultaneously constructs the graph and disambiguates the words using a centrality function. It starts inserting the nodes corresponding to the senses of the words with low polysemy and iteratively inserting the more ambiguous words. The advantages of this method are that the use of small graphs, at the beginning of the process, reduces the complexity of the problem and that it can be used with different centrality measures.
Recently, a new model for WSD has been introduced, based on an undirected graphical model (Chaplot, Bhattacharyya, and Paranjape 2015). It approaches the WSD problem as a maximum a posteriori query on a Markov random field (Jordan and Weiss 2002). The graph is constructed using the content words of a sentence as nodes and connecting them with edges if they share a relation, determined using a dependency parser. The values that each node in the graphical model can take include the senses of the corresponding word. The senses are collected using a knowledge base and weighted using a probability distribution based on the frequency of the senses in the knowledge base. Furthermore, the senses between two related words are weighted using a similarity measure. The goal of this approach is to maximize the joint probability of the senses of all the words in the sentence, given the dependency structure of the sentences, the frequency of the senses, and the similarity among them.
A new graph-based, semi-supervised approach introduced to deal with multilingual WSD (Navigli and Ponzetto 2012b) and entity inking problems is Babelfy (Moro, Raganato, and Navigli 2014). Multilingual WSD is an important task because traditional WSD algorithms and resources are focused on English language. It exploits the information from large multilingual knowledge, such as BabelNet (Navigli and Ponzetto 2012a), to perform this task. Entity linking consists of disambiguating the named entities in a text and in finding the appropriate resources in an ontology, which correspond to the specific entities mentioned in a text. Babelfy creates the semantic signature of each word to be disambiguated, which consists of collecting, from a semantic network, all the nodes related to a particular concept, exploiting the global structure of the network. This process leads to the construction of a graph-based representation of the whole text. It then applies Random Walk with Restart (Tong, Faloutsos, and Pan 2006) to find the most important nodes in the network, solving the WSD problem.
Approaches that are more similar to ours in the formulation of the problem have been described by Araujo (2007). The author reviewed the literature devoted to the application of different evolutionary algorithms to several aspects of NLP: syntactical analysis, grammar induction, machine translation, text summarization, semantic analysis, document clustering, and classification. Basically, these approaches are search and optimization methods inspired by biological evolution principles. A specific evolutionary approach for WSD has been introduced by Menai (2014). It uses genetic algorithms (Holland 1975) and memetic algorithms (Moscato 1989) in order to improve the performances of a gloss-based method. It assumes that there is a population of individuals, represented by all the senses of the words to be disambiguated, and that there is a selection process, which selects the best candidates in the population. The selection process is defined as a sense similarity function, which gives a higher score to candidates with specific features, increasing their fitness to the detriment of the other population members. This process is repeated until the fitness level of the population regularizes and at the end the candidates with higher fitness are selected as solutions of the problem. Another approach, which addresses the disambiguation problem in terms of space search, is GETALP (Schwab et al. 2013). This uses an Ant Colony algorithm to find the best path in the weighted graph constructed, measuring the similarity of all the senses in a text and assigning to each word to be disambiguated the sense corresponding to the node in this path.
These methods are similar to our study in the formulation of the problem; the main difference is that our approach is defined in terms of evolutionary game theory. As we show in the next section, this approach ensures that the final labeling of the data is consistent and that the solution of the problem is always found. In fact, our system always converges to the nearest Nash equilibrium from which the dynamics have been started.
3. Word Sense Disambiguation as a Consistent Labeling Problem
WSD can be interpreted as a sense-labeling task (Navigli 2009), which consists in assigning a sense label to a target word. As a labeling problem we need an algorithm, which performs this task in a consistent way, taking into account the context in which the target word occurs. Following this observation, we can formulate the WSD task as a constraint satisfaction problem (Tsang 1995) in which the labeling process has to satisfy some constraints in order to be consistent. This approach gives us the possibility not only to exploit the contextual information of a word but also to find the most appropriate sense association for the target word and the words in its context. This is the most important contribution of our work, which distinguishes it from existing WSD algorithms. In fact, in some cases using only contextual information without the imposition of constraints can lead to inconsistencies in the assignment of senses to related words.
As an illustrative example we can consider a binary constraint satisfaction problem, which is defined by a set of variables representing the elements of the problem and a set of binary constraints representing the relationships among variables. The problem is considered solved if there is a solution that satisfies all the constraints. This setting can be described in a formal manner as a triple (V, D, R), where V = {v1, … , vn} is the set of variables; is the set of domains for each variable, each denoting a finite set of possible values for variable vi; and is a set of binary constraints where Rij describe a set of compatible pairs of values for the variables vi and vj. R can be defined as a binary matrix of size p × q where p and q are the cardinalities of domains and variables, respectively. Each element of the binary matrix Rij(λ,λ′) = 1 indicates if the assignment vi = λ is compatible with the assignment vj = λ′. R is used to impose constraints on the labeling so that each label assignment is consistent.
This binary case assumes that the constraints are completely violated or completely respected, which is restrictive; it is more appropriate, in many real-world cases, to have a weight that expresses the level of confidence about a particular assignment (Hummel and Zucker 1983). This notion of consistency has been shown to be related to the Nash equilibrium concept in game theory (Miller and Zucker 1991). We have adopted this method to approach the WSD task in order to perform a consistent labeling of the data. In our case, we can consider variables as words, labels as word senses, and compatibility coefficients as similarity values among two word senses. To explain how the Nash equilibria are computed we need to introduce basic notions of game theory in the following section.
4. Game Theory
4.1 Classical Game Theory
Game theory provides predictive power in interactive decision situations. It was introduced by Von Neumann and Morgenstern (1944) in order to develop a mathematical framework able to model the essentials of decision-making in interactive situations. In its normal form representation (which is the one we use in this article) it consists of a finite set of players I = {1,..,n}, a set of pure strategies for each player Si = {s1,…,sm}, and a utility function , which associates strategies to payoffs. Each player can adopt a strategy in order to play a game; and the utility function depends on the combination of strategies played at the same time by the players involved in the game, not just on the strategy chosen by a single player. An important assumption in game theory is that the players are rational and try to maximize the value of ui; Furthermore, in non-cooperative games the players choose their strategies independently, considering what the other players can play and try to find the best strategy profile to use in a game.
As an example, we can consider the famous Prisoner's Dilemma, whose payoff matrix is shown in Table 1. Each cell of the matrix represents a strategy profile, where the first number represents the payoff of Player 1 (P1) and the second is the payoff of Player 2 (P2), when both players use the strategy associated with a specific cell. P1 is called the row player because it selects its strategy according to the rows of the payoff matrix, and P2 is called the column player because it selects its strategy according to the columns of the payoff matrix. In this game the strategy confess is a dominant strategy for both players and this strategy combination is the Nash equilibrium of the game.
The Prisoner's Dilemma.
P1\P2 . | confess . | don't confess . |
---|---|---|
confess | −5,−5 | 0,−6 |
don't confess | −6,0 | −1,−1 |
P1\P2 . | confess . | don't confess . |
---|---|---|
confess | −5,−5 | 0,−6 |
don't confess | −6,0 | −1,−1 |
Nash equilibria represent the key concept of game theory and can be defined as those strategy profiles in which each strategy is a best response to the strategy of the co-player and no player has the incentive to unilaterally deviate from their decision, because there is no way to do better.
In a two-player game we can define a strategy profile as a pair (p, q) where p ∈ Δi and q ∈ Δj. The expected payoff for this strategy profile is computed as follows: ui(p, q) = p ⋅ Aiq and uj(p, q) = q ⋅ Ajp, where Ai and Aj are the payoff matrices of player i and player j, respectively. The Nash equilibrium is computed in mixed strategies in the same way as pure strategies. It is represented by a pair of strategies such that each is a best response to the other. The only difference is that, in this setting, the strategies are probabilities and must be computed considering the payoff matrix of each player.
A game theoretic framework can be considered as a solid tool in decision-making situations because a fundamental theorem by Nash (1951) states that any normal-form game has at least one mixed Nash equilibrium, which can be used as the solution of the decision problem.
4.2 Evolutionary Game Theory
Evolutionary game theory was introduced by Smith and Price (1973), overcoming some limitations of traditional game theory, such as the hyper-rationality imposed on the players. In fact, in real-life situations the players choose a strategy according to heuristics or social norms (Szabó and Fath 2007). It has been introduced in biology to explain the evolution of species. In this context, strategies correspond to phenotypes (traits or behaviors), payoffs correspond to offspring, allowing players with a high actual payoff (obtained thanks to their phenotype) to be more prevalent in the population. This formulation explains natural selection choices among alternative phenotypes based on their utility function. This aspect can be linked to rational choice theory, in which players make a choice that maximizes their utility, balancing cost against benefits (Okasha and Binmore 2012).
The following theorem states that with Equation (5) it is always possible to find the Nash equilibria of the games (see Weibull [1997] for the proof).
Theorem 1.
A point x ∈ Θ is the limit of a trajectory of Equation (5) starting from the interior of Θ if and only if x is a Nash equilibrium. Further, if point x ∈ Θ is a strict Nash equilibrium, then it is asymptotically stable, additionally implying that the trajectories starting from all nearby states converge to x.
The game is played with the new strategy spaces until the system converges—that is, when the difference among the payoffs at time tn and tn−1 is under a small threshold. In Figure 1 we can see how the cooperate strategy increases over time, reaching a stationary point, which corresponds to the equilibrium of the game.
5. WSD Games
In this section we describe how the WSD games are formulated. We assume that each player i ∈ I that participates in the games is a particular word in a text and that each strategy is a particular word sense. The players can choose a determined strategy among the set of strategies Si = {1,…,c}, each expressing a certain hypothesis about its membership in a class and c being the total number of classes available. We consider Si as the mixed strategy for player i as described in Section 4. The games are played between two similar words, i and j, imposing only pairwise interaction between them. The payoff matrix Zij of a single game is defined as a sense similarity matrix between the senses of word i and word j. The payoff function for each word is additively separable and is computed as described in Section 4.2.
5.1 Implementation of the WSD Games
In order to run our algorithm, we need the network that models the interactions among the players, the strategy space of the game, and the payoff matrices. We adopted the following steps in order to model the data required by our framework and, specifically, for each text to be disambiguated we:
- •
Extract from the text the list of words I that have an entry in a lexical database
- •
compute from I the word similarity matrix W in which are stored the pairwise similarities among each word with the others and represents the players' interactions
- •
increase the weights between two words that share a proximity relation,
- •
extract from I the list C of all the possible senses that represent the strategy space of the system
- •
assign for each word in I a probability distribution over the senses in C creating for each player a probability distribution over the possible strategies
- •
compute the sense similarity matrix Z among each pair of senses in C, which is then used to compute the partial payoff matrices of each game
- •
apply the replicator dynamics equation in order to compute the Nash equilibria of the games and
- •
assign to each word i ∈ I a strategy s ∈ C
These steps are described in the following sections. In Section 5.1.1 we describe the graph construction procedure that we used in order to model the geometry of the data. In Section 5.1.2 we explain how we implement the strategy space of the game that allows each player to choose from a predetermined number of strategies. In Section 5.1.3 we describe how we compute the sense similarity matrix and how it is used to create the partial payoff matrices of the games. Finally, in Section 5.1.4 we describe the system dynamics.
5.1.1 Graph Construction
In our study, we modeled the geometry of the data as a graph. The nodes of the graph correspond to the words of a text, which have an entry in a lexical database. We denote the words by , where ij is the jth word and N is the total number of words retrieved. From I we construct an N × N similarity matrix W where each element wij is the similarity value assigned by a similarity function to the words i and j. W can be exploited as a useful tool for graph-based algorithms because it is treatable as a weighted adjacency matrix of a weighted graph.
In some cases, it is possible that some target words are not present in the reference corpus, because of different text segmentation techniques or spelling differences. In this case, we use query expansion techniques in order to find an appropriate substitute (Carpineto and Romano 2012). Specifically, we use WordNet to find alternative lexicalizations of a lemma, choosing the one that co-occurs more frequently with the words in its context.
The information obtained from an association measure can be enriched by taking into account the proximity of the words in the text (or the syntactic structure of the sentence). The first task can be achieved augmenting the similarities among a target word and the n words that appear on its right and on its left, where n is a parameter that with small values can capture fixed expressions and with large values can detect semantic concepts (Fkih and Omri 2012). The second task can be achieved using a dependency parser to obtain the syntactical relations among the words in the target sentence, but this approach is not used in this article. In this way, the system is able to exploit local and global cues, mixing together the one sense per discourse (Kelly and Stone 1975) and the one sense per collocation (Yarowsky 1993) hypotheses.
We are not interested in all the relations in the sentence but we focus only on relations among target words. The use of a dependency/proximity structure makes the graph reflect the structure of the sentence and the use of a distributional approach allows us to exploit the relations of semantically correlated words. This is particularly useful when the proximity information is poor—for example, when it connects words to auxiliary or modal verbs. Furthermore, these operations ensure that there are no disconnected nodes in the graph.
5.1.2 Strategy Space Implementation
The strategy space of the game is created using a knowledge base to collect the sense inventories of each word in a text, where mi is the number of senses associated with word i. Then we create the list of all the unique concepts in the sense inventories, which correspond to the space of the game.
5.1.3 The Payoff Matrices
We encoded the payoff matrix of a WSD game as a sense similarity matrix among all the senses in the strategy spaces of the game. In this way, the higher the similarity among the senses of two words, the higher the incentive for a word to choose that sense, and play the strategy associated with it.
5.1.4 System Dynamics
The complexity of each step of the replicator dynamics is quadratic but there are different dynamics that can be used with our framework to solve the problem more efficiently, such as the recently introduced infection and immunization dynamics (Rota Buló, Pelillo, and Bomze 2011), which have a linear-time/space complexity per step and are known to be much faster than, and as accurate as, the replicator dynamics.
5.2 Implementation Details
In this section we describe the association measures used to weight the graph W (Section 5.2.1), the semantic and relatedness measures used to compare the synsets (Section 5.2.2), the computation of the payoff matrices of the games (Section 5.2.3), and the different implementations of the system strategy space (Section 5.2.4) in cases of unsupervised, semi-supervised, and coarse-grained WSD.
5.2.1 Association Measures
We evaluated our algorithm with different similarity measures in order to find the measure that performs better; the results of this evaluation are presented in Section 6.2.1. Specifically, for our experiments, we used eight different measures: the Dice coefficient (dice) (Dice 1945), the modified Dice coefficient (mDice) (Kitamura and Matsumoto 1996), the pointwise mutual information (pmi) (Church and Hanks 1990), the t-score measure (t-score) (Church and Hanks 1990), the z-score measure (z-score) (Burrows 2002), the odds ration (odds-r) (Blaheta and Johnson 2001), the chi-squared test (chi-s) (Rao 2002), and the chi-squared correct (chi-s-c) (DeGroot et al. 1986).
The measures that we used are presented in Figure 2, where the notation refers to the standard contingency tables (Evert 2008) used to display the observed and expected frequency distribution of the variables, respectively, on the left and on the right of Figure 3. All the measures for the experiments in this article have been calculated using the BNC corpus (Leech 1992) because it is a well balanced general domain corpus.
Contingency tables of observer frequency (on the left) and expected frequency (on the right).
Contingency tables of observer frequency (on the left) and expected frequency (on the right).
5.2.2 Semantic and Relatedness Measures
We used WordNet (Miller 1995) and BabelNet (Navigli and Ponzetto 2012a) as knowledge bases to collect the sense inventories of each word to be disambiguated.
Semantic and Relatedness Measures Calculated with WordNet. WordNet (Miller 1995) is a lexical database where the lexicon is organized according to a psycholinguistic theory of the human lexical memory, in which the vocabulary is organized conceptually rather than alphabetically, giving a prominence to word meanings rather than to lexical forms. The database is divided in five parts: nouns, verbs, adjectives, adverbs, and functional words. In each part the lexical forms are mapped to the senses related to them; in this way it is possible to cluster words that share a particular meaning (synonyms) and to create the basic component of the resource: the synset. Each synset is connected in a network to other synsets, which have a semantic relation with it.
The relations in WordNet are: hyponymy, hypernymy, antonymy, meronymy, and holonymy. Hyponymy gives the relations from more general concepts to more specific; hypernymy gives the relations from particular concepts to more general; antonymy relates two concepts that have an opposite meaning; meronymy connects the concept that is part of a given concept with it; and holonymy relates a concept with its constituents. Furthermore, each synset is associated with a definition and gives the morphological relations of the word forms related to it. Given the popularity of the resource many parallel projects have been developed. One of them is eXtended WordNet (Mihalcea and Moldovan 2001), which gives a parsed version of the glosses together with their logical form and the disambiguation of the term in it.
We have used this resource to compute similarity and relatedness measures in order to construct the payoff matrices of the games. The computation of the sense similarity measures is generally conducted using relations of likeness such as the is-a relation in a taxonomy; on the other hand, the relatedness measures are more general and take into account a wider range of relations such as the is-a-part-of or is-the-opposite-of.
The procedure to compute the semantic relatedness of two synsets has been introduced by Patwardhan and Pedersen (2006) as Gloss Vector measure; and we used it with four different variations for our experiments. The four variations are named: tf-idf, tf-idfext, vec, and vecext. The difference among them relies on the way the gloss vectors are constructed. Because the synset gloss is usually short we used the concept of super-gloss as in Patwardhan and Pedersen (2006) to construct the vector of each synset. A super-gloss is the concatenation of the gloss of the synset plus the glosses of the synsets, which are connected to it via some WordNet relations (Pedersen 2012). We used the WordNet version that has been used to label each data set. Specifically, the different implementations of the vector construction vary on the way in which the co-occurrence is calculated, the corpus used, and the source of the relations. tf-idf constructs the co-occurrence vectors exploiting the term frequency - inverse document frequency weighting schema (tf-idf). tf-idfext uses the same information of tf-idf plus the relations derived from eXtended WordNet (Mihalcea and Moldovan 2001). vec uses a standard bag-of-words approach to compute the co-occurrences. vecext uses the same information of vec plus the relations from eXtended WordNet.
Instead of considering only the raw frequency of terms in documents, the tf-idf method scales the importance of less informative terms taking into account the number of documents in which a term occurs. Formally, it is the product of two statistics: the term frequency and the inverse document frequency. The former is computed as the number of times a term occurs in a document (gloss in our case); the latter is computed as , where N is the number of documents in the corpus and dft is the number of documents in which the term occurs.
Relatedness Measure Calculated with BabelNet and NASARI. BabelNet (Navigli and Ponzetto 2012a) is a wide-coverage multilingual semantic network. It integrates lexicographic and encyclopedic knowledge from WordNet and Wikipedia, automatically mapping the concepts shared by the two knowledge bases. This mapping generates a semantic network where millions of concepts are lexicalized in different languages. Furthermore, it allows linking named entities, such as Johann Sebastian Bach, and concepts, such as composer and organist.
BabelNet can be represented as a labeled direct graph G = (V, E) where V is the set of nodes (concepts or named entities) and is the set of edges connecting pairs of concepts or named entities. The edges are labeled with a semantic relation from R, such as: is-a, given name, or occupation. Each node v ∈ V contains a set of lexicalizations of the concept for different languages, which forms a BabelNet synset.
The semantic measure, which we developed using BabelNet, is based on NASARI5 (Camacho-Collados, Pilehvar, and Navigli 2015), a semantic representation of the concepts and named entities in BabelNet. This approach first exploits the BabelNet network to find the set of related concepts in WordNet and Wikipedia and then constructs two vectors to obtain a semantic representation of a concept b. These representations are projected in two different semantic spaces, one based on words and the other on synsets. They use lexical specificity6 (Lafon 1980) to extract the most representative words to use in the first vector and the most representative synsets to use in the second vector.
In this article, we computed the similarity between two senses using the vectors (of the word-based semantic space) provided by NASARI. These semantic representations provide for each sense the set of words that best represent the particular concept and the score of representativeness of each word. From this representation we computed the pairwise cosine similarity between each concept, as described in the previous section for the semantic relatedness measures.
The use of NASARI is particularly useful in the case of named entity disambiguation because it includes many entities that are not included in WordNet. On the other hand, it is difficult to use it in all-words sense disambiguation tasks, since it includes only WordNet synsets that are mapped to Wikipedia pages in BabelNet. For this reason it is not possible to find the semantic representation for many verbs, adjectives, and adverbs that are commonly found in all-words sense disambiguation tasks.
We used the SPARQL endpoint7 provided by BabelNet to collect the sense inventories of each word in the texts of each data set. For this task we filtered the first 100 resources whose label contains the lexicalization of the word to be disambiguated. This operation is required because in many cases it is possible to have indirect references to entities.
5.2.3 From Similarities to Payoffs
The similarity and relatedness measures are computed for all the senses of the words to be disambiguated. From this computation it is possible to obtain a similarity matrix Z that incorporates the pairwise similarity among all the possible senses. This computation could have heavy computational cost, if there are many words to be disambiguated. To overcome this issue, the pairwise similarities can be computed just one time on the entire knowledge base and used in actual situations, reducing the computational cost of the algorithm. From this matrix we can obtain the partial semantic similarity matrix for each pair of players, Zij = m × n, where m and n are the senses of i and j in Z.
5.2.4 Strategy Space Implementation
5.3 An Example
As an example, we can consider the following sentence, which we encountered before:
- •
There is a financial institution near the river bank.
We first tokenize, lemmatize, and tag the sentence; then we extract the content words that have an entry in WordNet 3.0 (Miller 1995), constructing the list of words to be disambiguated: {is, financial, institution, river, bank}. Once we have identified the target words we compute the pairwise similarity for each target word. For this task we use the Google Web 1T 5-Gram Database (Brants and Franz 2006) to compute the modified Dice coefficient8 (Kitamura and Matsumoto 1996). With the information derived by this process we can construct a co-occurrence graph (Figure 4(a)), which indicates the strength of association between the words in the text. This information can be augmented, taking into account other sources of information such that the dependency structure of the syntactic relations between the words9 or the proximity information derived by a simple n-gram model (Figure 4(b), n = 1).
The operation to increment the weights of structurally related words is important because it prevents the system from relying only on distributional information, which could lead to a sense shift for the ambiguous word bank. In fact, its association with the words financial and institution would have the effect of interpreting it as a financial institution and not as sloping land, as defined in WordNet. Furthermore, using only distributional information could exclude associations between words that do not appear in the corpus in use.
In Figure 4(c) we see the final form of the graph for our target sentence, in which we have combined the information from the co-occurrence graph and from the n-gram graph. The weights in the co-occurrence graph are increased by the mean weight of the graph if a corresponding edge exists in the n-gram graph and does not include a stop-word.10
Three graph representations for the sentence: there is a financial institution near the river bank. (a) A co-occurrence graph constructed using the modified Dice coefficient as similarity measure over the Google Web 1T 5-Gram Database (Brants and Franz 2006) to weight the edges. (b) Graph representation of the n-gram structure of the sentence, with n = 1; for each node, an edge is added to another node if the corresponding word appears to its left or right in a window the size of one word. (c) A weighted graph that combines the information of the co-occurrence graph and the n-gram graph. The edges of the co-occurrence graph are augmented by its mean weight if a corresponding edge exists in the n-gram graph and does not include a stop-word.
Three graph representations for the sentence: there is a financial institution near the river bank. (a) A co-occurrence graph constructed using the modified Dice coefficient as similarity measure over the Google Web 1T 5-Gram Database (Brants and Franz 2006) to weight the edges. (b) Graph representation of the n-gram structure of the sentence, with n = 1; for each node, an edge is added to another node if the corresponding word appears to its left or right in a window the size of one word. (c) A weighted graph that combines the information of the co-occurrence graph and the n-gram graph. The edges of the co-occurrence graph are augmented by its mean weight if a corresponding edge exists in the n-gram graph and does not include a stop-word.
After the pairwise similarities between the words are computed, we access a lexical database in order to obtain the sense inventories of each word so that each word can be associated with a predefined number of senses. For this task, we use WordNet 3.0 (Miller 1995). Then, for each unique sense in all the sense inventories, we compute the pairwise semantic similarity in order to identify the affinity among all the pairwise sense combinations. This task can be done using a semantic similarity or relatedness measure.11 For this example, we used a variant of the gloss vector measure (Patwardhan and Pedersen 2006), the tf-idf, described in Section 5.2.2.
Having obtained the similarity information, we can initialize the strategy space of each player with a uniform distribution, given the fact that we are not considering any prior information about the senses distributions. Now the system dynamics can be started. In each iteration of the dynamics each player plays a game with its neighbors, obtaining a payoff for each of its strategies according to Equation (10); once the players have played the games with their neighbors in W, the strategy space of each player is updated at time t + 1 according to Equation (6).
We present the dynamics of the system created for the example sentence in Figure 5. The dynamics are shown only for the ambiguous words at time steps t1, t2, t3, and t12 (when the system converges). As we can see, at time step 1 the senses of each word are equiprobable, but as soon as the games are played some senses start to emerge. In fact at time step 2 many senses are discarded, and this in virtue of two principles: a) the words in the text push the senses of the other words toward a specific sense; and b) the sense similarity values for certain senses are very low. Regarding the first principle, we can consider that the word institution, which is playing the games with the words financial and bank, is immediately driven toward a specific sense, as an organization founded and united for a specific purpose as defined in WordNet 3.0—thus discarding the other senses. Regarding the second principle, we can consider many senses of the word bank that are not compatible with the senses of the other words in the text, and therefore their values decrease rapidly.
System dynamics for the words: be, institution, and bank at time step 1, 2, 3, and 12 (system convergence). The strategy space of each word is represented as a regular polygon of radius 1, where the distance from the center to any vertex represents the probability associated with a particular word sense. The values on each radius in a polygon are connected with a darker line in order to show the actual probability distribution obtained at each time step.
System dynamics for the words: be, institution, and bank at time step 1, 2, 3, and 12 (system convergence). The strategy space of each word is represented as a regular polygon of radius 1, where the distance from the center to any vertex represents the probability associated with a particular word sense. The values on each radius in a polygon are connected with a darker line in order to show the actual probability distribution obtained at each time step.
The most interesting phenomenon that can be appreciated from the example is the behavior of the strategy space of the word bank. It has ten senses according to WordNet 3.0 (Miller 1995), and can be used in different contexts and domains to indicate, among other things, a financial institution (s22 in Figure 5) or a sloping land (s20 in Figure 5). When it plays a game with the words financial and institution, it is directed toward its financial sense; when it plays a game with the word river, it is directed toward its naturalistic meaning. As we can see in Figure 5 at time step 2, the two meanings (s20 and s22) have almost the same value and at time step 3 the word starts to define a precise meaning to the detriment of s21 but not of s22. The balancing of these forces toward a specific meaning is given by the similarity value wij, which allows bank in this case to choose its naturalistic meaning. Furthermore, we can see that the inclination to a particular sense is given by the payoff matrix Zij and by the strategy distribution Sj, which indicates what sense word j is going to choose, ensuring that word i's is coherent with this choice.
6 Experimental Evaluation
6.1 Parameter Tuning
We used two data sets to tune the parameters of our approach, SemEval-2010 task 17 (S10) (Agirre et al. 2009) and SemEval-2015 Task 13 (S15) (Moro and Navigli 2015). The first data set is composed of three English texts from the ecology domain for a total of 1,398 words to be disambiguated (1,032 nouns/named entities and 366 verbs). The second data set is composed of four English documents, from different domains: medical, drug, math, and social issues, for a total of 1,261 instances, including nouns/named entities, verbs, adjectives, and adverbs. Both data sets have been manually labeled using WordNet 3.0. The only difference between these data sets is that the target words of the first data set belong to a specific domain, whereas all the content words of the second data set have to be disambiguated. We used these two typologies of data set to evaluate our algorithm in different scenarios; furthermore, we created, from each data set, 50 different data sets, selecting from each text a random number of sentences and evaluating our approach on each of these data sets to identify the parameters that on average perform better than others. In this way it is possible to simulate a situation in which the system has to work on texts of different sizes and on different domains. This because, as demonstrated by Søgaard et al. (2014), the results of a determined algorithm are very sensitive to sample size. The number of target words for each text in the random data sets ranges from 12 to 571. The parameters that will be tuned are the association and semantic measures to use to weight the similarity among words and senses (Section 6.1.1), the n of the n-gram graph used to increase the weights of near words (Section 6.1.2), and the p of the geometric distribution used by our semi-supervised system (Section 6.1.3).
6.1.1 Association and Semantic Measures
The first experiment that we present is aimed at finding the semantic and distributional measures with the highest performances. Recall that we used WordNet 3.0 as knowledge base and the BNC corpus (Leech 1992) to compute the association measures. In Tables 2 and 3 we report the average results on the S10 and S15 data sets, respectively. From these tables it is possible to see that the performance of the system is highly influenced by the combination of measures used. As an example of the different representations generated by the measures described in Section 5.2, we can observe Figures 6 and 7, which depict the matrices Z and the adjacency matrix of the graph W, respectively, and are computed on the following three sentences from the second text of S10:
The rivers Trent and Ouse, which provide the main fresh water flow into the Humber, drain large industrial and urban areas to the south and west (River Trent), and less densely populated agricultural areas to the north and west (River Ouse). The Trent/Ouse confluence is known as Trent Falls. On the north bank of the Humber estuary the principal river is the river Hull, which flows through the city of Kingston-upon-Hull and has a tidal length of 32 km up to the Hempholme Weir.
Results as F1 for S10. The first result with a statistically significant difference from the best (bold result) is marked with * (χ2, p < 0.05).
. | dice . | mdice . | pmi . | t-score . | z-score . | odds-r . | chi-s . | chi-s-c . |
---|---|---|---|---|---|---|---|---|
tfidf | 55.5 | 56.3 | 50.6 | 45.4 | 50.1 | 49.8 | 39.1 | 54.4 |
tfidfext | 56.5 | 55.9 | 50.1 | 45.0 | 49.9 | 49.5 | 39.1 | 54.2 |
vec | 54.7 | 54.3 | 49.3 | 44.1 | 49.4 | 53.6 | 39.3 | 50.5 |
vecext | 55.0 | 54.3 | 48.8 | 43.8 | 48.6 | 53.6 | 39.1 | 49.9 |
jcn | 51.3 | 50.6 | 40.1 | 50.1 | 47.6 | 52.6* | 50.1 | 50.6 |
wup | 37.2 | 36.9 | 35.6 | 32.2 | 37.9 | 36.8 | 38.4 | 35.4 |
. | dice . | mdice . | pmi . | t-score . | z-score . | odds-r . | chi-s . | chi-s-c . |
---|---|---|---|---|---|---|---|---|
tfidf | 55.5 | 56.3 | 50.6 | 45.4 | 50.1 | 49.8 | 39.1 | 54.4 |
tfidfext | 56.5 | 55.9 | 50.1 | 45.0 | 49.9 | 49.5 | 39.1 | 54.2 |
vec | 54.7 | 54.3 | 49.3 | 44.1 | 49.4 | 53.6 | 39.3 | 50.5 |
vecext | 55.0 | 54.3 | 48.8 | 43.8 | 48.6 | 53.6 | 39.1 | 49.9 |
jcn | 51.3 | 50.6 | 40.1 | 50.1 | 47.6 | 52.6* | 50.1 | 50.6 |
wup | 37.2 | 36.9 | 35.6 | 32.2 | 37.9 | 36.8 | 38.4 | 35.4 |
Results as F1 for S15. The first result with a statistically significant difference from the best (bold result) is marked with * (χ2, p < 0.05).
. | dice . | mdice . | pmi . | t-score . | z-score . | odds-r . | chi-s . | chi-s-c . |
---|---|---|---|---|---|---|---|---|
tfidf | 64.1 | 64.2 | 63.1 | 59.0 | 61.8 | 65.3 | 63.3* | 62.4 |
tfidfext | 62.9 | 63.1 | 62.4 | 58.7 | 60.9 | 63.0 | 62.0 | 61.1 |
vec | 62.8 | 62.3 | 62.8 | 59.8 | 62.3 | 62.9 | 61.1 | 60.3 |
vecext | 60.5 | 59.9 | 61.2 | 57.8 | 59.7 | 60.6 | 60.1 | 59.4 |
jcn | 57.2 | 57.6 | 56.7 | 57.9 | 57.0 | 56.9 | 57.5 | 57.6 |
wup | 46.2 | 45.4 | 43.8 | 45.4 | 45.9 | 47.4 | 46.1 | 45.5 |
. | dice . | mdice . | pmi . | t-score . | z-score . | odds-r . | chi-s . | chi-s-c . |
---|---|---|---|---|---|---|---|---|
tfidf | 64.1 | 64.2 | 63.1 | 59.0 | 61.8 | 65.3 | 63.3* | 62.4 |
tfidfext | 62.9 | 63.1 | 62.4 | 58.7 | 60.9 | 63.0 | 62.0 | 61.1 |
vec | 62.8 | 62.3 | 62.8 | 59.8 | 62.3 | 62.9 | 61.1 | 60.3 |
vecext | 60.5 | 59.9 | 61.2 | 57.8 | 59.7 | 60.6 | 60.1 | 59.4 |
jcn | 57.2 | 57.6 | 56.7 | 57.9 | 57.0 | 56.9 | 57.5 | 57.6 |
wup | 46.2 | 45.4 | 43.8 | 45.4 | 45.9 | 47.4 | 46.1 | 45.5 |
The representations of the payoff matrix Z computed on three sentences of the second text of S10, with the measures described in Section 5.2.2. All the senses of the words in the text are sequentially ordered.
The representations of the payoff matrix Z computed on three sentences of the second text of S10, with the measures described in Section 5.2.2. All the senses of the words in the text are sequentially ordered.
The representations of the adjacency matrix of the graph W computed on three sentences of the second text of S10, with the measures described in Section 5.2.1. The words are ordered sequentially and reflect the list proposed in the text. For a better visual comparison only positive values are presented, whereas the experiments are performed considering also negative values. The last image represents the strategy space of the players.
The representations of the adjacency matrix of the graph W computed on three sentences of the second text of S10, with the measures described in Section 5.2.1. The words are ordered sequentially and reflect the list proposed in the text. For a better visual comparison only positive values are presented, whereas the experiments are performed considering also negative values. The last image represents the strategy space of the players.
resulting in the following 35 content words (names and verbs) and 131 senses.
1. river n | 10. area n | 19. Ouse n | 28. be v |
2. Trent n | 11. south n | 20. confluence n | 29. river n |
3. Ouse n | 12. west n | 21. be v | 30. flow v |
4. provide v | 13. River n | 22. Trent n | 31. city n |
5. main n | 14. Trent n | 23. Falls n | 32. have v |
6. water n | 15. area n | 24. bank n | 33. length* n |
7. flow n | 16. River n | 25. Humber n | 34. km n |
8. Humber n | 17. Ouse n | 26. estuary n | 35. Weir n |
9. drain v | 18. Trent n | 27. river n |
1. river n | 10. area n | 19. Ouse n | 28. be v |
2. Trent n | 11. south n | 20. confluence n | 29. river n |
3. Ouse n | 12. west n | 21. be v | 30. flow v |
4. provide v | 13. River n | 22. Trent n | 31. city n |
5. main n | 14. Trent n | 23. Falls n | 32. have v |
6. water n | 15. area n | 24. bank n | 33. length* n |
7. flow n | 16. River n | 25. Humber n | 34. km n |
8. Humber n | 17. Ouse n | 26. estuary n | 35. Weir n |
9. drain v | 18. Trent n | 27. river n |
The first observation that can be made on the results is related to the semantic measures; in fact, the relatedness measures perform significantly better than the semantic similarity measures. This is because wup and jcn can be computed only on synsets that have the same part of speech. This limitation affects the results of the algorithm because the games played between two words with different parts of speech have no effect on the dynamics of the system, since the values of the resulting payoff matrices are all zeros. This affects the performances of the system in terms of recall—because in this situation these words tend to remain on the central point of the simplex—and also in terms of precision—because the choice of the meaning of a word is computed only taking into account the influence of words with the same part of speech. In fact, from Figure 6 we can see that the representations provided by wup and jcn for the text described above have many uniform areas; this means that these approaches are not able to provide a clear representation of the data. On the contrary, the representations provided by the relatedness measures show a block structure on the main diagonal of the matrix, which is exactly what is required for a similarity measure. The use of the tf-idf weighting schema seems to be able to reduce the noise in the data representation; in fact the weights on the left part of the matrix are reduced by tfidf and tfidf-ext whereas they have high values in vec and vec-ext. The representations obtained with eXtended WordNet are very similar to those obtained with WordNet and their performance is also very close, although on average WordNet outperforms eXtended WordNet.
If we observe the performances of the association measures we notice that on average the best measures are dice, mdice, chi-s-c, and also odds-r on S15—the other measures perform almost always under the statistical significance. Observing the representations in Figure 7 we can see that dice and mdice have a similar structure; the difference between these two measures are that mdice has values on a different range and tends to better differentiate the weights, whereas in dice the values are almost uniform. Pmi tends to take high values when one word in the collocation has low frequency, but this does not imply high dependency, thus it compromises the results of the games. From its representation we can observe that its structure is different from the previous two—in fact, it concentrate its values on collocations such as river Trent and river Ouse and this has the effect of unbalancing the data representation. In fact, the dice and mdice concentrate their values on collocations such as river flow and bank estuary. T-score and z-score have a similar structure, the only difference is in the range of the values. For these measures we can see that the distribution of the values is quite homogeneous, meaning that these measures are not able to balance the weights well. On odds-r we recognize a structure similar to that of pmi, the main difference being that it works on a different range. The values obtained with chi-s are on a wide range, which compromises the data representation; in fact, its results are always under the statistical significance. Chi-s-c works on a narrower range than chi-s and its structure resembles that of dice—in fact, its results are often high.
6.1.2 n-gram Graph
The association measures are able to provide a good representation of the text but in many cases it is possible that a word in a specific text is not present in the corpus on which these measures are calculated; furthermore, it is possible that these words are used with different lexicalizations. One way to overcome these problems is to increase the values of the nodes near a determined word; in this way it is possible to ensure that the nodes in W are always connected. Furthermore, this allows us to exploit local information, increasing the importance of the words that share a proximity relation with a determined word; in this way it is possible to give more importance to (possibly syntactically) related words, as described in Section 5.1.1. To test the influence that the parameter of the n-gram graph has on the performance of the algorithm we selected the association and relatedness measures with the highest results and conducted a series of experiments on the same data sets presented above, with increasing values of n. The results of these experiments on S10 and S15 are presented in Figure 8(a) and 8(b), respectively. From the plots we can see that this approach is always beneficial for S15 and that the results increased substantially with values of n greater than 2. To the contrary, on S10 this approach is not always beneficial but in many cases it is possible to notice an improvement. In particular, we notice that the pair of measures with highest results on both data sets is tfidf-mdice with n = 5. This also confirms our earlier experiments in which we saw that these two measures are particularly suited for our algorithm.
Results as F1 on S10 (on the left) and S15 (on the right) with increasing values of neighbor nodes (n).
Results as F1 on S10 (on the left) and S15 (on the right) with increasing values of neighbor nodes (n).
6.1.3 Geometric Distribution
Once we have identified the measures to use in our unsupervised system, we can test for the best parameter to use in case we want to exploit information from sense-labeled corpora. To tune the parameter of the geometric distribution (described in Section 5.2.4), we used the pair of measures and the value of n detected with the previous experiments and ran the algorithm on S10 and S15 with increasing values of p, in the interval [0.05,0.95].
The results of this experiment are presented in Figure 9(a), where we can see that the performance of the semi-supervised system on S15 is always better than that obtained with the unsupervised system (p = 0). On the other hand, the performance on S10 is always lower than that obtained with the unsupervised system. This behavior is not surprising because the target words of S10 belong to a specific semantic domain. We used SemCor to obtain the information about the sense distributions and this resource is a general domain corpus, which is not tailored for this specific task. In fact, as pointed out by McCarthy et. al (2007), the distribution of word senses on specific domains is highly skewed; for this reason, the most frequent sense heuristic calculated on general domains corpora, such as SemCor, is not beneficial for this kind of text.
From the plot we can see that on S15 the highest results are obtained with values of p ranging from 0.4 to 0.7 and for the evaluation of our model we decided to use p = 0.4 as parameter for the geometric distribution, because with this value we obtained the highest result.
Results as F1 on S10 and S15 with increasing values of p (on the left), p = 0 corresponds to the results with the unsupervised setting (on the left). An example of geometric distribution with six ranked senses compared with the uniform distribution (on the right).
Results as F1 on S10 and S15 with increasing values of p (on the left), p = 0 corresponds to the results with the unsupervised setting (on the left). An example of geometric distribution with six ranked senses compared with the uniform distribution (on the right).
6.1.4 Error Analysis
The main problems that we noticed analyzing the results of previous experiments are related to the semantic measures. As we pointed out in Section 6.1.1, these measures can be computed only on synsets with the same part of speech and this influences the results in terms of recall. The adverbs and adjectives are not disambiguated with these measures because of the lack of payoffs. This does not happen only in the case of function words with low semantic content but also for verbs with a rich semantic content, such as generate, prevent, and obtain. The use of the relatedness measures substantially reduces the number of words that are not disambiguated. With these measures, a word is not disambiguated only in cases in which the concepts denoted by it are not covered enough by the reference corpus—for example, in our experiments we have words such as drawn-out, dribble, and catchment that are not disambiguated.
To overcome this problem we have used the n-gram graph to increase the weights among neighboring words. Experimentally, we noticed that when this approach is used with the relatedness measures, it leads to the disambiguation of all the target words and with n ≥ 1 we have precision = recall. The use of this approach influences the results also in terms of precision—in fact, if we consider the performance of the system on the word actor, we pass from F1 = 0 (n = 0) to F1 = 71.4 (n = 5). This is because the number of relations of the two senses (synsets) of the word actor are not balanced in WordNet 3.0; in fact, actor as theatrical performer has 21 relations whereas actor as person who acts and gets things done has only 8 relations, and this can compromise the computation of the semantic relatedness measures. It is possible to overcome this limitation using the local information given by the n-gram graph, which allows us to balance the influence of words in the text.
Another aspect to consider is whether the polysemy of the words influences the results of the system. Analyzing the results we noticed that the majority of the errors are made on words such as make-v, give-v, play-v, better-a, work-v, follow-v, see-v, and come-v, which have more that 20 different senses and are very frequent words difficult to disambiguate in fine-grained tasks. As we can see from Figure 10, this problem can be partially solved using the semi-supervised system. In fact, the use of information from sense-labeled corpora is particularly useful when the polysemy of the words is particularly high.
Average F1 on the words of S15 grouped by number of senses, using the unsupervised and the semi-supervised system.
Average F1 on the words of S15 grouped by number of senses, using the unsupervised and the semi-supervised system.
6.2 Evaluation Set-up
We evaluated our algorithm with three fine-grained data sets: Senseval-2 English all-words (S2) (Palmer et al. 2001), Senseval-3 English all-words (S3) (Snyder and Palmer 2004), SemEval-2007 all-words (S7) (Pradhan et al. 2007), and one coarse-grained data set, SemEval-2007 English all-words (S7CG) (Navigli, Litkowski, and Hargraves 2007),13 using as the knowledge base WordNet. Furthermore, we evaluated our approach on two data sets, SemEval-2013 task 12 (S13) (Navigli, Jurgens, and Vannella 2013) and KORE50 (Hoffart et al. 2012),14 using as the knowledge base BabeNet.
We describe the evaluation using WordNet as the knowledge base in the next sections, and in Section 6.2.2 we present the evaluation conducted using BabelNet as the knowledge base. Recall that for all the next experiments we used mdice to weight the graph W, tfidf to compute the payoffs, n = 5 for the n-gram graph, and p = 0.4 in the case of semi-supervised learning. The results are provided as F1 for all the data sets except KORE50; for this data set the results are provided as accuracy, as is common in the literature.
6.2.1 Experiments Using WordNet as Knowledge Base
Table 4 shows the results as F1 for the four data sets that we used for the experiments with WordNet. The table includes the results for the two implementations of our system: the unsupervised and the semi-supervised and the results obtained using the most frequent sense heuristic. For the computation of the most frequent sense, we assigned to each word to be disambiguated the first sense returned by the WordNet reader provided by the Natural Language Toolkit (version 3.0) (Bird 2006). As we can see, the best performance of our system is obtained on nouns on all the data sets. This is in line with state-of-the-art systems because in general the nouns have lower polysemy and higher inter-annotator agreement (Palmer et al. 2001). Furthermore, our method is particularly suited for nouns. In fact, the disambiguation of nouns benefits from a wide context and local collocations (Agirre and Edmonds 2007).
Detailed results as F1 for the four data sets studied with tf-idf and mdice as measures. The results show the performance of our unsupervised (uns) and semi-supervised (ssup) system and the results obtained using the most frequent sense heuristic (MFS). Detailed information about the performance of the systems on different parts of speech are provided: nouns (N), verbs (V), adjectives (A), and adverbs (R).
SemEval 2007 coarse-grained - S7CG . | |||||
---|---|---|---|---|---|
Method . | All . | N . | V . | A . | R . |
80.4 | 85.5 | 71.2 | 81.5 | 76.0 | |
82.8 | 85.4 | 77.2 | 82.9 | 84.6 | |
MFS | 76.3 | 76.0 | 70.1 | 82.0 | 86.0 |
SemEval 2007 fine-grained - S7 | |||||
Method | All | N | V | A | R |
43.3 | 49.7 | 39.9 | − | − | |
56.5 | 62.9 | 53.0 | − | − | |
MFS | 54.7 | 60.4 | 51.7 | − | − |
Senseval 3 fine-grained - S3 | |||||
Method | All | N | V | A | R |
59.1 | 63.3 | 50.7 | 64.5 | 71.4 | |
64.7 | 70.3 | 54.1 | 70.7 | 85.7 | |
MFS | 62.8 | 69.3 | 51.4 | 68.2 | 100.0 |
Senseval 2 fine-grained - S2 | |||||
Method | All | N | V | A | R |
61.2 | 69.8 | 41.7 | 61.9 | 65.1 | |
66.0 | 72.4 | 43.5 | 71.8 | 75.7 | |
MFS | 65.6 | 72.1 | 42.4 | 71.6 | 76.1 |
SemEval 2007 coarse-grained - S7CG . | |||||
---|---|---|---|---|---|
Method . | All . | N . | V . | A . | R . |
80.4 | 85.5 | 71.2 | 81.5 | 76.0 | |
82.8 | 85.4 | 77.2 | 82.9 | 84.6 | |
MFS | 76.3 | 76.0 | 70.1 | 82.0 | 86.0 |
SemEval 2007 fine-grained - S7 | |||||
Method | All | N | V | A | R |
43.3 | 49.7 | 39.9 | − | − | |
56.5 | 62.9 | 53.0 | − | − | |
MFS | 54.7 | 60.4 | 51.7 | − | − |
Senseval 3 fine-grained - S3 | |||||
Method | All | N | V | A | R |
59.1 | 63.3 | 50.7 | 64.5 | 71.4 | |
64.7 | 70.3 | 54.1 | 70.7 | 85.7 | |
MFS | 62.8 | 69.3 | 51.4 | 68.2 | 100.0 |
Senseval 2 fine-grained - S2 | |||||
Method | All | N | V | A | R |
61.2 | 69.8 | 41.7 | 61.9 | 65.1 | |
66.0 | 72.4 | 43.5 | 71.8 | 75.7 | |
MFS | 65.6 | 72.1 | 42.4 | 71.6 | 76.1 |
We obtained low results on verbs on all data sets. This, as pointed out by Dang (1975), is a common problem not only for supervised and unsupervised WSD systems but also for humans, who in many cases disagree about what constitutes a different sense for a polysemous verb, compromising the sense tagging procedure.
As we anticipated in Section 6.1.3, the use of prior knowledge is beneficial for this kind of data set. As we can see in Table 4, using a semi-supervised setting improves the results of 5% on S2 and S3 and of 12% on S7. The large improvement obtained on S7 can be explained by the fact that the results of the unsupervised system are well below the most frequent sense heuristic, so exploiting the evidence from the sense-labeled data set is beneficial. For the same reason, the results obtained on S7CG with a semi-supervised setting are less impressive than those obtained with the unsupervised systems; in fact, the structure of the data sets is different and the results obtained with the unsupervised setting are well above the most frequent sense. This series of experiments confirms that the use of prior knowledge is beneficial in general domain data sets and that when it is used, the system performs better than the most common-sense heuristic computed on the same corpus.
Comparison to State-of-the-Art Algorithms. Table 5 shows the results of our system and the results obtained by state-of-the-art systems on the same data sets. We compared our method with supervised, unsupervised, and semi-supervised systems on four data sets. The supervised systems are It Makes Sense (Zhong and Ng 2010) (Zhong10); an open source WSD system based on support vector machines (Steinwart and Christmann 2008); and the best system that participated in each competition (Best). The semi-supervised systems are IRST-DDD-00 (Strapparava, Gliozzo, and Giuliano 2004), based on WordNet domains and on manually annotated domain corpora; MFS, which corresponds to the most frequent sense heuristic implemented using the WordNet corpus reader of the natural language toolkit; MRF-LP, based on Markov random field (Chaplot, Bhattacharyya, and Paranjape 2015); Nav05 (Navigli and Velardi 2005), a knowledge-based method that exploits manually disambiguated word senses to enrich the knowledge base relations; and PPRw2w (Agirre, de Lacalle, and Soroa 2014), a random walk method that uses contextual information and prior knowledge about senses distribution to compute the most important sense in a network given a specific word and its context. The unsupervised systems are: Nav10, a graph based WSD algorithm that exploits connectivity measures to determine the most important node in the graph composed by all the senses of the words in a sentence; and a version of the PPRw2w algorithm that does not use sense tagged resources.
Comparison with state-of-the-art algorithms: unsupervised (unsup.), semisupervised (semi sup.), and supervised (sup.). MFS refers to the MFS heuristic computed on SemCor on each data set and Best refers to the best supervised system for each competition. The results are provided as F1 and the first result with a statistically significant difference from the best of each data set is marked with * (χ2, p < 0.05).
. | . | S7CG . | S7CG (N) . | S7 . | S3 . | S2 . |
---|---|---|---|---|---|---|
![]() | Nav10 | − | − | 43.1 | 52.9 | − |
PPRw2w | 80.1 | 83.6 | 41.7 | 57.9 | 59.7 | |
WSDgames | 80.4* | 85.5 | 43.3 | 59.1 | 61.2 | |
![]() | IRST-DDD-00 | − | − | − | 58.3 | − |
MFS | 76.3 | 77.4 | 54.7 | 62.8 | 65.6* | |
MRF-LP | − | − | 50.6* | 58.6 | 60.5 | |
Nav05 | 83.2 | 84.1 | − | 60.4 | − | |
PPRw2w | 81.4 | 82.1 | 48.6 | 63.0 | 62.6 | |
WSDgames | 82.8 | 85.4 | 56.5 | 64.7* | 66.0 | |
![]() | Best | 82.5 | 82.3* | 59.1 | 65.2 | 68.6 |
Zhong10 | 82.6 | − | 58.3 | 67.6 | 68.2 |
. | . | S7CG . | S7CG (N) . | S7 . | S3 . | S2 . |
---|---|---|---|---|---|---|
![]() | Nav10 | − | − | 43.1 | 52.9 | − |
PPRw2w | 80.1 | 83.6 | 41.7 | 57.9 | 59.7 | |
WSDgames | 80.4* | 85.5 | 43.3 | 59.1 | 61.2 | |
![]() | IRST-DDD-00 | − | − | − | 58.3 | − |
MFS | 76.3 | 77.4 | 54.7 | 62.8 | 65.6* | |
MRF-LP | − | − | 50.6* | 58.6 | 60.5 | |
Nav05 | 83.2 | 84.1 | − | 60.4 | − | |
PPRw2w | 81.4 | 82.1 | 48.6 | 63.0 | 62.6 | |
WSDgames | 82.8 | 85.4 | 56.5 | 64.7* | 66.0 | |
![]() | Best | 82.5 | 82.3* | 59.1 | 65.2 | 68.6 |
Zhong10 | 82.6 | − | 58.3 | 67.6 | 68.2 |
The results show that our unsupervised system performs better than any other unsupervised algorithm in all data sets. In S7CG and S7, the difference is minimal compared with PPRw2w and Nav10, respectively; in S3 and S2, the difference is more substantial compared with both unsupervised systems. Furthermore, the performance of our system is more stable on the four data sets, showing a constant improvement on the state of the art.
The comparison with semi-supervised systems shows that our system always performs better than the most frequent sense heuristic when we use information from sense-labeled corpora. We note strange behavior on S7CG: when we use prior knowledge, the performance of our semi-supervised system is lower than our unsupervised system and state of the art. This is because on this data set the performance of our unsupervised system is better than the results that can be achieved by using labeled data to initialize the strategy space of the semi-supervised system. On the other three data sets we note a substantial improvement in the performances of our system, with stable results higher than state-of-the-art systems.
Finally, we note that the results of our semi-supervised system on the fine-grained data sets are close to the performance of state-of-the-art supervised systems, with values that are statistically relevant only on S3. We also note that the performance of our system on the nouns of the S7CG data set is higher than the results of the supervised systems.
6.2.2 Experiments with BabelNet
BabelNet is particularly useful when the number of named entities to disambiguate is high. In fact, it is not possible to perform this task using only WordNet, because its coverage on named entities is limited. For the experiments on this section we used BabelNet to collect the sense inventories of each word to be disambiguated, the mdice measure to weight the graph W, and NASARI to obtain the semantic representation of each sense. The similarity among the representation obtained with this resource is computed using the cosine similarity measure, described in Section 5.2.2. The only difference with the experiments presented in Section 6.2.1 is that we used BabelNet as knowledge base and NASARI as resource to collect the sense representations instead of WordNet.
S13 consists of 13 documents in different domains, available in five languages (we used only English). All the nouns in these texts are annotated using BabelNet, with a total of 1,931 words to be disambiguated (English data set). KORE50 consists of 50 short English sentences with a total number of 146 mentions manually annotated using YAGO2 (Hoffart et al. 2013). We used the mapping between YAGO2 and Wikipedia to obtain for each mention the corresponding BabelNet concept, since there exists a mapping between Wikipedia and BabelNet. This data set contains highly ambiguous mentions that are difficult to capture without the use of a large and well-organized knowledge base. In fact, the mentions are not explicit and require the use of common knowledge to identify their intended meaning.
We used the SPARQL endpoint15 provided by BabelNet to collect the sense inventories of the words in the texts of each data set. For this task we filtered the first 100 resources whose label contains the lexicalization of the word to be disambiguated. This operation can increase the dimensionality of the strategy space, but it is required because, particularly in KORE50, there are many indirect references—such as Tiger to refer to Tiger Woods (the famous golf player) or Jones to refer to John Paul Jones (the Led Zeppelin bassist).
Comparison to State-of-the-Art Algorithms. The results of these experiments are shown in Table 6, where it is possible to see that the performance of our system is close to the results obtained with Babelfy on S13 and substantially higher on KORE50. This is because with our approach it is necessary to respect the textual coherence, which is required when a sentence contains a high level of ambiguity, such as those proposed by KORE50. On the contrary, PPRw2w performs poorly on this data set. This is because, as attested to in Moro, Raganato, and Navigli (2014), it disambiguates the words independently, without imposing any consistency requirements.
Comparison with state-of-the-art algorithms on WSD and entity linking. The results are provided as F1 for S13 and as accuracy for KORE50. The first result with a statistically significant difference from the best (bold result) is marked with * (χ2, p < 0.05).
. | S13 . | KORE50 . |
---|---|---|
WSDgames | 70.8 | 75.7 |
Babelfy | 69.2 | 71.5 |
SUDOKU | 66.3 | − |
MFS | 66.5* | − |
PPRw2w | 60.8 | − |
KORE | − | 63.9* |
GETALP | 58.3 | − |
. | S13 . | KORE50 . |
---|---|---|
WSDgames | 70.8 | 75.7 |
Babelfy | 69.2 | 71.5 |
SUDOKU | 66.3 | − |
MFS | 66.5* | − |
PPRw2w | 60.8 | − |
KORE | − | 63.9* |
GETALP | 58.3 | − |
The good performance of our approach is also due to the good semantic representations provided by NASARI—in fact, it is able to exploit a richer source of information, Wikipedia, which provides a larger coverage and a wider source of information than WordNet alone.
The results on KORE50 are presented as accuracy, following the custom of previous work on this data set. As we anticipated, it contains decontextualized sentences, which require common knowledge to be disambiguated. This common knowledge is obtained exploiting the relations in BabelNet that connect related entities, but in many cases this is not enough because the references to entities are too general and in this case the system is not able to provide an answer. It is also difficult to exploit distributional information on this data set because the sentences are short and in many cases cryptic. For these reasons the recall on this data set is well below the precision: 55.5%. The system does not provide answers for the entities in sentences such as: Jobs and Baez dated in the late 1970s, and she performed at his Stanford memorial, but it is able to correctly disambiguate the same entities in sentences where there is more contextual information.
7 Conclusions
In this article we introduced a new method for WSD based on game theory. We have provided an extensive introduction to the WSD task and explained the motivations behind the choice to model the WSD problem as a constraint-satisfaction problem. We conducted an extensive series of experiments to identify the similarity measures that perform better in our framework. We have also evaluated our system with two different implementations and compared our results with state-of-the-art systems, on different WSD tasks.
Our method can be considered as a continuation of knowledge-based, graph-based, and similarity-based approaches. We used the methodologies of these three approaches combined in a game theoretic framework. This model is used to perform a consistent labeling of senses. In our model we try to maximize the textual coherence, imposing that the meaning of each word in a text must be related to the meaning of the other words in the text. To do this we exploited distributional and proximity information to weight the influence that each word has on the others. We also exploited semantic similarity information to weight the strengths of compatibility among two senses. This is of great importance because it imposes constraints on the labeling process, developing a contextual coherence on the assignment of senses. The application of a game theoretic framework guarantees that these assumptions are met. Furthermore, the use of the replicator dynamics equation allows us to always find the best labeling assignment.
Our system, in addition to having a solid mathematical and linguistic foundation, has been demonstrated to perform well compared with state-of-the-art systems and to be extremely flexible. In fact, it is possible to implement new similarity measures, graph constructions, and strategy space initializations to test it in different scenarios. It is also possible to use it as completely unsupervised or to use information from sense-labeled corpora.
The features that make our system competitive compared with state-of-the-art systems are that instead of finding the most important sense in a network to be associated with the meaning of a single word, our system disambiguates all the words at the same time, taking into account the influence that each word has on the others and imposing sense compatibility among each sense before assigning a meaning. We have demonstrated how our system can deal with sense shifts, where a centrality algorithm, which tries to find the most important sense in a network, can be deceived by the context. In our case, the weighting of the context ensures respecting the proximity structure of a sentence and disambiguating each word according to the context in which it appears. This is because the meaning of a word in a sentence does not depend on all the words in the sentence but only on those that share a proximity (or syntactical) relation and those that enjoy a high distributional similarity.
Acknowledgments
This work was supported by the Samsung Global Research Outreach Program. We are deeply grateful to Rodolfo Delmonte for his insights on the preliminary phase of this work and to Bernadette Sharp for her help during the final part of it. We would also like to thank Phil Edmonds for providing us the correct version of the Senseval 2 data set.
Notes
A complete example of the disambiguation of the first sentence is given in Section 5.3.
We used the IC files computed on SemCor (Miller et al. 1993) for the experiments in this article. They are available at http://wn-similarity.sourceforge.net and are mapped to the corresponding version of WordNet of each data set.
In our case the corpus is composed of all the WordNet glosses.
The resource is available at http://lcl.uniroma1.it/nasari/.
A statistical measure based on the hypergeometric distribution over word frequencies.
Specifically we used the service provided by the Corpus Linguistics group at FAU Erlangen-Nürnberg, with a collocation span of four words on the left and on the right and collocates with minimum frequency: 100.
This aspect is not treated in this article.
A more accurate representation of the data can be obtained using the dependency structure of the sentence instead of the n-gram graph; but in this case the results would not have changed, since in both cases there is an edge between river and bank. In fact, in many cases a simple n-gram model can implicitly detect syntactical relations. We used the stop-word list available in the Python Natural Language Toolkit, described earlier.
Semantic similarity and relatedness measures are discussed in Sections 5.2.1 and 5.2.2.
The code of the algorithm and the data sets used are available at http://www.dsi.unive.it/∼tripodi/wsd.
We downloaded S2 from www.hipposmond.com/senseval2, S3 from http://www.senseval.org/senseval3, S7 from http://nlp.cs.swarthmore.edu/semeval/tasks/index.php, and S7CG from http://lcl.uniroma1.it/coarse-grained-aw.
We downloaded S13 from https://www.cs.york.ac.uk/semeval-2013/task12/index.html and KORE50 from http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/.
References
Author notes
European Centre for Living Technology, Ca' Minich, S. Marco 2940 30124 Venezia, Italy. E-mail: rocco.tripodi@unive.it.
Dipartimento di Scienze Ambientali, Informatica e Statistica, Via Torino 155 - 30172 Venezia, Italy. E-mail: pelillo@unive.it.