Abstract
Web search result clustering aims to facilitate information search on the Web. Rather than the results of a query being presented as a flat list, they are grouped on the basis of their similarity and subsequently shown to the user as a list of clusters. Each cluster is intended to represent a different meaning of the input query, thus taking into account the lexical ambiguity (i.e., polysemy) issue. Existing Web clustering methods typically rely on some shallow notion of textual similarity between search result snippets, however. As a result, text snippets with no word in common tend to be clustered separately even if they share the same meaning, whereas snippets with words in common may be grouped together even if they refer to different meanings of the input query.
In this article we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction. Key to our approach is to first acquire the various senses (i.e., meanings) of an ambiguous query and then cluster the search results based on their semantic similarity to the word senses induced. Our experiments, conducted on data sets of ambiguous queries, show that our approach outperforms both Web clustering and search engines.
1. Introduction
The Web is by far the largest information archive available worldwide. This vast pool of text contains information of the most wildly disparate kinds, and is potentially capable of satisfying virtually any conceivable user need. Unfortunately, however, in this setting retrieving the precise item of information that is relevant to a given user search can be like looking for a needle in a haystack. State-of-the-art search engines such as Google and Yahoo! generally do a good job at retrieving a small number of relevant results from such an enormous collection of data (i.e., retrieving with high precision, low recall). Such systems today, however, still find themselves up against the lexical ambiguity issue (Furnas et al. 1987; Navigli 2009), that is, the linguistic property due to which a single word may convey different meanings.
Recently, the degree of ambiguity of Web queries has been studied using WordNet (Miller et al. 1990; Fellbaum 1998) and Wikipedia1 as sources of ambiguous words.2 It has been estimated that around 4% of Web queries and 16% of the most frequent queries are ambiguous (Sanderson 2008), as also confirmed in later studies (Clough et al. 2009; Song et al. 2009). An example of an ambiguous query is Butterfly effect, which could refer to either chaos theory, a film, a band, an album, a novel, or a collection of poetry. Similarly, black spider could refer to either an arachnid, a car, or a frying pan, and so forth.
Lexical ambiguity is often the consequence of the low number of query words that Web users, on average, tend to type (Kamvar and Baluja 2006). This issue could be solved by expanding the initial query with unequivocal cue words. Interestingly, the average query length is continually growing. The average number of words per query is now estimated around three words per query,3 a number that is still too low to eradicate polysemy.
The fact that there may be different informative needs for the same user query has been tackled by diversifying search results, an approach whereby a list of heterogenenous results is presented, and Web pages that are similar to ones already near the top are prevented from ranking too highly in the list (Agrawal et al. 2009; Swaminathan, Mathew, and Kirovski 2009). Today even commercial search engines are starting to rerank and diversify their results. Unfortunately, recent work suggests that diversity does not yet play a primary role in ranking algorithms (Santamaría, Gonzalo, and Artiles 2010), but it undoubtedly has the potential to do so (Chapelle, Chang, and Liu 2011).
Another mainstream approach to the lexical ambiguity issue is that of Web clustering engines (Carpineto et al. 2009), such as Carrot4 and Yippy.5 These systems group search results by providing a cluster for each specific meaning of the input query. Users can then select the cluster(s) and the pages therein that best answer their information needs. These approaches, however, do not perform any semantic analysis of search results, clustering them solely on the basis of their lexical similarity.
For instance, given the query snow leopard, Google search returns, among others, the snippets reported in Table 1.6 In the third column of the table we provide the correct meanings associated with each snippet (i.e., either the operating system or the animal sense). Although snippets 2, 4, and 5 all refer to the same meaning, they have no content word in common apart from our query words. As a result, a traditional Web clustering engine would most likely assign these snippets to different clusters. Moreover, snippet 6 shares words with snippets referring to both query meanings (i.e., snippets 1, 2, and 3 in Table 1), thus making it even harder for Web clustering engines to group search results effectively. Finally, none of the top-ranking snippets refers to The Snow Leopard, a popular 1978 book by Peter Matthiessen.
# . | Snippet . | Meaning . |
---|---|---|
1 | To advance Mac OS X Leopard, Apple engineers… | Software |
2 | The snow leopard (Uncia uncia or Panthera uncia) is a moderately large cat native to the mountain ranges | Animal |
3 | Mac OS X Snow Leopard (version 10.6) is the seventh and current major… | Software |
4 | Get the facts on snow leopards. Endangered Species Act (ESA): the snow leopard is listed as endangered… | Animal |
5 | Snow leopards are exceptional athletes capable of making huge leaps over ravines. | Animal |
6 | Snow Leopard. Even the name seems to underpromise – it's the first ‘big cat’ OS X codename to reference | Software |
# . | Snippet . | Meaning . |
---|---|---|
1 | To advance Mac OS X Leopard, Apple engineers… | Software |
2 | The snow leopard (Uncia uncia or Panthera uncia) is a moderately large cat native to the mountain ranges | Animal |
3 | Mac OS X Snow Leopard (version 10.6) is the seventh and current major… | Software |
4 | Get the facts on snow leopards. Endangered Species Act (ESA): the snow leopard is listed as endangered… | Animal |
5 | Snow leopards are exceptional athletes capable of making huge leaps over ravines. | Animal |
6 | Snow Leopard. Even the name seems to underpromise – it's the first ‘big cat’ OS X codename to reference | Software |
In this article, we present a novel approach to Web search result clustering that explicitly addresses the language ambiguity issue. Key to our approach is the use of Word Sense Induction (WSI), that is, techniques aimed at automatically discovering the different meanings of a given term (i.e., query). Each sense of the query is represented as a cluster of words co-occurring in raw text with the query. Each search result snippet returned by a Web search engine is then mapped to the most appropriate meaning (i.e., cluster) and the resulting clustering of snippets is returned.
This article provides four main contributions:
We present a general evaluation framework for Web search result clustering, which we also exploit to perform a large-scale end-to-end experimental comparison of several graph-based WSI algorithms. In fact, the output of WSI (i.e., the automatically discovered senses) is evaluated in terms of both the quality of the corresponding search result clusters and the resulting ability to diversify search results. This is in contrast with most literature in the field of Word Sense Induction, where experiments are mainly performed in vitro (i.e., not in the context of an everyday application; Matsuo et al. 2006; Manandhar et al. 2010).
In order to test whether our results were strongly dependent on the evaluation measures we implemented in the framework, we complemented our extrinsic experimental evaluation with a qualitative analysis of the automatically induced senses. This study was performed via a manual evaluation carried out by several human annotators.
We present novel versions of previously proposed WSI graph-based algorithms, namely, SquaT++ and Balanced Maximum Spanning Tree (B-MST) (the former is an enhancement of the original SquaT algorithm [Navigli and Crisafulli 2010], and the latter is a variant of MST [Di Marco and Navigli 2011] that produces more balanced clusters).
We show how, thanks to our framework, WSI can be successfully integrated into real-world applications, such as Web search result clustering, so as to outperform non-semantic state-of-the-art Web clustering systems. To the best of our knowledge, with the exception of some very preliminary results (Véronis 2004; Basile, Caputo, and Semeraro 2009), this is the first time that unsupervised text understanding techniques have been shown to considerably boost an Information Retrieval task in a solid evaluation framework.
This article extends previous conference work (Navigli and Crisafulli 2010; Di Marco and Navigli 2011) by performing a novel, in-depth study of the interactions between different corpora and several different WSI algorithms, including novel ones, within the same framework, and, additionally, by providing a comparison with a state-of-the-art search result clustering engine.
The article is structured as follows: in Section 2 we present related work, in Section 3 we illustrate our approach, end-to-end experiments are reported in Section 4, and in vitro experiments are discussed in Section 5. We present a time performance analysis in Section 6, and conclude the paper in Section 7.
2. Related Work
Our work is aimed at addressing the difficulties arising within the different approaches to the issue of lexical ambiguity in Web Information Retrieval. Given the large body of work in this field, in this section we summarize the main research directions on the topic.
2.1 Web Directories
In Web 1.0—mainly based on static Web pages—the solution to clustering search results was that of manually organizing and categorizing Web sites. The resulting repositories are called Web directories and list Web sites by category and possible subcategories. These categories are sometimes organized as taxonomies (like in the Open Directory Project, ODP7).
Although Web directories are not search engines, information can be searched therein. So, given a query, the returned search results are organized by category. For instance, given the query snow leopard the ODP returns the categories shown in Table 2 (the number of matching Web pages is reported in the second column). As can be seen from this example, the Web directory approach has evident limits:
- 1.
It is static, thus it needs manual updates to cover new pages and new meanings (e.g., the book sense of snow leopard is not considered in ODP).
- 2.
It covers only a small portion of the Web (e.g., we only have one Web page categorized with the computing sense of snow leopard, cf. the last row of Table 2).
- 3.
It classifies Web pages using coarse categories. This latter feature of Web directories makes it difficult to distinguish between instances of the same kind (e.g., pages about artists with the same surname classified as Arts:Music:Bands and Artists).
ODP Category . | # pages . |
---|---|
Science: Biology: Flora and Fauna: …Felidae: Uncia | 6 |
Kids and Teens: School Time: Science: …Leopards: Snow Leopards | 4 |
Science: Environment: Biodiversity: Conservation: Mammals: Felines | 3 |
Kids and Teens: School Time: Science: …Animals: Endangered Species | 1 |
Computers: Emulators: Apple: Macintosh: SheepShaver | 1 |
ODP Category . | # pages . |
---|---|
Science: Biology: Flora and Fauna: …Felidae: Uncia | 6 |
Kids and Teens: School Time: Science: …Leopards: Snow Leopards | 4 |
Science: Environment: Biodiversity: Conservation: Mammals: Felines | 3 |
Kids and Teens: School Time: Science: …Animals: Endangered Species | 1 |
Computers: Emulators: Apple: Macintosh: SheepShaver | 1 |
Although methods for the automatic classification of Web documents have been proposed (Liu et al. 2005; Xue et al. 2008, inter alia) and some problems have been tackled effectively (Bennett and Nguyen 2009), these approaches are usually supervised and still suffer from reliance on a predefined taxonomy of categories. Finally, it has been reported that directory-based systems are among the most ineffective solutions to Web information retrieval (Bruza, McArthur, and Dennis 2000).
2.2 Semantic Information Retrieval
A second approach to query ambiguity consists of associating explicit semantics (i.e., word senses or concepts) with queries and documents, that is, performing Semantic Information Retrieval (SIR). SIR is performed by indexing and searching concepts rather than terms, that is, by means of Word Sense Disambiguation (WSD; Navigli 2009), thus potentially coping with two linguistic phenomena: expressing a single meaning with different words (synonymy) and using the same word to express various different meanings (polysemy). The main idea is that assigning concepts to words can potentially overcome these two issues, enabling a shift from the lexical to the semantic level to be achieved.
Over the years, various methods for SIR have been proposed (Krovetz and Croft 1992; Voorhees 1993; Mandala, Tokunaga, and Tanaka 1998; Gonzalo, Penas, and Verdejo 1999; Kim, Seo, and Rim 2004; Liu, Yu, and Meng 2005, inter alia). Contrasting results have been reported on the benefits of these techniques, however: It has been shown that WSD has to be very accurate to benefit Information Retrieval (Sanderson 1994)—a result that was later debated (Gonzalo, Penas, and Verdejo 1999; Stokoe, Oakes, and Tait 2003). Also, it has been reported that WSD has to be very precise on minority senses and uncommon terms, rather than on frequent words (Krovetz and Croft 1992; Sanderson 2000).
The main drawback of SIR is that it relies on the existence of a reference dictionary to perform WSD (typically, WordNet) and thus suffers this dictionary's static nature and its inherent paucity of most proper nouns. This latter problem is particularly important for Web searches, as users tend to retrieve more information about named entities (e.g., singers, artists, cities) than concepts (such as abstract information about singers or artists). Although lexical knowledge resources that integrate lexicographic senses with named entites on a large scale have recently been created (Navigli and Ponzetto 2012), it is still to be shown that their use for SIR is beneficial. Moreover, these resources do not yet tackle the dynamic evolution of language.
In contrast, our WSI approach to search result clustering automatically discovers both lexicographic and encyclopedic senses of a query (including new ones), thus taking into account all of the mentioned issues.
2.3 Search Result Clustering
A more popular approach to query ambiguity is that of search result clustering. Typically, given a query, the system starts from a flat list of text snippets returned from one or more commonly available search engines and clusters them on the basis of some notion of textual similarity. At the root of the clustering approach lies van Rijsbergen's cluster hypothesis (van Rijsbergen 1979, page 45): “closely associated documents tend to be relevant to the same requests,” whereas results concerning different meanings of the input query are expected to belong to different clusters.
Approaches to search result clustering can be classified as data-centric or description-centric (Carpineto et al. 2009). The former focus more on the problem of data clustering than on presenting the results to the user. A pioneering example is Scatter/Gather (Cutting et al. 1992), which divides the data set into a small number of clusters and, after the selection of a group, performs clustering again and proceeds iteratively. Developments of this approach have been proposed that improve on cluster quality and retrieval performance (Ke, Sugimoto, and Mostafa 2009). Other data-centric approaches use agglomerative hierarchical clustering (e.g., LASSI [Maarek et al. 2000]), rough sets (Ngo and Nguyen 2005), or exploit link information (Zhang, Hu, and Zhou 2008).
Description-centric approaches are, instead, more focused on the description to produce for each cluster of search results. Among the most popular and successful approaches are those based on suffix trees. Suffix trees are rooted directed trees that contain all the suffixes of a string s. The label of each edge is a non-empty substring of s and each vertex v is labeled with the concatenation of the edge labels on the path from the root to v. If we view the search result snippets to be clustered as a set of strings (i.e., their bag of words), each vertex of the corresponding suffix tree can be considered as a set of documents that share a phrase (i.e., the label of the vertex itself) and therefore the vertices represent a set of base clusters B = (b1, b2, …, bn). The original Suffix Tree Clustering (STC; Zamir et al. 1997; Zamir and Etzioni 1998) algorithm obtains the final clustering by merging the clusters in B with a high overlap in the documents they contain. A scoring function is defined, based on both the number of documents in the base cluster and the length of the common phrase, with the aim of returning only the top k clusters.
Later developments improved the performance of the STC algorithm using document–document similarity scores in order to overcome the low scalability of the original approach (Branson and Greenberg 2002). Crabtree, Gao, and Andreae (2005) identified an issue in the original scoring function whereby unreasonably high scores tend to be assigned to clusters obtained as a result of the merging of very similar base clusters. To solve this problem, they proposed the Extended Suffix Tree Clustering algorithm (ESTC) with a novel scoring function and a new procedure for selecting the top k clusters to be returned.
More recent approaches based on suffix trees extract relevant keyphrases from generalized suffix trees (i.e., trees which contain suffixes of a set of strings S = {s1, s2, …, s|S|}) in order to choose meaningful labels for the output clusters (Bernardini, Carpineto, and D'Amico 2009; Carpineto, D'Amico, and Bernardini 2011).
Other approaches to description-centric search result clustering in the literature are based on formal concept analysis (Carpineto and Romano 2004), singular value decomposition (Osinski and Weiss 2005), spectral clustering (Cheng et al. 2005), spectral geometry (Liu et al. 2008), link analysis (Gelgi, Davulcu, and Vadrevu 2007), and graph connectivity measures (Di Giacomo et al. 2007). Search result clustering has also been viewed as a supervised salient phrase ranking task (Zeng et al. 2004).
Whereas search result clustering has heretofore been performed without the explicit use of lexical semantics, in our work we show how to exploit search result clustering as the common evaluation framework of both semantic and non-semantic clustering engines.
2.4 Diversification
Rather than clustering the top search results by their similarity, one can aim at reranking them on the basis of criteria that maximize their diversity, so as to present top results which are as different from each other as possible. This technique, called diversification of search results, is a recent research topic that, yet again, deals with the query ambiguity issue. To some extent, today's search engines, such as Google and Yahoo!, apply some diversification technique to their top-ranking results.
One of the first examples of diversification algorithms was based on the use of similarity functions to measure the diversity between documents and between document and query (Carbonell and Goldstein 1998). Other diversification techniques use conditional probabilities to determine which document is most different from higher-ranking ones (Chen and Karger 2006), or use affinity ranking (Zhang et al. 2005), based on topic variance and coverage.
An algorithm called Essential Pages (Swaminathan, Mathew, and Kirovski 2009) has been proposed that aims to reduce information redundancy and returns Web pages that maximize coverage with respect to the input query. In this approach the Web search results for a query q are transformed into bags of words containing the terms occurring in the corresponding Web page. Frequency information from raw corpora is then used to find relevant words for q, that is, words which are generally infrequent, but occur often in the results retrieved for q. The coverage score of a search result r is then calculated as a function of the number of terms relevant for q and contained in r. Another interesting approach reformulates the problem explicitly in terms of how to minimize the risk of dissatisfaction for the average user (Agrawal et al. 2009). A greedy algorithm is proposed that balances between relevance and diversity of the search results. The algorithm is evaluated using generalizations of classical Information Retrieval metrics that are based on statistical considerations and take into account the intentions of the user.
More recently, vector space model representations have been explored to improve diversity in search results (Santamaría, Gonzalo, and Artiles 2010). Web page results are represented as vectors and compared against vector representations of encyclopedic entries available from Wikipedia using cosine similarity. Search results are diversified accordingly.
Finally, in the last few years the specific structure of the Web has been exploited to perform diversification, as proposed by Ma, Lyu, and King (2010), who make use of Markov random walks on query-URL bipartite graphs, and Chandar and Carterette (2010), who cluster search results by exploiting the links in Web pages in order to identify the subtopics of the returned documents.
2.5 Word Sense Induction
A fifth solution to the query ambiguity issue is Word Sense Induction (WSI), namely, the automatic discovery of word (i.e., query) senses from raw text (see Navigli [2009, 2012] for a survey). WSI allows us to go beyond the surface similarity of Web snippets (which hampers the performance of Web search result clustering) by dynamically acquiring an inventory of senses of the input query. The core idea is to then use these query senses to cluster the Web snippets returned by a traditional search engine.
Very little work on this topic exists: Vector-based WSI was successfully shown to improve bag-of-words ad hoc Information Retrieval (Schütze and Pedersen 1995) and preliminary studies (Udani et al. 2005; Chen, Zaïane, and Goebel 2008) have provided interesting insights into the use of WSI for Web search result clustering. A more recent attempt at automatically identifying query meanings is based on the use of hidden topics (Nguyen et al. 2009). In this approach, however, topics (estimated from a universal data set) are query-independent and thus their number needs to be established beforehand. In contrast, we aim to cluster snippets on the basis of a dynamic and finer-grained notion of sense.
An exploratory study on ten query words has shown that the majority of relevant uses of query words can be identified using graph-based WSI (Véronis 2004). In the present work we take this preliminary finding to the next level, by studying the impact of several graph-based WSI algorithms on a large scale and by integrating them into a Web search result clustering framework. As a result, we are able not only to perform an end-to-end evaluation of WSI approaches, but also to compare them with traditional search result clustering techniques, which instead lack explicit semantics for the query meanings.
2.6 Aspect Identification
Over recent years a line of research has been developed in the field of Information Retrieval that makes use of query logs and clickthrough information to identify and model the aspects of a given query in terms of the user intents for that query. Aspects can be identified by exploiting those queries in the past that enabled the user to retrieve documents that are close to the current input query (Wang and Zhai 2007). A different approach aims, instead, at extracting related queries from query logs as candidate aspects and discarding duplicate and redundant aspects using search results. Wikipedia InfoBoxes are used to cluster candidate aspects into classes (Wu, Madhavan, and Halevy 2011). Latent aspects of queries can also be extracted from query reformulations within historical search session logs (Wang, Chakrabarti, and Punera 2009). More recently, a topic modeling approach based on query logs and click data has been proposed that aims at discovering generic aspects pervading manually fixed categories of named entities (Xue and Yin 2011). The implicit user-specific aspect of a query can be obtained from short query log sessions of other users using a Markov logic learning model. This results in the documents that best model the user's intentions when entering a query (Mihalkova and Mooney 2009). Finally, a semi-supervised approach has recently been applied to create class labels that are later assigned to latent clusters of queries using a Hierarchical Dirichlet Process (Reisinger and Pasca 2011).
This line of research has some points of contact with WSI, but also important differences:
- •
Most important, aspect identification aims at discriminating between very fine-grained facets of a given query, such as those of rental, pricing, and accidents of a car, in contrast to WSI whose goal is that of inducing different meanings of the given query, such as car as a motor vehicle, railroad car, song, novel, or even primitive in the LISP programming language. In this respect, the two tasks are complementary, because once WSI has discovered the different senses of a query, then one can apply aspect identification to detect subsenses of each meaning.
- •
Much work based on query logs and click data requires reliable statistics, which are not always available in all languages. WSI relies instead on raw text corpora, which can easily be obtained for any language. This difference also holds for custom search engines not working on the Web, which might not have enough statistics from their users, but could instead resort to raw (domain) corpora.
- •
Privacy and availability issues are often mentioned in connection with query logs and clickthrough data, therefore making research on this topic hard to replicate and evaluate objectively, especially in comparison with other systems.
3. Semantically Enhanced Search Result Clustering
Web search result clustering is usually performed in three main steps:
- 1.
Given a query q, a search engine is used to retrieve a list of results R = ( r1, …, rn ).
- 2.
A clustering of the results in R is obtained by means of a clustering algorithm.
- 3.
The clusters in are optionally labeled with an appropriate algorithm (e.g., Zamir and Etzioni 1998; Carmel, Roitman, and Zwerdling 2009) for visualization purposes.
First, we preprocess the set R of search results returned by the search engine (Section 3.1). Next, to inject semantics into search result clustering, we propose improving Step 2 by means of a WSI algorithm: Given a query q, we first dynamically induce, from a text corpus, the set of word senses of q (Section 3.2); next, we cluster the Web results on the basis of the word senses previously induced (Section 3.3). We show our framework in Figure 1.
3.1 Preprocessing of Web Search Results
As a result of submitting our query q to a search engine, we obtain a list of relevant search results R = ( r1, …, rn ). In order to make this list usable by a clustering algorithm, each result ri is processed by means of four steps aimed at transforming it into a bag of words bi:
- 1.
We obtain the snippet si corresponding to the result ri.
- 2.
We apply tokenization to si, thus splitting the string into tokens and setting them to lowercase.
- 3.
We augment the current token set with multi-word expressions obtained by compounding subsequent word tokens up to φ words (a parameter whose tuning is described later in Section 4.1.4). The terms in the resulting token set are lemmatized using WordNet as reference lexicon. We remove tokens that are not in the WordNet lexicon (e.g., the, esa).
- 4.
We remove the stopwords (e.g., get, on, be, as) and the target query words (e.g., snow, leopard, snow leopard) from the token set.
step . | output . |
---|---|
initial snippet | “Get the facts on snow leopards. Endangered Species Act (ESA): the snow leopard is listed as endangered” |
tokenization | { get, the, facts, on, snow, leopards, endangered, species, act, esa, leopard, is, listed, as } |
compounding and lemmatization | { get, fact, on, snow, leopard, snow_leopard, endangered, species, endangered_species, act, be, listed, as } |
stopword and query words removal | { fact, endangered, species, endangered_species, act, listed } |
step . | output . |
---|---|
initial snippet | “Get the facts on snow leopards. Endangered Species Act (ESA): the snow leopard is listed as endangered” |
tokenization | { get, the, facts, on, snow, leopards, endangered, species, act, esa, leopard, is, listed, as } |
compounding and lemmatization | { get, fact, on, snow, leopard, snow_leopard, endangered, species, endangered_species, act, be, listed, as } |
stopword and query words removal | { fact, endangered, species, endangered_species, act, listed } |
3.2 Graph-Based Word Sense Induction
The next step is to dynamically discover the senses of the input query q and provide a representation for them that will later be used for semantically clustering the snippets preprocessed in the previous step. WSI algorithms are unsupervised techniques aimed at automatically identifying the set of senses denoted by a word. These methods induce word senses from text by clustering word occurrences on the basis of the idea that a given word—used in a specific sense—tends to co-occur with the same neighboring words (Harris 1954). Several approaches to WSI have been proposed in the literature (see Navigli [2009, 2012] for a survey), ranging from clustering based on context vectors (e.g., Schütze 1998) and word similarity (e.g., Lin 1998) to probabilistic frameworks (Brody and Lapata 2009), latent semantic models (Van de Cruys and Apidianaki 2011), and co-occurrence graphs (e.g., Widdows and Dorow 2002).
In our work, we chose to focus on approaches based on co-occurrence graphs for two reasons:
They have been shown to achieve state-of-the-art performance in standard evaluation tasks (Agirre et al. 2006b; Agirre and Soroa 2007; Korkontzelos and Manandhar 2010).
Other approaches are either based on syntactic dependency statistics (Lin 1998; Van de Cruys and Apidianaki 2011), which are hard to obtain on a large scale for many domains and languages, or based on large matrix computation methods such as context-group discrimination (Schütze 1998), non-negative matrix factorization (Van de Cruys and Apidianaki 2011) and Clustering by Committee (Lin and Pantel 2002). Instead, in our approach we aim to exploit the relational structure of word co-occurrences with lower requirements (i.e., using just a stopword list, a lemmatizer, and a compounder, cf. Section 3.1), assuming that the semantics of a word are represented by means of a co-occurrence graph whose vertices are co-occurrences and whose edges are co-occurrence relations.
Curvature clustering (Dorow et al. 2005), an algorithm based on the participation ratio of words in graph triangles, that is, complete graphs with three vertices.
Squares, Triangles, and Diamonds (SquaT++), an algorithm that integrates two graph patterns previously exploited in the literature (Navigli and Crisafulli 2010), namely, squares and triangles, with a novel pattern called diamond.
Balanced Maximum Spanning Tree Clustering (B-MST), an extension of a WSI algorithm based on the calculation of a Maximum Spanning Tree (Di Marco and Navigli 2011) that aims at balancing the number of co-occurrences in each sense cluster.
HyperLex (Véronis 2004), an algorithm based on the identification of hubs (representing basic meanings) in co-occurrence graphs.
Chinese Whispers (Biemann 2006), a randomized algorithm that partitions the graph vertices by iteratively transferring the mainstream message (i.e., word sense) to neighboring vertices.
All of these graph algorithms for WSI consist of a common step, namely, co-occurrence graph construction (described in Section 3.2.1) and a second step, namely, the discovery of word senses, whose implementation depends on the specific algorithm adopted. We discuss the second phase of each algorithm separately (Section 3.2.2).
3.2.1 Step 1: Graph Construction
Given a target query q, we build a co-occurrence graph Gq = (V, E) such that V is the set of words8 co-occurring with q, and E is the set of undirected edges, each denoting a co-occurrence between pairs of words in V. We harvest the statistics for co-occurring words V from a text corpus (we used two different corpora, see Section 4.1.2), which was previously tokenized and lemmatized.
First, for each word w we calculate the total number c(w) of its occurrences and the number of times c(w, w′) that w occurs together with some word w′ in the same context (to this end, we use the lemmas corresponding to inflected forms in the text). For instance, in Table 4, assuming w = lion, we show the absolute count c(w′) of some words (second column) together with the joint co-occurrence count c(w, w′) of words w′ occurring with w = lion in the same context (third column). Note that the co-occurrences w′ may refer to different senses of word w—for example, africa and savannah refer to the animal sense of lion, whereas technology and software to the operating system sense. Moreover, w′ may be ambiguous itself in the context of w (e.g., tiger as either an animal or an operating system).
word w′ . | c(w′) . | c(w, w′) . | Dice(w, w′) . |
---|---|---|---|
animal | 213,414 | 5,109 | 0.2534 |
videogame | 201,342 | 4,945 | 0.2042 |
mac | 194,056 | 4,940 | 0.1568 |
africa | 189,011 | 4,521 | 0.1961 |
feline | 167,487 | 4,548 | 0.1472 |
cat | 161,980 | 4,493 | 0.1214 |
savannah | 159,693 | 3,535 | 0.1091 |
predator | 145,239 | 3,643 | 0.1065 |
apple | 140,670 | 3,261 | 0.1043 |
tiger | 134,702 | 2,147 | 0.1024 |
technology | 129,483 | 2,017 | 0.0097 |
software | 113,045 | 1,846 | 0.0084 |
iPod | 112,100 | 1,803 | 0.0070 |
simulation | 93,899 | 1,367 | 0.0031 |
word w′ . | c(w′) . | c(w, w′) . | Dice(w, w′) . |
---|---|---|---|
animal | 213,414 | 5,109 | 0.2534 |
videogame | 201,342 | 4,945 | 0.2042 |
mac | 194,056 | 4,940 | 0.1568 |
africa | 189,011 | 4,521 | 0.1961 |
feline | 167,487 | 4,548 | 0.1472 |
cat | 161,980 | 4,493 | 0.1214 |
savannah | 159,693 | 3,535 | 0.1091 |
predator | 145,239 | 3,643 | 0.1065 |
apple | 140,670 | 3,261 | 0.1043 |
tiger | 134,702 | 2,147 | 0.1024 |
technology | 129,483 | 2,017 | 0.0097 |
software | 113,045 | 1,846 | 0.0084 |
iPod | 112,100 | 1,803 | 0.0070 |
simulation | 93,899 | 1,367 | 0.0031 |
Table 4 reports the Dice coefficients in the fourth column for the example words. The rationale behind the use of the Dice coefficient, as opposed to, for example, a simple co-occurrence count such as c(w, w′), is that dividing by the average of the total counts of the two words drastically decreases the ranking of words that tend to co-occur frequently with many other words (home, page, etc.).
Finally, we use the occurrence and co-occurrence counts just collected to construct the co-occurrence graph Gq = (V, E) for the input query q. The pseudocode of our graph construction procedure is shown in Algorithm 1 and consists of the following steps:
- a.
Initialization with snippet words (lines 1–2): Initially we set V to contain all the content words from the bags of words obtained from the snippet results of query q, that is, , where bj is the bag of words corresponding to the search result rj ∈ R as obtained after the preprocessing step (see Section 3.1). We also set E : = ∅, that is, the edge set is initially empty.
- b. Adding first-order co-occurrences (lines 3–5): We augment V with the highest-ranking words co-occurring with query q in the selected text corpus, that is, those words w for which the following equations are satisfied:where δ and δ′ are experimentally tuned thresholds (cf. Section 4.1.4).
- c.
Adding second-order co-occurrences (lines 6–11): Optionally, we create an auxiliary copy V(0) of V. For each word w ∈ V(0) we augment V with those words w′ which are strongly related to w in the text corpus. In other words we add w′ to V if both Equations (2) are satisfied for the pair of words w and w′.
- d. Creating the co-occurrence graph (lines 12–15): For each pair of words (w, w′) ∈ V ×V, we add the corresponding edge {w, w′} to E with weight Dice(w, w′) if the following condition is satisfied:where θ is a confidence threshold for the co-occurrence relation. Note that we use a threshold δ′ to select which vertices to add to the graph Gq (Step [b]) and we use a potentially different threshold θ for the selection of which edges to add to Gq. Finally, we remove from V all the disconnected vertices (i.e., those with degree 0).
As a result of this algorithm a co-occurrence graph Gq for the query q is produced. Consider again the target word lion and let us assume that the words in Table 4 are the only co-occurrences of lion. In Figure 2 we show the execution of the four steps of our graph construction algorithm for the input query lion, assuming δ = 0.38, δ′ = θ = 0.003, and c(lion) = 350, 727. First, we initialize the graph with the words in the snippets returned for lion (Figure 2a), next we add the words co-occurring with the query (Figure 2b), then second-order co-occurrences, that is, words co-occurring with those just added to the graph (Figure 2c), and finally we add those edges between word pairs whose Dice value is above a threshold (Figure 2d).
3.2.2 Step 2: Sense Discovery
All the graph-based WSI algorithms that we implemented in our framework are designed to discover the senses of an input term, which in our specific application is the query q. This process of meaning discovery is carried out through the use of the relational and structural information contained in the co-occurrence graph we have just created. In fact, a co-occurrence graph Gq = (V,E) for a query q contains: (i) vertices w ∈ V corresponding to words highly related to q, and (ii) edges e ∈ E representing co-occurrence relations between vertices (i.e., words) in V. The key idea behind graph-based WSI is to obtain a partition S = (S1, …, Sm) of Gq such that each component Si = (Vi, Ei) contains structurally (i.e., semantically) related vertices (i.e., words). In other words, each vertex set Vi is intended to contain only words related to a specific sense of q. As a result S is the sense inventory for the query q and each Si is a sense cluster.
We now introduce each graph-based WSI algorithm in detail.
The curvature algorithm is designed to identify the meaning components by means of the removal of all vertices whose curvature is below a certain threshold σ. For example, we can attribute two different meanings to the word Napoleon, namely, a French emperor and an American city. By looking at the graph in Figure 3 we can easily find that Napoleon participates in two triangles (represented by continuous lines) and it potentially could also participate in four additional triangles (i.e., those including dashed lines). It follows that . The deletion of the vertex Napoleon results in two components (respectively, containing the vertices { France, revolution } and { Ohio, America }) representing the two mentioned meanings.
SquaT++ is a generalization of the curvature algorithm in that: (i) it uses the triangle pattern to calculate curvature, and (ii) it disconnects the graph using the same algorithm as curvature. SquaT++ is a novel algorithm, however, that extends the previously proposed SquaT (Navigli and Crisafulli 2010), based on triangles and squares, by introducing a new pattern, namely, the diamond, whose clustering coefficient is linearly combined with the other two. Moreover, in our experiments we tested two versions of SquaT++: the traditional one in which the coefficient is calculated on vertices (like in Equation (8)), and a variant calculated on edges. Our hunch here is that removing low-ranking edges rather than vertices might produce more informative clusters, because no word is removed from the original graph. In what follows, we refer to the vertex version as SquaT++V and to the variant on edges as SquaT++E, and we refer to the general algorithm as SquaT++.
B-MST. A more global approach to the identification of sense components is the Balanced Maximum Spanning Tree (B-MST), which is based on the computation of the Maximum Spanning Tree (MST) of the co-occurrence graph. Cluster meanings are identified by iteratively removing the edges which represent structurally weak relations, i.e., those with lower weight in the MST. The procedure is as follows:
Eliminate from Gq all vertices whose degree is 1.
Calculate the maximum spanning tree of the graph Gq (e.g., the bold edges in Figure 4(c) represent the maximum spanning tree for our initial graph).
The original MST algorithm for WSI, proposed by Di Marco and Navigli (2011), iteratively eliminates the minimum-weight edge whose degree ≥ 2, until N connected components (i.e., word clusters) are obtained or there are no more edges to eliminate. The problem with this approach is that it can generate unbalanced clusters (i.e., a few very large clusters and several small clusters); for this reason we developed the B-MST variant which calculates an appropriate cluster mean cardinality10 and removes an edge if its elimination does not lead to connected components with cardinality less than half of the calculated mean value. This additional constraint prevents the creation of very small clusters, while at the same time avoiding artificial equal-size clusters.
Following our lion example, and assuming that the value of the only parameter of B-MST (i.e., the maximum number N of meanings to be identified) is set to 3, we obtain the clusters in Figure 4(d).
HyperLex. Another option for sense discovery is that of HyperLex, which identifies the most interconnected vertices in the graph Gq, called hubs. Each hub acts as the “root” of a specific component of Gq and, correspondingly, a meaning of the target query q.
As an example, consider the co-occurrence graph in Figure 2(d). A list of the vertices in the graph is created, sorted by c(w′), as shown in Table 4. For the purpose of our example, let us assume σ = 0.5 and σ′ = 0.015. The first hub to be selected is animal. All its neighbors (tiger, feline, cat, predator, africa, and savannah in our example) are also removed from the list. The next hub is videogame (its neighbors simulation, software, and technology are also removed from the list). The last hub is mac; after the removal of its neighbor from the list (apple) the last vertex to be examined is iPod, which cannot be selected as hub because it does not satisfy the second condition of Equation (9). The selected hubs are shown as rectangles in Figure 4(e).
Once the hub selection process is complete, the target query q is added to the set of vertices V of graph Gq and each hub is connected to q with an infinite-weight edge (see vertex lion and its edges added to the graph in Figure 4(e)). Then, a maximum spanning tree Tq of the graph is calculated starting from vertex q (see the bold edges in Figure 4(e)). As a result, Tq will include all the infinite-weight edges from q to its direct descendants, namely, the hubs. Vertex q is then removed from the graph so that each subtree rooted at a hub in Tq represents a word sense for the target query q (see Figure 4(f)). In our example, three clusters are produced: { videogame, simulation, software, technology }, { mac, apple, iPod }, and { animal, feline, tiger, cat, predator, africa, savannah }. Note that, in our example, HyperLex and SquaT++ found the same meanings for the query word lion (namely, the animal, the operating system, and the videogame), but produced different clusters (e.g., HyperLex assigns the word tiger to the animal cluster whereas SquaT++ removes it from the graph). Finally, notice that in HyperLex the number of senses is dynamically chosen on the basis of the co-occurrences of q and the algorithm's thresholds.
An alternative approach to hub selection as performed in HyperLex consists of using the PageRank algorithm to sort the vertices of the co-occurrence graph and choose the best ranking ones as hubs of the target word (Agirre et al. 2006b). Given that the performance of this variant is comparable to that of HyperLex, in this work we focus on the original version of the induction algorithm.
Chinese Whispers. All the previously presented algorithms work in a top–down fashion, that is, they iteratively remove edges or vertices from an initial co-occurrence graph until a number of partitions are obtained. The last algorithm we consider, called Chinese Whispers, works, instead, bottom–up. The pseudocode, shown in Algorithm 2, consists of the following two steps:
- 1.
First, the algorithm assigns a distinct class i to each vertex vi and creates a clustering C containing the singleton clusters Ci (lines 1–4 of the algorithm).
- 2. Second, a series of iterations is performed aimed at merging the clusters (lines 5–11). Specifically, at each iteration the algorithm analyzes each vertex v in random order and assigns it to the majority class among those associated with its neighbors. In other words, it assigns each vertex v to the class c that maximizes the sum of the weights of the edges {u,v} incident on v such that c is the class of u, according to the following formula:
As soon as an iteration produces no change in the clustering (line 11), the algorithm stops and outputs the final clustering (line 12). In contrast to the previous algorithm, Chinese Whispers is parameter-free. Figure 4(g) shows an output example for this algorithm on the lion co-occurrence graph.
3.3 Clustering of Web Search Results
We are now ready to semantically cluster our Web search results R, which we previously transformed into bags of words B (cf. Section 3.1). To this end we use the automatically discovered senses for our input query q (cf. Section 3.2). We adopt different measures, each of which calculates the similarity between a bag of words bi ∈ B and the sense clusters { S1, …, Sm } acquired as a result of Word Sense Induction.
We now present three different similarity measures between snippet bags of words and sense clusters (cf. Equation (11)), which we implemented in our framework.
3.4 Cluster Sorting
The formula determines the average similarity between the bags of words bi of the search results ri in cluster Cj and the corresponding sense cluster Sj. The similarity function sim is the same as that stated in Equation (11) and defined in Section 3.3.
Finally, we rank the elements ri within each cluster Cj by their similarity sim(bi, Sj). We note that the ranking and optimality of clusters can be improved with more sophisticated techniques (e.g., Crabtree, Gao, and Andreae 2005; Kurland 2008; Kurland and Domshlak 2008; Lee, Croft, and Allan 2008). This is beyond the scope of this article, however.
4. In Vivo Experiments: Web Search Result Clustering
We now present two extrinsic experiments aimed at determining the impact of WSI when integrated into Web search result clustering. We first describe our experimental set-up (Section 4.1). Next, we present a first experiment focused on the quality of the output search result clusters (Section 4.2) and a second experiment on the degree of diversification of semantically enhanced versus non-semantic search result clustering algorithms (Section 4.3).
4.1 Experimental Set-up
4.1.1 Lexicon
4.1.2 Corpora
To calculate the co-occurrence strength between words we need a large corpus to extract co-occurrence counts and calculate the Dice values (cf. Equation (1)). To this end we performed separate experiments on two different corpora and constructed the corresponding co-occurrence databases:
Google Web1T (Brants and Franz 2006): This corpus is a large collection of n-grams (n = 1, …, 5)—namely, windows of n consecutive tokens—occurring in one terabyte of Web documents as collected by Google. We consider all the co-occurrences for lemmas which appear in the same n-gram (we applied the WordNet lemmatizer to obtain the canonical form of any word sequence).
ukWaC (Ferraresi et al. 2008): This corpus was constructed by crawling the .uk domain and obtaining a large sample of Web pages that were automatically part-of-speech tagged using the TreeTagger tool. For this corpus we considered all the co-occurrences of WordNet lemmas that appear in the same sentence.
We selected these two corpora for their very different natures, namely: Google Web1T is a very large corpus, but with very narrow contexts (5-grams) with a minimum occurrence frequency; ukWaC represents a smaller portion of the Web, but with larger contexts. This enabled us to observe the behavior of WSI algorithms when co-occurrences were extracted from different kinds of textual source. In Table 5 we show examples of the contexts available in the two corpora for the same word (i.e., lion) and the content words that are found to co-occur with it (shown in italics in Table 5).
corpus . | context example . |
---|---|
Web1T | roar of the lion in |
ukWaC | Wilson's Zoo and its sadlion had given way to the braveattempt to create an early “Safari Park”. |
corpus . | context example . |
---|---|
Web1T | roar of the lion in |
ukWaC | Wilson's Zoo and its sadlion had given way to the braveattempt to create an early “Safari Park”. |
4.1.3 Tuning Set
Given that our graph construction step and our WSI algorithms have parameters, we created a data set to perform tuning. In order to fix the parameter values independently of our application we created this data set by means of pseudowords (Schütze 1992; Yarowsky 1993). A pseudoword is an ambiguous artificial word created by concatenating two or more monosemous words. Each monosemous word represents a meaning of the pseudoword. For example, given the words pizza and blog we can create the pseudoword pizza*blog. The list of pseudowords we used is reported in Table 6.
. | pseudoword . |
---|---|
1 | pizza*blog |
2 | banana*plush |
3 | kalashnikov*mollusk*sky |
4 | hurricane*glue*modem |
5 | pistol*stair*yacht*semantics |
6 | potassium*razor*walrus*calendula |
7 | monarchy*archery*google*locomotive*beach |
8 | hyena*helium*soccer*ukulele*wife |
9 | human*orchid*candela*colosseum*movie*guitar |
10 | journey*harmonica*vine*mustache*rhino*police |
11 | glossary*river*dad*kitchen*aikido*geranium*italy |
12 | microbe*hug*ship*skull*beer*giraffe*mathematics |
. | pseudoword . |
---|---|
1 | pizza*blog |
2 | banana*plush |
3 | kalashnikov*mollusk*sky |
4 | hurricane*glue*modem |
5 | pistol*stair*yacht*semantics |
6 | potassium*razor*walrus*calendula |
7 | monarchy*archery*google*locomotive*beach |
8 | hyena*helium*soccer*ukulele*wife |
9 | human*orchid*candela*colosseum*movie*guitar |
10 | journey*harmonica*vine*mustache*rhino*police |
11 | glossary*river*dad*kitchen*aikido*geranium*italy |
12 | microbe*hug*ship*skull*beer*giraffe*mathematics |
The powerful property of pseudowords is that they enable the automatic construction of sense-tagged corpora with virtually no effort. In fact, we automatically created our tuning data set as follows:
- 1.
First, we collected the top 100 results retrieved by Yahoo! for each meaning (i.e., monosemous word) of the pseudoword (e.g., pizza and blog for pizza*blog).
- 2.
We created a set of 100 snippets for the “pseudoword” query (e.g., pizza*blog) by selecting snippets from each meaning of the pseudoword in a number that was proportional to their total occurrence count. For instance, if pizza and blog occur, respectively, 73,000 and 27,000 times in the reference corpus (e.g., ukWaC), we selected 73 snippets from pizza and 27 from blog. As a result we simulated the distribution of the two senses of the pseudoword within the retrieved snippets.
- 3.
Finally, within each of the 100 snippets, we replaced each monosemous word occurrence (e.g., pizza and blog) with the pseudoword itself (i.e., pizza*blog). As a result we obtained a set of 100 snippets for each ambiguous pseudoword.
4.1.4 Parameters
We used our tuning set to select, first, the optimal values of the parameters needed to perform graph construction, and, second, to choose the parameter values specific to each graph-based WSI algorithm. To find the best configurations we performed tuning by combining the three evaluation measures of Adjusted Rand Index, Jaccard Index, and F1 (introduced in Section 4.2.1).
Graph construction. Because all our WSI algorithms draw on the co-occurrence graph, we first tuned the parameters for graph construction for each of the two corpora (cf. Section 3.2.1), namely: the maximum length of the compounds extracted from the corpus (φ), the minimum number of co-occurrences (δ) and minimum Dice value (δ′) for vertex addition, and the minimum weight for a graph edge (θ) and vertex addition using first versus second-order co-occurrences. In Table 7 we show the values for these parameters that optimize the performance of each WSI algorithm on the two corpora.11 In all our runs we used the Word Overlap as a similarity measure for Web search result clustering.
We observed that the optimal values for many of the parameters used for graph construction were stable across algorithms, whereas they changed across corpora due to the different scales of the two corpora. Instead, the maximum compound length and the co-occurrence order were fixed for all configurations. For the former we observed no performance increase with longer compound lengths. For the latter we found negligible improvements with second-order co-occurrences, at the cost, however, of increasing the size of the resulting graph exponentially. Given the large number of experiments that would be involved, we decided to avoid this additional workload and use first-order co-occurrences in all our experiments.
WSI algorithms. Next, for each graph-based WSI algorithm, we kept the given optimal values fixed for building the co-occurrence graphs for the tuning set queries, while varying the parameter values of the WSI algorithm, using Word Overlap as similarity measure for Web search result clustering. In Table 8 we show the optimal values for each algorithm when using Web1T (third column) and ukWaC (fourth column) to build the co-occurrence graph. Chinese Whispers is not shown as it is parameter-free (cf. Section 3.2.2).
. | . | Web1T . | ukWaC . |
---|---|---|---|
Curvature | removal threshold (σ) | 0.25 | 0.35 |
SquaT++ | vertex removal threshold (σ) | 0.07 | 0.2 |
edge removal threshold (σ) | 0.2 | 0.25 | |
B-MST | number of clusters (N) | 4 | 4 |
HyperLex | min hub degree (σ) | 0.05 | 0.06 |
min edge weight (σ′) | 0.004 | 0.01 |
. | . | Web1T . | ukWaC . |
---|---|---|---|
Curvature | removal threshold (σ) | 0.25 | 0.35 |
SquaT++ | vertex removal threshold (σ) | 0.07 | 0.2 |
edge removal threshold (σ) | 0.2 | 0.25 | |
B-MST | number of clusters (N) | 4 | 4 |
HyperLex | min hub degree (σ) | 0.05 | 0.06 |
min edge weight (σ′) | 0.004 | 0.01 |
For SquaT++, together with the σ threshold, we also tuned the three coefficient values α, β, and γ, that is, we needed to find the best values for the coefficients in Equation (8). The optimal coefficient combinations are shown in Table 9 for SquaT++ on vertices and edges, when using the two corpora for graph construction. The values indicate that all the three graph patterns provide a positive contribution to the algorithm's performance, with the same coefficients for SquaT++ on vertices and edges. Interestingly, we observe that, whereas the contribution of triangles (weighted by α) is the same across corpora, the respective weights of squares (β) and diamonds (γ) are flipped. After inspection we found that the graphs obtained with Web1T are less interconnected than those produced with ukWac. Consequently, diamonds are sparser but more reliable in the Web1T setting, whereas they are much more frequent, and thus noisier, in the ukWaC setting.
4.1.5 Test Sets
We conducted our in vivo experiments on two test sets of ambiguous queries:
- •
AMBIENT (AMBIguous ENTries), a data set that contains 44 ambiguous queries.12 The sense inventory for the meanings (i.e., subtopics)13 of queries is given by Wikipedia disambiguation pages. For instance, given the beagle query, its disambiguation page in Wikipedia provides the meanings of dog, Mars lander, computer search service, beer brand, and so forth. The top 100 Web results of each query returned by the Yahoo! search engine were tagged with the most appropriate query senses according to Wikipedia (amounting to 4,400 sense-annotated search results). To our knowledge, this is currently the largest data set of ambiguous queries available on-line. In fact, other existing data sets, such as those from the TREC Interactive Tracks, are not focused on distinguishing the subtopics of a query.
- •
MORESQUE (MORE Sense-tagged QUEry results), a data set that we developed as an integration of AMBIENT following guidelines provided by its authors.14 In fact, our aim was to study the behavior of Web search algorithms on queries of different lengths, ranging from one to four words. The AMBIENT data set, however, is composed in the main of one-word queries. MORESQUE provides dozens of queries of length 2, 3, and 4, together with the top 100 results from Yahoo! for each query annotated precisely as was done in the AMBIENT data set. We decided not to discontinue the use of Yahoo! mainly for homogeneity reasons.
Wikipedia has already been used as a sense inventory by, among others, Bunescu and Pasca (2006), Mihalcea (2007), and Gabrilovich and Markovitch (2009). Santamaría, Gonzalo, and Artiles (2010) have investigated in depth the benefit of using Wikipedia as the sense inventory for diversifying search results, showing that Wikipedia offers much more sense coverage for search results than other resources such as WordNet.
We report the statistics on the composition of the two data sets in Table 10. Given that the snippets could possibly be annotated with more than one Wikipedia subtopic, we also determined the average number of subtopics per snippet. This amounted to 1.01 for AMBIENT and 1.04 for MORESQUE for snippets with at least one subtopic annotation. We can thus conclude that multiple subtopic annotations are infrequent. Finally, we analyzed how the different subtopics are distributed over the snippet results for each query. To do this we calculated the standard deviation of the subtopic population for each individual query, which we show in Figure 5. We observed a considerable difference in the standard deviations of shorter and longer queries (e.g., between those from the AMBIENT data set [from 1 to 44 in the figure] and the MORESQUE data set [from 45 to 158]). We further calculated the average standard deviation over the two data sets' queries, obtaining 6.5 for AMBIENT and 13.1 for MORESQUE. Therefore we anticipate that the longer the query length, the more unbalanced will be the distribution of its subtopics over the top-ranking results.
data set . | queries . | queries by length . | average subtopics . | |||
---|---|---|---|---|---|---|
1 . | 2 . | 3 . | 4 . | |||
AMBIENT | 44 | 35 | 6 | 3 | 0 | 17.9 |
MORESQUE | 114 | 0 | 47 | 36 | 31 | 6.6 |
data set . | queries . | queries by length . | average subtopics . | |||
---|---|---|---|---|---|---|
1 . | 2 . | 3 . | 4 . | |||
AMBIENT | 44 | 35 | 6 | 3 | 0 | 17.9 |
MORESQUE | 114 | 0 | 47 | 36 | 31 | 6.6 |
In line with previous experiments on search result clustering, our data set does not contain monosemous queries for two reasons: (i) we are interested in queries with multiple meanings, and (ii) monosemous queries would increase the performance of our experiments because no diversification would be needed for them.
4.1.6 Systems
We performed a comparison of our semantically enhanced search result clustering systems with nonsemantic ones.
Semantically enhanced systems. We integrated our graph-based WSI algorithms (Curvature, SquaT++, B-MST, HyperLex, and Chinese Whispers; cf. Section 3.2) into our search result clustering framework. We tested each algorithm when combined with any of the snippet-to-sense similarity measures introduced in Section 3.3.
Nonsemantic systems. We compared our semantically enhanced systems with four Web clustering engines, namely:
Lingo (Osinski and Weiss 2005): A Web clustering engine implemented in the Carrot open-source framework15 that clusters the most frequent phrases extracted using suffix arrays.
Suffix Tree Clustering (STC) (Zamir and Etzioni 1998): The original Web search clustering approach based on suffix trees.
KeySRC (Bernardini, Carpineto, and D'Amico 2009): A state-of-the-art Web clustering engine built on top of STC with part-of-speech pruning and dynamic selection of the cut-off level of the clustering dendrogram.
Yippy16 (formerly Clusty): A state-of-the-art metasearch engine developed by Vivísimo aimed at clustering search results into meaningful topics.
For Lingo and STC we used the Carrot implementation which we integrated into our framework. Conversely, for Yippy we used the on-line output provided by the Web search engine.
4.1.7 Baselines
We compared the four systems against three baselines:
- •
Singletons: Each snippet is clustered as a separate singleton (i.e., the cardinality of the resulting clustering is ).
- •
All-in-one: All the snippets are clustered into a single cluster (i.e., ).
- •
Wikipedia clustering: Given an input query q, we apply Equation (13) to match the bag of content words of each snippet against that of each Wikipedia page representing a meaning of q (we use the disambiguation page of q as its sense inventory). The snippet is then added to the cluster corresponding to the best-matching Wikipedia page. Given q, we obtain a clustering whose size is determined by the number of meanings in the Wikipedia disambiguation page of q.
The first two baselines help us determine whether the evaluation measures have a bias towards very small (singletons) or big clusters (all-in-one). The third baseline, based on Wikipedia, is a tough one in that—in contrast to our systems—it relies on a predefined sense inventory (which is the same as that used in the manual classification of the test set) to cluster the snippets. Consequently the baseline does not “induce” the senses, but just classifies (or labels) each snippet with the best-matching Wikipedia sense of the input query.
4.2 Experiment 1: Evaluation of the Clustering Quality
4.2.1 Evaluation Measures
In this first experiment our goal is to evaluate the quality of the output produced by our search result clustering systems. Unfortunately, the clustering evaluation problem is a notably hard issue, and one for which there exists no unequivocal solution. Many evaluation measures have been proposed in the literature (Rand 1971; Zhao and Karypis 2004; Rosenberg and Hirschberg 2007; Geiss 2009, inter alia) so, in order to get exhaustive results, we tested three different clustering quality measures, namely, Adjusted Rand Index, Jaccard Index, and F1-measure, which we introduce hereafter. Each of these measures calculates the quality of a clustering , output for a given query q, against the gold standard clustering for that query. We then determine the overall results on the entire set of queries Q in the test set according to the measure M by averaging the values of obtained for each single test query q ∈ Q.
where TP is the number of true positives (i.e., snippet pairs) that are in the same cluster both in and , TN is the number of true negatives (i.e., pairs which are in different clusters in both clusterings), and FP and FN are, respectively, the number of false positives and false negatives. For the gold standard we use the clustering induced by the sense annotations provided in our data sets for each snippet (i.e., each cluster contains the snippets manually associated with a particular Wikipedia page, that is, subtopic, of the query).
Differently from the original RI (which ranges between 0 and 1), the ARI ranges between −1 and + 1 and is 0 when the index equals its expected value. Given the issues with RI, in our experiments we focused on ARI.
In fact, in contrast to RI (cf. Equation (17)), neither the numerator nor the denominator of JI include the TN term.
F1-Measure. The ARI and the JI calculate the clustering quality using snippet pairs as the basic unit. Instead, a clustering can be evaluated by focusing on the precision of the single clusters and the topics recalled by them, that is, we evaluate according to its precision (P) and recall (R) against a gold standard . Precision determines how accurately the clusters of represent the topics in the gold standard , and recall measures how accurately the topics in are covered by the clusters in .
Note that, in contrast with ARI, in calculating precision and recall we do not consider untagged gold standard snippets.
4.2.2 Results and Discussion
We show the results of the WSI algorithms in Table 12. With few exceptions, the results obtained on the two corpora are comparable. SquaT++, which extends Curvature with the Square and Diamond patterns, obtains higher performance. Although integrating three different graph patterns is beneficial, the difference between using edges and vertices to do so is mostly marginal.
Algorithm . | Sim. . | Web1T . | ukWaC . | ||||||
---|---|---|---|---|---|---|---|---|---|
ARI . | JI . | F1 . | # cl. . | ARI . | JI . | F1 . | # cl. . | ||
Curvature | WO | 67.03 | 74.10 | 58.34 | 2.3 | 64.86 | 72.74 | 58.84 | 3.5 |
DO | 66.88 | 73.76 | 58.67 | 2.3 | 64.02 | 71.04 | 59.85 | 3.5 | |
TO | 67.14 | 74.04 | 58.36 | 2.3 | 65.03 | 72.46 | 58.73 | 3.5 | |
SquaT++V | WO | 69.65 | 75.69 | 59.19 | 2.1 | 69.27 | 75.55 | 59.18 | 2.3 |
DO | 69.21 | 75.45 | 59.19 | 2.1 | 68.73 | 75.14 | 59.75 | 2.3 | |
TO | 69.67 | 75.69 | 59.19 | 2.1 | 69.33 | 75.55 | 59.23 | 2.3 | |
SquaT++E | WO | 69.88 | 75.82 | 59.39 | 2.7 | 69.84 | 75.74 | 59.70 | 3.9 |
DO | 69.63 | 75.74 | 60.99 | 2.7 | 69.86 | 75.35 | 63.00 | 3.9 | |
TO | 69.88 | 75.83 | 59.40 | 2.7 | 69.86 | 75.70 | 59.78 | 3.9 | |
B-MST | WO | 60.76 | 71.51 | 64.56 | 5.0 | 61.15 | 72.24 | 65.57 | 5.0 |
DO | 66.48 | 69.37 | 64.84 | 5.0 | 67.60 | 70.02 | 67.41 | 5.0 | |
TO | 63.17 | 71.21 | 64.04 | 5.0 | 64.18 | 71.93 | 65.46 | 5.0 | |
HyperLex | WO | 60.86 | 72.05 | 65.41 | 13.0 | 56.59 | 72.00 | 70.69 | 17.0 |
DO | 66.27 | 68.00 | 71.91 | 13.0 | 65.92 | 67.31 | 76.88 | 17.0 | |
TO | 62.82 | 70.87 | 65.08 | 13.0 | 61.64 | 70.61 | 70.42 | 17.0 | |
Chinese Whispers | WO | 67.75 | 75.37 | 60.25 | 12.5 | 68.77 | 75.45 | 59.66 | 6.5 |
DO | 65.95 | 69.49 | 70.33 | 12.5 | 67.86 | 72.34 | 66.16 | 6.5 | |
TO | 67.57 | 74.69 | 60.50 | 12.5 | 68.97 | 75.28 | 59.79 | 6.5 |
Algorithm . | Sim. . | Web1T . | ukWaC . | ||||||
---|---|---|---|---|---|---|---|---|---|
ARI . | JI . | F1 . | # cl. . | ARI . | JI . | F1 . | # cl. . | ||
Curvature | WO | 67.03 | 74.10 | 58.34 | 2.3 | 64.86 | 72.74 | 58.84 | 3.5 |
DO | 66.88 | 73.76 | 58.67 | 2.3 | 64.02 | 71.04 | 59.85 | 3.5 | |
TO | 67.14 | 74.04 | 58.36 | 2.3 | 65.03 | 72.46 | 58.73 | 3.5 | |
SquaT++V | WO | 69.65 | 75.69 | 59.19 | 2.1 | 69.27 | 75.55 | 59.18 | 2.3 |
DO | 69.21 | 75.45 | 59.19 | 2.1 | 68.73 | 75.14 | 59.75 | 2.3 | |
TO | 69.67 | 75.69 | 59.19 | 2.1 | 69.33 | 75.55 | 59.23 | 2.3 | |
SquaT++E | WO | 69.88 | 75.82 | 59.39 | 2.7 | 69.84 | 75.74 | 59.70 | 3.9 |
DO | 69.63 | 75.74 | 60.99 | 2.7 | 69.86 | 75.35 | 63.00 | 3.9 | |
TO | 69.88 | 75.83 | 59.40 | 2.7 | 69.86 | 75.70 | 59.78 | 3.9 | |
B-MST | WO | 60.76 | 71.51 | 64.56 | 5.0 | 61.15 | 72.24 | 65.57 | 5.0 |
DO | 66.48 | 69.37 | 64.84 | 5.0 | 67.60 | 70.02 | 67.41 | 5.0 | |
TO | 63.17 | 71.21 | 64.04 | 5.0 | 64.18 | 71.93 | 65.46 | 5.0 | |
HyperLex | WO | 60.86 | 72.05 | 65.41 | 13.0 | 56.59 | 72.00 | 70.69 | 17.0 |
DO | 66.27 | 68.00 | 71.91 | 13.0 | 65.92 | 67.31 | 76.88 | 17.0 | |
TO | 62.82 | 70.87 | 65.08 | 13.0 | 61.64 | 70.61 | 70.42 | 17.0 | |
Chinese Whispers | WO | 67.75 | 75.37 | 60.25 | 12.5 | 68.77 | 75.45 | 59.66 | 6.5 |
DO | 65.95 | 69.49 | 70.33 | 12.5 | 67.86 | 72.34 | 66.16 | 6.5 | |
TO | 67.57 | 74.69 | 60.50 | 12.5 | 68.97 | 75.28 | 59.79 | 6.5 |
The first important insight is that the best results, shown in bold in Table 12, are consistent across corpora and similarity measures (i.e., WO, DO, and TO), thus showing the robustness of the WSI algorithms when co-occurrences are extracted from different textual sources. The pairwise evaluation measures (i.e., ARI and JI), however, rank the WSI algorithms differently from the F1 measure. In fact, when we focus on pairwise evaluation measures, SquaT++ outperforms all other systems on both corpora, with Chinese Whispers ranking second. B-MST and HyperLex obtain lower results. When we look into the precision of the output clusters and the recall of the gold-standard topics (i.e., we calculate F1), however, we observe an inverse trend: HyperLex, B-MST and, to a lesser extent, Chinese Whispers achieve the best performance, whereas Curvature and SquaT++ obtain lower F1. This is because, assuming comparable precision, producing more clusters (as is done by HyperLex, B-MST, and Chinese Whispers) implies more chances to obtain higher recall, thus better diversifying among the topics of the retrieved search results. More specifically, B-MST and, especially, HyperLex benefit from the use of ukWaC in terms of F1 performance, with HyperLex gaining around 5% when moving from Web1T to ukWaC.
Finally, in most cases we observe negligible differences between the three different similarity measures (i.e., WO, DO, TO, cf. Section 3.3), with some exceptions concerning B-MST, HyperLex, and Chinese Whispers.
We now report the best results for our WSI algorithms in Table 13, compared against those of nonsemantic systems (i.e., Lingo, STC, and KeySRC, cf. Section 4.1.6) and our three baselines (i.e., all-in-one, singleton, and Wikipedia, cf. Section 4.1.7). For the WSI algorithms we show the results when using the WO measure, because, first, DO uses graph information and thus cannot be applied to nonsemantic systems, and, second, in most cases (as remarked earlier) there are negligible differences between the two other similarity measures (i.e., WO and TO, see Table 12).
Algorithm . | Algorithm . | Web1T . | ukWaC . | ||||||
---|---|---|---|---|---|---|---|---|---|
ARI . | JI . | F1 . | #cl. . | ARI . | JI . | F1 . | # cl. . | ||
WSI-based | Curvature | 67.03 | 74.10 | 58.34 | 2.3 | 64.86 | 72.74 | 58.84 | 3.5 |
SquaT++V | 69.65 | 75.69 | 59.19 | 2.1 | 69.27 | 75.55 | 59.18 | 2.3 | |
SquaT++E | 69.88 | 75.82 | 59.39 | 2.7 | 69.84 | 75.74 | 59.70 | 3.9 | |
B-MST | 60.76 | 71.51 | 64.56 | 5.0 | 61.15 | 72.24 | 65.57 | 5.0 | |
HyperLex | 60.86 | 72.05 | 65.41 | 13.0 | 56.59 | 72.00 | 70.69 | 17.0 | |
Chinese Whispers | 67.75 | 75.37 | 64.25 | 12.5 | 68.77 | 75.45 | 59.66 | 6.5 | |
SRC systems | Lingo | −0.53 | 36.36 | 16.73 | 2.0 | −0.53 | 36.36 | 16.73 | 2.0 |
STC | −7.90 | 38.23 | 14.96 | 2.0 | −7.90 | 38.23 | 14.96 | 2.0 | |
KeySRC | 14.34 | 27.77 | 63.11 | 18.5 | 14.34 | 27.77 | 63.11 | 18.5 | |
Baselines | All-in-one | 0.00 | 47.12 | 42.40 | 1.0 | 0.00 | 47.12 | 42.40 | 1.0 |
Singleton | 0.00 | 0.00 | 68.17 | 100.0 | 0.00 | 0.00 | 68.17 | 100.0 | |
Wikipedia | 13.83 | 56.02 | 14.33 | 5.7 | 13.83 | 56.02 | 14.33 | 5.7 |
Algorithm . | Algorithm . | Web1T . | ukWaC . | ||||||
---|---|---|---|---|---|---|---|---|---|
ARI . | JI . | F1 . | #cl. . | ARI . | JI . | F1 . | # cl. . | ||
WSI-based | Curvature | 67.03 | 74.10 | 58.34 | 2.3 | 64.86 | 72.74 | 58.84 | 3.5 |
SquaT++V | 69.65 | 75.69 | 59.19 | 2.1 | 69.27 | 75.55 | 59.18 | 2.3 | |
SquaT++E | 69.88 | 75.82 | 59.39 | 2.7 | 69.84 | 75.74 | 59.70 | 3.9 | |
B-MST | 60.76 | 71.51 | 64.56 | 5.0 | 61.15 | 72.24 | 65.57 | 5.0 | |
HyperLex | 60.86 | 72.05 | 65.41 | 13.0 | 56.59 | 72.00 | 70.69 | 17.0 | |
Chinese Whispers | 67.75 | 75.37 | 64.25 | 12.5 | 68.77 | 75.45 | 59.66 | 6.5 | |
SRC systems | Lingo | −0.53 | 36.36 | 16.73 | 2.0 | −0.53 | 36.36 | 16.73 | 2.0 |
STC | −7.90 | 38.23 | 14.96 | 2.0 | −7.90 | 38.23 | 14.96 | 2.0 | |
KeySRC | 14.34 | 27.77 | 63.11 | 18.5 | 14.34 | 27.77 | 63.11 | 18.5 | |
Baselines | All-in-one | 0.00 | 47.12 | 42.40 | 1.0 | 0.00 | 47.12 | 42.40 | 1.0 |
Singleton | 0.00 | 0.00 | 68.17 | 100.0 | 0.00 | 0.00 | 68.17 | 100.0 | |
Wikipedia | 13.83 | 56.02 | 14.33 | 5.7 | 13.83 | 56.02 | 14.33 | 5.7 |
Our first finding here is that WSI-based search result clustering outperforms all other approaches across all evaluation measures on the two corpora, except for KeySRC and the singleton baseline when using the F1 measure. We note, however, that although KeySRC outperforms the WSI algorithms based on graph patterns in terms of F1, it attains very low ARI and JI results. Even worse, the singleton baseline produces trivial, meaningless clusterings, as measured by ARI and JI. The all-in-one baseline, instead, obtains non-zero JI (thanks to the true positives taken into account), but again zero ARI. Further, its F1 is lower than singleton, because of its lower recall. The Wikipedia baseline fares well compared with the other baselines in terms of ARI and JI, but achieves lower F1, again because of low recall. Finally, KeySRC consistently outperforms the other SRC systems in terms of ARI and F1.
In order to perform a fair comparison of our systems with Yippy we used a modified version of our test set that retains only the Yahoo! results that were also returned by Yippy. The average number of results over all queries in the resulting data set is 24.4, with a minimum and maximum number of 3 and 56 results per query, respectively.
We report the results on the reduced data set in Table 14. Among the classical search result clustering systems, Yippy performs worse in terms of ARI and JI. Instead, when we focus on the precision and recall of the output clusters, Yippy outperforms all other nonsemantic systems, while lagging behind all WSI algorithms (which use Web1T). One finding here is that, even in the presence of a smaller number of snippets per query, semantic systems perform best, whereas other approaches, which rely (like KeySRC) on the availability of a sufficient number of snippets, fall short.
. | Algorithm . | ARI . | JI . | F1 . | # cl. . |
---|---|---|---|---|---|
WSI-based | Curvature | 63.21 | 75.22 | 64.86 | 2.1 |
SquaT++V | 64.25 | 75.29 | 64.99 | 2.0 | |
SquaT++E | 64.13 | 75.96 | 65.30 | 2.3 | |
B-MST | 53.72 | 76.51 | 70.82 | 5.0 | |
HyperLex | 55.93 | 76.63 | 72.04 | 7.9 | |
Chinese Whispers | 55.41 | 74.83 | 70.52 | 8.4 | |
SRC systems | Lingo | − 1.58 | 35.65 | 17.00 | 2.0 |
STC | 7.91 | 38.34 | 15.04 | 2.0 | |
KeySRC | − 0.01 | 31.80 | 32.90 | 3.3 | |
Yippy | 3.80 | 14.81 | 64.52 | 14.9 | |
Baselines | All-in-one | 0.00 | 45.66 | 48.30 | 1.0 |
Singleton | 0.00 | 0.00 | 72.19 | 24.4† | |
Wikipedia | 10.00 | 52.05 | 15.60 | 5.8 |
. | Algorithm . | ARI . | JI . | F1 . | # cl. . |
---|---|---|---|---|---|
WSI-based | Curvature | 63.21 | 75.22 | 64.86 | 2.1 |
SquaT++V | 64.25 | 75.29 | 64.99 | 2.0 | |
SquaT++E | 64.13 | 75.96 | 65.30 | 2.3 | |
B-MST | 53.72 | 76.51 | 70.82 | 5.0 | |
HyperLex | 55.93 | 76.63 | 72.04 | 7.9 | |
Chinese Whispers | 55.41 | 74.83 | 70.52 | 8.4 | |
SRC systems | Lingo | − 1.58 | 35.65 | 17.00 | 2.0 |
STC | 7.91 | 38.34 | 15.04 | 2.0 | |
KeySRC | − 0.01 | 31.80 | 32.90 | 3.3 | |
Yippy | 3.80 | 14.81 | 64.52 | 14.9 | |
Baselines | All-in-one | 0.00 | 45.66 | 48.30 | 1.0 |
Singleton | 0.00 | 0.00 | 72.19 | 24.4† | |
Wikipedia | 10.00 | 52.05 | 15.60 | 5.8 |
†Corresponding to the average number of snippet results in the reduced data set.
4.3 Experiment 2: Evaluation of the Clustering Diversity
4.3.1 Evaluation Measure
So whereas S-recall@K aims at determining the performance of a system at retrieving the largest number of topics for the query q in the K top-ranking results, S-precision@r quantifies the ratio of distinct subtopics covered by the minimal set of results returned for which the system obtains a specific recall r. Note that unambiguous queries would perform with S-precision@r = S-recall@K = 1 for all values of r and K.18
These two measures are only suitable, however, for systems returning ranked lists (such as Yahoo! and Essential Pages). In order to apply them to search result clustering systems, we flatten each clustering to a list of search results. To do so, given a clustering , we add to the initially empty list the first element19 of each cluster Cj (j = 1, …, m); then we iterate the process by selecting the second element of each cluster Cj such that |Cj| ≥ 2, and so on. The remaining elements returned by the search engine, but not included in any cluster of , are appended to the bottom of the list in their original order.
4.3.2 Results and Discussion
The results in terms of S-recall@K are shown in Table 15. The first key finding is that, independently of the adopted corpus for graph construction, each of the WSI algorithms outperforms all nonsemantic systems, including the state-of-the-art search resulting clustering engine (KeySRC), Essential Pages (see Section 2.4), and the Yahoo! baseline. This result provides strong evidence that inducing senses for a given ambiguous query is beneficial for the diversification of snippet results. Not all WSI algorithms perform the same, however: In fact, we observe that exploiting local graph patterns (as done by Curvature and SquaT++) typically leads to worse results compared with other graph-based approaches. We do not observe substantial differences between Curvature and SquaT++ on edges and vertices. We hypothesize that the lack of significant difference in the diversification performance of the three pattern-based WSI algorithms is due to the lower number of clusters they produce (in the order of around two to three clusters, cf. Table 12).
The best performance on both corpora is, instead, obtained by HyperLex, tailed by B-MST and Chinese Whispers. HyperLex is more complex and requires the tuning of many parameters (Agirre et al. 2006a), however. Interestingly, we observe that the ranking of WSI algorithms according to S-recall@K closely matches that obtained with the F1 measure for clustering quality. Finally, among the nonsemantic alternatives, Yahoo! fares well and surpasses KeySRC and EP.
To get futher insights into the performance of the best semantic systems, in Figure 6 we graphed the values of S-recall@K for representative systems, namely, B-MST, SquaT++V, and the three nonsemantic systems. The results shown in the figure are those obtained with Web1T (top) and ukWaC (bottom). We can see that SquaT++V lags behind B-MST especially for low values of K. As also remarked previously, Yahoo! tends to perform better than KeySRC.
As regards S-precision@r, shown in Table 16, again all WSI algorithms outperform nonsemantic systems. The general trend observed for S-recall@K is confirmed here: HyperLex generally achieves the best values of S-precision@r, with good performance for all other semantic systems. All in all, HyperLex has the best balance between recall and precision, with better diversification performance on ukWaC, and therefore looks like the most suitable choice. B-MST, however, is much simpler and requires just one parameter (i.e., the number of clusters), which can also be exploited by the user to get finer- or coarser-grained search result groups. As was previously done for S-recall@K, we also graphed the values of S-precision@r for the same representative systems in Figure 7.
5. In Vitro Experiment: Evaluating the Induced Senses
Although the primary aim of this work was to demonstrate a relevant, end-to-end application of sense discovery techniques, we performed an additional in vitro experiment aimed at verifying the quality of the discovered senses independently of the task in which they are used.
When performing in vitro evaluations, no single intrinsic measure provides a clear hint as to which algorithm performs best (Manandhar et al. 2010). In fact, some measures favor large clusters, whereas others are based on the expectaction that the WSI algorithm will discover more fine-grained sense distinctions. To provide further insights into the clusters produced by our graph-based WSI algorithms, we performed a qualitative evaluation of the output clusters. To this end we randomly selected 17 queries from our query data set. For each query, we submitted in random order the output of three representative WSI algorithms on the ukWaC corpus, namely, Curvature, HyperLex, and B-MST, to five annotators.
We show an excerpt of the evaluation procedure for the query excalibur in Table 17. On the left side of the table we propose an example of an anonymized set of three clusterings (i.e., one for each algorithm, shown in columns 2–4) presented to our annotators. Each algorithm produced a group of clusters, each of which consisted of a set of words strictly related to the meaning conveyed by the cluster itself, as discussed in Section 3.2.2. The annotators were asked to rank the three clusterings according to their own preference (ties were allowed). On the right side of Table 17 we show an example of ranking for the three clusterings. In the example, clustering B was deemed to be more representative, because it better models three meanings of excalibur, namely: the film-novel meaning, the sword meaning, and the hotel casino meaning, whereas clustering A mixes the movie and the casino meaning within cluster 1, and, even worse, clustering C just provides a singleton cluster.
Finally, for each query, and for the entire set of 17 queries, we calculated the average ranking obtained by each WSI algorithm. The overall results are shown in Table 18 (last row): 1.7 for HyperLex, 1.8 for B-MST, and 2.4 for Curvature. This experiment corroborates the findings obtained from our extrinsic experiments: Curvature is the worst-ranking system (probably because of the low number of induced senses), whereas HyperLex and B-MST are more apt to discriminate between the meanings of an input query. It is worth noting that the annotators often assigned the same rank to the clusters produced by B-MST and HyperLex, confirming our extrinsic finding that the two algorithms tend to have a similar behavior, compared with local graph pattern WSI.
query id . | B-MST . | HyperLex . | Curvature . |
---|---|---|---|
1 | 2.0 | 1.0 | 3.0 |
2 | 1.4 | 2.0 | 2.4 |
3 | 2.2 | 1.6 | 2.2 |
4 | 1.4 | 2.4 | 2.2 |
5 | 2.6 | 2.0 | 1.4 |
6 | 1.2 | 1.8 | 3.0 |
7 | 2.2 | 1.4 | 2.1 |
8 | 2.2 | 2.4 | 1.4 |
9 | 1.8 | 2.4 | 1.6 |
10 | 1.8 | 1.6 | 2.2 |
11 | 1.6 | 1.4 | 3.0 |
12 | 1.6 | 1.8 | 2.6 |
13 | 1.8 | 1.2 | 3.0 |
14 | 1.8 | 1.2 | 3.0 |
15 | 1.8 | 1.8 | 2.0 |
16 | 1.6 | 1.4 | 3.0 |
17 | 1.8 | 1.2 | 3.0 |
all | 1.8 | 1.7 | 2.4 |
query id . | B-MST . | HyperLex . | Curvature . |
---|---|---|---|
1 | 2.0 | 1.0 | 3.0 |
2 | 1.4 | 2.0 | 2.4 |
3 | 2.2 | 1.6 | 2.2 |
4 | 1.4 | 2.4 | 2.2 |
5 | 2.6 | 2.0 | 1.4 |
6 | 1.2 | 1.8 | 3.0 |
7 | 2.2 | 1.4 | 2.1 |
8 | 2.2 | 2.4 | 1.4 |
9 | 1.8 | 2.4 | 1.6 |
10 | 1.8 | 1.6 | 2.2 |
11 | 1.6 | 1.4 | 3.0 |
12 | 1.6 | 1.8 | 2.6 |
13 | 1.8 | 1.2 | 3.0 |
14 | 1.8 | 1.2 | 3.0 |
15 | 1.8 | 1.8 | 2.0 |
16 | 1.6 | 1.4 | 3.0 |
17 | 1.8 | 1.2 | 3.0 |
all | 1.8 | 1.7 | 2.4 |
6. Time Performance Analysis
Finally, because we are interested in the real-world application of the WSI techniques we discussed, we decided to collect statistics about the execution times of each system on the AMBIENT and MORESQUE data sets. We carried out this performance analysis on a workstation using Sun Java 1.6 VM running on OpenSuse 11.4 (64 bit) with 16 GB PC3-15000 RAM, Intel Xeon [email protected] GHz, and 1.5 TB hard disk space.
In the graph construction step, in common with all WSI algorithms, most (i.e., about 80%) of the computational load is due to the interaction with the database management system (DBMS, we used MySQL 5.1), and the remaining CPU time is used for populating the graph. On average constructing a co-occurrence graph takes 10–12 seconds per query. We note, however, that our algorithms were not engineered to work in an enterprise, possibly distributed, environment, with a commercial DBMS. Moreover, a fully engineered architecture might appropriately precalculate and cache the graphs concerning the most frequent queries.
The average time performance of WSI algorithms including sense discovery and snippet clustering (but excluding graph construction) are shown in Table 19, expressed in average number of seconds per query for both corpora. These numbers are compared with the time performance of nonsemantic systems (bottom part of the table).
. | Algorithm . | Web1T . | ukWaC . |
---|---|---|---|
WSI-based systems | Curvature | 0.34 | 0.34 |
SquaT++V | 28.98 | 14.45 | |
SquaT++E | 21.49 | 169.13 | |
B-MST | 0.24 | 0.27 | |
HyperLex | 0.16 | 0.13 | |
Chinese Whispers | 0.28 | 0.35 | |
SRC systems | Lingo | 0.27 | |
STC | 0.20 | ||
KeySRC | 1.00† |
. | Algorithm . | Web1T . | ukWaC . |
---|---|---|---|
WSI-based systems | Curvature | 0.34 | 0.34 |
SquaT++V | 28.98 | 14.45 | |
SquaT++E | 21.49 | 169.13 | |
B-MST | 0.24 | 0.27 | |
HyperLex | 0.16 | 0.13 | |
Chinese Whispers | 0.28 | 0.35 | |
SRC systems | Lingo | 0.27 | |
STC | 0.20 | ||
KeySRC | 1.00† |
†Estimated by the authors of KeySRC.
We observe that, among pattern-based algorithms, SquaT++ has a high runtime cost, due to the heavy calculation of three different graph patterns. SquaT++E is particularly onerous in the presence of large amounts of edges, which is the case for ukWaC. Curvature, instead, has a lower cost, because the triangle pattern is less onerous to compute. Interestingly, the algorithms which we experimentally found to perform best (i.e., B-MST, HyperLex, and Chinese Whispers) have a much lower computational load compared with graph-pattern based algorithms. We found that HyperLex is particularly fast, with an average time of 0.1 seconds per query. Finally, we observe that the cost of the best WSI algorithms is not very far off that of nonsemantic SRC systems.
7. Conclusions
In this article we have presented a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text. Key to our approach is the idea of, first, automatically inducing senses for the target query and, second, clustering the search results based on their semantic similarity to the word senses induced.
A sizeable body of work looking at the benefit of word senses for Web search already exists at the intersection between lexical semantics and information retrieval. That research, however, has focused almost exclusively on classical Word Sense Disambiguation, with contrasting and often inconclusive results. In this article, instead, we provide clear indication on the usefulness of a looser notion of sense to cope with ambiguous queries.
In fact, our experiments on data sets of queries of different lengths show that our approach outperforms all nonsemantic approaches to Web search result clustering. The main advantage of using Word Sense Induction lies in its dynamic production of word senses that cover both concepts (e.g., beagle as a specific breed of dog) and instances (e.g., beagle as a specific instance of a space lander). This is in contrast with static dictionaries such as WordNet that are typically used in Word Sense Disambiguation and which, by their very nature, mainly encode concepts.
Not only have we shown that graph-based WSI, when applied to search result clustering, surpasses its nonsemantic alternatives, but we have also provided an end-to-end evaluation framework that enables fair comparison of WSI algorithms. As a result, we are able to overcome many of the issues with the evaluation of clustering algorithms (von Luxburg, Williamson, and Guyon 2012), including the lack of a single unbiased intrinsic measure (Manandhar et al. 2010). Moreover, new WSI algorithms can be added at any time and compared with those already integrated into the framework. Building upon this, we are currently organizing a Semeval-2013 task for the extrinsic evaluation of WSI algorithms.20 As of today, we are releasing a new data set of 114 ambiguous queries and 11,400 sense-annotated snippets.21 Given the present paucity of ambiguous query data sets available (Sanderson 2008), we hope our data set will be useful in future comparative experiments.
Thanks to its modular structure, our framework can easily be extended in many other ways, including the addition of new snippet similarity measures, text corpora, query data sets, evaluation measures, and so on. Although our graphs are centered on words (as vertices), we are also interested in testing new graph construction procedures based on the use of collocations as vertices, as done by Korkontzelos and Manandhar (2010). Furthermore, the framework is independent of the target language, in that it just requires a large-enough corpus for co-occurrence extraction in that language and some basic tools for processing text (i.e., a stopword list, a lemmatizer, and a compounder). As future work, the framework might be integrated with distributional semantics models and techniques (Baroni and Lenci 2010; Erk, Padó, and Padó 2010; Mitchell and Lapata 2010; Boleda, im Walde, and Badia 2012; Clarke 2012; Silberer and Lapata 2012, inter alia).
Finally we note that, although in this article our framework was applied to polysemous queries only, nothing prevents it from being used to perform experiments at different levels of sense granularity. A qualitative evaluation of preliminary experiments in aspect identification (cf. Section 2.6), which requires the detection of very fine-grained subsenses of possibly monosemous queries, showed that WSI also seems to perform well in this task. Given the high number of monosemous queries submitted to Web search engines, we believe that further investigation in this direction may well reveal additional benefits of WSI for Web Information Retrieval.
Acknowledgments
The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI no. 259234 and the CASPUR High-Performance Computing Grants 515/2011 and 118/2012. Thanks go to Google for providing the Web1T corpus for research purposes, Claudio Carpineto and Massimiliano D'Amico for producing the output of KeySRC and Essential Pages, and Stanislaw Osinski and Dawid Weiss for their help with Lingo and STC. Additional thanks go to Jim McManus and the three anonymous reviewers for their helpful comments.
Notes
Note that we focus here on the ambiguity of queries in terms of their polysemy, rather than on the identification of aspects or subsenses of a given meaning of a query, as was done in recent work on topic identification (Wang, Chakrabarti, and Punera 2009; Xue and Yin 2011; Wu, Madhavan, and Halevy 2011). We discuss this point further in Section 2.6.
See Hitwise on 2008–2009 Google data: http://www.hitwise.com/us/press-center/press-releases/google-searches-apr-09.
Results retrieved in May 2011.
Because our application (i.e., Web search result clustering) typically deals with nominal senses, and to avoid overly large graphs, we restrict our vocabulary to nouns only.
We note that the Dice coefficient can have a probabilistic interpretation in terms of the conditional probabilities P(w|w′) and P(w′|w) or, alternatively, the joint probability P(w,w′) and the marginal probabilities P(w) and P(w′) (Smadja, McKeown, and Hatzivassiloglou 1996).
We calculate the mean cardinality of a cluster by dividing the total number of vertices in the graph by the maximum number N of clusters that we want to obtain.
To this end we used empirically chosen parameters for each WSI algorithm, while delaying the optimal choice of these parameters to the next paragraph.
In the following we use the terms subtopic and word sense interchangeably. As stated in the Introduction, this work focuses on query disambiguation along the lines of Word Sense Induction and Disambiguation (Navigli 2009), rather than aspect identification, which concerns subtle distinctions within the same meaning of a query.
The majority topic for a cluster is the topic t for which there exists the maximum number of snippets in Cj tagged with t.
Here again we focus on the polysemy of queries in the traditional (computational) linguistic sense. See Section 2.6 for a discussion.
Recall that the snippets within a cluster are sorted by relevance, cf. Section 3.4.
The data set is available at http://lcl.uniroma1.it/moresque/.
References
Author notes
Dipartimento di Informatica, Sapienza Università di Roma, Via Salaria, 113, 00198 Roma Italy. E-mail: {dimarco,navigli}@di.uniroma1.it.