Abstract
We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar (Freud, psychology) have a low rating. We show that, via this focus on similarity, SimLex-999 incentivizes the development of models with a different, and arguably wider, range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract adjective, noun, and verb pairs, together with an independent rating of concreteness and (free) association strength for each pair. This diversity enables fine-grained analyses of the performance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceiling, state-of-the-art models perform well below this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures.
1. Introduction
There is very little similar about coffee and cups. Coffee refers to a plant, which is a living organism or a hot brown (liquid) drink. In contrast, a cup is a man-made solid of broadly well-defined shape and size with a specific function relating to the consumption of liquids. Perhaps the only clear trait these concepts have in common is that they are concrete entities. Nevertheless, in what is currently the most popular evaluation gold standard for semantic similarity, WordSim(WS)-353 (Finkelstein et al. 2001), coffee and cup are rated as more “similar” than pairs such as car and train, which share numerous common properties (function, material, dynamic behavior, wheels, windows, etc.). Such anomalies also exist in other gold standards such as the MEN data set (Bruni et al. 2012a). As a consequence, these evaluations effectively penalize models for learning the evident truth that coffee and cup are dissimilar.
Although clearly different, coffee and cup are very much related. The psychological literature refers to the conceptual relationship between these concepts as association, although it has been given a range of names including relatedness (Budanitsky and Hirst 2006; Agirre et al. 2009), topical similarity (Hatzivassiloglou et al. 2001), and domain similarity (Turney 2012). Association contrasts with similarity, the relation connecting cup and mug (Tversky 1977). At its strongest, the similarity relation is exemplified by pairs of synonyms; words with identical referents.
Computational models that effectively capture similarity as distinct from association have numerous applications. Such models are used for the automatic generation of dictionaries, thesauri, ontologies, and language correction tools (Biemann 2005; Cimiano, Hotho, and Staab 2005; Li et al. 2006). Machine translation systems, which aim to define mappings between fragments of different languages whose meaning is similar, but not necessarily associated, are another established application (He et al. 2008; Marton, Callison-Burch, and Resnik 2009). Moreover, since, as we establish, similarity is a cognitively complex operation that can require rich, structured conceptual knowledge to compute accurately, similarity estimation constitutes an effective proxy evaluation for general-purpose representation-learning models whose ultimate application is variable or unknown (Collobert and Weston 2008; Baroni and Lenci 2010).
As we show in Section 2, the predominant gold standards for semantic evaluation in NLP do not measure the ability of models to reflect similarity. In particular, in both WS-353 and MEN, pairs of words with associated meaning, such as coffee and cup (rating = 6.810), telephone and communication (7.510), or movie and theater (7.710), receive a high rating regardless of whether or not their constituents are similar. Thus, the utility of such resources to the development and application of similarity models is limited, a problem exacerbated by the fact that many researchers appear unaware of what their evaluation resources actually measure.1
Although certain smaller gold standards—those of Rubenstein and Goodenough (1965) (RG) and Agirre et al. (2009) (WS-Sim)—do focus clearly on similarity, these resources suffer from other important limitations. For instance, as we show, and as is also the case for WS-353 and MEN, state-of-the-art models have reached the average performance of a human annotator on these evaluations. It is common practice in NLP to define the upper limit for automated performance on an evaluation as the average human performance or inter-annotator agreement (Yong and Foo 1999; Cunningham 2005; Resnik and Lin 2010). Based on this established principle and the current evaluations, it would therefore be reasonable to conclude that the problem of representation learning, at least for similarity modeling, is approaching resolution. However, circumstantial evidence suggests that distributional models are far from perfect. For instance, we are some way from automatically generated dictionaries, thesauri, or ontologies that can be used with the same confidence as their manually created equivalents.
Motivated by these observations, in Section 3 we present SimLex-999, a gold standard resource for evaluating the ability of models to reflect similarity. SimLex-999 was produced by 500 paid native English speakers, recruited via Amazon Mechanical Turk,2 who were asked to rate the similarity, as opposed to association, of concepts via a simple visual interface. The choice of evaluation pairs in SimLex-999 was motivated by empirical evidence that humans represent concepts of distinct part-of-speech (POS) (Gentner 1978) and conceptual concreteness (Hill, Korhonen, and Bentz 2014) differently. Whereas existing gold standards contain only concrete noun concepts (MEN) or cover only some of these distinctions via a random selection of items (WS-353, RG), SimLex-999 contains a principled selection of adjective, verb, and noun concept pairs covering the full concreteness spectrum. This design enables more nuanced analyses of how computational models overcome the distinct challenges of representing concepts of these types.
In Section 4 we present quantitative and qualitative analyses of the SimLex-999 ratings, which indicate that participants found it unproblematic to quantify consistently the similarity of the full range of concepts and to distinguish it from association. Unlike existing data sets, SimLex-999 therefore contains a significant number of pairs, such as [movie, theater], which are strongly associated but receive low similarity scores.
The second main contribution of this paper, presented in Section 5, is the evaluation of state-of-the-art distributional semantic models using SimLex-999. These include the well-known neural language models (NLMs) of Huang et al. (2012), Collobert and Weston (2008), and Mikolov et al. (2013a), which we compare with traditional vector-space co-occurrence models (VSMs) (Turney and Pantel 2010) with and without dimensionality reduction (SVD) (Landauer and Dumais 1997). Our analyses demonstrate how SimLex-999 can be applied to uncover substantial differences in the ability of models to represent concepts of different types.
Despite these differences, the models we consider each share the characteristic of being better able to capture association than similarity. We show that the difficulty of estimating similarity is driven primarily by those strongly associated pairs with a high (association) rating in gold standards such as WS-353 and MEN, but a low similarity rating in SimLex-999. As a result of including these challenging cases, together with a wider diversity of lexical concepts in general, current models achieve notably lower scores on SimLex-999 than on existing gold standard evaluations, and well below the SimLex-999 inter-human agreement ceiling.
Finally, we explore ways in which distributional models might improve on this performance in similarity modeling. To do so, we evaluate the models on the SimLex999 subsets of adjectives, nouns, and verbs, as well as on abstract and concrete subsets and subsets of more and less strongly associated pairs (Sections 5.2.2–5.2.4). As part of these analyses, we confirm the hypothesis (Agirre et al. 2009; Levy and Goldberg 2014) that models learning from input informed by dependency parsing, rather than simple running-text input, yield improved similarity estimation and, specifically, clearer distinction between similarity and association. In contrast, we find no evidence for a related hypothesis (Agirre et al. 2009; Kiela and Clark 2014) that smaller context windows improve the ability of models to capture similarity. We do, however, observe clear differences in model performance on the distinct concept types included in SimLex-999. Taken together, these experiments demonstrate the benefit of the diversity of concepts included in SimLex-999; it would not have been possible to derive similar insights by evaluating based on existing gold standards.
We conclude by discussing how observations such as these can guide future research into distributional semantic models. By facilitating better-defined evaluations and finer-grained analyses, we hope that SimLex-999 will ultimately contribute to the development of models that accurately reflect human intuitions of similarity for the full range of concepts in language.
2. Design Motivation
In this section, we motivate the design decisions made in developing SimLex-999. We begin (2.1) by examining the distinction between similarity and association. We then show that for a meaningful treatment of similarity it is also important to take a principled approach to both POS and conceptual concreteness (2.2). We finish by reviewing existing gold standards, and show that none enables a satisfactory evaluation of the capability of models to capture similarity (2.3).
2.1 Similarity and Association
The difference between association and similarity is exemplified by the concept pairs [car, bike] and [car, petrol]. Car is said to be (semantically) similar to bike and associated with (but not similar to) petrol. Intuitively, car and bike can be understood as similar because of their common physical features (e.g., wheels), their common function (transport), or because they fall within a clearly definable category (modes of transport). In contrast, car and petrol are associated because they frequently occur together in space and language, in this case as a result of a clear functional relationship (Plaut 1995; McRae, Khalkhali, and Hare 2012).
Association and similarity are neither mutually exclusive nor independent. Bike and car, for instance, are related to some degree by both relations. Because it is common in both the physical world and in language for distinct entities to interact, it is relatively easy to conceive of concept pairs, such as car and petrol, that are strongly associated but not similar. Identifying pairs of concepts for which the converse is true is comparatively more difficult. One exception is common concepts paired with low frequency synonyms, such as camel and dromedary. Because the essence of association is co-occurrence (linguistic or otherwise [McRae, Khalkhali, and Hare 2012]), such pairs can seem, at least intuitively, to be similar but not strongly associated.
To explore the interaction between the two cognitive phenomena quantitatively, we exploited perhaps the only two existing large-scale means of quantifying similarity and association. To estimate similarity, we considered proximity in the WordNet taxonomy (Fellbaum 1998). Specifically, we applied the measure of Wu and Palmer (1994) (henceforth WupSim), which approximates similarity on a [0,1] scale reflecting the minimum distance between any two synsets of two given concepts in WordNet. WupSim has been shown to correlate well with human judgments on the similarity-focused RG data set (Wu and Palmer 1994). To estimate association, we extracted ratings directly from the University of South Florida Free Association Database (USF) (Nelson, McEvoy, and Schreiber 2004). These data were generated by presenting human subjects with one of 5,000 cue concepts and asking them to write the first word that comes into their head that is associated with or meaningfully related to that concept. Each cue concept c was normed in this way by over 10 participants, resulting in a set of associates for each cue, and a total of over 72,000 (c, a) pairs. Moreover, for each such pair, the proportion of participants who produced associate a when presented with cue c can be used as a proxy for the strength of association between the two concepts.
By measuring WupSim between all pairs in the USF data set, we observed, as expected, a high correlation between similarity and association strength across all USF pairs (Spearman ρ = 0.65, p < 0.001). However, in line with the intuitive ubiquity of pairs such as car and petrol, of the USF pairs (all of which are associated to a greater or lesser degree) over 10% had a WupSim score of less than 0.25. These include pairs of ontologically different entities with a clear functional relationship in the world [refrigerator, food], which may be of differing concreteness [lung, disease]; pairs in which one concept is a small concrete part of a larger abstract category [sheriff, police]; pairs in a relationship of modification or subcategorization [gravy, boat]; and even those whose principal connection is phonetic [wiggle, giggle]. As we show in Section 2.2, these are precisely the sort of pairs that are not contained in existing evaluation gold standards. Table 1 lists the USF noun pairs with the lowest similarity scores overall, and also those with the largest additive discrepancy between association strength and similarity.
2.1.1 Association and Similarity in NLP
As noted in the Introduction, the similarity/association distinction is not only of interest to researchers in psychology or linguistics. Models of similarity are particularly applicable to various NLP tasks, such as lexical resource building, semantic parsing, and machine translation (Haghighi et al. 2008; He et al. 2008; Marton, Callison-Burch, and Resnik 2009; Beltagy, Erk, and Mooney 2014). Models of association, on the other hand, may be better suited to tasks such as word-sense disambiguation (Navigli 2009), and applications such as text classification (Phan, Nguyen, and Horiguchi 2008) in which the target classes correspond to topical domains such as agriculture or sport (Rose, Stevenson, and Whitehead 2002).
Much recent research in distributional semantics does not distinguish between association and similarity in a principled way (see, e.g., Reisinger and Mooney 2010b; Huang et al. 2012; Luong, Socher, and Manning 2013).3 One exception is Turney (2012), who constructs two distributional models with different features and parameter settings, explicitly designed to capture either similarity or association. Using the output of these two models as input to a logistic regression classifier, Turney predicts whether two concepts are associated, similar, or both, with 61% accuracy. However, in the absence of a gold standard covering the full range of similarity ratings (rather than a list of pairs identified as being similar or not) Turney cannot confirm directly that the similarity-focused model does indeed effectively quantify similarity.
Agirre et al. (2009) explicitly examine the distinction between association and similarity in relation to distributional semantic models. Their study is based on the partition of WS-353 into a subset focused on similarity, which we refer to as WS-Sim, and a subset focused on association, which we term WS-Rel. More precisely, WS-Sim is the union of the pairs in WS-353 judged by three annotators to be similar and the set U of entirely unrelated pairs, and WS-Rel is the union of U and pairs judged to be associated but not similar. Agirre et al. confirm the importance of the association/similarity distinction by showing that certain models perform relatively well on WS-Rel, whereas others perform comparatively better on WS-Sim. However, as shown in the following section, a model need not be an exemplary model of similarity in order to perform well on WS-Sim, because an important class of concept pair (associated but not similar entities) is not represented in this data set. Therefore the insights that can be drawn from the results of the Agirre et al. study are limited.
Several other authors touch on the similarity/association distinction in inspecting the output of distributional models (Andrews, Vigliocco, and Vinson 2009; Kiela and Clark 2014; Levy and Goldberg 2014). Although the strength of the conclusions that can be drawn from such qualitative analyses is clearly limited, there appear to be two broad areas of consensus concerning similarity and distributional models:
- •
Models that learn from input annotated for syntactic or dependency relations better reflect similarity, whereas approaches that learn from running-text or bag-of-words input better model association (Agirre et al. 2009; Levy and Goldberg 2014).
- •
Models with larger context windows may learn representations that better capture association, whereas models with narrower windows better reflect similarity (Agirre et al. 2009; Kiela and Clark 2014).
2.2 Concepts, Part-of-Speech, and Concreteness
Empirical studies have shown that the performance of both humans and distributional models depends on the POS category of the concepts learned. Gentner (2006) showed that children find verb concepts harder to learn than noun concepts, and Markman and Wisniewski (1997) present evidence that different cognitive operations are used when comparing two nouns or two verbs. Hill, Reichart, and Korhonen (2014) demonstrate differences in the ability of distributional models to acquire noun and verb semantics. Further, they show that these differences are greater for models that learn from both text and perceptual input (as with humans).
In addition to POS category, differences in human and computational concept learning and representation have been attributed to the effects of concreteness, the extent to which a concept has a directly perceptible physical referent. On the cognitive side, these “concreteness effects” are well established, even if the causes are still debated (Paivio 1991; Hill, Korhonen, and Bentz 2014). Concreteness has also been associated with differential performance in computational text-based (Hill, Kiela, and Korhonen 2013) and multi-modal semantic models (Kiela et al. 2014).
2.3 Existing Gold Standards and Evaluation Resources
For brevity, we do not exhaustively review all methods that have been used to evaluate semantic models, but instead focus on the similarity or association-based gold standards that are most commonly applied in recent work in NLP. In each case, we consider how well the data set satisfies one of the three following criteria:
Representative. The resource should cover the full range of concepts that occur in natural language. In particular, it should include cases representing the different ways in which humans represent or process concepts, and cases that are both challenging and straightforward for computational models.
Clearly defined. In order for a gold standard to be diagnostic of how well a model can be applied to downstream applications, a clear understanding is needed of what exactly the gold standard measures. In particular, it must clearly distinguish between dissociable semantic relations such as association and similarity.
Consistent and reliable. Untrained native speakers must be able to quantify the target property consistently, without requiring lengthy or detailed instructions. This ensures that the data reflect a meaningful cognitive or semantic phenomenon, and also enables the data set to be scaled up or transferred to other languages at minimal cost and effort.
WordSim-353. WS-353 (Finkelstein et al. 2001) is perhaps the most commonly used evaluation gold standard for semantic models. Despite its name, and the fact that it is often referred to as a “similarity gold standard,”4 in fact, the instructions given to annotators when producing WS-353 were ambiguous with respect to similarity and association. Subjects were asked to:
Assign a numerical similarity score between 0 and 10 (0 = words totally unrelated, 10 = words VERY closely related) … when estimating similarity of antonyms, consider them “similar” (i.e., belonging to the same domain or representing features of the same concept), not “dissimilar”.
As we confirm analytically in Section 5.2, these instructions result in pairs being rated according to association rather than similarity.5 WS-353 consequently suffers two important limitations as an evaluation of similarity (which also afflict other resources to a greater or lesser degree):
- 1.
Many dissimilar word pairs receive a high rating.
- 2.
No associated but dissimilar concepts receive low ratings.
As noted in the Introduction, an arguably more serious third limitation of WS-353 is low inter-annotator agreement, and the fact that state-of-the-art models such as those of Collobert and Weston (2008) and Huang et al. (2012) reach, or even surpass, the inter-annotator agreement ceiling in estimating the WS-353 scores. Huang et al. report a Spearman correlation of ρ = 0.713 between their model output and WS-353. This is 10 percentage points higher than inter-annotator agreement (ρ = 0.611) when defined as the average pairwise correlation between two annotators, as is common in NLP work (Padó, Padó, and Erk 2007; Reisinger and Mooney 2010a; Silberer and Lapata 2014). It could be argued that a different comparison is more appropriate: Because the model is compared to the gold-standard average across all annotators, we should compare a single annotator with the (almost) gold-standard average over all other annotators. Based on this metric the average performance of an annotator on WS-353 is ρ = 0.756, which is still only marginally better than the best automatic method.6
Thus, at least according to the established wisdom in NLP evaluation (Yong and Foo 1999; Cunningham 2005; Resnik and Lin 2010), the strength of the conclusions that can be inferred from improvements on WS-353 is limited. At the same time, however, state-of-the-art distributional models are clearly not perfect representation-learning or even similarity estimation engines, as evidenced by the fact they cannot yet be applied, for instance, to generate flawless lexical resources (Alfonseca and Manandhar 2002).
WS-Sim. WS-Sim is the set of pairs in WS-353 identified by Agirre et al. (2009) as either containing similar or unrelated (neither similar nor associated) concepts. The ratings in WS-Sim are mapped directly from WS-353, so that all concept pairs in WS-Sim that receive a high rating are associated and all pairs that receive a low rating are unassociated. Consequently, any model that simply reflects association would score highly on WS-Sim, irrespective of how well it captures similarity.
Such a possibility could be excluded by requiring models to perform well on WS-Sim and poorly on WS-Rel, the subset of WS-353 identified by Agirre et al. (2009) as containing no pairs of similar concepts. However, although this would exclude models of pure association, it would not test the ability of models to quantify the similarity of the pairs in WS-Sim. Put another way, the WS-Sim/WS-Rel partition could in theory resolve limitation (1) of WS-353 but it would not resolve limitation (2): Models are not tested on their ability to attribute low scores to associated but dissimilar pairs.
In fact, there are more fundamental limitations of WS-Sim as a similarity-based evaluation resource. It does not, strictly speaking, reflect similarity at all, since the ratings of its constituent pairs were assigned by the WS-353 annotators, who were asked to estimate association, not similarity. Moreover, it inherits the limitation of low inter-annotator agreement from WS-353. The average pairwise correlation between annotators on WS-Sim is ρ = 0.667, and the average correlation of a single annotator with the gold standard is only ρ = 0.651, both below the performance of automatic methods (Agirre et al. 2009). Finally, the small size of WS-Sim renders it poorly representative of the full range of concepts that semantic models may be required to learn.
Rubenstein & Goodenough. Prior to WS-353, the smaller RG data set, consisting of 65 pairs, was often used to evaluate semantic models. The 15 raters employed in the data collection were asked to rate the “similarity of meaning” of each concept pair. Thus RG does appear to reflect similarity rather than association. However, although limitation (1) of WS-353 is therefore avoided, RG still suffers from limitation (2): By inspection, it is clear that the low similarity pairs in RG are not associated. A further limitation is that distributional models now achieve better performance on RG (correlations of up to Pearson r = 0.86 [Hassan and Mihalcea 2011]) than the reported inter-annotator agreement of r = 0.85 (Rubenstein and Goodenough 1965). Finally, the size of RG renders it an even less comprehensive evaluation than WS-Sim.
The MEN Test Collection. A larger data set, MEN (Bruni et al. 2012a), is used in a handful of recent studies (Bruni et al. 2012b; Bernardi et al. 2013). As with WS-353, both terms similarity and relatedness are used by the authors when describing MEN, although the annotators were expressly asked to rate pairs according to relatedness.7
The construction of MEN differed from RG and WS-353 in that each pair was only considered by one rater, who ranked it for relatedness relative to 50 other pairs in the data set. An overall score out of 50 was then attributed to each pair corresponding to how many times it was ranked as more related than an alternative. However, because these rankings are based on relatedness, with respect to evaluating similarity MEN necessarily suffers from both of the limitations (1) and (2) that apply to WS-353. Further, there is a strong bias towards concrete concepts in MEN because the concepts were originally selected from those identified in an image-bank (Bruni et al. 2012a).
Synonym Detection Sets. Multiple-choice synonym detection tasks, such as the TOEFL test questions (Landauer and Dumais 1997), are an alternative means of evaluating distributional models. A question in the TOEFL task consists of a cue word and four possible answer words, only one of which is a true synonym. Models are scored on the number of true synonyms identified out of 80 questions. The questions were designed by linguists to evaluate synonymy, so, unlike the evaluations considered thus far, TOEFL-style tests effectively discriminate between similarity and association. However, because they require a zero-one classification of pairs as synonymous or not, they do not test how well models discern pairs of medium or low similarity. More generally, in opposition to the fuzzy, statistical approaches to meaning predominant in both cognitive psychology (Griffiths, Steyvers, and Tenenbaum 2007) and NLP (Turney and Pantel 2010), they do not require similarity to be measured on a continuous scale.
3. The SimLex-999 Data Set
Having considered the limitations of existing gold standards, in this section we describe the design of SimLex-999 in detail.
3.1 Choice of Concepts
Separating similarity from association. To create a test of the ability of models to capture similarity as opposed to association, we started with the ≈ 72,000 pairs of concepts in the USF data set. As the output of a free-association experiment, each of these pairs is associated to a greater or lesser extent. Importantly, inspecting the pairs revealed that a good range of similarity values are represented. In particular, there were many examples of hypernym/hyponym pairs [body, abdomen], cohyponym pairs [cat, dog], synonyms or near synonyms [deodorant, antiperspirant], and antonym pairs [good, evil]. From this cohort, we excluded pairs containing a multiple-word item [hot dog, mustard], and pairs containing a capital letter [Mexico, sun]. We ultimately sampled 900 of the SimLex-999 pairs from the resulting cohort of pairs, according to the stratification procedures outlined in the following sections.
To complement this cohort with entirely unassociated pairs, we paired up the concepts from the 900 associated pairs at random. From these random parings, we excluded those that coincidentally occurred elsewhere in USF (and therefore had a degree of association). From the remaining pairs, we accepted only those in which both concepts had been subject to the USF norming procedure, ensuring that these non-USF pairs were indeed unassociated rather than simply not normed. We sampled the remaining 99 SimLex-999 pairs from this resulting cohort of unassociated pairs.
POS category. In light of the conceptual differences outlined in Section 2.2, SimLex-999 includes subsets of pairs from the three principle meaning-bearing POS categories: nouns, verbs, and adjectives. To classify potential pairs according to POS, we counted the frequency with which the items in each pair occurred with the three possible tags in the POS-tagged British National Corpus (Leech, Garside, and Bryant 1994). To minimize POS ambiguity, which could lead to inconsistent ratings, we excluded pairs containing a concept with lower than 75% tendency towards one particular POS. This yielded three sets of potential pairs : [A,A] pairs (of two concepts whose majority tag was Adjective), [N,N] pairs, and [V,V] pairs.
Given the likelihood that different cognitive operations are used in estimating the similarity between items of different POS-category (Section 2.2), concept pairs were presented to raters in batches defined according to POS. Unlike both WS-353 and MEN, pairs of concepts of mixed POS ([white, rabbit], [run,marathon]) were excluded. POS categories are generally considered to reflect very broad ontological classes (Fellbaum 1998). We thus felt it would be very difficult, or even counter-intuitive, for annotators to quantify the similarity of mixed POS pairs according to our instructions.
Concreteness. Although a clear majority of pairs in gold standards such as MEN and RG contain concrete items, perhaps surprisingly, the vast majority of adjective, noun, and verb concepts in everyday language are in fact abstract (Hill, Reichart, and Korhonen 2014; Kiela et al. 2014).8 To facilitate the evaluation of models for both concrete and abstract concept meaning, and in light of the cognitive and computational modeling differences between abstract and concrete concepts noted in Section 2.2, we aimed to include both concept types in SimLex-999.
Unlike the POS distinction, concreteness is generally considered to be a gradual phenomenon. One benefit of sampling pairs for SimLex-999 from the USF data set is that most items have been rated according to concreteness on a scale of 1–7 by at least 10 human subjects. As Figure 1 demonstrates, concreteness (as the average over these ratings) interacts with POS on these concepts: Nouns are on average more concrete than verbs, which are more concrete than adjectives. However, there is also clear variation in concreteness within each POS category. We therefore aimed to select pairs for SimLex-999 that spanned the full abstract–concrete continuum within each POS category.
After excluding any pairs that contained an item with no concreteness rating, for each potential SimLex-999 pair we considered both the concreteness of the first item and the additive difference in concreteness between the two items. This enabled us to stratify our sampling equally across four classes: (C1) concrete first item (rating > 4) with below-median concreteness difference; (C2) concrete first item (rating> 4), second item of lower concreteness and the difference being greater than the median; (C3) abstract first item (rating ≤ 4) with below-median concreteness difference; and (C4) abstract first item (rating ≤ 4) with the second item of greater concreteness and the difference being greater than the median.
Final sampling. From the associated (USF) cohort of potential pairs we selected 600 noun pairs, 200 verb pairs, and 100 adjective pairs, and from the unassociated (non-USF) cohort, we sampled 66 nouns pairs, 22 verb pairs, and 11 adjective pairs. In both cases, the sampling was stratified such that, in each POS subset, each of the four concreteness classes C1−C4 was equally represented.
3.2 Question Design
The annotator instructions for SimLex-999 are shown in Figure 2. We did not attempt to formalize the notion of similarity, but rather introduce it via the well-understood idea of synonymy, and in contrast to association. Even if a formal characterization of similarity existed, the evidence in Section 2 suggests that the instructions would need separate cases to cover different concept types, increasing the difficulty of the rating task. Therefore, we preferred to appeal to intuition on similarity, and to verify post hoc that subjects were able to interpret and apply the informal characterization consistently for each concept type.
Immediately following the instructions in Figure 2, participants were presented with two “checkpoint” questions, one with abstract examples and one with concrete examples. In each case the participant was required to identify the most similar pair from a set of three options, all of which were associated, but only one of which was clearly similar (e.g. [bread, butter] [bread, toast] [stale, bread]). After this, the participants began rating pairs in groups of six or seven pairs by moving a slider, as shown in Figure 3.
This group size was chosen because the (relative) rating of a set of pairs implicitly requires pairwise comparisons between all pairs in that set. Therefore, larger groups would have significantly increased the cognitive load on the annotators. Another advantage of grouping was the clear break (submitting a set of ratings and moving to the next page) between the tasks of rating adjective, noun, and verb pairs. For better inter-group calibration, from the second group onwards the last pair of the previous group became the first pair of the present group, and participants were asked to re-assign the rating previously attributed to the first pair before rating the remaining new items.
3.3 Context-Free Rating
As with MEN, WS-353, and RG, SimLex-999 consists of pairs of concept words together with a numerical rating. Thus, unlike in the small evaluation constructed by Huang et al. (2012), words are not rated in a phrasal or sentential context. Such meaning-in-context evaluations are motivated by a desire to disambiguate words that otherwise might be considered to have multiple senses.
We did not attempt to construct an evaluation based on meaning-in-context for several reasons. First, determining the set of senses for a given word, and then the set of contexts that represent those senses, introduces a high degree of subjectivity into the design process. Second, ensuring that a model has learned a high quality representation of a given concept would have required evaluating that concept in each of its given contexts, necessitating many more cases and a far greater annotation effort. Third, in the (infrequent) case that some concept c1 in an evaluation pair (c1, c2) is genuinely (etymologically) polysemous, c2 can provide sufficient context to disambiguate c1.9 Finally, the POS grouping of pairs in the survey can also serve to disambiguate in the case that the conflicting senses of the polysemous concept are of differing POS categories.
3.4 Questionnaire Structure
Each participant was asked to rate 20 groups of pairs on a 0–6 scale of integers (nonintegral ratings were not possible). Checkpoint multiple-choice questions were inserted at points between the 20 groups in order to ensure the participant had retained the correct notion of similarity. In addition to the checkpoint of three noun pairs presented before the first group (which contained noun pairs), checkpoint questions containing adjective pairs were inserted before the first adjective group and checkpoints of three verb pairs were inserted before the first verb group.
From the 999 evaluation pairs, 14 noun pairs, 4 verb pairs, and 2 adjective pairs were selected as a consistency set. The data set of pairs was then partitioned into 10 tranches, each consisting of 119 pairs, of which 20 were from the consistency set and the remaining 99 unique to that tranche. To reduce workload, each annotator was asked to rate the pairs in a single tranche only. The tranche itself was divided into 20 groups, with each group consisting of 7 pairs (with the exception of the last group of the 20, which had 6). Of these seven pairs, the first pair was the last pair from the previous group, and the second pair was taken from the consistency set. The remaining pairs were unique to that particular group and tranche. The design enabled control for possible systematic differences between annotators and tranches, which could be detected by variation on the consistency set.
3.5 Participants
Five hundred residents of the United States were recruited from Amazon Mechanical Turk, each with at least 95% approval rate for work on the Web service. Each participant was required to check a box confirming that he or she was a native speaker of English and warned that work would be rejected if the pattern of responses indicated otherwise. The participants were distributed evenly to rate pairs in one of the ten question tranches, so that each pair was rated by approximately 50 subjects. Participants took between 8 and 21 minutes to rate the 119 pairs across the 20 groups, together with the checkpoint questions.
3.6 Post-Processing
In order to correct for systematic differences in the overall calibration of the rating scale between respondents, we measured the average (mean) response of each rater on the consistency set. For 32 respondents, the absolute difference between this average and the mean of all such averages was greater than 1 (though never greater than 2); that is, 32 respondents demonstrated a clear tendency to rate pairs as either more or less similar than the overall rater population. To correct for this bias, we increased (or decreased) the rating of such respondents for each pair by one, except in cases where they had given the maximum rating, 6 (or minimum rating, 0). This adjustment, which ensured that the average response of each participant was within one of the mean of all respondents on the consistency set, resulted in a small increase to the inter-rater agreement on the data set as a whole.
After controlling for systematic calibration differences, we imposed three conditions for the responses of a rater to be included in the final data collation. First, the average pairwise Spearman correlation of responses with all other responses for a participant could not be more than one standard deviation below the mean of all such averages. Second, the increase in inter-rater agreement when a rater was excluded from the analysis needed to be smaller than at least 50 other raters (i.e., 10% of raters were excluded on this criterion). Third, we excluded the six participants who got one or more of the checkpoint questions wrong. A total of 99 participants were excluded based on one or more of these conditions, but no more than 16 from any one tranche (so that each pair in the final data set was rated by a minimum of 36 raters). Finally, we computed average (mean) scores for each pair, and transformed all scores linearly from the interval [0, 6] to the interval [0, 10].
4. Analysis of the Data Set
In this section we analyze the responses of the SimLex-999 annotators and the resulting ratings. First, by considering inter-annotator agreement, we examine the consistency with which annotators were able to apply the characterization of similarity, outlined in the instructions for the range of concept types in SimLex-999. Second, we verify that a valid notion of similarity was understood by the annotators, in that they were able to accurately separate similarity from association.
4.1 Inter-Annotator Agreement
As in previous annotation or data collection for computational semantics (Padó, Padó, and Erk 2007; Reisinger and Mooney 2010a; Silberer and Lapata 2014) we computed the inter-rater agreement as the average of pairwise Spearman ρ correlations between the ratings of all respondents. Overall agreement was ρ = 0.67. This compares favorably with the agreement on WS-353 (ρ = 0.61 using the same method). The design of the MEN rating system precludes a conventional calculation of inter-rater agreement (Bruni et al. 2012b). However, two of the creators of MEN who independently rated the data set achieved an agreement of ρ = 0.68.10
The SimLex-999 inter-rater agreement suggests that participants were able to understand the (single) characterization of similarity presented in the instructions and to apply it to concepts of various types consistently. This conclusion was supported by inspection of the brief feedback offered by the majority of annotators in a final text field in the questionnaire: 78% expressed sentiment that the test was clear, easy to complete, or some similar sentiment.
Interestingly, as shown in Figure 4 (left), agreement was not uniform across the concept types. Contrary to what might be expected given established concreteness effects (Paivio 1991), we observed not only higher inter-rater agreement but also less per-pair variability for abstract rather than concrete concepts.11
Strikingly, the highest inter-rater consistency and lowest per-pair variation (defined as the inverse of the standard deviation of all ratings for that pair) was observed on adjective pairs. Although we are unsure exactly what drives this effect, a possible cause is that many pairs of adjectives in SimLex-999 cohabit a single salient, one-dimensional scale ( freezing > cold > warm > hot). This may be a consequence of the fact that many pairs in SimLex-999 were selected (from USF) to have a degree of association. On inspection, pairs of nouns and verbs in SimLex-999 do not appear to occupy scales in the same way, possibly because concepts of these POS categories come to be associated via a more diverse range of relations. It seems plausible that humans are able to estimate the similarity of scale-based concepts more consistently than pairs of concepts related in a less uni-dimensional fashion.
Regardless of cause, however, the high agreement on adjectives is a satisfactory property of SimLex-999. Adjectives exhibit various aspects of lexical semantics that have proved challenging for computational models, including antonymy, polarity (Williams and Anand 2009), and sentiment (Wiebe 2000). To approach the high level of human confidence on the adjective pairs in SimLex-999, it may be necessary to focus particularly on developing automatic ways to capture these phenomena.
4.2 Response Validity: Similarity not Association
Inspection of the SimLex-999 ratings indicated that pairs were indeed evaluated according to similarity rather than association. Table 2 includes examples that demonstrate a clear dissociation between the two semantic relations.
To verify this effect quantitatively, we recruited 100 additional participants to rate the WS-353 pairs, but following the SimLex-999 instructions and question format. As shown in Fig 5(a), there were clear differences between these new ratings and the original WS-353 ratings. In particular, a high proportion of pairs was given a lower rating by subjects following the SimLex-999 instructions than those following the WS-353 guidelines: The mean SimLex rating was 4.07 compared with 5.91 for WS-353.
This was consistent with our expectations that pairs of associated but dissimilar concepts would receive lower ratings based on the SimLex-999 than on the WS-353 instructions, whereas pairs that were both associated and similar would receive similar ratings in both cases. To confirm this, we compared the WS-353 and SimLex-999-based ratings on the subsets WS-Rel and WS-Sim, which were hand-sorted by Agirre et al. (2009) to include pairs connected by association (and not similarity) and those connected by similarity (but possibly also association), respectively.
As shown in Figure 5(b–c), the correlation between the SimLex-999-based and WS-353 ratings was notably higher (ρ = 0.73) on the WS-Sim subset than the WS-Rel subset (ρ = 0.38). Specifically, the tendency of subjects following the SimLex-999 instructions to assign lower ratings than those following the WS-353 instructions was far more pronounced for pairs in WS-Sim (Figure 5(b)) than for those in WS-Rel (Figure 5(c)). This observation suggests that the associated but dissimilar pairs in WS-353 were an important driver of the overall lower mean for SimLex-999-based ratings, and thus provide strong evidence that the SimLex-999 instructions do indeed enable subjects to distinguish similarity from association effectively.
4.3 Finer-Grained Semantic Relations
We have established the validity of similarity as a notion understood by human raters and distinct from association. However, much theoretical semantics focuses on relations between words or concepts that are finer-grained than similarity and association. These include meronymy (a part to its whole, e.g., blade–knife), hypernymy (a category concept to a member of that category, e.g., animal–dog), and cohyponymy (two members of the same implicit category, e.g., the pair of animals dog–cat) (Cruse 1986). Beyond theoretical interest, these relations can have practical relevance. For instance, hypernymy can form the basis of semantic entailment and therefore textual inference: The proposition a cat is on the table entails that an animal is on the table precisely because of the hypernymy relation from animal to cat.
We chose not to make these finer-grained relations the basis of our evaluation for several reasons. At present, detecting relations such as hypernymy using distributional methods is challenging, even when supported by supervised classifiers with access to labeled pairs (Levy et al. 2015). Such a designation can seem to require specific world-knowledge (is a snale a reptile?), can be gradual, as evidenced by typicality effects (Rosch, Simpson, and Miller 1976), or simply highly subjective. Moreover, a fine-grained relation R will only be attested (to any degree) between a small subset of all possible word pairs, whereas similarity can in theory be quantified for any two words chosen at random. We thus considered a focus on fine-grained semantic relations to be less appropriate for a general-purpose evaluation of representation quality.
Nevertheless, post hoc analysis of the SimLex annotator responses and fine-grained relation classes, as defined by lexicographers, yields further interesting insights into the nature of both similarity and association. Of the 999 word pairs in SimLex, 382 are also connected by one of the common finer-grained semantic relations in WordNet. For each of these relations, Figure 6 shows the average similarity rating and average USF free association score for all pairs that exhibit that relation.
In cases where a relationship of hypernymy/hyponymy exists between the words in a pair (not necessarily immediate : 1_hypernym, 2_hypernym, etc.) similarity and association coincide. Hyper/hyponym pairs that are separated by fewer levels in the WordNet hierarchy are both more strongly associated and rated as more similar. However, there are also interesting discrepancies between similarity and association. Unsurprisingly, pairs that are classed as synonyms in WordNet (i.e., having at least one sense in some common synset) are rated as more similar than pairs of any other relation type by SimLex annotators. In contrast, antonyms are the most strongly associated word pairs among these finer-grained relations. Further, pairs consisting of a meronym and holonym (part and whole) are comparatively strongly associated but not judged to be similar.
The analysis also highlights a case that can be particularly problematic when rating similarity: cohyponyms, or members of the same salient category (such as knife and fork). We gave no specific guidelines for how to rate such pairs in the SimLex annotator instructions, and whether they are considered similar or not seems to be a matter of perspective. On one hand, their membership of a common category could make them appear similar, particularly if the category is relatively specific. On the other hand, in the case of knife and fork, for instance, the underlying category cultery might provide a backdrop against which the differences of distinct members become particularly salient.
5. Evaluating Models with SimLex-999
In this section, we demonstrate the applicability of SimLex-999 by analyzing the performance of various distributional semantic models in estimating the new ratings. The models were selected to cover the main classes of representation learning architectures (Baroni, Dinu, and Kruszewski 2014): Vector space co-occurrence (counting) models and NLMs (Bengio et al. 2003). We first show that SimLex-999 is notably more difficult for state-of-the-art models to estimate than existing gold standards. We then conduct more focused analyses on the various concept subsets defined in SimLex-999, exploring possible causes for the comparatively low performance of current models and, in turn, demonstrating how SimLex-999 can be applied to investigate such questions.
5.1 Semantic Models
The relatively low-dimension, dense (vector) representations learned by this model and the other NLMs introduced in this section are sometimes referred to as embeddings (Turian, Ratinov, and Bengio 2010). Collobert and Weston (2008) train their models on 852 million words of text from a 2007 dump of Wikipedia and the RCV1 Corpus (Lewis et al. 2004) and use their embeddings to achieve state-of-the-art results on a variety of NLP tasks. We downloaded the embeddings directly from the authors' Web page.12
Huang et al. Huang et al. (2012) present a NLM that learns word embeddings to maximize the likelihood of predicting the last word in a sentence s based on (i) the previous words in that sentence (local context, as with Collobert and Weston [2008]) and (ii) the document d in which that word occurs (global context). As with Collobert and Weston (2008), the model represents input sentences as a matrix of word embeddings. In addition, it represents documents in the input corpus as single-vector averages over all word embeddings in that document. It can then compute scores g(s, d) and g(sw, d), whereas before sw is a sentence with an “incorrect” randomly selected last word.
The combination of local and global contexts in the objective encourages the final word embeddings to reflect aspects of both the meaning of nearby words and of the documents in which those words appear. When learning from 990M words of Wikipedia text, Huang et al. report a Spearman correlation of ρ = 71.3 between the cosine similarity of their model embeddings and the WS-353 scores, which constitutes state-of-the-art performance for a NLM model on that data set. We downloaded these embeddings from the authors' Web page.13
Mikolov et al. Mikolov et al. (2013a) present an architecture that learns word embeddings similar to those of standard NLMs but with no nonlinear hidden layer (resulting in a simpler scoring function). This enables faster representation learning for large vocabularies. Despite this simplification, the embeddings achieve state-of-theart performance on several semantic tasks including sentence completion and analogy modeling (Mikolov et al. 2013a, 2013b).
The model learns from a set of (target-word, context-word) pairs, extracted from a corpus of sentences as follows. In a given sentence s (of length N), for each position n ≤ N, each word wn is treated in turn as a target word. An integer t(n) is then sampled from a uniform distribution on {1, … k}, where k > 0 is a predefined maximum context-window parameter. The pair tokens {(wn, wn+j): −t(n) ≤ j ≤ t(n), wi ∈ s} are then appended to the training data. Thus, target/context training pairs are such that (i) only words within a k-window of the target are selected as context words for that target, and (ii) words closer to the target are more likely to be selected than those further away.
As with other NLMs, Mikolov et al.'s model captures conceptual semantics by exploiting the fact that words appearing in similar linguistic contexts are likely to have similar meanings. Informally, the model adjusts its embeddings to increase the probability of observing the training corpus. Because this probability increases with p(c|w), and p(c|w) increases with the dot product , the updates have the effect of moving each target-embedding incrementally “closer” to the context-embeddings of its collocates. In the target-embedding space, this results in embeddings of concept words that regularly occur in similar contexts moving closer together.
We use the author's Word2vec software in order to train their model and use the target embeddings in our evaluations. We experimented with embeddings of dimension 100, 200, 300, 400, and 500 and found that 200 gave the best performance on both WS-353 and SimLex-999.
Vector Space Model (VSM). As an alternative to NLMs, we constructed a vector space model following the guidelines for optimal performance outlined by Kiela and Clark (2014). After extracting the 2,000 most frequent word tokens in the corpus that are not in a common list of stopwords14 as features, we populated a matrix of co-occurrence counts with a row for each of the concepts in some pair in our evaluation sets, and a column for each of the features. Co-occurrence was counted within a specified window size, although never across a sentence boundary. This resulting matrix was then weighted according to Pointwise Mutual Information (PMI) (Recchia and Jones 2009). The rows of the resulting matrix constitute the vector representations of the concepts.
SVD. As proposed initially in Landauer and Dumais (1997), we also experimented with models in which SVD (Golub and Reinsch 1970) is applied to the PMI-weighted VSM matrix, reducing the dimension of each concept representation to 300 (which yielded best results after experimenting, as before, with 100–500 dimension vectors).
For each model described in this section, we calculate similarity as the cosine similarity between the (vector) representations learned by that model.
5.2 Results
In experimenting with different models on SimLex-999, we aimed to answer the following questions: (i) How well do the established models perform on SimLex-999 versus on existing gold standards? (ii) Are any observed differences caused by the potential of different models to measure similarity vs. association? (iii) Are there interesting differences in ability of models to capture similarity between adjectives vs. nouns vs. verbs? (iv) In this case, are the observed differences driven by concreteness, and its interaction with POS, or are other factors also relevant?
Overall Performance on SimLex-999.Figure 7 shows the performance of the NLMs on SimLex-999 versus on comparable data sets, measured by Spearman's ρ correlation. All models estimate the ratings of MEN and WS-353 more accurately than SimLex-999. The Huang et al. (2012) model performs well on WS-353,15 but is not very robust to changes in evaluation gold standard, and performs worst of all the models on SimLex-999. Given the focus of the WS-353 ratings, it is tempting to explain this by concluding that the global context objective leads the Huang et al. (2012) model to focus on association rather than similarity. However, the true explanation may be less simple, since the Huang et al. (2012) model performs weakly on the association-based MEN data set. The Collobert and Weston (2008) model is more robust across WS-353 and MEN, but still does not match the performance of the Mikolov et al. (2013a) model on SimLex-999.
Figure 8 compares the best performing NLM model (Mikolov et al. 2013a) with the VSM and SVD models.16 In contrast to recent results that emphasize the superiority of NLMs over alternatives (Baroni, Dinu, and Kruszewski 2014), we observed no clear advantage for the NLM over the VSM or SVD when considering the association-based gold standards WS-353 and MEN together. While the NLM is the strongest performer on WS-353, SVD is the strongest performer on MEN. However, the NLM model performs notably better than the alternatives at modeling similarity, as measured by SimLex-999.
Comparing all models in Figures 7 and 8 suggests that SimLex-999 is notably more challenging to model than the alternative data sets, with correlation scores ranging from 0.098 to 0.414. Thus, even when state-of-the-art models are trained for several days on massive text corpora,17 their performance on SimLex-999 is well below the inter-annotator agreement (Figure 7). This suggests that there is ample scope for SimLex-999 to guide the development of improved models.
Modeling Similarity vs. Association. The comparatively low performance of NLM, VSM, and SVD models on SimLex-999 compared with MEN and WS-353 is consistent with our hypothesis that modeling similarity is more difficult than modeling association. Indeed, given that many strongly associated but dissimilar pairs, such as [coffee, cup], are likely to have high co-occurrence in the training data, and that all models infer connections between concepts from linguistic co-occurrence in some form or another, it seems plausible that models may overestimate the similarity of such pairs because they are “distracted” by association.
To test this hypothesis more precisely, we compared the performance of models on the whole of SimLex-999 versus its 333 most associated pairs (according to the USF free association scores). Importantly, pairs in this strongly associated subset still span the full range of possible similarity scores (min similarity = 0.23 [shrink, grow], max similarity = 9.80 [vanish, disappear]).
As shown in Figure 9, all models performed worse when the evaluation was restricted to pairs of strongly associated concepts, which was consistent with our hypothesis. The Collobert and Weston (2008) model was better than the Huang et al. (2012) model at estimating similarity in the face of high association. This is not entirely surprising given the global-context objective in the latter model, which may have encouraged more association-based connections between concepts. The Mikolov et al. model, however, performed notably better than both other NLMs. Moreover, this superiority is proportionally greater when evaluating on the most associated pairs only (as indicated by the difference between the red and gray bars), suggesting that the improvement is driven at least in part by an increased ability to “distinguish” similarity from association.
To understand better how the architecture of models captures information pertinent to similarity modeling, we performed two additional experiments using SimLex-999. These comparisons were also motivated by the hypotheses, made in previous studies and outlined in Section 2.1.2, that both dependency-informed input and smaller context windows encourage models to capture similarity rather than association.
We tested the first hypothesis using the embeddings of Levy and Goldberg (2014), whose model extends the Mikolov et al. (2013a) model so that target-context training instances are extracted based on dependency-parsed rather than simple running text. As illustrated in Figure 9, the dependency-based embeddings outperform the original (running text) embeddings trained on the same corpus. Moreover, the comparatively large increase in the red bar compared to the gray bar suggests that an important part of the improvement of the dependency-based model derives from a greater ability to discern similarity from association.
Our comparisons provided less support for the second (window size) hypothesis. As shown in Figure 10, there is a negligible improvement in the performance of the model when the window size is reduced from 10 to 2. However, for the SVD model we observed the converse. The SVD model with window size 10 slightly outperforms the SVD model with window 2, and this improvement is quite pronounced on the most associated pairs in SimLex-999.
Learning Concepts of Different POS. Given the theoretical likelihood of variation in model performance across POS categories noted in Section 2.2, we evaluated the Mikolov et al. (2013a), VSM, and SVD models on the subsets of SimLex-999 containing adjective, noun, and verb concept pairs.
The analyses yield two notable conclusions, as shown in Figure 11. First, perhaps contrary to intuition, all models estimate the similarity of adjectives better than other concept categories. This aligns with the (also unexpected) observation that humans rate the similarity of adjectives more consistently and with more agreement than other parts of speech (see the dashed lines). However, the parallels between human raters and the models do not extend to verbs and nouns; verb similarity is rated more consistently than noun similarity by humans, but models estimate these ratings more accurately for nouns than for verbs.
To better understand the linguistic information exploited by models when acquiring concepts of different POS, we also computed performance on the POS subsets of SimLex-999 of the dependency-based model of Levy and Goldberg (2014) and the standard skipgram model, in which linguistic contexts are encoded as simple bags-of-words (BOW) (Mikolov et al. (2013a) [trained on the same Wikipedia text]). As shown in Figure 12, dependency-aware contexts yield the largest improvements for capturing verb similarity. This aligns with the cognitive theory of verbs as relational concepts (Markman and Wisniewski 1997) whose meanings rely on their interaction with (or dependency on) other words or concepts. It is also consistent with research on the automatic acquisition of verb semantics, in which syntactic features have proven particularly important (Sun, Korhonen, and Krymolowski 2008). Although a deeper exploration of these effects is beyond the scope of this work, this preliminary analysis again highlights how the word classes integrated into SimLex-999 are pertinent to a range of questions concerning lexical semantics.
Learning Concrete and Abstract Concepts. Given the strong interdependence between POS and conceptual concreteness (Figure 1), we aimed to explore whether the variation in model performance on different POS categories was in fact driven by an underlying effect of concreteness. To do so, we ranked each pair in the SimLex-999 data set according to the sum of the concreteness of the two words, and compared performance of models on the most concrete and least concrete quartiles according to this ranking (Figure 13).
Interestingly, the performance of models on the most abstract and most concrete pairs suggests that the distinction characterized by concreteness is at least partially independent of POS. Specifically, while the Mikolov et al. model was the highest performer of all POS categories, its performance was worse than both the simple VSM and SVD models (of window size 10) on the most concrete concept pairs.
This finding supports the growing evidence for systematic differences in representation and/or similarity operations between abstract and concrete concepts (Hill, Kiela, and Korhonen 2013), and suggests that at least part of these concreteness effects are independent of POS. In particular, it appears that models built from underlying vectors of co-occurrence counts, such as VSMs and SVD, are better equipped to capture the semantics of concrete entities, whereas the embeddings learned by NLMs can better capture abstract semantics.
6. Conclusion
Although the ultimate test of semantic models should be their utility in downstream applications, the research community can undoubtedly benefit from ways to evaluate the general quality of the representations learned by such models, prior to their integration in any particular system. We have presented SimLex-999, a gold standard resource for the evaluation of semantic representations containing similarity ratings of word pairs of different POS categories and concreteness levels.
The development of SimLex-999 was principally motivated by two factors. First, as we demonstrated, several existing gold standards measure the ability of models to capture association rather than similarity, and others do not adequately test their ability to discriminate similarity from association. This is despite the many potential applications for accurate similarity-focused representation learning models. Analysis of the ratings of the 500 SimLex-999 annotators showed that subjects can consistently quantify similarity, as distinct from association, and apply it to various concept types, based on minimal intuitive instructions.
Second, as we showed, state-of-the-art models trained solely on running-text corpora have now reached or surpassed the human agreement ceiling on WordSim-353 and MEN, the most popular existing gold standards, as well as on RG and WS-Sim. These evaluations may therefore have limited use in guiding or moderating future improvements to distributional semantic models. Nevertheless, there is clearly still room for improvement in terms of the use of distributional models in functional applications. We therefore consider the comparatively low performance of state-of-the-art models on SimLex-999 to be one of its principal strengths. There is clear room under the inter-rating ceiling to guide the development of the next generation of distributional models.
We conducted a brief exploration of how models might improve on this performance, and verified the hypotheses that models trained on dependency-based input capture similarity more effectively than those trained on running-text input. The evidence that smaller context windows are also beneficial for similarity models was mixed, however. Indeed, we showed that the optimal window size depends on both the general model architecture and the part-of-speech and concreteness of the target concepts.
Our analysis of these hypotheses illustrates how the design of SimLex-999— covering a principled set of concept categories and including meta-information on concreteness and free-association strength—enables fine-grained analyses of the performance and parameterization of semantic models. However, these experiments only scratch the surface in terms of the possible analyses. We hope that researchers will adopt the resource as a robust means of answering a diverse range of questions pertinent to similarity modeling, distributional semantics, and representation learning in general.
In particular, for models to learn high-quality representations for all linguistic concepts, we believe that future work must uncover ways to explicitly or implicitly infer “deeper,” more general, conceptual properties such as intentionality, polarity, subjectivity, or concreteness (Gershman and Dyer 2014). However, although improving corpus-based models in this direction is certainly realistic, models that learn exclusively via the linguistic modality may never reach human-level performance on evaluations such as SimLex-999. This is because much conceptual knowledge, and particularly that which underlines similarity computations for concrete concepts, appears to be grounded in the perceptual modalities as much as in language (Barsalou et al. 2003).
Whatever the means by which the improvements are achieved, accurate concept-level representation is likely to constitute a necessary first step towards learning informative, language-neutral phrasal and sentential representations. Such representations would be hugely valuable for fundamental NLP applications such as language understanding tools and machine translation.
Distributional semantics aims to infer the meaning of words based on the company they keep (Firth 1957). However, although words that occur together in text often have associated meanings, these meanings may be very similar or indeed very different. Thus, possibly excepting the population of Argentina, most people would agree that, strictly speaking, Maradona is not synonymous with football (despite their high rating of 8.62 in WordSim-353). The challenge for the next generation of distributional models may therefore be to infer what is useful from the co-occurrence signal and to overlook what is not. Perhaps only then will models capture most, or even all, of what humans know when they know how to use a language.
Notes
For instance, Huang et al. (2012, pages 1, 4, 10) and Reisinger and Mooney (2010b, page 4) refer to MEN and/or WS-353 as “similarity data sets.” Others evaluate on both these association-based and genuine similarity-based gold standards with no reference to the fact that they measure different things (Medelyan et al. 2009; Li et al. 2014).
Several papers that take a knowledge-based or symbolic approach to meaning do address the similarity/association issue (Budanitsky and Hirst 2006).
This fact is also noted by the data set authors. See www.cs.technion.ac.il/∼gabr/resources/data/wordsim353/.
Individual annotator responses for WS-353 were downloaded from www.cs.technion.ac.il/∼gabr/resources/data/wordsim353.
According to the USF concreteness ratings, 72% of noun or verb types in the British National Corpus are more abstract than the concept war, a concept many would already consider quite abstract.
Reported at http://clic.cimec.unitn.it/∼elia.bruni/MEN. It is reasonable to assume that actual agreement on MEN may be somewhat lower than 0.68, given the small sample size and the expertise of the raters.
Per-pair variability was measured by calculating the standard deviation of responses for each pair, and averaging these scores across the pairs of each concept type.
Taken from the Python Natural Language Toolkit (Bird 2006).
This score, based on embeddings downloaded from the authors' webpage, is notably lower than the score reported in Huang et al. (2012), mentioned in Section 5.1.
We conduct this comparison on the smaller RCV1 Corpus (Lewis et al. 2004) because training the VSM and SVD models is comparatively slow.
Training times reported by Huang et al. (2012) and for Collobert and Weston (2008) at http://ronan.collobert.com/senna/.
References
Author notes
Computer Laboratory University of Cambridge, UK. E-mail: [email protected], [email protected].
Technion, Israel Institute of Technology, Haifa, Israel. E-mail: [email protected].