Abstract
We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guarantees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks.
1. Introduction
Unsupervised methods for learning word embeddings are widely used in natural language processing (NLP). The only data these methods need as input are very large corpora. However, in addition to corpora, there are many other resources that are undoubtedly useful in NLP, including lexical resources like WordNet and Wiktionary and knowledge bases like Wikipedia and Freebase. We will simply refer to these as resources. In this article, we present AutoExtend, a method for enriching these valuable resources with embeddings for non-word objects they describe; for example, AutoExtend enriches WordNet with embeddings for synsets. The word embeddings and the new non-word embeddings live in the same vector space.
Many NLP applications benefit if non-word objects described by resources—such as synsets in WordNet—are also available as embeddings. For example, in sentiment analysis, Balamurali, Joshi, and Bhattacharyya (2011) showed the superiority of sense-based features over word-based features. Generally, the arguments for the utility of embeddings for words carry over to the utility of embeddings for non-word objects like synsets in WordNet. We demonstrate this by improved performance thanks to AutoExtend embeddings for non-word objects in experiments on Word-in-Context Similarity, Word Sense Disambiguation (WSD), and several other tasks.
To extend a resource with AutoExtend, we first formalize it as a graph in which (i) objects of the resource (both word objects and non-word objects) are nodes and (ii) edges describe relations between nodes. These relations can be of an additive or a similarity nature. Additive relations capture the basic intuition of the offset calculus (Mikolov et al. 2013a) as we will discuss in detail in Section 2. Similarity relations simply define similar nodes. We then define various constraints based on these relations. For example, one of our constraints states that the embeddings of two synsets related by the similarity relation should be close. Finally, we select the set of embeddings that minimizes the learning objective.
The advantage of our approach is that it decouples (i) the learning of word embeddings on the one hand and (ii) the extension of these word embeddings to non-word objects in a resource on the other hand. If someone identifies a better way of learning word embeddings, AutoExtend immediately can extend these embeddings to similarly improved embeddings for non-word objects. We do not rely on any specific properties of word embeddings that make them usable in some resources but not in others.
The main contributions of this article are as follows. We present AutoExtend, a flexible method that extends word embeddings to embeddings of non-word objects. We demonstrate the generality and flexibility of AutoExtend by running experiments on three different resources: WordNet (Fellbaum 1998), Freebase (Bollacker et al. 2008), and GermaNet (Hamp, Feldweg et al. 1997). AutoExtend does not require manually labeled corpora. In fact, it does not require any corpora. All we need as input is a set of word embeddings and a resource that can be formally modeled as a graph in the way described above. We show that AutoExtend achieves state-of-the-art performance on several tasks including WSD.
This article is structured as follows. In Section 2, we introduce the AutoExtend model. In Section 3, we describe the three resources we use in our experiments and how we model them. We evaluate the embeddings of word and non-word objects in Section 4 using the tasks of WSD, Entity Linking, Word Similarity, Word-in-Context Similarity, and Synset Alignment. Finally, we give an overview of related work in Section 5 and present our conclusions in Section 6.
2. Model
The graph formalization that underlies AutoExtend is based on the offset calculus introduced by Mikolov et al. (2013a). We interpret this calculus as a group theory formalization of word relations: We have a set of elements (the word embeddings) and an operation (vector addition) satisfying the axioms of a commutative group, in particular, commutativity, closure,1 associativity, and invertibility.
In addition to semantic properties like gender, the offset calculus has been applied to morphological properties (e.g., running − walking + walked = ran) and even to properties of regional varieties of English (e.g., bonnet − aubergine + eggplant = hood). We take an expansive view of what a property is and include complex properties that are captured by resources. The most important instance of this expansive view in this article is that we model a word's embedding as the sum of the embeddings of its senses. For example, the vector of the word suit is modeled as the sum of a vector representing lawsuit and a vector representing business suit. Apart from the offset calculus, this can also be motivated by the additivity that underlies many embedding learning algorithms. This is most obvious for the counts in vector space models. They are clearly additive and thus support the view of a word as the sum of its senses. To be more precise, a word is a weighted sum of its senses, where the weights represent the probability of a sense. Our model incorporates this by simply learning shorter or longer vectors.
The basic idea behind AutoExtend is that it takes the embedding of an object that is a bundle of properties as input and “decodes” or “unravels” this embedding to the embeddings of these properties. For example, AutoExtend unravels the embedding of a word to the embeddings of its senses. These senses are not directly observable, so we can view them as hidden variables.
2.1 General Framework
The basic input to AutoExtend is a semantic resource represented as a graph and an embedding space given as a set of vectors. Each node in the graph is associated with a vector in a high-dimensional vector space. Nodes in the graph can have different types; for example, in WordNet, the types are word, lexeme, and synset (see Figure 1). One type is the input type. Embeddings of nodes for this type are known. Embeddings of the other types are unknown and will be learned by AutoExtend.
Concretely, to extend a resource with AutoExtend, we (i) formalize it as a graph based on the offset calculus, (ii) assign known objects an input embedding, (iii) define a learning objective on the graph, and, finally, find the set of embeddings that optimizes the learning objective.
(i) Graph formalization of resource. In our formalization of the resource as a graph, objects of the resource—both word objects and non-word objects—are nodes; some edges of the resource describe additive relations between nodes. These additive relations are the basic relations of the offset calculus between embeddings of words, on the one hand, and embeddings of constituents derived from the resource (e.g., semantic properties, morphological properties, or senses), on the other hand. More precisely, the embedding of a node x is the sum of the embeddings of all nodes yi that are connected via an edge (yi,x). Other edges of the resource describe similarity relations. One example of this is that the embeddings of two synsets related by the hyponymy relation should be close. An example of such a graph can be seen in Figure 1.
(ii) Connecting resource and word embeddings. Each node is associated with a vector. The vectors of some nodes are known and the vectors of other nodes (e.g., senses) are not known. Throughout this article a known object is a word; note that a word can also be a short phrase. An example for the embedding space we want to learn can be seen in Figure 2.
(iii) Learning objective. We define the learning objective based on various constraints. The additive relations define the topology of an autoencoder, which will result in autoencoding constraints that apply if a resource object participates in different additive relations (see next section). We also use similarity relations that are specified in the resource. Finally, we select the set of embeddings for non-word objects that minimizes the learning objective. We will assign them to those nodes in the graph (e.g., senses) that do not occur in corpora and do not have corpus-based embeddings.
We present AutoExtend in more detail in the following sections. Although we could couch the discussion in terms of generic resources, the presentation is easier to follow if a specific resource is used as an example. We will therefore use WordNet as an example resource where appropriate. We now give a brief description of those aspects of WordNet that we make use of in this article.
Words in WordNet are lemmata where a lemma is defined as a particular spelling of the base form of an inflected word form; i.e., a lemma is a sequence of letters with a particular part of speech. A lexeme pairs such a spelling with a particular meaning. A synset is a set of lexemes with the same meaning in the sense that they are interchangeable for each other in context. Thus, we can also define a lexeme as the conjunction of a word and a synset. Additive relations between lexemes and words and between lexemes and synsets correspond to a graph in which each lexeme node is connected to exactly one word node and to exactly one synset node. Additionally, two synset nodes can be connected to indicate a (dis)similarity relation holding between them, for example, hyponymy or antonymy.
In the context of this article, word nodes are “known” in the sense that we have learned their vectors from a large corpus. Embeddings for inflected forms are not used in this paper.2 Lexeme and synset nodes are unknown because they are not directly observable in a corpus and vectors cannot be learned from them using standard embedding learning algorithms.
2.2 Additive Edges
As already mentioned, we will use WordNet as an example resource to simplify the presentation of our model. We will use the additive edges to formulate two basic premises of our model: (i) words are sums of their lexemes and (ii) synsets are sums of their lexemes. For example, the embedding of the word suit is a sum of the embeddings of its two lexemes suit(textile) and suit(law); and the embedding of the synset lawsuit-case-suit(law) is a sum of the embeddings of its three lexemes lawsuit, case(law), and suit(law) (see Figure 3). This is equivalent to saying words split up into their lexemes and lexemes sum up to their synsets. We will formulate this in this subsection.
Note that we allow E(i,j) < 0 and in general the distribution weights for each dimension (diagonal entries of E(i,j)) will be different. Our assumption can be interpreted as word w(i) distributing its embedding activations to its lexemes on each dimension separately.
2.3 Learning Through Autoencoding
Now we have an autoencoder where input and output layers are the word embeddings. Aligning these two layers (i.e., minimizing the difference between them) will give us the word constraint. The hidden layer represents the synset vectors. The tensors E and D have to be learned. They are rank 4 tensors of size ≈1015. However, we already discussed that they are very sparse, for two reasons: (i) We make the assumption that there is no interaction between dimensions. (ii) There are only a few interactions between words and synsets (only when a lexeme exists). In practice, there are only ≈107 elements to learn, which is technically feasible.
2.4 Matrix Formalization
2.5 Lexeme Embeddings
2.6 Similarity Edges
2.7 Column Normalization
2.8 Implementation
Our training objective is the minimization of the sum of all constraints normalized by their output size, namely, the word constraint (Equation (29)) divided by the number of words, the lexeme constraint (Equation (33)) divided by the number of lexemes, and the similarity constraint (Equation (34)) divided by the number of similarities. Our training objective is minimization of the sum of these three normalized constraints, weighted by α (Equation (29)), β (Equation (33)), and 1 − α − β (Equation (34)). The parameters α and β are tuned on development sets using a grid search with step size 0.1. To save computational cost we explore a “lazy” approach for the column normalization constraint (Equations (35) and (36)): We start the computation with column normalized matrices and normalize them again after each iteration (doing a gradient descent on the other three constraints) as long as the error function still decreases. When the error function starts increasing, we stop normalizing the matrices and continue with a normal gradient descent. This respects the fact that whereas E(d) and D(d) should be column normalized in theory, there are many practical issues that prevent this (e.g., out-of-vocabulary words).
The overall training objective cannot be solved analytically because it is subject to Equation (16) and Equation (25). We therefore use backpropagation. It turned out to be unnecessary to use regularization: All learned weights in the experiments presented below are in [−2,2].
3. Data
We test our framework in three different problem settings that cover three resources and two languages.
3.1 WordNet
We use publicly available 300-dimensional embeddings3 for 3,000,000 words and phrases trained on Google News, a corpus of ≈1011 tokens, using word2vec continuous bag-of-words (CBOW), with a window size of 5 (Mikolov et al. 2013). Unless stated otherwise we use WordNet 2.1, as the SensEval tasks are based on this version. Many words in the word2vec vocabulary are not in WordNet, for example, inflected forms (cars) and proper nouns (Tony Blair). Conversely, many WordNet lemmata are not in the word2vec vocabulary, for example, 42 (digits were converted to 0). This results in a number of empty synsets (see Table 2). Note, however, that AutoExtend can produce embeddings for empty synsets because we also use similarity relations, not just additive relations.
. | WordNet 2.1 . | ∩ w2v . | GermaNet 9.0 . | ∩ w2v . | Freebase . | ∩ w2v . | ||
---|---|---|---|---|---|---|---|---|
words | 147,478 | 54,570 | 109,683 | 89,160 | ≈ | 23,000 | 17,165 | |
synsets | 117,791 | 73,844 | 93,246 | 82,027 | e: | ≈ | 50,000,000 | 12,362 |
t: | ≈ | 26,000 | 3,516 | |||||
lexemes | 207,272 | 106,167 | 124,996 | 103,926 | ≈ | 47,000,000 | 27,478 |
. | WordNet 2.1 . | ∩ w2v . | GermaNet 9.0 . | ∩ w2v . | Freebase . | ∩ w2v . | ||
---|---|---|---|---|---|---|---|---|
words | 147,478 | 54,570 | 109,683 | 89,160 | ≈ | 23,000 | 17,165 | |
synsets | 117,791 | 73,844 | 93,246 | 82,027 | e: | ≈ | 50,000,000 | 12,362 |
t: | ≈ | 26,000 | 3,516 | |||||
lexemes | 207,272 | 106,167 | 124,996 | 103,926 | ≈ | 47,000,000 | 27,478 |
We run AutoExtend on the word2vec vectors. Our main goal is to produce compatible embeddings for lexemes and synsets. In this way, we can compute nearest neighbors across all three types, as shown in Figure 5.
3.2 GermaNet
For this set-up, we train word2vec embeddings for German using settings similar to those that were used to train the English word2vec embeddings. We use the German Wikipedia with 5 × 108 tokens and preprocess them with the word2phrase tool included in word2vec twice, first with a threshold of 200 and then with a threshold of 100. After that, we run word2vec with identical settings as the downloaded word embeddings (i.e., CBOW, window size 5, minimal count 5, negative sampling 3, and hierarchical softmax off). We run 10 iterations to compensate for the smaller corpus. After that, we intersect them with words found in GermaNet 9.0. As GermaNet has the same structure as WordNet, we can directly apply AutoExtend to it. For similarity relations, we only use hypernymy and antonymy. In GermaNet, antonymy is a relationship between lexemes. To match our model, we extend it to synsets by viewing any pair of synsets as antonyms if they contain lexemes that are antonyms.
3.3 Freebase
Freebase contains word nodes4 (whose embeddings are known) and alias nodes and entity nodes (whose embeddings are unknown). Each entity also has one or more types (e.g., director). As we will explain subsequently, we also create type nodes and learn embeddings for them. An alias node is connected to exactly one word node and exactly one entity node. An entity node is connected to one or more type nodes.
We use the same English word embeddings as for WordNet and intersect them with words found in Freebase. A Freebase entity has one or more aliases (e.g., the entity Barack Obama has the aliases Barack Obama, President Obama, and Barack Hussein Obama). Aliases are available in different languages, but we only use English aliases. The role of synsets in WordNet corresponds to the role of entities in Freebase; the role of lexemes in WordNet corresponds to the role of aliases in Freebase (i.e., they connect words and entities). An overview is shown in Table 3. Freebase contains a large number of entities with a single alias; we exclude these because they are usually not completely modeled and contain little information.
Freebase also contains a great diversity of relations, but most of them do not fulfill the requirement of connecting similar entities. For example, the relation born-in connects a person and a city, and we do not want to align these embeddings. We therefore only use the relation same-type. There are about 26,000 types in Freebase, with different granularity, and well-modeled entities usually have several types. For example, Barack Obama has the types President-of-the-US, person, and author as well as several other types.5 For a type with n members, this would give us n2 relations. This would result in a huge relation matrix that would slow down the AutoExtend computation. To address this, we add type nodes to the graph. The similarity relation same-type is only constructed between type nodes and entity nodes, but not between entity nodes and entity nodes. An added benefit is that AutoExtend also produces type embeddings; these may be useful for several tasks, for example, for entity typing (Yaghoobzadeh and Schütze 2015).
4. Experiments and Evaluation
We evaluate AutoExtend embeddings on the following tasks: WSD, Entity Linking, Word-in-Context Similarity, Word Similarity, and Synset Alignment. Our results depend directly on the quality of the underlying word embeddings. We would expect even better evaluation results as word representation learning methods improve. Using a new and improved set of underlying embeddings in AutoExtend is simple: It is a simple switch of the input file that contains the word embeddings.
4.1 Word Sense Disambiguation
We use IMS (It Makes Sense) for our WSD evaluation (Zhong and Ng 2010). As in the original paper, preprocessing consists of sentence splitting, tokenization, POS tagging, and lemmatization; the classifier is a linear SVM. In our experiments (Table 4), we run IMS with each feature set by itself to assess the relative strengths of individual feature sets (lines 1–7) and on feature set combinations to determine which combination is best for WSD (lines 8, 12–15). We use SensEval-2 as development set for SensEval-3 and vice versa. This gives us a weighting of α = β = 0.4 for both sets.
IMS implements three standard WSD feature sets: part of speech (POS), surrounding word, and local collocation (lines 1–3).
Let w be an ambiguous word with k senses. The three feature sets on lines 5–7 are based on the AutoExtend embeddings s( j), 1 ≤ j ≤ k, of the k synsets of w and the centroid c of the sentence in which w occurs. The centroid is simply the sum of all word2vec vectors of the words in the sentence, excluding stop words.
Based on the experiment, we would like to determine whether AutoExtend features improve WSD performance when added to standard WSD features. To make sure that improvements we obtain are not solely due to the power of word2vec, we also investigate a simple word2vec baseline. For S-product (the AutoExtend feature set that performs best in the experiment, see line 14), we test the alternative word2vec-based Snaive-product feature set. It has the same definition as S-product except that we replace the synset vectors s( j) with naive synset vectors z( j), defined as the sum of the word2vec vectors of the words that are members of synset j.
Lines 1–7 in Table 4 show the performance of each feature set by itself. We see that the synset feature sets (lines 5–7) have a comparable performance to standard feature sets. S-product is the strongest of the synset feature sets.
Lines 9–16 show the performance of different feature set combinations. MFS (line 8) is the most frequent sense baseline. Lines 9 and 10 are the winners of SensEval. The standard configuration of IMS (line 11) uses the three feature sets on lines 1–3 (POS, surrounding word, local collocation) and achieves an accuracy of 65.2% on the English lexical sample task of SensEval-2 (Kilgarriff 2001) and 72.3% on SensEval-3 (Mihalcea, Chklovski, and Kilgarriff 2004).6 Lines 12–16 add one additional feature set to the IMS system on line 11; e.g., the system on line 14 uses POS, surrounding word, local collocation, and S-product feature sets. The system on line 14 outperforms all previous systems, most of them significantly. Although S-raw performs quite reasonably as a feature set alone, it hurts the performance when used as an additional feature set. Because this is the feature set that contains the largest number of features (n(k + 1)), overfitting is the likely reason. Conversely, S-cosine only adds k features and therefore may suffer from underfitting.
The main result of this experiment is that we achieve an improvement of more than 1% in WSD performance when using AutoExtend.
4.2 Entity Linking
We use the same IMS system for Entity Linking. The train, development, and test sets are created as follows. We start with the annotated FACC (Gabrilovich, Ringgaard, and Subramanya 2013) corpus and extract all entity annotated words and their surrounding words—ten to the left and ten to the right. Recall that throughout this article, a word can also be a phrase. We remove aliases that occur fewer than 0.1 times as the corresponding word and words that have a character length of one or two. We extract at most 400 examples for each entity–word combination. This procedure selects entities that are ambiguous and that are frequent enough to give us a sufficient number of training examples. We randomly select 50 words with 1,000 examples each and split each word into 700 train, 100 development, and 200 test instances. This results in a test set of 10,000 instances.7 We optimize the constraint weights on the development set; the optimal values are α = 0.7 and β = 0.0. We incorporate the embeddings in three different ways as described in Section 4.1. The results can be seen in Table 4. Again the element-wise product (line 14) performs better than cosine and raw (lines 13 and 15). The new feature set achieves an accuracy of 65.4%—significantly better than the baseline IMS system (line 11, 61.7%).
4.3 Word-in-Context Similarity
The third evaluation uses SCWS (Huang et al. 2012). SCWS is a Word Similarity test set that does not only provide isolated words and corresponding similarity scores, but also a context for each word. The similarity score is an average score of 10 human ratings. See Table 5 for examples. In contrast to normal Word Similarity test sets, this data set also contains pairs of two instances of the same word. SCWS is based on WordNet, but the information as to which synset a Word-in-Context came from is not available. However, the data set is the closest we could find for sense similarity. Synset and lexeme embeddings are obtained by running AutoExtend. We set α = 0.2 and β = 0.2 based on Section 4.4. Lexeme embeddings are the natural choice for this task as human subjects are provided with two words and a context for each and then have to assign a similarity score. For completeness, we also run experiments for synsets.
word 1 . | similarity . | word 2 . |
---|---|---|
… Crew members advised passengers to sit quietly in order to increase their chances of survival … | 7.1 | … the Rome Statute stipulates that the court may inform the Assembly of States Parties or Security Council … |
… and Andy's getting ready to pack his bags and head up to Los Angeles tomorrow to get ready to fly back home on Thursday | 2.1 | … she encounters Ben ( Duane Jones ), who arrives in a pickup truck and defends the house against another pack of zombies … |
word 1 . | similarity . | word 2 . |
---|---|---|
… Crew members advised passengers to sit quietly in order to increase their chances of survival … | 7.1 | … the Rome Statute stipulates that the court may inform the Assembly of States Parties or Security Council … |
… and Andy's getting ready to pack his bags and head up to Los Angeles tomorrow to get ready to fly back home on Thursday | 2.1 | … she encounters Ben ( Duane Jones ), who arrives in a pickup truck and defends the house against another pack of zombies … |
For each word, we compute a context vector c by adding all word vectors of the context, excluding the test word itself. Following Reisinger and Mooney (2010), we compute the lexeme (respectively, synset) vector l either as the simple average of the lexeme (respectively, synset) vectors l(ij) (respectively, s( j)) (method AvgSim, no dependence on c in this case) or as the average of the lexeme (respectively, synset) vectors weighted by cosine similarity to c (method AvgSimC). The latter method is supposed to give higher weights to lexemes that better fit the context.
Table 6 shows that AutoExtend lexeme embeddings (line 7) perform better than previous work, including Huang et al. (2012) and Tian et al. (2014). Lexeme embeddings perform better than synset embeddings (lines 7 vs. 6), presumably because using a representation that is specific to the actual word being judged is more precise than using a representation that also includes synonyms.
. | . | AvgSim . | AvgSimC . |
---|---|---|---|
1 | Huang et al. (2012) | 62.8 | 65.7† |
2 | Tian et al. (2014) | – | 65.4† |
3 | Neelakantan et al. (2014) | 67.2 | 69.3 |
4 | Chen, Liu, and Sun (2014) | 66.2‡ | 68.9 |
5 | words (word2vec) | 66.7† | 66.7† |
6 | synsets | 63.2† | 63.5† |
7 | lexemes | 68.3 | 70.2 |
A simple baseline is to use the underlying word2vec embeddings directly (line 5). In this case, there is only one embedding, so there is no difference between AvgSim and AvgSimC. It is interesting to note that even if we do not take the context into account (method AvgSim) the lexeme embeddings outperform the original word embeddings. As AvgSim simply adds up all lexemes of a word, this is equivalent to the motivation we proposed in the beginning of the article (Equation (8)). Thus, replacing a word's embedding by the sum of the embeddings of its senses could generally improve the quality of embeddings—see Huang et al. (2012) for a similar argument. We will provide a deeper evaluation of this in Section 4.4.
4.4 Word Similarity
The results of the previous experiments motivate us to test the new embeddings also on Word Similarity test sets, namely, (MC Miller and Charles 1991), MEN (Bruni, Tran, and Baroni 2014), RG (Rubenstein and Goodenough 1965), SIMLEX (Hill, Reichart, and Korhonen 2014), RW (Luong, Socher, and Manning 2013), and WordSim-353 (Finkelstein et al. 2001) for English (using embeddings autoextended based on WordNet) and GUR-65, GUR-350 (Gurevych 2005) and ZG-222 (Zesch and Gurevych 2006) for German (using embeddings autoextended based on GermaNet). Because the simple sum of the lexeme vectors (method AvgSim, line 7, Table 6) ignores the context and outperforms the underlying word embeddings (line 5), we expect a similar performance improvement on other Word Similarity test sets. Note that AutoExtend makes available three different word embeddings:
- 1.
the original word embeddings W0 = W, i.e., the input to AutoExtend
- 2.
the word embeddings W1 that we obtain when we add lexeme vectors of the encoding part (see Equation (30))
- 3.
the word embeddings W2 that we obtain when we add lexeme vectors of the decoding part (see Equations 31 or 22)
We observe that each pair (Wi,Wj),i≠j of word embedding sets corresponds to a constraint of AutoExtend. (i) The column normalization constraint (Equation (35)) will align W0 and W1, as we just split the original word embeddings and add them up again. (ii) The word constraint (Equation (29)) will align W0 and W2. This was the initial idea of our system. (iii) The lexeme constraint (Equation (33)) will align W1 and W2.
As in the previous section, we use the cosine similarity of word embeddings to predict a similarity score and report the Spearman correlation. We use W (Table 7, line 1) as our baseline. Lines 2 and 3 are the word embeddings described above. The SIMLEX and GUR-65 test sets are used as development sets to obtain the parameters α = 0.2 and β = 0.2 for both models by optimizing max(W1,W2) (i.e., the best result of line 2 and 3). Although we observe a significant performance drop from W to W1, we also observe a small improvement in W2 for English. The improvement is significant for German, but not for English. This is most likely because of the very strong baseline of the Google News word embeddings, which are used for the English test sets. The German embeddings are trained on the smaller Wikipedia corpus. This suggests that our method is especially suited to improve lower quality embeddings.
. | . | MC . | MEN . | RG . | SIMLEX . | RW . | WORDSIM . | GUR . | GUR . | ZG . |
---|---|---|---|---|---|---|---|---|---|---|
size | 30 | 3,000 | 65 | 999 | 2,034 | 353 | 65 | 350 | 222 | |
coverage | 30 | 2,922 | 65 | 999 | 1,246 | 332 | 47 | 213 | 108 | |
1 | W | 78.9 | 77.0 | 76.1 | 44.2 | 54.2 | 69.9 | 41.0 | 39.1 | 23.0 |
2 | W1 | 70.9 | 67.5† | 67.8‡ | 37.6† | 49.3† | 61.0† | 25.1† | 40.4 | 28.4‡ |
3 | W2 | 85.2 | 77.5 | 82.5 | 47.4† | 54.8 | 69.0 | 63.3† | 57.1† | 34.3† |
. | . | MC . | MEN . | RG . | SIMLEX . | RW . | WORDSIM . | GUR . | GUR . | ZG . |
---|---|---|---|---|---|---|---|---|---|---|
size | 30 | 3,000 | 65 | 999 | 2,034 | 353 | 65 | 350 | 222 | |
coverage | 30 | 2,922 | 65 | 999 | 1,246 | 332 | 47 | 213 | 108 | |
1 | W | 78.9 | 77.0 | 76.1 | 44.2 | 54.2 | 69.9 | 41.0 | 39.1 | 23.0 |
2 | W1 | 70.9 | 67.5† | 67.8‡ | 37.6† | 49.3† | 61.0† | 25.1† | 40.4 | 28.4‡ |
3 | W2 | 85.2 | 77.5 | 82.5 | 47.4† | 54.8 | 69.0 | 63.3† | 57.1† | 34.3† |
4.5. Synset Alignment
We create two test sets, one for words and one for synsets. The 1,000 translation pairs we held back are concatenated with 1,000 random German–English word pairs. The task is to predict whether a pair is a translation (positive) or not (negative); the test set contains 1,000 positive and 1,000 negative instances. We construct similar development and test sets for synsets by using the interlingual index provided in GermaNet. The interlingual index allows a mapping of concepts (e.g., synsets) of different languages. We randomly collect 1,000 correct German–English synset pairs and 1,000 false synset pairs for development and test each. The development set is used to optimize the parameters α and β of German and English models. The best performance is found for α = 0.9 and β = 0.1 for both German and English. Note that we do not need a development set for words as there are no parameters to tune. Errors in the word test set are probably due to insufficient word embedding models or errors caused by the linear mapping L. As we already mentioned, our synset embeddings can only be as good as the underlying word embeddings. Thus both cases, insufficient word embeddings and insufficient linear map, also affect the performance of the synset embeddings. Because of this, the accuracy of the word test set (line 1 in Table 8) can be seen as an upper bound. Line 2 shows the performance of our synset embeddings on the development and test sets. Line 3 shows the performance of naive synset embeddings, defined as the sum of the vectors of the words that are members of a synset.
. | . | Words . | Synsets . | |
---|---|---|---|---|
dev . | test . | |||
size | 2,000 | 2,000 | 2,000 | |
1 | word | 0.943 | ||
2 | synset | 0.872 | 0.870 | |
3 | synsetnaive | 0.852‡ | 0.826† |
. | . | Words . | Synsets . | |
---|---|---|---|---|
dev . | test . | |||
size | 2,000 | 2,000 | 2,000 | |
1 | word | 0.943 | ||
2 | synset | 0.872 | 0.870 | |
3 | synsetnaive | 0.852‡ | 0.826† |
The main result of this experiment is that the synset vectors obtained by AutoExtend perform better in bilingual synsets alignment than a naive sum of words.
4.6 Analysis
The most important parameter of AutoExtend is the weighting α and β given to the objectives. Table 9 shows a summary of all weightings used in this article. We observe that although all constraints are important, the optimal weighting is different for different applications. These differences are due to different corpora and resources. For example, aligning types in Freebase has a different effect than aligning antonyms in WordNet. More important, however, is the actual task for which the embeddings are used. For example, when we compute embedding similarities, we want similar words to have similar embeddings, resulting in a big weighting for the similarity constraint (lines 4 and 5). For Synset Alignment we want similar embeddings to have different embeddings in order to better distinguish them, resulting in no weight for the similarity constraint (lines 6 and 7).
. | task . | corpus . | resource . | α . | β . | 1 − α − β . |
---|---|---|---|---|---|---|
word c. . | lexeme c. . | similarity c. . | ||||
1 | WSD | Google News | WordNet 2.1 | 0.4 | 0.4 | 0.2 |
2 | Entity Linking | Google News | Freebase | 0.7 | 0.0 | 0.3 |
3 | SCWS | Google News | WordNet 2.1 | 0.2 | 0.2 | 0.6 |
4 | Word Similarity | Google News | WordNet 2.1 | 0.2 | 0.2 | 0.6 |
5 | Word Similarity | Wikipedia | GermaNet 9.0 | 0.2 | 0.2 | 0.6 |
6 | Synset Alignment | Google News | WordNet 3.0 | 0.9 | 0.1 | 0.0 |
7 | Synset Alignment | Wikipedia | GermaNet 9.0 | 0.9 | 0.1 | 0.0 |
. | task . | corpus . | resource . | α . | β . | 1 − α − β . |
---|---|---|---|---|---|---|
word c. . | lexeme c. . | similarity c. . | ||||
1 | WSD | Google News | WordNet 2.1 | 0.4 | 0.4 | 0.2 |
2 | Entity Linking | Google News | Freebase | 0.7 | 0.0 | 0.3 |
3 | SCWS | Google News | WordNet 2.1 | 0.2 | 0.2 | 0.6 |
4 | Word Similarity | Google News | WordNet 2.1 | 0.2 | 0.2 | 0.6 |
5 | Word Similarity | Wikipedia | GermaNet 9.0 | 0.2 | 0.2 | 0.6 |
6 | Synset Alignment | Google News | WordNet 3.0 | 0.9 | 0.1 | 0.0 |
7 | Synset Alignment | Wikipedia | GermaNet 9.0 | 0.9 | 0.1 | 0.0 |
We found that some applications are not sensitive to the weighting; for example, for Entity Linking (line 2), the differences between weightings that result in non-zero weights to all three constraints are negligible (less than 0.3).
We also analyzed the impact of the four different relations in WordNet (see Table 1) on performance. In Tables 4 and 6, all four relations are used together. We found that any combination of three relation types performs worse than using all four together. A comparison of different relations must be done carefully as they differ in the POS they affect and in quantity (see Table 1). In general, relation types with more relations outperformed relation types with fewer relations.
5. Related Work
5.1 Word Embeddings
Among the earliest work on distributed word representations (usually called “word embeddings” today) was Rumelhart, Hinton, and Williams (1988). Non-neural-network techniques that create low-dimensional word representations also have been used widely, including singular value decomposition (SVD) (Deerwester et al. 1990; Schütze 1992) and random indexing (Kanerva 1998, 2009). There has recently been a resurgence of work on embeddings (e.g., Bengio, Ducharme, and Vincent 2003; Mnih and Hinton 2007; Collobert et al. 2011; Mikolov et al. 2013a; Pennington, Socher, and Manning 2014), including methods that are SVD-based (Levy and Goldberg 2014; Stratos, Collins, and Hsu 2015). All of these models differ from AutoExtend in that they produce only a single embedding for each word, but all of them can be used as input for AutoExtend.
5.2 Sense Embeddings Not Related to Lexical Resources
There are several approaches to finding embeddings for senses, variously called meaning, sense, and multiple word embeddings. Schütze (1998) created sense representations by clustering context representations derived from co-occurrence. The centroid of its cluster is used as a representation of a sense. Reisinger and Mooney (2010) and Huang et al. (2012) also presented methods that learn multiple embeddings per word by clustering the contexts. Bordes et al. (2011) created similarity measures for relations in WordNet and Freebase to learn entity embeddings. An energy-based model was proposed by Bordes et al. (2012) to create disambiguated meaning embeddings, and Neelakantan et al. (2014) and Tian et al. (2014) extended the Skip-gram model (Mikolov et al. 2013a) to learn multiple word embeddings. Another interesting approach to create sense-specific word embeddings uses bilingual resources (Guo et al. 2014). The downside of this approach is that parallel data are needed. Although all these embeddings correspond to different word senses, there is no clear mapping between them and a resource like WordNet.
5.3 Sense Embeddings Related to Lexical Resources
Recently, Bhingardive et al. (2015) used WordNet to create sense embeddings similar to the naive method in this article. They used these sense embeddings to extract the most frequent synset. Chen, Liu, and Sun (2014) modified word2vec to learn sense embeddings, each corresponding to a WordNet synset. They used glosses to initialize sense embeddings, which in turn can be used for WSD. The sense disambiguated data can again be used to improve sense embeddings. Although WordNet is by far the most used resource, Iacobacci, Pilehvar, and Navigli (2015) computed sense embeddings with BabelNet, which is a superset of WordNet. They used a state-of-the-art WSD system to generate a large sense annotated corpus that is used to train sense embeddings. In contrast, our approach can be used to improve WSD without relying on input from an existing WSD system.
5.4 Embeddings Using Lexical Resources
Other work tried to combine distributed word representations and semantic resources to create better or specialized embeddings. These include the ADMM by Fried and Duh (2014) and the work of Wang, Mohamed, and Hirst (2015). Liu et al. (2015) also used WordNet to create ordinal similarity inequalities to extend the Skip-gram model into a Semantic Word Embedding model. In the Relation Constrained Model, Yu and Dredze (2014) used word2vec to learn embeddings that are optimized to predict a related word in the resource, with good evaluation results. Bian, Gao, and Liu (2014) used not only semantic but also morphological and syntactic knowledge to compute more effective word embeddings. Cotterell, Schütze, and Eisner (2016) focus on generating embeddings for inflected forms not observed during training based on morphological resources. Wang et al. (2014) used Freebase to learn embeddings for entities and words. This is done during embedding learning, in contrast to our post-processing method. Zhong et al. (2015) improved this by requiring the embedding vector not only to fit the structured constraints in the knowledge base but also to be equal to the embedding vector computed from the text description.
5.5 Post-processing Embeddings
This prior work needs a training step to learn embeddings. In contrast, we can “AutoExtend” any set of given word embeddings—without (re)training them. There is an increasing amount of work on taking existing word embeddings and combining them with a lexical resource. Labutov and Lipson (2013) re-embedded existing word embeddings in supervised training, not to create new embeddings for senses or entities, but to obtain better predictive performance on a task while not changing the space of embeddings. A similar approach was chosen by Faruqui et al. (2015) and called retrofitting. That work is also related to our work in that it uses WordNet. However, it only uses the similarity relations in order to change embeddings for known objects (i.e., words). They did not use additive relations nor did they compute embeddings for non-word objects. Jauhar, Dyer, and Hovy (2015) also used the same retrofitting technique to model sense embeddings. Their work is similar to our approach but instead of distinguishing between additive and similarity relations all edges are treated as similarity relations (see Figure 1). Their results show an improvement for word embeddings but the sense embeddings perform worse than the embeddings on which they were trained (0.42 on SCWS, see Table 6). We therefore believe that the additive relation is the superior model for the relationship between words and lexemes as well as for the relationship between synsets and lexemes. Kiela, Hill, and Clark (2015) used retrofitting and joint-learning approaches to specialize their embeddings for either similarity or relatedness tasks.
5.6 Other Related Work
In this work, we treated WSD and Entity Linking as the same problem and used IMS to solve this task. Moro, Raganato, and Navigli (2014) exposed the differences of both tasks and also presented a unified approach, called Babelfy. An overview and analysis of the main approaches to Entity Linking was given by Shen, Wang, and Han (2015). And whereas we use cosine to compute the similarity between synsets, there are also many similarity measures that only rely on a given resource, mostly WordNet. These measures are often functions that depend on information like glosses or on topological properties like shortest paths. Examples include Wu and Palmer (1994) and Leacock and Chodorow (1998); Blanchard et al. (2005) give a good overview. A purely graph-based approach to WSD was presented by Agirre, de Lacalle, and Soroa (2014).
6. Conclusions
We presented AutoExtend, a flexible method to learn embeddings for non-word objects in resources. AutoExtend is a general method that can be used for any set of embeddings and for any resource that imposes constraints of a certain type on the relation between words and other objects. Our experimental results show that AutoExtend can be applied to different tasks including Word Sense Disambiguation, Entity Linking, Word-in-Context Similarity, Word Similarity, and Synset Alignment. It achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation.
Acknowledgments
This work was funded by Deutsche Forschungsgemeinschaft (DFG SCHU 2246/2-2). We are grateful to Christiane Fellbaum for discussions leading up to this article and to the anonymous reviewers for their comments.
Notes
Closure does not hold literally for the set of words of a language represented in a finite corpus: There can be no bijection between the countable set of words and the uncountable set of real-valued vectors. When the sum of the embeddings of two words x and y is not attested as the embedding of a word, then we can see it as the embedding of a longer description. A simple example is that for many animals x, there are special words for the sum of x and young (calf, cub, chick), but for others, a phrase must be used (baby koala in non-Australian varieties of English, infant baboon).
We also tried (i) lemmatizing the corpus, (ii) using an averaged embedding of all inflected forms and (iii) using only the embeddings of the most frequent inflected form. These three methods yielded worse performance.
Recall that our definition of words also includes phrases. Just as we subsume both suit and red tape under the concept of word in WordNet, we also refer to both Clinton and George Miller in Freebase as words. Entities also have one notable type, e.g., President-of-the-US for Barack Obama, but we do not distinguish notable types from other types.
Entities also have one notable type, e.g., President-of-the-US for Barack Obama, but we do not distinguish notable types from other types.
Zhong and Ng (2010) report accuracies of 65.3% / 72.6% for this configuration.
This set is publicly available at http://cistern.cis.lmu.de/.
We could also transfer the German embedding space to the English one, but the performance is lower for this setting. The most likely reason is that the English embeddings are learned on a bigger corpus and thus contain more information. For the linear map, it is easy to drop information, but it is difficult to infer new information.