Abstract
We describe a probabilistic framework for acquiring selectional preferences of linguistic predicates and for using the acquired representations to model the effects of context on word meaning. Our framework uses Bayesian latent-variable models inspired by, and extending, the well-known Latent Dirichlet Allocation (LDA) model of topical structure in documents; when applied to predicate–argument data, topic models automatically induce semantic classes of arguments and assign each predicate a distribution over those classes. We consider LDA and a number of extensions to the model and evaluate them on a variety of semantic prediction tasks, demonstrating that our approach attains state-of-the-art performance. More generally, we argue that probabilistic methods provide an effective and flexible methodology for distributional semantics.
1. Introduction
Computational models of lexical semantics attempt to represent aspects of word meaning. For example, a model of the meaning of dog may capture the facts that dogs are animals, that they bark and chase cats, that they are often kept as pets, and so on. Word meaning is a fundamental component of the way language works: Sentences (and larger structures) consist of words, and their meaning is derived in part from the contributions of their constituent words' lexical meanings. At the same time, words instantiate a mapping between conceptual “world knowledge” and knowledge of language.
The relationship between the meanings of an individual word and the larger linguistic structure in which it appears is not unidirectional; while the word contributes to the meaning of the structure, the structure also clarifies the meaning of the word. Taken on its own a word may be vague or ambiguous, in the senses of Zwicky and Sadock (1975); even when the word's meaning is relatively clear it may still admit specification of additional details that affect its interpretation (e.g., what color/breed was the dog?). This specification comes through context, which consists of both linguistic and extralinguistic factors but shows a strong effect of the immediate lexical and syntactic environment—the other words surrounding the word of interest and their syntactic relations to it.
These diverse concerns motivate lexical semantic modeling as an important task for all computational systems that must tackle problems of meaning. In this article we develop a framework for modeling word meaning and how it is modulated by contextual effects.1 Our models are distributional in the sense that their parameters are learned from observed co-occurrences between words and contexts in corpus data. More specifically, they are probabilistic models that associate latent variables with automatically induced classes of distributional behavior and associate each word with a probability distribution over those classes. This has a natural interpretation as a model of selectional preference, the semantic phenomenon by which predicates such as verbs or adjectives more plausibly combine with some classes of arguments than with others. It also has an interpretation as a disambiguation model: The different latent variable values correspond to different aspects of meaning and a word's distribution over those values can be modified by information coming from the context it appears in. We present a number of specific models within this framework and demonstrate that they can give state-of-the-art performance on tasks requiring models of preference and disambiguation. More generally, we illustrate that probabilistic modeling is an effective general-purpose framework for distributional semantics and a useful alternative to the popular vector-space framework.
The main contributions of the article are as follows:
We describe the probabilistic approach to distributional semantics, showing how it can be applied as generally as the vector-space approach.
We present three novel probabilistic selectional preference models and show that they outperform a variety of previously proposed models on a plausibility-based evaluation.
Furthermore, the representations learned by these models correspond to semantic classes that are useful for modeling the effect of context on semantic similarity and disambiguation.
2. Background and Related Work
2.1 Distributional Semantics
The distributional approach to semantics is often traced back to the so-called “distributional hypothesis” put forward by mid-century linguists such as Zellig Harris and J.R. Frith:
If we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. (Harris 1954)
You shall know a word by the company it keeps. (Frith 1957)
The basic unit of distributional semantics is the co-occurrence: an observation of a word appearing in a particular context. The definition is a general one: We may be interested in all kinds of words, or only a particular subset of the vocabulary; we may define the context of interest to be a document, a fixed-size window around a nearby word, or a syntactic dependency arc incident to a nearby word. Given a data set of co-occurrence observations we can extract an indexed set of co-occurrence counts fw for each word of interest w; each entry fwc counts the number of times that w was observed in context c. Alternatively, we can extract an indexed set fc for each context.
The vector-space approach is the best-known methodology for distributional semantics; under this conception fw is treated as a vector in , where is the vocabulary of contexts. As such, fw is amenable to computations known from linear algebra. We can compare co-occurrence vectors for different words with a similarity function such as the cosine measure or a dissimilarity function such as Euclidean distance; we can cluster neighboring vectors; we can project a matrix of co-occurrence counts onto a low-dimensional subspace; and so on. This is perhaps the most popular approach to distributional semantics and there are many good general overviews covering the possibilities and applications of the vector space model (Curran 2003; Weeds and Weir 2005; Padó and Lapata 2007; Turney and Pantel 2010).
Although it is natural to view the aggregate of co-occurrence counts for a word as constituting a vector, it is equally natural to view it as defining a probability distribution. When normalized to have unit sum, fw parameterizes a discrete distribution giving the conditional probability of observing a particular context given that we observe w. The contents of the vector-space modeler's toolkit generally have probabilistic analogs: similarity and dissimilarity can be computed using measures from information theory such as the Kullback–Leibler or Jensen–Shannon divergences (Lee 1999); the effects of clustering and dimensionality reduction can be achieved through the use of latent variable models (see Section 3.2.2). Additionally, Bayesian priors on parameter distributions provide a flexible toolbox for performing regularization and incorporating prior information in learning. A further advantage of the probabilistic framework is that it is often straightforward to extend existing models to account for additional structure in the data, or to tie together parameters for shared statistical strength, while maintaining guarantees of well-normalized behavior thanks to the laws of probability. In this article we focus on selectional preference learning and contextual disambiguation but we believe that the probabilistic approach exemplified here can fruitfully be applied in any scenario involving distributional semantic modeling.
2.2 Selectional Preferences
2.2.1 Motivation
A fundamental concept in linguistic knowledge is the predicate, by which we mean a word or other symbol that combines with one or more arguments to produce a composite representation with a composite meaning (by the principle of compositionality). The archetypal predicate is a verb; for example, transitive drink takes two noun arguments as subject and object, with which it combines to form a basic sentence. However, the concept is a general one, encompassing other word classes as well as more abstract items such as semantic relations (Yao et al. 2011), semantic frames (Erk, Padó, and Padó 2010), and inference rules (Pantel et al. 2007). The asymmetric distinction between predicate and argument is analogous to that between context and word in the more general distributional framework.
It is intuitive that a particular predicate will be more compatible with some semantic argument classes than with others. For example, the subject of drink is typically an animate entity (human or animal) and the object of drink is typically a beverage. The subject of eat is also typically an animate entity but its object is typically a foodstuff. The noun modified by the adjective tasty is also typically a foodstuff, whereas the noun modified by informative is an information-bearing object. This intuition can be formalized in terms of a predicate's selectional preference: a function that assigns a numerical score to a combination of a predicate and one or more arguments according to the semantic plausibility of that combination. This score may be a probability, a rank, a real value, or a binary value; in the last case, the usual term is selectional restriction.
Models of selectional preference aim to capture conceptual knowledge that all language users are assumed to have. Speakers of English can readily identify that examples such as the following are semantically infelicitous despite being syntactically well-formed:
- 1.
The beer drank the man.
- 2.
Quadruplicity drinks procrastination. (Russell 1940)
- 3.
Colorless green ideas sleep furiously. (Chomsky 1957)
- 4.
The paint is silent. (Katz and Fodor 1963)
In NLP, one motivation for modeling predicate–argument plausibility is to investigate whether this aspect of human conceptual knowledge can be learned automatically from text corpora. If the predictions of a computational model correlate with judgments collected from human behavioral data, the assumption is that the model itself shares some properties with human linguistic knowledge and is in some sense a “good” semantic model. More practically, NLP researchers have shown that selectional preference knowledge is useful for downstream applications, including metaphor detection (Shutova 2010), identification of non-compositional multiword expressions (McCarthy, Venkatapathy, and Joshi 2007), semantic role labeling (Gildea and Jurafsky 2002; Zapirain, Agirre, and Màrquez 2009; Zapirain et al. 2010), word sense disambiguation (McCarthy and Carroll 2003), and parsing (Zhou et al. 2011).
2.2.2 The “Counting” Approach
The simplest way to estimate the plausibility of a predicate–argument combination from a corpus is to count the number of times that combination appears, on the assumptions that frequency correlates with plausibility and that given enough data the resulting estimates will be relatively accurate. For example, Keller and Lapata (2003) estimate predicate–argument plausibilities by submitting appropriate queries to a Web search engine and counting the number of “hits” returned. To estimate the frequency with which the verb drink takes beer as a direct object, Keller and Lapata's method uses the query <drink|drinks|drank|drunk|drinking a|the|∅ beer|beers>; to estimate the frequency with which tasty modifies pizza the query is simply <tasty pizza|pizzas>. Where desired, these joint frequency counts can be normalized by unigram hit counts to estimate conditional probabilities such as P(pizza|tasty).
The main advantages of this approach are its simplicity and its ability to exploit massive corpora of raw text. On the other hand, it is hindered by the facts that only shallow processing is possible and that even in a Web-scale corpus the probability estimates for rare combinations will not be accurate. At the time of writing, Google returns zero hits for the query <draughtsman|draughtsmen whistle|whistles|whistled|whistling> and 1,570 hits for <onion|onions whistle|whistles|whistled|whistling>, suggesting the implausible conclusion that an onion is far more likely to whistle than a draughtsman.2
2.2.3 Similarity-Based Smoothing Methods
During the 1990s, research on language modeling led to the development of various “smoothing” methods for overcoming the data sparsity problem that inevitably arises when estimating co-occurrence counts from finite corpora (Chen and Goodman 1999). The general goal of smoothing algorithms is to alter the distributional profile of observed counts to better match the known statistical properties of linguistic data (e.g., that language exhibits power-law behavior). Some also incorporate semantic information on the assumption that meaning guides the distribution of words in a text.
2.2.4 Discriminative Models
Bergsma, Lin, and Goebel (2008) cast selectional preference acquisition as a supervised learning problem to which a discriminatively trained classifier such as a Support Vector Machine (SVM) can be applied. To produce training data for a predicate, they pair “positive” arguments that were observed for that predicate in the training corpus and have an association with that predicate above a specified threshold (measured by mutual information) with randomly selected “negative” arguments of similar frequency that do not occur with the predicate or fall below the association threshold. Given this training data, a classifier can be trained in a standard way to predict a positive or negative score for unseen predicate–argument pairs.
An advantage of this approach is that arbitrary sets of features can be used to represent the training and testing items. Bergsma, Lin, and Goebel include conditional probabilities P(a|p) for all predicates the candidate argument co-occurs with, typographic features of the argument itself (e.g., whether it is capitalized, or contains digits), lists of named entities, and precompiled semantic classes.
2.2.5 WordNet-Based Models
An alternative approach to preference learning models the argument distribution for a predicate as a distribution over semantic classes provided by a predefined lexical resource. The most popular such resource is the WordNet lexical hierarchy (Fellbaum 1998), which provides semantic classes and hypernymic structures for nouns, verbs, adjectives, and adverbs.3 Incorporating knowledge about the WordNet taxonomy structure in a preference model enables the use of graph-based regularization techniques to complement distributional information, while also expanding the coverage of the model to types that are not encountered in the training corpus. On the other hand, taxonomy-based methods build in an assumption that the lexical hierarchy chosen is the universally “correct” one and they will not perform as well when faced with data that violates the hierarchy or contains unknown words. A further issue faced by these models is that the resources they rely on require significant effort to create and will not always be available to model data in a new language or a new domain.
Resnik (1993) proposes a measure of associational strength between a predicate and WordNet classes based on the empirical distribution of words of each class (and their hyponyms) in a corpus. Abney and Light (1999) conceptualize the process of generating an argument for a predicate in terms of a Markovian random walk from the hierarchy's root to a leaf node and choosing the word associated with that leaf node. Ciaramita and Johnson (2000) likewise treat WordNet as defining the structure of a probabilistic graphical model, in this case a Bayesian network. Li and Abe (1998) and Clark and Weir (2002) both describe models in which a predicate “cuts” the hierarchy at an appropriate level of generalization, such that all classes below the cut are considered appropriate arguments (whether observed in data or not) and all classes above the cut are considered inappropriate.
In this article we focus on purely distributional models that do not rely on manually constructed lexical resources; therefore we do not revisit the models described in this section subsequently, except as a basis for empirical comparison. Ó Séaghdha and Korhonen (2012) do investigate a number of Bayesian preference models that incorporate WordNet classes and structure, finding that such models outperform previously proposed WordNet-based models and perform comparably to the distributional Bayesian models presented here.
2.3 Measuring Similarity in Context
2.3.1 Motivation
A fundamental idea in semantics is that the meaning of a word is disambiguated and modulated by the context in which it appears. The word body clearly has a different sense in each of the following text fragments:
- 1.
Depending on the present position of the planetarybodyin its orbital path, …
- 2.
The executivebodydecided…
- 3.
The humanbodyis intriguing in all its forms.
A complementary perspective on the disambiguatory power of context models is provided by research on semantic composition, namely, how the syntactic effect of a grammar rule is accompanied by a combinatory semantic effect. In this view, the goal is to represent the combination of a context and an in-context word, not just to represent the word given the context. The co-occurrence models described in this article are not designed to scale up and provide a representation for complex syntactic structures,4 but they are applicable to evaluation scenarios that involve representing binary co-occurrences.
2.3.2 Vector-Space Models
As described in Section 2.1, the vector-space approach to distributional semantics casts word meanings as vectors of real numbers and uses linear algebra operations to compare and combine these vectors. A word w is represented by a vector vw that models aspects of its distribution in the training corpus; the elements of this vector may be co-occurrence counts (in which case it is the same as the frequency vector fw) or, more typically, some transformation of the raw counts.
All the models described in this section provide a way of relating a word's standard co-occurrence vector to a vector representation of the word's meaning in context. This allows us to calculate the similarity between two in-context words or between a word and an in-context word using standard vector similarity measures such as the cosine. In applications where the task is to judge the appropriateness of substituting a word ws for an observed word wo in context C = {(r1, w1), (r2, w2), … , (rn, wn)}, a common approach is to compute the similarity between the contextualized vector and the uncontextualized word vector . It has been demonstrated empirically that this approach yields better performance than contextualizing both vectors before the similarity computation.
3. Probabilistic Latent Variable Models for Lexical Semantics
3.1 Notation and Terminology
We define a co-occurrence as a pair (c,w), where c is a context belonging to the vocabulary of contexts and w is a word belonging to the word vocabulary .5 Unless otherwise stated, the contexts considered in this article are head-lexicalized dependency edges c = (r,wh) where is the grammatical relation and is the head lemma. We notate grammatical relations as ph:label:pd, where ph is the head word's part of speech, pd is the dependent word's part of speech, and label is the dependency label.6 We use a coarse set of part-of-speech tags: n (noun), v (verb), j (adjective), r (adverb). The dependency labels are the grammatical relations used by the RASP system (Briscoe 2006; Briscoe, Carroll, and Watson 2006), though in principle any dependency formalism could be used. The assumption that predicates correspond to head-lexicalized dependency edges means that they have arity one.
To estimate our preference models we will rely on co-occurrence counts extracted from a corpus of observations O. Each observation is a co-occurrence of a predicate and an argument. The set of observations for context c is denoted O(c). The co-occurrence frequency of context c and word w is denoted by fcw, and the total co-occurrence frequency of c by fc = ∑w ∈ Wfcw.
3.2 Modeling Assumptions
3.2.1 Bayesian Modeling
The Bayesian approach to probabilistic modeling (Gelman et al. 2003) is characterized by (1) the use of prior distributions over model parameters to encode the modeler's expectations about the values they will take; and (2) the explicit quantification of uncertainty by maintaining posterior distributions over parameters rather than point estimates.7
Other priors commonly used for discrete distributions in NLP include the Dirichlet process and the Pitman–Yor process (Goldwater, Griffiths, and Johnson 2011). The Dirichlet process provides similar behavior to the Dirichlet distribution prior but is “non-parametric” in the sense of varying the size of its support according to the data; in the context of mixture modeling, a Dirichlet process prior allows the number of mixture components to be learned rather than fixed in advance. The Pitman–Yor process is a generalization of the Dirichlet process that is better suited to learning power-law distributions. This makes it particularly suitable for language modeling where the Dirichlet distribution or Dirichlet process would not produce a long enough tail due to their preference for sparsity (Teh 2006). On the other hand, Dirichlet-like behavior may be preferable in semantic modeling, where we expect, for example, predicate–class and class–argument distributions to be sparse.
3.2.2 The Latent Variable Assumption
Here the latent variables z index mixture components, each of which is associated with a distribution over observations x, and the resulting likelihood is an average of the component distributions weighted by the mixing weights P(z). The set of possible values for z is the set of components Z. When |Z| is small relative to the size of the training data, this model has a clustering effect in the sense that the distribution learned for P(x|z) is informed by all datapoints assigned to component z.
The idea of compressing the observed co-occurrence data through a small layer of latent variables shares the same basic motivations as other, not necessarily probabilistic, dimensionality reduction techniques such as Latent Semantic Analysis or Non-negative Matrix Factorization. An advantage of probabilistic models is their flexibility, both in terms of learning methods and model structures. For example, the models considered in this article can potentially be extended to multi-way co-occurrences and to hierarchically defined contexts that cannot easily be expressed in frameworks that require the input to be a co-occurrence matrix.
3.3 Bayesian Models for Binary Co-occurrences
Combining the latent variable co-occurrence model (23) with the use of Dirichlet priors naturally leads to Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003). Often described as a “topic model,” LDA is a model of document content that assumes each document is generated from a mixture of multinomial distributions or “topics.” Topics are shared across documents and correspond to thematically coherent patterns of word usage. For example, one topic may assign high probability to the words finance, fund, bank, and invest, whereas another topic may assign high probability to the words football, goal, referee, and header. LDA has proven to be a very successful model with many applications and extensions, and the topic modeling framework remains an area of active research in machine learning.
Figure 1 sketches the “generative story” according to which LDA generates arguments for predicates and also presents a plate diagram indicating the dependencies between variables in the model. Table 1 illustrates the semantic representation induced by a 600-topic LDA model trained on predicate–noun co-occurrences extracted from the British National Corpus (for more details of this training data, see Section 4.1). The “semantic classes” are actually distributions over all nouns in the vocabulary rather than a hard partitioning; therefore we present the eight most probable words for each. We also present the contexts most frequently associated with each class. Whereas a topic model trained on document–word co-occurrences will find topics that reflect broad thematic commonalities, the model trained on syntactic co-occurrences finds semantic classes that capture a much tighter sense of similarity: Words assigned high probability in the same topic tend to refer to entities that have similar properties, that perform similar actions, and have similar actions performed on them. Thus Class 1 is represented by attack, raid, assault, campaign, and so on, forming a coherent semantic grouping. Classes 2, 3, and 4 correspond to groups of tests, geometric objects, and public/educational institutions, respectively. Class 5 has been selected to illustrate a potential pitfall of using syntactic co-occurrences for semantic class induction: fund, revenue, eyebrow, and awareness hardly belong together as a coherent conceptual class. The reason, it seems, is that they are all entities that can be (and in the corpus, are) raised. This class has also conflated different (but related) senses of reserve and as a result the modifier nature is often associated with it.
3.4 Parameter and Hyperparameter Learning
3.4.1 Learning Methods
A variety of methods are available for parameter learning in Bayesian models. The two standard approaches are variational inference, in which an approximation to the true distribution over parameters is estimated exactly, and sampling, in which convergence to the true posterior is guaranteed in theory but rarely verifiable in practice. In some cases the choice of approach is guided by the model, but often it is a matter of personal preference; for LDA, there is evidence that equivalent levels of performance can be achieved through variational learning and sampling given appropriate parameterization (Asuncion et al. 2009). In this article we use learning methods based on Gibbs sampling, following Griffiths and Steyvers (2004). The basic idea of Gibbs sampling is to iterate through the corpus one observation at a time, updating the latent variable value for each observation according to the conditional probability distribution determined by the current observed and latent variable values for all other observations. Because the likelihoods are multinomials with Dirichlet priors, we can integrate out their parameters using Equation (21).
A naive implementation of the sampler will take time linear in the number of topics and the number of observations to complete one iteration. Yao, Mimno, and McCallum (2009) present a new sampling algorithm for LDA that yields a considerable speedup by reformulating Equation (30) to allow caching of intermediate values and an intelligent sorting of topics so that in many cases only a small number of topics need be iterated though before assigning a topic to an observation. In this article we use Yao, Mimno, & McCallum's algorithm for LDA, as well as a transformation of the Rooth-LDA and Lex-LDA samplers that can be derived in an analogous fashion.
3.4.2 Inference
As noted previously, the Gibbs sampling procedure is guaranteed to converge to the true posterior after a finite number of iterations; however, this number is unknown and it is difficult to detect convergence. In practice, we run the sampler for a hopefully sufficient number of iterations and perform inference based on the final sampling state (assignments of all z and s variables) and/or a set of intermediate sampling states.
Given a sequence or chain of sampling states S1, … , Sn, we can predict a value for P(w|c) or P(c,w) using these equations and the set of latent variable assignments at a single state Si. As the sampler is initialized randomly and will take time to find a good area of the search space, it is standard to wait until a number of iterations have passed before using any samples for prediction. States S1, … , Sb from this burn-in period are discarded.
3.4.3 Choosing |Z|
In the “parametric” latent variable models used here the number of topics or semantic classes, |Z|, must be fixed in advance. This brings significant efficiency advantages but also the problem of choosing an appropriate value for |Z|. The more classes a model has, the greater its capacity to capture fine distinctions between entities. However, this finer granularity inevitably comes at a cost of reduced generalization. One approach is to choose a value that works well on training or development data before evaluating held-out test items. Results in lexical semantics are often reported over the entirety of a data set, meaning that if we wish to compare those results we cannot hold out any portion. If the method is relatively insensitive to the parameter it may be sufficient to choose a default value. Rooth et al. (1999) suggest cross-validating on the training data likelihood (and not on the ultimate evaluation measure). An alternative solution is to average the predictions of models trained with different choices of |Z|; this avoids the need to pick a default and can give better results than any one value as it integrates contributions at different levels of granularity. As mentioned in Section 3.4.2 we must take care when averaging predictions to compute with quantities that do not rely on topic identity—for example, estimates of P(a|p) can safely be combined whereas estimates of P(z1|p) cannot.
3.4.4 Hyperparameter Estimation
Although the likelihood parameters can be integrated out, the parameters for the Dirichlet and Beta priors (often referred to as “hyperparameters”) cannot and must be specified either manually or automatically. The value of these parameters affects the sparsity of the learned posterior distributions. Furthermore, the use of an asymmetric prior (where not all its parameters have equal value) implements an assumption that some observation values are more likely than others before any observations have been made. Wallach, Mimno, and McCallum (2009) demonstrate that the parameterization of the Dirichlet priors in an LDA model has a material effect on performance, recommending in conclusion a symmetric prior on the “emission” likelihood P(w|z) and an asymmetric prior on the document topic likelihoods P(z|d). In this article we follow these recommendations and, like Wallach, Mimno, and McCallum, we optimize the relevant hyperparameters using a fixed point iteration to maximize the log evidence (Minka 2003; Wallach 2008).
3.5 Measuring Similarity in Context with Latent-Variable Models
The representation induced by latent variable selectional preference models also allows us to capture the disambiguatory effect of context. Given an observation of a word in a context, we can infer the most probable semantic classes to appear in that context and we can also infer the probability that a class generated the observed word. We can also estimate the probability that the semantic classes suggested by the observation would have licensed an alternative word. Taken together, these can be used to estimate in-context semantic similarity. The fundamental intuitions are similar to those behind the vector-space models in Section 2.3.2, but once again we are viewing them from the perspective of probabilistic modeling.
3.6 Related Work
As related earlier, non-Bayesian mixture or latent-variable approaches to co-occurrence modeling were proposed by Pereira, Tishby, and Lee (1993) and Rooth et al. (1999). Blitzer, Globerson, and Pereira (2005) describe a co-occurrence model based on a different kind of distributed latent-variable architecture similar to that used in the literature on neural language models. Brody and Lapata (2009) use the clustering effects of LDA to perform word sense induction. Vlachos, Korhonen, and Ghahramani (2009) use non-parametric Bayesian methods to cluster verbs according to their co-occurrences with subcategorization frames. Reisinger and Mooney (2010, 2011) have also investigated Bayesian methods for lexical semantics in a spirit similar to that adopted here. Reisinger and Mooney (2010) describe a “tiered clustering” model that, like Lex-LDA, mixes a cluster-based preference model with a predicate-specific distribution over words; however, their model does not encourage sharing of classes between different predicates. Reisinger and Mooney (2011) propose a very interesting variant of the latent-variable approach in which different kinds of contextual behavior can be explained by different “views,” each of which has its own distribution over latent variables; this model can give more interpretable classes than LDA for higher settings of |Z|.
Some extensions of the LDA topic model incorporate local as well as document context to explain lexical choice. Griffiths et al. (2004) combine LDA and a hidden Markov model (HMM) in a single model structure, allowing each word to be drawn from either the document's topic distribution or a latent HMM state conditioned on the preceding word's state; Moon, Erk, and Baldridge (2010) show that combining HMM and LDA components can improve unsupervised part-of-speech induction. Wallach (2006) also seeks to capture the influence of the preceding word, while at the same time generating every word from inside the LDA model; this is achieved by conditioning the distribution over words on the preceding word type as well as on the chosen topic. Boyd-Graber and Blei (2008) propose a “syntactic topic model” that makes topic selection conditional on both the document's topic distribution and on the topic of the word's parent in a dependency tree. Although these models do represent a form of local context, they either use a very restrictive one-word window or a notion of syntax that ignores lexical or dependency-label effects; for example, knowing that the head of a noun is a verb is far less informative than knowing that the noun is the direct object of eat.
More generally, there is a connection between the models developed here and latent-variable models used for parsing (e.g., Petrov et al. 2006). In such models each latent state corresponds to a “splitting” of a part-of-speech label so as to produce a finer-grained grammar and tease out intricacies of word–rule “co-occurrence.” Finkel, Grenager, and Manning (2007) and Liang et al. (2007) propose a non-parametric Bayesian treatment of state splitting. This is very similar to the motivation behind an LDA-style selectional preference model. One difference is that the parsing model must explain the parse tree structure as well as the choice of lexical items; another is that in the selectional preference models described here each head–dependent relation is treated as an independent observation (though this could be changed). These differences allow our selectional preference models to be trained efficiently on large corpora and, by focusing on lexical choice rather than syntax, to home in on purely semantic information. Titov and Klementiev (2011) extend the idea of latent-variable distributional modeling to do “unsupervised semantic parsing” and reason about classes of semantically similar lexicalized syntactic fragments.
4. Experiments
4.1 Training Corpora
In our experiments we use two training corpora:
BNC the written component of the British National Corpus,9 comprising around 90 million words. The corpus was tagged for part of speech, lemmatized, and parsed with the RASP toolkit (Briscoe, Carroll, and Watson 2006).
WIKI a Wikipedia dump of over 45 million sentences (almost 1 billion words) tagged, lemmatized, and parsed with the C+C toolkit10 and the fast CCG parser described by Clark et al. (2009).
In order to train our selectional preference models, we extracted word–context observations from the parsed corpora. Prior to extraction, the dependency graph for each sentence was transformed using the preprocessing steps illustrated in Figure 4. We then filtered for semantically discriminative information by ignoring all words with part of speech other than common noun, verb, adjective, and adverb. We also ignored instances of the verbs be and have and discarded all words containing non-alphabetic characters and all words with fewer than three characters.11
As mentioned in Section 2.1, the distributional semantics framework admits flexibility in how the practitioner defines the context of a word w. We investigate two possibilities in this article:
Syn The context of w is determined by the syntactic relations r and words w′ incident to it in the sentence's parse tree, as illustrated in Section 3.1.
Win5 The context of w is determined by the words appearing within a window of five words on either side of it. There are no relation labels, so there is essentially just one relation r to consider.
Training topic models on a data set with very large “documents” leads to tractability issues. The window-based approach is particularly susceptible to an explosion in the number of extracted contexts, as each token in the data can contribute 2×W word–context observations, where W is the window size. We reduced the data by applying a simple downsampling technique to the training corpora. For the WIKI/Syn corpus, all word–context counts were divided by 5 and rounded to the nearest integer. For the WIKI/Win5 corpus we divided all counts by 70; this number was suggested by Dinu and Lapata (2010), who used the same ratio for downsampling the similarly sized English Gigaword Corpus. Being an order of magnitude smaller, the BNC required less pruning; we divided all counts in the BNC/Win5 by 5 and left the BNC/Syn corpus unaltered. Type/token statistics for the resulting sets of observations are given in Table 4.
4.2 Evaluating Selectional Preference Models
Various approaches have been suggested in the literature for evaluating selectional preference models. One popular method is “pseudo-disambiguation,” in which a system must distinguish between actually occurring and randomly generated predicate–argument combinations (Pereira, Tishby, and Lee 1993; Chambers and Jurafsky 2010). In a similar vein, probabilistic topic models are often evaluated by measuring the probability they assign to held-out data; held-out likelihood has also been used for evaluation in a task involving selectional preferences (Schulte im Walde et al. 2008). These two approaches take a “language modeling” approach in which model quality is identified with the ability to predict the distribution of co-occurrences in unseen text. Although this metric should certainly correlate with the semantic quality of the model, it may also be affected by frequency and other idiosyncratic aspects of language use unless tightly controlled. In the context of document topic modeling, Chang et al. (2009) find that a model can have better predictive performance on held-out data while inducing topics that human subjects judge to be less semantically coherent.
In this article we choose to evaluate models by comparing system predictions with semantic judgments elicited from human subjects. These judgments take various forms. In Section 4.3 we use judgments of how plausible it is that a given predicate takes a given word as its argument. In Section 4.4 we use judgments of similarity between pairs of predicate–argument combinations. In Section 4.5 we use judgments of substitutability for a target word as disambiguated by its sentential context. Taken together, these different experimental designs provide a multifaceted analysis of model quality.
4.3 Predicate–Argument Plausibility
4.3.1 Data
For the plausibility-based evaluation we use a data set of human judgments collected by Keller and Lapata (2003). This comprises data for three grammatical relations: verb–object, adjective–noun, and noun–noun modification. For each relation, 30 predicates were selected; each predicate was paired with three noun arguments from different predicate–argument frequency bands in the BNC as well as three noun arguments that were not observed for that predicate in the BNC. In this way two subsets (Seen and Unseen) of 90 items each were assembled for each predicate. Human plausibility judgments were elicited from a large number of subjects; these numerical judgments were then normalized, log-transformed, and averaged in a Magnitude Estimation procedure.
Following Keller and Lapata (2003), we evaluate our models by measuring the correlation between system predictions and the human judgments. Keller and Lapata use Pearson's correlation coefficient r; we additionally use Spearman's rank correlation coefficient ρ for a non-parametric evaluation. Each system prediction is log-transformed before calculating the correlation to improve the linear fit to the gold standard.
4.3.2 Methods
We evaluate the LDA, Rooth-LDA, and Lex-LDA latent-variable preference models, trained on predicate–argument pairs (c,w) extracted from the BNC. We use a default setting |Z| = 100 for the number of classes; in our experiments we have observed that our Bayesian models are relatively robust to the choice of |Z|. We average predictions of the joint probability P(c,w) over three independent samples, each of which is obtained by sampling P(c,w) every 50 iterations after a burn-in period of 200 iterations. Rooth-LDA gives joint probabilities by definition (25), but LDA and Lex-LDA are defined in terms of conditional probabilities (24). There are two options for training these models:
P → A: Model the distribution P(w|c) over arguments for each predicate.
A → P: Model the distribution P(c|w) over predicates for each argument.
For comparison, we report the performance figures given by Keller and Lapata for their search-engine method using AltaVista and Google12 as well as a number of alternative methods that we have reimplemented and trained on identical data:
BNC (MLE) A maximum-likelihood estimate proportional to the co-occurrence frequency f(c,w) in the parsed BNC.
BNC (KN) BNC relative frequencies smoothed with modified Kneser-Ney (Chen and Goodman 1999).
Resnik The WordNet-based association strength of Resnik (1993). We used WordNet version 2.1 as the method requires multiple roots in the hierarchy for good performance.
Clark/Weir The WordNet-based method of Clark and Weir (2002), using WordNet 3.0. This method requires that a significance threshold α and significance test be chosen; we investigated a variety of settings and report performance for α = 0.9 and Pearson's χ2 test, as this combination consistently gave the best results.
Rooth-EM Rooth et al. (1999)'s latent-variable model without priors, trained with EM. As for the Bayesian models, we average the predictions over three iterations. This method is very sensitive to the number of classes; as proposed by Rooth et al., we choose the number of classes from the range (20, 25, … , 50) through 5-fold cross-validation on a held-out log-likelihood measure.
EPP The vector-space method of Erk, Padó, and Padó (2010), as described in Section 2.2.3. We used the cosine similarity measure for smoothing as it performed well in Erk, Padó, & Padó's experiments.
Disc A discriminative model inspired by Bergsma, Lin, and Goebel (2008) (see Section 2.2.4). In order to get true probabilistic predictions, we used a logistic regression classifier with L1 regularization rather than a Support Vector Machine.13 We train one classifier per predicate in the Keller and Lapata data set. Following Bergsma, Lin, and Goebel, we generate pseudonegative instances for each predicate by sampling noun arguments that either do not co-occur with it or have a negative PMI association. Again following Bergsma, Lin, and Goebel, we use a ratio of two pseudonegative instances for each positive instance and require pseudonegative arguments to be in the same frequency quintile as the matched observed argument. The features used for each data instance, corresponding to an argument, are: the conditional probability of the argument co-occurring with each predicate in the training data; and string-based features capturing the length and initial and final character n-grams of the argument word.14 We also investigate whether our LDA model can be used to provide additional features for the discriminative model, by giving the index of the most probable class max zP(z|c,w); results for this system are labeled Disc+LDA.
In order to test statistical significance of performance differences we use a test for correlated correlation coefficients proposed by Meng, Rosenthal, and Rubin (1992). This is more appropriate than a standard test for independent correlation coefficients as it takes into account the strength of correlation between two sets of system outputs as well as each output's correlation with the gold standard. Essentially, if the two sets of system outputs are correlated there is less chance that their difference will be deemed significant. As we have no a priori reason to believe that one model will perform better than another, all tests are two-tailed.
4.3.3 Results
Results on the Keller and Lapata (2003) plausibility data set are presented in Table 5.15 For common combinations (the Seen data) it is clear that relative corpus frequency is a reliable indicator of plausibility, especially when Web-scale resources are available. The BNC MLE estimate outperforms the best selectional preference model on three out of six Seen evaluations, and the AltaVista and Google estimates from Keller and Lapata (2003) outperforms the best selectional preference model on every applicable Seen evaluation. For the rarer Unseen combinations, however, MLE estimates are not sufficient and the latent-variable selectional preference models frequently outperform even the Web-based predictions. The results for BNC(KN) improve on the MLE estimates for the Unseen data but do not match the models that have a semantic component.
It is clear from Table 5 that the new Bayesian latent-variable models outperform the previously proposed selectional preference models under almost every evaluation. Among the latent-variable models there is no one clear winner, and small differences in performance are as likely to arise through random sampling variation as through qualitative differences between models. That said, Rooth-LDA and Lex-LDA do score higher than LDA in a majority of cases. As expected, the bidirectional P ↔ A models tend to perform at around the midpoint of the P → A and A → P models, though they can also exceed both; this suggests that they are a good choice when there is no intuitive reason to choose one direction over the other.
Table 6 aggregates comparisons for all combinations of the six data sets and two evaluation measures. As before, all the Bayesian latent-variable models achieve a roughly similar level of performance, consistently outperforming the models selected from the literature and frequently reaching statistical significance (p < 0.05). These results confirm that LDA-style models can be considered the current state of the art for selectional preference modeling.
4.4 Predicate–Argument Similarity
4.4.1 Data
Mitchell and Lapata (2008, 2010) collected human judgments of similarity between pairs of predicates and arguments corresponding to minimal sentences. Mitchell and Lapata's explicit aim was to facilitate evaluation of general semantic compositionality models but their data sets are also suitable for evaluating predicate–argument representations.
Mitchell and Lapata (2008) used the BNC to extract 4 attested subject nouns for each of 15 verbs, yielding 60 reference combinations. Each verb–noun tuple was matched with two verbs that are synonyms of the reference verb in some contexts but not in others. In this way, Mitchell and Lapata created a data set of 120 pairs of predicate–argument combinations. Similarity judgments were obtained from human subjects for each pair on a Likert scale of 1–7. Examples of the resulting data items are given in Table 8. Mitchell and Lapata use six subjects' ratings as a development data set for setting model parameters and the remaining 54 subjects' ratings for testing. In this article we use the same split.
shoulder slump | 6, 7, 5, 5, 6, 5, 5, 7, 5, 5, 7, 5, 6, 6, 5, 6, 6, 6, 7, 5, |
shoulder slouch | 7, 6, 6, 5, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7 |
shoulder slump | 2, 5, 4, 4, 3, 3, 2, 3, 2, 1, 3, 3, 6, 5, 3, 2, 1, 1, 1, 7, |
shoulder decline | 4, 4, 6, 3, 5, 6 |
shoulder slump | 6, 7, 5, 5, 6, 5, 5, 7, 5, 5, 7, 5, 6, 6, 5, 6, 6, 6, 7, 5, |
shoulder slouch | 7, 6, 6, 5, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7 |
shoulder slump | 2, 5, 4, 4, 3, 3, 2, 3, 2, 1, 3, 3, 6, 5, 3, 2, 1, 1, 1, 7, |
shoulder decline | 4, 4, 6, 3, 5, 6 |
Mitchell and Lapata (2010) adopt a similar approach to data collection with the difference that instead of keeping arguments constant across combinations in a pair, both predicates and arguments vary across comparand combinations. They also consider a range of grammatical relations: verb–object, adjective–noun, and noun–noun modification. Human subjects rated similarity between predicate–argument combinations on a 1–7 scale as before; examples are given in Table 9. Inspection of the data suggests that the subjects' annotation may conflate semantic similarity and relatedness; for example, football club and league match are often given a high similarity score. Mitchell and Lapata again split the data into development and testing sections, the former comprising 54 subjects' ratings and the latter comprising 108 subjects' ratings.
stress importance | 6, 7, 7, 5, 5, 7, 7, 7, 6, 5, 6, 7, 3, 7, 7, 6, 7, 7 |
emphasize need | |
ask man | 3, 1, 4, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1 |
stretch arm | |
football club | 7, 6, 7, 6, 6, 5, 5, 3, 6, 6, 4, 5, 4, 6, 2, 7, 5, 5 |
league match | |
education course | 7, 7, 5, 5, 7, 5, 5, 7, 7, 4, 6, 2, 5, 6, 6, 7, 7, 4 |
training program |
stress importance | 6, 7, 7, 5, 5, 7, 7, 7, 6, 5, 6, 7, 3, 7, 7, 6, 7, 7 |
emphasize need | |
ask man | 3, 1, 4, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1 |
stretch arm | |
football club | 7, 6, 7, 6, 6, 5, 5, 3, 6, 6, 4, 5, 4, 6, 2, 7, 5, 5 |
league match | |
education course | 7, 7, 5, 5, 7, 5, 5, 7, 7, 4, 6, 2, 5, 6, 6, 7, 7, 4 |
training program |
4.4.2 Models
For the Mitchell and Lapata (2008) data set we train the following models on the BNC corpus:
LDA An LDA selectional preference model of verb–subject co-occurrence with similarity computed as described in Section 3.5. Similarity predictions sim(n,o|c) are averaged over five runs. We consider three models of context–target interaction, which in this case corresponds to verb–subject interaction:
LDAC→T Target generation is conditioned on the context, as in equation (53).
LDAT→C Context generation is conditioned on the target, as in equation (56).
LDAC↔T An average of the predictions made by LDAC →T and LDAT →C.
Mult Pointwise multiplication (6) using Win5 co-occurrences.
M+L08 The best-performing system of Mitchell and Lapata (2008), combining an additive and a multiplicative model and using window-based co-occurrences.
SVS The best-performing system of Erk and Padó (2008); the Structured Vector Space model (8), parameterized to use window-based co-occurrences and raising the expectation vector values (7) to the 20th power (this parameter was optimized on the development data).
For the Mitchell and Lapata (2010) data set we train the following models, again on the BNC corpus:
Rooth-LDA/Syn A Rooth-LDA model trained on the appropriate set of syntactic co-occurrences (verb–object, noun–noun modification, or adjective–noun), with the topic distribution calculated as in Equation (59).
LDA/Win5 An LDA model trained on the Win5 window-based co-occurrences. Because all observations are modeled using the same latent classes, the distributions P(z|o,c) (Equation (53)) for each word in the pair can be combined by taking a normalized product.
Combined This model averages the similarity prediction of the Rooth-LDA/Syn and LDA/Win5 models.
Mult Pointwise multiplication (6) using Win5 co-occurrences.
M+L10/Mult A multiplicative model (6) using a vector space based on window co-occurrences in the BNC.
M+L10/Best The best result for each grammatical relation from any of the semantic spaces and combination methods tested by Mitchell and Lapata. Some of these methods require parameters to be set through optimization on the development set.18
4.4.3 Results
Results for the Mitchell and Lapata (2008) data set are presented in Tables 10 and 11.19 The LDA preference models clearly outperform the previous state of the art of ρcat = 0.27 (Erk and Padó 2008), with the best simple average of predictors scoring ρcat = 0.38, ρave = 0.41, and the best optimized combination scoring ρcat = 0.39, ρave = 0.41. This is comparable to the average level of agreement between human judges estimated by Mitchell and Lapata's to be ρave = 0.40. Optimizing on the development data consistently gave better performance than averaging over all predictors, though in most cases the differences are small.
Results for the Mitchell and Lapata (2010) data set are presented in Tables 12 and Table 13.20 Again the latent-variable models perform well, comfortably outperforming the Mult baseline, and with just one exception the Combined models surpass Mitchell and Lapata's reported results. Combining the syntactic co-occurrence model Rooth-LDA/Syn and the window-based model LDA/Win5 consistently gives the best performance, suggesting that the human ratings in this data set are sensitive to both strict similarity and a looser sense of relatedness. As Turney (2012) observes, the average-ρcat-per-group approach of Mitchell and Lapata leads to lower performance figures than averaging across annotators; with the latter approach (Table 12) the ρave correlation values approach the level of human interannotator agreement for two of the three relations: noun–noun and adjective–noun modification.
4.5 Lexical Substitution
4.5.1 Data
The data set for the English Lexical Substitution Task (McCarthy and Navigli 2009) consists of 2,010 sentences sourced from Web pages. Each sentence features one of 205 distinct target words that may be nouns, verbs, adjectives, or adverbs. The sentences have been annotated by human judges to suggest semantically acceptable substitutes for their target words. Table 14 gives example sentences and annotations for the target verb charge. For the original shared task the data was divided into development and test sections; in this article we follow subsequent work using parameter-free models and use the whole data set for testing.
Commission is the amount charged to execute a trade. |
levy (2), impose (1), take (1), demand (1) |
Annual fees are charged on a pro-rata basis to correspond with the standardized renewal date in December. |
levy (2), require (1), impose (1), demand (1) |
Meanwhile, George begins obsessive plans for his funeral…George, suspicious, charges to her room to confront them. |
run (2), rush (2), storm (1), dash (1) |
Realizing immediately that strangers have come, the animals charge them and the horses began to fight. |
attack (5), rush at (1) |
Commission is the amount charged to execute a trade. |
levy (2), impose (1), take (1), demand (1) |
Annual fees are charged on a pro-rata basis to correspond with the standardized renewal date in December. |
levy (2), require (1), impose (1), demand (1) |
Meanwhile, George begins obsessive plans for his funeral…George, suspicious, charges to her room to confront them. |
run (2), rush (2), storm (1), dash (1) |
Realizing immediately that strangers have come, the animals charge them and the horses began to fight. |
attack (5), rush at (1) |
The gold standard substitute annotations contain a number of multiword terms such as rush at and generate electricity. As it is impossible for a standard lexical distributional model to reason about such terms, we remove these substitutes from the gold standard.21 We remove entirely the 17 sentences that have only multiword substitutes in the gold standard, as well as 7 sentences for which no gold annotations are provided. This leaves 1,986 sentences.
In this article we report both τb and GAP scores, calculated individually for each sentence and averaged. The open-vocabulary design of the original Lexical Substitution Task facilitated the use of other evaluation measures such as “precision out of ten”: the proportion of the first 10 words in a system's ranked substitute list that are contained in the gold standard annotation for that sentence. This measure is not appropriate in the constrained-vocabulary scenario considered here; when there are fewer than 10 candidate substitutes for a target word, the precision will always be 1.
4.5.2 Models
We apply both window-based and syntactic models of similarity in context to the lexical substitution data set; we expect the latter to give more accurate predictions but to have incomplete coverage when a test sentence is not fully and correctly parsed or when the test lexical items were not seen in the appropriate contexts in training.22 We therefore also average the predictions of the two model types in the hope of attaining superior performance with full coverage.
The models we train on the BNC and combined BNC + WIKI corpora are as follows:
Win5 An LDA model using 5-word-window contexts (so |C| ≤ 10) and similarity P(z|o,C) computed according to Equation (54).
C → T An LDA model using syntactic co-occurrences with similarity computed according to Equation (54).
T → C An LDA model using syntactic co-occurrences with similarity computed according to Equation (57).
T ↔ C A model averaging the predictions of the C →T and T →C models.
Win5 +C → T, Win5 +T → C, Win5 +T ↔ C A model averaging the predictions of Win5 and the appropriate syntactic model.
TFP11 The vector-space model of Thater, Fürstenau, and Pinkal (2011). We report figures with and without backoff to lexical similarity between target and substitute words in the absence of a syntax-based prediction.
We also consider two baseline LDA models:
No Context A model that ranks substitutes n by computing the Bhattacharyya similarity between their topic distributions P(z|n) and the target word topic distribution P(z|o).
No Similarity A model that ranks substitutes n by their context-conditioned probability P(n|C) only; this is essentially a language-modeling approach using syntactic “bigrams.”
4.5.3 Results
Table 15 presents results on the Lexical Substitution Task data set.24 As expected, the window-based LDA models attain good coverage but worse performance than the syntactic models. The combined model Win5 + T ↔ C trained on BNC+WIKI gives the best scores (GAP = 49.5, τb = 0.23). Every combined model gives a statistically significant improvement (p < 0.01) over the corresponding window-based Win5 model. Our TFP11 reimplementation of Thater, Fürstenau, and Pinkal (2011) has slightly less than complete coverage, and performs worse than almost all combined LDA models. To compute statistical significance we only use the sentences for which TFP11 made predictions; for both the BNC and BNC+WIKI corpora, the Win5 + T ↔ C model gives a statistically significant (p < 0.05) improvement over TFP11 for both GAP and τb, while Win5 + T → C gives a significant improvement for GAP and τb on the BNC training corpus. The no-context and no-similarity baselines are clearly worse than the full models; this difference is statistically significant (p < 0.01) for both training corpora and all models.
Table 16 breaks performance down across the four parts of speech used in the data set. Verbs appear to present the most difficult substitution questions and also demonstrate the greatest beneficial effect of adding syntactic disambiguation to the basic Win5 model. The full Win5 + T ↔ C outperforms our reimplementation of Thater, Fürstenau, and Pinkal (2011) on all parts of speech for the GAP statistic and on verbs and adjectives for τb, scoring a tie on nouns and adverbs. Table 16 also lists results reported by Dinu and Lapata (2010) and Thater, Fürstenau, and Pinkal (2010, 2011) for their models trained on the English Gigaword Corpus. This corpus is of comparable size to the BNC+WIKI corpus, but we note that the results reported by Thater, Fürstenau, and Pinkal (2011) are better than those attained by our reimplementation, suggesting that uncontrolled factors such as choice of corpus, parser, or dependency representation may be responsible. Thater, Fürstenau, and Pinkal's (2011) results remain the best reported for this data set; our Win5 + T ↔ C results are better than Dinu and Lapata (2010) and Thater, Fürstenau, and Pinkal (2010) in this uncontrolled setting.
5. Conclusion
In this article we have shown that the probabilistic latent-variable framework provides a flexible and effective toolbox for distributional modeling of lexical meaning and gives state-of-the-art results on a number of semantic prediction tasks. One useful feature of this framework is that it induces a representation of semantic classes at the same time as it learns about selectional preference distributions. This can be viewed as a kind of coarse-grained sense induction or as a kind of concept induction. We have demonstrated that reasoning about these classes leads to an accurate method for calculating semantic similarity in context. By applying our models we attain state-of-the-art performance on a range of evaluations involving plausibility prediction, in-context similarity, and lexical substitution. The three models we have investigated—LDA, Rooth-LDA and Lex-LDA—all perform at a similar level for predicting plausibility, but in other cases the representation induced by one model may be more suitable than the others.
In future work, we anticipate that the same intuitions may lead to similarity accurate methods for other tasks where disambiguation is required; an obvious candidate would be traditional word sense disambiguation, perhaps in combination with the probabilistic WordNet-based preference models of Ó Séaghdha and Korhonen (2012). More generally, we expect that latent-variable models will prove useful in applications where other selectional preference models have been applied, for example, metaphor interpretation and semantic role labeling.
A second route for future work is to enrich the semantic representations that are learned by the model. As previously mentioned, probabilistic generative models are modular in the sense that they can be integrated in larger models. Bayesian methods for learning tree structures could be applied to learn taxonomies of semantic classes (Blei, Griffiths, and Jordan 2010; Blundell, Teh, and Heller 2010). Borrowing ideas from Bayesian hierarchical language modeling (Teh 2006), one could build a model of selectional preference and disambiguation in the context of arbitrarily long dependency paths, relaxing our current assumption that only the immediate neighbors of a target word affect its meaning. Our class-based preference model also suggests an approach to identifying regular polysemy alternation by finding class co-occurrences that repeat across words, offering a fully data-driven alternative to polysemy models based on WordNet (Boleda, Padó, and Utt 2012). In principle, any structure that can be reasoned about probabilistically, from syntax trees to coreference chains or semantic relations, can be coupled with a selectional preference model to incorporate disambiguation or lexical smoothing in a task-oriented architecture.
Acknowledgments
The work in this article was funded by the EPSRC (grant EP/G051070/1) and by the Royal Society. We are grateful to Frank Keller and Mirella Lapata for sharing their data set of plausibility judgments; to Georgiana Dinu, Karl Moritz Hermann, Jeff Mitchell, Sebastian Padó, and Andreas Vlachos for offering information and advice; and to the anonymous Computational Linguistics reviewers, whose suggestions have substantially improved the quality of this article.
Notes
The analogous example given by Ó Séaghdha (2010) relates to the plausibility of a manservant or a carrot laughing; Google no longer returns zero hits for <a|the manservant|manservants|menservants laugh|laughs|laughed> but a frequency-based estimate still puts the probability of a carrot laughing at 200 times that of a manservant laughing (1,680 hits against 81 hits).
WordNet also contains many other kinds of semantic relations besides hypernymy but these are not typically used for selectional preference modeling.
When specifically discussing selectional preferences, we will also use the terms predicate and argument to describe a co-occurrence pair; when restricted to syntactic predicates, the former term is synonymous with our definition of context.
Strictly speaking, w and wh are drawn from subsets of that are licensed by r when r is a syntactic relation, that is, they must have parts of speech pd and ph, respectively. Our models assume a fixed argument vocabulary, so we can partition the training data according to part of speech; the models are agnostic regarding the predicate vocabulary as these are subsumed by the context vocabulary. In the interest of parsimony we leave this detail implicit in our notation.
However, the second point is often relaxed in application contexts where the posterior mean is used for inference (e.g., Section 3.4.2).
In the notation of Section 3.4, this estimate is given by .
An exception was made for the word PC as it appears in the Keller and Lapata (2003) data set used for evaluation.
Keller and Lapata only report Pearson's r correlations; as we do not have their per-item predictions we cannot calculate Spearman's ρ correlations or statistical significance scores.
We used the logistic regression implementation provided by LIBLINEAR (Fan et al. 2008), available at http://www.csie.ntu.edu.tw/∼cjlin/liblinear.
Bergsma, Lin, and Goebel (2008) also use features extracted from gazetteers. However, they observe that additional features only give a small improvement over co-occurrence features alone. We do not use such features here but hypothesize that the improvement would be even smaller in our experiments as the data do not contain proper nouns.
Results for LDAP → A and Rooth-LDA were previously published in Ó Séaghdha (2010).
We do not compare against the system of Turney (2012) as Turney uses a different experimental design based on partitioning by phrases rather than annotators.
In practice the sequence of items is not the same for every annotator and the sequence of predictions s must be changed accordingly.
Ultimately, however, none of the combination methods needing optimization outperform the parameter-free methods in Mitchell and Lapata's results.
These results were not previously published.
An LDA model cannot make an informative prediction of P(z|o,C) if word o was never seen entering into at least one (unlexicalized) syntactic relation in C. Other syntactic models such as that of Thater, Fürstenau, and Pinkal (2011) face analogous restrictions.
We use the software package provided by Sebastian Padó at http://www.nlpado.de/∼sebastian/sigf.html.
Results for the LDA models were reported in Ó Séaghdha and Korhonen (2011).
References
Author notes
15 JJ Thomson Avenue, Cambridge, CB3 0FD, United Kingdom. E-mail: Diarmuid.O'[email protected].