## Abstract

We describe a probabilistic framework for acquiring selectional preferences of linguistic predicates and for using the acquired representations to model the effects of context on word meaning. Our framework uses Bayesian latent-variable models inspired by, and extending, the well-known Latent Dirichlet Allocation (LDA) model of topical structure in documents; when applied to predicate–argument data, topic models automatically induce semantic classes of arguments and assign each predicate a distribution over those classes. We consider LDA and a number of extensions to the model and evaluate them on a variety of semantic prediction tasks, demonstrating that our approach attains state-of-the-art performance. More generally, we argue that probabilistic methods provide an effective and flexible methodology for distributional semantics.

## 1. Introduction

Computational models of **lexical semantics** attempt to represent aspects of word meaning. For example, a model of the meaning of *dog* may capture the facts that dogs are animals, that they bark and chase cats, that they are often kept as pets, and so on. Word meaning is a fundamental component of the way language works: Sentences (and larger structures) consist of words, and their meaning is derived in part from the contributions of their constituent words' lexical meanings. At the same time, words instantiate a mapping between conceptual “world knowledge” and knowledge of language.

The relationship between the meanings of an individual word and the larger linguistic structure in which it appears is not unidirectional; while the word contributes to the meaning of the structure, the structure also clarifies the meaning of the word. Taken on its own a word may be vague or ambiguous, in the senses of Zwicky and Sadock (1975); even when the word's meaning is relatively clear it may still admit specification of additional details that affect its interpretation (e.g., what color/breed was the *dog*?). This specification comes through **context**, which consists of both linguistic and extralinguistic factors but shows a strong effect of the immediate lexical and syntactic environment—the other words surrounding the word of interest and their syntactic relations to it.

These diverse concerns motivate lexical semantic modeling as an important task for all computational systems that must tackle problems of meaning. In this article we develop a framework for modeling word meaning and how it is modulated by contextual effects.^{1} Our models are **distributional** in the sense that their parameters are learned from observed co-occurrences between words and contexts in corpus data. More specifically, they are **probabilistic models** that associate latent variables with automatically induced classes of distributional behavior and associate each word with a probability distribution over those classes. This has a natural interpretation as a model of **selectional preference**, the semantic phenomenon by which predicates such as verbs or adjectives more plausibly combine with some classes of arguments than with others. It also has an interpretation as a disambiguation model: The different latent variable values correspond to different aspects of meaning and a word's distribution over those values can be modified by information coming from the context it appears in. We present a number of specific models within this framework and demonstrate that they can give state-of-the-art performance on tasks requiring models of preference and disambiguation. More generally, we illustrate that probabilistic modeling is an effective general-purpose framework for distributional semantics and a useful alternative to the popular vector-space framework.

The main contributions of the article are as follows:

We describe the probabilistic approach to distributional semantics, showing how it can be applied as generally as the vector-space approach.

We present three novel probabilistic selectional preference models and show that they outperform a variety of previously proposed models on a plausibility-based evaluation.

Furthermore, the representations learned by these models correspond to semantic classes that are useful for modeling the effect of context on semantic similarity and disambiguation.

## 2. Background and Related Work

### 2.1 Distributional Semantics

The distributional approach to semantics is often traced back to the so-called “distributional hypothesis” put forward by mid-century linguists such as Zellig Harris and J.R. Frith:

If we consider words or morphemes

AandBto be more different in meaning thanAandC, then we will often find that the distributions ofAandBare more different than the distributions ofAandC. (Harris 1954)

You shall know a word by the company it keeps. (Frith 1957)

**distributional semantics**encompasses a broad range of methods that identify the semantic properties of a word or other linguistic unit with its patterns of co-occurrence in a corpus of textual data. The potential for learning semantic knowledge from text was recognized very early in the development of NLP (Spärck Jones 1964; Cordier 1965; Harper 1965), but it is with the technological developments of the past twenty years that this data-driven approach to semantics has become dominant. Distributional approaches may use a representation based on vector spaces, on graphs, or (like this article) on probabilistic models, but they all share the common property of estimating their parameters from empirically observed co-occurrences.

The basic unit of distributional semantics is the **co-occurrence**: an observation of a word appearing in a particular context. The definition is a general one: We may be interested in all kinds of words, or only a particular subset of the vocabulary; we may define the context of interest to be a document, a fixed-size window around a nearby word, or a syntactic dependency arc incident to a nearby word. Given a data set of co-occurrence observations we can extract an indexed set of co-occurrence counts **f**_{w} for each word of interest *w*; each entry *f*_{wc} counts the number of times that *w* was observed in context *c*. Alternatively, we can extract an indexed set **f**_{c} for each context.

The vector-space approach is the best-known methodology for distributional semantics; under this conception **f**_{w} is treated as a vector in , where is the vocabulary of contexts. As such, **f**_{w} is amenable to computations known from linear algebra. We can compare co-occurrence vectors for different words with a similarity function such as the cosine measure or a dissimilarity function such as Euclidean distance; we can cluster neighboring vectors; we can project a matrix of co-occurrence counts onto a low-dimensional subspace; and so on. This is perhaps the most popular approach to distributional semantics and there are many good general overviews covering the possibilities and applications of the vector space model (Curran 2003; Weeds and Weir 2005; Padó and Lapata 2007; Turney and Pantel 2010).

Although it is natural to view the aggregate of co-occurrence counts for a word as constituting a vector, it is equally natural to view it as defining a probability distribution. When normalized to have unit sum, **f**_{w} parameterizes a discrete distribution giving the conditional probability of observing a particular context given that we observe *w*. The contents of the vector-space modeler's toolkit generally have probabilistic analogs: similarity and dissimilarity can be computed using measures from information theory such as the Kullback–Leibler or Jensen–Shannon divergences (Lee 1999); the effects of clustering and dimensionality reduction can be achieved through the use of latent variable models (see Section 3.2.2). Additionally, Bayesian priors on parameter distributions provide a flexible toolbox for performing regularization and incorporating prior information in learning. A further advantage of the probabilistic framework is that it is often straightforward to extend existing models to account for additional structure in the data, or to tie together parameters for shared statistical strength, while maintaining guarantees of well-normalized behavior thanks to the laws of probability. In this article we focus on selectional preference learning and contextual disambiguation but we believe that the probabilistic approach exemplified here can fruitfully be applied in any scenario involving distributional semantic modeling.

### 2.2 Selectional Preferences

#### 2.2.1 Motivation

A fundamental concept in linguistic knowledge is the **predicate**, by which we mean a word or other symbol that combines with one or more **arguments** to produce a composite representation with a composite meaning (by the principle of compositionality). The archetypal predicate is a verb; for example, transitive *drink* takes two noun arguments as subject and object, with which it combines to form a basic sentence. However, the concept is a general one, encompassing other word classes as well as more abstract items such as semantic relations (Yao et al. 2011), semantic frames (Erk, Padó, and Padó 2010), and inference rules (Pantel et al. 2007). The asymmetric distinction between predicate and argument is analogous to that between context and word in the more general distributional framework.

It is intuitive that a particular predicate will be more compatible with some semantic argument classes than with others. For example, the subject of *drink* is typically an animate entity (human or animal) and the object of *drink* is typically a beverage. The subject of *eat* is also typically an animate entity but its object is typically a foodstuff. The noun modified by the adjective *tasty* is also typically a foodstuff, whereas the noun modified by *informative* is an information-bearing object. This intuition can be formalized in terms of a predicate's **selectional preference**: a function that assigns a numerical score to a combination of a predicate and one or more arguments according to the semantic plausibility of that combination. This score may be a probability, a rank, a real value, or a binary value; in the last case, the usual term is **selectional restriction**.

Models of selectional preference aim to capture conceptual knowledge that all language users are assumed to have. Speakers of English can readily identify that examples such as the following are semantically infelicitous despite being syntactically well-formed:

- 1.
The beer drank the man.

- 2.
Quadruplicity drinks procrastination. (Russell 1940)

- 3.
Colorless green ideas sleep furiously. (Chomsky 1957)

- 4.
The paint is silent. (Katz and Fodor 1963)

*My car drinks gasoline*, which must be understood non-literally since

*car*strongly violates the subject preference of

*drink*and

*gasoline*is also an unlikely candidate for something to drink.

In NLP, one motivation for modeling predicate–argument plausibility is to investigate whether this aspect of human conceptual knowledge can be learned automatically from text corpora. If the predictions of a computational model correlate with judgments collected from human behavioral data, the assumption is that the model itself shares some properties with human linguistic knowledge and is in some sense a “good” semantic model. More practically, NLP researchers have shown that selectional preference knowledge is useful for downstream applications, including metaphor detection (Shutova 2010), identification of non-compositional multiword expressions (McCarthy, Venkatapathy, and Joshi 2007), semantic role labeling (Gildea and Jurafsky 2002; Zapirain, Agirre, and Màrquez 2009; Zapirain et al. 2010), word sense disambiguation (McCarthy and Carroll 2003), and parsing (Zhou et al. 2011).

#### 2.2.2 The “Counting” Approach

The simplest way to estimate the plausibility of a predicate–argument combination from a corpus is to count the number of times that combination appears, on the assumptions that frequency correlates with plausibility and that given enough data the resulting estimates will be relatively accurate. For example, Keller and Lapata (2003) estimate predicate–argument plausibilities by submitting appropriate queries to a Web search engine and counting the number of “hits” returned. To estimate the frequency with which the verb *drink* takes *beer* as a direct object, Keller and Lapata's method uses the query <*drink*|*drinks*|*drank*|*drunk*|*drinking a*|*the*|∅ *beer*|*beers*>; to estimate the frequency with which *tasty* modifies *pizza* the query is simply <*tasty pizza*|*pizzas*>. Where desired, these joint frequency counts can be normalized by unigram hit counts to estimate conditional probabilities such as *P*(*pizza*|*tasty*).

The main advantages of this approach are its simplicity and its ability to exploit massive corpora of raw text. On the other hand, it is hindered by the facts that only shallow processing is possible and that even in a Web-scale corpus the probability estimates for rare combinations will not be accurate. At the time of writing, Google returns zero hits for the query <*draughtsman*|*draughtsmen whistle*|*whistles*|*whistled*|*whistling*> and 1,570 hits for <*onion*|*onions whistle*|*whistles*|*whistled*|*whistling*>, suggesting the implausible conclusion that an onion is far more likely to whistle than a draughtsman.^{2}

#### 2.2.3 Similarity-Based Smoothing Methods

During the 1990s, research on language modeling led to the development of various “smoothing” methods for overcoming the data sparsity problem that inevitably arises when estimating co-occurrence counts from finite corpora (Chen and Goodman 1999). The general goal of smoothing algorithms is to alter the distributional profile of observed counts to better match the known statistical properties of linguistic data (e.g., that language exhibits power-law behavior). Some also incorporate semantic information on the assumption that meaning guides the distribution of words in a text.

**similarity-based smoothing**, by which one can extrapolate from observed co-occurrences by implementing the distributional hypothesis: “similar” words will have similar distributional properties. A general form for similarity-based co-occurrence estimates is

*sim*can be an arbitrarily chosen similarity function; Dagan, Lee, and Pereira (1999) investigate a number of options. is a set of comparison words that may depend on

*w*

_{1}or

*w*

_{2}, or neither: Essen and Steinbiss (1992) use the entire vocabulary, whereas Dagan, Lee, and Pereira use a fixed number of the most similar words to

*w*

_{2}, provided their similarity value is above a threshold

*t*.

*p*in the training corpus, denoted

*Seenargs*(

*p*):The co-occurrence strength

*weight*(

*a*|

*p*) may simply be normalized co-occurrence frequency; alternatively a statistical association measure such as pointwise mutual information may be used. As before,

*sim*(

*a*,

*a*′) may be any similarity measure defined on members of

*A*. One advantage of this and other similarity-based models is that the corpus used to estimate similarity need not be the same as that used to estimate predicate–argument co-occurrence, which is useful when the corpus labeled with these co-occurrences is small (e.g., a corpus labeled with FrameNet frames).

#### 2.2.4 Discriminative Models

Bergsma, Lin, and Goebel (2008) cast selectional preference acquisition as a supervised learning problem to which a discriminatively trained classifier such as a Support Vector Machine (SVM) can be applied. To produce training data for a predicate, they pair “positive” arguments that were observed for that predicate in the training corpus and have an association with that predicate above a specified threshold (measured by mutual information) with randomly selected “negative” arguments of similar frequency that do not occur with the predicate or fall below the association threshold. Given this training data, a classifier can be trained in a standard way to predict a positive or negative score for unseen predicate–argument pairs.

An advantage of this approach is that arbitrary sets of features can be used to represent the training and testing items. Bergsma, Lin, and Goebel include conditional probabilities *P*(*a*|*p*) for all predicates the candidate argument co-occurs with, typographic features of the argument itself (e.g., whether it is capitalized, or contains digits), lists of named entities, and precompiled semantic classes.

#### 2.2.5 WordNet-Based Models

An alternative approach to preference learning models the argument distribution for a predicate as a distribution over semantic classes provided by a predefined lexical resource. The most popular such resource is the WordNet lexical hierarchy (Fellbaum 1998), which provides semantic classes and hypernymic structures for nouns, verbs, adjectives, and adverbs.^{3} Incorporating knowledge about the WordNet taxonomy structure in a preference model enables the use of graph-based regularization techniques to complement distributional information, while also expanding the coverage of the model to types that are not encountered in the training corpus. On the other hand, taxonomy-based methods build in an assumption that the lexical hierarchy chosen is the universally “correct” one and they will not perform as well when faced with data that violates the hierarchy or contains unknown words. A further issue faced by these models is that the resources they rely on require significant effort to create and will not always be available to model data in a new language or a new domain.

Resnik (1993) proposes a measure of associational strength between a predicate and WordNet classes based on the empirical distribution of words of each class (and their hyponyms) in a corpus. Abney and Light (1999) conceptualize the process of generating an argument for a predicate in terms of a Markovian random walk from the hierarchy's root to a leaf node and choosing the word associated with that leaf node. Ciaramita and Johnson (2000) likewise treat WordNet as defining the structure of a probabilistic graphical model, in this case a Bayesian network. Li and Abe (1998) and Clark and Weir (2002) both describe models in which a predicate “cuts” the hierarchy at an appropriate level of generalization, such that all classes below the cut are considered appropriate arguments (whether observed in data or not) and all classes above the cut are considered inappropriate.

In this article we focus on purely distributional models that do not rely on manually constructed lexical resources; therefore we do not revisit the models described in this section subsequently, except as a basis for empirical comparison. Ó Séaghdha and Korhonen (2012) do investigate a number of Bayesian preference models that incorporate WordNet classes and structure, finding that such models outperform previously proposed WordNet-based models and perform comparably to the distributional Bayesian models presented here.

### 2.3 Measuring Similarity in Context

#### 2.3.1 Motivation

A fundamental idea in semantics is that the meaning of a word is disambiguated and modulated by the context in which it appears. The word *body* clearly has a different sense in each of the following text fragments:

- 1.
*Depending on the present position of the planetary**body**in its orbital path, …* - 2.
*The executive**body**decided…* - 3.
*The human**body**is intriguing in all its forms.*

*committee*is a reasonable substitute for

*body*in fragment 2 but less reasonable in fragment 1. An evaluation of semantic models based on this principle was run as the English Lexical Substitution Task in SemEval 2007 (McCarthy and Navigli 2009). The annotated data from the Lexical Substitution Task have been used by numerous researchers to evaluate models of lexical choice; see Section 4.5 for further details.

*w*

_{o},

*w*

_{s}in a given context

*C*= {(

*r*

^{1},

*w*

^{1}), (

*r*

^{2},

*w*

^{2}), … , (

*r*

^{n},

*w*

^{n})}. When the task is substitution,

*w*

_{o}is the original word and

*w*

_{s}is the candidate substitute. Our general approach is to compute a representation

*Rep*(

*w*

_{o}|

*C*) for

*w*

_{o}in context

*C*and compare it with

*Rep*(

*w*

_{s}), our representation for

*w*

_{n}:where

*sim*is a suitable similarity function for comparing the representations. This general framework leaves open the question of what kind of representation we use for

*Rep*(

*w*

_{o}|

*C*) and

*Rep*(

*w*

_{s}); in Section 2.3.2 we describe representations based on vector-space semantics and in Section 3.5 we describe representations based on latent-variable models.

A complementary perspective on the disambiguatory power of context models is provided by research on semantic composition, namely, how the syntactic effect of a grammar rule is accompanied by a combinatory semantic effect. In this view, the goal is to represent the combination of a context and an in-context word, not just to represent the word given the context. The co-occurrence models described in this article are not designed to scale up and provide a representation for complex syntactic structures,^{4} but they are applicable to evaluation scenarios that involve representing binary co-occurrences.

#### 2.3.2 Vector-Space Models

As described in Section 2.1, the vector-space approach to distributional semantics casts word meanings as vectors of real numbers and uses linear algebra operations to compare and combine these vectors. A word *w* is represented by a vector **v**_{w} that models aspects of its distribution in the training corpus; the elements of this vector may be co-occurrence counts (in which case it is the same as the frequency vector **f**_{w}) or, more typically, some transformation of the raw counts.

**v**

_{w},

**v**

_{w′}, their combination

**p**is provided by a function

*g*that may also depend on syntax

*R*and background knowledge

*K*:Mitchell and Lapata investigate a number of functions that instantiate Equation (5), finding that elementwise multiplication is a simple and consistently effective choice:The motivation for this “disambiguation by multiplication” is that lexical vectors are sparse and the multiplication operation has the effect of sending entries not supported in both

**v**

_{w}and

**v**

_{w′}towards zero while boosting entries that have high weights in both vectors.

**structured vector space**approach in which each word

*w*is associated with a set of “expectation” vectors

*R*

_{w}, indexed by dependency label, in addition to its standard co-occurrence vector

**v**

_{w}. The expectation vector

*R*

_{w}(

*r*) for word

*w*and dependency label

*r*is an average over co-occurrence vectors for seen arguments of

*w*and

*r*in the training corpus:Whereas a standard selectional preference model addresses the question “which words are probable as arguments of predicate (

*w*,

*r*)?”, the expectation vector (7) addresses the question “what does a typical co-occurrence vector for an argument of the predicate (

*w*,

*r*) look like?”. To disambiguate the semantics of word

*w*in the context of a predicate (

*w*′,

*r*′), Erk and Padó combine the expectation vector

*R*

_{w′}(

*r*′) with the word vector

**v**

_{w}:

*w*in the context of (

*r*′,

*w*′) to bewhere

*α*quantifies the compatibility of the observed predicate (

*w*′,

*r*′) with the smoothing predicate (

*w*″,

*r*″),

*weight*quantifies the co-occurrence strength between (

*w*″,

*r*″) and

*w*, and

*e*

_{r″,w″}is a basis vector for (

*w*″,

*r*″). This is a general formulation admitting various choices of

*α*and

*weight*; the optimal configuration is found to be as follows:This is conceptually very similar to the EPP selectional preference model (3) of Erk, Padó, and Padó (2010); each entry in the vector

**v**

_{w|r′,w′}is a similarity-smoothed estimate of the preference of (

*w*′,

*r*′) for

*w*. EPP uses seen arguments of (

*w*′,

*r*′) for smoothing, whereas Thater, Fürstenau, and Pinkal (2011) take a complementary approach and smooth with seen predicates for

*w*. In order to combine the disambiguatory effects of multiple predicates, a sum over contextualized vectors is taken:

All the models described in this section provide a way of relating a word's standard co-occurrence vector to a vector representation of the word's meaning in context. This allows us to calculate the similarity between two in-context words or between a word and an in-context word using standard vector similarity measures such as the cosine. In applications where the task is to judge the appropriateness of substituting a word *w*_{s} for an observed word *w*_{o} in context *C* = {(*r*^{1}, *w*^{1}), (*r*^{2}, *w*^{2}), … , (*r*^{n}, *w*^{n})}, a common approach is to compute the similarity between the contextualized vector and the uncontextualized word vector . It has been demonstrated empirically that this approach yields better performance than contextualizing both vectors before the similarity computation.

## 3. Probabilistic Latent Variable Models for Lexical Semantics

### 3.1 Notation and Terminology

We define a co-occurrence as a pair (*c*,*w*), where *c* is a context belonging to the vocabulary of contexts and *w* is a word belonging to the word vocabulary .^{5} Unless otherwise stated, the contexts considered in this article are head-lexicalized dependency edges *c* = (*r*,*w*_{h}) where is the grammatical relation and is the head lemma. We notate grammatical relations as *p*_{h}:*label*:*p*_{d}, where *p*_{h} is the head word's part of speech, *p*_{d} is the dependent word's part of speech, and *label* is the dependency label.^{6} We use a coarse set of part-of-speech tags: *n* (noun), *v* (verb), *j* (adjective), *r* (adverb). The dependency labels are the grammatical relations used by the RASP system (Briscoe 2006; Briscoe, Carroll, and Watson 2006), though in principle any dependency formalism could be used. The assumption that predicates correspond to head-lexicalized dependency edges means that they have arity one.

*w*in the sentence has a syntactic context set

*C*comprising all the dependency edges incident to

*w*. In the sentence fragment

*The executive body decided…*, the word

*body*has two incident edges:The context set for

*body*is

*C*= {(

*j:ncmod*

^{ − 1}

*:n*,

*executive*), (

*v:ncsubj:n*,

*decide*)}, where (

*v:ncsubj:n*,

*decide*) indicates that

*body*is the subject of

*decide*and (

*j:ncmod*

^{ − 1}

*:n*,

*executive*) denotes that it stands in an inverse non-clausal modifier relation to

*executive*(we assume that nouns are the heads of their adjectival modifiers).

To estimate our preference models we will rely on co-occurrence counts extracted from a corpus of observations *O*. Each observation is a co-occurrence of a predicate and an argument. The set of observations for context *c* is denoted *O*(*c*). The co-occurrence frequency of context *c* and word *w* is denoted by *f*_{cw}, and the total co-occurrence frequency of *c* by *f*_{c} = ∑_{w ∈ W}*f*_{cw}.

### 3.2 Modeling Assumptions

#### 3.2.1 Bayesian Modeling

The Bayesian approach to probabilistic modeling (Gelman et al. 2003) is characterized by (1) the use of prior distributions over model parameters to encode the modeler's expectations about the values they will take; and (2) the explicit quantification of uncertainty by maintaining posterior distributions over parameters rather than point estimates.^{7}

**θ**with length |

*K*| where

*K*is the sample space. The probability that an observation

*o*takes value

*k*is then:The value of

**θ**must typically be learned from data. The maximum likelihood estimate (MLE) sets

*θ*

_{k}proportional to the number of times

*k*was observed in a set of observations

*O*, where each observation

*o*

_{i}∈

*K*:Although simple, such an approach has significant limitations. Because a linguistic vocabulary contains a large number of items that individually have low probability but together account for considerable total probability mass, even a large corpus is unlikely to give accurate estimates for low-probability types (Evert 2004). Items that do not appear in the training data will be assigned zero probability of appearing in unseen data, which is rarely if ever a valid assumption. Sparsity increases further when the sample space contains composite items (e.g., context-words pairs).

**θ**and apply Bayes' Theorem:A standard choice for the prior distribution over the parameters of a discrete distribution is the Dirichlet distribution:Here,

**α**is a |

*K*|-length vector where each

*α*

_{k}> 0. One effect of the Dirichlet prior is that setting the sum ∑

_{k}

*α*

_{k}to a small value will encode the expectation that the parameter vector

**θ**is likely to distribute its mass more sparsely. The Dirichlet distribution is a

**conjugate prior**for multinomial and categorical likelihoods, in the sense that the posterior distribution

*P*(

**α**|

*O*) in Equation (16) is also a Dirichlet distribution when

*P*(

*O*|

**α**) is multinomial or categorical and

*P*(

**α**) is Dirichlet:where ⊕ indicates elementwise addition of the observed count vector

**f**

_{O}to the Dirichlet parameter vector

**α**. Furthermore, the conjugacy property allows us to do a number of important computations in an efficient way. In many applications we are interested in predicting the distribution over values

*K*for a “new” observation given a set of prior observations

*O*while retaining our uncertainty about the model parameters. We can average over possible values of

**θ**, weighted according to their probability

*P*(

**α**|

*O*, α) by “integrating out” the parameter and still retain a simple closed-form expression for the posterior predictive distribution: Expression (21) is central to the implementation of collapsed Gibbs samplers for Bayesian models such as latent Dirichlet allocation (Section 3.3). For mathematical details of these derivations, see Heinrich (2009).

Other priors commonly used for discrete distributions in NLP include the Dirichlet process and the Pitman–Yor process (Goldwater, Griffiths, and Johnson 2011). The Dirichlet process provides similar behavior to the Dirichlet distribution prior but is “non-parametric” in the sense of varying the size of its support according to the data; in the context of mixture modeling, a Dirichlet process prior allows the number of mixture components to be learned rather than fixed in advance. The Pitman–Yor process is a generalization of the Dirichlet process that is better suited to learning power-law distributions. This makes it particularly suitable for language modeling where the Dirichlet distribution or Dirichlet process would not produce a long enough tail due to their preference for sparsity (Teh 2006). On the other hand, Dirichlet-like behavior may be preferable in semantic modeling, where we expect, for example, predicate–class and class–argument distributions to be sparse.

#### 3.2.2 The Latent Variable Assumption

**latent variables**are random variables whose values are not provided by the input data. As a result, their values must be inferred at the same time as the model parameters on the basis of the training data and model structure. The latent variable concept is a very general one that is used across a wide range of probabilistic frameworks, from hidden Markov models to neural networks. One important application is in mixture models, where the data likelihood is assumed to have the following form:

Here the latent variables *z* index mixture components, each of which is associated with a distribution over observations *x*, and the resulting likelihood is an average of the component distributions weighted by the mixing weights *P*(*z*). The set of possible values for *z* is the set of components *Z*. When |*Z*| is small relative to the size of the training data, this model has a clustering effect in the sense that the distribution learned for *P*(*x*|*z*) is informed by all datapoints assigned to component *z*.

The idea of compressing the observed co-occurrence data through a small layer of latent variables shares the same basic motivations as other, not necessarily probabilistic, dimensionality reduction techniques such as Latent Semantic Analysis or Non-negative Matrix Factorization. An advantage of probabilistic models is their flexibility, both in terms of learning methods and model structures. For example, the models considered in this article can potentially be extended to multi-way co-occurrences and to hierarchically defined contexts that cannot easily be expressed in frameworks that require the input to be a co-occurrence matrix.

*n*'s distribution over verbs

*v*aswhich is equivalent to Equation (23) when we take

*n*as the predicate and

*v*as the argument, in effect defining an inverse selectional preference model. Pereira, Tishby, & Lee also observe that given certain assumptions Equation (24) can be written more symmetrically asThe distributions

*P*(

*v*|

*z*),

*P*(

*n*|

*z*), and

*P*(

*z*) are estimated by an optimization procedure based on Maximum Entropy. Rooth et al. (1999) propose a much simpler Expectation Maximization (EM) procedure for estimating the parameters of Equation (25).

### 3.3 Bayesian Models for Binary Co-occurrences

Combining the latent variable co-occurrence model (23) with the use of Dirichlet priors naturally leads to **Latent Dirichlet Allocation** (LDA) (Blei, Ng, and Jordan 2003). Often described as a “topic model,” LDA is a model of document content that assumes each document is generated from a mixture of multinomial distributions or “topics.” Topics are shared across documents and correspond to thematically coherent patterns of word usage. For example, one topic may assign high probability to the words *finance*, *fund*, *bank*, and *invest*, whereas another topic may assign high probability to the words *football*, *goal*, *referee*, and *header*. LDA has proven to be a very successful model with many applications and extensions, and the topic modeling framework remains an area of active research in machine learning.

*P*(

*w*|

*c*) is modeled as

Figure 1 sketches the “generative story” according to which LDA generates arguments for predicates and also presents a plate diagram indicating the dependencies between variables in the model. Table 1 illustrates the semantic representation induced by a 600-topic LDA model trained on predicate–noun co-occurrences extracted from the British National Corpus (for more details of this training data, see Section 4.1). The “semantic classes” are actually distributions over all nouns in the vocabulary rather than a hard partitioning; therefore we present the eight most probable words for each. We also present the contexts most frequently associated with each class. Whereas a topic model trained on document–word co-occurrences will find topics that reflect broad thematic commonalities, the model trained on syntactic co-occurrences finds semantic classes that capture a much tighter sense of similarity: Words assigned high probability in the same topic tend to refer to entities that have similar properties, that perform similar actions, and have similar actions performed on them. Thus Class 1 is represented by *attack*, *raid*, *assault*, *campaign*, and so on, forming a coherent semantic grouping. Classes 2, 3, and 4 correspond to groups of tests, geometric objects, and public/educational institutions, respectively. Class 5 has been selected to illustrate a potential pitfall of using syntactic co-occurrences for semantic class induction: *fund*, *revenue*, *eyebrow*, and *awareness* hardly belong together as a coherent conceptual class. The reason, it seems, is that they are all entities that can be (and in the corpus, are) *raised*. This class has also conflated different (but related) senses of *reserve* and as a result the modifier *nature* is often associated with it.

*cost*,

*number*,

*risk*, or

*expenditure*can plausibly be

*increased*,

*reduced*,

*cut*, or

*involved*; another shows that a

*house*,

*building*,

*home*, or

*station*can be

*built*,

*left*,

*visited*, or

*used*. As with LDA, there are some over-generalizations; the fact that an

*eye*or

*mouth*can be

*opened*,

*closed*, or

*shut*does not necessarily entail that it can be

*locked*or

*unlocked*.

*P*(

*w*|

*c*) aswhere

*σ*

_{c}is a value between 0 and 1 that can be interpreted as a measure of argument lexicalization or as the probability that an observation for context

*c*is drawn from the lexical distribution

*P*

_{lex}or the class-based distribution

*P*

_{class}.

*P*

_{class}has the same form as the LDA preference model. The value of

*σ*

_{c}will vary across predicates according to how well their argument preference can be fit by the class-based models; a predicate with high

*σ*

_{c}will have idiosyncratic argument patterns that are best learned by observing that predicate's co-occurrences in isolation. In many cases this may reflect idiomatic or non-compositional usages, though it is also to be expected that

*σ*

_{c}will correlate with frequency; given sufficient data for a context, smoothing becomes less important. As an example we trained the Lex-LDA model on BNC verb-object co-occurrences and estimated posterior mean values for

*σ*

_{c}for all verbs occurring more than 100 times and taking at least 10 different object argument types. The verbs with highest and lowest values are listed in Table 3. Although almost anything can be

*discussed*or

*highlighted*, verbs such as

*pose*and

*wreak*have very lexicalized argument preferences. The semantic classes learned by Lex-LDA are broadly comparable to those learned by LDA, though it is less likely to mix classes on the basis of a single argument lexicalization; whereas the LDA class in row 5 of Table 1 is distracted by the high-frequency collocations

*nature reserve*and

*raise eyebrow*, Lex-LDA models trained on the same data can explain these through lexicalization effects and separate out body parts, conservation areas, and investments in different classes.

### 3.4 Parameter and Hyperparameter Learning

#### 3.4.1 Learning Methods

A variety of methods are available for parameter learning in Bayesian models. The two standard approaches are variational inference, in which an approximation to the true distribution over parameters is estimated exactly, and sampling, in which convergence to the true posterior is guaranteed in theory but rarely verifiable in practice. In some cases the choice of approach is guided by the model, but often it is a matter of personal preference; for LDA, there is evidence that equivalent levels of performance can be achieved through variational learning and sampling given appropriate parameterization (Asuncion et al. 2009). In this article we use learning methods based on Gibbs sampling, following Griffiths and Steyvers (2004). The basic idea of Gibbs sampling is to iterate through the corpus one observation at a time, updating the latent variable value for each observation according to the conditional probability distribution determined by the current observed and latent variable values for all other observations. Because the likelihoods are multinomials with Dirichlet priors, we can integrate out their parameters using Equation (21).

*i*th observation is assigned value

*z*is computed aswhere

**z**

^{−i}is the set of current assignments for all observations other than the

*i*th,

*f*

_{z}is the number of observations in that set assigned latent variable

*z*, is the number of observations with context

*c*

_{i}assigned latent variable

*z*, and is the number of observations with word

*w*

_{i}assigned latent variable

*z*.

*s*

_{i}must also be sampled for each token. We “block” the sampling for

*z*

_{i}and

*s*

_{i}to improve convergence. The Gibbs sampling distribution iswhere ∅ indicates that no topic is assigned. The fact that topics are not assigned for all tokens means that Lex-LDA is less useful in situations that require representational power they afford—for example, the contextual similarity paradigm described in Section 3.5.

A naive implementation of the sampler will take time linear in the number of topics and the number of observations to complete one iteration. Yao, Mimno, and McCallum (2009) present a new sampling algorithm for LDA that yields a considerable speedup by reformulating Equation (30) to allow caching of intermediate values and an intelligent sorting of topics so that in many cases only a small number of topics need be iterated though before assigning a topic to an observation. In this article we use Yao, Mimno, & McCallum's algorithm for LDA, as well as a transformation of the Rooth-LDA and Lex-LDA samplers that can be derived in an analogous fashion.

#### 3.4.2 Inference

As noted previously, the Gibbs sampling procedure is guaranteed to converge to the true posterior after a finite number of iterations; however, this number is unknown and it is difficult to detect convergence. In practice, we run the sampler for a hopefully sufficient number of iterations and perform inference based on the final sampling state (assignments of all *z* and *s* variables) and/or a set of intermediate sampling states.

Given a sequence or **chain** of sampling states *S*_{1}, … , *S*_{n}, we can predict a value for *P*(*w*|*c*) or *P*(*c*,*w*) using these equations and the set of latent variable assignments at a single state *S*_{i}. As the sampler is initialized randomly and will take time to find a good area of the search space, it is standard to wait until a number of iterations have passed before using any samples for prediction. States *S*_{1}, … , *S*_{b} from this **burn-in** period are discarded.

*P*(

*w*|

*c*) from a set of states

*S*:It is also possible to average over states drawn from multiple chains. However, averaging of any kind can only be performed on quantities whose interpretation does not depend on the sampling state itself. For example, we cannot average over estimates of

*P*(

*z*

_{1}|

*c*) drawn from different samples as the topic called

*z*

_{1}in one iteration is not identical to the topic called

*z*

_{1}in another; even within the same chain, the meaning of a topic will often change gradually from state to state.

#### 3.4.3 Choosing |*Z*|

In the “parametric” latent variable models used here the number of topics or semantic classes, |*Z*|, must be fixed in advance. This brings significant efficiency advantages but also the problem of choosing an appropriate value for |*Z*|. The more classes a model has, the greater its capacity to capture fine distinctions between entities. However, this finer granularity inevitably comes at a cost of reduced generalization. One approach is to choose a value that works well on training or development data before evaluating held-out test items. Results in lexical semantics are often reported over the entirety of a data set, meaning that if we wish to compare those results we cannot hold out any portion. If the method is relatively insensitive to the parameter it may be sufficient to choose a default value. Rooth et al. (1999) suggest cross-validating on the training data likelihood (and not on the ultimate evaluation measure). An alternative solution is to average the predictions of models trained with different choices of |*Z*|; this avoids the need to pick a default and can give better results than any one value as it integrates contributions at different levels of granularity. As mentioned in Section 3.4.2 we must take care when averaging predictions to compute with quantities that do not rely on topic identity—for example, estimates of *P*(*a*|*p*) can safely be combined whereas estimates of *P*(*z*_{1}|*p*) cannot.

#### 3.4.4 Hyperparameter Estimation

Although the likelihood parameters can be integrated out, the parameters for the Dirichlet and Beta priors (often referred to as “hyperparameters”) cannot and must be specified either manually or automatically. The value of these parameters affects the sparsity of the learned posterior distributions. Furthermore, the use of an asymmetric prior (where not all its parameters have equal value) implements an assumption that some observation values are more likely than others before any observations have been made. Wallach, Mimno, and McCallum (2009) demonstrate that the parameterization of the Dirichlet priors in an LDA model has a material effect on performance, recommending in conclusion a symmetric prior on the “emission” likelihood *P*(*w*|*z*) and an asymmetric prior on the document topic likelihoods *P*(*z*|*d*). In this article we follow these recommendations and, like Wallach, Mimno, and McCallum, we optimize the relevant hyperparameters using a fixed point iteration to maximize the log evidence (Minka 2003; Wallach 2008).

### 3.5 Measuring Similarity in Context with Latent-Variable Models

The representation induced by latent variable selectional preference models also allows us to capture the disambiguatory effect of context. Given an observation of a word in a context, we can infer the most probable semantic classes to appear in that context and we can also infer the probability that a class generated the observed word. We can also estimate the probability that the semantic classes suggested by the observation would have licensed an alternative word. Taken together, these can be used to estimate in-context semantic similarity. The fundamental intuitions are similar to those behind the vector-space models in Section 2.3.2, but once again we are viewing them from the perspective of probabilistic modeling.

*w*

_{o}and an alternative term

*w*

_{s}in context

*C*with the similarity between the probability distribution over latent variables associated with

*w*

_{o}and

*C*and the probability distribution over latent variables associated with

*w*

_{s}:This assumes that we can associate a distribution over the same set of latent variables with each context item

*c*∈

*C*. As noted in Section 2.3.2, previous research has found that conditioning the representation of both the observed term and the candidate substitute on the context gives worse performance than conditioning the observed term alone; we also found a similar effect. Dinu and Lapata (2010) present a specific version of this framework, using a window-based definition of context and the assumption that the similarity given a set of contexts is the product of the similarity value for each context:In this article we generalize to syntactic as well as window-based contexts and also derive a well-motivated approach to incorporating multiple contexts inside the probability model; in Section 4.5 we show that both innovations contribute to improved performance on a lexical substitution data set.

*C*, each of which has an opinion about the distribution over latent variables, this becomesThe uncontextualized distribution

*P*(

*z*|

*w*

_{s}) is not given directly by the LDA model. It can be estimated from relative frequencies in the Gibbs sampling state; we use an unsmoothed estimate.

^{8}We denote this model

*C*→

*T*to note that the target word is generated given the context.

*T*→

*C*:Again, we can generalize to non-singleton context sets:whereEquation (57) has the form of a “product of experts” model (Hinton 2002), though unlike many applications of such models we train the experts independently and thus avoid additional complexity in the learning process. The uncontextualized distribution

*P*(

*z*|

*w*

_{s}) is an explicit component of the

*T*→

*C*model.

*c*and target word

*w*

_{o}is given bywhile calculating the uncontextualized distribution

*P*(

*z*|

*w*

_{s}) requires summing over the set of possible contexts C':Because the interaction classes learned by Rooth-LDA are specific to a relation type, this model is less applicable than LDA to problems that involve a rich context set

*C*.

*L*

_{1}distance (Lee 1999; Ó Séaghdha and Copestake 2008).

### 3.6 Related Work

As related earlier, non-Bayesian mixture or latent-variable approaches to co-occurrence modeling were proposed by Pereira, Tishby, and Lee (1993) and Rooth et al. (1999). Blitzer, Globerson, and Pereira (2005) describe a co-occurrence model based on a different kind of distributed latent-variable architecture similar to that used in the literature on neural language models. Brody and Lapata (2009) use the clustering effects of LDA to perform word sense induction. Vlachos, Korhonen, and Ghahramani (2009) use non-parametric Bayesian methods to cluster verbs according to their co-occurrences with subcategorization frames. Reisinger and Mooney (2010, 2011) have also investigated Bayesian methods for lexical semantics in a spirit similar to that adopted here. Reisinger and Mooney (2010) describe a “tiered clustering” model that, like Lex-LDA, mixes a cluster-based preference model with a predicate-specific distribution over words; however, their model does not encourage sharing of classes between different predicates. Reisinger and Mooney (2011) propose a very interesting variant of the latent-variable approach in which different kinds of contextual behavior can be explained by different “views,” each of which has its own distribution over latent variables; this model can give more interpretable classes than LDA for higher settings of |*Z*|.

Some extensions of the LDA topic model incorporate local as well as document context to explain lexical choice. Griffiths et al. (2004) combine LDA and a hidden Markov model (HMM) in a single model structure, allowing each word to be drawn from either the document's topic distribution or a latent HMM state conditioned on the preceding word's state; Moon, Erk, and Baldridge (2010) show that combining HMM and LDA components can improve unsupervised part-of-speech induction. Wallach (2006) also seeks to capture the influence of the preceding word, while at the same time generating every word from inside the LDA model; this is achieved by conditioning the distribution over words on the preceding word type as well as on the chosen topic. Boyd-Graber and Blei (2008) propose a “syntactic topic model” that makes topic selection conditional on both the document's topic distribution and on the topic of the word's parent in a dependency tree. Although these models do represent a form of local context, they either use a very restrictive one-word window or a notion of syntax that ignores lexical or dependency-label effects; for example, knowing that the head of a noun is a verb is far less informative than knowing that the noun is the direct object of *eat*.

More generally, there is a connection between the models developed here and latent-variable models used for parsing (e.g., Petrov et al. 2006). In such models each latent state corresponds to a “splitting” of a part-of-speech label so as to produce a finer-grained grammar and tease out intricacies of word–rule “co-occurrence.” Finkel, Grenager, and Manning (2007) and Liang et al. (2007) propose a non-parametric Bayesian treatment of state splitting. This is very similar to the motivation behind an LDA-style selectional preference model. One difference is that the parsing model must explain the parse tree structure as well as the choice of lexical items; another is that in the selectional preference models described here each head–dependent relation is treated as an independent observation (though this could be changed). These differences allow our selectional preference models to be trained efficiently on large corpora and, by focusing on lexical choice rather than syntax, to home in on purely semantic information. Titov and Klementiev (2011) extend the idea of latent-variable distributional modeling to do “unsupervised semantic parsing” and reason about classes of semantically similar lexicalized syntactic fragments.

## 4. Experiments

### 4.1 Training Corpora

In our experiments we use two training corpora:

**BNC**the written component of the British National Corpus,^{9}comprising around 90 million words. The corpus was tagged for part of speech, lemmatized, and parsed with the RASP toolkit (Briscoe, Carroll, and Watson 2006).**WIKI**a Wikipedia dump of over 45 million sentences (almost 1 billion words) tagged, lemmatized, and parsed with the C+C toolkit^{10}and the fast CCG parser described by Clark et al. (2009).

**BNC**and

**WIKI**corpora.

In order to train our selectional preference models, we extracted word–context observations from the parsed corpora. Prior to extraction, the dependency graph for each sentence was transformed using the preprocessing steps illustrated in Figure 4. We then filtered for semantically discriminative information by ignoring all words with part of speech other than common noun, verb, adjective, and adverb. We also ignored instances of the verbs *be* and *have* and discarded all words containing non-alphabetic characters and all words with fewer than three characters.^{11}

As mentioned in Section 2.1, the distributional semantics framework admits flexibility in how the practitioner defines the context of a word *w*. We investigate two possibilities in this article:

**Syn**The context of*w*is determined by the syntactic relations*r*and words*w*′ incident to it in the sentence's parse tree, as illustrated in Section 3.1.**Win5**The context of*w*is determined by the words appearing within a window of five words on either side of it. There are no relation labels, so there is essentially just one relation*r*to consider.

Training topic models on a data set with very large “documents” leads to tractability issues. The window-based approach is particularly susceptible to an explosion in the number of extracted contexts, as each token in the data can contribute 2×*W* word–context observations, where *W* is the window size. We reduced the data by applying a simple downsampling technique to the training corpora. For the **WIKI/Syn** corpus, all word–context counts were divided by 5 and rounded to the nearest integer. For the **WIKI/Win5** corpus we divided all counts by 70; this number was suggested by Dinu and Lapata (2010), who used the same ratio for downsampling the similarly sized English Gigaword Corpus. Being an order of magnitude smaller, the BNC required less pruning; we divided all counts in the **BNC/Win5** by 5 and left the **BNC/Syn** corpus unaltered. Type/token statistics for the resulting sets of observations are given in Table 4.

### 4.2 Evaluating Selectional Preference Models

Various approaches have been suggested in the literature for evaluating selectional preference models. One popular method is “pseudo-disambiguation,” in which a system must distinguish between actually occurring and randomly generated predicate–argument combinations (Pereira, Tishby, and Lee 1993; Chambers and Jurafsky 2010). In a similar vein, probabilistic topic models are often evaluated by measuring the probability they assign to held-out data; held-out likelihood has also been used for evaluation in a task involving selectional preferences (Schulte im Walde et al. 2008). These two approaches take a “language modeling” approach in which model quality is identified with the ability to predict the distribution of co-occurrences in unseen text. Although this metric should certainly correlate with the semantic quality of the model, it may also be affected by frequency and other idiosyncratic aspects of language use unless tightly controlled. In the context of document topic modeling, Chang et al. (2009) find that a model can have better predictive performance on held-out data while inducing topics that human subjects judge to be less semantically coherent.

In this article we choose to evaluate models by comparing system predictions with semantic judgments elicited from human subjects. These judgments take various forms. In Section 4.3 we use judgments of how plausible it is that a given predicate takes a given word as its argument. In Section 4.4 we use judgments of similarity between pairs of predicate–argument combinations. In Section 4.5 we use judgments of substitutability for a target word as disambiguated by its sentential context. Taken together, these different experimental designs provide a multifaceted analysis of model quality.

### 4.3 Predicate–Argument Plausibility

#### 4.3.1 Data

For the plausibility-based evaluation we use a data set of human judgments collected by Keller and Lapata (2003). This comprises data for three grammatical relations: verb–object, adjective–noun, and noun–noun modification. For each relation, 30 predicates were selected; each predicate was paired with three noun arguments from different predicate–argument frequency bands in the BNC as well as three noun arguments that were not observed for that predicate in the BNC. In this way two subsets (*Seen* and *Unseen*) of 90 items each were assembled for each predicate. Human plausibility judgments were elicited from a large number of subjects; these numerical judgments were then normalized, log-transformed, and averaged in a Magnitude Estimation procedure.

Following Keller and Lapata (2003), we evaluate our models by measuring the correlation between system predictions and the human judgments. Keller and Lapata use Pearson's correlation coefficient *r*; we additionally use Spearman's rank correlation coefficient *ρ* for a non-parametric evaluation. Each system prediction is log-transformed before calculating the correlation to improve the linear fit to the gold standard.

#### 4.3.2 Methods

We evaluate the LDA, Rooth-LDA, and Lex-LDA latent-variable preference models, trained on predicate–argument pairs (*c*,*w*) extracted from the BNC. We use a default setting |*Z*| = 100 for the number of classes; in our experiments we have observed that our Bayesian models are relatively robust to the choice of |*Z*|. We average predictions of the joint probability *P*(*c*,*w*) over three independent samples, each of which is obtained by sampling *P*(*c*,*w*) every 50 iterations after a burn-in period of 200 iterations. Rooth-LDA gives joint probabilities by definition (25), but LDA and Lex-LDA are defined in terms of conditional probabilities (24). There are two options for training these models:

*P*→*A*: Model the distribution*P*(*w*|*c*) over arguments for each predicate.*A*→*P*: Model the distribution*P*(*c*|*w*) over predicates for each argument.

*P*→

*A*and

*A*→

*P*implementations of LDA and Lex-LDA, we can evaluate a combined model

*P*↔

*A*that simply averages the two sets of predictions; this removes the arbitrariness involved in choosing one direction or the other.

For comparison, we report the performance figures given by Keller and Lapata for their search-engine method using AltaVista and Google^{12} as well as a number of alternative methods that we have reimplemented and trained on identical data:

**BNC (MLE)**A maximum-likelihood estimate proportional to the co-occurrence frequency*f*(*c*,*w*) in the parsed BNC.**BNC (KN)**BNC relative frequencies smoothed with modified Kneser-Ney (Chen and Goodman 1999).**Resnik**The WordNet-based association strength of Resnik (1993). We used WordNet version 2.1 as the method requires multiple roots in the hierarchy for good performance.**Clark/Weir**The WordNet-based method of Clark and Weir (2002), using WordNet 3.0. This method requires that a significance threshold*α*and significance test be chosen; we investigated a variety of settings and report performance for*α*= 0.9 and Pearson's*χ*^{2}test, as this combination consistently gave the best results.**Rooth-EM**Rooth et al. (1999)'s latent-variable model without priors, trained with EM. As for the Bayesian models, we average the predictions over three iterations. This method is very sensitive to the number of classes; as proposed by Rooth et al., we choose the number of classes from the range (20, 25, … , 50) through 5-fold cross-validation on a held-out log-likelihood measure.**EPP**The vector-space method of Erk, Padó, and Padó (2010), as described in Section 2.2.3. We used the cosine similarity measure for smoothing as it performed well in Erk, Padó, & Padó's experiments.**Disc**A discriminative model inspired by Bergsma, Lin, and Goebel (2008) (see Section 2.2.4). In order to get true probabilistic predictions, we used a logistic regression classifier with*L*_{1}regularization rather than a Support Vector Machine.^{13}We train one classifier per predicate in the Keller and Lapata data set. Following Bergsma, Lin, and Goebel, we generate pseudonegative instances for each predicate by sampling noun arguments that either do not co-occur with it or have a negative PMI association. Again following Bergsma, Lin, and Goebel, we use a ratio of two pseudonegative instances for each positive instance and require pseudonegative arguments to be in the same frequency quintile as the matched observed argument. The features used for each data instance, corresponding to an argument, are: the conditional probability of the argument co-occurring with each predicate in the training data; and string-based features capturing the length and initial and final character*n*-grams of the argument word.^{14}We also investigate whether our LDA model can be used to provide additional features for the discriminative model, by giving the index of the most probable class max_{z}*P*(*z*|*c*,*w*); results for this system are labeled**Disc+LDA**.

In order to test statistical significance of performance differences we use a test for correlated correlation coefficients proposed by Meng, Rosenthal, and Rubin (1992). This is more appropriate than a standard test for independent correlation coefficients as it takes into account the strength of correlation between two sets of system outputs as well as each output's correlation with the gold standard. Essentially, if the two sets of system outputs are correlated there is less chance that their difference will be deemed significant. As we have no a priori reason to believe that one model will perform better than another, all tests are two-tailed.

#### 4.3.3 Results

Results on the Keller and Lapata (2003) plausibility data set are presented in Table 5.^{15} For common combinations (the Seen data) it is clear that relative corpus frequency is a reliable indicator of plausibility, especially when Web-scale resources are available. The BNC MLE estimate outperforms the best selectional preference model on three out of six Seen evaluations, and the AltaVista and Google estimates from Keller and Lapata (2003) outperforms the best selectional preference model on every applicable Seen evaluation. For the rarer Unseen combinations, however, MLE estimates are not sufficient and the latent-variable selectional preference models frequently outperform even the Web-based predictions. The results for BNC(KN) improve on the MLE estimates for the Unseen data but do not match the models that have a semantic component.

It is clear from Table 5 that the new Bayesian latent-variable models outperform the previously proposed selectional preference models under almost every evaluation. Among the latent-variable models there is no one clear winner, and small differences in performance are as likely to arise through random sampling variation as through qualitative differences between models. That said, Rooth-LDA and Lex-LDA do score higher than LDA in a majority of cases. As expected, the bidirectional *P* ↔ *A* models tend to perform at around the midpoint of the *P* → *A* and *A* → *P* models, though they can also exceed both; this suggests that they are a good choice when there is no intuitive reason to choose one direction over the other.

Table 6 aggregates comparisons for all combinations of the six data sets and two evaluation measures. As before, all the Bayesian latent-variable models achieve a roughly similar level of performance, consistently outperforming the models selected from the literature and frequently reaching statistical significance (p < 0.05). These results confirm that LDA-style models can be considered the current state of the art for selectional preference modeling.

*i*th item isSpearman's

*ρ*is equivalent to the

*r*correlation between ranks and so a similar quantity can be computed. Table 7 illustrates the items with highest and lowest contributions for one evaluation (Spearman's

*ρ*on the Keller and Lapata Unseen data set). We have attempted to identify general factors that predict the difficulty of an item by measuring rank correlation between the per-item pseudo-coefficients and various corpus statistics. However, it has proven difficult to isolate reliable patterns. One finding is that arguments with high corpus frequency tend to incur larger errors for the

*P*→

*A*latent-variable models and Rooth-LDA, whereas predicates with high corpus frequency tend to incur smaller errors; with the

*A*→

*P*the effect is lessened but not reversed, suggesting that part of the effect may be inherent in the data set rather than in the prediction model.

### 4.4 Predicate–Argument Similarity

#### 4.4.1 Data

Mitchell and Lapata (2008, 2010) collected human judgments of similarity between pairs of predicates and arguments corresponding to minimal sentences. Mitchell and Lapata's explicit aim was to facilitate evaluation of general semantic compositionality models but their data sets are also suitable for evaluating predicate–argument representations.

Mitchell and Lapata (2008) used the BNC to extract 4 attested subject nouns for each of 15 verbs, yielding 60 reference combinations. Each verb–noun tuple was matched with two verbs that are synonyms of the reference verb in some contexts but not in others. In this way, Mitchell and Lapata created a data set of 120 pairs of predicate–argument combinations. Similarity judgments were obtained from human subjects for each pair on a Likert scale of 1–7. Examples of the resulting data items are given in Table 8. Mitchell and Lapata use six subjects' ratings as a development data set for setting model parameters and the remaining 54 subjects' ratings for testing. In this article we use the same split.

shoulder slump | 6, 7, 5, 5, 6, 5, 5, 7, 5, 5, 7, 5, 6, 6, 5, 6, 6, 6, 7, 5, |

shoulder slouch | 7, 6, 6, 5, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7 |

shoulder slump | 2, 5, 4, 4, 3, 3, 2, 3, 2, 1, 3, 3, 6, 5, 3, 2, 1, 1, 1, 7, |

shoulder decline | 4, 4, 6, 3, 5, 6 |

shoulder slump | 6, 7, 5, 5, 6, 5, 5, 7, 5, 5, 7, 5, 6, 6, 5, 6, 6, 6, 7, 5, |

shoulder slouch | 7, 6, 6, 5, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7 |

shoulder slump | 2, 5, 4, 4, 3, 3, 2, 3, 2, 1, 3, 3, 6, 5, 3, 2, 1, 1, 1, 7, |

shoulder decline | 4, 4, 6, 3, 5, 6 |

Mitchell and Lapata (2010) adopt a similar approach to data collection with the difference that instead of keeping arguments constant across combinations in a pair, both predicates and arguments vary across comparand combinations. They also consider a range of grammatical relations: verb–object, adjective–noun, and noun–noun modification. Human subjects rated similarity between predicate–argument combinations on a 1–7 scale as before; examples are given in Table 9. Inspection of the data suggests that the subjects' annotation may conflate semantic similarity and relatedness; for example, *football club* and *league match* are often given a high similarity score. Mitchell and Lapata again split the data into development and testing sections, the former comprising 54 subjects' ratings and the latter comprising 108 subjects' ratings.

stress importance | 6, 7, 7, 5, 5, 7, 7, 7, 6, 5, 6, 7, 3, 7, 7, 6, 7, 7 |

emphasize need | |

ask man | 3, 1, 4, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1 |

stretch arm | |

football club | 7, 6, 7, 6, 6, 5, 5, 3, 6, 6, 4, 5, 4, 6, 2, 7, 5, 5 |

league match | |

education course | 7, 7, 5, 5, 7, 5, 5, 7, 7, 4, 6, 2, 5, 6, 6, 7, 7, 4 |

training program |

stress importance | 6, 7, 7, 5, 5, 7, 7, 7, 6, 5, 6, 7, 3, 7, 7, 6, 7, 7 |

emphasize need | |

ask man | 3, 1, 4, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1 |

stretch arm | |

football club | 7, 6, 7, 6, 6, 5, 5, 3, 6, 6, 4, 5, 4, 6, 2, 7, 5, 5 |

league match | |

education course | 7, 7, 5, 5, 7, 5, 5, 7, 7, 4, 6, 2, 5, 6, 6, 7, 7, 4 |

training program |

*ρ*correlation computed for each group is combined by averaging.

^{16}The analogous approach for the Mitchell and Lapata (2008) data set calculates a single

*ρ*value by pairing of each annotator-item score with the system prediction for the appropriate item. Let

**s**be the sequence of system predictions for |

*I*| items and

**y**

_{a}be the scores assigned by annotator

*a*∈

*A*to those |

*I*| items. Then the “concatenated” correlation

*ρ*

_{cat}is calculated as follows:

^{17}The length of the

**y**

_{cat}and

**s**

_{cat}sequences is equal to the total number of annotator-item scores. For the Mitchell and Lapata (2010) data set, a

*ρ*

_{cat}value is calculated for each of the three annotator groups and these are then averaged. As Turney observes, this approach seems to have the effect of underestimating model quality relative to the inter-annotator agreement figure, which is calculated as average intersubject correlation. Therefore, in addition to Mitchell and Lapata's

*ρ*

_{cat}evaluation, we also perform an evaluation that computes the average correlation

*ρ*

_{ave}between the system output and each individual annotator:

#### 4.4.2 Models

For the Mitchell and Lapata (2008) data set we train the following models on the BNC corpus:

**LDA**An LDA selectional preference model of verb–subject co-occurrence with similarity computed as described in Section 3.5. Similarity predictions*sim*(*n*,*o*|*c*) are averaged over five runs. We consider three models of context–target interaction, which in this case corresponds to verb–subject interaction:**LDA**_{C→T}Target generation is conditioned on the context, as in equation (53).**LDA**_{T→C}Context generation is conditioned on the target, as in equation (56).**LDA**_{C↔T}An average of the predictions made by LDA_{C →T}and LDA_{T →C}.

*Z*| = 100. As well as presenting results for an average over all predictors we investigate whether the choice of predictors can be optimized by using the development data to select the best subset of predictors.**Mult**Pointwise multiplication (6) using**Win5**co-occurrences.

**M+L08**The best-performing system of Mitchell and Lapata (2008), combining an additive and a multiplicative model and using window-based co-occurrences.**SVS**The best-performing system of Erk and Padó (2008); the Structured Vector Space model (8), parameterized to use window-based co-occurrences and raising the expectation vector values (7) to the 20th power (this parameter was optimized on the development data).

For the Mitchell and Lapata (2010) data set we train the following models, again on the BNC corpus:

**Rooth-LDA/Syn**A Rooth-LDA model trained on the appropriate set of syntactic co-occurrences (verb–object, noun–noun modification, or adjective–noun), with the topic distribution calculated as in Equation (59).**LDA/Win5**An LDA model trained on the**Win5**window-based co-occurrences. Because all observations are modeled using the same latent classes, the distributions*P*(*z*|*o*,*c*) (Equation (53)) for each word in the pair can be combined by taking a normalized product.**Combined**This model averages the similarity prediction of the**Rooth-LDA/Syn**and**LDA/Win5**models.**Mult**Pointwise multiplication (6) using**Win5**co-occurrences.

**M+L10/Mult**A multiplicative model (6) using a vector space based on window co-occurrences in the BNC.**M+L10/Best**The best result for each grammatical relation from any of the semantic spaces and combination methods tested by Mitchell and Lapata. Some of these methods require parameters to be set through optimization on the development set.^{18}

#### 4.4.3 Results

Results for the Mitchell and Lapata (2008) data set are presented in Tables 10 and 11.^{19} The LDA preference models clearly outperform the previous state of the art of *ρ*_{cat} = 0.27 (Erk and Padó 2008), with the best simple average of predictors scoring *ρ*_{cat} = 0.38, *ρ*_{ave} = 0.41, and the best optimized combination scoring *ρ*_{cat} = 0.39, *ρ*_{ave} = 0.41. This is comparable to the average level of agreement between human judges estimated by Mitchell and Lapata's to be *ρ*_{ave} = 0.40. Optimizing on the development data consistently gave better performance than averaging over all predictors, though in most cases the differences are small.

Results for the Mitchell and Lapata (2010) data set are presented in Tables 12 and Table 13.^{20} Again the latent-variable models perform well, comfortably outperforming the **Mult** baseline, and with just one exception the **Combined** models surpass Mitchell and Lapata's reported results. Combining the syntactic co-occurrence model **Rooth-LDA/Syn** and the window-based model **LDA/Win5** consistently gives the best performance, suggesting that the human ratings in this data set are sensitive to both strict similarity and a looser sense of relatedness. As Turney (2012) observes, the average-*ρ*_{cat}-per-group approach of Mitchell and Lapata leads to lower performance figures than averaging across annotators; with the latter approach (Table 12) the *ρ*_{ave} correlation values approach the level of human interannotator agreement for two of the three relations: noun–noun and adjective–noun modification.

### 4.5 Lexical Substitution

#### 4.5.1 Data

The data set for the English Lexical Substitution Task (McCarthy and Navigli 2009) consists of 2,010 sentences sourced from Web pages. Each sentence features one of 205 distinct target words that may be nouns, verbs, adjectives, or adverbs. The sentences have been annotated by human judges to suggest semantically acceptable substitutes for their target words. Table 14 gives example sentences and annotations for the target verb *charge*. For the original shared task the data was divided into development and test sections; in this article we follow subsequent work using parameter-free models and use the whole data set for testing.

Commission is the amount charged to execute a trade. |

levy (2), impose (1), take (1), demand (1) |

Annual fees are charged on a pro-rata basis to correspond with the standardized renewal date in December. |

levy (2), require (1), impose (1), demand (1) |

Meanwhile, George begins obsessive plans for his funeral…George, suspicious, charges to her room to confront them. |

run (2), rush (2), storm (1), dash (1) |

Realizing immediately that strangers have come, the animals charge them and the horses began to fight. |

attack (5), rush at (1) |

Commission is the amount charged to execute a trade. |

levy (2), impose (1), take (1), demand (1) |

Annual fees are charged on a pro-rata basis to correspond with the standardized renewal date in December. |

levy (2), require (1), impose (1), demand (1) |

Meanwhile, George begins obsessive plans for his funeral…George, suspicious, charges to her room to confront them. |

run (2), rush (2), storm (1), dash (1) |

Realizing immediately that strangers have come, the animals charge them and the horses began to fight. |

attack (5), rush at (1) |

The gold standard substitute annotations contain a number of multiword terms such as *rush at* and *generate electricity*. As it is impossible for a standard lexical distributional model to reason about such terms, we remove these substitutes from the gold standard.^{21} We remove entirely the 17 sentences that have only multiword substitutes in the gold standard, as well as 7 sentences for which no gold annotations are provided. This leaves 1,986 sentences.

*charge*is the union of the substitute lists in the gold standard for every sentence containing

*charge*as a target word:

*levy, impose, take, demand, require, impose, run, rush, storm, dash, attack,…*. Evaluation of system predictions for a given sentence then involves comparing the ranking produced by the system with the implicit ranking produced by annotators, assuming that any candidates not attested for the sentence appear with frequency 0 at the bottom of the ranking. Dinu and Lapata (2010) use Kendall's

*τ*

_{b}, a standard rank correlation measure that is appropriate for data containing tied ranks. Thater, Fürstenau, and Pinkal (2010, 2011) use Generalized Average Precision (GAP), a precision-like measure originally proposed by Kishida (2005) for information retrieval:where

*x*

_{1}, … ,

*x*

_{n}are the ranked candidate scores provided by the system,

*y*

_{1}, … ,

*y*

_{R}are the ranked scores in the gold standard and

*I*(

*x*) is an indicator function with value 1 if

*x*> 0 and 0 otherwise.

In this article we report both *τ*_{b} and GAP scores, calculated individually for each sentence and averaged. The open-vocabulary design of the original Lexical Substitution Task facilitated the use of other evaluation measures such as “precision out of ten”: the proportion of the first 10 words in a system's ranked substitute list that are contained in the gold standard annotation for that sentence. This measure is not appropriate in the constrained-vocabulary scenario considered here; when there are fewer than 10 candidate substitutes for a target word, the precision will always be 1.

#### 4.5.2 Models

We apply both window-based and syntactic models of similarity in context to the lexical substitution data set; we expect the latter to give more accurate predictions but to have incomplete coverage when a test sentence is not fully and correctly parsed or when the test lexical items were not seen in the appropriate contexts in training.^{22} We therefore also average the predictions of the two model types in the hope of attaining superior performance with full coverage.

The models we train on the **BNC** and combined **BNC + WIKI** corpora are as follows:

**Win5**An LDA model using 5-word-window contexts (so |*C*| ≤ 10) and similarity*P*(*z*|*o*,*C*) computed according to Equation (54).*C*→*T*An LDA model using syntactic co-occurrences with similarity computed according to Equation (54).*T*→*C*An LDA model using syntactic co-occurrences with similarity computed according to Equation (57).*T*↔*C*A model averaging the predictions of the*C*→*T*and*T*→*C*models.**Win5 +***C*→*T***, Win5 +***T*→*C***, Win5 +***T*↔*C*A model averaging the predictions of**Win5**and the appropriate syntactic model.**TFP11**The vector-space model of Thater, Fürstenau, and Pinkal (2011). We report figures with and without backoff to lexical similarity between target and substitute words in the absence of a syntax-based prediction.

We also consider two baseline LDA models:

**No Context**A model that ranks substitutes*n*by computing the Bhattacharyya similarity between their topic distributions*P*(*z*|*n*) and the target word topic distribution*P*(*z*|*o*).**No Similarity**A model that ranks substitutes*n*by their context-conditioned probability*P*(*n*|*C*) only; this is essentially a language-modeling approach using syntactic “bigrams.”

*T*↔

*C*syntactic model, but performance is similar with other co-occurrence types.

Predictions for the LDA models are averaged over five runs for each setting of |*Z*| in the range {600,800,1000,1200}. In order to test statistical significance of differences between models we use stratified shuffling (Yeh 2000).^{23}

#### 4.5.3 Results

Table 15 presents results on the Lexical Substitution Task data set.^{24} As expected, the window-based LDA models attain good coverage but worse performance than the syntactic models. The combined model **Win5** + *T* ↔ *C* trained on **BNC+WIKI** gives the best scores (GAP = 49.5, *τ*_{b} = 0.23). Every combined model gives a statistically significant improvement (p < 0.01) over the corresponding window-based **Win5** model. Our **TFP11** reimplementation of Thater, Fürstenau, and Pinkal (2011) has slightly less than complete coverage, and performs worse than almost all combined LDA models. To compute statistical significance we only use the sentences for which **TFP11** made predictions; for both the **BNC** and **BNC+WIKI** corpora, the **Win5** + *T* ↔ *C* model gives a statistically significant (p < 0.05) improvement over **TFP11** for both GAP and *τ*_{b}, while **Win5** + *T* → *C* gives a significant improvement for GAP and *τ*_{b} on the BNC training corpus. The no-context and no-similarity baselines are clearly worse than the full models; this difference is statistically significant (p < 0.01) for both training corpora and all models.

Table 16 breaks performance down across the four parts of speech used in the data set. Verbs appear to present the most difficult substitution questions and also demonstrate the greatest beneficial effect of adding syntactic disambiguation to the basic **Win5** model. The full **Win5** + *T* ↔ *C* outperforms our reimplementation of Thater, Fürstenau, and Pinkal (2011) on all parts of speech for the GAP statistic and on verbs and adjectives for *τ*_{b}, scoring a tie on nouns and adverbs. Table 16 also lists results reported by Dinu and Lapata (2010) and Thater, Fürstenau, and Pinkal (2010, 2011) for their models trained on the English Gigaword Corpus. This corpus is of comparable size to the **BNC+WIKI** corpus, but we note that the results reported by Thater, Fürstenau, and Pinkal (2011) are better than those attained by our reimplementation, suggesting that uncontrolled factors such as choice of corpus, parser, or dependency representation may be responsible. Thater, Fürstenau, and Pinkal's (2011) results remain the best reported for this data set; our **Win5** + *T* ↔ *C* results are better than Dinu and Lapata (2010) and Thater, Fürstenau, and Pinkal (2010) in this uncontrolled setting.

## 5. Conclusion

In this article we have shown that the probabilistic latent-variable framework provides a flexible and effective toolbox for distributional modeling of lexical meaning and gives state-of-the-art results on a number of semantic prediction tasks. One useful feature of this framework is that it induces a representation of semantic classes at the same time as it learns about selectional preference distributions. This can be viewed as a kind of coarse-grained sense induction or as a kind of concept induction. We have demonstrated that reasoning about these classes leads to an accurate method for calculating semantic similarity in context. By applying our models we attain state-of-the-art performance on a range of evaluations involving plausibility prediction, in-context similarity, and lexical substitution. The three models we have investigated—LDA, Rooth-LDA and Lex-LDA—all perform at a similar level for predicting plausibility, but in other cases the representation induced by one model may be more suitable than the others.

In future work, we anticipate that the same intuitions may lead to similarity accurate methods for other tasks where disambiguation is required; an obvious candidate would be traditional word sense disambiguation, perhaps in combination with the probabilistic WordNet-based preference models of Ó Séaghdha and Korhonen (2012). More generally, we expect that latent-variable models will prove useful in applications where other selectional preference models have been applied, for example, metaphor interpretation and semantic role labeling.

A second route for future work is to enrich the semantic representations that are learned by the model. As previously mentioned, probabilistic generative models are modular in the sense that they can be integrated in larger models. Bayesian methods for learning tree structures could be applied to learn taxonomies of semantic classes (Blei, Griffiths, and Jordan 2010; Blundell, Teh, and Heller 2010). Borrowing ideas from Bayesian hierarchical language modeling (Teh 2006), one could build a model of selectional preference and disambiguation in the context of arbitrarily long dependency *paths*, relaxing our current assumption that only the immediate neighbors of a target word affect its meaning. Our class-based preference model also suggests an approach to identifying regular polysemy alternation by finding class co-occurrences that repeat across words, offering a fully data-driven alternative to polysemy models based on WordNet (Boleda, Padó, and Utt 2012). In principle, any structure that can be reasoned about probabilistically, from syntax trees to coreference chains or semantic relations, can be coupled with a selectional preference model to incorporate disambiguation or lexical smoothing in a task-oriented architecture.

## Acknowledgments

The work in this article was funded by the EPSRC (grant EP/G051070/1) and by the Royal Society. We are grateful to Frank Keller and Mirella Lapata for sharing their data set of plausibility judgments; to Georgiana Dinu, Karl Moritz Hermann, Jeff Mitchell, Sebastian Padó, and Andreas Vlachos for offering information and advice; and to the anonymous *Computational Linguistics* reviewers, whose suggestions have substantially improved the quality of this article.

## Notes

The analogous example given by Ó Séaghdha (2010) relates to the plausibility of a manservant or a carrot laughing; Google no longer returns zero hits for <*a*|*the manservant*|*manservants*|*menservants laugh*|*laughs*|*laughed*> but a frequency-based estimate still puts the probability of a carrot laughing at 200 times that of a manservant laughing (1,680 hits against 81 hits).

WordNet also contains many other kinds of semantic relations besides hypernymy but these are not typically used for selectional preference modeling.

When specifically discussing selectional preferences, we will also use the terms **predicate** and **argument** to describe a co-occurrence pair; when restricted to syntactic predicates, the former term is synonymous with our definition of context.

Strictly speaking, *w* and *w*_{h} are drawn from subsets of that are licensed by *r* when *r* is a syntactic relation, that is, they must have parts of speech *p*_{d} and *p*_{h}, respectively. Our models assume a fixed argument vocabulary, so we can partition the training data according to part of speech; the models are agnostic regarding the predicate vocabulary as these are subsumed by the context vocabulary. In the interest of parsimony we leave this detail implicit in our notation.

However, the second point is often relaxed in application contexts where the posterior mean is used for inference (e.g., Section 3.4.2).

In the notation of Section 3.4, this estimate is given by .

An exception was made for the word *PC* as it appears in the Keller and Lapata (2003) data set used for evaluation.

Keller and Lapata only report Pearson's r correlations; as we do not have their per-item predictions we cannot calculate Spearman's *ρ* correlations or statistical significance scores.

We used the logistic regression implementation provided by LIBLINEAR (Fan et al. 2008), available at http://www.csie.ntu.edu.tw/∼cjlin/liblinear.

Bergsma, Lin, and Goebel (2008) also use features extracted from gazetteers. However, they observe that additional features only give a small improvement over co-occurrence features alone. We do not use such features here but hypothesize that the improvement would be even smaller in our experiments as the data do not contain proper nouns.

Results for LDA_{P → A} and Rooth-LDA were previously published in Ó Séaghdha (2010).

We do not compare against the system of Turney (2012) as Turney uses a different experimental design based on partitioning by phrases rather than annotators.

In practice the sequence of items is not the same for every annotator and the sequence of predictions **s** must be changed accordingly.

Ultimately, however, none of the combination methods needing optimization outperform the parameter-free methods in Mitchell and Lapata's results.

These results were not previously published.

An LDA model cannot make an informative prediction of *P*(*z*|*o*,*C*) if word *o* was never seen entering into at least one (unlexicalized) syntactic relation in *C*. Other syntactic models such as that of Thater, Fürstenau, and Pinkal (2011) face analogous restrictions.

We use the software package provided by Sebastian Padó at http://www.nlpado.de/∼sebastian/sigf.html.

Results for the LDA models were reported in Ó Séaghdha and Korhonen (2011).

## References

## Author notes

15 JJ Thomson Avenue, Cambridge, CB3 0FD, United Kingdom. E-mail: Diarmuid.O'Seaghdha@cl.cam.ac.uk.