Abstract
Formalizing “meaning as context” mathematically leads to a new, algebraic theory of meaning, in which composition is bilinear and associative. These properties are shared by other methods that have been proposed in the literature, including the tensor product, vector addition, point-wise multiplication, and matrix multiplication.
Entailment can be represented by a vector lattice ordering, inspired by a strengthened form of the distributional hypothesis, and a degree of entailment is defined in the form of a conditional probability. Approaches to the task of recognizing textual entailment, including the use of subsequence matching, lexical entailment probability, and latent Dirichlet allocation, can be described within our framework.
1. Introduction
This article presents the thesis that defining meaning as context leads naturally to a model in which meanings of strings are represented as elements of an associative algebra over the real numbers, and entailment is described by a vector lattice ordering. This model is general enough to encompass several proposed methods of composition in vector-based representations of meaning.
In recent years, the abundance of text corpora and computing power has allowed the development of techniques to analyze statistical properties of words. For example techniques such as latent semantic analysis (Deerwester et al. 1990) and its variants, and measures of distributional similarity (Lin 1998; Lee 1999), attempt to derive aspects of the meanings of words by statistical analysis, and statistical information is often used when parsing to determine sentence structure (Collins 1997). These techniques have proved useful in many applications within computational linguistics and natural language processing (Grefenstette 1994; Schütze 1998; Bellegarda 2000; Choi, Wiemer-Hastings, and Moore 2001; Lin 2003; McCarthy et al. 2004), arguably providing evidence that they capture something about the nature of words that should be included in representations of their meaning. However, it is very difficult to reconcile these techniques with existing theories of meaning in language, which revolve around logical and ontological representations. The new techniques, almost without exception, can be viewed as dealing with vector-based representations of meaning, placing meaning (at least at the word level) within the realm of mathematics and algebra; conversely the older theories of meaning dwell in the realm of logic and ontology. It seems there is no unifying theory of meaning to provide guidance to those making use of the new techniques.
The problem appears to be a fundamental one in computational linguistics because the whole foundation of meaning seems to be in question. The older, logical theories often subscribe to a model-theoretic philosophy of meaning (Kamp and Reyle 1993; Blackburn and Bos 2005). According to this approach, sentences should be translated to a logical form that can be interpreted as a description of the state of the world. The new vector-based techniques, on the other hand, are often closer in spirit to the philosophy of meaning as context, the idea that the meaning of an expression is determined by how it is used. This is an old idea with origins in the philosophy of Wittgenstein (1953), who said that “meaning just is use,” Firth's (1968) “You shall know a word by the company it keeps,” and the distributional hypothesis of Harris (1968), that words will occur in similar contexts if and only if they have similar meanings. This hypothesis is justified by the success of techniques such as latent semantic analysis as well as experimental evidence (Miller and Charles 1991). Although the two philosophies are not obviously incompatible—especially because the former applies mainly at the sentence level and the latter mainly at the word level—it is not clear how they relate to each other.
The problem of how to compose vector representations of meanings of words has recently received increased attention (Clark, Coecke, and Sadrzadeh 2008; Mitchell and Lapata 2008; Widdows 2008; Erk and Padó 2009; Baroni and Zamparelli 2010; Guevara 2011; Preller and Sadrzadeh 2011) although the problem has been considered in earlier work (Smolensky 1990; Landauer and Dumais 1997; Foltz, Kintsch, and Landauer 1998; Kintsch 2001). A solution to this problem would have practical as well as philosophical benefits. Current techniques such as latent semantic analysis work well at the word level, but we cannot extend them much beyond this, to the phrase or sentence level, without quickly encountering the data-sparseness problem: There are not enough occurrences of strings of words to determine what their vectors should be merely by looking in corpora. If we knew how such vectors should compose then we would be able to extend the benefits of the vector based techniques to the many applications that require reasoning about the meaning of phrases and sentences.
This article describes the results of our own efforts to identify a theory that can unite these two paradigms, introduced in the author's DPhil thesis (Clarke 2007). In addition, we also discuss the relationship between this theory and methods of composition that have recently been proposed in the literature, showing that many of them can be considered as falling within our framework.
Our approach in identifying the framework is summarized in Figure 1:
Inspired by the philosophy of meaning as context and vector-based techniques we developed a mathematical model of meaning as context, in which the meaning of a string is a vector representing contexts in which that string occurs in a hypothetical infinite corpus.
The theory on its own is not useful when applied to real-world corpora because of the problem of data sparseness. Instead we examine the mathematical properties of the model, and abstract them to form a framework which contains many of the properties of the model. Implementations of the framework are called context theories because they can be viewed as theories about the contexts in which strings occur. By analogy with the term “model-theoretic” we use the term “context-theoretic” for concepts relating to context theories, thus we call our framework the context-theoretic framework.
In order to ensure that the framework was practically useful, context theories were developed in parallel with the framework itself. The aim was to be able to describe existing approaches to representing meaning within the framework as fully as possible.
provide some guidelines describing in what way the representation of a phrase or sentence should relate to the representations of the individual words as vectors;
require information about the probability of a string of words to be incorporated into the representation;
provide a way to measure the degree of entailment between strings based on the particular meaning representation;
be general enough to encompass logical representations of meaning; and
be able to incorporate the representation of ambiguity and uncertainty, including statistical information such as the probability of a parse or the probability that a word takes a particular sense.
The contribution of this article is as follows:
We define the context-theoretic framework and introduce the mathematics necessary to understand it. The description presented here is cleaner than that of Clarke (2007), and in addition we provide examples that should provide intuition for the concepts we describe.
We relate the framework to methods of composition that have been proposed in the literature, namely:
– vector addition (Landauer and Dumais 1997; Foltz, Kintsch, and Landauer 1998);
– the tensor product (Smolensky 1990; Clark and Pulman 2007; Widdows 2008);
– the multiplicative models of Mitchell and Lapata (2008);
– matrix multiplication (Baroni and Zamparelli 2010; Rudolph and Giesbrecht 2010);
– the approach of Clark, Coecke, and Sadrzadeh (2008).
It is important to note that the purpose of describing related work in terms of our framework is not merely to demonstrate the generality of our framework: In doing so, we identify previously ignored features of this work such as the lattice structure within the vector space. This allows any one of these approaches to be endowed with an entailment property defined by this lattice structure, based on a philosophy of meaning as context.
Although the examples described here show that existing approaches can be described within the framework and show some of its potential, they cannot demonstrate its full power. The mathematical structures we make use of are extremely general, and we hope that in the future many interesting discoveries will be made by exploring the realm we identify here.
Our approach in defining the framework may be perceived as overly abstract; however, we believe this approach has many potential benefits, because approaches to composition which may have been considered unrelated (such as the tensor product and vector addition) are now shown to be related. This means that when studying such constructions, work can be avoided by considering the general case, for the same reason that class inheritance aids code reuse. For example, definitions given in terms of the framework can be applied to all instances, such as our definition of a degree of entailment. We also hope to motivate people to prove theorems in terms of the framework, having demonstrated its wide applicability.
The remainder of the article is as follows: In Section 2 we define our framework, introducing the necessary definitions, and showing how related work fits into the framework. In Section 3 we introduce our motivating example, showing that a simple mathematical definition of the notions of “corpus” and “context” leads to an instance of our framework. In Section 4, we describe specific instances of our framework in application to the task of recognizing textual entailment. In Section 5 we show how the sophisticated approach of Clark, Coecke, and Sadrzadeh (2008) can be described within our framework. Finally, in Section 6 we present our conclusions and plans for further work.
2. Context Theory
In this section, we define the fundamental concept of our concern, a context theory, and discuss its properties. The definition is an abstraction of both the more commonly used methods of defining composition in vector-based semantics and our motivating example of meaning as context, described in the next section. Because of its relation to this motivating example, a context theory can be thought of as a hypothesis describing in what contexts all strings occur.
Definition 1 (Context Theory)
A context theory is a tuple , where A is a set (the alphabet),
is a unital algebra over the real numbers, ξ is a function from A to
, V is an abstract Lebesgue space, and ψ is an injective linear map from
to V.
We will explain each part of this definition, introducing the necessary mathematics as we proceed. We assume the reader is familiar with linear algebra; see Halmos (1974) for definitions that are not included here.
2.1 Algebra over a Field
We have identified an algebra over a field (or simply algebra when there is no ambiguity) as an important construction because it generalizes nearly all the methods of vector-based composition that have been proposed. An algebra adds a multiplication operation to a vector space; the vector space is intended to describe meaning, and it is this multiplication operation that defines the composition of meaning in context-theoretic semantics.
Definition 2 (Algebra over a Field)




Example 1
The square real-valued matrices of order n form a real unital associative algebra under standard matrix multiplication. The vector operations are defined entry-wise. The unity element of the algebra is the identity matrix.
This means that our proposal is more general than that of Rudolph and Giesbrecht (2010), who suggest using matrix multiplication as a framework for distributional semantic composition. The main differences in our proposal are as follows.
We allow dimensionality to be infinite, instead of restricting ourselves to finite-dimensional matrices.
Matrix algebras form a *-algebra, whereas we do not currently impose this requirement.
Many of the vector spaces used in computational linguistics have an implicit lattice structure; we emphasize the importance of this structure and use the associated partial ordering to define entailment.








The algebra is what tells us how meanings compose. A crucial part of our thesis is that meanings can be represented by elements of an algebra, and that the type of composition that can be defined using an algebra is general enough to describe the composition of meaning in natural language. To go some way towards justifying this, we give several examples of algebras that describe methods of composition that have been proposed in the literature: namely, point-wise multiplication (Mitchell and Lapata 2008), vector addition (Landauer and Dumais 1997; Foltz, Kintsch, and Landauer 1998), and the tensor product (Smolensky 1990; Clark and Pulman 2007; Widdows 2008).
Example 2 (Point-wiseMultiplication)
Example of possible occurrences for three terms in three different contexts.
. | d1 . | d2 . | d3 . |
---|---|---|---|
cat | 0 | 2 | 3 |
animal | 2 | 1 | 2 |
big | 1 | 3 | 0 |
. | d1 . | d2 . | d3 . |
---|---|---|---|
cat | 0 | 2 | 3 |
animal | 2 | 1 | 2 |
big | 1 | 3 | 0 |



Example 3 (Additive Algebra)



Point-wise multiplication and addition are not ideal as methods for composing meaning in natural language because they are commutative; although it is often useful to consider the simpler, commutative case, natural language itself is inherently non-commutative. One obvious method of composing vectors that is not commutative is the tensor product. This method of composition can be viewed as a product in an algebra by considering the tensor algebra, which is formed from direct sums of all tensor powers of a base vector space.
Example 4
The multiplicative models of Mitchell and Lapata (2008) correspond to the class of finite dimensional algebras. Let be a finite-dimensional vector space. Then every associative bilinear product on
can be described by a linear function T from
to
, as required in Mitchell and Lapata's model. To see this, consider the action of the product · on two orthonormal basis vectors a and b of
. This is a vector in
, thus we can define T(a ⊗ b) = a ·b. By considering all basis vectors, we can define the linear function T.
Example 5 (Tensor Algebra)
2.2 Vector Lattices
The next part of the definition specifies an abstract Lebesgue space. This is a special kind of vector lattice, or even more generally, a partially ordered vector space. This lattice structure is implicit in most vector spaces used in computational linguistics, and an important part of our thesis is that the partial ordering can be interpreted as an entailment relation.
Definition 3 (Partially Ordered Vector Space)
Example 6 (Lattice Operations on ℝn)
Vector representations of the terms orange and fruit based on hypothetical occurrences in six documents and their vector lattice meet (the darker shaded area).
Vector representations of the terms orange and fruit based on hypothetical occurrences in six documents and their vector lattice meet (the darker shaded area).
The vector operations of addition and multiplication by scalar, which can be defined in a similar component-wise fashion, are nevertheless independent of the particular basis chosen. Conversely, the lattice operations depend on the choice of basis, so the operations as defined herein would behave differently if the components were written using a different basis. We argue that it makes sense for us to consider these properties of vectors in the context of computational linguistics because we can often have a distinguished basis: namely, the one defined by the contexts in which terms occur. Of course it is true that techniques such as latent semantic analysis introduce a new basis which does not have a clear interpretation in relation to contexts; nevertheless they nearly always identify a distinguished basis which we can use to define the lattice operations. Because our aim is a theory of meaning as context, we should include in our theory a description of the lattice structure which arises out of consideration of these contexts.
We argue that the mere association of words with vectors is not enough to constitute a theory of meaning—a theory of meaning must allow us to interpret these vectors. In particular it should be able to tell us whether one meaning entails or implies another; indeed this is one meaning of the verb to mean. Entailment is an asymmetric relation: “x entails y” does not have the same meaning as “y entails x”. Vector representations allow the measurement of similarity or distance, through an inner product or metric; this is a symmetric relation, however, and so cannot be suitable for describing entailment.
In propositional and first order logic, the entailment relation is a partial ordering; in fact it is a Boolean algebra, which is a special kind of lattice. It seems natural to consider whether the lattice structure that is inherent in the vector representations used in computational linguistics can be used to model entailment.
We believe our framework is suited to all vector-based representations of natural language meaning, however the vectors are obtained. Given this assumption, we can only justify our assumption that the partial order structure of the vector space is suitable to represent the entailment relation by observing that it has the right kind of properties we would expect from this relation.
There may be more justification for this assumption, however, based on the case where the vectors for terms are simply their frequencies of occurrences in n different contexts, so that they are vectors in ℝn. In this case, the relation ξ(x) ≤ ξ(y) means that y occurs at least as frequently as x in every context. This means that y occurs in at least as wide a range of contexts as x, and occurs as least as frequently as x. Thus the statement “x entails y if and only if ξ(x) ≤ ξ(y)” can be viewed as a stronger form of the distributional hypothesis of Harris (1968).
In fact, this idea can be related to the notion of distributional generality, introduced by Weeds, Weir, and McCarthy (2004) and developed by Geffet and Dagan (2005). A term x is distributionally more general than another term y if x occurs in a subset of the contexts that y occurs in. The idea is that distributional generality may be connected to semantic generality. An example of this is the hypernymy or is-a relation that is used to express generality of concepts in ontologies; for example, the term animal is a hypernym of dog because a dog is an animal. Weeds, Weir, and McCarthy (2004, p. 1019) explain the connection to distributional generality as follows:
Although one can obviously think of counter-examples, we would generally expect that the more specific term dog can only be used in contexts where animal can be used and that the more general term animal might be used in all of the contexts where dog is used and possibly others. Thus, we might expect that distributional generality is correlated with semantic generality…
Our proposal, in the case where words are represented by frequency vectors, can be considered a stronger version of distributional generality, where the additional requirement is on the frequency of occurrences. In practice, this assumption is unlikely to be compatible with the ontological view of entailment. For example the term entity is semantically more general than the term animal; however, entity is unlikely to occur more frequently in each context, because it is a rarer word. A more realistic foundation for this assumption might be if we were to consider the components for a word to represent the plausibility of observing the word in each context. The question then, of course, is how such vectors might be obtained. Another possibility is to attempt to weight components in such a way that entailment becomes a plausible interpretation for the partial ordering relation.
Even if we allow for such alternatives, however, in general it is unlikely that the relation will hold between any two strings, because u ≤ v iff ui ≤ vi for each component, ui,vi, of the two vectors. Instead, we propose to allow for degrees of entailment. We take a Bayesian perspective on this, and suggest that the degree of entailment should take the form of a conditional probability. In order to define this, however, we need some additional structure on the vector lattice that allows it to be viewed as a description of probability, by requiring it to be an abstract Lebesgue space.
Definition 4 (Banach Lattice)
A Banach latticeV is a vector lattice together with a norm ∥·∥ such that V is complete with respect to ∥·∥.
Definition 5 (Abstract Lebesgue Space)
Example 7 (ℓpSpaces)
The finite-dimensional real vector spaces ℝn can be considered as special cases of the sequence spaces (consisting of vectors in which all but n components are zero) and, because they are finite-dimensional, we can use any of the ℓp norms. Thus, our previous examples, in which ξ mapped terms to vectors in ℝn, can be considered as mapping to abstract Lebesgue spaces if we adopt the ℓ1 norm.
2.3 Degrees of Entailment
We propose that in vector-based semantics, a degree of entailment is more appropriate than a black-and-white observation of whether or not entailment holds. If we think of the vectors as describing “degrees of meaning,” it makes sense that we should then look for degrees of entailment.
Conditional probability is closely connected to entailment: If A entails B, then P(B|A) = 1. Moreover, if A and B are mutually exclusive, then P(A|B) = P(B|A) = 0. It is thus natural to think of conditional probability as a degree of entailment.
An abstract Lebesgue space has many of the properties of a probability space, where the set operations of a probability space are replaced by the lattice operations of the vector space. This means that we can think of an abstract Lebesgue space as a vector-based probability space. Here, events correspond to positive elements with the norm less than or equal to 1; the probability of an event u is given by the norm (which we shall always assume is the ℓ1 norm), and the joint probability of two events u and v is .
Definition 6 (Degree of Entailment)
Example 8
An important question is how this context-theoretic definition of the degree of entailment relates to more familiar notions of entailment. There are three main ways in which the term entailment is used:
the model-theoretic sense of entailment in which a theory A entails a theory B if every model of A is also a model of B. It was shown in Clarke (2007) that this type of entailment can be described using context theories, where sentences are represented as projections on a vector space.
entailment between terms (as expressed for example in the WordNet hierarchy), for example the hypernymy relation between the terms cat and animal encodes the fact that a cat is an animal. In Clarke (2007) we showed that such relations can be encoded in the partial order structure of a vector lattice.
Human common-sense judgments as to whether one sentence entails or implies another sentence, as used in the Recognising Textual Entailment Challenges (Dagan, Glickman, and Magnini 2005).
Our definition is more general than the model-theoretic and hypernymy notions of entailment, however, as it allows the measurement of a degree of entailment between any two strings: As an extreme example, one may measure the degree to which not a entails in the. Although this may not be useful or philosophically meaningful, we view it as a practical consequence of the fact that every string has a vector representation in our model, which coincides with the current practice in vector-based compositionality techniques (Clark, Coecke, and Sadrzadeh 2008; Widdows 2008).
2.4 Lattice Ordered Algebras
A lattice ordered algebra merges the lattice ordering of the vector space V with the product of . This structure encapsulates the ordering properties that are familiar from multiplication in matrices and elementary arithmetic. For this reason, many proposed methods of composing vector-based representations of meaning can be viewed as lattice ordered algebras. The only reason we have not included it as a requirement of the framework is because our motivating example (described in the next section) is not guaranteed to have this property, although it does give us a partially ordered algebra.
Definition 7 (Partially Ordered Algebra)
A partially ordered algebra is an algebra which is also a partially ordered vector space, which satisfies u·v ≥ 0 for all
. If the partial ordering is a lattice, then
is called a lattice-ordered algebra.
Example 9 (Lattice-Ordered Algebra of Matrices)
The matrices of order n form a lattice-ordered algebra under normal matrix multiplication, where the lattice operations are defined as the entry-wise minimum and maximum.
Example 10 (Operators on ℓpSpaces)
If is a lattice-ordered algebra which is also an abstract Lebesgue space, then
is a context theory. In this simplified situation,
plays the role of the vector lattice as well as the algebra; ξ maps from A to
as before, and 1 indicates the identity map on
. Many of the examples we discuss will be of this form, so we will use the shorthand notation,
. It is tempting to adopt this as the definition of context theory; as we will see in the next section, however, this is not supported by our prototypical example of a context theory as in this case the algebra is not necessarily lattice-ordered.
3. Context Algebras
In this section we describe the prototypical examples of a context theory, the context algebras. The definition of a context algebra originates in the idea that the notion of “meaning as context” can be extended beyond the word level to strings of arbitrary length. In fact, the notion of context algebra can be thought of as a generalization of the syntactic monoid of a formal language: Instead of a set of strings defining the language, we have a fuzzy set of strings, or more generally, a real-valued function on a free monoid.
We call such functions real-valued languages and they take the place of formal languages in our theory. We attach a real number to each string which is intended as an indication of its importance or likelihood of being observed; for example, those with a value of zero are considered not to occur.
Definition 8 (Real-Valued Language)
Let A be a finite set of symbols. A real-valued language (or simply a language when there is no ambiguity) L on A is a function from A* to ℝ. If the range of L is a subset of ℝ + then L is called a positive language. If the range of L is a subset of [0,1] then L is called a fuzzy language. If L is a positive language such that then L is a probability distribution over A*, a distributional language.
One possible interpretation for L when it is a distributional language is that L(x) is the probability of observing the string x when selecting a document at random from an infinite collection of documents.


Note that probability distributions are in ℓ1(A*) and fuzzy languages are in ℓ∞(A*). If L ∈ ℓ1(A*) + (the space of positive functions on A* such that the sum of all values of the function is finite) then we can define a probability distribution pL over A* by pL(x) = L(x)/∥L∥1. Similarly, if L ∈ ℓ∞(A*) + (the space of bounded positive functions on A*) then we can define a fuzzy language fL by fL(x) = L(x)/∥L∥∞.
Example 11
Given a finite set of strings C ⊂ A*, which we may imagine to be a corpus of documents, define L(x) = 1/|C| if x ∈ C, or 0 otherwise. Then L is a probability distribution over A*.
In general, we think of a real-valued language as an abstraction of a corpus; in particular, we think of a corpus as a finite sample of a distributional language representing all possible documents that could ever be written.
Example 12
Let L be a language such that L(x) = 0 for all but a finite subset of A*. Then L ∈ ℓp(A*) for all p.
Example 13
Let L be the language defined by L(x) = |x| where x is the length of (i.e., number of symbols in) string x. Then L is a positive language which is not bounded: For any string y there exists a z such that L(z) > L(y), for example z = ay for a ∈ A.
Example 14
Let L be the language defined by L(x) = 1/2 for all x. Then L is a fuzzy language but L ∉ ℓ1(A*)
We will assume now that L is fixed, and consider the properties of contexts of strings with respect to this language. As in a syntactic monoid, we consider the context to be everything surrounding the string, although in this case instead of a set of pairs of strings we have a function from pairs of strings to the real numbers. We emphasize the vector nature of these real-valued functions by calling them “context vectors.” Our thesis is centered around these vectors, and it is their properties that form the inspiration for the context-theoretic framework.
Definition 9 (Context Vectors)
In other words, is a function from pairs of strings to the real numbers, and the value of
is the value of x in the context (y,z), which is L(yxz).
The question we are addressing is: Does there exist some algebra containing the context vectors of strings in A* such that
where x,y ∈ A* and · indicates multiplication in the algebra? As a first try, consider the vector space L ∞ (A*×A*) in which the context vectors live. Is it possible to define multiplication on the whole vector space such that the condition just specified holds?
Example 15




Definition 10 (Generated Subspace )
In other words, it is the space of all vectors formed from linear combinations of context vectors.
Because of the way we define the subspace, there will always exist some basis =
where
, and we can define multiplication on this basis by
where u,v ∈ B. Defining multiplication on the basis defines it for this whole vector subspace, because we define multiplication to be linear, making
an algebra.
There are potentially many different bases we could choose, however, each corresponding to a different subset of A*, and each giving rise to a different definition of multiplication. Remarkably, this isn't a problem.
Proposition 1 (Context Algebra)
Multiplication on is the same irrespective of the choice of basis B.
Proof
We say defines a basis
for
when
is a basis such that
. Assume there are two sets
that define corresponding bases
and
for
. We will show that multiplication in basis
is the same as in the basis
.







Example 16


The notion of a context theory is founded on the prototypical example given by context vectors. So far we have shown that multiplication can be defined on the vector space generated by context vectors of strings; we have not discussed the lattice properties of the vector space, however. In fact,
does not come with a natural lattice ordering that makes sense for our purposes, although the original space
does—it is isomorphic to the sequence space. Thus
will form our context theory, where
for a ∈ A and ψ is the canonical map that simply maps elements of
to themselves, but are considered as elements of
. There is an important caveat here, however: We required that the vector lattice be an abstract Lebesgue space, which means we need to be able to define a norm on it. The ℓ1 norm on
is an obvious candidate, although it is not guaranteed to be finite. This is where the nature of the underlying language L becomes important.
We might hope that the most restrictive class of the languages we discussed, the distributional languages, would guarantee that the norm is finite. Unfortunately, this is not the case, as the following example demonstrates.
Example 17




The problem in the previous example is that the average string length is infinite. If we restrict ourselves to distributional languages in which the average string length is finite, then the problem goes away.
Proposition 2
Proof
If L is finite average length, then , and so
is a context theory, where ψ is the canonical map from
to ℓ1(A*×A*). Thus context algebras of finite average length languages provide our prototypical examples of context theories.
3.1 Discussion
The benefit of the context-theoretic framework is in providing a space of exploration for models of meaning in language. Our effort has been in finding principles by which to define the boundaries of this space. Each of the key boundaries, namely, bilinearity and associativity of multiplication and entailment through vector lattice structure, can also be viewed as limitations of the model.
Bilinearity is a strong requirement to place, and has wide-ranging implications for the way meaning is represented in the model. It can be interpreted loosely as follows: Components of meaning persist or diminish but do not spontaneously appear. This is particularly counterintuitive in the case of idiom and metaphor in language. It means that, for example, both red and herring must contain some components relating to the meaning of red herring which only come into play when these two words are combined in this particular order. Any other combination would give a zero product for these components. It is easy to see how this requirement arises from a context-theoretic perspective, nevertheless from a linguistic perspective it is arguably undesirable.
One potential limitation of the model is that it does not explicitly model syntax, but rather syntactic restrictions are encoded into the vector space and product itself. For example, we may assume the word square has some component of meaning in common with the word shape. Then we would expect this component to be preserved in the sentences He drew a square and He drew a shape. However, in the case of the two sentences The box is square and *The box is shape we would expect the second to be represented by the zero vector because it is not grammatical; square can be a noun and an adjective, whereas shape cannot. Distributivity of meaning means that the component of meaning that square has in common with shape must be disjoint with the adjectival component of the meaning of square.
Associativity is also a very strong requirement to place; indeed Lambek (1961) introduced non-associativity into his calculus precisely to deal with examples that were not satisfactorily dealt with by his associative model (Lambek 1958).
Our framework provides answers to someone considering the use of algebra for natural language semantics. What field should be used? The real numbers. Need the algebra be finite-dimensional? No. Should the algebra by unital? Yes. Some of these answers impose restrictions on what is possible within the framework. The full implication of these restrictions for linguistics is beyond the scope of this article, and indeed is not yet known.
Although we hope that these features or boundaries are useful in their current form, it may be that with time, or for certain applications, there is a reason to expand or contract certain of them, perhaps because of theoretical discoveries relating to the model of meaning as context, or for practical or linguistic reasons, if, for example, the model is found to be too restrictive to model certain linguistic phenomena.
4. Applications to Textual Entailment
In this section we analyze approaches to the problem of recognizing textual entailment, showing how they can be related to the context-theoretic framework, and discussing potential new approaches that are suggested by looking at them within the framework. We first discuss some simple approaches to textual entailment based on subsequence matching and measuring lexical overlap. We then look at the approach of Glickman and Dagan (2005), showing that it can be considered as a context theory in which words are represented as projections on the vector space of documents. This leads us to an implementation of our own in which we used latent Dirichlet allocation as an alternative approach to overcoming the problem of data sparseness.
A fair amount of effort is required to describe these approaches within our framework. Although there is no immediate practical benefit to be gained from this, our main purpose in doing this is to demonstrate the generality of the framework. We also hope that insight into these approaches may be gleaned by viewing them from a new perspective.
4.1 Subsequence Matching and Lexical Overlap
A sequence x ∈ A* is a subsequence of y ∈ A* if each element of x occurs in y in the same order, but with the possibility of other elements occurring in between, so for example abba is a subsequence of acabcba in {a,b,c}*. Subsequence matching compares the subsequences of two sequences: The more subsequences they have in common the more similar they are assumed to be. This idea has been used successfully in text classification (Lodhi et al. 2002) and also formed the basis of the author's entry to the second Recognising Textual Entailment Challenge (Clarke 2006).
Example 18 (Subsequence Matching)
Consider the algebra ℓ1(A*) for some alphabet A. This has a basis consisting of elements ex for x ∈ A*, where ex the function that is 1 on x and 0 elsewhere. In particular eε is a unity for the algebra. Define ; then
is a context theory. Under this context theory, a sequence x completely entails y if and only if it is a subsequence of y. In our experiments, we have shown that this type of context theory can perform significantly better than straightforward lexical overlap (Clarke 2006). Many variations on this idea are possible: for example, using more complex mappings from A* to ℓ1(A*).
Example 19 (Lexical Overlap)
The simplest approach to textual entailment is to measure the degree of lexical overlap: the proportion of words in the hypothesis sentence that are contained in the text sentence (Dagan, Glickman, and Magnini 2005). This approach can be described as a context theory in terms of a free commutative semigroup on a set A, defined by A*/ ≡ where x ≡ y in A* if the symbols making up x can be reordered to make y. Then define ξ′ by where [a] is the equivalence class of a in A*/ ≡. Then
is a context theory in which entailment is defined by lexical overlap. More complex definitions of
can be used, for example, to weight different words by their probabilities.
4.2 Document Projections

In their paper, Glickman and Dagan (2005) assume that probabilities can be attached to individual words, as we do, although they interpret these as the probability that a word is “true” in a possible world. In their interpretation, a document corresponds to a possible world, and a word is true in that world if it occurs in the document.
They do not, however, determine these probabilities directly; instead they make assumptions about how the entailment probability of a sentence depends on lexical entailment probability. Although they do not state this, the reason for this is presumably data sparseness: They assume that a sentence is true if all its lexical components are true; this will only happen if all the words occur in the same document. For any sizeable sentence this is extremely unlikely, hence their alternative approach.
It is nevertheless useful to consider this idea from a context-theoretic perspective. We define a context theory , where:
We denote by B(U) the set of bounded operators on the vector space U; in this case we are considering the bounded operators on the vector space indexed by the set of documents D. Because D is finite, all operators on this space are in fact bounded; this property will be needed when we generalize D to an infinite set, however.
ξ: A →B(ℓ∞(D)) is defined by ξ(u) = Pu; it maps words to document projections.
is a map defined by
, where
p ∈ ℓ∞(D) is defined by p(d) = 1/|D| for all d ∈ D. This is defined such that ∥Pup∥1 is the probability of the term u.
The degree to which x entails y is then given by =
. This corresponds directly to Glickman and Dagan's (2005) entailment “confidence”; it is simply the proportion of documents that contain all the terms of x which also contain all the terms of y.
4.3 Latent Dirichlet Projections
The formulation in the previous section suggests an alternative approach to that of Glickman and Dagan (2005) to cope with the data sparseness problem. We consider the finite data available D as a sample from a distributional language D′; the vector p then becomes a probability distribution over the documents in D′. In our own experiments, we used latent Dirichlet allocation (Blei, Ng, and Jordan 2003) to build a model of the corpus as a probabilistic language based on a subset of around 380,000 documents from the Gigaword corpus. Having this model allows us to consider an infinite array of possible documents, and thus we can use our context-theoretic definition of entailment because there is no problem of data sparseness.
Latent Dirichlet allocation (LDA) follows the same vein as latent semantic analysis (LSA; Deerwester et al. 1990) and probabilistic latent semantic analysis (PLSA; Hofmann 1999) in that it can be used to build models of corpora in which words within a document are considered to be exchangeable, so that a document is treated as a bag of words. LSA performs a singular value decomposition on the matrix of words and documents which brings out hidden “latent” similarities in meaning between words, even though they may not occur together.
In contrast, PLSA and LDA provide probabilistic models of corpora using Bayesian methods. LDA differs from PLSA in that, whereas the latter assumes a fixed number of documents, LDA assumes that the data at hand are a sample from an infinite set of documents, allowing new documents to be assigned probabilities in a straightforward manner.
Figure 3 shows a graphical representation of the latent Dirichlet allocation generative model, and Figure 4 shows how the model generates a document of length N. In this model, the probability of occurrence of a word w in a document is considered to be a multinomial variable conditioned on a k-dimensional “topic” variable z. The number of topics k is generally chosen to be much fewer than the number of possible words, so that topics provide a “bottleneck” through which the latent similarity in meaning between words becomes exposed.
Graphical representation of the Dirichlet model. The inner box shows the choices that are repeated for each word in the document; the outer box shows the choice that is made for each document; the parameters outside the boxes are constant for the model.
Graphical representation of the Dirichlet model. The inner box shows the choices that are repeated for each word in the document; the outer box shows the choice that is made for each document; the parameters outside the boxes are constant for the model.
The model is thus entirely specified by α and the conditional probabilities p(w|z) that we can assume are specified in a k×V matrix β where V is the number of words in the vocabulary. The parameters α and β can be estimated from a corpus of documents by a variational expectation maximization algorithm, as described by Blei, Ng, and Jordan (2003).
LDA was applied by Blei, Ng, and Jordan (2003) to the tasks of document modeling, document classification, and collaborative filtering. They compare LDA to several techniques including PLSA; LDA outperforms these on all of the applications. LDA has been applied to the task of word sense disambiguation (Boyd-Graber, Blei, and Zhu 2007; Cai, Lee, and Teh 2007) with significant success.
Consider the vector space ℓ∞(A*) for some alphabet A, the space of all bounded functions on possible documents. In this approach, we define the representation of a string x to be a projection Px on the subspace representing the (infinite) set of documents in which all the words in string x occur. We define a vector q(x) for x ∈ A* where q(x) is the probability of string x in the probabilistic language.



We built a latent Dirichlet allocation model using Blei, Ng, and Jordan's (2003) implementation on documents from the British National Corpus, using 100 topics. We evaluated this model on the 800 entailment pairs from the first Recognizing Textual Entailment Challenge test set.1 Results were comparable to those obtained by Glickman and Dagan (2005) (see Table 2). In this table, Accuracy is the accuracy on the test set, consisting of 800 entailment pairs, and CWS is the confidence weighted score; see Dagan, Glickman, and Magnini (2005) for the definition. The differences between the accuracy values in the table are not statistically significant because of the small data set, although all accuracies in the table are significantly better than chance at the 1% level. The accuracy of the model is considerably lower than the state of the art, which is around 75% (Bar-Haim et al. 2006). We experimented with various document lengths and found very long documents (N = 106 and N = 107) to work best.
Results obtained with our latent Dirichlet projection model on the data from the first Recognizing Textual Entailment Challenge for two document lengths N = 106 and N = 107 using a cut-off for the degree of entailment of 0.5 at which entailment was regarded as holding.
Model . | Accuracy . | CWS . |
---|---|---|
Dirichlet (106) | 0.584 | 0.630 |
Dirichlet (107) | 0.576 | 0.642 |
Bayer (MITRE) | 0.586 | 0.617 |
Glickman (Bar Ilan) | 0.586 | 0.572 |
Jijkoun (Amsterdam) | 0.552 | 0.559 |
Newman (Dublin) | 0.565 | 0.6 |
Model . | Accuracy . | CWS . |
---|---|---|
Dirichlet (106) | 0.584 | 0.630 |
Dirichlet (107) | 0.576 | 0.642 |
Bayer (MITRE) | 0.586 | 0.617 |
Glickman (Bar Ilan) | 0.586 | 0.572 |
Jijkoun (Amsterdam) | 0.552 | 0.559 |
Newman (Dublin) | 0.565 | 0.6 |
It is important to note that because the LDA model is commutative, the resulting context algebra must also be commutative, which is clearly far from ideal in modeling natural language.
5. The Model of Clark, Coecke, and Sadrzadeh
One of the most sophisticated proposals for a method of composition is that of Clark, Coecke, and Sadrzadeh (2008) and the more recent implementation of Grefenstette et al. (2011). In this section, we will show how their model can be described as a context theory.
The authors describe the syntactic element of their construction using pregroups (Lambek 2001), a formalism which simplifies the syntactic calculus of Lambek (1958). These can be described in terms of partially ordered monoids, a monoid G with a partial ordering ≤ satisfying x ≤ y implies xz ≤ yz and zx ≤ zy for all x,y,z ∈ G.
Definition 11 (Pregroup)
For an element v in a particular tensor power of V, such that v = (s1 ⊗ p1) ⊗ (s2 ⊗ p2) ⊗ ⋯ ⊗ (sn ⊗ pn), where the pi are basis vectors of P, then we can recover a complex grammatical type for v as the product γ(v) = γ1γ2 ⋯ γn, where γi is the basic grammatical type corresponding to pi. We will call the vectors such as this which have a single complex type (i.e., they are not formed from a weighted sum of more than one type) unambiguous.
We also assume that words are represented by vectors whose grammatical type is irreducible: There is no pregroup reduction possible on the type. We define Γ(T(V)) as the vector space generated by all such vectors.
This construction thus allows us to represent complex grammatical types, similar to Clark, Coecke, and Sadrzadeh (2008), although it also allows us to take weighted sums of these complex types, giving us a powerful method of expressing syntactic and semantic ambiguity.
6. Conclusions and Future Work
We have presented a context-theoretic framework for natural language semantics. The framework is founded on the idea that meaning in natural language can be determined by context, and is inspired by techniques that make use of statistical properties of language by analyzing large text corpora. Such techniques can generally be viewed as representing language in terms of vectors. These techniques are currently used in applications such as textual entailment recognition, although the lack of a theory of meaning that incorporates these techniques means that they are often used in a somewhat ad hoc manner. The purpose behind the framework is to provide a unified theoretical foundation for such techniques so that they may be used in a principled manner.
By formalizing the notion of “meaning as context” we have been able to build a mathematical model that informs us about the nature of meaning under this paradigm. Specifically, it gives us a theory about how to represent words and phrases using vectors, and tells us that the product of two meanings should be distributive and associative. It also gives us an interpretation of the inherent lattice structure on these vector spaces as defining the relation of entailment. It tells us how to measure the size of the vector representation of a string in such a way that the size corresponds to the probability of the string.
We have demonstrated that the framework encompasses several related approaches to compositional distributional semantics, including those based on a predefined composition operation such as addition (Landauer and Dumais 1997; Foltz, Kintsch, and Landauer 1998; Mitchell and Lapata 2008) or the tensor product (Smolensky 1990; Clark and Pulman 2007; Widdows 2008), matrix multiplication (Rudolph and Giesbrecht 2010), and the more sophisticated construction of Clark, Coecke, and Sadrzadeh (2008).
6.1 Practical Investigations
Section 4 raises many possibilities for the design of systems to recognize textual entailment within the framework.
Variations on substring matching: experiments with different weighting schemes for substrings, allowing partial commutativity of words or phrases, and replacing words with vectors representing their context, using tensor products of these vectors instead of concatenation.
Extensions of Glickman and Dagan's approach and our own context-theoretic approach using LDA, perhaps using other distributional languages based on n-grams or other models in which words do not commute, or a combination of context theories based on commutative and non-commutative models.
The LDA model we used is a commutative one. This is a considerable simplification of what is possible within the context-theoretic framework; it would be interesting to investigate methods of incorporating non-commutativity into the model.
Implementations based on the approach to representing uncertainty in logical semantics similar to those described in Clarke (2007).
There are many approaches to textual entailment that we have not considered here; we conjecture that variations of many of them could be described within our framework. We leave the task of investigating the relationship between these approaches and our framework to further work.
Other areas that we are investigating, together with researchers at the University of Sussex, is the possibility of learning finite-dimensional algebras directly from corpus data, along the lines of Guevara (2011) and Baroni and Zamparelli (2010).
One question we have not addressed in this article is the feasibility of computing with algebraic representations. Although this question is highly dependent on the particular context theory chosen, it is possible that general algorithms for computation within this framework could be found; this is another area that we intend to address in further work.
6.2 Theoretical Investigations
Although the context-theoretic framework is an abstraction of the model of meaning as context, it would be good to have a complete understanding of the model and the types of context theories that it allows. Tying down these properties would allow us to define algebras that could truly be called “context theories.”
The context-theoretic framework shares a lot of properties with the study of free probability (Voiculescu 1997). It would be interesting to investigate whether ideas from free probability would carry over to context-theoretic semantics.
Although we have related our model to many techniques described in the literature, we still have to investigate its relationship with other models such as that of Song and Bruza (2003) and Guevara (2011).
We have not given much consideration here to the issue of multi-word expressions and non-compositionality. What predictions does the context-theoretic framework make about non-compositionality? Answering this may lead us to new techniques for recognizing and handling multi-word expressions and non-compositionality.
Of course it is hard to predict the benefits that may result from what we have presented, because we have given a way of thinking about meaning in natural language that in many respects is new. This new way of thinking opens the door to the unification of logic-based and vector-based methods in computational linguistics, and the potential fruits of this union are many.
Acknowledgements
The ideas presented here have benefitted enormously from the input and support of my DPhil supervisor, David Weir, without whom this work would not exist; Rudi Lutz; and Stephen Clark, who really grokked this and made many excellent suggestions for improvements. I am also grateful for the advice and encouragement of Bill Keller, John Carroll, Peter Williams, Mark W. Hopkins, Peter Lane, Paul Hender, and Peter Hines. I am indebted to the anonymous reviewers; their suggestions have undoubtedly improved this article beyond measure; the paragraph on the three uses of the term entailment was derived directly from one of their suggestions.
Notes
We have so far only used data from the first challenge, because we performed the experiment before the other challenges had taken place.
References
Author notes
Gorkana Group, Discovery House, 28–48 Banner Street, London EC1Y8QE. E-mail: [email protected].