Abstract
Distributional semantics has been extended to phrases and sentences by means of composition operations. We look at how these operations affect similarity measurements, showing that similarity equations of an important class of composition methods can be decomposed into operations performed on the subparts of the input phrases. This establishes a strong link between these models and convolution kernels.
1. Introduction
Distributional semantics approximates word meanings with vectors tracking co-occurrence in corpora (Turney and Pantel 2010). Recent work has extended this approach to phrases and sentences through vector composition (Clark 2015). Resulting compositional distributional semantic models (CDSMs) estimate degrees of semantic similarity (or, more generally, relatedness) between two phrases: A good CDSM might tell us that green bird is closer to parrot than to pigeon, useful for tasks such as paraphrasing.
We take a mathematical look1 at how the composition operations postulated by CDSMs affect similarity measurements involving the vectors they produce for phrases or sentences. We show that, for an important class of composition methods, encompassing at least those based on linear transformations, the similarity equations can be decomposed into operations performed on the subparts of the input phrases, and typically factorized into terms that reflect the linguistic structure of the input. This establishes a strong link between CDSMs and convolution kernels (Haussler 1999), which act in the same way. We thus refer to our claim as the “Convolution Conjecture.”
We focus on the models in Table 1. These CDSMs all apply linear methods, and we suspect that linearity is a sufficient (but not necessary) condition to ensure that the Convolution Conjecture holds. We will first illustrate the conjecture for linear methods, and then briefly consider two nonlinear approaches: the dual space model of Turney (2012), for which it does, and a representative of the recent strand of work on neural-network models of composition, for which it does not.
. | 2-word phrase . | 3-word phrase . | reference . |
---|---|---|---|
Additive | Mitchell and Lapata (2008) | ||
Multiplicative | Mitchell and Lapata (2008) | ||
Full Additive | Guevara (2010), Zanzotto et al. (2010) | ||
Lexical Function | Coecke, Sadrzadeh, and Clark (2010) |
2. Mathematical Preliminaries
3. Formalizing the Convolution Conjecture
Structured Objects. In line with Haussler (1999), a structured objectx ∈ X is either a terminal object that cannot be furthermore decomposed, or a non-terminal object that can be decomposed into n subparts. We indicate with one such decomposition, where the subparts xi ∈ X are structured objects themselves. The set X is the set of the structured objects and TX ⊆ X is the set of the terminal objects. A structured object x can be anything according to the representational needs. Here, x is a representation of a text fragment, and so it can be a sequence of words, a sequence of words along with their part of speech, a tree structure, and so on. The set is the set of decompositions of x relevant to define a specific CDSM. Note that a given decomposition of a structured object x does not need to contain all the subparts of the original object. For example, let us consider the phrase x = tall boy We can then define . This set contains the three possible decompositions of the phrase: , , and .
Definition 1 (Convolution Conjecture)
The Convolution Conjecture postulates that the similarity K(f (x), f (y)) between the tensors f (x) and f (y) is computed by combining operations on the subparts, that is, Ki(f (xi), f (yi)), using the function g. This is exactly what happens in convolution kernels (Haussler 1999). K is usually the dot product, but this is not necessary: We will show that for the dual-space model of Turney 2012 K turns out to be the fourth root of the Frobenius tensor.
4. Comparing Composed Phrases
We illustrate now how the Convolution Conjecture (CC) applies to the considered CDSMs, exemplifying with adjective–noun and subject–verb–object phrases. Without loss of generality we use tall boy and red cat for adjective–noun phrases and goats eat grass and cows drink water for subject–verb–object phrases.
The CC form of Additive shows that the overall dot product can be decomposed into dot products of the vectors of the single words. Composition does not add any further information. These results can be easily extended to longer phrases and to phrases of different length.
By looking at Full Additive in the CC form, we observe that when XTY ≈ I for all matrix pairs, it degenerates to Additive. Interestingly, Full Additive can also approximate a semantic convolution kernel (Mehdad, Moschitti, and Zanzotto 2010), which combines dot products of elements in the same slot. In the adjective–noun case, we obtain this approximation by choosing two nearly orthonormal matrices A and N such that AAT = NNT ≈ I and ANT ≈ 0 and applying Equation (2): .
Results can again be easily extended to longer and different-length phrases.
We rewrote the equation as a Frobenius product between two fourth-order tensors. The first combines the two third-order tensors of the verbs and the second combines the vectors representing the arguments of the verb, that is: . In this case as well we can separate the role of predicate and argument types in the comparison computation.
5. Conclusion
The Convolution Conjecture offers a general way to rewrite the phrase similarity computations of CDSMs by highlighting the role played by the subparts of a composed representation. This perspective allows for a better understanding of the exact operations that a composition model applies to its input. The Convolution Conjecture also suggests a strong connection between CDSMs and semantic convolution kernels. This link suggests that insights from the CDSM literature could be directly integrated in the development of convolution kernels, with all the benefits offered by this well-understood general machine-learning framework.
Acknowledgments
We thank the reviewers for helpful comments. Marco Baroni acknowledges ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES).
Notes
Ganesalingam and Herbelot (2013) also present a mathematical investigation of CDSMs. However, except for the tensor product (a composition method we do not consider here as it is not empirically effective), they do not look at how composition strategies affect similarity comparisons.
Grefenstette et al. (2013) first framed the Lexical Function in terms of tensor contraction.
References
Author notes
Department of Enterprise Engineering, University of Rome “Tor Vergata,” Viale del Politecnico, 1, 00133 Rome, Italy. E-mail: [email protected].