This article introduces RELPRON, a large data set of subject and object relative clauses, for the evaluation of methods in compositional distributional semantics. RELPRON targets an intermediate level of grammatical complexity between content-word pairs and full sentences. The task involves matching terms, such as “wisdom,” with representative properties, such as “quality that experience teaches.” A unique feature of RELPRON is that it is built from attested properties, but without the need for them to appear in relative clause format in the source corpus. The article also presents some initial experiments on RELPRON, using a variety of composition methods including simple baselines, arithmetic operators on vectors, and finally, more complex methods in which argument-taking words are represented as tensors. The latter methods are based on the Categorial framework, which is described in detail. The results show that vector addition is difficult to beat—in line with the existing literature—but that an implementation of the Categorial framework based on the Practical Lexical Function model is able to match the performance of vector addition. The article finishes with an in-depth analysis of RELPRON, showing how results vary across subject and object relative clauses, across different head nouns, and how the methods perform on the subtasks necessary for capturing relative clause semantics, as well as providing a qualitative analysis highlighting some of the more common errors. Our hope is that the competitive results presented here, in which the best systems are on average ranking one out of every two properties correctly for a given term, will inspire new approaches to the RELPRON ranking task and other tasks based on linguistically interesting constructions.
The field of compositional distributional semantics (Mitchell and Lapata 2008; Clark, Coecke, and Sadrzadeh 2008; Baroni, Bernardi, and Zamparelli 2014) integrates distributional semantic representations of words (Schütze 1998; Turney and Pantel 2010; Clark 2015) with formal methods for composing word representations into larger phrases and sentences (Montague 1970; Dowty, Wall, and Peters 1981). In recent years a number of composition methods have been proposed, including simple arithmetic operations on distributional word vectors (Mitchell and Lapata 2008, 2010), multi-linear operations involving higher-order representations of argument-taking words such as verbs and adjectives (Baroni and Zamparelli 2010; Coecke, Sadrzadeh, and Clark 2010), and composition of distributed word vectors learned with neural networks (Socher, Manning, and Ng 2010; Mikolov, Yih, and Zweig 2013). To compare such approaches it is important to have high-quality data sets for evaluating composed phrase representations at different levels of granularity and complexity.
Existing evaluation data sets fall largely into two categories. Some data sets focus on two to three word phrases consisting of content words only (i.e., no closed-class function words), including subject–verb, verb–object, subject–verb–object, and adjective–noun combinations (Mitchell and Lapata 2008, 2010; Baroni and Zamparelli 2010; Vecchi, Baroni, and Zamparelli 2011; Grefenstette and Sadrzadeh 2011; Boleda et al. 2012; Boleda et al. 2013; Lazaridou, Vecchi, and Baroni 2013). Such data sets have been essential for evaluating the first generation of compositional distributional models, but the relative grammatical simplicity of the phrases—in particular, the absence of function words—leaves open a wide range of compositional phenomena in natural language against which models must be evaluated.1
Other data sets, including those used in recent SemEval and *SEM Shared Tasks (Agirre et al. 2012, 2013; Marelli et al. 2014; Agirre 2015), focus on pairs of full sentences, some as long as 20 words or more. The evaluation is based on a numeric similarity measure for each sentence pair. These data sets represent a more realistic task for language technology applications, but reducing sentence similarity to a single value makes it difficult to identify how a model performs on subparts of a sentence, or on specific grammatical constructions, masking the areas where models need improvement.2
In the domain of syntactic parsing, a call has been made for “grammatical construction-focused” parser evaluation (Rimell, Clark, and Steedman 2009; Bender et al. 2011), focusing on individual, often challenging syntactic structures, in order to tease out parser performance in these areas from overall accuracy scores. We make an analogous call here, for a wider range of compositional phenomena to be investigated in compositional distributional semantics in the near future.
This article begins to answer that call by presenting relpron, a data set of noun phrases consisting of a noun modified by a relative clause. For example, building that hosts premieres is a noun phrase containing a subject relative clause (a relative clause with an extracted subject), and in our data set describes a theater; while person that helicopter saves contains an object relative clause, and in our data set describes a survivor. The relpron data set is primarily concerned with the analysis of a particular type of closed-class function word, namely, relative pronouns (that, in our examples); function words have traditionally been of greater concern than content words for formal semantics, and we address how this focus can be extended to distributional semantics.
Relative clauses, although still fairly short and therefore more manageable than full sentences, are nevertheless more grammatically complex than the short phrases in previous data sets, because of the relative pronoun and the long-distance relationship between the verb and the extracted argument—known as the head noun of the relative clause (building, person). The aim of relpron is to expand the variety of phrase types on which compositional distributional semantic methods can be evaluated, thus helping to build these methods up step-by-step to the full task of compositional sentence representations.
relpron is a large (1,087 relative clauses), corpus-based, naturalistic data set, including both subject and object relative clauses that modify a variety of concrete and abstract nouns. The relative clauses are matched with terms and represent distinctive properties of those terms, such as wisdom: quality that experience teaches, and bowler: player that dominates batsman. A unique feature of relpron is that it is built from attested properties of terms, but without the properties needing to appear in relative clause form in the source corpus.
This article also presents some initial experiments on relpron, using a variety of composition methods. We find that a simple arithmetic vector operation provides a challenging baseline, in line with existing literature evaluating such methods, but we are able to match this baseline using a more sophisticated method similar to the Practical Lexical Function (PLF) model of Paperno, Pham, and Baroni (2014). We hope that the compositional methods presented here will inspire new approaches.
The remainder of the article is organized as follows. Section 2 provides some motivation for the relpron data set and the canonical ranking task that we imagine relpron being used for, as well as a description of existing data sets designed to test compositional distributional models. Section 3 describes the data set itself, and provides a detailed description of how it was built. Section 4 surveys some of the existing compositional methods that have been applied in distributional semantics, including a short historical section describing work from cognitive science in the 1980s and 1990s. Section 4.2 provides a fairly detailed description of the Categorial framework, which provides the basis for the more complex composition methods that we have tested, as well as a description of existing implementations of the framework. Section 5 provides further details of the composition methods, including lower-level details of how the vectors and tensors were built. Section 6 presents our results on the development and test portions of relpron, comparing different composition methods and different methods of building the vectors and tensors (including count-based vectors vs. neural embeddings). Finally, Section 7 provides an in-depth analysis of relpron, showing how results vary across subject and object relative clauses, across the different head nouns, and how the methods perform on the subtasks necessary for capturing relative pronoun semantics, as well as providing a qualitative analysis highlighting some of the more common errors. Section 8 concludes.
2. Motivation for relpron
Relative clauses have been suggested as an interesting test case for compositional distributional semantic methods by Sadrzadeh, Clark, and Coecke (2013) and Baroni, Bernardi, and Zamparelli (2014). One of the main inspirations for relpron is the pilot experiment of Sadrzadeh, Clark, and Coecke, in which terms such as carnivore are matched with descriptions such as animal who eats meat. However, the data set is small and all examples are manually constructed. relpron aims to provide a larger, more realistic, and more challenging testing ground for relative clause composition methods.
relpron falls into the general category of cross-level semantic similarity tasks (Jurgens, Pilehvar, and Navigli 2014), which involve comparing phrases of different lengths; in this case, a single word with a short phrase. The task of matching a phrase with a single word can be thought of as a special case of paraphrase recognition. However, restricting one of the phrases to a word means that the experiment is better controlled with regard to composition methods; because we know that we have available high-quality, state-of-the-art distributional representations of single words, the only unknown is the composed phrase representation, whereas a comparison of two composed phrases can be more difficult to interpret.
The relpron task is similar to the definition classification task of Kartsaklis, Sadrzadeh, and Pulman (2012), which targeted verb phrase composition. In that task, methods were tested on their ability to match a term (in this case a verb) with its definition (usually of the form verb–object), such as embark: enter boat or vessel. However, the definitions of Kartsaklis, Sadrzadeh, and Pulman were mined from a set of dictionaries, whereas the relpron relative clauses are not limited to dictionary definitions.
Figure 1 introduces some terminology for describing the terms and properties in the data set.3 A term is a single word which is associated with a set of descriptive phrases. Semantically, each descriptive phrase is a property of the term. Syntactically, each descriptive phrase is a noun phrase containing a relative clause. For the sake of brevity, we often refer to the whole noun phrase containing the relative clause as a relative clause, when we believe this will not lead to confusion. We say subject relative clause or object relative clause (or subject property or object property) to indicate which argument of the verb has been extracted. The head noun of the relative clause is the extracted noun, and is a hypernym of the term; for example, device is a hypernym of telescope. The subject or object that remains in situ in the clause we call the argument of the verb, and the general term for the subject or object role is grammatical function. The relative clauses in relpron are restrictive relative clauses, because they narrow down the meaning of the head noun. In contrast, a non-restrictive relative clause provides incidental situational information; for example, a device, which was in the room yesterday.
Our goal was to construct a data set containing at least 1,000 properties in the form of relative clauses. To make the ranking task more challenging, we chose groups of terms that share a head noun. For example, telescope, watch, button, and pipe are all subclasses of device, so their relative clauses will all begin device that…. Because the relative clauses are not definitions, each term can be associated with multiple properties; we selected between four and ten properties per term.
The construction of relpron also takes a novel approach to producing evaluation data sets that target a particular grammatical construction. Rather than mining examples of the construction directly from a corpus, which may be difficult because of data sparsity, the properties in relpron did not have to occur in relative clause format in our source corpus. Instead, the properties were obtained from subject–verb–object (SVO) triples such as astronomer uses telescope, and joined with head nouns such as device. The head nouns are also attested in the corpus as heads of relative clauses in general, though not necessarily with these specific properties. The relative clauses in relpron are therefore naturalistic, plausible, and corpus-based, without having to be attested in the exact relative clause form in which they ultimately appear in the data set.
The canonical task on the relpron data set is conceived as a ranking task. Given a term, a system must rank all properties by their similarity to the term. The ideal system will rank all properties corresponding to that term above all other properties. This is analogous to an Information Retrieval task (Manning, Raghavan, and Schütze 2008), where the term is the query and the properties are the documents, with the identifying properties for a given term as the relevant documents. Evaluation can make use of standard Information Retrieval measures; in this article we use Mean Average Precision (MAP).
2.1 Existing Evaluation Data Sets
Most existing evaluation data sets for compositional distributional semantics have been based on small, fixed syntactic contexts, typically adjective–noun or subject–verb– object. Mitchell and Lapata (2008) introduce an evaluation that has been adopted by a number of other researchers. The task is to predict human similarity judgments on subject–verb phrases, where the similarity rating depends on word sense disambiguation in context. For example, horse run is more similar to horse gallop than to horse dissolve, whereas colors run is more similar to colors dissolve than to colors gallop. Compositional models are evaluated on how well their similarity judgments correlate with the human judgments. Mitchell and Lapata (2010) introduce a phrase similarity data set consisting of adjective–noun, verb–object, and noun–noun combinations, where the task again is to produce similarity ratings that correlate well with human judgments. For example, large number and great majority have high similarity in the gold-standard data, whereas further evidence and low cost have low similarity. This data set formed the basis of a shared task at the GEMS 2011 workshop (Padó and Peirsman 2011). Grefenstette and Sadrzadeh (2011) and Kartsaklis and Sadrzadeh (2014) introduce analogous tasks for subject–verb–object triples. In the disambiguation task, people try door is more similar to people test door than to people judge door. In the similarity task, medication achieve result and drug produce effect have high similarity ratings, whereas author write book and delegate buy land have low ratings. Sentence pairs with mid-similarity ratings tend to have high relatedness but are not mutually substitutable, such as team win match and people play game.
Other evaluations have made use of the contrast between compositional and non-compositional phrases, based on the intuition that a successful compositional model should produce phrase vectors similar to the observed context vectors for compositional phrases, but diverge from the observed context vectors for phrases known to be non-compositional. Reddy, McCarthy, and Manandhar (2011) introduce an evaluation based on compound nouns like climate change (compositional) and gravy train (non-compositional). Boleda et al. (2012, 2013) evaluate models on a data set involving intersective adjectives in phrases like white towel (compositional) and non-intersective adjectives in phrases like black hole and false floor (non-compositional). The data from the shared task of the ACL 2011 Distributional Semantics and Compositionality workshop (Biemann and Giesbrecht 2011) are also relevant in this context.
Also relevant for compositional distributional semantics are larger data sets for full sentence similarity and entailment, including those from recent SemEval and *SEM Shared Tasks (Agirre et al. 2012, 2013; Marelli et al. 2014; Agirre 2015). The SemEval Semantic Textual Similarity tasks make use of a number of paraphrase corpora that include text from news articles, headlines, tweets, image captions, video descriptions, student work, online discussion forums, and statistical machine translation. These corpora include sentences of widely varying length and quality. The SICK data set (Marelli et al. 2014) is based on image and video description corpora, with named entities removed and other normalization steps applied, in order to remove phenomena outside the scope of compositional distributional semantic models. Additional sentences are added based on linguistic transformations that vary the meaning of the original sentence in a controlled fashion. However, this data set is not designed for targeted evaluation of models on specific compositional linguistic phenomena, especially because many of the sentence pairs exhibit significant lexical overlap; for example, An airplane is in the air, An airplane is flying in the air.
Some recent data sets have extended the range of linguistic phenomena that may be evaluated. Cheung and Penn (2012) test the ability of compositional models to detect semantic similarity in the presence of lexical and syntactic variation, for example, among the sentences Pfizer buys Rinat, Pfizer paid several hundred million dollars for Rinat, and Pfizer's acquisition of Rinat. Bernardi et al. (2013) introduce a data set comparing nouns to noun phrases, including determiners as function words; for example, duel is compared to two opponents (similar), various opponents (dissimilar), and two engineers (dissimilar). Pham et al. (2013) introduce a data set involving full sentences from a limited grammar, where determiners and word order vary; for example, A man plays a guitar is compared to A guitar plays a man and The man plays a guitar.
3. The relpron Data Set
One challenge in targeting specific grammatical constructions for evaluation is that the more linguistically interesting constructions can be relatively rare in text corpora. The development of relpron takes a novel approach, using a syntactic transformation to assemble properties in relative clause form without the properties needing to appear in this syntactic form in the source corpus. The properties were obtained from SVO triples such as astronomer uses telescope, and joined with head nouns such as device, a hypernym of telescope, to create the relative clause device that astronomer uses. The source corpus used for constructing the data set was the union of a 2010 Wikipedia download and the British National Corpus (BNC) (Burnard 2007). Here we describe the steps in the construction of relpron.
3.1 Selection of Candidate Head Nouns
To make the data set maximally challenging, we designed it with multiple terms (hyponyms) per head noun (hypernym). This ensures that the composition method cannot rely solely on the similarity between the term and the head noun—say, traveler and person—and fail to consider the verb and argument in the relative clause.
We began by selecting a set of candidate head nouns, each of which would need to have a number of good candidate terms. To improve the naturalness of the examples, we chose candidate head nouns from among nouns frequently used as heads of relative clauses in the source corpus. We used a simple regular expression search for nouns occurring in the pattern “a[n] NOUN that|who|which,” considering as candidates the 200 most frequent results. We manually excluded nouns where the that-clause could be an argument of the noun, for example statement, decision, and belief.
One author (Rimell) used WordNet (Miller 1995) as a guide to manually narrow down the candidates, with the following two requirements. First, the head nouns should have a large number of WordNet hyponyms, to maximize the chances of finding appropriate terms. The hyponyms were required to be direct children or grandchildren, because we wanted to avoid extremely abstract nouns like entity, which has many distant hyponyms but few direct ones. Second, the hyponyms should be mostly unigrams, because we wanted to avoid multiword terms such as remote control. For simplicity, we considered only the first WordNet sense of each candidate head noun.4 Occasionally, we utilized the WordNet level one down from the frequent head noun in the corpus; for example, the corpus search yielded substance, but we used its direct hyponym material as a head noun, because it had more direct hyponyms of its own.
3.2 Selection of Candidate Terms
Next, potential terms were filtered by frequency. Based on initial exploration of the data, we passed along for manual annotation only those candidate terms that occurred at least 5,000 times in the source corpus, to maximize the chances that there would be enough SVO triples for annotation (see Section 3.3). Finally, candidate terms that appeared in WordNet as hyponyms of more than one candidate head noun were removed. For example, surgery could be an activity or a room; school could be a building or an organization. We believed that the inclusion of such terms would complicate the data set and the evaluation unnecessarily.
3.3 Extraction of SVO Triples
For each candidate term, candidate properties were obtained from the source corpus by extraction of all triples in which the term occurred in either subject or object position. We used a version of the corpus lemmatized with Morpha (Minnen, Carroll, and Pearce 2001) and parsed with the C&C Tools (Clark and Curran 2007) to obtain the subject and object relations. A minimum frequency cutoff of six occurrences per verb, per grammatical function (subject/object), per candidate term was applied. Properties containing proper noun arguments were removed. Table 1 shows some sample triples extracted from the corpus prior to manual annotation.
|Object .||Subject .|
|it allow traveler||traveler make decision|
|website allow traveler||traveler make journey|
|inn serve traveler||traveler take road|
|business serve traveler||traveler take advantage|
|she become traveler||traveler have option|
|student become traveler||traveler have luggage|
|Object .||Subject .|
|it allow traveler||traveler make decision|
|website allow traveler||traveler make journey|
|inn serve traveler||traveler take road|
|business serve traveler||traveler take advantage|
|she become traveler||traveler have option|
|student become traveler||traveler have luggage|
3.4 Conversion to Relative Clause Form
The candidate properties were converted to relative clause form prior to annotation, by joining the head noun with the relevant part of the SVO triple, using the template term: headnoun that V O for a subject property, or term: headnoun that S V for an object property. Using the examples from Table 1, inn serve traveler thus became traveler: person that inn serve; and traveler make journey became traveler: person that make journey. For simplicity, that was used as the relative pronoun throughout the data set, although some head nouns are more likely to appear with who or which.
3.5 Manual Annotation of Properties
The manual annotation step was the stage at which candidate head nouns, terms, and properties were filtered for the final data set. Only terms with at least four good properties, and head nouns with at least four terms, were retained after this step. Annotation was continued until the data set reached the desired size.
Most of the SVO triples extracted from the corpus did not make good identifying properties for their terms. In many cases the SVO triple represented a property which, although it could be true of the term, did not distinguish it from other terms, especially from other hyponyms of the same head noun. Thus player that wins tournament is not a good property for golfer: many golfers do win tournaments, but so do other sports players. Similarly, organization that money helps is an accurate description of a charity, but does not distinguish it from other types of organizations. In other cases, the extracted triple contained a light verb and/or a pronoun in argument position, such as golfer: player that he become, or charity: organization that these include. These candidates were too vague to be meaningful properties. Some candidate properties were unsuitable because they were not generic. Among the candidates for killer were person that win lottery and person that actor play, because some specific killer in the corpus experienced these events. However, they are not descriptive of killers in general. Lexical ambiguity and parser errors contributed other unsuitable candidates.
Two of the authors (Rimell and Clark) selected good properties from among the candidates for each term, looking for what we called identifying properties. Such a property should distinguish a term from other hyponyms of its head noun, both those in relpron, and ideally other examples of the head noun as well (barring near-synonyms such as traveler and voyager). As a further guideline for annotation, we defined an identifying property as one that distinguishes a term because it mainly or canonically applies to the term in question, compared with other examples of the head noun. For example, person that tavern serves was not judged to be a good property for traveler. It does not distinguish a traveler from other examples of people, because serving travelers is not the main or canonical purpose of a tavern (since the majority of people who eat and drink in a tavern are not traveling). On the other hand, person that hotel accommodates is a good property, because, although hotels might accommodate other people (e.g., for meetings or in a restaurant), their main and canonical purpose is to accommodate travelers. A more subtle example is the property device that bomb has, a candidate property for timer. This property does not uniquely distinguish timer from among other devices, because bombs contain multiple (sub-)devices, but we felt that a timer was such a canonical part of a bomb that this could be considered a good property for timer.
For a given head noun (hypernym), the annotator began with a random selection of ten candidate terms (hyponyms) from the candidates selected as in Section 3.2. For each candidate term, the annotator chose up to ten good properties. Once ten had been found, annotation stopped on that term, even if more good properties might have been available, because the aim was a broad and balanced data set rather than recall of all good properties. Not all terms had ten good properties. We required a minimum of four, otherwise the candidate term was discarded and another candidate randomly selected from the list of candidates. We also required a minimum of four terms per head noun, otherwise the entire head noun was discarded and another chosen from the list of candidates.
The two annotators originally worked together on a small set of data. After the annotation guidelines were refined, the head nouns were divided between the annotators. As a final check, all properties were reviewed and discussed; any property on which the annotators could not agree was discarded. We aimed for a balance of subject and object relative clauses for each term, but this was not always possible. Some head nouns were very unbalanced in this respect: person terms tended to have better subject properties thanks to their agentivity, such as philosopher: person that interprets world; whereas quality terms tended to have better object properties because of their lack of agentivity, such as popularity: quality that artist achieves. We also tried to make the data set more challenging by choosing properties with lexical confounders when we noticed them. For example, team: organization that loses match and division: organization that team joins share the lemma team, which means a ranking method cannot rely on lexical overlap between the term and the property as an indicator of similarity. Table 2 shows the full list of selected properties for two terms in the development set, navy and expert.
Finding terms with enough good properties was a challenging task. Overall, we had to examine candidate properties for 284 candidate terms to yield the 138 terms in the final data set. Most head nouns in the final data set have ten terms, but two (player and woman) have only five terms, and one (scientist) has eight terms. The final list of head nouns, in alphabetical order, is: activity, building, device, document, mammal, material, organization, person, phenomenon, player, quality, room, scientist, vehicle, woman.5
3.6 Final Data Set
After annotation was complete, the data were divided into test and development sets. The division was made by head noun (hypernym), with all terms (and therefore properties) for a given head noun appearing in either the test or the development set. Assuming that terms with the same head noun are good mutual confounders, this split was chosen to maximize the difficulty of the task. We arranged the split so that both sets contained a range of concrete and abstract head nouns. We also ensured that both sets contained the maximum possible number of lexical confounders.
The final data set consists of 1,087 properties, comprising 15 head nouns and 138 terms. This is divided into a test set of 569 properties, comprising 8 head nouns and 73 terms; and a development set of 518 properties, comprising 7 head nouns and 65 terms. The total number of terms and properties by head noun is given in Table 3. Overall, the data set includes 565 subject and 522 object properties, with the distribution across test and development sets as shown in Table 4.
4. Composition Methods: Background
This section surveys some of the composition methods that have been proposed in the distributional semantics literature, focusing on the methods applied in this article. We also include a short section giving more of a historical perspective, contrasting current work in computational linguistics with that from cognitive science in the 1980s and 1990s. Although in our experiments we do use neural embeddings (Mikolov, Yih, and Zweig 2013) as input vectors to the composition, we leave the application of neural networks to the modeling of the relative pronoun itself as important future work. Hence in this section we leave out work using recurrent or recursive neural networks specifically designed for composition, such as Socher, Manning, and Ng (2010), Socher et al. (2013), and Hermann, Grefenstette, and Blunsom (2013).
In this section, we are agnostic about how the vector representations are built. Many of the composition techniques we describe can be equally applied to “classical” distributional vectors built using count methods (Turney and Pantel 2010; Clark 2015) and vectors built using neural embedding techniques. It may be that some composition techniques work better with one type of vector or the other—for example, there is evidence that elementwise multiplication performs better with count methods—although research comparing the two vector-building methods is still in its infancy (Baroni, Dinu, and Kruszewski 2014; Levy and Goldberg 2014; Milajevs et al. 2014; Levy et al. 2015). Section 5 provides details of how our vector representations are built in practice.
4.1 Simple Operators on Vectors
The simplest method of combining two distributional representations, for example, vectors for fast and car, is to perform some elementwise combination, such as vector addition or elementwise multiplication. As a method of integrating formal and distributional semantics, these operators appear to be clearly inadequate a priori, not least because they are commutative, and so do not respect word order. Despite this fact, there is a large literature investigating how such operators can be used for phrasal composition, starting with the work of Mitchell and Lapata (2008, 2010) (M&L subsequently).
One obvious question to consider is how well such simple operators scale when applied to phrases or sentences longer than a few words. Polajnar, Rimell, and Clark (2014) and Polajnar and Clark (2014) suggest that elementwise multiplication performs badly in this respect, with the quality of the composed representation degrading quickly as the number of composition operations increases. Vector addition is more stable, but Polajnar, Rimell, and Clark suggest that the quality of the composed representation degrades after around 10 binary composition operations, even with addition.
M&L also consider the tensor product, which is an instance of the general multiplicative model described above (when C is the identity operator), and described in Section 4.4, but find that it performs poorly on their phrasal composition tasks and data sets, using classical count vectors. One of the potential problems with the tensor product is that the resulting phrase vector lives in a different vector space from the argument vectors, so that , for example, lives in a different space from . Hence M&L also discuss circular convolution, a technique for projecting a vector in a tensor product space onto the original vector space components (Plate 1991).
4.2 The Categorial Framework
Because the majority of the methods used in our experiments are variants of the Categorial framework, this section provides a fairly detailed description of that framework.
In an attempt to unite a formal, Montague-style semantics (Montague 1970; Dowty, Wall, and Peters 1981) with a distributional semantics, Coecke, Sadrzadeh, and Clark (2010) and Baroni, Bernardi, and Zamparelli (2014) both make the observation that the semantics of argument-taking words such as verbs and adjectives can be represented using multi-linear algebraic objects.6 This is the basis for the Categorial framework (Coecke, Sadrzadeh, and Clark 2010), which uses higher-order tensors—a generalization of matrices—for words which take more than one argument. The Categorial framework has vectors and their multi-linear analogues as native elements, and provides a natural operation for composition, namely, tensor contraction (see Section 4.2.2). It therefore stands in contrast to frameworks for compositional distributional semantics that effectively take the logic of formal semantics as a starting point, and use distributional semantics to enrich the logical representations (Garrette, Erk, and Mooney 2011; Lewis and Steedman 2013; Herbelot and Copestake 2015).
It is important to note that the Categorial framework makes no commitment to how the multi-linear algebraic objects are realized, only a commitment to their shape and how they are combined. In particular, the vector spaces corresponding to atomic categories, such as noun and sentence, do not have to be distributional in the classical sense; they could be distributed in the connectionist sense (Smolensky 1990), where the basis vectors themselves are not readily interpretable (the neural embeddings that we use in this article fall into this category).
4.2.1 Composition as Matrix Multiplication
A useful starting point in describing the framework is adjectival modification. In fact, the proposal in Baroni and Zamparelli (2010) (B&Z hereafter) to model the meanings of adjectives as matrices can be seen as an instance of the Categorial framework (and also an instance of the general framework described in Baroni, Bernardi, and Zamparelli ). The insight in B&Z is that, in theoretical linguistics, adjectives are typically thought of as having a functional role, mapping noun denotations to noun denotations. B&Z make the transition to vector spaces by arguing that, in linear algebra, functions are represented as matrices (the linear maps). This insight provides an obvious answer to the question of what the composition operator should be in this framework, namely, matrix multiplication. Figure 2 shows how the context vector for car is multiplied by the matrix for red to produce the vector for red car. In this simple example, the noun space containing and has five basis vectors, which means that a 5 × 5 matrix, , is required for the adjective.
The second contribution of B&Z is to propose a method for learning the matrix from supervised training data. What should the gold-standard representation for be? B&Z argue that, given large enough corpora, it should ideally be the context vector for the compound red car; in other words, the vector for red car should be built in exactly the same way as the vector for car. These context vectors for phrases are generally known as holistic vectors. The reason for the adjective matrix is that it allows the generalization of adjective–noun combinations beyond those seen in the training data. The details of the B&Z training process will not be covered in this section, but briefly, the process is to find all examples of adjective–noun pairs in a large training corpus, and then use standard linear regression techniques to obtain matrices for each adjective in the training data. These learned matrices can then be used to generalize beyond the seen adjective–noun pairs. So if tasty artichoke, for example, has not been seen in the training data (or perhaps seen infrequently), the prediction for the context vector can be obtained, as long as there are enough examples of tasty X in the data to obtain the matrix (via linear regression), and enough occurrences of artichoke to obtain the vector.
Testing is performed by using held-out context vectors for some of the adjective–noun pairs in the data, and seeing how close—according to the cosine measure—the predicted context vector is to the actual context vector for each adjective–noun pair in the test set. The learning method using linear regression was compared against various methods of combining the context vectors of the adjective and noun, such as vector addition and elementwise multiplication, and was found to perform significantly better at this prediction task.
4.2.2 An Extension Using Multi-Linear Algebra: Tensor-Based CCG Semantics
The grammatical formalism assumed in this section will be a variant of Categorial Grammar, which was also the grammar formalism used by Montague (1970). The original papers setting out the tensor-based compositional framework (Clark, Coecke, and Sadrzadeh 2008; Coecke, Sadrzadeh, and Clark 2010) used pregroup categorial grammars (Lambek 2008), largely because they share an abstract mathematical structure with vector spaces. However, other forms of categorial grammar can be used equally as well in practice, and here we use Combinatory Categorial Grammar (CCG; Steedman 2000), which has been shown to fit seamlessly with the tensor-based semantic framework (Grefenstette 2013; Maillard, Clark, and Grefenstette 2014). With a computational linguistics audience in mind, we also present the framework entirely in terms of multi-linear algebra, rather than the category theory of the original papers.
A matrix is a second-order tensor; for example, the red matrix mentioned earlier lives in the N ⊗ N space, meaning that two indices—each corresponding to a basis vector in the noun space N—are needed to specify an entry in the matrix.7 Noting that N ⊗ N is structurally similar to the CCG syntactic type for an adjective (N/N)—both specify functions from nouns to nouns—a recipe for translating a syntactic type into a semantic type suggests itself: Replace all slash operators in the syntactic type with tensor product operators. With this translation, the combinators used by CCG to combine syntactic categories carry over seamlessly to the meaning spaces, maintaining what is often described as CCG's “transparent interface” between syntax and semantics. The seamless integration with CCG arises from the (somewhat trivial) observation that tensors are linear maps—a particular kind of function—and hence can be manipulated using CCG's combinatory rules.
Here are some example syntactic types, and the corresponding tensor spaces in which words with those types are semantically represented (using the notation syntactic type : semantic type). We first assume that all atomic types have meanings living in distinct vector spaces:8
noun phrases, NP : N
sentences, S : S
Intransitive verb, S\NP : S ⊗ N
Transitive verb, (S\NP)/NP : S ⊗ N ⊗ N
Ditransitive verb, ((S\NP)/NP)/NP : S ⊗ N ⊗ N ⊗ N
Adverbial modifier, (S\NP)\(S\NP) : S ⊗ N ⊗ S ⊗ N
Preposition modifying NP, (NP\NP)/NP : N ⊗ N ⊗ N
Hence the meaning of an intransitive verb, for example, is a particular matrix, or second-order tensor, in the tensor product space S ⊗ N. The meaning of a transitive verb is a “cuboid,” or third-order tensor, in the tensor product space S ⊗ N ⊗ N. In the same way that the syntactic type of an intransitive verb can be thought of as a function—taking an NP and returning an S—the meaning of an intransitive verb is also a function (linear map)—taking a vector in N and returning a vector in S. Another way to think of this function is that each element of the matrix specifies, for a pair of basis vectors (one from N and one from S), how a value on the given N basis vector contributes to the result for the given S basis vector.
4.2.3 Existing Implementations of the Categorial Framework
There are a few existing implementations of the Categorial framework, focusing mostly on adjectives and transitive verbs. The adjective implementation is that of B&Z as described in Section 4.2.1, where linear regression is used to learn each adjective as a mapping from noun contexts to adjective–noun contexts, with the observed context vectors for the noun and adjective– noun instances as training data. Because an adjective–noun combination has noun phrase meaning, the noun space is the obvious choice for the space in which the composed meanings should live. Maillard and Clark (2015) extend this idea by showing how the skip-gram model of Mikolov, Yih, and Zweig (2013) can be adapted to learn adjective matrices as part of a neural embedding objective.
An alternative way of learning the transitive verb tensor parameters is presented in Grefenstette et al. (2013), using a process analogous to B&Z's process for learning adjectives. Two linear regression steps are performed. In the first step, a matrix is learned representing verb–object phrases, that is, a verb that has already been paired with its object. For example, the matrix for is learned as a mapping from corpus instances such as dogs and dogs eat meat. The full tensor for is then learned with meat as input and the matrix as output. Essentially, the subject mapping is learned in the first step, and the object mapping in the second.
Polajnar, Fagarasan, and Clark (2014), following Krishnamurthy and Mitchell (2013), investigate a non-distributional sentence space, the “plausibility space” described by Clark (2013, 2015). Here the sentence space is one- or two-dimensional, with sentence meaning either a real number between 0 and 1, or a probability distribution over the classes plausible and implausible. A verb tensor is learned using single-step linear regression, with the training data consisting of positive (plausible) and negative (implausible) SVO examples. Positive examples are attested in the corpus, and negative examples for a given verb have frequency-matched random nouns substituted in the argument positions. One tensor investigated has dimensions K × K × S, where K is the number of noun dimensions and S has two dimensions, each ranging from 0 to 1. The parameters of the tensor are learned so that when combined with a subject and object, a plausibility judgment is produced. Polajnar, Rimell, and Clark (2015) investigate a distributional sentence space based on the wider discourse context, in which the meaning of a sentence is represented as a distributional vector based on context words in the surrounding discourse, making the sentence space completely distinct from the noun space.10
Existing implementations of the Categorial framework have investigated learning up to third-order tensors (for transitive verbs). However, in practice, syntactic categories such as ((N/N)/(N/N))/((N/N)/(N/N)) are not uncommon in the wide-coverage CCG grammar of Hockenmaier and Steedman (2007); such a category would require an eighth-order tensor. The combination of many word–category pairs and higher-order tensors results in a huge number of parameters to be learned in any implementation. As a solution to this problem, various ways of reducing the number of parameters are being investigated, for example, using tensor decomposition techniques (Kolda and Bader 2009; Fried, Polajnar, and Clark 2015), and removing some of the interactions encoded in the tensor by using only matrices to encode each predicate–argument combination (Paperno, Pham, and Baroni 2014; Polajnar, Fagarasan, and Clark 2014).
For this article, the problem of large numbers of parameters arises in the modeling of the relative pronoun, because its CCG type is (NP\NP)/(S\NP) for subject relative clauses or (NP\NP)/(S/NP) for object relative clauses, resulting in a fourth-order tensor, N ⊗ N ⊗ S ⊗ N. Given the limited amount of training data available (see Section 5.5.1), we do not attempt to model the full tensor, but investigate two approximations, based on the methods of Paperno, Pham, and Baroni (2014). First, by modeling the transitive verb with two matrices—one for the subject interaction and the other for the object—we are able to model the verb as a pair of matrices with semantic type S ⊗ N. This approximation is described in Section 5.2.2, and allows us to reduce the relative pronoun to a third-order tensor, N ⊗ N ⊗ S (Section 5.5.1). Second, we apply the same “decoupling” approximation to the relative pronoun, in order to represent the relative pronoun as a pair of matrices (Section 5.5.4), which can be learned independently of one another.
4.3 Relative Clause Composition
Two approaches to modeling relative clauses have been proposed in the compositional distributional semantics literature, both within the Categorial framework, although neither one has received a large-scale implementation. Both approaches are based on the fact that the relative clause (interpreted strictly, without the head noun) is a noun modifier, so the relative pronoun must map the meaning of the composed verb–argument phrase to a modifier of the head noun. In set theoretic terms, the relative pronoun signals the intersection between the set of individuals denoted by the head noun, and those denoted by the verb–argument phrase. For example, in accuracy: quality that correction improves, accuracy is both a quality and a thing that is improved by correction.
Baroni, Bernardi, and Zamparelli (2014) propose a method for learning a relative pronoun as a fourth-order tensor that maps verb–argument phrases to noun modifiers. The proposed method would use multi-step linear regression to learn: (1) in the first step, verb–argument matrices, such as a chase cats matrix from training data such as 〈dogs, dogs chase cats〉; (2) in the second step, noun modifier matrices, such as a that chase cats matrix from training data such as 〈animal, animal that chases cats〉; and (3) in the third step, a tensor for the relative pronoun that, using the matrices from steps 1 and 2 with matched predicates. The final tensor would thus be learned from paired matrices such as chases cats, that chases cats. Under this approach, the intersective semantics of the relative pronoun must be captured within the tensor. Although the necessary training data might be relatively sparse, particularly for holistic relative clause vectors, Baroni, Bernardi, and Zamparelli (2014) hypothesize that they are sufficiently common to learn a general representation for the relative pronoun. We test a similar approach in this article, using holistic relative clause vectors to learn a tensor representing the relative pronoun (Section 5.5.1), though with single-step linear regression and a simplification of the verb representation that allows the tensor to be third- rather than fourth-order. We also learn a variant that represents the relative pronoun as a pair of matrices (Section 5.5.4) and requires learning fewer parameters.
4.4 Historical Perspective
The general question of how to combine vector-based meaning representations in a compositional framework is a relatively new question for computational linguistics, but one that has been actively researched in cognitive science since the 1980s. The question arose because of the perceived failure of distributed models in general, and connectionist models in particular, to provide a suitable account of compositionality in language (Fodor and Pylyshyn 1988). The question is also relevant to the broad enterprise of artificial intelligence (AI) in that connectionist or distributed representations may seem to be in opposition to the more traditional symbolic systems often used in AI.
A dependency tree, for example, is naturally represented this way, with the roles being predicates paired with grammatical relations, such as object of eat, and the fillers being heads of arguments, such as hotdog. The vector representation of eat hotdog is . The vector representation of a whole dependency tree is the sum of the tensor products over all the dependency edges in the tree.
Smolensky and Legendre (2006) argue at length for why the tensor product is appropriate for combining connectionist and symbolic structures. A number of other proposals for realizing symbolic structures in vector representations are described, including Plate (2000) and Pollack (1990). Smolensky's claim is that, despite perhaps appearances to the contrary, all are special cases of a generalized tensor product representation.
Clark and Pulman (2007) suggest that a similar scheme to Smolensky's could be applied to parse trees for NLP applications, perhaps with the use of a convolution kernel (Haussler 1999) to calculate similarity between two parse trees with distributed representations at each node. A similar idea has been fleshed out in Zanzotto, Ferrone, and Baroni (2015), showing how kernel functions can be defined for a variety of composition operations (including most of the operations discussed in this article), allowing the similarity between two parse trees to be computed based on the distributional representations at the leaves and the composed representations on the internal nodes.
There are similarities between Smolensky's approach and the Categorial framework from Section 4.2, in that tensor representations are central to both. However, in Smolensky's approach, the tensor product is the operation used for composition of the predicate and its argument, whereas tensor contraction is the composition operation in the Categorial framework.
5. Composition Methods: Implementation
This section provides a more detailed description of the methods we have tested on relpron, ranging from some trivial lexical baselines, through to the simple arithmetic operators, and finally more sophisticated methods based on the Categorial framework. Both count-based vectors and neural embeddings have been investigated.
5.1 Word Vectors
In this section we describe how we built the word and holistic phrase vectors used by the various composition methods.
5.1.1 Count-Based Vectors (Count)
For historical completeness we include classical count-based vectors in this article, although these large vectors are impractical for more sophisticated tensor learning. These vectors are used only for the baseline results (Section 6, Table 5), including the Frobenius algebra composition method of Sadrzadeh, Clark, and Coecke (2013).
|.||Method .||Count .||Count-SVD .||Skip-Gram .|
|.||Method .||Count .||Count-SVD .||Skip-Gram .|
Count-based word vectors were built from a 2013 download of Wikipedia, using the settings from Grefenstette and Sadrzadeh (2011). The top 2,000 words in the corpus (excluding stopwords) were used as features, with a co-occurrence window of three content words on either side of the target word (i.e., with stopwords removed before applying the window), and co-occurrence counts weighted as the probability of a context word given the target word divided by the overall probability of the context word (Mitchell and Lapata 2008). We found that these settings achieved the best result on the transitive verb composition task of Grefenstette and Sadrzadeh (2011).
5.1.2 Reduced Count-Based Vectors (Count-SVD)
As a variant on the classical count-based vectors, we also present baseline results (Section 6, Table 5) on a count-based space reduced with Singular Value Decomposition (SVD), which produces dense vectors more comparable to neural embeddings.
Word vectors were built from the same 2013 Wikipedia download, using the settings of Polajnar, Fagarasan, and Clark (2014). The top 10,000 words in the corpus (excluding stopwords) were used as features, with sentence boundaries defining the co-occurrence window, and co-occurrence counts weighted with the t-test statistic (Curran 2004). SVD was used to reduce the number of dimensions to 300. Following Polajnar and Clark (2014), context selection (with n = 140) and row normalization were performed prior to SVD.
5.1.3 Neural Embeddings (Skip-Gram)
Our main results in Section 6 are produced using neural embeddings. These types of vectors have proven useful for a wide variety of tasks in recent literature, and in our early experiments with tensor, learning provided the best results on the relpron development data and other composition tasks. For our methods, in addition to the word vectors, we required holistic vectors to use as targets for training the verb and relative pronoun matrices and tensors. The holistic vectors were also obtained using neural embeddings.
Word vectors were built using a re-implementation of skip-gram with negative sampling (Mikolov et al. 2013) on a lemmatized 2015 download of Wikpedia. We used a window of ten words on either side of the target, with ten negative samples per word occurrence, and 100-dimensional target and context vectors. All lemmas occurring fewer than 100 times in the corpus were ignored.12 The context vectors produced during word vector learning were retained for use during training of the holistic vectors for phrases.
Verb matrices and relative pronoun matrices and tensors require holistic vectors to serve as training targets for linear regression. These vectors represent the observed contexts in which instances of verbs or relative pronouns with their arguments are found in the Wikipedia corpus. We tagged and parsed the corpus with the Stanford Parser (Chen and Manning 2014). For learning of verb matrices, we extracted verb– subject and verb–object pairs, ensuring that each pair occurred within a transitive verb construction. For learning of relative pronoun matrices and tensors, we extracted 〈head noun, verb, argument〉 tuples that occurred within relative clause constructions. To identify relative clauses, we looked for characteristic dependency subtrees containing the Stanford Universal Dependency labels relcl and PronType=Rel. The grammatical role of the relative pronoun argument of the verb, and of the verb's in situ argument, determined whether it was a subject or object relative clause. We used a small number of manually defined heuristics to help limit the results to true relative clauses, for example, to exclude cases of clausal complementation on the verb, as in an incident that left one government embarrassed.
The contexts of our extracted verb–argument pairs and relative clause tuples were used to train holistic vectors using our implementation of skip-gram, only retaining phrases that occurred at least twice in the corpus. For holistic vectors, we followed Paperno, Pham, and Baroni (2014) and used a wider window of 15 words on either side of the target. Skip-gram has a random component in the initialization of vectors. In order to ensure that the learned holistic and word vectors were in the same vector space, we trained both using the same context vectors.
5.2 Verb Matrices
In this section we describe two methods used for constructing verb matrices for the composition methods based on the Categorial framework.
5.2.1 Relational Verb Matrices
Using the relational method of Grefenstette and Sadrzadeh (2011), as described in Section 4.2.3, verb matrices were built as a sum of outer products of their subject–object pairs. Pairs were subject to a minimum per-verb count of 2 and a minimum frequency of 100 for both nouns in the 2013 Wikipedia download, and not weighted by frequency.
5.2.2 Decoupled Verb Matrices with Linear Regression
All of the Categorial methods require a representation for the transitive verb in the relative clause. Based on the CCG category (S\NP)/NP of a transitive verb, this is a third-order tensor, S ⊗ N ⊗ N. The relative pronoun has a CCG type of (NP\NP)/(S\NP) or (NP\NP)/(S/NP), resulting in a fourth-order tensor, N ⊗ N ⊗ S ⊗ N. In order to make the learning of a relative pronoun tensor tractable, we make use of the method introduced by Paperno, Pham, and Baroni (2014) and Polajnar, Fagarasan, and Clark (2014) in order to reduce the semantic type of the transitive verb to S ⊗ N. This method models a transitive verb as two second-order tensors (matrices), one governing the interaction of the verb with its subject and the other with its object, and has been shown to reproduce sentence semantics without loss of accuracy. This “decoupling” approximation for the transitive verb allows the relative pronoun to be modeled as a third-order tensor, N ⊗ N ⊗ S.
Verb matrices were learned using L2 ridge regression (Hoerl and Kennard 1970), also known as L2-regularized regression. For a given verb V, the training data for consisted of (subject vector, holistic subject–verb vector) pairs, and the training data for consisted of (object vector, holistic verb–object vector) pairs. In addition, each pair was weighted by the logarithm of its number of occurrences in the corpus, which we found improved performance on the relpron development set. Regression was implemented in Python using standard NumPy operations, and we optimized the regularization parameter on the relpron development set, using MAP scores and the SPLF composition method described in Section 5.5.3. A single regularization parameter was used globally for all verbs and set to 75 after tuning.
5.3 Simple Composition Methods
In this section we describe our relative clause composition methods. We first present the baseline methods, consisting of lexical comparisons, simple arithmetic operators, and the Frobenius algebra method of Clark, Coecke, and Sadrzadeh (2013) and Sadrzadeh, Clark, and Coecke (2013). We then introduce our main methods, all of which, like the Frobenius algebra method, are based on the Categorial framework, but which differ according to whether and how the relative pronoun is modeled. All of our main methods are implemented using neural embedding vectors.
5.3.1 Lexical Baselines
Two simple lexical similarity baselines (Lexical) set the vector representation of the property equal to the verb vector or the argument vector. No composition is involved. These baselines check whether a system can predict the similarity between, for example, traveler and person that hotel serves by the similarity between traveler and hotel (argument) or traveler and serve (verb). We do not have a lexical baseline consisting of the head noun, because this would not produce a well-defined ranking of properties, given that many properties share the same head noun (person in this example).
5.3.2 Arithmetic Operators
Arithmetic composition methods (Arithmetic) are the vector addition and elementwise vector product (Mitchell and Lapata 2008, 2010) of the vectors for the three lexical items in the property: head noun, verb, and argument. For vector addition, we also perform ablation to determine the contribution of each lexical item. The relative pronoun itself is not modeled in the composed representation for the arithmetic operators. Although the arithmetic operators are simple, vector addition has proven a difficult baseline to beat in many composition tasks.
5.4 Frobenius Algebra Baseline
5.5 Composition Methods within the Categorial Framework
In the next sections we describe the main composition methods evaluated in this article. These methods are based on the Categorial framework, as is the Frobenius algebra baseline; however, the methods described here are all implemented with Skip-Gram vectors, and matrices and tensors learned by linear regression. Figure 3 gives a schematic view of these composition methods.
5.5.1 Tensor Composition Method (RPTensor)
The RPTensor method models the relative pronoun as a third-order tensor . Learning a relative pronoun tensor from holistic vectors for relative clauses was suggested by Baroni, Bernardi, and Zamparelli (2014), but to our knowledge this is the first implementation of a function word tensor. We learn two separate tensors and for subject and object relative clauses, respectively, because we assume that the relative pronoun may have a different semantics in each case.
As with the verb matrices, the relative pronoun tensors and are learned by weighted L2 ridge regression, where the training data consist of (head noun vector, holistic verb–argument vector, holistic relative clause vector) tuples. The regularization parameter was optimized on the relpron development set using MAP scores and set to 80 for both tensors.
The power of the tensor comes from the fact that it captures the interaction between the head noun and the composed verb–argument phrase, and hence models the relative pronoun in a meaningful way that composition methods such as vector addition, which omits the pronoun entirely, do not. However, capturing these interactions leads to a large number of parameters, and the sparsity of training data is therefore a challenge in training the full tensor model. Our corpus of 72M sentences contained only around 800K occurrences of relative clauses, compared with over 80M verb–subject and verb– object occurrences (4M types); and this was heavily unbalanced across subject and object relative clauses, with 771K subject relative clause occurrences (222K types) compared with 21K object relative clause occurrences (only 8K types).
5.5.2 PLF Composition Method (PLF)
5.5.3 Simplified PLF Composition Method (SPLF)
5.5.4 Full PLF Composition Method (FPLF)
The relative pronoun matrices (, ) and (, ) are learned by weighted L2 ridge regression, where the training data consists of (noun vector, holistic relative clause vector) pairs for (, ), and (holistic verb–argument vector, holistic relative clause vector) pairs for (, ). The regularization parameter was optimized on the relpron development set and set to 75 for all four matrices.
5.5.5 Categorial Baselines
6. Evaluation and Results
This section describes the evaluation of the ranking task on relpron and presents the results of the various composition methods.
Table 5 shows the results for the baseline methods, using Count, Count-SVD, and Skip-Gram vectors. The highest MAP score for all three types of vectors is achieved with vector addition, adding the vectors for head noun, verb, and argument, and the Skip-Gram vectors achieve the best result overall, with a MAP of 0.496. This is already a challenging score to beat, because it represents a system that on average places a correct property at every second position in the top-ranked properties for each term. Elementwise multiplication performs almost as well for Count vectors, but very poorly for Count-SVD and Skip-Gram vectors, which is consistent with previous findings on elementwise multiplication in dense vector spaces (Vecchi, Baroni, and Zamparelli 2011; Utsumi 2012; Milajevs et al. 2014; Polajnar and Clark 2014). We will therefore use vector addition as our primary arithmetic vector operation for the rest of the article. Moreover, we will use Skip-Gram vectors for our main Categorial framework-based experiments.
Turning to the lexical baselines and the ablation tests for vector addition, we may assess the relative contribution to the vector addition result of the three lexical items in the relative clause. We look first at the lexical baselines. Similarity between the term and argument achieves a MAP of 0.347 for Skip-Gram vectors, although similarity between the term and the verb produces a much worse ranking. Looking at the arithmetic operators, the ablation tests show that the head noun, verb, and argument all contribute to the correct ranking of properties. However, the argument is the most important, as the sum of head noun and verb vectors achieves a lower MAP (0.264 for Skip-Gram vectors) than the other two combinations (0.401 and 0.450, respectively). The importance of the argument is fairly intuitive; for example, in family: organization that claims ancestry, the words family and ancestry are clearly the most closely related. However, Table 5 shows that the argument alone is insufficient to rank properties and a method must perform some composition to achieve a respectable MAP score.
In Table 5 we also see that the Frobenius method, which uses a relational matrix for the verb, is not competitive with vector addition. It achieves a MAP of only 0.277 using Count vectors, and in the other two vector spaces, the MAP is almost zero, presumably because of the elementwise multiplication that is part of the method. Based on this result, we do not pursue the Frobenius algebra method further.
Table 6 shows the results of the composition methods and Categorial baselines on the relpron development and test data. The relative performance of the methods is very similar across the development and test portions of the data set. In both cases, SPLF and Addition are the top two performers, with SPLF having a higher MAP than Addition on the test data, although the difference is not significant. Both methods achieve a MAP of nearly 0.50, which corresponds to putting a correct property in every second place on average.
|Method .||Development .||Test .|
|Method .||Development .||Test .|
Significantly higher than next lowest result (p < 0.05).
The next highest MAP is achieved by PLF. It is instructive to compare PLF with the Categorial baselines, VArg and Vhn, since PLF sums these two components. Vhn performs very badly, in line with the sum of the verb and head noun vectors in Table 5. However, the head noun does contribute to the composition of relative clause meaning, as we can see from the fact that on the test data, PLF achieves a significantly higher MAP than VArg. It is also relevant to compare PLF with SPLF. PLF learns a verb matrix for the interaction with the head noun, whereas SPLF approximates the head noun interaction as addition. SPLF significantly outperforms PLF on the development and test data, a difference that suggests that modeling the relative clause as an SVO sentence may not be the optimal approach for relative clause composition.
FPLF and RPTensor are the two methods that learn a relative pronoun function by regression. FPLF achieves a lower MAP than SPLF, Addition, and PLF, but a higher MAP than RPTensor, supporting the hypothesis that the available training data in the form of holistic vectors for relative clauses is insufficient for the number of parameters in the full tensor method. Both methods underperform the VArg functional baseline and some of the arithmetic baselines.
7. Error Analysis
In this section we examine in detail the performance of some of the main composition methods on the relpron development data. We are interested in four main points. First, we investigate whether the grammatical function (subject or object) of the relative clause plays a role in the MAP score. We find that most of the methods are relatively well-balanced across grammatical functions, but FPLF shows a lower MAP on object properties, where the amount of training data is substantially lower.
Second, we investigate whether the different head nouns in the data set have any effect on MAP score. We find that the scores are relatively well-balanced across head nouns, but that more concrete terms and their properties may be easier to model. However, based on qualitative observations, a more important factor seems to be term polysemy.
Third, we look at how well the composition methods capture the intersective semantics of the relative clause construction. We break down the evaluation into two independent components: how often the methods are able to identify the correct head noun, and how often they are able to pick out correct properties when the head noun is known. We find that all methods do very well at identifying the correct head noun, but FPLF lags behind the others at picking the correct property.
Finally, we look at some common errors on the relpron development data. One source of errors is lexical overlap in terms and properties. We also observe that some of the systems assign high ranks to plausible but unannotated properties, and address this issue with an evaluation scenario where the property is the query and the task is to rank the correct term highest. We find that on average all the methods rank the correct term at approximately position two.
We focus in this section on four methods: SPLF and Addition, the two methods with the highest overall MAP score; FPLF, the better-performing method of the two which explicitly learn a relative pronoun representation; and VArg, as a high-performing Categorial baseline.
7.1 Accuracy by Grammatical Function
An obvious first step in understanding the relpron results is to break them down by grammatical function. Because subjects and objects are asymmetric in their interaction with the verb—in particular, verbs show stronger selectional preferences for objects than for subjects (Marantz 1984; Kratzer 1996, 2002)—we might expect that relative clauses with different extraction sites are not equally easy to compose. In addition, for the methods that learn an explicit representation of the relative pronoun (FPLF in this analysis), the substantial difference in the amount of training data might affect the result.
The MAP scores for each grammatical function are shown in Table 7. The first row shows the subject results, in which only the subject properties from the development data are ranked by similarity for each term. The AP is calculated for each term, and MAP is calculated as the mean AP over the set of all terms.13 The object results are calculated analogously, although note that the MAP scores in Table 7 are not directly comparable to the scores on the full development set because the number of confounders is lower for each term.
|.||SPLF .||Addition .||FPLF .||VArg .|
|.||SPLF .||Addition .||FPLF .||VArg .|
For SPLF and Addition, Table 7 shows that the MAP scores are well balanced. For FPLF, on the other hand, the MAP scores are unbalanced: 0.570 for subject properties, in line with SPLF and Addition; and only 0.428 for object properties. This difference provides support for the hypothesis that the amount of training data available for subject vs. object relative clauses makes a difference to the resulting accuracy.14 The term navy provides an an example of the differential performance between subject and object properties. SPLF ranks four correct subject and three correct object properties within its top ten ranked properties, whereas FPLF ranks five correct subject properties and only one correct object property within its top ten.15
The MAP scores for VArg are also unbalanced in favor of subject relative clauses, though the effect is not as pronounced. We attribute this difference to the strength of the selectional preference of a verb for its object compared to its subject, since verb–object combinations (for subject properties) may have greater semantic content, enabling a more accurate ranking of properties.
7.2 Accuracy by Head Noun
Breaking down the results on relpron by head noun—for example, device vs. quality—is another obvious place to look for differential performance in the composition methods. Table 8 shows the MAP scores by head noun on the development set. The head nouns are ordered from left to right according to the average concreteness rating of their terms, based on the ratings of Brysbaert, Warriner, and Kuperman (2014). In Table 8, the AP for each term is based on a ranking of all properties in the development set, and the MAP for a head noun is the mean AP over all its terms.16
Table 8 shows that the MAP scores are relatively well-balanced across head nouns for all methods, suggesting that there are no real outliers among head nouns in the ability of the methods to compose relative clauses. The general distribution of scores across head nouns for the four methods is also roughly similar. Building and player are consistently high, with building in particular having a MAP of at least 0.5 for all methods. These head nouns have some of the most concrete terms, which might suggest that concrete terms have properties that are easier to model, or at least that concrete terms have better-quality word vectors that make the ranking easier. This is not a clear effect, however, because the methods mostly have average-to-high MAPs for quality and organization, which have terms with low concreteness, and average-to-low MAPs for device, which has the most concrete terms in the development set.
A factor that appears to play a role in MAP scores is term polysemy. All of the methods score low on document, and SPLF in particular achieves its lowest MAP for this head noun. Several document terms (lease, form, assignment, bond) exhibit very low AP. The term form provides an example; it has an extremely low AP of 0.01. Although form was chosen by the annotators in its document sense, with correct properties including document that parent signs and document that applicant completes, other senses may be more dominant in the source corpus, confusing the similarity ranking.17 Prior disambiguation of words in the data set, as in Kartsaklis and Sadrzadeh (2013), might improve performance.
We believe that polysemy may come into play with device terms as well, though the terms have high concreteness. One term with a very low AP is watch. Among the properties that SPLF ranks high for watch are person that police hunt (killer) and device that views stars (telescope), both of which are related to different senses of watch.
7.3 Capturing the Semantics of Relative Clauses
A system that successfully captures the semantics of relative clauses must integrate the semantic contribution of the head noun with the contributions of the verb and argument; for example, identifying that a saw is a device, and that among devices, it is the one that cuts wood. In this section we break down the results on the relpron development set into two measures that demonstrate performance on these subtasks independently.
We look first at how often the methods are able to identify the correct head noun. We consider the top ten ranked properties for each term from the full development set, and calculate the percentage of them that have the correct head noun, regardless of whether the whole property is correct. The results are shown in Table 9, where we omit the VArg method because it does not take the head noun into account. Table 9 shows that overall, nearly eight out of the top ten properties have the correct head noun. FPLF matches the performance of the other two methods, although it is the only one of the three that incorporates the head noun by an operation other than addition.
We next look at the MAP scores when the ranking of properties for each term is restricted to properties with the correct head noun. For example, given a building term such as ruin, the methods must rank only the building properties. The task here is to distinguish ruin from cinema, pub, observatory, house, mosque, and so forth. The results are shown in Table 10.18 We again omit VArg, because SPLF is equivalent to VArg when the head noun is held fixed.
Table 10 shows that SPLF and Addition perform similarly on distinguishing properties of terms that share a head noun, while FPLF lags behind. Combined with Table 9, this result shows that FPLF's overall lower MAP on the full development set is due to difficulty distinguishing terms from one another when the head noun is known, not to difficulty in identifying the correct head noun.
Comparing Tables 9 and 10 highlights cases where the methods struggle with one relative clause composition subtask or the other, an effect that appears to vary by head noun. For example, all methods demonstrate perfect or near-perfect ability to rank player properties at the top for player terms, but average-to-low ability to distinguish players from one another. This may be a feature of the topic domain, in that the activities undertaken by different kinds of sports players—golfer, batter, pitcher, quarterback, and so forth—are distributionally similar. On the other hand, all methods exhibit only average ability to identify the correct head noun for organization terms, but relatively high ability to select the correct organization properties when the head noun is known. For example, the SPLF ranking for religion shows a large amount of confusion with person properties, such as person that defends rationalism (philosopher).19 However, when restricted to ranking organization properties, as in Table 10, SPLF achieves a perfect MAP of 1.0 for religion.
7.4 Common Errors and Evaluation with Properties as Queries
In addition to poor performance on polysemous terms (Section 7.2), we observed two common sources of error on relpron. The first is lexical overlap between terms and properties, a type of confounder that we deliberately included in the data set (see Section 3.5). For example, the top two ranked properties for batter for all three methods are player that walks batter (pitcher) and player that strikes batter (pitcher); and the top ranked property for balance for all three methods is document that has balance (account). FPLF, despite its more sophisticated modeling of the relative pronoun, suffers from lexical overlap about as much as SPLF and Addition. However, we note that the RPTensor method is able to overcome this problem to a small extent, with the lexically confounding properties ranked a few positions below the top.
The second source of error is that the methods often assign high similarity to properties that are plausible descriptions for a term, but not annotated as gold positives. For example, the top-ranked property for lease for the RPTensor method is document that government sells (bond). The highest-ranked properties for intellectual for SPLF include person that promotes idea (advocate) and person that analyzes ontology (philosopher).20 This phenomenon speaks to the difficulty of collecting distinctive properties during the manual annotation phase; future work may involve re-annotation of the data set for additional positives. At present, the effect is to artificially lower the MAP ceiling, which affects all methods equally.
relpron was designed to test compositional distributional semantic methods on a phrase type of intermediate size and complexity: more complex than two-word combinations, and involving a function word, but less complex than full sentences. The data set is challenging, but results are promising with current techniques; a MAP score of 0.5 corresponds to every other property being ranked correctly, given a term. Again, in line with a growing body of literature, we observe that vector addition performs extremely well, but we are able to match its performance with a more complex method based on the Practical Lexical Function model.
Based on the qualitative analysis, our contention is that, in order to improve substantially on the results presented in this article, methods more linguistically sophisticated than vector addition will be required. We base this contention on a few factors. First, vector addition has little room for improvement, with potential gains limited to those that come from improving the overall quality of the vectors. It seems unlikely that vector addition can achieve a MAP much higher than its current one of around 0.5. We also note that the relative clauses used in this article are well below the ten-word length at which the quality of composed representations with vector addition has been suggested to degrade, so future data sets focusing on more complex linguistic constructions may require alternative methods.
For the more complex methods, on the other hand, there are likely to be readily achievable gains on relpron from increasing or changing the training data. This article represents the first large-scale implementation of methods that learn explicit categorial representations for relative pronouns, but the training data in the form of holistic vectors for relative clauses was still observed to be relatively small compared with the number of parameters learned by the methods, especially for object relative clauses. Larger corpora or different sources of data—for example, including dictionary definitions or paraphrase data sets in the training data—may improve the models.
Beyond the issue of training data, the qualitative analysis showed that the various methods were sometimes unable to integrate the semantic contributions of the head noun and the remainder of the relative clause. Vector addition will not be able to capture this relationship any better than it already does, but other learning methods (e.g., involving non-linear models) may be able to do so.
It is also possible that a learning method designed more specifically for the relpron ranking task would perform better, especially because many of the training examples in our existing data are non-restrictive relative clauses—although this would require an alternative data source, and relpron itself is too small to provide training data. We expect relpron and similar data sets to be important evaluation tools for future methods that combine formal and distributional semantics, and hope that the insights provided by relpron inspire new work focused on linguistically challenging grammatical constructions.
We are grateful to Mehrnoosh Sadrzadeh for many detailed and constructive discussions on the topic of relative clause composition, and comments on earlier drafts of this article. We would like to thank Bob Coecke, Dimitri Kartsaklis, Daoud Clarke, Jeremy Reffin, Julie Weeds, David Weir, and the anonymous reviewers for their detailed and helpful comments. Laura Rimell and Stephen Clark were supported by EPSRC grant EP/I037512/1. Jean Maillard is supported by an EPSRC Doctoral Training Grant and a St. John's Scholarship. Laura Rimell, Tamara Polajnar, and Stephen Clark are supported by ERC Starting Grant DisCoTex (306920).
The pilot subtask on interpretable Semantic Textual Similarity (Agirre 2015) begins to address this problem.
For readability, we sometimes present verbs and nouns in their inflected form in relative clauses in this article, e.g., detects planets rather than detect planet. In the data set, all words are lemmatized.
For example, for person the WordNet command for viewing the hyponym hierarchy was: wn person -n1 -treen.
Readers may wonder why woman and not man was included in the data set. This was a simple matter of statistics: man did not have enough WordNet hyponyms with sufficient corpus frequency, according to the criteria in Section 3.2.
Another way to motivate this idea is that some words naturally have more of an “operator semantics,” whereas others have more of a “content semantics.” Socher et al. (2012, 2013) realize this distinction in the context of a recursive neural network, using matrices for the operator semantics and vectors for the content semantics. Every word has both associated types, but the network may learn to put more weight on one type or the other for a given word.
The tensor product V ⊗ W of two vector spaces V ∈ ℝn and W ∈ ℝm is an nm-dimensional space spanned by elements of the form ; that is, pairs of basis vectors of V and W. An element a ∈ V ⊗ W can be written as where aij is a scalar.
We make no distinction in this section between nouns and noun phrases, as far as the corresponding semantic space is concerned.
We consider a verb with all of its argument positions saturated—for example, an SVO triple—to be a sentence, although real-world sentences are much longer and contain determiners and other function words.
Kiros et al. (2015) investigate a similar sentence space within the context of recurrent neural networks.
In practice, the formulae work out equivalently to the copy-subject and copy-object methods used by Kartsaklis, Sadrzadeh, and Pulman (2012) for composed SVO triples, although they are arrived at in a different way. The formulation for the object relative clause also glosses over the fact that the CCG derivation requires type-raising and function composition, but it can easily be shown that the resulting semantic interpretation is equivalent based on the type-raising implementation in Maillard, Clark, and Grefenstette (2014).
The word slipping, which occurs once in the relpron test set, was missing from the resulting corpus. We substituted the vector for slip at test time.
Any term that has no properties with the relevant grammatical function is omitted from the calculation; for example, accuracy has no subject properties, so it is not included in the MAP for Subject.
As further support for this hypothesis, the RPTensor method (not in the table) showed the same discrepancy, with a MAP of 0.520 on subject properties, and 0.400 on object properties.
The correct properties for SPLF are, in order, organization that uses submarine (S), organization that blockades port (S), organization that fleet destroys (O), organization that battleship fights (O), organization that maintains blockade (S), organization that vessel serves (O), organization that defeats fleet (S). The correct properties for FPLF are, in order, organization that uses submarine (S), organization that blockades port (S), organization that maintains blockade (S), organization that fleet destroys (O), organization that establishes blockade (S), organization that defeats fleet (S).
Unlike in Section 7.1, these MAP scores are directly comparable to those on the full development set, since the number of confounders is the same.
The top four properties for form as ranked by SPLF are: organization that undergoes merger (division), organization that siblings form (family), organization that infantry reinforces (army), organization that restructuring creates (division).
One question that might be asked is whether the ranking task within head nouns would be more difficult than the full ranking task, since all buildings (for example) share certain characteristics that might make their properties particularly good mutual confounders. In practice, this is difficult to measure, since the MAP scores in Table 10 are higher than the MAPs on the full development set because there are fewer confounders.
The top five properties in the SPLF ranking are person that defends rationalism (philosopher), person that religion has (follower), person that questions theology (philosopher), person that teaches epistemology (philosopher), and person that accepts philosophy (follower).
Some of the highly-ranked but incorrect properties are surprisingly plausible, although clearly not identifying properties for the term in question: SPLF ranks building that sells popcorn (theater) in second position for pub.