Abstract
We present a study on the automatic acquisition of semantic classes for Catalan adjectives from distributional and morphological information, with particular emphasis on polysemous adjectives. The aim is to distinguish and characterize broad classes, such as qualitative (gran ‘big’) and relational (pulmonar ‘pulmonary’) adjectives, as well as to identify polysemous adjectives such as econòmic (‘economic ∣ cheap’). We specifically aim at modeling regular polysemy, that is, types of sense alternations that are shared across lemmata. To date, both semantic classes for adjectives and regular polysemy have only been sparsely addressed in empirical computational linguistics.
Two main specific questions are tackled in this article. First, what is an adequate broad semantic classification for adjectives? We provide empirical support for the qualitative and relational classes as defined in theoretical work, and uncover one type of adjective that has not received enough attention, namely, the event-related class. Second, how is regular polysemy best modeled in computational terms? We present two models, and argue that the second one, which models regular polysemy in terms of simultaneous membership to multiple basic classes, is both theoretically and empirically more adequate than the first one, which attempts to identify independent polysemous classes. Our best classifier achieves 69.1% accuracy, against a 51% baseline.
1. Introduction
Adjectives are one of the most elusive parts of speech with respect to meaning. For example, it is very difficult to establish a broad classification of adjectives into semantic classes, analogous to a broad ontological classification of nouns (Raskin and Nirenburg 1998). This article tackles precisely this task, that is, the semantic classification of adjectives, for Catalan. We aim at automatically inducing the semantic class for an adjective given its linguistic properties, as extracted from corpora and other resources.
The acquisition of semantic classes has been widely studied for verbs (Dorr and Jones 1996; McCarthy 2000; Korhonen, Krymolowski, and Marx 2003; Lapata and Brew 2004; Schulte im Walde 2006; Joanis, Stevenson, and James 2008) and, to a lesser extent, for nouns (Hindle 1990; Pereira, Tishby, and Lee 1993), but, with very few exceptions (Bohnet, Klatt, and Wanner 2002; Carvalho and Ranchhod 2003), not for adjectives. Furthermore, we cannot rely on a well-established classification for adjectives. The classes themselves are subject to experimentation. We will test two different classifications, analyzing the empirical properties of the classes and the problems in their definition.
Another significant challenge is posed by polysemy, or the fact that one and the same adjective can have multiple senses. Different senses may fall into different classes, such that it is no longer possible to identify one single semantic class per adjective. Moreover, many adjectives exhibit similar sense alternations, in a phenomenon known as regular or systematic polysemy (Apresjan 1974; Copestake and Briscoe 1995). A special focus of the research presented, therefore, is on modeling regular polysemy. As an example of regular polysemy, take for instance the sense alternation for the adjective econòmic exemplified in Example (1). Econòmic, derived from economia (‘economy’), can be translated as ‘economic, of the economy’, as in Example (1a), or as ‘cheap’, as in Example (1b). As we will see, each of these senses corresponds to a different semantic class in our classifications.
- (1)
- a.
recuperació econòmica
recovery economysuffix
‘recovery of the economy’
- b.
pantalons econòmics
trousers economysuffix
‘cheap trousers’
- a.
Other adjectives exhibit similar sense alternations; for example, familiar (derived from família, ‘family’) and amorós (derived from amor, ‘love’), as shown in Example (2).
- (2)
- a.
reunió familiar / cara familiar
meeting familysuffix / face familysuffix
‘family meeting / familiar face’
- b.
problema amorós / noi amorós
problem lovesuffix / boy lovesuffix
‘love problem / lovely boy’
- a.
The first senses in Examples (1) and (2) have a transparent relation to the denotation of the deriving noun, as witnessed by the fact that they are translated as nouns in English (economy, family, love), whereas the other senses are translated as adjectives (cheap, familiar, lovely). For each of these adjectives, there is a relationship between the two senses, such that the sense alternations seem to correspond to a productive semantic process along the lines of Example (3) (schema (43), page 173 Raskin and Nirenburg 1998).
- (3)
Pertaining to [noun meaning] → characteristic of [noun meaning]
Because of the systematic semantic relationship between the two senses of these adjectives, they constitute an instance of regular polysemy. In this article, therefore, we not only address the acquisition of semantic classes, but also the acquisition of polysemy: Our goal is to determine, for a given adjective, whether it is monosemous or polysemous, and to which class(es) it belongs. Note that we are not dealing with individual sense alternations, as related work on sense induction does (Schütze 1998; McCarthy et al. 2004; Brody and Lapata 2009), but with sense alternation types, that systematically hold across different lemmata. Thus, the present research is at the crossroad between sense induction and lexical acquisition.
Regularities in sense alternations are pervasive in human languages, and they are probably favored by the properties of human cognition (Murphy 2002). Regular polysemy has been studied in theoretical linguistics (Apresjan 1974; Pustejovsky 1995) and in symbolic approaches to computational semantics (Copestake and Briscoe 1995). It has received little attention in empirical computational semantics, however. This is surprising, given the amount of work devoted to sense-related tasks such as Word Sense Disambiguation (WSD). In WSD (see (Navigli 2009) for an overview) sense ambiguities are almost exclusively modeled for each individual lemma, despite the ensuing sparsity problems ((Ando 2006) is an exception). Properly modeling regular polysemy, therefore, promises to improve computational semantic tasks such as WSD and sense discrimination.
This article has the goal of finding a computational model that responds to the theoretical and empirical properties of regular polysemy. In this direction, we test two alternative approaches. We first model polysemy in terms of independent classes to be separately acquired (e.g., an adjective with two senses ai and bi belongs to a class AB defined independently of classes A and B), and show that this model is not adequate. A second approach, which posits that polysemous adjectives simultaneously belong to more than one class (e.g., an adjective with two senses ai and bi belongs to both class A and class B), is more successful. Our best classifier achieves 69.1% accuracy against a 51% baseline, which is satisfactory, considering that the estimated upper bound (human agreement) for this task is 68%. We discuss pros and cons of the two models described and ways to overcome their limitations.
In the following, we first review related work (Section 2) and linguistic aspects of adjective classification (Section 3), then present the two acquisition experiments (Sections 4 and 5), and finish with a general discussion (Section 6) and some conclusions and directions for future research (Section 7).
2. Related Work
As mentioned in the Introduction, there has been very little research in the semantic classification of adjectives. We know of only two articles on specifically this topic: Carvalho and Ranchhod (2003) used adjective classes similar to the ones explored here to disambiguate between nominal and adjectival readings in Portuguese. Adjective information, manually coded, served to establish constraints in a finite-state transducer part-of-speech tagger. Actually, POS tagging was also the initial motivation for the present research, as adjective–noun and adjective–verb (participle) ambiguities cause most difficulties to both humans and machines in languages such as English, German, and Catalan (Marcus, Santorini, and Marcinkiewicz 1993; Brants 2000; Boleda 2007). Bohnet, Klatt, and Wanner (2002) also has similar goals to the present research, as it is aimed at automatically classifying German adjectives. However, the classification used is not purely semantic, polysemy is not taken into account, and the evidence and techniques used are more limited than the ones used here.
Other research on adjectives within computational linguistics is oriented toward different goals than ours. Yallop, Korhonen, and Briscoe (2005) tackle syntactic, not semantic classification, akin to the acquisition of subcategorization frames for verbs. Another relevant line of research pursues WSD. Justeson and Katz (1995) and Chao and Dyer (2000) showed that adjectives are a very useful cue for disambiguating the sense of the nouns they modify. Adjective classes could be further exploited in WSD in at least two respects: (1) to establish an inventory of adjective senses (if polysemous instances are correctly detected; this is where sense induction and our own work fits in), and (2) to exploit class-based properties for the disambiguation, similar to related work on verb classes (Resnik 1993; Prescher, Riezler, and Rooth 2000; Kohomban and Lee 2005).
The application where adjectives have received most attention, however, is Opinion Mining and Sentiment Analysis (Pang and Lee 2008), as adjectives are known to convey much of the evaluative and subjective information in language (Wiebe et al. 2004). The typical goal of this kind of study has been to identify subjective adjectives and their orientation (positive, neutral, negative). This type of research, from pioneering work by Hatzivassiloglou and colleages (Hatzivassiloglou and McKeown 1993; Hatzivassiloglou and McKeown 1997; Hatzivassiloglou and Wiebe 2000) to current research (de Marneffe, Manning, and Potts 2010), has thus focused on scalar adjectives, that is, adjectives like good and bad, which can be translated into values that can be ordered along a scale. These adjectives typically enter into antonymy relations (the semantic relation between good and bad), and in fact antonymy is the main organizing criterion for adjectives in WordNet (Miller 1998), the most widely used semantic resource in NLP. However, when examining a large scale lexicon, it becomes immediately apparent that there are many other types of adjectives that do not easily fit in a scale-based or antonymy-based view of adjectives (Alonge et al. 2000). Some examples are pulmonary, former, and foldable. It is not clear, for instance, whether it makes sense to ask for an antonym of pulmonary, or to establish a “foldability” scale for foldable. These adjectives need a different treatment, and they are treated in terms of different semantic classes in this article.
The semantic properties of adjectives can also be exploited in advanced NLP tasks and applications such as Question Answering, Dialog Systems, Natural Language Generation, or Information Extraction. For instance, from a sentence like This maimai is round and sweet, we can quite safely infer that the (invented) object maimai is a physical object, probably edible. This type of process could be exploited in, for instance, Information Extraction and ontology population, although to our knowledge this possibility has received but little attention (Malouf 2000; Almuhareb and Poesio 2004).
As for polysemy, previous approaches to the automatic acquisition of semantic classes have mostly disregarded the problem, by biasing the experimental material to include monosemous words only, or by choosing an approach that ignores polysemy (Hindle 1990; Merlo and Stevenson 2001; Schulte im Walde 2006; Joanis, Stevenson, and James 2008). There are a few exceptions to this tradition, such as Pereira, Tishby, and Lee (1993), Rooth et al. (1999), and Korhonen, Krymolowski, and Marx (2003), who used soft clustering methods for multiple assignment to verb semantic classes (see Section 4.5).
There is very little related work in empirical computational semantics in modeling regular polysemy. A pioneering piece of research is Buitelaar (1998), which tried to account for regular polysemy with the CoreLex resource. CoreLex, building on the Generative Lexicon theory (Pustejovsky 1995), groups WordNet senses into 39 “basic types” (broad ontological categories). In CoreLex, each word is associated to a polysemy class, that is, the set of all basic types its synsets belong to. Some of these polysemy classes constitute instances of regular polysemy, as recently explored in Utt and Padó (2011).
Lapata (2000; Lapata (2001) also addresses regular polysemy in the Generative Lexicon framework. This work attempts to establish all the possible meanings of adjective-noun combinations, and rank them using information gathered from the British National Corpus (Burnage and Dunlop 1992). This information should indicate that an easy problem is usually equivalent to problem that is easy to solve (as opposed to, for example, easy text, that is usually equivalent to text that is easy to read). Thus, the focus is on the meaning of adjective-noun combinations, not on that of adjectives alone as in the present research.
3. Basis for a Semantic Classification of Adjectives
Adjective classes in our definition are broad classes of lexical meaning. We will present lexical acquisition experiments in which, given the evidence found in corpora and other lexical resources, a semantic class can be assigned to a given adjective. For this purpose, two preconditions are required:
- (a)
a classification that establishes the number and characteristics of the target semantic classes;
- (b)
a stable relation between observable features and each semantic class.
There is no established semantic classification for adjectives in computational linguistics that we can use and, therefore, one subgoal of the research is to establish the classification in the first place, addressing (a), and exploiting the morphology–semantics and syntax–semantics interfaces for acquisition, addressing (b). We are thus facing a highly exploratory endeavor, and we do not regard the classifications we use as final. We test two different classifications: an initial classification, based on the literature, for the experiments reported in Section 4, and an alternative classification, for the experiments reported in Section 5. We next turn to presenting the two tested classifications.
3.1 Initial Classification
Qualitative adjectives
These are prototypical adjectives like gran (‘big’) or dolç (‘sweet’), including scalar adjectives, which denote attributes or properties of objects. Adjectives in this class tend to be gradable and comparable (see Examples (4a–4b)). They are characterized by exhibiting the greatest variability with respect to their syntactic behavior: In Catalan, they can act as predicates in copular sentences and other constructions (Examples (4c–4d)), and they can typically act as both pre- and post-nominal modifiers (Examples (4e–4f)). When an adjective modifies a head noun in pre-nominal position, the interpretation is usually nonrestrictive, as shown by the fact that they can modify proper nouns (Example (4e)).
- (4)
- a.
Taula molt gran / grandíssima
Table very big / bigsuperlative
‘Very big table’
- b.
Aquesta taula és més gran que aquella
This table is more big than that
‘This table is bigger than that one’
- c.
Aquesta taula és gran
This table is big
‘This table is big’
- d.
Aquesta taula la veig massa gran
This table itobj-cl-fem seepres−1stp−sg too big
‘This table seems to me to be too big’
- e.
La gran Diana va seguir cantant
The great Diana past-aux continue singing.
‘Great Diana continued singing.’
- f.
Van portar una taula gran
past-aux bring a table big
‘They brought in a big table’
- a.
Intensional adjectives
These are adjectives like presumpte (‘alleged’) or antic (‘former’), which according to formal semantics denote second-order properties (and subsequent work Montague 1974). Most intensional adjectives modify nouns in pre-nominal position only (Example (5a)), and they cannot functionally act as predicates (Example (5b)). They are also typically not gradable (Example (5c)).
- (5)
- a.
El Joan és el presumpte assassí
The Joan is the alleged murderer
‘Joan is the alleged murderer’
- b.
#El Joan és presumpte
The Joan is alleged
‘#Joan is alleged’
- c.
#Més presumpte assassí / #presumptíssim assassí
More alleged murderer / allegedsuperlative murderer
‘#More/very alleged murderer’
- a.
Intensional adjectives like presumpte may appear in any order with respect to qualitative adjectives, as in Example (6). The order, however, affects interpretation: Example (6a) entails that the referent of the noun phrase is young, whereas Example (6b) does not (McNally and Boleda 2004).
- (6)
- a.
jove presumpte assassí
‘young alleged murderer’
- b
presumpte jove assassí
‘alleged young murderer’
- a.
Relational adjectives
Adjectives such as pulmonar, estacional, botànic (‘pulmonary, seasonal, botanical’) denote a relationship to an object (in the mentioned examples, lung, season, and plant objects). Most of them are denominal (e.g., pulmonar is derived from pulmó, ‘lung’) and can only modify nouns post-nominally (see Example (7a)). Also, contrary to qualitative adjectives, they are not gradable (Example (7b)) and act as predicates only under very restricted circumstances (Example (7c) vs. (7d)). If other adjectives or modifiers co-occur with relational adjectives, these occur after the adjective (Example (7e)). We will say relational adjectives are adjacent to the head noun.
- (7)
- a.
Tenia una malaltia pulmonar / #pulmonar malaltia
Had a disease pulmonary / pulmonary disease
‘He/she had a pulmonary disease’
- b.
#Malaltia molt pulmonar / pulmonaríssima
Disease very pulmonary / pulmonarysuperlative
#‘Very pulmonary disease’
- c.
La decisió europea → ??Aquesta decisió és europea
The decision European → This decision is European
‘The European decision → ??This decision is European’
- d.
La tuberculosi pot ser pulmonary
The tuberculose can be pulmonary
‘Tuberculose can be pulmonary’
- e.
inflamació pulmonar greu / #inflamació greu pulmonary
inflamation pulmonary serious / inflamation serious pulmonary
‘serious pulmonary inflammation’
- a.
Table 1 summarizes the properties just explained. Our goal is to use these properties to induce the semantic class of adjectives. For instance, if an adjective is denominal, appears almost exclusively in postnominal position, and is strictly adjacent to the head noun, we predict that it is relational. In the experiments reported in Sections 4 and 5, we extract data related to these and other properties of adjectives from linguistic resources, and use them as features in machine learning experiments.
. | Qualitative . | Intensional . | Relational . |
---|---|---|---|
. | gran (‘big’) . | presumpte (‘alleged’) . | pulmonar ‘pulmonary’ . |
Property | |||
predicative | + | − | restricted |
gradable/comparable | + | − | − |
position with respect to head noun | both | pre-nom. | post-nom. |
adjacent | − | − | + |
denominal | − | − | + |
. | Qualitative . | Intensional . | Relational . |
---|---|---|---|
. | gran (‘big’) . | presumpte (‘alleged’) . | pulmonar ‘pulmonary’ . |
Property | |||
predicative | + | − | restricted |
gradable/comparable | + | − | − |
position with respect to head noun | both | pre-nom. | post-nom. |
adjacent | − | − | + |
denominal | − | − | + |
3.2 Alternative Classification
In the acquisition experiments reported in Section 5, we distinguish between qualitative, relational, and event-related adjectives. The classification presented in Section 3.1 is thus altered in two ways: (1) The intensional class is dropped. (2) A new class, that of event-related adjectives, is added to the classification. The reasons for these changes will become clear in the discussion of the experiments in Section 4. Here, we describe the new class and provide a summary table of the alternative classification.
Event-related adjectives
Adjectives such as exportador, promès, resultant (‘exporting, promised, resulting’) denote a relationship to an event, in this case, export, promise, and result events, respectively. Most of them are deverbal. Like relational adjectives, they are typically nongradable (see Example (8a)) and prefer the postnominal position when modifying nouns (Example (8b)). Like qualitative adjectives, they typically can act as predicates (Example (8c)).
- (8)
- a.
És un país {exportador / #molt exportador} de petrol
Is a country {exporting / very exporting} of oil
‘It is an oil exporting / #very exporting country’
- b.
#exportador país
‘exporting country’
- c.
Aquest país és exportador
This country is exporting
‘This is an exporting country’
- a.
Table 2 summarizes the properties of the alternative classification (for a more thorough discussion of previous research on the semantics of adjectives and more motivation for the classification, see (Boleda 2007)). For comparison, we will briefly outline the treatment of adjectives in WordNet (Miller 1998; Alonge et al. 2000). As mentioned in Section 2, the main semantic relation around which adjectives are organized in WordNet is antonymy. Also as explained, however, not all adjectives have antonyms. This is solved in WordNet by the use of indirect antonyms (e.g., swift and slow are indirect antonyms, through the semantic similarity between swift and fast). Still, indirect antonymy only applies to a small subset of the adjectives in WordNet (slightly over 20% in WordNet 1.5). Therefore, some kinds of adjectives receive a differentiated treatment.
. | Qualitative . | Event-related . | Relational . |
---|---|---|---|
. | gran (‘big’) . | exportador (‘exporting’) . | pulmonar ‘pulmonary’ . |
Property | |||
predicative | + | + | restricted |
gradable/comparable | + | typically not | − |
position with respect to head noun | both | post-nom. | post-nom. |
adjacent | − | − | + |
derivational type | non-derived | deverbal | denominal |
. | Qualitative . | Event-related . | Relational . |
---|---|---|---|
. | gran (‘big’) . | exportador (‘exporting’) . | pulmonar ‘pulmonary’ . |
Property | |||
predicative | + | + | restricted |
gradable/comparable | + | typically not | − |
position with respect to head noun | both | post-nom. | post-nom. |
adjacent | − | − | + |
derivational type | non-derived | deverbal | denominal |
Specifically, two main kinds of adjectives are distinguished in WordNet: (1) Descriptive adjectives, akin to our qualitative adjectives, which are organized around antonymy (descriptive adjectives, however, include intensional adjectives). (2) Relational adjectives, as defined in this article, for which two different solutions are adopted. If a suitable antonym can be found for a given relational adjective (antonym in a broad sense; in (page 60 Miller 1998), physical and mental are considered antonyms), it is treated in the same way as a descriptive adjective. Otherwise, it is linked through a pertain-to pointer to the related noun. In addition, a subclass of descriptive adjectives, having the form of past or present participles, is distinguished, and also receives a hybrid treatment. Those that can be accommodated to antonymy are treated as descriptive adjectives (laughing–unhappy, through the similarity between laughing and happy). Those which cannot are linked to the source verb through a principal-part-of pointer. Our event-related class includes not only past and present participles, but other types of deverbal adjectives. Thus, most of the classes used in this article are to some extent backed up by the organization of adjectives in WordNet.
3.3 The Role of Polysemy
As explained in the Introduction, some adjectives are polysemous such that each sense falls into a different class of the classifications just presented. Consider for instance the adjective econòmic in Example (1), repeated here as Example (9) for convenience. The two main senses of econòmic instantiate the relational (sense in Example (9a)) and the qualitative class (sense in Example (9b)), respectively.
- (9)
- a.
anàlisi econòmica
‘economic analysis’
- b.
pantalons econòmics
‘cheap trousers’
- a.
Crucially for our purposes, in each of the senses the adjective exhibits the properties of each of the associated classes. When used as a relational adjective, it is not gradable and cannot be used in a pre-nominal position (Example (10)). When used as a qualitative adjective, it is gradable and it can be used predicatively (see Example (11)). In the experiments that follow, we aim at capturing this hybrid behavior.
- (10)
- a.
#L’anàlisi molt econòmica de les dades
The-analysis very economic of the data
‘#The very economic analysis of the data’
- b.
#Va dur a terme una econòmica anàlisi
‘Past-aux bring to term an economic analysis
‘#He/she carried out an economic analysis’
- a.
- (11)
Aquests pantalons són molt econòmics!
These trousers are very economic!
‘These trousers are very cheap!’
Cases of regular polysemy between the intensional and qualitative classes also exist, as illustrated in Examples (12) and (13). Antic has two major senses, a qualitative one (equivalent to ‘old, ancient’) and an intensional one (equivalent to ‘former’). Note again that, when used in the intensional sense, it exhibits properties of the intensional class: It appears pre-nominally (Example (13a)) and is not gradable (Example (13b)).
- (12)
- a.
edifici antic
building ancient
‘ancient building’
- b.
edifici molt antic
building very ancient
‘very ancient building’
- a.
- (13)
- a.
antic president
ancient president
‘former president’
- b.
#molt antic president
very ancient president
‘#very former president’
- a.
The new class in the alternative classification, that of event-related adjectives, also introduces regular polysemy, specifically, between event-related and qualitative adjectives, as illustrated in Examples (14) and (15). The participial adjective sabut (‘known’) has an event-related sense, corresponding to the verb saber (‘know’), and a qualitative sense that can be translated as ‘wise’. Likewise, the deverbal adjective cridaner derived from cridar (‘to shout’) alternates between an event-related sense and a qualitative sense.
- (14)
problema sabut / home sabut
problem known / man known
‘known problem / wise man’
- (15)
noi cridaner / camisa cridanera
boy shoutsuffix/ shirt attention-gaining
‘boy who shouts a lot / attention-gaining shirt’
Examples (14) and (15) represent cases of regular polysemy because, as can be drawn from the translations, there is a systematic shift from a transparent relation with the event to a quality that bears a more distant relation to the event. In the case of sabut the relation is clear (if a man knows a lot, he is wise); in the case of cridaner, a shirt qualifies for the adjective if it is for instance loud-colored or has an eccentric cut, such that it gains the attention of people, as shouting does.
In this article, we only consider types of polysemy that cut across the classification pursued. Other kinds of polysemy that have traditionally been tackled in the literature will not be considered. For instance, we will not be concerned with the polysemy illustrated in Example (16), which arguably has more to do with the semantics of the modified noun than that of the adjective (Pustejovsky 1995). Both of the uses of trist (‘sad’) illustrated in Example (16) fall into the qualitative class, so, contrary to the work by Lapata (2000; Lapata (2001) cited previously, we do not treat the adjective as polysemous in the context of the present experiments.
- (16)
noi trist / pel·lícula trista
boy sad / film sad
‘sad boy / sad film’
4. First Model: Polysemous Adjectives Constitute Independent Classes
Given the hybrid behavior of polysemous adjectives explained in Section 3, we can expect that they behave differently from adjectives in the basic classes. For instance, adjectives polysemous between a qualitative and a relational use should exhibit more evidence for gradability than pure relational adjectives, but less than pure qualitative adjectives. In this view, polysemous adjectives belong to a class, for instance, the qualitative-relational class, that is distinct from both the qualitative and the relational classes, typically exhibiting feature values that are in between those of the basic classes. In this section, we report on experiments testing precisely this model for regular polysemy. We will therefore distinguish between five types of adjectives: qualitative, intensional, relational, polysemous between a qualitative and an intensional reading (intensional-qualitative), and polysemous between a qualitative and a relational reading (qualitative-relational). There is one polysemous class missing (intensional-relational). No cases of polysemy between intensional and relational adjectives were observed in our data.
Recall from the previous sections that we cannot reuse an established classification, and that there is virtually no previous work on the automatic semantic classification of adjectives. The present experiments also aim at testing the overall enterprise of inducing semantic classes from distributional properties for adjectives. Given the exploratory nature of the experiment, we use clustering, an unsupervised technique, to uncover natural groupings of adjectives and test to what extent these correspond to the classes described in the literature.
4.1 Data and Gold Standard
The experiments reported in this section are based on an eight million word fragment of the CTILC corpus (Corpus Informatitzat de la Llengua Catalana; Rafel 1994), developed at the Institut d’Estudis Catalans. Each word is associated with its lemma, part of speech, and inflectional features, as well as syntactic function. Lemma and morphological information have been manually checked. We automatically added syntactic information with CatCG (Alsina et al. 2002). CatCG is a shallow parser that assigns one or more syntactic functions to each word. In the case of the adjective, CatCG distinguishes between (1) predicate of a copular sentence; (2) predicate in another construction; (3) pre-nominal modifier; (4) post-nominal modifier. As no full dependencies are indicated, the head noun can only be identified with heuristics.
In the experiments, we cluster all adjectives occurring more than ten times in the corpus (a total of 3,521 lemmata), and analyze the results using a subset of the data. This is a randomly chosen 101-lemma gold standard (available in the Appendix). Fifty lemmata were chosen token-wise and 50 type-wise to balance high-frequency and low-frequency adjectives (one lemma was chosen with both methods, so the repetition was removed). Two lemmata were added in a post-hoc fashion, as explained subsequently.
The lemmata were annotated by four doctoral students in computational linguistics. The task of the judges was to assign each lemma to one of the five classes (qualitative, intensional, relational, qualitative-intensional, and qualitative-relational). The instructions for the judges included information about all linguistic characteristics discussed in Section 3, including syntactic and semantic characteristics.
The judges had a moderate degree of agreement, comparable to that obtained in other tasks on semantics or discourse, inter-annotator scores ranging between κ = 0.54 and 0.64 (see (Artstein and Poesio 2008) for a discussion of agreement measures for computational linguistics). For comparison, Véronis (1998) reported a mean pair-wise weighted κ = 0.43 for a word sense tagging task in French; and Merlo and Stevenson (2001) obtained κ = 0.53–0.66 for the task of classifying English verbs as unergative, unaccusative, or object-drop. Poesio and Artstein (2005) report κ values of 0.63–0.66 (0.45–0.50 if a trivial category is dropped) for the tagging of anaphoric relations. Our judges reported difficulties in tagging particular kinds of adjectives, such as deverbal adjectives. This issue will be retaken in Section 4.5.
No intensional adjectives were identified in the data by the judges, and only one intensional-qualitative adjective was identified. Two intensional lemmata were manually added to be able to minimally track the class. This is clearly insufficient for a quantitative approach, however, so the intensional class is dropped in the alternative classification. It is striking that intensional adjectives, which have traditionally been the focus of formal semantic approaches to the semantics of adjectives, constitute a very small class (less than a dozen lemmata are mentioned in the reviewed literature).
4.2 Features
We use two sets of distributional features to model adjective behavior: on the one hand, theoretically motivated features (theoretical features for short); on the other hand, features that encode the part-of-speech distribution of a four-word window around the adjective (POS features). The former provide a theoretically informed model of adjectives, because they are cues to the properties of each class as described in the literature. The latter are meant to provide a theory-independent representation of adjectives, to test to what extent the structures obtained with theoretical and POS features are similar. Both sets of features take a narrow context into account (at most five words to each side of the adjective), because of the limited syntactic behavior of adjectives.
4.2.1 Theoretical Features
Feature . | Textual correlate . | Mean . | SD . |
---|---|---|---|
gradable | degree adverbs, degree suffixation | 0.04 | 0.08 |
comparable | comparative constructions | 0.03 | 0.07 |
copular | copular predicate syntactic tag | 0.06 | 0.10 |
predicative | predicate syntactic tag | 0.03 | 0.06 |
pre-nom | pre-nominal modifier syntactic tag | 0.04 | 0.08 |
adjacent | first adjective in a series of two or more | 0.03 | 0.05 |
Feature . | Textual correlate . | Mean . | SD . |
---|---|---|---|
gradable | degree adverbs, degree suffixation | 0.04 | 0.08 |
comparable | comparative constructions | 0.03 | 0.07 |
copular | copular predicate syntactic tag | 0.06 | 0.10 |
predicative | predicate syntactic tag | 0.03 | 0.06 |
pre-nom | pre-nominal modifier syntactic tag | 0.04 | 0.08 |
adjacent | first adjective in a series of two or more | 0.03 | 0.05 |
Table 3 is the translation of Table 1 into shallow cues that can be extracted from a corpus. The mean values in Table 3 are very low, which points to the sparseness of theoretically defined properties such as predicativity or gradability, at least in written texts (oral corpora would presumably yield different values). Also note that standard deviations are higher than mean values, which indicates a high variability in the feature values, something that will be exploited for classification.
- (1)
In comparison with the other classes, qualitative adjectives should have higher values for features gradable, comparable, copular, predicative, middle values for feature prenominal, and low values for feature adjacent.
- (2)
Relational adjectives should have an almost opposite distribution, with very low values for all features except for adjacent.
- (3)
Intensional adjectives should exhibit very low values for all features except for pre-nominal, for which a very high value is expected.
- (4)
With respect to polysemous adjectives, it can be foreseen that their feature values will be in between those of the basic classes. For instance, an adjective that is polysemous between a qualitative and a relational reading should exhibit a higher value for feature gradable than a monosemous relational adjective, but a lower value than a monosemous qualitative adjective.
The differences in value distributions, although significant,1 are not sharp, as most of the ranges in the boxes overlap. This affects mainly polysemous classes: Although they show the tendency predicted—exhibiting values that are in between those of the basic classes—they do not present clearly distinct values. The clustering results will be affected by this distribution, as will be discussed in Section 4.5.
4.2.2 POS Features
POS features encode the part-of-speech distribution of a four-word window around the adjective, providing a theory-independent representation of the linguistic behavior of adjectives. To avoid data sparseness, we encode possible POS for each position as a different feature. For instance, for an occurrence of alta (‘tall’) as in Example (17a), the representation would be as in Example (17b). In the example, the target adjective is in boldface, and the relevant word window is in italics. Negative numbers indicate positions to the left, positive ones positions to the right. The representation in Example (17b) corresponds to the parts of speech of és, més, que, and la, respectively.
- (17)
- a.
la Bruna és més altaque l’Angelina
the Bruna is more tall than the-Angelina
‘Bruna is taller than Angelina’
- b.
-2 verb, -1 adverb, +1 conjunction, +2 determiner
- a.
Feature values are defined as in theoretical features (see Equation (1)). The ten features with the overall highest mean value in our data (among a total of 36 features) are listed in Table 4. Note that the mean values are much higher for the POS features (Table 4) than for the theoretical features (Table 3), as theoretical features are much sparser.
Feature . | Mean . | SD . | Feature . | Mean . | SD . |
---|---|---|---|---|---|
−1 noun | 0.52 | 0.25 | −2 preposition | 0.13 | 0.09 |
+1 punctuation | 0.42 | 0.15 | −1 adverb | 0.10 | 0.11 |
−2 determiner | 0.39 | 0.20 | −1 verb | 0.08 | 0.11 |
+2 determiner | 0.24 | 0.13 | −1 determiner | 0.06 | 0.10 |
+1 preposition | 0.21 | 0.15 | +1 noun | 0.06 | 0.10 |
Feature . | Mean . | SD . | Feature . | Mean . | SD . |
---|---|---|---|---|---|
−1 noun | 0.52 | 0.25 | −2 preposition | 0.13 | 0.09 |
+1 punctuation | 0.42 | 0.15 | −1 adverb | 0.10 | 0.11 |
−2 determiner | 0.39 | 0.20 | −1 verb | 0.08 | 0.11 |
+2 determiner | 0.24 | 0.13 | −1 determiner | 0.06 | 0.10 |
+1 preposition | 0.21 | 0.15 | +1 noun | 0.06 | 0.10 |
4.3 Clustering Algorithm and Parameters
We use the k-means clustering algorithm (see (Kaufman and Rousseeuw 1990) and (Everitt, Landau, and Leese 2001) for comprehensive introductions to clustering).2 This is a classical algorithm, conceptually simple and computationally efficient, which has been used in related work, such as the induction of German semantic verb classes (Schulte im Walde 2006) and the syntactic classification of Catalan verbs (Mayol, Boleda, and Badia 2005). Also, it performs hard clustering, which is adequate for our purposes (recall from Section 4.1 that we model polysemy in terms of separate classes). Additional experiments with other clustering methods yielded similar results: We tested two hierarchical and one flat algorithm, one of them agglomerative and the other two partitional, with several clustering criteria, always using the cosine distance measure.
We experimented with two representations of the feature values: raw and standardized proportions. In clustering, features with higher mean and standard deviation values tend to dominate over more sparse features. Standardization smooths the differences in the strengths of features. We standardize to z-scores, so that all features have mean 0 and standard deviation 1. As the most interpretable results were obtained with standardized values, we will restrict the discussion in the next section to the results obtained with standardized values.
4.4 Results
The discussion focuses on the cluster analyses with three and five clusters because our basis is three classes (intensional, qualitative, and relational) and we consider a total of five classes (basic classes plus polysemous classes: intensional-qualitative and qualitative-relational). A higher number of clusters introduces more noise (in the form of small clusters with no clear content).
The contingency tables of the clustering results with three clusters are depicted in Table 5. Part A of the table depicts the solution obtained with theoretical features, while Part B represents the solution obtained with POS features. Rows are gold standard classes and columns are clusters, labeled with the cluster number provided by the algorithm. The ordering of the cluster numbers corresponds to the quality of the cluster, measured in terms of the clustering criterion (see Equation (2)), 0 representing the cluster with the highest quality. In each cell Cij of Table 5, the number of adjectives of class i that are assigned to cluster j by the algorithm is given. The largest value for each class is highlighted (see gray cells).
. | A: Theoretical . | B: POS . | . | ||||
---|---|---|---|---|---|---|---|
Cluster . | 0 . | 1 . | 2 . | 0 . | 1 . | 2 . | Total . |
intensional (I) | 0 | 0 | 0 | 0 | 2 | ||
intensional-qualitative (IQ) | 0 | 0 | 0 | 0 | 1 | ||
qualitative (Q) | 4 | 13 | 10 | 5 | 52 | ||
qualitative-relational (QR) | 3 | 3 | 2 | 2 | 11 | ||
relational (R) | 13 | 1 | 5 | 10 | 35 | ||
TotalGS | 28 | 31 | 42 | 37 | 47 | 17 | 101 |
Totalcl | 834 | 1,287 | 1,400 | 1,234 | 1,754 | 533 | 3,521 |
. | A: Theoretical . | B: POS . | . | ||||
---|---|---|---|---|---|---|---|
Cluster . | 0 . | 1 . | 2 . | 0 . | 1 . | 2 . | Total . |
intensional (I) | 0 | 0 | 0 | 0 | 2 | ||
intensional-qualitative (IQ) | 0 | 0 | 0 | 0 | 1 | ||
qualitative (Q) | 4 | 13 | 10 | 5 | 52 | ||
qualitative-relational (QR) | 3 | 3 | 2 | 2 | 11 | ||
relational (R) | 13 | 1 | 5 | 10 | 35 | ||
TotalGS | 28 | 31 | 42 | 37 | 47 | 17 | 101 |
Totalcl | 834 | 1,287 | 1,400 | 1,234 | 1,754 | 533 | 3,521 |
A striking feature of Table 5 is that results in the two parts (A and B) are very similar. The following can be observed:
- (1)
There is one cluster (cluster 0 in both solutions) that contains the majority of relational adjectives in the gold standard. This is the most compact cluster according to the clustering criterion.
- (2)
Another cluster (2 in solution A, 1 in solution B) contains the majority of qualitative adjectives in the gold standard, as well as all intensional and IQ adjectives.
- (3)
The remaining cluster contains a mixture of qualitative and relational adjectives in both solutions.
- (4)
Adjectives that are polysemous between a qualitative and a relational reading (QR) are scattered through all the clusters, although they show a tendency to be ascribed to the relational cluster in solution B (cluster 0).
The five-way results are depicted in Table 6. On the one hand, the table shows that the five-way structure found by the clustering algorithm is very similar to the three-way structure in Table 5. This means that the three clusters in A and B have basically been replicated by the three first clusters in C and D, respectively. On the other hand, the differences between the structures obtained using theoretical versus POS features are more obvious in the five-way solutions. From the set-up of the experiment, we had expected one cluster per class, plus QR and IQ adjectives isolated in a cluster of their own. This is clearly not borne out in Table 6. What we find instead is that (a) the mixed clusters persist and score high in the clustering criterion (see clusters 0 in solution C and 0–1 in solution D, with a mixture of Q, QR, and R adjectives), and (b) two additional small clusters are created (clusters 3 and 4 in both solutions) with no clear interpretation, suggesting that the three-way set-up matches better the structure uncovered by the clustering algorithm.
. | C: Theoretical . | D: POS . | . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Cluster . | 0 . | 1 . | 2 . | 3 . | 4 . | 0 . | 1 . | 2 . | 3 . | 4 . | Total . |
I | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | ||
IQ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ||
Q | 7 | 4 | 4 | 2 | 3 | 7 | 2 | 3 | 52 | ||
QR | 3 | 3 | 0 | 0 | 1 | 2 | 1 | 1 | 11 | ||
R | 12 | 1 | 0 | 1 | 9 | 5 | 7 | 3 | 35 | ||
TotalGS | 24 | 28 | 42 | 4 | 3 | 20 | 17 | 47 | 10 | 7 | 101 |
Totalcl | 857 | 854 | 1462 | 156 | 192 | 828 | 406 | 1,754 | 275 | 258 | 3,521 |
. | C: Theoretical . | D: POS . | . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Cluster . | 0 . | 1 . | 2 . | 3 . | 4 . | 0 . | 1 . | 2 . | 3 . | 4 . | Total . |
I | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | ||
IQ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ||
Q | 7 | 4 | 4 | 2 | 3 | 7 | 2 | 3 | 52 | ||
QR | 3 | 3 | 0 | 0 | 1 | 2 | 1 | 1 | 11 | ||
R | 12 | 1 | 0 | 1 | 9 | 5 | 7 | 3 | 35 | ||
TotalGS | 24 | 28 | 42 | 4 | 3 | 20 | 17 | 47 | 10 | 7 | 101 |
Totalcl | 857 | 854 | 1462 | 156 | 192 | 828 | 406 | 1,754 | 275 | 258 | 3,521 |
From the discussion of Tables 5 and 6 we conclude that the three-way clustering meets the target classification better than the five-way clustering, and that polysemous adjectives are not identified as a separate class. These results suggest that modeling polysemous adjectives in terms of additional, complex classes is not an adequate strategy (we return to this point subsequently).
Recall that we defined theoretical and POS features to compare the structures obtained using theoretically informed and theory-independent features. Further feature analysis, not reported here for space reasons, reveals a high correlation between the most descriptive features of solutions A and B.3 This highlights the correspondence between the two feature representations with respect to the clustering results: The POS features elicited as most discriminative by the clustering algorithm are precisely those that correspond to the theoretical features. This correspondence explains the resemblance between the solutions obtained with the two types of representation and at the same time provides support for the present definition of the theoretical features.
Last but not least, note that we do not assign a score to each clustering solution. Evaluation of clustering is very problematic when there is no one-to-one correspondence between classes and clusters (Hatzivassiloglou and McKeown 1993), as is our case. Schulte im Walde (2006) provides a thorough discussion of this issue and proposes different metrics and types of evaluation. We defer numerical evaluation until Section 5.
4.5 Discussion
4.5.1 Classification
The experiments presented provide feedback to the question, what is an appropriate broad semantic classification for adjectives? The clustering experiments provide empirical support for the qualitative and relational classes, as is particularly evident in the three-way solution (Table 5). These are classes that have traditionally been taken into account in descriptive grammar (Bally 1944; Picallo 2002) and computational resources such as WordNet (Miller 1998; Alonge et al. 2000), so we consider them to be quite stable and keep them in our classification.
Intensional and IQ adjectives, in contrast, are grouped together with qualitative adjectives in all solutions, because they do not exhibit distinctive enough distributional properties to differentiate them, a fact aggravated by the small size of the intensional class. From the point of view of NLP, it is reasonable to encode intensional adjectives by hand, given their limited number. For these reasons, we include the intensional class in the qualitative class in what follows (remember that, as mentioned in Section 3, WordNet also includes intensional adjectives in the qualitative—in their terms, descriptive—class).
”Hybrid” clusters, that is, clusters that contain adjectives from several semantic classes, play an interesting role in our cluster analyses. Such clusters seem to be coherent and stable, as they appear in all examined solutions (A, B, and also C and D in Tables 5 and 6) and have good scores in the clustering criterion. Significantly, however, most of the adjectives that are problematic for humans are assigned to hybrid clusters, where problematic means that they are not assigned to the same class by all four judges. Conversely, most adjectives in the hybrid clusters are problematic. Thus, hybrid clusters are useful to signal problems in the proposed classification. As an example, consider cluster 0 in Part C of Table 6: 17 out of the 24 (70.1%) gold standard adjectives in this hybrid cluster are problematic for humans. This cluster contrasts with the qualitative cluster (cluster 2 of Table 6), where only 10 out of its 42 (23.8%) lemmata are problematic.
Two kinds of adjectives crop up among problematic adjectives: so-called ethnic adjectives (alemany ‘German’, menorquí ‘Menorcan’, sud-africà ‘South African’, xinès ‘Chinese’), and deverbal adjectives (indicador ‘indicating’, parlant ‘speaking’, protector ‘protecting, protective’, salvador ‘savior’). Ethnic adjectives can act as predicates of copular sentences in a much more natural way than typical relational adjectives, and seem to be vague between a relational and a qualitative reading in their semantics (page 173 Raskin and Nirenburg 1998). This kind of adjective will mainly be treated as polysemous in the experiments reported in Section 5.
As for deverbal adjectives, they are clearly neither relational (they do not express a relationship to an object) nor intensional. They are also not typically qualitative, however, because they trigger a relationship to an event instead of denoting a simple property. For instance, protector triggers a relationship with a stable event of protecting in Example (18): A person named Serra belongs to the kind of associates who have as a primary role to protect the association.
- (18)
Serra ...Era soci protector de l’Associació de concerts
Serra ...was associate protecting of the-Association of concerts
‘Serra was a protecting associate of the Association of concerts’
These considerations motivate the addition of a class of event-related adjectives in the overall classification. Event-related adjectives have not received much attention in the linguistic literature, except for one particular subtype, namely, adjectival uses of the participle (Bresnan 1982; Levin and Rappaport 1986; Bresnan 1995). As for computational resources, the English WordNet, as explained in Section 3, only distinguishes some participial adjectives. In the Italian WordNet, however, other event-related adjectives receive a specific treatment, through the encoding of the lexical relations causes and liable-to, as exemplified in Example (19) (Alonge et al. 2000):
- (19)
- a.
depuratorio ‘depurative, purifying’ causesdepurare ‘to depurate/purify’.
- b.
giudicabile ‘triable’ liable-to giudicare ‘to judge’.
- a.
To sum up, the results of the experiments reported in this section motivate a three-way classification between qualitative, event-related, and relational adjectives. Note that, in the revised classification proposed in this section, classes are uniformly defined according to the ontological type of their denotation: Qualitative adjectives denote attributes or properties, relational adjectives denote relationships to objects, and event-related adjectives denote relationships to events. The classes correspond to the three major types of entities in an ontology (attributes, objects, events), more specifically, to the way adjectives participate from those entities. In this view, relational and event-related adjectives denote properties, just as qualitative adjectives do, but they are a specific type of property involving a relationship with either an object or an event. The classification is in fact similar to the one proposed in the Ontological Semantics framework (Raskin and Nirenburg 1998; Nirenburg and Raskin 2004).
Also note that the revised semantic classification bears a prominent relationship to morphology: In the default case, qualitative adjectives are not derived, event-related adjectives are deverbal, and relational adjectives are denominal. However, the correspondence between semantic classes and derivational type is not a one-to-one mapping. Although most event-related adjectives are deverbal, not only strictly deverbal adjectives evoke events: For instance, tangible ‘tangible’ evokes an event of touching, but there is no verb *tangir in Catalan (tangible is built on the Latin verb tangō, ‘touch’). (page 187 Raskin and Nirenburg 1998) cite examples for English such as audible or ablaze. Similarly, some object adjectives are not denominal (such as botànic ‘botanical’). Conversely, some denominal or deverbal adjectives are qualitative: vergonyós ‘shy’ (from vergonya ‘shyness’), amable (literally ‘suitable to be loved’; has evolved to ‘kind, friendly’). We will empirically check the correspondence between morphology and semantic class in Section 5.5.
4.5.2 Regular Polysemy
Our first series of experiments also provides feedback to the question, what is an adequate computational model for regular polysemy? Specifically, we have shown that the treatment of regular polysemy in terms of independent classes is not adequate. Remember that the motivation for the experiments presented in this section was the hypothesis that polysemous adjectives exhibit a linguistic behavior that participates from the basic classes involved in the regular polysemy, thus yielding feature values that are in between those of the basic classes (cf. Figure 1). Thus, we had expected that polysemous adjectives form a homogeneous group of lexical items, characterized precisely by the fact that they exhibit properties from each class to a certain degree. However, this expectation is not borne out in the results of the experiments. To this respect, it is striking that QR adjectives (polysemous between a qualitative and a relational reading) are spread throughout all the clusters in all solutions. They are not identified as a homogeneous group, nor as distinct from the rest. Crucially, as pointed out in Section 4.2, the differences between the feature values of polysemous adjectives and those of the basic classes are not strong enough to motivate a separate cluster.
We believe that the reason for these results is the fact that polysemous adjectives do not in fact have a homogeneous, differentiated profile: In a given corpus, most adjectives are used predominantly in one of their senses, corresponding to one of the basic classes, and thus the “hard” classification with three clusters fits better. For instance, the qualitative-relational adjective irònic (‘ironic’) is mainly used as a qualitative adjective in the corpus. Accordingly, it always appears in the qualitative clusters. Conversely, militar (‘military’) is mostly used as a relational adjective, and is consistently assigned to one of the relational clusters in all solutions. Thus, although polysemous adjectives on average do show a mixed behavior, each lexical item tends to pattern with one of the basic classes. An alternative conceptualization of regular polysemy and experimental design is called for, and this will be the topic of the next section.
5. Second Model: Polysemous Adjectives Simultaneously Belong to Different Classes
The experiments presented in the previous section pursued two goals: on the one hand, to test the initial classification proposal; on the other, to test a model of regular polysemy that treats polysemous adjectives in terms of separate classes. With respect to the first goal, the experiments in this section rely on the results of the previous experiments, and use the alternative classification described in Section 3.2. The alternative classification has in addition been supported by a clustering experiment not reported here for space reasons (see Boleda, Badia, and Batlle [2004] for details and discussion).
With respect to the second goal, we have shown that the first model is not successful at modeling regular polysemy. Furthermore, the analysis of feature values in the previous section suggests that the lack of success is not related to the specific technique used in the initial experiment, but to the properties of polysemous adjectives: the fact that they are used predominantly in one of their senses, and the fact that the feature distributions of “polysemous classes” largely overlap with those of the basic classes.
In the present experiments, we develop an alternative approach to regular polysemy that is based on the perspective that polysemous adjectives belong to more than one semantic class, in the framework of multi-label classification. A typical example of a multi-label classification task is Text Categorization (Schapire and Singer 2000), where a document can be described via more than one label (e.g., Health and Local), so that it effectively belongs to more than one of the target classes. The motivation for this new approach is the fact that polysemous adjectives exhibit properties of all the classes involved (see Section 3.3). The hypothesis is that the evidence found for a polysemous adjective that is polysemous between, say, a relational and a qualitative use should be strong enough for the adjective to be assigned to both the relational and the qualitative classes. Note that by assigning the adjective to the two classes independently, we make an implicit classification of the adjective as polysemous. The success of the approach will depend on whether the different senses are sufficiently represented in the data, and it will be especially challenging to distinguish between noise and evidence for a given class.
5.1 Data and Gold Standard
The experiments reported in this section are based on a 16 million word fragment of the CTILC corpus (see Section 4.1). We additionally use an adjective database (Sanromà and Boleda 2010) with manually coded information about all adjectives occurring more than 50 times in the corpus (2,296 lemmata). The database codes the derivational type (deverbal, denominal, participial, non-derived) and suffix of each adjective.
A gold standard of 210 adjective lemmata (available in the Appendix) was selected from this database for the experiments. The lemmata were randomly sampled in a stratified fashion, balancing three factors of variability: frequency, morphological type, and suffix. Thus, the gold standard contains an equal number of adjectives from three frequency bands (low, medium, high), from the four derivational types, and from a series of suffixes within each type. This sampling method is aimed at achieving semantic variability.
Three experts assigned each of the 210 lemmata to one or two of the classes in the alternative classification, namely, event-related, qualitative, or relational. The decisions were reached by consensus and were based on expert knowledge together with the examination of the information in the database, corpus examples, and the judgments provided by 322 naive subjects in a large-scale annotation experiment.4
Table 7 shows the distribution of adjectives in the gold standard into classes according to the three experts. These are the data used in the experiments presented in this section. The proportion of polysemous adjectives is quite high, over 17%, with qualitative-relational being the most frequent type of polysemy. Also note that 51% of the adjectives are qualitative; this will be the baseline for the machine learning experiments presented subsequently.
Class . | Label . | Example . | # . | % . |
---|---|---|---|---|
qualitative | Q | tenaç, ‘tenacious’ | 107 | 51.0 |
event | E | informatiu, ‘informative’ | 37 | 17.6 |
relational | R | cranià, ‘cranial’ | 30 | 14.3 |
qualitative-relational | QR | familiar, ‘familiar’ | 23 | 11.0 |
qualitative-event | QE | sabut, ‘known’ | 7 | 3.3 |
event-relational | ER | comptable, ‘countable’ | 6 | 2.9 |
Total | 210 | 100 |
Class . | Label . | Example . | # . | % . |
---|---|---|---|---|
qualitative | Q | tenaç, ‘tenacious’ | 107 | 51.0 |
event | E | informatiu, ‘informative’ | 37 | 17.6 |
relational | R | cranià, ‘cranial’ | 30 | 14.3 |
qualitative-relational | QR | familiar, ‘familiar’ | 23 | 11.0 |
qualitative-event | QE | sabut, ‘known’ | 7 | 3.3 |
event-relational | ER | comptable, ‘countable’ | 6 | 2.9 |
Total | 210 | 100 |
5.2 Features
5.2.1 Feature Definition
We define five feature sets based on different types of linguistic information, to gain further insight into the properties of each class. In particular, we are interested in the properties of event-related adjectives, for which we do not have a description in the linguistic literature. Table 8 summarizes the properties of the feature sets used for the present experiments.
Feature set . | Description . | # . | Example . | |
---|---|---|---|---|
morph | morphological (derivational) properties | 2 (25) | suffix | |
func | syntactic function of the adjective | 4 | post-nom. modifier | |
uni | uni-gram POS (1 word to left or to right) | 24 | −1noun | |
bi | bi-gram POS (1 word to left and 1 to right) | 50 | −1noun+1adj | |
theor | distributional cues of theoretical properties | 18 | gradable | |
Total | 98 (121) |
Feature set . | Description . | # . | Example . | |
---|---|---|---|---|
morph | morphological (derivational) properties | 2 (25) | suffix | |
func | syntactic function of the adjective | 4 | post-nom. modifier | |
uni | uni-gram POS (1 word to left or to right) | 24 | −1noun | |
bi | bi-gram POS (1 word to left and 1 to right) | 50 | −1noun+1adj | |
theor | distributional cues of theoretical properties | 18 | gradable | |
Total | 98 (121) |
Feature set morph represents derivational properties of adjectives, as encoded in the adjective database. We include this type of information because of the relevance of morphology for the new classification (see Section 4.5). Func encodes the syntactic functions of the adjectives in the corpus, as explained in Section 4.1. Uni (for unigram) and bi (for bigram) encode the distribution of the adjective in the corpus in terms of the parts of speech of the surrounding words. Feature analysis of the first experiment showed that the word preceding and following the target were the most informative, so in the present experiment only a one-word window is taken into account. The unigram distribution (uni) encodes each part of speech separately, as was done in the first experiment, and the bigram distribution (bi) takes the left and right word jointly, to avoid feature correlation effects. In the latter feature set, only the 50 most frequent bigrams are considered, to avoid features that are too sparse.5
Finally, feature set theor (for theoretical) generalizes and adds to the theoretical properties used in the first experiment (Table 3 in Section 4.2). Upon inspection of the clustering solutions (not reported here for space reasons), some further potentially relevant distributional pieces of information cropped up that were included in the theor features of the present experiment. The new features, summarized in Table 9, cover several aspects of the noun phrases (NPs) in which adjectives occur: The type of determiner of the NP, agreement properties (as these can correlate with semantic properties), the syntactic function of the head noun, and the presence of a potential adjective complement. The latter are usually headed by prepositions (El Joan està gelós d’en Pere, ‘Joan is jealous of Pere’). Finally, feature distance to the head is a reformulation of feature adjacent from Section 4.2. It encodes the mean distance of the adjective to the head, in number of words, as this is a more general definition that alleviates data sparseness.
Property . | Features . |
---|---|
type of determiner | NP headed by definite/indefinite/no determiner |
agreement properties | gender and number of the NP |
syntactic function of head noun | subject, object, complement to a preposition |
complement-bearing | adjective followed by a preposition |
distance to the head | linear distance (number of words) |
Property . | Features . |
---|---|
type of determiner | NP headed by definite/indefinite/no determiner |
agreement properties | gender and number of the NP |
syntactic function of head noun | subject, object, complement to a preposition |
complement-bearing | adjective followed by a preposition |
distance to the head | linear distance (number of words) |
5.2.2 Feature Tuning
We test the effects of feature selection in the performance of the classifiers. The features are selected according to their performance within the machine learning algorithm used for classification. Accuracy for a given subset of features is estimated by cross-validation over the training data. Because the number of subsets increases exponentially with the number of features, this method is computationally very expensive, so we use a best-first search strategy. We also experiment with binarization of the two categorical features (suffix, derivational type).
5.3 Method
The classification task is approached with a two-level architecture.
- 1.
The decision on the class of the adjective is decomposed into three binary decisions: Is it qualitative or not? Is it event-related or not? Is it relational or not?
- 2.
A complete classification is achieved by merging the results of the binary decisions. A consistency check is applied by which (a) if all decisions are negative, the adjective is assigned to the qualitative class (the most frequent one; this was the case for a mean of 4.6% of the class assignments); (b) if all decisions are positive, we randomly discard one (three-way polysemy is not foreseen in our classification; this was the case for a mean of 0.6% of the class assignments).
Note that in the present experiments we change both the classification and the approach (unsupervised vs. supervised) with respect to the first set of experiments presented in Section 4, which can be seen as a sub-optimal technical choice. After the first series of experiments that required a more exploratory analysis, however, we believe that we have now reached a more stable classification, which we can test by supervised methods. In addition, we need a one-to-one correspondence between gold standard classes and clusters for the approach to work, which we cannot guarantee when using an unsupervised approach that outputs a certain number of clusters with no mapping to the gold standard classes.
We test two types of classifiers. The first type are Decision Tree classifiers trained on different types of linguistic information coded as feature sets. Decision Trees are one of the most widely machine learning techniques (Quinlan 1993), and they have been used in related work (Merlo and Stevenson 2001). They have relatively few parameters to tune (a requirement with small data sets such as ours) and provide a transparent representation of the decisions made by the algorithm, which facilitates the inspection of results and the error analysis. We will refer to these Decision Tree classifiers as simple classifiers, in opposition to the ensemble classifiers, which are complex, as explained next.
The second type of classifier we use are ensemble classifiers, which have received much attention in the machine learning community (Dietterich 2000). When building an ensemble classifier, several class proposals for each item are obtained from multiple simple classifiers, and one of them is chosen on the basis of majority voting, weighted voting, or more sophisticated decision methods. It has been shown that in most cases, the accuracy of the ensemble classifier is higher than the best individual classifier (Freund and Schapire 1996; Dietterich 2000; Breiman 2001). The main reason for the general success of ensemble classifiers is that they are more robust towards the biases particular to individual classifiers: A bias shows up in the data in the form of “strange” class assignments made by one single classifier, which are therefore overridden by the class assignments of the remaining classifiers.7
For the evaluation, 100 different estimates of accuracy are obtained for each feature set using 10-run, 10-fold cross-validation (10x10 cv for short). In this schema, 10-fold cross-validation is performed 10 times, that is, 10 different random partitions of the data (runs) are made, and 10-fold cross-validation is carried out for each partition. To avoid the inflated Type I error probability when reusing data (Dietterich 1998), the significance of the differences between accuracies is tested with the corrected resampled t-test as proposed by Nadeau and Bengio (2003).8
5.4 Results
5.4.1 Simple Classifiers
The accuracies for the simple classifiers are shown in Table 10. Part A of the table lists the results for each of the binary decisions (qualitative/non-qualitative, event/non-event, relational/non-relational). The accuracy for each decision is computed independently. For instance, a qualitative-event adjective is judged correct within the qualitative class iff the decision is qualitative; correct within the event class iff the decision is event; and correct within the relational class iff the decision is non-relational.
. | A: Per-class accuracy . | B: Overall accuracy . | |||
---|---|---|---|---|---|
. | Qualitative . | Event . | Relational . | Full . | Partial . |
baseline | 65.2 ± 11.1 | 76.2 ± 9.9 | 71.9 ± 9.6 | 51.0 ± 0.0 | 65.2 ± 0.0 |
morph | 68.2 ± 11.1 | 87.3** ± 6.3 | 85.2*** ± 7.2 | 59.9*** ± 2.2 | 84.7*** ± 0.7 |
morphFS | 72.5* ± 7.9 | 89.1** ± 6.0 | 84.2*** ± 7.5 | 60.6*** ± 1.3 | 87.8*** ± 0.4 |
func | 75.1** ± 9.0 | 76.1 ± 9.8 | 82.8** ± 7.5 | 56.0*** ± 1.9 | 80.6*** ± 1.8 |
uni | 64.2 ± 10.8 | 68.4 ± 12.0 | 82.1** ± 9.0 | 42.8 ± 2.7 | 74.8*** ± 2.6 |
uniFS | 66.0 ± 9.3 | 75.1 ± 10.6 | 82.2** ± 7.5 | 52.9 ± 1.9 | 77.0*** ± 2.0 |
bi | 63.8 ± 9.9 | 66.2 ± 9.8 | 78.2* ± 8.2 | 46.1 ± 2.3 | 77.8*** ± 1.8 |
biFS | 67.4 ± 10.6 | 72.3 ± 10.2 | 83.0*** ± 8.3 | 52.3 ± 1.7 | 76.7*** ± 1.0 |
theor | 71.8 ± 10.0 | 74.1 ± 9.9 | 86.4*** ± 7.6 | 54.8*** ± 1.7 | 81.8*** ± 1.8 |
all | 75.5** ± 9.0 | 86.5** ± 6.4 | 86.0*** ± 6.5 | 62.5*** ± 2.5 | 87.6*** ± 2.5 |
. | A: Per-class accuracy . | B: Overall accuracy . | |||
---|---|---|---|---|---|
. | Qualitative . | Event . | Relational . | Full . | Partial . |
baseline | 65.2 ± 11.1 | 76.2 ± 9.9 | 71.9 ± 9.6 | 51.0 ± 0.0 | 65.2 ± 0.0 |
morph | 68.2 ± 11.1 | 87.3** ± 6.3 | 85.2*** ± 7.2 | 59.9*** ± 2.2 | 84.7*** ± 0.7 |
morphFS | 72.5* ± 7.9 | 89.1** ± 6.0 | 84.2*** ± 7.5 | 60.6*** ± 1.3 | 87.8*** ± 0.4 |
func | 75.1** ± 9.0 | 76.1 ± 9.8 | 82.8** ± 7.5 | 56.0*** ± 1.9 | 80.6*** ± 1.8 |
uni | 64.2 ± 10.8 | 68.4 ± 12.0 | 82.1** ± 9.0 | 42.8 ± 2.7 | 74.8*** ± 2.6 |
uniFS | 66.0 ± 9.3 | 75.1 ± 10.6 | 82.2** ± 7.5 | 52.9 ± 1.9 | 77.0*** ± 2.0 |
bi | 63.8 ± 9.9 | 66.2 ± 9.8 | 78.2* ± 8.2 | 46.1 ± 2.3 | 77.8*** ± 1.8 |
biFS | 67.4 ± 10.6 | 72.3 ± 10.2 | 83.0*** ± 8.3 | 52.3 ± 1.7 | 76.7*** ± 1.0 |
theor | 71.8 ± 10.0 | 74.1 ± 9.9 | 86.4*** ± 7.6 | 54.8*** ± 1.7 | 81.8*** ± 1.8 |
all | 75.5** ± 9.0 | 86.5** ± 6.4 | 86.0*** ± 6.5 | 62.5*** ± 2.5 | 87.6*** ± 2.5 |
Part B reports the accuracies for the overall, merged class assignments, taking polysemy into account (qualitative vs. qualitative-event vs. qualitative-relational vs. event, etc.).9 In Part B, we report two accuracy measures: full and partial. Full accuracy requires the class assignments to be identical (an assignment of qualitative for an adjective labeled as qualitative-relational in the gold standard will count as an error), whereas partial accuracy only requires some overlap in the classification of the machine learning algorithm and the gold standard for a given class assignment (a qualitative assignment for a qualitative-relational adjective will be counted as correct). The motivation for reporting partial accuracy is that a class assignment with some overlap with the gold standard is more useful than a class assignment with no overlap. The figures in the discussion that follow refer to full accuracy unless otherwise stated.
For the qualitative and relational classes, taking into account distributional information allows for an improvement over the default morphology–semantics mapping outlined in Section 4.5: Feature set all, containing all the features, achieves 75.5% accuracy for qualitative adjectives; feature set theor, with carefully defined features, achieves 86.4% for relational adjectives. In contrast, morphology seems to act as a ceiling for event-related adjectives: The best result, 89.1%, is obtained with morphological features using feature selection. As will be shown in Section 5.5, event-related adjectives do not exhibit a differentiated distributional profile from qualitative adjectives, which accounts for the failure of distributional features to capture this class. As could be expected, the best overall result is obtained with feature set all, that is, by taking all features into account: 62.5% full accuracy is a highly significant improvement over the baseline, 51.0%. The second best results are obtained with morphological features using feature selection (60.6%), due to the high performance of morphological information with event adjectives.
Also note that the POS feature sets, uni and bi, are not able to beat the baseline for full accuracy: Results are 42.8% and 46.1%, respectively, jumping to 52.9% and 52.3% when feature selection is used, still not enough to achieve a significant improvement over the baseline. Thus, for this task and this set-up, it is necessary to use well motivated features. In this respect, it is also remarkable that feature selection actually decreased performance for the motivated distributional feature sets (func, sem, all; results not shown in the table), and only slightly improved over morph (59.9% to 60.6% accuracy). Carefully defined features are of high quality and therefore do not benefit from automatic feature selection. Actually, (page 308 Witten and Frank 2011) state that “the best way to select relevant attributes is manually, based on a deep understanding of the learning problem and what the [features] actually mean.”
In the partial evaluation condition, however, all feature sets achieve a highly significant improvement over the baseline (p < 0.001). Therefore, the classifications obtained with any of the feature sets are more useful than the baseline, in the sense that they present more overlap with the gold standard.
5.4.2 Ensemble Classifiers
Table 11 shows the results of Attribute Bagging, compared to the best simple classifier and human agreement (observed agreement, in percentage). The results obtained with AdaBoost (a standard EC; default parameters) are also included as a sanity check. The best results with Attribute Bagging, reported in the table, were obtained using both feature selection and binarization (binarization did not improve results for the remaining classifiers in Tables 10 and 11).
. | A: Per-class accuracy . | B: Overall accuracy . | |||
---|---|---|---|---|---|
. | Qualitative . | Event . | Relational . | Full . | Partial . |
best simple (all) | 75.5 ± 9.0 | 86.5 ± 6.4 | 86.0 ± 6.5 | 62.5 ± 2.5 | 87.6 ± 2.5 |
AdaBoost | 82.0* ± 8.6 | 85.6 ± 7.1 | 88.0 ± 6.7 | 66.0* ± 1.9 | 89.9* ± 1.3 |
Att. Bagg.FS, bin,i=5 | 77.0 ± 8.7 | 85.8 ± 7.1 | 89.0 ± 6.5 | 66.3* ± 1.1 | 87.0 ± 1.5 |
Att. Bagg.FS, bin,i=100 | 81.0 ± 8.8 | 86.1 ± 6.9 | 90.1* ± 5.3 | 69.1*** ± 1.0 | 89.0 ± 1.0 |
Human agreement | − | − | − | 68 | 85 |
. | A: Per-class accuracy . | B: Overall accuracy . | |||
---|---|---|---|---|---|
. | Qualitative . | Event . | Relational . | Full . | Partial . |
best simple (all) | 75.5 ± 9.0 | 86.5 ± 6.4 | 86.0 ± 6.5 | 62.5 ± 2.5 | 87.6 ± 2.5 |
AdaBoost | 82.0* ± 8.6 | 85.6 ± 7.1 | 88.0 ± 6.7 | 66.0* ± 1.9 | 89.9* ± 1.3 |
Att. Bagg.FS, bin,i=5 | 77.0 ± 8.7 | 85.8 ± 7.1 | 89.0 ± 6.5 | 66.3* ± 1.1 | 87.0 ± 1.5 |
Att. Bagg.FS, bin,i=100 | 81.0 ± 8.8 | 86.1 ± 6.9 | 90.1* ± 5.3 | 69.1*** ± 1.0 | 89.0 ± 1.0 |
Human agreement | − | − | − | 68 | 85 |
The Attribute Bagging EC with i = 5 achieves comparable accuracy to AdaBoost with default parameters. Full accuracy results with the Attribute Bagging classifier with i = 100 (69.1%) are significantly higher than those of the best simple classifier (62.5%; p < 0.0001) and the AdaBoost classifier (p = 0.01; recall however that we did not optimize AdaBoost’s parameters). Ensemble classifiers are thus helpful for our task.
The best classifier in our experiments (Att. Bagg.FS, bin,i = 100) obtains 69.1% full and 89.0% partial accuracy. This is comparable to the agreement between the expert annotation of the gold standard and naive subjects participating in a large-scale annotation experiment (po = 0.68, or 68%, and κ = 0.55 for full accuracy, po = 0.85, or 85%, and κ = 0.72 for overlapping accuracy; see (Boleda, Schulte im Walde, and Badia 2008) for details on the comparison). If we view human agreement as an upper bound, we have reached the maximum accuracy that could be obtained via machine learning for the present task. Further improvements will need to be preceded by an improvement in the agreement scores of human judges, that is, by a better definition of the classes and the classifying task.
Finally, Table 11 shows that the best results are obtained for the relational class (90.1%), followed by the event class (86.5%), and the qualitative class has the lowest scores (at most 82%). The qualitative class contains attribute-denoting adjectives, but in the present definition it is also populated with adjectives that simply do not fit into the other classes (such as intensional adjectives, as explained earlier). Also, whereas some adjectives in the class are prototypical qualitative adjectives such as gros ‘big’ or llarg ‘long’, others are unprototypical types of properties (subaltern ‘subordinate’, subsidiari ‘subsidiary’). This factor brings heterogeneity into the class, which justifies the relatively poor performance of the classifier on this task. Significantly, also, ECs do not improve upon simple classifiers for the event class; again, morphological information acts as a ceiling and no combination of information serves to go beyond that ceiling, as will become clear in the error analysis explained next.
5.5 Error Analysis
Table 12 depicts the contingency table of the classifications by the experts (rows) and one randomly chosen run of the Attribute Bagging classifier with i = 100 (columns). The table shows that there are two major sources of errors: First, the confusion between the qualitative and event classes, which is responsible for 14 errors (see dark-gray shaded cells in the table; also note that the related Q–QE and E–QE misclassifications account for another 14 errors). To compare, note that the confusion between the qualitative and relational classes only accounts for six of the errors, and there are no cases of confusion between event and relational adjectives.
. | Q . | E . | R . | QR . | QE . | ER . | Total . |
---|---|---|---|---|---|---|---|
Q | 90 | 2 | 0 | 107 | |||
E | 17 | 0 | 1 | 37 | |||
R | 4 | 0 | 20 | 0 | 30 | ||
QR | 0 | 13 | 0 | 1 | 23 | ||
QE | 0 | 0 | 5 | 0 | 7 | ||
ER | 0 | 0 | 1 | 0 | 3 | 6 | |
Total | 110 | 22 | 28 | 22 | 19 | 9 | 210 |
. | Q . | E . | R . | QR . | QE . | ER . | Total . |
---|---|---|---|---|---|---|---|
Q | 90 | 2 | 0 | 107 | |||
E | 17 | 0 | 1 | 37 | |||
R | 4 | 0 | 20 | 0 | 30 | ||
QR | 0 | 13 | 0 | 1 | 23 | ||
QE | 0 | 0 | 5 | 0 | 7 | ||
ER | 0 | 0 | 1 | 0 | 3 | 6 | |
Total | 110 | 22 | 28 | 22 | 19 | 9 | 210 |
The second major source of errors is the overgeneration of polysemous adjectives (see medium-gray shaded cells): there are 26 adjectives tagged as monosemous by the experts and assigned a polysemous class by the system. To compare, the opposite case (i.e., tagging polysemous adjectives as monosemous) accounts for 13 errors only (see light-gray shaded cells). We next examine the two main types of errors in more detail.
5.5.1 Distinguishing between Qualitative and Event Adjectives
In fact, the class distribution varies with the suffix (see Table 13): Some types, such as -or and the participle, show a clear predominance of the event class (see dark-gray shaded cells); other types, such as -ble, -iu, or -nt, are more spread in their distribution (see light-gray shaded cells). Thus, the suffix seems to influence the resulting readings, with some active-like suffixes building a much more transparent relation to the event (creador ‘creating’, exportador ‘exporting’, recomanat ‘recommended’), and some passive-like or stative suffixes being more prone to creating a stative meaning (contingent ‘contingent’, formidable ‘formidable | terrific’, significatiu ‘significant’). The aspectual class of the deriving verb (Vendler 1957) also plays a role: For instance, although the meaning of abundant (‘abundant’) is related to that of the verb abundar (‘abound’), it clearly has a more stative (property-like) meaning than many of the other event adjectives, due to the fact that the deriving verb is stative. Correspondingly, abundant is classified as qualitative by the Attribute Bagging algorithm. This variation in the morphology–semantics interface is also mirrored in the feature value distributions, as will be shown subsequently.
. | Q . | E . | R . | QR . | QE . | ER . | Total . |
---|---|---|---|---|---|---|---|
-ble | 0 | 0 | 1 | 1 | 11 | ||
-iu | 1 | 1 | 2 | 0 | 11 | ||
-nt | 0 | 0 | 0 | 1 | 11 | ||
-or | 1 | 0 | 0 | 0 | 0 | 11 | |
participle | 2 | 0 | 0 | 5 | 0 | 15 | |
Total | 16 | 36 | 3 | 2 | 7 | 6 | 70 |
. | Q . | E . | R . | QR . | QE . | ER . | Total . |
---|---|---|---|---|---|---|---|
-ble | 0 | 0 | 1 | 1 | 11 | ||
-iu | 1 | 1 | 2 | 0 | 11 | ||
-nt | 0 | 0 | 0 | 1 | 11 | ||
-or | 1 | 0 | 0 | 0 | 0 | 11 | |
participle | 2 | 0 | 0 | 5 | 0 | 15 | |
Total | 16 | 36 | 3 | 2 | 7 | 6 | 70 |
The remaining features do not show differences between event and qualitative adjectives, but rather properties of relational adjectives. In addition to the properties that were already known, the figure shows that relational adjectives appear more often in definite NPs acting as preposition complements (graphs E and H). Thus, the typical syntactic context for a relational adjective is preposition + definite determiner + noun + relational adjective). This type of adjective also appears slightly more often with feminine head nouns, which could be due to the fact that, in Catalan, many abstract nouns (física ‘physics’, capacitat ‘ability’) are feminine, for morphological reasons. These nouns are often modified by relational adjectives to select for subtypes of the class of objects denoted by the nouns (McNally and Boleda 2004).
Another difficulty in the distributional characterization of the event class is the fact that it is quite heterogeneous, due to the variation at the morphology–semantics interface discussed earlier. This can be traced in Figure 4 by the fact that for most of the features, the box of the event class is larger than the box of the other two classes, meaning that there is more variation within the event class than within the other two classes.
To sum up, morphological features can quite reliably spot event-related adjectives, but distributional information cannot. As a result, in the cases where morphology gives the wrong prediction, nothing can be done on the distributional side to remedy this. This results in the confusion of event-related and qualitative adjectives shown in Table 12.
5.5.2 Detecting Polysemous Adjectives
Table 14 shows that the distribution of polysemous items predicted by Equation (4) is more similar to the distribution obtained with the best machine learning classifier (ML) than to the distribution of polysemous items in the gold standard (GS) for the QE cases. The distribution is estimated from the frequency over the 210 adjectives in the gold standard, and shown as absolute numbers.
. | Predicted . | ML . | GS . |
---|---|---|---|
QR | 15 | 22 | 23 |
QE | 19 | 19 | 7 |
ER | 5 | 9 | 6 |
. | Predicted . | ML . | GS . |
---|---|---|---|
QR | 15 | 22 | 23 |
QE | 19 | 19 | 7 |
ER | 5 | 9 | 6 |
Both Equation (4) and the ML classifier assign 19 adjectives to the QE polysemy type, although the gold standard contains only 7 QE adjectives. The equation predicts fewer QR adjectives than observed in the data, but in this case the classifier produces a similar number of QR adjectives than attested (22 vs. 23). Finally, the classifier produces more ER adjectives than observed and also than predicted by Equation (4), but in this case the numbers are so small that no clear tendencies can be observed. Thus, the procedure followed can be said to cause the overgeneration of items for the QE polysemy type, but it does not account for the other two polysemous classes.
Further qualitative analysis on the overgenerated polysemous adjectives (corresponding to the middle-gray cells in Table 12; not reported because of space concerns) showed that different types of evidence motivate the inclusion of monosemous adjectives in two classes, causing them to be considered polysemous. This suggests that, because polysemous adjectives exhibit only partial or limited evidence of each class, the threshold for positive assignment to a class is lowered, resulting in the observed overgeneration. Recall that at the beginning of this section, when introducing the model, we warned that it would be specially challenging to distinguish between noise and evidence for a given class. We have indeed found this to be a challenge. The mentioned effect is amplified by the procedure followed, which assumes that the class assignments are independent, thus not adequately enough modeling the empirical distribution of polysemy.
6. Discussion: Towards a Model for Regular Polysemy
classes (n monosemous classes plus polysemous classes, all the possible two-combinations of the monosemous classes). This formula assumes that only two-way regular polysemy is allowed, as in this article; polysemy across three or more classes would make the explosion of classes even worse. It is clear that the second model is easier to learn.
The second difference concerns the way class assignments to polysemous words are carried out. In the first model, polysemous words are assigned to one single, independent class, whereas in the second they are assigned to each of the two basic classes that give rise to the regular polysemy. Recall that the motivation for the first model was that—given that regularly polysemous adjectives show a particular hybrid behavior—we could expect that polysemous adjectives could be characterized as differentiated classes. This expectation has clearly not been borne out. A further problem with the first model it that it in principle allows for a polysemous class AB whose properties do not necessarily have anything to do with those of the basic classes A and B. The second model, in contrast, enforces that polysemous adjectives exhibit properties of each of the classes they participate in, which is both theoretically and empirically more adequate. For these reasons, we believe that the second model is more suitable to represent regular polysemy than the first model.
The second model is also not completely satisfactory, however. As discussed in the previous section, in the current implementation of the model the class assignments are assumed to be independent (though this need not be the case in other instantiations of the model). Also, in a way, it is at the opposite end of the scale with respect to the first model: Whereas in the first model polysemous adjectives do not need to have anything in common with the basic classes, in the second model a polysemous word is assumed to be just like any other word in each of the basic classes. For instance, a qualitative-relational adjective is assumed to function both as a full-fledged qualitative adjective and a full-fledged relational adjective. By their very nature, polysemous words will show only some evidence for each of the classes, as their occurrences (and thus their properties) will be distributed across the two classes. Therefore, they will be untypical members of at least one of the intervening classes.
An alternative instantiation of the second model could use soft clustering (Pereira, Tishby, and Lee 1993; Rooth et al. 1999; Korhonen, Krymolowski, and Marx 2003), which assigns a probability to each of the classes and is thus not bound to a hard yes/no decision, as our approach does. From a theoretical point of view (and for many practical purposes such as dictionary construction), however, a distinction between monosemous and polysemous words is desirable, which adds a further parameter to be optimized in a soft clustering setting. Overlapping clustering (Banerjee et al. 2005), which allows for membership in multiple clusters, avoids this difficulty. Both methods have the advantage that they do not assume independence of the decisions. The most serious problem for the experiments presented in this article, however, would presumably also be a problem for these settings: The fact that the skewed sense distribution of many words makes it difficult to distinguish evidence for a particular class from noise. In the soft clustering setting, for instance, it would be hard to distinguish whether 10% evidence for class A and 90% for class B corresponds to polysemy with a skewed distribution, to noise in the data, or simply to an untypical instance.
To sum up, the main problem for the models presented in this article is that neither model can capture the distributional connection between P(AB) and P(A), either because AB and A are seen as unrelated atoms in the first place (first model), or because AB is diluted into A and B (second model). A more refined statistical approach that can model this interdependency is needed for further progress. Such a model should take into account both the differences of polysemous adjectives with respect to the other adjectives in the basic classes (first model) and their similarities (second model), thus directly capturing their hybrid behavior.
7. Conclusion
This article has tackled the automatic induction of semantic classes for Catalan adjectives, with a special emphasis on regular polysemy. To our knowledge, this is the first time that such an endeavor has been carried out, as (1) related work on lexical acquisition has focused on verbs (and, to a lesser extent, nouns) and on major languages such as English and German; and (2) polysemy in general has been largely ignored in lexical acquisition, and regular polysemy has only been sparsely addressed in empirical computational semantics.
We have explored the relationship between observable cues and semantic properties for adjectives, and, specifically, the morphology–semantics and syntax–semantics interfaces. We have showed that there is a systematic relation between the type of denotation of an adjective and its morphological and distributional properties. Our experiments have furthermore related the linguistic properties of adjectives as described in the literature to the information that can be extracted from linguistic resources, such as corpora or lexical databases. The presented results and analyses provide empirical support for the qualitative and relational classes, defined in theoretical work, and bring event-related adjectives into focus, a type of adjective that has been largely neglected in the literature.
This article has focused on Catalan as a case study, but most of the properties discussed (predicativity, gradability, complementation patterns), as well as the types of polysemy explored, are relevant for a broader range of languages, specially Indo-European languages (Dixon and Aikhenvald 2004). The approach does not require deep-processing resources (full parsing, semantic tagging, semantic role labeling), which makes it useful for lesser-researched languages.
The experiments show that a major bottleneck for our purposes is the definition of the classification itself: The machine learning results obtained have reached an upper bound, as the best classifier has achieved 69.1% accuracy (against a 51.0% baseline), and the human agreement is 68%. Thus, improvements in the computational task will need to be preceded by improvements in the agreement scores, that is, by a better and clearer definition of the classification and the classification task. We have shown that this is by no means a trivial issue. In fact, low inter-coder agreement scores are a problem for machine learning approaches to semantic and discourse-related phenomena in general. This is in contrast to tasks such as POS tagging or syntactic parsing, where relatively high inter-coder agreement scores are achieved. This state of affairs is probably due to the fact that semantic and pragmatic phenomena are much less well understood than morphological or syntactic phenomena.
Our experiments have highlighted a number of problems with the current classification proposal. First, the distinction between event-related and qualitative adjectives. The event class cannot be distinguished from the qualitative class with the distributional information used in this article, and its members are not homogeneous. We have shown that factors such as the aspectual class of the deriving verb or the suffix of the deverbal adjective play a role in the semantic and syntactic behavior of these adjectives that should be further explored. Also, a crucial type of evidence remains to be explored, namely, the selectional preferences of adjectives. These may be a relevant clue to the differences between qualitative and event-related adjectives. The second main problem is the fact that the qualitative class contains adjectives that do not fit into the other classes, constituting a sort of “catch-all” class. A natural extension for the work presented in this article would be to define a finer-grained categorization including the problematic cases discussed earlier. For instance, adjectives deriving from stative verbs could be distinguished from those deriving from active verbs, and different types of qualitative adjectives could be treated as different classes.
As for regular polysemy, we have shown that polysemous adjectives exhibit a hybrid behavior, with properties from all the classes involved in each type of regular polysemy. We have empirically tested two models of the phenomenon aimed at exploiting this hybrid behavior. The first model treats polysemous words in terms of independent classes, and we have argued that it is not adequate, neither from a theoretical nor from an empirical perspective. The second model assumes that polysemous words belong to each of the basic classes participating in the regular polysemy. This model is more adequate than the first one, as it accounts for the properties of the basic classes found in polysemous words, but it fails to account for the differences between polysemous and monosemous words. To improve on the modeling of regular polysemy, we plan to move to token-based (word-in-context) models (Schütze 1998; Erk and Padó 2010), as opposed to type-based models as we have done in this article. This should in turn shed light into the problem of distinguishing between evidence for a particular class from noise, discussed previously.
Finally, at a methodological level, we have illustrated how the broad coverage, large-scale, radically empirical approaches developed in computational linguistics can be of use to uncover phenomena and facts that are relevant for the study of language, providing complementary evidence to the analytic tools traditionally used by linguists. Most prominently, we have shown that (1) by randomly sampling the set of words to be analyzed, new or neglected phenomena emerge; (2) the feature representation typically used by machine learning algorithms provides an empirical handle to the linguistic properties of words that can be explored in different ways (e.g., to test hypotheses about the morphology-syntax and semantics-syntax interfaces); (3) machine learning experiments provide a framework for the systematic evaluation of different models of the phenomenon under study (in our case, both adjective classification and regular polysemy). Computational linguistic studies are also inherently limited in several aspects, such as the type of evidence that can be used or the ways in which it can be used. Despite these limitations, we believe that empirical computational linguistics approaches are a gold mine of new knowledge about language.
Appendix: Gold Standard Data
In the following, we include the lemmata that were manually classified for the first and second set of experiments, respectively (Sections 4 and 5). For details on the classes and the methodology, see the body of the article. The translation of the adjectives has been carried out with the help of the Spanish–English/English–Spanish Collins Dictionary (3rd edition) and Google Translator.11 Different senses are separated with a vertical bar (‘|’), different translations of the same sense with a comma (‘,’). Whenever possible, we have included adjective equivalents; many of the relational adjectives, however, are equivalent to attributive uses of nouns. Such nominal translations have been marked with (attr.).
Recall that the gold standard for the second experiment, together with its feature values, is available at the ACL repository (see URL in footnote 6).
Gold standard for the experiments with the first model (Section 4)
intensional (I): mer ‘mere’, presumpte ‘alleged’.
qualitative (Q): accidental ‘accidental’, accidentat ‘uneven, rough | injured’, alienant ‘alienating’, anticlerical ‘anticlerical’, avergonyit ‘ashamed’, bastard ‘bastard’, benigne ‘benign’, caracurt ‘short-faced’, coherent ‘coherent’, colpidor ‘striking’, contradictori ‘contradictory’, cosmopolita ‘cosmopolitan’, destructor ‘destructive’, diversificador ‘diversifying’, duratiu ‘durative’, escàpol ‘fleeing’, esfereïdor ‘terrifying’, evident ‘evident’, exempt ‘exempt’, expeditiu ‘expeditious’, fortuït ‘fortuitous’, gradual ‘gradual’, grandiós ‘grand’, gratuït ‘free | gratuitous’, honest ‘honest’, implacable ‘implacable’, infreqüent ‘infrequent’, innoble ‘ignoble’, inquiet ‘anxious | restless’, insalvable ‘insuperable’, inservible ‘useless’, invers ‘inverse’, irreductible ‘unyielding’, laberíntic ‘labyrinthine’, llaminer ‘sweet-toothed | appetising’, malalt ‘ill’, morat ‘purple’, negatiu ‘negative’, nombrós ‘numerous’, penós ‘distressing’, preeminent ‘pre-eminent’, preponderant ‘preponderant’, raonable ‘reasonable’, real ‘real’, representatiu ‘representative’, sobrenatural ‘supernatural’, subsidiari ‘subsidiary’, supraracional ‘supra-rational’, trivial ‘trivial’, uniforme ‘uniform’, usual ‘usual’, utòpic ‘Utopian’, vitalista ‘vitalist(ic)’.
relational (R): adquisitiu ‘acquisitive’, alfabètic ‘alphabetical’, carbònic ‘carbonic’, cervical ‘neck (attr.), cervical’, climatològic ‘climatologic’, col·laborador ‘collaborating’, curatiu ‘curative’, diofàntic ‘diophantic’, formatiu ‘formative’, freudià ‘Freudian’, governatiu ‘governmental’, indicador ‘indicating’, onomàstic ‘name (attr.), onomastic’, parlant ‘talking’, penitenciari ‘penitentiary, prison (attr.)’, periglacial ‘periglacial’, pesquer ‘fishing’, petri ‘stony’, preescolar ‘preschool (attr.)’, protector ‘protecting’, salvador ‘rescueing’, sociocultural ‘sociocultural’, sud-africà ‘South African’, tàctil ‘tactile’, terciari ‘tertiary’, terminològic ‘terminological’, topogràfic ‘topographic(al)’, toràcic ‘thoracic’, vaginal ‘vaginal’, valencianoparlant ‘Valencian-speaking’, ventral ‘ventral’, veterinari ‘veterinary’, vocàlic ‘vocalic, vowel (attr.)’, xinès ‘Chinese’.
intensional-qualitative (IQ): antic ‘ancient | former’.
qualitative-relational (QR): alemany ‘German’, celest ‘celestial | sky blue’, contaminant ‘pollutant’, cultural ‘cultural’, femení ‘female (attr.) | feminine’, irònic ‘irony (attr.) | ironic’, menorquí ‘Menorcan’, militar ‘war (attr.) | military’, sonor ‘sound (attr.) | sonorous’, triomfal ‘triumphal | triumphant’, viril ‘man (attr.) | virile, manly’.
Gold standard for the experiments with the second model (Section 5)
qualitative (Q): absort ‘absorbed’, aleatori ‘random’, altiu ‘haughty’, ample ‘wide’, animal ‘animal’, anòmal ‘anomalous’, baix ‘low’, benigne ‘benign’, bord ‘infertile (plant) | stroppy (person)’, caduc ‘deciduous’, calb ‘bald’, capaç ‘able’, cardinal ‘cardinal’, caut ‘cautious’, cèlebre ‘famous’, concret ‘concrete’, conservador ‘conservative’, contingent ‘contingent’, cru ‘raw | crude’, curull ‘full’, decisiu ‘decisive’, deficient ‘deficient, defective’, deliciós ‘delicious’, desproporcionat ‘disproportionate’, dificultós ‘difficult’, esquerre ‘left’, excels ‘sublime’, exquisit ‘exquisite’, fluix ‘weak | loose’, foll ‘crazy’, formidable ‘formidable | terrific’, franc ‘frank’, fresc ‘fresh’, gros ‘big’, gruixut ‘thick’, humil ‘humble’, igual ‘equal, alike’, imperfecte ‘imperfect’, impropi ‘improper’, incomplet ‘incomplet’, inhumà ‘inhuman’, insuficient ‘insufficient’, integral ‘integral | wholegrain’, íntegre ‘entire’, intel·ligent ‘intelligent’, intern ‘intern’, líquid ‘liquid’, llarg ‘long’, llis ‘smooth’, mal ‘bad’, màxim ‘maximum’, menor ‘minor | smaller | younger’, mínim ‘minimum’, moll ‘wet’, morat ‘purple’, mutu ‘mutual’, notori ‘notorious’, ocult ‘hidden’, opac ‘opaque’, paradoxal ‘paradoxical’, peculiar ‘peculiar’, perillós ‘dangerous’, pertinent ‘pertinent’, pessimista ‘pessimistic’, plàcid ‘placid’, precoç ‘precocious’, predilecte ‘favorite’, primari ‘primary’, primitiu ‘primitive’, propens ‘prone’, pròsper ‘prosperous’, prudent ‘prudent’, punxegut ‘sharp-pointed’, quadrat ‘square’, reaccionari ‘reactionary’, recent ‘recent’, recíproc ‘reciprocal’, remarcable ‘remarkable’, responsable ‘responsible’, rígid ‘rigid’, roent ‘burning’, sant ‘saint’, semicircular ‘semicircular’, seriós ‘serious’, significatiu ‘significant’, silenciós ‘silent’, similar ‘similar’, simplista ‘simplistic’, subaltern ‘subordinate’, sublim ‘sublime’, subsidiari ‘subsidiary’, subterrani ‘underground’, superflu ‘superfluous’, tenaç ‘tenacious’, terrible ‘terrible’, típic ‘typical’, titular ‘titular, official’, tort ‘bent’, total ‘total’, tou ‘soft’, triangular ‘triangular’, vague ‘vague’, ver ‘true’, viciós ‘vicious’, vigorós ‘vigorous’, viril ‘virile’, vulgar ‘vulgar’.
event-related (E): abundant ‘abundant’, abundós ‘plentiful’, acompanyat ‘accompanied’, admirable ‘admirable’, contradictori ‘contradictory’, convincent ‘convincing’, creador ‘creative’, divergent ‘divergent’, encarregat ‘in charge’, exigent ‘demanding’, exportador ‘exporting’, immutable ‘immutable’, imperceptible ‘imperceptible’, informatiu ‘informative’, irat ‘angry’, matiner ‘who gets up early’, motor ‘motor’, oblidat ‘forgotten’, orientat ‘oriented’, picat ‘pricked | minced | offended’, preferible ‘preferable’, productor ‘producing’, promès ‘promised’, protector ‘protecting, protective’, receptor ‘receiving’, recomanat ‘recommended’, regulador ‘regulating’, resultant ‘resulting’, revelador ‘revealing’, salvador ‘savior’, satisfactori ‘satisfactory’, sospitós ‘suspicious | suspect’, temible ‘fearsome’, treballador ‘working’, variable ‘variable’, victoriós ‘victorious’, vivent ‘living’.
relational: americà ‘American’, angular ‘angular’, atòmic ‘atomic’, barceloní ‘Barcelonian’, calcari ‘calcareous’, causal ‘causal’, ciutadà ‘city (attr.)’, conflictiu ‘conflict (attr.)’, corporatiu ‘corporate’, cranià ‘skull (attr.)’, diari ‘daily’, elèctric ‘electric(al)’, epistemològic ‘epistemological’, escènic ‘scenic’, estacional ‘seasonal’, fangós ‘muddy’, imperial ‘imperial’, lleidatà ‘Leridan’, manresà ‘Manresan’, marxià ‘Marx (attr.)’, melòdic ‘melodic’, mercantil ‘mercantile’, obrer ‘working-class, labour (attr.)’, ontològic ‘ontological’, pasqual ‘paschal’, peninsular ‘peninsular’, renaixentista ‘Renaissance (attr.)’, respiratori ‘respiratory’, terrestre ‘terrestrial’, viari ‘road (attr.)’.
event-qualitative (EQ): animat ‘animate | lively’, cridaner ‘who usually shouts | loud-colored’, embolicat ‘wrapped up | messy’, encantat ‘charmed | happy’, obert ‘opened | open’, raonable ‘that can be reasoned on | reasonable, fair’, sabut ‘known | wise’.
event-relational (ER): comptable ‘countable | account (attr.)’, cooperatiu ‘cooperative | cooperative (attr.)’, digestiu ‘digestive | digestion (attr.)’, docent ‘teaching | educational’, nutritiu ‘nutritive | nutritional’, vegetatiu ‘vegetative | vegetation (attr.)’.
qualitative-relational (QR): alegre ‘cheerful’, amorós ‘lovely | love (attr.)’, anarquista ‘anarchistic | anarchist’, capitalista ‘capitalistic | capitalist’, catalanista ‘Catalanistic | Catalanist’, comunista ‘communistic | communist’, diürn ‘diurnal, day (attr.)’, eròtic ‘erotic | love (attr.)’, familiar ‘familiar | family (attr.)’, feminista ‘feminist | feminism (attr.)’, humà ‘humane | human’, infantil ‘childish | child (attr.)’, intuïtiu ‘intuitive | intuition (attr.)’, local ‘local | place (attr.)’, nocturn ‘nocturnal, night (attr.)’, poètic ‘poetic, idealized | poetry (attr.)’, professional ‘(worker) who works well | professional, job (attr.)’, revolucionari ‘revolutionary | revolution (attr.)’, sensitiu ‘sensitive | sensation (attr.)’, socialista ‘socialistic | socialist’, turístic ‘touristy | tourist (attr.)’, unitari ‘unitary | union (attr.)’, utilitari ‘utilitarian | utility (attr.)’.
Acknowledgements
The authors wish to thank Àngel Gil, Laia Mayol, Martí Quixal, and Roser Sanromà for participating in the annotation of the gold standards; David Farwell, Louise McNally, Sebastian Padó, and Martí Quixal for comments and discussion on previous versions of this article; Josep Maria Boleda, Montse Cuadros, and Edgar Gonzàlez for technical help; and the anonymous reviewers for their constructive criticism, which has greatly helped improve the article. This work has been supported via Ph.D. grants to the first author by the Generalitat de Catalunya (2001FI 00582), the Fundación Caja Madrid, and the Universitat Pompeu Fabra; also by the Ministry of Education and the Ministry of Science and Technology of Spain under contracts FFI2010-09464-E (REDISIM), FFI2010-15006 (OntoSem 2), TIN2009-14715-C04-04 (KNOW2), and JCI2007-57-1479; and by the European Union via the EU PASCAL2 Network of Excellence (FP7-ICT-216886). The second author was funded by the DFG Collaborative Research Center 732.
Notes
Tested by one-way ANOVA tests on each of the features (factor: Classes), excluding items in the I and IQ classes because not enough observations are available. The test yields p-values lower than 0.05 (predicative), 0.01 (comparable, pre-nominal, adjacent), and 0.001 (gradable, copular), respectively.
Descriptive features are defined here as those that are among the three features with highest or lowest mean values for at least three clusters in the five-way solution.
For details on the annotation experiment, see Boleda, Schulte im Walde, and Badia (2008). The experiment yielded low inter-coder agreement scores (estimated κ 0.31–0.45, observed agreement 0.62-0.70). Note that the consensus classification is sub-optimal in the sense that its replicability cannot be estimated.
For a more detailed explanation of the information encoded in feature sets uni and bi, see (section 5.2.2 Boleda 2007).
The experiments discussed in this section were carried out with the Weka software package (Witten and Frank 2011), version 3.6. The Decision Tree algorithm used is J48, the latest open source version of C4.5 (Quinlan 1993), with default parameters (binary splits = False, confidence factor for pruning = 0.25, minimum number of instances per leaf = 2, reduced-error pruning = False, subtree raising = True, unpruned = False, use Laplace = False). AdaBoost has also been used with default parameters (base classifier = Decision Stump, number of iterations = 10, random seed = 1, use resampling instead of reweighting = False, weight threshold = 100). For Attribute Bagging, we used the Random Subspace algorithm, with J48 as base classifier (parameters as before), bag size = 1/3, and random seed = 1. We experimented with different values for the number of iterations (see Section 5.4.2).
Note that the corrected resampled t-test can only compare accuracies obtained under two conditions (algorithms or, as is our case, feature sets); ANOVA would be more adequate. In the field of machine learning, there is no established correction for ANOVA for the purposes of testing differences in accuracy (Bouckaert 2004). Therefore, we use multiple t-tests instead, which increases the overall error probability of the results for the significance tests.
Note that, for each adjective, only 10 different full classification proposals are obtained in each feature set, because each adjective is only used once per run for testing. Therefore, while the per-class accuracy for each feature set is assessed from 100 estimates (obtained via 10x10 cv), the accuracy of the different feature sets for full classification is assessed comparing 10 accuracies. This holds for Tables 10 and 11.
Grouping subsets according to linguistic considerations (i.e., building an EC over the feature subsets listed in Table 8) improved upon the best simple classifier, but not upon Attribute Bagging.
References
Author notes
Department of Translation and Language Sciences, Universitat Pompeu Fabra, Roc Boronat 138, 08018 Barcelona, Spain. E-mail: [email protected].
E-mail: [email protected].
E-mail: [email protected].