Decomposing Generalization: Models of Generic, Habitual, and Episodic Statements

Abstract We present a novel semantic framework for modeling linguistic expressions of generalization— generic, habitual, and episodic statements—as combinations of simple, real-valued referential properties of predicates and their arguments. We use this framework to construct a dataset covering the entirety of the Universal Dependencies English Web Treebank. We use this dataset to probe the efficacy of type-level and token-level information—including hand-engineered features and static (GloVe) and contextual (ELMo) word embeddings—for predicting expressions of generalization.


Introduction
Natural language allows us to convey not only information about particular individuals and events, as in (1), but also generalizations about those individuals and events, as in (2).
(1) a. Mary ate oatmeal for breakfast today.
b. The students completed their assignments.
(2) a. Mary eats oatmeal for breakfast. b. The students always complete their assignments on time.
This capacity for expressing generalization is extremely flexible -allowing for generalizations about the kinds of events that particular individuals are habitually involved in, as in (2), as well as characterizations about kinds of things, as in (3).
(3) a. Bishops move diagonally. b. Soap is used to remove dirt.
Such distinctions between episodic statements (1), on the one hand, and habitual (2) and generic (or characterizing) statements (3), on the other, have a long history in both the linguistics and artificial intelligence literatures. 1 Nevertheless, few modern semantic parsers make a systematic distinction (though see Abzianidze and Bos 2017). This is problematic, because the ability to accurately capture different modes of generalization is likely key to building systems with robust common sense reasoning (Zhang et al., 2017a;Bauer et al., 2018) -a central component of general artificial intelligence (McCarthy, 1960(McCarthy, , 1980(McCarthy, , 1986Minsky, 1974;Schank and Abelson, 1975;Hobbs et al., 1987;Reiter, 1987). It is also surprising, since there is no dearth of data relevant to generalization (Doddington et al., 2004;Cybulska and Vossen, 2014b;Friedrich et al., 2015).
One obstacle to further progress on generalization is that current frameworks tend to take standard descriptive categories as sharp classese.g. EPISODIC, GENERIC, HABITUAL for statements and KIND, INDIVIDUAL for noun phrases. This may seem reasonable for sentences like (1a), where Mary clearly refers to a particular individual, or (3a), where Bishops clearly refers to a kind; but natural text is less forgiving (Grimm, 2014(Grimm, , 2016(Grimm, , 2018. Consider the underlined arguments in (4): do they refer to kinds or individuals?
(4) a. I will manage client expectations.
b. The atmosphere may not be for everyone. c. Thanks again for great customer service! To remedy this, we propose a novel framework for capturing linguistic expressions of generalization. Taking inspiration from decompositional semantics (Reisinger et al., 2015;White et al., 2016), we suggest that linguistic expressions of generalization should be captured in a continuous multi-label system, rather than a multi-class system. We do this by decomposing categories such as EPISODIC, HABITUAL, and GENERIC into simple referential properties of predicates and their arguments. Using this framework ( §3), we develop an annotation protocol, which we deploy ( §4) to construct a new large-scale dataset of annotations covering the entire Universal Dependencies (Nivre et al., 2015) English Web Treebank (Bies et al., 2012)the Universal Decompositional Semantics Genericity (UDS-G) dataset (available at decomp.io).
Through exploratory analysis of this dataset, we demonstrate that this multi-label framework is well-motivated ( §5). We then present models for predicting expressions of linguistic generalization that combine hand-engineered type and tokenlevel features with static and contextual learned representations ( §6). We find that (i) referential properties of arguments are easier to predict than those of predicates; and that (ii) contextual learned representations contain most of the relevant information for both arguments and predicates ( §7).

Background
Most existing annotation frameworks aim to capture expressions of linguistic generalization using multi-class annotation schemes. We argue that this reliance on multi-class annotation schemes is problematic on the basis of descriptive and theoretical work in the linguistics literature.
One of the earliest frameworks explicitly aimed at capturing expressions of linguistic generalization was developed under the ACE-2 program (Mitchell et al., 2003;Doddington et al., 2004, and see Reiter and Frank 2010). This framework associates entity mentions with discrete labels for whether they refer to a specific member of the set in question (SPECIFIC) or any member of the set in question (GENERIC), with no formal definitions for kind-or particular-referring expressions.
The ACE-2005 Multilingual Training Corpus (Walker et al., 2006) adds data from broadcast conversations, weblogs, and Usenet forums to the 40,106 noun phrases (NPs) from 520 newswire and broadcast documents annotated under ACE-2 and, importantly, makes changes to the genericity annotation guidelines -providing two additional classes: (i) negatively quantified entries (NEG) for referring to empty sets and (ii) underspecified entries (USP) where the referent is ambiguous between GENERIC and SPECIFIC. The existence of the USP label already portends an issue with multi-class annotation schemes, which have no way of capturing the well-known phenomena of taxonomic reference (see Carlson and Pelletier, 1995, and references therein), abstract/event reference (Grimm, 2014(Grimm, , 2016(Grimm, , 2018, and weak definites (Carlson and Sussman, 2005). For example, wines in (5) refers to particular kinds of wine; service in (6) refers to an abstract entity/event that could be construed as both particular-referring, in that it is the service at a specific restaurant, and kind-referring, in that it encompasses all service events at that restaurant; and bus in (7) refers to potentially multiple distinct buses that are grouped into a kind by the fact that they drive a particular line.
(5) That vintner makes three different wines. (6) The service at that restaurant is excellent. (7) That bureaucrat takes the 90 bus to work.
A similar inflexibility is inherited by later schemes, such as ARRAU (Poesio et al., 2008, andsee Mathew 2009;Louis and Nenkova 2011), which is mainly intended to capture anaphora resolution but which also annotates NPs for a binary GENERIC attribute following the GNOME guidelines (Poesio, 2004). This is remedied to some extent in ECB+ (Cybulska and Vossen, 2014b,a), which is an extension of the EventCorefBank (ECB; Bejan and Harabagiu, 2010; Lee et al., 2012) -which annotates Google News texts for event coreference in accordance with the TimeML specification (Pustejovsky et al., 2003). ECB+ is an improvement in the sense that event and entity mentions may be labeled with a GENERIC class.
The ECB+ approach is useful, since episodic, habitual, and generic statements can straightforwardly be described using combinations of event and entity mention labels. For example, episodic statement will involve only non-generic entity and event mentions; habitual statements will involve a generic event mention and at least one non-generic entity mention; and generic statements will only involve generic event and entity mentions. This demonstrates the strength of decomposing statements into properties of the events and entities they describe; but there remain difficult issues arising from the fact that the decomposition does not go far enough. One is that, like ACE-2/2005 and ARRAU, ECB+ does not make it possible to capture taxonomic and abstract reference or weak definites; another is that, because ECB+ treats generics as mutually exclusive from other event classes, it is not possible to capture that events and states in those classes can themselves be particular or generic. This is well-known for different classes of events, such as those determined by a predicate's lexical aspect (Vendler, 1957); but it is likely also important for distinguishing more particular stage-level properties -e.g. availability (8) -from more generic individual-level propertiese.g. strength (9) (Carlson, 1977a). (8) Those firemen are available. (9) Those firemen are strong. This situation is improved upon in the Richer Event Descriptions (RED; O'Gorman et al., 2016) and Situation Entities (SitEnt; Friedrich and Palmer, 2014a,b;Friedrich et al., 2015;Friedrich and Pinkal, 2015b,a;Friedrich et al., 2016) frameworks, which annotate both NPs and entire clauses for genericity. In particular, SitEnt, which is used to annotate MASC (Ide et al., 2010) and Wikipedia, has the nice property that it recognizes the existence of abstract entities and lexical aspectual class of clauses' main verbs, along with habituality and genericity. This is useful because, in addition to decomposing statements using the genericity of the main referent and event, this framework recognizes that lexical aspect is an independent phenomenon. In practice, however, the annotations produced by this framework are mapped into a multi-class scheme containing only the high-level GENERIC-HABITUAL-EPISODIC distinction -alongside a conceptually independent distinction among illocutionary acts.
A potential argument in favor of mapping into a multi-class scheme is that, if it is sufficiently elaborated, the relevant decomposition may be recoverable. But regardless of such an elaboration, uncertainty about which which class any particular entity or event falls into cannot be ignored. Some examples may just not have categorically correct answers; and even if they do, annotator uncertainty and bias may obscure them. To account for this, we develop a novel annotation framework that both (i) explicitly captures annotator confidence about the different referential properties discussed above and (ii) automatically corrects for annotator bias using standard psycholinguistic methods.

Annotation Framework
We divide our framework into two protocolsthe argument and predicate protocols -that probe Figure 1: Examples of argument protocol (top) and predicate protocol (bottom) for the sentence I will manage client expectations accordingly.
properties of individuals and situations -i.e. events or states -referred to in a clause. A crucial aspect of our framework is that (i) multiple properties can be simultaneously true for a particular individual or situation; and (ii) we explicitly collect confidence ratings for each property. This makes our framework highly extensible, since further properties can be added without breaking a strict multi-class ontology.
We focus on properties that lie along three main axes: whether a predicate or its arguments refer to (i) instantiated or spatiotemporally delimited -i.e. particular situations or individuals; (ii) classes of situations -i.e. hypothetical situations or kinds of individuals; and/or (iii) intangible -i.e. abstract or stative situations or individuals. Figure 1 shows examples of the argument protocol (top) and predicate protocol (bottom), whose implementation is based on the event factuality annotation protocol described by White et al. (2016) and Rudinger et al. (2018). Annotators are presented with a sentence with one or many words highlighted, followed by statements pertaining to the highlighted words in the context of the sentence. 2 They are then asked to fill in the statement with a binary response saying whether it does or does not hold and to give their confidence on a 5 point scale -not at all confident (1), not very confident (2), somewhat confident (3), very confident (4), and totally confident (5). The task instructions, along with the protocol implementation, are available at decomp.io.

Data Collection
We use our annotation framework to collect annotations of predicates and arguments in the Universal Dependencies  English Web Treebank (Bies et al., 2012) -thus yielding the Universal Decompositional Semantics Genericity (UDS-G) dataset. UD-EWT has three main advantages over other similar corpora: (i) it contains text from multiple genres, not just newswire; and (ii) it contains gold standard Universal Dependency parses; and (iii) there are now a wide variety of other semantic annotations using the same predicate-argument extraction standard (White et al., 2016;Zhang et al., 2017b;Rudinger et al., 2018). Table 1 compares our dataset against other large annotated resources for generalization. Our data collection procedure had four stages: (i) predicate-argument extraction; (ii) predicate-argument filtering; (iii) bulk annotation; and (iv) rating normalization.
Predicate-argument extraction We extract predicates and arguments using PredPatt (White et al., 2016;Zhang et al., 2017b), which identified 34,025 predicates and 56,246 arguments of those predicates from 16,622 sentences. The parameters used for extraction are shipped with the code.
Predicate and argument filtering Based on analysis of pilot data, we developed a set of heuristics for filtering certain tokens that PredPatt identifies as predicates and arguments, either because we found that there was little variability in the label assigned to particular subsets of tokense.g. pronominal arguments, such as I, we, he, she, etc., are almost always labeled particular, nonkind, and non-abstract (with the exception of you and they, which can be kind-referring) -or because it is not generally possible to answer questions about those tokens -e.g. adverbial predicates are excluded. A full specification of these filtering heuristics is shipped with the data. Based on these filtering heuristics, we retain 37,146 arguments and 33, 114 predicates for annotation.
Bulk annotation 482 annotators were recruited from Amazon Mechanical Turk to annotate arguments; and 438 annotators were recruited to annotate predicates. Arguments and predicates in the UD-EWT validation and test sets were annotated by three annotators each; and those in the UD-EWT train set were annotated by one each.  Annotation normalization The need to adjust annotations biases introduced by different annotators has long been recognized in the psycholinguistics literature (Baayen, 2008) and is often carried out using mixed effects models (Gelman and Hill, 2014) and/or rating normalization procedures, such as z-scoring or ridit scoring (Agresti, 2003). We employ such procedures with the aim of producing a single real-valued score for each property that accounts for annotator confidence while adjusting for annotator bias.
Confidence normalization Different annotators use the confidence scale in different wayse.g. some annotators use all five options while others only ever respond with totally confident (5). To adjust for these differences, we normalize the confidence ratings for each property using a standard ordinal scale normalization technique known as ridit scoring. In ridit scoring ordinal labels are mapped to (0, 1) using the empirical cumulative distribution function of the ratings given by each annotator. Specifically, for the responses y (a) given by annotator a, ridit y (a) y Ridit scoring has the effect of reweighting the importance of a scale label based on the frequency with which it is used. For example, insofar as an annotator rarely uses extreme values, such as not at all confident or totally confident, the annotator is likely signaling very low or very high confidence, respectively, when they are used; and insofar as an annotator often uses extreme values, the annotator is likely not signaling particularly low or particularly high confidence.
Binary normalization In analyzing pilot data, we found that different annotators also have different biases for responding true or false on different properties. To adjust for these biases, we construct a normalized score using mixed effects logistic regressions fit separately to our train and development splits and our test splits. These mixed effects models all had (i) a hinge loss with margin set to the normalized confidence rating; (ii) fixed effects for property -PARTICULAR, KIND, and ABSTRACT for arguments; PARTICULAR, HY-POTHETICAL, and DYNAMIC for predicates -token, and their interaction; and (iii) by-annotator random intercepts and random slopes for property with diagonal covariance matrices. We obtain a normalized score from these models by setting the Best Linear Unbiased Predictors for the by-annotator random effects to zero and using the Best Linear Unbiased Estimators for the fixed effects to obtain a real-valued label for each token on each property. This procedure amounts to estimating a label for each property and each token based on the 'average annotator.'

Exploratory Analysis
Before presenting models for predicting our properties, we conduct a variety of exploratory analyses to demonstrate that the properties of the dataset relate to other token-and type-level semantic properties in intuitive ways. Figure 2 plots the normalized ratings for the argument (left) and predicate (right) protocols. Each point corresponds to a token and the density plots visualize the number of points in a region.
Arguments We see that arguments have a clear tendency (Pearson correlation ρ=-0.33) to refer to either a kind or a particular -e.g. place in (10) falls in the lower right quadrant (particularreferring) and transportation in (11) falls in the upper left quadrant (kind-referring) -though there are a not insignificant number of arguments that refer to something that is both -e.g. registration in (12) falls in the upper right quadrant. (10) I think this place is probably really great especially judging by the reviews on here . (11) What made it perfect was that they offered transportation so that... (12) Some places do the registration right at the hospital... We also see that there is a clear tendency for arguments that are neither particular-referring (ρ=-0.28) nor kind-referring (ρ=-0.11) to be abstractreferring -e.g. power in (13) falls in the lower left quadrant (only abstract-referring) -but that there are some arguments that refer to abstract kinds and some that refer to abstract particulars -e.g. both reputation (14) and argument (15) are abstract, but reputation falls in the lower right quadrant, while argument falls in the upper left (kind-referring).
(13) Power be where power lies. (14) Meanwhile, his reputation seems to be improving, although Bangs noted a "pretty interesting social dynamic." (15) The Pew researchers tried to transcend the economic argument.
Predicates We see that there is effectively no tendency (ρ=0.00) for predicates that refer to particular situations to refer to dynamic events -e.g. faxed in the (16) falls in the upper right quadrant (particular-and dynamic-referring), while available in (17) falls in the lower right quadrant (particular-and non-dynamic-referring).
(16) I have faxed to you the form of Bond... (17) is gare montparnasse storage still available?
But we do see that there is a clear tendency (ρ=-0.25) for predicates that are hypothetical-referring not to be particular-referring -e.g knows in (18a) and do in (18b) are hypotheticals in the lower left.
(18) a. Who knows what the future might hold , and it might be expensive ? b. I have tryed to give him water but he wont take it..what should i do?
Inducing clause types One impetus for developing a multi-label framework for capturing linguistic expressions of generalization was that three-way classification of clauses into EPISODIC, GENERIC, and HABITUAL appeared insufficient.
If it were sufficient, we would expect that clauses represented using our multi-label framework should cluster into three (or at least some small number of) distinct groups.
To check this, we concatenate the normalized ratings for each argument with the normalized ratings for its corresponding predicate and fit a Gaussian Mixture Model (GMM) with a Dirichlet Process (DP) prior to these predicate-argument pairs. Even with concentration parameters set to induce high sparsity (α = 0.01), this method assigns only 22% of the predicate-argument pairs to the three most populous categories -seen in Figure 3.

Comparison to other token-level properties
We compare our token-level argument and predicate properties against argument and predicate properties found in two other token-level datasets.

Event Factuality
We expect that event hypotheticality should be related to event factualityi.e. whether an event happened or not. Specifically, we expect hypothetical events to tend not to be factual. To test this, we use an event factuality dataset annotated on UD-EWT, developed by White et al. (2016) and Rudinger et al. (2018). In this dataset, all verbal predicates produced by PredPatt are annotated for whether the event they refer to already happened or is currently happen- ing, along with a confidence rating on a five-point scale. We apply the same normalization procedure used for our properties to the factuality data and compare our normalized predicate properties against this normalized factuality score. We find that 78% of the predicates annotated in the train and dev portions of our dataset were also annotated in the factuality dataset. Among these predicates, we corroborate our expectations, finding a Spearman correlation with IS.FACTUAL of -0.25 for IS.HYPOTHETICAL, 0.12 for IS.PARTICULAR and 0.02 for IS.DYNAMIC.
Semantic Proto-Role Properties Referential properties have long been known to be important for determining argument-taking behavior in ways similar to the semantic proto-role properties of Dowty (1991) -see, e.g., the noun incorporation literature (Mithun, 1984(Mithun, , 1986Baker, 1988;Van Geenhoven and Van Geenhoven, 1998;Farkas and Swart, 2003;Massam, 2009). We thus expect some amount of correlation between our properties and proto-role properties. Reisinger et al. (2015) present an annotation framework for semantic properties relevant to determining semantic role based on Dowty's (1991) seminal work. This framework was then updated and applied to UD-EWT by White et al. (2016). Table 2 gives the correlation between the ridit normalized ratings for various SPR properties on argument spans, with our argument properties. We see that properties that are associated with agentivity (AWARENESS, VOLITION, INSTIGATION, etc.) correlate positively with particularity and negatively with abstractness and (to some extent) kindhood.
Comparison to type-level properties We compare our token-level argument and predicate properties against argument and predicate properties found in two type-level datasets.  Eventivity The LCS Database contains handbuilt lexicoconceptual structures, from which predicate eventivity and stativity can be inferred based on whether or not a particular sense contains a root node be (Dorr and Voss, 1993). We compare our IS.DYNAMIC predicate annotations against the eventivity ratings of verb lemmas from LCS. If a lemma possessed at least one LCS structure (sense) where it had a dynamic or stative reading, we consider it to be dynamic or stative (or both). 43.7% of the predicate lemmas in our dataset were present in the LCS database, and by thresholding the normalized scores for IS.DYNAMIC at zero -greater than 0 is dynamic, less than is not dynamic -we observe that 86.4% of predicates share at least one sense in which both are eventive or stative and 40.9% share all senses. For example, the lemmas exist, thrive, and take contain eventive and stative senses in both the LCS database and our annotations.
Concreteness The Concreteness rating lexicon provides concreteness ratings, which evaluate the degree to which the concept denoted by a word refers to a perceptible entity, for 40,000 generally known English lemmas (Brysbaert et al., 2014). We compare our IS.ABSTRACT argument annotations against these ratings. Concreteness ratings were found for 66% of the lemmas in our dataset, and the normalized IS.ABSTRACT score exhibited a Spearman correlation of -0.45.

Models
We consider two forms of predicate and argument representations to predict the three attributes in our framework: hand-engineered features and learned features. For both, we contrast both type-level information and token-level information.
Hand-engineered features We consider five sets of type-level hand-engineered features.
1. Concreteness Concreteness ratings for root argument lemmas in the argument protocol from the concreteness database (Brysbaert et al., 2014). For the predicate protocol, we assign 3 concreteness features: the mean, maximum and minimum concreteness rating of its arguments. 2. Eventivity Eventivity and stativity for the root predicate lemma in the predicate protocol and the predicate head of the root argument in the argument protocol from the LCS database. 3. VerbNet Verb classes from VerbNet (Schuler, 2005) for predicate lemmas. 4. FrameNet Frames evoked by root predicate lemmas in the predicate protocol and for both the root argument lemma and its predicate head in the argument protocol from FrameNet (Baker et al., 1998). 5. WordNet WordNet (Fellbaum, 1998) supersenses (Ciaramita and Johnson, 2003) for argument and predicate lemmas. And we consider two sets of token-level handengineered features.
1. Syntactic features POS tags, UD morphological features, and governing dependencies were extracted using PredPatt for the predicate/argument root and all of its dependents. 2. Lexical features Function words -determiners, modals, auxiliaries -in the dependents of the annotated arguments and predicates. Learned features For our type-level learned features, we use the 42B uncased GloVe embeddings for the root of the annotated predicate or argument (Pennington et al., 2014). For our tokenlevel learned features, we use 1,024-dimensional ELMO embeddings (Peters et al., 2018). To obtain the latter, the UD-EWT sentences are passed as input to the ELMO three-layered biLM, and we extract the output of all three hidden layers for the root of the annotated predicates and arguments, giving us 3,072-dimensional vectors for each.
Labeling models For each protocol, we predict the three normalized properties corresponding to the annotated token(s) using different subsets of the above features. The feature representation is used as the input to a multilayer perceptron with ReLU nonlinearity and L1 loss. The number of hidden layers and the hidden layer sizes are hyperparameters that we tune on the development set.
Implementation For all experiments, we use stochastic gradient descent to train the multi-layer neural network parameters with the Adam optimizer (Kingma and Ba, 2014), using the default learning rate in pytorch (1e-3). We performed ablation experiments on the 4 major classes of features discussed above.
Development For all models, we train for at most 20 epochs with early stopping. At the end of each epoch, the L1 loss is calculated on the development set, and if it is higher than the previous epoch, we stop training, saving the parameter values from the previous epoch.
Evaluation Consonant with work in event factuality prediction, we report Pearson correlation (ρ) and proportion of mean absolute error (MAE) explained by the model, which we refer to as R1 on analogy with the variance explained R2 = ρ 2 .
where MAE p baseline is always guessing the median for property p. We calculate R1 across properties (wR1) by taking the mean R1 weighted by the MAE for each property.
These metrics together are useful, since ρ tells us how similar the predictions are to the true values, ignoring scale, and R1 tells us how close the predictions are to the true values, after accounting for variability in the data. We focus mainly on differences in relative performance among our models on these metrics, but for comparison, state-ofthe-art event factuality prediction systems obtain ρ ≈ 0.77 and R1 ≈ 0.57 for predicting event factuality on the predicates we annotate. Table 3 contains the results on the test set for both the argument (top) and predicate (bottom) protocols. We see that (i) our models are generally better able to predict referential properties of arguments than those of predicates; (ii) for both predicates and arguments, contextual learned representations contain most of the relevant information for both arguments and predicates, though the addition of hand-engineered features can give a slight performance boost, particularly for the predicate properties; and (iii) the results for proportion absolute error explained are significantly lower than what we might expect from the variance explained implied by the correlations. We discuss (i) and (ii) here, deferring discussion of (iii) to §8. Argument properties While type-level handengineered and learned features perform relatively poorly for properties such as IS.PARTICULAR and IS.KIND for arguments, they are able to predict IS.ABSTRACT relatively well compared to the models with all features. The converse of this also holds: token-level hand-engineered features are better able to predict IS.PARTICULAR and IS.KIND, but perform relatively poorly on their own for IS.ABSTRACT.

Results
This seems likely to be a product of abstract reference being fairly strongly associated with particular lexical items, while most arguments can refer particulars and kinds and which they refer to is context-dependent. And in light of the relatively good performance of contextual learned features alone, it suggests that these contextual learned features -in contrast to the hand-engineered tokenlevel features -are able to use this information coming from the lexical item.
Interestingly, however, the models with both contextual learned features (ELMo) and handengineered token-level features perform slightly better than those without the hand-engineered features across the board, suggesting that there is some (small) amount of contextual information relevant to generalization that the contextual learned features are missing. This performance boost may be diminished by improved contextual encoders, such as BERT (Devlin et al., 2018).
Predicate properties We see a pattern similar to the one observed for the argument properties mirrored in the predicate properties: while type-level hand-engineered and learned features perform relatively poorly for properties such as IS.PARTICULAR and IS.HYPOTHETICAL, they are able to predict IS.DYNAMIC relatively well compared to the models with all features. The converse of this also holds: token-level hand-engineered features are better able to predict IS.PARTICULAR and IS.HYPOTHETICAL, but perform relatively poorly on their own for IS.ABSTRACT.  One caveat here is that, unlike for IS.ABSTRACT, type-level learned features (GloVe) alone perform quite poorly for IS.DYNAMIC, and the difference between the models with only type-level hand-engineered features and the ones with only token-level hand-engineered features is less stark for IS.DYNAMIC than for IS.ABSTRACT. This may suggest that, though IS.DYNAMIC is relatively constrained by the lexical item, it may be more contextually determined than IS.ABSTRACT. Another major difference between the argument properties and the predicate properties is that IS.PARTICULAR is much more difficult to predict than IS.HYPOTHETICAL. This contrasts with IS.PARTICULAR for arguments, which is easier to predict than IS.KIND. Figure 4 plots the true (normalized) property values for the argument (top) and predicate (bottom) protocols from the development set against the values predicted by the models highlighted in blue in Table 3. Points are colored by the part-ofspeech of the argument or predicate root.

Analysis
We see two overarching patterns. First, our models are generally reluctant to predict values outside the [-1, 1] range, despite the fact that there are not an insignificant number of of true values outside this range. This behavior likely contributes to the difference we saw between the ρ and R1 metrics, wherein R1 was generally worse than we would expect from ρ. This pattern is starkest for IS.PARTICULAR in the predicate protocol, where predictions are nearly all constrained to [0,1]. Second, the model appears to be heavily reliant on part-of-speech information -or some semantic information related to part-of-speech -for making predictions. This behavior can be seen in the fact that, though common noun-rooted arguments get relatively variable predictions, pronounand proper noun-rooted arguments are almost always predicted to be particular, non-kind, nonabstract; and though verb-rooted predicates also get relatively variable predictions, common noun-, adjective-, and proper noun-rooted, are almost always predicted to be non-dynamic.
Argument protocol Proper nouns tend to refer to particular, non-kind, non-abstract entities, but they can be kind-referring, which our models miss: iPhone in (20) and Marines in (19) were predicted to have low kind score and high particular score, while annotators label these arguments as non-particular and kind-referring.  Table 3.
This similarly holds for pronouns. As mentioned in §4, we filtered out several pronominal arguments, but certain pronouns -like you, they, yourself, themselves -were not filtered because they can have both particular-and kind-referring uses.
Our models fail to capture instances where pronouns are labeled kind-referring -e.g. you in (21) and (22) -consistently predicting low IS.KIND scores, likely because they are rare in our data. This behavior is not seen with common nouns: the model correctly predicts common nouns in certain contexts as non-particular, non-abstract, and kindreferring -e.g. food in (23) and men in (24).
(23) Kitchen puts out good food... (24) just saying most men suck! Predicate protocol As in the argument protocol, general trends associated with part-of-speech are exaggerated by the model. We noted in §5 that annotators tend to annotate hypothetical predicates as non-particular and vice-versa (ρ=-0.25), but the model's predictions are anti-correlated to a much greater extent (ρ=-0.79). For example, annotators are more willing to say a predicate can refer to particular, hypothetical situations, as in (25), or a nonparticular, non-hypothetical situation, as in (26).
(25) Read the entire article; there 's a punchline... (26) it s illegal to sell stolen property, even if you don't know its stolen.
The model also had a bias towards particular predicates referring to dynamic predicates(ρ=0.34) -a correlation not present among annotators. For instance, is closed in (27) was annotated as particular but non-dynamic but predicted by the model to be particular and dynamic; and helped in (28) was annotated as non-particular and dynamic, but the model predicted particular and dynamic.
(27) library is closed (28) I have a new born daughter and she helped me with a lot.

Conclusion
We proposed a novel semantic framework for modeling linguistic expressions of generalization as combinations of simple, real-valued referential properties of predicates and their arguments. We used this framework to construct a dataset covering the entirety of the Universal Dependencies English Web Treebank.