Expectations over Unspoken Alternatives Predict Pragmatic Inferences

Abstract Scalar inferences (SI) are a signature example of how humans interpret language based on unspoken alternatives. While empirical studies have demonstrated that human SI rates are highly variable—both within instances of a single scale, and across different scales—there have been few proposals that quantitatively explain both cross- and within-scale variation. Furthermore, while it is generally assumed that SIs arise through reasoning about unspoken alternatives, it remains debated whether humans reason about alternatives as linguistic forms, or at the level of concepts. Here, we test a shared mechanism explaining SI rates within and across scales: context-driven expectations about the unspoken alternatives. Using neural language models to approximate human predictive distributions, we find that SI rates are captured by the expectedness of the strong scalemate as an alternative. Crucially, however, expectedness robustly predicts cross-scale variation only under a meaning-based view of alternatives. Our results suggest that pragmatic inferences arise from context-driven expectations over alternatives, and these expectations operate at the level of concepts.1


Introduction
Much of the richness of linguistic meaning arises from what is left unsaid (e.g., Grice, 1975;Sperber and Wilson, 1986;Horn, 1989).For example, if Alice says "Some of the students passed the exam", Bob can infer that Alice means not all students passed the exam, even though Alice's utterance would still be logically true if all students had passed.One explanation of this inference is that Bob reasons about the unspoken alternatives that Code and data can be found at https://github.com/jennhu/expectations-over-alternatives.
were available to the speaker.Under the assumptions that (1) speakers generally try to be informative, (2) Alice has full knowledge of the situation, and (3) it would have been relevant and more informative for Alice to say "All of the students passed the exam", Alice's choice to say "some" suggests that she believes the sentence with "all" is false.This inference pattern is more generally known as scalar inference (SI), which arises from orderings between linguistic items (scales).SI has often been treated as a categorical phenomenon: when a speaker utters a weaker (less informative) item on a scale, a listener rules out the meaning of stronger (more informative) items on that scale (e.g., Levinson, 2000).However, empirical studies have demonstrated substantial variability in the rates at which humans draw SIs, both within instances of a single scale (Degen, 2015;Eiteljoerge et al., 2018;Li et al., 2021) and across scales formed by different lexical items (e.g., Doran et al., 2009;Beltrama and Xiang, 2013;van Tiel et al., 2016;Gotzner et al., 2018;Pankratz and van Tiel, 2021;Ronai and Xiang, 2022).For example, consider the following instances of the scale some, all : (1) a.I like some country music.b.I like some, but not all, country music.
(2) a.It would certainly help them to appreciate some of the things that we have here.b.It would certainly help them to appreciate some, but not all, of the things that we have here.Degen (2015) finds that humans are highly likely to consider (1-a) as conveying a similar meaning as (1-b), but unlikely to consider (2-a) as conveying a similar meaning as (2-b) (Figure 1a).Similarly, consider the following instances of the scales possible, certain and ugly, hideous , which both consist of adjectives ordered by entailment:

Low scalar inference rate
The painting is ugly The painting is not hideous

High scalar inference rate
Success is possible Success is not certain Variation across scales (3) a. Success is possible.b.Success is not certain.
(4) a.The painting is ugly.b.The painting is not hideous.van Tiel et al. (2016) find that humans are highly likely to conclude that (3-a) implies (3-b), but unlikely to conclude that (4-a) implies (4-b) (Figure 1b).While cross-scale and within-scale variation have typically been studied as distinct empirical phenomena, they both reflect gradedness in listener inferences based on alternatives and context.It therefore seems desirable to explain these em-pirical findings with a shared account, but there have been few proposals that quantitatively explain both within-and cross-scale variation.For example, cross-scale variation can be explained by intrinsic properties of the scale (e.g., whether the strong scalemate refers to an extreme endpoint; van Tiel et al., 2016), but these factors cannot explain variation within instances of a single scale.On the other hand, many factors explaining within-scale variance are scale-specific (e.g., the partitive "of the" for some, all ; Degen, 2015) and may not generalize to new scales.
Here, we investigate a shared account of SI rates within and across scales.Since the alternatives are not explicitly produced (by definition), the listener has uncertainty over which alternatives the speaker could have used -and therefore, which strong scalemates ought to be negated through SI.Building upon constraint-based accounts of human language processing (Degen andTanenhaus, 2015, 2016), we test the hypothesis that SIs depend on the availability of alternatives, which depend on context-driven expectations maintained by the listener.For example, if a speaker says "The movie was good", the listener might predict that amazing is a more likely alternative than funny to the weak term good.An expectation-based view predicts that the listener would be thus be more likely to infer that the movie is not amazing (according to the speaker), and less likely to infer that the movie is not funny.However, while Degen andTanenhaus (2015, 2016) have argued that listeners maintain context-driven expectations over alternatives, these studies have primarily investigated a single scale ( some, all ) in small domains, arguing from qualitative patterns and in the absence of a formal theory.
Furthermore, while it is generally assumed that SIs arise based on reasoning about unspoken alternatives, it remains debated whether humans reason about alternatives as linguistic structures (e.g., Katzir, 2007;Fox and Katzir, 2011), or at the level of concepts (e.g., Gazdar, 1979;Buccola et al., 2021).Returning to the earlier example, if the weak scalemate is good, listeners may reason about a concept like VERYGOOD instead of a specific linguistic expression like amazing.In this sense, the listener's uncertainty about alternatives might arise from uncertainty about both the scale itself (Is the speaker implying the plot wasn't amazing, or that the jokes weren't funny?), as well as the exact word forms under consideration by the speaker (Is the speaker implying the movie wasn't amazing, fantastic, or wonderful?).Despite theoretical debates about the nature of alternatives, however, the role of concept-based alternatives in SI has not been tested in a systematic, quantitative way.
We provide a formalization of an expectationbased account of alternatives and test it on both string-based and concept-based views of alternatives.Instead of empirically estimating human expectations over alternatives (cf.Ronai and Xiang, 2022), we use neural language models as an approximation, which allows us to generate predictions for arbitrary sentences and contexts.We test the account's predictions on human SI rates within the some, all scale (Degen, 2015), and across 148 scales from four datasets (van Tiel et al., 2016;Gotzner et al., 2018;Pankratz and van Tiel, 2021;Ronai and Xiang, 2022).We find support for the expectation-based account, and also provide the first evidence that concept-based alternatives may be underlying a wide range of SIs.Our results suggest that pragmatic inferences may arise from context-driven expectations over unspoken alternatives, and these expectations operate at the level of concepts.

Within-scale variation
Within-scale variation refers to the variation in SI rates across instances of a single scale, such as some, all .To explore SI variation within the scale some, all , we use the dataset collected by Degen (2015), which features 1363 naturalistic sentences containing a "some"-NP from the Switchboard corpus of telephone dialogues (Godfrey et al., 1992) (Table 1).For each sentence, SI rates were measured using a sentence-similarity paradigm.On each trial, participants saw two sentence variants: the original sentence containing "some", and a minimally differing sentence where ", but not all," was inserted directly after "some".Participants were asked, "How similar is the statement with 'some, but not all' to the statement with 'some'?" and indicated responses (similarity judgments) on a seven point Likert scale.If the speaker's originally intended meaning clearly includes an implicature, then making the implicature explicit by inserting ", but not all," should not change the meaning of the sentence, so similarity judgments should be high.Thus, a higher similarity judgment indicates a stronger SI.Degen (2015) finds substantial variation in SI rates across contexts, challenging the idea that the "some, but not all" inference arises reliably without sensitivity to context (Horn, 1989;Levinson, 2000).She also reports several features that predict SI rates, such as whether "some" occurs with the partitive "of the", or whether the "some"-NP is in subject position.However, these features may be highly specific to the some, all scale, and it is unclear whether a more general mechanism may also explain variation within or across other scales.

Cross-scale variation (scalar diversity)
Cross-scale variation refers to the variation in SI rates across scales formed by different lexical items.To explore this, we use SI rates across 148 unique scales from four datasets, summarized in Table 1.Each scale involves a pair of English words (adjectives, adverbs, or verbs) of the form [WEAK], [STRONG] , where [WEAK] is less informative than [STRONG] (e.g., intelligent, brilliant ). 1 For each dataset, SI rates were measured through a binary choice task.Participants saw a character make a short, unembedded statement consisting of a simple noun phrase subject and a predicate with a weak scalar item (e.g., "John says: This student is intelligent.").Their task was to indicate (Yes or No) whether they would conclude that the speaker believes the negation of a strong scalar item (e.g., "Would you conclude from this that, according to John, she is not brilliant?").The SI rate for a scale is the proportion of Yes responses.
This method has revealed large variation in SI rates, ranging from 4% ( ugly, hideous ) to 100% ( sometimes, always ) (van Tiel et al., 2016).van Tiel et al. (2016) test two classes of factors that might predict SI rates: the availability of the strong scalemate given the weak scalemate, and the degree to which scalemates can be distinguished from each other.They find SI rates are predicted by measures of scalemate distinctness (e.g., whether the strong scalemate forms a fixed endpoint on the scale), but not by availability (but see Westera and Boleda, 2020;Ronai and Xiang, 2022).Other studies have proposed additional scale-intrinsic factors (e.g., Gotzner et al., 2018;Sun et al., 2018;Pankratz and van Tiel, 2021).However, structural properties of a scale cannot explain variablity in SI rates within a scale, as these properties do not change across contexts.
While others have proposed context-dependent factors -which could, in principle, explain both cross-and within-scale variation -these factors often lack explanatory power in practice.For example, Ronai and Xiang (2021) find that the prominence of the Question Under Discussion (Roberts, 2012) is correlated with SI rates, but only for unbounded scales (i.e., scales where neither scalemate has a fixed, extreme meaning).

An expectation-based account of SI
Theoretically, it is the set of alternative utterances -utterances that the speaker could have used, but didn't -that drive scalar implicature, and in principle every possible utterance in a language might be an alternative to every other.However, at an algorithmic level (Marr, 1982), it would be intractable for listeners to perform inference over this entire set.Furthermore, the signature pattern of SI would not arise without restrictions on the alternatives: otherwise, "[WEAK], but not [STRONG]" and "[STRONG]" would both be alternatives to "[WEAK]", leading to contradictory inferences without a mechanism for breaking symmetry (Kroch, 1972;Katzir, 2007;Breheny et al., 2018).
To solve this symmetry problem, some approaches restrict alternatives based on structural complexity through grammar-internal mechanisms (e.g., Katzir, 2007;Fox and Katzir, 2011).
However, these theories do not capture the uncertainty that listeners maintain, and are difficult to test quantitatively.Here, we test the view that listeners form probabilistic expectations over alternatives, given information from their interaction with the speaker.In the remainder of this section, we first discuss the conceptual predictions of an expectation-based account of SI, and then describe how we operationalize these predictions using neural language models.
Suppose that a listener hears a sentence with a weak scalar term [WEAK] (e.g., "This student is intelligent").To rule out the meaning of a particular strong scalemate [STRONG] (e.g., the student is not brilliant), the listener must have reason to believe that the speaker would have said [STRONG] if they had intended to convey the strong meaning.However, since the alternatives are not explicitly produced, the listener has some degree of uncertainty over what alternatives were considered by the speaker.If it is likely that the speaker would have said [STRONG] to convey the strong meaning, then their choice to say [WEAK] suggests that they did not have grounds to say [STRONG] -and thus, an SI should be more likely to arise.
The key question, then, is how listeners estimate which alternatives are likely to be considered by the speaker.An expectation-based account proposes that listeners integrate contextual and grammatical cues to maintain probabilistic expectations over these alternatives.A scalemate that is more probable (given these cues) should be more likely to enter the scalar inference computation.Thus, this account predicts that the more expected the strong scalemate is as an alternative to the weak scalemate, the higher SI rates should be.

String-based view of alternatives
When an alternative is likely to be a strong scalemate, listeners should be more likely to rule out its meaning, resulting in higher SI rates.Conditioned on the context and the speaker's choice to use [WEAK], the listener must estimate the probability of [WEAK] and [STRONG] being contrasted in a scalar relationship.Since it is difficult to directly estimate this probability, we construct a sentence frame where the probability of [STRONG]at the level of forms -approximates the probability of [STRONG] being in a scalar relationship with a weak scalemate [WEAK].This approach allows us to re-frame the problem of estimating listeners' expectations over strong scalemates into a word prediction problem.
To do this, we use the scalar construction "X, but not Y", which in many cases suggests that Y is a strong scalemate to X (Hearst, 1992;de Melo and Bansal, 2013;van Miltenburg, 2015;Pankratz and van Tiel, 2021).For a given utterance [CONTEXT] [WEAK] [CONTEXT] and hypothesized scale [WEAK], [STRONG] , we form a sentence that explicitly states the SI: scalar construction [CONTEXT] (1) To test how expected [STRONG] is as an alternative to [WEAK], we need to estimate how likely a human would predict [STRONG] to appear in the [STRONG] position in (1).2Instead of attempting to directly measure these predictions (cf.Ronai and Xiang, 2022, see ( 3)), we approximate this with neural language models.We measure how unexpected [STRONG] is by computing its surprisal (negative log probability) under a language model, conditioned on the rest of the sentence.Since surprisal measures unexpectedness, we predict a negative relationship between SI rate and the surprisal of the strong scalemate.
This predictor is closely related to the notion of an SI's "relevance" (Pankratz and van Tiel, 2021).Under usage-based theories of language (e.g., Tomasello, 2003;Bybee and Beckner, 2015), if a weak scalar term is encountered frequently in a scalar relationship with a particular strong term, then the scalar relationship between these items will be enforced.Thus, Pankratz and van Tiel (2021) measure the relevance of an SI by counting corpus frequencies of the scalemates in the string "[WEAK], but not [STRONG]."This is conceptually aligned with our setup, where we might expect higher corpus frequencies to correspond to lower surprisal under a language model.However, our predictor differs from Pankratz and van Tiel's in an important way: they aim to measure the "general relevance" of an SI, which they de-fine as "relevance even in the absence of a situated context."It is unclear how general relevance can explain variation in SI rates within instances of a scale.By using context-conditioned probabilities from a language model, our predictor could account for both the general frequency of "[WEAK], but not [STRONG]" as well as expectations driven by the context in which the scale occurs.

Concept-based view of alternatives
The method described above implicitly treats linguistic forms as the alternatives driving scalar inferences.However, recent proposals have advanced the view that alternatives are not linguistic objects, but instead operate at the level of more general reasoning preferences (Buccola et al., 2021).On this view, alternatives are constructed by replacing primitives of the concept expressed by the speaker with primitives of equal or less complexity.
Here, we test a generalization of this conceptbased view of alternatives.Suppose, for example, a speaker uses the weak scalar term big.On a concept-based view, the listener may infer that the speaker is contrasting big with a concept like VERYBIG instead of a particular linguistic expression like enormous.However, in the experiments mentioned in Section 2.2, the SI process likely needs to be grounded in linguistic forms before the listener makes a judgment about a particular strong scalemate (in string form).One hypothesis is that upon hearing an expression with a weak scalemate, a stronger conceptual alternative is activated, which in turn probabilistically activates all the strings that could reflect it.Returning to our earlier example, if the conceptual alternative is VERYBIG, and huge, massive, and enormous are string-based realizations of that alternative, they may be assigned a high likelihood.When asked about a specific string-form alternative (e.g., "The elephant is big.Would you conclude that it is not enormous?"),humans may endorse the SI if the probability of conceptually similar linguistic alternatives is sufficiently high, even if the probability of the tested alternative (here, enormous) is low.
If SIs involve reasoning about conceptual alternatives, then surprisal values estimated from assumed string-form alternatives may be poor estimates of the true relevant surprisal, as a single concept could be expressed with multiple forms.Therefore, in addition to assessing whether ex-pectedness of specific linguistic forms predicts SI rates (Section 3.1), we also test a second predictor which approximates the expectedness of conceptual alternatives.To do this, we need a set of alternatives A that could serve as potential linguistic scalemates.As described in more detail in Sections 4.3 and 5.3, we obtain A by taking a fixed set of words with the same part of speech as the weak scalemate, inspired by grammatical theories of alternatives (e.g., Rooth, 1985;Katzir, 2007). 3sing this alternative set A, we compute the weighted average surprisal of A using weights determined by the conceptual similarity between each alternative and the tested strong scalemate.We use GloVe embeddings (Pennington et al., 2014) as an approximation for conceptual representations of scalar items, and cosine similarity between GloVe vectors to approximate conceptual similarity.
For each scale [WEAK], [STRONG] , we obtain weights by computing the cosine similarity between the GloVe embeddings for [STRONG] (v [STRONG] ) and each potential alternative a (v a ) in the alternative set A. We compute the weighted average probability over A using these weights, and then take the negative log to obtain the weighted average surprisal: (2) If there are many conceptually similar alternatives with low surprisal, then the weighted average surprisal will be low, even if the surprisal of the tested scalemate is high.Therefore, weighted average suprisal forms a proxy for concept-based surprisal, which we compare to string-based suprisal.
4 Predicting variation within some, all

Human data
To investigate variation within the scale some, all , we use human SI strength ratings collected by Degen (2015).These ratings were measured by asking participants to rate the similarity (1-7) between a sentence with "some" and a minimally differing sentence with "some, but not all".See Section 2.1 for details.

Model
Following the experiment conducted by Degen (2015), we construct scalar templates by inserting ", but not all," after the occurrence of "some" in each sentence from the dataset.Since this scalar construction ("some, but not all,") often occurs in the middle of the sentence, we use the bidirectional language model BERT (Devlin et al., 2019) to measure model expectations at the position of the strong scalemate.Concretely, we replace "all" with the [MASK] token and measure BERT's probability distribution at that token.All models in our study are accessed via the Huggingface transformers library (Wolf et al., 2020).

Candidate alternatives
For our string-based surprisal predictor (Section 3.1), we are only concerned with the surprisal of the alternative all in the [STRONG] position in (1).However, to compute our conceptbased surprisal predictor (Section 3.2), we need a set of candidate alternatives that could potentially serve as the strong scalemates implied by the speaker.Since the alternatives to some are highly constrained by the grammar, we manually constructed a set of English quantifiers that can be used in contrast to some: each, every, few, half, much, many, most, and all.

Results
Figure 2 shows the relationship between our predictors and human SI ratings for Degen's (2015) dataset of variation within some, all .We find that both string-based and concept-based surprisal are indeed negatively correlated with human similarity judgments (string-based: Figure 2a We additionally conducted a multivariate analysis including our two new predictors (string-and concept-based surprisal) among the predictors investigated in Degen's original study.We centered and transformed all variables according to Degen's original analyses.The results are summarized in Table 2.We find that the original predictors  remain statistically significant, and that conceptbased surprisal (but not string-based surprisal) is a significant predictor in the full model.This suggests that listeners draw stronger scalar inferences when all -or a conceptually similar alternativeis more expected in a given context.
5 Predicting variation across scales

Human data
To investigate variation across scales, we use human SI rates collected by four studies (Ronai and Xiang, 2022;Pankratz and van Tiel, 2021;Gotzner et al., 2018;van Tiel et al., 2016).SI rates were measured by showing participants a sentence with the weak scalemate (e.g., "The student is intelligent"), and asking whether they would endorse the negation of the strong scalemate (e.g., "The student is not brilliant").See Section 2.2 for details.

Model
We construct scalar templates following the pattern summarized in Table 3.Since in each case the strong scalemate is the final word in the sentence,5 we use an autoregressive language model to measure expectations over potential scalemates in the [STRONG] position.We use the base GPT-2 model (Radford et al., 2019) via Huggingface and obtain model surprisals through the SyntaxGym command-line interface (Gauthier et al., 2020).

Candidate alternatives
Recall from Section 3.2 that we need a set of potential linguistic alternatives to compute the weighted average surprisal.We take this set of alternatives to be a set of words with the same part of speech (POS) as the weak scalemate and obtain these candidate alternative sets by extracting lists of English adjectives, adverbs, and verbs from WordNet (Miller, 1995).We then used NLTK (Loper and Bird, 2002) to find the words satisfying finer-grained POS tags (JJ for adjectives, RB for adverbs, and VB for verbs), and sorted each POS set according to word frequencies from the Open-Subtitles corpus (Lison and Tiedemann, 2016). 6,7 We excluded words in the POS sets that were not in the frequency corpus, resulting in 3204 adjectives, 1953 adverbs, and 226 verbs.We restricted each POS set to its 1000 highest-frequency words, and performed some manual exclusions (e.g., removing "do" and "be" from the verb set, which are unlikely to form scales with any of the tested items and follow different syntactic rules).This finally resulted in our three alternative sets: 1000 adjectives, 960 adverbs, and 224 verbs.8

String-based analyses
Figure 3a shows our results for cross-scale variation, under a string-based view of alternatives.We find that surprisal is a significant predictor only for Ronai and Xiang's dataset (Pearson ρ = −0.361,p = 0.006). 9POS # unique Form of original sentence Form of scalar construction Example Adj 120 The elephant is big, but not enormous Adv 12 The director is sometimes late, but not always Verb 16 [
to ensure broad coverage over potential scalemates. 9We repeated this analysis after removing an outlier from Gotzner et al.'s dataset, and again found a lack of relationship between SI rate and surprisal (ρ = −0.0452,p = 0.719).
Model surprisal vs. human completions.For the dataset where we do find a relationship between surprisal and SI rates, we ask whether model surprisals are correlated with humanderived measurements of how "accessible" the strong scalemate is.If model surprisals and human accessibility scores are strongly linked, this would suggest that models and humans are aligned at the level of predictive distributions over alternatives, validating our approach of using language models to approximate human predictions.
To this end, we use data from Ronai and Xiang's Experiment 2, which measured the accessibility of scalemates through a Cloze task.Humans were presented with a short dialogue featuring a sentence with the weak scalemate, as in (3), and then asked to generate a completion of the dialogue in the blank.The "accessibility" of the strong scale- mate is taken to be the frequency with which it is generated in this paradigm.
Sue: The movie is good. (3) Mary: So you mean it's not .
We find that model surprisals are negatively correlated with accessibility scores (Figure 4; ρ = −0.357,p = 0.006), suggesting that our method of estimating expectations over alternatives using artificial language models aligns with direct measurements in humans.

Concept-based analyses
Turning to a conceptual view of alternatives, Figure 3b shows the relationship between human SI rates and weighted average surprisals (Equation 2).We find a significant negative correlation for all but one of the tested datasets (Ronai and Xiang: ρ = −0.400,p = 0.002; Pankratz and van Tiel: ρ = −0.342,p = 0.015; Gotzner et al.: ρ = −0.415,p = 0.0005; van Tiel et al.: ρ = −0.167,p = 0.310), demonstrating that similarity-weighted surprisal captures more variation than raw surprisal (cf. Figure 3a; Section 5.4.1).We additionally included both (centered) stringbased and concept-based surprisal as predictors in a multivariate model, summarized in Table 4 (middle columns).As in the within-scale analysis, for three of the four datasets we find that conceptbased surprisal is a stronger predictor than stringbased surprisal.With that said, we find only a marginal effect of concept-based surprisal in Ronai and Xiang's data, and no effect of either predictor in van Tiel et al.'s data.However, for Ronai and Xiang's data, this does not mean that there is no value in either predictor -rather, the predictors are too closely correlated to definitively favor one over the other.To demonstrate this, for each dataset we performed an analysis of variance (ANOVA) comparing the full model to a null intercept-only model (Table 4, right columns).We find that for all datasets except that of van Tiel et al., the model with both surprisal predictors explains significantly more variance than the null model.In sum, our results suggest that the expectedness of the strong scalemate can capture significant cross-scale SI variation, but these expectations may operate over groups of semantically similar linguistic forms instead of individual strings.
Qualitative analysis.As a follow-up analysis, we identified cases where GPT-2 assigns low probability to the tested strong scalemate, but high probability to near synonyms.We analyzed the top 5 alternatives from the full alternative set (Section 5.3) that were assigned highest probability as strong scalemates under GPT-2.Figure 5 shows three examples from Ronai and Xiang's dataset.The title of each subplot shows the scalar construction, with the weak scalemate highlighted in teal and the tested strong scalemate underlined in red.The y-axis shows the top 5 candidate scalemates, and the x-axis shows the probability assigned by the model.For the weak scalemate big (left), GPT-2 assigns highest probability to the alternative huge, which semantically conveys similar information to the empirically tested alternative enormous.We see a similar pattern for weak scalemate largely and alternatives completely and totally (middle), as well as for weak scalemate hard and alternative impossible (right).This is consistent with the hypothesis that surprisal of a specific string may not capture surprisal of the underlying concept.
Taken together, these analyses suggest that The elephant is big, but not enormous The coast is largely flooded, but not totally The problem is hard, but not unsolvable Figure 5: Probability assigned by GPT-2 to top 5 candidate strong alternatives (y-axis) for 3 example weak scalar items: big, largely, and hard (Ronai and Xiang, 2022).The full scalar construction is shown above each subplot, with the original tested strong scalemate underlined in red.
a concept-based view of alternatives is better aligned with human inferences than treating alternatives as specific linguistic forms.Testing additional ways of operationalizing concept-based alternatives is a promising direction for future work.

Related work
Prior work has evaluated the ability of computational models to capture scalar inferences.For example, the IMPPRES benchmark (Jeretic et al., 2020) frames SI as a natural language inference problem: the weak scalar expression (e.g., "Jo ate some of the cake") is the premise, and the negated strong scalar expression (e.g., "Joe didn't eat all of the cake") is the hypothesis.Under this setup, an interpretation consistent with the strictly logical reading would assign a neutral relationship between the premise and hypothesis, whereas a pragmatic reading would assign an entailment relationship.Models are evaluated based on how often they assign the entailment label across items, which treats SIs as a homogeneous phenomenon and does not capture SI variation.Another line of work has attempted to predict within-scale SI variation through a supervised approach (Schuster et al., 2020;Li et al., 2021).This approach takes a sentence with a weak scalar item, and attempts to directly predict the human SI strength through a prediction head on top of a sentence encoder.This differs from our approach in that it requires training directly on the SI-rateprediction task, whereas we probe the predictive distribution that emerges from language modeling with no task-specific representations.This allows us to compare model probability distributions to the expectations deployed by humans during pragmatic inferences, building upon a literature linking language models to predictive processing (e.g., Frank and Bod, 2011;Smith and Levy, 2013;Wilcox et al., 2020;Merkx and Frank, 2021).
There have also been several studies extracting scalar orderings from corpora or language model representations.For example, de Marneffe et al.
(2010) use distributional information from a web corpus to ground the meanings of adjectives for an indirect question answering task.Similarly, Shivade et al. (2015) use scalar constructions like "X, but not Y" to identify scales from a corpus of biomedical texts.Others have found that adjectival scale orderings can be derived from static word embeddings (Kim and de Marneffe, 2013) and contextualized word representations (Garí Soler andApidianaki, 2020, 2021).

Discussion
We tested a shared mechanism explaining variation in SI rates across scales and within some, all , based on the hypothesis that humans maintain context-driven expectations about unspoken alternatives (Degen andTanenhaus, 2015, 2016).We operationalized this in two ways using neural language models: the expectedness of a linguistic alternative as a scalemate (string-based surprisal), and the expectedness of a conceptual alternative (weighted average surprisal).We found that for both within-scale and cross-scale variation, expectedness captures human SI rates.Crucially, however, expectedness of the strong scalemate is a robust predictor of cross-scale variation only under a conceptual view of alternatives (Buccola et al., 2021).Our results support the idea that the strength of pragmatic inferences depends on the availability of alternatives, which depends on in-context predictability.
One open question is the source of variability across the tested human behavioral datasets -in particular, the lack of surprisal effect for van Tiel et al.'s data (Section 5.4).While we cannot be certain about why the results vary, we identified a few differences that might affect data quality across datasets (see Table 1).van Tiel et al.'s study has the smallest number of participants (28), smallest number of ratings per scale (10), and smallest number of scales (39).In addition, their experiments presented multiple sentence contexts per scale, whereas the other experiments only presented one sentence per scale.Other experimental factors, such as participant recruitment and exclusion criteria, may have also contributed to differences in data reliability.

How do listeners restrict the alternatives?
We now return to the issue raised in Footnote 2: what information do listeners use to form expectations about alternatives?To illustrate potential hypotheses, consider the item "The soup is warm/hot" from van Tiel et al.'s experimental materials.In our framework described in Section 3.1, [CONTEXT] = "The soup is", [WEAK] = "warm", and [STRONG] = "hot".One hypothesis is that listeners form expectations over relevant scalar expressions given [CONTEXT] alone.On this view, expectations over strong scalemates could be measured by computing the probability of [STRONG] in the template [CONTEXT][STRONG]; i.e., "The soup is [STRONG]".In contrast, in this paper we test expectations of [STRONG] in the template "The soup is warm, but not [STRONG]", which instantiates an alternate theoretical position: that listeners use not only the context, but also [WEAK] as information for forming expectations over alternatives.
We adopt this view for several reasons.First, it could be the case that the context does not provide enough information for the listener to narrow down alternatives.Returning to the running example, "The soup is" could be followed by many continuations, some potentially relating to the taste or size of the soup in addition to its temperature.Taking the weak scalar term "warm" into account allows the listener to restrict the relevant alternatives to a smaller, more tractable set, which presents an algorithmic solution to the computationally challenging inference problem.However, the underinformativity of the context may be a problem unique to the simple stimuli used in the behavioral experiments.It is plausible that listeners could sufficiently restrict alternative sets given more nat-uralistic contexts, which likely provide more cues to the Question Under Discussion (Roberts, 2012).
In addition, there could be cues from [WEAK] that provide information about likely alternatives, independent of the context.For example, listeners might prefer strong scalemates that match [WEAK] in register or formality, or in shared phonological features.This motivates why we chose template (1) to measure expectations over alternatives, instead of [CONTEXT][STRONG].However, the extent to which listeners tune their predictions based on [WEAK] above and beyond the context remains an open empirical question.

From alternatives to inference
Conceptually, computing an SI involves two steps: (1) determining the suitable alternatives, and (2) ruling out the meaning of alternatives to arrive at a strengthened interpretation of the weak scalar term.Our results primarily shed light on the first step, providing evidence that expectations play a role in determining alternatives, and that alternatives are likely based on meanings in addition to linguistic forms.
When considering the higher-level reasoning process, many factors beyond alternatives play a causal role in SI.One view is that humans use alternatives in a cooperative reasoning process, such as that formalized by the Rational Speech Act framework (RSA; Frank and Goodman, 2012;Goodman and Frank, 2016).In an RSA model, a pragmatic listener L 1 (m | u) uses a speaker's utterance u to update their prior beliefs P (m) over which meaning m the speaker is trying to convey.The listener does this by computing the likelihood of a pragmatic speaker S 1 producing u given each potential meaning.The pragmatic S 1 speaker corresponds to the utility U of the utterance u to convey m, relative to the utility of the alternative utterances in the set of alternatives A: Our findings appear compatible with RSA: listeners reason about a speaker that normalizes over alternatives.However, it remains an open question how variable expectations over alternatives should be operationalized in an RSA model.One option, as recently proposed by Zhang et al. (2023), is that the pragmatic speaker is conditioned on the alternative set A. The pragmatic listener has beliefs over different sets of A and marginalizes over these beliefs when drawing an inference: Another possibility is that the variable expectations are not inputs to the model, but instead fall out of reasoning about how likely speakers are to use the weaker versus stronger terms, given variable contextual priors over meanings and questions under discussion (see, e.g., Goodman and Lassiter, 2015;Qing et al., 2016).We leave a detailed exploration of such a model to future work.
The role of priors.Pragmatic inferences are influenced by the prior probabilities of the world states compatible with the weak and strong meanings (Degen et al., 2015;Sikos et al., 2021).For example, consider the scale start, finish .If a human were asked "The movie started at 2:30.Would you conclude that the movie did not finish at 2:30?", they would likely answer Yes.This Yes response would count as an SI under the experimental paradigm, but does not reflect pragmatic reasoning over scalar alternatives: it is simply implausible for a movie to start and finish at the same time, given our knowledge of the world. 10 These priors have an important connection to our analyses.As outlined in Section 3.1, we approximate the expectedness of a strong scalemate by measuring the expectedness of its linguistic form.This approach can be seen as reflecting an implicit assumption that the more likely a certain meaning is, the more likely it is to be expressed linguistically.This is likely to be wrong in certain cases -for example, if a certain meaning is so likely that it is obvious without being said, then speakers may avoid the effort of explicitly producing the linguistic expression (and thus, the linguistic expression would have low probability).This could potentially be the case for relatively common SIs.For example, a speaker might be able to get away with only saying some and expecting a listener to recover the meaning some but not all.
With that said, we believe our estimation method may minimize this issue, as we measure 10 This example is due to Lassiter (2022).
expectations conditioned on an explicit scalar contrast with the weak scalemate (i.e., "[WEAK], but not").Thus, our approach can be seen as approximating listeners' expectations about upcoming linguistic material, given that the speaker has already chosen to produce a scalar contrast.Nevertheless, a complete account of scalar inferences will need to account for the influence of the prior probabilities over world states, which may explain some of the variance not captured by our expectedness predictors.

Implications for NLP
While the main role of language models in our analyses was to systematically test a cognitive theory, we believe this work also has implications for NLP evaluation.A growing body of work uses controlled assessments to evaluate the linguistic knowledge of NLP models.Many studies test whether models exhibit a categorical pattern of behavior that reflects a particular linguistic generalization.For example, in syntactic evaluations, a model is successful if it satisfies certain inequality relationships between grammatical and ungrammatical sentences (e.g., Linzen et al., 2016;Futrell et al., 2019;Hu et al., 2020).SI (and other types of implicatures) have largely been treated the same way (see Section 6).
In contrast, we do not evaluate whether language models exhibit a categorical pattern of behavior ("Do models interpret SIs pragmatically?").Instead, based on the empirical evidence for scalar variation, we test whether models capture systematic variability in human inferences ("Are models sensitive to the factors that modulate human pragmatic inferences?").We urge other NLP researchers to consider variability in human behaviors instead of relying on categorical generalizations (see also Pavlick and Kwiatkowski, 2019;Jiang and Marneffe, 2022;Baan et al., 2022;Webson et al., 2023).Through this approach, we can build models that capture the rich variability of human language, and use these models to refine our theories about the human mind.

Figure 1 :
Figure 1: (a) Distribution of human scalar inference (SI) ratings (on scale of 1-7) across instances of the some, all scale (reproduction of Fig. 1, Degen 2015).(b) Average SI rates across scales formed by different lexical items (reproduction of Fig. 2, van Tiel et al. 2016).
Figure2shows the relationship between our predictors and human SI ratings forDegen's (2015) dataset of variation within some, all .We find that both string-based and concept-based surprisal are indeed negatively correlated with human similarity judgments (string-based: Figure2a, Pearson ρ = −0.400,p < 0.0001; concept-based: Figure 2b, ρ = −0.432,p < 0.0001). 4e additionally conducted a multivariate analysis including our two new predictors (string-and concept-based surprisal) among the predictors investigated in Degen's original study.We centered and transformed all variables according to Degen's original analyses.The results are summarized in Table2.We find that the original predictors

Figure 2 :
Figure2: Relationship between human SI strength ratings within some, all scale(Degen, 2015) and BERT-derived predictors: (a) surprisal of scalemate all in the scalar construction, and (b) weighted average surprisal over the full set of candidate alternatives (Section 4.3).Each point represents a sentence.Shaded region denotes 95% CI.

Figure 3 :
Figure 3: Relationship between human SI rates and GPT-2-derived predictors across scales, for four datasets.Each point represents a single scale.Shaded region denotes 95% CI.(a) SI rate vs. surprisal of strong scalemate in the scalar construction.(b) SI rate vs. weighted average surprisal over the full set of candidate alternatives (Section 5.3).

Table 1 :
Details of human data used in our analyses.An item is a unique (scale, context) combination.

Table 3 :
Scalar construction templates for different parts of speech (for cross-scale variation).

Table 4 :
Summary of full regression model (middle columns) and ANOVA comparing full model against intercept-only model (right columns) for each cross-scale variation dataset.