A Pragmatic Account of the Weak Evidence Effect

Language is not only used to transmit neutral information; we often seek to persuade by arguing in favor of a particular view. Persuasion raises a number of challenges for classical accounts of belief updating, as information cannot be taken at face value. How should listeners account for a speaker’s “hidden agenda” when incorporating new information? Here, we extend recent probabilistic models of recursive social reasoning to allow for persuasive goals and show that our model provides a pragmatic account for why weakly favorable arguments may backfire, a phenomenon known as the weak evidence effect. Critically, this model predicts a systematic relationship between belief updates and expectations about the information source: weak evidence should only backfire when speakers are expected to act under persuasive goals and prefer the strongest evidence. We introduce a simple experimental paradigm called the Stick Contest to measure the extent to which the weak evidence effect depends on speaker expectations, and show that a pragmatic listener model accounts for the empirical data better than alternative models. Our findings suggest further avenues for rational models of social reasoning to illuminate classical decision-making phenomena.

were not revealed, allowing us to impute a "generative" average across the two observed values and the 27 three guessed values (Table S1). 28 We say a participant passed the 2AFC check if their binary verdict ('longer' vs. 'shorter') is consistent 29 with the direction of their point estimate. We say a participant also passed the stricter "generative" check 30 if the average imputed from their guesses for the remaining three unobserved sticks matches their 2AFC 31 and point estimates. We observe that rates for the these stricter checks were somewhat lower for 32 participants who expected speakers not to show the strongest evidence first (97% vs. 96%, and 89% vs.

33
86%, respectively), though neither of these differences was significant, χ 2 (1) = 0.76, p = 0.38 and 34 χ 2 (1) = 1.05, p = 0.31, respectively. Rates were far above chance for all groups. To ensure robustness, 35 we re-ran our primary analyses on the subset of participants that passed the strictest conjunction of all 36 checks, which is highly improbable under an inattentive null model, and obtained nearly identical results
== D R A F T  Figure S1: Participants revised their beliefs after obtaining a second piece of evidence. Each facet represents participants who were given the same initial piece of evidence (blue dots) with each arrow connecting their judgment after the first piece of evidence and the second piece of evidence. In most cases, participants revised their estimates down, although participants who showed a weak evidence effect for the first stick (top column) also displayed a classical weak evidence effect on the second piece of evidence (e.g. in the second row, participants who saw a 7" stick on the first trial were slightly more confident the average was longer after seeing a 4" stick).  Figure S2: We found strong order effects, with the belief judgment elicited after the second stick apparently affected by a recency bias. Under perfect averaging, the diagonal would leave the judge with complete uncertainty (denoted on our color scale by white), since the evidence from both the longer side (blue) and the shorter side (red) should cancel out.

APPENDIX B: ORDER EFFECTS
While we focus on the first piece of evidence as the clearest weak evidence effect, we also collected a  Proof. We begin by substituting the combined utility (Eq. 6) into the speaker softmax: Now, using Eq. 3 to expand the first term, note that where N is the number of sticks in the true set (N = 5 in our experiment). However, we already assume that the set of possible utterances U are the true sticks in the underlying set (i.e. the contestants cannot make up sticks, they must choose one of the N sticks in the set), so Because all utterances have the exact same epistemic utility U epi , this term drops out of the soft-max: yielding Eq. 8.

46
Theorem 2. Persuasiveness monotonically increases as a function of stick length.
Proof. We say an utterance u is more persuasive than an utterance u when Under the stick contest, let L = {l 1 , . . . , l N } be an partially-ordered set of N stick lengths, such that l i ≤ l j for any index i < j. We denote the mean stick length byl = 1 N i l i . Without loss of generality, let the speaker's persuasive goal be w * = shorter =l < 5 (the argument follows analogously for longer). Take two utterances u = l i and u = l j such that l i ≤ l j (i.e. such that u is just as short or shorter than u ). First, we expand the utility: Now, let X be a random variable representing the sum of the N − 1 still-unknown sticks, X = l −i .
Then we recognize this as the cumulative distribution function (CDF), F X (x) = P (X < x). Because the underlying set of sticks L is assumed to be i.i.d., note that the random variable X = l −i does not depend on the original choice of i. Critically, we know that the cumulative distribution function is The results reported in the main text are based on a pre-registered replication we conducted during the 49 revision of the manuscript (May 2022). In this appendix, we report the corresponding results from our 50 original sample (February 2020). The only methodological difference between the original study and the 51 internal replication was the way we counter-balanced the order of the "long"-vs. "short"-biased 52 contestants. In our original study, the "long"-biased contestant always presented their evidence first; in 53 our replication, the order of the contestants was randomized. Additionally, in our replication, we added 54 the following clarification to the instructions: "Sticks ranging in length from 1 to 9 inches are equally 55 likely to appear in the set." Participants in the initial sample were recruited on the Prolific platform, with Our regression model was the same as in the main text, except we did not include a fixed effect of "long" 59 vs. "short": all participants were shown evidence from the "long"-biased speaker. As in the study 60 reported in the main text, we found a significant interaction between speaker expectations and evidence We used the following priors for our Bayesian data analysis: where p z is the mixture weight used for heterogeneous models, µ = P J i (longer|u) ∈ [0, 1] is the RSA those who expected less strong evidence. However, our findings are robust to whether we collapse these 80 groups or not.

81
Belief-adjustment models 82 In the notation of McKenzie, Lee, and Chen (2002), Eq. 9 is written: where C k ∈ [0, 1] is the degree of belief in a particular claim after being presented with evidence e k , depends on the evidence previously presented. We can therefore rewrite Eq. S1 as C k = C k−1 + w k · (s(e k ) − (m k | e 1 , ..., e k−1 )).
To fit this class of models to our data, we follow Trueblood and Busemeyer (2011), assuming a mapping between stick length and evidence strength given by a centered logistic function: where the logistic growth rate B is fit to the data (we used a uniform prior B ∼ Unif[0, 10]). This The averaging variant, in which evidence is encoded in relationship to the current belief in the hypothesis, is more suited for estimation tasks involving some kind of moving average (Hogarth & Einhorn, 1992), whereas the Stick Contest is better described as an evaluation task in which a single hypothesis is under consideration ("is the sample long?"). We also found empirically that the adding variant provided a better fit to the data than the averaging variant. parameters, as well as a heterogeneous model, in which we assume a priori that participants are a convex 90 combination of the two models. As in the RSA models, we infer the mixture weight p z that best explains 91 the population-level mixture (marginalizing over latent variable assignments z).

92
Higher levels of reasoning and the strong evidence effect 93 While our cover story explicitly provided participants with the motivations of speakers, in terms of their financial incentives, these motivations are less obvious in most real-world scenarios. They must be inferred from what the speaker is saying. This is straightforwardly derived in our framework by allowing the listener to jointly infer the true state of the world w and the speaker's bias β: Our formulation raises a natural question about how speakers would behave if they were aware judges were making such inferences. This emerges at the next level of recursive reasoning: where C(u) represents some cost associated with being perceived as biased by the judge: and w c ≥ 0 is a parameter specifying the degree of the cost. We included a J 2 model who reasons about    Table S2: Participants were presented with a free-response text field to explain their reasoning at the end of both phases. Here we provide sample responses from the end of the speaker phase, from both participants who expected the strongest evidence and those who expected less strong evidence.  Table S3: Sample responses from the end of the judge phase, from both participants who expected the strongest evidence and those who expected less strong evidence.