## Abstract

Over the past two decades, numerous studies have demonstrated how less-predictable (i.e., higher surprisal) words take more time to read. In general, these studies have implicitly assumed the reading process is purely *responsive*: Readers observe a new word and allocate time to process it as required. We argue that prior results are also compatible with a reading process that is at least partially *anticipatory*: Readers could make predictions about a future word and allocate time to process it based on their expectation. In this work, we operationalize this anticipation as a word’s contextual entropy. We assess the effect of anticipation on reading by comparing how well surprisal and contextual entropy predict reading times on four naturalistic reading datasets: two self-paced and two eye-tracking. Experimentally, across datasets and analyses, we find substantial evidence for effects of contextual entropy over surprisal on a word’s reading time (RT): In fact, entropy is sometimes better than surprisal in predicting a word’s RT. Spillover effects, however, are generally not captured by entropy, but only by surprisal. Further, we hypothesize four cognitive mechanisms through which contextual entropy could impact RTs—three of which we are able to design experiments to analyze. Overall, our results support a view of reading that is not just responsive, but also anticipatory.^{1}

## 1 Introduction

Language comprehension—and, by proxy, the reading process—is assumed to be incremental and dynamic in nature: Readers take in one word at a time, process it, and then move on to the next word (Hale, 2001, 2006; Rayner and Clifton, 2009; Boston et al., 2011). As each word requires a different amount of processing effort, readers must dynamically allocate differing amounts of time to words as they read. Indeed, this effect has been confirmed by a number of studies, which show a word’s reading time is a monotonically increasing function of the word’s length and surprisal (Hale, 2001; Smith and Levy, 2008; Shain, 2019, *inter alia*).

Most prior work (e.g., Levy, 2008; Demberg and Keller, 2008; Fernandez Monsalve et al., 2012; Wilcox et al., 2020), however, focuses on the **responsive** nature of the reading process, i.e., prior work looks solely at how a reader’s behavior is influenced by attributes of words which have already been observed. Such analyses make the assumption that readers dynamically allocate resources to predict future words’ identities in advance, but that the distribution of those predictions do not themselves directly affect reading behavior. However, a closer analysis of RT data shows the above theory might not capture the whole picture. In addition to being responsive, reading behavior may also be **anticipatory**: Readers’ predictions may influence reading behavior for a word regardless of its actual identity.

Theoretically, anticipatory reading behavior may be an adaptive response to oculomotor constraints, as it takes time both to identify a word and to program a motor response to move beyond it. An example of anticipatory behavior is that the eyes often skip over words while reading—a decision that must be made while the word’s identity remains uncertain (Ehrlich and Rayner, 1981; Schotter et al., 2012). We identify four mechanisms that are anticipatory in nature and may impact reading behaviors:

- (i)
**word skipping**: readers may completely omit fixating on a word; - (ii)
**budgeting**: readers may pre-allocate RTs for a word before reaching it; - (iii)
**preemptive processing**: readers may start processing a future word based on their expectations (and before knowing its identity); - (iv)
**uncertainty cost**: readers may incur an additional processing load when in high uncertainty contexts.

In this work, we look beyond responsiveness, investigating anticipatory reading behaviors and the mechanisms above. Specifically, we look at how a reader’s expectation about a word’s surprisal—operationalized as that word’s **contextual entropy**—affects the time taken to read it. For various reasons, however, a reader’s anticipation may not exactly match a word’s expected surprisal value, which would make the contextual entropy a poor operationalization of anticipation. Rather, readers may rely on skewed approximations instead, e.g., anticipating that the next word’s surprisal is simply the surprisal of the most likely next word. We use the Rényi entropy (a generalization of Shannon’s entropy) to operationalize these different skewed expectation strategies. We then design several experiments to investigate the mechanisms above, analyzing the relationship between readers’ expectations about a word’s surprisal and its observed RTs.

We run our analyses in four naturalistic datasets: two self-paced reading and two eye-tracking. In line with prior work, we find a significant effect of a word’s surprisal on its RTs across all datasets, reaffirming the responsive nature of reading. In addition, we find the word’s contextual entropy to be a significant predictor of its RTs in three of the four analyzed datasets—in fact, in two of these, entropy is a more powerful predictor than surprisal; see Table 3. Unlike surprisal however, in general, entropy does not predict spillover effects. We further find that a specific Rényi entropy (with *α* = 1/2) consistently leads to stronger predictors than the Shannon entropy. Our finding suggests readers may anticipate a future word’s surprisal to be a function of the number of plausible word-level continuations (as opposed to the actual expected surprisal).

## 2 Predicting Reading Behavior

One behavior of interest in psycholinguistics is reading time (RT) allocation, i.e., how much time a reader spends processing each word in a text. RTs and other eye movement measures, such as word skipping (§ 5.4), are important for psycholinguistics because they offer insights into the mechanisms driving the reading process. Indeed, there exists a vast literature of such analyses (Rayner, 1998; Hale, 2001, 2003, 2016; Keller, 2004; van Schijndel and Linzen, 2018; Shain, 2019, 2021; Shain and Schuler, 2021, 2022; Wilcox et al., 2020; Meister et al., 2021, 2022; Kuribayashi et al., 2021, 2022; Hoover et al., 2022, *inter alia*).^{2}

**x**∈ℝ

^{d}which is believed to impact reading—e.g., we could choose $xt=[\u2223wt\u2223,u(wt)]\u22ba$, where |

*w*

_{t}| is the length of word

*w*

_{t}and

*u*(

*w*

_{t}) is its frequency (quantified as its unigram log-probability). These variables are then used to fit a regressor

*f*

_{ϕ}(

**x**) of a reading measure

*y*:

**are learned parameters, and**

*ϕ**y*will be either reading times or word skipping ratio here. We then evaluate this regressor by looking at its performance, which is typically operationalized as the average log-likelihood assigned by

*f*

_{ϕ}(

**x**) to held out data (Goodkind and Bicknell, 2018; Wilcox et al., 2020).

When comparing different theories of the reading process, each may predict a different architecture *f*_{ϕ} or set of variables **x** which should be used in eq. (1). We can then compare these theories by looking at the performance of their associated regressors. Specifically, we take a model that leads to higher log-likelihoods on held out data as evidence in favor of its corresponding theory about underlying cognitive mechanisms. Further, model *f*_{ϕ}(**x**) can then be used to understand the relationship between the employed predictors **x** and RTs.

### 2.1 Responsive Reading

One of the most studied variables in the above paradigm is **surprisal**, which measures a word’s information content. Surprisal theory (Hale, 2001; Levy, 2008) posits that a word’s surprisal should directly impact its processing cost. Intuitively, this makes sense: The higher a word’s information content, the more resources it should take to process that word. Surprisal theory has since sparked a line of research exploring the relationship between surprisal and processing cost, where a word’s processing cost is typically quantified as its RT.^{3}

*p*is the ground-truth probability distribution over natural language utterances. We resort to

*h*

_{t}(

*w*) as a convenient shorthand that avoids notational clutter. In words, eq. (2) states that a word is more surprising—and thus conveys more information—if it is less likely, and vice versa.

Time and again, surprisal has proven to be a strong predictor in RT analyses (Smith and Levy, 2008, 2013; Goodkind and Bicknell, 2018; Wilcox et al., 2020; Shain et al., 2022, *inter alia*). Importantly, surprisal (as well as other properties of a word, like frequency or length) is a quantity that can only feasibly impact readers’ behaviors after they have encountered the word in question.^{4} Thus, by limiting their analyses to such characteristics, these prior works assume RT allocation happens *after* word identification, being thus **responsive** to the context a reader finds themselves in and happening on demand as needed for processing a word.

### 2.2 Anticipatory Reading

Not all reading behaviors, however, can be characterized as reactive. As a concrete example, readers often skip words—a decision which must be made while the next word’s identity is unknown. Furthermore, prior work has shown that the uncertainty over a sentence’s continuations impacts RTs (where this uncertainty is quatified as contextual entropy, as we make explicit later; Roark et al., 2009; Angele et al., 2015; van Schijndel and Schuler, 2017; van Schijndel and Linzen, 2019). Both of these observations offer initial evidence that some form of **anticipatory** planning is being performed by the reader, influencing the way that they read a text.

**contextual entropy**, which is defined as follows:

*W*

_{t}denotes a $W\xaf$-valued random variable, which takes on values $w\u2208W\xaf$ with distribution

*p*(·∣

*w*_{ <t}). Specifically, we assume a (potentially infinite) vocabulary $W$, which we augment to include a special $eos\u2209W$ token that indicates the end of an utterance. To that end, we define $W\xaf=defW\u222a{eos}$. When clear from context, we shorten H(

*W*

_{t}∣

*W*_{ <t}=

*w*_{ <t}) to simply H(

*W*

_{t}).

Prior work has also investigated the role of entropy in RTs. Hale (2003, 2006), for instance, investigated the role of entropy reduction on reading times; Hale defines entropy reduction as the change in the conditional entropy over sentence parses induced by word *t*, which is a different measure than the word entropy we investigate here. More recently, other work (Roark et al., 2009; van Schijndel and Schuler, 2017; Aurnhammer and Frank, 2019) investigated the role of successor entropy (i.e., word (*t* + 1)’s entropy) on RTs. Linzen and Jaeger (2014) investigated both entropy reduction, total future entropy (i.e., the conditional entropy over sentence parses), and single step syntactic entropy (i.e., the entropy over the next step in a syntactic derivation). In this work, we are instead interested in the role of the entropy of word *t* itself because of its theoretical motivation as the expected processing difficulty under surprisal theory.

Similarly to us, Cevoli et al. (2022) also study the role of the entropy of word *t* on RTs for word *t*; more specifically, Cevoli et al. analyze *prediction error costs* by investigating how surprisal and entropy interact in predicting RTs. Finally, Smith and Levy (2010) also investigate how word *t*’s contextual entropy influences RTs, but while further conditioning a reader’s predictions on a noisy version of word *t*’s visual signal.

### 2.3 Skewed Anticipations

*w*

_{t}will be, while knowing only its context

*w*_{ <t}. However, a reader may employ a different strategy when making anticipatory predictions. One possibility, for instance, is that readers could be overly confident, and trust their best (i.e., most likely) guess when making this prediction. In this case, readers would instead anticipate a subsequent word’s surprisal to be:

^{5}

**contextual Rényi entropy**(Rényi, 1961), which is defined as

_{α}(

*W*

_{t}∣

*W*_{ <t}=

*w*_{ <t}) to H

_{α}(

*W*

_{t}). Notably, with different values of

*α*, the Rényi entropy leads to different interpretations of a reader’s anticipation strategies. The Rényi entropy is equivalent to eq. (5) when

*α*= 0, which measures the size of the support of

*p*(·∣

*w*_{ <t}), or equivalently, the number of competing (word-level) continuations at a timestep. Further, the Rényi entropy is equivalent to eq. (5) when $\alpha =\u221e$, which measures the surprisal of the word with maximal probability in a context. Finally, through L’Hôpital’s rule, it is equivalent to eq. (3), the Shannon entropy, when

*α*= 1. In general, however, Rényi entropy does not have as clear of an intuitive meaning when $\alpha \u2209{0,1,\u221e}$. Notably, the Rényi entropy with

*α*= 1/2 will be relevant in our results’ section. In this case, it can be thought of as measuring a softened version of a distribution

*p*’s support.

## 3 Anticipatory Mechanisms

In this paper, we are mainly interested in the effect of anticipations on RTs, where we operationalize anticipation in terms of contextual (Rényi) entropy, as defined above. We consider four main mechanisms under which anticipation could affect RTs: word skipping, budgeting, preemptive processing, and uncertainty cost. We discuss each of these in turn.

### Word Skipping.

The first way in which anticipation could affect RTs is by allowing readers to skip words entirely, allocating the word a reading time of zero. A reader must, by definition, decide whether or not to skip a word *before* fixating on it.^{6} We hypothesize the reader may thus decide to skip a word when they are confident in its identity, i.e., when the word’s contextual entropy is low. If this hypothesis is true, then contextual entropy should be a good predictor of when a reader skips words.

### Budgeting.

The reading process can be described as a sequence of fixations and saccades.^{7} Saccades, however, do not happen instantly: On average, they must be planned at least 125 milliseconds in advance (Reichle et al., 2009). Further, there is an average eye-to-brain delay of 50 ms (Pollatsek et al., 2008). We may thus estimate that the effects of a word’s surprisal, as well as other word properties such as frequency, in RT allocation will only show up 175 ms after that word is fixated, or later.^{8} Considering this delay on saccade execution, it is not unreasonable that RTs could be decided (or budgeted) further in advance, when the reader still does not know word *w*_{t}’s identity. If a reader indeed budgets reading times beforehand, RTs should be—at least in part—predictable from the contextual entropy. Processing costs, however, may still be driven by surprisal. In this case, we might observe budgeting effects: e.g., if a reader *under*-budgets RTs for a word (i.e., if the word’s contextual entropy is smaller than its actual surprisal), we may see a compensation, which could manifest as larger spillover effects in the following word.

### Preemptive Processing.

Recent work (e.g., Willems et al., 2015; Goldstein et al., 2022) suggests that—especially for low entropy contexts—the brain starts preemptively processing future words before reaching them.^{9} Thus, shorter reading times in low entropy words at time *t* + 1 may be compensated by longer times in the previous word *w*_{t}. However, recent work investigating the effect of successor entropy, i.e., word (*t* + 1)’s entropy, on RTs has found conflicting results.^{10} Also on the topic of preprocessing, Smith and Levy (2008) derive from first principles what a reader’s optimal preprocessing effort should be for any given context: Under their assumption that reading times should be scale-free and that readers optimally trade off preprocessing and reading costs, a reader should always allocate a *constant* amount of resources for preprocessing future words.^{11} We will investigate the effect of successor entropy on RTs in § 5.6.

### Uncertainty Cost.

Finally, uncertainty about a word’s identity, as quantified by its contextual entropy, may cause an increase in processing load directly. For example, keeping a large number of competing word continuations under consideration may require additional cognitive resources, impacting the reader’s processing load beyond the effect of the observed word’s surprisal. We know, however, no way of testing this hypothesis directly under our experimental setup. Therefore, we will not analyze this mechanism specifically; we will only study it in our main experiment (§ 5.2), where it is measured in aggregate with other mechanisms.

## 4 Experimental Setup

### 4.1 Estimators

Unfortunately, we cannot compute the values discussed in § 2, as we do not have access to the true natural language distribution *p*(·∣*w*_{ <t}). We can, however, estimate these values using a language model *p*_{θ}(·∣*w*_{ <t}). We will thus use *p*_{θ} in place of *p* in order to estimate all the information-theoretic quantities in § 2. Using language model-based estimators is standard practice when investigating the relationship between RTs and information-theoretic quantities, e.g., surprisal.

#### Language Models.

We use GPT-2 small (Radford et al., 2019) as our language model *p*_{θ} in all experiments.^{12} Although some work has shown that a language model’s quality correlates with its psychometric predictive power (Goodkind and Bicknell, 2018; Wilcox et al., 2020), both Shain et al. (2022) and Oh and Schuler (2022) have more recently found that GPT-2 small’s surprisal estimates are actually more predictive of RTs than those of both larger versions of GPT-2 and GPT-3. We note, however, that GPT-2 predicts subwords at each time-step, rather than predicting full words. Thus, to get word-level surprisal, we must sum over the subwords’ surprisal estimates. In some cases, many distinct subword sequences may represent a single word. In this case, we only consider the *canonical* subword sequence output by GPT-2’s tokenizer. Estimating the contextual entropy per word is harder because computing it requires summing over the entire vocabulary $W\xaf$, whose cardinality can be infinite. We approximate the contextual entropy by computing the entropy over the subwords instead.^{13} In practice, this is equivalent to computing a lower bound on the true contextual entropies, as we show in App. B.

### 4.2 Data

We perform our analyses on two eye-tracking and two self-paced reading datasets. The self-paced reading corpora we study are the Natural Stories Corpus (Futrell et al., 2018) and the Brown Corpus (Smith and Levy, 2013). The eye-tracking corpora are the Provo Corpus (Luke and Christianson, 2018) and the Dundee Corpus (Kennedy et al., 2003). We refer readers to App. C for more details on these corpora, as well as dataset statistics and preprocessing steps. For the eye-tracking data, we focus our analyses on Progressive Gaze Duration: A word’s RT is taken to be the sum of all fixations on it before a reader first passes it, i.e., we only consider fixations in a reader’s first forward pass. Further, for our first set of experiments, we consider a skipped word’s RT to be zero (following Rayner et al., 2011);^{14} we denote these datasets as Provo (✓) and Dundee (✓). In later experiments, we discard skipped words, denoting these datasets with an (✗) instead. Following prior work (e.g., Wilcox et al., 2020), we average RT measurements across readers, analyzing one RT value per word token.

### 4.3 Linear Modeling

^{15}$f\varphi (x)=\varphi \u22bax$, where

**is a column vector which parameterizes**

*ϕ**f*

_{ϕ}. Further, given data $D={(xn,yn)}n=1N$, regressor

*f*

_{ϕ}(

**x**)’s

*average*log-likelihood on $D$ is given by

*σ*

^{2}> 0.

^{16}

### 4.4 Evaluation

^{th}fold. Further, as is standard in RT analyses, we test the predictive power of a hypothesis by comparing a target model against a baseline model. These models differ only in that the target model contains a predictor of interest, whereas the baseline model does not. Our metric of interest is thus the difference in log-likelihood of held-out data between the two models:

^{17}

## 5 Experiments and Results

### 5.1 Experiment #1: Confirmatory Analysis

*w*

_{t}| is the word length in characters and

*u*(

*w*

_{t}) is the unigram frequency of the

*t*

^{th}word. Notably, we include predictors for words

*w*

_{t −1},

*w*

_{t −2}, and

*w*

_{t −3}because prior work has shown that a word’s RT is impacted not only by its own surprisal, but also by the surprisal of previous words. These effects are referred to as

**spillover effects**. We then estimate the Δ

_{llh}between

*t*′ ∈{

*t*,

*t*−1,

*t*−2,

*t*−3}. In words,

**x**

^{model}includes all surprisal predictors

**x**

_{t}

^{surp}, while for the baseline model

**x**

^{base}we remove surprisal predictors one at a time. We present these results in Table 1. The results show that the surprisal of word

*w*

_{t}is a strong predictor of RTs in all four analyzed datasets. Additionally, we see significant spillover effects for the surprisal of three previous words in self-paced reading corpora, for the two previous words in Dundee, and for the single previous one in Provo. Interestingly, and consistent with prior work (Smith and Levy, 2008, 2013), we find that spillover effects are stronger than the current word’s effect in Brown. On the other three datasets, however, we find the surprisal effect on the current word to be stronger than the spillover effects.

### 5.2 Experiment #2: Surprisal vs. Entropy

In the second experiment, we analyze the predictive power of the contextual Shannon entropy on RTs. Specifically, Table 2 presents the Δ_{llh} between the baseline model **x**_{t}^{base} = **x**_{t}^{cmn} ⊕**x**_{t}^{surp} and two target models. The first is a model where the entropy term H(*W*_{t}) is added *in addition* to the predictors already present in **x**^{base}. The second is a model where the surprisal term *h*_{t}(*w*_{t}) is replaced by the entropy term H(*W*_{t}). From Table 2, we see that adding the entropy of the current word significantly increases the predictive power in three out of the four analyzed datasets. Furthermore, replacing the surprisal predictor with the entropy only leads to a model with worse predictive power in one of the three analyzed datasets (in Provo). On the other three datasets, the entropy’s predictive power is as good as the surprisal’s—more precisely, there is no statistically significant difference in their power. Together, these results suggest that the reading process is both responsive and anticipatory.

Analyzing the impact of the previous words’ entropies, i.e., H(*W*_{t−1}), H(*W*_{t−2}), H(*W*_{t−3}),^{18} on RTs, we see a somewhat different story. When adding spillover entropy terms as extra predictors we see no consistent improvements in predictive power. We observe a weak improvement on self-paced reading datasets when adding H(*W*_{t−1}) as a predictor, but, even then, the improvement is only significant on Natural Stories. We find a similarly weak effect when adding H(*W*_{t−2}) on eye-tracking data, which is only significant on the Dundee corpus. This lack of predictive power further stands out when contrasted to surprisal spillover effects, which were mostly significant; see Table 1. Furthermore, replacing surprisal spillover terms with the corresponding entropy terms generally leads to models with weaker predictive power. Together, these results imply the effect of entropy (expected surprisal) on RTs is mostly local, i.e., the expectation over a word’s surprisal impacts its RT, but not future words’ RTs.

### 5.3 Experiment #3: Skewed Expectations

We now compare the effect of Rényi entropy with *α*≠1 on RTs. We follow a similar setup to before. Specifically, we compute the contextual Rényi entropy for several values of *α*. We then train regressors where we either add the Rényi entropy as an additional predictor, or where we replace the current word’s surprisal *h*_{t}(*w*_{t}) with the Rényi entropy. We then plot these values in Figure 1. Analyzing this figure, we see that Provo again presents different trends from the other datasets. We also see a clear trend in the three other datasets: The predictive power of expectations seem to improve for smaller values of *α*. More precisely, in Brown, Natural Stories, and Dundee, *α* = 1/2 seems to lead to stronger predictive powers than *α* > 1/2.

Based on these results, we then produce a similar table to the previous experiment’s, but using the Rényi entropy with *α* = 1/2 instead. These results are depicted in Table 3. Similarly to before, we still see a significant improvement in predictive powers on three of the datasets when adding the entropy as an extra predictor. Unlike before, however, replacing the surprisal predictors (for time step *t*) with Rényi entropy predictors significantly improves log-likelihoods in two of the analyzed datasets. In other words, the Rényi entropy has a stronger predictive power than the surprisal in both these datasets. We now move on to investigate why this is the case, analyzing the mechanisms proposed in § 3.

### 5.4 Experiment #4: Word Skipping

_{llh}between a baseline model and an additional model that captures our target effect. In contrast to previous experiments, though, we employ a logistic regressor that predicts whether or not a word was skipped during the readers’ initial pass. Our prediction function can thus be written as

**is a column vector of the model’s parameters and**

*ϕ**σ*is the sigmoid function. Now, given data $D={(xn,yn)}n=1N$, where

*y*represents the ratio of readers who skipped a word, the average log-likelihood of this predictor on $D$ is:

*y*represent the ratio of readers who skipped a word—as opposed to the per-reader binary skipped vs not distinction—is equivalent to averaging the predicted feature across readers, as we do when predicting reading times.

Table 4 presents our results. First, we see that surprisal is a significant predictor of whether or not a word is skipped in Dundee; however, it is not a significant predictor in Provo. Second, we find that in Dundee the predictive power over whether a word was skipped is significantly stronger when using the Rényi entropy of the current word than when using its surprisal. Finally, while we find an improvement in predictive power when adding entropy (in addition to surprisal) as a predictor, we find no significant improvement when starting with entropy and adding surprisal. This implies that, at least for Dundee, word-skipping effects are predicted solely by the entropy, with the surprisal of the current word adding no extra predictive power.

Note that we represented skipped words as having RTs of 0 ms in our previous experiments on eye-tracking datasets. Thus, our previous results could be driven purely by word-skipping effects. We now run the same experiments as in § 5.2 and § 5.3, but with skipped words removed from our analysis. These results are presented in Table 5. In short, when skipped words are not considered, the Rényi entropy is no more predictive of RTs than the surprisal. In fact, the surprisal seems to be a slightly stronger predictor, albeit not significantly so in Dundee. However, adding the Rényi entropy as a predictor to a model which already has surprisal still adds significant predictive power in Dundee. In short, this table shows that, while partly driven by word skipping, there are still potentially other effects of anticipation on RTs.

### 5.5 Experiment #5: Budgeting Effects

**under-budgeting**as any positive difference between surprisal and entropy. Similarly, we may expect

**over-budgeting**to lead to negative spillover-effects, since spending extra time in a word might allow the reader to start going through some of their processing debt (i.e., the still unprocessed spillover effects of that and of previous words). We operationalize several potential budgeting effects as:

_{llh}of adding these effects as predictors of RT on top of a baseline with the current word’s entropy, as well as all four surprisal terms, as predictors $xtbase=xcmn\u2295xsurp\u2295[H\alpha (Wt)]\u22ba$. Unlike previous experiments, thus, our baseline here already contains the entropy as a predictor. Further, we show results for eye-tracking datasets both including (✓) and excluding (✗) skipped words for this and future analyses.

Table 6 presents these results. In short, we do find budgeting effects of word *t* −1 on RTs in our two analyzed self-paced reading datasets, and in Dundee (✓). We do not, however, find them on Dundee (✗). This may imply budgeting effects impact word skipping, but not actual RTs once the word is fixed. Further, we also find weak budgeting effects of word *t* −2 in our (✗) eye-tracking datasets; these, however, are only significant in Dundee. We conclude that these results do not provide concrete evidence of a budgeting mechanism influencing RTs, but only of it influencing word skipping instead. We will further analyze these effects in our discussion section (§ 6).

### 5.6 Experiment #6: Preemptive Processing

In our analysis of preemptive processing, we will analyze the impact of successor entropy, i.e., H_{α}(*W*_{t +1}), on RTs. While prior work has analyzed this impact, the results in the literature are contradictory. Table 7 presents the results of our analysis. In short, this table shows that the successor entropy is only significant in Natural Stories.^{19} In contrast, the current word’s contextual entropy is a significant predictor of RTs in 3/4 analyzed datasets, even when added to a model that already has the successor entropy. Further, while most of our results suggest readers rely on skewed expectations for their anticipatory predictions—i.e., the Rényi entropy with *α* = 1/2 is in general a stronger predictor than Shannon’s entropy—the successor Shannon entropy seems more predictive of RTs than the Rényi. Our full model, though, still has a larger log-likelihood when using Rényi entropies. Overall, our results support the findings of Smith and Levy (2008), which suggests preemptive processing costs are constant with respect to the successor entropy. Thus, we conclude preemptive processing is likely not the main mechanism through which H(*W*_{t}) affects *w*_{t}’s reading times.

## 6 Discussion

We wrap up our paper with an overall discussion of results. A key overall finding seen across Tables 2 and 3 is that effects of entropy (expected surprisal) are generally local, i.e., they are clearest and most pronounced on current-word RTs. On the other hand, the effects of surprisal also show up on subsequent words, e.g., in spillover effects. This is consistent with our overall hypothesis that entropy effects capture **anticipatory** reading behavior.

To make this point more concrete, we plot the values of the parameters ** ϕ** from our best regressor per dataset in Figure 2—showing the effect of predictor variables not-included in a dataset as zero. As the contextual Rényi entropy models yield overall higher data log-likelihoods, we focus on them here. Figure 2 shows that—for Brown, Natural Stories, and Dundee—not only does the entropy have similar (or stronger) psychometric predictive power than the surprisal, it also has a similar (or stronger)

*effect size*on RTs. In other words, an increase of 1 bit in contextual entropy leads to a similar or larger increase in RTs than a 1 bit increase in surprisal.

Figure 2 also shows that in Natural Stories—the only dataset where it is significant—the successor entropy has a larger effect on RTs than the surprisal, and its impact is positive. This suggests an increase in the next word’s entropy may lead to an increase in the current word’s RT. In turn, this could imply that readers preemptively process future words, and that they need more time to do this when there are more plausible future alternatives. Moreover, we see the successor Rényi entropy has a similar (or slightly smaller) effect on RTs than the current word’s Rényi entropy. Why the successor entropy is only significant in the Natural Stories dataset is left as an open question.

Figure 2 further shows the effect of over-budgeting on RTs in Brown, Natural Stories, and Dundee.^{20} We see that our operationalization of over-budgeting leads to a negative effect on RTs in Dundee (✓), but to no effect in Dundee (✗). Together, these results suggest that when a reader over-budgets time for a word, they are more likely to skip the following one. In Brown and Natural Stories, however, over-budgeting seems to lead to a positive effect on the next word’s RT. As this is only the case in self-paced reading datasets, we suspect this could be related to specific properties of this experimental setting, e.g., a reader’s attention could break when they become idle due to over-budgeting RT for a specific word.

Finally, while we get roughly consistent effect sizes for all predictors in Brown, Natural Stories, and Dundee, but results are different for Provo. While we note that Provo is the smallest of our analyzed datasets (in terms of its number of annotated word tokens; see Table 8 in App. C), this is likely not the whole story behind these different results. As it is non-trivial to diagnose the source of these differences, we leave this task open for future work.

## 7 Limitations and Caveats

Throughout this paper, we have discussed the effect of anticipation on RTs (and on the reading process, more generally)—where we quantify a reader’s anticipation as a contextual entropy. We do not, however, have access to the true distribution *p*, which is necessary to compute this entropy. Rather, we rely on a language model *p*_{θ} to approximate it. How this approximation impacts our results is a non-trivial question—especially since we do not know which errors our estimator is likely to commit. If we assume *p*_{θ} to be equivalent to *p* up to the addition of white-noise to its logits,^{21} for instance, we could have good estimates of the entropy (as the noise would be partially averaged out), while not-as-good estimates of the surprisal (since surprisal estimates would be affected by the entire noise in *p*_{θ}(*w*_{t}∣*w*_{ <t}) estimates).^{22}

We believe this not to be the main reason behind our results for two reasons. First, if the entropy helped predict RTs simply because we have noisy versions of the surprisal in our estimates, the same should be true for predicting spillover effects, which are also predictable from surprisals. This is not the case, however: While the entropy, i.e., H(*W*_{t}), helps predict RTs, spillover entropies, e.g., H(*W*_{t−1}), do not. Second, even if our estimates are noisy, assuming that this noise is not unreasonably large, a noisy estimate of the surprisal should better approximate the true surprisal than an estimate of the contextual entropy. Since replacing the surprisal with the contextual entropy eventually leads to better predictions of RTs, this is likely not the only mechanism on which our results rely.

*w*

_{t}while considering its entire context

*w*_{ <t}. Modeling surprisal and entropy effects while considering skipped words, however, would be an important future step for an analysis of anticipation in reading. As an example, van Schijndel and Schuler (2016) show that when a word

*w*

_{t −1}is skipped, the subsequent word

*w*

_{t}’s RT is not only proportional to its own surprisal (i.e.,

*h*

_{t}(

*w*

_{t})), but to the sum of both these words surprisals (i.e., to

*h*

_{t}(

*w*

_{t}) +

*h*

_{t −1}(

*w*

_{t −1})). They justify this by arguing that a reader would need to incorporate both words’ information at once when reading. Another model of the reading process, however, could predict that readers simply marginalize over the word in the (

*t*−1)

^{th}position, and compute the surprisal of word

*w*

_{t}directly as:

*p*

_{θ}—as well as the effects of skipped words—has on our results.

## 8 Conclusion

This work investigates the anticipatory nature of the reading process. We examine the relationship between expected information content—as quantified by contextual entropy—and RTs in four naturalistic datasets, specifically looking at the additional predictive power over surprisal that this quantity provides. While our results confirm the responsive nature of reading, they also highlight its anticipatory nature. We observe that contextual entropy has significant predictive power for reading behavior—most reliably on current-word processing—which gives us evidence of a non-trivial anticipatory component to reading.

## Acknowledgments

We thank Simone Teufel for conversations in early stages of this project. We also thank our action editor Ehud Reiter, and the reviewers for their detailed feedback on this paper. TP was supported by a Facebook PhD Fellowship. CM was supported by the Google PhD Fellowship. EGW was supported by an ETH Zürich Postdoctoral Fellowship. RPL was supported by NSF grant BCS-2121074 and a Newton Brain Science Award.

## Ethical Considerations

The authors foresee no ethical concerns with the research presented in this paper.

## Notes

Code is available at https://github.com/rycolab/anticipation-on-reading-times.

Levy (2005), for instance, showed that a word’s surprisal quantifies a change in the reader’s belief over sentence continuations. He then posited this change in belief may be reflected as processing cost.

This follows from standard theories of causality. Granger (1969), for instance, posits that future material cannot influence present behavior.

While most state-of-the-art language models cannot assign zero probabilities to a word due to their use of a softmax in their final layers, it is plausible that humans could.

This is under the assumption that the reader is not able to identify upcoming words through their parafoveal vision.

Fixations are when the gaze focuses on a word; saccades are rapid eye movements shifting gaze from a point to another. In self-paced reading, saccades are similar to mouse clicks.

Again, this assumes that the reader has not identified the word parafoveally. A second caveat regarding this analysis is that, once a saccade is initiated, there is an initial period during which it can be canceled or reprogrammed to target a different location (Van Gisbergen et al., 1987).

Specifically, they show the brain’s processing load before a word’s onset correlates negatively with its entropy.

While Roark et al. (2009) and van Schijndel and Schuler (2017) have found that successor entropy has a positive impact on RTs, i.e., that when *w*_{t +1} has lower entropy, word *w*_{t} takes a shorter time to be read, both Linzen and Jaeger (2014) and Aurnhammer and Frank (2019) have found no effect.

We give the full derivation—including the necessary assumptions—in App. A for completeness.

We make use of Wolf et al.’s (2020) library.

There are many ways to estimate the Rényi entropy, e.g., one could also have estimated the Rényi entropy by assuming a fixed finite vocabulary $W\xaf$, and then computed the probability of the words’ canonical tokenizations.

This choice goes against the more common practice of simply discarding skipped words from the analyses. Our experimental paradigm is based on two factors. First, we are interested in word skipping as a mechanism by which anticipation impacts RTs. Second, we want to make the eye-tracking setting more closely comparable to the self-paced reading, where fully skipping a word is not possible.

As both Shannon and Rényi entropies are linear functions of surprisal, we believe this assumption is justifiable.

We note that RTs cannot be negative, and thus prediction errors will not actually be Gaussian.

Significance is assessed using a paired permutation test. We correct for multiple hypothesis testing (Benjamini and Hochberg, 1995) and mark: in green significant Δ_{llh} where a variable adds predictive power (i.e., when the model with more predictors is better), in red significant Δ_{llh} where a variable leads to overfitting (i.e., when the model with more predictors is worse). ^{*}*p* < 0.05, ^{**}*p* < 0.01, ^{***}*p* < 0.001.

We term these predictors *spillover* entropy effects by analogy to the surprisal case. As before, we omit the conditioning factor on these entropies for notational succinctness, i.e., we write H(*W*_{t−1}) instead of H(*W*_{t−1}∣*W*_{ <t−1} = *w*_{ <t−1}).

We note this is the same dataset previously analyzed by van Schijndel and Linzen (2019), who found a significant effect of the successor entropy.

While over-budgeting is not a significant predictor in Brown, it leads to slightly stronger models and we add it to this dataset’s regressor for an improved comparison.

I.e., $p\theta (\xb7\u2223w<t)\u221dp(\xb7\u2223w<t)eN(0;\sigma 2)$, where $N(0;\sigma 2)$ is a normally distributed, 0-mean noise with variance *σ*^{2}.

This can be made clearer if discussed in terms of the mean squared error of the surprisal and entropy estimates. The mean squared error of an estimator equals its squared bias plus its variance. Since contextual entropy is simply the average across surprisals, we should expect the bias term induced by the addition of white noise to be the same in our estimates of both entropy and surprisal. However, the variance term would be larger for surprisals. This could bias our analyses towards preferring the entropy as a predictor.

Superadditivity means *f*(*a*) + *f*(*b*) ≤ *f*(*a* + *b*), while subadditivity means *f*(*a*) + *f*(*b*) ≥ *f*(*a* + *b*). See https://math.stackexchange.com/questions/3736657/proof-of-xp-sub-super-additive for several simple proofs.

## References

### A Smith and Levy’s (2008) Constant Preemptive Processing Effort

*Assume that the reading times and preprocessing effort (PE) are allocated as follows*

*where*

*y*

*represents reading times here, and*

*k*> 1

*is a free parameter*.

*Then, the total effort to preprocess all words in the vocabulary, i.e.,*$\u2211w\u2208W\xafpe(w\u2223w<t)$

*, is constant*.

*Proof*.

This proves the result.

### B A Subword Bound on the Rényi Entropy

*Let*

*p*

*be a language model with vocabulary*$S$

*, an alphabet of subwords*.

*Let*$W\u2286S*$

*be the set of words constructable with subwords drawn from*$S$.

*Further, assume that, for every*$w\u2208W$

*, there exists a*

*unique*

*tokenization of*

*w*

*into subwords*$s1,\u2026,sT\u2208S$

*whose concatenation is*

*w*

*, i.e.,*

*w*=

*s*

_{1}⋯

*s*

_{T}.

*Due to this uniqueness assumption, we may regard*

*p*

*as either a distribution over*$S*$ or $W*$.

*Then, for all*

*α*∈ℝ

_{ >0}

*, we have*

*where*

*S*

_{t}

*is an*$S-$

*-valued random variable that takes on the value of the first subword of the word in*

*t*

^{th}

*position, and*$S\xaf=defS\u222a{eos}$.

*Proof*.

*w*which start with subword

*s*. This allows us to rewrite the Rényi entropy as follows:

*α*∈ℝ

_{ >0}∖{1} because, for

*α*> 1, we have that $11\u2212\alpha <0$ and

*x*

^{α}is superadditive for

*x*≥ 0. Similarly, for 0 <

*α*< 1, we have that $11\u2212\alpha >0$ and

*x*

^{α}is subadditive for

*x*≥ 0.

^{23}Finally, for the case that

*α*= 1, we can apply the chain rule of entropy to write the joint entropy of

*W*

_{t}and

*S*

_{t}in two different ways:

_{1}(

*S*

_{t}∣

*W*

_{t},

*W*_{ <t}=

*w*_{ <t}) = 0 because

*S*

_{t}is deterministically derived from

*W*

_{t}. This implies

*α*∈ℝ

_{ >0}.

### C Datasets

Unless otherwise stated, we follow the data preprocessing steps (including cleaning and tokenization) performed by Meister et al. (2021). We use the following corpora in our experiments:

Dataset . | # RTs . | # Words . | # Texts . | # Readers . |
---|---|---|---|---|

Brown | 136,907 | 6,907 | 13 | 35 |

Natural Stories | 848,767 | 9,325 | 10 | 180 |

Provo (✓) | 225,624 | 2,422 | 55 | 84 |

Dundee (✓) | 463,236 | 48,404 | 20 | 9 |

Provo (✗) | 125,884 | 2,422 | 55 | 84 |

Dundee (✗) | 246,031 | 46,583 | 20 | 9 |

Dataset . | # RTs . | # Words . | # Texts . | # Readers . |
---|---|---|---|---|

Brown | 136,907 | 6,907 | 13 | 35 |

Natural Stories | 848,767 | 9,325 | 10 | 180 |

Provo (✓) | 225,624 | 2,422 | 55 | 84 |

Dundee (✓) | 463,236 | 48,404 | 20 | 9 |

Provo (✗) | 125,884 | 2,422 | 55 | 84 |

Dundee (✗) | 246,031 | 46,583 | 20 | 9 |

##### Brown Corpus.

This corpus, first presented in Smith and Levy (2013), consists of moving-window self-paced RTs of selections from the Brown corpus of American English. The subjects were 35 UCSD undergraduate native English speakers, each reading short (292–902 word) passages. Comprehension questions were asked after reading, and participants were excluded if their performance was at chance.

##### Natural Stories.

This corpus is based on 10 stories constructed to have unlikely, but still grammatically correct, sentences. As it includes psychometric data on sentences with rare constructions, this corpus gives us a broader understanding of how different sentences are processed. Self-paced RTs on these texts was collected from 180 native English speakers. We use this dataset’s 2021 version, with fixes released by the authors.

##### Provo Corpus.

This dataset consists of 55 paragraphs of English text from various sources and genres. A high-resolution eye tracker (1000 Hz) was used to collect eye movement data while reading from 84 native speakers of American English.

##### Dundee Corpus.

We employ this corpus’ English portion, which contains eye-tracking recordings (1000 Hz) of 10 native English speakers. We drop all measurements from one of these readers (with ID sg), due to them not displaying any surprisal effects as reported by Smith and Levy (2013). Each participant reads 20 newspaper articles from *The Independent*, with a total of 2,377 sentences.

### D Surprisal vs. Entropy

Surprisal and contextual entropy are bound to be strongly related, as one is the other’s expected value. To see the extent of their relation, we compute their Spearman correlation per dataset and display it in Figure 3. This figure shows that these values are indeed strongly correlated, and that Shannon’s entropy is more strongly correlated to the surprisal than the Rényi entropy with *α* = 1/2. Given that the Rényi entropy is in general a stronger predictor of RTs than the Shannon entropy, this finding provides further evidence that our results do not only rely on the entropy “averaging out” the noise in our surprisal’s estimates.

## Author notes

Action Editor: Ehud Reiter