In a typical text, readers look much longer at some words than at others, even skipping many altogether. Historically, researchers explained this variation via low-level visual or oculomotor factors, but today it is primarily explained via factors determining a word’s lexical processing ease, such as how well word identity can be predicted from context or discerned from parafoveal preview. While the existence of these effects is well established in controlled experiments, the relative importance of prediction, preview and low-level factors in natural reading remains unclear. Here, we address this question in three large naturalistic reading corpora (n = 104, 1.5 million words), using deep neural networks and Bayesian ideal observers to model linguistic prediction and parafoveal preview from moment to moment in natural reading. Strikingly, neither prediction nor preview was important for explaining word skipping—the vast majority of explained variation was explained by a simple oculomotor model, using just fixation position and word length. For reading times, by contrast, we found strong but independent contributions of prediction and preview, with effect sizes matching those from controlled experiments. Together, these results challenge dominant models of eye movements in reading, and instead support alternative models that describe skipping (but not reading times) as largely autonomous from word identification, and mostly determined by low-level oculomotor information.

When reading a text, readers move their eyes across the page to bring new information to the centre of the visual field, where perceptual sensitivity is highest. While it may subjectively feel as if the eyes smoothly slide along the text, they in fact traverse the words with rapid jerky movements called saccades, followed by brief stationary periods called fixations. Across a text, saccades and fixations are highly variable and seemingly erratic: Some fixations last less than 100 ms, others more than 400; and while some words are fixated multiple times, many other words are skipped altogether (Dearborn, 1906; Rayner & Pollatsek, 1987). What explains this striking variation?

Historically, researchers have pointed to low-level non-linguistic factors like word length, oculomotor noise, or the relative position where the eyes happen to land (Bouma & de Voogd, 1974; Buswell, 1920; Dearborn, 1906; O’Regan, 1980). Such explanations were motivated by the idea that oculomotor control was largely autonomous. In this view, readers can adjust saccade lengths and fixation durations to global characteristics like text difficulty or reading strategy, but not to subtle word-by-word differences in language processing (Bouma & de Voogd, 1974; Buswell, 1920; Dearborn, 1906; Morton, 1964).

As reading was studied in more detail, however, it became clear that the link between eye movements and cognition was more direct. For instance, it was found that fixation durations were shorter for words with higher frequency (Inhoff, 1984; Rayner, 1977). Eye movements were even shown to depend on how well a word’s identity could be inferred before fixation. Specifically, researchers found that words are read faster and skipped more often if they are predictable from linguistic context (Balota et al., 1985; Ehrlich & Rayner, 1981) or if they are identifiable from a parafoveal preview (McConkie & Rayner, 1975; Rayner, 1975; Schotter et al., 2012).

These demonstrations of a direct link between eye movements and language processing overturned the autonomous view, replacing it by cognitive accounts describing eye movements during reading as largely, if not entirely, controlled by linguistic processing (Clifton et al., 2016; Reichle et al., 2003). Today, many studies still build on the powerful techniques like gaze-contingent displays that helped overturn the autonomous view, but now ask much more detailed questions, like whether word identification is a distributed or sequential process (Kliegl et al., 2006, 2007); how many words can be processed in the parafovea (Rayner et al., 2007); at which level they are analysed (Hohenstein & Kliegl, 2014; Pan et al., 2021), and how this may differ between writing systems or orthographies (Tiffin-Richards & Schroeder, 2015; Yan et al., 2010).

Here, we ask a different, perhaps more elemental question: how much of the variation in eye movements do linguistic prediction, parafoveal preview, and non-linguistic factors each explain? That is, how important are these factors for determining how the eyes move during reading? Dominant, cognitive models explain eye movement variation primarily as a function of lexical processing. Skipping, for instance, is modelled as the probability that a word is identified before fixation (Engbert & Kliegl, 2003; Engbert et al., 2005; Reichle et al., 2003). Some, however, have questioned this purely cognitive view, suggesting that low-level features like word eccentricity or length might be more important (Brysbaert et al., 2005; Reilly & O’Regan, 1998; Vitu et al., 1995). One particularly relevant analysis comes from Brysbaert et al. (2005). Presenting a meta-analysis on the aggregate effect sizes on word skipping, they argue that the effect of length and distance is so large that skipping may not just be driven by ongoing word identification, but also—and indeed perhaps primarily—by low-level heuristics part of a simple, scanning strategy (Brysbaert et al., 2005). Similarly, one may ask what drives next-word identification: is identifying the next word mostly driven by linguistic predictions (Goodman, 1967) or by parafoveal perception? Remarkably, while it is well-established that both linguistic and oculomotor, and both predictive and parafoveal processing, all affect eye-movements (Brysbaert et al., 2005; Kliegl et al., 2004; Schotter et al., 2012; Staub, 2015), a comprehensive picture of their relative explanatory power is currently missing, perhaps because they are seldom studied all at the same time.

To arrive at such a comprehensive picture we focus on natural reading, analysing three large datasets of participants reading passages, long articles, and even an entire novel—together encompassing 1.5 million (un)fixated words, across 108 individuals (Cop et al., 2017; Kennedy, 2003; Luke & Christianson, 2018). We use a model-based approach: instead of manipulating word predictability or perturbing parafoveal perceptibility, we combine deep neural language modelling (Radford et al., 2019) and Bayesian ideal observer analysis (Duan & Bicknell, 2020) to quantify how much information about next-word identity is conveyed by both prediction and preview, on a moment-by-moment basis. Our model-based analysis is quite different from the experimental approach, especially in the case of parafoveal preview which is generally studied with a boundary paradigm. However, the underlying logic is the same: in the boundary paradigm, eye movements are compared between conditions in which the preview is informative (valid) and when it conveys no (or incorrect) information about word identity. We—following (Bicknell & Levy, 2010; Duan & Bicknell, 2020)—simply replace this categorical contrast with a more continuous analysis, quantifying the subtle word-by-word variation in the amount of information conveyed by the prior preview. In this sense, our approach can be seen as an extension and refinement of the seminal analyses by Brysbaert and colleagues, allowing for instance to quantify not just the effect of word length on skipping—but also, simultaneously, estimate and control for the effect word length has on a word’s prior parafoveal identifiability.

In this way, our word-by-word, information-theoretic analysis brings us closer to the underlying mechanisms than analysing effect sizes in the aggregate. However, we want to stress we use these models as normative models to estimate how much information is in principle available from prediction and preview at each moment, but do not take these as processing models of human cognition (see Methods and Discussion for a more extensive comparison of our model-based approach and traditional methods). Such a broad-coverage model-based approach has been applied to predictability effects on reading before (Frank et al., 2013; Goodkind & Bicknell, 2018; Kliegl et al., 2004; Luke & Christianson, 2016; Shain et al., 2022; Smith & Levy, 2013), but either without considering preview or only through coarse heuristics such as using word frequency as a proxy for parafoveal identifiability (Kennedy et al., 2013; Kliegl et al., 2006; Pynte & Kennedy, 2006) (but see Duan & Bicknell, 2020). By contrast, we explicitly model both, in addition to low-level explanations like autonomous oculomotor control. To assess explanatory power, we use set theory to derive the unique and shared variation in eye movements explained by each model.

To preview the results, this revealed a striking dissociation between skipping and reading times. For word skipping, the overwhelming majority of explained variation could be explained—mostly uniquely explained—by a non-linguistic oculomotor model, that explained word skipping just as a function of a word’s distance to the prior fixation position and its length. These two low-level variables explained much more skipping variation than the degree to which a word was identifiable or predictable prior to fixation. For reading times, by contrast, we did find that factors determining a word’s lexical processing explained most variance. In line with dominant models, we found strong effects of both prediction and preview, matching effect sizes from controlled designs. Interestingly, prediction and parafoveal preview seem to operate independently: we found strong evidence against Bayes-optimal integration of the two. Together, these results support and extend the earlier conclusions of Brysbaert and colleagues, while challenging dominant cognitive models of reading, showing that skipping (or the decision of where to fixate) and reading times (i.e., how long to fixate) are governed by different principles, and that for word skipping, the link between eye movements and cognition is less direct than commonly thought.

We analysed eye movements from three large datasets of participants reading texts ranging from isolated paragraphs to an entire novel. Specifically, we considered three datasets: Dundee (Kennedy, 2003) (N = 10, 51.502 words per participant), Geco (Cop et al., 2017) (N = 14, 54.364 words per participant) and Provo (Luke & Christianson, 2018) (N = 84, 2.689 words per participant). In each corpus, we analysed both skipping and reading times (indexed by gaze duration), as they are thought to reflect separate processes: the decision of where vs. how long to fixate, respectively (Brysbaert et al., 2005; Reichle et al., 2003). For more descriptive details about the data across participants and datasets, see Methods and Figures A.5–A.7.

To estimate the effect of linguistic prediction and parafoveal preview, we quantified the amount of information conveyed by both factors for each word in the corpus (for preview, this was tailored to each individual participant, since each word was previewed at a different eccentricity by each participant). To this end, we formalised both processes as a probabilistic belief about the identity of the next word, given either the preceding words (prediction) or a noisy parafoveal percept (preview; see Figure 1A). As such, we could describe these disparate cognitive processes using a common information-theoretic currency. To compute the probability distributions, we used GPT-2 for prediction (Radford et al., 2019) and a Bayesian ideal observer for preview (Duan & Bicknell, 2020) (see Figure 1B and Methods). Note that we use both computational models as normative models; tools to estimate how much information is in principle available from linguistic context (prediction) or parafoveal perception (preview) on a moment-by-moment basis. In other words, we use these models much in the same way as we rely on the counting algorithms used to aggregate lexical frequency statistics: in both cases we are interested in the computed statistic (e.g., lexical surprisal or entropy, or lexical frequency) but we do not want to make any cognitive claim about the underlying algorithm that we happened to use to compute this statistic (e.g., GPT-2 for lexical surprisal, or a counting algorithm for lexical frequency). For more details on the exact choice, and relation to alternative metrics (e.g., GPT-3 or cloze probabilities) see Methods and Discussion.

Figure 1.

Quantifying two types of context during natural reading. (A) Readers can infer the identity of the next word before fixation either by predicting it from context or by discerning it from the parafovea. Both can be cast as a probabilistic inference about the next word, either given the preceding words (prediction, blue) or given a parafoveal percept (preview, orange). (B) To model prediction, we use GPT-2, one of the most powerful publicly available language models (Radford et al., 2019). For preview, we use an ideal observer (Duan & Bicknell, 2020) based on well-established ‘Bayesian Reader’ models (Bicknell & Levy, 2010; Norris, 2006, 2009). Importantly, we do not use either model as a cognitive model per se, but rather as a tool to quantify how much information is in principle available from prediction or preview on a moment-by-moment basis.

Figure 1.

Quantifying two types of context during natural reading. (A) Readers can infer the identity of the next word before fixation either by predicting it from context or by discerning it from the parafovea. Both can be cast as a probabilistic inference about the next word, either given the preceding words (prediction, blue) or given a parafoveal percept (preview, orange). (B) To model prediction, we use GPT-2, one of the most powerful publicly available language models (Radford et al., 2019). For preview, we use an ideal observer (Duan & Bicknell, 2020) based on well-established ‘Bayesian Reader’ models (Bicknell & Levy, 2010; Norris, 2006, 2009). Importantly, we do not use either model as a cognitive model per se, but rather as a tool to quantify how much information is in principle available from prediction or preview on a moment-by-moment basis.

Close modal

Prediction and Preview Increase Skipping Rates and Reduce Reading Times

We first asked whether our formalisations allowed us to observe the expected effects of prediction and preview, while statistically controlling for other explanatory variables. This was done by performing a multiple regression analysis, and statistically testing whether the coefficients were in the expected direction. Word skipping was modelled with a logistic regression, reading times (gaze durations) were predicted using ordinary least squares regression. Because the decisions of whether to skip and how long to fixate a word are made at different moments, when different types of information are available, we modeled each separately with a different set of explanatory variables. But in both cases, for inference on the coefficients, we considered the full model (variables motivated and detailed below; see Tables A.1 and A.2 for a tabular overview of all variables).

As expected, we found in all datasets that words were more likely to be skipped if there was more information available from the linguistic prediction (Bootstrap: Dundee, p = 0.023; GECO, p = 0.034; Provo, p < 10−5) and/or the parafoveal preview (Bootstrap: Dundee, p = 4 × 10−5; GECO, p < 10−5; Provo, p < 10−5). Similarly, reading times were reduced for words that were more predictable (all p’s < 3.2 × 10−4) or more identifiable from the parafovea (all p’s < 4 × 10−5). Together this confirms that our model-based approach can capture the expected effects of both prediction (Clifton et al., 2016) and preview (Schotter et al., 2012) in natural reading, while statistically controlling for other variables.

Word Skipping is Largely Independent of Online Lexical Processing

After confirming that prediction and preview had a statistically significant influence on word skipping and reading times, we went on to assess their relative explanatory power. That is, we asked how important these factors were, by examining how much variance was explained by each. To this end, we grouped the variables from the full regression model into different types of explanations, and assessed how well each type accounted for the data, in terms of the unique and overlapping amount of variation explained by each explanation. This in turn was measured by the cross-validated R2 for reading times, and RMcF2 for skipping, which both quantify the proportion of variation explained (see Methods).

For skipping, we considered three explanations. First, a word might be skipped purely because it could be predicted from context—that is, purely as a function of the amount of information about word identity conveyed by the prediction. Secondly, a word might be skipped because its identity could be gleaned from a parafoveal preview—that is, purely as a function of the amount of information about word identity conveyed by the preview. Finally, a word might be skipped simply because it is so short or so close to the prior fixation location that an autonomously generated saccade will likely overshoot it, irrespective of its linguistic properties—in other words, purely as a function of length and eccentricity. Note that we did not include often-used lexical attributes like frequency to predict skipping, because using attributes of wordn+1 already pre-supposes parafoveal identification. Moreover, to the extent that a lexical attribute like frequency might influence a word’s parafoveal identifiability, this should already be captured by the parafoveal entropy (see Figure A.3 and Methods for more details).

For each word, we thus modelled the probability of skipping either as a function of prediction, preview, or oculomotor information (i.e., eccentricity and length), or by any combination of the three. Then we partitioned the unique and shared cross-validated variation explained by each account. Strikingly, this revealed that the overwhelming majority of explained skipping variation (94%) could be accounted for by the oculomotor baseline that consisted just of eccentricity and length (Figure 2). Moreover, the majority of the variation was only explained by the baseline, which explained 10 times more unique variation than prediction and preview combined. There was a large degree of overlap between preview and the oculomotor baseline, which is unsurprising since a word’s identifiability decreases as a function of its eccentricity and length. Interestingly, there was even more overlap between the prediction and baseline model: almost all skipping variation that could be explained by contextual constraint could be equally well explained by the oculomotor baseline factors.

Figure 2.

Variation in skipping explained by predictive, parafoveal and autonomous oculomotor processing. (A) Proportions of cross-validated variation explained by prediction (blue), preview (orange) oculomotor baseline (grey) and their overlap; averaged across datasets (each dataset weighted equally). (B) Variation partitions for each individual dataset, including statistical significance of variation uniquely explained by predictive, parafoveal or oculomotor processing. Stars indicate significance-levels of the cross-validated unique variation explained (bootstrap t-test against zero): p < 0.05 (*), p < 0.05 (**), p < 0.001 (***) For results of individual participants, and their consistency, see Figure A.9.

Figure 2.

Variation in skipping explained by predictive, parafoveal and autonomous oculomotor processing. (A) Proportions of cross-validated variation explained by prediction (blue), preview (orange) oculomotor baseline (grey) and their overlap; averaged across datasets (each dataset weighted equally). (B) Variation partitions for each individual dataset, including statistical significance of variation uniquely explained by predictive, parafoveal or oculomotor processing. Stars indicate significance-levels of the cross-validated unique variation explained (bootstrap t-test against zero): p < 0.05 (*), p < 0.05 (**), p < 0.001 (***) For results of individual participants, and their consistency, see Figure A.9.

Close modal

Importantly, while the contribution of prediction and preview was small, it was significant both for prediction (Dundee: 0.015% bootstrap 95CI: 0.003–0.029%; bootstrap t-test compared to zero, p = 0.014; Geco: 0.039%, 95CI: 0.018–0.065%; p = 0.0001; Provo: 0.20%; 95CI: 0.14–0.28%, p < 10−5) and preview (Dundee: 2.14%, 95CI: 1.66–2.60%; p < 10−5; Geco: 1.71%, 95CI: 1.20–2.29%, p < 10−5; Provo: 0.56%, 95CI: 0.36–0.79%, p < 10−5), confirming that both factors do affect skipping. Crucially however, the vast majority of skipping that could be explained by either prediction or preview was equally well explained by the more low-level and computationally frugal oculomotor model—which also explained much more of the skipping data overall. This challenges the idea that word identification is the main driver behind skipping, instead pointing to a more low-level, computationally simpler strategy.

What might this simpler strategy be? One possibility is a ‘blind’ random walk: generating saccades of some average length, plus oculomotor noise. However, we find that saccades are tailored to word length and exhibit a well-known preferred landing position, slightly left to a word’s centre (see Figure A.8; compare McConkie et al., 1988; Rayner, 1979). This suggests the decision of where to look next is not ‘blind’ but is based on a coarse low-level visual analysis of the parafovea, for instance conveying just the location of the next word ‘blob’ within a preferred range (i.e., skipping words too close or short; cf. Brysbaert et al., 2005; Deubel et al., 2000; Reilly & O’Regan, 1998). Presumably, such a simple strategy would on average sample visual input conveniently, yielding saccades large enough to read efficiently but small enough for comprehension to keep track. However, if such an ‘autopilot’ is indeed largely independent of online comprehension, one would expect it occasionally go out of step, such that a skipped word cannot be recognised or guessed, derailing comprehension. In line with suggestion, we find evidence for a compensation strategy. The probability that an initially skipped words is subsequently (regressively) fixated is significantly, inversely related to its parafoveal identifiability before skipping (see Figure A.10; logistic regression to prior parafoveal entropy: all β’s > 0.15; bootstrap test on coefficients: all p’s < 10−5). Together, this suggests that initial skipping decisions are primarily driven by a low-level oculomotor ‘autopilot’, which is kept in line with online comprehension by correcting saccades that outrun word recognition (much in line with the suggestions by Brysbaert et al., 2005).

Reading Times are Strongly Modulated by Lexical Processing Difficulty

For reading times (defined as gaze durations, so considering foveal reading time only), we similarly considered three broad explanations. First, a word might be read faster because it was predictable from the preceding context, which we formalised via lexical surprisal. Second, a word might be read faster if it could already be partly identified from the parafoveal preview (before fixation). This informativeness of the preview was again formalised via the parafoveal preview entropy. Finally, a word might be read faster due to non-contextual attributes of the fixated word itself, such as frequency or word-class or the viewing position. This last explanatory factor functioned as a baseline that captured key non-contextual attributes, both linguistic and non-linguistic (see Methods).

In all datasets, we again found that all explanations accounted for some unique variation: prediction (Dundee: 0.80% bootstrap 95CI: 0.55–1.09%, bootstrap t-test compared to zero: p < 6−5; Geco: 0.68%, 95CI: 0.55–0.83%; p = 0.0001; Provo: 0.35%, 95CI: 0.20–0.43%, p < 10−5), preview (Dundee: 1.91%, 95CI: 1.00–3.14%, p = 0.00012; Geco: 1.59%, 95CI: 0.96–2.30%, p = 5 × 10−5; Provo: 0.93%, 95CI: 0.70–1.98%, p < 10−5) and the non-contextual word attributes (Dundee: 8.06%, 95CI: 5.84–10.32%, p = 5 × 10−5; Geco: 1.99%, 95CI: 1.32–2.81%, p < 10−5; Provo: 5.38%, 95CI: 4.48–6.83%, p < 10−5).

The non-contextual baseline explained the most variance, which shows—unsurprisingly—that properties of the fixated word itself are more important than contextual factors in determining how long a word is fixated. Critically however, compared to skipping the unique contribution of prediction and preview was more than three times higher (see Figure 3). Specifically, while prediction and preview could only uniquely account for 6% of explained word skipping variation, they uniquely accounted for more than 18% of explained variation in reading times. This suggests that while for skipping most explained variation can be accounted for by purely oculomotor variables, this is not the case for reading times.

Figure 3.

Variation in reading times explained by predictive, parafoveal and non-contextual information. (A) Grand average of partitions of cross-validated variance in reading times (indexed by gaze durations) across datasets (each dataset weighted equally) explained by non-contextual factors (grey), parafoveal preview (orange), and linguistic prediction (blue). (B) Variance partitions for each individual dataset, including statistical significance of the cross-validated variance explained uniquely by the predictive, parafoveal or non-contextual explanatory variables. Stars indicate significance levels of the cross-validated unique variance explained (bootstrap t test against zero): p < 0.05 (**), p < 0.001 (***). For results of individual participants, see Figure A.11. Note that the baseline model here both lexical attributes (e.g., frequency) and oculomotor factors (relative viewing/landing position). For a direct contrast between lexical processing-based explanations and purely oculomotor explanations, see Figure 4.

Figure 3.

Variation in reading times explained by predictive, parafoveal and non-contextual information. (A) Grand average of partitions of cross-validated variance in reading times (indexed by gaze durations) across datasets (each dataset weighted equally) explained by non-contextual factors (grey), parafoveal preview (orange), and linguistic prediction (blue). (B) Variance partitions for each individual dataset, including statistical significance of the cross-validated variance explained uniquely by the predictive, parafoveal or non-contextual explanatory variables. Stars indicate significance levels of the cross-validated unique variance explained (bootstrap t test against zero): p < 0.05 (**), p < 0.001 (***). For results of individual participants, see Figure A.11. Note that the baseline model here both lexical attributes (e.g., frequency) and oculomotor factors (relative viewing/landing position). For a direct contrast between lexical processing-based explanations and purely oculomotor explanations, see Figure 4.

Close modal

However, this comparison (between oculomotor and lexical processing based accounts) difficult to make based on the Figures 2 and 3 alone. This is because in the reading times analysis, the baseline model contained both oculomotor (i.e., viewing position) and lexical factors (notably lexical frequency). Therefore, we performed an additional analysis, grouping the explanatory variables differently to contrast purely oculomotor explanatory variables versus variables affecting lexical processing ease (such as predictability, parafoveal identifiability, and lexical frequency; see Tables A.3 and A.4). This shows that for skipping, purely oculomotor explanations can account for much more than a lexical processing-based explanation—but for reading times, it is exactly the other way around (Figure 4). Note that in Figure 4, the oculomotor model for reading times only contains variables quantifying viewing/landing position, because this is the primary oculomotor explanation for reading time differences (O’Regan, 1980, 1992; Vitu et al., 1990). If we also include word length in the oculomotor model for reading times, there is much more overlapping variance explained by the lexical and oculomotor model, presumably due to the correlation between word length and (log)frequency, which may inflate the importance of the oculomotor account (see Figure A.13). However, even with this potentially inflated estimate, the overall dissociation persists: if we compare the ratios unique variation explained by oculomotor vs. lexical processing-based models, there is still more than a 30-fold difference between the skipping and reading times analysis (in Figure A.13A). Together, this supports that for skipping, most explained variation is captured by purely oculomotor rather than lexical processing-based explanations, whereas for reading times, it is the other way around.

Figure 4.

Comparing oculomotor and lexical processing-based explanations for skipping and reading times. Analysis with the same explanatory variables as Figures 2 and 3, grouped differently to directly contrast purely oculomotor explanatory variables and those that affect lexical processing ease (such as predictability, parafoveal identifiability, and lexical frequency; see Methods). Venn diagrams represent the proportions of unique and overlapping amount of explained variation (in R2 for reading times, and RMcF2 for skipping) by each explanation (grand average across datasets). For partitions for individual datasets with statistics, see Figure A.12; for an alternative partitioning that includes word length in the oculomotor of reading times, see Figure A.13.

Figure 4.

Comparing oculomotor and lexical processing-based explanations for skipping and reading times. Analysis with the same explanatory variables as Figures 2 and 3, grouped differently to directly contrast purely oculomotor explanatory variables and those that affect lexical processing ease (such as predictability, parafoveal identifiability, and lexical frequency; see Methods). Venn diagrams represent the proportions of unique and overlapping amount of explained variation (in R2 for reading times, and RMcF2 for skipping) by each explanation (grand average across datasets). For partitions for individual datasets with statistics, see Figure A.12; for an alternative partitioning that includes word length in the oculomotor of reading times, see Figure A.13.

Close modal

Model-Based Estimates of Naturalistic Prediction and Preview Benefits Match Experimental Effect Sizes

The reading times results confirm that reading times are highly sensitive to factors influencing a word’s lexical processing ease, including contextual factors like linguistic and parafoveal context. This is in line with the scientific consensus and decades of experimental research on eye movements in reading (Rayner, 2009). But how well do our model-based, correlational results compare exactly to findings from the experimental literature?

To directly address this question, we quantitatively derived, for each participant, the effect size of two well-established effects that would be expected to be obtained if we would conduct a well-controlled factorial experiment. While we did not actually perform such a factorial experiment, we can derive this from the regression model, because we quantitatively estimated how much additional information from either prediction or preview (in bits) reduced reading times (in milliseconds). Therefore, the regression analyses allows us to estimate the expected difference in reading times for words that are expected vs. unexpected (predictability benefit; Rayner & Well, 1996; Staub, 2015) or have valid vs. invalid preview (i.e., preview benefit; Schotter et al., 2012).

Interestingly, the model-derived effect sizes are very well in line with those observed in experimental studies (see Figure 5). This suggests that our analysis does not strongly underfit or otherwise underestimate the effect of prediction or preview. Moreover, it shows that the effect sizes, which are well-established in controlled designs, generalise to natural reading. This last point is especially interesting for the preview benefit, because it implies that the effect can be largely explained in terms of parafoveal lexical identifiability (Pan et al., 2021; Rayner, 2009), and that other factors such as low-level visual ‘preprocessing’, or interference between the (invalid) parafoveal percept and foveal percept, may only play a minor role (cf. Reichle et al., 2003; Schotter et al., 2012).

Figure 5.

Model-derived effect sizes match experimentally observed effect sizes. Preview (left) and predictability benefits (right) inferred from our analysis of each dataset, and observed in a sample of studies (see Table A.5). In this analysis, preview benefit was derived from the regression model as the expected difference in gaze duration after a preview of average informativeness versus after no preview at all. Predictability benefit was defined as the difference in gaze duration for high versus low probability words; ‘high’ and ‘low’ were defined by subdividing the cloze probabilities from Provo into equal thirds of ‘low’, ‘medium’ and ‘high’ probability (see Methods). In each plot, small dots with dark edges represent either individual subjects within one dataset or individual studies in the sample of the literature; larger dots with error bars represent the mean effect across individuals or studies, plus the bootstrapped 99%CI.

Figure 5.

Model-derived effect sizes match experimentally observed effect sizes. Preview (left) and predictability benefits (right) inferred from our analysis of each dataset, and observed in a sample of studies (see Table A.5). In this analysis, preview benefit was derived from the regression model as the expected difference in gaze duration after a preview of average informativeness versus after no preview at all. Predictability benefit was defined as the difference in gaze duration for high versus low probability words; ‘high’ and ‘low’ were defined by subdividing the cloze probabilities from Provo into equal thirds of ‘low’, ‘medium’ and ‘high’ probability (see Methods). In each plot, small dots with dark edges represent either individual subjects within one dataset or individual studies in the sample of the literature; larger dots with error bars represent the mean effect across individuals or studies, plus the bootstrapped 99%CI.

Close modal

No Integration of Prediction and Preview

So far, we have treated prediction and preview as being independent. However, it might be that these processes, while using different information, are integrated—such that a word is parafoveally more identifiable when it is also more predictable in context. Bayesian probability theory proposes an elegant and mathematically optimal way to integrate these sources of information: the prediction of the next word could be incorporated as a prior in perceptual inference. Such a contextual prior fits into hierarchical Bayesian models of vision (Lee & Mumford, 2003), and has been observed in speech perception, where a contextual prior guides the recognition of words from a partial sequence of phonemes (Brodbeck et al., 2022; Heilbron et al., 2022). Does such a prior also guide word recognition in reading, based on a partial parafoveal percept?

To test this, we recomputed the parafoveal identifiability of each word for each participant, but now with an ideal observer using the prediction from GPT-2 as a prior. As expected, bayesian integration enhanced perceptual inference: on average, the observer using linguistic prediction as a prior extracted more information from the preview (± 6.25 bits) than the observer not taking the prediction into account (± 4.30 bits; T1.39×106 = 1.35 × 1011, p ≈ 0). Interestingly however, it provided a worse fit to the human reading data. This was established by comparing two versions of the full regression model: one with parafoveal entropy from the (theoretically superior) contextual ideal observer and one from the non-contextual ideal observer. In all datasets both skipping and reading times were better explained by a model including parafoveal identifiability from the non-contextual observer (skipping: all p’s < 10−5; reading times: p’s < 10−5; see Figure 6). This replicates Duan and Bicknell (2020), who performed a similar analysis comparing a contextual (5-gram) and non-contextual prior in natural reading (Duan & Bicknell, 2020). Our findings replicate and significantly extend their findings, since Duan and Bicknell (2020) only analysed skipping in the Dundee corpus. Our analysis not only investigates additional datasets but it finds exactly the same result for reading times (for which the importance of both prediction and preview is decidedly larger).

Figure 6.

Evidence against bayesian integration of linguistic prediction and parafoveal preview. Cross-validated prediction performance of the full reading times (top) and skipping (bottom) model (including all variables), equipped with parafoveal preview information either from the contextual observer or from the non-contextual observer. Dots with connecting lines indicate participants; stars indicate significance: p < 0.001 (***).

Figure 6.

Evidence against bayesian integration of linguistic prediction and parafoveal preview. Cross-validated prediction performance of the full reading times (top) and skipping (bottom) model (including all variables), equipped with parafoveal preview information either from the contextual observer or from the non-contextual observer. Dots with connecting lines indicate participants; stars indicate significance: p < 0.001 (***).

Close modal

Together, this suggests while both linguistic prediction and parafoveal preview influence online reading behaviour, the two sources of information are not integrated, but instead operate independently—highlighting a remarkable sub-optimality in reading.

Eye movements during reading are highly variable. Across three large datasets, we assessed the relative importance of different explanations for this variability. In particular, we quantified the importance of two major contextual determinants of a word’s lexical processing difficulty—linguistic prediction and parafoveal preview—and compared such lexical processing-based explanations to alternative (non-linguistic) explanations. This revealed a stark dissociation: for word skipping, a simple low-level oculomotor model (using just word length and distance to prior fixation location) could account for much more (unique) variation than lexical processing-based explanations, whereas for reading times, it was exactly the other way around. Interestingly, preview effects were best captured by a non-contextual observer, suggesting that while readers use both linguistic prediction and preview, these do not appear to be integrated on-line. Together, the results underscore the dissociation between skipping and reading times, and show that for word skipping, the link between eye movements and cognition is less direct than commonly thought.

Our results on skipping strongly support the earlier findings and theoretical perspective by Brysbaert et al. (2005). They analysed effect sizes from studies on skipping and found a disproportionately large effect of length, compared to proxies of processing-difficulty like frequency and predictability. We significantly extend their findings by modelling skipping itself (rather than effect sizes from studies) and making a direct link to processing mechanisms. For instance, based on their analysis it was unclear how much of the length effect could be attributed to the lower visibility of longer words—that is, how much of the length effect may be an identifiability effect (Brysbaert et al., 2005, p. 19). We show that length and eccentricity alone explained three times as much variation as parafoveal identifiability—and that most of the variation explained by identifiability was equally well explained by length and eccentricity. This demonstrates that length and eccentricity themselves—not just to the extent they reduce identifiability—are key drivers of skipping.

This conclusion challenges dominant, cognitive models of eye movements, which describe lexical identification as the primary driver behind skipping (Engbert & Kliegl, 2003; Engbert et al., 2005; Reichle et al., 2003). Importantly, our results do not challenge predictive or parafoveal word identification itself. Rather, they challenge the notion that moment-to-moment decisions of whether to skip individual words are primarily driven by the recognition of those words. Instead, our results suggest a simpler strategy in which a coarse (e.g., dorsal stream) visual representation is used to reflexively select the next saccade target following the simple heuristic to move forward to the next word ‘blob’ within a certain range (see also Brysbaert et al., 2005; Deubel et al., 2000; Reilly & O’Regan, 1998).

Given that readers use both prediction and preview, why would they strongly affect reading times but hardly word skipping? We suggest this is because these different decisions—of where versus how long to fixate—are largely independent and are made at different moments (Findlay & Walker, 1999; Hanes & Schall, 1995; Schall & Cohen, 2011). Specifically, the decision of where to fixate—and hence whether to skip the next word—is made early in saccade programming, which can take 100–150 ms (Becker & Jürgens, 1979; Brysbaert et al., 2005; Hanes & Schall, 1995). Although the exact sequence of operations leading to a saccade remains debated, given that readers on average only look some 250 ms at a word, it is clear that skipping decisions are made under strong time constraints, especially given the lower processing rate of parafoveal information. We suggest that the brain meets this constraint by resorting to a computationally frugal ‘move forward’ policy. How long to fixate, by contrast, depends on saccade initiation. This process is separate from target selection, as indicated by physiological evidence that variation in target selection time only weakly explains variation in initiation times, which are affected by more factors and can be adjusted later (Findlay & Walker, 1999; Schall & Cohen, 2011). This can allow initiation to be informed by foveal information, which is processed more rapidly and may thus more directly influence the decision to either keep dwelling or execute the saccade.

One simplifying assumption we made is that during natural reading, the relative importance of identification-related processes (like prediction and preview) and oculomotor processsing are relatively stable within a single reader. However, it might be that underneath the average, aggregate relative importance we estimated, there is variability between specific moments in a text, or even within a sentence, during which the relative importance might be quite different. One such moment could be sentence transitions, where due to end-of-sentence ‘wrap up’ effects (Andrews & Veldre, 2021; Just & Carpenter, 1980), the relative importance of for instance preview may be reduced. Here we did not treat sentence transitions (or other such moments) as special, but looking into the possibility of moment-to-moment variability in relative importance is an interesting avenue for future research.

A distinctive feature of our analysis is that we focus on a few sets of computationally explicit variables, each forming a coherent explanation, and quantify the shared and unique variation accounted for by each explanation. The advantage of this approach is interpretability. However, a limitation of the partitioning analysis is that it is not always possible to add all potentially statistically significant variables (or interactions) to the regression, because partitioning requires that each variable can be assigned to a single explanation. When this is not possible (e.g., when a variable is (indirectly) associated with multiple explanations) this requires making a decision. Either the variable is omitted and the regression may not capture all explainable variance. Alternatively, the variable is assigned to just one explanation, which may distort the results by inflating the importance of that explanation.

A primary example of a variable requiring such a decision is (log)-frequency in the context of skipping. Frequency is sometimes used to predict skipping, as a proxy for a word’s parafoveal identifiability. However, this relationship is indirect, and (log)frequency is also—and much more strongly—correlated with length, and is hence also associated with the oculomotor explanation. Therefore, if one uses frequency as a proxy for parafoveal identifiability, one may find apparent preview effects which are in fact length effects, and strongly overestimate preview importance (Brysbaert & Drieghe, 2003). To avoid such overestimates, especially because the effect of frequency on identifiability should already be captured by the Ideal Observer (see Figure A.3 and Methods), we did not include frequency in our skipping analysis, nor did we include any other attribute that sometimes used a ‘proxy’ for either prediction/constraint or preview. A conceptually related problem is posed by interactions between variables from different explanations, such as between prediction/preview entropy and oculomotor predictors. These are impossible to assign to a single explanation, and were hence excluded from the regression.

As a result, the regression model did not include some variables or interactions that were used by prior regression analyses of skipping or reading times. This means that our regression may leave some explainable variation unexplained, and that our importance estimates are specific to the variables we consider, and our modelling thereof. However, this is a limitation that we believe trades-off favourably against the advantages afforded by the analysis. In particular, because for both skipping and reading times (1) we included all the factors deemed most important by prior regression-based studies (e.g., Duan & Bicknell, 2020; Hahn & Keller, 2023; Kliegl et al., 2006); (2) the amount of overall (cross-validated) explained variation is in line with prior regression-based analyses (e.g., Duan & Bicknell, 2020; Kliegl et al., 2006); and (3) our model-based effect sizes of prediction and preview effects are well in line with those from the experimental literature, suggesting our modelling of prediction or preview does not significantly fail to capture major aspects of either (Figure 5). In sum, we therefore do not believe that our selective and computationally explicit regression analysis significantly underestimates major factors of importance, and we are optimistic that our analysis yielded the comprehensive, interpretable picture that we aimed for.

To quantify predictability (surprisal) and constraint (lexical entropy) we used a neural language model (GPT-2), instead of the more standard cloze procedure. The reason for this is that we are interested in natural texts, where many words will have relatively low predictability values (e.g., below p = 0.01) which are inherently difficult to estimate in a cloze task. Since the effect of word predictability is logarithmic (Shain et al., 2022; Smith & Levy, 2013) the differences between small probabilities (e.g., between p = 0.001 and p = 0.0001) can have non-negligible effects, which is why for natural texts language models are superior to cloze metrics to capture predictability effects. Since the PROVO corpus includes cloze probability for every word, we could confirm this empirically, finding that model-derived surprisal indeed predicts reading times much better (Figure A.1).

We used this specific language model (GPT-2) simply because it was among the best publicly available ones, and prior work demonstrates that better language models (measured in perplexity) also predict human reading behaviour better (Goodkind & Bicknell, 2018; Wilcox et al., 2020). This raises the question whether an even better model (e.g., GPT-3, GPT-4, GPT-5, etc.) could predict human behaviour even better, and whether this might change the results. However, we do not believe this is likely. First, compared to the increases in quality (decreases in perplexity) from ngrams to GPT, further model improvements will be very subtle when quantified in the aggregate, and since reading behaviour is itself a noisy metric it is not obvious if such improvements will have a measurable impact. Second, one recent study even suggested that models larger than GPT-2 (GPT-J and GPT-3) predicted reading slightly worse, perhaps due to super-human memorisation capacities (Shain et al., 2022). In short, we used GPT-2 simply because it is a strong measure of lexical predictability (in English). Our analyses do not depend on GPT-2 specifically, in the same way we do not believe the results would change if we would have used different (but similar quality) lexical frequency statistics.

One apparent complication is that the skipping and reading times analyses use different metrics for explained variation (R2 and RMcF2). This is due to the difference between continuous and discrete variables. As a result, directly numerically comparing the two (e.g., interpreting 4% R2 as ‘less’ than 5% RMcF2) is difficult. However, our comparisons between skipping and reading times are not based on such absolute, numerical comparisons. Instead, the conclusions only rely on comparing the relative importance of different explanations. In other words, comparing the relative size and overlap of Venn diagrams in Figures 2, 3 and 4 (and hence only directly comparing quantities of the same metric).

If one does look at absolute numerical values across Figures 2, 3 and 6, the R2 values of the reading times regression may seem rather small. This could indicate a poor fit, which would potentially undermine our claim that reading times are to a large degree explained by cognitive factors. However, we do not believe this is the case, since our R2’s for gaze durations are not lower than R2’s reported by other regression analyses in natural reading (e.g., Kliegl et al., 2006); and because we find effect sizes in line with the experimental literature (Figure 5). Therefore, we do not believe we overfit or underfit gaze durations. Instead, what the relatively low R2 values indicate, we suggest, is that gaze durations are inherently noisy; that only a limited amount of the variation is systematic variation. While this noisiness might be interesting in itself (e.g., reflecting an autonomous timer; Engbert et al., 2005), it is not of interest in this study, which focusses on systematic variation, and hence only on relative importance of different explanations, not on absolute R2 values.

A final notable finding is that preview was best explained by a non-contextual observer. This replicates the only other study that compared contextual and non-contextual models of preview (Duan & Bicknell, 2020). That study focussed on skipping; the fact that we obtain the same result for reading times and in different datasets strengthens the conclusion that context does not inform preview. This is also in line with a number of studies on skipping, suggesting no or a weak effect of contextual fit (Angele et al., 2014; Angele & Rayner, 2013; Hahn & Keller, 2023). However, it contradicts a possibly larger range of experimental studies on preview more broadly, that do find interactions between contextual constraint/prediction and preview (e.g., Balota et al., 1985; McClelland & O’Regan, 1981; Schotter et al., 2015; Veldre & Andrews, 2018). One explanation for this discrepancy stems from how the effect is measured. Experimental studies looked at the effect of context on the difference in reading time after valid versus invalid preview (Schotter et al., 2015; Veldre & Andrews, 2018). This may reveal a context effect not on recognition, but at a later stage (e.g., priming between context, preview and foveal word). Arguably, these yield different predictions. If context affects recognition it may allow identification of otherwise unidentifiable words. But if the interaction occurs later it may only amplify processing of recognisable words. Constructing a model that formally reconciles this discrepancy is an interesting challenge for future work.

Given that readers use both prediction and preview, why doesn’t contextual prediction inform preview? One explanation stems from time constraints imposed by eye movements. Given that readers on average only look some 250 ms at a word in which they have to recognise the foveal word and process the parafoveal percept, this perhaps leaves too little time to fully let the foveal word and context inform parafoveal preview. On the other hand, word recognition based on partial input also occurs in speech perception under significant time-constraints. But despite those constraints, sentence context does influence auditory word recognition (McClelland & Elman, 1986; Zwitserlood, 1989), a process best modelled by a contextual prior (i.e., the opposite of what we find here; Brodbeck et al., 2022; Heilbron et al., 2022). Therefore, rather than being related to time-constraints per se, it might be also related to the underlying circuitry. More precisely, the fact that contrary to auditory word recognition, visual word recognition is a laboriously acquired skill that occurs throughout areas in the visual system that are repurposed (not evolved) for reading (Dehaene, 2009; Yeatman & White, 2021). Therefore, global sentence context might be able to dynamically influence the recognition of speech sounds in temporal cortex, but not that of words in visual cortex; there, context effects might be confined to simpler, more local context, like lexical context effects on letter perception (Heilbron et al., 2020; Reicher, 1969; Wheeler, 1970; Woolnough et al., 2021).

In conclusion, we have found that two important contextual sources of information about next-word identity in reading, linguistic prediction and parafoveal preview, strongly drive variation in reading times, but hardly affect word skipping, which is largely based on low-level factors. Our results show that as readers, we do not always use all information available to us; and that we are, in a sense, of two minds: consulting complex inferences to decide how long to look at a word, while employing semi-mindless scanning routines to decide where to look next. It is striking that these disparate strategies operate mostly in harmony. Only occasionally they go out of step—then we notice that our eyes have moved too far and we have to look back, back to where our eyes left cognition behind.

We analysed eye-tracking data from three, big, naturalistic reading corpora, in which native English speakers read texts while eye-movement data was recorded (Cop et al., 2017; Kennedy, 2003; Luke & Christianson, 2016).

Stimulus Materials

We considered the English-native portions of the Dundee, Geco and Provo corpora. The Dundee corpus comprises eye-movements from 10 native speakers from the UK (Kennedy, 2003), who read a total of 56.212 words across 20 long articles from The Independent newspaper. Secondly, the English portion of the Ghent Eye-tracking Corpus (Geco) (Cop et al., 2017) is a collection of eye movement data from 14 UK English speakers who each read Agathe Cristie’s The Mysterious Affair at Styles in full (54.364 words per participant). Lastly, the Provo corpus (Luke & Christianson, 2018) is a collection of eye movement data from 84 US English speakers, who each read a total of 55 paragraphs (extracted from diverse sources) for a total of 2.689 words.

Eye Tracking Apparatus and Procedure

In all datasets, eye movements were recorded monocularly, by recording the right eye. In Geco and Provo, recordings were made using an EyeLink 1000 (SR Research, Canada) with a spatial resolution of 0.01° and a temporal resolution of 1000 Hz. For Dundee, a Dr. Bouis oculometer (Dr. Bouis, Kalsruhe, Germany), with a spatial resolution of <0.1° and a temporal resolution of 1000 Hz was used. To minimize head movement, the participant’s heads were stabilised with a chinrest (Geco, Provo) or a bite bar (Dundee). In each experiment, texts were presented in ‘screens’ with either five lines (Dundee) or one paragraph per screen (Geco and Provo), presented using a font size of 0.33° per character. Each screen began with a fixation mark (gaze trigger) that was replaced by the initial word when stable fixation was achieved. In all datasets, a 9-point calibration was performed prior to the recording. In the longer experiments, a recalibration was performed every three screens (Dundee) or either every 10 minutes or whenever the drift correction exceeded 0.5° (Geco). For Dundee and Provo, the order of different texts were randomized across participants. In Geco, the entire novel was read start to finish with breaks between each chapter, during which participants answered comprehension questions.

For each corpus the x, y-values per fixation position were converted into a word-by-word format. In Dundee, raw x, y-values were smoothed by rounding to single-character precision. In Geco and Provo, raw x, y-values for each within-word- or within-letter fixation were preserved and available for each word. Across the three data sets we redefined the bounding boxes around each word, such that they subtended the area between the first to the last character of the word, with the boundary set halfway to the neighbouring character (e.g., halfway the before and after the word). Punctuation before or after the word were left out, and words for which the bounding box was inconsistently defined were ignored. For distributions of saccade and fixation data, see Figures A.5–A.7.

Language Model

Contextual predictions were formalised using a language model—a model computing the probability of each word given the preceding words. Here, we used GPT-2 (XL)—currently among the best publicly released English language models. GPT-2 is a transformer-based model, that in a single pass turns a sequence of tokens U = (u1, …, uk) into a sequence of conditional probabilities, (p(u1), p(u2|u1), …, p(uk|u1, …, uk−1)).

Roughly, this happens in three steps: first, an embedding encodes the sequence of symbolic tokens as a sequence of vectors, which are the first hidden state h0. Then, a stack of n transformer blocks each applies a series of operations resulting in a new set of hidden states hl, for each block l. Finally, a (log-)softmax layer is applied to compute (log-)probabilities over target tokens. In other words, the model can be summarised as follows:
h0=UWe+Wp
(1)
hl=transformer_blockhl1i1n
(2)
Pu=softmaxhnWeT,
(3)
where We is the token embedding and Wp is the position embedding.
The key component of the transformer-block is masked multi-headed self-attention. This transforms a sequence of input vectors (x1, x2, …, xk) into a sequence of output vectors (y1, y2, …, yk). Fundamentally, each output vector yi is simply a weighted average of the input vectors: yi = j=1kwijxj. Critically, the weight wi,j is not a parameter, but is derived from a dot product between the input vectors xiTxj, passed through a softmax and scaled by a constant determined by the dimensionality dk: wij = (exp xiTxj/j=1k exp xiTxj)1dk. Because this is done for each position, each input vector xi is used in three ways: first, to derive the weights for its own output, yi (as the query); second, to derive the weight for any other output yj (as the key); finally, it is used in the weighted sum (as the value). Different linear transformations are applied to the vectors in each cases, resulting in Query, Key and Value matrices (Q, K, V). Putting this all together, we obtain:
self_attentionQKV=softmaxQKTdkV.
(4)

To be used as a language model, two elements are added. First, to make the operation position-sensitive, a position embedding Wp is added in the embedding step (Equation 1). Second, to enforce that the model only uses information from the past, attention from future vectors is masked out. To give the model more flexibility, each transformer block contains multiple instances (‘heads’) of the self-attention mechanisms from Equation 4. In total, GPT-2 (XL) contains n = 48 blocks, with 12 heads each; a dimensionality of d = 1600 and a context window of k = 1024, yielding a total of 1.5 × 109 parameters. We used the PyTorch implementation of GPT-2 from the Transformers package (Wolf et al., 2020).

One complication of deriving word-probabilities from GPT-2 is that it doesn’t operate on words but on tokens. Tokens can be whole words (as with most common words) or sub-words. To derive word probabilities, we take the token probability for a single-token word, and the joint probability for words spanning multiple tokens, as is standard practice in psycholinguistics (Pimentel et al., 2022; Shain et al., 2022; Wilcox et al., 2020). However, because GPT marks word boundaries (i.e., spaces), at the beginning of a token, technically, the ‘end of word‘ decision is made at the next token. Defining word probabilities via (joint) constituent token probabilities doesn’t take this into account. Therefore, it will on average slightly over-estimate word probabilities (underestimate surprisal). However, this slight underestimation is not likely to affect any of our conclusions, since our regression analyses primarily depend on differences between relative predictabilities of different words, and since GPT-2, when used in this way, has been shown to result in probabilities that predict reading behaviour very well compared to other language models that do use whole-word tokenisation (Wilcox et al., 2020), and even compared to larger models like GPT-3 (Shain et al., 2022).

We chose GPT-2 because it is a high-quality language model (measured in perplexity on English texts), and better language models generally predict reading behaviour better (Goodkind & Bicknell, 2018; Wilcox et al., 2020). Our analysis does not depend on GPT-2 specifically, it could be switched with any similarly high-quality language model, much in the same way as we do believe our results to be specific to the exact lexical frequency statistics estimates we used (see Discussion).

Ideal Observer

To compute parafoveal identifiability, we implemented an ideal observer based on the formalism by Duan and Bicknell (2020). This model formalises parafoveal word identification using Bayesian inference and builds on previous well-established ‘Bayesian Reader’ models (Bicknell & Levy, 2010; Norris, 2006, 2009). It computes the probability of the next word given a noisy percept by combining a prior over possible words with a likelihood of the noisy percept, given a word identity:
pw𝓘pwp𝓘w,
(5)
where 𝓘 represents the noisy visual input, and w represents a word identity. We considered two priors (see Figure 6): a non-contextual prior (the overall probability of words in English based on their frequency in Subtlex (Brysbaert & New, 2009), and a contextual prior based on GPT2 (see below). Below we describe how visual information is represented and perceptual inference is performed. For a graphical schematic of the model, see Figure A.2; for some distinctive simulations showing how the model captures key effects of linguistic and visual characteristics on word recognition, see Figure A.3.

Sampling Visual Information.

Like in other Bayesian Readers (Bicknell & Levy, 2010; Norris, 2006, 2009), noisy visual input is accumulated by sampling from a multivariate Gaussian which is centred on a one-hot ‘true’ letter vector—here represented in an uncased 26-dimensional encoding—with a diagonal covariance matrix Σ(ε) = λ(ϵ)−1/2I. The shape of Σ is thus scaled by the sensory quality λ(ε) for a letter at eccentricity ε. Sensory quality is computed as a function of the perceptual span: this uses a Gaussian integral based follows the perceptual span or processing rate function from the SWIFT model (Engbert et al., 2005). Specifically, for a letter at eccentricity ε, λ is given by the integral within the bounding box of the letter:
λε=ε.5ε+.512πσ2expx22σ2dx,
(6)
which, following Bicknell and Levy (2010) and Duan and Bicknell (2020), is scaled by a scaling factor Λ. Unlike SWIFT, the Gaussian in Equation 6 is symmetric, since we only perform inference on information about the next word. By using one-hot encoding and a diagonal covariance matrix, the ideal observer ignores similarity structure between letters. This is clearly a simplification, but one with significant computational benefits; moreover, it is a simplification shared by all Bayesian Reader-like models (Bicknell & Levy, 2010; Duan & Bicknell, 2020; Norris, 2006), which can nonetheless capture many important aspects of visual word recognition and reading. To determine parameters Λ and σ, we performed a grid search on a subset of Dundee and Geco (see Figure A.4), resulting in Λ = 1 and σ = 3. Note that this σ value is close to the average σ value of SWIFT and (3.075) and corresponds well to prior literature on the size of the perceptual span (±15 characters; Bicknell & Levy, 2010; Engbert et al., 2005; Schotter et al., 2012).

Perceptual Inference.

Inference is performed over the full vocabulary. This is represented as a matrix which can be seen as a stack of word vectors, y1, y2, …, yv, obtained by concatenating the letter vectors. The vocabulary is thus a V × d matrix, with V the number of words in the vocabulary and d the dimensionality of the word vectors (determined by the length of the longest word: d = 26 × lmax).

To perform inference, we use the belief-updating scheme from Duan and Bicknell (2020), in which the posterior at sample t is expressed as a (V − 1) dimensional log-odds vector x(t), in which each entry xit represents the log-odds of yi relative to the final word yv. In this formulation, the initial value of x is thus simply the prior log odds, xi0 = log p(wi) − log p(wv), and updating is done by summing prior log-odds and the log-odds likelihood. This procedure is repeated for T samples, each time taking the posterior of the previous timestep as the prior in the current timestep. Note that using log odds in this way avoids renormalization:
xit=logpwi𝓘0tpwv𝓘0t=logpwi𝓘0t1p𝓘twipwv𝓘0t1p𝓘twv=logpwi𝓘0t1pwv𝓘0t1+logp𝓘twip𝓘twv=xit1+Δxit.
(7)

In other words, as visual sample 𝓘(t) comes in, beliefs are updated by summing the prior log odds x(t−1) and the log-odds likelihood of the new information x(t).

For a given word wi, the log-odds likelihood of each new sample is the difference of two multivariate Gaussian log-likelihoods, one centred on yi and one on the last vector yv. This can be formulated as a linear transformation of 𝓘:
Δxi=logp𝓘wilogp𝓘wv=logp𝓘𝒩yiΣlogp𝓘𝒩yvΣ=12𝓘yiTΣ1𝓘yi12𝓘yvTΣ1𝓘yv=yvTΣ1yvyiTΣ1yi2+yiyvTΣ1𝓘,
(8)
which implies that updating can be implemented by sampling from a multivariate normal. To perform inference on a given word, we performed this sampling scheme until convergence (using T = 50), and then transformed the posterior log-odds into the log posterior, from which we computed the Shannon entropy as a metric of parafoveal identifiability.

To compute the parafoveal entropy for each word in the corpus, we make the simplifying assumption that parafoveal preview only occurs during the last fixation prior to a saccade, thus computing the entropy as a function of the word itself and its distance to the last fixation location within the previously fixated word (which is not always the previous word). Because this distance is different for each participant, it was computed separately for each word, for each participant. Moreover, because the inference scheme is based on sampling, we repeated it 3 times, and averaged these to compute the posterior entropy of the word. The amount of information obtained from the preview is then simply the difference between prior and posterior entropy.

The ideal observer was implemented in custom Python code, and can be found in the data sharing collection (see below).

Contextual vs. Non-Contextual Prior

We considered two observers: one with a non-contextual prior capturing the overall probability of a word in a language, and with a contextual prior, capturing the contextual probability of a word in a specific context. For the non-contextual prior, we simply used lexical frequencies from which we computed the (log)-odds prior used in Equation 7. For the contextual prior, we derived the contextual prior from log-probabilities from GPT-2. This effectively involves constructing a new Bayesian model for each word, for each participant, in each dataset. To simplify this process, we did not take the full predicted distribution of GPT-2, but only the ‘nucleus’ of the top k predicted words with a cumulative probability of 0.95, and truncated the (less reliable) tail of the distribution. Further, we simply assumed that the rest of the tail was ‘flat’ and had a uniform probability. Since the prior odds can be derived from relative frequencies, we can think of the probabilities in the flat tail as having a ‘pseudocount’ of 1. If we similarly express the prior probabilities in the nucleus as implied ‘pseudofrequencies’, the cumulative implied nucleus frequency is then complementary to the length of the tail, which is simply the difference between the vocabulary size and nucleus size (Vk). As such, for word i in the text, we can express the nucleus as implied frequencies as follows:
freqsψ=PtrwicontextVk1j=1kPwjicontext
(9)
where Ptr(w(i) ∣ context) is the truncated lexical prediction, and P(wji | context) is predicted probability that word i in the text is word j in the sorted vocabulary. Note that using this flat tail not only simplifies the computation, but also deals with the fact that the vocabulary of GPT-2 is smaller than that of the ideal observer—using this tail we can still use the full vocabulary (e.g., to capture orthographic uniqueness effects), while using 95% of the density from GPT-2.

Data Selection

In our analyses, we focus on first-pass reading (i.e., progressive eye movements), analysing only those fixations or skips when none of the subsequent words have been fixated before. Moreover, we exclude return sweeps (i.e., line transitions), which are very different from within-line saccades, and hence excluded. We extensively preprocessed the corpora so that we could include as many words as possible. However, we had to impose some additional restrictions. Specifically we did not include words if they a) contained non-alphabetic characters; b) if they were adjacent to blinks; c) if the distance to the prior fixation location was more than 24 characters (±8); moreover, for the gaze duration we excluded d) words with implausibly short (< 70 ms) or long (> 900 ms) gaze durations. Criterion c) was chosen because some participants occasionally skipped long sequences of words, up to entire lines or more. Such ‘skipping’—indicated by saccades much larger than the perceptual span—is clearly different from the skipping of words during normal reading, and was therefore excluded. Note that these criteria are comparatively mild (cf. Duan & Bicknell, 2020; Smith & Levy, 2013), and leave approximately 1.1 million observations for the skipping analysis, and 593.000 reading times observations.

Regression Models: Skipping

Skipping was modelled via logistic regression in scikit-learn (Pedregosa et al., 2011), with three sets of explanatory variables (or ’models’) each formalising a different explanation for why a word might be skipped.

First, a word might be skipped because it could be confidently predicted from context. We formalise this via linguistic entropy, quantifying the information conveyed by the prediction from GPT-2. We used entropy, not (log) probability, because using the next word’s probability directly would presuppose that the word is identified, undermining the dissociation of prediction and preview. By contrast, prior entropy specifically probes the information available from prediction only.

Secondly, a word might be skipped because it could be identified from a parafoveal preview. This was formalised via parafoveal entropy, which quantifies the parafoveal preview uncertainty (or, inversely, the amount of information conveyed by the preview). This is a complex function integrating low-level visual (e.g., decreasing visibility as a function of eccentricity) and higher-level information (e.g., frequency or orthographic effects) and their interaction (see Figure A.3). Here, too we did not use lexical features (e.g., frequency) of the next word to model skipping directly, as this presupposes that the word is identified; and to the extent that these factors are expected to influence identifiability, this is already captured by the parafoveal entropy (Figure A.3).

Finally, a word might be skipped simply because it is too short and/or too close to the prior fixation location, such that a fixation of average length would overshoot the word. This autonomous oculomotor account was formalised by modelling skipping probability purely as a function of a word’s length and its distance to the previous fixation location.

Note that these explanations are not mutually exclusive, so we also evaluated their combinations (see below).

Regression Models: Reading Time

As an index of reading time, we analysed first-pass gaze duration, the sum of a word’s first-pass fixation durations. We analyse gaze durations as they arguably most comprehensively reflect how long a word is looked at, and are the focus of similar model-based analyses of contextual effects in reading (Goodkind & Bicknell, 2018; Smith & Levy, 2013). For reading times, we used linear regression, and again considered three sets of explanatory variables, each formalising a different kind of explanation.

First, a word may be read more slowly because it is unexpected in context. We formalised this using surprisal −log(p), a metric of a word’s unexpectedness—or how much information is conveyed by a word’s identity in light of a prior expectation about the identity. To capture spillover (Rayner et al., 2006; Smith & Levy, 2013) we included not just the surprisal of the current word, but also that of the previous two words.

Secondly, a word might be read more slowly because it was difficult to discern from the parafoveal preview. This was formalised using the parafoveal entropy (see above).

Finally, a word might be read more slowly because of non-contextual factors of the word itself. This is an aggregate baseline explanation, aimed to capture all relevant non-contextual word attributes, which we contrast to the two major contextual sources of information about a word identity that might affect reading times (prediction and preview). We included word class, length, log-frequency, and the relative landing position (quantified as the distance to word centre, both in fraction and in characters). For log-frequency we used the UK or US version of SUBTLEX depending on the corpus and included the log-frequency of the past two words to capture spillover effects.

The full model was defined as the joint of all models. For a tabular overview of all explanatory variables, see Tables A.1–A.3.

Model Evaluation

We compared the ability of each model to account for the variation in the data by probing prediction performance in a 10-fold cross-validation scheme, in which we quantified how much of the observed variation in skipping rates and gaze durations could be explained.

For reading times, we did this using the coefficient of determination, defined via the ratio of residual and total sum of squares: R2 = 1 − SSresSStot. The ratio SSresSStot relates the error of the model (SSres) to the error of a ‘null’ model predicting just the mean (SStot), and gives the variance explained. For skipping, we use a tightly related metric, the McFadden R2. Like the R2 it is computed by comparing the error of the model to the error of a null model with only an intercept: RMcF2 = 1 − LMLnull, where L indicates the loss.

While R2 and RMcF2 are not identical, they are formally tightly related—critically, both are zero when the prediction is constant (no variation explained) and go towards one proportionally as the error decreases to zero (i.e., towards all variation explained). Note that in a cross-validated setting, both metrics can become negative when prediction of the model is worse than the prediction of a constant null-model.

Variation Partitioning

To assess relative importance, we used variation partitioning to estimate how much explained variation could be attributed to each set of explanatory variables. This is also known as variance partitioning, as it is originally based on partitioning sums of squares; here we use the more general term ‘variation’ following Legendre (2008).

Variation partitioning builds on the insight that when two (groups of) explanatory variables (A and B) both explain some variation in the data y, and A and B are independent, then variation explained by combining A and B will be approximately additive. By contrast, when A and B are fully redundant (e.g., when B only has an apparent effect on y through its correlation with A), then a model combining A and B will not explain more than the two alone. Following de Heer et al. (2017), we generalise this logic to up to three (sets of) explanatory variables, by testing each individually and all combinations, and using set theory notation and graphical representation for its simplicity and clarity.

A two-way partition of two sets of explanatory variables (Figures 4, A.12, A.13) involves (A and B) fitting three models: two partial models with their features alone (A and B), and a joint model with both (AB). The unique variation explained by either A (A*) is derived via the difference between the partial models and the joint model:
A*=AB=ABBB*=BA=ABA
(10)
And the intersection is derived from the joint model and sum of the partial models:
AB=A+BAB
(11)
For three groups of explanatory variables (A, B, and C), the situation is a bit more complex. We first evaluate each separately and all combinations, resulting in 7 models:
A,B,C,AB,AC,BC,ABC.
From these 7 models we obtain 7 ‘empirical’ scores (of variation explained), from which we derive the 7 ‘theoretical‘ partitions: 4 overlap partitions and 3 unique partitions. The first overlap partition is the variation explained by all models, which we can derive as:
ABC=ABC+A+B+CABACBC.
(12)
The next three overlap partitions contain all pairwise intersections of models that did not include the other model:
ABC=A+BABABCACB=A+CACABCBCA=B+CBCABC.
(13)
The last three partitions are those explained exclusively by each model. This is the relative complement: the partition unique to A is the relative complement of BC: BCRC. For simplicity we also use a star notation, indicating the unique partition of A as A*. These are derived as follows:
A*=BCRC=ABCBCB*=ACRC=ABCACC*=ABRC=ABCAB.
(14)

Note that, in the cross-validated setting, the results can become paradoxical and depart from what is possible in classical statistical theory, such as partitioning sums of squares. For instance, due to over-fitting, a model that combines multiple EVs could explain less variance than all of the EVs alone, in which case some partitions would become negative. However, following de Heer et al. (2017), we believe that the advantages of using cross-validation outweigh the risk of potentially paradoxical results in some subjects. Partitioning was carried out for each subject, allowing to statistically assess whether the additional variation explained by a given model was significant. On average, none of the partitions were paradoxical.

Simulating Effect Sizes

Regression-based preview benefits were defined as the expected difference in gaze duration after a preview of average informativeness versus after no preview at all. This best corresponds to an experiment in which the preceding preview was masked (e.g., XXXX) rather than invalid (see Discussion). To compute this we compared the took the difference in parafoveal entropy between an average preview and the prior entropy. Because we standardised our explanatory variables, this was transformed to subject-specific z-scores and then multiplied by the regression weights to obtain an expected effect size.

For the predictability benefit, we computed the expected difference in gaze duration between ‘high’ and ‘low’ probability words. ‘High’ and ‘low’ was empirically defined based on the human-normed cloze probabilities in Provo (using the OrthoMatchModel definition for additional granularity; Luke & Christianson, 2018), which we divided into thirds using percentiles. The resulting cutoff points (low < 0.02; high > 0.25) were log-transformed, applied to the surprisal values from GPT-2, and multiplied by the weights to predict effect sizes. Note that these definitions of ‘low’ and ‘high’ may appear low compared to those in the literature—however, most studies collect cloze only for specific ‘target’ words in relatively predictable contexts, which biases the definition of ‘low’ vs. ‘high’ probability. By contrast, we analysed cloze probabilities for all words, yielding these values.

Statistical Testing

Statistical testing was performed across participants within each dataset. Because two of the three corpora had a low number of participants (10 and 14 respectively) we used data-driven, non-analytical bootstrap t-tests, that involve resampling a null-distribution with zero mean (by removing the mean), counting across bootstraps how likely a t-value at least as extreme as the true t-value was to occur. Each test used at least 104 bootstraps; p values were computed without assuming symmetry (equal-tail bootstrap; Rousselet et al., 2019). Confidence intervals (in the figures and text) also based on bootstrapping.

We thank Maria Barrett, Yunyan Duan, and Benedikt Ehinger for useful input and inspiring discussions during various stages of this project.

This work was supported by The Netherlands Organisation for Scientific Research (NWO Research Talent grant to M.H.; NWO Vidi 452-13-016 to F.P.d.L.; Gravitation Program Grant Language in Interaction no. 024.001.006 to P.H.) and the European Union Horizon 2020 Program (ERC Starting Grant 678286, “Contextvision” to F.P.d.L.).

Conceptualisation: MH. Data wrangling and preprocessing: JvH. Formal analysis: MH, JvH. Statistical analysis and visualisations: JvH, MH. Supervision: FPdL, PH. Initial draft: MH. Final draft: MH, JvH, PH, FPdL.

The Provo and Geco corpora are freely available (Cop et al. 2017; Luke & Christianson, 2018). All additional data and code needed to reproduce the results will be made public on the Donders Repository at https://doi.org/10.34973/kgm8-6z09.

Andrews
,
S.
, &
Veldre
,
A.
(
2021
).
Wrapping up sentence comprehension: The role of task demands and individual differences
.
Scientific Studies of Reading
,
25
(
2
),
123
140
.
Angele
,
B.
,
Laishley
,
A. E.
,
Rayner
,
K.
, &
Liversedge
,
S. P.
(
2014
).
The effect of high- and low-frequency previews and sentential fit on word skipping during reading
.
Journal of Experimental Psychology: Learning, Memory, and Cognition
,
40
(
4
),
1181
1203
. ,
[PubMed]
Angele
,
B.
, &
Rayner
,
K.
(
2013
).
Processing the in the parafovea: Are articles skipped automatically?
Journal of Experimental Psychology: Learning, Memory, and Cognition
,
39
(
2
),
649
662
. ,
[PubMed]
Balota
,
D. A.
,
Pollatsek
,
A.
, &
Rayner
,
K.
(
1985
).
The interaction of contextual constraints and parafoveal visual information in reading
.
Cognitive Psychology
,
17
(
3
),
364
390
. ,
[PubMed]
Becker
,
W.
, &
Jürgens
,
R.
(
1979
).
An analysis of the saccadic system by means of double step stimuli
.
Vision Research
,
19
(
9
),
967
983
. ,
[PubMed]
Bicknell
,
K.
, &
Levy
,
R.
(
2010
).
A rational model of eye movement control in reading
. In
J.
Hajič
(Ed.),
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
(pp.
1168
1178
).
Association for Computational Linguistics
.
Bouma
,
H.
, &
de Voogd
,
A. H.
(
1974
).
On the control of eye saccades in reading
.
Vision Research
,
14
(
4
),
273
284
. ,
[PubMed]
Brodbeck
,
C.
,
Bhattasali
,
S.
,
Cruz Heredia
,
A. A. L.
,
Resnik
,
P.
,
Simon
,
J. Z.
, &
Lau
,
E.
(
2022
).
Parallel processing in speech perception with local and global representations of linguistic context
.
eLife
,
11
,
Article e72056
. ,
[PubMed]
Brysbaert
,
M.
, &
Drieghe
,
D.
(
2003
).
Please stop using word frequency data that are likely to be word length effects in disguise
.
Behavioral and Brain Sciences
,
26
(
4
),
479
.
Brysbaert
,
M.
,
Drieghe
,
D.
, &
Vitu
,
F.
(
2005
).
Word skipping: Implications for theories of eye movement control in reading
. In
G.
Underwood
(Ed.),
Cognitive processes in eye guidance
(pp.
53
78
).
Oxford University Press
.
Brysbaert
,
M.
, &
New
,
B.
(
2009
).
Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English
.
Behavior Research Methods
,
41
(
4
),
977
990
. ,
[PubMed]
Buswell
,
G. T.
(
1920
).
An experimental study of the eye-voice span in reading
.
University of Chicago
.
Clifton
, Jr.,
C.
,
Ferreira
,
F.
,
Henderson
,
J. M.
,
Inhoff
,
A. W.
,
Liversedge
,
S. P.
,
Reichle
,
E. D.
, &
Schotter
,
E. R.
(
2016
).
Eye movements in reading and information processing: Keith Rayner’s 40 year legacy
.
Journal of Memory and Language
,
86
,
1
19
.
Cop
,
U.
,
Dirix
,
N.
,
Drieghe
,
D.
, &
Duyck
,
W.
(
2017
).
Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading
.
Behavior Research Methods
,
49
(
2
),
602
615
. ,
[PubMed]
de Heer
,
W. A.
,
Huth
,
A. G.
,
Griffiths
,
T. L.
,
Gallant
,
J. L.
, &
Theunissen
,
F. E.
(
2017
).
The hierarchical cortical organization of human speech processing
.
Journal of Neuroscience
,
37
(
27
),
6539
6557
. ,
[PubMed]
Dearborn
,
W. F.
(
1906
).
The psychology of reading: An experimental study of the reading pauses and movements of the eye
.
Columbia University Contributions to Philosophy and Psychology
,
4
,
1
134
.
Dehaene
,
S.
(
2009
).
Reading in the brain: The new science of how we read
.
Penguin
.
Deubel
,
H.
,
O’Regan
,
J. K.
, &
Radach
,
R.
(
2000
).
Commentary on section 2—Attention, information processing, and eye movement control
. In
A.
Kennedy
,
R.
Radach
,
D.
Heller
, &
J.
Pynte
(Eds.),
Reading as a perceptual process
(pp.
355
374
).
North-Holland/Elsevier Science Publishers
.
Duan
,
Y.
, &
Bicknell
,
K.
(
2020
).
A rational model of word skipping in reading: Ideal integration of visual and linguistic information
.
Topics in Cognitive Science
,
12
(
1
),
387
401
. ,
[PubMed]
Ehrlich
,
S. F.
, &
Rayner
,
K.
(
1981
).
Contextual effects on word perception and eye movements during reading
.
Journal of Verbal Learning and Verbal Behavior
,
20
(
6
),
641
655
.
Engbert
,
R.
, &
Kliegl
,
R.
(
2003
).
The game of word skipping: Who are the competitors?
Behavioral and Brain Sciences
,
26
(
4
),
481
482
.
Engbert
,
R.
,
Nuthmann
,
A.
,
Richter
,
E. M.
, &
Kliegl
,
R.
(
2005
).
SWIFT: A dynamical model of saccade generation during reading
.
Psychological Review
,
112
(
4
),
777
813
. ,
[PubMed]
Findlay
,
J. M.
, &
Walker
,
R.
(
1999
).
A model of saccade generation based on parallel processing and competitive inhibition
.
Behavioral and Brain Sciences
,
22
(
4
),
661
721
. ,
[PubMed]
Frank
,
S. L.
,
Fernandez Monsalve
,
I.
,
Thompson
,
R. L.
, &
Vigliocco
,
G.
(
2013
).
Reading time data for evaluating broad-coverage models of English sentence processing
.
Behavior Research Methods
,
45
(
4
),
1182
1190
. ,
[PubMed]
Goodkind
,
A.
, &
Bicknell
,
K.
(
2018
).
Predictive power of word surprisal for reading times is a linear function of language model quality
. In
Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018)
(pp.
10
18
).
Association for Computational Linguistics
.
Goodman
,
K. S.
(
1967
).
Reading: A psycholinguistic guessing game
.
Journal of the Reading Specialist
,
6
(
4
),
126
135
.
Hahn
,
M.
, &
Keller
,
F.
(
2023
).
Modeling task effects in human reading with neural network-based attention
.
Cognition
,
230
,
Article 105289
. ,
[PubMed]
Hanes
,
D. P.
, &
Schall
,
J. D.
(
1995
).
Countermanding saccades in macaque
.
Visual Neuroscience
,
12
(
5
),
929
937
. ,
[PubMed]
Heilbron
,
M.
,
Armeni
,
K.
,
Schoffelen
,
J.-M.
,
Hagoort
,
P.
, &
de Lange
,
F. P.
(
2022
).
A hierarchy of linguistic predictions during natural language comprehension
.
Proceedings of the National Academy of Sciences
,
119
(
32
),
Article e2201968119
. ,
[PubMed]
Heilbron
,
M.
,
Richter
,
D.
,
Ekman
,
M.
,
Hagoort
,
P.
, &
de Lange
,
F. P.
(
2020
).
Word contexts enhance the neural representation of individual letters in early visual cortex
.
Nature Communications
,
11
(
1
),
Article 321
. ,
[PubMed]
Hohenstein
,
S.
, &
Kliegl
,
R.
(
2014
).
Semantic preview benefit during reading
.
Journal of Experimental Psychology: Learning, Memory, and Cognition
,
40
(
1
),
166
190
. ,
[PubMed]
Inhoff
,
A. W.
(
1984
).
Two stages of word processing during eye fixations in the reading of prose
.
Journal of Verbal Learning and Verbal Behavior
,
23
(
5
),
612
624
.
Just
,
M. A.
, &
Carpenter
,
P. A.
(
1980
).
A theory of reading: From eye fixations to comprehension
.
Psychological Review
,
87
(
4
),
329
354
. ,
[PubMed]
Kennedy
,
A.
(
2003
).
The Dundee Corpus
[CD-ROM]
.
Psychology Department, University of Dundee
.
Kennedy
,
A.
,
Pynte
,
J.
,
Murray
,
W. S.
, &
Paul
,
S.-A.
(
2013
).
Frequency and predictability effects in the Dundee Corpus: An eye movement analysis
.
Quarterly Journal of Experimental Psychology
,
66
(
3
),
601
618
. ,
[PubMed]
Kliegl
,
R.
,
Grabner
,
E.
,
Rolfs
,
M.
, &
Engbert
,
R.
(
2004
).
Length, frequency, and predictability effects of words on eye movements in reading
.
European Journal of Cognitive Psychology
,
16
(
1–2
),
262
284
.
Kliegl
,
R.
,
Nuthmann
,
A.
, &
Engbert
,
R.
(
2006
).
Tracking the mind during reading: The influence of past, present, and future words on fixation durations
.
Journal of Experimental Psychology: General
,
135
(
1
),
12
35
. ,
[PubMed]
Kliegl
,
R.
,
Risse
,
S.
, &
Laubrock
,
J.
(
2007
).
Preview benefit and parafoveal-on-foveal effects from word n + 2
.
Journal of Experimental Psychology: Human Perception and Performance
,
33
(
5
),
1250
1255
. ,
[PubMed]
Lee
,
T. S.
, &
Mumford
,
D.
(
2003
).
Hierarchical Bayesian inference in the visual cortex
.
Journal of the Optical Society of America A
,
20
(
7
),
1434
1448
. ,
[PubMed]
Legendre
,
P.
(
2008
).
Studying beta diversity: Ecological variation partitioning by multiple regression and canonical analysis
.
Journal of Plant Ecology
,
1
(
1
),
3
8
.
Luke
,
S. G.
, &
Christianson
,
K.
(
2016
).
Limits on lexical prediction during reading
.
Cognitive Psychology
,
88
,
22
60
. ,
[PubMed]
Luke
,
S. G.
, &
Christianson
,
K.
(
2018
).
The Provo Corpus: A large eye-tracking corpus with predictability norms
.
Behavior Research Methods
,
50
(
2
),
826
833
. ,
[PubMed]
McClelland
,
J. L.
, &
Elman
,
J. L.
(
1986
).
The TRACE model of speech perception
.
Cognitive Psychology
,
18
(
1
),
1
86
. ,
[PubMed]
McClelland
,
J. L.
, &
O’Regan
,
J. K.
(
1981
).
Expectations increase the benefit derived from parafoveal visual information in reading words aloud
.
Journal of Experimental Psychology: Human Perception and Performance
,
7
(
3
),
634
644
.
McConkie
,
G. W.
,
Kerr
,
P. W.
,
Reddix
,
M. D.
, &
Zola
,
D.
(
1988
).
Eye movement control during reading: I. The location of initial eye fixations on words
.
Vision Research
,
28
(
10
),
1107
1118
. ,
[PubMed]
McConkie
,
G. W.
, &
Rayner
,
K.
(
1975
).
The span of the effective stimulus during a fixation in reading
.
Perception & Psychophysics
,
17
(
6
),
578
586
.
Morton
,
J.
(
1964
).
The effects of context upon speed of reading, eye movements and eye-voice span
.
Quarterly Journal of Experimental Psychology
,
16
(
4
),
340
354
.
Norris
,
D.
(
2006
).
The Bayesian reader: Explaining word recognition as an optimal Bayesian decision process
.
Psychological Review
,
113
(
2
),
327
357
. ,
[PubMed]
Norris
,
D.
(
2009
).
Putting it all together: A unified account of word recognition and reaction-time distributions
.
Psychological Review
,
116
(
1
),
207
219
. ,
[PubMed]
O’Regan
,
J. K.
(
1980
).
The control of saccade size and fixation duration in reading: The limits of linguistic control
.
Perception & Psychophysics
,
28
(
2
),
112
117
. ,
[PubMed]
O’Regan
,
J. K.
(
1992
).
Optimal viewing position in words and the strategy-tactics theory of eye movements in reading
. In
K.
Rayner
(Ed.),
Eye movements and visual cognition: Scene perception and reading
(pp.
333
354
).
Springer
.
Pan
,
Y.
,
Frisson
,
S.
, &
Jensen
,
O.
(
2021
).
Neural evidence for lexical parafoveal processing
.
Nature Communications
,
12
(
1
),
Article 5234
. ,
[PubMed]
Pedregosa
,
F.
,
Varoquaux
,
G.
,
Gramfort
,
A.
,
Michel
,
V.
,
Thirion
,
B.
,
Grisel
,
O.
,
Blondel
,
M.
,
Prettenhofer
,
P.
,
Weiss
,
R.
,
Dubourg
,
V.
,
Vanderplas
,
J.
,
Passos
,
A.
,
Cournapeau
,
D.
,
Brucher
,
M.
,
Perrot
,
M.
, &
Duchesnay
,
É.
(
2011
).
Scikit-learn: Machine learning in Python
.
Journal of Machine Learning Research
,
12
,
2825
2830
.
Pimentel
,
T.
,
Meister
,
C.
,
Wilcox
,
E. G.
,
Levy
,
R.
, &
Cotterell
,
R.
(
2022
).
On the effect of anticipation on reading times
.
arXiv:2211.14301
.
Pynte
,
J.
, &
Kennedy
,
A.
(
2006
).
An influence over eye movements in reading exerted from beyond the level of the word: Evidence from reading English and French
.
Vision Research
,
46
(
22
),
3786
3801
. ,
[PubMed]
Radford
,
A.
,
Wu
,
J.
,
Child
,
R.
,
Luan
,
D.
,
Amodei
,
D.
, &
Sutskever
,
I.
(
2019
).
Language models are unsupervised multitask learners
.
OpenAI Blog
,
1
(
8
),
9
.
Rayner
,
K.
(
1975
).
The perceptual span and peripheral cues in reading
.
Cognitive Psychology
,
7
(
1
),
65
81
.
Rayner
,
K.
(
1977
).
Visual attention in reading: Eye movements reflect cognitive processes
.
Memory & Cognition
,
5
(
4
),
443
448
. ,
[PubMed]
Rayner
,
K.
(
1979
).
Eye guidance in reading: Fixation locations within words
.
Perception
,
8
(
1
),
21
30
. ,
[PubMed]
Rayner
,
K.
(
2009
).
Eye movements and attention in reading, scene perception, and visual search
.
Quarterly Journal of Experimental Psychology
,
62
(
8
),
1457
1506
. ,
[PubMed]
Rayner
,
K.
,
Juhasz
,
B. J.
, &
Brown
,
S. J.
(
2007
).
Do readers obtain preview benefit from word N + 2? A test of serial attention shift versus distributed lexical processing models of eye movement control in reading
.
Journal of Experimental Psychology: Human Perception and Performance
,
33
(
1
),
230
245
. ,
[PubMed]
Rayner
,
K.
, &
Pollatsek
,
A.
(
1987
).
Eye movements in reading: A tutorial review
. In
M.
Coltheart
(Ed.),
Attention and performance XII: The psychology of reading
(pp.
327
362
).
Lawrence Erlbaum Associates, Inc
.
Rayner
,
K.
,
Reichle
,
E. D.
,
Stroud
,
M. J.
,
Williams
,
C. C.
, &
Pollatsek
,
A.
(
2006
).
The effect of word frequency, word predictability, and font difficulty on the eye movements of young and older readers
.
Psychology and Aging
,
21
(
3
),
448
465
. ,
[PubMed]
Rayner
,
K.
, &
Well
,
A. D.
(
1996
).
Effects of contextual constraint on eye movements in reading: A further examination
.
Psychonomic Bulletin & Review
,
3
(
4
),
504
509
. ,
[PubMed]
Reicher
,
G. M.
(
1969
).
Perceptual recognition as a function of meaningfulness of stimulus material
.
Journal of Experimental Psychology
,
81
(
2
),
275
280
. ,
[PubMed]
Reichle
,
E. D.
,
Rayner
,
K.
, &
Pollatsek
,
A.
(
2003
).
The E-Z reader model of eye-movement control in reading: Comparisons to other models
.
Behavioral and Brain Sciences
,
26
(
4
),
445
526
. ,
[PubMed]
Reilly
,
R. G.
, &
O’Regan
,
J. K.
(
1998
).
Eye movement control during reading: A simulation of some word-targeting strategies
.
Vision Research
,
38
(
2
),
303
317
. ,
[PubMed]
Rousselet
,
G. A.
,
Pernet
,
C. R.
, &
Wilcox
,
R. R.
(
2019
).
An introduction to the bootstrap: A versatile method to make inferences by using data-driven simulations
.
PsyArXiv
.
Schall
,
J. D.
, &
Cohen
,
J. Y.
(
2011
).
The neural basis of saccade target selection
. In
S.
Liversedge
,
I.
Gilchrist
, &
S.
Everling
(Eds.),
The Oxford handbook of eye movements
(pp.
357
381
).
Oxford University Press
.
Schotter
,
E. R.
,
Angele
,
B.
, &
Rayner
,
K.
(
2012
).
Parafoveal processing in reading
.
Attention, Perception, & Psychophysics
,
74
(
1
),
5
35
. ,
[PubMed]
Schotter
,
E. R.
,
Lee
,
M.
,
Reiderman
,
M.
, &
Rayner
,
K.
(
2015
).
The effect of contextual constraint on parafoveal processing in reading
.
Journal of Memory and Language
,
83
,
118
139
. ,
[PubMed]
Shain
,
C.
,
Meister
,
C.
,
Pimentel
,
T.
,
Cotterell
,
R.
, &
Levy
,
R.
(
2022
).
Large-scale evidence for logarithmic effects of word predictability on reading time
.
PsyArXiv
.
Smith
,
N. J.
, &
Levy
,
R.
(
2013
).
The effect of word predictability on reading time is logarithmic
.
Cognition
,
128
(
3
),
302
319
. ,
[PubMed]
Staub
,
A.
(
2015
).
The effect of lexical predictability on eye movements in reading: Critical review and theoretical interpretation
.
Language and Linguistics Compass
,
9
(
8
),
311
327
.
Tiffin-Richards
,
S. P.
, &
Schroeder
,
S.
(
2015
).
Children’s and adults’ parafoveal processes in German: Phonological and orthographic effects
.
Journal of Cognitive Psychology
,
27
(
5
),
531
548
.
Veldre
,
A.
, &
Andrews
,
S.
(
2018
).
Parafoveal preview effects depend on both preview plausibility and target predictability
.
Quarterly Journal of Experimental Psychology
,
71
(
1
),
64
74
. ,
[PubMed]
Vitu
,
F.
,
O’Regan
,
J. K.
,
Inhoff
,
A. W.
, &
Topolski
,
R.
(
1995
).
Mindless reading: Eye-movement characteristics are similar in scanning letter strings and reading texts
.
Perception & Psychophysics
,
57
(
3
),
352
364
. ,
[PubMed]
Vitu
,
F.
,
O’Regan
,
J. K.
, &
Mittau
,
M.
(
1990
).
Optimal landing position in reading isolated words and continuous text
.
Perception & Psychophysics
,
47
(
6
),
583
600
. ,
[PubMed]
Wheeler
,
D. D.
(
1970
).
Processes in word recognition
.
Cognitive Psychology
,
1
(
1
),
59
85
.
Wilcox
,
E. G.
,
Gauthier
,
J.
,
Hu
,
J.
,
Qian
,
P.
, &
Levy
,
R.
(
2020
).
On the predictive power of neural language models for human real-time comprehension behavior
.
arXiv:2006.01912
.
Wolf
,
T.
,
Debut
,
L.
,
Sanh
,
V.
,
Chaumond
,
J.
,
Delangue
,
C.
,
Moi
,
A.
,
Cistac
,
P.
,
Rault
,
T.
,
Louf
,
R.
,
Funtowicz
,
M.
,
Davison
,
J.
,
Shleifer
,
S.
,
von Platen
,
P.
,
Ma
,
C.
,
Jernite
,
Y.
,
Plu
,
J.
,
Xu
,
C.
,
Le Scao
,
T.
,
Gugger
,
S.
, …
Rush
,
A. M.
(
2020
).
HuggingFace’s Transformers: State-of-the-art natural language processing
.
arXiv:1910.03771
.
Woolnough
,
O.
,
Donos
,
C.
,
Rollo
,
P. S.
,
Forseth
,
K. J.
,
Lakretz
,
Y.
,
Crone
,
N. E.
,
Fischer-Baum
,
S.
,
Dehaene
,
S.
, &
Tandon
,
N.
(
2021
).
Spatiotemporal dynamics of orthographic and lexical processing in the ventral visual pathway
.
Nature Human Behaviour
,
5
(
3
),
389
398
. ,
[PubMed]
Yan
,
M.
,
Kliegl
,
R.
,
Shu
,
H.
,
Pan
,
J.
, &
Zhou
,
X.
(
2010
).
Parafoveal load of word N + 1 modulates preprocessing effectiveness of word N + 2 in Chinese reading
.
Journal of Experimental Psychology: Human Perception and Performance
,
36
(
6
),
1669
1676
. ,
[PubMed]
Yeatman
,
J. D.
, &
White
,
A. L.
(
2021
).
Reading: The confluence of vision and language
.
Annual Review of Vision Science
,
7
,
487
517
. ,
[PubMed]
Zwitserlood
,
P.
(
1989
).
The locus of the effects of sentential-semantic context in spoken-word processing
.
Cognition
,
32
(
1
),
25
64
. ,
[PubMed]

Competing Interests

Competing Interests: The authors declare no conflict of interests.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.

Supplementary data