Abstract
Surprisal theory posits that less-predictable words should take more time to process, with word predictability quantified as surprisal, i.e., negative log probability in context. While evidence supporting the predictions of surprisal theory has been replicated widely, much of it has focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times, (ii) whether expected surprisal, i.e., contextual entropy, is predictive of reading times, and (iii) whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to date between information theory and incremental language processing across languages.
1 Introduction
Language processing is incremental and dynamic: When a reader encounters a word, they allocate a certain amount of time to process it before moving on to the next one. One influential theory for the mechanism underlying this process is surprisal theory (Hale, 2001; Levy, 2008), which states that the time required to successfully comprehend a word is based on its predictability. Notably, predictability is often quantified as surprisal (negative log-probability given preceding context), from which the theory’s name is derived. Suprisal theory is supported, empirically, by a number of studies which have found that surprisal is strongly correlated with psychometric measurements in large naturalistic reading corpora (Demberg and Keller, 2008; Wilcox et al., 2020; Shain, 2019, 2021; Meister et al., 2021; Pimentel et al., 2023; Hoover et al., 2022, inter alia). Put differently, a word’s surprisal is a strong correlate of its processing effort, operationalized as reading time.
However, there is one serious limitation with most previous studies: While making general claims about human language processing, they predominantly investigate reading times in English. And, while a few studies have investigated surprisal effects in languages other than English, e.g., Meister et al. (2021) in Dutch and Kuribayashi et al. (2021, 2022) in Japanese, no systematic, crosslinguistic analysis has been performed. As multiple sentence processing phenomena exhibit significant crosslinguistic variation (Hillert, 1998), the extent to which surprisal theory generalizes crosslinguistically is a nontrivial limitation of the current state of the literature.
In addition, two recent contributions have posited several extensions to surprisal theory—most influentially, (a) that contextual entropy, i.e., expected surprisal, also correlates with reading times, and (b) that the relationship between surprisal and reading time is linear (Smith and Levy, 2013; Wilcox et al., 2020; Shain et al., 2022). Regarding (a), Pimentel et al. (2023) and Cevoli et al. (2022) have argued for what may be considered an expanded version of surprisal theory where processing difficulty is still determined by surprisal, but where people’s reading behavior is additionally sensitive to expected surprisal (contextual entropy). Building off prior work that has investigated the role of entropy in language processing (Hale, 2003; Roark et al., 2009; Linzen and Jaeger, 2016; van Schijndel and Schuler, 2017), these recent studies suggest that readers may allocate reading times in advance of encountering a word, based on their expectations of how difficult the word will be to process. Regarding (b), a number of studies have found evidence that the linking function between reading times and surprisal is linear (Smith and Levy, 2013; Wilcox et al., 2020; Shain et al., 2022). However, these results have been challenged recently, with different studies coming to different conclusions about the most appropriate linking function. In the past two years, for example, investigations have concluded that this function is sublinear (Brothers and Kuperberg, 2021), linear (Shain et al., 2022), and superlinear (Meister et al., 2021; Hoover et al., 2022). Here, we will use the term surprisal theory to refer to both the core hypothesis that reading times are correlated with surprisal, as well as the two extensions—(a) and (b)—described above.
We address a gap in the current literature by investigating the predictions of surprisal theory, on eleven languages distributed across five language families.1 We enumerate these three predictions as hypotheses below.
Surprisal is predictive of reading times.
Contextual entropy is predictive of reading times.
The linking function between surprisal and reading times is linear.
We facilitate crosslinguistic comparison by using the MECO dataset (Siegelman et al., 2022), which presents eye-tracking data on reading materials with the same content in each language. We estimate surprisal and contextual entropy from two types of autoregressive language models—a single, large, multilingual model (mGPT; Shliazhko et al. 2022), as well as monolingual models trained on large and small datasets, where the small dataset is the same size across languages (≈ 30 million words). We quantify the psychometric predictive power of surprisal and contextual entropy (i.e., how well each predicts reading times) by including them as variables in linear regression models. These models are then trained to predict by-word reading times; if the log-likelihood of the regression improves after including these variables, we take this as evidence that those variables have psychometric predictive power (Frank and Bod, 2011; Fossum and Levy, 2012; Goodkind and Bicknell, 2018).
We find that, in all languages tested, regression models that include surprisal are significantly better predictors of reading times over baselines which do not include surprisal, confirming the surprisal hypothesis. Additionally, we find that models which include contextual entropy are even better predictors of reading times in most languages tested, confirming the contextual entropy hypothesis. Finally, compatible with the linear link hypothesis, we find that models constrained to a linear relationship between surprisal and reading times are just as good as those that can express more complex relationships. Overall, our results provide the largest crosslinguistic analysis of the relationship between reading and word-level information theoretic properties to-date.
2 Psycholinguistic Predictive Power
Our behavior of interest is how long readers spend visually attending to a given word wt in its linguistic context, i.e., wt’s reading time. This quantity offers a window into the psychological processes that underlie language comprehension and is typically taken as a direct reflection of the word’s processing difficulty (Rayner, 1998). A word’s reading time can be measured via multiple experimental modalities, including self-paced reading (Just et al., 1982; Jegerski, 2013) and the maze task (Forster et al., 2009; Boyce et al., 2020). In this work, we focus on eye-tracking measurements. These measurements have high temporal resolution and exhibit smaller spillover effects than self-paced reading (Smith and Levy, 2013), where spillover is the effect of a word’s properties on later words’ reading behavior.
In order to contrast different theories of language processing, we compare regression models with different vectors of predictor variables x and with different architectures fϕ, each of which is taken to instantiate a different hypothesis about what underlying factors determine reading times. We fit each regression model on a portion of our dataset and evaluate it by measuring the log-likelihood that it assigns to held-out data. Models that lead to higher log-likelihood can be said to have better predictive power or psychological accuracy for human reading—and their associated theories are then taken to be better models of the underlying psycholinguistic processes (Frank and Bod, 2011; Fossum and Levy, 2012).
2.1 Surprisal
2.2 Contextual Entropy
Prior work has investigated the relationship between contextual entropy and reading behavior: A number of studies have investigated entropy reduction, or the extent to which wt reduces uncertainty over possible next words (Frank, 2010, 2013) or the possible incremental parses that can be assigned to a sentence prefix (Hale, 2003, 2006). Other researchers have investigated the effect of successor entropy, i.e., the entropy of Wt +1, on predicting the current-word reading times (Roark et al., 2009; Linzen and Jaeger, 2016; van Schijndel and Schuler, 2017).3 In contrast, we look at the effect of Wt’s contextual entropy on prediction, following Pimentel et al. (2023) and Cevoli et al. (2022). As discussed in Pimentel et al. (2023), investigating contextual entropy separately from surprisal can uncover to what extent reading behavior is responsive (i.e., driven by surprisal) or anticipatory (i.e., driven by expected surprisal). Pimentel et al. (2023) specifically found that contextual entropy is a significant predictor of reading times on 3 out of 4 of their tested English eye-tracking and self-paced reading datasets.
3 Experimental Setup
3.1 Dataset
We use the Multilingual Eye Movement Corpus (MECO; Siegelman et al., 2022). MECO contains eye-tracking data from L1 speakers (between 29 and 54 per language) for 12 simplified Wikipedia-style articles in thirteen languages; these languages are from five different language families. Articles in the MECO corpus went through an iterative translation process by separate teams of translators to ensure that article content was the same across languages and range from a minimum 1,487 total words (Finnish) to a maximum 3,021 total words (Russian). The eleven languages we include in our analysis are: Korean (Koreanic), Turkish (Turkic), Hebrew (Semitic), Finnish (Uralic), Dutch, English, German, Greek, Italian, Russian, and Spanish (Indo-European).4 While this sample is still biased towards Indo-European languages, it is more diverse than other previous studies, which have tended to focus exclusively on a single language.
The following pre-processing steps were taken: Words that were skipped on the first pass were given a reading-time of zero and included in the analysis. Eye-tracking datasets report multiple different word-based measurements of reading times, of which we use three (Rayner, 1998): The first fixation is the duration of the first fixation on a word during its first pass. Gaze duration is the sum of all first-pass fixations on a word. And total fixation time is the sum of all fixations on a word during the trial. While we report results for all three for the sake of completeness, our discussion will focus on results for gaze duration as has been done in previous studies, e.g., Wilcox et al., 2020. First fixation times are typically associated word identification (Clifton et al., 2007) and are expected to not reflect strong contextual influences. Total reading durations can be influenced by material from the right context (i.e., regressive saccades). Thus, for studies that focused on progressive movement through a text, such as ours, gaze duration is expected to be most strongly associated with first-pass processing difficulty, which is our cognitive process of interest. For each of these metrics, we fit a regression model on averages of the reading time measures taken across subjects, as has been done in previous work (Smith and Levy, 2013; Wilcox et al., 2020). This step was performed to mitigate the potentially high by-participant variance present in eye-tracking data.
3.2 Language Models
We derive surprisal and contextual entropy estimates from both monolingual and multilingual models, which we describe in greater detail below.
Monolingual Models
We train monolingual transformer models using the Wiki40B dataset (Guo et al., 2020), from which we rely on the training and validation splits from the original paper for each of our analyzed languages. We first fit language-specific UnigramLM tokenizers (Kudo, 2018) with a vocabulary size of 32k on the training portion of this dataset, which we then use to tokenize both the Wiki40B and MECO text into subword units. We then train two models per language, with different amounts of training data: For the monoT(all) variant, we train the model on the total amount of data in Wiki40B for each language; for the monoT(30m) variant, we subsample ≈ 30 million tokens from each language. For a list of the training dataset sizes for the monoT(all) models, see Table 1. We train all our models using fairseq (Ott et al., 2019), following their recommended language modeling training hyper-parameters. We use a standard decoder-only transformer with 6 layers, a context window size of 512 tokens, and shared input– output embeddings. We train our models using Adam (Kingma and Ba, 2015), with a learning rate of 5e−4, 4000 warm-up updates, and dropout of 0.1. For both of our monolingual models, as well as the multilingual model described below, per-word surprisals are computed by summing over subword unit surprisals, which is the appropriate procedure since surprisal decomposes additively over the units compromising a signal. Because of spurious ambiguity inherent in the tokenization scheme, an efficient algorithm to estimate contextual entropy over full words is unavailable to us; such an algorithm requires summing over an infinite number of sub-word combinations. Instead, we simplify this computation by estimating contextual entropy over one single step of sub-word tokens as suggested in Pimentel et al. (2023). Techniques similar to this have been employed previously in studies of entropy (Frank, 2010), e.g., to account for clitics and contractions.
Language . | Code . | # Training Tokens (mil) . |
---|---|---|
Dutch | du | 171 |
English | en | 1,966 |
Finnish | fi | 89 |
German | ge | 883 |
Greek | gr | 57 |
Hebrew | he | 112 |
Italian | it | 376 |
Korean | ko | 75 |
Russian | ru | 488 |
Spanish | sp | 508 |
Turkish | tr | 48 |
Language . | Code . | # Training Tokens (mil) . |
---|---|---|
Dutch | du | 171 |
English | en | 1,966 |
Finnish | fi | 89 |
German | ge | 883 |
Greek | gr | 57 |
Hebrew | he | 112 |
Italian | it | 376 |
Korean | ko | 75 |
Russian | ru | 488 |
Spanish | sp | 508 |
Turkish | tr | 48 |
Multilingual Model
Context Length
One recent study has hypothesized that, when deriving surprisal estimates for psycholinguistic modeling, the size of the context window can bias estimates (Hoover et al., 2022). The reasoning is that short context windows could shift probability mass away from very low-frequency words, which would be better predicted from longer contexts. Therefore, we estimate surprisal and contextual entropy from mGPT in two contexts: In short contexts the model is given only the current sentence (up until the current word); in long contexts we use the model’s full input window size of 512 characters. We use long contexts for our first analysis, and use both contexts for our second analysis, which investigates both the shape of the reading times–surprisal linking function and the influence of context length on these results.
Psychological Plausibility
Increasingly, researchers that use language models for cognitive modeling have considered their psychological plausibility as estimates of humans’ internal notions of word predictability. In particular, some researchers have compared the size of the models’ training data to the amount of linguistic experience of the average human child (Zhang et al., 2021). Assuming that children are typically exposed to ≈ 11 million words per year as an upper limit (Hart and Risley, 1995), then the mGPT model is trained on multiple human lifetimes’ worth of language data. The monoT(all) models are trained on data scales equivalent to or less than one human lifetime,6 and the monoT(30m) models are trained on data equivalent to the linguistic exposure of a young child. However, we argue that the psychological plausibility of a model’s next-word predictions is not completely determined by whether that model’s training data is the same size as the amount of data a human learner is exposed to. Indeed, there is a body of evidence suggesting that, beyond a certain minimal amount of data, the more data a model is trained on, the more human-like that model’s next-word predictions become (Goodkind and Bicknell, 2018; Wilcox et al., 2020). All of our models are trained on an amount of data within this range. However, at the other end of the scale, the relationship flips: Models trained on an extremely large amount of data seem to be slightly worse predictors of human reading (Shain et al., 2022; Oh and Schuler, 2023). For our models, training datasets are uni-modal (i.e., language only) and learning is with arguably weaker priors for language-like structure, whereas humans learn from multi-modal data with potentially much stronger priors for linguistic structures. Likely, more data makes up for the lack of multi-modal data and uninformative priors.
3.3 Regression Models
All of our regression models are fit to predict the reading time y(wt, w <t) of a word wt in a context w <t from the predictor vector xt. In addition to looking at the word wt, our predictor includes quantities derived from the previous two words wt−1, wt−2 to control for potential spillover effects. We will refer to the three words wt, wt−1, wt−2 as our regressor words. Following previous work in this area, all regression models include the word length and log-unigram frequency, as estimated by Speer (2022), for all regressor words in a predictor xt for a specific index t. The predictors above constitute our (context invariant) baseline predictors. Regression models are trained and evaluated using 10-fold cross validation. For more information on the regressions used in each of our experiments, see Appendix A. The significance of the observed Δ values between target and baseline models is assessed via a paired permutation test that checks whether Δ is significantly different from zero. We use permutation tests for our comparisons because they make no assumption about the distribution of the test statistic. Instead, the test uses the empirical distribution of differences in likelihoods, as estimated using averages computed over permutations of likelihoods, in order to compute p-values.
4 Results
4.1 Surprisal
To test the surprisal hypothesis, we fit a target regression model whose predictors includes the surprisals of our regressor words plus our baseline predictors described above. We compare this to a baseline that does not include the surprisal predictors. For this and subsequent tests, we calculate results for each language individually, as well as for the combined data from all languages. Results can be seen in Figure 1 broken down by language, model, and each of our three word-based measurements of reading time. We observe a clear pattern in the results across the languages: Positive Δ in nearly every test for gaze duration and total fixation, and less consistently positive Δ for first fixation, where, as noted before, we would not necessarily expect surprisal effects to show up. Looking at the results for each model, we observe the most robust results for mGPT, where Δ is significantly greater than zero in every language for gaze duration and total fixation. For the monolingual models, we observe more robust effects for the monoT(all) model over the monoT(30m) model, which is sensible given the latter’s limited training data size.
For an aggregate test of the effects of surprisal, we fit an additional regression model on the combined data from all languages to predict gaze duration with random by-language effects. We use a fully maximal random effect structure, as advocated in Barr et al. (2013). We find that the model with surprisal leads to significantly greater than zero Δ in all cases (p < 0.001). Although surprisal leads to a positive Δ across languages, we do observe some variation in the magnitude of this effect, or the predictive power obtained by regression model. For both mGPT and monoT(all) we observe the highest predictive power in Russian and Dutch, with lower predictive power in Spanish, English, and Hebrew. One natural question to ask is whether imbalances in the model’s training data leads to some of this variation—do models make better predictions for language where they have seen more data? However, there are converging pieces of evidence from our data suggesting that differences in dataset size is not the main cause of the by-language variation. First, both mGPT and monoT(all) show relatively lower predictive power for some large-data languages such as Spanish and English. Second, and quite interestingly, similar patterns of predictive power can be observed for our monoT(30m) models, where training dataset size is controlled across languages. Here, as with the other models, we observe larger values of Δ in Dutch and Russian and smaller values of Δ in English, Spanish, and Hebrew. These results pose a puzzle, as the languages for which the models obtain higher Δ are not obviously different from those for which the models obtain lower Δ, in terms of their linguistic features. For example, English (lower Δ) and Dutch (higher Δ) are both Western Germanic. Further investigation is needed to determine if these patterns hold up for other crosslinguistic reading time datasets.
4.2 Contextual Entropy
To test the contextual entropy hypothesis we first fit a single baseline regression model. Our baseline regression model includes the surprisal of all regressor words, plus baseline predictors. We then evaluate target regression models in two variants: For the replace regression model, we replace surprisal with contextual entropy for all regressor words. For the add regression model, we add an additional term of contextual entropy for all regressor words. As results do not change much between our monolingual language models, we present results for monoT(all).
Results can be seen in Figure 2, where the replace regression is indicated with a triangle and the add regression is indicated with a circle. First, we find that replacing surprisal with entropy tends to hurt predictive power in most cases. For example, for mGPT and gaze duration, Δ is negative in 6/11 languages and significantly so in two, Dutch (p < 0.05) and Italian (p < 0.05), implying overfitting. Negative effects are even stronger for the monoT(all) model, where we find negative gaze duration Δ in every language (results are significant in 5/11). Adding entropy as an additional predictor, on the other hand, generally improves the model’s predictive power. For example, for mGPT and gaze duration, Δ from the add regression is positive in 8/11 languages, and significantly so in 5 (English, Greek, Korean, Russian, and Turkish). In addition, Δ is significantly positive for the add regression for all three reading time measures when data is combined across languages, as shown in the ‘All’ column at the left of Figure 2. Results are less strong for monoT(all), where positive Δ shows up predominantly for first fixation. As before, we run an aggregate test with data from all languages including by-language random effects.7 For gaze duration, we find that adding contextual entropy leads to positive Δ (mGPT, p < 0.001; monoT(all), p < 0.01) and that replacement leads to negative Δ (mGPT, p < 0.01; monoT(all), p < 0.001). Overall, we take these results as being in line with those reported in Pimentel et al. (2023). Our findings suggest that contextual entropy has a weak—albeit consistent—effect on reading times across languages, and therefore that participants may be pre-planning their processing times based on the expected surprisal of upcoming words.
4.3 Variation Across Languages
The crosslinguistic relationship between Δ and language model quality is relevant to current debates about about whether language models can plausibly be used to understand psycholinguistic processes. As mentioned in Section 3.2, it has been observed that, within English, models with lower perplexity tend to exhibit better predictive power (Goodkind and Bicknell, 2018; Wilcox et al., 2020). However, studies on Japanese have failed to replicate these results, suggesting that the relationship does not hold for all languages (Kuribayashi et al., 2021). Further, Oh and Schuler (2023) and Shain et al. (2022) show that this relationship may not hold even in English for the most recent language models. To investigate this, we compute, for mGPT, the Pearson’s correlation between Δ and test set perplexity, as reported in Shliazhko et al. 2022, both across languages, as well as across language families.8 For this analysis we show results only for mGPT and leave a full analysis, comparing different monolingual models, for future work.
The correlations can be seen in Figure 4. We do find a relatively strong negative correlation across languages, however it is not significant (ρ = −0.497, p = 0.1). We do not find any evidence of correlation in the language family data. Although the negative by-language correlation suggests that, for languages where mGPT has lower perplexity, it may be a better model of psycholinguistic behavior, the lack of significance is in line with the negative results from Japanese.
Notably, there are important differences between this analysis and the studies cited above, which train a number of different language models within a single language and a single shared vocabulary, as opposed to comparing the outputs of a single multilingual language model across languages as we do here. Additionally, although mGPT does share a single vocabulary across languages, different languages might be a priori harder or easier to language-model (Cotterell et al., 2018; Mielke et al., 2019), and quality of the tokenization might vary across languages as well. Thus, more fine-grained linguistic controls are necessary before making strong conclusions about the relationship between perplexity and psychometric predictive power across languages.
4.4 Model Coefficients
How do surprisal, entropy, frequency, and length individually affect reading times? Figure 3 shows the estimates from regression models for each of our predictor variables, estimated across 10 folds of data. Unlike the figures presented above, effects are broken down by the coefficients for each of our regressor words from wt (on the left of each facet) to wt−2 (on the right of each facet). Note that effect size here does not correspond to the predictive power of the model as a whole, but rather the impact of word-level properties on reading times. Because predictor variables are not normalized, units are different across rows. The top two rows indicates the estimated slowdown in milliseconds for each additional bit (of surprisal or entropy). The second row indicates slowdown for each additional occurrence per billion words of text (on a log scale). And the bottom row indicates slowdown for each additional character in the word.
We find a consistent effect of surprisal for wt of between 2-4 ms/bit. There is some inter-language variability, with the smallest effect for Hebrew, and larger effects for Dutch, Russian, Greek, and Italian. We find smaller effects for wt−1, ranging from between 0-2 ms/bit. There is no obvious effect of surprisal for wt−2. Overall, these results differ slightly from those reported in Smith and Levy (2013), who investigate reading times on the English Dundee Corpus (Kennedy et al., 2003) and find a stronger effect for wt−1 than we do. However, our results are not inconsistent with the relatively lower spillover effects traditionally observed in eye-tracking data.
Turning to contextual entropy, we find slightly smaller effects, and slightly more variance, between languages. There is no obvious relationship between the effect sizes for surprisal and contextual entropy. For example, Dutch, which has a larger surprisal effect, has one of the smallest effect sizes for entropy. For frequency, we find a consistently negative effect for wt, as expected—as words get more frequent they take less time to read. For wt−1 and wt−2 effects are much smaller and less consistent across languages. For example, Dutch, Finnish, Italian, and Russian all have consistently positive frequency effects for wt−1, whereas in Turkish and Greek, these effects are negative.
We find consistent effects for word length, which are positive for every language on wt. We also find consistent negative effects for wt−1. This may be due to the fact that readers are likely to skip a word if it comes after a long word, which would be associated with a reading time of zero in our analysis. Overall, these coefficient estimates are in line with previous reading time studies and further highlight the crosslinguistic consistency of our results.
5 Surprisal–RT Linking Function
The regression models we have been using to assess Δ have implicitly assumed a linear linking function between surprisal and reading time—a relationship that has been empirically verified in some previous studies in English (Smith and Levy, 2013; Wilcox et al., 2020; Shain et al., 2022). Other recent studies, however, have questioned linearity, including Meister et al. (2021) and Hoover et al. (2022), who argue for a superlinear relationship, and Brothers and Kuperberg (2021), who argue for a sublinear relationship. In this section, we directly test the linear link hypothesis. We compare the Δ of our linear regression models against regression models that can capture non-linear relationships. We present results exclusively for gaze duration for the reasons discussed in Section 3.1.
5.1 Visualizing the Link with GAMs
In order to visualize the link between surprisal and reading times, we use generalized additive models (GAMs), a class of models that can fit non-linear relationships between predictor and response variables. Given the less-constrained hypothesis space of the GAM, if the model finds a relationship that is (visually) linear, this is good first evidence that the underlying effect is linear. We fit a GAM to predict reading times from word frequency, length and surprisal, derived for short contexts (sentence level) and long contexts (document level). We include smooth terms for current and previous word surprisal, as well as tensor product terms for a non-linear interaction between log-frequency and word length. By way of comparison, we also fit a GAM that enforces a linear effect of surprisal, following Hoover et al. (2022). For this comparison, we fit new models, all using the mgcv library, as opposed to simply comparing GAMs to our linear models from the previous section, to ensure that the effects of our baseline variables are exactly the same between models in this section.9 For each language and language model combination, we visualize the fitted curve using 10-fold cross validation, i.e., we train a GAM model on 9 of the 10 folds and sample reading times from the trained model. To sample reading times, we vary the surprisal values for wt ranging 0–20 in increments of 0.1. No other predictors are fed into the model.
The visualizations of the estimated GAMs for effects on wt can be seen in Figure 5. Below the fit, we show density plots for surprisal values in the corpus. The results are consistent across languages and contexts. Visually, the non-linear GAMs capture the effect of surprisal on reading times by fitting an approximately linear curve, which sometimes falls directly on top of the linear control GAM (e.g., for Finnish and Turkish). Unlike Hoover et al. (2022) we do not find a consistent difference for fits between surprisals derived in short contexts versus long contexts. We note, however, that Hoover et al. (2022) find superlinear trends specially for their best examined models (e.g., GPT-3), which may outperform multilingual mGPT.
5.2 Testing Linearity
Although the GAM fits in Figure 5 are visually linear, we would like to test the question of linearity with a more rigorous method. To do so, we compare the Δ of the linear and non-linear GAMs described above. Δ is calculated by comparing each model to a shared baseline that includes only tensor product terms for frequency and length. The idea is that if the underlying relationship between surprisal and reading time is non-linear, then the non-linear GAMs should be able to achieve higher Δ, whereas if the underlying relationship is linear then the non-linear GAMs would not have an advantage. Thus, a consistently null result across languages suggests that the relationship is linear.
The results of this comparison can be seen in Figure 6. Here, Δ is slightly different for linear models than in Section 4.1, as we fit these models with tensor product terms for baseline predictors. Visually, there is no consistent difference between linear and non-linear models across languages. We test the difference in Δ statistically with permutation tests, as described in Section 3.3. Our tests do not support the alternative hypothesis for an α = 0.05 for any of the models or languages. Together with the visualizations presented above, these results support a linear linking function between surprisal and reading times.
6 Discussion
6.1 Implications of Psycholinguistic Theories
Throughout the paper, we have mentioned that the eleven languages studied come from five different language families, but what does this mean in terms of the actual linguistic characteristics that they exhibit? At the highest organizational level, our sample includes languages with multiple different word orders and headedness including SVO (Hebrew, English), SOV (Korean, Turkish), as well as languages with no dominant word order (German and Greek; Haspelmath et al., 2005). Our sample includes languages with extensive case marking such as Finnish (15 cases), as well as languages with extremely impoverished case systems, such as English. In terms of word construction, our sample includes languages that are both agglutinating (Turkish, Finnish and Korean) and fusional (Russian, Romance languages). While this set is not close to covering all ways that human languages can vary, we bring up these differences to highlight how it does contain important high-level parametric variations observed in human languages.
In light of this, the stability observed in our results testing the surprisal hypothesis is rather remarkable. Across language families and model types, we observe essentially consistent results, in terms of the predictive power of the models, the effect size associated with surprisal, as well as for the shape of the surprisal–reading-time relationship. Focusing first on predictive power, we find a relatively tight range of Δ values associated with surprisal. For example, for gaze duration and mGPT, all Δ values fall between 0.012 and 0.040. Indeed, across languages and models, we find relatively little variance in the predictive power of surprisal. Turning to the effect size of surprisal, we observe a millisecond-per-bit trade-off that falls between 2–4 ms/bit for every language (see Figure 3). The previous estimate of 3.75 ms of slowdown per bit of surprisal reported in Smith and Levy (2013) for English falls well within this range (though note that this previous work used surprisal estimates derived from an n-gram model, which will generally be higher than surprisal estimates derived from large neural language models such as the ones we consider in this study). We take these results to suggest that humans may have stable crosslinguistic preferences for the rate at which they process information during reading, i.e., not greater than 4 milliseconds per bit of information. This is consistent with previous work that has observed crosslinguistic consistency in the rate of information during speech production (Pellegrino et al., 2011; Coupé et al., 2019), as well as trade-offs between the information content of a word and the time taken to produce it (Pimentel et al., 2021).10
One point of difference between these and previous results, however, is the size of the effect of the surprisal of previous words. Looking at gaze duration in the Dundee corpus of English (Kennedy et al., 2003), Smith and Levy (2013) find an effect on reading time for surprisal for the previous word which is about as strong as for the current word. We find much weaker effects in this study, ranging from 0-2 ms/bit. Note, that this lower effect for previous words is in line with other incremental processing measures which are strongly incremental, such as the maze task, where previous-word surprisal has little to no effect on reading time of the current word (Boyce and Levy, 2020), as well as with the results reported in Pimentel et al. (2023) for eye-tracking over the Provo (Luke and Christianson, 2018) and Dundee corpora.
Turning to the shape of the surprisal–reading times relationship, our results support the linear link hypothesis and are in line with the comprehensive results recently reported in Shain et al. (2022). Unlike Hoover et al. (2022) we do not observe superlinear surprisal–reading time relationships for larger and more data-intensive language models, or for language models that had access to longer contextual windows. Interestingly, we do observe that the one language which visually appears to be superlinear, (i.e., it has an upwards curve in Figure 5) is English. Thus, while we believe Hoover et al. (2022) was right to be concerned by a potential visual nonlinearity in the English relationship, this effect does not appear to exist crosslinguistically and is not borne out by our statistical testing.
Surprisal theory is attractive because it offers a general-purpose link between statistical properties of natural language and human behavior. While its domain generality gives the theory a universal-like flavor, previous literature has (in our opinion) correctly refrained from overtly discussing it as a universal of human language processing. By conducing the most comprehensive crosslinguistic assessment of surprisal theory to date, this study presents initial evidence which supports the universality of surprisal effects in naturalistic reading. That being said, further testing is a necessary next step.
6.2 Implications of Multilingual Language Modeling
As the number of multilingual language models has proliferated, it has become increasingly important to understand how they differ from more traditional, monolingual models. Previous studies have produced mixed results: Some have found that the larger training data scales of multilingual models leads to better performance (Conneau et al., 2020), while others have found advantages for monolingual models (Agerri et al., 2020; Rönnqvist et al., 2019; Virtanen et al., 2019), which are often attributed to monolingual models’ language-specific tokenization and vocabulary representation. The majority of these previous studies have focused on masked language models (mostly using architecture based off the BERT model) and evaluation based on performance of downstream tasks (Doddapaneni et al., 2021). This study offers a useful complement to previous work by focusing on autoregressive models, as well as on their cognitive modeling capacities.11 Our results are more or less in line with previous studies, insofar as we find no obvious differences between our multilingual model and our monolingual models. Our results thus suggest that for computational linguists interested in cognitive modeling, multilingual and monolingual language models may be equally viable options. However, we would like to note that we did not compare models in truly low-resource settings, as the training datasets of our smallest monolingual models still included 30 million tokens. It may be the case that when trained on much smaller datasets, multilingual models may benefit from crosslingual transfer.
6.3 Concurrent Work
We want to briefly note the differences between the work presented here and a concurrent study that also used the MECO dataset (i.e., de Varda and Marelli, 2022). While de Varda and Marelli’s research questions are similar to ours, their methods and conclusions are quite different. Instead of an autoregressive language model, they use a masked language model (mBERT; Devlin et al., 2019), which has access to both left and right context. An issue with this strategy is that the surprisal values produced by this setup are not psychologically plausible estimates of actual surprisals, which are estimated from the left context alone.12 which weakens the ability to test psycholinguistic causal claim about the relationship between surprisal and reading times. In their experiments, de Varda and Marelli do not find significant effects of pseudo-surprisal on gaze duration in four of the 12 languages in MECO,13 including English, and find significant effects of pseudo-surprisal on other eye movement measures in even fewer of the languages, which they view as evidence that surprisal might not be a consistent predictor of reading times across languages.14 While we are aligned on the importance of de Varda and Marelli’s research questions, we believe that their failure to replicate surprisal effects for English—or to find it for other languages—reflects the limitations in their methodological choices.
6.4 Limitations and Future Directions
Turning back to our own study, there are a few limitations we would like to discuss: Although our sample of languages is much larger than previous studies, Indo-European languages are still overrepresented. Indeed, each of our non Indo-European language families is represented by a single language. Additionally, all the data tested here comes from high-resource languages with long traditions of writing systems, and from individuals who live in industrialized societies. Finally, the methodology we employ here requires a large corpus of (written) language on which a language model can be trained. It may be the case, that for much lower-resource languages, there is often not enough linguistic data to derive statistical estimates needed to test surprisal theory in this manner. Thus, while our methods may be able to test the predictions of surprisal theory in lower-resource settings, where corpora of a few hundred thousand words exist, they may not be suitable for a large number of the world’s languages. While our results put surprisal theory on firmer empirical footing, testing its predictions beyond these settings is an important and necessary step in assessing the theory’s universality.
7 Conclusion
This paper has presented the most comprehensive crosslinguistic evaluation of surprisal theory reported in the literature to date. Using eye-tracking data from controlled materials in eleven languages across five language families, we have tested three hypotheses: (i) the surprisal hypothesis (surprisal is predictive of reading times), (ii) the contextual entropy hypothesis (contextual entropy is predictive of reading times), and (iii) the linear link hypothesis (the relationship between surprisal and reading times is linear). We found exceptionally strong crosslinguistic stability in our results, with each prediction being borne out in every language tested. These results provide the most robust link between information-theoretic quantities and incremental processing.
Acknowledgments
We would like to thank our TACL action editor, Maggie Li, as well as our reviewers, whose thoughtful feedback greatly improved this work. T.P. was supported by a Facebook PhD Fellowship. C.M. was supported by the Google PhD Fellowship. E.G.W. was supported by an ETH Zurich Postdoctoral Fellowship. R.P.L. was supported by NSF grant BCS-2121074 and a Newton Brain Science Award.
Notes
Our languages (and families) are: Korean (Koreanic), Turkish (Turkic), Hebrew (Semitic), Finnish (Uralic), Dutch, English, German, Greek, Italian, Russian, and Spanish (Indo-European).
In practice, negative values of Δ are also possible; they indicate overfitting, and imply the same theoretical conclusion as a Δ of 0.
When computing Wt +1, it is common to treat Wt = wt as observed.
The dataset also includes Norwegian and Estonian, however these are not supported by our multilingual language model and therefore excluded.
Shliazhko et al. (2022) report that their combined dataset contains 489 billion characters. Assuming a crosslinguistic average of ≈ 5 characters per word, this puts their training set at slightly under 100 billion words.
The only exception is English, which at ≈ 2 billion words is about two lifetime’s worth of linguistic data, assuming the 11-million word per year estimate of Hart and Risley (1995).
Following the same methodology as the previous test, we look at the effect of adding or replacing surprisal across all regressor words.
For language families, Δ and perplexities are within-family averages.
For these analyses we choose to only include surprisal, frequency, and length from wt and wt−1 as predictors. This was done because of the minimal effects found on wt−2 in our analysis of coefficients (see Figure 3). A sample GAM call for this analysis is given in Appendix A.
Our results are not necessarily consistent with a universal channel capacity, or an information rate above which comprehension cannot be sustained. A channel capacity could explain uniform information density effects, or the tendency to spread information out uniformly over a sentence, presumably at or near the channel capacity (Levy and Jaeger, 2006; Frank and Jaeger, 2008; Meister et al., 2021). However, as pointed out in Smith and Levy (2013), such effects require a superlinear surprisal link hypothesis, which we do not observe empirically.
However, see Hollenstein et al. (2021) for a previous investigation of multilingual language models’ ability to predict reading times.
Because the perceptual span is limited to about 14 characters to the right of a fixation (Rayner, 1975) and little linguistic information is gleaned from the far right of the perceptual span (Schotter et al., 2012), upcoming word identities cannot have a substantial causal influence on a word’s first-pass reading behavior (Granger, 1969).
They include Estonian, which we drop as it was not in mGPT’s training data.
Their study does not consider contextual entropy.
References
A Regression Modeling Details
We give more details on the regression formulae used in the various experiments reported in the main section. Our notation is as follows: reading_time is the reading time of the word of interest, i.e., wt, surp is the surprisal of wt, prev_surp is the surprisal of the previous word, i.e., wt−1, and prev2_surp is the surprisal of the word two previous, i.e., wt−2. The other variables use the same prev and prev2 prefixes, and we simply explain the variable names for the current index t for the sake of brevity, below. For these, ent indicates the contextual entropy of Wt, len indicates the length of wt in characters, and freq indicates the log unigram frequency of wt.
Effect of Surprisal (Section4.1)
For the tests assessing the effect of surprisal within individual languages, we use the following model:
The baseline models are the same with the exception that the surprisal terms are removed. For the aggregate test assessing the effect of surprisal across languages, we use the following model:
Effect of Contextual Entropy (Section4.2)
For both tests, the baseline model included surprisal, length and unigram frequency, i.e., it was the first model given in the paragraph above. For the replace test, the target regression model we use is
For the add test, the target regression model we use is
Surprisal–RT Linking Function (Section5)
The GAM formula used for non-linear models we use is
And for linear models:
We now briefly explain the components of these regressions. s() sets up a spline-based smooth term between a predictor and response variable that can take on a wide variety of non-linear functional relationships. Here, k=6 indicates a maximum of 6 basis functions for the smooth. We choose k=6 following the logic from Hoover et al. (2022), Appendix C. Having 6 basis functions allows for five degrees of freedom, which enables the regression to fit non-linear yet still relatively simple curves. The other term, te(), sets up a tensor product smooth term, which can effectively capture non-linear interactions between two variables.
B Surprisal versus RT for wt−1
As mentioned in the main text, previous work has investigated the relationship between surprisal and reading times not just for the current word wt, but also for the previous word, wt−1. Looking at gaze duration in the Dundee corpus of English (Kennedy et al., 2003), Smith and Levy (2013) find an effect of wt−1’s surprisal which is about as strong as the effect of wt’s surprisal on the reading time of wt. In Figure 7 we show this relationship in our corpus for mGPT and monoT(all), using the same methods and presentational paradigm as in Section 5.1.
The results are consistent across models used, and suggest that the relationship between reading time and surprisal of the previous word is somewhat variable across languages. For English, Italian, Korean, Russian, and Spanish we find a relationship that is roughly linear and increasing, i.e., similar to the results for surprisal of the current word. For Dutch, Turkish, and Hebrew, we find a relationship that is roughly increasing, but visually non-linear. For Finnish, German, and Greek, we find either a flat or negative relationship. These results are in line with the effect terms plotted in Figure 3, where we find very weak and sometimes negative coefficients for the wt−1 surprisal term for these languages (i.e., the middle x-tick position in the top row). Overall, these results are consistent with the linear effect that has been previously observed in English. However, they suggest that the impact of the surprisal of the previous word varies between languages.
Author notes
Wenjie (Maggie) Li