Abstract
Human translators linger on some words and phrases more than others, and predicting this variation is a step towards explaining the underlying cognitive processes. Using data from the CRITT Translation Process Research Database, we evaluate the extent to which surprisal and attentional features derived from a Neural Machine Translation (NMT) model account for reading and production times of human translators. We find that surprisal and attention are complementary predictors of translation difficulty, and that surprisal derived from a NMT model is the single most successful predictor of production duration. Our analyses draw on data from hundreds of translators operating across 13 language pairs, and represent the most comprehensive investigation of human translation difficulty to date.
1 Introduction
During the Nuremberg trials, a Soviet interpreter paused and lost her train of thought when faced with the challenge of translating the phrase “Trojan Horse politics”, and the presiding judge had to stop the session (Matasov, 2017). Translation difficulty rarely has such extreme consequences, but the process of translating any text requires a human translator to handle words and phrases that vary in difficulty. Translation difficulty can be operationalized in various ways (Sun, 2015), and one approach considers texts to be difficult if they trigger translation errors (Vanroy et al., 2019). Here, however, we focus on difficulty in cognitive processing, and consider a word or phrase to be difficult if it requires extended processing time. Our opening example (i.e., “Trojan Horse politics”) illustrates translation difficulty at the phrase level, and Figure 1 shows how our notion of translation difficulty varies at the level of individual words. Across this sample, words like “societies” and “population” are consistently linked with longer production times than words like “result” and “tend”.
Processing times have been extensively studied by psycholinguists, but the majority of this work is carried out in a monolingual setting. Within the literature on translation, analysis of cognitive processing is most prominent within a small but growing area known as Translation Process Research (TPR). Researchers in this area aim to characterize the cognitive processes and strategies that support human translation, and do so by analyzing eye movements and keystrokes collected from translators (Carl, 2012). Here we build on this tradition and focus on three variables at the word and segment level derived from the CRITT TPR-DB database (Carl et al., 2016b): source reading time (TrtS), target reading time (TrtT), and translation duration (Dur). Our analyses are relatively large in scale by the standards of previous work in this area, and we draw on data from 312 translators working across 13 language pairs.
A central goal of our work is to bring translation process research into contact with modern work on Neural Machine Translation (NMT). Recent work in NLP has led to dramatic improvements in the multilingual abilities of NMT models (Kudugunta et al., 2019; Aharoni et al., 2019), and these models can support tests of existing psycholinguistic theories and inspire new theories. Our work demonstrates the promise of NMT models for TPR research by testing whether surprisal and attentional features derived from an NMT model are predictive of human translation difficulty. Two of these predictors are shown in the right panels of Figure 1, and both are correlated with translation duration for the example sentence shown.
In what follows we introduce the surprisal and attentional features (Sections 3 and 4) that we consider then evaluate the extent to which they yield improvements over baseline models of translation difficulty. We find that surprisal is a strong predictor of difficulty, which supports and extends previous psycholinguistic findings that surprisal predicts both monolingual processing (Levy, 2008; Wilcox et al., 2023) and translation processing (Teich et al., 2020; Wei, 2022; Carl, 2021a). The attentional features we evaluate predict difficulty less well, but provide supplementary predictive power when combined with surprisal.
2 Related Work
An extensive body of monolingual work has demonstrated that the pace of human reading is based on next-word predictability—more contextually surprising words incur higher cognitive costs and are slower to process (Hale, 2001; Levy, 2008). The phenomenon is observed across a wide range of language usage, including reading (Monsalve et al., 2012; Smith and Levy, 2013; Meister et al., 2021; Shain et al., 2024; Wilcox et al., 2021, 2023), listening and comprehension (Russo et al., 2020, 2022; Kumar et al., 2023; Yu et al., 2023), speech (Jurafsky, 2003; Levy and Jaeger, 2006; Cohen-Priva and Jurafsky, 2008; Demberg et al., 2012; Malisz et al., 2018; Dammalapati et al., 2019; Pimentel et al., 2021), typing (Chen et al., 2021), and code-switching (Calvillo et al., 2020).
Surprisal has also been proposed as a predictor of translation difficulty in Translation Processing Research (TPR) (Schaeffer and Carl, 2014; Teich et al., 2020; Carl, 2021a; Wei, 2022; Deilen et al., 2023; Lim et al., 2023), but existing evaluations of this idea are limited in two respects. First, previous measures of translation surprisal are based on overly simplistic probability estimates, which are inferior to modern NMT models. Prior work on TPR mostly relies on probabilities extracted from manual word alignments, and these probabilities are noisy because they are not sensitive to target context and because the corpora are relatively small. Second, relying on manual alignment means that most previous approaches do not scale well and cannot be applied to novel unaligned texts. Lim et al. (2023) address this second limitation using automatic alignment of existing parallel corpora, but their probability estimates are again insensitive to context and therefore less accurate than probabilities derived from NMT models.
Along with surprisal, prior TPR research has also proposed contextual entropy as a predictor of translation difficulty (Teich et al., 2020; Carl, 2021a; Wei, 2022). Entropy is the expected surprisal of a translation distribution, and indicates the effort in resolving the uncertainty over all possible translation choices. The results of Wei (2022) and Carl (2021a) suggest, however, that entropy is weaker than surprisal as a predictor of translation duration. Entropy is also a relatively weak predictor of monolingual reading difficulty, despite being hypothesized to indicate a reader’s anticipation of processing effort (Hale, 2003; Linzen and Jaeger, 2016; Lowder et al., 2018; Wilcox et al., 2023).
In the broader NLP literature, several lines of work suggest that attentional measures derived from language models and NMT models can capture processing difficulty. Attention weights in language models (Li et al., 2024) and NMT models (Ferrando and Costa-jussà, 2021) are known to contribute to next-token prediction, and previous studies have explored whether attentional weights in language models can account for monolingual processing time (Ryu and Lewis, 2021; Oh and Schuler, 2022). For translation, NMT studies suggest that the contextualization of ambiguous tokens occurs in encoder self-attention (Tang et al., 2019; Yin et al., 2021), and NMT models show higher cross-attention entropy with increasingly difficult tokens (Dabre and Fujita, 2019; Zhang and Feng, 2021; Lu et al., 2021). NMT attentional weights also reflect whether a figurative expression is handled by paraphrasing as opposed to literal translation (Dankers et al., 2022). All of these results suggest that attentional weights are worth exploring as predictors of human translation difficulty.
In the TPR literature, prior work has explored whether errors made by MT models tend to predict translation difficulty and post-editing difficulty for humans (Carl and Báez, 2019). To our knowledge, however, previous work has not used probabilities and attentional measures derived from NMT models as predictors of human translation difficulty. Building on previous work, we focus on surprisal rather than entropy, and provide a comprehensive evaluation of the extent to which surprisal predicts translation difficulty. We also propose several attentional features based on previous literature and test whether they contribute predictive power that goes beyond surprisal alone.1
3 Surprisal: An Information-theoretic Account of Translation Processing
In contrast with traditional monolingual surprisal, translation surprisal is based on a context that includes a complete sequence in a source language (SL) and a sequence of previously translated words in a target language (TL). Target words with high translation surprisal are hypothesized to require extended cognitive processing because they are relatively unpredictable in context.
Relative to prior work, we aim to provide a more comprehensive and rigorous evaluation of the role of surprisal in translation processing. We consider predictors of translation difficulty based on both monolingual and translation surprisal, which we estimate using a large language model and a neural translation model respectively.
3.1 Monolingual Surprisal
Given the well-established link between slm and reading time (Monsalve et al., 2012; Smith and Levy, 2013; Shain et al., 2024; Wilcox et al., 2021), we test whether slm predicts reading times for source and target texts when participants are engaged in the act of translation. Previous work also establishes links between surprisal and language production in tasks involving speech (Dammalapati et al., 2021), oral reading (Klebanov et al., 2023), code-switching (Calvillo et al., 2020), and typing (Chen et al., 2021). Building on this literature, we also test whether monolingual surprisal predicts the time participants take to type their translations.
3.2 Translation Surprisal
4 Predicting Translation Difficulty using NMT Attention
Modern encoder-decoder NMT models rely on the transformer architecture and incorporate three kinds of attention: encoder self-attention, cross-attention, and decoder self-attention (Vaswani et al., 2017).3 We consider all three sets of attention weights as potential predictors of translation difficulty. By some accounts, reading, transferring, and writing are three distinct stages in the human translation process (Shreve et al., 1993; Macizo and Bajo, 2004, 2006; Shreve and Lacruz, 2017), and we propose that the three attentional modules roughly align with these three stages.
Let x = [x1,…, xm] and y = [y1,…, yn] denote parallel source and target sequences, with m = {1,…, m} and n = {1,…, n} as their token indices. Let u and v denote segments of x and y respectively such that u = xi and v = yj, where the indices i ⊆m and j ⊆n. Note that x and y do not include special tokens (e.g., the end-of-sequence tag eos) added to the sequences under the NMT’s tokenization scheme. We additionally define and as the contexts where the corresponding segments are excluded, i.e., and . In contrast to , only includes token indices from the context preceding v. We define in this way because the decoder typically does not have access to future tokens when generating translations.
Using Equations 3 and 4, we define in the following six attentional features as candidate predictors of source-side reading time and five features for target-side reading time and production duration. For simplicity, we obtain the final attentional features in our analysis by averaging values computed for each attentional head across layers.
4.1 Predicting Source Text Difficulty
Encoder Attention.
We propose four features extracted from encoder attention that are inspired by Dankers et al. (2022), who find that when producing a paraphrase rather than a literal translation of a figurative expression, an NMT encoder tends to allocate more attention from the phrase to itself, while reducing attention directed to and received from the context. This result is relevant to our goal of predicting translation difficulty because paraphrasing is known to be more effortful than literal translation (Balling et al., 2014; Schaeffer and Carl, 2014; Rojo, 2015; Carl and Schaeffer, 2017).
Cross-attention.
Cross-attention allows information to pass from encoder to decoder and establishes rough alignments between input and output tokens in an NMT model (Alkhouli et al., 2018; Li et al., 2019). However, it is unclear from previous work whether more attention weight received from the target sequence contributes to harder or easier translation. Tu et al. (2016) and Mi et al. (2016) show that increased attention received by part of the source text is related to over-translation, a phenomenon in which the model focuses too much on some parts of the input and neglects others when generating a translation. In contrast, Dankers et al. (2022) demonstrate that paraphrasing a figurative expression instead of literal translation reduces attention to corresponding source tokens.
4.2 Predicting Target Text Difficulty
Cross-attention.
Decoder Attention.
While previous work on NMT models has considered encoder and cross-attention in depth, decoder self-attention has received less investigation. Yang et al. (2020) demonstrate that the role of decoder self-attention is to ensure translation fluency. Relative to encoder attention and cross attention, decoder attention aligns less well with human annotations, and contributes less to improving NMT performance when regularized with human annotations (Yin et al., 2021).
5 Data
Our empirical measures of translation difficulty are derived from the CRITT Translation Process Research Database (TPR-DB) (Carl et al., 2016b). We focus on three behavioral measures: source text reading time (TrtS) is the sum of all fixation durations on a given source segment during a session; target text reading time (TrtT) is the sum of all fixation durations on a target segment; and translation production duration (Dur) is the time taken to produce a segment. Both reading time measures are based on eye-tracking data, and the translation production measure is based on keylogging data. The data set is organized in terms of words, segments, and sentences, and we carry out separate analyses at the word and segment levels.
In CRITT TPR-DB, word and segment boundaries and alignments are provided by human annotators. For consistency, we remove alignments of words and segments that cross sentence boundaries. Following Carl (2021b), we filter values of TrtS, TrtT, and Dur lower than 20ms. The remaining values are log scaled. We analyze data from 17 public studies available from the public database.5 These studies represent 13 different language pairs, and each study includes data from an average of 18 human translators. The studies included along with the size of each one are summarized in Table 1. For cross-validation, we divide the samples into 10 folds. To ensure that all predictions are evaluated using previously unseen sentences, we randomly sample test data at the sentence level, which means that the source sentences in train and test partitions do not overlap.
Study . | Token . | Segment . | |||||
---|---|---|---|---|---|---|---|
TrtS . | TrtT . | Dur . | TrtS . | TrtT . | Dur . | ||
en → da | ACS08 (Sjørup, 2013), | 5305 | 5320 | 6176 | 4121 | 4203 | 4779 |
BD13, BD08 (Dragsted, 2010) | |||||||
en → de | SG12 (Nitzke, 2019) | 3691 | 3956 | 4534 | 2991 | 3243 | 3589 |
en → es | BML12 (Mesa-Lao, 2014) | 4067 | 4020 | 8280 | 3555 | 3330 | 6072 |
en → hi | NJ12 (Carl et al., 2016b) | 4717 | 4828 | 5205 | 2917 | 2851 | 2933 |
en → ja | ENJA15 (Carl et al., 2016a) | 6806 | 8329 | 2168 | 4263 | 4299 | 2130 |
en → nl | ENDU20 (Vanroy, 2021) | 0 | 0 | 7814 | 0 | 0 | 6318 |
en → pt | JLG10 (Alves and Gonçalves, 2013) | 0 | 0 | 2443 | 0 | 0 | 2217 |
en → zh | RUC17, STC17 (Carl and Báez, 2019), | 8949 | 7934 | 3876 | 6097 | 5922 | 3925 |
CREATIVE (Vieira et al., 2023) | |||||||
da → en | LWB09 (Jensen et al., 2009) | 3844 | 4177 | 5327 | 3445 | 3493 | 4315 |
fr → pl | DG01 (Płońska, 2016) | 0 | 0 | 17041 | 0 | 0 | 13283 |
pt → en | JLG10 | 0 | 0 | 2053 | 0 | 0 | 1876 |
pt → zh | MS13 (Schmaltz et al., 2016) | 1011 | 830 | 203 | 781 | 755 | 237 |
zh → pt | MS13 | 1210 | 1237 | 1509 | 1101 | 1027 | 1209 |
Study . | Token . | Segment . | |||||
---|---|---|---|---|---|---|---|
TrtS . | TrtT . | Dur . | TrtS . | TrtT . | Dur . | ||
en → da | ACS08 (Sjørup, 2013), | 5305 | 5320 | 6176 | 4121 | 4203 | 4779 |
BD13, BD08 (Dragsted, 2010) | |||||||
en → de | SG12 (Nitzke, 2019) | 3691 | 3956 | 4534 | 2991 | 3243 | 3589 |
en → es | BML12 (Mesa-Lao, 2014) | 4067 | 4020 | 8280 | 3555 | 3330 | 6072 |
en → hi | NJ12 (Carl et al., 2016b) | 4717 | 4828 | 5205 | 2917 | 2851 | 2933 |
en → ja | ENJA15 (Carl et al., 2016a) | 6806 | 8329 | 2168 | 4263 | 4299 | 2130 |
en → nl | ENDU20 (Vanroy, 2021) | 0 | 0 | 7814 | 0 | 0 | 6318 |
en → pt | JLG10 (Alves and Gonçalves, 2013) | 0 | 0 | 2443 | 0 | 0 | 2217 |
en → zh | RUC17, STC17 (Carl and Báez, 2019), | 8949 | 7934 | 3876 | 6097 | 5922 | 3925 |
CREATIVE (Vieira et al., 2023) | |||||||
da → en | LWB09 (Jensen et al., 2009) | 3844 | 4177 | 5327 | 3445 | 3493 | 4315 |
fr → pl | DG01 (Płońska, 2016) | 0 | 0 | 17041 | 0 | 0 | 13283 |
pt → en | JLG10 | 0 | 0 | 2053 | 0 | 0 | 1876 |
pt → zh | MS13 (Schmaltz et al., 2016) | 1011 | 830 | 203 | 781 | 755 | 237 |
zh → pt | MS13 | 1210 | 1237 | 1509 | 1101 | 1027 | 1209 |
6 Models and Methods
LM and NMT Models.
We follow Wilcox et al. (2023) and use mGPT (Shliazhko et al., 2024), a multilingual language model, to estimate monolingual surprisal (slm).6 To compute translation surprisal (smt), we use NLLB-200’s 600M variant, a multi-way multilingual translation model that is distilled from a much larger 54.5B Mixture-of-Experts model (Costa-jussà et al., 2022).7 Among publicly available NMT models, NLLB-200 is a standard benchmark and achieves state-of-the-art results across many language pairs (Moslem et al., 2023; Seamless Communication et al., 2023). We compute attentional features for each of 16 heads across 12 layers, then average across heads and layers to create the final set of attentional features for our analyses.
Normalization.
Sections 3 and 4 describe feature definitions that are sums over a sequence of tokens, which makes it crucial to control for segment length when predicting reading time and production duration. All surprisal values are therefore normalized by the lengths of the input segments, wi and yj. To normalize attentional features, we first calculate dummy feature values by replacing alk in Equation 3 and 4 with uniform attention values (i.e., alk = 1/|k| where |k| is the length of the attention vector). A normalized attentional feature is defined as the ratio of the raw feature value (defined in Section 4) to its dummy value.
Control Features.
Although we are most interested in surprisal and attentional features as predictors of translation difficulty, other simple features may also predict difficulty. In particular, longer segments, low-frequency segments, and segments towards the beginning of a sentence might be systematically more difficult than shorter segments, high-frequency segments, and segments towards the end of a sentence. We therefore include segment length, average unigram frequency8 (log scaled) and average position quantile as control features in all models, where both averages are computed over all tokens belonging to a segment.
Linear Models.
Following previous studies, we use linear models to evaluate the predictive power of both surprisal and attentional features. To predict translation difficulty for all languages, we use a mixed model that includes language pair and participant id as random effects. In addition, we fit individual linear regression models for the four language pairs (en → da, en → de, en → hi, and en → zh) for which we have most data.
7 Results
Δllh as a Measure of Predictive Power.
Prior work on translation difficulty rarely uses held-out evaluation, but we follow previous psycholinguistic studies (Goodkind and Bicknell, 2018; Kuribayashi et al., 2021; Wilcox et al., 2020, 2023; De Varda and Marelli, 2023) and evaluate our models using log-likelihood of held-out data. To assess the predictive power of a feature, we train a mixed model with the feature of interest in addition to all control features, and compare against a baseline model which includes only the control features. The contribution of the predictor feature is then measured as the difference in log-likelihood of the held-out test data (Δllh) between the two models. A positive Δllh indicates added predictive power from the feature relative to the baseline model, whereas Δllh ≤ 0 means that we have no evidence for the effect of the feature on reading and production times.9 Like Wilcox et al. (2023), we test if Δllh > 0 is significant across held-out samples using a paired permutation test based on 1000 random permutations.
7.1 Surprisal and Attentional Features
Figure 2 shows Δllh of surprisal predictors and attentional features at both word and segment levels. Data points shown in red indicate predictors that are statistically significant (p < .05) relative to the baseline model. When fitted on all language pairs, slm (Eq. 1) is a significant predictor of source text reading time and target production duration, but not target reading time. On the target side, smt (Eq. 2) is the best overall predictor of difficulty. Target reading time has fewer significant predictors than does target production duration, and therefore appears to be harder to predict. From here on, we restrict our analyses to features that are significant at at least one of the two segmentation levels.
Figure 3 shows analogous results for four individual language pairs. Surprisal features slm and smt remain strong predictors for reading time and target production duration in general. Among the attentional features, and Hu, xe (Eq. 5 and 8) most consistently predict source reading time across language pairs and segmentation levels. However, the predictions of target reading time and duration by attentional features are less consistent—despite predicting en → da target difficulty, most attentional features fail to contribute in other language pairs. One possible reason is that these features overfit to small samples of individual language pairs, compared to a mixed model that is fitted on a much larger data set including all language pairs.
7.2 Attention is Supplementary to slm and smt
Our results so far confirm that both slm and smt individually predict translation difficulty, whereas attentional features on their own are less consistent. We next ask if the attentional features that proved significant in Section 7.1 provide supplementary predictive power when combined with slm and smt. To predict source reading time, we train models that include control features, slm and one attentional feature. For target difficulty, the models are trained with control features, smt and an attentional feature. We then calculate two variants of Δllh for these models; the first compares against the baseline model, and the second compares against a model that is trained on control features and either slm or smt.
We repeat the same significance tests as before, and the results are shown in Figure 4. For the entire data set (top row of Figure 4), models with the addition of individual attentional features predict translation difficulty better than those trained with surprisal and control features only (except Hv, xc, Eq. 11). Again, however, these results are weaker for individual language pairs.
7.3 Predictor Coefficients
Thus far we have only demonstrated the predictive power of surprisal and attentional features. To enable conclusions about the nature of the relationship between individual features and translation difficulty, Figure 5 shows average mixed model coefficients over data folds. The coefficients plotted support conclusions about the direction and effect size of the relationship between each predictor and translation difficulty, but the bar heights may not reflect predictive power, which has been indicated previously by Δllh.10 In general, segments are more difficult when they are longer and occur earlier in the sequence. Rare words in general take longer to read and produce, as our baseline models consistently converge to negative coefficients for frequency (not shown in the figure). However, with the addition of surprisal and attentional features, the frequency effect for rare source words is reversed, whereas rare targets still require more attention and take longer to produce. As expected, increases in slm and smt are associated with increased reading time and production duration.
On the source side, the coefficients related to encoder self-attention indicate that harder-to-translate source texts direct less attention to context (, Eq. 5) and more to eos (fu,eose, Eq. 6), which reduces their entropy (Hu, xe, Eq. 8). Difficult source words are also singled out as important by having more incoming cross-attention from the target sequence (fy, uc, Eq. 9).
On the target side, harder translations tend to show slight increases in cross-attention to source eos (fv,eosc, Eq. 10), and show more diffuse attention across the source sequence (Hv, xc, Eq. 11).11 Our results thus support Dankers et al.’s (2022) claim that paraphrases show increased attention to eos and take longer to produce than literal translations.
Figures 5b and 5c also suggest that harder translations have more informative decoder attention (, Eq. 14), and direct more attention to themselves (fv, vd, Eq. 12) and the context (, Eq. 13). These results imply reduced attention to bos, the initial token of a translation sequence that conveys the target language to the NLLB model.
8 Discussion
Section 7.1 showed that monolingual surprisal predicts source reading time, but that translation surprisal is a more consistent predictor of target reading time and production duration. On its own, NMT attention also predicts translation difficulty to some degree, but the most accurate predictions are achieved by combining surprisal and attentional features.
8.1 Psycholinguistic Implications
Our results support previous findings that surprisal predicts translation difficulty (Wei, 2022; Teich et al., 2020; Carl, 2021a). Surprisal has several justifications as a cognitive difficulty metric (Levy, 2013; Futrell and Hahn, 2022), and one approach interprets surprisal as a measure of a shift in cognitive resource allocation. On this account, higher translation surprisal indicates that more effort is needed to shift cognitive resources to the word ultimately selected (Wei, 2022).
Teich et al. (2020) suggest that translators aim for translations that are both faithful to the source (pmt(t|s) is high) and fluent (plm(t) is high). These goals do not always align, and capture two different translation strategies—literal translation optimizes MT probability at the expense of LM probability, whereas figurative translation prioritizes the latter. Although increases in slm and smt both predict increased target difficulty, our data reveals that these predictors have a weak but significant negative correlation (p < .001) at both token (ρ = −.053) and segment (ρ = −.079) levels. We therefore find quantitative support for a trade-off between fidelity and fluency (Müller et al., 2020; Lim et al., 2024).
8.2 Translatability by Parts of Speech
To gain more insight into the aspects of translation difficulty captured by surprisal and attention, we analyzed human difficulty and model predictions for different parts of speech. All results that follow are based on a subset of studies that use the same source texts, multiLing, a small sample of news articles and sociological texts in English.12 We break down the difficulty of these English words by their part-of-speech (POS) tags, which are available in the corpus.13
Figure 6a shows reading time, slm and attentional features of words grouped by their POS tags. Compared to function words, open-class words, such as proper nouns, nouns and adjectives, are the most difficult to translate and have higher surprisal. These words also direct more attention to eos and less to the context, and attract more cross-attention from the translated sequence.
For target difficulty, the distinction between open-class and function words is also evident in Figure 6b. For each source word, translation duration is defined as the duration of the target segment aligned with the source word divided by the number of alignments between the target segment and the source sentence. Translations of coordinating conjunctions and punctuation stand out as among the easiest by humans, but are surprising for the LM and difficult for NMT. One possible reason is that conjunctions can be cross-linguistically ambiguous (Li et al., 2014; Gromann and Declerck, 2014; Novák and Novák, 2022). For example, English “but” and “and” have been shown to affect NMT fluency (Popović and Castilho, 2019; Popović, 2019). For punctuation, He et al. (2019) demonstrate that the importance of these tokens in NMT can vary by language pairs; for example, translation to Japanese often relies on punctuation to demarcate coherent groups, which is useful for syntactic reordering.
9 Conclusion
Our results support the prevailing view that current NLP models, including LM and NMT, align partially with human language usage and are predictive of language processing complexity. We evaluated surprisal and NMT attention as predictors of human translation difficulty, and found that both factors predict reading times and production duration. Previous work provides some evidence that surprisal and NMT attention capture important aspects of translation difficulty, and our work strengthens this conclusion by estimating surprisal based on state-of-the-art models and analyzing data based on 13 language pairs and hundreds of human translators.
Although the attentional features we consider are empirically successful and grounded in prior literature, they are not without limitations. These features are relatively simple and combining attention weights in more sophisticated ways may allow stronger predictions of human translation difficulty. A more theoretically motivated approach that builds on recent studies of the interpretability of attention distributions (Vashishth et al., 2019; Zhang et al., 2021; Madsen et al., 2022) is worth exploring to develop more fine-grained predictors of translation processing.
To work with as much data as possible, we focused primarily on analyses that combine data from all 13 language pairs, but analyzing translation challenges in individual language pairs is a high priority for future work. A possible next step is an analysis exploring whether the predictors considered here are sensitive to constructions in specific languages that are known sources of processing difficulty (Campbell, 1999; Vanroy, 2021). Factors such as surprisal and attentional flow are appealing in part because their generality makes them broadly applicable across languages, but understanding the idiosyncratic ways in which each pair of languages poses translation challenges is equally important.
Acknowledgments
We thank the action editor and reviewers for thoughtful feedback that improved this work. This project was supported by ARC FT190100200.
Notes
Code available at https://github.com/ZhengWeiLim/pred-trans-difficulty-NMT.
wi is also defined in a way that allows difficulty prediction of non-contiguous segments.
Cross-attention is sometimes known as encoder-decoder attention.
ℓ1 normalization.
The translation studies selected from the TPR database exclude data sets where many alignments cross sentence boundaries, or that contain too many errors (e.g., missing values and inconsistent sentence segmentations) across tables.
Wilcox et al. (2023) point out that Δllh may be ≤ 0 because of overfitting, or because the relationship between the predictor feature and the target variable is not adequately captured by the model class used (in our case, linear models).
Mean coefficients of fv,eosc for token and segment are .001 and .002, respectively.
Studies included from multiLing corpus are RUC17, ENJA15, NJ12, STC17, SG12, ENDU20 and BML12.
POS tags are predictions of NLTK tagger converted to universal POS tags.
References
Author notes
Now at Google.
Action Editors: Liang Huang and Stefan Riezler