While natural languages differ widely in both canonical word order and word order flexibility, their word orders still follow shared cross-linguistic statistical patterns, often attributed to functional pressures. In the effort to identify these pressures, prior work has compared real and counterfactual word orders. Yet one functional pressure has been overlooked in such investigations: The uniform information density (UID) hypothesis, which holds that information should be spread evenly throughout an utterance. Here, we ask whether a pressure for UID may have influenced word order patterns cross-linguistically. To this end, we use computational models to test whether real orders lead to greater information uniformity than counterfactual orders. In our empirical study of 10 typologically diverse languages, we find that: (i) among SVO languages, real word orders consistently have greater uniformity than reverse word orders, and (ii) only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders. These findings are compatible with a pressure for information uniformity in the development and usage of natural languages.1
Human languages differ widely in many respects, yet there are patterns that appear to hold consistently across languages. Identifying explanations for these patterns is a fundamental goal of linguistic typology. Furthermore, such explanations may shed light on the cognitive pressures underlying and shaping human communication.
This work studies the uniform information density (UID) hypothesis as an explanatory principle for word order patterns (Fenk and Fenk, 1980; Genzel and Charniak, 2002; Aylett and Turk, 2004; Jaeger, 2010; Meister et al., 2021). The UID hypothesis posits a communicative pressure to avoid spikes in information within an utterance, thereby keeping the information profile of an utterance relatively close to uniform over time. While the UID hypothesis has been proposed as an explanatory principle for a range of linguistic phenomena, e.g., speakers’ choices when faced with lexical and syntactic alternations (Levy and Jaeger, 2006), its relationship to word order patterns has received limited attention, with the notable exception of Maurits et al. (2010).
Our work investigates the relationship between UID and word order patterns, differing from prior work in several ways. We (i) use Transformer language models (LMs) (Vaswani et al., 2017) to estimate information-theoretic operationalizations of information uniformity; (ii) analyze large-scale naturalistic datasets of 10 typologically diverse languages; and (iii) compare a range of theoretically motivated counterfactual grammar variants.
Experimentally, we find that among SVO languages, the real word order has a more uniform information density than nearly all counterfactual word orders; the only orders that consistently exceed real orders in uniformity are generated using an implausibly strong bias for uniformity, at the cost of expressivity. Further, we find that counterfactual word orders that place verbs before objects are more uniform than ones that place objects before verbs in nearly every language.
Our findings suggest that a tendency for uniform information density may exist in human language, with two potential sources: (i) word order rules, with SVO order generally being more uniform than SOV; and (ii) choices made by speakers, who use the flexibility present in real languages to structure information more uniformly at a global level (and not only in a small number of isolated constructions).
2 Functional Pressures in Language
2.1 Linguistic Optimizations
A number of linguistic theories link cross-linguistic patterns to functional pressures. For example, both the grammatical rules of a language and speakers’ choices (within the space of grammatically acceptable utterances) are posited to reflect a trade-off between effort and robustness: Shorter and simpler structures are easier to produce and comprehend, but longer and more complex utterances can encode more information (Gabelentz, 1901; Zipf, 1935; Hawkins, 1994, 2004, 2014; Haspelmath, 2008). Another such functional pressure follows from the principle of dependency length minimization (DLM), which holds that, in order to minimize working memory load during comprehension, word orders should place words in direct dependency relations close to each other (Rijkhoff, 1986, 1990; Hawkins, 1990, 1994, 2004, 2014; Grodner and Gibson, 2005; Gibson, 1998, 2000;; Bartek et al., 2011; Temperley and Gildea, 2018; Futrell et al., 2020). A growing body of work has turned to informa tion theory, the mathematical theory of communication (Shannon, 1948), to formalize principles that explain linguistic phenomena (Jaeger and Tily, 2011; Gibson et al., 2019; Pimentel et al., 2021c). One such principle is that of uniform information density.
2.2 Uniform Information Density
According to the UID hypothesis, speakers tend to spread information evenly throughout an utterance; large fluctuations in the per-unit information content of an utterance can impede communication by increasing the processing load on the listener. Speakers may modulate the information profile of an utterance by selectively producing linguistic units such as optional complementizers in English (Levy and Jaeger, 2006; Jaeger, 2010). A pressure for UID in speaker choices has also been studied in specific constructions in other languages, though with mixed conclusions (Zhan and Levy, 2018; Clark et al., 2022).
Formally, the information conveyed by a linguistic signal y, e.g., an utterance or piece of text, is quantified in terms of its surprisal s(·), which is defined as y’s negative log-probability: . Here, pℓ is the underlying probability distribution over sentences y for a language ℓ. Note that we do not have access to the true distribution pℓ, and typically rely on a language model with learned parameters θ to estimate surprisal values with a second distribution pθ.
Surprisal can be additively decomposed over the units that comprise a signal. Explicitly, for a signal y that can be expressed as a series of linguistic units , where and is a set vocabulary of words or morphemes, the surprisal of a unit yn is its negative log-probability given prior context: . Note that the distribution pℓ(·∣y <n) has support , where eos is a designated symbol indicating the end of a sequence;2 a valid, complete signal has yN = eos. The quantity s(y) can thus likewise be expressed as . Assuming that we have a fixed amount of information to convey and that high-surprisal items are disproportionately difficult to process,3 it can be shown mathematically that spreading information evenly throughout a signal optimizes ease of processing for the comprehender (Levy and Jaeger, 2006; Smith and Levy, 2013; Levy, 2018; Meister et al., 2021.
While the UID hypothesis is often discussed in the context of speaker choices, it has also been presented as a general cognitive constraint that might influence reading times (Meister et al., 2021), speech duration (Pimentel et al., 2021b), and word lengths (Piantadosi et al., 2011). Selection for UID has also been discussed as a potential evolutionary pressure on language that can explain typological differences (Jaeger and Tily, 2011). Within this literature, there is not a consensus on how to formally operationalize UID. For example, Frank and Jaeger (2008) measure regression of surprisal towards a language-wide mean; Collins (2014) and Bloem (2016) consider more local changes in surprisal in their quantification of UID.
3 Counterfactual Language Paradigm
Following prior work that has used counterfactual languages to study the functional pressures at play in word order patterns, we investigate to what degree a language’s word order shows signs of optimization for UID. In this approach, a corpus of natural language is compared against a counterfactual corpus containing minimally changed versions of the same sentences, where the changes target an attribute of interest, e.g., the language’s word order. For example, several studies of DLM have compared syntactic dependency lengths in real and counterfactual corpora, generated by permuting the sentences’ word order either randomly (Ferrer-i-Cancho, 2004; Liu, 2008) or deterministically by applying a counterfactual grammar (Gildea and Temperley, 2010; Gildea and Jaeger, 2015; Futrell et al., 2015b, 2020). Similarly, we will compare measures of UID in real and counterfactual corpora to investigate whether real languages’ word orders exhibit more uniform information density than alternative realizations.
3.1 Formal Definition
We build on the counterfactual generation procedure introduced by Hahn et al. (2020) to create parallel corpora. This procedure operates on sentences’ dependency parses. Formally, a dependency parse of a sentence y is a directed tree with one node for every word, where each word in y, with the exception of a designated root word, is the child of its (unique) syntactic head; see Zmigrod et al. (2020) for a discussion of the role of the root constraint in dependency tree annotation. Each edge in the tree is annotated with the syntactic relationship between the words connected by that edge; see Figure 1 for an example. Here we use the set of dependency relations defined by the Universal Dependencies (UD) paradigm (de Marneffe et al., 2021), though we follow Hahn et al. (2020) in transforming dependency trees such that function words are treated as heads, leading to representations closer to those of standard syntactic theories; see also Gerdes et al. (2018).
While syntactic relationships are naturally described hierarchically, sentences are produced and processed as linear strings of words. Importantly, there are many ways to linearize a dependency parse ’s nodes into a string y. Concretely, a grammar under our formalism is defined by an ordering function (see Kuhlmann, 2010) g(·,·) which takes as arguments a dependency parse and a specific node in it, and returns an ordering of the node and its dependents. For each node, its dependents are arranged from left to right according to this ordering; any node without dependents is trivially an ordered set on its own. This process proceeds recursively to arrive at a final ordering of all nodes in a dependency tree, yielding the final string y. Pseudo-code for the linearization of a tree based on an ordering function g is given in Figure 2.
One consequence of this formalism is that all counterfactual orders correspond to projective trees, i.e., trees with no crossing dependencies. While projectivity is a well-attested cross-linguistic tendency, human languages do not obey it absolutely (Ferrer-i-Cancho et al., 2018; Yadav et al., 2021). Within the space of projective word order interventions allowed by this formalism, the grammars which we borrow from Hahn et al. (2020) enforce two additional simplifying constraints. First, the relative positioning (left or right) between the head and dependent of a particular relation is fixed. Second, the relative ordering of different relations on the same side of a head is also fixed. We denote grammars which satisfy both constraints as consistent. Notably, natural languages violate both of these assumptions to varying degrees. For example, even in English—a language with relatively strict word order—adverbs can generally appear before or after their head. While these simplifications mean that the formalism cannot perfectly describe natural languages, it provides a computationally well-defined method for intervening on many features of word order. In particular, the consistent grammars of Hahn et al. (2020) are parameterized by a set of scalar weights corresponding to each possible syntactic relation; the ordering function thus reduces to sorting each head’s dependents based on their weight values. Notably, Hahn et al. (2020) also introduced a method for optimizing these grammars for various objective functions by performing stochastic gradient descent on a probabilistic relaxation of the grammar formalism; we use several of these grammars (described in §3.2) in our subsequent analysis.
Creating Counterfactual Word Orderings.
3.2 Counterfactual Grammar Specifications
In addition to the original Real word order, we explore the following theoretically motivated counterfactual grammars for each language. Example sentences from several of these grammars are shown in Figure 3.
Consistent Approximation to Real Order.
Approx is a consistent approximation to the real word order within our formalism; it uses an ordering function parameterized by weights that were fitted to maximize the likelihood of observed word orders for each language, as reported by Hahn et al. (2020). This variant captures most of the word order features of a real language while allowing for a fair comparison to deterministic counterfactual grammars that do not model the flexibility of real language. From the perspective of the UID hypothesis, we expect this variant to be less uniform that Real because it has less flexibility to accommodate speakers’ choices that optimize for UID.
Consistent Random Grammars.
We include variants Random1 through Random5, which use ordering functions parameterized by randomly assigned weights. This means that for a given random grammar, each dependency relation has a fixed direction (left or right), but that the directions of these relations lack the correlations observed in natural language (Greenberg, 1963). Random grammars with the same numerical index share weights across languages.
Consistent Grammars Optimized for Efficiency.
We include two consistent grammars that are optimized for the joint objective of parseability (how much information an utterance provides about its underlying syntactic structure) and sentence-internal predictability, as re ported by Hahn et al. (2020), one with OV order (Efficient-OV) and one with VO order (Efficient-VO). For example, the Efficient-OV grammar for English would give a plausible version of a consistent and efficient grammar in the counterfactual world where English has verbs after objects.
Grammars Optimized for Dependency Length Minimization.
From the same work we also take consistent grammars that are optimized for DLM, denoted as Min-Dl-Opt. While linearizations produced by these grammars are not guaranteed to minimize dependency length for any particular sentence, they minimize the expected average dependency length of a large sample of sentences in a language. In addition, we include Min-Dl-Loc, an inconsistent grammar that applies the projective dependency-length minimization algorithm of Gildea and Temperley (2007) at the sentence level, leading to sentences with minimal DL but without the constraint of consistency.
Sort-Freq is an inconsistent grammar which orders words in a sentence from highest to lowest frequency, ignoring dependency structure altogether. We use this ordering as a heuristic baseline for which we expect UID to hold relatively strongly: Low-frequency elements, which tend to have higher surprisal even if solely from their less frequent usage (Ellis, 2002), are given more context, and thus should have smaller surprisals than if they occurred early; more conditioning context tends to reduce the surprisal of the next word (Luke and Christianson, 2016). We also test Sort-Freq-Rev, ordering words from least to most frequent, which for analogous reasons we expect to perform poorly in terms of UID. However, both of these orderings lead to massive syntactic ambiguity by introducing many string collisions—any two sentences containing the same words in different orders would be linearized identically. This eliminates word order as a mechanism for expressing distinctions in meaning, so these orders are implausible as alternatives to natural languages (Mahowald et al., 2022).
Finally, we also include the Reverse variant, where the words in each sentence appear in the reverse order of the original. This variant preserves all pairwise distances between words within sentences and has identical dependency lengths as the original order, thus isolating the effect of linear order on information density from other potential influences. Notably, if the original language happens to be perfectly consistent, then Reverse will also satisfy consistency; in practice, this is unlikely to hold with natural languages.
3.3 UID and Counterfactual Grammars
Crucially, however, the transformation f might change the UID score of such a language, allowing us to evaluate the impact of word order on information uniformity. As a simple example, consider the language ℓ1 that places a uniform distribution over only four strings: aw, ax, by, and bz. In this language, the first and second symbols always have 1 bit of surprisal, and the end of the string has 0 bits of surprisal. If the counterfactual language ℓ2 is the reverse of ℓ1, we have a uniform distribution over the strings wa, xa, yb, and zb. Here, the first symbol always has 2 bits of surprisal, and the second symbol and end of sentence always have zero bits, as their values are deterministic for a given initial symbol. While the mean surprisal per symbol is the same for ℓ1 and ℓ2, ℓ1 has more uniform information density than ℓ2.
4.1 Use of Counterfactual Grammars
Real Word Orders Are not Consistent.
The consistent grammars borrowed from Hahn et al. (2020) assume that the direction of each syntactic relation, as well as the relative ordering of dependents on the same side of a head, are fixed. This is not generally true of natural languages. We address this difference by including the variant Approx as a comparison to the counterfactual variants, which are constrained by consistency, and by including Reverse as a comparison to Real, both of which are not constrained by consistency.
Automatic Parsing Errors.
Another issue is that the dependency parses extracted for each original sentence as part of the counterfactual generation pipeline may contain parsing errors. These errors may introduce noise into the counterfactual datasets that is not present in the original sentences, and may cause deviations from the characteristics that we assume our counterfactual grammars should induce. For example, Min-Dl-Loc only produces sentences with minimized dependency length if the automatic parse is correct.
Finally, our counterfactual generation procedure assumes a deterministic mapping from sentences to dependency trees as one of its steps. However, multiple valid parses of sentences are possible in the presence of syntactic ambiguity. In such cases, we always select the most likely structure according to the parser, which learns these probabilities based on its training data. Therefore, this design choice could lead to underrepresentation of certain syntactic structures when applying a transformation. However, we note that the variants Real, Reverse, Sort-Freq, and Sort-Freq-Rev do not depend on dependency parses and so are unaffected by this design choice.
4.2 Choice of Dataset
Properties of language can vary across genres and domains. When drawing conclusions about human language in general, no single dataset will be completely representative. Due to the amount of data required to train LMs, we use written corpora in this work, and use the term speaker loosely to refer to any language producer regardless of modality. To address potential concerns about the choice of dataset in this study, we conducted a supplementary analysis on a subset of languages using a different web corpus, which we report in §4.2.
4.3 Errors and Inductive Biases
In summary, our UID metrics could be biased either positively or negatively by the quality of our models. However, since our analysis focuses on the comparison of UID metrics between word order variants rather than their absolute value, this bias should not be a major concern. We use the same model architecture for all language–variant combinations, and so a bias in the UID metric corresponding to one combination should likewise be reflected in all of the metrics that it is compared to. Further, our results hold even when controlling for mean surprisal, as described in §6.
Because modern LMs have been developed to model natural language, they may contain subtle biases towards the properties of real word orders or of highly resourced languages. Based on Inequality (10), if two probabilistic models mℓ and were to perfectly learn the true and counterfactual distributions pℓ and , respectively, then mℓ should assign approximately the same or higher mean surprisal to a corpus from ℓ than assigns to the counterfactual corpus from ℓf. This implies that previous results of Gildea and Jaeger (2015), Ravfogel et al. (2019), Hahn et al. (2020), and White and Cotterell (2021), which found that real corpora tend to have lower average per-word surprisal than deterministically generated counterfactual versions of the same corpora, were in fact due to the inductive bias of the learning algorithms used to estimate surprisals. There is a clear reason why the trigram model of Gildea and Jaeger (2015) would yield higher mean surprisals for counterfactual corpora: The transformation functions f tended to increase dependency lengths, and words in a dependent–head relation tend to have higher mutual information than other pairs of words (Futrell and Levy, 2017; Futrell et al., 2019, 2020). Hence the transformations tended to push words that are predictive of each other outside of the conditioning window of the model (see also Hahn and Xu, 2022, for similar effects). The Transformer architecture we use in this work could thus also contain biases favoring features of real language, which we attempt to control for (see §6).
5 Experimental Setup
This work uses the publicly available Wiki40b dataset (Guo et al., 2020), a large text corpus derived from Wikipedia articles. We use subsets of the Wiki40b dataset in 10 languages: English, Russian, French, German, Hindi, Farsi, Vietnamese, Indonesian, Hungarian, and Turkish. The first six represent the Germanic, Slavic, Romance, Indo-Aryan, and Iranian sub-families of the Indo-European language family. The latter four belong to the Austroasiatic, Austronesian, Uralic, and Turkic language families, respectively. Turkish, Hindi, and Farsi have basic SOV word order, while the other languages have SVO order, with Hungarian being mixed (Dryer, 2013). Languages were chosen based on the amount of available data in the Wiki40b dataset, their typological properties (covering a range of families, canonical word orders, and morphological complexity), and availability of automatic dependency parsing models.
The datasets are subsampled to yield approximately 20M words in the training set of each language and approximately 1M words in the test and validation sets. We automatically generate dependency parses for all sentences using the UDPipe parser (Straka and Straková, 2017), yielding syntactic representations in the UD paradigm. We then apply each of the counterfactual orderings introduced in §3.2 to the original data to create parallel corpora for each language. Sentences are stripped of punctuation (as determined by the dependency parser’s Punct label) and are lowercased. Periods are added back in to mark the end of sentences, regardless of what the original final punctuation was. Sub-word tokenization is then applied to the corpora using a byte-pair encoding (BPE) model, trained with a fixed vocabulary size of 30K tokens and using the algorithm of Sennrich et al. (2016).7
5.2 Language Modeling
For each variant of each language, we train a Transformer language model (Vaswani et al., 2017) using fairseq (Ott et al., 2019). Models are trained on document-level inputs, with a maximum length of 512 tokens; this means that each token is predicted with the preceding material of the entire document as context. Each model is trained with early stopping, halting training after no improvement in validation loss for three epochs. The Adam optimizer was used (Kingma and Ba, 2017), with a learning rate of 0.0005, weight decay of 0.01, and dropout of 0.1. Training scripts are available in the project’s GitHub repository.1 In all of our analyses, we use the word-by-word surprisals estimated using our trained models on their corresponding held-out test sets. Note that we do not consider the designated eos symbol in the computation of any of our UID-related metrics. In the case that a word is composed of multiple sub-word tokens, we aggregate their surprisals by summation, since surprisal decomposes additively.
Estimates of mean per-word surprisal on the test set are in Figure 4A. Consistent with the results of Hahn et al. (2020), our trained models for nearly all counterfactual variants assign higher per-word surprisal to their respective test sets than the Real models assign to theirs. Across all 10 languages, Reverse has mean surprisal close to, but consistently slightly higher than, that of the real ordering. Sort-Freq and Sort-Freq-Rev have mean surprisals close to or below those of Real.
Estimates of mean surprisal variance (uidv) over sentences are shown in Figure 4B. Notably, there is a dissociation between the rank order of variants according to mean surprisal and according to uidv: Variants with similar mean surprisals did not necessarily have similar uidv scores, and vice versa, suggesting that information uniformity and mean surprisal can vary independently of each other. Our main observations are as follows: (i) In all languages except Turkish and Hindi, our estimates of uidv for Real are lower than those for Reverse, despite the variants’ similarities in mean surprisal. (ii) As predicted, the Sort-Freq baseline has uidv equal to or lower than that of Real. (iii) The other counterfactual variants typically exhibit higher uidv than Real, with the exception of mixed results for Sort-Freq-Rev. (iv) The Efficient-VO variants typically have lower uidv than Efficient-OV (with Hungarian being a noteworthy exception), which supports findings based on toy grammars showing that SVO orders are more uniform than SOV orders (Maurits et al., 2010). Crucially, these results are qualitatively similar using the uidlv metric (Figure 6B).
To fairly compare variants using the uidp metric, we first need to account for the fact that, unlike surprisal variance, the metric is sensitive to shifts in mean surprisal. To control for this, we fit a regression model predicting the uidp score based on three variables: The mean surprisal, the grammar variant, and the dataset size (20M, 6.6M, and 2.2M words). We train multiple language models for each language-variant combination (3 dataset sizes and 2 random seeds), resulting in 84 data points per language. We apply treatment coding to the variants, with Real as the reference level. Figure 5 shows the resulting estimates of the coefficients for each variant, where a coefficient should be positive if that variant is less uniform than Real. Qualitatively, the regression results match the results given by uidv and uidlv: Real is more uniform than Reverse in SOV languages, Sort-Freq is the only counterfactual variant that is consistently more uniform than Real, and Efficient-VO is more uniform than Efficient-OV in most languages; the opposite is true in Hungarian and the difference is negligible in Russian.
We offer a discussion of the results observed in §6, including their implications for the role of functional pressures in language.
7.1 Differences in Mean Surprisal
Across 10 typologically diverse languages, we find that Transformer LMs learn to predict data from real word orders better than data from counterfactual orders, with the exception of the Sort-Freq and Sort-Freq-Rev variants. This suggests that these LMs’ inductive biases somehow favor properties of real languages, in line with previous work on other modeling architectures (Gildea and Jaeger, 2015; Ravfogel et al., 2019). This is not surprising, given that commonly used architectures and hyperparameters have been selected specifically based on their good performance on real language tasks. Unlike in n-gram models, the precise inductive bias of Transformer models that favors real word orders is not transparent and merits further study.8
7.2 Differences Between Real and Approx
We observe that despite the similarities between the Real and Approx variants of a given language, the latter are consistently assigned higher mean surprisal by their respective LMs. Meanwhile, the various UID metrics show similar results for Real and Approx, suggesting that the greater flexibility of Real is not responsible for UID differences in our results. This is somewhat surprising, since it may appear that such flexibility is what enables speakers’ choices, which have been previously discussed as contributing to UID. However, many speaker choices that potentially impact UID, such as word choice, active versus passive voice, and optional words, are not captured by this difference in flexibility between Real and Approx.
7.3 Greater Uniformity of Real over Reverse in SVO Languages
While mean surprisal is always very close for Real and Reverse grammars, Reverse is less uniform in 8 out of 10 languages, including all SVO languages. This held across multiple operationalizations of UID, with the exception of mixed results for Hungarian, a language with considerable flexibility in word order. Thus, while both Real and Reverse orders are learned approximately equally well by language models, they differ in how uniformly they distribute information.
One key difference between Real and Reverse is that insofar as Real sentences exhibit a tendency to mention entities from the end of a given sentence close to the beginning of the next one, Reverse does not preserve this property. For example, the pair of sentences “I like dogs. They are friendly.” would become “Dogs like I. Friendly are they.”; note that the distance between antecedent and pronoun is significantly increased. This feature of the Reverse raises the possibility that the uniformity patterns we observe are due to speaker choices taking cross-sentence dependencies into consideration. To minimize the influence of cross-sentence dependencies, we can consider only sentences occurring at the start of a document, which cannot refer to previous sentences. Figure 6A shows that the tendency for Real to have lower surprisal variance than Reverse still holds in this setting across most languages. This suggests that cross-sentence dependencies alone cannot fully explain the observed differences in information uniformity.
Notably, our results show that the UID preference for Real over Reverse is not consistently present in languages with basic SOV order (Turkish, Hindi, and Farsi). We propose the following explanation for this result: As argued in Maurits et al. (2010), SVO languages tend to have more uniform information density profiles than SOV languages—a finding supported by our empirical results in which Efficient-VO had lower surprisal variance than Efficient-OV in 9 out of 10 languages. Unlike the short, simple sentences of Maurits et al., however, the present study considers long and complex sentences where speaker choices have considerable opportunity to influence information uniformity, in addition to the role of basic word order. These choices include whether to use a pronoun, whether to use an active or passive construction, and what order to present a conjunction or list of items, among others. Importantly, speakers make choices conditional on the forward ordering of real language, so we expect that the choices made in an attempt to increase UID—which constitutes a non-trivial percentage of utterances (Levy and Jaeger, 2007)—would have a greater effect on UID in Real than in Reverse. In SVO languages, the effects upon UID of basic word order and speaker choices both go in the same direction: towards more uniformity. In SOV languages, these effects conflict: The basic word order is non-optimal in terms of UID, and so uniformity can theoretically be increased by a transformation to Reverse, while speaker choices are presumably already mostly optimal in Real. This may explain the heterogeneous patterning among the three SOV languages.
Furthermore, these results can potentially shed light on an important question in linguistic typology: Why are some basic word orders more common than others? According to some theories, SOV order (the most typologically common) is the most natural for expressing events with subjects and objects (Goldin-Meadow et al., 2008; Gibson et al., 2013; Futrell et al., 2015a). If these theories are correct, an evolutionary pressure on languages to shift from SOV to SVO could help account for the prevalence of SVO languages, which are nearly as common as SOV ones. A pressure for information uniformity offers one such account.
Finally, Pimentel et al. (2021a) have recently shown that the distribution of per-phone information within words is more uniform when analysed in reverse order than in forward order—the opposite of what we observe on our sentence-level analysis. This difference may suggest qualitatively distinct information-theoretic pressures being present at the lexical and sentential levels and is a potential topic for further study.
7.4 Other Variants
The variants designed to minimize dependency length, Min-Dl-Loc and Min-Dl-Opt, showed mixed results in terms of information uniformity compared to Real. The random grammars fell into two groups: Random1, Random2, and Random4 tended to be less uniform than Real, while Random3 and Random5 tended to be similar in uniformity to Real. Since random grammars have fixed but uncorrelated directions of syntactic relations, these cross-linguistically consistent patterns suggest that some settings of the parameterized grammar are inherently more favorable from the perspective of UID than others.
The only counterfactual word order to consistently have a higher degree of information uniformity than the real orders was the highly constrained Sort-Freq, which turns sentences into sorted word lists. Thus, while it appears possible to improve on real word orders’ information uniformity, this comes at the cost of massive syntactic ambiguity and reduced expressivity.
7.5 Robustness to Dataset Choice
In this study, the chosen dataset (Wiki40b) contains formal writing that may not exhibit the same communicative pressures as spoken language. It is largely devoid of first and second person pronouns, interrogatives, and other features common in everyday speech; further, it may have disproportionate amounts of translationese (Koppel and Ordan, 2011). As a supplementary analysis, we repeated the experiments on the CC100 dataset (Conneau et al., 2020), using only a subset of languages due to computational constraints. This dataset is sourced from a web crawl and therefore contains a wider range of genres and styles than Wiki40b. uidv scores for these experiments are shown in Figure 7. The results qualitatively match the patterns from the Wiki40b experiments in the following ways: (i) better uidv scores for Real than for Reverse among SVO languages, (ii) better uidv scores for Efficient-VO than Efficient-OV in most languages (with Hungarian again being an exception), and (iii) the only variant that has higher uniformity that Real across a majority of languages is Sort-Freq.
In conclusion, we have empirically demonstrated that in many languages, real word orders distribute information more uniformly than a range of counterfactual orders. The fact that this pattern holds in every SVO languages but is mixed among SOV languages lends support to the view that SVO basic word order is preferable to SOV order from the perspective of maximizing UID. We posit that there are two potential sources of optimization within a language for greater UID: Language evolution favoring word orders that produce less variance in information content, and speaker choices in favor of constructions that smooth the information profile of utterances. Our results are consistent with the UID hypothesis, and support the idea that communicative pressures (operationalized in terms of information theory) influence the structure of human language.
We thank our action editor and the anonymous reviewers for their detailed feedback on this paper. CM was supported by the Google PhD Fellowship. TP was supported by a Facebook PhD Fellowship. This work was supported by NSF grant BCS-2121074 to RPL.
Code for reproducing our experiments is available at https://github.com/thomashikaru/word-order-uid.
This symbol allows for the global normalization of pℓ, i.e., a valid probability distribution over finite-length sequences (see Du et al., 2022, for a discussion).
Most empirical results (Hale, 2001; Levy, 2008; Shain et al., 2022) suggest that a word’s processing effort is directly proportional to its surprisal. Yet there is also evidence of a superlinear relationship, which would imply a preference by the comprehender for UID (Meister et al., 2021; Hoover et al., 2022).
This metric suggests a super-linear processing cost for surprisal.
We note that, while a fully uniform language would have value 0 for uidv and uidlv, it would not for uidp(y), so the metrics are not directly comparable.
For notational brevity, we leave the dependency of on ℓ implicit as it should be clear from context.
All variants of the same language are tokenized using the same BPE model, trained on a sample of 100K documents from all variants; BPE tokens could not cross word boundaries for compatibility with different word orders.
Notably, White and Cotterell (2021) show that there is a large variation in how Transformer language models perform in toy languages with diverse word orders; they, however, do not find evidence that Transformers perform better on the most frequently occurring orders (as opposed to, e.g., OVS and VOS word orders, which are found in few languages).
Action Editor: Mark-Jan Nederhof