A Cross-Linguistic Pressure for Uniform Information Density in Word Order

Abstract While natural languages differ widely in both canonical word order and word order flexibility, their word orders still follow shared cross-linguistic statistical patterns, often attributed to functional pressures. In the effort to identify these pressures, prior work has compared real and counterfactual word orders. Yet one functional pressure has been overlooked in such investigations: The uniform information density (UID) hypothesis, which holds that information should be spread evenly throughout an utterance. Here, we ask whether a pressure for UID may have influenced word order patterns cross-linguistically. To this end, we use computational models to test whether real orders lead to greater information uniformity than counterfactual orders. In our empirical study of 10 typologically diverse languages, we find that: (i) among SVO languages, real word orders consistently have greater uniformity than reverse word orders, and (ii) only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders. These findings are compatible with a pressure for information uniformity in the development and usage of natural languages.1


Introduction
Human languages differ widely in many respects, yet there are patterns that appear to hold consistently across languages.Identifying explanations for these patterns is a fundamental goal of linguistic typology.Furthermore, such explanations may shed light on the cognitive pressures underlying and shaping human communication.
This work studies the uniform information density (UID) hypothesis as an explanatory principle for word order patterns (Fenk and Fenk, 1980;  Genzel and Charniak, 2002; Aylett and Turk,   1 Code for reproducing our experiments is available at https://github.com/thomashikaru/word-order-uid. 2004 ;Jaeger, 2010;Meister et al., 2021).The UID hypothesis posits a communicative pressure to avoid spikes in information within an utterance, thereby keeping the information profile of an utterance relatively close to uniform over time.While the UID hypothesis has been proposed as an explanatory principle for many linguistic phenomena, e.g., speakers' choices when faced with lexical and syntactic alternations (Levy and Jaeger, 2006), its relationship to word order patterns has received limited attention, with the notable exception of Maurits et al. (2010) and Jain et al. (2018).
Our work investigates the relationship between UID and word order patterns, differing from prior work in several ways.We (i) use Transformer language models (LMs) (Vaswani et al., 2017) to estimate information-theoretic operationalizations of information uniformity; (ii) analyze large-scale naturalistic datasets of 10 typologically diverse languages; and (iii) compare a range of theoretically motivated counterfactual grammar variants.
Experimentally, we find that among SVO languages, the real word order has a more uniform information density than nearly all counterfactual word orders; the only orders that consistently exceed real orders in uniformity are generated using an implausibly strong bias for uniformity, at the cost of expressivity.Further, we find that counterfactual word orders that place verbs before objects are more uniform than ones that place objects before verbs in nearly every language.
Our findings suggest that a tendency for uniform information density may exist in human language, with two potential sources: (i) word order rules, with SVO order generally being more uniform than SOV; and (ii) choices made by speakers, who use the flexibility present in real languages to structure information more uniformly at a global level (and not only in a small number of isolated constructions).arXiv:2306.03734v2[cs.CL] 9 Jul 2023 2 Functional Pressures in Language

Linguistic Optimizations
A number of linguistic theories link crosslinguistic patterns to functional pressures.For example, both the grammatical rules of a language and speakers' choices (within the space of grammatically acceptable utterances) are posited to reflect a trade-off between effort and robustness: shorter and simpler structures are easier to produce and comprehend, but longer and more complex utterances can encode more information (Gabelentz, 1901;Zipf, 1935;Hawkins, 1994Hawkins, , 2004Hawkins, , 2014;;Haspelmath, 2008).Another such functional pressure follows from the principle of dependency length minimization (DLM), which holds that, in order to minimize working memory load during comprehension, word orders should place words in direct dependency relations close to each other (Rijkhoff, 1986(Rijkhoff, , 1990;;Hawkins, 1990Hawkins, , 1994Hawkins, , 2004Hawkins, , 2014;;Grodner and Gibson, 2005;Gibson, 1998Gibson, , 2000;;Bartek et al., 2011;Temperley and Gildea, 2018;Futrell et al., 2020).A growing body of work has turned to information theory, the mathematical theory of communication (Shannon, 1948), to formalize principles that explain linguistic phenomena (Jaeger and Tily, 2011;Gibson et al., 2019;Pimentel et al., 2021c).One such principle is that of uniform information density.

Uniform Information Density
According to the uniform information density (UID) hypothesis, speakers tend to spread information evenly throughout an utterance; large fluctuations in the per-unit information content of an utterance can impede communication by increasing the processing load on the listener.Speakers may modulate the information profile of an utterance by selectively producing linguistic units such as optional complementizers in English (Levy and Jaeger, 2006;Jaeger, 2010).A pressure for UID in speaker choices has also been studied in specific constructions in other languages, though with mixed conclusions (Zhan and Levy, 2018;Clark et al., 2022).
Formally, the information conveyed by a linguistic signal y, e.g., an utterance or piece of text, is quantified in terms of its surprisal s(•), which is defined as y's negative log-probability: s(y) def = − log p (y).Here, p is the underlying probability distribution over sentences y for a language .Note that we do not have access to the true distribution p , and typically rely on a language model with learned parameters θ to estimate surprisal values with a second distribution p θ .
Surprisal can be additively decomposed over the units that comprise a signal.Explicitly, for a signal y that can be expressed as a series of linguistic units y 1 , . . ., y N , where y n ∈ V and V is a set vocabulary of words or morphemes, the surprisal of a unit y n is its negative log-probability given prior context: s(y n ) = − log p (y n | y <n ).Note that the distribution p (• | y <n ) has support V def = V ∪{EOS}, where EOS is a designated symbol indicating the end of a sequence;2 a valid, complete signal y = y 1 , . . ., y N has y N = EOS.The quantity s(y) can thus likewise be expressed as s(y) = N n=1 s(y n ).Assuming that we have a fixed amount of information to convey and that high-surprisal items are disproportionately difficult to process,3 it can be shown mathematically that spreading information evenly throughout a signal optimizes ease of processing for the comprehender (Levy and Jaeger, 2006;Smith and Levy, 2013;Levy, 2018;Meister et al., 2021).
While the UID hypothesis is often discussed in the context of speaker choices, it has also been presented as a general cognitive constraint that might influence reading times (Meister et al., 2021), speech duration (Pimentel et al., 2021b), and word lengths (Piantadosi et al., 2011).Selection for UID has also been discussed as a potential evolutionary pressure on language that can explain typological differences (Jaeger and Tily, 2011).Within this literature, there is not a consensus on how to formally operationalize UID.For example, Frank and Jaeger (2008) measure regression of surprisal towards a language-wide mean; Collins (2014) and Bloem (2016) consider more local changes in surprisal in their quantification of UID.
In this work, we consider three metrics for operationalizing UID (Meister et al., 2021): In Equation ( 1), UID v is the mean withinsentence variance of word surprisals, where µ = In Equation ( 2), UID lv quantifies the average wordto-word change in surprisal, a more localized measure (Collins, 2014).Intuitively, this is maximized when high-surprisal words alternate with low-surprisal words, and minimized when words appear in sorted order by information content.
UID p (y) In Equation (3), UID p is a power mean with k > 1, which disproportionately increases in the presence of larger surprisal values. 4Note that for all of these operationalizations, lower values correspond to greater uniformity.5

Counterfactual Language Paradigm
Following prior work that has used counterfactual languages to study the functional pressures at play in word order patterns, we investigate to what degree a language's word order shows signs of optimization for UID.In this approach, a corpus of natural language is compared against a counterfactual corpus containing minimally changed versions of the same sentences, where the changes target an attribute of interest, e.g., the language's word order.For example, several studies of DLM have compared syntactic dependency lengths in real and counterfactual corpora, generated by permuting the sentences' word order either randomly (Ferrer-i-Cancho, 2004;Liu, 2008) or deterministically by applying a counterfactual grammar (Gildea and Temperley, 2010;Gildea and Jaeger, 2015;Futrell et al., 2015bFutrell et al., , 2020)).Similarly, we will compare measures of UID in real and counterfactual corpora to investigate whether real languages' word orders exhibit more uniform information density than alternative realizations.

Formal Definition
We build on the counterfactual generation procedure introduced by Hahn et al. (2020) to create parallel corpora.This procedure operates on sentences' dependency parses.Formally, a dependency parse of a sentence y is a directed tree with one node for every word, where each word in y, with the exception of a designated root word, is the child of its (unique) syntactic head; see Zmigrod et al. (2020) for a discussion of the role of the root constraint in dependency tree annotation.Each edge in the tree is annotated with the syntactic relationship between the words connected by that edge; see Fig. 1 for an example.Here we use the set of dependency relations defined by the Universal Dependencies (UD) paradigm (de Marneffe et al., 2021), though we follow Hahn et al. (2020) in transforming dependency trees such that function words are treated as heads, leading to representations closer to those of standard syntactic theories; see also Gerdes et al. (2018).
Tree linearization.While syntactic relationships are naturally described hierarchically, sentences are produced and processed as linear strings of words.Importantly, there are many ways to linearize a dependency parse 's nodes into a string y.Concretely, a grammar under our formalism is defined by an ordering function (see Kuhlmann, 2010) g(•, •) which takes as arguments a dependency parse and a specific node in it, and returns an ordering of the node and its dependents.For each node, its dependents are arranged from left to right according to this ordering; any node without dependents is trivially an ordered set on its own.This process proceeds recursively to arrive at a final ordering of all nodes in a dependency tree, yielding the final string y.Pseudo-code for the linearization of a tree based on an ordering function g is given in Fig. 2.
Simplifying assumptions.One consequence of this formalism is that all counterfactual orders correspond to projective trees, i.e., trees with no crossing dependencies.While projectivity is a well-attested cross-linguistic tendency, human languages do not obey it absolutely (Ferrer-i-Cancho et al., 2018;Yadav et al., 2021).Within the space of projective word order interventions allowed by this formalism, the grammars which we borrow from Hahn et al. (2020) enforce two additional simplifying constraints.First, the relative positioning (left or right) between the head and dependent of a particular relation is fixed.Second, the relative ordering of different relations on the same side of a head is also fixed.We denote grammars which satisfy both constraints as consistent.Notably, natural languages violate both of these assumptions to varying degrees.For example, even in English -a language with relatively strict word order -adverbs can generally appear before or after their head.While these simplifications mean that the formalism cannot perfectly describe natural languages, it provides a computationally well-defined method for intervening on many features of word order.In particular, the consistent grammars of Hahn et al. (2020) 2020) also introduced a method for optimizing these grammars for various objective functions by performing stochastic gradient descent on a probabilistic relaxation of the grammar formalism; we use several of these grammars (described in §3.2) in our subsequent analysis.
Creating counterfactual word orderings.The above paradigm equips us with the tools necessary for systematically altering sentences' word orderings, which in turn, enables us to create counterfactual corpora.Notably, the large corpora we use in this study contain sentences as strings, not as their dependency parses.We therefore define our counterfactual grammar intervention as the output of a (deterministic) word re-ordering function f : Y → Y, where Y def = V * is the set of all possible sentences that can be constructed using a language's vocabulary V.6 This function takes as input a sentence from our original language and outputs a sentence with the counterfactual word order defined by a given ordering function g.We decompose this function into two steps: We use a state-of-the-art parser (Straka and Straková, 2017) to implement parse : Y → T where T is the set of all dependency parses.Specifically, we define parse(y) = argmax ∈T p( | y) for a learned conditional probability distribution over possible parses p(• | y).We then obtain the linearized form of the resulting tree by supplying it and the ordering function g to linearize , as defined above.Collectively, the outputs of this process (parallel datasets differing only in word order) are referred to as variants.Importantly, f here is a deterministic function; one could instead consider f to be probabilistic in nature, with each sentence y having a distribution over tree structures .We discuss the implications of this choice in §4.

Counterfactual Grammar Specifications
In addition to the original REAL word order, we explore the following theoretically motivated counterfactual grammars for each language.
Consistent approximation to real order.AP-PROX is a consistent approximation to the real word order within our formalism; it uses an ordering function parameterized by weights that were fitted to maximize the likelihood of observed word orders for each language, as reported by Hahn et al. (2020).This variant captures most of the word order features of a real language while allowing for a fair comparison to deterministic counterfactual grammars that do not model the flexibility of real language.From the perspective of the UID hypothesis, we expect this variant to be less uniform that REAL because it has less flexibility to accommodate speakers' choices that optimize for UID.
Consistent random grammars.We include variants RANDOM 1 through RANDOM 5 , which use ordering functions parameterized by randomly assigned weights.This means that for a given random grammar, each dependency relation has a fixed direction (left or right), but that the directions of these relations lack the correlations observed in natural language (Greenberg, 1963).Random grammars with the same numerical index share weights across languages.
Consistent grammars optimized for efficiency.
We include two consistent grammars that are optimized for the joint objective of parseability (how much information an utterance provides about its underlying syntactic structure) and sentenceinternal predictability, as reported by Hahn et al. (2020), one with OV order (EFFICIENT-OV) and one with VO order (EFFICIENT-VO).For example, the EFFICIENT-OV grammar for English would give a plausible version of a consistent and efficient grammar in the counterfactual world where English has verbs after objects.
Grammars optimized for dependency length minimization.From the same work we also take consistent grammars that are optimized for DLM, denoted as MIN-DL-OPT.While linearizations produced by these grammars are not guaranteed to minimize dependency length for any particular sentence, they minimize the expected average dependency length of a large sample of sentences in a language.In addition, we include MIN-DL-LOC, an inconsistent grammar that applies the projective dependency-length minimization algorithm of Gildea and Temperley (2007) at the sentence level, leading to sentences with minimal DL but without the constraint of consistency.
Frequency-sorted grammars.SORT-FREQ is an inconsistent grammar which orders words in a sentence from highest to lowest frequency, ignoring dependency structure altogether.We use this ordering as a heuristic baseline for which we expect UID to hold relatively strongly: lowfrequency elements, which tend to have higher surprisal even if solely from their less frequent usage (Ellis, 2002), are given more context, and thus should have smaller surprisals than if they occurred early; more conditioning context tends to reduce the surprisal of the next word (Luke and Christianson, 2016).We also test SORT-FREQ-REV, ordering words from least to most frequent, which for analogous reasons we expect to perform poorly in terms of UID.However, both of these orderings lead to massive syntactic ambiguity by introducing many string collisions -any two sentences containing the same words in different orders would be linearized identically.This eliminates word order as a mechanism for expressing distinctions in meaning, so these orders are implausible as alternatives to natural languages (Mahowald et al., 2022).
Reverse grammar.Finally, we also include the REVERSE variant, where the words in each sentence appear in the reverse order of the original.This variant preserves all pairwise distances between words within sentences and has identical dependency lengths as the original order, thus isolating the effect of linear order on information density from other potential influences.Notably, if the original language happens to be perfectly consistent, then REVERSE will also satisfy consistency; in practice, this is unlikely to hold with natural languages.

UID and Counterfactual Grammars
Let p (y) be the probability distribution over sentences y for a language of interest .We can define a language's UID score as the expected value of its sentences' UID scores, where we overload the UID function to take either a sentence y or an entire language : UID( ) where sentence-level UID can be UID v (y), UID lv (y), or UID p (y).In practice, we estimate this language-level UID score using a Monte-Carlo estimator, taking the mean sentence-level UID score across a held-out test set S of sentences y in language , where we assume y ∼ p : Similarly, the expected surprisal (or Shannon entropy, H) of this language is computed as: We evaluate how well a language model p θ approximates p by its cross-entropy: where a smaller value of H implies a better model.Again using a Monte Carlo estimator, we measure cross-entropy using the held-out test set S : This is simply the mean surprisal that the model assigns to a corpus of naturalistic data.These computations can also be applied to counterfactual variants of a language.Let f stand for a language identical to , but where its strings have been transformed by f; this language's distribution over sentences would be p f (y) = y ∈Y p (y )1{y = f(y )}.Since entropy is non-increasing over function transformations (by Jensen's inequality), it follows that: Further, if our counterfactual generation function f is a bijection -meaning that each input string gets mapped to a distinct output string and each output string has an input that maps to it -then we can create a second function f −1 : Y → Y, which would generate from f .Then, the following holds: i.e., it must be that H( ) = H( f ).Reversing a sentence is an example of a bijective function, and thus Equation ( 11) holds necessarily for the pair of REAL and REVERSE variants; the counterfactual generation procedure thus should not produce differences in mean surprisal between these variants.At the same time, bijectivity does not necessarily hold for our other counterfactual transformations and is violated to a large degree when mapping to SORT-FREQ and SORT-FREQ-REV.Thus in general, we can only guarantee Inequality (10).Crucially, however, the transformation f might change the UID score of such a language, allowing us to evaluate the impact of word order on information uniformity.As a simple example, consider the language 1 that places a uniform distribution over only four strings: aw, ax, by, and bz.In this language, the first and second symbols always have 1 bit of surprisal, and the end of the string has 0 bits of surprisal.If the counterfactual language 2 is the reverse of 1 , we have a uniform distribution over the strings wa, xa, yb, and zb.Here, the first symbol always has 2 bits of surprisal, and the second symbol and end of sentence always have zero bits, as their values are deterministic for a given initial symbol.While the mean surprisal per symbol is the same for 1 and 2 , 1 has more uniform information density than 2 .

Use of Counterfactual Grammars
Real word orders are not consistent.The consistent grammars borrowed from Hahn et al. (2020) assume that the direction of each syntactic relation, as well as the relative ordering of dependents on the same side of a head, are fixed.This is not generally true of natural languages.We address this difference by including the variant AP-PROX as a comparison to the counterfactual variants, which are constrained by consistency, and by including REVERSE as a comparison to REAL, both of which are not constrained by consistency.
Automatic parsing errors.Another issue is that the dependency parses extracted for each original sentence as part of the counterfactual generation pipeline may contain parsing errors.These errors may introduce noise into the counterfactual datasets that is not present in the original sentences, and may cause deviations from the characteristics that we assume our counterfactual grammars should induce.For example, MIN-DL-LOC only produces sentences with minimized dependency length if the automatic parse is correct.
Deterministic parsing.Finally, our counterfactual generation procedure assumes a deterministic mapping from sentences to dependency trees as one of its steps.However, multiple valid parses of sentences are possible in the presence of syntactic ambiguity.In such cases, we always select the most likely structure according to the parser, which learns these probabilities based on its training data.Therefore, this design choice could lead to underrepresentation of certain syntactic structures when applying a transformation.However, we note that the variants REAL, REVERSE, SORT-FREQ, and SORT-FREQ-REV do not depend on dependency parses and so are unaffected by this design choice.

Choice of Dataset
Properties of language can vary across genres and domains.When drawing conclusions about human language in general, no single dataset will be completely representative.Due to the amount of data required to train LMs, we use written corpora in this work, and use the term speaker loosely to refer to any language producer regardless of modality.To address potential concerns about the choice of dataset in this study, we conducted a supple-mentary analysis on a subset of languages using a different web corpus, which we report in §7.5.

Errors and Inductive Biases
Model Errors.Language model quality could impact the estimated values of our UID metrics UID v , UID p , and UID lv .To see why, consider a model p θ that -rather than providing unbiased estimates of p -is a smoothed interpolation between p and the uniform distribution: Here, an increase in 1 − λ would lead to an increase in H(p , p θ ), since the cross-entropy is only minimized when p θ (• | y <n ) = p (• | y <n ).This change, however, would be reflected as an increase in uniformity, e.g., a decrease in UID v : surprisals would be closer to uniform for smaller values of λ.Alternatively, consider the situation where a language has perfect information uniformity, i.e., where UID v , UID p , and UID lv are their minimum possible values.The interpolation of p with any non-uniform distribution should instead decrease the measured uniformity, at least with respect to UID v and UID lv .In summary, our UID metrics could be biased either positively or negatively by the quality of our models.However, since our analysis focuses on the comparison of UID metrics between word order variants rather than their absolute value, this bias should not be a major concern.We use the same model architecture for all language-variant combinations, and so a bias in the UID metric corresponding to one combination should likewise be reflected in all of the metrics that it is compared to.Further, our results hold even when controlling for mean surprisal, as described in §6.
Inductive Biases.Because modern LMs have been developed to model natural language, they may contain subtle biases towards the properties of real word orders or of highly resourced languages.Based on Inequality (10), if two probabilistic models m and m f were to perfectly learn the true and counterfactual distributions p and p f , respectively, then m should assign approximately the same or higher mean surprisal to a corpus {y (m) } M m=1 from than m f assigns to the counterfactual corpus from f .This implies that previous results of Gildea and Jaeger (2015), Ravfogel et al. (2019), Hahn et al. (2020) and White and Cotterell (2021), which found that real corpora tend to have lower average per-word surprisal than deterministically generated counterfactual versions of the same corpora, were in fact due to the inductive bias of the learning algorithms used to estimate surprisals.There is a clear reason why the trigram model of Gildea and Jaeger (2015) would yield higher mean surprisals for counterfactual corpora: the transformation functions f tended to increase dependency lengths, and words in a dependent-head relation tend to have higher mutual information than other pairs of words (Futrell and Levy, 2017;Futrell et al., 2019Futrell et al., , 2020)).Hence the transformations tended to push words that are predictive of each other outside of the conditioning window of the model (see also Hahn and Xu, 2022, for similar effects).The Transformer architecture we use in this work could thus also contain biases favoring features of real language, which we attempt to control for (see §6).

Data
This work uses the publicly available Wiki40b dataset (?), a large text corpus derived from Wikipedia articles.We use subsets of the Wiki40b dataset in 10 languages: English, Russian, French, German, Hindi, Farsi, Vietnamese, Indonesian, Hungarian, and Turkish.The first six represent the Germanic, Slavic, Romance, Indo-Aryan, and Iranian sub-families of the Indo-European language family.The latter four belong to the Austroasiatic, Austronesian, Uralic, and Turkic language families, respectively.Turkish, Hindi, and Farsi have basic SOV word order, while the other languages have SVO order with Hungarian being mixed (Dryer, 2013).Languages were chosen based on the amount of available data in the Wiki40b dataset, their typological properties (covering a range of families, canonical word orders, and morphological complexity), and availability of automatic dependency parsing models.
The datasets are subsampled to yield approximately 20M words in the training set of each language and approximately 1M words in the test and validation sets.We automatically generate dependency parses for all sentences using the UD-Pipe parser (Straka and Straková, 2017), yielding syntactic representations in the UD paradigm.We then apply each of the counterfactual orderings introduced in §3.2 to the original data to create parallel corpora for each language.Sentences are stripped of punctuation (as determined by the dependency parser's PUNCT label) and are lowercased.Periods are added back in to mark the end of sentences, regardless of what the original final punctuation was.Sub-word tokenization is then applied to the corpora using a byte-pair encoding (BPE) model, trained with a fixed vocabulary size of 30K tokens and using the algorithm of Sennrich et al. (2016).7

Language Modeling
For each variant of each language, we train a Transformer language model (Vaswani et al., 2017) using fairseq (Ott et al., 2019).Models are trained on document-level inputs, with a maximum length of 512 tokens; this means that each token is predicted with the preceding material of the entire document as context.Each model is trained with early stopping, halting training after no improvement in validation loss for three epochs.The Adam optimizer was used (Kingma and Ba, 2017), with a learning rate of 0.0005, weight decay of 0.01, and dropout of 0.1.Training scripts are available in the project's GitHub repository. 1 In all of our analyses, we use the word-by-word surprisals estimated using our trained models on their corresponding held-out test sets.Note that we do not consider the designated EOS symbol in the computation of any of our UID-related metrics.In the case that a word is comprised of multiple sub-word tokens, we aggregate their surprisals by summation, since surprisal decomposes additively.

Results
Estimates of mean per-word surprisal on the test set are in Fig. 4A.Consistent with the results of Hahn et al. (2020), our trained models for nearly all counterfactual variants assign higher per-word surprisal to their respective test sets than the REAL models assign to theirs.Across all 10 languages, REVERSE has mean surprisal close to, but consistently slightly higher than, that of the real ordering.SORT-FREQ and SORT-FREQ-REV have mean surprisals close to or below those of REAL.
Estimates of mean surprisal variance (UID v ) over sentences are shown in Fig. 4B.Notably, there is a dissociation between the rank order of variants according to mean surprisal and according to UID v : variants with similar mean surprisals did not necessarily have similar UID v scores, and vice versa, suggesting that information uniformity and mean surprisal can vary independently of each other.Our main observations are as follows: (i) In all languages except Turkish and Hindi, our estimates of UID v for REAL are lower than those for REVERSE, despite the variants' similarities in mean surprisal.(ii) As predicted, the SORT-FREQ baseline has UID v equal to or lower than that of REAL.(iii) The other counterfactual variants typically exhibit higher UID v than REAL, with the exception of mixed results for SORT-FREQ-REV.(iv) The EFFICIENT-VO variants typically have lower UID v than EFFICIENT-OV (with Hungarian being a noteworthy exception), which supports findings based on toy grammars showing that SVO orders are more uniform than SOV orders (Maurits et al., 2010).Crucially, these results are qualitatively similar using the UID lv metric (Fig. 6B).
To fairly compare variants using the UID p metric, we first need to account for the fact that, unlike surprisal variance, the metric is sensitive to shifts in mean surprisal.To control for this, we fit a regression model predicting the UID p score based on three variables: the mean surprisal, the grammar variant, and the dataset size (20M, 6.6M, and 3.3M words).We train multiple language models for each language-variant combination (3 dataset sizes and 2 random seeds), resulting in 84 data points per language.We apply treatment coding to the variants, with REAL as the reference level.Fig. 5 shows the resulting estimates of the coefficients for each variant, where a coefficient should be positive if that variant is less uniform than REAL.Qualitatively, the regression results match the results given by UID v and UID lv : REAL is more uniform than REVERSE in SOV languages, SORT-FREQ is the only counterfactual variant that is consistently more uniform than REAL, and EFFICIENT-VO is more uniform than EFFICIENT-OV in most languages; the opposite is true in Hungarian and the difference is negligible in Russian.

Discussion
We offer a discussion of the results observed in §6, including their implications for the role of functional pressures in language.

Differences in mean surprisal
Across 10 typologically diverse languages, we find that Transformer LMs learn to predict data from real word orders better than data from counterfactual orders, with the exception of the SORT-FREQ and SORT-FREQ-REV variants.This suggests that these LMs' inductive biases somehow favor properties of real languages, in line with previous work on other modeling architectures (Gildea and Jaeger, 2015;Ravfogel et al., 2019).This is not surprising, given that commonly used architectures and hyperparameters have been selected specifically based on their good performance on real language tasks.Unlike in n-gram models, the precise inductive bias of Transformer models that favors real word orders is not transparent and merits further study. 8

Differences between REAL and APPROX
We observe that despite the similarities between the REAL and APPROX variants of a given language, the latter are consistently assigned higher mean surprisal by their respective LMs.Meanwhile, the various UID metrics show similar results for REAL and APPROX, suggesting that the greater flexibility of REAL is not responsible for UID differences in our results.This is somewhat surprising, since it may appear that such flexibility is what enables speakers' choices, which have been previously discussed as contributing to UID.However, many speaker choices that potentially impact UID, such as word choice, active versus passive voice, and optional words, are not captured by this difference in flexibility between REAL and APPROX.
8 Notably, White and Cotterell (2021) show that there is a large variation in how Transformer language models perform in toy languages with diverse word orders; they, however, do not find evidence that Transformers perform better on the most frequently occurring orders (as opposed to, e.g., OVS and VOS word orders, which are found in few languages).

Greater uniformity of REAL over REVERSE in SVO languages
While mean surprisal is always very close for REAL and REVERSE grammars, REVERSE is less uniform in 8 out of 10 languages, including all SVO languages.This held across multiple operationalizations of UID, with the exception of mixed results for Hungarian, a language with considerable flexibility in word order.Thus, while both REAL and REVERSE orders are learned approximately equally well by language models, they differ in how uniformly they distribute information.
One key difference between REAL and RE-VERSE is that insofar as REAL sentences exhibit a tendency to mention entities from the end of a given sentence close to the beginning of the next one, REVERSE does not preserve this property.For example, the pair of sentences "I like dogs.They are friendly."would become "Dogs like I. Friendly are they.";note that the distance between antecedent and pronoun is significantly increased.This feature of the REVERSE raises the possibility that the uniformity patterns we observe are due to speaker choices taking cross-sentence dependencies into consideration.To minimize the influence of cross-sentence dependencies, we can consider only sentences occur- ring at the start of a document, which cannot refer to previous sentences.Fig. 6A shows that the tendency for REAL to have lower surprisal variance than REVERSE still holds in this setting across most languages.This suggests that cross-sentence dependencies alone cannot fully explain the observed differences in information uniformity.
Notably, our results show that the UID preference for REAL over REVERSE is not consistently present in languages with basic SOV order (Turkish, Hindi, and Farsi).We propose the following explanation for this result: As argued in Maurits et al. ( 2010), SVO languages tend to have more uniform information density profiles than SOV languages -a finding supported by our empirical results in which EFFICIENT-VO had lower surprisal variance than EFFICIENT-OV in 9 out of 10 languages.Unlike the short, simple sentences of Maurits et al., however, the present study considers long and complex sentences where speaker choices have considerable opportunity to influence information uniformity, in addition to the role of basic word order.These choices include whether to use a pronoun, whether to use an active or passive construction, and what order to present a conjunction or list of items, among others.Importantly, speakers make choices conditional on the forward ordering of real language, so we expect that the choices made in an attempt to increase UID -which constitutes a non-trivial percentage of utterances (Levy and Jaeger, 2007) -would have a greater effect on UID in REAL than in RE-VERSE.In SVO languages, the effects upon UID of basic word order and speaker choices both go in the same direction -towards more uniformity.In SOV languages, these effects conflict -the basic word order is non-optimal in terms of UID, and so uniformity can theoretically be increased by a transformation to REVERSE, while speaker choices are presumably already mostly optimal in REAL.This may explain the heterogeneous patterning among the three SOV languages.
Furthermore, these results can potentially shed light on an important question in linguistic typology: Why are some basic word orders more common than others?According to some theories, SOV order (the most typologically common) is the most natural for expressing events with subjects and objects (? Gibson et al., 2013;Futrell et al., 2015a).If these theories are correct, an evolutionary pressure on languages to shift from SOV to SVO could help account for the prevalence of SVO languages, which are nearly as common as SOV ones.A pressure for information uniformity offers one such account.
Finally, Pimentel et al. (2021a) has recently shown that the distribution of per-phone information within words is more uniform when analysed in reverse order than in forward order -the opposite of what we observe on our sentence-level analysis.This difference may suggest qualitatively distinct information-theoretic pressures being present at the lexical and sentential levels and is a potential topic for further study.

Other Variants
The variants designed to minimize dependency length, MIN-DL-LOC and MIN-DL-OPT, showed mixed results in terms of information uniformity compared to REAL.The random grammars fell into two groups: RANDOM 1 , RANDOM 2 , and RANDOM 4 tended to be less uniform than REAL, while RANDOM 3 and RANDOM 5 tended to be similar in uniformity to REAL.Since random grammars have fixed but uncorrelated directions of syntactic relations, these cross-linguistically consistent patterns suggest that some settings of the parameterized grammar are inherently more favorable from the perspective of UID than others.
The only counterfactual word order to consistently have a higher degree of information uniformity than the real orders was the highly constrained SORT-FREQ, which turns sentences into sorted word lists.Thus, while it appears possible to improve on real word orders' information uniformity, this comes at the cost of massive syntactic ambiguity and reduced expressivity.

Robustness to Dataset Choice
In this study, the chosen dataset (Wiki40b) contains formal writing that may not exhibit the same communicative pressures as spoken language.It is largely devoid of first and second person pronouns, interrogatives, and other features common in everyday speech; further, it may have disproportionate amounts of translationese (Koppel and Ordan, 2011).As a supplementary analysis, we repeated the experiments on the CC100 dataset (?), using only a subset of languages due to computational constraints.This dataset is sourced from a web crawl and therefore contains a wider range of genres and styles than Wiki40b.UID v scores for these experiments are shown in Fig. 7.The results qualitatively match the patterns from the Wiki40b experiments in the following ways: (i) better UID v scores for REAL than for REVERSE among SVO languages, (ii) better UID v scores for EFFICIENT-VO than EFFICIENT-OV in most languages (with Hungarian again being an exception), and (iii) the only variant that has higher uniformity that REAL across a majority of languages is SORT-FREQ.

Conclusion
In conclusion, we have empirically demonstrated that in many languages, real word orders distribute information more uniformly than a range of coun- terfactual orders.The fact that this pattern holds in every SVO languages but is mixed among SOV languages lends support to the view that SVO basic word order is preferable to SOV order from the perspective of maximizing UID.We posit that there are two potential sources of optimization within a language for greater UID: language evolution favoring word orders that produce less variance in information content, and speaker choices in favor of constructions that smooth the information profile of utterances.Our results are consistent with the UID hypothesis, and support the idea that communicative pressures (operationalized in terms of information theory) influence the structure of human language.

Figure 1 :
Figure 1: An example dependency tree showing syntactic relationships according UD, transformed so that function words are heads ( §3.2).Arrows point from heads to dependents.
are parameterized by a set of scalar weights corresponding to each possible syntactic relation; the ordering function thus reduces to sorting each head's dependents based on their weight values.Notably, Hahn et al. (

Figure 2 :
Figure2: Pseudo-code to linearize a dependency tree according to a grammar's ordering function g.In this code, each node contains a word and its syntactic dependents.

Figure 3 :
Figure 3: The same source sentence according to 4 real and counterfactual orderings.

Figure 4 :
Figure 4: Mean test-set surprisal and surprisal variance of language models across real and counterfactual grammars in 10 languages.Error bars denote the 95% CI of the mean.

Figure 5 :
Figure 5: Linear regression coefficient estimates when predicting UID p as a function of mean surprisal, variant, and dataset size.The reference level for variant is REAL, so positive coefficients (blue) indicate variants with greater UID p , i.e., less uniformity, than real language.

Figure 7 :
Figure 7: Surprisal mean and variance for a subset of languages on the CC100 dataset.Error bars denote 95% CI.