On the Difficulty of Translating Free-Order Case-Marking Languages

Identifying factors that make certain languages harder to model than others is essential to reach language equality in future Natural Language Processing technologies. Free-order case-marking languages, such as Russian, Latin or Tamil, have proved more challenging than fixed-order languages for the tasks of syntactic parsing and subject-verb agreement prediction. In this work, we investigate whether this class of languages is also more difficult to translate by state-of-the-art Neural Machine Translation models (NMT). Using a variety of synthetic languages and a newly introduced translation challenge set, we find that word order flexibility in the source language only leads to a very small loss of NMT quality, even though the core verb arguments become impossible to disambiguate in sentences without semantic cues. The latter issue is indeed solved by the addition of case marking. However, in medium- and low-resource settings, the overall NMT quality of fixed-order languages remains unmatched.


Introduction
Despite the tremendous advances achieved in less than a decade, Natural Language Processing remains a field where language equality is far from being reached (Joshi et al., 2020). In the field of Machine Translation, modern neural models have attained remarkable quality for high-resource language pairs like German-English, Chinese-English or English-Czech, with a number of studies claiming even human parity (Hassan et al., 2018;Bojar et al., 2018;Barrault et al., 2019;Popel et al., 2020). These results may lead to the unfounded belief that NMT methods will perform equally well in any language pair, provided similar amounts of training data. In fact, several studies suggest the opposite (Platanios et al., 2018;Ataman and Federico, 2018;Bugliarello et al., 2020).
Why, then, do some language pairs have lower translation accuracy? And, more specifically: Are certain typological profiles more challenging for current state-of-the-art NMT models? Every language has its own combination of typological properties, including word order, morphosyntactic features and more (Dryer and Haspelmath, 2013). Identifying language properties (or combinations thereof) that pose major problems to the current modeling paradigms is essential to reach language equality in future MT (and other NLP) technologies (Joshi et al., 2020), in a way that is orthogonal to data collection efforts. Among others, natural languages adopt different mechanisms to disambiguate the role of their constituents: Flexible order typically correlates with the presence of case marking and, vice versa, fixed order is observed in languages with little or no case marking (Comrie, 1981;Sinnemäki, 2008;Futrell et al., 2015b). Morphologically rich languages in general are known to be challenging for MT at least since the times of phrase-based statistical MT (Birch et al., 2008) due to their larger and sparser vocabularies, and remain challenging even for modern neural architectures (Ataman and Federico, 2018;Belinkov et al., 2017). By contrast, the relation between word order flexibility and MT quality has not been directly studied to our knowledge.
In this paper, we study this relationship using strictly controlled experimental setups. Specifically, we ask: • Are current state-of-the-art NMT systems biased towards fixed-order languages?
• To what extent does case marking compensate for the lack of a fixed order in the source language?
Unfortunately parallel data is scarce in most of the world languages (Guzmán et al., 2019), and corpora in different languages are drawn from different domains. Exceptions exist, like the widely used Europarl (Koehn, 2005), but represent a small fraction of the large variety of typological feature combinations attested in the world. This makes it very difficult to run a large-scale comparative study and isolate the factors of interest from, e.g., domain mismatch effects. As a solution, we propose to evaluate NMT on synthetic languages (Gulordava and Merlo, 2016;Wang and Eisner, 2016;Ravfogel et al., 2019) that differ from each other only by specific properties, namely: the order of main constituents, or the presence and nature of case markers (see example in Table 1).
We use this approach to isolate the impact of various source-language typological features on MT quality and to remove the typical confounders of corpus size and domain. Using a variety of synthetic languages and a newly introduced challenge set, we find that state-of-the-art NMT has little to no bias towards fixed-order languages, but only when a sizeable training set is available.

Free-order Case-marking Languages
The word order profile of a language is usually represented by the canonical order of its main constituents, (S)ubject, (O)bject, (V)erb. For instance, English and French are SVO languages, while Turkish and Hindi are SOV. Other, less commonly attested, word orders are VSO and VOS, while OSV and OVS are extremely rare (Dryer, 2013). While many other word order features exist (e.g., noun/adjective), they often correlate with the order of main constituents (Greenberg, 1963).
A different, but likewise important dimension is that of word order freedom (or flexibility). Languages that primarily rely on the position of a word to encode grammatical roles typically display rigid orders (like English or Mandarin Chinese), while languages that rely on case marking can be more flexible allowing word order to express discourse-related factors like topicalization. Examples of highly flexible-order languages include languages as diverse as Russian,Hungarian,Latin,Tamil and Turkish. 1 In the field of psycholinguistics, due to the historical influence of English-centered studies, word order has long been considered the primary and most natural device through which children learn 1 See Futrell et al. (2015b) for detailed figures of word order freedom (measured by the entropy of subject and object dependency relation order) in a diverse sample of 34 languages. to infer syntactic relationships in their language (Slobin, 1966). However, cross-linguistic studies have later revealed that children are equally prepared to acquire both fixed-order and inflectional languages (Slobin and Bever, 1982).
Coming to computational linguistics, datadriven MT and other NLP approaches were also historically developed around languages with remarkably fixed order and very simple to moderately simple morphological systems, like English or French. Luckily, our community has been giving increasing attention to more and more languages with diverse typologies, especially in the last decade. So far, previous work has found that free-order languages are more challenging for parsing Merlo, 2015, 2016) and subject-verb agreement prediction (Ravfogel et al., 2019) than their fixed-order counterparts. This raises the question of whether word order flexibility also negatively affects MT quality.
Before the advent of modern NMT, Birch et al. (2008) used the Europarl corpus to study how various language properties affected the quality of phrase-based Statistical MT. Amount of reordering, target morphological complexity, and historical relatedness of source and target languages were identified as strong predictors of MT quality. Recent work by Bugliarello et al. (2020), however, has failed to show a correlation between NMT difficulty (measured by a novel informationtheoretic metric) and several linguistic properties of source and target language, including Morphological Counting Complexity (Sagot, 2013) and Average Dependency Length (Futrell et al., 2015a). While that work specifically aimed at ensuring cross-linguistic comparability, the sample on which the linguistic properties could be computed (Europarl) was rather small and not very typologically diverse, leaving our research questions open to further investigation. In this paper, we therefore opt for a different methodology: namely, synthetic languages.

Methodology
Synthetic languages This paper presents two sets of experiments: In the first ( §4), we create parallel corpora using very simple and predictable artificial grammars and small vocabularies (Lupyan and Christiansen, 2002). See example in Table 1. By varying the position of subject/verb/object and introducing case markers to the source language, we study the biases of two NMT architectures in optimal training data conditions and a fully controlled setup, i.e. without any other linguistic cues that may disambiguate constituent roles. In the second set of experiments ( §5), we move to a more realistic setup using synthetic versions of the English language that differ from it in only one or few selected typological features (Ravfogel et al., 2019). For instance, the original sentence's order (SVO) is transformed to different orders, like SOV or VSO, based on its syntactic parse tree.
In both cases, typological variations are introduced in the source side of the parallel corpora, while the target language remains fixed. In this way, we avoid the issue of non-comparable BLEU scores across different target languages. Lastly, we make the simplifying assumption that, when verb-argument order varies from the canonical order in a flexible-order language, it does so in a totally arbitrary way. While this is rarely true in practice, as word order may be predictable given pragmatics or other factors, we focus here on "the extent to which word order is conditioned on the syntactic and compositional semantic properties of an utterance" (Futrell et al., 2015b).
Translation models We consider two widely used NMT architectures that crucially differ in their encoding of positional information: (i) Recurrent sequence-to-sequence BiLSTM with attention (Bahdanau et al., 2015;Luong et al., 2015) processes the input symbols sequentially and has each hidden state directly conditioned on that of the previous (or following, for the backward LSTM) timestep (Elman, 1990;Hochreiter and Schmidhuber, 1997). (ii) The non-recurrent, fully attention-based Transformer (Vaswani et al., 2017) processes all input symbols in parallel relying on dedicated embeddings to encode each input's position. 2 Transformer has nowadays surpassed recurrent encoder-decoder models in terms of generic MT quality. Moreover, Choshen and Abend (2019) have recently shown that Transformer-based NMT models are indifferent to the absolute order of source words, at least when equipped with learned positional embeddings. On the other hand, the lack of recurrence in Transformers has been linked to a limited ability to capture hierarchical structure (Tran et al., 2018;Hahn, 2020). To our knowledge, no previous work has studied the biases of either architectures towards fixed-order languages in a systematic manner.

Toy Parallel Grammar
We start by evaluating our models on a pair of toy languages inspired by the English-Dutch pair and created using a Synchronous Context-Free Grammar (Chiang and Knight, 2006). Each sentence consists of a simple clause with a transitive verb, subject and object. Both arguments are singular and optionally modified by an adjective. The source vocabulary contains 6 nouns, 6 verbs, 6 adjectives, and the complete corpus contains 10k generated sentence pairs. Working with such a small, finite grammar allows us to simulate an otherwise impossible situation where the NMT model can be trained on (almost) the totality of a language's utterances, canceling out data sparsity effects. 3 Source Language Variants We consider three source language variants, illustrated in Table 1: • fixed-order VSO; • fixed-order VOS; • mixed-order (randomly chosen between VSO or VOS) with nominal case marking.
We choose these word orders so that, in the flexible-order corpus, the only way to disambiguate argument roles is case marking, realized by simple unambiguous suffixes (#S and #O). The target language is always fixed SVO. The same random split (80/10/10% training/validation/test) is applied to the three corpora. NMT Setup As recurrent model, we trained a 2layer BiLSTM with attention (Luong et al., 2015) with 500 hidden layer size. As Transformer models, we trained one using the standard 6-layer configuration (Vaswani et al., 2017) and a smaller one with only 2 layers given the simplicity of the languages. All models are trained at the word level using the complete vocabulary. More hyper-parameters are provided in Appendix A.1. Note that our goal is not to compare LSTM and Transformer accuracy to each other, but rather to observe the different trends across fixed-and flexible-order language variants. Given the small vocabulary, we use sentence-level accuracy instead of BLEU for evaluation.
Results As shown in Figure 1, all models achieve perfect accuracy on all language pairs after 1000 training steps, except for the Large Transformer on the free-order language, likely due to overparametrization (Sankararaman et al., 2020). These results demonstrate that our NMT architectures are equally capable of modeling translation of both types of language, when all other factors of variation are controlled for. Nonetheless, a pattern emerges when looking at the learning curves within each plot: While the two fixed-order languages have very similar learning curves, the freeorder language with case markers always requires slightly more training steps to converge. This is also the case, albeit to a lesser extent, when the mixed-order corpus is pre-processed by splitting all case suffixes from the nouns (extra experiment not shown in the plot). This trend is noteworthy, given the simplicity of our grammars and the transparency of the case system. As our training sets cover a large majority of the languages, this result might suggest that free-order natural languages need larger training datasets to reach a sim-ilar translation quality than their fixed-order counterparts. In §5 we validate this hypothesis on more naturalistic language data.

Synthetic English Variants
Experimenting with toy languages has its shortcomings, like the small vocabulary size and nonrealistic distribution of words and structures. In this section, we follow the approach of Ravfogel et al. (2019) to validate our findings in a less controlled but more realistic setup. Specifically, we create several variants of the Europarl English-French parallel corpus where the source sentences are modified by changing word order and adding artificial case markers. We choose French as target language because of its fixed order, SVO, and its relatively simple morphology. 4 As Indo-European languages, English and French are moderately related in terms of syntax and vocabulary while being sufficiently distant to avoid a word-by-word translation strategy in many cases. Source language variants are obtained by transforming the syntactic tree of the original sentences. While Ravfogel et al. (2019) could rely on the Penn Treebank (Marcus et al., 1993) for their monolingual task of agreement prediction, we instead need parallel data. For this reason, we parse the English side of the Europarl v.7 corpus (Koehn, 2005) using the Stanza dependency parser (Qi et al., 2020;Manning et al., 2014). After parsing, we adopt a modified version of the synthetic language generator by Ravfogel et al. (2019) to create the following English variants: 5 4 According to the Morphological Counting Complexity (Sagot, 2013) values reported by Cotterell et al. (2018), English scores 6 (least complex), Dutch 26, French 30, Spanish 71, Czech 195,and Finnish 198 (most complex). 5 Our revised language generator is available at https: • fixed-order: either SVO, SOV, VSO or VOS; 6 • free-order: for each sentence in the corpus, one of the six possible orders of (Subject, Object, Verb) is chosen randomly; • shuffled words: all source words are shuffled regardless of their syntactic role. This is our lower bound, measuring the reordering ability of a model in the total absence of sourceside order cues (akin to bag-of-words input).
To allow for a fair comparison with the artificial case-marking languages, we remove number agreement features from verbs in all the above variants (cf. says → say in Table 2).
To answer our second research question, we experiment with two artificial case systems proposed by Ravfogel et al. (2019) and illustrated in Table 2 (overt suffixes): • unambiguous case system: suffixes indicating argument role (subject/object/indirect object) and number (singular/plural) are added to the heads of noun and verb phrases; • syncretic case system: suffixes indicating number but not grammatical function are added to the heads of main arguments, providing only partial disambiguation of argument roles. This system is inspired from subject/object syncretism in Russian.
Syncretic case systems were found to be roughly as common as non-syncretic ones in a large sample of almost 200 world languages (Baerman and Brown, 2013). Case marking is always combined with the fully flexible order of main constituents. As in (Ravfogel et al., 2019), English number marking is removed from verbs and their arguments before adding the artificial suffixes.

NMT Setup
Models As recurrent model, we used a 3-layer BiLSTM with hidden size of 512 and MLP attention (Bahdanau et al., 2015). The Transformer model has the standard 6-layer configuration with hidden size of 512, 8 attention heads, and sinusoidal positional encoding (Vaswani et al., 2017).  All models use subword representation based on 32k BPE merge operations (Sennrich et al., 2016), except in the low-resource setup where this is reduced to 10k operations. More hyper-parameters are provided in Appendix A.1.

Data and Evaluation
We train our models on various subsets of the English-French Europarl corpus: 1,9M sentence pairs (high-resource), 100K (medium-resource), 10K (low-resource). For evaluation, we use 5K sentences randomly held-out from the same corpus. Given the importance of word order to assess the correct translation of verb arguments into French, we compute the reordering-focused RIBES 7 metric (Isozaki et al., 2010) in addition to the more commonly used BLEU (Papineni et al., 2002). In each experiment, the source side of training and test data is transformed using the same procedure whereas the target side remains unchanged. We repeat each experiment 3 times (or 4 for languages with random order choice) and report the averaged results.

Challenge Set
Besides syntactic structure, natural language often contains semantic and collocational cues that help disambiguate the role of an argument. Small BLEU/RIBES differences between our language variants may indicate actual robustness of a model to word order flexibility, but may also indicate that a model relies on those cues rather than on syntactic structure (Gulordava et al., 2018). To discern these two hypotheses, we create a challenge set of 7,200 simple affirmative and negative sentences where swapping subject and object leads to another plausible sentence. 8 Each English sentence and its reverse are included in the test set together with the respective translations, as for example: (1) a. The president thanks the minister. / Le président remercie le ministre.
b. The minister thanks the president. / Le ministre remercie le président.
The source side is then processed as explained in §5 and translated by the NMT model trained on the corresponding language variant. Thus, translation quality on this set reflects the extent to which NMT models have robustly learnt to detect verb arguments and their roles independently from other cues, which we consider an important sign of linguistic generalization ability. For space constraints we only present RIBES scores on the challenge set. 9 Table 3 reports the high-resource setting results. The first row (original English to French) is given only for reference and shows the overall highest results. The BLEU drop observed when moving to any of the fixed-order variants (including SVO) is likely due to parsing flaws resulting in awkward reorderings. As this issue affects all our synthetic variants, it does not undermine the validity of our findings. For clarity, we center our main discussion on the Transformer results and comment on the BiLSTM results at the end of this section. 8 More details can be found in Appendix A.2. We release the challenge set at https://github.com/ arianna-bis/freeorder-mt 9 We also computed BLEU scores: they strongly correlate with RIBES but fluctuate more due to the larger effect of lexical choice.

High-Resource Results
Fixed-Order Variants All four tested fixedorder variants obtain very similar BLEU/RIBES scores on the Europarl-test. This is in line with previous work in NMT showing that linguistically motivated pre-ordering leads to small gains (Zhao et al., 2018) or none at all (Du and Way, 2017), and that Transformer-based models are not biased towards monotonic translation (Choshen and Abend, 2019). On the challenge set, scores are slightly more variable but a manual inspection reveals that this is due to different lexical choices, while word order is always correct for this group of languages. To sum up, in the high-resource setup, our Transformer models are perfectly able to disambiguate the core argument roles when these are consistently encoded by word order.

Fixed-Order vs Random-Order
Somewhat surprisingly, the Transformer results are only marginally affected by the random ordering of verb and core arguments. Recall that in the 'Random' language all six possible permutations of (S,V,O) are equally likely. Thus, Transformer shows an excellent ability to reconstruct the correct constituent order in the general-purpose test set. The picture is very different on the challenge set, where RIBES drops severely from 97.6 to 74.1. These low results were to be expected given the challenge set design (it is impossible even for a human to recognize subject from object in the 'Random, no case' challenge set). Nonetheless, they demonstrate that the general-purpose set cannot tell us whether an NMT model has learnt to reliably exploit syntactic structure of the source language, because of the abundant non-syntactic cues. In fact, even when all source words are shuffled, Transformer still achieves a respectable 25.8/71.2 BLEU/RIBES on the Europarl-test.

Case Marking
The key comparison in our study lies between fixed-order and free-order casemarking languages. Here, we find that case marking can indeed restore near-perfect accuracy on the challenge set (98.1 RIBES). However, this only happens when the marking system is completely unambiguous, which, as already mentioned, is true for only about a half of the real case-marking languages (Baerman and Brown, 2013). Indeed the syncretic system visibly improves quality on the challenge set (74.1 to 84.4 RIBES) but remains far behind the fixed-order score (97.6). In terms of overall NMT quality (Europarl-test), fixed-order  languages score only marginally higher than the free-order case-marking ones, regardless of the unambiguous/syncretic distinction. Thus our finding that Transformer NMT systems are equally capable of modeling the two types of languages ( §4) is also confirmed with more naturalistic language data. That said, we will show in Sect. 5.4 that this positive finding is conditional on the availability of large amounts of training samples.

BiLSTM vs Transformer
The LSTM-based results generally correlate with the Transformer results discussed above, however our recurrent models appear to be slightly more sensitive to changes in the source-side order, in line with previous findings (Choshen and Abend, 2019). Specifically, translation quality on Europarl-test fluctuates slightly more than Transformer among different fixed orders, with the most monotonic order (SVO) leading to the best results. When all words are randomly shuffled, BiLSTM scores drop much more than Transformer. However, when comparing the fixed-order variants to the ones with free order of main constituents, BiL-STM shows only a slightly stronger preference for fixed-order, compared to Transformer. This suggests that, by experimenting with arbitrary permu-tations, Choshen and Abend (2019) might have overestimated the bias of recurrent NMT towards more monotonic translation, whereas the more realistic combination of constituent-level reordering with case marking used in our study is not so problematic for this type of model. Interestingly, on the challenge set, BiLSTM and Transformer perform on par, with the notable exception that syncretic case is much more difficult for the BiLSTM model. Our results agree with the large drop of subject-verb agreement prediction accuracy observed by Ravfogel et al. (2019) when experimenting with the random order of main constituents. However, their scores were also low for SOV and VOS, which is not the case in our NMT experiments. Besides the fact that our challenge set only contains short sentences (hence no long dependencies and few agreement attractors), our task is considerably different in that agreement only needs to be predicted in the target language, which is fixed-order SVO.
Summary Our results so far suggest that stateof-the-art NMT models, especially if Transformerbased, have little or no bias towards fixed-order languages. In what follows, we study whether this finding is robust to differences in data size, type of morphology, and target language.

Effect of Data Size and Morphological Features
Data Size The results shown in Table 3 represent a high-resource setting (almost 2M training sentences). While recent successes in crosslingual transfer learning alleviate the need for labeled data (Liu et al., 2020), their success still depends on the availability of large unlabeled data as well as other, yet to be explained, language properties (Joshi et al., 2020). We then ask: Do free-order case-marking languages need more data than fixed-order non-case-marking ones to reach similar NMT quality? We simulate a mediumand low-resource scenario by sampling 100K and 10K training sentences, respectively, from the full Europarl data. To reduce the number of experiments, we only consider Transformer with one fixed-order language variant (SOV) 10 and exclude syncretic case marking. To disentagle the effect of word order from that of case marking on low-resource translation quality, we also experiment with a language variant combining fixedorder (SOV) and case marking. Results are shown in Figure 2 and discussed below.

Morphological Features
The artificial case systems used so far included easily separable suffixes with a 1:1 mapping between grammatical categories and morphemes (e.g. .nsubj.sg, .dobj.pl) reminiscent of agglutinative morphologies. Many world languages, however, do not comply to this 1:1 mapping principle but display flexivity (multiple categories conveyed by one morpheme) and/or exponence (the same category expressed by various, lexically determined, morphemes). Wellstudied examples of languages with case+number exponence include Russian and Finnish, while flexive languages include, again, Russian and Latin. Motivated by previous findings on the impact of fine-grained morphological features on language modeling difficulty (Gerz et al., 2018), we experiment with three types of suffixes (see examples in Table 2): • overt: number and case are denoted by easily separable suffixes (e.g. .nsubj.sg, .dobj.pl) similar to agglutinative languages (1:1); • implicit: the combination of number and case is expressed by unique suffixes without internal structure (e.g. kar for .nsubj.sg, ker for .dobj.pl) similar to fusional languages. This system displays exponence (many:1); • implicit with declensions: like the previous, but with three different paradigms each arbitrarily assigned to a different subset of the lexicon. This system displays exponence and flexivity (many:many).
A complete overview of our morphological paradigms is provided in Appendix A.3 All our languages have moderate inflectional synthesis and, in terms of fusion, are exclusively concatenative. Despite this, the effect on vocabulary size is substantial: 180% increase by overt and implicit case marking, 250% by implicit marking with declensions (in the full data setting).
Results Results are shown in the plots of Figure 2 (detailed numerical scores are given in Appendix A.4). We find that reducing training size has, not surprisingly, a major effect on translation quality. Among source language variants, fixed-order obtains the highest quality across all setups. In terms of BLEU (2(a)), the spread among variants increases somewhat with less data however differences are small. A clearer picture emerges from RIBES (2(b)), whereby less data clearly leads to more disparity. This is already visible in the 100k setup, with the fixed SOV language dominating the others. Case marking, despite being necessary to disambiguate argument roles in the absence of semantic cues, does not improve translation quality and even degrades it in the low-resource setup. Looking at the challenge set results (2(c)) we see that the free-order casemarking languages are clearly disadvantaged: In the mid-resource setup, case marking improves substantially over the underspecified random,nocase language but remains far behind fixed-order. In low-resource, case marking notably hurts quality even in comparison with the underspecified language. These results thus demonstrate that free-order case-marking languages require more data than their fixed-order counterparts to be accurately translated by state-of-the-art NMT. 11 Our experiments also show that this greater learning difficulty is not only due to case marking (and subsequent data sparsity), but also to word order flexibility (compare sov+overt to r+overt in Figure 2). Regarding different morphology types, we do not observe a consistent trend in terms of overall translation quality (Europarl-test): in some cases, the richest morphology (with declensions) slightly outperforms the one without declensions -a result that would deserve further exploration. On the other hand, results on the challenge set, where most words are case-marked, show that morphological richness inversely correlates with translation quality when data is scarce. We postulate that our artificial morphologies may be too limited in scope (only 3-way case and number marking) to impact overall translation quality and leave the investigation of richer inflectional synthesis to future work.

Effect of Target Language
All results so far involved translation into a fixedorder (SVO) language without case marking. To verify the generality of our findings, we repeat a subset of experiments with the same synthetic English variants, but using Czech or Dutch as target languages. Czech has rich fusional morphology including case marking, and very flexible order. Dutch has simple morphology (no case marking) and moderately flexible, syntactically determined order. 12 Figure 3 shows the results with 100k training sentences. In terms of BLEU, differences are even smaller than in English-French. In terms of RIBES, trends are similar across target languages, with the fixed SOV source language obtaining best results and the case-marked source language obtaining worst results. This suggests that the major findings of our study are not due to the specific choice of French as the target language. The effect of word order flexibility on NLP model performance has been mostly studied in the field of syntactic parsing, for instance using Average Dependency Length (Gildea and Temperley, 2010;Futrell et al., 2015a) or head-dependent order entropy (Futrell et al., 2015b;Gulordava and Merlo, 2016) as syntactic correlates of word order freedom. Related work in language modeling has shown that certain languages are intrinsically more difficult to model than others (Cotterell et al., 2018;Mielke et al., 2019) and has furthermore studied the impact of fine-grained morphology features (Gerz et al., 2018) on LM perplexity.
Regarding the word order biases of seq-to-seq models, Chaabouni et al. (2019) use miniature languages similar to those of Sect. 4 to study the evolution of LSTM-based agents in a simulated iterated learning setup. Their results in a standard "individual learning" setup show, like ours, that a free-order case-marking toy language can be learned just as well as a fixed-order one, confirming earlier results obtained by simple Elman networks trained for grammatical role classification (Lupyan and Christiansen, 2002). Transformer was not included in these studies. Choshen and Abend (2019) measure the ability of LSTM-and Transformer-based NMT to model a language pair where the same arbitrary (non syntactically motivated) permutation is applied to all source sentences. They find that Transformer is largely indifferent to the order of source words (provided this is fixed and consistent across training and test set) but nonetheless struggles to translate long dependencies actually occurring in natural data. They do not directly study the effect of order flexibility.
The idea of permuting dependency trees to generate synthetic languages was introduced independently by Gulordava and Merlo (2016) (discussed above) and by Wang and Eisner (2016), the latter with the aim of diversifying the set of treebanks currently available for language adaptation.

Conclusions
We have presented an in-depth analysis of how Neural Machine Translation difficulty is affected by word order flexibility and case marking in the source language. Although these common language properties were previously shown to negatively affect parsing and agreement prediction accuracy, our main results show that state-of-the-art NMT models, especially Transformer-based ones, have little or no bias towards fixed-order languages. Our simulated low-resource experiments, however, reveal a different picture, that is: freeorder case-marking languages require more data to be translated as accurately as their fixed-order counterparts. Since parallel data (like labeled data in general) is scarce for most of the world languages (Guzmán et al., 2019;Joshi et al., 2020), we believe this should be considered as a further obstacle to language equality in future NLP technologies.
In future work, our analysis should be extended to target language variants using principled alternatives to BLEU (Bugliarello et al., 2020), and to other typological features that are likely to affect MT performance, such as inflectional synthesis and degree of fusion (Gerz et al., 2018). Finally, the synthetic languages and challenge set proposed in this paper could be used to evaluate syntax-aware NMT models (Eriguchi et al., 2016;Bisk and Tran, 2018;Currey and Heafield, 2019), which promise to better capture linguistic structure, especially in low-resource scenarios.