To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation

We conduct a large-scale fine-grained comparative analysis of machine translations (MTs) against human translations (HTs) through the lens of morphosyntactic divergence. Across three language pairs and two types of divergence defined as the structural difference between the source and the target, MT is consistently more conservative than HT, with less morphosyntactic diversity, more convergent patterns, and more one-to-one alignments. Through analysis on different decoding algorithms, we attribute this discrepancy to the use of beam search that biases MT towards more convergent patterns. This bias is most amplified when the convergent pattern appears around 50% of the time in training data. Lastly, we show that for a majority of morphosyntactic divergences, their presence in HT is correlated with decreased MT performance, presenting a greater challenge for MT systems.


Introduction
Translation divergences occur when the translations differ structurally from the source sentences, typically as a result of either inherent crosslingual differences or idiosyncratic preferences of translators.These divergences happen naturally in the translation process and can be readily found in human translations (HT), including those used for training machine translation (MT) systems (see the table in Figure 1 for some examples).Their existence in HT has long been regarded as a key challenge for MT (Dorr, 1994) and more recent empirical studies have demonstrated the abundance of translation divergences in HT (Deng and Xue, 2017;Nikolaev et al., 2020).
In contrast to HT, MT outputs tend to be less diverse and more literal (i.e., absence of translation divergence), exhibiting the features of translationese (Gellerstam, 1986).This qualitative dif-  (Bojar et al., 2015), with relevant fragments of the source/target shown in the first/second rows.The English control constructions are bolded including both the finite root verb and the controlled word, while the French phrases of interest are underlined.Bottom figure: Percentages of target patterns for HT and MT, with obligatory control finite verbs as the source pattern.o2o:conv: one-toone convergent patterns where the target phrase uses a similar control construction to the source; o2o:div: one-to-one divergent patterns where the target differs structurally from the source; null: no target word is aligned; others: other less frequent patterns (e.g., one-to-many alignments).The percentages of all four categories sum up to 100%.
ference between HT and MT has inspired a rich body of work attempting to narrow the gap, such as automatic detection of machine translated texts in the training data (Kurokawa et al., 2009;Lembersky et al., 2012;Aharoni et al., 2014;Riley et al., 2020;Freitag et al., 2022), training MT systems on more diverse translations (Khayrallah et al., 2020;Bao et al., 2023), and carefully reordering the examples to reduce the degree of divergence between the source and the target (Wang et al., 2007;Zhang and Zong, 2016;Zhou et al., 2019).The challenges that translation divergences present do not just concern training MT systems, but also their evaluation (Koppel and Ordan, 2011;Freitag et al., 2020).Nonetheless, even as we gain deepened understanding of how to address these challenges, it remains unclear how quantitatively different MT and HT are in terms of divergences.1Control verbs,2 for instance, provide a great case study to showcase this difference.There is much uncertainty when translating them from English to French, and human translators employ a wide variety of constructions including many divergent patterns (Figure 1).In comparison, MT is much more likely to preserve the source structure, with the convergent pattern comprising about 20% more of all translations of control verbs.This difference exemplifies MT's undesirable tendency to produce translationese that is too literal and lacks structural diversity (Freitag et al., 2019;Bizzoni et al., 2020).
In this work, we seek to systematically investigate this difference by conducting a large-scale fine-grained comparative analysis on the distribution of translation divergences for HT and MT, all through the lens of morphosyntax.More specifically, we aim to answer the following research questions: 1) How are MT and HT quantitatively different in terms of morphosyntactic divergence? 2) How do we explain or understand this difference?3) How do translation divergences in HT affect MT quality?In other words, do MT systems have more difficulty translating source sentences that exhibit divergences in HT?
Through extensive analyses based on three language pairs and two types of morphosyntactic divergence using the annotational framework of Universal Dependencies (Nivre et al., 2016), we make the following empirical observations: 1. MT is more conservative than HT, with less morphosyntactic diversity, more convergent patterns, and more one-to-one alignments.
2. MT is morphosyntactically less similar to HT for less frequent source patterns.
3. The distributional difference can be largely attributed to the use of beam search, which is biased towards convergent patterns.This bias is most amplified when the convergent target patterns appear around 50% of the time out of all translations of the same source pattern in the training data.
4. A majority of the most frequent divergent patterns are correlated with decreased MT performance.This correlation cannot be fully explained by the lower frequencies of the relevant divergences.
To the best of our knowledge, this is the first work to present the comparative perspective of HT vs MT in such fine granularity covering thousands of morphosyntactic constructions.In the remaining sections, we first briefly describe related work in Section 2. The experimental setup is described in detail in Section 3. We demonstrate the quantitative difference between MT and HT in Section 4, and seek to understand this discrepancy in Section 5. Lastly, we explore the correlation between the presence of divergences in HT with MT performance in Section 6 and make conclusions in Section 7.
The closest work to ours is from Nikolaev et al. (2020) who proposed to investigate fine-grained crosslingual morphosyntactic divergence based on Universal Dependencies.They augmented a subset of the Parallel Universal Dependencies (PUD) corpus (Zeman et al., 2017) with human-annotated word alignments for five language pairs and focused exclusively on content words.While our work shares a similar conceptional and methodological foundation to theirs, our goal is to conduct a comparative analysis between HT and MT.In addition, we rely on a dependency parser and a word aligner (see Section 3 for more details) to reach a sufficiently large scale to enable the investigation of more fine-grained divergences.
Diverse Machine Translation MT systems tend to produce less diverse outputs in general (Gimpel et al., 2013;Ott et al., 2018), which is particularly harmful for back translation (Edunov et al., 2018;Soto et al., 2020;Burchell et al., 2022).To address this issue, various techniques have been proposed in the literature, including modified decoding algorithms (Li et al., 2016;Sun et al., 2020;Li et al., 2021), mixtures of experts (Shen et al., 2019), Bayesian models (Wu et al., 2020), additional codes (syntax or latent) (Shu et al., 2019;Lachaux et al., 2020) and training with simulated multi-reference corpora (Lin et al., 2022).In all aforementioned works, the emphasis is on the lack of diversity in MT outputs rather than comparing them systematically against HT.Notable exceptions include Roberts et al. (2020) who investigated the distributional differences between MT and HT in terms of n-grams, sentence length, punctuation, and copy rates.Marchisio et al. (2022) compared translations from supervised MT and unsupervised MT and noted their systematic style differences based on similarity and monotonicity in their POS sequences.In contrast, our work goes beyond surface features and focuses on fine-grained morphosyntactic divergences.
Algorithmic Bias Another closely related line of work studies algorithmic biases of current NLP systems, with particular emphasis on gender and racial biases (Bolukbasi et al., 2016;Caliskan et al., 2017;Zhao et al., 2017;Garg et al., 2018).Specifically for MT, researchers have focused on lexical diversity by comparing HT against posteditese (Toral, 2019) or MT outputs directly ( Vanmassenhove et al., 2019); Bizzoni et al. (2020) have compared HT, MT and simultaneous interpreting in terms of translationese using POS perplexity and dependency length.Most related to our work, Vanmassenhove et al. (2021) have conducted an extensive comparison between HT and MT based on a suite of lexical and morphological diversity metrics.While our study reaches a similar conclusion that MT is less diverse than HT, we explore morphosyntactic patterns on a more fine-grained level, and also reveal the bias of MT (and more specifically beam search) towards convergent structures.

Experimental Setup
Types of Morphosyntactic Divergence In this study, we experiment with two types of translation patterns based on the annotational scheme of Universal Dependencies: (A) Word-based: POS tags for the aligned word pair.We additionally include their parent and child syntactic dependencies for more granularity.Order of the children dependencies is ignored.
(B) Arc-based: The source dependency arc, and the target path between the aligned words of the arc's head and tail.Directionality of the target dependencies is ignored.We additionally include the POS tags of both the head and the tail for more granularity.
These types are largely based on the proposal of Nikolaev et al. (2020), with modifications to accommodate more granularity.With either type, the translation pattern is a convergence if the source and the target sides have the same structure (word-based or arc-based), and otherwise a divergence.Notationally, we use tildes to connect the various parts of the pattern in a fixed order.For instance, for the control verb "cautioned" in Figure 2, its word-based divergence has root~VERB~nsubj+xcomp on the source side, where VERB corresponds to its POS tag, root its parent dependency, and nsubj and xcomp its two child dependencies.Similarly, we have root~VERB~nsubj+obl+xcomp on the target side.With regard to an arc-based divergence, for the source arc between the words "cautioned" and "readers", we denote it as VERB~nsubj~NOUN, where nsubj is the dependency relation of the arc, and VERB and NOUN the POS tags of the head and the tail, respectively.Similarly, we denote the aligned target pattern as VERB~obl~NOUN.
Data We conduct experiments for three language pairs using WMT datasets (Bojar et  Models We train a bilingual Transformer base model (Vaswani et al., 2017) for each language pair using the T5X framework (Roberts et al., 2022).All models are trained with Adafactor optimizer (Shazeer and Stern, 2018)   dependency parser is an implementation of Dozat and Manning (2017) based on mBERT (Devlin et al., 2019).The neural word aligner is based on AMBER (Hu et al., 2021) and fine-tuned on human-annotated alignments.We follow Nikolaev et al. (2020) to keep the content words4 and their dependencies and alignments only, and focus on one-to-one alignments unless otherwise noted.
As reported in Table 2 part (ii) and (iii), we validate that both tools have high accuracy on public datasets: UD test sets for parsing and humanannotated PUD datasets (Nikolaev et al., 2020) for word alignment.We will release the automatic annotations to the public.

Comparative Analysis of MT vs HT
We proceed to conduct a comparative analysis of MT vs HT based on the fine-grained morphosyntactic patterns defined in the previous section.For any given source pattern p according to the word-based or arc-based definition as detailed in the previous section, we study the distribution of its aligned target patterns, i.e., Pr HT (• | p) and Pr MT (• | p), along two major dimensions: diversity/uncertainty as measured by entropy of the target pattern, and convergence/divergence rate.Figure 3 shows that there is considerable variance in how the most frequent source patterns in HT are distributed along these two axes, and that each dimension captures a different property of the distribution.
Through analyses on both the aggregate level and the individual pattern level, we conclude that MT is more conservative than HT, with less morphosyntactic diversity, more convergent patterns, and more one-to-one alignments.We also observe that MT tends to be less similar to HT for the less frequent source patterns.The analyses in this section are based on the held-out subset consisting of one million sentence pairs.We refer readers to Appendix A for similar results on a subset that is further filtered using LaBSE crosslingual embeddings (Feng et al., 2022) with a remarkably similar trend, which we include to show that it does not change our conclusions when we test on data that has been filtered to improve its cross-lingual equivalence.

MT is Less Morphosyntactically Diverse
Than HT Preliminaries We define diversity score as the conditional entropy of target patterns given source patterns, which reflects the aggregate level of uncertainty when translating a morphosyntactic pattern.More formally, let P and Q denote the categorical random variables for source patterns and their aligned target patterns, respectively.The aggregate diversity score is defined as where p is any specific source pattern that occurs in the corpus.
In addition, for any given source pattern p, we define a source pattern-specific diversity score as the entropy of the target patterns aligned to that source pattern p.This score corresponds to the term H(Q | P = p) in Equation (1).Finding by Source Pattern On the level of individual source patterns, we observe that the reduction of diversity among their aligned target patterns is across-the-board but unevenly distributed.

Aggregate Finding As summarized in
Figure 4 plots a stacked histogram of the relative differences in diversity score (MT vs HT) for the most frequent source patterns with at least 1000 occurrences, and it shows that the vast majority of them see a drop of diversity (i.e., negative difference).This reduction varies from pattern to pattern, ranging from 0% to 60%.

MT is More Convergent Than HT
Preliminaries We tally divergences and convergences according to the two types detailed in Section 3. We then define the convergence or divergence rate as the percentage of convergent or divergent patterns out of all translation patterns.Similar to diversity, we can compute convergence/divergence rates for both the entire corpus in aggregate and individual source patterns.For the latter case, we tally all the aligned target patterns for a specific source pattern and calculate the rates accordingly.
Aggregate Finding As summarized in Table 3 part (ii), we observe a consistent increase of convergence rate for all three language pairs and two types of divergence.This increase is most pronounced for En→Fr and En→De, whereas En→Zh has a less noticeable although still consistent increase and starts with a much lower convergence rate for HT: the highest rate for En→Zh is 23.4%, whereas En→De can reach 57.8%.
Finding by Source Pattern On a more granular level, we again notice a consistent increase of convergent patterns for MT among the top source patterns (Figure 5).For the vast majority of top source patterns, MT has produced more convergent translations than HT, and this discrepancy ranges from a negligible amount (~0%) for most patterns to more than 20%.This discrepancy is distributed differently for the three languages: En→Fr and En→De have seen more patterns with increased convergence rate while En→Zh has most patterns barely changed and clustered around 0%.As we later show in Figure 9, this trend is unsurprising given the much lower convergence rates for En→Zh in general.

MT Looks Less Like HT For Less Frequent Patterns
Preliminaries Both diversity score and convergence rate are properties of translations produced by one system, either MT or HT.To directly measure the distributional difference between MT and HT, we resort to Wasserstein distance (WD) between the two conditional distributions 6 Pr MT (• | 6 Recall that we treat both source patterns and target patterns as categorical random variables where every unique source or target pattern is treated as a distinct value that the p) and Pr HT (• | p) using a unit cost matrix. 7This metric can be intuitively interpreted as the minimal amount of probability mass that has to be moved from Pr MT (• | p) to match Pr HT (• | p), with an upper bound of 1 (i.e., sum of all probability mass). 8 Finding As Figure 6 shows, there is a negative correlation between WD and the source pattern frequency: MT matches HT more closely for the more frequent source patterns while having difficulty in reproducing the HT distribution for the less frequent ones.This trend persists for all tested settings, and points to a potential weakness of MT systems when it comes to learning the distributions of the less common structures.

Beyond One-to-one Alignments
Preliminaries One-to-one alignments constitute a majority of all detected alignments, but they fail to account for translation patterns involving deletions and insertions.To investigate the quantitative differences between HT and MT on those special patterns, we conduct additional analyses on the distribution of all categories of alignments based on the word-based definition.Besides deletions random variables can take. 7In which diagonal/off-diagonal entries are 0/1. 8We note that other metrics such as KL-divergence can also be used to measure distributional difference, but we eventually chose WD for its interpretability.(src2null) and insertions (null2tgt), the remaining alignments are collapsed into the other category (e.g., one-to-many mapping).
Finding Figure 7 summarizes the distribution of all alignment categories,9 which demonstrates a significant and consistent difference between HT and MT.More specifically, MT produces fewer deletions (green), fewer insertions (red), and more one-to-one translations (blue).En→Fr again exhibits the biggest discrepancy with 9.6% less deletions (10.8% vs 20.4%) and 14.8% less insertions (13.0%vs 26.8%), both around 50% relative reduction.This trend contributes to the overall conservative nature of MT predictions, favoring one-to-one alignments at the expense of the other (more uncertain) categories.

Understanding the Discrepancy
In this section, we seek to understand the source of discrepancy between HT and MT as demonstrated in the previous section.By investigating different decoding algorithms, we attribute this discrepancy to the use of beam search, echoing the thesis laid out by previous work (Edunov et al., 2018;Eikema and Aziz, 2020).More specifically in our experiments, we show that beam search is biased towards less diverse and more convergent translations, even when the learned model distribution actually resembles HT.This bias is most prominent when the convergent patterns appear around 50% of the time in training data.Moreover, frequencies of convergent patterns in MT are increased even when they are uncommon in HT, suggesting perhaps a more inherent structural bias in current MT architectures.
Decoding Algorithms Besides beam search, we additionally obtain translations through two sampling methods.More specifically, to make fair comparison with single-reference HT, we sample one translation using ancestral sampling or nucleus sampling with p = 0.95 (Holtzman et al., 2020) for each source sentence.
Beam Search is Biased Against Diversity and Divergence As Figure 8 illustrates, for all three language pairs and two types of divergence, translations obtained through beam search are significantly less diverse and more convergent compared to either sampling method.Indeed, ancestral sampling consistently produces higher diversity scores and lower convergence rates than even HT. 10Since ancestral sampling is an unbiased estimator of the model distribution, this suggests that on the aggregate distribution level, the model learns to be as least as morphosyntactically diverse and divergent as HT.
A further breakdown of most frequent 11 individual source patterns reveals that beam search's bias towards convergent translations is a function of the relative frequencies of the convergent patterns.As Figure 9 demonstrates, the increase of convergence rate for beam search compared to ancestral sampling seems to be quadratically correlated with the convergence rate for ancestral sampling: Peak difference is reached at around 40-50%.This suggests that beam search favors the convergent pattern more when the pattern appears around 50% of the time in training data.This could be because the model has seen the pattern 10 We hypothesize that the increased diversity score and the higher divergence rate for ancestral sampling compared to HT are attributable to the use of label smoothing during training.Roberts et al. (2020) have also demonstrated the effect of label smoothing on various diversity diagnostics.
11 With at least 1000 occurrences.enough to assign it substantial probability mass, but there is still enough uncertainty that humans will frequently choose other patterns.
We additionally note that convergence rate increases for the overwhelming majority of the most frequent source patterns even when the convergence patterns are uncommon in HT.This strongly suggests an inherent bias of beam search towards convergent patterns, 12 and that this bias is distinct from the typical bias amplification due to data exposure, e.g., "cooking" is more likely to co-occur with "women" than "men" in the training data (Zhao et al., 2017).We suspect that this bias towards convergence is due to the architectural design of MT systems, but we leave the subject matter for future work.

Divergence and MT Quality
In our final analysis, we investigate how the presence of morphosyntactic divergence in HT might affect MT quality.In contrast to the previous sections analyzing conditional distributions given a source pattern, we focus instead on individual divergences/convergences.The potential connection between divergence and MT quality is motivated by second-language acquisition research that describes language inference from their first languages (i.e., negative transfer) as one source of difficulty for learners (Gass et al., 2020), which can happen when the two languages diverge structurally.Do MT systems have similar problems with divergences?
Preliminaries To answer this question, we conduct an analysis on the presence (or absence) of a word-based morphosyntactic divergence in HT and the corresponding MT quality as measured by BLEU (Papineni et al., 2002) and BLEURT (Sellam et al., 2020).The basic idea is to construct two contrastive groups of source sentences (called the experiment group and the control group) and compare the MT performance on each group.The HT references of the experiment group contain a given divergent pattern, corresponding to sentences that are perhaps more challenging to translate, whereas those of the control group do not.More specifically, for a given divergence with source pattern p and target pattern q (p ̸ = q), its control group consists of source sentences for which HT translates every source p into target p (i.e., a convergent pattern), and its experiment group consists of source sentences for which HT translates every source p into p except for one that is translated into q.For an simplified example, if we are interested the divergence that translates nouns into verbs, the corresponding control group contains source sentences for which HT translates every noun into a noun, whereas its experiment group contains source sentences for which exactly one noun is translated into a verb and the rest of nouns into nouns.
We then collect the MT outputs for both groups and compute the differences in BLEU and BLEURT.This procedure is repeated for every divergence pattern for which both groups have at least 100 sentences.
Findings We treat each difference in BLEU or BLEURT as one data point and plot their estimated probability density function.As illustrated in Figure 10, divergences are more often associated with significantly lower BLEU scores (i.e., negative differences), with a fairly large amount of variance.Trends for BLEURT scores are similar, but with En→De showing less drastic differences compared to BLEU. 13 On the other hand, a substantial number of divergent patterns have either virtual no change or an increase of BLEU or BLEURT scores.This suggests that being a diver- gence pattern in itself is not associated with decreased MT performance.What could explain this variance?Why are some divergent patterns associated with worse MT performance while others aren't?One obvious hypothesis is that these patterns are seen less frequently during training.However, a closer inspection seems to suggest that frequency of divergent patterns alone is not an adequate predictor.More specifically, we use the absolute or relative frequency 14 of the divergent pattern, with or without taking a log of the number, and correlate it with BLEU or BLEURT scores.Even with the best option (log of relative frequency) presented in Table 4, there is only weak correlation (Pearson or Kendall τ ) for En→Fr and En→De, and no correlation for En→Zh.It is unclear what aspects of divergent patterns make them more difficult to translate, or whether they are merely co-occurring with those elements that are the true cause of difficulty.We leave it to future work to investigate the underlying cause.

Conclusion
We conduct a large-scale fine-grained comparative investigation between HT and MT outputs, through the lens of morphosyntactic divergence.Based on extensive analyses on three language pairs, we demonstrate that MT is less morphosyntactic diverse and more convergent than HT.We further attribute to this difference to the use of beam search that biases MT outputs towards less diverse and less divergent patterns.Finally, we 14 Here, relative frequency is defined as the ratio of the number of training examples with the divergence over that with the convergence.It is a way to counterbalance the fact that some extremely common source patterns will have a lot more frequent divergences.
show that the presence of divergent patterns in HT has overall an adverse effect on MT quality.
In future work, we are interested in applying the same analysis to large language model (LLM)based MT systems.Recent studies have noted that LLM-based systems tend to produce less literal translations, compared to the traditional encoderdecoder models (Vilar et al., 2023;Raunak et al., 2023).It would be interested to see whether and to what extent the LLM translations might differ from those produced by traditional models when viewed from a morphological lens.

Figure 4 :
Figure 4: Stacked histogram of the relative differences in source pattern-specific diversity score.

Figure 5 :
Figure 5: Stacked histogram of the absolute differences in source pattern-specific convergence rate.

Figure 6 :
Figure 6: Wasserstein distance with a unit cost matrix between Pr MT (• | p) and Pr HT (• | p) for any given source pattern p. Patterns are binned by frequency on a log scale, and both the means (lines) and the 95% confidence intervals (shaded areas) are shown.The plot shows a negative correlation between WD and the source pattern frequency.

Figure 7 :
Figure7: Distribution for all types of alignments.Percentages are defined relative to the total number of source content words.o2o: one-to-one; src2null: deletions; null2tgt: insertions; other: other types such as one-to-many.

Figure 9 :
Figure 9: Plot of difference in convergence rate (beam search vs HT) against convergence rate of HT.The plot is similar when comparing beam search against ancestral sampling.

Figure 10 :
Figure 10: Kernel density estimation for the difference in BLEU or BLEURT scores between the experiment and the control group.Negative values indicate that the experiment group has lower score than the control group.
system Figure 1: Top table: Examples of divergences in HT for En→Fr WMT15 training data al., Table 1: Number of distinct source or target patterns found in the analysis set (1M sentences from WMT).
Table 3 part (i), MT is less morphosyntactically diverse than HT in aggregate, across three language pairs French, where the relative pronoun que is obligatory in French but not in English.(3)amod~PROPN~leaf (low convergence rate, low entropy): adjectives as part of a proper nouns.Adjectives in official institutions and titles are typically capitalized and annotated as PROPN in English (e.g., Secretary General) but lowercased and annotated as ADJ in French (e.g., secrétaire général).

Table 3 :
Aggregate diversity scores and convergence rates.The ∆% columns show the relative change in percentage from HT to MT.

Table 4 :
Correlation between the difference in BLEURT score and ratio of frequencies (i.e., the number of training examples with divergences over that with convergences).p-values are displayed in gray.