Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation

Pretrained character-level and byte-level language models have been shown to be competitive with popular subword models across a range of Natural Language Processing tasks. However, there has been little research on their effectiveness for neural machine translation (NMT), particularly within the popular pretrain-then-finetune paradigm. This work performs an extensive comparison across multiple languages and experimental conditions of character- and subword-level pretrained models (ByT5 and mT5, respectively) on NMT. We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited. In our analysis, we show how character models’ gains in translation quality are reflected in better translations of orthographically similar words and rare words. While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5, suggesting an ability to modulate word-level and character-level information during generation. We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.


Introduction
Character-level and byte-level models1 have been a source of interest in Natural Language Processing (NLP) for many years, with the promise of tokenization-free systems able to process and generate text on a finer granularity.However, these systems have failed to become the dominant paradigm over subword-level models, despite comparable performances.This is likely because of the additional time and compute resources required, due to the longer input sequences used in characterbased approaches.There are cases where character models have been shown to outperform subword models.However, these instances may be seen as niche (e.g.tasks that specifically require character information) or unrealistic (e.g. using data corrupted with synthetic noise (Xue et al., 2022)).
Additionally, previous studies presented only limited evaluations of these systems for popular tasks where character-level information could lead to major performance benefits.In this work, we conduct a comprehensive comparison of character and subword pretrained models on machine translation (MT).Despite the popularity of NMT, pretrained character models have not yet been thoroughly assessed for this task, with recent research on character models for MT focusing on models trained from scratch (Libovický et al., 2022;Edman et al., 2022).We posit that these approaches can only reliably assess translation performance for high-resource languages, where the beneficial effects of multilingual pretraining were found to be less impactful (Liu et al., 2020).
Character models can leverage more fine-grained information, which could be helpful in many challenging NMT settings, such as low-resource translation.To validate this, we fine-tune the character model ByT5 (Xue et al., 2022) and its subword counterpart mT5 (Xue et al., 2021) for translation, for a variety of languages and settings.Among our findings, those standing out are: (1) ByT5 generates higher quality translations than mT5 in general, especially when resources are limited.
(2) ByT5 shows better cross-lingual generalization than mT5 on average, and especially for high resource languages or low resource languages that are related to high resource ones.Figure 1: ByT5's overall source vs. target contribution for the German→English translation: "I study international relations."$: end-of-sentence token.Peaks in source importance at the beginning of each word suggest word-level modeling.
(3) Fine-tuning with many examples causes ByT5's translations in the zero-shot setting to degrade faster than mT5.(4) ByT5 shows a cohesive word-level source importance pattern, suggesting a capacity to capture relationships beyond the character level (as shown in Figure 1).(5) When ByT5 outperforms mT5, ByT5 is also better at translating orthographically similar words and rare words.
Our findings support the idea that in several realistic circumstances, and particularly for lowresource scenarios, character models are superior in terms of translation quality over their widelyused subword counterparts.In our analysis, we further show how these gains in quality connect to specific word properties, such as orthographic similarity and frequency.

Related Work
Character-level models have long been of interest for use in machine translation, dating back to when statistical models were the dominant paradigm (Tiedemann and Nakov, 2013).At that time, character models were already competitive with wordlevel models, especially when training data was limited to < 100k sentence pairs.Durrani et al. (2010) also showed that character models were particularly adept at translating closely related languages such as Hindi and Urdu.We note that, at the time, subword tokenizers such as BPE (Sennrich et al., 2016) were not yet commonly used.
Neural approaches to character-level MT were first based on RNNs (Costa-jussà and Fonollosa, 2016;Lee et al., 2017), with extensive work being done to compare character models to subword models (now equipped with BPE).For NMT, it was shown that character models using an RNN architecture (with and without a CNN for processing characters) perform equally to or better than subword models (Larriba Flor, 2017;Lee et al., 2017;Sennrich, 2017;Cherry et al., 2018).Their better translation performance could be attributed to their ability to handle complex morphology, rare or unseen words, and noisy input (Jozefowicz et al., 2016;Kim et al., 2016;Belinkov and Bisk, 2017;Lee et al., 2017;Singh, 2017;Durrani et al., 2019).However the inefficiency of character models was already apparent, as they were slower than subword models by a considerable margin (Chung et al., 2016).Work had also been done in comparing byte and character level models, with little differences found, however these were mostly experimenting with Latin-scripted languages (Costa-jussà et al., 2017).
More recent work has looked at Transformers on the character and byte-level.Xue et al. (2022) created ByT5 and compared it to its subword counterpart, mT5 (Xue et al., 2021), which are also the models we focus on in this work.Their comparisons were however focused on either multilingual classification tasks or English-based generative tasks, but no multilingual generative tasks or machine translation.
Previous work analyzed character-level Transformers trained from scratch for NMT, not using the T5-based models.Libovický et al. (2022) looked at "vanilla" character models (those without any compression of the sequence length prior to the computation in the Transformer), as well as Lee et al. (2017)'s, Tay et al. (2022)'s, and Clark et al. (2022)'s methods for compressing sequence length.They conclude that these character-level models do not provide any benefits over subword models while being less efficient.There are two factors to note with these conclusions.Firstly, their experiments show similar performance, but they only experiment on high-resourced languages.Given prior work in RNNs mentioned above, this does not appear to be the application that would benefit most from character-level models.Secondly, they introduce a two-step decoder to achieve character-level decoding from subword-level hidden states, however as Edman et al. (2022) points out, this decoder does not scale well to higher-resourced scenarios.This two-step decoder adds an additional layer of complexity to evaluating such models, as an ab-lation of the model's granularity and the model's decoding process would be necessary to fully understand the performance of each model.Added to the fact that there are no pretrained models using this sequence length compression that are comparable to mT5 and ByT5 in terms of data used for pretraining or model scale, we do not experiment with the models which compress the sequence length in our work.
In the context of low-resource MT, Edman et al. (2022) showed that character-level models can outperform subword models on the lowresource pair Xhosa-Zulu.Li et al. (2021) showed that character-level models create higher-quality translations in synthetic low-resource settings of English→{German, Finnish} with a corpus size of 50k parallel sentences.Carrión-Ponz and Casacuberta (2022) showed that quasi-character models (subword models with a small vocabulary of size 350) produce higher-quality translations than subwords with a more standard vocabulary size of 32 thousand when data is limited for a number of European languages, finding consistent improvements across several domains.
There are two major caveats with this previous work that should be considered, however.First, previous work using character models for MT focused on training models from scratch, as this is a long-standing practice in the MT field.However, this practice could be especially harmful for low-resource languages, where the paradigm of fine-tuning a pretrained, multilingual model was shown to be effective (Liu et al., 2020).
The second caveat is that previous evaluations of cross-lingual transfer were also limited by a relatively small model size (< 70M parameters).In contrast, this work evaluates models up to 1.2B parameters.With an order of magnitude more parameters, we investigate the presence of emergent properties of larger character and subword-based NMT models, following the evidence from other generative tasks.
Within the realm of the extensive previous work done on character models, this work serves to give an updated overview on the performance of character versus subword models for NMT.To our knowledge, we provide the first of such overviews using multilingual, transformer-based models of up to 1.2B parameters.We fine-tune a total of 162 models (varying languages, amount of training data, model size, and model type), and test on 200 lan-guages to thoroughly compare the performance of character and subword models.We also perform an attribution analysis to gain a deeper understanding of the differences between the two granularities of character-level and subword-level, which to our knowledge, has not yet been researched.

Method
As our experiments are aimed to provide a fair comparison of character and subword pretrained models for translation, we first justify our model choice, followed by our training scheme, and lastly our choice of metric for evaluation.

Models
With many character-level models available (El Boukkouri et al., 2020;Tay et al., 2022), we opt to compare ByT5 to its subword counterpart mT5.These models are, to our knowledge, the most comparable due to their similar training setup and parameter count.We note that although the parameter counts between mT5 and ByT5 models are similar, Xue et al. opted to increase the width (i.e.hidden dimension size) of the ByT5 models to compensate for it using fewer parameters for embedding bytes.This is most noticeable in the small models, with 85% of mT5's parameters being used for embeddings, compared to 0.3% for ByT5 (Xue et al., 2022).As this disparity lessens with increasing model size, we consider this difference to be a meaningful factor in explaining results correlating negatively with model size.As such, most of our experiments use the small (300M), base (582M), and large (1.23B) ByT5 and mT5 models, or focus only on the large models, where the disparity is lowest.

Training
We fine-tune mT5 and ByT5 models using the same prompt used in Raffel et al. (2020): Translate <S> to <T>: <src> where <S> is the source language, <T> is the target language, and <src> is the source text.We primarily use WMT's NewsCommentary v16 2 datasets for fine-tuning.We consider 5 levels of "resourcedness" for fine-tuning, using {0.4,2, 10, 50, 250} thousand sentence pairs.We also use the WMT14 German-English dataset to test higherresource settings of 1.25 and 4.5 million sentence pairs (i.e. the entire dataset). 3Our hyperparameter choices for training can be found in Appendix A. For development and testing, we use the FLoRes-200 dataset (NLLB Team et al., 2022).
As for training language pairs, we train on {German, Russian}↔English and {Portuguese, English}→Spanish.We choose these language pairs as they are all within NewsCommentary, guaranteeing a similar quality, and accounting for varying degrees of language similarity.
We additionally test the models' ability to retain cross-lingual information with the FLoRes-200 dataset (NLLB Team et al., 2022), whose wide variety of languages allows us to further isolate important language characteristics.To test the models' zero-shot capabilities, we simply swap out <S> and <src> for a new language, keeping the target language the same.No further training is performed, making the setting zero-shot.

Evaluation
We considered several translation quality metrics, opting eventually for chrF++ (Popović, 2017), which is formulated as a combination of both character-level and word-level F-scores, weighted to maximize correlation with human judgment.Note this differs from the original chrF, which only factors in character-level scores.Combining both word-level and character-level scores means neither ByT5 nor mT5 should be favored by chrF++.
We also considered modern neural metrics such as COMET (Rei et al., 2020(Rei et al., , 2022)), which are known to correlate better with sentence-level human judgments (Freitag et al., 2022).However, the fact that COMET itself is a subword model, and has been mostly trained to evaluate the outputs of subword models, makes it unclear how the metric performs for character models.As such, we resort to the more transparent chrF++.Nevertheless, we provide COMET scores for the direct translation results (Section 4), using the wmt22-comet-da model (i.e. .We also include BLEU scores in Appendix B, the trends of which are highly similar to the trends seen for chrF++.

Direct Translation Results
Our direct translation setup evaluates models on the same language pairs for which they are finetuned.Since character models generally outper-form subword models, especially when training data is scarce (as noted in Section 2), we vary the amount of fine-tuning data, confirming these findings for {German, Russian}↔English translation.
Varying the amount of fine-tuning data reveals the largest difference in the translation quality of character and subword models.Figure 2 shows that ByT5 outperforms mT5 in all resource levels according to chrF++, and is competitive with mT5 according to COMET.When resources are limited, the quality gap between ByT5 and mT5 also tends to increase.We see that model size also plays a role, with the large model having the largest quality gap of up to 10 chrF++ points when only 400 sentences are available.This goes against our assumption that, given the differences in architecture being largest between the small models, we would see the largest difference in performance from them (see Section 3.1).German→English.Underlined scores are significantly better (p < 0.05).

# of Training
Table 1 shows the performance of the large models on the German→English WMT14 dataset, accounting for higher-resource settings.The results according to chrF++ show that ByT5 continues to outperform mT5 by a significant margin, while COMET finds no significant difference.While we expect that the chrF++ performance of the two models will eventually converge given enough data, we were not able to observe it even in our highestresource setup of 4.5M sentence pairs.

Zero-shot Results
Our zero-shot evaluation examines how well the models retain information from similar languages seen in pretraining or fine-tuning or how well they generalize to unseen languages when fine-tuned on a specific language pair.As such, we consider zero-shot translation as translating from a source language that has not been fine-tuned on, using in a paired t-test between the two respective models.
our scheme described in Section 3.2.This can have important implications on best practices for low-resource model training, as cross-lingual transfer learning is a common technique for obtaining a good initialization for less-resourced language pairs.We first look at the general translation quality across all languages in FLoRes-200, and study patterns in performance with respect to geography and language resourced-ness.We then investigate a degradation in zero-shot quality which we observe when ByT5 is trained for too long.

General Performance
Figure 3 shows the average translation quality of {German, Russian}↔English fine-tuned models tested on all 204 languages from the FLoRes-200 dataset (X→English).
Overall, we see that ByT5 continues to outperform mT5 for lower resource scenarios.However, its zero-shot performance drastically decreases in several cases above 10k training examples, while mT5 continues to perform well up to 250k examples, with only a slight dip in performance comparatively or no dip at all.We further investigate the large degradation of ByT5 in Section 5.3, but first, we take a closer look at language-specific results on German→English with 10k training examples.
In Figure 4, we see the performance difference of ByT5 and mT5 for each source language, plotted in their respective geographic locations. 4  performs well overall, we notice an apparent weakness when zero-shot translating languages from West and Central Africa.Many of these languages are considered low-resource languages, however there are several languages also considered lowresource on which ByT5 performs well.Therefore, we break down the necessary components of a language for ByT5 to perform well next.

Language Resourcedness
As we have seen in Section 4, the amount of data used for fine-tuning plays a major role in the quality of translations.Similarly, we would expect whether the model has seen a particular language in pretraining to play a role as well.
If we only use the presence of a language in the pretraining dataset to predict whether ByT5 outperforms mT5, we get an accuracy of 62% on the FLoRes-200 languages, using our German→English model.For our other language pairs, we get 71%, 45%, and 50% for models finetuned on English→German, Russian→English, and English→Russian, respectively.So presence in pretraining alone is not a reliable predictor of performance.
However this does not tell the full story, as many low-resource languages are related to high-resource languages, which the models, particularly ByT5, are capable of exploiting.Specifically, we categorize each language into one of three categories: • High Resource: The language is in the pretraining data.• Low Resource -Related: The language is in the same language subgrouping (as determined by NLLB Team et al. ( 2022)) and same script as a language in pre-training.• Low Resource -Unrelated: All other languages.
As an example, Acehnese (ace) is written in both Arabic and Latin scripts.For Latin, we consider it "Low Resource -Related" because it is related to the high-resource Indonesian (both being Malayo-Polynesian languages), which is also written in Latin script.However for Arabic, we consider it "Low Resource -Unrelated" because there are no Malayo-Polynesian languages written in Arabic script in the pretraining data.
Requiring both language subgrouping and script to be the same performs much better as a predictor than using only one of the two.It is also intuitive: Related languages typically share many similar words, but this will not result in shared embeddings if they are written in different scripts.
In Figure 5, we see that, for the most part, languages in the first 2 categories tend to be languages where ByT5 outperforms mT5, meanwhile the reverse is true for the third category.Using this rule, we can correctly predict which model will perform better for 89% of the languages when fine-tuning on German→English. 5hile not a perfect predictor, the efficacy of this simple rule shows there is a clear trend that ByT5 can better leverage information from other languages it has seen to inform its translation of low-resource languages.The reasoning for why mT5 performs better on unrelated low-resource languages is somewhat unclear.One possible explanation is that mT5's sparse embeddings allow for a more robust encoding for languages where false friends are more abundant, seeing as mT5 would contain less positive bias towards semantic similarity given orthographic similarity.In fact, from the perspective of a character model, "false friends" do not necessarily need to be words, but could also be simply character n-grams.Nevertheless, many of the languages in the third category include West and Central African languages, corresponding to the pattern we see in Figure 4.These languages mostly use the Latin script, increasing the likelihood of false friends with the original source language (German).It should be noted however that a majority of the languages in which mT5 performs better are also those where neither mT5 or ByT5 showed performance above 25 chrF++ (which is roughly equivalent to a BLEU score of 5).As such, we recommend further work before drawing conclusions based on these languages.

Zero-shot Degradation
To understand the dip in quality we see from ByT5large fine-tuned on 250k examples (Figure 3), we begin by showing that freezing certain parts of the model can mitigate this weakness.Based on our discoveries, we then examine whether writing script is an explaining factor in how much a specific language degrades in quality.Table 2: chrF++ on Flores-200 for trained language pair ("Trained", i.e.German) and all remaining languages ("Zero-shot", averaged) using large models fine-tuned on 250k German→English examples.Top: Performances by freezing some model components ("Enc n% ": freeze the first n% of encoder layers; "¬ X-Attn": freeze all parameters except cross-attentions.)Bottom: Performance of unfrozen models trained for 10k and 250k steps (i.e. the step counts with the best performing models for zero-shot and trained settings).

Does Freezing Encoder
number of steps it takes for the models fine-tuned on 10k sentences to converge), and continuing to train on only part of the model.We proceed to test the models on their original source language (in this case German), as well as the 204 languages from the test set.Results are shown in Table 2, including a comparison to the best-performing unfrozen models.
Here we can see that, for ByT5, freezing anywhere from the first quarter of the encoder layers up to everything except cross-attentions is effective at preventing the loss of generality that comes with extra training.Ingle et al. (2022) saw a similar pattern in the few-shot setting, where freezing the first 25%-50% layers of RoBERTa improved performance over freezing nothing. 6This also shows that the issue is not with the variety of the training data, but specifically the number of training steps taken.
As we see a relatively steep incline in zero-shot translation quality from 0 to 25% of the encoder layers frozen, and then a relatively stable quality afterwards, it seems the majority of language-specific operations are occurring in the first 25% of the layers of the encoder in ByT5.On the other hand, mT5 appears to exhibit this language abstraction solely based on its sparser embeddings, as it does not suffer the same level of zero-shot quality decrease when not frozen.However, when freezing layers of mT5 beyond the embeddings, we see a steep decrease in translation quality both on the trained language and zero-shot languages.This implies that mT5 needs to adjust a much larger portion of its parameters in order to translate with higher quality, whereas ByT5 only needs to adjust at most its cross attentions.
In light of minimal gains from further training on zero-shot, early stopping using a left-out set of low-resource language examples seems ideal.

Does Writing Script Affect Degradation?
The fact that freezing the encoder preserves the quality on zero-shot languages narrows the failure point of ByT5 down to the source side.Since a major difference between character and subword models is their respective dense and sparse token embeddings, we examine the impact of each language's writing script on the degradation of translation quality.We expect that, for example, with our German→English models, other Latin-scripted languages will be more severely impacted, due to their high degree of embedding overlap.We expect those embeddings (and subsequent encodings) to become specialized to German only, meanwhile non-Latin encodings should remain relatively unchanged.Figure 6 shows the degradation factor of Latin versus Non-Latin-scripted languages for our ByT5large English↔German and English→Russian models and Latin, Cyrillic, and all other multi-byte scripts for our Russian→English models. 7or the first 3 subfigures, we can see that Latinscripted languages degrade more than the non-Latin languages.This is expected, seeing as ByT5 shares embeddings (and presumably encodings) for all Latin languages.So, for example, when it is specifying for German→English, any Latinscripted languages unrelated to German should be particularly negatively affected.Meanwhile other scripts share fewer bytes, and thus would be less affected.
For Russian→English, we do not see the same trend with Cyrillic, but rather other scripts are heavily affected.We suspect this is for two reasons.Firstly, the Cyrillic languages are much more concentrated around Russia than Latin languages are around Europe, so the Cyrillic languages tend to be either more closely related to, and partially mutually intelligible with, Russian (e.g.Ukrainian), or have several loan words from Russian (e.g.Kazakh).Meanwhile the other non-Latin, non-Cyrillic scripts are heavily affected because they are also multi-byte scripts.These scripts share a large portion of bytes with Cyrillic and each other due to how UTF-8 encodes multi-byte characters.For example, the "I" in Cyrillic is encoded in bytes as [208,152], while the Devanagari " " is [224,164,152].Both share the final byte as both are the 25th character within their respective code blocks.
As such, we find that multi-byte scripts may suffer from zero-shot translation degradation due to vocabulary overlap with the original source-side script, assuming this source-side script is another multi-byte script.
Since we observe a loss of generality only when at least 50k sentences are used for fine-tuning, the following sections mostly focus on results from models using between 0.4k and 10k fine-tuning examples.

Attribution Analysis
In translation studies, words are widely recognized as the minimal unit of translation (Hatim and Munday, 2004).It is therefore natural to assume that a source-to-target word or subword mapping might be easier to create than a character-level one, except perhaps for closely related languages where many words translate into orthographically similar cognate words.Therefore, it seems that (sub)word models are naturally more suited towards machine translation, but our results from the previous sections appear to contradict that.This naturally leads us to investigate whether and how character models can learn to incorporate word-level information to improve their translation capabilities.
A common way to quantify the influence of in-puts in driving model predictions is by means of feature attribution methods (Madsen et al., 2022), which have previously been applied in the MT domain to highlight word alignments, coreference resolution capabilities, and model training dynamics (Ding et al., 2019;He et al., 2019;Voita et al., 2021b inter alia).For our analysis, we study source versus target character contributions in ByT5 models using gradients (Simonyan et al., 2014), which were recently shown to be more faithful than other advanced approaches for language tasks (Bastings et al., 2022).One can take gradients with respect to the probability of a predicted token and propagate them back to the model's inputs in order to quantify the importance of input tokens to the resulting prediction.Our analysis is inspired by Voita et al. (2021a), where a similar analysis was conducted on subword MT models to highlight a "language modeling regime" where high importance is given to the target prefix while disregarding source tokens.
In our study, we compute gradients for each input token (i.e. a character) with respect to the next token's prediction probability at every generation step using the Inseq library (Sarti et al., 2023).For every generated token, gradient vectors are aggregated at the character level using the L2 norm of the gradient vector to gauge the influence of the source and target context on the model's predictions. 8e verify our hypothesis that character-level NMT models might implicitly operate at a word level by comparing source-side and target-side contributions across different model sizes and language pairs.In this context, a character-by-character translation would be characterized by relatively uniform source attributions across all generated characters, while a word-by-word translation would imply a higher source contribution of the characters marking the beginning of a new word, followed by a shift of importance towards the target prefix indicating that completing the word requires limited access to source-side information.Figure 1  source importance with respect to the character's position in the sentence.While progressing through the generation, we observe that the model relies less on the source side and more on the target side for its next-byte predictions.This aligns with Voita et al.
(2021a)'s findings for subword-level MT models, with the intuitive explanation that a longer generated context can provide more cues for the model to produce a coherent output without the need to attend to source-side tokens.
In the right plot, instead, we compute the relative average source importance for byte positions within words.Since bytes occurring later in a word also occur later in a sentence, we would expect source importance to drop later in a word as a consequence of the trend seen in the left plot.To disentangle the effect of the in-word byte position from the sentence-level one, we normalize the source importance scores by their average for their corresponding position in the sentence (i.e., we divide every in-word score by the respective in-sentence average from the left plot).As a result, we would expect stable relative source importance close to 100% across all in-word byte positions if the position of characters inside the word did not affect source importance.Instead, we observe a rapid decline of source importance within the word across all languages, implying an increase in target-side importance for non-word-initial generated characters.Upon manual inspection of the importance patterns, we note that the target-side importance is focused mainly on the previous characters belonging to the same word, suggesting the usage of word-level information to facilitate translation.
Interestingly, for English→Russian we observe  an oscillatory importance pattern peaking on evennumbered positions.The observed trend reflects the UTF-8 Cyrillic encoding using 2 bytes per character,9 with peaks at first-byte boundaries pointing to the model's ability to discriminate characters and bytes in languages using multi-byte characters.
We also note that relative source attributions seem to converge to 100% around the third character for Latin script languages and around the sixth in-word character for Russian.These results indicate that similar importance patterns might emerge across different languages in character models despite separate fine-tuning procedures.
How does Resourcedness Affect Word-level Modeling?In Figure 8, we examine how the relative source importance changes when more training data becomes available, and its difference between trained and zero-shot languages.The left plot shows that a larger amount of training examples contributes to sharpening the source contribution around the first few bytes in a word while decreasing source importance for the following bytes.These results support our intuition that memorized co-occurrences between source and target words, which are captured more easily when given more training examples, enable confident generation with less reliance on the source text.From the right plot, we also note how this trend does not automatically extend to related languages when performing zero-shot translation, with source importance patterns for {Dutch, Limburgish}→English translation using a German→English model presenting a smooth trend comparable to the 0.4k German→English one in the left plot.This suggests that memorized co-occurrences lowering source-side dependence do not generalize to related languages even when translation quality does.

Effect of Word Similarity and Frequency
Character models have previously been shown to have a more robust vocabulary and are thus able to form useful biases for translating orthographically similar words, as well as rare words (Lee et al., 2017).We revisit these findings to see if they hold with larger, multilingually-pretrained Transformers.
Can Character Models Exploit Word Similarity?We proceed to investigate whether character models can operate on character level when desirable.To this end, we focus our analysis on orthographically similar words (OSWs), starting from the assumption that a character model can easily exploit their similarity by learning to copy matching substrings.We start by assessing whether character models show improved performances for OSW translation.We use AWESoME (Dou and Neubig, 2021) to align source, reference, and hypothesis sentences and calculate word-level translation accuracy.Our definition of OSWs is based on the inverse normalized-Levenshtein distance of the source and reference words, varying the threshold from 0 to 100%, e.g.German gesundheit and Dutch gezondheid have a similarity of 70% since they differ in three letters.Figure 9 shows the accuracy difference for the large ByT5 and mT5 models trained for German→English and Portuguese→Spanish on words grouped at different levels of orthographic similarity.We observe that, as words become more similar, the accuracy delta also increases in favor of ByT5, especially when less fine-tuning data is used.These results indicate that character models can learn OSW translations more rapidly and effectively than their subword counterparts.We proceed by examining the source contributions of OSWs and non-OSWs using ByT5, and how those change according to the number of examples used for fine-tuning.We restrict our analysis to >70% similarity for OSWs, and <30% similarity for non-OSWs.We consider only the source importance over characters belonging to the same word, in order to investigate the presence of copying patterns for OSWs.As shown in Figure 10, the importance of source text grows with the amount of fine-tuning data when translating OSWs in the German→English setting (same direction as training), suggesting an improved ability of the model in copying information from the source.Conversely, we observe a declining trend for the zeroshot Dutch→English direction, converging to the source importance of Limburgish, which was not present in the model's pretraining data.This could indicate that the Dutch knowledge acquired during pretraining is progressively lost when fine-tuning on increasingly more German data, supporting the results of Figure 3.For non-OSWs, we find a relative source importance of 94 ± 4% across all three directions and all amounts of training examples, indicating that the observed copying behavior is restricted to highly similar words only.
To further confirm this copying behavior is indeed distinct between a subword and character model, we create a synthetic control test set by replacing all proper noun pairs (word pairs tagged as proper nouns in both source and reference target sentences10 ) with random strings of Latin alphabet characters matching the original words' length.This is done for our German→English large models, modifying the German→English test set from FLoRes200.Table 3   We observe that for the subword-level mT5 model, the copying mechanism is learned progressively during fine-tuning.Meanwhile, the ByT5 model instead achieves superior copying performances with very little supervision, achieving 91% accuracy with only 400 training examples.We also observe that mT5's performance on chrF++ improves with increasing examples, while ByT5's performance remains relatively unaffected.Comparing this to the chrF++ scores on the original test set, we see that ByT5's scores are unchanged, while mT5's scores drop by up to 8 points when the test set is modified.
While our control set only targets proper nouns, this copying behavior may be far more pervasive, as it could be present in any loan words, regardless of part-of-speech.The large difference in scores, particularly on the models fine-tuned on only 400 examples, seems to show a major difference in the default behavior of the two models.
Can Character Models Translate Rare Words More Accurately?Prior work has shown character models achieving better translation accuracy of rare words (Lee et al., 2017;Singh, 2017), so we expect the same to be true for large, pretrained, Transformers as well.We adopt the same word-level translation accuracy method as earlier in the section, but instead binning words based on their frequency in the English fine-tuning data.We evaluate word-level accuracy of ByT5 versus mT5-large when fine-tuned on 10k German→English sentence pairs.We additionally distinguish between accuracy using the original source language and accuracy using zero-shot languages.Word-level accuracy delta on the original source (German) and zero-shot (X→English, averaged) languages, using large German→English models with 10k training pairs.Bins based on frequencies in English fine-tuning data.
As shown in Figure 11, ByT5 has a higher perword translation accuracy across all frequencies for the zero-shot languages, and all but the most frequent words for German.Generally, as the frequency of the word increases, model disparity decreases.This is expected based on previous work (Singh, 2017).The disparity is higher for zero-shot languages overall, which indicates a higher level of robustness to changes in vocabulary from ByT5, a phenomenon also previously observed (Belinkov and Bisk, 2017).

Efficiency
Although we have shown that the translation quality of character models is competitive or better than subword models, another important aspect is the efficiency of character models.While the lack of efficiency of character-level Transformer models is well-documented (Libovický et al., 2022;Tay et al., 2022;Edman et al., 2022), we have yet to see their efficiency on larger (300M+ parameters) Transformers for NMT, and so we provide our findings for completeness.We report model training and inference times for 1 NVIDIA V100 32GB GPU in Table 4.
Both training and inference speeds in samples per second are considerably slower for the character models (4-5 times slower for training, 5-6  times slower for inference).The number of epochs needed to converge is lower for character models, but not enough to counteract the slowness of training.
The slower speed comes largely from the increase in sequence length.While we tried to balance the size of the batches such that each model sees the same amount of text per batch, achieving this required the character models to accumulate gradients for 4 times as many iterations as the subword models.Thus, if training or inference speed is a concern, subword models are likely the superior choice, particularly for high-resource languages.In the low-resource setting, there is a significant tradeoff between accuracy and speed when choosing between character and subword models.
It should be noted that there are several alternatives to vanilla character or byte-level models which offer a speedup, typically by compressing the sequence length before the bulk of the computation is done in the Transformer (El Boukkouri et al., 2020;Clark et al., 2022;Yu et al., 2023, inter alia).None of these methods claim faster processing over a standard subword-level model, however, so there would still be a trade-off between accuracy and speed when applying these methods, albeit to a lesser extent.

Conclusion
Subword-level models are currently the dominant paradigm for machine translation.However, this work showed that character models could produce competitive or even superior results in many circumstances.First, character models obtain an overall better translation quality on trained language pairs.Moreover, we highlighted how the gain in translation quality from using character models is particularly marked when fine-tuning data is scarce.Finally, we showed how character models have superior cross-lingual transferability, especially with languages seen in pretraining, or those that are similar to a language seen in pretraining.
Following the results of our analyses, we can attribute this superior quality to a character model's ability to implicitly account for the appropriate input granularity given the context, translating at times in a word-by-word fashion, or character-bycharacter when needed.This ultimately results in better accuracy when translating orthographically similar and rare words.
The quality increase is however not without a trade-off.Indeed, character models are at least 4 times slower in training and inference, making them suboptimal for many real-world situations.We posit that further work into speeding up generation models (Stern et al., 2018;Edman et al., 2022;Santilli et al., 2023, inter alia), as well as more thorough evaluations of neural metrics for character models, would greatly benefit this area of research.Nevertheless, we can conclude that in less time-sensitive, or low-resource settings, characterlevel translations are worth the wait.

A Hyperparameters
We train models using the AdaFactor (Shazeer and Stern, 2018) optimizer with 4000 warmup batches and a final constant learning rate of 1e-4.We batch by # of tokens (20k for ByT5, 5k for mT5) to provide each model with a roughly equivalent amount of context, as we estimate roughly 4 bytes per mT5 token during initial testing.We use early stopping with a patience of 5, based on a set number of steps, varying with the amount of data used.The step sizes were {50, 100, 250, 500, 1000} batches for {0.4k, 2k, 10k, 50k, 250k} total training examples, respectively.Similarly, we set a maximum number of epochs to {6250, 1250, 250, 50, 10}.All of our experiments on WMT14 data used the same step sizes and max epochs as our experiments using 250k training examples.

B Additional scores
Figure 12 shows our main results in terms of BLEU.We can see that the trends of the BLEU scores are very similar to those of the chrF++ scores, and generally favor ByT5.We also provide Table 5, which shows the raw chrF++ scores that are used to create Figures 4 and  5.
All individual scores used to create the figures in this work (which totals over 24 thousand values), as well as the models weights and models outputs , are provided. 11

Figure 2 :
Figure 2: Translation quality using chrF++ (left) and COMET (right) of mT5 and ByT5 when fine-tuned and tested on German↔English and Russian↔English.Stars indicate a significant difference (p < 0.05) in a paired t-test between the two respective models.

Figure 3 :
Figure3shows the average translation quality of {German, Russian}↔English fine-tuned models tested on all 204 languages from the FLoRes-200 dataset (X→English).Overall, we see that ByT5 continues to outperform mT5 for lower resource scenarios.However, its zero-shot performance drastically decreases in several cases above 10k training examples, while mT5 continues to perform well up to 250k examples, with only a slight dip in performance comparatively or no dip at all.We further investigate the large degradation of ByT5 in Section 5.3, but first, we take a closer look at language-specific results on German→English with 10k training examples.In Figure4, we see the performance difference of ByT5 and mT5 for each source language, plotted in their respective geographic locations. 4While ByT5 4 Locations sourced from: https://glottolog.org/glottolog/language

Figure 4 :
Figure 4: Map of all languages in FLoRes-200, colored according to which large model performed better (red "△" if ByT5, cyan "▽" if mT5) when fine-tuned on 10k German→English examples.

Figure 7 :
Figure 7: Left: Average ByT5-large source importance across byte positions within a sentence.Values are smoothed using a 10-point rolling window.Right: Average ByT5-large source importance across bytes within a word, normalized by sentence position (left plot) to account for the effect of decreasing source importance.

Figure 8 :
Figure 8: Left: Source contribution across word's characters for a ByT5-large fine-tuned on deu→eng translation with different # of fine-tuning examples (0.4k to 250k).Right: Source contributions for same-language (deu→eng) and zero-shot cross-lingual translation using the same 250k ex.model.

Figure 9 :
Figure 9: Word-level accuracy deltas for large models at different orthographic similarity levels, trained on different numbers of examples (0.4k, 2k, 10k).
Figure 11:Word-level accuracy delta on the original source (German) and zero-shot (X→English, averaged) languages, using large German→English models with 10k training pairs.Bins based on frequencies in English fine-tuning data.

Figure 12 :
Figure12: BLEU scores of mT5 and ByT5 on German↔English and Russian↔English.Stars indicate significant difference (p < 0.05) in a paired t-test between the two respective models.
chrF++ scores of large mT5 and ByT5 models fine-tuned on 10k German→English examples.
reports subword and character models' accuracies in copying the random strings from the source input into the generated output.

Table 3 :
Per-word copying accuracies and sentencelevel chrF++ scores of large German→English models on the control test set (top, middle), and chrF++ scores on the original test set (bottom).

Table 4 :
The training and inference speeds for German→English experiments.Epochs reported are using the models fine-tuned on 10 thousand pairs, using early stopping.The best result per model size and column is shown in bold.