Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions

Large-scale pretrained language models (LLMs), such as ChatGPT and GPT4, have shown strong abilities in multilingual translation, without being explicitly trained on parallel corpora. It is intriguing how the LLMs obtain their ability to carry out translation instructions for different languages. In this paper, we present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7.5B, to perform multilingual translation following given instructions. Firstly, we show that multilingual LLMs have stronger translation abilities than previously demonstrated. For a certain language, the translation performance depends on its similarity to English and the amount of data used in the pretraining phase. Secondly, we find that LLMs’ ability to carry out translation instructions relies on the understanding of translation instructions and the alignment among different languages. With multilingual finetuning with translation instructions, LLMs could learn to perform the translation task well even for those language pairs unseen during the instruction tuning phase.


Introduction
The emergence of Large Pretrained Language Models (LLMs) (Brown et al., 2020;OpenAI, 2023) has revolutionized the research of machine translation (Hendy et al., 2023;Garcia et al., 2023).These models have demonstrated remarkable multilingual translation capabilities, without requiring explicit training on parallel corpora.For instance, XGLM, a medium-sized multilingual language model, outperforms supervised models using only several examples as demonstrations (Lin et al., 2022); the cutting-edge LLM GPT4 has been shown to perform comparably to commercial translation systems on multiple language pairs (Jiao et al., 2023b).
Most existing research on LLMs for machine translation focuses on in-context learning (ICL), i.e. taking several parallel sentences as the demonstration to guide LLMs to perform translation (Vilar et al., 2023;Agrawal et al., 2023;Hendy et al., 2023;Zhu et al., 2023).However, these methods rely heavily on the in-context learning ability of LLMs.For smaller models, e.g.models with only 1B or 7B parameters, the relatively weak ICL ability may result in an underestimation of their potential translation ability.
Instead of relying on the ICL abilities, we propose to investigate the ability of LLMs by directly training them to follow translation instructions.Inspired by the recent success of instruction tuning (Wei et al., 2022;Chung et al., 2022), we organize multilingual translation tasks as different instances of the translation instruction, with each instance corresponding to a specific language pair.By training the LLMs to follow these instructions, i.e. with multilingual Finetuning with Translation Instructions (mFTI), it is possible to better elicit translation ability inside LLMs.
Our results show that by training on a mixed dataset of 1,000 sentences per language pair, mFTI outperforms the 8-shot in-context learning by near 3 BLEU on average, showing a greater potential of LLMs' translation ability than previously demonstrated (Lin et al., 2022).In addition, we also discuss how mFTI improves the LLMs and which factors influence the performance.
To better understand why LLMs could follow these instructions, we design a mFTI setting where only a subset of the translation instructions, i.e. language pairs, are used for training.Thus LLMs need to generalize their instruction following abilities for those language pairs unseen during mFTI.Surprisingly, mFTI elicits the translation ability not only for trained language pairs but also for those unseen during instruction training.With further experiments and analyses, we find that LLMs could learn the translation behavior in general by being trained to translate even irrelevant language pairs.It is also interesting that with mFTI, LLMs learn to directly align languages through the use of pivot languages, which enhances the instructionfollowing ability for unseen language pairs. 2 Multilingual Finetuning with Translation Instructions

Overall Framework
Given a corpus of multilingual parallel sentences and their languages M = {(l s i , l t i , x i , y i )}, where l s i and l t i are names of the source and target language of i-th parallel sentence (x i , y i ), respectively, mFTI leverages an instruction template T to organize the corpus M into a language modeling dataset D. Each sentence d i in D is an instantiation of the translation instruction with a specific sentence pair: The parameter of LLMs are then optimized using a standard next-token-prediction objective on D: where θ are parameters of LLMs.The instruction template we adopt is Translation: [l s ]: x [l t ]: y where the prefix "Translation:" is used to indicate the translation task; the pattern "[•]:" is used to identify the name of the specific language.

Experiment Setup
Backbone Language Model We consider XGLM-7.5B(Lin et al., 2022) as our backbone language models.XGLM-7.5B is a massive multilingual auto-regressive language model, which is trained on a massive corpus of 500 billion tokens comprising 30 diverse languages.Low-resource languages have been up-sampled during training, making it an ideal backbone model for multilingual translation research.
Languages Following Lin et al. ( 2022), our evaluation involves 13 languages that are covered in the pretraining corpus of XGLM, i.e.

Evaluation
Datasets Following previous works (Lin et al., 2022), we evaluate translation models on the FLORES-101 dataset (Goyal et al., 2022), which provides manual translations of 1012 sentences in 101 languages.
Finetuning Datasets Our finetuning dataset primarily comes from WikiMatrix (Schwenk et al., 2021).WikiMatrix provides a parallel corpus for 1620 different language pairs, including many non-English language pairs, which enables a systematic investigation for the translation of languages other than English.We also leverage the MultiCCAligned (El-Kishky et al., 2020) corpus for language pairs that are not contained in Wiki-Matrix, including Hi-Sw, Ko-Sw, Ta-Sw, Sw-Hi, Sw-Ko, Sw-Ta.

Optimization Details
We finetune all models using the Adam (Kingma and Ba, 2014) optimizer with the learning rate fixed as 5e − 6.We use a fixed batch size of 80 sentences and finetune models for 1 epoch or 2000 steps (depending on the size of the training corpus) for all experiments.

Understanding the Potential Translation Ability of LLMs
In this section, we first assess the overall translation performance of mFTI by comparing it to fewshot in-context learning1 .We then present a detailed analysis of how the corpus for mFTI influences the translation quality.

Translation Ability of LLMs
We finetune XGLM on 156 language pairs spanning all 13 languages.Since our goal is to elicit the translation ability of LLMs using a small number of examples, we limit the number of parallel sentences to 1000 per language.
mFTI Better Elicits Translation Ability than Few-shot ICL. Figure 1 shows the average BLEU for translation to and from language X, respectively.Full results on each langauge direction can be found in Appendix A. It is clear that mFTI leads to better translation performances than 8-shot ICL for all language pairs (3 BLEU on average).For some languages, the gap is up to 8 BLEU (e.g.translating into Catalan).This demonstrates the effectiveness of mFTI in eliciting LLM's translation ability.It also shows that LLMs have a greater potential for multilingual translation than we saw with ICL (Lin et al., 2022).
Even for translating to and from English, mFTI still outperforms 8-shot ICL, but with a much smaller gap.This indicates that LLMs with ICL are better at performing tasks that involve English rather than other languages, but they still have the potential to perform even better.
XGLM is still an English-centric Model.The translation performance for each language varies greatly.Considering that the number of sentences used in mFTI is the same for each language, one may suspect that the translation performance of each language largely depends on the amount of its pretraining data.For this reason, the languages in Figure 1 are listed in descending order of their data amount in the XGLM pretraining.However, there are clear fluctuations.For example, Russian and Chinese are the two languages with the largest portion of pretraining data other than English, but their translation performance is much worse than some other languages such as French.
We calculate the Spearman correlation between the translation performance and possible influence factors, namely data amount in pretraining and similarity to English.For data amount, we use the size of the pretraining corpus reported in Lin et al. (2022).For similarity to English, we adopt the lang2vec2 , which is a toolkit for querying the URIEL typological database, to get each language's feature vector of different perspectives including geography, syntax, phylogeny, phonology and inventory3 .As shown in Table 1, the translation performance indeed has a positive correlation with data amount in pretraining (0.39/0.36).However, the similarity between a specific language and English plays a more important role in determining the final translation performance.All considered features demonstrate a higher correlation coefficient than the data amount in pretraining.This indicates that XGLM is still a predominantly English-centric model.Based on these observations, we suggest taking the relation between different languages into consideration when collecting and sampling data for pretraining multilingual LLMs.
It is not trivial for LLM-based models to outperform conventional supervised MT models.
To better posit the performance of mFTI , we compare it with two conventional supervised MT models, i.e., M2M-1.2B(Fan et al., 2020) and NLLB-3B (Costa-jussà et al., 2022) model, in Figure 2 4 .We can see that despite that mFTI significantly improves over 8-shot ICL and sometimes achieves comparable performance to M2M-615M, it still lags behind the stronger NLLB-3B by a large margin, rendering the challenge to adopt a mediumsized LLM to outperform large-scale supervised MT models.

mFTI Brings Consistent Improvements across Different Metrics, LLMs and Finetuning Strategies
In order to understand the universal effectiveness of mFTI, we present experiments on more LLMs, i.e.BLOOM-7b1 (Scao et al., 2022) and LLaMA (Touvron et al., 2023), and parameterefficient finetuning strategy LoRA (Hu et al., 2022).We report the performance averaged on 156 translation directions evaluated by both sacre-BLEU (Post, 2018) and COMET (Rei et al., 2022) 5 in Table 2 6 .Firstly, we can see that methods based on XGLM-7.5Bsignificantly performs significantly better than BLOOM-7B and LLaMA-7B.This is because many low-resource languages are illrepresented in BLOOM and LLaMA.Secondly, mFTI consistently outperforms 8-shot ICL in terms of BLEU and COMET on all three studied LLMs, regardless of the finetuning strategy, which demonstrates the universal effectiveness in different scenarios.Contrary to previous findings (Jiao et al., 2023a), we did not find LoRA performs better than full finetuning.We hypothesize that learning translation on 156 pairs simultaneously is more challenging and requires more model capacity, making full finetuning a better choice than LoRA in this scenario.

mFTI Enhances Direct Language Alignment
A distinct difference between ICL and mFTI is that mFTI could learn from more parallel sentences and update the model if needed.It is interesting to see what changes after the update.Many previous works (Zhang et al., 2023;Jiao et al., 2023b) have shown that translating by pivoting through English significantly improves ICL's translation performance.To this end, we compare performance gains of pivot translation using ICL and mFTI, respectively.Figure 3 presents the result.Each value in the grid is the BLEU difference before and after pivoting through English.We can first observe that pivoting through English indeed improves translation performance for ICL, up to 10 BLEU in some language pairs.However, after mFTI, the gap has been significantly reduced.Considering the fact the mFTI achieves an average 3 BLEU higher than ICL, the reduction of benefits from pivoting through English compared to direct translation may indicate a better direct alignment between languages.

Influencing Factors of mFTI
Quality of Finetuning Corpus is Crucial.Recent work on instruction tuning demonstrates that the quality of instruction data is crucial for achieving good performances (?).We observe a similar trend when performing mFTI.Specifically, we construct high and low-quality finetuning corpus by selecting parallel sentences according to their attached LASER7 similarity score from the full set of parallel sentences.According to the results in Table 3, finetuning with high-quality parallel sentences can improve the BLEU score by around 2 points compared to finetuning with low-quality   (1k, 2k, 4k, 8k, 16k, 32k) and the number of model parameters (564M, 1.7B, 2.9B, 4.5B, 7.5B).As we can see, it follows a standard log-linear scaling law in terms of both the number of training examples and model size, which is consistent with findings in the previous work (Kaplan et al., 2020).

Understanding the Ability of Carrying Out Translation Instructions
In this section, we present a comprehensive analysis on how mFTI improves the model's ability to carry out translation instructions.We begin by presenting an overarching experiment where we intentionally withhold certain language pairs during the mFTI process, which allows us to study models' ability to carry out translation instructions under different conditions.
Furthermore, we delve deeper into our analysis by exploring how mFTI enhances LLMs' ability to carry out translation instructions from following perspectives: better understanding of translation instructions (Section 4.3 and Section 4.4) and bet- ter alignment between languages to execute translation instructions (Section 4.5).

Manipulating Conditions
In Section 3, we have presented results in a fully supervised setting, where all testing language pairs are seen during instruction tuning.To provide further insights into LLMs' generalization ability across language pairs, we simulate a more realistic scenario where there may be a lack of source and/or target language sentences during the instruction tuning process.More specifically, from the 13 selected languages, we hold out 6 languages as unseen languages.We further partition the rest 7 languages into three groups: Only-Source (languages only appear on the source side), Only-Target (languages only appear on the target side) and Source-Target (languages appear on both the source and target side).We then form language pairs from these partitions following the requirement of partitions.This allows us to assess mFTI's performance under the following conditions: • Seen Both Sides Both the source side and target side language appear in the finetuning corpus.This can be further divided to: -Same Direction.The same translation direction is trained during mFTI.-Reversed Direction.The same translation direction does not appear when training, but the reversed direction does.
-Unseen Direction.The translation pair (neither the same nor the reverse) does not appear when training.
• Unseen Src.Only the target language sentences appear when training.
• Unseen Tgt.Only the source language sentences appear when training.
• Unseen Both Sides.Neither source language nor target language sentences appear in the finetuning corpus.

mFTI Learns to Follow Translation Instruction across Conditions
We finetune XGLM on the corpus described in the previous section.Since there are 16 language directions in the training corpus, we denote the finetuned model as mFTI-16.The model finetuned on all language pairs is denoted as mFTI-all.Table 4 shows the results.
mFTI-16 Brings Improvements on Most Settings, Yet Much Less Than mFTI-all.Firstly we can see that mFTI-16 brings improvements on most settings except Reversed Direction, demonstrating the effectiveness of mFTI-16.However, the improvements are less when compared mFTIall, even for the Same Direction partition.This can be attributed to fewer language pairs when finetuning, which we will discuss in Section 4.3.Table 4: Translation performances under different data conditions.mFTI-16: XGLM multilingual finetuned with translation instructions on a mixture of 16 language pairs described in Section 4.1.
Seeing Target Languages When Finetuning is Better Than Source Languages.When there are unseen languages in the language direction, the improvement on Unseen Src is much larger compared to Unseen Tgt, indicating the understanding of the specified target language may be more important than the source language.
Unseen Both Sides Also Benefit From mFTI Training.The most surprising phenomenon is that language pairs from Unseen Both Sides partition also benefit from mFTI, with an improvement of 0.7 BLEU compared to 8-shot ICL.Since mFTI-16 does not see any sentences of the source and target languages, the improvements indicate a better understanding of the translation instruction, which we will discuss in Section 4.4.

Instruction Tuning with More Language Pairs Leads to Better Translation Performance
Previous instruction-tuning works show that scaling the number of tasks significantly benefits the unseen tasks (Chung et al., 2022).Observing the performance gap of Same Direction between mFTI-16 and mFTI-all, we gradually add more language pairs to mFTI-16, and plot the translation performance on each partition in Figure 5.In order to isolate possible effects of additional monolingual sentences, we only add language pairs that exclude the studied 13 languages 8 .It can be seen that as the number of language pairs grows, the translation performances of all partitions generally increase, validating the importance of more language pairs.Notably, the performance of the Reversed Direction partition is significantly boosted, outperforming 8-shot ICL by a large margin when increasing the number of language pairs from 16 to 30.
Surprisingly, the performance of the Unseen Both Sides partition improves the most.Since 8 Detailed language pairs are in Appendix D. no data of language pairs in Unseen Both Sides are added, this indicates the ability of instructionfollowing on these language pairs has been significantly enhanced, which we will discuss in the next section.

mFTI Generalizes the Understanding of Translation Instruction to Unseen Directions
In this section, we aim to understand how mFTI facilitates the understanding of instructions from a more fine-grained view, i.e. specific language directions and instruction-following errors.For the language directions, we select Ru→Fr (high-resource), Bg→Ar (mediumresource), Ca→Ta (low-resource) from the Unseen-Both Sides partition to study mFTI's effectiveness under different resource settings.
For instruction errors, we identify the following four major problems in translations: • Source Copy (SC): This error occurs when the model simply copies the source sentence as the translation without making any meaningful changes.We identify this error by calculating the sentence-level BLEU score between the translations and the source sentences.If the BLEU score is above 80, it indicates that the translation is nearly identical to the source.
• Off-target translation (OT): In this case, the model fails to produce sentences in the target language.We detect this error by using a language identification tool, such as fasttext, to determine the language of the generated translations.
• Over/under translation (OU): This error refers to situations where the model produces translations that are significantly longer or shorter than references.We consider translations with a length ratio above 2 or below 0.5 as over-or under-translations, respectively.
• Oscillatory hallucination (OH): This error occurs when the model gets stuck in a specific translation state and generates repeated n-grams until reaching the maximum length.
We define translations with n-grams that consecutively repeat at least three times as oscillatory hallucinations.

Adding Irrelevant Language Pairs Reduces SC, OT and OU Ratios
In Section 4.3, we show that additional language pairs in mFTI lead to improved BLEU scores even for the Unseen Both Sides partition.We provide an in-depth analysis here from the aforementioned fine-grained views.We plot the trends of translation and instruction-following performance, and the ratios of 4 specific instruction-following errors Table 5: BLEU score, off-target ratio and oscillatory hallucination ratio before and after adding monolingual sentences to the finetuning corpus.Scores where adding monolingual sentences leads to improved quality are with green background.
as the number of additional language pairs grows.The results are in Figure 6.
More Language Pairs Reduce Instruction-Following Errors and Improve Translation Performance.Firstly, we can see that as more language pairs are added to the training corpus, instruction-following errors on Unseen-both language pairs are gradually reduced, leading to improvements in BLEU scores.Comparing different language pairs, we can see that high-and mediumresource language pairs generally perform better than low-resource language pairs on all four types of errors.Since all these language directions are unseen when instruction finetuning, it highlights the importance of language skills acquired during the pretraining phase.
SC: Solved.It can be observed that after adding about 30-60 language pairs, the model learns to avoid the SC problem, indicating this is a relatively easy problem to solve.
OU: Decreased to the level of mFTI-all.We can further see that adding more language pairs is also effective for reducing OU errors, as the error ratios significantly decrease as the number of language pairs grows.Notably, after scaling the number of language pairs to 150, the OU ratios of three unseen language pairs are comparable to supervised full finetuning.This demonstrates the effectiveness of mFTI.
OT: Decreased, but not to a satisfactory level.
Turning to the OT ratio, we observe that it also decreases as the number of language pairs grows.However, even after scaling the number of language pairs to 150, the OT ratio still cannot be decreased to the level of mFTI-all.
OH: No effect.Finally, we can see that with the increment in the number of language pairs, the OH ratio does not show a clear decreasing trend, which we will further discuss in the next section.

Joint Training with Monolingual Generation Instructions Helps Reduce OH and OT Problems More Efficiently
In the previous section, we find that the off-target (OT) and oscillatory hallucination (OH) on some language pairs cannot be fully solved to the level of mFTI-all by adding more irrelevant language pairs.We note that both problems are only related to the target language: the OT problem can be attributed to models' inability to relate target language names to the corresponding scripts of the language, and the OH problem might be caused by the poor modeling of the target languages.We hypothesize that finetuning models on instructions of monolingual generation, i.e. given a language name, generate fluent sentences from that language, should help ease these problems.
To this end, we organize the monolingual sentences of the held-out languages into monolingual generation instructions.The template we adopt is "[l i ] : y".We then finetune XGLM on the dataset compromised of translation instructions and these monolingual generation instructions.
We report the BLEU score, OT ratio and OH ratio in Table 5. Firstly we can see that adding monolingual generation instructions for the three Unseen Both Side language pairs can help mitigate the OT and OH problem in most scenarios, leading to better translation performance.Notably, by combining more irrelevant language pairs and monolingual sentences, the gap between mFTI-150 with monolingual sentences and mFTI-all has significantly diminished, despite that the model has never seen parallel sentences of the tested language before.

mFTI Improves Language Alignment via Pivot Languages
Besides the understanding of translation instruction, another crucial knowledge that models must grasp to carry out the instruction is the alignment between source and target languages.However, in scenarios where direct parallel sentences are not available, models have limited access to alignment information.This situation resembles the zero-shot setting commonly studied in multilingual translation research (Gu et al., 2019;Zhang et al., 2020;Arivazhagan et al., 2019;Liu et al., 2021).In this section, we aim to investigate the ability of mFTI to establish meaningful alignments through pivot languages in this scenario.
Specifically, for the three Unseen Both Sides language pairs X→Y studied in the previous section, i.e.Ru→Fr, Bg→Ar and Ca→Ta, we start from the mFTI-150 setting, and add parallel sentences of X→En and En→Y to the training corpus.We then perform mFTI using these augmented corpora and evaluate the model's performance on test sentences that do not contain instruction-following errors.As knowledge of language alignments is the last requirement for carrying out translation instructions once the model has learned to execute translation instructions correctly, the performance on these sentences serves as a reliable indicator of the model's proficiency in language alignment.The result is in Table 6.First, we can see that mFTI-150 and 8-shot ICL perform comparably, both significantly worse than mFTI-all.Since the tested three language pairs are unseen in mFTI-150, this indicates that similar to mFTI-150, the main role of ICL is to enhance the model's understanding of the translation behavior instead of source-target alignment knowledge.However, after adding pivot parallel sentences, the model's performance (+pivot) is significantly boosted.This demonstrates the potential of mFTI to leverage pivot languages to boost direct alignment between languages and improve translation performances.(2023b) have studied the translation quality of various GPT-3 models and found their performances to be comparable to commercial translation systems on high-resource language pairs.In contrast to these works, our research focuses on exploring existing LLMs' translation ability by directly tuning them to follow translation instructions.The most similar work to ours is Jiao et al. (2023a), which finetunes an open-source LLM LLaMA (Touvron et al., 2023) on the mixes translation data and the alpaca instruction dataset (Taori et al., 2023) to make it a better translator.However, they mainly focus on the bilingual translation setting while our work investigates the multilingual generalization when finetuning LLMs to carry out translation instructions.

Generalization On Unseen Language Pairs
Our work also has a close relation to zero-shot translation in the multilingual translation setting, where there are no direct parallel sentences between the source and target language.There are two major problems for zero-shot translation: generating correct languages and learning universal language representations.For the first problem, Gu et al. ( 2019 2021) impose regularization on the encoder/decoder to make the model more aware of the target language.Unlike their works, we discuss the off-target problem in the context of LLMs, and find adding both irrelevant language pairs and additional monolingual sentences can ease the problem to a great extent.
For the second problem, previous works focus on learning language-agnostic representations through additional regularization of model representations (Arivazhagan et al., 2019;Pan et al., 2021), and consistency between semantic equivalent sentences (Al-Shedivat and Parikh, 2019;Yang et al., 2021).Instead, our works mainly aim to reveal the helpfulness of multilingual finetuning LLMs for unseen language pairs by internalizing the pivot language information.Furthermore, our discussion encompasses a more stringent version of zero-shot translation, where neither source nor target language sentences are present in the finetuning corpus.This demands a stronger generalization ability, as the model must effectively utilize the language knowledge acquired during pretraining and the translation task knowledge acquired during finetuning to generate high-quality translations.

Instruction Finetuning
Our work focuses on finetuning LLMs with instructions to improve zero-shot translation performance.Prior works have demonstrated that LLMs face great difficulty in achieving good performance in zero-shot settings when lacking fewshot examples.Nevertheless, finetuning LLMs on a variety of tasks can significantly improve zeroshot performance on several tasks.For instance, Wei et al. (2022) aims to improve generalization in unseen tasks by performing instruction tuning.Muennighoff et al. (2023) further extend to finetune LLM by multilingual data instead of English data and find that multilingual finetuning leads to better performance on unseen tasks and unseen languages.Chung et al. (2022) explore instruc-tion tuning from the perspective of the number of tasks in finetuning corpus and LLM size.Chung et al. (2022) found that scaling these factors can dramatically improve zero-shot performance.
In our work, we primarily focus on the translation performance of LLMs.We adopt a comprehensive approach to consider the factors mentioned above, including the scale of the finetuning corpus, the size of model parameters, and the language selection within the fine-tuning corpus, for a comprehensive analysis of the translation performance of the LLMs.Additionally, we conduct a detailed analysis of the model's understanding and execution capabilities in translation tasks after instruction finetuning.

Conclusion
In this paper, we explore Multilingual Finetuning with Translation Instructions (mFTI), to better unleash the translation ability of multilingual LLMs.Through extensive experiments, we demonstrate that by training on a mixture of 1000 sentences per language pair, mFTI achieves better performance than 8-shot ICL, indicating the untapped potential of translation ability in LLMs by previous works.
Moreover, we systematically discuss the working mechanism of mFTI by analyzing it from the view of instruction-following.Our experiments demonstrate that mFTI helps the model better follow the instruction by introducing more language pairs and monolingual sentences, and enhances the direct language alignment by learning from pivot language pairs.Our paper also unveils remaining translation issues when adopting LLMs for zero-shot machine translation, i.e. over/under translation, oscillatory hallucination, and mistranslation caused by incorrect alignments.Future works should focus on acquiring more language knowledge from the pretraining phase and designing better regularization terms to solve these problems.

Figure 1 :
Figure 1: Translation performance of 8-shot ICL and mFTI using 1000 sentences per language pair.Languages are ordered by the data amount in the pretraining corpus.

Figure 2 :
Figure 2: Comparison of mFTI with conventional supervised machine translation models.Performances are evaluated in BLEU.

Figure 3 :
Figure 3: Changes of BLEU score after pivoting through English for 8-shot ICL and mFTI.

Figure 4 :
Figure 4: The translation performance of finetuned XGLM as the number of model parameters and training examples scales.
Figure 4 shows the translation performance when varying the number of training examples per language pair

Figure 5 :
Figure 5: Translation performance on different partitions as the number of language pairs grows.Left: partitions where sentences of both source and target language are seen when training.Right: partitions where source and/or target language sentences are unseen when training.

Figure 6 :
Figure6: Trends of translation and instruction-following performance on 3 Unseen-both language pairs when scaling up the number of language pairs during mFTI.The left 2 figures show the BLEU score and overall instruction-following error ratios, respectively.The rest 4 figures show the ratios of 4 specific error types, respectively, i.e. source copy, off-target, over/under translation, and oscillatory hallucination.The X-axis denotes the number of training language pairs.The Y-axis denotes the percentage of translations with specific error types.

Table 2 :
Averaged translation performance on all 156 language pairs of 8-shot ICL and mFTI using different LLMs and finetuning strategies.

Table 3 :
The translation performance of finetuned XGLM as the quality of finetuning corpus varies.