FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms. We explore automatic evaluation metrics for FRMT and validate their correlation with expert human evaluation across both region-matched and mismatched rating scenarios. Finally, we present a number of baseline models for this task, and offer guidelines for how researchers can train, evaluate, and compare their own models. Our dataset and evaluation code are publicly available: https://bit.ly/frmt-task.


Introduction
Machine translation (MT) has made rapid advances in recent years, achieving impressive performance for many language pairs, especially those with high amounts of parallel data available.Although the MT task is typically specified at the coarse level of a language (e.g.Spanish or Hindi), some prior work has explored finer-grained distinctions, such as between regional varieties of Arabic (Zbib et al., 2012), or specific levels of politeness in German (Sennrich et al., 2016).Unfortunately, most approaches to style-targeted translation thus far rely on large, labeled training corpora (Zbib et al., 2012;Lakew et al., 2018;Costajussà et al., 2018;Honnet et al., 2018;Sajjad et al., 2020;Wan et al., 2020;Kumar et al., 2021), and in many cases these resources are unavailable or expensive to create.We explore a setting for MT where unlabeled training data is plentiful for the desired language pair, but only a few parallel examples (0-100, called "exemplars") are annotated for the target varieties.As a specific use-case, we examine translation into regional varieties: Brazilian vs.European Portuguese and Mainland vs. Taiwan Mandarin.While these varieties are mutually intelligible, they often exhibit lexical, syntactic, or orthographic differences that can negatively impact an MT user's experience.Figure 1 illustrates the use of exemplars to control the regional variety at inference time.
MT systems that do not support region or style distinctions may be biased toward varieties with more available data (the "web-majority" varieties).We observe this bias in a widely used proprietary MT system, with measurable negative effects for speakers of web-minority varieties ( §6.2).One barrier to further research on this issue is the lack of a high-quality evaluation benchmark.Thus, to encourage more access to language technologies for speakers of web-minority varieties and more equitable NLP research, we make the following contributions: (1) We construct and release FRMT, a new dataset for eval-uating few-shot region-aware translation from English to Brazilian/European Portuguese and Mainland/Taiwan Mandarin.(2) We evaluate predictions from a number of existing and customtrained baseline systems on the FRMT task using automatic metrics.(3) We conduct detailed human evaluations of gold and model-based translations on FRMT, under all combinations of rater region and target region.( 4) We analyze the correlation of automatic metrics and human evaluations on FRMT, and propose a new targeted metric for lexical accuracy.

Related Work
Textual style transfer aims to control fine-grained stylistic features of generated text.Earlier work leverages supervised parallel data (Jhamtani et al., 2017); later work assumes labeled but non-parallel training data (Shen et al., 2017;Li et al., 2018;Niu et al., 2018a), or foregoes training-time labels entirely, as in our setting, relying only on few-shot exemplars provided at inference time (Xu et al., 2020;Riley et al., 2021;Garcia et al., 2021).However, style transfer evaluation protocols are known to be lacking (Pang and Gimpel, 2019;Briakou et al., 2021;Hu et al., 2022), due to the underspecification of stylistic attributes (e.g.formality, sentiment) and the absence of standardization across studies.Region-aware translation addresses these issues, providing a test-bed for exploring few-shot attribute control-MT evaluation methods are relatively mature, and many regional language varieties can be sufficiently delineated for the task.
Previous work has explored many sub-types of variety-targeted MT.Region-aware MT targets specific regions or dialects (Zbib et al., 2012;Costa-jussà et al., 2018;Honnet et al., 2018;Lakew et al., 2018;Sajjad et al., 2020;Wan et al., 2020;Kumar et al., 2021); formality-aware MT targets different formality levels (Niu et al., 2017(Niu et al., , 2018b;;Wang et al., 2019); and personalized MT aims to match an individual's specific style (Michel and Neubig, 2018;Vincent, 2021).However, with few exceptions (e.g.Garcia et al. 2021), these works assume the availability of large-scale datasets containing examples with the target varieties explicitly labeled.In the present work, we design a benchmark that emphasizes few-shot adaptability.Although our dataset is limited to four regions and two languages, the few-shot setup and high degree of linguistic dissimilarity between the selected languages means that approaches performing well on the entire FRMT benchmark can be expected to generalize reasonably well to other languages, other regions, and other stylistic attributes.
Several existing parallel corpora cover regional language varieties, but have limitations that motivate us to construct a new high-quality, targeted dataset.e-PACT (Barreiro and Mota, 2017) comprises translations from English books into Portuguese variants, but is small and not easily accessible.OpenSubTitles (Lison et al., 2018) skews toward shorter utterances and is noisy due to automatic alignment.WIT3 (Cettolo et al., 2012) provides translations of TED-talk transcripts into many languages, but relies on volunteer translators which may limit quality.
Popular shared tasks have not included regiontargeted translation either: The Conference on Machine Translation (WMT) has included translation between similar languages (e.g.Akhbardeh et al., 2021), while the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) focuses mainly on classification and not translation (e.g.Zampieri et al., 2021).
Furthermore, we are not aware of previous work that (1) measures deltas in human evaluation metrics between the region-matched and region-mismatched settings, (2) correlates these with automated metrics, (3) offers tailored subtasks targeting region-differentiated lexical items and region-biased distractors, or (4) defines targeted metrics testing region-appropriateness.

FRMT Dataset
We introduce the FRMT dataset for evaluating the quality of few-shot region-aware machine translation.The dataset covers two regions each for Portuguese (Brazil and Portugal) and Mandarin (Mainland and Taiwan).These languages and varieties were selected for multiple reasons: (1) They have many speakers who can benefit from increased regional support in NLP.(2) Portuguese and Mandarin are linguistically very distinct, coming from different families; we therefore hypothesize that methods that perform well on both are more likely to generalize well to other languages.The dataset was created by sampling English sentences from Wikipedia and acquiring professional human translations in the target regional varieties.Final quality verification is done through manual evaluation by an independent set of translators, using the MQM protocol (Freitag et al., 2021a) that we also employ to evaluate system translation quality.

Data sampling method
FRMT seeks to capture region-specific linguistic differences, as well as potential distractors.To this end, we divide the dataset into three buckets (lexical, entity, random), each containing human translations of sentences extracted from different sets of English Wikipedia articles. 1 Lexical: We collect English lexical items for which the best translation into the target language differs depending on the target region.To source these, we rely on blogs and educational websites that list terms differing by region.We further validate each pair of translations by asking a native speaker of each region whether each translation is appropriate for the intended meaning in their region.We filter to only use pairs where exactly one translation is appropriate per region.This is done independently for Portuguese and Mandarin as target languages, yielding lists of 23 and 15 terms, respectively.For each term t, we extract up to 100 sentences from the beginning of the English Wikipedia article with title t.
Entity: We select entities that are strongly associated with specific regions under consideration (e.g.Lisbon and São Paulo), which may have adversarial effects for models that rely heavily on correlations learned from pretraining.Our selection comprises 38 Mandarin-focused and 34 Portuguese-focused entities.We extract up to 100 source sentences from the beginning of the English Wikipedia article about each selected entity.
Random: For a more naturally-distributed subset, we randomly sample 100 articles from Wikipedia's collections of "featured" or "good" articles. 2Here, we take up to 20 sentences from the start of a randomly chosen section within each article.Unlike the other two buckets, this one features one common set of sentences to be translated into all four target variants. 1As Wikipedia data source we use the training split of wiki40b (v1.3.0) by Guo et al. (2020)

Human translation
14 paid professionals translated the selected English texts into the four target language variants: 3 translators per Portuguese region and 4 per Mandarin region.For each region, each sentence was translated by one translator, resulting in one reference per source.Each translator translated nonoverlapping chunks of the source data one sentence at a time in the order of the original text.Sentences that were rejected by at least one translator (e.g. for having too much non-English text) are not included in our dataset.

Corpus statistics
For each bucket, we split our data into exemplar, development (dev), and test data.The exemplars are intended to be the only pairs where the region label is shown to the model, such as via fewshot or in-context learning (Brown et al., 2020).Providing these ensures increased comparability across methods on the FRMT benchmark, in addition to sidestepping potential domain mismatch issues by providing exemplars from the same domain (Wikipedia text) as the evaluation data.Table 1 reports the number of released sentence pairs for each split of the dataset.Sentences from a given Wikipedia page appear only in a single split, ensuring a system cannot "cheat" by memorizing word-region associations from the exemplars, or by overfitting to words and entities while hill-climbing on the dev set.
Table 2 shows example items from each bucket.Em 2019, a Virgin Atlantic começou a autorizar as assistentes de bordo a usar calças e a dispensar maquilhagem.
In 2019, Virgin Atlantic began to allow its female flight attendants to wear pants and not wear makeup.
Buses are the cheapest way to move around Natal.
The juice causes intense psychedelic hallucinations in those who drink it, and the police quickly trace it to the farm and move in to arrest Homer, Seth, and Munchie.
Table 2: Examples from the dataset, limited to the Portuguese dev-set for brevity.The last two columns show the reference human translations obtained for each region given the English source text (in italics).For the lexical and entity buckets, we show examples for which the Levenshtein edit-distance between the two translations is near the median observed for the whole dev-set.

Limitations
Our dataset is designed to capture differences in regional varieties, but capturing all such differences in a finite dataset is impossible.While we specifically target lexical differences, the terms were selected via a manual process based on online resources that discuss lexical differences in these languages, and these resources can sometimes be incorrect, outdated, or inattentive to rare words or words with more subtle differences.
Other regional distinctions, such as grammatical differences, were not specifically targeted by our data bucketing process, and thus the degree to which they are captured by the dataset is determined by their likelihood to occur in translations of English Wikipedia text.This also means that differences that only surface in informal settings are unlikely to be included, as Wikipedia text has a generally formal style.While we believe that methods that perform well on all four varieties included in FRMT should be applicable to other languages and varieties, measuring this would require a similar dataset with wider coverage.Constructing such a dataset requires only knowledge of regional differences to inform selection of source texts as in our lexical and entity buckets, and translators who are native speakers of the target varieties.An additional pool of MQM-trained translators would be needed to validate the collected translations for regional distinctiveness.
In spite of validation through MQM, it should be noted that the region-targeted translations we collected are not necessarily minimal contrastive pairs, but may include differences arising from factors other than regional variation, such as individual style preferences of the human translators.

Evaluation Metrics
While human judgments are the gold standard for evaluating machine-generated texts, collecting them can be time-consuming and expensive.
For faster iteration, it can be helpful to measure progress against automatic metrics that are known to correlate well with human judgments.We hypothesize that common reference-based MT evaluation metrics might have differing sensitivities to regional differences, and so we conduct a human evaluation of several baseline models (see §6.1) and compute correlation of several automatic metrics with the human judgments.We also propose a new automated lexical accuracy metric that more directly targets region-awareness.

Human evaluation
To obtain the highest fidelity human ratings, we use the expert-based Multidimensional Quality Metrics (MQM) evaluation framework proposed by Freitag et al. (2021a) and recommended by the WMT'21 Evaluation Campaign (Freitag et al., 2021b).We show expert raters chunks of 10 contiguous English sentences from our test set with one corresponding set of translations.Raters then identify errors in the translations, assigning a cate-gory and severity to each.Due to cost constraints, we evaluate 25% of the test set, evenly distributed across our three evaluation buckets.Within each region, each chunk is rated by three raters, who achieve interannotator consistency of 70.4 ± 2.2 (as 100-scaled intraclass correlation 3 ).Each output is shown to raters of both regions of the corresponding language.All Mandarin outputs are automatically transliterated into the rater's region's corresponding Han script (Mainland: simplified; Taiwan: traditional), using Google Translate "Chinese (Simplified)" ↔ "Chinese (Traditional)".

Automatic translation quality metrics
We evaluate the following automatic, referencebased metrics: BLEU (Papineni et al., 2002): Based on token n-grams, using corpus_bleu from Post (2018). 4 CHRF (Popović, 2015): Based on character ngram F1, using corpus_chrf from Post (2018). 5 BLEURT (Sellam et al., 2020): A learned, model-based metric, that has good correlation with human judgments of translation quality.To the best of our knowledge, BLEURT has not been evaluated with respect to human judgments of region-specific translation quality.
As in the human evaluation, all Mandarin outputs are transliterated into the target regional Han script before evaluation.

Correlation
For computing correlation, each data point is a score on a 10-sentence chunk of model output, covering the three models discussed in section §6.1, using both matched and mismatched ratings.For MQM, this is the average of 30 weighted ratings: one per sentence per rater.The category/severity weights are described in Freitag et al. (2021a).For BLEU and CHRF, which are corpus-level metrics, we take the 10 input/output 3 Using the icc function of R's irr library (Gamer et al., 2019). 4SacreBLEU version strings for {Portuguese,Mandarin}: sentence pairs as the "corpus".For BLEURT, we use the average sentence-level score.Table 3 presents the correlation results, scaled by −100. 6e observe that the learned BLEURT metrics outperform the non-learned metrics by wide margins, in line with findings from Sellam et al. (2020) that neural methods outperform n-gram based methods.Additionally, the teacher model (BLEURT) outperforms the distilled student models, with larger students consistently outperforming smaller ones.

Lexical accuracy
To assess a model's ability to select lexical forms appropriate to the target region, we define a lexical accuracy metric.As discussed in section §3.1, sentences in the lexical bucket are from Wikipedia articles containing specific words that we expect to have distinct regional translations.
For instance, we include source sentences from the English Wikipedia article "Bus" in the Portuguese lexical bucket, as the word for bus is distinct in Brazil and Portugal (ônibus vs. autocarro).As the expected output words are known ahead of time, we can directly measure the rate at which a model selects region-appropriate variants.
Starting from the list of terms used to select articles for the lexical bucket, we remove the terms selected for the exemplars split in order to test generalization to unseen terms.This results in 18 term-pairs in Portuguese and 13 in Mandarin.
We calculate the metric over all model outputs for the lexical bucket, covering both regions.For each term-pair, we calculate the number of sentences containing the matched variant and the number of sentences containing the mismatched variant.The model's lexical accuracy (LA) for the given language is then the total number of matches divided by the sum of matches and mismatches: To account for Portuguese inflection, we considered matching lemmatized forms rather than surface forms, but found little difference in the resulting scores.We thus report results using naive surface matching, which avoids a dependency on a specific lemmatizer and improves reproducibility.
To disentangle lexical choice from script choice, we define lexical accuracy to be scriptagnostic-e.g., for the word pineapple, if the target is zh-TW, we count both script forms of the Taiwan variant fènglí (鳳 梨 and 凤 梨) as correct, and both script forms of the Mainland variant bōluó (菠萝 and 菠蘿) as incorrect.This ensures that models are judged solely on their lexical choices, and prevents "gaming" the metric by only using the lexical forms and script of a single region.
We emphasize that lexical choice is just one important facet of region-aware translation, aside from morphology, syntax, and beyond.Even so, we believe that this easy-to-calculate metric is worth iterating on, since one may safely say that a model that scores poorly on lexical accuracy has not solved region-aware translation.

Reporting FRMT results
For the FRMT task (as opposed to the dataset), we stipulate a key "few-shot" restriction: candidate models may not be intentionally exposed to any region-labeled data at any point during training, except for data from the FRMT exemplars split.This restriction covers both region-labeled monolingual data as well as region-labeled parallel translation data.7While it may not be difficult to obtain region labels for Brazil/Portugal or Mainland/Taiwan (e.g. by filtering web pages on top-level web domain), we intend for FRMT to serve as a measure of few-shot generalization to arbitrary regions and language varieties, for which obtaining labels may be much harder.
Researchers sharing FRMT results should report lexical accuracy, per-bucket BLEU, and the "FRMT" score (described in §6.2) on test, as shown in Tables 4 and 5.These metrics can be calculated with our provided evaluation scripts. 8e also recommend reporting BLEURT scores, but recognize that this may not always be possible, as it requires significantly more computational resources.Similarly, we encourage human evaluation using MQM as a gold standard, but do not wish to promote this as a community metric, due to its impracticality for many researchers and the potential confound of having different rater pools.
Finally, for any model candidate, it is important to report how many exemplars were supplied for each variety.To improve comparability, we recommend 0, 10, or 100 exemplars per region.

Baseline Models
We evaluate a handful of academic MT models that claim some ability to provide few-shot controllable translations.We also evaluate a commercial MT system that does not distinguish between these regional varieties.
Our first baseline is the Universal Rewriter (UR) of Garcia et al. (2021), which supports multilingual style transfer and translation.It is initialized from an mT5-XL checkpoint (Xue et al., 2021) and finetuned on a combination of monolingual and parallel data from mC4 and OPUS, respectively.We train it with sequence length of 128 instead of 64, to be directly comparable to our other models.
Our second baseline is UR finetuned from the Massively Multilingual Massive Machine translation (M4) model of Siddhant et al. (2022) instead of mT5 (M4-UR).We hypothesize that initializing from a model explicitly designed for translation will outperform one trained as a general language model.For both UR and M4-UR, we use the first 100 exemplars from the lexical buckets.
Our third baseline uses natural language prompting to control the regional variety of M4's output (M4-Prompts), such as prefixing the input with "A Brazilian would write it like this:".This is motivated by earlier work using this technique effectively for large language models (Wei et al., 2022;Sanh et al., 2022;Brown et al., 2020), and more recent work applying it to region-aware MT (Garcia and Firat, 2022).Thick "match" bars show scores from raters in the target region.Thin "mismatch" bars show scores from raters in the opposite region.In all conditions, raters prefer region-matched gold translations, confirming the presence of region-specific phenomena in the collected data.PaLM is the highest-rated baseline, but still has room for improvement, particularly in Mandarin.
Our fourth baseline fine-tunes the M4-Prompts model, where the source-side language tags used to induce the target language are replaced with prompts of the form "Translate to [language]:".This model (M4-Prompts FT) is designed to explicitly introduce prompting behavior.At inference time, we replace "[language]" with the variety name (e.g."Brazilian Portuguese").Neither M4-Prompts nor M4-Prompts FT use exemplars.
Our next three baselines are different-sized versions of PaLM (Chowdhery et al., 2022), a large language model that has demonstrated remarkable zero-shot and few-shot performance on a variety of tasks (PaLM 540B, PaLM 62B, and PaLM 8B, referring to their approximate parameter counts).The prompt for these models begins with "Translate the following texts from English to [language variety]" and is followed by ten exemplars selected randomly from the lexical bucket. 9Each exemplar is put on two lines: first the English text, prefixed by "English:", and then the translation in the target variety, prefixed by the variety's name.At the end of the prompt, we show the model the input text and the language variety prefix, and take the first decoded line of text.
Finally, we examine Google Translate 10 , a publicly-available commercial MT model that does not support regional varieties for Portuguese or Mandarin (though it does support transliteration of traditional and simplified scripts).We evalu- 9 The model has a fixed input sequence length, including the prompt, and a fixed output sequence length.We ensure that the ten exemplars are short enough to leave at least 128 tokens for the input text, to match the 128 tokens allotted to the output.
10 translate.google.com,accessed April 4th, 2022 ate this system mainly to test the hypothesis that variety-agnostic systems will be biased toward the web-majority variety.
6 Baseline Model Performance

Human evaluation results
We select three baseline models for human evaluation: M4-UR, M4-Prompts, and PaLM 540B, covering a variety of modeling techniques.Figure 2 presents human evaluation of our baselines on the 25% sample of our test set described in §4.2.For the gold data, we observe that raters of all regions prefer translations from their own region (the "matched" case) over translations from the other region (the "mismatched" case) in all three buckets; when averaged over buckets, the MQM penalties for the matched and mismatched cases are significantly different (1.73 matched and 3.55 mismatched; t = −3.34;p < 0.001).This indicates that, despite the limitations discussed in §3.4,our data collection process succeeded in producing regionally-distinct translations.This effect is strongest in the lexical bucket, presumably due to the high rate of region-distinct terms in these sentences.
In Portuguese, we find that all models perform better in the region-matched setting, indicating that each model has some ability to localize to Brazil and Portugal.However, in Mandarin, apart from PaLM's lexical bucket, region match does not lead to MQM gains, indicating that these models are not able to produce better, more region-specific translations in this case.
Comparing across models, we find that PaLM performs the best, followed by M4-Prompts and  then M4-UR, consistently across both Portuguese and Mandarin.PaLM performs particularly well in the lexical bucket, suggesting that larger models may be better suited to the task of memorizing region-specific lexical variants.
For Mandarin, a large gap remains between expert translations and our baselines: averaged over buckets, the gold matched MQM penalty is 2.5 vs. PaLM's 8.8.It's apparent that better region handling will be needed to close this gap, since our baselines have much worse match/mismatch deltas than gold translations: the average gold mismatched penalty minus matched penalty was 2.7, while PaLM's was −0.3.
For Portuguese, while PaLM gives impressive results, there is still a meaningful gap with expert translation: averaged over buckets, the gold MQM penalty was 2.1 vs. PaLM's 2.7, indicating headroom for our task.There is also the important question of whether competitive performance can be achieved with smaller models, which are better suited for production use-cases.
Figure 3 breaks down scores by rater and target region, over the full 25% sample.As before, in each setting, raters prefer region-matched over mismatched gold translations.For Portuguese, we find that our pt-PT raters were "harder graders" than our pt-BR raters, with a delta of +2 MQM between the regions in both matched and mismatched settings; by contrast, our Mandarin raters were well calibrated across regions.
We further examined model performance on the entity bucket, to test whether the presence of "distractor" entities (associated with the nontarget region) would hurt translation quality, but we did not find significant differences in MQM scores.Still, we note isolated examples of this effect; for instance, when targeting pt-BR, the M4-Prompts model produces the pt-PT spelling património (cf.pt-BR patrimônio), but only when the English source contains the words Lisbon or Portugal.We expect the entity bucket will be useful to researchers looking for similar effects.

Automated metric results
Table 4 shows performance of our baseline models on the automated metrics BLEU and BLEURT."FRMT" score is a summary of per-language performance, calculated as the geometric mean across regions of the arithmetic mean across buckets.
As mentioned at the outset, we observe that region-agnostic models have a strong bias toward the region with larger presence in webcrawled corpora.This is especially apparent in the lexical bucket, where Google Translate has a +20.6 BLEU gap between pt-BR and pt-PT and a +17.8 gap between zh-CN and zh-TW.
Within the lexical bucket, we note that PaLM outperforms the public Google Translate model in web-minority regions (pt-PT and zh-TW) despite being trained in a fully unsupervised manner.This highlights that even with minimal region-labeled data (10 exemplars), it is possible to make meaningful progress over region-agnostic approaches.
Table 5 shows lexical accuracy performance, assessing whether specific terms receive regionappropriate translations.Here, the PaLM models outperform alternatives by a wide margin.As even the smallest PaLM model has more than 2× the parameters of our other baselines (3.7B parameters each), this suggests that model capacity is a key ingredient for learning to use region-specific terminology in a few-shot manner.Still, there is a wide gap compared to human performance.
Notably, while the smaller PaLM models outperform our UR and M4 baselines on lexical accuracy, they underperform on BLEU and BLEURT.This highlights that using region-appropriate terminology is only a small part of the translation task, and at smaller sizes, models designed specifically for translation have the advantage.

Mismatched outputs
Given a reference in a specified language variety (e.g.pt-PT), a "good" model should achieve a higher score when translating into that variety (the "matched" case) than an alternative variety  (e.g.pt-BR; the "mismatched" case).To measure the extent to which this holds for our baseline models, we show the delta between matched and mismatched outputs on the test set in Table 6.
We observe that in the Portuguese case, most models do score better when asked to produce text in the same regional variety as the reference.However, when it comes to Mandarin, most models-PaLM being the exception-struggle to produce zh-TW output that outperforms their zh-CN output when evaluated against a zh-TW reference, indicating that the attempts to appropriately stylize the generated text degrade its quality more than they improve its regional acceptability.

Effect of exemplars
To test sensitivity to the number and choice of exemplars, we evaluate PaLM 540B while varying the set of exemplars used.Table 7 shows the effect of ablating the number of exemplars in the range 0-10.We observe that a single exemplar is sufficient to achieve strong results, using zero exemplars yields reasonably strong results, and gains from additional exemplars are marginal.
To measure the variance in performance across exemplar choice, we re-run PaLM 540B evaluation three times each using either 1 or 10 exemplars, resampling the exemplars on each run.We find that the choice of exemplar(s) has a relatively small effect-with 10 exemplars, the standard deviations of FRMT-BLEU and FRMT-BLEURT across all four runs (including the original) were below 0.5 in each language, and with just 1 exemplar, the standard deviations remained under 1.0.

Qualitative analysis
To provide additional insights on regional differences and model behavior, we manually inspect dev set gold translations and model outputs, across the models sent to human evaluation.In both languages, we observe regional differences beyond just the lexical items underlying our lexical bucket.For instance, in    BR and a in pt-PT.As another example, in Table 9, we observe both gold and PaLM outputs use the term 程式 (chéngshì, en:program) only in zh-TW when translating the phrase "coding errors".In many cases, PaLM uses the expected regionspecific lexical forms, as already reflected in our lexical accuracy metric.By contrast, we observe the models are more prone to use terms from the web-majority region (pt-BR and zh-CN) irrespective of the target.For example, in Table 9, PaLM matches gold translations in using the region-specific terms for software-zh-CN: 软 件 (ruǎnjiàn), zh-TW: 軟 體 (ruǎntǐ)-while the M4-based models use the zh-CN term throughout (simplified: 软件, traditional: 軟件).

Conclusion
In this paper, we introduced FRMT, a new benchmark for evaluating few-shot region-aware machine translation.
Our dataset covers 4 regions of Portuguese and Mandarin, and enables fine-grained comparison across region-matched and mismatched conditions, and across different classes of inputs (lexical, entity, random).
While we found the large-scale generalist model PaLM 540B to show impressive few-shot region control, there is still significant room for improvement.None of the models we evaluated match human performance, and the gap is particularly large in Mandarin.Additionally, there remains an open research question as to whether robust few-shot regional control can be achieved at more modest model scales.
We are eager to see progress on FRMT, as methods that do well in this few-shot setting are likely to be easily extensible to other regions and styles.We anticipate that the flexibility to adapt to new output styles in the absence of extensive labeled data will be a key factor in making generative text models more useful, inclusive, and equitable.

Figure 1 :
Figure 1: FRMT requires a machine translation model to adapt its output to be appropriate for a specific region, such as Brazil (left) or Portugal (right).Because only a few exemplars are provided to convey the target region, methods that perform well on FRMT can likely extend to other regions and styles.

Figure 2 :
Figure 2: MQM (↓) scores for gold translations and model predictions in Portuguese (left) and Mandarin (right).Thick "match" bars show scores from raters in the target region.Thin "mismatch" bars show scores from raters in the opposite region.In all conditions, raters prefer region-matched gold translations, confirming the presence of region-specific phenomena in the collected data.PaLM is the highest-rated baseline, but still has room for improvement, particularly in Mandarin.

Figure 3 :
Figure 3: MQM (↓) scores for gold translations and model predictions, broken down by rater region and target region.For example "BR rates PT" indicates Brazilian raters scoring sentences targeted to Portugal.

Table 3 :
Coefficients of correlation between human MQM ratings and several automated metrics.CHRF has the lowest correlation, with BLEU performing slightly better.All BLEURT models outperform the non-learned metrics, with the full-size model achieving higher correlation than the smaller distillations thereof.

Table 4 :
FRMT per-bucket test set results, in the format: BLEU (BLEURT).The "FRMT" score is the geometric mean across regions of the arithmetic mean across buckets.

Table 5 :
Lexical accuracy on FRMT test.PaLM outperforms other approaches, while region-agnostic models like Google Translate are guaranteed 50%.
Table 8 and similar examples, we find on <date> phrases tend to be translated with differing prepositions-em in pt-

Table 6 :
FRMT test set deltas between matched and mismatched outputs for a given reference, shown in the format: ∆BLEU (∆BLEURT).Negative numbers indicate that the reference-based metric preferred the model output that targeted the opposite language variety.The last column shows deltas between FRMT scores evaluated with respect to matched vs. mismatched outputs.

Table 7 :
FRMT test set results of PaLM 540B, when varying the number of exemplars, shown in the format: BLEU (BLEURT).Across both languages, even one exemplar is sufficient for strong results, and zero-shot performance is reasonably strong.Increasing to 10 exemplars in Portuguese or 7 exemplars in Mandarin gives marginal additional gains.Note that these results were not used to select the number of exemplars for the PaLM 540B results reported elsewhere; this ablation was run afterward.

Table 8 :
Gold and model outputs for the source: Same-sex marriage in Portugal was legalized on 17 May 2010.Phenomena of interest are bolded.

Table 9 :
Gold and model outputs for the source: Not all software defects are caused by coding errors.Phenomena of interest are bolded, and region-specific errors are underlined and red.Note, M4-based model zh-TW outputs have been transliterated to traditional script, matching our evaluation setting.