QAmeleon: Multilingual QA with Only 5 Examples

Abstract The availability of large, high-quality datasets has been a major driver of recent progress in question answering (QA). Such annotated datasets, however, are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are fine-tuned, thus avoiding costly annotation. Prompt tuning the PLM with only five examples per language delivers accuracy superior to translation-based baselines; it bridges nearly 60% of the gap between an English-only baseline and a fully-supervised upper bound fine-tuned on almost 50,000 hand-labeled examples; and consistently leads to improvements compared to directly fine-tuning a QA model on labeled examples in low resource settings. Experiments on the TyDiqa-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation.1


Introduction
Question answering (QA) has seen impressive progress in recent years enabled by the use of large pre-trained language models (Devlin et al., 2019;Lewis et al., 2020a;Raffel et al., 2020), and the availability of high-quality benchmarks (Rajpurkar et al., 2016;Trischler et al., 2017;Kwiatkowski et al., 2019).Many QA datasets frame the task as reading comprehension where the question is about a paragraph or document and the answer is a span therein.Advances in QA modeling have been primarily reported for English, which offers a considerable amount of high-quality training data compared to other languages.More recently, efforts have focused on the creation of multilingual QA benchmarks such as TYDI QA (10 languages; Clark et al. 2020), MLQA (6 languages; Lewis et al. 2020b), and XQUAD (10 languages; Artetxe et al. 2020b).Among these, only TYDI QA is genuinely large-scale, MLQA and XQUAD are limited to an evaluation set due to the high cost and labor required to collect data across languages.
As a result, efforts to localize QA models to new languages have been primarily focusing on zero-shot approaches.Recent proposals include using machine translation to approximate training data for supervised learning (Lewis et al., 2020b), and data augmentation via generating synthetic questions for new languages (Riabi et al., 2021;Shakeri et al., 2021).Both approaches rely on transfer from English, which leads to a dependence on translation artifacts (Koppel and Ordan, 2011;Artetxe et al., 2020a) and a bias towards the linguistic characteristics of English, which is not the best source for all target languages (Lin et al., 2019).However, annotating a minimallysized data sample can potentially overcome these limitations while incurring significantly reduced costs compared to full dataset translation (Garrette and Baldridge, 2013).
In this paper, we argue that a few-shot approach in combination with synthetic data generation and existing high-quality English resources can mitigate some of the above mentioned artifacts.Beyond question answering, multilingual approaches have succeeded at leveraging a small number of annotations within a variety of tasks (Zhao et al., 2021, inter alia) including natural language inference, paraphrase identification, and semantic parsing (Sherborne and Lapata, 2022).We report average Exact Match (EM) across all languages on the TYDIQA-GOLDP dataset (Clark et al., 2020).
We synthesize these directions into QAMELEON, an approach for bootstrapping multilingual QA systems, with as few as five examples in a new target language (see Figure 1).We use gold annotations to prompt-tune a PLM in order to automatically generate multilingual QA data, which is then used to fine-tune a QA model.We find that QAMELEON delivers accuracy superior to zero-shot methods and competitive translationbased baselines, and in some cases competes with the fully supervised upper bound.2Experiments on the TYDI QA (Clark et al., 2020) and MLQA (Lewis et al., 2020b) benchmarks show that fewshot prompt tuning (Lester et al., 2021) scales across languages, significantly outperforms prompt engineering (Brown et al., 2020) with the same number of labeled examples, and is a viable alternative to large-scale annotation.
Our contributions include (a) a new approach to bootstrapping a multilingual QA system; QAMELEON prompt-tunes a PLM with as few as five gold examples to automatically generate multilingual QA data which is then used to finetune a QA model; (b) a series of experimental results showing significant improvements over existing approaches in the few-shot regime, ranging from 12% absolute accuracy on TYDIQA-GOLDP (Clark et al., 2020) over an English-only baseline and 4% absolute accuracy over a competitive translate-train baseline; (c) extensive analysis of the behavior of QAMELEON in zero shot and low resource regimes, on different multilingual QA datasets, and in comparison to prompt-engineering.

Synthetic Data Generation
Let D l denote a QA dataset with examples provided by human annotators, where l is a target language in a set L of languages of interest.D l consists of samples (c, q, a) l , where c is a paragraph of text, q is a question, and a is an answer extracted from c (see Figure 1 left).We further use D l,n to denote a dataset D l , but making n explicit, with n referring to the number of examples it contains.For instance, D fr,5 denotes a French QA dataset with 5 examples.Finally, let U l denote sets of unlabeled paragraphs in language l; we assume these are indomain with respect to the paragraphs in D l but are not accompanied by questions or answers.
Throughout this work, we will assume the availability of D en , a large QA dataset in English (source language).This assumption corresponds to the observation that most large-scale QA datasets (Rajpurkar et al., 2016;Yang et al., 2018;Bajaj et al., 2016;Kwiatkowski et al., 2019) contain examples exclusively in English.For languages other than English, we assume that only small datasets D l,n are available for training (e.g., n = 5) ("Few-Shot" scenario) or no data at all ("English-Only" scenario).We will also assume that sets U l of unlabeled passages are available for all target languages.Our task will be to synthesize QA data in each target language l in order to fine-tune QA models on l directly.
In the rest of this section we formally describe three ways of synthesizing QA data and give further details on the two scenarios we consider, "English-Only" and "Few-Shot".

Machine Translation (MT)
A widely adopted approach (Lewis et al., 2020b;Shakeri et al., 2020) makes use of a machine translation system T to automatically translate text from one language into another.Let T l ′ (D l ) denote the translation of dataset D l from language l to language l ′ .The translation is performed by independently applying T to context c, question q, and answer a for each example in the source dataset (see approach 2 in Figure 1).A synthetic QA dataset D MT is generated by translating the entire English dataset into each language of interest: The approach described here is known as "translate-train".An alternative is "translate-test", where translation is employed during inference instead of training.Multilingual inputs are translated to English and inference is done via an English QA model.The English predictions are then translated back to the respective target language.We experimentally found "translate-test" to perform poorly on our task in comparison to translate-train due to its reliance on multiple noisy translation steps.
Note that fine-tuning on D MT still relies on the support of the high-quality D en .Previous work (Kramchaninova and Defauw, 2022;Vu et al., 2022) has highlighted various limitations with multilingual approaches based on MT including (a) their dependence on the quality of available MT systems in a given language and in turn the availability of high-quality (expensive) parallel data, (b) a potential misalignment of answer spans after translation in context to the passage vs translation of answers independently, and (c) translationese artifacts and English-centric content topics (Clark et al., 2020).

Prompt Engineering (PE)
PLMs (Brown et al., 2020;Chowdhery et al., 2022) have recently shown unprecedented performance on a vast number of tasks, including natural language generation, without the need for modifying any of the model's parameters, simply by hand-designing a textual prompt that instructs the model to perform a certain task.Following Brown et al. (2020), we consider a class of handdesigned prompts referred to as "prompting" or "in-context learning".The prompt starts with a free form instruction, followed by a small number of instances exemplifying how the task is solved.An incomplete instance is then appended to this prompt and the PLM performs the task by completing that instance.We refer to this approach as "prompt engineering" (PE), since the input to the PLM has to be hand-engineered based on human intuition about the target task (see approach 3 in Figure 1).
In order to hand-engineer prompts for our task, we use a small set of parallel examples C l,n consisting of passages, questions, and their answers in the English source and target language l.We discuss how we construct these examples shortly.For now, suffice it to say that we create two prompts for answer and question generation, respectively.3Our first prompt is used to obtain an answer a l in the target language l from passage c l : I will write potential answers for the following passages .
Passage : c l Answer in English : aen Answer in the original language : a l ...
The second prompt generates question q l , utilizing passage c l and the previously predicted answer a l : I will write questions and answers for the following passages .Passage : c l Answer : a l Question in English : qen Question in the original language : q l ... We generate synthetic data instances (c, q, a) l where a and q are inferred by applying our two prompts consecutively on each passage c l ∈ U l (recall U l is the set of unlabeled passages in target language l).
In the English-Only scenario, neither questions nor answers are available in target language; we obtain these by resorting to machine translation: In the "Few-Shot" setting, we have access to n-labeled examples (questions and answers) in the target language, and translate these into English: Let P e l denote this prompting based generation.We can write the generated synthetic dataset as: Note that in the composition of the prompt, we always include English as an intermediate or "bridge", i.e., asking the model to predict questions and answers in English in addition to the ones in the target language, as we experimentally found it improves the quality of the generated data.The use of a bridge for this task can be thought of as an example of multilingual "chain of thought" prompting (Wei et al., 2022).

QAMELEON (PT)
In this approach, an optimizer is utilized to minimize the cross-entropy loss by updating the PLM's parameters for P (a, q|c, l) over a training set containing examples (c, q, a) l for the languages in L. As with PE, we generate the training set for the PLM in two ways.For "English-Only" we construct the dataset as l∈L T (D en ), while for "Few-Shot" we use l∈L D l,n .
Given the small size of the training set in the "Few-Shot" setting and the large size of current models, we opt for using prompt tuning (PT; Lester et al., 2021), a parameter-efficient finetuning variant where we concatenate a soft prompt of length m tokens to the input of the PLM, where m is a hyperparameter always set to 50 in this work.Only the embeddings of these m prompt tokens are allowed to be modified by the optimizer.We note that in prompt tuning, like in prompt engineering, the parameters of the PLM remain unchanged.What is trained is only the short soft prompt that is prepended to the input embeddings at inference time.
We use P t l to denote the operation of generating question-answer pairs through greedy decoding on the prompt-tuned PLM, by taking an unlabeled passage c l ∈ U l as input, preceded by a few tokens encoding language l.We finally obtain the synthetic QA dataset D PT as:

Data Assumptions
English-Only In this scenario, only training data in English is available, denoted as D en .Prompt Engineering (PE) assumes parallel exemplars are available, while Prompt Tuning (PT) requires exemplars in the target language only.Both are possible by translating examples of the English data D en into each target language.Machine Translation (MT) approaches in this work follow this scenario only.
Few-Shot In this scenario, a small number of examples (n-shot) are available in each target language, denoted as D l,n .In this scenario, parallel exemplars for Prompt Engineering (PE) can be obtained by translating the target language data into English.Prompt Tuning (PT) only requires exemplars in the target language, which are readily available in this setting.

Experimental Setup
We evaluate the synthetic data generation approaches presented in Section 2 across various languages on two benchmark datasets, which we discuss below.We also describe various model configurations, and comparison systems before presenting our results.1).
Unlabeled Data We obtained paragraphs U l in each target language from Wikipedia.Specifically, we pre-processed Wikipedia pages using WikiExtractor (Attardi, 2015).Paragraphs were sampled uniformly, with a length between 200 and 510 characters.The target language was determined based on the language code of the Wikipedia edition.

Model Configuration
Synthetic Data Generation In our TYDI QA experiments, we treat the English training data as the English source.For MLQA, we employ the English SQUAD (Rajpurkar et al., 2016) training data as the source.In the Few-Shot scenario, our human-annotated target-language examples D l,n are taken from the training split of TYDIQA-GOLDP and the validation split of MLQA.
For machine translation (MT), we employ the public Google Translate API (Wu et al., 2016) while the PLM utilized in this work is PaLM-540B (Chowdhery et al., 2022).We perform heuristic checks to clean synthetic datasets D PE and D PT .We only preserve a question-answer pair if the generated answer a is a substring of the given context c, but not a substring of the query q.We perform the first check as both TYDIQA-GOLDP and MLQA are extractive QA datasets.We perform the latter check because we empirically found that some of the low quality generated question-answer pairs were trivially answered based on the content of the question alone, for example, q: "where is X?", a: "X".
In the construction of D PE , we additionally perform round-trip filtering (Alberti et al., 2019) as qualitative analysis of random QA pairs suggested a higher level of noise in the PE-generated data.This round-trip consistency check is done by comparing the originally generated answer a in (c, q, a) l with the predicted answer.This predicted answer is obtained by prompting the PLM to answer question q based on passage c.We also tried round-trip filtering for PT generated data, however, we did not observe any gains.We report detailed statistics of the synthetically generated datasets in Section 5.
In the construction of D PT , we prompt-tune the  (Shazeer and Stern, 2018).
We tune a prompt of length m = 50 tokens for up to 1,000 steps, evaluating every 50 steps, with a batch size of 16 examples, and learning rate of 0.3 with a linear warmup of 200 steps.We use early stopping to select the best prompt per language based on BLEU (Papineni et al., 2002) on a heldout dataset from the English TYDIQA-GOLDP, translated to each target language.
Question Answering We fine-tuned an mT5-XL model (Xue et al., 2021) for question-answering to evaluate different synthetic data generation methods (D MT , D PE , and D PT ).As a baseline, we further use mT5-XL fine-tuned on available training data.Specifically, in the English-Only scenario, Baseline mT5-XL is fine-tuned on the English QA data D en .In the Few-shot scenario, Baseline mT5-XL is fine-tuned on n human annotated examples in the target languages (same number given to PE and PT).We conducted experiments on TYDIQA-GOLDP (Clark et al., 2020) and MLQA (Lewis et al., 2020b), see Section 3.1.Throughout downstream QA evaluation, mT5-XL was fine-tuned with AdaFactor, with a learning rate of 0.0002, a batch size of 64, up to 3,000 and 5,000 steps of training for TYDIQA-GOLDP and MLQA respectively evaluating every 50 steps.We measure QA performance with Exact Match (EM) and F1, and report the unweighted average across languages (excluding English).For TYDIQA-GOLDP, we report results on the development split which is commonly used as an evaluation set since the test split is unavailable.We select mT5 checkpoints per language using EM, and report the average of 3 runs.For MLQA, we present results on the test split, selecting the best mT5 checkpoint based on the average EM on the MLQA dev set.

QAMELEON (PT) Delivers the Best QA System
Table 2 summarizes our results on TYDI QA for both English-only and Few-Shot scenarios.Overall, we find that a low resource setting with 5 humanannotated examples in the target language (D l,5 ) is useful for scaling QA to multiple languages.More specifically, 5-shot prompt tuning gives an EM improvement of 11.7% absolute (58.5% → 70.2%) in exact match answer accuracy on the TYDIQA-GOLDP evaluation set over mT5 finetuned on English data only (Baseline), 3.7% (66.5% → 70.2%) over mT5 fine-tuned on 5 examples per language (Few-shot Baseline), and 4.1% (66.1% → 70.2%) over mT5 fine-tuned on the data obtained with the MT approach.
QAMELEON further improves over the fewshot results obtained by prompting code-davinci-002 (Chen et al., 2021), PaLM-540B (Chowdhery et al., 2022) extremely large PLMs for the task of QA, whereas QAMELEON leverages data synthesis to distill a PLM into a much smaller mT5-XL model.It also is important to note that QAMELEON as an approach is orthogonal and possibly complementary to any improvements due to more performant QA models and more sophisticated PLMs (e.g., Flan-U-PaLM-540B).
In both English-only and Few-shot resource scenarios, QAMELEON outperforms the other two data generation approaches, Machine Translation (MT), and Prompt Engineering (PE).Despite employing PE in two stages, chain-of-thought style, we observe that the generated data leads to lower QA performance.Moreover, we see better performance with using English-Only data in comparison to the Few-Shot scenario suggesting that the PLM is able to better utilize high-quality English data rather than small amounts of labeled data (in other languages).Finally, augmenting PLM generated data (either via PE or PT) with data generated via MT leads to gains in QA performance over using any of these methods independently.This could be due to the coupling of diverse QA data i.e., language-specific content and taskspecific English-centric translated content.
Table 3 shows QA performance in individual languages, for each of the methods in Table 3  which include English-centric translated content and target language-specific content obtained from the PLM.Eventually, we observe a saturation effect, i.e., beyond O(1, 000) QA pairs in the target language improvements are limited.
BLEU Does not Correlate with Downstream QA Performance An interesting question is whether improvements in QA performance are due to better (e.g., more grammatical or diverse) questions.We assessed the quality of questions generated by QAMELEON (PT) on TYDIQA-GOLDP by measuring their similarity to gold standard questions.We compare this with an mT5-XL model for question generation fine-tuned in a Few-shot setting.Both QAMELEON (PT) and mT5-XL question generation models were given the same number of examples in each language.Table 5 reports BLEU (Papineni et al., 2002) scores for these two models; we additionally report question answering performance (in terms of EM) via another set of mT5-XL models fine-tuned on the synthetic data generated by the respective models.
Even though mT5-XL produces questions with slightly higher BLEU score, QAMELEON generates QA data that leads to much higher QA performance.The result underscores the need for better trustworthy automatic evaluation metrics (Sellam et al., 2020) across languages.
Our Results Generalize to MLQA To validate the general applicability of our approach, we evaluate QAMELEON on MLQA (Lewis et al., 2020b).We prompt-tune the PLM on 5 examples per language taken from the MLQA development set, since MLQA does not provide training partitions.We generate synthetic datasets in all of the MLQA languages and compare an English-only baseline, MT, and QAMELEON (PT) approaches as we did previously for TYDIQA-GOLDP.We report results (EM and F1) using mT5-XL as the QA model in Table 6, where English is included in the average performance.(Xue et al., 2021) 54.5 73.5 XLM-E-XL (Chi et al., 2022) 57.9 76.2 We find that the MT approach is very effective on MLQA, which is not surprising since MLQA questions are translated from English.QAMELEON (PT), however, still delivers an improvement in combination with MT synthetic data.Table 6 further reports comparisons with the state-of-the-art models of Xue et al. (2021) and Chi et al. (2022).The former is mT5-XL (3.7B parameters) fine-tuned on English data only, whereas XLM-E-XL (2.2B parameters) benefits from a different language model pretraining technique.The latter approach is orthogonal and potentially complementary to QAMELEON.

Data Analysis
Table 7 shows the size of synthetic data resources generated via Prompt Engineering (PE) and QAMELEON (PT), per language and in total.These were in the range of 47,000-53,000 QA examples for TYDIQA-GOLDP, and 89,000 for MLQA.The varying size of the data across languages is due to the filtering described in Section 3. In some languages (e.g., Telugu) generation is more noisy leading to fewer data points.We conjecture this is due to the PLM being exposed to less data representing these languages during pre-training.We further hypothesize that a more multilingual pre-training of PLMs could potentially lead to better quality data across all languages.
Machine translation (MT) creates the same number of data points as the source training set.For TYDIQA-GOLDP, the English training contains 3,696 data points (Table 1), leading  Figure 3 shows the distribution of various question types for individual languages and on average across all languages.For each language, synthetically generated questions were first translated to English and then grouped into categories (inner circle) based on their first word and sub-categories (outer circle) based on the first two words.We find that QAMELEON (Figure 3 (c)) generates more diverse questions in comparison to TYDIQA-GOLDP (Figure 3 (f)).The distribution of question words varies across languages in both the datasets.For example, diversity is higher for Russian, Finnish, and Indonesian (Figure 3 (a,d)); however, for Bengali and Telugu (Figure 3 (b,e)), the distribution of questions is skewed towards a specific question type ('What' for Telugu and 'Other' for Bengali).This could be attributed to a lack of diversity in questions for these languages or poor translation quality leading to skewed utterances.
Table 8 illustrates randomly sampled examples of QA pairs generated by QAMELEON (PT) for passages in the TYDIQA-GOLDP eval set.For these passages, we also have access to human annotated QA pairs.As can be seen, QA pairs generated by QAMELEON are of similar quality   The above approaches generally require large amounts of labeled QA data in the form of SQUAD (Rajpurkar et al., 2016) or Natural Questions (Kwiatkowski et al., 2019) to train passage selection and question generation models.In contrast, we only assume access to a few question-answer pairs per language.
Multilingual QA In this work we used mT5-XL (Xue et al., 2021) as our reference QA model.We note that a slightly more performant choice could have been ByT5 (Xue et al., 2022), which reports improvements on TYDIQA-GOLDP by operating directly on raw text instead of sentence pieces.Existing work on low resource multilingual QA has been relatively limited.Lee et al. (2018) propose to use automatically translated high-confidence QA examples for training, while other approaches (Kumar et al., 2019;Chi et al., 2020) only generate questions and require supervised training data in the target language.Other approaches (Riabi et al., 2021;Shakeri et al., 2021;Kramchaninova and Defauw, 2022) focus on zero-shot transfer, i.e., a multilingual model trained on QA data generation on SQUAD (and optionally automatically translated SQUAD data) is applied to other languages.Our work shows that few-shot settings result in better multilingual generation quality in comparison to zero-shot models.
Prompting Existing work (Brown et al., 2020;Schick and Schütze, 2021, inter alia) has shown that prompting pre-trained large language models can lead to strong performance in a wide range of tasks including natural language generation and common sense reasoning.In the context of multilingual QA, Chowdhery et al. ( 2022) employ a single prompt and a few labeled examples in the target language.In contrast, we employ chainof-thought prompting, and English answers and questions as a bridge.Moreover, our experiments with QAMELEON demonstrate that prompt tuning is superior and a viable alternative to large-scale annotation.Prompting in multilingual settings has achieved the best performance using English prompts and target language exemplars (Winata et al., 2021;Lin et al., 2022;Shi et al., 2022).We demonstrate that parameter-efficient methods such as prompt tuning using target language exemplars (Lester et al., 2021) is a superior choice.

Benefits and Limitations
The method proposed in this work, QAmeleon, prompt tunes large PLMs to generate multilingual synthetic question answering data.In this section we discuss its benefits over related approaches, but also drawbacks and limitations.The main benefits are large performance improvements over alternative methods, as borne out by our experiments, as well as surprising data efficiency achieved through large-scale pre-training and a few manual annotations.Alternative methods considered here are multilingual QA approaches for low resource languages, such as translatetest, translate-train, fine-tuning multilingual models directly on the small amount of available training data, performing multilingual QA directly through in-context learning, or even synthetic data generation with prompt engineered PLMs.Another benefit of our approach stems from prompt tuning itself, which is able to learn from a tiny number of training examples, as low as one example per language in our experiments, whereas fine-tuning cannot be utilized as easily.Prompt tuning also affords the practical advantage of being space efficient; a fraction of a percent of the storage space is used to save the learned parameters, since only the learned soft prompt needs to be stored.Our evaluation methodology also provides benefits, since we measure question answer performance directly on downstream models instead of using a proxy like BLEU or ROUGE on generated questions.As shown in Table 5 metrics can be misleading, and one might conclude that smaller models generate better questions than large PLMs if the evaluation were to consider only question BLEU scores.
The main drawback of our method is the high computational cost to prompt tune the PLM and to generate the synthetic data.While prompt tuning is not as expensive as fine-tuning, we still need to perform optimization on a model containing hundreds of billions of parameters.We estimate the cost of each prompt tuning and data generation experiment to be in the order of 256 TPU v3 chips for 12 hours.Another limitation of our experimental results is that they are fundamentally tied to a specific large PLM.PLMs are an area of active research, so any improvements in pretraining techniques, construction of pre-training sets, instruction tuning or reinforcement learning, are likely to translate in improvements for our synthetic data generation method.Promising areas of future work are parameter efficient techniques similar to prompt tuning, as well as analysis of data augmentations techniques like QAmeleon across different types and sizes of PLMs.Moreover, a more formal understanding of how the number of manual annotations (aka few shots) interacts with the quality of synthetic generation, would also be useful.Perhaps somewhat counter-intuitively, our experiments showed that QA performance does not drastically improve when scaling from 50 to 100 manual examples.

Data Release
To assist with the replicability of our results and to allow other researchers to benefit from our work, we will release a significant portion of the synthetic data generated by QAMELEON in the 5-shot scenario.To minimize the chance that questionanswer pairs generated by the PLM contain sensitive, offensive or controversial material, we vetted each generated question with three human raters.We asked each rater to discard questionanswer pairs that made generalized claims about groups, contained opinions that were potentially disparaging or embarrassing to one or more people, or names of individuals not related to media (e.g., film, TV) or sport.The release will contain 47,173 examples, each with a Wikipedia passage, a question and an extractive answer, corresponding to 89% of the examples utilized in this work for the 5-shot scenario.

Conclusions
In this work, we examined the ability of pretrained language models to generate synthetic data for bootstrapping multilingual QA systems, with as few as five examples in a new target language.We introduced QAMELEON, a parameter efficient approach which uses prompt tuning to automatically create multilingual QA data.Extensive experiments under different resource scenarios demonstrate that QAMELEON is superior to prompt engineering and competitive baselines based on machine translation.In the future, we would like to extend this approach to other multilingual tasks, including retrieval, summarization, and semantic parsing.

Figure 1 :
Figure 1: Synthetic data generation for multilingual question-answering (QA).Left: Examples of the multilingual QA task.Translations are added for readability.Middle: Strategies for localizing QA models to new languages: 1.Using English QA data as a zero-shot approach, 2. with Machine Translation (MT) to approximate training data for supervised learning, and 3. few-shot approaches with a handful of multilingual examples.Right: Model performance on the multilingual QA task.We report average Exact Match (EM) across all languages on the TYDIQA-GOLDP dataset (Clark et al., 2020).

Figure 3 :
Figure 3: Distribution of question category for QAMELEON (PT) generated questions (a,b,c) and TYDIQA-GOLDP training questions (d,e,f).Category are obtained by translating the questions to English with Google Translate and grouping by the first two word tokens.

Table 1 :
Number of question-answer pairs per language and data split for the datasets considered in this work.

Table 2 :
Synthetic question-answering data generation methods for training multilingual reading comprehension systems on TYDIQA-GOLDP.We report averages over 3 runs of fine-tuning mT5-XL on gold or synthetic data.Standard deviation is given in parentheses.Performance for individual languages (excluding English) is shown in Table3.For comparison we also include recent few-shot prompting results with large language models on TYDIQA-GOLDP: Chen et al. (2021) §, Chowdhery et al. (2022) †, and Chung et al. (2022) ‡.PLM on l∈L T (D en ) or l∈L D l,n as detailed earlier.Prompt tuning is performed with the AdaFactor optimizer

Table 3 :
QA performance (Average EM over three runs) for individual languages on the TYDIQA-GOLDP evaluation set; the backbone of the QA model is mT5-XL fine-tuned on gold (Baseline, Supervised) or synthetically generated data.The final row displays the percent of tokens for each language in the PLM training data.

Table 8 :
Examples of QA pairs from human-annotated TYDI QA and generated by QAMELEON (PT) on corresponding passages.English translations from Google Translate are added for readability.

Table 9 :
QA pairs (random selection) generated by QAMELEON (PT) on Wikipedia passages.English translations from Google Translate are added for readability.