Abstract
The availability of large, high-quality datasets has been a major driver of recent progress in question answering (QA). Such annotated datasets, however, are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are fine-tuned, thus avoiding costly annotation. Prompt tuning the PLM with only five examples per language delivers accuracy superior to translation-based baselines; it bridges nearly 60% of the gap between an English-only baseline and a fully-supervised upper bound fine-tuned on almost 50,000 hand-labeled examples; and consistently leads to improvements compared to directly fine-tuning a QA model on labeled examples in low resource settings. Experiments on the TyDiqa-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation.1
1 Introduction
Question answering (QA) has seen impressive progress in recent years enabled by the use of large pre-trained language models (Devlin et al., 2019; Lewis et al., 2020a; Raffel et al., 2020), and the availability of high-quality benchmarks (Rajpurkar et al., 2016; Trischler et al., 2017; Kwiatkowski et al., 2019). Many QA datasets frame the task as reading comprehension, where the question is about a paragraph or document and the answer is a span therein. Advances in QA modeling have been primarily reported for English, which offers a considerable amount of high-quality training data compared to other languages. More recently, efforts have focused on the creation of multilingual QA benchmarks such as TyDi QA (10 languages; Clark et al., 2020), MLQA (6 languages; Lewis et al., 2020b), and XQuAD (10 languages; Artetxe et al., 2020b). Among these, only TyDi QA is genuinely large-scale, MLQA and XQuAD are limited to an evaluation set due to the high cost and labor required to collect data across languages.
As a result, efforts to localize QA models to new languages have been primarily focusing on zero-shot approaches. Recent proposals include using machine translation to approximate training data for supervised learning (Lewis et al., 2020b), and data augmentation via generating synthetic questions for new languages (Riabi et al., 2021; Shakeri et al., 2021). Both approaches rely on transfer from English, which leads to a dependence on translation artifacts (Koppel and Ordan, 2011; Artetxe et al., 2020a) and a bias towards the linguistic characteristics of English, which is not the best source for all target languages (Lin et al., 2019). However, annotating a minimally sized data sample can potentially overcome these limitations while incurring significantly reduced costs compared to full dataset translation (Garrette and Baldridge, 2013).
In this paper, we argue that a few-shot approach in combination with synthetic data generation and existing high-quality English resources can mitigate some of the above-mentioned artifacts. Beyond question answering, multilingual approaches have succeeded at leveraging a small number of annotations within a variety of tasks (Zhao et al., 2021, inter alia) including natural language inference, paraphrase identification, and semantic parsing (Sherborne and Lapata, 2022). Existing work (Brown et al., 2020; Schick and Schütze, 2021, inter alia) has further shown that prompting pre-trained large language models (PLMs) can lead to strong performance on various tasks, including question answering (Khashabi et al., 2020; Chowdhery et al., 2022) and open-ended natural language generation (Tang et al., 2022; Yang et al., 2022). Investigations of prompting in multilingual settings have also shown strong few-shot performance in classification tasks (Winata et al., 2021), natural language inference (Zhao and Schütze, 2021), common sense reasoning (Shi et al., 2022), machine translation (Lin et al., 2022), and retrieval (Dai et al., 2022).
We synthesize these directions into QAmeleon, an approach for bootstrapping multilingual QA systems, with as few as five examples in a new target language (see Figure 1). We use gold annotations to prompt-tune a PLM in order to automatically generate multilingual QA data, which is then used to fine-tune a QA model. We find that QAmeleon delivers accuracy superior to zero-shot methods and competitive translation-based baselines, and in some cases competes with the fully supervised upper bound.2 Experiments on the TyDi QA (Clark et al., 2020) and MLQA (Lewis et al., 2020b) benchmarks show that few-shot prompt tuning (Lester et al., 2021) scales across languages, significantly outperforms prompt engineering (Brown et al., 2020) with the same number of labeled examples, and is a viable alternative to large-scale annotation.
Synthetic data generation for multilingual question-answering (QA). Left: Examples of the multilingual QA task. Translations are added for readability. Middle: Strategies for localizing QA models to new languages: 1. Using English QA data as a zero-shot approach, 2. with Machine Translation (MT) to approximate training data for supervised learning, and 3. few-shot approaches with a handful of multilingual examples. Right: Model performance on the multilingual QA task. We report average Exact Match (EM) across all languages on the TyDiQA-GoldP dataset (Clark et al., 2020).
Synthetic data generation for multilingual question-answering (QA). Left: Examples of the multilingual QA task. Translations are added for readability. Middle: Strategies for localizing QA models to new languages: 1. Using English QA data as a zero-shot approach, 2. with Machine Translation (MT) to approximate training data for supervised learning, and 3. few-shot approaches with a handful of multilingual examples. Right: Model performance on the multilingual QA task. We report average Exact Match (EM) across all languages on the TyDiQA-GoldP dataset (Clark et al., 2020).
Our contributions include (a) a new approach to bootstrapping a multilingual QA system; QAmeleon prompt-tunes a PLM with as few as five gold examples to automatically generate multilingual QA data which is then used to fine-tune a QA model; (b) a series of experimental results showing significant improvements over existing approaches in the few-shot regime, ranging from 12% absolute accuracy on TyDiqa-GoldP (Clark et al., 2020) over an English-only baseline and 4% absolute accuracy over a competitive translate-train baseline; (c) extensive analysis of the behavior of QAmeleon in zero shot and low resource regimes, on different multilingual QA datasets, and in comparison to prompt-engineering.
2 Synthetic Data Generation
Let denote a QA dataset with examples provided by human annotators, where l is a target language in a set L of languages of interest. consists of samples (c, q, a)l, where c is a paragraph of text, q is a question, and a is an answer extracted from c (see Figure 1 left). We further use to denote a dataset , but making n explicit, with n referring to the number of examples it contains. For instance, denotes a French QA dataset with 5 examples. Finally, let denote sets of unlabeled paragraphs in language l; we assume these are in-domain with respect to the paragraphs in but are not accompanied by questions or answers.
Throughout this work, we will assume the availability of , a large QA dataset in English (source language). This assumption corresponds to the observation that most large-scale QA datasets (Rajpurkar et al., 2016; Yang et al., 2018; Bajaj et al., 2016; Kwiatkowski et al., 2019) contain examples exclusively in English. For languages other than English, we assume that only small datasets are available for training (e.g., n = 5) (“Few-Shot” scenario) or no data at all (“English-Only” scenario). We will also assume that sets of unlabeled passages are available for all target languages. Our task will be to synthesize QA data in each target language l in order to fine-tune QA models on l directly.
In the rest of this section we formally describe three ways of synthesizing QA data and give further details on the two scenarios we consider, “English-Only” and “Few-Shot”.
2.1 Machine Translation (MT)
The approach described here is known as “translate-train”. An alternative is “translate-test”, where translation is employed during inference instead of training. Multilingual inputs are translated to English and inference is done via an English QA model. The English predictions are then translated back to the respective target language. We experimentally found “translate-test” to perform poorly on our task in comparison to translate-train due to its reliance on multiple noisy translation steps.
Note that fine-tuning on still relies on the support of the high-quality . Previous work (Kramchaninova and Defauw, 2022; Vu et al., 2022) has highlighted various limitations with multilingual approaches based on MT including (a) their dependence on the quality of available MT systems in a given language and in turn the availability of high-quality (expensive) parallel data, (b) a potential misalignment of answer spans after translation in context to the passage vs translation of answers independently, and (c) translationese artifacts and English-centric content topics (Clark et al., 2020).
2.2 Prompt Engineering (PE)
PLMs (Brown et al., 2020; Chowdhery et al., 2022) have recently shown unprecedented performance on a vast number of tasks, including natural language generation, without the need for modifying any of the model’s parameters, simply by hand-designing a textual prompt that instructs the model to perform a certain task. Following Brown et al. (2020), we consider a class of hand-designed prompts referred to as “prompting” or “in-context learning”. The prompt starts with a free form instruction, followed by a small number of instances exemplifying how the task is solved. An incomplete instance is then appended to this prompt and the PLM performs the task by completing that instance. We refer to this approach as “prompt engineering” (PE), since the input to the PLM has to be hand-engineered based on human intuition about the target task (see approach 3 in Figure 1).
In order to hand-engineer prompts for our task, we use a small set of parallel examples consisting of passages, questions, and their answers in the English source and target language l. We discuss how we construct these examples shortly. For now, suffice it to say that we create two prompts for answer and question generation, respectively.3 Our first prompt is used to obtain an answer al in the target language l from passage cl:
I will write potential answers
for the following passages.
Passage:cl
Answer in English:aen
Answer in the original language:al
...
The second prompt generates question ql, utilizing passage cl and the previously predicted answer al:
I will write questions and answers
for the following passages.
Passage:cl
Answer:al
Question in English:qen
Question in the original language:ql
...
We generate synthetic data instances (c, q, a)l where a and q are inferred by applying our two prompts consecutively on each passage (recall is the set of unlabeled passages in target language l).
Note that in the composition of the prompt, we always include English as an intermediate or “bridge”, i.e., asking the model to predict questions and answers in English in addition to the ones in the target language, as we experimentally found it improves the quality of the generated data. The use of a bridge for this task can be thought of as an example of multilingual “chain of thought” prompting (Wei et al., 2022).
2.3 QAmeleon (PT)
In this approach, an optimizer is utilized to minimize the cross-entropy loss by updating the PLM’s parameters for P(a, q|c, l) over a training set containing examples (c, q, a)l for the languages in L. As with PE, we generate the training set for the PLM in two ways. For “English-Only” we construct the dataset as , while for “Few-Shot” we use .
Given the small size of the training set in the “Few-Shot” setting and the large size of current models, we opt for using prompt tuning (PT; Lester et al., 2021), a parameter-efficient fine-tuning variant where we concatenate a soft prompt of length m tokens to the input of the PLM, where m is a hyperparameter always set to 50 in this work. Only the embeddings of these m prompt tokens are allowed to be modified by the optimizer. We note that in prompt tuning, like in prompt engineering, the parameters of the PLM remain unchanged. What is trained is only the short soft prompt that is prepended to the input embeddings at inference time.
2.4 Data Assumptions
English-Only
In this scenario, only training data in English is available, denoted as . Prompt Engineering (PE) assumes parallel exemplars are available, while Prompt Tuning (PT) requires exemplars in the target language only. Both are possible by translating examples of the English data into each target language. Machine Translation (MT) approaches in this work follow this scenario only.
Few-Shot
In this scenario, a small number of examples (n-shot) are available in each target language, denoted as . In this scenario, parallel exemplars for Prompt Engineering (PE) can be obtained by translating the target language data into English. Prompt Tuning (PT) only requires exemplars in the target language, which are readily available in this setting.
3 Experimental Setup
We evaluate the synthetic data generation approaches presented in Section 2 across various languages on two benchmark datasets, which we discuss below. We also describe various model configurations, and comparison systems before presenting our results.
3.1 Datasets
TyDi QA
(Clark et al., 2020) is a multilingual extractive question answering dataset designed to represent a typologically diverse set of languages. Annotators were given a Wikipedia passage in the target language and asked to write a question that could not be answered by that passage. For each question, the top-ranked Wikipedia article was then retrieved via Google Search. Annotators were subsequently asked to answer the question given the retrieved Wikipedia article. As a result of this information-seeking task design, questions in TyDi QA are often without an answer. In this work we consider TyDiQA-GoldP: the Gold Passage version of TyDi QA where only questions with answers in the Wikipedia page are given and the model has to identify the answer in the passage that contains it (see Table 1 for statistics on this dataset).
Number of question-answer pairs per language and data split for the datasets considered in this work.
Language . | TyDiQA-GoldP . | MLQA . | ||
---|---|---|---|---|
Train . | Eval . | Dev . | Test . | |
Arabic | 14,805 | 921 | 517 | 5,335 |
Bengali | 2,390 | 113 | – | – |
Chinese | – | – | 504 | 5,137 |
English | 3,696 | 440 | 1,148 | 11,590 |
Finnish | 6,855 | 782 | – | – |
German | – | – | 512 | 4,517 |
Hindi | – | – | 507 | 4,918 |
Indonesian | 5,702 | 565 | – | – |
Kiswahili | 2,755 | 499 | – | – |
Korean | 1,625 | 276 | – | – |
Russian | 6,490 | 812 | – | – |
Spanish | – | – | 500 | 5,253 |
Telugu | 5,563 | 669 | – | – |
Vietnamese | – | – | 511 | 5,495 |
Total | 49,881 | 5,077 | 4,199 | 42,245 |
Language . | TyDiQA-GoldP . | MLQA . | ||
---|---|---|---|---|
Train . | Eval . | Dev . | Test . | |
Arabic | 14,805 | 921 | 517 | 5,335 |
Bengali | 2,390 | 113 | – | – |
Chinese | – | – | 504 | 5,137 |
English | 3,696 | 440 | 1,148 | 11,590 |
Finnish | 6,855 | 782 | – | – |
German | – | – | 512 | 4,517 |
Hindi | – | – | 507 | 4,918 |
Indonesian | 5,702 | 565 | – | – |
Kiswahili | 2,755 | 499 | – | – |
Korean | 1,625 | 276 | – | – |
Russian | 6,490 | 812 | – | – |
Spanish | – | – | 500 | 5,253 |
Telugu | 5,563 | 669 | – | – |
Vietnamese | – | – | 511 | 5,495 |
Total | 49,881 | 5,077 | 4,199 | 42,245 |
MLQA
(Lewis et al., 2020b) is an extractive question answering dataset, designed for evaluating multilingual and cross-lingual question answering models. MLQA does not publish a training split, but only development and test partitions. MLQA was created by aligning sentences in Wikipedia passages across different languages. Annotators then created questions based on English sentences, professional translators translated these questions to other languages, and finally annotators selected answers from passages containing sentences aligned to the translated questions. As in TyDiQA-GoldP, the task is to extract the answer from a passage given a question (dataset statistics are shown in Table 1).
Unlabeled Data
We obtained paragraphs in each target language from Wikipedia. Specifically, we pre-processed Wikipedia pages using WikiExtractor (Attardi, 2015). Paragraphs were sampled uniformly, with a length between 200 and 510 characters. The target language was determined based on the language code of the Wikipedia edition.
3.2 Model Configuration
Synthetic Data Generation
In our TyDi QA experiments, we treat the English training data as the English source. For MLQA, we employ the English SQuaD (Rajpurkar et al., 2016) training data as the source. In the Few-Shot scenario, our human-annotated target-language examples are taken from the training split of TyDiQA-GoldP and the validation split of MLQA.
For machine translation (MT), we employ the public Google Translate API (Wu et al., 2016) while the PLM utilized in this work is PaLM-540B (Chowdhery et al., 2022). We perform heuristic checks to clean synthetic datasets and . We only preserve a question-answer pair if the generated answer a is a substring of the given context c, but not a substring of the query q. We perform the first check as both TyDiQA-GoldP and MLQA are extractive QA datasets. We perform the latter check because we empirically found that some of the low quality generated question-answer pairs were trivially answered based on the content of the question alone, for example, q: “where is X?”, a: “X”.
In the construction of , we additionally perform round-trip filtering (Alberti et al., 2019) as qualitative analysis of random QA pairs suggested a higher level of noise in the PE-generated data. This round-trip consistency check is done by comparing the originally generated answer a in (c, q, a)l with the predicted answer. This predicted answer is obtained by prompting the PLM to answer question q based on passage c. We also tried round-trip filtering for PT generated data, however, we did not observe any gains. We report detailed statistics of the synthetically generated datasets in Section 5.
In the construction of , we prompt-tune the PLM on or as detailed earlier. Prompt tuning is performed with the AdaFactor optimizer (Shazeer and Stern, 2018). We tune a prompt of length m = 50 tokens for up to 1,000 steps, evaluating every 50 steps, with a batch size of 16 examples, and learning rate of 0.3 with a linear warmup of 200 steps. We use early stopping to select the best prompt per language based on BLEU (Papineni et al., 2002) on a held-out dataset from the English TyDiQA-GoldP, translated to each target language.
Question Answering
We fine-tuned an mT5-XL model (Xue et al., 2021) for question-answering to evaluate different synthetic data generation methods (, , and ). As a baseline, we further use mT5-XL fine-tuned on available training data. Specifically, in the English-Only scenario, Baseline mT5-XL is fine-tuned on the English QA data . In the Few-shot scenario, Baseline mT5-XL is fine-tuned on n human annotated examples in the target languages (same number given to PE and PT). We conducted experiments on TyDiQA-GoldP (Clark et al., 2020) and MLQA (Lewis et al., 2020b), see Section 3.1.
Throughout downstream QA evaluation, mT5-XL was fine-tuned with AdaFactor, with a learning rate of 0.0002, a batch size of 64, up to 3,000 and 5,000 steps of training for TyDiQA-GoldP and MLQA, respectively, evaluating every 50 steps. We measure QA performance with Exact Match (EM) and F1, and report the unweighted average across languages (excluding English). For TyDiQA-GoldP, we report results on the development split which is commonly used as an evaluation set since the test split is unavailable. We select mT5 checkpoints per language using EM, and report the average of 3 runs. For MLQA, we present results on the test split, selecting the best mT5 checkpoint based on the average EM on the MLQA dev set.
4 Results
QAmeleon (PT) Delivers the Best QA System
Table 2 summarizes our results on TyDi QA for both English-only and Few-Shot scenarios. Overall, we find that a low resource setting with 5 human-annotated examples in the target language () is useful for scaling QA to multiple languages. More specifically, 5-shot prompt tuning gives an EM improvement of 11.7% absolute (58.5% → 70.2%) in exact match answer accuracy on the TyDiQA-GoldP evaluation set over mT5 fine-tuned on English data only (Baseline), 3.7% (66.5% → 70.2%) over mT5 fine-tuned on 5 examples per language (Few-shot Baseline), and 4.1% (66.1% → 70.2%) over mT5 fine-tuned on the data obtained with the MT approach.
Synthetic question-answering data generation methods for training multilingual reading comprehension systems on TyDiQA-GoldP. We report averages over 3 runs of fine-tuning mT5-XL on gold or synthetic data. Standard deviation is given in parentheses. Performance for individual languages (excluding English) is shown in Table 3. For comparison we also include recent few-shot prompting results with large language models on TyDiQA-GoldP: Chen et al. (2021)§, Chowdhery et al. (2022)†, and Chung et al. (2022)‡.
Method . | English-Only . | Few-Shot . | ||||
---|---|---|---|---|---|---|
Translate . | Avg EM . | Avg F1 . | n-Shot . | Avg EM . | Avg F1 . | |
Baseline | 58.5(±3.1) | 74.2(±2.6) | 5 | 66.5(±0.7) | 79.8(±0.4) | |
MT | 66.1(±2.1) | 79.5(±1.8) | 5 | – | – | |
PE | 64.4(±1.4) | 76.9(±1.1) | 5 | 62.6(±1.8) | 77.6(±1.2) | |
PE + MT | 69.4(±0.4) | 81.4(±0.4) | 5 | 67.9(±0.2) | 80.5(±0.6) | |
QAmeleon (PT) | 65.5(±0.7) | 79.4(±0.7) | 5 | 70.2(±0.2) | 81.7(±0.1) | |
QAmeleon (PT)+MT | 68.1(±0.8) | 80.9(±0.7) | 5 | 70.7(±0.9) | 82.2(±0.8) | |
code-davinci-002§ | – | – | 1 | 48.1 | – | |
PaLM-540B† | – | – | 1–10 | 60.0 | – | |
Flan-U-PaLM-540B‡ | – | – | 1 | 68.3 | – |
Method . | English-Only . | Few-Shot . | ||||
---|---|---|---|---|---|---|
Translate . | Avg EM . | Avg F1 . | n-Shot . | Avg EM . | Avg F1 . | |
Baseline | 58.5(±3.1) | 74.2(±2.6) | 5 | 66.5(±0.7) | 79.8(±0.4) | |
MT | 66.1(±2.1) | 79.5(±1.8) | 5 | – | – | |
PE | 64.4(±1.4) | 76.9(±1.1) | 5 | 62.6(±1.8) | 77.6(±1.2) | |
PE + MT | 69.4(±0.4) | 81.4(±0.4) | 5 | 67.9(±0.2) | 80.5(±0.6) | |
QAmeleon (PT) | 65.5(±0.7) | 79.4(±0.7) | 5 | 70.2(±0.2) | 81.7(±0.1) | |
QAmeleon (PT)+MT | 68.1(±0.8) | 80.9(±0.7) | 5 | 70.7(±0.9) | 82.2(±0.8) | |
code-davinci-002§ | – | – | 1 | 48.1 | – | |
PaLM-540B† | – | – | 1–10 | 60.0 | – | |
Flan-U-PaLM-540B‡ | – | – | 1 | 68.3 | – |
QAmeleon further improves over the few- shot results obtained by prompting code-davinci- 002 (Chen et al., 2021), PaLM-540B (Chowdhery et al., 2022), and Flan-U-PaLM-540B (Chung et al., 2022), with a similar number of available labeled examples. These approaches directly employ extremely large PLMs for the task of QA, whereas QAmeleon leverages data synthesis to distill a PLM into a much smaller mT5-XL model. It also is important to note that QAmeleon as an approach is orthogonal and possibly complementary to any improvements due to more performant QA models and more sophisticated PLMs (e.g., Flan-U-PaLM-540B).
In both English-only and Few-shot resource scenarios, QAmeleon outperforms the other two data generation approaches, Machine Translation (MT), and Prompt Engineering (PE). Despite employing PE in two stages, chain-of-thought style, we observe that the generated data leads to lower QA performance. Moreover, we see better performance with using English-Only data in comparison to the Few-Shot scenario, suggesting that the PLM is able to better utilize high-quality English data rather than small amounts of labeled data (in other languages). Finally, augmenting PLM generated data (either via PE or PT) with data generated via MT leads to gains in QA performance over using any of these methods independently. This could be due to the coupling of diverse QA data, i.e., language-specific content and task-specific English-centric translated content.
Table 3 shows QA performance in individual languages, for each of the methods in Table 3 in their best performing setting: Few-shot Baseline, Machine Translation (MT), Prompt Engineering (PE), Prompt Tuning (PT), and augmenting PE and PT with MT. Data generated by QAmeleon (PT) using 5 target examples provides the best performance in Bengali, Finnish, and Telugu. A boost can be seen for Arabic, Indonesian, Russian, and Swahili when QAmeleon data is combined with MT data. Language distribution listed under ‘% tokens in PLM’ reflects the extremely low representation of many languages in the pre-training corpora of the PLM used in this work. As an upper bound, we additionally show the performance of supervised mT5-XL fine-tuned on large amounts of gold training data (see Table 1) to illustrate the remaining gap, which could potentially be bridged by increasing the number of labeled examples or by improved (e.g., more multilingual or FLAN-tuned) PLMs.
QA performance (Average EM over three runs) for individual languages on the TyDiQA-GoldP evaluation set; the backbone of the QA model is mT5-XL fine-tuned on gold (Baseline, Supervised) or synthetically generated data. The final row displays the percent of tokens for each language in the PLM training data.
Method . | n-shot . | Ar . | Bn . | Fi . | Id . | Ko . | Ru . | Sw . | Te . | Avg . |
---|---|---|---|---|---|---|---|---|---|---|
Baseline | 5 | 65.9 | 68.4 | 65.1 | 71.3 | 68.4 | 57.6 | 60.1 | 75.4 | 66.5 |
MT | 0 | 66.3 | 62.2 | 65.2 | 72.4 | 63.9 | 61.1 | 70.5 | 67.0 | 66.1 |
PE | 0 | 60.4 | 66.7 | 63.5 | 63.6 | 65.1 | 53.8 | 74.5 | 67.3 | 64.4 |
PE + MT | 0 | 68.1 | 70.5 | 68.2 | 73.6 | 68.5 | 61.0 | 78.4 | 66.9 | 69.4 |
QAmeleon (PT) | 5 | 65.4 | 76.7 | 69.4 | 69.0 | 67.6 | 61.5 | 75.6 | 76.7 | 70.2 |
QAmeleon (PT)+MT | 5 | 67.9 | 72.6 | 69.2 | 73.8 | 65.1 | 62.8 | 77.7 | 76.1 | 70.7 |
Supervised | Multi-k | 75.7 | 81.4 | 74.5 | 79.8 | 77.2 | 72.8 | 82.6 | 83.0 | 78.4 |
% tokens in PLM | – | 0.15 | 0.03 | 0.42 | 0.16 | 0.19 | 0.53 | 0.01 | 0.02 | – |
Method . | n-shot . | Ar . | Bn . | Fi . | Id . | Ko . | Ru . | Sw . | Te . | Avg . |
---|---|---|---|---|---|---|---|---|---|---|
Baseline | 5 | 65.9 | 68.4 | 65.1 | 71.3 | 68.4 | 57.6 | 60.1 | 75.4 | 66.5 |
MT | 0 | 66.3 | 62.2 | 65.2 | 72.4 | 63.9 | 61.1 | 70.5 | 67.0 | 66.1 |
PE | 0 | 60.4 | 66.7 | 63.5 | 63.6 | 65.1 | 53.8 | 74.5 | 67.3 | 64.4 |
PE + MT | 0 | 68.1 | 70.5 | 68.2 | 73.6 | 68.5 | 61.0 | 78.4 | 66.9 | 69.4 |
QAmeleon (PT) | 5 | 65.4 | 76.7 | 69.4 | 69.0 | 67.6 | 61.5 | 75.6 | 76.7 | 70.2 |
QAmeleon (PT)+MT | 5 | 67.9 | 72.6 | 69.2 | 73.8 | 65.1 | 62.8 | 77.7 | 76.1 | 70.7 |
Supervised | Multi-k | 75.7 | 81.4 | 74.5 | 79.8 | 77.2 | 72.8 | 82.6 | 83.0 | 78.4 |
% tokens in PLM | – | 0.15 | 0.03 | 0.42 | 0.16 | 0.19 | 0.53 | 0.01 | 0.02 | – |
Increasing Labeled Examples Improves QA Performance
So far, we have tested QAmeleon in an extremely low resource setting, using only 5 examples in the target language. We next examine its performance when we vary the number of annotated examples. Table 4 compares the performance of mT5-XL fine-tuned on 1 to 100 examples (Baseline), on synthetic QA datasets generated by QAmeleon using corresponding number of examples, and as an upper bound on the entire TyDiQA-GoldP dataset. As can be seen, increasing the number of examples from 1 to 50 improves the performance of QA models. We observe however a decrease in performance at 100 examples, showing that further research will likely be needed to close the gap between our method and the fully supervised upper bound, while still only labeling a small number of examples. It is important to note that for all the amounts of available annotated data we considered, significant improvements in multilingual QA could be obtained by generating data with QAmeleon instead of fine-tuning the QA model directly on labeled data.
Comparison of QA performance from fine-tuning mT5-XL on 1 to 100 examples (Baseline), on synthetic data generated with prompt tuning (PT), or on the full TyDiQA-GoldP training set (Supervised). Results are averaged across languages.
Method . | n-Shot . | Avg EM . |
---|---|---|
Baseline | 1 | 63.7 |
QAmeleon (PT) | 1 | 69.7 |
Baseline | 5 | 66.5 |
QAmeleon (PT) | 5 | 70.2 |
Baseline | 50 | 69.3 |
QAmeleon (PT) | 50 | 73.7 |
Baseline | 100 | 70.6 |
QAmeleon (PT) | 100 | 71.9 |
Supervised | Multi-k | 78.4 |
Method . | n-Shot . | Avg EM . |
---|---|---|
Baseline | 1 | 63.7 |
QAmeleon (PT) | 1 | 69.7 |
Baseline | 5 | 66.5 |
QAmeleon (PT) | 5 | 70.2 |
Baseline | 50 | 69.3 |
QAmeleon (PT) | 50 | 73.7 |
Baseline | 100 | 70.6 |
QAmeleon (PT) | 100 | 71.9 |
Supervised | Multi-k | 78.4 |
The Larger the Synthetic Data, the Better the QA Model
We now study the impact of varying the size of the automatically generated datasets on QA performance. As shown in Figure 2, when larger amounts of synthetic data are used for fine-tuning the QA model, absolute accuracy increases. This upshot is higher when combining PLM-generated data with Translation data in comparison to individual datasets. This can be explained due to the increased diversity of the combined data, which include English-centric translated content and target language-specific content obtained from the PLM. Eventually, we observe a saturation effect, i.e., beyond O(1,000) QA pairs in the target language improvements are limited.
Effect of synthetic data size on downstream QA performance (Average EM on TyDiQA-GoldP evaluation set); results shown for mT5-XL QA model fine-tuned via Machine Translation (MT), Prompt Engineering (PE), Prompt Tuning (QAmeleon (PT)), and combinations thereof (PE + MT and QAmeleon (PT) + MT).
Effect of synthetic data size on downstream QA performance (Average EM on TyDiQA-GoldP evaluation set); results shown for mT5-XL QA model fine-tuned via Machine Translation (MT), Prompt Engineering (PE), Prompt Tuning (QAmeleon (PT)), and combinations thereof (PE + MT and QAmeleon (PT) + MT).
BLEU Does not Correlate with Downstream QA Performance
An interesting question is whether improvements in QA performance are due to better (e.g., more grammatical or diverse) questions. We assessed the quality of questions generated by QAmeleon (PT) on TyDiQA-GoldP by measuring their similarity to gold standard questions. We compare this with an mT5-XL model for question generation fine-tuned in a Few-shot setting. Both QAmeleon (PT) and mT5-XL question generation models were given the same number of examples in each language. Table 5 reports BLEU (Papineni et al., 2002) scores for these two models; we additionally report question answering performance (in terms of EM) via another set of mT5-XL models fine-tuned on the synthetic data generated by the respective models.
BLEU scores and downstream QA performance on TyDiQA-GoldP for questions generated by mT5-XL and QAmeleon (Few-shot setting, 5 examples in the target language).
Method . | n-Shot . | Avg BLEU . | Avg EM . |
---|---|---|---|
mT5-XL | 5 | 24.74 | 57.3 |
QAmeleon (PT) | 5 | 24.29 | 70.2 |
Method . | n-Shot . | Avg BLEU . | Avg EM . |
---|---|---|---|
mT5-XL | 5 | 24.74 | 57.3 |
QAmeleon (PT) | 5 | 24.29 | 70.2 |
Even though mT5-XL produces questions with slightly higher BLEU score, QAmeleon generates QA data that leads to much higher QA performance. The result underscores the need for better trustworthy automatic evaluation metrics (Sellam et al., 2020) across languages.
Our Results Generalize to MLQA
To validate the general applicability of our approach, we evaluate QAmeleon on MLQA (Lewis et al., 2020b). We prompt-tune the PLM on 5 examples per language taken from the MLQA development set, since MLQA does not provide training partitions. We generate synthetic datasets in all of the MLQA languages and compare an English-only baseline, MT, and QAmeleon (PT) approaches as we did previously for TyDiQA-GoldP. We report results (EM and F1) using mT5-XL as the QA model in Table 6, where English is included in the average performance.
Downstream QA performance on the MLQA test set with an mT5-XL model trained on SQuAD English data (English-Only), SQuAD translated to all MLQA languages (MT), on synthetic data generated by QAmeleon (5-shot) in all MLQA languages, or on a combination of data generated by MT and QAmeleon. Results for Xue et al. (2021) and Chi et al. (2022) are taken from the respective papers.
Method . | Avg EM . | Avg F1 . |
---|---|---|
English-Only | 53.1 | 71.8 |
MT | 56.4 | 74.8 |
QAmeleon (PT) | 55.0 | 74.3 |
QAmeleon (PT) + MT | 56.8 | 75.3 |
mT5-XL (Xue et al., 2021) | 54.5 | 73.5 |
XLM-E-XL (Chi et al., 2022) | 57.9 | 76.2 |
We find that the MT approach is very effective on MLQA, which is not surprising since MLQA questions are translated from English. QAmeleon (PT), however, still delivers an improvement in combination with MT synthetic data. Table 6 further reports comparisons with the state-of-the-art models of Xue et al. (2021) and Chi et al. (2022). The former is mT5-XL (3.7B parameters) fine-tuned on English data only, whereas XLM-E-XL (2.2B parameters) benefits from a different language model pretraining technique. The latter approach is orthogonal and potentially complementary to QAmeleon.
5 Data Analysis
Table 7 shows the size of synthetic data resources generated via Prompt Engineering (PE) and QAmeleon (PT), per language and in total. These were in the range of 47,000–53,000 QA examples for TyDiQA-GoldP, and 89,000 for MLQA. The varying size of the data across languages is due to the filtering described in Section 3. In some languages (e.g., Telugu) generation is more noisy leading to fewer data points. We conjecture this is due to the PLM being exposed to less data representing these languages during pre-training. We further hypothesize that a more multilingual pre-training of PLMs could potentially lead to better quality data across all languages.
Number of synthetic question-answer pairs per language generated via Prompt Engineering (PE) and QAmeleon (PT) with 5 human-labeled examples.
Language . | TyDiQA-GoldP . | MLQA . | |
---|---|---|---|
PE . | QAmeleon . | QAmeleon . | |
Arabic | 5,219 | 8,499 | 14,738 |
Bengali | 5,948 | 8,036 | – |
Chinese | – | – | 14,669 |
Finnish | 8,062 | 5,943 | – |
German | – | – | 11,186 |
Hindi | – | – | 12,036 |
Indonesian | 6,487 | 7,810 | – |
Kiswahili | 8,003 | 8,041 | – |
Korean | 5,229 | 7,906 | – |
Russian | 5,619 | 7,441 | – |
Spanish | – | – | 10,134 |
Telugu | 2,742 | 5,222 | – |
Vietnamese | – | – | 13,333 |
Total | 47,309 | 52,955 | 89,344 |
Language . | TyDiQA-GoldP . | MLQA . | |
---|---|---|---|
PE . | QAmeleon . | QAmeleon . | |
Arabic | 5,219 | 8,499 | 14,738 |
Bengali | 5,948 | 8,036 | – |
Chinese | – | – | 14,669 |
Finnish | 8,062 | 5,943 | – |
German | – | – | 11,186 |
Hindi | – | – | 12,036 |
Indonesian | 6,487 | 7,810 | – |
Kiswahili | 8,003 | 8,041 | – |
Korean | 5,229 | 7,906 | – |
Russian | 5,619 | 7,441 | – |
Spanish | – | – | 10,134 |
Telugu | 2,742 | 5,222 | – |
Vietnamese | – | – | 13,333 |
Total | 47,309 | 52,955 | 89,344 |
Machine translation (MT) creates the same number of data points as the source training set. For TyDiQA-GoldP, the English training contains 3,696 data points (Table 1), leading to approximately 29,000 QA examples across 8 languages. In MLQA, machine translation (MT) uses SQuAD as the English dataset, consisting of ∼ 87,000 data points, leading to ∼ 522,000 QA examples across 6 languages.
Figure 3 shows the distribution of various question types for individual languages and on average across all languages. For each language, synthetically generated questions were first translated to English and then grouped into categories (inner circle) based on their first word and sub-categories (outer circle) based on the first two words. We find that QAmeleon (Figure 3(c)) generates more diverse questions in comparison to TyDiQA-GoldP (Figure 3(f)). The distribution of question words varies across languages in both the datasets. For example, diversity is higher for Russian, Finnish, and Indonesian (Figure 3(a,d)); however, for Bengali and Telugu (Figure 3(b,e)), the distribution of questions is skewed towards a specific question type (‘What’ for Telugu and ‘Other’ for Bengali). This could be attributed to a lack of diversity in questions for these languages or poor translation quality leading to skewed utterances.
Distribution of question category for QAmeleon (PT) generated questions (a,b,c) and TyDiQA-GoldP training questions (d,e,f). Category are obtained by translating the questions to English with Google Translate and grouping by the first two word tokens.
Distribution of question category for QAmeleon (PT) generated questions (a,b,c) and TyDiQA-GoldP training questions (d,e,f). Category are obtained by translating the questions to English with Google Translate and grouping by the first two word tokens.
Table 8 illustrates randomly sampled examples of QA pairs generated by QAmeleon (PT) for passages in the TyDiQA-GoldP eval set. For these passages, we also have access to human annotated QA pairs. As can be seen, QA pairs generated by QAmeleon are of similar quality and at times more diverse compared to the human-annotated dataset. Table 9 illustrates examples of QA pairs generated by QAmeleon from randomly selected Wikipedia passages.
6 Related Work
Data Generation for QA
Prior work on the generation of QA data has mostly focused on English and typically divides the task into answer extraction/generation and question generation, followed by some type of filtering. Alberti et al. (2019) employ round-trip consistency for filtering with BERT-based models. Other work (Shakeri et al., 2020) uses BART to jointly generate a question and its answer given an input passage, employing likelihood-based filtering. Lewis et al. (2021) use a RoBERTa-based passage selection model to identify interesting passages. Bartolo et al. (2021) additionally train the generation models on an adversarial QA dataset, while Yao et al. (2022) integrate a QA-pair ranking module.
The above approaches generally require large amounts of labeled QA data in the form of SQuAD (Rajpurkar et al., 2016) or Natural Questions (Kwiatkowski et al., 2019) to train passage selection and question generation models. In contrast, we only assume access to a few question-answer pairs per language.
Multilingual QA
In this work we used mT5-XL (Xue et al., 2021) as our reference QA model. We note that a slightly more performant choice could have been ByT5 (Xue et al., 2022), which reports improvements on TyDiQA-GoldP by operating directly on raw text instead of sentence pieces. Existing work on low resource multilingual QA has been relatively limited. Lee et al. (2018) propose to use automatically translated high-confidence QA examples for training, while other approaches (Kumar et al., 2019; Chi et al., 2020) only generate questions and require supervised training data in the target language. Other approaches (Riabi et al., 2021: Shakeri et al., 2021; Kramchaninova and Defauw, 2022) focus on zero-shot transfer, i.e., a multilingual model trained on QA data generation on SQuAD (and optionally automatically translated SQuAD data) is applied to other languages. Our work shows that few-shot settings result in better multilingual generation quality in comparison to zero-shot models.
Prompting
Existing work (Brown et al., 2020; Schick and Schütze, 2021, inter alia) has shown that prompting pre-trained large language models can lead to strong performance in a wide range of tasks including natural language generation and common sense reasoning. In the context of multilingual QA, Chowdhery et al. (2022) employ a single prompt and a few labeled examples in the target language. In contrast, we employ chain-of-thought prompting, and English answers and questions as a bridge. Moreover, our experiments with QAmeleon demonstrate that prompt tuning is superior and a viable alternative to large-scale annotation. Prompting in multilingual settings has achieved the best performance using English prompts and target language exemplars (Winata et al., 2021; Lin et al., 2022; Shi et al., 2022). We demonstrate that parameter-efficient methods such as prompt tuning using target language exemplars (Lester et al., 2021) is a superior choice.
7 Benefits and Limitations
The method proposed in this work, QAmeleon, prompt tunes large PLMs to generate multilingual synthetic question answering data. In this section we discuss its benefits over related approaches, but also drawbacks and limitations. The main benefits are large performance improvements over alternative methods, as borne out by our experiments, as well as surprising data efficiency achieved through large-scale pre-training and a few manual annotations. Alternative methods considered here are multilingual QA approaches for low resource languages, such as translate-test, translate-train, fine-tuning multilingual models directly on the small amount of available training data, performing multilingual QA directly through in-context learning, or even synthetic data generation with prompt engineered PLMs.
Another benefit of our approach stems from prompt tuning itself, which is able to learn from a tiny number of training examples, as low as one example per language in our experiments, whereas fine-tuning cannot be utilized as easily. Prompt tuning also affords the practical advantage of being space efficient; a fraction of a percent of the storage space is used to save the learned parameters, since only the learned soft prompt needs to be stored. Our evaluation methodology also provides benefits, since we measure question answer performance directly on downstream models instead of using a proxy like BLEU or ROUGE on generated questions. As shown in Table 5 proxy metrics can be misleading, and one might conclude that smaller models generate better questions than large PLMs if the evaluation were to consider only question BLEU scores.
The main drawback of our method is the high computational cost to prompt tune the PLM and to generate the synthetic data. While prompt tuning is not as expensive as fine-tuning, we still need to perform optimization on a model containing hundreds of billions of parameters. We estimate the cost of each prompt tuning and data generation experiment to be in the order of 256 TPU v3 chips for 12 hours. Another limitation of our experimental results is that they are fundamentally tied to a specific large PLM. PLMs are an area of active research, so any improvements in pre-training techniques, construction of pre-training sets, instruction tuning or reinforcement learning, are likely to translate in improvements for our synthetic data generation method. Promising areas of future work are parameter efficient techniques similar to prompt tuning, as well as analysis of data augmentations techniques like QAmeleon across different types and sizes of PLMs. Moreover, a more formal understanding of how the number of manual annotations (aka few shots) interacts with the quality of synthetic generation, would also be useful. Perhaps somewhat counter-intuitively, our experiments showed that QA performance does not drastically improve when scaling from 50 to 100 manual examples.
8 Data Release
To assist with the replicability of our results and to allow other researchers to benefit from our work, we will release a significant portion of the synthetic data generated by QAmeleon in the 5-shot scenario. To minimize the chance that question-answer pairs generated by the PLM contain sensitive, offensive or controversial material, we vetted each generated question with three human raters. We asked each rater to discard question-answer pairs that made generalized claims about groups, contained opinions that were potentially disparaging or embarrassing to one or more people, or names of individuals not related to media (e.g., film, TV) or sport. The release will contain 47,173 examples, each with a Wikipedia passage, a question and an extractive answer, corresponding to 89% of the examples utilized in this work for the 5-shot scenario.
9 Conclusions
In this work, we examined the ability of pre-trained language models to generate synthetic data for bootstrapping multilingual QA systems, with as few as five examples in a new target language. We introduced QAmeleon, a parameter efficient approach which uses prompt tuning to automatically create multilingual QA data. Extensive experiments under different resource scenarios demonstrate that QAmeleon is superior to prompt engineering and competitive baselines based on machine translation. In the future, we would like to extend this approach to other multilingual tasks, including retrieval, summarization, and semantic parsing.
Notes
We release the multilingual QA synthetic data used for fine-tuning them at https://github.com/google-research-datasets/QAmeleon.
This is noteworthy as multilingual models fine-tuned on translated data—also known as translate-train—form the state of the art on most multilingual datasets (Ruder et al., 2021).
We find that joint answer and question generation using single-stage prompting performs worse in comparison to two-stage generation.
References
Author notes
Equal contribution. See Contributions section for details.
Action Editor: Lidong Bing