Recent approaches to multilingual open- domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a few-shot learning approach to synthesize large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, FsModQA, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a cross-lingual prompting strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.

Open-domain QA has demonstrated impressive performance by employing the retrieve-then-read (Figure 1(a)) pipeline (Chen et al., 2017), which is built upon dense retrievers (Karpukhin et al., 2020) and efficient generative readers (Izacard and Grave, 2021). However, this success has been primarily limited to English, leaving the multilingual setting under-explored. This limitation is mainly due to the difficulty and costs of creating high-quality and balanced human-supervised training data for languages other than English. Moreover, multilingual open-domain QA introduces additional challenges with retrieving evidence from multilingual corpora, requiring the underlying retrieval system to be capable of both cross-lingual and monolingual retrieval (Asai et al., 2021b).

Figure 1: 

Left: Multilingual open-domain QA pipeline. Middle: Training strategies: 1) self-supervised pre-training; 2–4) baselines using English QA data: 2) used directly; 3) machine translated into target languages; 4) used to prompt LLMs to generate target language QA samples; 5) our method using few-shot in-language data to prompt an LLM. Right: Result comparison (Avg. F1) on the XOR-Full dataset.

Figure 1: 

Left: Multilingual open-domain QA pipeline. Middle: Training strategies: 1) self-supervised pre-training; 2–4) baselines using English QA data: 2) used directly; 3) machine translated into target languages; 4) used to prompt LLMs to generate target language QA samples; 5) our method using few-shot in-language data to prompt an LLM. Right: Result comparison (Avg. F1) on the XOR-Full dataset.

Close modal

More recently, efforts have been made to create multilingual open-domain QA benchmarks from existing multilingual machine reading comprehension tasks (e.g., Xor-TyDi QA [Asai et al., 2021a]) and by translating English datasets (e.g., MKQA [Longpre et al., 2021]). These datasets have enabled various approaches to address multilingual open-domain QA problems, including iterative data augmentation (Asai et al., 2021b) and extensive additional pre-training on Wikipedia texts (Abulkhanov et al., 2023; Jiang et al., 2024). However, these methods still heavily depend on abundant high-quality language-specific data for fine-tuning, making them less effective solutions when language resources are limited. Therefore, a more generalizable approach to multilingual open-domain QA should aim to mitigate this reliance and be capable of facilitating language adaptation with minimally supervised samples.

In this paper, we present FsModQA, a method for Few-ShotMultilingual Open-Domain QA using minimally-sized supervised data (i.e., up to 5 per language).1 Our approach consists of two core components: a self-supervised pre-training objective on multilingual corpora; and a synthetic data generation pipeline that prompts a large language model (LLM) using few-shot supervised examples. Concretely, we generate question-answer pairs from WikiData triples by leveraging LLMs’ In-Context Learning (ICL) ability. To facilitate ICL prompts, we incorporate ChatBots to generate curated input-output pairs, which serve as examples for prompting LLMs to generate millions of questions from WikiData triples across various languages. After generating these question-answer pairs, we identify the supported Wikipedia passages through answer string matching. We further gather cross-lingual answers and evidence passages through Wikipedia language links to facilitate cross-lingual retrieval. Employing this generated data, we train a multilingual model with a joint objective for retrieval and QA, producing a promising pre-trained model (Figure 1(c)) for subsequent few-shot learning.

In few-shot learning, we employ LLMs for data generation from few-shot examples. For each target language, we feed the few-shot examples to an LLM and prompt it to generate question-answer pairs from a given document. The few-shot examples are assumed to encapsulate the QA style and distribution of the target dataset, enforcing the LLM to generate synthetic data with similar characteristics. With this abundant synthetic data, the pre-trained model can be further fine-tuned to achieve superior results (Figure 1(c)). As an unsupervised alternative, we explore a zero-shot cross-lingual prompting strategy that uses data from other languages as prompts for data generation, and we show this is almost on par with few-shot prompting (Figure 1(c)).

We evaluate FsModQA on various datasets, including cross-lingual and monolingual retrieval, and multilingual open-domain QA. We observe notable improvements over competitive few-shot baselines, with +5.1% gain on retrieval and +8.4% gain on multilingual open-domain QA. To further test FsModQA language adaptation ability, we conduct zero-shot adaptation experiments using our cross-lingual prompting strategy on 15 languages. This adaptation improves performance in both monolingual retrieval and multilingual QA significantly, achieving results that are superior or comparable to strong translation-based methods.2

Figure 2 presents the full pipeline for generating self-supervised pre-training and fine-tuning data.

Figure 2: 

Full pipeline for data construction and model training: (1) generate large-scale data from Wikidata for self-supervised pre-training; (2) use few-shot prompting to generate synthetic Q&A pairs from Wikipedia passages of target languages, on which the pre-trained model is further fine-tuned.

Figure 2: 

Full pipeline for data construction and model training: (1) generate large-scale data from Wikidata for self-supervised pre-training; (2) use few-shot prompting to generate synthetic Q&A pairs from Wikipedia passages of target languages, on which the pre-trained model is further fine-tuned.

Close modal

2.1 Self-Supervised Data Construction

Sampling Factual Triplets.

Our self-supervised training dataset is constructed based on Wikidata (Vrandečić and Krötzsch, 2014), a multilingual knowledge base consisting of fact triplets linked to millions of entities. We manually select 50 common properties (Appendix Table 10) based on English and consider all triples associated with these relations. We then gather fact triplets in the desired target languages through language links.

Generating Questions.

Given a triplet T=(s,r,o), we aim to write a question q about the head entity s’s property r with the gold answer a being the tail entity o. One can use relation-specific templates to efficiently transform each triple into natural questions (Sciavolino et al., 2021). However, this method lacks diversity, making triples with the same properties generate questions with similar surface forms. Instead, we adopt a generative approach by using a LLM to automatically generate questions with more diverse styles.

Specifically, we first sample five triples for each property and prompt ChatGPT (gpt-3.5-turbo) to generate three questions for each triple. This process yields a curated set of high-quality questions: K={si,ri,oi,qi}i=0k.

We additionally generate questions with Yes/No answers from the same set of sampled triples. It is easy to generate Yes questions. For No questions, we need to create false fact triples from existing triples. Specifically, we randomly replace a triple’s head or tail entity with the most similar Wikidata entity, and check the perturbed triple is not a valid fact according to Wikidata. We then generate questions using ChatGPT as before. Examples are included in Appendix Table 11.

Subsequently, these curated questions are used as ICL examples to prompt a smaller LLM to transform all sampled triples into natural questions. We use Gemma-7B (Gemma Team et al., 2024) as the LLM and include the prompts we used in Appendix Table 12.

Multilingual Positive Passage Identification.

As shown in Figure 3, for a question qja and answer aja derived from a triple (sja, r, oja), we gather all passages from the Wikipedia page Wja linked by sja and add passages containing aja as positive Dqja. If no such passage exists, we use partial match and select the one with the highest lexical overlaps with aja as positive. We further include positive passages in other languages to facilitate cross-lingual retrieval. We first translate the triple into target languages (sL, r, aL) using language links and identify cross-lingual positives by searching aL in the Wikipedia page WL linked by sL as above. This derives monolingual and cross-lingual positive passages Dq=DqjaDqL. We generate 18.7M (q,a,Dq) triples across 8 languages in total, denoted as MlWikiQA.3

Figure 3: 

Pre-training data construction pipeline: (1) transform WikiData triples into QAs using LLMs for each target language L, and (2) identify in-language and cross-lingual positive passages from the head entity’s Wikipedia page and through language links. English translations are added for readability.

Figure 3: 

Pre-training data construction pipeline: (1) transform WikiData triples into QAs using LLMs for each target language L, and (2) identify in-language and cross-lingual positive passages from the head entity’s Wikipedia page and through language links. English translations are added for readability.

Close modal

2.2 Few-shot Synthetic Data Generation

Few-shot Setting.

The main idea of FsModQA is to amplify a limited number of annotated examples into a substantially larger volume of synthetic data by prompting LLMs. In this work, we consider Xor-TyDi QA (Asai et al., 2021a) as our target dataset. For each language in Xor-TyDi QA, we randomly sample five triples K={(qiL,aiL,diL)}i=15 from the training set as few-shot examples. Each triple contains the question, answer, and the ground truth passage. We ensure that three examples are span answers, while the remaining two are Yes and No answers to align with Xor-TyDi QA distribution.

Prompt-based Question & Answer Generation.

We populate a hand-engineered template with our few-shot language-specific examples K and use them as the ICL examples to prompt LLM. Given a randomly sampled passage dL from language L, we append dL to the template, and the LLM is expected to generate a relevant question qL and answer aL in language L. We further constrain the answer aL to be a span within dL, a property of the original Xor-TyDi QA dataset.

Many questions classified as unanswerable in Clark et al. (2020) can be answered by referring to English Wikipedia (Asai et al., 2021a). These questions are included as cross-lingual questions in Xor-TyDi QA. To simulate this scenario, we generate synthetic cross-lingual data from English passages. We first use GoogleTranslate to translate the few-shot examples to English: K={(qiEn,qiL,aiEn,aiL,diEn)}i=15. Subsequently, we use these translated few-shot examples K to fill another template and instruct the LLM to generate QA from a randomly sampled English passage dEn, first in English (qEn, aEn) and then in target language (qL, aL). Similarly, we restrict aEn to be a span within dEn. We include the prompts we used in Tables 13 and 14 in Appendix.

Data Filtering.

We employ a method based on Natural Language Inference (NLI) to enhance the quality of our synthetic data. NLI techniques aim to classify whether a hypothesis text is entailed by, neutral, or contradictory to a given premise text (Bowman et al., 2015). They have been widely used for identifying factual errors in text summarization (Laban et al., 2022) and hallucinations in machine-generated texts (Honovich et al., 2022). In this study, we employ NLI methods for data filtering (Yoran et al., 2024). Given a synthetic example (q, a, d), we consider the source passage d as the premise and the concatenation of the generated question q and answer a as the hypothesis. We retain an example only when the premise entails the hypothesis.

In more detail, we apply a novel local-to-global filtering mechanism. In local filtering, we evaluate whether the originating passage d entails the synthetic QA (q, a) pairs. We take the output probability of the entailment label as the score and keep examples when the entailment score exceeds a threshold Tl. In global filtering, we use a pre-trained model (i.e., the self-supervised model in Figure 1(b)) to perform retrieval for the question q and obtain a set of passages D^q. We compute an entailment score vector xRD^q, with each entry being the entailment score between (q, a) and a retrieved passage dD^q. We then apply a maximum pooling operation max(x) to derive the final score. The intuition behind this is that a valid (q, a) should be supported by at least one of the retrieved passages, which aligns with open-domain settings. Similarly, we retain only those examples whose scores surpass a predefined threshold Tg. In this way, we end up having 1.7M synthetic data in total across 7 languages, denoted as FsMlQA.

2.2.1 Zero-shot Cross-lingual Prompting

Our few-shot setting relies on a few annotated examples to generate synthetic QA pairs in target languages. However, this approach encounters significant challenges when the target language is extremely low-resourced, making it nearly impossible to obtain even a few examples. For this setting, we explore zero-shot prompting, which uses cross-lingual examples to prompt LLMs to generate synthetic QA pairs in target languages.

We consider two zero-shot prompting settings. In English-Prompting setting, we use English QA data to fill up a template and use it as the prompt to ask LLMs to generate QA pairs from passages randomly sampled from the target language. In Multilingual-Prompting setting, we assume access to a handful of examples in a held-out language set. We randomly sample five multilingual examples from this held-out set to populate another template, and prompt LLMs to generate QA pairs in target languages. We include the prompts used in Tables 15 and 16 in Appendix.

2.2.2 Data Sampling

Our synthetic dataset, FsMlQA, exhibits a strongly skewed distribution towards shorter answer lengths (often single tokens), whereas the human-annotated answers in Xor-TyDi QA tend to be substantially longer. To address this mismatch, we resample the training data from FsMlQA according to answer length, using a geometric distribution, l ∼Geo(p), to achieve a better balance between short and long answers.4

2.3 FsModQA Model

Model Structure.

As shown in Figure 4, we employ a single encoder-decoder model to perform both passage retrieval and QA tasks. The first half of the encoder functions as a dual-encoder with shared parameters, which separately encodes the question q and the passage corpus D. Additionally, we append an instruction to the question to inform the language of the target answer: “Answer in {lang}”. A LayerNorm operation, followed by average pooling, is applied to compress the inputs into single vectors: Eq and {EdidiD}, which are used for matching via dot products. The top-k most relevant passages to the question are selected: Dq=argtopkdiD(Eq·Edi). The embeddings of the question and each top-k passage in Dq are concatenated and fed into the remaining cross-encoder layers. Finally, the cross-encoder embeddings are flattened and incorporated into the decoder through cross-attention to generate the answer a, following the Fusion-in-Decoder approach (Izacard and Grave, 2021).

Figure 4: 

The unified model for passage retrieval and question answering.

Figure 4: 

The unified model for passage retrieval and question answering.

Close modal

Model Training.

FsModQA is first pre-trained on MlWikiQA and later fine-tuned on FsMlQA. In self-supervised pre-training, we use a simple contrastive loss and answer generation loss to train FsModQA. The dual-encoder is updated by contrasting the paired question passage against the targets of other questions in one training batch (i.e., in-batch negative). Formally, for i-th training example, the loss function ℒssli is:
The pre-trained FsModQA is subsequently fine-tuned on our synthetic data through an end-to-end training mechanism. The dual-encoder is trained using signals derived from the answer generation task, with the cross-attention score from the decoder serving as the target for assessing question-passage relevance. For i-th training example, the loss function is formally defined as:
where Dqi is the passages returned by the dual-encoder itself and Pca is the target distribution gathered from the decoder’s cross-attention scores. SG signifies stop-gradient, which prevents the decoder from being affected by the retriever loss, and CA denotes the cross-attention score at the last decoder layer. The term 0 refers to the first output token, H is the number of cross-attention heads, and |dj| stands for the length of passage dj.

The entire model is optimized to generate the target answer ai given qi and relevant passages Dqi. The final loss is: Le2ei=Lreti+Lansi, where Lansi=logt=1TP(atia<ti,qi,Dqi).

3.1 Datasets and Metrics

We evaluate on the Xor-TyDi QA dataset (Asai et al., 2021a), with XOR-Retrieve for cross-lingual retrieval and XOR-Full for multilingual open-retrieval QA. We conduct zero-shot evaluations on two benchmarks, MIRACL (Zhang et al., 2023) for monolingual retrieval and MKQA (Longpre et al., 2021) for multilingual open-domain QA. For XOR-Retrieve, we use the February 2019 English Wikipedia dump as the retrieval corpus and the same dumps from 13 languages for XOR-Full and MKQA (Asai et al., 2021a). For MIRACL, we use the monolingual Wikipedia preprocessed by Zhang et al. (2023). Following prior work, we evaluate models at Recall@5kt (top 5000 tokens) on XOR-Retrieve; F1, exact match (EM) and BLEU on XOR-Full; nDCG@10 on MIRACL; and F1 on MKQA.

3.2 Baselines

We evaluate three ranges of representative baselines based on the type of supervised data used: (i) Zero-shot baselines (“-En”) fine-tuned on supervised English-only data (i.e., Natural Questions (Kwiatkowski et al., 2019)). (ii) Supervised baselines that fine-tuned on human-annotated multilingual data (i.e., Xor-TyDi QA). (iii) Few-shot models that improve zero-shot baselines with only a few supervised multilingual instances.

Retriever Baselines.

For XOR-Retrieve, we include: (1) Zero-shot retrievers: translate-test methods: DPR+MT (Asai et al., 2021a) and ReATT+MT (Jiang et al., 2022); models pre- trained on multilingual Wikipedia: CLASS-En (Jiang et al., 2024) and LAPCA (Abulkhanov et al., 2023). (2) Supervised retrievers: multilingual dense retrievers: mDPR (Asai et al., 2021a), CORA (Asai et al., 2021b), Sentri (Sorokin et al., 2022), QuiCK (Ren et al., 2022); token-level dense retrievers: DrDecr (Li et al., 2022) pre-trains ColBERT on WikiMatrix (Schwenk et al., 2021). (3) Few-shot retrievers: SWIM-X (Thakur et al., 2024) generates massive synthetic data from LLMs through a summarisation-then-ask technique. CLASS (5-shot) fine-tunes CLASS-En on our 5-shot examples. For MIRACL (Zhang et al., 2023), we include two supervised retrievers: fine-tuned mContriever (Izacard et al., 2022) and Hybrid that combines the results of BM25, mDPR, and mColbert (Khattab and Zaharia, 2020).

Reader Baselines.

(1) Zero-shot baselines: translate-test methods MT+DPR, ReAtt+MT, and GMT+GS generate answers from English retrieved passages with question and answer translations. (2) Supervised baselines: BM25 does in-language retrieval with an extractive multilingual QA model; MT+Mono first applies BM25 and then MT+DPR if no answer was generated. Fusion-in-decoder methods (i.e., CORA, CLASS, Sentri, LAPCA) use retrieval-augmented generation, generating target language answers from multilingual retrieved passages. (3) Few-shot readers: Gemma (5-shot) (Gemma Team et al., 2024) and LLaMa3 (5-shot) (Touvron et al., 2023) prompt LLMs with few-shot examples and retrieved passages using the template in Appendix Table 17; CLASS (5-shot) fine-tunes CLASS-En on few-shot examples. We use the same 5-shot examples for all methods.

3.3 Implementation Details

With the proposed self-supervised data construction method, we generate 18,735,159 triplets for pre-training across 8 languages, with statistics in Appendix Table 19. We initialize our model from the mT5-large checkpoint (Xue et al., 2021) and pre-train it using the loss function ℒssl for 100K steps with a batch size of 800 on 16 A100 GPUs for 64 hours. We set the learning rate to 5 × 10−5 with 10% steps of warm-up, and linear decay to 0.

With our few-shot data generation method, we obtain 1,746,156 question-answer pairs across 7 languages included in Xor-TyDi QA after data filtering with Tl=0.5 and Tg=0.8, with detailed statistics shown in Table 19 in Appendix. For fine-tuning, we first train the pre-trained model using NQ data for 8K steps and then on FsMlQA for 6K–14K steps depending upon the size of the sampled training dataset, with the loss function ℒe2e. We set the batch size to 128 and the learning rate to 5 × 10−5. We apply an asynchronous passage update mechanism, where we periodically refresh the retrieved passages for each training query using the most recent checkpoint every 1K steps.

3.4 Retrieval Results

XOR-Retrieve.

Table 1 shows that FsModQA, fine-tuned on 100K synthetic data, surpasses the few-shot SWIM-X (7M) by 5.5% at Recall@5kt, despite the latter using substantially more synthetic data generated by a significantly larger proprietary LLM (PaLM2). This indicates our method’s great efficiency in training and data generation. Further scaling up the training data to full size does not improve retrieval accuracy. In addition, we find that fine-tuning CLASS, a sophisticated pre-training method, on the same set of 5-shot examples, lags FsModQA by 3.1 points. This shows our method of amplifying data through LLM prompting is superior to direct fine-tuning.

Table 1: 

Results on XOR-Retrieve dev sets. Best performance is in bold. † and * denotes results reported by Asai et al. (2021a) and Jiang et al. (2024), respectively. Others are copied from original papers.

MethodBackbone# Total ParamsPre-training DataFine-tuning DataR@5kt
ArBnFiJaKoRuTeAvg.
Zero-shot Retrievers 
DPR+MT mBERT 220M – NQ 52.4 62.8 61.8 48.1 58.6 37.8 32.4 50.6 
ReAtt+MT* T5-L 583M – NQ 67.3 71.0 29.3 61.8 67.0 61.2 66.4 60.6 
CLASS-En* mT5-L 410M Wikipedia NQ 66.7 78.6 66.6 60.2 63.2 58.2 78.2 67.4 
Supervised Retrievers 
CORA mBERT 557M – NQ + XOR 42.7 52.0 49.0 32.8 43.5 39.2 41.6 43.0 
mDPR mBERT 557M – NQ + XOR 48.9 60.2 59.2 34.9 49.8 43.0 55.5 50.2 
Sentri XLM-R 560M – NQ + TQA + XOR 56.8 62.2 65.5 53.2 55.5 52.3 80.3 60.8 
QuiCK mBERT 557M – NQ + XOR 63.8 78.0 65.3 63.5 69.8 67.1 74.8 68.9 
DrDecr XLM-R 278M WikiMatrix NQ + XOR 70.2 85.9 69.4 65.1 68.8 68.8 83.2 73.1 
LAPCA XLM-R 560M Wikipedia NQ + XPAQ + XOR 70.2 83.8 79.6 69.7 73.6 75.5 83.1 76.5 
CLASS mT5-L 410M Wikipedia NQ 70.6 84.9 71.0 66.0 72.6 70.0 81.9 73.9 
Few-shot Retrievers 
SWIM-X (7M) mT5-B 580M mC4 SWIM-IR 57.9 75.0 65.6 59.3 58.9 64.6 74.4 65.1 
CLASS (5-shot) mT5-L 410M Wikipedia NQ + XOR (5-shot) 67.0 78.6 65.6 59.0 63.6 59.0 79.5 67.5 
 
FsModQA (100K) mT5-L 410M MlWikiQA NQ + FsMlQA 66.3 79.3 67.8 66.4 65.6 73.8 75.2 70.6 
FsModQA (1.7M) mT5-L 410M MlWikiQA NQ + FsMlQA 63.4 80.6 67.5 66.0 66.7 74.3 75.6 70.6 
MethodBackbone# Total ParamsPre-training DataFine-tuning DataR@5kt
ArBnFiJaKoRuTeAvg.
Zero-shot Retrievers 
DPR+MT mBERT 220M – NQ 52.4 62.8 61.8 48.1 58.6 37.8 32.4 50.6 
ReAtt+MT* T5-L 583M – NQ 67.3 71.0 29.3 61.8 67.0 61.2 66.4 60.6 
CLASS-En* mT5-L 410M Wikipedia NQ 66.7 78.6 66.6 60.2 63.2 58.2 78.2 67.4 
Supervised Retrievers 
CORA mBERT 557M – NQ + XOR 42.7 52.0 49.0 32.8 43.5 39.2 41.6 43.0 
mDPR mBERT 557M – NQ + XOR 48.9 60.2 59.2 34.9 49.8 43.0 55.5 50.2 
Sentri XLM-R 560M – NQ + TQA + XOR 56.8 62.2 65.5 53.2 55.5 52.3 80.3 60.8 
QuiCK mBERT 557M – NQ + XOR 63.8 78.0 65.3 63.5 69.8 67.1 74.8 68.9 
DrDecr XLM-R 278M WikiMatrix NQ + XOR 70.2 85.9 69.4 65.1 68.8 68.8 83.2 73.1 
LAPCA XLM-R 560M Wikipedia NQ + XPAQ + XOR 70.2 83.8 79.6 69.7 73.6 75.5 83.1 76.5 
CLASS mT5-L 410M Wikipedia NQ 70.6 84.9 71.0 66.0 72.6 70.0 81.9 73.9 
Few-shot Retrievers 
SWIM-X (7M) mT5-B 580M mC4 SWIM-IR 57.9 75.0 65.6 59.3 58.9 64.6 74.4 65.1 
CLASS (5-shot) mT5-L 410M Wikipedia NQ + XOR (5-shot) 67.0 78.6 65.6 59.0 63.6 59.0 79.5 67.5 
 
FsModQA (100K) mT5-L 410M MlWikiQA NQ + FsMlQA 66.3 79.3 67.8 66.4 65.6 73.8 75.2 70.6 
FsModQA (1.7M) mT5-L 410M MlWikiQA NQ + FsMlQA 63.4 80.6 67.5 66.0 66.7 74.3 75.6 70.6 

MIRACL.

Table 2 shows that FsModQA surpasses the few-shot retriever SWIM-X by 5.1%, although SWIM-X generates synthetic data on each MIRACL language through 3-shot prompting, whereas FsModQA is exclusively trained on synthetic data generated from 5-shot examples of Xor-TyDi QA and thus, evaluated on a zero-shot manner. We further divide languages into seen and unseen groups based on FsModQA’s training data. It outperforms SWIM-X on all seen languages and 7 out of 10 unseen languages, except on zh, fr, and de. We suspect SWIM-X benefits significantly from large-scale synthetic data generation on these high-resource languages.

Table 2: 

Monolingual retrieval results on MIRACL dev sets. Best performance is in bold. Hybrid scores are taken from Zhang et al. (2023). mContriever and SWIM-X are copied from Thakur et al. (2024).

Seen LanguagesUnseen Languages
arbnenfijakoruteesfafrhiidswthzhdeyoAvg.
Supervised Retrievers 
Hybrid 67.3 65.4 54.9 67.2 57.6 60.9 53.2 60.2 64.1 59.4 52.3 61.6 44.3 44.6 59.9 52.6 56.5 37.4 56.6 
mContriever 66.4 68.4 44.2 65.2 56.8 58.8 51.2 79.0 42.8 48.9 46.2 45.0 45.8 67.7 70.7 49.4 42.3 48.4 55.4 
Few-shot Retrievers 
SWIM-X (180K) 60.2 57.1 34.7 40.6 40.8 43.3 49.7 55.9 33.4 36.3 64.3 33.0 39.5 40.0 56.3 63.3 50.2 36.5 46.4 
 
FsModQA (100K) 64.4 63.6 45.4 64.7 55.1 49.6 50.0 76.2 40.5 43.7 36.5 43.2 42.6 50.2 60.4 43.2 36.7 60.2 51.5 
Seen LanguagesUnseen Languages
arbnenfijakoruteesfafrhiidswthzhdeyoAvg.
Supervised Retrievers 
Hybrid 67.3 65.4 54.9 67.2 57.6 60.9 53.2 60.2 64.1 59.4 52.3 61.6 44.3 44.6 59.9 52.6 56.5 37.4 56.6 
mContriever 66.4 68.4 44.2 65.2 56.8 58.8 51.2 79.0 42.8 48.9 46.2 45.0 45.8 67.7 70.7 49.4 42.3 48.4 55.4 
Few-shot Retrievers 
SWIM-X (180K) 60.2 57.1 34.7 40.6 40.8 43.3 49.7 55.9 33.4 36.3 64.3 33.0 39.5 40.0 56.3 63.3 50.2 36.5 46.4 
 
FsModQA (100K) 64.4 63.6 45.4 64.7 55.1 49.6 50.0 76.2 40.5 43.7 36.5 43.2 42.6 50.2 60.4 43.2 36.7 60.2 51.5 

3.5 Multilingual QA Results

XOR-Full.

In Table 3, we show FsModQA achieves the best results in few-shot settings, outperforming CLASS-En (directly fine-tuning on 5-shot examples) by 8.4% and directly few-shot promoting LLMs for QA by 18%. Compared to supervised readers, FsModQA surpasses CORA and other pipeline methods while achieving results comparable to the rest. It is also noteworthy that in two low-resource languages, FsModQA outperforms comparable supervised baselines in Bengali and achieves a closer match in Telugu, indicating the effectiveness of our method in handling low-resource languages.

Table 3: 

Multilingual QA results on the XOR-Full dev set. Best performance is in bold. † and * denotes results taken from Asai et al. (2021b) and Jiang et al. (2024). Others are copied from original papers.

MethodBackbone# Total ParamsPre-training DataFine-tuning DataF1Macro Average
ArBnFiJaKoRuTeF1EMBLEU
Zero-shot Readers 
MT+DPR mBERT – – NQ 7.2 4.3 17.0 7.9 7.1 13.6 0.5 8.2 3.8 6.8 
ReAtt+MT* T5-L 1.19B – NQ 15.0 10.5 1.8 13.1 14.9 15.4 8.2 11.3 5.5 9.5 
GMT+GS – – – NQ 18.0 29.1 13.8 5.7 15.2 14.9 15.6 16.0 9.9 14.9 
Supervised Readers 
BM25 – – – XOR 31.1 21.9 21.4 12.4 12.1 17.7 – – – – 
MT+Mono mBERT – – NQ + XOR 15.8 9.6 20.5 12.2 11.4 16.0 0.5 17.3 7.5 10.7 
CORA mBERT+mT5-B 1.14B – NQ + XOR 42.9 26.9 41.4 36.8 30.4 33.8 30.9 34.7 25.8 23.3 
CLASS mT5-L 1.23B Wikipedia NQ + XOR 49.1 32.0 46.7 44.1 38.4 39.9 41.1 41.6 32.5 28.2 
Sentri XLM-R+mT5-B 1.14B – NQ + TQA + XOR 52.5 31.2 45.5 44.9 43.1 41.2 30.7 41.3 34.9 30.7 
LAPCA XLM-R+mT5-B 1.14B Wikipedia NQ + XPAQ + XOR 53.4 50.2 49.3 44.7 49.5 49.3 38.9 47.8 38.7 35.5 
Few-shot Readers 
Gemma (5-shot) Gemma 7B – – 13.4 19.0 21.7 20.2 20.5 23.0 23.4 20.2 12.2 15.3 
LLaMA3 (5-shot) LLaMA3 8B – – 22.7 13.2 22.9 17.8 19.0 19.2 28.9 20.5 12.8 15.6 
CLASS (5-shot) mT5-L 1.23B Wikipedia NQ + XOR (5-shot) 32.3 28.1 29.9 25.7 29.5 27.7 24.7 29.8 20.5 21.2 
 
FsModQA mT5-L 1.23B MlWikiQA NQ + FsMlQA 41.3 35.4 39.6 41.5 35.0 38.2 36.3 38.2 27.9 24.4 
MethodBackbone# Total ParamsPre-training DataFine-tuning DataF1Macro Average
ArBnFiJaKoRuTeF1EMBLEU
Zero-shot Readers 
MT+DPR mBERT – – NQ 7.2 4.3 17.0 7.9 7.1 13.6 0.5 8.2 3.8 6.8 
ReAtt+MT* T5-L 1.19B – NQ 15.0 10.5 1.8 13.1 14.9 15.4 8.2 11.3 5.5 9.5 
GMT+GS – – – NQ 18.0 29.1 13.8 5.7 15.2 14.9 15.6 16.0 9.9 14.9 
Supervised Readers 
BM25 – – – XOR 31.1 21.9 21.4 12.4 12.1 17.7 – – – – 
MT+Mono mBERT – – NQ + XOR 15.8 9.6 20.5 12.2 11.4 16.0 0.5 17.3 7.5 10.7 
CORA mBERT+mT5-B 1.14B – NQ + XOR 42.9 26.9 41.4 36.8 30.4 33.8 30.9 34.7 25.8 23.3 
CLASS mT5-L 1.23B Wikipedia NQ + XOR 49.1 32.0 46.7 44.1 38.4 39.9 41.1 41.6 32.5 28.2 
Sentri XLM-R+mT5-B 1.14B – NQ + TQA + XOR 52.5 31.2 45.5 44.9 43.1 41.2 30.7 41.3 34.9 30.7 
LAPCA XLM-R+mT5-B 1.14B Wikipedia NQ + XPAQ + XOR 53.4 50.2 49.3 44.7 49.5 49.3 38.9 47.8 38.7 35.5 
Few-shot Readers 
Gemma (5-shot) Gemma 7B – – 13.4 19.0 21.7 20.2 20.5 23.0 23.4 20.2 12.2 15.3 
LLaMA3 (5-shot) LLaMA3 8B – – 22.7 13.2 22.9 17.8 19.0 19.2 28.9 20.5 12.8 15.6 
CLASS (5-shot) mT5-L 1.23B Wikipedia NQ + XOR (5-shot) 32.3 28.1 29.9 25.7 29.5 27.7 24.7 29.8 20.5 21.2 
 
FsModQA mT5-L 1.23B MlWikiQA NQ + FsMlQA 41.3 35.4 39.6 41.5 35.0 38.2 36.3 38.2 27.9 24.4 

MKQA.

In Table 4, FsModQA achieves the best zero-shot results on MKQA in almost all languages, with an improvement of +2.8% compared to supervised CORA and CLASS. This suggests that training on our synthetic data can well generalize to other new languages, indicating that generating synthetic data for each target language may not be necessary for language adaptation.

Table 4: 

Zero-shot multilingual QA results (F1) on MKQA. Best performance is in bold. “cn”: “Zh-cn” (Chinese, simplified). “hk”: “Zh-hk” (Chinese, Hong Kong). “tw”: “Zh-tw” (Chinese, traditional).

MethodDaDeEsFrHeHuItKmMsNlNoPlPtSvThTrVicnhktwAvg
Supervised Readers 
CORA 30.4 30.2 32.0 30.8 15.8 18.4 29.0 5.8 27.8 32.1 29.2 25.6 28.4 30.9 8.5 22.2 20.9 5.2 6.7 5.4 21.8 
CLASS 28.3 32.3 33.3 31.2 10.3 23.1 30.6 7.1 24.7 30.2 28.4 25.6 29.3 28.9 14.1 24.8 19.0 8.0 7.8 6.7 22.2 
Few-shot Readers 
 
FsModQA 34.8 33.3 38.5 34.8 19.5 28.4 31.9 7.5 36.7 34.1 35.5 18.4 33.4 37.2 15.1 24.8 9.9 9.1 8.6 7.9 25.0 
MethodDaDeEsFrHeHuItKmMsNlNoPlPtSvThTrVicnhktwAvg
Supervised Readers 
CORA 30.4 30.2 32.0 30.8 15.8 18.4 29.0 5.8 27.8 32.1 29.2 25.6 28.4 30.9 8.5 22.2 20.9 5.2 6.7 5.4 21.8 
CLASS 28.3 32.3 33.3 31.2 10.3 23.1 30.6 7.1 24.7 30.2 28.4 25.6 29.3 28.9 14.1 24.8 19.0 8.0 7.8 6.7 22.2 
Few-shot Readers 
 
FsModQA 34.8 33.3 38.5 34.8 19.5 28.4 31.9 7.5 36.7 34.1 35.5 18.4 33.4 37.2 15.1 24.8 9.9 9.1 8.6 7.9 25.0 

3.6 Ablation

We perform ablation studies to justify each of our designs, with results shown in Tables 5 and 6.

Table 5: 

The effects of generating cross-lingual queries from English passages, at 100K data scale.

XOR-FullXOR-Retrieve
In-LGCross-LGAllRetrievalCL-Retrieval
Avg. F1Avg. F1Avg. F1RM@100R@5kt
FsModQA 46.8 31.2 36.9 75.0 70.6 
- CL Queries 49.3 30.0 36.8 72.0 68.4 
XOR-FullXOR-Retrieve
In-LGCross-LGAllRetrievalCL-Retrieval
Avg. F1Avg. F1Avg. F1RM@100R@5kt
FsModQA 46.8 31.2 36.9 75.0 70.6 
- CL Queries 49.3 30.0 36.8 72.0 68.4 
Table 6: 

Ablations by removing one component of our method, at 100K data scale.

ArBnFiJaKoRuTeAvg.
FsModQA 40.6 34.3 38.4 40.7 32.9 37.7 33.9 36.9 
 - Data Filtering 39.0 31.7 37.4 39.2 32.3 35.5 35.3 35.8 
 - Geo Sampling 37.9 35.9 36.7 38.5 34.1 35.0 33.5 36.0 
 - MlWikiQA 11.2 7.2 10.2 17.5 7.9 8.5 4.4 9.6 
ArBnFiJaKoRuTeAvg.
FsModQA 40.6 34.3 38.4 40.7 32.9 37.7 33.9 36.9 
 - Data Filtering 39.0 31.7 37.4 39.2 32.3 35.5 35.3 35.8 
 - Geo Sampling 37.9 35.9 36.7 38.5 34.1 35.0 33.5 36.0 
 - MlWikiQA 11.2 7.2 10.2 17.5 7.9 8.5 4.4 9.6 

Cross-lingual Data Improves Cross-lingual Ability.

Excluding cross-lingual synthetic training data enhances performance in answering questions that require only the retrieval of in-language passages. However, the result on questions relying on cross-lingual passage retrieval declines, reducing the overall results. This is further evidenced by retrieval results RM@100, where the accuracy of finding evidence in any language (e.g., English and in-language) drops, with additional support from the cross-lingual passage retrieval results.

Data Filtering Improves Data Quality.

By using the raw synthetic data from LLMs without any quality control, the performance suffers in every examined language except Telugu. We suspect that the NLI model is deficient in this language.

Geometry Sampling Improves Long-answer Generation.

Sampling data according to geometry distribution over answer length leads to a 0.9% gain on average. In languages that contain a significant number of long answers (i.e., ar, fi, ja, ru), geometry sampling shows gains of up to 2.7%. Conversely, in bn, and ko, where short answers dominate, random sampling is usually better.

Pre-training is Crucial.

We observe extremely poor results in all languages without pre-training on our MlWikiQA, primarily due to the model’s low retrieval accuracy in identifying relevant passages. We believe pre-training enables the model to achieve good initial retrieval accuracy, which is essential in the subsequent fine-tuning process.

3.7 Training Data Scaling

Performance Improves with More Synthetic Data.

To investigate the effect of our data scale on models, we train FsModQA on subsets ranging from 0.05M to the entire 1.7M QA pairs, Results on each language and the average performance are shown in Figure 5. As the data size increases, FsModQA shows enhanced average performance up to the 0.6M data scale and gradually decreases afterward. We observe that as data size increases, the proportion of examples with short answers increases (78.4% → 95.3%), and the result on long-answer examples drops from 18.0% to 15.1%, indicating overfitting to short answers.

Figure 5: 

Performance when trained with different sizes of our synthetic data.

Figure 5: 

Performance when trained with different sizes of our synthetic data.

Close modal

Our geometric sampling method (§2.2.2) attempts to balance the answers by length, however its use of sampling without replacement means the few long answer instances are quickly exhausted, such that larger sampled datasets become skewed toward shorter answers. To mitigate this issue, we employ sampling with replacement. This method upsamples longer-answer examples such that the length distribution follows the precomputed geometric distribution.5 As a result, it effectively increases the number of training epochs for data points with longer answers. As shown in Figure 6, sampling with replacement significantly improves performance on longer answers (≥4 tokens) while maintaining comparable performance on shorter answers relative to the current method.

Figure 6: 

Performance comparison when sampling data with or without replacement by using our geometric sampling strategy.

Figure 6: 

Performance comparison when sampling data with or without replacement by using our geometric sampling strategy.

Close modal

Few-shot Prompting Is Superior to Direct Fine-tuning and Benefits from More Supervised Data.

In Figure 7, we show that directly fine-tuning on the 5-shot examples is beneficial (28.6% → 33.5%) but remains inferior to our few-shot method. When increasing the size of supervised data, both methods achieve consistent improvements although the performance gap narrows. With full-sized training data, FsModQA surpasses CLASS (43.0% v.s. 41.6%), achieving new state-of-the-art results. See Appendix Table 21 for results in each language.

Figure 7: 

Results when trained with varying sizes of supervised data. The average together with the best and worst languages are reported.

Figure 7: 

Results when trained with varying sizes of supervised data. The average together with the best and worst languages are reported.

Close modal

3.8 Zero-shot Prompting Strategies

We compare our few-shot prompting strategy with two zero-shot cross-lingual prompting methods in §2.2.1. In English-Prompting, we consider NQ training data and TyDi QA English training data as prompting sources, respectively. In Multilingual-Prompting, we use 5-shot examples from all languages in Xor-TyDi QA (i.e., those used in our few-shot setting) for prompting. When generating synthetic data for each target language, we exclude its 5-shot examples from the prompting source. We compare the success rate of generating valid examples using different prompting strategies in Appendix Table 20, with few-shot prompting achieving the highest rate and English-Prompting with NQ yielding the lowest rate.

Zero-shot prompting is comparable to few-shot prompting.

Table 7 shows that all three zero-shot prompting variants achieve consistent improvements over FsModQA-En with up to 8.1% gains, highlighting the versatility of our method in zero-shot language adaptation. Prompting with English datasets created with the same guidelines achieves better results (TyDi-En v.s. NQ-En), and using multilingual examples for prompting (i.e., Xor-TyDi-*) is comparable to FsModQA. Specifically, the diversity and QA styles in prompts are more important for fi and te, while for other languages, employing in-language prompts usually leads to the best performance.

Table 7: 

XOR-Full performance comparison when using zero-shot prompting strategies for synthetic data generation, at 100K scale. FsModQA-En indicates the model pre-trained on MlWikiQA and fine-tuned on the English NQ dataset.

ArBnFiJaKoRuTeAvg.
FsModQA-En 30.7 30.2 31.0 24.3 26.2 29.6 28.5 28.6 
 
FsModQA 41.7 34.7 38.7 39.4 34.7 35.0 33.5 36.8 
NQ-En 38.8 33.9 40.1 33.0 33.0 34.9 34.3 35.4 
TyDi-En 39.1 34.4 41.2 35.6 31.9 34.6 36.0 36.1 
Xor-TyDi-* 42.5 33.9 40.3 37.9 33.7 34.6 34.3 36.7 
ArBnFiJaKoRuTeAvg.
FsModQA-En 30.7 30.2 31.0 24.3 26.2 29.6 28.5 28.6 
 
FsModQA 41.7 34.7 38.7 39.4 34.7 35.0 33.5 36.8 
NQ-En 38.8 33.9 40.1 33.0 33.0 34.9 34.3 35.4 
TyDi-En 39.1 34.4 41.2 35.6 31.9 34.6 36.0 36.1 
Xor-TyDi-* 42.5 33.9 40.3 37.9 33.7 34.6 34.3 36.7 

English-prompting is the best way of using English data and is complementary to existing methods.

We compare three different ways of using TyDi QA English data for zero-shot learning, direct English fine-tuning, fine-tuning on machine-translated data from English, and English Prompting. Table 8 shows the benefits of all three methods, with our English-Prompting approach yielding the best results in all languages. Additionally, combining data from all three methods results in improvements over any of them when used independently, and matches the performance of our few-shot setting.

Table 8: 

Result comparison on XOR-Full for different means of using TyDi-En data.

ArBnFiJaKoRuTeAvg.
FsModQA-En 30.7 30.2 31.0 24.3 26.2 29.6 28.5 28.6 
 
 + Fine-tuning 36.8 30.7 35.5 29.1 28.6 30.4 29.4 31.5 
 + Translate-train 31.5 31.2 29.9 26.5 28.8 27.7 31.5 29.6 
 + English-prompt 39.1 34.4 41.2 35.6 31.9 34.6 36.0 36.1 
 + All 41.9 36.2 43.0 37.3 33.7 37.0 37.4 38.1 
ArBnFiJaKoRuTeAvg.
FsModQA-En 30.7 30.2 31.0 24.3 26.2 29.6 28.5 28.6 
 
 + Fine-tuning 36.8 30.7 35.5 29.1 28.6 30.4 29.4 31.5 
 + Translate-train 31.5 31.2 29.9 26.5 28.8 27.7 31.5 29.6 
 + English-prompt 39.1 34.4 41.2 35.6 31.9 34.6 36.0 36.1 
 + All 41.9 36.2 43.0 37.3 33.7 37.0 37.4 38.1 

In §2.2.1, we propose a zero-shot prompting strategy that uses few-shot examples from other languages to generate synthetic data for a distinct target language. The effectiveness of this approach is demonstrated in §3.8. In this section, we evaluate the impact of this strategy in adapting FsModQA to a diverse range of previously unseen languages, using only English labeled data.

4.1 Experimental Setup

Languages

We select ten languages unseen by FsModQA from the MIRACL dataset for monolingual retrieval adaptation. We choose ten unseen languages from the MKQA dataset with high, medium, and low resources for multilingual open-domain QA adaptation.

Data Generation

We consider the English NQ training data as the source for prompts. For each target language, we randomly sample five-shot examples from the NQ dataset to prompt the generation of Q&A pairs from selected Wikipedia passages, following the procedure described in §2.2. This approach yields 128,000 training instances for each target language. Additionally, we compare this method to the translate-train baseline (MT), which uses Google Translate to translate the NQ training data into the target languages.

Model Training

For both methods, we fine-tune FsModQA for 3K steps following the same procedure used in FsMlQA3.3). The final checkpoint obtained at the last training step is used for evaluation. Note that separate models are created per language in this experiment.

4.2 Results

Monolingual Retrieval Adaptation.

As shown in the upper part of Table 9, the zero-shot adaptation significantly improves FsModQA’s monolingual retrieval results by an average of 5.4% across ten unseen languages. These improvements are particularly pronounced in low-resource languages (i.e., th, yo, sw), whereas the MT baseline results in notable declines both in these languages (e.g., −36.3% in yo) and overall (−3.3%). Note that MIRACL was created by native speakers from texts in the target languages, which aligns with our data generation process. This explains the consistent gains achieved by our method and shows its superiority to translation-based approaches.

Table 9: 

Zero-shot adaptation to unseen languages in monolingual retrieval (nDCG@10) and multilingual open-domain QA (F1).

MIRACLHighMediumLow
DeEsFrZhFaHiIdSwThYoAvg.
FsModQA 36.7 40.5 36.5 43.2 43.7 42.6 43.2 50.2 60.4 60.2 45.7 
 + MT 41.3 41.8 37.1 41.7 40.7 42.4 44.1 50.7 60.5 23.9 42.4 
 + Adapt 38.8 41.6 38.6 47.0 47.7 45.9 44.2 62.3 66.6 78.3 51.1 
 
MKQA High Medium Low 
De Es Fr Zh He Pl Tr Vi Km Th Avg. 
FsModQA 33.3 38.5 34.8 8.5 19.5 18.4 24.9 9.9 7.5 15.1 21.0 
 + MT 42.6 41.6 41.2 12.8 32.1 29.5 39.9 39.5 13.8 22.1 31.5 
 + Adapt 42.1 42.1 43.0 12.0 27.4 39.7 40.4 40.4 13.3 22.7 32.3 
MIRACLHighMediumLow
DeEsFrZhFaHiIdSwThYoAvg.
FsModQA 36.7 40.5 36.5 43.2 43.7 42.6 43.2 50.2 60.4 60.2 45.7 
 + MT 41.3 41.8 37.1 41.7 40.7 42.4 44.1 50.7 60.5 23.9 42.4 
 + Adapt 38.8 41.6 38.6 47.0 47.7 45.9 44.2 62.3 66.6 78.3 51.1 
 
MKQA High Medium Low 
De Es Fr Zh He Pl Tr Vi Km Th Avg. 
FsModQA 33.3 38.5 34.8 8.5 19.5 18.4 24.9 9.9 7.5 15.1 21.0 
 + MT 42.6 41.6 41.2 12.8 32.1 29.5 39.9 39.5 13.8 22.1 31.5 
 + Adapt 42.1 42.1 43.0 12.0 27.4 39.7 40.4 40.4 13.3 22.7 32.3 

Multilingual Open-domain QA Adaptation.

As shown in the bottom of Table 9, the adaptation effectively enhances multilingual open-domain QA performance across seven languages, achieving an average improvement of 11.3%. MT-based approaches yield results comparable to our adaptation, which is expected since MKQA was translated from NQ and the machined-translated data share the same topic distributions (i.e., American-centric). In contrast, our method generates data from Wikipedia texts written in target languages to simulate how native speakers ask questions, which is more common for real-world scenarios.

5.1 Quality Validation

To assess the overall quality of our synthetic data, we randomly sample 1,000 examples from the silver pre-training data (MlWikiQA) and few-shot synthetic data (FsMlQA). These samples are evaluated using the GPT-4o mini to assess quality based on: 1) Fluency (0–2): assessing whether the query is understandable, readable, and free of spelling or grammatical mistakes; 2) Relevance (0–2): evaluating the alignment between the generated query-answer pair and the passage used for data generation. The prompts employed for quality assessment are included in Appendix Table 18.

Figure 8 illustrates that both types of our generated queries exhibit fluency and strong grounding in the corresponding positive passages. The silver-standard MlWikiQA, derived using heuristics from WikiData (§2.1), consistently achieves higher scores across both metrics in all languages compared to the unfiltered synthetic FsMlQA (w/o Filter columns). However, the quality of FsMlQA improves significantly after applying our tailored filtering mechanism (w/ Filter columns), almost matching the quality and fluency scores for MlWikiQA. This finding underscores the critical role of the filtering procedure in producing a synthetic dataset of quality comparable to the silver-standard dataset.

Figure 8: 

Quality validation results on the synthetic FsMlQA (with and without filter), comparing against the silver-standard pre-training data MlWikiQA. We employ Model-as-Judge to evaluate the quality of generated data on a three-level rating scale (0–2) based on two factors: fluency and relevance.

Figure 8: 

Quality validation results on the synthetic FsMlQA (with and without filter), comparing against the silver-standard pre-training data MlWikiQA. We employ Model-as-Judge to evaluate the quality of generated data on a three-level rating scale (0–2) based on two factors: fluency and relevance.

Close modal

5.2 Query Distribution Comparison

To examine the distributional differences between our synthetic FsMlQA and the gold-standard data in Xor-TyDi QA, we randomly sample up to 20,000 examples from both datasets and visualize their distributions using t-SNE (van der Maaten and Hinton, 2008), which projects the queries onto a two-dimensional space. Figure 9 highlights several key findings: 1) The synthetic queries exhibit sufficient diversity, as they are scattered across the plot, indicating that our approach is capable of generating queries of various types using only five labeled examples. 2) The synthetic data shows significant overlap with the gold-standard data, demonstrating that it retains the core characteristics of the gold distribution. 3) The gold-standard data exhibits greater diversity than the synthetic data, suggesting that there is still room for improvement in enhancing diversity and variation during the data generation process, which we leave for future work. Similar findings are observed in the other languages (see Appendix Figure 10).

Figure 9: 

Distribution comparison between FsMlQA and Xor-TyDi QA in Japanese. We show that the synthetic data is diverse and significantly overlaps with the gold standard.

Figure 9: 

Distribution comparison between FsMlQA and Xor-TyDi QA in Japanese. We show that the synthetic data is diverse and significantly overlaps with the gold standard.

Close modal

5.3 Safety

We employ Llama-Guard-26 as the content safety classifier to assess the presence of unsafe content within our synthetic dataset. Our analysis reveals that 98.9% of the 1,746,156 queries in FsMlQA are classified as safe.

Pre-training for Open-domain QA.

Open-domain QA requires retrieving relevant passages and extracting answers from them. This necessity has driven various methods that jointly train retrievers and readers. REALM (Guu et al., 2020), RAG (Lewis et al., 2020), EMDR2 (Sachan et al., 2021), YONO (Lee et al., 2022), ReAtt (Jiang et al., 2022), and Atlas (Izacard et al., 2024) first pre-train retrievers or initialize from pre-trained (Izacard et al., 2022) and fine-tuned retrievers. Subsequently, both components are fine-tuned jointly: the reader is trained using an answer generation loss, and the retriever is trained to promote passages that increase the likelihood of generating correct answers. Recently, this joint training mechanism has been adapted for multilingual open-domain QA (Jiang et al., 2024), where retrievers are initially trained by learning from English teachers using multilingual parallel data, followed by a joint training stage with query-answer pairs generated by LLMs. Our approach follows this joint training paradigm for model pre-training but differs significantly. We use WikiData as a source to generate more informative natural questions and answers. Additionally, our pre-training method is more efficient by eliminating knowledge distillation from English models.

LLMs for Few-shot Data Generation.

Prompting LLMs to generate synthetic data has been widely adopted to improve the performance of retrieval and QA tasks. UPR (Sachan et al., 2022) and InPars (Bonifacio et al., 2022) use zero-shot or few-shot prompting for passage reranking. Promptagator (Dai et al., 2023) and SWIM-X (Thakur et al., 2024) prompt LLMs with few-shot examples to generate massive synthetic queries, either in English or in multiple languages, for retriever fine-tuning. Gecko (Lee et al., 2024) prompts LLMs to generate synthetic instructions and queries from Web documents and create high-quality labels for retriever fine-tuning. Beyond retrieval, LLMs are employed to generate QA data, where QAmeleon (Agrawal et al., 2023) prompts a 540B LLM to generate multilingual QA pairs from only five examples. Nevertheless, these methods primarily focus on retrieval tasks and the more narrowly defined machine reading comprehension tasks. In our work, we rigorously investigate how LLMs can improve the more challenging multilingual open-domain QA tasks under few-shot settings. In addition, we explore zero-shot prompting, demonstrating that cross-lingual prompting using English data or limited multilingual data from held-out languages can yield results comparable to few-shot prompting, and we show this technique can also be leveraged for effective zero-shot language adaptation.

In this work, we propose FsModQA, a few-shot learning approach for multilingual open-domain retrieval tasks. We present a novel self-supervised pre-training framework that exploits WikiData to effectively initialize both multilingual retrieval and QA capabilities. This process is followed by few-shot synthetic multilingual QA generation from LLMs using only five human-annotated examples. We demonstrate that the resulting model achieves competitive multilingual retrieval and QA performance through fine-tuning on the high-quality synthetic data. We further show that this few-shot approach generalizes to zero-shot settings that only require English-supervised data. This mechanism serves as an effective approach for language adaptation, enabling the adapted model to achieve both boosted retrieval and end-to-end QA performance across fifteen previously unseen languages.

This work uses LLMs for synthetic data generation, which may propagate undesirable biases to generated data. We believe such biases will not be amplified as we sample prompts from Xor-TyDi QA, a dataset annotated with strict guidelines. Our preliminary safety analysis also reveals that only less than 1% data contains potentially harmful queries, as identified by Llama-Guard-2.

We thank the action editor Shay Cohen and anonymous reviewers for their helpful feedback and suggestions. The first author is supported by the Graduate Research Scholarships funded by the University of Melbourne. This work was funded by the Australian Research Council, Discovery grant DP230102775.

1 

We use the term few-shot throughout this paper to denote that our method relies on only a small number of human-annotated examples. Thus, we classify our method as a few-shot learning approach, consistent with Dai et al. (2023).

2 

Code, data, and checkpoints are available here.

3 

We classify MlWikiQA as a silver-standard dataset rather than a synthetic one, as it is derived from the structured information in WikiData and Wikipedia.

4 

Empirically, we set p = 0.4(μ = 2.5) for all languages except for Japanese, where we set p = 0.1(μ = 10) to favor longer answers. When computing the distribution, we truncate the answer length to 30.

5 

We do not cap the number of repeats.

Dmitry
Abulkhanov
,
Nikita
Sorokin
,
Sergey
Nikolenko
, and
Valentin
Malykh
.
2023
.
Lapca: Language-agnostic pretraining with cross-lingual alignment
. In
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
,
SIGIR ’23
, pages
2098
2102
,
New York, NY, USA
.
Association for Computing Machinery
.
Priyanka
Agrawal
,
Chris
Alberti
,
Fantine
Huot
,
Joshua
Maynez
,
Ji
Ma
,
Sebastian
Ruder
,
Kuzman
Ganchev
,
Dipanjan
Das
, and
Mirella
Lapata
.
2023
.
QAmeleon: Multilingual QA with Only 5 examples
.
Transactions of the Association for Computational Linguistics
,
11
:
1754
1771
.
Akari
Asai
,
Jungo
Kasai
,
Jonathan
Clark
,
Kenton
Lee
,
Eunsol
Choi
, and
Hannaneh
Hajishirzi
.
2021a
.
XOR QA: Cross-lingual open-retrieval question answering
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
547
564
,
Online
.
Association for Computational Linguistics
.
Akari
Asai
,
Xinyan
Yu
,
Jungo
Kasai
, and
Hannaneh
Hajishirzi
.
2021b
.
One question answering model for many languages with cross-lingual dense passage retrieval
. In
Advances in Neural Information Processing Systems
.
Luiz
Bonifacio
,
Hugo
Abonizio
,
Marzieh
Fadaee
, and
Rodrigo
Nogueira
.
2022
.
Inpars: Unsupervised dataset generation for information retrieval
. In
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
,
SIGIR ’22
, pages
2387
2392
,
New York, NY, USA
.
Association for Computing Machinery
.
Samuel R.
Bowman
,
Gabor
Angeli
,
Christopher
Potts
, and
Christopher D.
Manning
.
2015
.
A large annotated corpus for learning natural language inference
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
632
642
,
Lisbon, Portugal
.
Association for Computational Linguistics
.
Danqi
Chen
,
Adam
Fisch
,
Jason
Weston
, and
Antoine
Bordes
.
2017
.
Reading Wikipedia to answer open-domain questions
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1870
1879
,
Vancouver, Canada
.
Association for Computational Linguistics
.
Jonathan H.
Clark
,
Eunsol
Choi
,
Michael
Collins
,
Dan
Garrette
,
Tom
Kwiatkowski
,
Vitaly
Nikolaev
, and
Jennimaria
Palomaki
.
2020
.
TyDi QA: A benchmark for information- seeking question answering in typologically diverse languages
.
Transactions of the Association for Computational Linguistics
,
8
:
454
470
.
Zhuyun
Dai
,
Vincent Y.
Zhao
,
Ji
Ma
,
Yi
Luan
,
Jianmo
Ni
,
Jing
Lu
,
Anton
Bakalov
,
Kelvin
Guu
,
Keith
Hall
, and
Ming-Wei
Chang
.
2023
.
Promptagator: Few-shot dense retrieval from 8 examples
. In
The Eleventh International Conference on Learning Representations
.
Kelvin
Guu
,
Kenton
Lee
,
Zora
Tung
,
Panupong
Pasupat
, and
Ming-Wei
Chang
.
2020
.
Realm: Retrieval-augmented language model pre-training
. In
Proceedings of the 37th International Conference on Machine Learning
,
ICML’20
.
JMLR.org
.
Or
Honovich
,
Roee
Aharoni
,
Jonathan
Herzig
,
Hagai
Taitelbaum
,
Doron
Kukliansy
,
Vered
Cohen
,
Thomas
Scialom
,
Idan
Szpektor
,
Avinatan
Hassidim
, and
Yossi
Matias
.
2022
.
TRUE: Re-evaluating factual consistency evaluation
. In
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering
, pages
161
175
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Gautier
Izacard
,
Mathilde
Caron
,
Lucas
Hosseini
,
Sebastian
Riedel
,
Piotr
Bojanowski
,
Armand
Joulin
, and
Edouard
Grave
.
2022
.
Unsupervised dense information retrieval with contrastive learning
.
Transactions on Machine Learning Research
.
Gautier
Izacard
and
Edouard
Grave
.
2021
.
Leveraging passage retrieval with generative models for open domain question answering
. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
, pages
874
880
,
Online
.
Association for Computational Linguistics
.
Gautier
Izacard
,
Patrick
Lewis
,
Maria
Lomeli
,
Lucas
Hosseini
,
Fabio
Petroni
,
Timo
Schick
,
Jane
Dwivedi-Yu
,
Armand
Joulin
,
Sebastian
Riedel
, and
Edouard
Grave
.
2024
.
Atlas: Few-shot learning with retrieval augmented language models
.
Journal of Machine Learning Research
,
24
(
1
).
Fan
Jiang
,
Tom
Drummond
, and
Trevor
Cohn
.
2024
.
Pre-training cross-lingual open domain question answering with large-scale synthetic supervision
. In
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
, pages
13906
13933
,
Miami, Florida, USA
.
Association for Computational Linguistics
.
Zhengbao
Jiang
,
Luyu
Gao
,
Zhiruo
Wang
,
Jun
Araki
,
Haibo
Ding
,
Jamie
Callan
, and
Graham
Neubig
.
2022
.
Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
2336
2349
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Vladimir
Karpukhin
,
Barlas
Oguz
,
Sewon
Min
,
Patrick
Lewis
,
Ledell
Wu
,
Sergey
Edunov
,
Danqi
Chen
, and
Wen-tau
Yih
.
2020
.
Dense passage retrieval for open-domain question answering
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
6769
6781
,
Online
.
Association for Computational Linguistics
.
Omar
Khattab
and
Matei
Zaharia
.
2020
.
Colbert: Efficient and effective passage search via contextualized late interaction over bert
. In
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
,
SIGIR ’20
, pages
39
48
,
New York, NY, USA
.
Association for Computing Machinery
.
Tom
Kwiatkowski
,
Jennimaria
Palomaki
,
Olivia
Redfield
,
Michael
Collins
,
Ankur
Parikh
,
Chris
Alberti
,
Danielle
Epstein
,
Illia
Polosukhin
,
Jacob
Devlin
,
Kenton
Lee
,
Kristina
Toutanova
,
Llion
Jones
,
Matthew
Kelcey
,
Ming-Wei
Chang
,
Andrew M.
Dai
,
Jakob
Uszkoreit
,
Quoc
Le
, and
Slav
Petrov
.
2019
.
Natural questions: A benchmark for question answering research
.
Transactions of the Association for Computational Linguistics
,
7
:
452
466
.
Philippe
Laban
,
Tobias
Schnabel
,
Paul N.
Bennett
, and
Marti A.
Hearst
.
2022
.
SummaC: Re-visiting NLI-based models for inconsistency detection in summarization
.
Transactions of the Association for Computational Linguistics
,
10
:
163
177
.
Haejun
Lee
,
Akhil
Kedia
,
Jongwon
Lee
,
Ashwin
Paranjape
,
Christopher
Manning
, and
Kyoung-Gu
Woo
.
2022
.
You only need one model for open-domain question answering
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
3047
3060
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Jinhyuk
Lee
,
Zhuyun
Dai
,
Xiaoqi
Ren
,
Blair
Chen
,
Daniel
Cer
,
Jeremy R.
Cole
,
Kai
Hui
,
Michael
Boratko
,
Rajvi
Kapadia
,
Wen
Ding
,
Yi
Luan
,
Sai Meher Karthik
Duddu
,
Gustavo Hernandez
Abrego
,
Weiqiang
Shi
,
Nithi
Gupta
,
Aditya
Kusupati
,
Prateek
Jain
,
Siddhartha Reddy
Jonnalagadda
,
Ming-Wei
Chang
, and
Iftekhar
Naim
.
2024
.
Gecko: Versatile text embeddings distilled from large language models
.
arXiv preprint
,
2403.20327v1
.
Patrick
Lewis
,
Ethan
Perez
,
Aleksandra
Piktus
,
Fabio
Petroni
,
Vladimir
Karpukhin
,
Naman
Goyal
,
Heinrich
Küttler
,
Mike
Lewis
,
Wen-tau
Yih
,
Tim
Rocktäschel
,
Sebastian
Riedel
, and
Douwe
Kiela
.
2020
.
Retrieval-augmented generation for knowledge-intensive nlp tasks
. In
Proceedings of the 34th International Conference on Neural Information Processing Systems
,
NIPS ’20
,
Red Hook, NY, USA
.
Curran Associates Inc.
Yulong
Li
,
Martin
Franz
,
Md
Arafat Sultan
,
Bhavani
Iyer
,
Young-Suk
Lee
, and
Avirup
Sil
.
2022
.
Learning cross-lingual IR from an English retriever
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
4428
4436
,
Seattle, United States
.
Association for Computational Linguistics
.
Shayne
Longpre
,
Yi
Lu
, and
Joachim
Daiber
,
Cambridge, MA
.
2021
.
MKQA: A linguistically diverse benchmark for multilingual open domain question answering
.
Transactions of the Association for Computational Linguistics
,
9
:
1389
1406
.
Houxing
Ren
,
Linjun
Shou
,
Ning
Wu
,
Ming
Gong
, and
Daxin
Jiang
.
2022
.
Empowering dual-encoder with query generator for cross- lingual dense retrieval
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
3107
3121
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Devendra
Sachan
,
Mike
Lewis
,
Mandar
Joshi
,
Armen
Aghajanyan
,
Wen-tau
Yih
,
Joelle
Pineau
, and
Luke
Zettlemoyer
.
2022
.
Improving passage retrieval with zero-shot question generation
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
3781
3797
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Devendra Singh
Sachan
,
Siva
Reddy
,
William L.
Hamilton
,
Chris
Dyer
, and
Dani
Yogatama
.
2021
.
End-to-end training of multi-document reader and retriever for open-domain question answering
. In
Advances in Neural Information Processing Systems
.
Holger
Schwenk
,
Vishrav
Chaudhary
,
Shuo
Sun
,
Hongyu
Gong
, and
Francisco
Guzmán
.
2021
.
WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia
. In
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
, pages
1351
1361
,
Online
.
Association for Computational Linguistics
.
Christopher
Sciavolino
,
Zexuan
Zhong
,
Jinhyuk
Lee
, and
Danqi
Chen
.
2021
.
Simple entity-centric questions challenge dense retrievers
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
6138
6148
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Nikita
Sorokin
,
Dmitry
Abulkhanov
,
Irina
Piontkovskaya
, and
Valentin
Malykh
.
2022
.
Ask me anything in your native language
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
395
406
,
Seattle, United States
.
Association for Computational Linguistics
.
Gemma
Team
,
Thomas
Mesnard
,
Cassidy
Hardin
,
Robert
Dadashi
,
Surya
Bhupatiraju
,
Shreya
Pathak
,
Laurent
Sifre
,
Morgane
Rivière
,
Mihir Sanjay
Kale
,
Juliette
Love
,
Pouya
Tafti
,
Léonard
Hussenot
,
Pier Giuseppe
Sessa
,
Aakanksha
Chowdhery
,
Adam
Roberts
,
Aditya
Barua
,
Alex
Botev
,
Alex
Castro-Ros
,
Ambrose
Slone
,
Amélie
Héliou
,
Andrea
Tacchetti
,
Anna
Bulanova
,
Antonia
Paterson
,
Beth
Tsai
,
Bobak
Shahriari
,
Charline Le
Lan
,
Christopher A.
Choquette-Choo
,
Clément
Crepy
,
Daniel
Cer
,
Daphne
Ippolito
,
David
Reid
,
Elena
Buchatskaya
,
Eric
Ni
,
Eric
Noland
,
Geng
Yan
,
George
Tucker
,
George-Christian
Muraru
,
Grigory
Rozhdestvenskiy
,
Henryk
Michalewski
,
Ian
Tenney
,
Ivan
Grishchenko
,
Jacob
Austin
,
James
Keeling
,
Jane
Labanowski
,
Jean-Baptiste
Lespiau
,
Jeff
Stanway
,
Jenny
Brennan
,
Jeremy
Chen
,
Johan
Ferret
,
Justin
Chiu
,
Justin
Mao-Jones
,
Katherine
Lee
,
Kathy
Yu
,
Katie
Millican
,
Lars Lowe
Sjoesund
,
Lisa
Lee
,
Lucas
Dixon
,
Machel
Reid
,
Maciej
Mikuła
,
Mateo
Wirth
,
Michael
Sharman
,
Nikolai
Chinaev
,
Nithum
Thain
,
Olivier
Bachem
,
Oscar
Chang
,
Oscar
Wahltinez
,
Paige
Bailey
,
Paul
Michel
,
Petko
Yotov
,
Rahma
Chaabouni
,
Ramona
Comanescu
,
Reena
Jana
,
Rohan
Anil
,
Ross
McIlroy
,
Ruibo
Liu
,
Ryan
Mullins
,
Samuel L.
Smith
,
Sebastian
Borgeaud
,
Sertan
Girgin
,
Sholto
Douglas
,
Shree
Pandya
,
Siamak
Shakeri
,
Soham
De
,
Ted
Klimenko
,
Tom
Hennigan
,
Vlad
Feinberg
,
Wojciech
Stokowiec
,
Yu
hui Chen
,
Zafarali
Ahmed
,
Zhitao
Gong
,
Tris
Warkentin
,
Ludovic
Peran
,
Minh
Giang
,
Clément
Farabet
,
Oriol
Vinyals
,
Jeff
Dean
,
Koray
Kavukcuoglu
,
Demis
Hassabis
,
Zoubin
Ghahramani
,
Douglas
Eck
,
Joelle
Barral
,
Fernando
Pereira
,
Eli
Collins
,
Armand
Joulin
,
Noah
Fiedel
,
Evan
Senter
,
Alek
Andreev
, and
Kathleen
Kenealy
.
2024
.
Gemma: Open models based on gemini research and technology
.
arXiv preprint
,
2403.08295v4
.
Nandan
Thakur
,
Jianmo
Ni
,
Gustavo Hernandez
Abrego
,
John
Wieting
,
Jimmy
Lin
, and
Daniel
Cer
.
2024
.
Leveraging LLMs for synthesizing training data across many languages in multilingual dense retrieval
. In
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
, pages
7699
7724
,
Mexico City, Mexico
.
Association for Computational Linguistics
.
Hugo
Touvron
,
Thibaut
Lavril
,
Gautier
Izacard
,
Xavier
Martinet
,
Marie-Anne
Lachaux
,
Timothée
Lacroix
,
Baptiste
Rozière
,
Naman
Goyal
,
Eric
Hambro
,
Faisal
Azhar
,
Aurelien
Rodriguez
,
Armand
Joulin
,
Edouard
Grave
, and
Guillaume
Lample
.
2023
.
Llama: Open and efficient foundation language models
.
arXiv preprint
,
2302.13971v1
.
Laurens
van der Maaten
and
Geoffrey
Hinton
.
2008
.
Visualizing data using t-sne
.
Journal of Machine Learning Research
,
9
(
86
):
2579
2605
.
Denny
Vrandečić
and
Markus
Krötzsch
.
2014
.
Wikidata: A free collaborative knowledgebase
.
Communications of the ACM
,
57
(
10
):
78
85
.
Linting
Xue
,
Noah
Constant
,
Adam
Roberts
,
Mihir
Kale
,
Rami
Al-Rfou
,
Aditya
Siddhant
,
Aditya
Barua
, and
Colin
Raffel
.
2021
.
mT5: A massively multilingual pre-trained text-to- text transformer
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
483
498
,
Online
.
Association for Computational Linguistics
.
Ori
Yoran
,
Tomer
Wolfson
,
Ori
Ram
, and
Jonathan
Berant
.
2024
.
Making retrieval- augmented language models robust to irrel evant context
. In
The Twelfth International Conference on Learning Representations
.
Xinyu
Zhang
,
Nandan
Thakur
,
Odunayo
Ogundepo
,
Ehsan
Kamalloo
,
David Alfonso-
Hermelo
,
Xiaoguang
Li
,
Qun
Liu
,
Mehdi
Rezagholizadeh
, and
Jimmy
Lin
.
2023
.
MIRACL: A multilingual retrieval dataset covering 18 diverse languages
.
Transactions of the Association for Computational Linguistics
,
11
:
1114
1131
.

Table 10: 

List of English properties used for generating MlWikiQA. Note that we do not generate data for a property if it does not exist in the Wikidata of target languages.

Property IDDescriptionProperty IDDescription
P264 record label P175 performer 
P176 manufacturer P112 founded by 
P127 owned by P840 narrative location 
P495 country of origin P20 place of death 
P407 language of work or name P582 end time 
P69 educated at P159 headquarters location 
P740 location of formation P17 country 
P136 genre P800 notable work 
P36 capital P570 date of death 
P190 twinned administrative body P4552 mountain range 
P915 filming location P3086 speed limit 
P84 architect P2046 area 
P569 date of birth P86 composer 
P515 phase of matter P2048 height 
P40 child P580 start time 
P828 has cause P50 author 
P2067 mass 108 employer 
P170 creator P2049 width 
P364 original language of film or TV show P277 programmed in 
P276 location P413 position played on team / speciality 
P131 located in the administrative territorial entity P26 spouse 
P106 occupation P607 conflict 
P942 theme music P571 inception 
P6 head of government P19 place of birth 
P1830 owner of P61 discoverer or inventor 
Property IDDescriptionProperty IDDescription
P264 record label P175 performer 
P176 manufacturer P112 founded by 
P127 owned by P840 narrative location 
P495 country of origin P20 place of death 
P407 language of work or name P582 end time 
P69 educated at P159 headquarters location 
P740 location of formation P17 country 
P136 genre P800 notable work 
P36 capital P570 date of death 
P190 twinned administrative body P4552 mountain range 
P915 filming location P3086 speed limit 
P84 architect P2046 area 
P569 date of birth P86 composer 
P515 phase of matter P2048 height 
P40 child P580 start time 
P828 has cause P50 author 
P2067 mass 108 employer 
P170 creator P2049 width 
P364 original language of film or TV show P277 programmed in 
P276 location P413 position played on team / speciality 
P131 located in the administrative territorial entity P26 spouse 
P106 occupation P607 conflict 
P942 theme music P571 inception 
P6 head of government P19 place of birth 
P1830 owner of P61 discoverer or inventor 
Table 11: 

Examples of using ChatGPT to generate questions from triples. We use the same prompt as Yes questions to generate No ones by sampling perturbed triples. indicate system outputs.

Examples of using ChatGPT to generate questions from triples. We use the same prompt as Yes questions to generate No ones by sampling perturbed triples.  indicate system outputs.
Examples of using ChatGPT to generate questions from triples. We use the same prompt as Yes questions to generate No ones by sampling perturbed triples.  indicate system outputs.
Table 12: 

An example of prompting Gemma-7B to generate questions with ICL examples from ChatGPT. indicate system outputs.

An example of prompting Gemma-7B to generate questions with ICL examples from ChatGPT.  indicate system outputs.
An example of prompting Gemma-7B to generate questions with ICL examples from ChatGPT.  indicate system outputs.
Table 13: 

Complete prompt for few-shot question answer generation from passages in target language.

Complete prompt for few-shot question answer generation from passages in target language.
Complete prompt for few-shot question answer generation from passages in target language.
Table 14: 

Complete prompt for few-shot cross-lingual question answer generation from English passages.

Complete prompt for few-shot cross-lingual question answer generation from English passages.
Complete prompt for few-shot cross-lingual question answer generation from English passages.
Table 15: 

An example for zero-shot English Prompting. indicate system outputs.

An example for zero-shot English Prompting.  indicate system outputs.
An example for zero-shot English Prompting.  indicate system outputs.
Figure 10: 

Distribution comparison between FsMlQA and Xor-TyDi QA in the rest languages. We demonstrate that the diverse synthetic data can be expanded from only five-shot examples and retains the core characteristics of the gold distribution.

Figure 10: 

Distribution comparison between FsMlQA and Xor-TyDi QA in the rest languages. We demonstrate that the diverse synthetic data can be expanded from only five-shot examples and retains the core characteristics of the gold distribution.

Close modal
Table 16: 

An example for zero-shot Multilingual Prompting. indicate system outputs.

An example for zero-shot Multilingual Prompting.  indicate system outputs.
An example for zero-shot Multilingual Prompting.  indicate system outputs.
Table 17: 

Prompt template for few-shot multilingual QA with LLMs.

Prompt template for few-shot multilingual QA with LLMs.
Prompt template for few-shot multilingual QA with LLMs.
Table 18: 

Prompt template for quality validation of synthetic data using Model-as-Judge.

Prompt template for quality validation of synthetic data using Model-as-Judge.
Prompt template for quality validation of synthetic data using Model-as-Judge.
Table 19: 

Dataset statistics of our pre-training data MlWikiQA and few-shot synthetic data FsMlQA in each language.

MlWikiQAFsMlQA
# Q-A ParisQuestion LengthAnswer Length# Q-A ParisQuestion LengthAnswer Length
Arabic 1,803,765 7.00±2.13 1.65±0.84 80,575 8.20±2.86 1.57±1.30 
Bengali 407,496 6.13±1.80 1.65±0.85 127,562 8.97±2.99 1.63±1.44 
English 7,963,985 7.95±2.54 1.78±1.01 – – – 
Finnish 2,135,790 6.02±1.75 1.32±0.64 270,627 5.83±2.09 1.38±0.90 
Japanese 2,735,635 14.74±3.57 3.57±1.73 143,265 10.19±2.18 3.96±4.69 
Korean 1,018,348 5.46±1.78 1.55±0.80 192,002 5.72±2.29 1.42±0.92 
Russian 2,561,925 6.94±2.17 1.70±1.10 792,914 7.34±2.64 1.44±1.03 
Telugu 108,215 5.60±1.84 1.50±0.74 139,211 6.48±2.48 1.49±1.17 
MlWikiQAFsMlQA
# Q-A ParisQuestion LengthAnswer Length# Q-A ParisQuestion LengthAnswer Length
Arabic 1,803,765 7.00±2.13 1.65±0.84 80,575 8.20±2.86 1.57±1.30 
Bengali 407,496 6.13±1.80 1.65±0.85 127,562 8.97±2.99 1.63±1.44 
English 7,963,985 7.95±2.54 1.78±1.01 – – – 
Finnish 2,135,790 6.02±1.75 1.32±0.64 270,627 5.83±2.09 1.38±0.90 
Japanese 2,735,635 14.74±3.57 3.57±1.73 143,265 10.19±2.18 3.96±4.69 
Korean 1,018,348 5.46±1.78 1.55±0.80 192,002 5.72±2.29 1.42±0.92 
Russian 2,561,925 6.94±2.17 1.70±1.10 792,914 7.34±2.64 1.44±1.03 
Telugu 108,215 5.60±1.84 1.50±0.74 139,211 6.48±2.48 1.49±1.17 
Table 20: 

Success Rate of synthetic data generation across seven languages with different prompting strategies. Success Rate = valid examples after data filtering / total examples (i.e., # Documents).

Prompting StrategyArBnFiJaKoRuTeAvg.
FsModQA 5.9% 13.9% 16.9% 7.7% 21.4% 15.4% 13.2% 13.4% 
NQ-En 3.4% 8.8% 8.2% 1.2% 8.3% 5.9% 4.2% 5.3% 
TyDi-En 5.0% 13.6% 12.8% 1.9% 17.0% 6.3% 5.9% 7.0% 
Xor-TyDi-* 10.2% 14.6% 14.2% 2.5% 22.7% 9.1% 10.7% 9.8% 
Prompting StrategyArBnFiJaKoRuTeAvg.
FsModQA 5.9% 13.9% 16.9% 7.7% 21.4% 15.4% 13.2% 13.4% 
NQ-En 3.4% 8.8% 8.2% 1.2% 8.3% 5.9% 4.2% 5.3% 
TyDi-En 5.0% 13.6% 12.8% 1.9% 17.0% 6.3% 5.9% 7.0% 
Xor-TyDi-* 10.2% 14.6% 14.2% 2.5% 22.7% 9.1% 10.7% 9.8% 
Table 21: 

Detailed results in each language when trained with varying sizes of supervised data.

MethodF1Macro Average
ArBnFiJaKoRuTeF1EMBLEU
5-shot 
FsModQA-En 35.6 32.7 35.5 35.1 30.2 33.6 31.8 33.5 23.8 23.0 
FsModQA 41.3 35.4 39.6 41.5 35.0 38.2 36.3 38.2 27.9 24.4 
16-shot 
FsModQA-En 38.3 31.0 39.4 38.3 35.2 34.9 34.6 35.9 26.1 24.1 
FsModQA 42.0 35.6 41.4 41.7 35.3 39.2 40.0 39.3 29.3 26.6 
32-shot 
FsModQA-En 42.4 31.2 40.8 38.1 33.0 37.9 34.9 36.9 26.3 25.5 
FsModQA 43.6 35.6 42.2 42.5 34.1 38.6 37.0 39.1 28.8 26.6 
128-shot 
FsModQA-En 42.0 28.8 41.7 40.3 34.6 34.7 36.0 36.9 27.0 25.2 
FsModQA 45.3 32.8 44.3 43.8 34.0 39.9 42.1 40.3 30.5 27.4 
1024-shot 
FsModQA-En 45.0 30.8 45.1 39.2 34.1 39.1 37.5 38.7 29.3 26.5 
FsModQA 47.5 33.7 46.7 41.4 35.9 40.2 40.1 40.8 31.3 27.9 
full 
FsModQA-En 48.9 33.3 47.7 42.9 39.6 40.0 41.7 42.0 32.7 28.5 
FsModQA 50.8 33.3 47.8 45.0 38.9 42.0 43.1 43.0 33.4 29.6 
MethodF1Macro Average
ArBnFiJaKoRuTeF1EMBLEU
5-shot 
FsModQA-En 35.6 32.7 35.5 35.1 30.2 33.6 31.8 33.5 23.8 23.0 
FsModQA 41.3 35.4 39.6 41.5 35.0 38.2 36.3 38.2 27.9 24.4 
16-shot 
FsModQA-En 38.3 31.0 39.4 38.3 35.2 34.9 34.6 35.9 26.1 24.1 
FsModQA 42.0 35.6 41.4 41.7 35.3 39.2 40.0 39.3 29.3 26.6 
32-shot 
FsModQA-En 42.4 31.2 40.8 38.1 33.0 37.9 34.9 36.9 26.3 25.5 
FsModQA 43.6 35.6 42.2 42.5 34.1 38.6 37.0 39.1 28.8 26.6 
128-shot 
FsModQA-En 42.0 28.8 41.7 40.3 34.6 34.7 36.0 36.9 27.0 25.2 
FsModQA 45.3 32.8 44.3 43.8 34.0 39.9 42.1 40.3 30.5 27.4 
1024-shot 
FsModQA-En 45.0 30.8 45.1 39.2 34.1 39.1 37.5 38.7 29.3 26.5 
FsModQA 47.5 33.7 46.7 41.4 35.9 40.2 40.1 40.8 31.3 27.9 
full 
FsModQA-En 48.9 33.3 47.7 42.9 39.6 40.0 41.7 42.0 32.7 28.5 
FsModQA 50.8 33.3 47.8 45.0 38.9 42.0 43.1 43.0 33.4 29.6 

Author notes

*

Also at Google.

Action Editor: Shay Cohen

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.