Abstract
Recent approaches to multilingual open- domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a few-shot learning approach to synthesize large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, FsModQA, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a cross-lingual prompting strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.
1 Introduction
Open-domain QA has demonstrated impressive performance by employing the retrieve-then-read (Figure 1(a)) pipeline (Chen et al., 2017), which is built upon dense retrievers (Karpukhin et al., 2020) and efficient generative readers (Izacard and Grave, 2021). However, this success has been primarily limited to English, leaving the multilingual setting under-explored. This limitation is mainly due to the difficulty and costs of creating high-quality and balanced human-supervised training data for languages other than English. Moreover, multilingual open-domain QA introduces additional challenges with retrieving evidence from multilingual corpora, requiring the underlying retrieval system to be capable of both cross-lingual and monolingual retrieval (Asai et al., 2021b).
Left: Multilingual open-domain QA pipeline. Middle: Training strategies: 1) self-supervised pre-training; 2–4) baselines using English QA data: 2) used directly; 3) machine translated into target languages; 4) used to prompt LLMs to generate target language QA samples; 5) our method using few-shot in-language data to prompt an LLM. Right: Result comparison (Avg. F1) on the XOR-Full dataset.
Left: Multilingual open-domain QA pipeline. Middle: Training strategies: 1) self-supervised pre-training; 2–4) baselines using English QA data: 2) used directly; 3) machine translated into target languages; 4) used to prompt LLMs to generate target language QA samples; 5) our method using few-shot in-language data to prompt an LLM. Right: Result comparison (Avg. F1) on the XOR-Full dataset.
More recently, efforts have been made to create multilingual open-domain QA benchmarks from existing multilingual machine reading comprehension tasks (e.g., Xor-TyDi QA [Asai et al., 2021a]) and by translating English datasets (e.g., MKQA [Longpre et al., 2021]). These datasets have enabled various approaches to address multilingual open-domain QA problems, including iterative data augmentation (Asai et al., 2021b) and extensive additional pre-training on Wikipedia texts (Abulkhanov et al., 2023; Jiang et al., 2024). However, these methods still heavily depend on abundant high-quality language-specific data for fine-tuning, making them less effective solutions when language resources are limited. Therefore, a more generalizable approach to multilingual open-domain QA should aim to mitigate this reliance and be capable of facilitating language adaptation with minimally supervised samples.
In this paper, we present FsModQA, a method for Few-ShotMultilingual Open-Domain QA using minimally-sized supervised data (i.e., up to 5 per language).1 Our approach consists of two core components: a self-supervised pre-training objective on multilingual corpora; and a synthetic data generation pipeline that prompts a large language model (LLM) using few-shot supervised examples. Concretely, we generate question-answer pairs from WikiData triples by leveraging LLMs’ In-Context Learning (ICL) ability. To facilitate ICL prompts, we incorporate ChatBots to generate curated input-output pairs, which serve as examples for prompting LLMs to generate millions of questions from WikiData triples across various languages. After generating these question-answer pairs, we identify the supported Wikipedia passages through answer string matching. We further gather cross-lingual answers and evidence passages through Wikipedia language links to facilitate cross-lingual retrieval. Employing this generated data, we train a multilingual model with a joint objective for retrieval and QA, producing a promising pre-trained model (Figure 1(c)) for subsequent few-shot learning.
In few-shot learning, we employ LLMs for data generation from few-shot examples. For each target language, we feed the few-shot examples to an LLM and prompt it to generate question-answer pairs from a given document. The few-shot examples are assumed to encapsulate the QA style and distribution of the target dataset, enforcing the LLM to generate synthetic data with similar characteristics. With this abundant synthetic data, the pre-trained model can be further fine-tuned to achieve superior results (Figure 1(c)). As an unsupervised alternative, we explore a zero-shot cross-lingual prompting strategy that uses data from other languages as prompts for data generation, and we show this is almost on par with few-shot prompting (Figure 1(c)).
We evaluate FsModQA on various datasets, including cross-lingual and monolingual retrieval, and multilingual open-domain QA. We observe notable improvements over competitive few-shot baselines, with +5.1% gain on retrieval and +8.4% gain on multilingual open-domain QA. To further test FsModQA language adaptation ability, we conduct zero-shot adaptation experiments using our cross-lingual prompting strategy on 15 languages. This adaptation improves performance in both monolingual retrieval and multilingual QA significantly, achieving results that are superior or comparable to strong translation-based methods.2
2 FsModQA
Figure 2 presents the full pipeline for generating self-supervised pre-training and fine-tuning data.
Full pipeline for data construction and model training: (1) generate large-scale data from Wikidata for self-supervised pre-training; (2) use few-shot prompting to generate synthetic Q&A pairs from Wikipedia passages of target languages, on which the pre-trained model is further fine-tuned.
Full pipeline for data construction and model training: (1) generate large-scale data from Wikidata for self-supervised pre-training; (2) use few-shot prompting to generate synthetic Q&A pairs from Wikipedia passages of target languages, on which the pre-trained model is further fine-tuned.
2.1 Self-Supervised Data Construction
Sampling Factual Triplets.
Our self-supervised training dataset is constructed based on Wikidata (Vrandečić and Krötzsch, 2014), a multilingual knowledge base consisting of fact triplets linked to millions of entities. We manually select 50 common properties (Appendix Table 10) based on English and consider all triples associated with these relations. We then gather fact triplets in the desired target languages through language links.
Generating Questions.
Given a triplet , we aim to write a question q about the head entity s’s property r with the gold answer a being the tail entity o. One can use relation-specific templates to efficiently transform each triple into natural questions (Sciavolino et al., 2021). However, this method lacks diversity, making triples with the same properties generate questions with similar surface forms. Instead, we adopt a generative approach by using a LLM to automatically generate questions with more diverse styles.
Specifically, we first sample five triples for each property and prompt ChatGPT (gpt-3.5-turbo) to generate three questions for each triple. This process yields a curated set of high-quality questions: .
We additionally generate questions with Yes/No answers from the same set of sampled triples. It is easy to generate Yes questions. For No questions, we need to create false fact triples from existing triples. Specifically, we randomly replace a triple’s head or tail entity with the most similar Wikidata entity, and check the perturbed triple is not a valid fact according to Wikidata. We then generate questions using ChatGPT as before. Examples are included in Appendix Table 11.
Multilingual Positive Passage Identification.
As shown in Figure 3, for a question qja and answer aja derived from a triple (sja, r, oja), we gather all passages from the Wikipedia page linked by sja and add passages containing aja as positive . If no such passage exists, we use partial match and select the one with the highest lexical overlaps with aja as positive. We further include positive passages in other languages to facilitate cross-lingual retrieval. We first translate the triple into target languages (sL, r, aL) using language links and identify cross-lingual positives by searching aL in the Wikipedia page linked by sL as above. This derives monolingual and cross-lingual positive passages . We generate 18.7M triples across 8 languages in total, denoted as MlWikiQA.3
Pre-training data construction pipeline: (1) transform WikiData triples into QAs using LLMs for each target language L, and (2) identify in-language and cross-lingual positive passages from the head entity’s Wikipedia page and through language links. English translations are added for readability.
Pre-training data construction pipeline: (1) transform WikiData triples into QAs using LLMs for each target language L, and (2) identify in-language and cross-lingual positive passages from the head entity’s Wikipedia page and through language links. English translations are added for readability.
2.2 Few-shot Synthetic Data Generation
Few-shot Setting.
The main idea of FsModQA is to amplify a limited number of annotated examples into a substantially larger volume of synthetic data by prompting LLMs. In this work, we consider Xor-TyDi QA (Asai et al., 2021a) as our target dataset. For each language in Xor-TyDi QA, we randomly sample five triples from the training set as few-shot examples. Each triple contains the question, answer, and the ground truth passage. We ensure that three examples are span answers, while the remaining two are Yes and No answers to align with Xor-TyDi QA distribution.
Prompt-based Question & Answer Generation.
We populate a hand-engineered template with our few-shot language-specific examples and use them as the ICL examples to prompt LLM. Given a randomly sampled passage dL from language L, we append dL to the template, and the LLM is expected to generate a relevant question qL and answer aL in language L. We further constrain the answer aL to be a span within dL, a property of the original Xor-TyDi QA dataset.
Many questions classified as unanswerable in Clark et al. (2020) can be answered by referring to English Wikipedia (Asai et al., 2021a). These questions are included as cross-lingual questions in Xor-TyDi QA. To simulate this scenario, we generate synthetic cross-lingual data from English passages. We first use GoogleTranslate to translate the few-shot examples to English: . Subsequently, we use these translated few-shot examples to fill another template and instruct the LLM to generate QA from a randomly sampled English passage dEn, first in English (qEn, aEn) and then in target language (qL, aL). Similarly, we restrict aEn to be a span within dEn. We include the prompts we used in Tables 13 and 14 in Appendix.
Data Filtering.
We employ a method based on Natural Language Inference (NLI) to enhance the quality of our synthetic data. NLI techniques aim to classify whether a hypothesis text is entailed by, neutral, or contradictory to a given premise text (Bowman et al., 2015). They have been widely used for identifying factual errors in text summarization (Laban et al., 2022) and hallucinations in machine-generated texts (Honovich et al., 2022). In this study, we employ NLI methods for data filtering (Yoran et al., 2024). Given a synthetic example (q, a, d), we consider the source passage d as the premise and the concatenation of the generated question q and answer a as the hypothesis. We retain an example only when the premise entails the hypothesis.
In more detail, we apply a novel local-to-global filtering mechanism. In local filtering, we evaluate whether the originating passage d entails the synthetic QA (q, a) pairs. We take the output probability of the entailment label as the score and keep examples when the entailment score exceeds a threshold . In global filtering, we use a pre-trained model (i.e., the self-supervised model in Figure 1(b)) to perform retrieval for the question q and obtain a set of passages . We compute an entailment score vector , with each entry being the entailment score between (q, a) and a retrieved passage . We then apply a maximum pooling operation to derive the final score. The intuition behind this is that a valid (q, a) should be supported by at least one of the retrieved passages, which aligns with open-domain settings. Similarly, we retain only those examples whose scores surpass a predefined threshold . In this way, we end up having 1.7M synthetic data in total across 7 languages, denoted as FsMlQA.
2.2.1 Zero-shot Cross-lingual Prompting
Our few-shot setting relies on a few annotated examples to generate synthetic QA pairs in target languages. However, this approach encounters significant challenges when the target language is extremely low-resourced, making it nearly impossible to obtain even a few examples. For this setting, we explore zero-shot prompting, which uses cross-lingual examples to prompt LLMs to generate synthetic QA pairs in target languages.
We consider two zero-shot prompting settings. In English-Prompting setting, we use English QA data to fill up a template and use it as the prompt to ask LLMs to generate QA pairs from passages randomly sampled from the target language. In Multilingual-Prompting setting, we assume access to a handful of examples in a held-out language set. We randomly sample five multilingual examples from this held-out set to populate another template, and prompt LLMs to generate QA pairs in target languages. We include the prompts used in Tables 15 and 16 in Appendix.
2.2.2 Data Sampling
Our synthetic dataset, FsMlQA, exhibits a strongly skewed distribution towards shorter answer lengths (often single tokens), whereas the human-annotated answers in Xor-TyDi QA tend to be substantially longer. To address this mismatch, we resample the training data from FsMlQA according to answer length, using a geometric distribution, l ∼Geo(p), to achieve a better balance between short and long answers.4
2.3 FsModQA Model
Model Structure.
As shown in Figure 4, we employ a single encoder-decoder model to perform both passage retrieval and QA tasks. The first half of the encoder functions as a dual-encoder with shared parameters, which separately encodes the question q and the passage corpus . Additionally, we append an instruction to the question to inform the language of the target answer: “Answer in {lang}”. A LayerNorm operation, followed by average pooling, is applied to compress the inputs into single vectors: Eq and , which are used for matching via dot products. The top-k most relevant passages to the question are selected: . The embeddings of the question and each top-k passage in are concatenated and fed into the remaining cross-encoder layers. Finally, the cross-encoder embeddings are flattened and incorporated into the decoder through cross-attention to generate the answer a, following the Fusion-in-Decoder approach (Izacard and Grave, 2021).
Model Training.
The entire model is optimized to generate the target answer ai given qi and relevant passages . The final loss is: , where .
3 Experiments
3.1 Datasets and Metrics
We evaluate on the Xor-TyDi QA dataset (Asai et al., 2021a), with XOR-Retrieve for cross-lingual retrieval and XOR-Full for multilingual open-retrieval QA. We conduct zero-shot evaluations on two benchmarks, MIRACL (Zhang et al., 2023) for monolingual retrieval and MKQA (Longpre et al., 2021) for multilingual open-domain QA. For XOR-Retrieve, we use the February 2019 English Wikipedia dump as the retrieval corpus and the same dumps from 13 languages for XOR-Full and MKQA (Asai et al., 2021a). For MIRACL, we use the monolingual Wikipedia preprocessed by Zhang et al. (2023). Following prior work, we evaluate models at Recall@5kt (top 5000 tokens) on XOR-Retrieve; F1, exact match (EM) and BLEU on XOR-Full; nDCG@10 on MIRACL; and F1 on MKQA.
3.2 Baselines
We evaluate three ranges of representative baselines based on the type of supervised data used: (i) Zero-shot baselines (“-En”) fine-tuned on supervised English-only data (i.e., Natural Questions (Kwiatkowski et al., 2019)). (ii) Supervised baselines that fine-tuned on human-annotated multilingual data (i.e., Xor-TyDi QA). (iii) Few-shot models that improve zero-shot baselines with only a few supervised multilingual instances.
Retriever Baselines.
For XOR-Retrieve, we include: (1) Zero-shot retrievers: translate-test methods: DPR+MT (Asai et al., 2021a) and ReATT+MT (Jiang et al., 2022); models pre- trained on multilingual Wikipedia: CLASS-En (Jiang et al., 2024) and LAPCA (Abulkhanov et al., 2023). (2) Supervised retrievers: multilingual dense retrievers: mDPR (Asai et al., 2021a), CORA (Asai et al., 2021b), Sentri (Sorokin et al., 2022), QuiCK (Ren et al., 2022); token-level dense retrievers: DrDecr (Li et al., 2022) pre-trains ColBERT on WikiMatrix (Schwenk et al., 2021). (3) Few-shot retrievers: SWIM-X (Thakur et al., 2024) generates massive synthetic data from LLMs through a summarisation-then-ask technique. CLASS (5-shot) fine-tunes CLASS-En on our 5-shot examples. For MIRACL (Zhang et al., 2023), we include two supervised retrievers: fine-tuned mContriever (Izacard et al., 2022) and Hybrid that combines the results of BM25, mDPR, and mColbert (Khattab and Zaharia, 2020).
Reader Baselines.
(1) Zero-shot baselines: translate-test methods MT+DPR, ReAtt+MT, and GMT+GS generate answers from English retrieved passages with question and answer translations. (2) Supervised baselines: BM25 does in-language retrieval with an extractive multilingual QA model; MT+Mono first applies BM25 and then MT+DPR if no answer was generated. Fusion-in-decoder methods (i.e., CORA, CLASS, Sentri, LAPCA) use retrieval-augmented generation, generating target language answers from multilingual retrieved passages. (3) Few-shot readers: Gemma (5-shot) (Gemma Team et al., 2024) and LLaMa3 (5-shot) (Touvron et al., 2023) prompt LLMs with few-shot examples and retrieved passages using the template in Appendix Table 17; CLASS (5-shot) fine-tunes CLASS-En on few-shot examples. We use the same 5-shot examples for all methods.
3.3 Implementation Details
With the proposed self-supervised data construction method, we generate 18,735,159 triplets for pre-training across 8 languages, with statistics in Appendix Table 19. We initialize our model from the mT5-large checkpoint (Xue et al., 2021) and pre-train it using the loss function ℒssl for 100K steps with a batch size of 800 on 16 A100 GPUs for 64 hours. We set the learning rate to 5 × 10−5 with 10% steps of warm-up, and linear decay to 0.
With our few-shot data generation method, we obtain 1,746,156 question-answer pairs across 7 languages included in Xor-TyDi QA after data filtering with and , with detailed statistics shown in Table 19 in Appendix. For fine-tuning, we first train the pre-trained model using NQ data for 8K steps and then on FsMlQA for 6K–14K steps depending upon the size of the sampled training dataset, with the loss function ℒe2e. We set the batch size to 128 and the learning rate to 5 × 10−5. We apply an asynchronous passage update mechanism, where we periodically refresh the retrieved passages for each training query using the most recent checkpoint every 1K steps.
3.4 Retrieval Results
XOR-Retrieve.
Table 1 shows that FsModQA, fine-tuned on 100K synthetic data, surpasses the few-shot SWIM-X (7M) by 5.5% at Recall@5kt, despite the latter using substantially more synthetic data generated by a significantly larger proprietary LLM (PaLM2). This indicates our method’s great efficiency in training and data generation. Further scaling up the training data to full size does not improve retrieval accuracy. In addition, we find that fine-tuning CLASS, a sophisticated pre-training method, on the same set of 5-shot examples, lags FsModQA by 3.1 points. This shows our method of amplifying data through LLM prompting is superior to direct fine-tuning.
Method . | Backbone . | # Total Params . | Pre-training Data . | Fine-tuning Data . | R@5kt . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . | |||||
Zero-shot Retrievers | ||||||||||||
DPR+MT† | mBERT | 220M | – | NQ | 52.4 | 62.8 | 61.8 | 48.1 | 58.6 | 37.8 | 32.4 | 50.6 |
ReAtt+MT* | T5-L | 583M | – | NQ | 67.3 | 71.0 | 29.3 | 61.8 | 67.0 | 61.2 | 66.4 | 60.6 |
CLASS-En* | mT5-L | 410M | Wikipedia | NQ | 66.7 | 78.6 | 66.6 | 60.2 | 63.2 | 58.2 | 78.2 | 67.4 |
Supervised Retrievers | ||||||||||||
CORA | mBERT | 557M | – | NQ + XOR | 42.7 | 52.0 | 49.0 | 32.8 | 43.5 | 39.2 | 41.6 | 43.0 |
mDPR† | mBERT | 557M | – | NQ + XOR | 48.9 | 60.2 | 59.2 | 34.9 | 49.8 | 43.0 | 55.5 | 50.2 |
Sentri | XLM-R | 560M | – | NQ + TQA + XOR | 56.8 | 62.2 | 65.5 | 53.2 | 55.5 | 52.3 | 80.3 | 60.8 |
QuiCK | mBERT | 557M | – | NQ + XOR | 63.8 | 78.0 | 65.3 | 63.5 | 69.8 | 67.1 | 74.8 | 68.9 |
DrDecr | XLM-R | 278M | WikiMatrix | NQ + XOR | 70.2 | 85.9 | 69.4 | 65.1 | 68.8 | 68.8 | 83.2 | 73.1 |
LAPCA | XLM-R | 560M | Wikipedia | NQ + XPAQ + XOR | 70.2 | 83.8 | 79.6 | 69.7 | 73.6 | 75.5 | 83.1 | 76.5 |
CLASS | mT5-L | 410M | Wikipedia | NQ | 70.6 | 84.9 | 71.0 | 66.0 | 72.6 | 70.0 | 81.9 | 73.9 |
Few-shot Retrievers | ||||||||||||
SWIM-X (7M) | mT5-B | 580M | mC4 | SWIM-IR | 57.9 | 75.0 | 65.6 | 59.3 | 58.9 | 64.6 | 74.4 | 65.1 |
CLASS (5-shot) | mT5-L | 410M | Wikipedia | NQ + XOR (5-shot) | 67.0 | 78.6 | 65.6 | 59.0 | 63.6 | 59.0 | 79.5 | 67.5 |
FsModQA (100K) | mT5-L | 410M | MlWikiQA | NQ + FsMlQA | 66.3 | 79.3 | 67.8 | 66.4 | 65.6 | 73.8 | 75.2 | 70.6 |
FsModQA (1.7M) | mT5-L | 410M | MlWikiQA | NQ + FsMlQA | 63.4 | 80.6 | 67.5 | 66.0 | 66.7 | 74.3 | 75.6 | 70.6 |
Method . | Backbone . | # Total Params . | Pre-training Data . | Fine-tuning Data . | R@5kt . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . | |||||
Zero-shot Retrievers | ||||||||||||
DPR+MT† | mBERT | 220M | – | NQ | 52.4 | 62.8 | 61.8 | 48.1 | 58.6 | 37.8 | 32.4 | 50.6 |
ReAtt+MT* | T5-L | 583M | – | NQ | 67.3 | 71.0 | 29.3 | 61.8 | 67.0 | 61.2 | 66.4 | 60.6 |
CLASS-En* | mT5-L | 410M | Wikipedia | NQ | 66.7 | 78.6 | 66.6 | 60.2 | 63.2 | 58.2 | 78.2 | 67.4 |
Supervised Retrievers | ||||||||||||
CORA | mBERT | 557M | – | NQ + XOR | 42.7 | 52.0 | 49.0 | 32.8 | 43.5 | 39.2 | 41.6 | 43.0 |
mDPR† | mBERT | 557M | – | NQ + XOR | 48.9 | 60.2 | 59.2 | 34.9 | 49.8 | 43.0 | 55.5 | 50.2 |
Sentri | XLM-R | 560M | – | NQ + TQA + XOR | 56.8 | 62.2 | 65.5 | 53.2 | 55.5 | 52.3 | 80.3 | 60.8 |
QuiCK | mBERT | 557M | – | NQ + XOR | 63.8 | 78.0 | 65.3 | 63.5 | 69.8 | 67.1 | 74.8 | 68.9 |
DrDecr | XLM-R | 278M | WikiMatrix | NQ + XOR | 70.2 | 85.9 | 69.4 | 65.1 | 68.8 | 68.8 | 83.2 | 73.1 |
LAPCA | XLM-R | 560M | Wikipedia | NQ + XPAQ + XOR | 70.2 | 83.8 | 79.6 | 69.7 | 73.6 | 75.5 | 83.1 | 76.5 |
CLASS | mT5-L | 410M | Wikipedia | NQ | 70.6 | 84.9 | 71.0 | 66.0 | 72.6 | 70.0 | 81.9 | 73.9 |
Few-shot Retrievers | ||||||||||||
SWIM-X (7M) | mT5-B | 580M | mC4 | SWIM-IR | 57.9 | 75.0 | 65.6 | 59.3 | 58.9 | 64.6 | 74.4 | 65.1 |
CLASS (5-shot) | mT5-L | 410M | Wikipedia | NQ + XOR (5-shot) | 67.0 | 78.6 | 65.6 | 59.0 | 63.6 | 59.0 | 79.5 | 67.5 |
FsModQA (100K) | mT5-L | 410M | MlWikiQA | NQ + FsMlQA | 66.3 | 79.3 | 67.8 | 66.4 | 65.6 | 73.8 | 75.2 | 70.6 |
FsModQA (1.7M) | mT5-L | 410M | MlWikiQA | NQ + FsMlQA | 63.4 | 80.6 | 67.5 | 66.0 | 66.7 | 74.3 | 75.6 | 70.6 |
MIRACL.
Table 2 shows that FsModQA surpasses the few-shot retriever SWIM-X by 5.1%, although SWIM-X generates synthetic data on each MIRACL language through 3-shot prompting, whereas FsModQA is exclusively trained on synthetic data generated from 5-shot examples of Xor-TyDi QA and thus, evaluated on a zero-shot manner. We further divide languages into seen and unseen groups based on FsModQA’s training data. It outperforms SWIM-X on all seen languages and 7 out of 10 unseen languages, except on zh, fr, and de. We suspect SWIM-X benefits significantly from large-scale synthetic data generation on these high-resource languages.
. | Seen Languages . | Unseen Languages . | . | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ar . | bn . | en . | fi . | ja . | ko . | ru . | te . | es . | fa . | fr . | hi . | id . | sw . | th . | zh . | de . | yo . | Avg. . | |
Supervised Retrievers | |||||||||||||||||||
Hybrid | 67.3 | 65.4 | 54.9 | 67.2 | 57.6 | 60.9 | 53.2 | 60.2 | 64.1 | 59.4 | 52.3 | 61.6 | 44.3 | 44.6 | 59.9 | 52.6 | 56.5 | 37.4 | 56.6 |
mContriever | 66.4 | 68.4 | 44.2 | 65.2 | 56.8 | 58.8 | 51.2 | 79.0 | 42.8 | 48.9 | 46.2 | 45.0 | 45.8 | 67.7 | 70.7 | 49.4 | 42.3 | 48.4 | 55.4 |
Few-shot Retrievers | |||||||||||||||||||
SWIM-X (180K) | 60.2 | 57.1 | 34.7 | 40.6 | 40.8 | 43.3 | 49.7 | 55.9 | 33.4 | 36.3 | 64.3 | 33.0 | 39.5 | 40.0 | 56.3 | 63.3 | 50.2 | 36.5 | 46.4 |
FsModQA (100K) | 64.4 | 63.6 | 45.4 | 64.7 | 55.1 | 49.6 | 50.0 | 76.2 | 40.5 | 43.7 | 36.5 | 43.2 | 42.6 | 50.2 | 60.4 | 43.2 | 36.7 | 60.2 | 51.5 |
. | Seen Languages . | Unseen Languages . | . | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ar . | bn . | en . | fi . | ja . | ko . | ru . | te . | es . | fa . | fr . | hi . | id . | sw . | th . | zh . | de . | yo . | Avg. . | |
Supervised Retrievers | |||||||||||||||||||
Hybrid | 67.3 | 65.4 | 54.9 | 67.2 | 57.6 | 60.9 | 53.2 | 60.2 | 64.1 | 59.4 | 52.3 | 61.6 | 44.3 | 44.6 | 59.9 | 52.6 | 56.5 | 37.4 | 56.6 |
mContriever | 66.4 | 68.4 | 44.2 | 65.2 | 56.8 | 58.8 | 51.2 | 79.0 | 42.8 | 48.9 | 46.2 | 45.0 | 45.8 | 67.7 | 70.7 | 49.4 | 42.3 | 48.4 | 55.4 |
Few-shot Retrievers | |||||||||||||||||||
SWIM-X (180K) | 60.2 | 57.1 | 34.7 | 40.6 | 40.8 | 43.3 | 49.7 | 55.9 | 33.4 | 36.3 | 64.3 | 33.0 | 39.5 | 40.0 | 56.3 | 63.3 | 50.2 | 36.5 | 46.4 |
FsModQA (100K) | 64.4 | 63.6 | 45.4 | 64.7 | 55.1 | 49.6 | 50.0 | 76.2 | 40.5 | 43.7 | 36.5 | 43.2 | 42.6 | 50.2 | 60.4 | 43.2 | 36.7 | 60.2 | 51.5 |
3.5 Multilingual QA Results
XOR-Full.
In Table 3, we show FsModQA achieves the best results in few-shot settings, outperforming CLASS-En (directly fine-tuning on 5-shot examples) by 8.4% and directly few-shot promoting LLMs for QA by 18%. Compared to supervised readers, FsModQA surpasses CORA and other pipeline methods while achieving results comparable to the rest. It is also noteworthy that in two low-resource languages, FsModQA outperforms comparable supervised baselines in Bengali and achieves a closer match in Telugu, indicating the effectiveness of our method in handling low-resource languages.
Method . | Backbone . | # Total Params . | Pre-training Data . | Fine-tuning Data . | F1 . | Macro Average . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | F1 . | EM . | BLEU . | |||||
Zero-shot Readers | ||||||||||||||
MT+DPR† | mBERT | – | – | NQ | 7.2 | 4.3 | 17.0 | 7.9 | 7.1 | 13.6 | 0.5 | 8.2 | 3.8 | 6.8 |
ReAtt+MT* | T5-L | 1.19B | – | NQ | 15.0 | 10.5 | 1.8 | 13.1 | 14.9 | 15.4 | 8.2 | 11.3 | 5.5 | 9.5 |
GMT+GS† | – | – | – | NQ | 18.0 | 29.1 | 13.8 | 5.7 | 15.2 | 14.9 | 15.6 | 16.0 | 9.9 | 14.9 |
Supervised Readers | ||||||||||||||
BM25† | – | – | – | XOR | 31.1 | 21.9 | 21.4 | 12.4 | 12.1 | 17.7 | – | – | – | – |
MT+Mono† | mBERT | – | – | NQ + XOR | 15.8 | 9.6 | 20.5 | 12.2 | 11.4 | 16.0 | 0.5 | 17.3 | 7.5 | 10.7 |
CORA | mBERT+mT5-B | 1.14B | – | NQ + XOR | 42.9 | 26.9 | 41.4 | 36.8 | 30.4 | 33.8 | 30.9 | 34.7 | 25.8 | 23.3 |
CLASS | mT5-L | 1.23B | Wikipedia | NQ + XOR | 49.1 | 32.0 | 46.7 | 44.1 | 38.4 | 39.9 | 41.1 | 41.6 | 32.5 | 28.2 |
Sentri | XLM-R+mT5-B | 1.14B | – | NQ + TQA + XOR | 52.5 | 31.2 | 45.5 | 44.9 | 43.1 | 41.2 | 30.7 | 41.3 | 34.9 | 30.7 |
LAPCA | XLM-R+mT5-B | 1.14B | Wikipedia | NQ + XPAQ + XOR | 53.4 | 50.2 | 49.3 | 44.7 | 49.5 | 49.3 | 38.9 | 47.8 | 38.7 | 35.5 |
Few-shot Readers | ||||||||||||||
Gemma (5-shot) | Gemma | 7B | – | – | 13.4 | 19.0 | 21.7 | 20.2 | 20.5 | 23.0 | 23.4 | 20.2 | 12.2 | 15.3 |
LLaMA3 (5-shot) | LLaMA3 | 8B | – | – | 22.7 | 13.2 | 22.9 | 17.8 | 19.0 | 19.2 | 28.9 | 20.5 | 12.8 | 15.6 |
CLASS (5-shot) | mT5-L | 1.23B | Wikipedia | NQ + XOR (5-shot) | 32.3 | 28.1 | 29.9 | 25.7 | 29.5 | 27.7 | 24.7 | 29.8 | 20.5 | 21.2 |
FsModQA | mT5-L | 1.23B | MlWikiQA | NQ + FsMlQA | 41.3 | 35.4 | 39.6 | 41.5 | 35.0 | 38.2 | 36.3 | 38.2 | 27.9 | 24.4 |
Method . | Backbone . | # Total Params . | Pre-training Data . | Fine-tuning Data . | F1 . | Macro Average . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | F1 . | EM . | BLEU . | |||||
Zero-shot Readers | ||||||||||||||
MT+DPR† | mBERT | – | – | NQ | 7.2 | 4.3 | 17.0 | 7.9 | 7.1 | 13.6 | 0.5 | 8.2 | 3.8 | 6.8 |
ReAtt+MT* | T5-L | 1.19B | – | NQ | 15.0 | 10.5 | 1.8 | 13.1 | 14.9 | 15.4 | 8.2 | 11.3 | 5.5 | 9.5 |
GMT+GS† | – | – | – | NQ | 18.0 | 29.1 | 13.8 | 5.7 | 15.2 | 14.9 | 15.6 | 16.0 | 9.9 | 14.9 |
Supervised Readers | ||||||||||||||
BM25† | – | – | – | XOR | 31.1 | 21.9 | 21.4 | 12.4 | 12.1 | 17.7 | – | – | – | – |
MT+Mono† | mBERT | – | – | NQ + XOR | 15.8 | 9.6 | 20.5 | 12.2 | 11.4 | 16.0 | 0.5 | 17.3 | 7.5 | 10.7 |
CORA | mBERT+mT5-B | 1.14B | – | NQ + XOR | 42.9 | 26.9 | 41.4 | 36.8 | 30.4 | 33.8 | 30.9 | 34.7 | 25.8 | 23.3 |
CLASS | mT5-L | 1.23B | Wikipedia | NQ + XOR | 49.1 | 32.0 | 46.7 | 44.1 | 38.4 | 39.9 | 41.1 | 41.6 | 32.5 | 28.2 |
Sentri | XLM-R+mT5-B | 1.14B | – | NQ + TQA + XOR | 52.5 | 31.2 | 45.5 | 44.9 | 43.1 | 41.2 | 30.7 | 41.3 | 34.9 | 30.7 |
LAPCA | XLM-R+mT5-B | 1.14B | Wikipedia | NQ + XPAQ + XOR | 53.4 | 50.2 | 49.3 | 44.7 | 49.5 | 49.3 | 38.9 | 47.8 | 38.7 | 35.5 |
Few-shot Readers | ||||||||||||||
Gemma (5-shot) | Gemma | 7B | – | – | 13.4 | 19.0 | 21.7 | 20.2 | 20.5 | 23.0 | 23.4 | 20.2 | 12.2 | 15.3 |
LLaMA3 (5-shot) | LLaMA3 | 8B | – | – | 22.7 | 13.2 | 22.9 | 17.8 | 19.0 | 19.2 | 28.9 | 20.5 | 12.8 | 15.6 |
CLASS (5-shot) | mT5-L | 1.23B | Wikipedia | NQ + XOR (5-shot) | 32.3 | 28.1 | 29.9 | 25.7 | 29.5 | 27.7 | 24.7 | 29.8 | 20.5 | 21.2 |
FsModQA | mT5-L | 1.23B | MlWikiQA | NQ + FsMlQA | 41.3 | 35.4 | 39.6 | 41.5 | 35.0 | 38.2 | 36.3 | 38.2 | 27.9 | 24.4 |
MKQA.
In Table 4, FsModQA achieves the best zero-shot results on MKQA in almost all languages, with an improvement of +2.8% compared to supervised CORA and CLASS. This suggests that training on our synthetic data can well generalize to other new languages, indicating that generating synthetic data for each target language may not be necessary for language adaptation.
Zero-shot multilingual QA results (F1) on MKQA. Best performance is in bold. “cn”: “Zh-cn” (Chinese, simplified). “hk”: “Zh-hk” (Chinese, Hong Kong). “tw”: “Zh-tw” (Chinese, traditional).
Method . | Da . | De . | Es . | Fr . | He . | Hu . | It . | Km . | Ms . | Nl . | No . | Pl . | Pt . | Sv . | Th . | Tr . | Vi . | cn . | hk . | tw . | Avg . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Supervised Readers | |||||||||||||||||||||
CORA | 30.4 | 30.2 | 32.0 | 30.8 | 15.8 | 18.4 | 29.0 | 5.8 | 27.8 | 32.1 | 29.2 | 25.6 | 28.4 | 30.9 | 8.5 | 22.2 | 20.9 | 5.2 | 6.7 | 5.4 | 21.8 |
CLASS | 28.3 | 32.3 | 33.3 | 31.2 | 10.3 | 23.1 | 30.6 | 7.1 | 24.7 | 30.2 | 28.4 | 25.6 | 29.3 | 28.9 | 14.1 | 24.8 | 19.0 | 8.0 | 7.8 | 6.7 | 22.2 |
Few-shot Readers | |||||||||||||||||||||
FsModQA | 34.8 | 33.3 | 38.5 | 34.8 | 19.5 | 28.4 | 31.9 | 7.5 | 36.7 | 34.1 | 35.5 | 18.4 | 33.4 | 37.2 | 15.1 | 24.8 | 9.9 | 9.1 | 8.6 | 7.9 | 25.0 |
Method . | Da . | De . | Es . | Fr . | He . | Hu . | It . | Km . | Ms . | Nl . | No . | Pl . | Pt . | Sv . | Th . | Tr . | Vi . | cn . | hk . | tw . | Avg . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Supervised Readers | |||||||||||||||||||||
CORA | 30.4 | 30.2 | 32.0 | 30.8 | 15.8 | 18.4 | 29.0 | 5.8 | 27.8 | 32.1 | 29.2 | 25.6 | 28.4 | 30.9 | 8.5 | 22.2 | 20.9 | 5.2 | 6.7 | 5.4 | 21.8 |
CLASS | 28.3 | 32.3 | 33.3 | 31.2 | 10.3 | 23.1 | 30.6 | 7.1 | 24.7 | 30.2 | 28.4 | 25.6 | 29.3 | 28.9 | 14.1 | 24.8 | 19.0 | 8.0 | 7.8 | 6.7 | 22.2 |
Few-shot Readers | |||||||||||||||||||||
FsModQA | 34.8 | 33.3 | 38.5 | 34.8 | 19.5 | 28.4 | 31.9 | 7.5 | 36.7 | 34.1 | 35.5 | 18.4 | 33.4 | 37.2 | 15.1 | 24.8 | 9.9 | 9.1 | 8.6 | 7.9 | 25.0 |
3.6 Ablation
The effects of generating cross-lingual queries from English passages, at 100K data scale.
. | XOR-Full . | XOR-Retrieve . | |||
---|---|---|---|---|---|
In-LG . | Cross-LG . | All . | Retrieval . | CL-Retrieval . | |
Avg. F1 . | Avg. F1 . | Avg. F1 . | RM@100 . | R@5kt . | |
FsModQA | 46.8 | 31.2 | 36.9 | 75.0 | 70.6 |
- CL Queries | 49.3 | 30.0 | 36.8 | 72.0 | 68.4 |
. | XOR-Full . | XOR-Retrieve . | |||
---|---|---|---|---|---|
In-LG . | Cross-LG . | All . | Retrieval . | CL-Retrieval . | |
Avg. F1 . | Avg. F1 . | Avg. F1 . | RM@100 . | R@5kt . | |
FsModQA | 46.8 | 31.2 | 36.9 | 75.0 | 70.6 |
- CL Queries | 49.3 | 30.0 | 36.8 | 72.0 | 68.4 |
Ablations by removing one component of our method, at 100K data scale.
. | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA | 40.6 | 34.3 | 38.4 | 40.7 | 32.9 | 37.7 | 33.9 | 36.9 |
- Data Filtering | 39.0 | 31.7 | 37.4 | 39.2 | 32.3 | 35.5 | 35.3 | 35.8 |
- Geo Sampling | 37.9 | 35.9 | 36.7 | 38.5 | 34.1 | 35.0 | 33.5 | 36.0 |
- MlWikiQA | 11.2 | 7.2 | 10.2 | 17.5 | 7.9 | 8.5 | 4.4 | 9.6 |
. | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA | 40.6 | 34.3 | 38.4 | 40.7 | 32.9 | 37.7 | 33.9 | 36.9 |
- Data Filtering | 39.0 | 31.7 | 37.4 | 39.2 | 32.3 | 35.5 | 35.3 | 35.8 |
- Geo Sampling | 37.9 | 35.9 | 36.7 | 38.5 | 34.1 | 35.0 | 33.5 | 36.0 |
- MlWikiQA | 11.2 | 7.2 | 10.2 | 17.5 | 7.9 | 8.5 | 4.4 | 9.6 |
Cross-lingual Data Improves Cross-lingual Ability.
Excluding cross-lingual synthetic training data enhances performance in answering questions that require only the retrieval of in-language passages. However, the result on questions relying on cross-lingual passage retrieval declines, reducing the overall results. This is further evidenced by retrieval results RM@100, where the accuracy of finding evidence in any language (e.g., English and in-language) drops, with additional support from the cross-lingual passage retrieval results.
Data Filtering Improves Data Quality.
By using the raw synthetic data from LLMs without any quality control, the performance suffers in every examined language except Telugu. We suspect that the NLI model is deficient in this language.
Geometry Sampling Improves Long-answer Generation.
Sampling data according to geometry distribution over answer length leads to a 0.9% gain on average. In languages that contain a significant number of long answers (i.e., ar, fi, ja, ru), geometry sampling shows gains of up to 2.7%. Conversely, in bn, and ko, where short answers dominate, random sampling is usually better.
Pre-training is Crucial.
We observe extremely poor results in all languages without pre-training on our MlWikiQA, primarily due to the model’s low retrieval accuracy in identifying relevant passages. We believe pre-training enables the model to achieve good initial retrieval accuracy, which is essential in the subsequent fine-tuning process.
3.7 Training Data Scaling
Performance Improves with More Synthetic Data.
To investigate the effect of our data scale on models, we train FsModQA on subsets ranging from 0.05M to the entire 1.7M QA pairs, Results on each language and the average performance are shown in Figure 5. As the data size increases, FsModQA shows enhanced average performance up to the 0.6M data scale and gradually decreases afterward. We observe that as data size increases, the proportion of examples with short answers increases (78.4% → 95.3%), and the result on long-answer examples drops from 18.0% to 15.1%, indicating overfitting to short answers.
Performance when trained with different sizes of our synthetic data.
Our geometric sampling method (§2.2.2) attempts to balance the answers by length, however its use of sampling without replacement means the few long answer instances are quickly exhausted, such that larger sampled datasets become skewed toward shorter answers. To mitigate this issue, we employ sampling with replacement. This method upsamples longer-answer examples such that the length distribution follows the precomputed geometric distribution.5 As a result, it effectively increases the number of training epochs for data points with longer answers. As shown in Figure 6, sampling with replacement significantly improves performance on longer answers (≥4 tokens) while maintaining comparable performance on shorter answers relative to the current method.
Performance comparison when sampling data with or without replacement by using our geometric sampling strategy.
Performance comparison when sampling data with or without replacement by using our geometric sampling strategy.
Few-shot Prompting Is Superior to Direct Fine-tuning and Benefits from More Supervised Data.
In Figure 7, we show that directly fine-tuning on the 5-shot examples is beneficial (28.6% → 33.5%) but remains inferior to our few-shot method. When increasing the size of supervised data, both methods achieve consistent improvements although the performance gap narrows. With full-sized training data, FsModQA surpasses CLASS (43.0% v.s. 41.6%), achieving new state-of-the-art results. See Appendix Table 21 for results in each language.
Results when trained with varying sizes of supervised data. The average together with the best and worst languages are reported.
Results when trained with varying sizes of supervised data. The average together with the best and worst languages are reported.
3.8 Zero-shot Prompting Strategies
We compare our few-shot prompting strategy with two zero-shot cross-lingual prompting methods in §2.2.1. In English-Prompting, we consider NQ training data and TyDi QA English training data as prompting sources, respectively. In Multilingual-Prompting, we use 5-shot examples from all languages in Xor-TyDi QA (i.e., those used in our few-shot setting) for prompting. When generating synthetic data for each target language, we exclude its 5-shot examples from the prompting source. We compare the success rate of generating valid examples using different prompting strategies in Appendix Table 20, with few-shot prompting achieving the highest rate and English-Prompting with NQ yielding the lowest rate.
Zero-shot prompting is comparable to few-shot prompting.
Table 7 shows that all three zero-shot prompting variants achieve consistent improvements over FsModQA-En with up to 8.1% gains, highlighting the versatility of our method in zero-shot language adaptation. Prompting with English datasets created with the same guidelines achieves better results (TyDi-En v.s. NQ-En), and using multilingual examples for prompting (i.e., Xor-TyDi-*) is comparable to FsModQA. Specifically, the diversity and QA styles in prompts are more important for fi and te, while for other languages, employing in-language prompts usually leads to the best performance.
XOR-Full performance comparison when using zero-shot prompting strategies for synthetic data generation, at 100K scale. FsModQA-En indicates the model pre-trained on MlWikiQA and fine-tuned on the English NQ dataset.
. | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA-En | 30.7 | 30.2 | 31.0 | 24.3 | 26.2 | 29.6 | 28.5 | 28.6 |
FsModQA | 41.7 | 34.7 | 38.7 | 39.4 | 34.7 | 35.0 | 33.5 | 36.8 |
NQ-En | 38.8 | 33.9 | 40.1 | 33.0 | 33.0 | 34.9 | 34.3 | 35.4 |
TyDi-En | 39.1 | 34.4 | 41.2 | 35.6 | 31.9 | 34.6 | 36.0 | 36.1 |
Xor-TyDi-* | 42.5 | 33.9 | 40.3 | 37.9 | 33.7 | 34.6 | 34.3 | 36.7 |
. | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA-En | 30.7 | 30.2 | 31.0 | 24.3 | 26.2 | 29.6 | 28.5 | 28.6 |
FsModQA | 41.7 | 34.7 | 38.7 | 39.4 | 34.7 | 35.0 | 33.5 | 36.8 |
NQ-En | 38.8 | 33.9 | 40.1 | 33.0 | 33.0 | 34.9 | 34.3 | 35.4 |
TyDi-En | 39.1 | 34.4 | 41.2 | 35.6 | 31.9 | 34.6 | 36.0 | 36.1 |
Xor-TyDi-* | 42.5 | 33.9 | 40.3 | 37.9 | 33.7 | 34.6 | 34.3 | 36.7 |
English-prompting is the best way of using English data and is complementary to existing methods.
We compare three different ways of using TyDi QA English data for zero-shot learning, direct English fine-tuning, fine-tuning on machine-translated data from English, and English Prompting. Table 8 shows the benefits of all three methods, with our English-Prompting approach yielding the best results in all languages. Additionally, combining data from all three methods results in improvements over any of them when used independently, and matches the performance of our few-shot setting.
Result comparison on XOR-Full for different means of using TyDi-En data.
. | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA-En | 30.7 | 30.2 | 31.0 | 24.3 | 26.2 | 29.6 | 28.5 | 28.6 |
+ Fine-tuning | 36.8 | 30.7 | 35.5 | 29.1 | 28.6 | 30.4 | 29.4 | 31.5 |
+ Translate-train | 31.5 | 31.2 | 29.9 | 26.5 | 28.8 | 27.7 | 31.5 | 29.6 |
+ English-prompt | 39.1 | 34.4 | 41.2 | 35.6 | 31.9 | 34.6 | 36.0 | 36.1 |
+ All | 41.9 | 36.2 | 43.0 | 37.3 | 33.7 | 37.0 | 37.4 | 38.1 |
. | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA-En | 30.7 | 30.2 | 31.0 | 24.3 | 26.2 | 29.6 | 28.5 | 28.6 |
+ Fine-tuning | 36.8 | 30.7 | 35.5 | 29.1 | 28.6 | 30.4 | 29.4 | 31.5 |
+ Translate-train | 31.5 | 31.2 | 29.9 | 26.5 | 28.8 | 27.7 | 31.5 | 29.6 |
+ English-prompt | 39.1 | 34.4 | 41.2 | 35.6 | 31.9 | 34.6 | 36.0 | 36.1 |
+ All | 41.9 | 36.2 | 43.0 | 37.3 | 33.7 | 37.0 | 37.4 | 38.1 |
4 Zero-shot Language Adaptation
In §2.2.1, we propose a zero-shot prompting strategy that uses few-shot examples from other languages to generate synthetic data for a distinct target language. The effectiveness of this approach is demonstrated in §3.8. In this section, we evaluate the impact of this strategy in adapting FsModQA to a diverse range of previously unseen languages, using only English labeled data.
4.1 Experimental Setup
Languages
We select ten languages unseen by FsModQA from the MIRACL dataset for monolingual retrieval adaptation. We choose ten unseen languages from the MKQA dataset with high, medium, and low resources for multilingual open-domain QA adaptation.
Data Generation
We consider the English NQ training data as the source for prompts. For each target language, we randomly sample five-shot examples from the NQ dataset to prompt the generation of Q&A pairs from selected Wikipedia passages, following the procedure described in §2.2. This approach yields 128,000 training instances for each target language. Additionally, we compare this method to the translate-train baseline (MT), which uses Google Translate to translate the NQ training data into the target languages.
Model Training
For both methods, we fine-tune FsModQA for 3K steps following the same procedure used in FsMlQA (§3.3). The final checkpoint obtained at the last training step is used for evaluation. Note that separate models are created per language in this experiment.
4.2 Results
Monolingual Retrieval Adaptation.
As shown in the upper part of Table 9, the zero-shot adaptation significantly improves FsModQA’s monolingual retrieval results by an average of 5.4% across ten unseen languages. These improvements are particularly pronounced in low-resource languages (i.e., th, yo, sw), whereas the MT baseline results in notable declines both in these languages (e.g., −36.3% in yo) and overall (−3.3%). Note that MIRACL was created by native speakers from texts in the target languages, which aligns with our data generation process. This explains the consistent gains achieved by our method and shows its superiority to translation-based approaches.
Zero-shot adaptation to unseen languages in monolingual retrieval (nDCG@10) and multilingual open-domain QA (F1).
MIRACL . | High . | Medium . | Low . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
De . | Es . | Fr . | Zh . | Fa . | Hi . | Id . | Sw . | Th . | Yo . | Avg. . | |
FsModQA | 36.7 | 40.5 | 36.5 | 43.2 | 43.7 | 42.6 | 43.2 | 50.2 | 60.4 | 60.2 | 45.7 |
+ MT | 41.3 | 41.8 | 37.1 | 41.7 | 40.7 | 42.4 | 44.1 | 50.7 | 60.5 | 23.9 | 42.4 |
+ Adapt | 38.8 | 41.6 | 38.6 | 47.0 | 47.7 | 45.9 | 44.2 | 62.3 | 66.6 | 78.3 | 51.1 |
MKQA | High | Medium | Low | ||||||||
De | Es | Fr | Zh | He | Pl | Tr | Vi | Km | Th | Avg. | |
FsModQA | 33.3 | 38.5 | 34.8 | 8.5 | 19.5 | 18.4 | 24.9 | 9.9 | 7.5 | 15.1 | 21.0 |
+ MT | 42.6 | 41.6 | 41.2 | 12.8 | 32.1 | 29.5 | 39.9 | 39.5 | 13.8 | 22.1 | 31.5 |
+ Adapt | 42.1 | 42.1 | 43.0 | 12.0 | 27.4 | 39.7 | 40.4 | 40.4 | 13.3 | 22.7 | 32.3 |
MIRACL . | High . | Medium . | Low . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
De . | Es . | Fr . | Zh . | Fa . | Hi . | Id . | Sw . | Th . | Yo . | Avg. . | |
FsModQA | 36.7 | 40.5 | 36.5 | 43.2 | 43.7 | 42.6 | 43.2 | 50.2 | 60.4 | 60.2 | 45.7 |
+ MT | 41.3 | 41.8 | 37.1 | 41.7 | 40.7 | 42.4 | 44.1 | 50.7 | 60.5 | 23.9 | 42.4 |
+ Adapt | 38.8 | 41.6 | 38.6 | 47.0 | 47.7 | 45.9 | 44.2 | 62.3 | 66.6 | 78.3 | 51.1 |
MKQA | High | Medium | Low | ||||||||
De | Es | Fr | Zh | He | Pl | Tr | Vi | Km | Th | Avg. | |
FsModQA | 33.3 | 38.5 | 34.8 | 8.5 | 19.5 | 18.4 | 24.9 | 9.9 | 7.5 | 15.1 | 21.0 |
+ MT | 42.6 | 41.6 | 41.2 | 12.8 | 32.1 | 29.5 | 39.9 | 39.5 | 13.8 | 22.1 | 31.5 |
+ Adapt | 42.1 | 42.1 | 43.0 | 12.0 | 27.4 | 39.7 | 40.4 | 40.4 | 13.3 | 22.7 | 32.3 |
Multilingual Open-domain QA Adaptation.
As shown in the bottom of Table 9, the adaptation effectively enhances multilingual open-domain QA performance across seven languages, achieving an average improvement of 11.3%. MT-based approaches yield results comparable to our adaptation, which is expected since MKQA was translated from NQ and the machined-translated data share the same topic distributions (i.e., American-centric). In contrast, our method generates data from Wikipedia texts written in target languages to simulate how native speakers ask questions, which is more common for real-world scenarios.
5 Data Analysis
5.1 Quality Validation
To assess the overall quality of our synthetic data, we randomly sample 1,000 examples from the silver pre-training data (MlWikiQA) and few-shot synthetic data (FsMlQA). These samples are evaluated using the GPT-4o mini to assess quality based on: 1) Fluency (0–2): assessing whether the query is understandable, readable, and free of spelling or grammatical mistakes; 2) Relevance (0–2): evaluating the alignment between the generated query-answer pair and the passage used for data generation. The prompts employed for quality assessment are included in Appendix Table 18.
Figure 8 illustrates that both types of our generated queries exhibit fluency and strong grounding in the corresponding positive passages. The silver-standard MlWikiQA, derived using heuristics from WikiData (§2.1), consistently achieves higher scores across both metrics in all languages compared to the unfiltered synthetic FsMlQA (w/o Filter columns). However, the quality of FsMlQA improves significantly after applying our tailored filtering mechanism (w/ Filter columns), almost matching the quality and fluency scores for MlWikiQA. This finding underscores the critical role of the filtering procedure in producing a synthetic dataset of quality comparable to the silver-standard dataset.
Quality validation results on the synthetic FsMlQA (with and without filter), comparing against the silver-standard pre-training data MlWikiQA. We employ Model-as-Judge to evaluate the quality of generated data on a three-level rating scale (0–2) based on two factors: fluency and relevance.
Quality validation results on the synthetic FsMlQA (with and without filter), comparing against the silver-standard pre-training data MlWikiQA. We employ Model-as-Judge to evaluate the quality of generated data on a three-level rating scale (0–2) based on two factors: fluency and relevance.
5.2 Query Distribution Comparison
To examine the distributional differences between our synthetic FsMlQA and the gold-standard data in Xor-TyDi QA, we randomly sample up to 20,000 examples from both datasets and visualize their distributions using t-SNE (van der Maaten and Hinton, 2008), which projects the queries onto a two-dimensional space. Figure 9 highlights several key findings: 1) The synthetic queries exhibit sufficient diversity, as they are scattered across the plot, indicating that our approach is capable of generating queries of various types using only five labeled examples. 2) The synthetic data shows significant overlap with the gold-standard data, demonstrating that it retains the core characteristics of the gold distribution. 3) The gold-standard data exhibits greater diversity than the synthetic data, suggesting that there is still room for improvement in enhancing diversity and variation during the data generation process, which we leave for future work. Similar findings are observed in the other languages (see Appendix Figure 10).
Distribution comparison between FsMlQA and Xor-TyDi QA in Japanese. We show that the synthetic data is diverse and significantly overlaps with the gold standard.
Distribution comparison between FsMlQA and Xor-TyDi QA in Japanese. We show that the synthetic data is diverse and significantly overlaps with the gold standard.
5.3 Safety
We employ Llama-Guard-26 as the content safety classifier to assess the presence of unsafe content within our synthetic dataset. Our analysis reveals that 98.9% of the 1,746,156 queries in FsMlQA are classified as safe.
6 Related Work
Pre-training for Open-domain QA.
Open-domain QA requires retrieving relevant passages and extracting answers from them. This necessity has driven various methods that jointly train retrievers and readers. REALM (Guu et al., 2020), RAG (Lewis et al., 2020), EMDR2 (Sachan et al., 2021), YONO (Lee et al., 2022), ReAtt (Jiang et al., 2022), and Atlas (Izacard et al., 2024) first pre-train retrievers or initialize from pre-trained (Izacard et al., 2022) and fine-tuned retrievers. Subsequently, both components are fine-tuned jointly: the reader is trained using an answer generation loss, and the retriever is trained to promote passages that increase the likelihood of generating correct answers. Recently, this joint training mechanism has been adapted for multilingual open-domain QA (Jiang et al., 2024), where retrievers are initially trained by learning from English teachers using multilingual parallel data, followed by a joint training stage with query-answer pairs generated by LLMs. Our approach follows this joint training paradigm for model pre-training but differs significantly. We use WikiData as a source to generate more informative natural questions and answers. Additionally, our pre-training method is more efficient by eliminating knowledge distillation from English models.
LLMs for Few-shot Data Generation.
Prompting LLMs to generate synthetic data has been widely adopted to improve the performance of retrieval and QA tasks. UPR (Sachan et al., 2022) and InPars (Bonifacio et al., 2022) use zero-shot or few-shot prompting for passage reranking. Promptagator (Dai et al., 2023) and SWIM-X (Thakur et al., 2024) prompt LLMs with few-shot examples to generate massive synthetic queries, either in English or in multiple languages, for retriever fine-tuning. Gecko (Lee et al., 2024) prompts LLMs to generate synthetic instructions and queries from Web documents and create high-quality labels for retriever fine-tuning. Beyond retrieval, LLMs are employed to generate QA data, where QAmeleon (Agrawal et al., 2023) prompts a 540B LLM to generate multilingual QA pairs from only five examples. Nevertheless, these methods primarily focus on retrieval tasks and the more narrowly defined machine reading comprehension tasks. In our work, we rigorously investigate how LLMs can improve the more challenging multilingual open-domain QA tasks under few-shot settings. In addition, we explore zero-shot prompting, demonstrating that cross-lingual prompting using English data or limited multilingual data from held-out languages can yield results comparable to few-shot prompting, and we show this technique can also be leveraged for effective zero-shot language adaptation.
7 Conclusion and Limitation
In this work, we propose FsModQA, a few-shot learning approach for multilingual open-domain retrieval tasks. We present a novel self-supervised pre-training framework that exploits WikiData to effectively initialize both multilingual retrieval and QA capabilities. This process is followed by few-shot synthetic multilingual QA generation from LLMs using only five human-annotated examples. We demonstrate that the resulting model achieves competitive multilingual retrieval and QA performance through fine-tuning on the high-quality synthetic data. We further show that this few-shot approach generalizes to zero-shot settings that only require English-supervised data. This mechanism serves as an effective approach for language adaptation, enabling the adapted model to achieve both boosted retrieval and end-to-end QA performance across fifteen previously unseen languages.
This work uses LLMs for synthetic data generation, which may propagate undesirable biases to generated data. We believe such biases will not be amplified as we sample prompts from Xor-TyDi QA, a dataset annotated with strict guidelines. Our preliminary safety analysis also reveals that only less than 1% data contains potentially harmful queries, as identified by Llama-Guard-2.
Acknowledgments
We thank the action editor Shay Cohen and anonymous reviewers for their helpful feedback and suggestions. The first author is supported by the Graduate Research Scholarships funded by the University of Melbourne. This work was funded by the Australian Research Council, Discovery grant DP230102775.
Notes
We use the term few-shot throughout this paper to denote that our method relies on only a small number of human-annotated examples. Thus, we classify our method as a few-shot learning approach, consistent with Dai et al. (2023).
Code, data, and checkpoints are available here.
We classify MlWikiQA as a silver-standard dataset rather than a synthetic one, as it is derived from the structured information in WikiData and Wikipedia.
Empirically, we set p = 0.4(μ = 2.5) for all languages except for Japanese, where we set p = 0.1(μ = 10) to favor longer answers. When computing the distribution, we truncate the answer length to 30.
We do not cap the number of repeats.
References
List of English properties used for generating MlWikiQA. Note that we do not generate data for a property if it does not exist in the Wikidata of target languages.
Property ID . | Description . | Property ID . | Description . |
---|---|---|---|
P264 | record label | P175 | performer |
P176 | manufacturer | P112 | founded by |
P127 | owned by | P840 | narrative location |
P495 | country of origin | P20 | place of death |
P407 | language of work or name | P582 | end time |
P69 | educated at | P159 | headquarters location |
P740 | location of formation | P17 | country |
P136 | genre | P800 | notable work |
P36 | capital | P570 | date of death |
P190 | twinned administrative body | P4552 | mountain range |
P915 | filming location | P3086 | speed limit |
P84 | architect | P2046 | area |
P569 | date of birth | P86 | composer |
P515 | phase of matter | P2048 | height |
P40 | child | P580 | start time |
P828 | has cause | P50 | author |
P2067 | mass | 108 | employer |
P170 | creator | P2049 | width |
P364 | original language of film or TV show | P277 | programmed in |
P276 | location | P413 | position played on team / speciality |
P131 | located in the administrative territorial entity | P26 | spouse |
P106 | occupation | P607 | conflict |
P942 | theme music | P571 | inception |
P6 | head of government | P19 | place of birth |
P1830 | owner of | P61 | discoverer or inventor |
Property ID . | Description . | Property ID . | Description . |
---|---|---|---|
P264 | record label | P175 | performer |
P176 | manufacturer | P112 | founded by |
P127 | owned by | P840 | narrative location |
P495 | country of origin | P20 | place of death |
P407 | language of work or name | P582 | end time |
P69 | educated at | P159 | headquarters location |
P740 | location of formation | P17 | country |
P136 | genre | P800 | notable work |
P36 | capital | P570 | date of death |
P190 | twinned administrative body | P4552 | mountain range |
P915 | filming location | P3086 | speed limit |
P84 | architect | P2046 | area |
P569 | date of birth | P86 | composer |
P515 | phase of matter | P2048 | height |
P40 | child | P580 | start time |
P828 | has cause | P50 | author |
P2067 | mass | 108 | employer |
P170 | creator | P2049 | width |
P364 | original language of film or TV show | P277 | programmed in |
P276 | location | P413 | position played on team / speciality |
P131 | located in the administrative territorial entity | P26 | spouse |
P106 | occupation | P607 | conflict |
P942 | theme music | P571 | inception |
P6 | head of government | P19 | place of birth |
P1830 | owner of | P61 | discoverer or inventor |
Examples of using ChatGPT to generate questions from triples. We use the same prompt as Yes questions to generate No ones by sampling perturbed triples. indicate system outputs.

An example of prompting Gemma-7B to generate questions with ICL examples from ChatGPT. indicate system outputs.

Complete prompt for few-shot cross-lingual question answer generation from English passages.

Distribution comparison between FsMlQA and Xor-TyDi QA in the rest languages. We demonstrate that the diverse synthetic data can be expanded from only five-shot examples and retains the core characteristics of the gold distribution.
Distribution comparison between FsMlQA and Xor-TyDi QA in the rest languages. We demonstrate that the diverse synthetic data can be expanded from only five-shot examples and retains the core characteristics of the gold distribution.
Dataset statistics of our pre-training data MlWikiQA and few-shot synthetic data FsMlQA in each language.
. | MlWikiQA . | FsMlQA . | ||||
---|---|---|---|---|---|---|
# Q-A Paris . | Question Length . | Answer Length . | # Q-A Paris . | Question Length . | Answer Length . | |
Arabic | 1,803,765 | 7.00±2.13 | 1.65±0.84 | 80,575 | 8.20±2.86 | 1.57±1.30 |
Bengali | 407,496 | 6.13±1.80 | 1.65±0.85 | 127,562 | 8.97±2.99 | 1.63±1.44 |
English | 7,963,985 | 7.95±2.54 | 1.78±1.01 | – | – | – |
Finnish | 2,135,790 | 6.02±1.75 | 1.32±0.64 | 270,627 | 5.83±2.09 | 1.38±0.90 |
Japanese | 2,735,635 | 14.74±3.57 | 3.57±1.73 | 143,265 | 10.19±2.18 | 3.96±4.69 |
Korean | 1,018,348 | 5.46±1.78 | 1.55±0.80 | 192,002 | 5.72±2.29 | 1.42±0.92 |
Russian | 2,561,925 | 6.94±2.17 | 1.70±1.10 | 792,914 | 7.34±2.64 | 1.44±1.03 |
Telugu | 108,215 | 5.60±1.84 | 1.50±0.74 | 139,211 | 6.48±2.48 | 1.49±1.17 |
. | MlWikiQA . | FsMlQA . | ||||
---|---|---|---|---|---|---|
# Q-A Paris . | Question Length . | Answer Length . | # Q-A Paris . | Question Length . | Answer Length . | |
Arabic | 1,803,765 | 7.00±2.13 | 1.65±0.84 | 80,575 | 8.20±2.86 | 1.57±1.30 |
Bengali | 407,496 | 6.13±1.80 | 1.65±0.85 | 127,562 | 8.97±2.99 | 1.63±1.44 |
English | 7,963,985 | 7.95±2.54 | 1.78±1.01 | – | – | – |
Finnish | 2,135,790 | 6.02±1.75 | 1.32±0.64 | 270,627 | 5.83±2.09 | 1.38±0.90 |
Japanese | 2,735,635 | 14.74±3.57 | 3.57±1.73 | 143,265 | 10.19±2.18 | 3.96±4.69 |
Korean | 1,018,348 | 5.46±1.78 | 1.55±0.80 | 192,002 | 5.72±2.29 | 1.42±0.92 |
Russian | 2,561,925 | 6.94±2.17 | 1.70±1.10 | 792,914 | 7.34±2.64 | 1.44±1.03 |
Telugu | 108,215 | 5.60±1.84 | 1.50±0.74 | 139,211 | 6.48±2.48 | 1.49±1.17 |
Success Rate of synthetic data generation across seven languages with different prompting strategies. Success Rate = valid examples after data filtering / total examples (i.e., # Documents).
Prompting Strategy . | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA | 5.9% | 13.9% | 16.9% | 7.7% | 21.4% | 15.4% | 13.2% | 13.4% |
NQ-En | 3.4% | 8.8% | 8.2% | 1.2% | 8.3% | 5.9% | 4.2% | 5.3% |
TyDi-En | 5.0% | 13.6% | 12.8% | 1.9% | 17.0% | 6.3% | 5.9% | 7.0% |
Xor-TyDi-* | 10.2% | 14.6% | 14.2% | 2.5% | 22.7% | 9.1% | 10.7% | 9.8% |
Prompting Strategy . | Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | Avg. . |
---|---|---|---|---|---|---|---|---|
FsModQA | 5.9% | 13.9% | 16.9% | 7.7% | 21.4% | 15.4% | 13.2% | 13.4% |
NQ-En | 3.4% | 8.8% | 8.2% | 1.2% | 8.3% | 5.9% | 4.2% | 5.3% |
TyDi-En | 5.0% | 13.6% | 12.8% | 1.9% | 17.0% | 6.3% | 5.9% | 7.0% |
Xor-TyDi-* | 10.2% | 14.6% | 14.2% | 2.5% | 22.7% | 9.1% | 10.7% | 9.8% |
Detailed results in each language when trained with varying sizes of supervised data.
Method . | F1 . | Macro Average . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | F1 . | EM . | BLEU . | |
5-shot | ||||||||||
FsModQA-En | 35.6 | 32.7 | 35.5 | 35.1 | 30.2 | 33.6 | 31.8 | 33.5 | 23.8 | 23.0 |
FsModQA | 41.3 | 35.4 | 39.6 | 41.5 | 35.0 | 38.2 | 36.3 | 38.2 | 27.9 | 24.4 |
16-shot | ||||||||||
FsModQA-En | 38.3 | 31.0 | 39.4 | 38.3 | 35.2 | 34.9 | 34.6 | 35.9 | 26.1 | 24.1 |
FsModQA | 42.0 | 35.6 | 41.4 | 41.7 | 35.3 | 39.2 | 40.0 | 39.3 | 29.3 | 26.6 |
32-shot | ||||||||||
FsModQA-En | 42.4 | 31.2 | 40.8 | 38.1 | 33.0 | 37.9 | 34.9 | 36.9 | 26.3 | 25.5 |
FsModQA | 43.6 | 35.6 | 42.2 | 42.5 | 34.1 | 38.6 | 37.0 | 39.1 | 28.8 | 26.6 |
128-shot | ||||||||||
FsModQA-En | 42.0 | 28.8 | 41.7 | 40.3 | 34.6 | 34.7 | 36.0 | 36.9 | 27.0 | 25.2 |
FsModQA | 45.3 | 32.8 | 44.3 | 43.8 | 34.0 | 39.9 | 42.1 | 40.3 | 30.5 | 27.4 |
1024-shot | ||||||||||
FsModQA-En | 45.0 | 30.8 | 45.1 | 39.2 | 34.1 | 39.1 | 37.5 | 38.7 | 29.3 | 26.5 |
FsModQA | 47.5 | 33.7 | 46.7 | 41.4 | 35.9 | 40.2 | 40.1 | 40.8 | 31.3 | 27.9 |
full | ||||||||||
FsModQA-En | 48.9 | 33.3 | 47.7 | 42.9 | 39.6 | 40.0 | 41.7 | 42.0 | 32.7 | 28.5 |
FsModQA | 50.8 | 33.3 | 47.8 | 45.0 | 38.9 | 42.0 | 43.1 | 43.0 | 33.4 | 29.6 |
Method . | F1 . | Macro Average . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Ar . | Bn . | Fi . | Ja . | Ko . | Ru . | Te . | F1 . | EM . | BLEU . | |
5-shot | ||||||||||
FsModQA-En | 35.6 | 32.7 | 35.5 | 35.1 | 30.2 | 33.6 | 31.8 | 33.5 | 23.8 | 23.0 |
FsModQA | 41.3 | 35.4 | 39.6 | 41.5 | 35.0 | 38.2 | 36.3 | 38.2 | 27.9 | 24.4 |
16-shot | ||||||||||
FsModQA-En | 38.3 | 31.0 | 39.4 | 38.3 | 35.2 | 34.9 | 34.6 | 35.9 | 26.1 | 24.1 |
FsModQA | 42.0 | 35.6 | 41.4 | 41.7 | 35.3 | 39.2 | 40.0 | 39.3 | 29.3 | 26.6 |
32-shot | ||||||||||
FsModQA-En | 42.4 | 31.2 | 40.8 | 38.1 | 33.0 | 37.9 | 34.9 | 36.9 | 26.3 | 25.5 |
FsModQA | 43.6 | 35.6 | 42.2 | 42.5 | 34.1 | 38.6 | 37.0 | 39.1 | 28.8 | 26.6 |
128-shot | ||||||||||
FsModQA-En | 42.0 | 28.8 | 41.7 | 40.3 | 34.6 | 34.7 | 36.0 | 36.9 | 27.0 | 25.2 |
FsModQA | 45.3 | 32.8 | 44.3 | 43.8 | 34.0 | 39.9 | 42.1 | 40.3 | 30.5 | 27.4 |
1024-shot | ||||||||||
FsModQA-En | 45.0 | 30.8 | 45.1 | 39.2 | 34.1 | 39.1 | 37.5 | 38.7 | 29.3 | 26.5 |
FsModQA | 47.5 | 33.7 | 46.7 | 41.4 | 35.9 | 40.2 | 40.1 | 40.8 | 31.3 | 27.9 |
full | ||||||||||
FsModQA-En | 48.9 | 33.3 | 47.7 | 42.9 | 39.6 | 40.0 | 41.7 | 42.0 | 32.7 | 28.5 |
FsModQA | 50.8 | 33.3 | 47.8 | 45.0 | 38.9 | 42.0 | 43.1 | 43.0 | 33.4 | 29.6 |
Author notes
Also at Google.
Action Editor: Shay Cohen