Meta-Learning a Cross-lingual Manifold for Semantic Parsing

Localizing a semantic parser to support new languages requires effective cross-lingual generalization. Recent work has found success with machine-translation or zero-shot methods, although these approaches can struggle to model how native speakers ask questions. We consider how to effectively leverage minimal annotated examples in new languages for few-shot cross-lingual semantic parsing. We introduce a first-order meta-learning algorithm to train a semantic parser with maximal sample efficiency during cross-lingual transfer. Our algorithm uses high-resource languages to train the parser and simultaneously optimizes for cross-lingual generalization to lower-resource languages. Results across six languages on ATIS demonstrate that our combination of generalization steps yields accurate semantic parsers sampling ≤10% of source training data in each new language. Our approach also trains a competitive model on Spider using English with generalization to Chinese similarly sampling ≤10% of training data.1

Within cross-lingual semantic parsing, there has been an effort to bootstrap parsers with minimal 1 Our code and data are available at github.com/tomsherborne/xgrdata to avoid the cost and labor required to support new languages.Recent proposals include using machine translation to approximate training data for supervised learning (Moradshahi et al., 2020;Sherborne et al., 2020;Nicosia et al., 2021) and zero-shot models, which engineer cross-lingual similarity with auxiliary losses (van der Goot et al., 2021;Yang et al., 2021;Sherborne and Lapata, 2022).These shortcuts bypass costly data annotation but present limitations such as "translationese" artifacts from machine translation (Koppel and Ordan, 2011) or undesirable domain shift (Sherborne and Lapata, 2022).However, annotating a minimally sized data sample can potentially overcome these limitations while incurring significantly reduced costs compared to full dataset translation (Garrette and Baldridge, 2013).
We argue that a few-shot approach is more realistic for an engineer motivated to support additional languages for a database -as one can rapidly retrieve a high-quality sample of translations and combine these with existing supported languages (i.e., English).Beyond semantic parsing, cross-lingual few-shot approaches have also succeeded at leveraging a small number of annotations within a variety of tasks (Zhao et al., 2021, inter alia) including natural language inference, paraphrase identification, part-of-speech-tagging, and named-entity recognition.Recently, the application of meta-learning to domain generalization has further demonstrated capability for models to adapt to new domains with small samples (Gu et al., 2018;Li et al., 2018;Wang et al., 2020b).
In this work, we synthesize these directions into a meta-learning algorithm for cross-lingual semantic parsing.Our approach explicitly optimizes cross-lingual generalization using fewer training samples per new language without performance degradation.We also require minimal computational overhead beyond standard gradient-descent training and no external dependencies beyond in-Meta-Learning (Finn et al., 2017, MAML), wherein the goal is to train a model on a variety of learning tasks, such that it can solve new tasks using a small number of training samples.In effect, MAML facilitates task-specific fine-tuning using few samples in a two-stage process.MAML requires computing higher-order gradients (i.e., "gradient through a gradient") which can often be prohibitively expensive for complex models.This limitation has motivated first-order approaches to MAML which offer similar performance with improved computational efficiency.
In this vein, the Reptile algorithm (Nichol et al., 2018) transforms the higher-order gradient approach into K successive first-order steps.Reptile-based training learns an optimal manifold across tasks (i.e., a high-density parameter subregion biased for strong cross-task likelihood), then similarly followed by rapid fine-tuning.By learning an optimal initialization, meta-learning proves useful for low-resource adaptation by minimizing the data required for out-of-domain tuning on new tasks.Kedia et al. (2021) also demonstrate the utility of Reptile to improve single-task performance.We build on this to examine single-task cross-lingual transfer using the optimal manifold learned with Reptile.
Meta-Learning for Semantic Parsing A variety of NLP applications have adopted metalearning in zero-and few-shot learning scenarios as a method of explicitly training for generalization (Lee et al., 2021;Hedderich et al., 2021).Within semantic parsing, there has been increasing interest in cross-database generalization, motivated by datasets such as Spider (Yu et al., 2018) requiring navigation of unseen databases (Herzig and Berant, 2017;Suhr et al., 2020).
Approaches to generalization have included simulating source and target domains (Givoli and Reichart, 2019) and synthesizing new training data based on unseen databases (Zhong et al., 2020;Xu et al., 2020a).Meta-learning has demonstrated fast adaptation to new data within a monolingual low-resource setting (Huang et al., 2018;Guo et al., 2019;Lee et al., 2019;Sun et al., 2020).Similarly, Chen et al. (2020) utilize Reptile to improve generalization of a model, trained on source domains, to fine-tune on new domains.Our work builds on Wang et al. (2021a), who explicitly promote monolingual cross-domain gen-eralization by "meta-generalizing" across disjoint domain-specific batches during training.
Cross-lingual Semantic Parsing A surge of interest in cross-lingual NLU has seen the creation of many benchmarks across a breadth of languages (Conneau et al., 2018;Hu et al., 2020;Liang et al., 2020), thereby motivating significant exploration of cross-lingual transfer (Nooralahzadeh et al., 2020;Xia et al., 2021;Xu et al., 2021;Zhao et al., 2021, inter alia).Previous approaches to cross-lingual semantic parsing assume parallel multilingual training data (Jie and Lu, 2014) and exploit multi-language inputs for training without resource constraints (Susanto and Lu, 2017a,b).
There has been recent interest in evaluating if machine translation is an economic proxy for creating training data in new languages (Sherborne et al., 2020;Moradshahi et al., 2020).Zero-shot approaches to cross-lingual parsing have also been explored using auxiliary training objectives (Yang et al., 2021;Sherborne and Lapata, 2022).Crosslingual learning has also been gaining traction in the adjacent field of spoken-language understanding (SLU).For datasets such as MultiATIS (Upadhyay et al., 2018), MultiATIS++ (Xu et al., 2020b), and MTOP (Li et al., 2021), zero-shot cross-lingual transfer has been studied through specialized decoding methods (Zhu et al., 2020), machine translation (Nicosia et al., 2021), and auxiliary objectives (van der Goot et al., 2021).
Cross-lingual semantic parsing has mostly remained orthogonal to the cross-database generalization challenges raised by datasets such as Spider (Yu et al., 2018).While we primarily present findings for multilingual ATIS into SQL (Hemphill et al., 1990), we also train a parser on both Spider and its Chinese version (Min et al., 2019).To the best of our knowledge, we are the first to explore a multilingual approach to this cross-database benchmark.We use Reptile to learn the overall task and leverage domain generalization techniques (Li et al., 2018;Wang et al., 2021a) for sample-efficient cross-lingual transfer.

Problem Definition
Semantic Parsing We wish to learn a parameterized parsing function, p θ , which maps from a natural language utterance and a relational database context to an executable program ex-pressed in a logical form (LF) language: As formalized in Equation ( 1), we learn parameters, θ, using paired data (Q, P, D) where P is the logical form equivalent of natural language question Q.In this work, our LFs are all executable SQL queries and therefore grounded in a database D. A single-domain dataset references only one D database for all (Q, P ), whereas a multi-domain dataset demands reasoning about unseen databases to generalize to new queries.This is expressed as a 'zero-shot' problem if the databases at test time, D test , were unseen during training.This challenge demands a parser capable of domain generalization beyond observed databases.This is in addition to the structured prediction challenge of semantic parsing.
Cross-Lingual Generalization Prototypical semantic parsing datasets express the question, Q, in English only.As discussed in Section 1, our parser should be capable of mapping from additional languages to well-formed, executable programs.However, prohibitive expense limits us from reproducing a monolingual model for each additional language and previous work demonstrates accuracy improvement by training multilingual models (Jie and Lu, 2014).In addition to the challenges of structured prediction and domain generalization, we jointly consider cross-lingual generalization.Training primarily relies on existing English data (i.e., Q EN samples) and we show that our meta-learning algorithm in Section 4 leverages a small sample of training data in new languages for accurate parsing.We express this sample, S l , for some language, l, as: where N l is the sample size from l, assumed to be smaller than the original English dataset (i.e., N l N EN ).Where available, we extend this paradigm to develop models for L different languages simultaneously in a multilingual setup by combining samples as: We can express cross-lingual generalization as: where p θ (P | Q EN , D) is the predicted distribution over all possible output SQL sequences conditioned on an English question, Q EN , and a database D. Our goal is for the prediction from a new language, Q l , to converge towards this existing distribution using the same parameters θ, constrained to fewer samples in l than English.We aim to maximize the accuracy of predicting programs on unseen test data from each non-English language l.The key challenge is learning a performant distribution over each new language with minimal available samples.This includes learning to incorporate each l into the parsing task and modeling the language-specific surface form of questions.Our setup is akin to few-shot learning; however, the number of examples needed for satisfactory performance is an empirical question.We are searching for both minimal sample sizes and maximal sampling efficiency.We discuss our sampling strategy in Section 5.2 with results at multiple sizes of S L in Section 6.

Methodology
We combine two meta-learning techniques for cross-lingual semantic parsing.
The first is the Reptile algorithm outlined in Section 2. Reptile optimizes for dense likelihood regions within the parameters (i.e., an optimal manifold) through promoting inter-batch generalization (Nichol et al., 2018).Standard Reptile iteratively optimizes the manifold for an improved initialization across objectives.Rapid fine-tuning yields the final task-specific model.The second technique is the first-order approximation of DG-MAML (Li et al., 2018;Wang et al., 2021a).This singlestage process optimizes for domain generalization by simulating "source" and "target" batches from different domains to explicitly optimize for cross-batch generalization.Our algorithm, XG-REPTILE, combines these paradigms to optimize a target loss with the overall learning "direction" derived as the optimal manifold learned via Reptile.This trains an accurate parser demonstrating sample-efficient cross-lingual transfer within an efficient single-stage learning process.

The XG-REPTILE Algorithm
Each learning episode of XG-REPTILE comprises two component steps: intra-task learning and inter-language generalization to jointly learn parsing and cross-lingual transfer.Alternating these  1) Run K iterations of gradient descent over K support batches to learn φ K , (2) compute ∇ macro , the difference between φ K and φ 1 , (3) find the loss on the target batch using φ K and (4) compute the final gradient update from ∇ macro and the target loss.
processes trains a competitive parser from multiple languages with low computational overhead beyond existing gradient-descent training.Our approach combines the typical two stages of metalearning to produce a single model without a finetuning requirement.

Task Learning
Step We first sample from the high-resource language (i.e., S EN ) K "support" batches of examples, B S = {(Q EN , P, D)}.For each of K batches: we compute predictions, compute losses, calculate gradients and adjust parameters using some optimizer (see illustration in Figure 1).After K successive optimization steps the initial weights in this episode, φ 1 , have been optimized to φ K .The difference between final and initial weights is calculated as: This "macro-gradient" step is equivalent to a Reptile step (Nichol et al., 2018), representing learning an optimal manifold as an approximation of overall learning trajectory.

Cross-Lingual
Step The second step samples one "target" batch, B T = (Q l , P, D), from a sampled target language (i.e., S l ⊂ S L ).We compute the cross-entropy loss and gradients from the prediction of the model at φ K on B T : Algorithm 1 XG-REPTILE Require: Support data, S EN , target data, S L Require: Inner learning rate, α, outer learning rate, β 1: Initialise θ 1 , the vector of initial parameters 2: for t ← 1 to T do 3: Copy Sample target language l from L languages 6: Sample target batch B T from S l 7: end for 11: Macro grad: Total gradient: We evaluate the parser at φ K on a target language we desire to generalize to.We show below that the gradient of L T comprises the loss at φ K and additional terms maximizing the inner product between the high-likelihood manifold and the target loss.The total gradient encourages intra-task and cross-lingual learning (see Figure 1).
Algorithm 1 outlines the XG-REPTILE process (loss calculation and batch processing are simplified for brevity).We repeat this process over T episodes to train model p θ to convergence.If we optimized for target data to align with individual support batches (i.e., K = 1) then we may observe batch-level noise in cross-lingual generalization.Our intuition is that aligning the target gradient with an approximation of the task manifold, i.e., ∇ macro , will overcome this noise and align new languages to a more mutually beneficial direction during training.We observe this intuitive behavior during learning in Section 6.
We efficiently generalize to low-resource languages by exploiting the asymmetric data requirements between steps: one batch of the target language is required for K batches of the source language.For example, if K = 10 then using this 1 K proportionality requires 10% of target-language data relative to support.We demonstrate in Section 6 that we can use a smaller < 1 K quantity per target language to increase sample efficiency.
Gradient Analysis Following Nichol et al. (2018), we express, g k = ∇L S k , the gradient in a single step of the inner loop (Line 9) as: We use a Taylor series expansion to approximate g k by ḡk , the gradient at the original point φ 1 , the Hessian matrix of the gradient at the initial point, Hk , the step difference between position φ k and the initial position and some scalar terms with marginal influence, O α 2 .By evaluating Equation ( 7) at i = 1 and rewriting the difference as a sum of gradient steps (e.g., Equations ( 8) and ( 9)), we arrive at an expression for g k shown in Equation ( 10) expressing the gradient as an initial component, ĝk , and the product of the Hessian at k, with all prior gradient steps.We refer to Nichol et al. (2018) for further validation that the gradient of this product maximizes the cross-batch expectation -therefore promoting intra-task generalization and learning the optimal manifold.The final gradient (Equation ( 11)) is the accumulation over g k steps and is equivalent to Equation (5).∇ macro comprises both gradients of K steps and additional terms maximizing the inner-product of inter-batch gradients. Use We can similarly express the gradient of the target batch as Equation ( 12) where the term, HT ∇ macro , is the cross-lingual generalization product similar to the intra-task generalization seen above.
Equation ( 13) shows an example final gradient when K = 2. Within the parentheses are the intratask and cross-lingual gradient products as components promoting fast learning across multiple axes of generalization.
The key hyperparameter in XG-REPTILE is the number of inner-loop steps K representing a tradeoff between a manifold approximation and target step frequency.At small K, the manifold approximation may be poor, leading to sub-optimal learning.At large K, then improved manifold approximation incurs fewer target batch steps per epoch, leading to weakened cross-lingual transfer.In practice, K is set empirically, and Section 6 identifies an optimal region for our task.
XG-REPTILE can be viewed as generalizing two existing algorithms.Without the L T loss, our approach is equivalent to Reptile and lacks crosslingual alignment.If K = 1, then XG-REPTILE is equivalent to DG-FMAML (Wang et al., 2021a) but lacks intra-task generalization across support batches.Our unification of these algorithms represent the best of both approaches and outperforms both techniques within semantic parsing.Another perspective is that XG-REPTILE learns a regularized optimal manifold, with immediate crosslingual capability as opposed to standard Reptile which requires fine-tuning to transfer across tasks.We identify how this contrast in approaches influences cross-lingual transfer in Section 6.

Experimental Design
We evaluate XG-REPTILE against several comparison systems across multiple languages.Where possible, we re-implement existing models and use identical data splits to isolate the contribution of our training algorithm.

Data
We report results on two semantic parsing datasets.First on ATIS (Hemphill et al., 1990), using the multilingual version from Sherborne and Lapata (2022) pairing utterances in six languages (English, French, Portuguese, Spanish, German, Chinese) to SQL queries.ATIS is split into 4,473 training pairs with 493 and 448 examples for validation and testing, respectively.We report performance as execution accuracy to test if predicted SQL queries can retrieve accurate database results.
We also evaluate on Spider (Yu et al., 2018), combining English and Chinese (Min et al., 2019, CSpider) versions as a cross-lingual task.The latter translates all questions to Chinese but retains the English database.Spider is significantly more challenging; it contains 10,181 questions and 5,693 unique SQL queries for 200 multi-table databases over 138 domains.We use the same split as Wang et al. (2021a) to measure generalization to unseen databases/table-schema during testing.This split uses 8,659 examples from 146 databases for training and 1,034 examples from 20 databases for validation.The test set contains 2,147 examples from 40 held-out databases and is held privately by the authors.To our knowledge, we report the first multilingual approach for Spider by training one model for English and Chinese.Our challenge is now multi-dimensional, requiring crosslingual and cross-domain generalization.Following Yu et al. (2018), we report exact set match accuracy for evaluation.

Sampling for Generalization
Training for cross-lingual generalization often uses parallel samples across languages.We illustrate this in Equation ( 14) where y 1 is the equivalent output for inputs, x 1 , in each language: EN : (x 1 , y 1 ) DE : (x 1 , y 1 ) ZH : (x 1 , y 1 ) (14) However, high sample overlap risks trivializing the task because models are not learning from new pairs, but instead matching only new inputs to known outputs.A preferable evaluation will test composition of novel outputs from unseen inputs: Equation ( 15) samples exclusive, disjoint datasets for English and target languages during training.In other words, this process is subtractive e.g., a 5% sample of German (or Chinese) target data leaves 95% of data as the English support.This is similar to K-fold cross-validation used to evaluate across many data splits.We sample data for our experiments with Equation (15).It is also possible to use Equation ( 16), where target samples are also disjoint, but we find this setup results in too few English examples for effective learning.

Semantic Parsing Models
We use a Transformer encoder-decoder model similar to Sherborne and Lapata (2022) for our ATIS experiments.We use the same mBART50 encoder (Tang et al., 2021) and train a Transformer decoder from scratch to generate SQL.
For Spider, we use the RAT-SQL model (Wang et al., 2020a) which has formed the basis of many performant submissions to the Spider leaderboard.RAT-SQL can successfully reason about unseen databases and table schema using a novel schemalinking approach within the encoder.We use the version from Wang et al. (2021a) with mBERT (Devlin et al., 2019) input embeddings for a unified model between English and Chinese inputs.Notably, RAT-SQL can be over-reliant on lexical similarity features between input questions and tables (Wang et al., 2020a).This raises the challenge of generalizing to Chinese where such overlap is null.For fair comparison, we implement identical models as prior work on each dataset and only evaluate the change in training algorithm.This is why we use an mBART50 encoder component for ATIS experiments and different mBERT input embeddings for Spider experiments.

Comparison Systems
We compare our algorithm against several strong baselines and adjacent training methods including: Monolingual Training A monolingual Transformer is trained on gold-standard professionally translated data for each new language.This is a monolingual upper bound without few-shot constraints.
Multilingual Training A multilingual Transformer is trained on the union of all data from the "Monolingual Training" method.This ideal upper bound uses all data in all languages without few-shot constraints.Reptile-EN→FT-All Initial training uses Reptile (Nichol et al., 2018) on English support data, S EN , followed by fine-tuning on target samples, S L .This is a typical usage of Reptile for training a low-resource multi-domain parser (Chen et al., 2020).

Translate-Test
We also compare to DG-FMAML (Wang et al., 2021a) as a special case of XG-REPTILE when K = 1.Additionally, we omit pairwise versions of XG-REPTILE (e.g., separate models generalizing from English to individual languages).These approaches demand more computation and demonstrated no significant improvement over a multi-language approach.All Machine Translation uses Google Translate (Wu et al., 2016).

Training Configuration
Experiments focus on the expansion from English to additional languages, where we use English as the "support" language and additional languages as "target".Key hyperparameters are outlined in Table 1.We train each model using the given optimizers with early stopping where model selection is through minimal validation loss for combined support and target languages.Input utterances are tokenized using SentencePiece (Kudo and Richardson, 2018) and Stanza (Qi et al., 2020) for ATIS and Spider, respectively.All experiments are implemented in Pytorch on a single V100 GPU.We report key results for ATIS averaged over three seeds and five random data splits.For Spider, we submit the best singular model from five random splits to the leaderboard.

Results and Analysis
We contrast XG-REPTILE to baselines for ATIS in Table 2  XG-REPTILE@10% improves on this model by an average +1.3%.Considering that our upper bound uses 10× the data of XG-REPTILE@10%, this accuracy gain highlights the benefit of explicit cross-lingual generalization.This is consistent at higher sample sizes (see Figure 2(c) for German).
At the smallest sample size, XG-REPTILE@1%, demonstrates a +12.4% and +13.2% improvement relative to Translate-Train and Translate-Test.Machine translation is often viable for cross-lingual transfer (Conneau et al., 2018).However, we find that mistranslation of named entities incurs an exaggerated parsing penalty -leading to inaccurate logical forms (Sherborne et al., 2020).This suggests that sample quality has an exaggerated influence on semantic parsing performance.When training XG-REPTILE with MT data, we also observe a lower Target-language average of 66.9%.This contrast further supports the importance of sample quality in our context.
XG-REPTILE improves cross-lingual generalization across all languages at equivalent and lower sample sizes.At 1%, it improves by an average +15.7% over the closest model, Reptile-EN→FT-All.Similarly, at 5%, we find +9.8%gain, and at 10%, we find +8.9%relative to the closest competitor.Contrasting across sample sizes -our best approach is @10%, however, this is +3.5% above @1%, suggesting that smaller samples could be sufficient if 10% sampling is unattainable.This relative stability is an improvement compared to the 17.7%, 11.2% or 10.3% difference between @1% and @10% for other models.This implies that XG-REPTILE better utilizes smaller samples than adjacent methods.
Across languages at 1%, XG-REPTILE improves primarily for languages dissimilar to English (Ahmad et al., 2019) to better minimize the cross-lingual transfer gap.For Chinese (ZH), we see that XG-REPTILE@1% is +26.4% above the closest baseline.This contrasts with the smallest gain, +8.5% for German, with greater similarity to English.Our improvement also yields less variability across target languages -the standard deviation across languages for XG-REPTILE@1% is 1.1, compared to 2.8 for Train-EN∪All or 7.4 for Reptile-EN→FT-All.
We can also compare to ZX-PARSE, the method of Sherborne and Lapata (2022) which engineers cross-lingual latent alignment for zeroshot semantic parsing without data in target languages.With 45 samples per target language, XG-REPTILE@1% improves by an average of +4.9%.XG-REPTILE is more beneficial for distant languages -cross-lingual transfer penalty between English and Chinese is −12.3% for ZX-PARSE compared to −5.7% in our case.While these systems are not truly comparable, given different data requirements, this contrast is practically useful for comparison between zero-and few-shot localization.
Influence of K on Performance In Figure 2(a) we study how variation in the key hyperparameter K, the size of the inner-loop in Algorithm 1 or the number of batches used to approximate the optimal task manifold influences model per-formance across languages (single run at 5% sampling).When K = 1, the model learns generalization from batch-wise similarity which is equivalent to DG-FMAML (Wang et al., 2021a).We empirically find that increasing K beyond one benefits performance by encouraging cross-lingual generalization with the task over a single batch, and it is, therefore, beneficial to align an outof-domain example with the overall direction of training.However, as theorized in Section 4, increasing K also decreases the frequency of the outer step within an epoch leading to poor crosslingual transfer at high K.This trade-off yields an optimal operating regime for this hyper-parameter.We use K = 10 in our experiments as the center of this region.Given this setting of K, the target sample size must be 10% of the support sample size for training in a single epoch.However, Table 2 identifies XG-REPTILE as the most capable algorithm for "over-sampling" smaller target samples for resource-constrained generalization.
Influence of Batch size on performance We consider two further case studies to analyze XG-REPTILE performance.For clarity, we focus on German; however, these trends are consistent across all target languages.Figure 2(b) examines if the effects of cross-lingual transfer within XG-REPTILE are sensitive to batch size during training (single run at 5% sampling).A dependence between K and batch size could imply that the desired inter-task and cross-lingual generalization outlined in Equation ( 13) is an unrealistic, edge-case phenomenon.This is not the case, and a trend of optimal K setting is consistent across many batch sizes.This suggests that K is an independent hyper-parameter requiring tuning alongside existing experimental settings.Performance across Larger Sample Sizes We consider a wider range of target data sample sizes between 1% to 50% in Figure 2(c).We observe that baseline approaches converge to between 69.3% and 73.9% at 50% target sample size.Surprisingly, the improvement of XG-REPTILE is retained at higher sample sizes with an accuracy of 76.5%.The benefit of XG-REPTILE is still greatest at low sample sizes with +5.4% improvement at 1%; however, we maintain a +2.6% gain over the closest system at 50%.While low sampling is the most economical, the consistent benefit of XG-REPTILE suggests a promising strategy for other cross-lingual tasks.
Learning Spider and CSpider Our results on Spider and CSpider are shown in Table 3.We compare XG-REPTILE to monolingual approaches from Wang et al. (2021a) and discuss cross-lingual results when sampling between 1% to 10% of CSpider target during training.
In the monolingual setting, XG-REPTILE shows significant improvement (p < 0.01; using an independent samples t-test) compared to DG-FMAML with +6.7% for English and +16.4% for Chinese dev accuracy.This further supports our claim that generalizing with a task manifold is superior to batch-level generalization.
Our results are closer to DG-MAML (Wang et al., 2021a), a higher-order meta-learning method requiring computational resources and training times exceeding 4× the requirements for XG-REPTILE.XG-REPTILE yields accuracies −5.4% and −1.5% below DG-MAML for English and Chinese, where DG-FMAML performs much lower at −12.1% (EN) and −17.9% (ZH).Our results suggest that XG-REPTILE is a superior firstorder meta-learning algorithm rivaling prior work with greater computational demands. 3n the multilingual setting, we observe that XG-REPTILE performs competitively using as little as 1% of Chinese examples.While training sampling 1% and 5% perform similarly -the best model sees 10% of CSpider samples during training to yield accuracy only −0.9% (test) below the monolingual DG-MAML model.While performance does not match monolingual models, the multilingual approach has additional utility in serving more users.As a zero-shot setup, predicting SQL from CSpider inputs through the model trained for English, yields 7.9% validation accuracy, underscoring that cross-lingual transfer for this dataset is non-trivial.
Varying the target sample size demonstrates more variable effects for Spider compared to ATIS.Notably, increasing the sample size yields poorer English performance beyond the optimal XG-REPTILE@5% setting for English.This may be a consequence of the cross-database challenge in Spider -information shared across languages may be less beneficial than for single-domain ATIS.The least performant model for both languages is XG-REPTILE@1%.Low performance here for Chinese can be expected, but the performance for English is surprising.We suggest that this result is a consequence of "over-sampling" of the target data disrupting the overall training process.That is, for 1% sampling and optimal K = 4, the target data is "over-used" 25× for each epoch of support data.We further observe diminishing benefits for English with additional Chinese samples.While we trained a competitive parser with minimal Chinese data, this effect could be a consequence of how RAT-SQL cannot exploit certain English-oriented learning features (e.g., lexical similarity scores).Future work could explore cross-lingual strategies to unify entity modeling for improved feature sharing.

Visualizing
the Manifold Analysis of XG-REPTILE in Section 4 relies on a theoretical basis that first-order meta-learning creates a dense high-likelihood sub-region in the pa-Figure 3: PCA Visualizations of sentence-averaged encodings for English (EN), French (FR) and Chinese (ZH) from the ATIS test set (@1% sampling from Table 2).We identify the regularized weight manifold which improves cross-lingual transfer using XG-REPTILE.We also improve in two similarity metrics averaged across languages.rameters (i.e.optimal manifold).Under these conditions, representations of new domains should cluster within the manifold to allow for rapid adaptation with minimal samples.This contrasts with methods without meta-learning, which provide no guarantees of representation density.However, metrics in Table 2 and 3 do not directly explain if this expected effect arises.To this end, we visualize ATIS test set encoder outputs using PCA (Halko et al., 2011) in Figure 3.We contrast English (support) and French and Chinese as the most and least similar target languages.Using PCA allows for direct interpretation of low-dimensional distances across approaches.Cross-lingual similarity is a proxy for manifold alignment -as our goal is accurate cross-lingual transfer from closely aligned representations from source and target languages (Xia et al., 2021;Sherborne and Lapata, 2022).
Analyzing Figure 3, we observe meta-learning methods (Reptile-EN→FT-All, XG-REPTILE) to fit target languages closer to the support (English, yellow circle).In contrast, methods not utilizing meta-learning (Train-EN∪All, Train-EN→FT-All) appear less ordered and with weaker representation overlap.Encodings from XG-REPTILE are less separable across languages and densely clustered, suggesting the regularized manifold hy-pothesized in Section 4 ultimately yields improved cross-lingual transfer.Visualizing encodings from English in the Reptile-EN model before fine-tuning produces a similar cluster (not shown), however, required fine-tuning results in "spreading" leading to less cross-lingual similarity.
We also quantitatively examine the average encoding change in Figure 3 using Cosine similarity and Hausdorff distance (Patra et al., 2019) between English and each target language.Cosine similarity is measured pair-wise across parallel inputs in each language to gauge similarity from representations with equivalent SQL outputs.As a measure of mutual proximity between sets, Hausdorff distance denotes a worst-case distance between languages to measure more general "closeness".Under both metrics, XG-REPTILE yields the best performance with the most substantial pair-wise similarity and Hausdorff similarity.These indicators for cross-lingual similarity further support the observation that our expected behavior is legitimately occurring during training.
Our findings better explain why our XG-REPTILE performs above other training algorithms.
Specifically, our results suggest that XG-REPTILE learns a regularized manifold which produces stronger cross-lingual similarity and improved parsing compared to Reptile  fine-tuning a manifold.This contrast will inform future work for cross-lingual meta-learning where XG-REPTILE can be applied.
Error Analysis We can also examine where the improved cross-lingual transfer influences parsing performance.Similar to Figure 3, we consider the results of models using 1% sampling as the worst-case performance and examine where XG-REPTILE improves on other methods on the test set (448 examples) over five languages.
Accurate semantic parsing requires sophisticated entity handling to translate mentioned proper nouns from utterance to logical form.In our few-shot sampling scenario, most entities will appear in the English support data (e.g."Denver" or "American Airlines"), and some will be mentioned within the target language sample (e.g."Mineápolis" or "Nueva York" in Spanish).These samples cannot include all possible entities -effective cross-lingual learning must "connect" these entities from the support data to the target language -such that these names can be parsed when predicting SQL from the target language.As shown in Figure 4, the failure to recognize entities from support data, for inference on target languages, is a critical failing of all models besides XG-REPTILE.
The improvement in cross-lingual similarity using XG-REPTILE expresses a specific improvement in entity recognition.Compared to the worst performing model, Train-EN∪All, 55% of improvement accounts for handling entities absent from the 1% target sample but present in the 99% English support data.While XG-REPTILE can generate accurate SQL, other models are limited in expressivity to fall back on using seen entities from the 1% sample.This notably accounts for 60% of improvement in parsing Chinese, with minimal orthographic overlap to English, indicating that XG-REPTILE better leverages support data without reliance on token similarity.In 48% of improved parses, entity mishandling is the sole error -highlighting how limiting poor cross-lingual transfer is for our task.
Our model also improves handling of novel modifiers (e.g."on a weekday", "round-trip") absent from target language samples.Modifiers are often realized as additional sub-queries and filtering logic in SQL outputs.Comparing XG-REPTILE to Train-EN∪All, 33% of improvement is related to modifier handling.Less capable systems fall back on modifiers observed from the target sample or ignore them entirely to generate inaccurate SQL.
While XG-REPTILE better links parsing knowledge from English to target languages -the problem is not solved.Outstanding errors in all languages primarily relate to query complexity, and the cross-lingual transfer gap is not closed.Furthermore, our error analysis suggests a future direction for optimal sample selection to minimize the error from interpreting unseen phenomena.

Conclusion
We propose XG-REPTILE, a meta-learning algorithm for few-shot cross-lingual generalization in semantic parsing.XG-REPTILE is able to better utilize fewer samples to learn an economical multilingual semantic parser with minimal cost and improved sample efficiency.Compared to adjacent training algorithms and zero-shot approaches, we obtain more accurate and consistent logical forms across languages similar and dissimilar to English.Results on ATIS show clear benefit across many languages and results on Spider demonstrate that XG-REPTILE is effective in a challenging cross-lingual and cross-database scenario.We focus our study on semantic parsing, however, this algorithm could be beneficial in other low-resource cross-lingual tasks.In future work we plan to examine how to better align entities in low-resource languages to further improve parsing accuracy.

Figure 1 :
Figure 1: One iteration of XG-REPTILE.(1) Run K iterations of gradient descent over K support batches to learn φ K , (2) compute ∇ macro , the difference between φ K and φ 1 , (3) find the loss on the target batch using φ K and (4) compute the final gradient update from ∇ macro and the target loss.

Figure 2 :
Figure2: Ablation Experiments on ATIS (a) accuracy against inner loop size K across languages, (b) accuracy against K for German when varying batch size, and (c) accuracy against dataset sample size relative to support dataset from 1% to 50% for German.For (b), the K = 1 case is equivalent to DG-FMAML(Wang et al., 2021a).

Figure 4 :
Figure4: Contrast between SQL from a French input from ATIS for Train-EN∪All and XG-REPTILE.The entities "San José" and "Phoenix" are not observed in the 1% sample of French data but are mentioned in the English support data.The Train-EN∪All approach fails to connect attributes seen in English when generating SQL from French inputs (×).Training with XG-REPTILE better leverages support data to generate accurate SQL from other languages ( ).
A monolingual Transformer is trained on source English data (S EN ).Machine translation is used to translate test data from additional languages into English.Logical forms are predicted from translated data using the English model.
TrainEN→FT-All We first train on English support data, S EN , and then fine-tune on target samples, S L .

Table 2 :
and present further analysis within Figure2.Results for the multi-domain Spider are Denotation accuracy using varying learning algorithms including XG-REPTILE at 1%, 5%, and 10% sampling rates for target dataset size relative to support dataset for ATIS.We report for English, French, Portuguese, Spanish, German and Chinese.Target Avg reports the average denotation accuracy across non-English languages ± standard deviation across languages.For few-shot experiments, we also report the standard deviation (±) across random samples.Best few-shot results per language are bolded.

Table 3 :
(Wang et al., 2021a)racy for RAT-SQL trained on Spider (English) and CSpider (Chinese) comparing XG-REPTILE to DG-MAML and DG-FMAML(Wang et al., 2021a).We experiment with sampling between 1% to 10% of Chinese examples relative to English.Monolingual and multilingual best results are bolded.