Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing

Abstract Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data. Previous work has primarily considered silver-standard data augmentation or zero-shot methods; exploiting few-shot gold data is comparatively unexplored. We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between probabilistic latent variables using Optimal Transport. We demonstrate how this direct guidance improves parsing from natural languages using fewer examples and less training. We evaluate our method on two datasets, MTOP and MultiATIS++SQL, establishing state-of-the-art results under a few-shot cross-lingual regime. Ablation studies further reveal that our method improves performance even without parallel input translations. In addition, we show that our model better captures cross-lingual structure in the latent space to improve semantic representation similarity.1


Introduction
Semantic parsing maps natural language utterances to logical form (LF) representations of meaning.As an interface between human-and computerreadable languages, semantic parsers are a critical component in various natural language understanding (NLU) pipelines, including assistant technologies (Kollar et al., 2018), knowledge base question answering (Berant et al., 2013;Liang, 2016), and code generation (Wang et al., 2023).
Recent advances in semantic parsing have led to improved reasoning over challenging questions (Li et al., 2023) and accurate generation of complex queries (Scholak et al., 2021), however, most prior work has focused on English (Kamath and Das,   1 Our code and data are publicly available at github.com/tomsherborne/minotaur.2019; Qin et al., 2022a).Expanding, or localizing, an English-trained model to additional languages is challenging for several reasons.There is typically little labeled data in the target languages due to high annotation costs.Cross-lingual parsers must also be sensitive to how different languages refer to entities or model abstract and mathematical relationships (Reddy et al., 2017;Hershcovich et al., 2019).Transfer between dissimilar languages can also degrade in multilingual models with insufficient capacity (Pfeiffer et al., 2022).
Previous strategies for resource-efficient localization include generating "silver-standard" training data through machine-translation (Nicosia et al., 2021) or prompting large language models (Rosenbaum et al., 2022).Alternatively, zero-shot models use "gold-standard" external corpora for auxiliary tasks (van der Goot et al., 2021) and few-shot models maximize sample-efficiency using metalearning (Sherborne and Lapata, 2023).We argue that previous work encourages cross-lingual transfer through implicit alignment only via minimizing silver-standard data perplexity, multi-task ensembling, or constraining gradients.
We instead propose to localize an encoderdecoder semantic parser by explicitly inducing cross-lingual alignment between representations.We present MINOTAUR (Minimizing Optimal Transport distance for Alignment Under Representations)-a method for cross-lingual semantic parsing which explicitly minimizes distances between probabilistic latent variables to reduce representation divergence across languages (Figure 1).MINOTAUR leverages Optimal Transport theory (Villani, 2008) to measure and minimize this divergence between English and target languages during episodic few-shot learning.Our hypothesis is that explicit alignment between latent variables can improve knowledge transfer between languages without requiring additional annotations or lexical alignment.We evaluate this hypothesis in a few-shot cross-lingual regime and study how many examples in languages beyond English are needed for "good" performance.
Our technique allows us to precisely measure, and minimize, the cross-lingual transfer gap between languages.This yields both sample-efficient training and establishes leading performance for few-shot cross-lingual transfer on two datasets.We focus our evaluation on semantic parsing but MINOTAUR can be applied directly to a wide range of other tasks.Our contributions are as follows: • We propose a method for learning a semantic parser using explicit cross-lingual alignment between probabilistic latent variables.MINOTAUR jointly minimizes marginal and conditional posterior divergence for fast and sample-efficient cross-lingual transfer.
• We propose an episodic training scheme for cross-lingual posterior alignment during training which requires minimal modifications to typical learning.

Related Work
Cross-lingual Semantic Parsing Growing interest in cross-lingual NLU has motivated the expansion of benchmarks to study model adaptation across many languages (Hu et al., 2020;Liang et al., 2020).Within executable semantic parsing, ATIS (Hemphill et al., 1990) has been translated into multiple languages such as Chinese and Indonesian (Susanto and Lu, 2017a), and GeoQuery (Zelle and Mooney, 1996) has been translated into German, Greek, and Thai (Jones et al., 2012).Adjacent research in Task-Oriented Spoken Language Understanding (SLU) has given rise to datasets such as MTOP in five languages (Li et al., 2021), and MultiATIS++ in seven languages (Xu et al., 2020).SLU aims to parse inputs into functional representations of dialog acts (which are often embedded in an assistant NLU pipeline) instead of executable machine-readable language.In all cases, cross-lingual semantic parsing demands fine-grained semantic understanding for successful transfer across languages.Multilingual pretraining (Pires et al., 2019) has the potential to unlock certain understanding capabilities but is often insufficient.Previous methods resort to expensive dataset translation (Jie and Lu, 2014;Susanto and Lu, 2017b) or attempt to mitigate data paucity by creating "silver" standard data through machine translation (Sherborne et al., 2020;Nicosia et al., 2021;Xia and Monti, 2021;Guo et al., 2021) or prompting (Rosenbaum et al., 2022;Shi et al., 2022).However, methods that rely on synthetic data creation are yet to produce cross-lingual parsing equitable to using gold-standard professional translation.
Zero-shot methods bypass the need for in-domain data augmentation using multi-task objectives which incorporate gold-standard data for external tasks such as language modeling or dependency parsing (van der Goot et al., 2021;Sherborne and Lapata, 2022;Gritta et al., 2022).Few-shot approaches which leverage a small number of annotations have shown promise in various tasks (Zhao et al., 2021, inter alia.)including semantic parsing.Sherborne and Lapata (2023) propose a first-order meta-learning algorithm to train a semantic parser capable of sample-efficient cross-lingual transfer.
Our work is most similar to recent studies on cross-lingual alignment for classification tasks (Wu and Dredze, 2020) and spoken-language understanding using token-and slot-level annotations between parallel inputs (Qin et al., 2022b;Liang et al., 2022).While similar in motivation, we contrast in our exploration of latent variables with parametric alignment for a closed-form solution to cross-lingual transfer.Additionally, our method does not require fine-grained word and phrase alignment annotations, instead inducing alignment in the continuous latent space.
Alignment and Optimal Transport Optimal Transport (OT; Villani 2008) minimizes the cost of mapping from one distribution (e.g., utterances) to another (e.g., logical forms) through some joint distribution with conditional independence (Monge, 1781), i.e., a latent variable conditional on samples from one input domain.OT in NLP has mainly used Sinkhorn distances to measure the divergence between non-parametric discrete distributions as an online minimization sub-problem (Cuturi, 2013).
Cross-lingual approaches to OT have been proposed for embedding alignment (Alvarez-Melis and Jaakkola, 2018;Alqahtani et al., 2021), bilingual lexicon induction (Marchisio et al., 2022) and summarization (Nguyen and Luu, 2022).Our method is similar to recent proposals for crosslingual retrieval using variational or OT-oriented representation alignment (Huang et al., 2023;Wieting et al., 2023).Wang and Wang (2019) consider a "continuous" perspective on OT using the Wasserstein Auto-Encoder (Tolstikhin et al., 2018, WAE) as a language model which respects geometric input characteristics within the latent space.
Our parametric formulation allows this continuous approach to OT, similar to the WAE model.While monolingual prior work in semantic parsing has identified that latent structure can benefit the semantic parsing task (Kočiský et al., 2016;Yin et al., 2018), it does not consider whether it can inform transfer between languages.To the best of our knowledge, we are the first to consider the continuous form of OT for cross-lingual transfer in a sequence-to-sequence task.We formulate the parsing task as a transportation problem in Section 3 and describe how this framework gives rise to explicit cross-lingual alignment in Section 4.

Cross-lingual Semantic Parsing
Given a natural language utterance x, represented as a sequence of tokens (x 1 , . . ., x T ), a semantic parser generates a faithful logical-form meaning representation y.2A typical neural network parser trains on input-output pairs {x i , y i } N i=0 , using the cross-entropy between predicted ŷ, and goldstandard logical form y, as supervision (Cheng et al., 2019).
Following the standard VAE framework (Kingma and Welling, 2014;Rezende et al., 2014), an encoder Q φ represents inputs from X as a continuous latent variable Z, Q φ : X → Z.A decoder G θ predicts outputs conditioned on samples from the latent space, G θ : Z → Y.The encoder therefore acts as approximate posterior Q φ (Z|X).Q φ is a multi-lingual pre-trained encoder shared across all languages.
For cross-lingual transfer, the parser must also generalize to languages from which it has seen few (or zero) training examples. 3Our goal is for the prediction for input x l ∈ X l in language l to match the prediction for equivalent input from a high-resource language (typically English), i.e., x l → y, x EN → y subject to the constraint of fewer training examples in l (|N l | |N EN |).As shown in Figure 1, we propose measuring the divergence between approximate posteriors (i.e., Q (Z|X EN ) and Q (Z|X l )) as the distance between individual samples and an approximation of the "mean" encoding of each language.This goal of aligning distributions naturally fits an Optimal Transport perspective.Tolstikhin et al. (2018) propose the Wasserstein Auto-Encoder (WAE) as an alternative variational model.The WAE minimizes the transportation cost under the Kantorovich form of the Optimal Transport problem (Kantorovich, 1958).Given two distributions P X , P Y , the objective is to find a transportation plan Γ (X, Y ), within the set of all joint distributions, P (X ∼ P X , Y ∼ P Y ), to map probability mass from P X to P Y with minimal cost.T c expresses the problem of finding a plan which minimizes a transportation cost function c (X, Y ) : X × Y → R + :

Kantorovich Transportation Problem
The WAE is proposed as an auto-encoder (i.e., P Y approximates P X ), however, in our setting P X is the natural language input distribution and P Y is the logical form output distribution and they are both realizations of the same semantics.Using conditional independence, y ⊥ ⊥ x | z, we can transform the plan, Γ (X, Y ) → Γ (Y |X) P X and consider a non-deterministic mapping from X to Y under observed P X .Tolstikhin et al. (2018, Theorem 1) identify how to factor this mapping through latent variable Z, leading to: Equation ( 2) expresses a minimizable objective: identify the probabilistic encoder Q φ (Z|X) and decoder G θ (Z) which minimizes a cost, subject to regularization on the divergence D between the marginal posterior Q (Z) and prior P (Z).
The additional regularization is how the WAE improves on the evidence lower bound in the variational auto-encoder, where the equivalent alignment on the individual posterior Q φ (Z|X) drives latent representations to zero.Regularization on the marginal posterior instead allows individual posteriors for different samples to remain distinct and non-zero.This limits posterior collapse, guiding Z to remain informative for decoding.
We use Maximum Mean Discrepancy (Gretton et al., 2012, MMD) for an unbiased estimate of D (Q(Z), P (Z)) as a robust measure of the distance between high dimensional Gaussian distributions.Equation (3) defines MMD using some kernel k : Z × Z → R, defined over a reproducible kernel Hilbert space, H k : Informally, MMD minimizes the distance between the "feature means" of variables P and Q estimated over a batch sample.Equation (4) defines MMD estimation over observed p and q using the heavytailed inverse multiquadratic (IMQ) kernel k: We define the IMQ kernel in Equation ( 5) below; C = 2|z|σ 2 and S = [0.1,0.2, 0.5, 1, 2, 5, 10].
This framework defines a WAE objective using a cost function, c to map from P X to P Y through latent variable Z.We now describe how MINOTAUR integrates explicit posterior alignment during this learning process.

MINOTAUR: Posterior Alignment for Cross-lingual Transfer
Variational Encoder-Decoder Our model comprises of encoder (and approximate posterior) Q φ , and generator decoder G θ .The encoder Q φ produces a distribution over latent encodings z = {z 1 , . . ., z T }, parameterized as a sequence of T mean states µ {1,...,T } ∈ R T ×d , and a single variance The latent encodings z are sampled using the Gaussian reparameterization trick (Kingma and Welling, 2014), Finally, an output sequence ŷ is generated from z through autoregressive generation, For an input sequence of T tokens, we use a sequence of T latent variables for z over pooling into a single representation.This allows for more 'bandwidth' in the latent state to minimize the risk of the decoder ignoring z i.e., posterior collapse.We find this design choice to be necessary as lossy pooling leads to weak overall performance.We also use a single variance estimate for sequence z-this minimizes variance noise across z and simplifies computation in posterior alignment.We follow the convention of an isotropic unit Gaussian prior, P (z) ∼ N (0, I).
Cross-lingual Alignment Typical WAE modeling builds meaningful latent structure by aligning the estimated posterior to the prior only.MINO-TAUR extends this through additionally aligning posteriors between languages.Consider learning the optimal mapping from English utterances X EN to logical forms Y within Equation (1) via latent variable Z, from monolingual data (X EN , Y ).The optimization in Equation ( 2) converges on an optimal transportation plan Γ * EN as the minimum cost. 4For transfer from English to language l, previous work either requires token alignment between X EN and X l or exploits the shared Y between X EN and X l (Qin et al., 2022b, inter alia.).We instead induce alignment by explicitly matching Z between languages.Since Y is dependent only on Z, the latent variable offers a continuous representation space for alignment with the minimal and intuitive condition that equivalent z yields equivalent y.Therefore, our proposal is a straightforward extension of learning Γ * EN ; we propose to bootstrap the transportation plan for target language l (i.e., Γ * l (X l , Y )) by aligning on Z in a few-shot learning scenario.MINOTAUR explicitly aligns Z l (from a target language l) towards Z (from EN) by matching Q(Z l |X l ) to Q(Z|X EN ) for the goal Γ * l = Γ * EN , thereby transferring the learned capabilities from high-resource languages with only a few training examples.
Given parallel inputs x EN and x l in English and language l, with equivalent LF (y EN = y l ), their latent encodings are given by: Unlike vanilla VAEs, where z is a single vector, the posterior samples (z EN , z l ∈ R T ×d ) are complex structures.We therefore follow Mathieu et al. (2019) in using a decomposed alignment signal minimizing both aggregate posterior alignment (higher-level) and individual posterior alignment (lower-level) with scaling factors (α P , β P ) respectively.This leads to the MINOTAUR alignment outlined in Figure 1 and expressed below, where D Z|X is a divergence penalty between individual representations to match local structure, while D Z is a divergence penalty between representation aggregates to match more global structure.The intuition is that individual matching pro-4 Γ * is implicit within the model parameters.
motes contextual encoding similarity and aggregate matching promotes similarity at the language level.Similar to the prior alignment, we use the MMD distance to align aggregate posteriors as Equation (3) (i.e., marginal posteriors over Z between languages).For individual alignment, we consider two numerically stable exact solutions to measure individual divergence which are well suited to matching high-dimensional Gaussians (Takatsu, 2011).Modeling Q φ (Z|X) as a parametric statistic yields the benefit of closed-form computation during learning.We primarily use the L 2 Wasserstein distance, W 2 , as the Optimal Transport-derived minimum transportation cost between Gaussians (p, q) across domains.Within Equation ( 12) the mean is µ, covariance is Σ = Diag{σ 2 i , . . ., σ 2 n }, and encodings have dimensionality d.Tr{} is the matrix trace function.
We also consider the Kullback-Leibler Divergence (KL) between two Gaussian distributions as Equation ( 13).Minimizing KL is equivalent to maximizing the mutual information between distributions as an information-theoretic goal of semantically aligning z.Section 6 demonstrates that W 2 is superior to KL in all cases.
Tr{} is the matrix trace function.Minimizing KL is equivalent to maximizing the mutual information between distributions as an information-theoretic goal of semantically aligning z.Section 6 demonstrates that W 2 is superior to KL in all cases.
We express D Z|X (see Equation ( 11)) between singular p and q representations for individual tokens for clarity, however, we actually minimize the mean of D Z|X between each z 1 and z 2 tokens across both sequences, i.e., 1 We observe that minimizing this mean divergence between all (z 1i , z 2j ) pairs is most empirically effective.
Finally, Equation ( 14) expresses the transportation cost, T c , for a single (x, y) pair during train-ing: the cross-entropy between predicted and gold y and WAE marginal prior regularization.
We episodically augment Equation ( 14) as Equation (15) using the MINOTAUR loss every k steps for few-shot induction of cross-lingual alignment.Sampling (x, y) is detailed in Section 5.
Another perspective on our approach is that we are aligning pushforward distributions, Q (X) : X → Z. Cross-lingual alignment at the input token level (in X ) requires fine-grained annotations and is an outstanding research problem (see Section 2).Our method of aligning pushforwards in Z is smoothly continuous, does not require word alignment, and does not always require input utterances to be parallel translations.While we evaluate MINOTAUR principally on semantic parsing, our framework can extend to any sequence-to-sequence or representation learning task which may benefit from explicit alignment between languages or domains.

Experimental Setting
MTOP (Li et al., 2021) contains dialog utterances of "assistant" queries and their corresponding tree-structured slot and intent LFs.MTOP is split into 15,667 training, 2,235 validation, and 4,386 test examples in English (EN).A variable subsample of each split is translated into French (FR), Spanish (ES), German (DE), and Hindi (HI).We refer to Li et al. (2021, Table 1) for complete dataset details.As shown in Figure 2, we follow Rosenbaum et al. (2022, Appendix B.2) using "spacejoined" tokens and "sentinel words" (i.e., a wordi token is prepended to each input token and replaces this token in the LF) to produce a closed decoder vocabulary (Raman et al., 2022).This allows the output LF to reference input tokens by label without a copy mechanism.We evaluate LF accuracy using the Space and Case Invariant Exact-Match metric (SCIEM; Rosenbaum et al. 2022).
We Optimization We train for a maximum of ten epochs with early stopping using validation loss.Optimization uses Adam (Kingma and Ba, 2015) with a batch size of 256 and learning rate of 1 × 10 −4 .We empirically tune hyperparameters (β P , α P ) to (0.5, 0.01) respectively.During learning, a typical step (without MINOTAUR alignment) samples a batch of (x L , y) pairs in languages L ∈ {EN, l 1 , l 2 . ..} from a sampled dataset described above.Each MINOTAUR step instead uses a sampled batch of parallel data (x EN , x l , y EN , y l ) to induce explicit cross-lingual alignment from the same data pool.The episodic learning loop size is tuned to k = 20; we find that if k is infrequent then posterior alignment is weaker and if k is too frequent then overall parsing degrades as the posterior alignment dominates learning.Tokenization uses SentencePiece (Kudo and Richardson, 2018) and beam search prediction uses five hypotheses.All experiments are implemented in PyTorch (Paszke et al., 2019) and AllenNLP (Gardner et al., 2018).
Training takes one hour using 1× A100 80GB GPU for either dataset.
Comparison Systems As an upper-bound, we train the WAE-derived model without low-resource constraints.We report monolingual (one language) and multilingual (all languages) versions of training a model on available data.We use the monolingual upper-bound EN model as a "Translate-Test" comparison.We also compare to monolingual and multilingual "Translate-Train" models to evaluate the value of gold samples compared to silver-standard training data.We follow previous work in using OPUS (Tiedemann, 2012) translations for MTOP and Google Translate (Wu et al., 2016) for MultiATIS++SQL in all directions.Following Rosenbaum et al. (2022), we use a crosslingual word alignment tool (SimAlign; Jalili Sabet et al. 2020) to project token positions from MTOP source to the parallel machine-translated output (e.g., to shift label wordi in EN to wordj in FR).
In all results, we report averages of five runs over different few-shot splits.For MTOP, we compare to "silver-standard" methods: "Translate-and-Fill" (Nicosia et al., 2021, TaF) which generates training data using MT, and CLASP (Rosenbaum et al., 2022) which uses MT and prompting to generate multilingual training data.We note that these models and dataset pre-processing methods are not public (we have confirmed that our methods are reasonably comparable with authors).For MultiATIS++SQL, we compare to XG-REPTILE from (Sherborne and Lapata, 2023).This method uses meta-learning to approximate a "task manifold" using English data and constrain representations of target languages to be close to this manifold.This approach implicitly optimizes for crosslingual transfer by regularizing the gradients for target languages to align with gradients for English.MINOTAUR differs in explicitly measuring the representation divergence across languages.

Results
We find that MINOTAUR validates our hypothesis that explicitly minimizing latent divergence improves cross-lingual transfer with few training examples in the target language.As evidenced by our ablation studies, our technique is surprisingly robust and can function without any parallel data between languages.Overall, our method outperforms silver-standard data augmentation techniques (in Table 1) and few-shot meta-learning (in Table 2).
using 100% of gold translated data and so represents a promising new strategy for this dataset.We note that even at a high SPIS rate of 100 (approximately ∼ 53.1% of training data), MINOTAUR is significantly (p < 0.01) poorer than the "Gold Multilingual" upper-bound, highlighting that few-shot transfer is challenging on MTOP.MINOTAUR outperforms all translation-based comparisons; augmenting "Translate-Train Multilingual" with our posterior alignment objective (+ MINOTAUR) yields a +10.1% average improvement.With equivalent data, this comparison shows that cross-lingual alignment by aligning each latent representation to the prior only (i.e., a WAE-based model) is weaker than cross-lingual alignment between posteriors.
Comparing to "Silver-Standard" Methods A more realistic comparison is between TaF (Nicosia et al., 2021) or CLASP (Rosenbaum et al., 2022), which optimize MT quality in their pipelines, and our method which uses sampled gold data.We outperform CLASP by >3% and TaF using mT5-large (Xue et al., 2021) by >2.1% at all sample rates.However, MINOTAUR requires > 5 SPIS sampling to improve upon TaF using mT5-xxl.We highlight that our model has only ∼ 116 million parameters whereas CLASP uses AlexaTM-500M (FitzGerald et al., 2022) with 500 million parameters, mT5large has 700 million parameters and mT5-xxl has 3.3 billion parameters.Relative to model size, our approach offers improved computational efficiency.The improvement of our method is mostly seen in languages typologically distant from English as MINOTAUR is always the strongest model for Hindi.In contrast, our method underperforms for English and German (more similar to EN) which may benefit from stronger pre-trained knowledge transfer within larger models.Our efficacy using gold data and a smaller model, compared to silver data in larger models, suggests a quality trade-off, constrained by computation, as a future study.

Cross-lingual Transfer in Executable Parsing
The results for MultiATIS++SQL in Table 2 show similar trends.However, here MINOTAUR can outperform the upper-bounds, and sampling at > 5% significantly (p < 0.01) improves on "Gold-Monolingual" and is similar or better than "Gold-Multilingual" (p < 0.05).Further increasing the sample rate yields marginal gains.MINOTAUR generally improves on XG-REPTILE and performs on par at a lower sample rate, i.e.MINOTAUR at 1% sampling is closer to XG-REPTILE at 5% sampling.This suggests that our approach is more sample efficient, achieving greater accuracy with fewer samples.MINOTAUR requires < 10 epochs to train whereas XG-REPTILE reports ∼ 50 training epochs, for poorer results.
larity (dissimilar languages demanding greater distances to minimize) whereas the alignment via gradient constraints within meta-learning could be less sensitive to this phenomenon.These results, with the observation that MINOTAUR improves most on Hindi for MTOP, illustrate a need for more in-depth studies of cross-lingual transfer between distant and lower resource languages.Future work can consider more challenging benchmarks across a wider pool of languages (Ruder et al., 2023).

Contrasting Alignment Signals
We report ablations of MINOTAUR on MTOP at 10 SPIS sampling.).The joint method using L 2 -Wasserstein distance is empirically optimal but not significantly above the aggregate-only method (p = 0.07).
that minimizing aggregate divergence is the main contributor to alignment with individual divergence as a weaker additional contribution.
Alignment without Latent Variables Table 4 considers alignment without the latent variable formulation on an encoder-decoder Transformer model (Vaswani et al., 2017).Here, the output of the encoder is not probabilistically bound without the parametric "guidance" of the Gaussian reparameterization.This is similar to analysis on explicit alignment from Wu and Dredze (2020).We test MMD, statistical KL divergence (e.g., x p(x)log p(x) q(x) ) and Euclidean L 2 distance as minimization functions and observe all techniques are significantly weaker (p < 0.01) than counterparts outlined in  12)) between representations leads to numerical underflow instability during training.This impeded alignment against typically reasonable comparisons such as cosine distance.
Even using MMD, which does not require an exact solution, fared poorer without the bounding of the latent variable Z.

Parallelism in Alignment
We further investigate whether MINOTAUR induces cross-lingual transfer when aligning posterior samples from inputs which are not parallel (i.e., x l is not a translation of x EN and output LFs are not equivalent).We intuitively expect parallelism as necessary for the model to minimize divergence between representations with equivalent semantics.As shown in Table 5, data parallelism is surprisingly not required using MMD to align marginal distributions only.The D Z|X only and D Z|X + D Z techniques significantly under-perform relative to equivalent methods using parallel data (p < 0.01).This is largely expected because individual alignment between posterior samples which should likely not be equivalent could inject unnecessary noise into the learning process.However, MMD (D Z only) is significantly (p < 0.01) above other methods with the closest performance to the parallel equivalent.This supports our interpretation that MMD aligns "at the language level" as minimization between languages should not mandate parallel data.For lower-resource scenarios, this approach could over-sample less data for crosslingual transfer to the long tail of under-resourced languages.(Tang et al., 2021) representations before training, and XG-REPTILE.The significant improvement in cross-lingual cosine similarity using MINOTAUR in Table 6 (p < 0.01) further supports how our proposed method learns improved cross-lingual similarity.
We also consider the most cosine-similar neighbors for each representation and test if the top-k closest representations are from a parallel utterance in a different language or some other utterance in the same language.Table 6 shows that > 99% of representations learned by MINOTAUR have a parallel utterance within five closest representations and ∼50% improvement in mean-reciprocal ranking score (MRR) between parallel utterances.We interpret this as the representation space using MINO-TAUR is more semantically distributed relative to MBART50, as representations for a given utterance are closer to semantic equivalents.We visualize this in Figure 3: the original pre-trained model has minimal cross-lingual overlap, whereas our system produces encodings with similarity aligned by semantics rather than language.MINOTAUR can rapidly adapt the pre-trained representations using an explicit alignment objective to produce a non-trivial informative latent structure.This formulation could have further utility within multilingual representation learning or information re- trieval, e.g., to induce more coherent relationships between cross-lingual semantics.
Error Analysis We conduct an error analysis on MultiATIS++SQL examples correctly predicted by MINOTAUR and incorrectly predicted by baselines.The primary improvement arises from improved handling of multi-word expressions and languagespecific modifiers.For example, adjectives in English are often multi-word adjectival phrases in (e.g., "cheapest" → "le moins cher" or "earliest" → "à plus tot").Improved handling of this error type accounts for an average of 53% of improvement across languages with the highest in French (69%) and lowest in Chinese (38%).We hypothesize that a combination of aggregate and meanpool individual alignment in MINOTAUR benefits this specific case where semantics are expressed in varying numbers of words between languages.While this could be similarly approached using fine-grained token alignment labels, MINOTAUR improves transfer in this context without additional annotation.While this analysis is straightforward for French, it is unclear why the transfer to Chinese is weaker.A potential interpretation is that weaker transfer of multi-word expressions to Chinese could be related to poor tokenization.Suboptimal sub-word tokenization of logographic or information-dense languages is an ongoing debate (Hofmann et al., 2022;Si et al., 2023) and exact explanations require further study.Translation-based models and weaker systems often generate malformed, non-executable SQL.Most additional improvement is due to a 23% boost in generating syntactically well-formed SQL evaluated within a database.Syntactic correctness is critical when a parser encounters a rare entity or unfamiliar linguis-  Maaten and Hinton, 2008).Compared to MBART50, MINOTAUR organizes the latent space to be more semantically distributed across languages without monolingual separability.tic construction and highlights how our model can better navigate inputs from languages minimally observed during training.This could potentially be further improved using recent incremental decoding advancements (Scholak et al., 2021).

Conclusion
We propose MINOTAUR, a method for few-shot cross-lingual semantic parsing leveraging Optimal Transport for knowledge transfer between languages.MINOTAUR uses a multi-level posterior alignment signal to enable sample-efficient semantic parsing of languages with few annotated examples.We identify how MINOTAUR aligns individual and aggregate representations to bootstrap parsing capability from English to multiple target languages.Our method is robust to different choices of alignment metrics and does not mandate parallel data for effective cross-lingual transfer.In addition, MINOTAUR learns more semantically distributed and language-agnostic latent representations with verifiably improved semantic similarity, indicating its potential application to improve cross-lingual generalization in a wide range of other tasks.

Figure 1 :
Figure 1: Upper: We align representations explicitly in the latent representation space, z, between encoder Q and decoder G. Lower: MINOTAUR induces cross-lingual similarity by minimizing divergence between latent distributions at two levelsbetween individual and aggregate posteriors.
Chen et al. (2020)ber of training instances for low-resource languages, following the Samplesper-Intent-and-Slot (SPIS) strategy fromChen et al. (2020)which we adapt to our cross-lingual scenario.SPIS randomly selects examples and keeps those that mention any slot and intent value (e.g., "IN:" and "SL:" from Figure2) with fewer than some rate in the existing subset.EN word1 Who word2 attended word3 Yale?x DE word1 Wer word2 besuchte word3 Yale? y Liu and Lapata (2019)Equation (6)) is a multi-head pooler fromLiu and Lapata (2019)adapting multi-x

Table 3 .
This contrast suggests the smooth curvature and bounded structure

Table 5 :
Accuracy on MTOP at 10 SPIS using nonparallel inputs between languages in MINOTAUR.During training, we sample English input, x EN , and an input in language l, x l which is not a translation of x EN for Equation (15).This approach weakens individual posterior alignment but identifies that MMD is the least sensitive to input parallelism.

Table 6 :
Average similarity between encodings of English and target languages for MultiATIS++SQL.Cosine similarity evaluates average distance between encodings of parallel sentences.Top-k evaluates if the parallel encoding is ranked within the k most cosine-similar vectors.Mean Reciprocal Rank (MRR) evaluates average position of parallel encodings ranked by similarity.Significant best results are bolded (p < 0.01)