Abstract
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data. Previous work has primarily considered silver-standard data augmentation or zero-shot methods; exploiting few-shot gold data is comparatively unexplored. We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between probabilistic latent variables using Optimal Transport. We demonstrate how this direct guidance improves parsing from natural languages using fewer examples and less training. We evaluate our method on two datasets, MTOP and MultiATIS++SQL, establishing state-of-the-art results under a few-shot cross-lingual regime. Ablation studies further reveal that our method improves performance even without parallel input translations. In addition, we show that our model better captures cross-lingual structure in the latent space to improve semantic representation similarity.1
1 Introduction
Semantic parsing maps natural language utterances to logical form (LF) representations of meaning. As an interface between human- and computer-readable languages, semantic parsers are a critical component in various natural language understanding (NLU) pipelines, including assistant technologies (Kollar et al., 2018), knowledge base question answering (Berant et al., 2013; Liang, 2016), and code generation (Wang et al., 2023).
Recent advances in semantic parsing have led to improved reasoning over challenging questions (Li et al., 2023) and accurate generation of complex queries (Scholak et al., 2021), however, most prior work has focused on English (Kamath and Das, 2019; Qin et al., 2022a). Expanding, or localizing, an English-trained model to additional languages is challenging for several reasons. There is typically little labeled data in the target languages due to high annotation costs. Cross-lingual parsers must also be sensitive to how different languages refer to entities or model abstract and mathematical relationships (Reddy et al., 2017; Hershcovich et al., 2019). Transfer between dissimilar languages can also degrade in multilingual models with insufficient capacity (Pfeiffer et al., 2022).
Previous strategies for resource-efficient localization include generating “silver-standard” training data through machine-translation (Nicosia et al., 2021) or prompting large language models (Rosenbaum et al., 2022). Alternatively, zero-shot models use “gold-standard” external corpora for auxiliary tasks (van der Goot et al., 2021) and few-shot models maximize sample-efficiency using meta-learning (Sherborne and Lapata, 2023). We argue that previous work encourages cross-lingual transfer through implicit alignment only via minimizing silver-standard data perplexity, multi-task ensembling, or constraining gradients.
We instead propose to localize an encoder-decoder semantic parser by explicitly inducing cross-lingual alignment between representations. We present Minotaur (Minimizing Optimal Transport distance for Alignment Under Representations)—a method for cross-lingual semantic parsing which explicitly minimizes distances between probabilistic latent variables to reduce representation divergence across languages (Figure 1). Minotaur leverages Optimal Transport theory (Villani, 2008) to measure and minimize this divergence between English and target languages during episodic few-shot learning. Our hypothesis is that explicit alignment between latent variables can improve knowledge transfer between languages without requiring additional annotations or lexical alignment. We evaluate this hypothesis in a few-shot cross-lingual regime and study how many examples in languages beyond English are needed for “good” performance.
Our technique allows us to precisely measure, and minimize, the cross-lingual transfer gap between languages. This yields both sample-efficient training and establishes leading performance for few-shot cross-lingual transfer on two datasets. We focus our evaluation on semantic parsing but Minotaur can be applied directly to a wide range of other tasks. Our contributions are as follows:
We propose a method for learning a semantic parser using explicit cross-lingual alignment between probabilistic latent variables. Minotaur jointly minimizes marginal and conditional posterior divergence for fast and sample-efficient cross-lingual transfer.
We propose an episodic training scheme for cross-lingual posterior alignment during training which requires minimal modifications to typical learning.
Experiments on task-oriented semantic parsing (MTOP; Li et al., 2021) and executable semantic parsing (MultiATIS++SQL; Sherborne and Lapata, 2022) demonstrate that Minotaur outperforms prior methods with fewer data resources and faster convergence.
2 Related Work
Cross-lingual Semantic Parsing
Growing interest in cross-lingual NLU has motivated the expansion of benchmarks to study model adaptation across many languages (Hu et al., 2020; Liang et al., 2020). Within executable semantic parsing, ATIS (Hemphill et al., 1990) has been translated into multiple languages such as Chinese and Indonesian (Susanto and Lu, 2017a), and GeoQuery (Zelle and Mooney, 1996) has been translated into German, Greek, and Thai (Jones et al., 2012). Adjacent research in Task-Oriented Spoken Language Understanding (SLU) has given rise to datasets such as MTOP in five languages (Li et al., 2021), and MultiATIS++ in seven languages (Xu et al., 2020). SLU aims to parse inputs into functional representations of dialog acts (which are often embedded in an assistant NLU pipeline) instead of executable machine-readable language.
In all cases, cross-lingual semantic parsing demands fine-grained semantic understanding for successful transfer across languages. Multilingual pre-training (Pires et al., 2019) has the potential to unlock certain understanding capabilities but is often insufficient. Previous methods resort to expensive dataset translation (Jie and Lu, 2014; Susanto and Lu, 2017b) or attempt to mitigate data paucity by creating “silver” standard data through machine translation (Sherborne et al., 2020; Nicosia et al., 2021; Xia and Monti, 2021; Guo et al., 2021) or prompting (Rosenbaum et al., 2022; Shi et al., 2022). However, methods that rely on synthetic data creation are yet to produce cross-lingual parsing equitable to using gold-standard professional translation.
Zero-shot methods bypass the need for in-domain data augmentation using multi-task objectives which incorporate gold-standard data for external tasks such as language modeling or dependency parsing (van der Goot et al., 2021; Sherborne and Lapata, 2022; Gritta et al., 2022). Few-shot approaches which leverage a small number of annotations have shown promise in various tasks (Zhao et al., 2021, inter alia) including semantic parsing. Sherborne and Lapata (2023) propose a first-order meta-learning algorithm to train a semantic parser capable of sample-efficient cross-lingual transfer.
Our work is most similar to recent studies on cross-lingual alignment for classification tasks (Wu and Dredze, 2020) and spoken-language understanding using token- and slot-level annotations between parallel inputs (Qin et al., 2022b; Liang et al., 2022). While similar in motivation, we contrast in our exploration of latent variables with parametric alignment for a closed-form solution to cross-lingual transfer. Additionally, our method does not require fine-grained word and phrase alignment annotations, instead inducing alignment in the continuous latent space.
Alignment and Optimal Transport
Optimal Transport (OT; Villani, 2008) minimizes the cost of mapping from one distribution (e.g., utterances) to another (e.g., logical forms) through some joint distribution with conditional independence (Monge, 1781), i.e., a latent variable conditional on samples from one input domain. OT in NLP has mainly used Sinkhorn distances to measure the divergence between non-parametric discrete distributions as an online minimization sub-problem (Cuturi, 2013).
Cross-lingual approaches to OT have been proposed for embedding alignment (Alvarez-Melis and Jaakkola, 2018; Alqahtani et al., 2021), bilingual lexicon induction (Marchisio et al., 2022), and summarization (Nguyen and Luu, 2022). Our method is similar to recent proposals for cross-lingual retrieval using variational or OT-oriented representation alignment (Huang et al., 2023; Wieting et al., 2023). Wang and Wang (2019) consider a “continuous” perspective on OT using the Wasserstein Auto-Encoder (Tolstikhin et al., 2018, Wae) as a language model which respects geometric input characteristics within the latent space.
Our parametric formulation allows this continuous approach to OT, similar to the Wae model. While monolingual prior work in semantic parsing has identified that latent structure can benefit the semantic parsing task (Kočiský et al., 2016; Yin et al., 2018), it does not consider whether it can inform transfer between languages. To the best of our knowledge, we are the first to consider the continuous form of OT for cross-lingual transfer in a sequence-to-sequence task. We formulate the parsing task as a transportation problem in Section 3 and describe how this framework gives rise to explicit cross-lingual alignment in Section 4.
3 Background
3.1 Cross-lingual Semantic Parsing
Given a natural language utterance x, represented as a sequence of tokens , a semantic parser generates a faithful logical-form meaning representation y.2 A typical neural network parser trains on input-output pairs , using the cross-entropy between predicted , and gold-standard logical form y, as supervision (Cheng et al., 2019).
Following the standard VAE framework (Kingma and Welling, 2014; Rezende et al., 2014), an encoder Qϕ represents inputs from as a continuous latent variable Z, . A decoder Gθ predicts outputs conditioned on samples from the latent space, . The encoder therefore acts as approximate posterior Qϕ(Z|X). Qϕ is a multi-lingual pre-trained encoder shared across all languages.
For cross-lingual transfer, the parser must also generalize to languages from which it has seen few (or zero) training examples.3 Our goal is for the prediction for input xl ∈ Xl in language l to match the prediction for equivalent input from a high-resource language (typically English), i.e., subject to the constraint of fewer training examples in l (). As shown in Figure 1, we propose measuring the divergence between approximate posteriors (i.e., and ) as the distance between individual samples and an approximation of the “mean” encoding of each language. This goal of aligning distributions naturally fits an Optimal Transport perspective.
3.2 Kantorovich Transportation Problem
The additional regularization is how the Wae improves on the evidence lower bound in the variational auto-encoder, where the equivalent alignment on the individual posterior Qϕ(Z|X) drives latent representations to zero. Regularization on the marginal posterior instead allows individual posteriors for different samples to remain distinct and non-zero. This limits posterior collapse, guiding Z to remain informative for decoding.
This framework defines a Wae objective using a cost function, c to map from PX to PY through latent variable Z. We now describe how Minotaur integrates explicit posterior alignment during this learning process.
4 Minotaur: Posterior Alignment for Cross-lingual Transfer
Variational Encoder-Decoder
For an input sequence of T tokens, we use a sequence of T latent variables for z over pooling into a single representation. This allows for more ‘bandwidth’ in the latent state to minimize the risk of the decoder ignoring z, i.e., posterior collapse. We find this design choice to be necessary as lossy pooling leads to weak overall performance. We also use a single variance estimate for sequence z—this minimizes variance noise across z and simplifies computation in posterior alignment. We follow the convention of an isotropic unit Gaussian prior, .
*
Cross-lingual Alignment Typical Wae modeling builds meaningful latent structure by aligning the estimated posterior to the prior only. Minotaur extends this through additionally aligning posteriors between languages. Consider learning the optimal mapping from English utterances XEN to logical forms Y within Equation (1) via latent variable Z, from monolingual data . The optimization in Equation (2) converges on an optimal transportation plan ΓEN* as the minimum cost.4
For transfer from English to language l, previous work either requires token alignment between XEN and Xl or exploits the shared Y between XEN and Xl (Qin et al., 2022b, inter alia). We instead induce alignment by explicitly matching Z between languages. Since Y is dependent only on Z, the latent variable offers a continuous representation space for alignment with the minimal and intuitive condition that equivalent z yields equivalent y. Therefore, our proposal is a straightforward extension of learning ΓEN*; we propose to bootstrap the transportation plan for target language l (i.e., ) by aligning on Z in a few-shot learning scenario. Minotaurexplicitly aligns Zl (from a target language l) towards Z (from EN) by matching Q(Zl|Xl) to Q(Z|XEN) for the goal , thereby transferring the learned capabilities from high-resource languages with only a few training examples.
We express (see Equation (11)) between singular p and q representations for individual tokens for clarity, however, we actually minimize the mean of between each z1 and z2 tokens across both sequences, i.e., . We observe that minimizing this mean divergence between all pairs is most empirically effective.
Another perspective on our approach is that we are aligning pushforward distributions, . Cross-lingual alignment at the input token level (in ) requires fine-grained annotations and is an outstanding research problem (see Section 2). Our method of aligning pushforwards in is smoothly continuous, does not require word alignment, and does not always require input utterances to be parallel translations. While we evaluate Minotaur principally on semantic parsing, our framework can extend to any sequence-to-sequence or representation learning task which may benefit from explicit alignment between languages or domains.
5 Experimental Setting
MTOP (Li et al., 2021)
This contains dialog utterances of “assistant” queries and their corresponding tree-structured slot and intent LFs. MTOP is split into 15,667 training, 2,235 validation, and 4,386 test examples in English (EN). A variable subsample of each split is translated into French (FR), Spanish (ES), German (DE), and Hindi (HI). We refer to Li et al. (2021, Table 1) for complete dataset details. As shown in Figure 2, we follow Rosenbaum et al. (2022, Appendix B.2) using “space-joined” tokens and “sentinel words” (i.e., a wordi token is prepended to each input token and replaces this token in the LF) to produce a closed decoder vocabulary (Raman et al., 2022). This allows the output LF to reference input tokens by label without a copy mechanism. We evaluate LF accuracy using the Space and Case Invariant Exact-Match metric (SCIEM; Rosenbaum et al., 2022).
. | EN . | FR . | ES . | DE . | HI . | Avg. . |
---|---|---|---|---|---|---|
Gold Monolingual | 79.4 | 69.8 | 72.3 | 67.1 | 60.5 | 67.4 ± 5.3 |
Gold Multilingual | 81.3 | 75.7 | 77.2 | 72.8 | 71.6 | 74.4 ± 3.5 |
Translate-Test | — | 7.7 | 7.4 | 7.6 | 7.3 | 7.5 ± 0.2 |
Translate-Train Monolingual | — | 41.7 | 31.4 | 50.1 | 32.2 | 38.9 ± 9.4 |
Translate-Train Multilingual | 74.2 | 46.9 | 43.0 | 53.6 | 39.9 | 45.9 ± 5.9 |
Translate-Train Multilingual + Minotaur | 77.5 | 59.9 | 60.2 | 61.6 | 42.2 | 56.0 ± 9.2 |
TaF mT5-large (Nicosia et al., 2021) | 83.5 | 71.1 | 69.6 | 70.5 | 58.1 | 67.3 ± 6.2 |
TaF mT5-xxl (Nicosia et al., 2021) | 85.9 | 74.0 | 71.5 | 72.4 | 61.9 | 70.0 ± 5.5 |
CLASP (Rosenbaum et al., 2022) | 84.4 | 72.6 | 68.1 | 66.7 | 58.1 | 66.4 ± 6.1 |
Minotaur 1 SPIS | 79.5 ± 0.4 | 71.9 ± 0.2 | 72.3 ± 0.1 | 68.4 ± 0.3 | 65.1 ± 0.1 | 69.4 ± 3.4 |
Minotaur 5 SPIS | 77.7 ± 0.6 | 72.0 ± 0.6 | 73.6 ± 0.3 | 69.1 ± 0.5 | 68.2 ± 0.5 | 70.7 ± 2.5 |
Minotaur 10 SPIS | 80.2 ± 0.4 | 72.8 ± 0.5 | 74.9 ± 0.1 | 70.0 ± 0.7 | 68.6 ± 0.5 | 71.6 ± 2.8 |
. | EN . | FR . | ES . | DE . | HI . | Avg. . |
---|---|---|---|---|---|---|
Gold Monolingual | 79.4 | 69.8 | 72.3 | 67.1 | 60.5 | 67.4 ± 5.3 |
Gold Multilingual | 81.3 | 75.7 | 77.2 | 72.8 | 71.6 | 74.4 ± 3.5 |
Translate-Test | — | 7.7 | 7.4 | 7.6 | 7.3 | 7.5 ± 0.2 |
Translate-Train Monolingual | — | 41.7 | 31.4 | 50.1 | 32.2 | 38.9 ± 9.4 |
Translate-Train Multilingual | 74.2 | 46.9 | 43.0 | 53.6 | 39.9 | 45.9 ± 5.9 |
Translate-Train Multilingual + Minotaur | 77.5 | 59.9 | 60.2 | 61.6 | 42.2 | 56.0 ± 9.2 |
TaF mT5-large (Nicosia et al., 2021) | 83.5 | 71.1 | 69.6 | 70.5 | 58.1 | 67.3 ± 6.2 |
TaF mT5-xxl (Nicosia et al., 2021) | 85.9 | 74.0 | 71.5 | 72.4 | 61.9 | 70.0 ± 5.5 |
CLASP (Rosenbaum et al., 2022) | 84.4 | 72.6 | 68.1 | 66.7 | 58.1 | 66.4 ± 6.1 |
Minotaur 1 SPIS | 79.5 ± 0.4 | 71.9 ± 0.2 | 72.3 ± 0.1 | 68.4 ± 0.3 | 65.1 ± 0.1 | 69.4 ± 3.4 |
Minotaur 5 SPIS | 77.7 ± 0.6 | 72.0 ± 0.6 | 73.6 ± 0.3 | 69.1 ± 0.5 | 68.2 ± 0.5 | 70.7 ± 2.5 |
Minotaur 10 SPIS | 80.2 ± 0.4 | 72.8 ± 0.5 | 74.9 ± 0.1 | 70.0 ± 0.7 | 68.6 ± 0.5 | 71.6 ± 2.8 |
We sample a small number of training instances for low-resource languages, following the Samples-per-Intent-and-Slot (SPIS) strategy from Chen et al. (2020) which we adapt to our cross-lingual scenario. SPIS randomly selects examples and keeps those that mention any slot and intent value (e.g., “IN:” and “SL:” from Figure 2) with fewer than some rate in the existing subset. Sampling stops when all slots and intents have a minimum frequency of the sampling rate (or the maximum if fewer than the sampling rate). SPIS sampling ensures a minimum coverage of all slot and intent types during cross-lingual transfer. This normalizes unbalanced low-resource data as the model has seen approximately similar examples across all semantic categories. Practically, an SPIS rate of 1, 5, and 10 equates to 284 (1.8%), 1,125 (7.2%), and 1,867 (11.9%) examples (% training data).
MultiATIS++SQL (Sherborne and Lapata, 2022)
Experiments on ATIS (Hemphill et al., 1990) study cross-lingual transfer using an executable LF to retrieve database information. We use the MultiATIS++SQL version (see Table 2), pairing executable SQL with parallel inputs in English (EN), French (FR), Portuguese (PT), Spanish (ES), German (DE), and Chinese (ZH). We measure denotation accuracy—the proportion of executed predictions retrieving equivalent database results as executing the gold LF. Data is split into 4,473 training, 493 validation, and 448 test examples with complete translation for all splits. We follow Sherborne and Lapata (2023) in using random sampling. Rates of 1%, 5%, and 10% correspond to 45, 224, and 447 examples, respectively. For both datasets, the model only observes remaining data in English, e.g., sampling at 5% uses 224 multilingual examples and 4,249 English-only examples for training.
Modeling
We follow prior work in using a Transformer encoder-decoder: We use the frozen pre-trained 12-layer encoder from mBART50 (Tang et al., 2021) and append an identical learnable layer. The decoder is a six-layer Transformer stack (Vaswani et al., 2017) matching the encoder dimensionality (d = 1,024). Decoder layers are trained from scratch following prior work and early experiments verified that pre-training the decoder did not assist in cross-lingual transfer, offering minimal improvement on English. The variance predictor (σ2 for predicting z in Equation (6)) is a multi-head pooler from Liu and Lapata (2019) adapting multi-head attention to produce singular output from sequential inputs. The final model has ∼116 million trainable parameters and ∼340 million frozen parameters.
Optimization
We train for a maximum of ten epochs with early stopping using validation loss. Optimization uses Adam (Kingma and Ba, 2015) with a batch size of 256 and learning rate of 1 × 10−4. We empirically tune hyperparameters to (0.5,0.01), respectively. During learning, a typical step (without Minotaur alignment) samples a batch of pairs in languages L ∈{EN, l1, l2…} from a sampled dataset described above. Each Minotaur step instead uses a sampled batch of parallel data to induce explicit cross-lingual alignment from the same data pool. The episodic learning loop size is tuned to k = 20; we find that if k is infrequent then posterior alignment is weaker and if k is too frequent then overall parsing degrades as the posterior alignment dominates learning. Tokenization uses SentencePiece (Kudo and Richardson, 2018) and beam search prediction uses five hypotheses. All experiments are implemented in PyTorch (Paszke et al., 2019) and AllenNLP (Gardner et al., 2018). Training takes one hour using 1× A100 80GB GPU for either dataset.
Comparison Systems
As an upper-bound, we train the Wae-derived model without low-resource constraints. We report monolingual (one language) and multilingual (all languages) versions of training a model on available data. We use the monolingual upper-bound EN model as a “Translate-Test” comparison. We also compare to monolingual and multilingual “Translate-Train” models to evaluate the value of gold samples compared to silver-standard training data. We follow previous work in using OPUS (Tiedemann, 2012) translations for MTOP and Google Translate (Wu et al., 2016) for MultiATIS ++SQL in all directions. Following Rosenbaum et al. (2022), we use a cross-lingual word alignment tool (SimAlign; Jalili Sabet et al., 2020) to project token positions from MTOP source to the parallel machine-translated output (e.g., to shift label wordi in EN to wordj in FR).
In all results, we report averages of five runs over different few-shot splits. For MTOP, we compare to “silver-standard” methods: “Translate-and-Fill” (Nicosia et al., 2021, TaF) which generates training data using MT, and CLASP (Rosenbaum et al., 2022) which uses MT and prompting to generate multilingual training data. We note that these models and dataset pre-processing methods are not public (we have confirmed that our methods are reasonably comparable with authors). For MultiATIS++SQL, we compare to XG-Reptile from (Sherborne and Lapata, 2023). This method uses meta-learning to approximate a “task manifold” using English data and constrain representations of target languages to be close to this manifold. This approach implicitly optimizes for cross-lingual transfer by regularizing the gradients for target languages to align with gradients for English. Minotaur differs in explicitly measuring the representation divergence across languages.
6 Results
We find that Minotaur validates our hypothesis that explicitly minimizing latent divergence improves cross-lingual transfer with few training examples in the target language. As evidenced by our ablation studies, our technique is surprisingly robust and can function without any parallel data between languages. Overall, our method outperforms silver-standard data augmentation techniques (in Table 1) and few-shot meta-learning (in Table 2).
. | EN . | FR . | PT . | ES . | DE . | ZH . | Avg. . | |
---|---|---|---|---|---|---|---|---|
Gold Monolingual | 72.3 | 73.0 | 71.8 | 67.2 | 73.4 | 73.7 | 71.9 ± 2.7 | |
Gold Multilingual | 73.7 | 74.4 | 72.3 | 71.7 | 74.6 | 71.3 | 72.9 ± 1.5 | |
Translate-Test | — | 70.1 | 70.6 | 66.9 | 68.5 | 62.9 | 67.8 ± 3.1 | |
Translate-Train Monolingual | — | 62.2 | 53.0 | 65.9 | 55.4 | 67.1 | 60.8 ± 6.3 | |
Translate-Train Multilingual | 72.7 | 69.4 | 67.3 | 66.2 | 65.0 | 69.2 | 67.5 ± 1.9 | |
Translate-Train Multilingual + Minotaur | 74.8 | 73.7 | 71.3 | 68.5 | 70.1 | 69.0 | 70.6 ± 2.1 | |
@1% | XG-Reptile | 73.8 ± 0.3 | 70.4 ± 1.8 | 70.8 ± 0.7 | 68.9 ± 2.3 | 69.1 ± 1.2 | 68.1 ± 1.2 | 69.5 ± 1.1 |
Minotaur | 75.6 ± 0.4 | 73.7 ± 0.6 | 71.4 ± 0.9 | 71.0 ± 0.5 | 70.4 ± 1.3 | 70.0 ± 0.9 | 71.3 ± 1.4 | |
@5% | XG-Reptile | 74.4 ± 1.3 | 73.0 ± 0.9 | 71.6 ± 1.1 | 71.6 ± 0.7 | 71.1 ± 0.6 | 69.5 ± 0.5 | 71.4 ± 1.3 |
Minotaur | 77.0 ± 1.0 | 73.9 ± 1.4 | 72.8 ± 1.1 | 71.1 ± 0.6 | 72.8 ± 2.0 | 72.3 ± 0.6 | 72.6 ± 1.0 | |
@10% | XG-Reptile | 75.8 ± 1.3 | 74.2 ± 0.2 | 72.8 ± 0.6 | 72.1 ± 0.7 | 73.0 ± 0.6 | 72.8 ± 0.5 | 73.0 ± 0.8 |
Minotaur | 79.8 ± 0.4 | 75.6 ± 1.8 | 75.4 ± 0.8 | 73.2 ± 1.7 | 76.8 ± 1.5 | 72.5 ± 0.7 | 74.7 ± 1.8 |
. | EN . | FR . | PT . | ES . | DE . | ZH . | Avg. . | |
---|---|---|---|---|---|---|---|---|
Gold Monolingual | 72.3 | 73.0 | 71.8 | 67.2 | 73.4 | 73.7 | 71.9 ± 2.7 | |
Gold Multilingual | 73.7 | 74.4 | 72.3 | 71.7 | 74.6 | 71.3 | 72.9 ± 1.5 | |
Translate-Test | — | 70.1 | 70.6 | 66.9 | 68.5 | 62.9 | 67.8 ± 3.1 | |
Translate-Train Monolingual | — | 62.2 | 53.0 | 65.9 | 55.4 | 67.1 | 60.8 ± 6.3 | |
Translate-Train Multilingual | 72.7 | 69.4 | 67.3 | 66.2 | 65.0 | 69.2 | 67.5 ± 1.9 | |
Translate-Train Multilingual + Minotaur | 74.8 | 73.7 | 71.3 | 68.5 | 70.1 | 69.0 | 70.6 ± 2.1 | |
@1% | XG-Reptile | 73.8 ± 0.3 | 70.4 ± 1.8 | 70.8 ± 0.7 | 68.9 ± 2.3 | 69.1 ± 1.2 | 68.1 ± 1.2 | 69.5 ± 1.1 |
Minotaur | 75.6 ± 0.4 | 73.7 ± 0.6 | 71.4 ± 0.9 | 71.0 ± 0.5 | 70.4 ± 1.3 | 70.0 ± 0.9 | 71.3 ± 1.4 | |
@5% | XG-Reptile | 74.4 ± 1.3 | 73.0 ± 0.9 | 71.6 ± 1.1 | 71.6 ± 0.7 | 71.1 ± 0.6 | 69.5 ± 0.5 | 71.4 ± 1.3 |
Minotaur | 77.0 ± 1.0 | 73.9 ± 1.4 | 72.8 ± 1.1 | 71.1 ± 0.6 | 72.8 ± 2.0 | 72.3 ± 0.6 | 72.6 ± 1.0 | |
@10% | XG-Reptile | 75.8 ± 1.3 | 74.2 ± 0.2 | 72.8 ± 0.6 | 72.1 ± 0.7 | 73.0 ± 0.6 | 72.8 ± 0.5 | 73.0 ± 0.8 |
Minotaur | 79.8 ± 0.4 | 75.6 ± 1.8 | 75.4 ± 0.8 | 73.2 ± 1.7 | 76.8 ± 1.5 | 72.5 ± 0.7 | 74.7 ± 1.8 |
*
Cross-lingual Transfer in Task-Oriented Parsing Table 1 summarizes our results on MTOP against comparison models at multiple SPIS rates. Our system significantly improves on the “Gold Monolingual” upper-bound even at 1 SPIS by >2% (p < 0.01, using a two-tailed sign test assumed hereafter). For few-shot transfer on MTOP, we observe strong cross-lingual transfer even at 1 SPIS translating only 1.8% of the dataset. Few-shot transfer is competitive with a monolingual model using 100% of gold translated data and so represents a promising new strategy for this dataset. We note that even at a high SPIS rate of 100 (approximately ∼ 53.1% of training data), Minotaur is significantly (p < 0.01) poorer than the “Gold Multilingual” upper-bound, highlighting that few-shot transfer is challenging on MTOP.
Minotaur outperforms all translation-based comparisons and augmenting “Translate-Train Multilingual” with our posterior alignment objective (+ Minotaur) yields a +10.1% average improvement. With equivalent data, this comparison shows that cross-lingual alignment by aligning each latent representation to the prior only (i.e., a Wae-based model) is weaker than cross-lingual alignment between posteriors.
Comparing to “Silver-Standard” Methods
A more realistic comparison is between TaF (Nicosia et al., 2021) or CLASP (Rosenbaum et al., 2022), which optimize MT quality in their pipelines, and our method which uses sampled gold data. We outperform CLASP by >3% and TaF using mT5-large (Xue et al., 2021) by >2.1% at all sample rates. However, Minotaur requires >5 SPIS sampling to improve upon TaF using mT5-xxl. We highlight that our model has only ∼ 116 million parameters whereas CLASP uses AlexaTM-500M (FitzGerald et al., 2022) with 500 million parameters, mT5-large has 700 million parameters and mT5-xxl has 3.3 billion parameters. Relative to model size, our approach offers improved computational efficiency. The improvement of our method is mostly seen in languages typologically distant from English as Minotaur is always the strongest model for Hindi. In contrast, our method underperforms for English and German (more similar to EN) which may benefit from stronger pre-trained knowledge transfer within larger models. Our efficacy using gold data and a smaller model, compared to silver data in larger models, suggests a quality trade-off, constrained by computation, as a future study.
*
Cross-lingual Transfer in Executable Parsing The results for MultiATIS++SQL in Table 2 show similar trends. However, here Minotaur can outperform the upper-bounds, and sampling at >5% significantly (p < 0.01) improves on “Gold-Monolingual” and is similar or better than “Gold-Multilingual” (p < 0.05). Further increasing the sample rate yields marginal gains. Minotaur generally improves on XG-Reptile and performs on par at a lower sample rate, i.e., Minotaur at 1% sampling is closer to XG-Reptile at 5% sampling. This suggests that our approach is more sample efficient, achieving greater accuracy with fewer samples. Minotaur requires <10 epochs to train whereas XG-Reptile reports ∼ 50 training epochs, for poorer results.
Despite demonstrating overall improvement, Minotaur is not universally superior. Notably, our performance on Chinese (ZH) is weaker than XG-Reptile at 10% sampling and our method appears to benefit less from more data in comparison. The divergence minimization in Minotaur may be more functionally related to language similarity (dissimilar languages demanding greater distances to minimize) whereas the alignment via gradient constraints within meta-learning could be less sensitive to this phenomenon. These results, with the observation that Minotaur improves most on Hindi for MTOP, illustrate a need for more in-depth studies of cross-lingual transfer between distant and lower resource languages. Future work can consider more challenging benchmarks across a wider pool of languages (Ruder et al., 2023).
*
Contrasting Alignment Signals We report ablations of Minotaur on MTOP at 10 SPIS sampling. Table 3 considers each function for cross-lingual alignment outlined in Section 3.2 as an individual or composite element. The best approach, used in all other reported results, minimizes the Wasserstein distance W2 for individual divergence and MMD for aggregate divergence. W2 is significantly superior to the Kullback-Leibler Divergence (KL) for minimizing individual posterior samples (p < 0.01 for individual and joint cases). The W2 distance directly minimizes the Euclidean L2 distance when variances of different languages are equivalent. This in turn is more similar to the Maximum Mean Discrepancy function (the best singular objective) which minimizes the distance between approximate “means” of each distribution i.e., between Z marginal distributions. Note that MMD and W2 alignments are not significantly different (p = 0.08). The W2 + MMD approach significantly outperforms all other combinations (p < 0.01). The identified strength of MMD, compared to methods for computing , highlights that minimizing aggregate divergence is the main contributor to alignment with individual divergence as a weaker additional contribution.
. | . | EN . | FR . | ES . | DE . | HI . | Avg. . |
---|---|---|---|---|---|---|---|
KL | — | 78.3 | 70.6 | 73.1 | 67.0 | 66.6 | 69.3 |
W2 | — | 78.6 | 72.1 | 74.3 | 68.7 | 67.4 | 70.6 |
— | MMD | 78.7 | 72.3 | 74.3 | 68.8 | 67.5 | 70.7 |
KL | MMD | 78.4 | 71.8 | 73.3 | 68.5 | 67.3 | 70.2 |
W2 | MMD | 80.2 | 72.8 | 74.9 | 70.0 | 68.6 | 71.6 |
. | . | EN . | FR . | ES . | DE . | HI . | Avg. . |
---|---|---|---|---|---|---|---|
KL | — | 78.3 | 70.6 | 73.1 | 67.0 | 66.6 | 69.3 |
W2 | — | 78.6 | 72.1 | 74.3 | 68.7 | 67.4 | 70.6 |
— | MMD | 78.7 | 72.3 | 74.3 | 68.8 | 67.5 | 70.7 |
KL | MMD | 78.4 | 71.8 | 73.3 | 68.5 | 67.3 | 70.2 |
W2 | MMD | 80.2 | 72.8 | 74.9 | 70.0 | 68.6 | 71.6 |
*
Alignment without Latent Variables Table 4 considers alignment without the latent variable formulation on an encoder-decoder Transformer model (Vaswani et al., 2017). Here, the output of the encoder is not probabilistically bound without the parametric “guidance” of the Gaussian reparameterization. This is similar to analysis on explicit alignment from Wu and Dredze (2020). We test MMD, statistical KL divergence (e.g., ) and Euclidean L2 distance as minimization functions and observe all techniques are significantly weaker (p < 0.01) than counterparts outlined in Table 3. This contrast suggests the smooth curvature and bounded structure of the Z parameterization contribute to effective cross-lingual alignment. Practically, these non-parametric approaches are challenging to implement. The lack of precise divergences (i.e., Equation (13) or Equation (12)) between representations leads to numerical underflow instability during training. This impeded alignment against reasonable comparisons such as cosine distance. Even using MMD, which does not require an exact solution, fared poorer without the bounding of the latent variable Z.
*
Parallelism in Alignment We further investigate whether Minotaur induces cross-lingual transfer when aligning posterior samples from inputs which are not parallel (i.e., xl is not a translation of xEN and output LFs are not equivalent). We intuitively expect parallelism as necessary for the model to minimize divergence between representations with equivalent semantics.
As shown in Table 5, data parallelism is surprisingly not required using MMD to align marginal distributions only. The only and techniques significantly under-perform relative to equivalent methods using parallel data (p < 0.01). This is largely expected because individual alignment between posterior samples which should likely not be equivalent could inject unnecessary noise into the learning process. However, MMD ( only) is significantly (p < 0.01) above other methods with the closest performance to the parallel equivalent. This supports our interpretation that MMD aligns “at the language level” as minimization between languages should not mandate parallel data. For lower-resource scenarios, this approach could over-sample less data for cross-lingual transfer to the long tail of under-resourced languages.
Alignment . | EN . | FR . | ES . | DE . | HI . | Avg. . |
---|---|---|---|---|---|---|
Parallel Ref. | 80.2 | 72.8 | 74.9 | 70.0 | 68.6 | 71.6 |
only | 78.9 | 67.3 | 68.3 | 64.6 | 59.4 | 64.9 |
only | 77.6 | 71.5 | 72.9 | 68.4 | 67.2 | 70.0 |
78.8 | 70.9 | 71.9 | 67.9 | 64.5 | 68.8 |
Alignment . | EN . | FR . | ES . | DE . | HI . | Avg. . |
---|---|---|---|---|---|---|
Parallel Ref. | 80.2 | 72.8 | 74.9 | 70.0 | 68.6 | 71.6 |
only | 78.9 | 67.3 | 68.3 | 64.6 | 59.4 | 64.9 |
only | 77.6 | 71.5 | 72.9 | 68.4 | 67.2 | 70.0 |
78.8 | 70.9 | 71.9 | 67.9 | 64.5 | 68.8 |
*
Learning a Latent Semantic Structure We study the representation space learned from our method training on MultiATIS++SQL at 1% sampling for direct comparison to similar analysis from Sherborne and Lapata (2023). We compute sentence representations from the test set as the average of the z representations for each input utterance . Table 6 compares between Minotaur, mBART50 (Tang et al., 2021) representations before training, and XG-Reptile. The significant improvement in cross-lingual cosine similarity using Minotaur in Table 6 (p < 0.01) further supports how our proposed method learns improved cross-lingual similarity.
Model . | Cosine . | Top-1 . | Top-5 . | Top-10 . | MRR . |
---|---|---|---|---|---|
mBART50 | 0.576 | 0.521 | 0.745 | 0.796 | 0.622 |
XG-Reptile | 0.844 | 0.797 | 0.949 | 0.963 | 0.865 |
Minotaur | 0.941 | 0.874 | 0.994 | 0.998 | 0.927 |
Model . | Cosine . | Top-1 . | Top-5 . | Top-10 . | MRR . |
---|---|---|---|---|---|
mBART50 | 0.576 | 0.521 | 0.745 | 0.796 | 0.622 |
XG-Reptile | 0.844 | 0.797 | 0.949 | 0.963 | 0.865 |
Minotaur | 0.941 | 0.874 | 0.994 | 0.998 | 0.927 |
We also consider the most cosine-similar neighbors for each representation and test if the top-k closest representations are from a parallel utterance in a different language or some other utterance in the same language. Table 6 shows that > 99% of representations learned by Minotaur have a parallel utterance within five closest representations and ∼50% improvement in mean-reciprocal ranking score (MRR) between parallel utterances. We interpret this as the representation space using Minotaur is more semantically distributed relative to mBART50, as representations for a given utterance are closer to semantic equivalents. We visualize this in Figure 3: The original pre-trained model has minimal cross-lingual overlap, whereas our system produces encodings with similarity aligned by semantics rather than language. Minotaur can rapidly adapt the pre-trained representations using an explicit alignment objective to produce a non-trivial informative latent structure. This formulation could have further utility within multilingual representation learning or information retrieval, e.g., to induce more coherent relationships between cross-lingual semantics.
*
Error Analysis We conduct an error analysis on MultiATIS++SQL examples correctly predicted by Minotaur and incorrectly predicted by baselines. The primary improvement arises from improved handling of multi-word expressions and language-specific modifiers. For example, adjectives in English are often multi-word adjectival phrases in French (e.g., “cheapest” → “le moins cher” or “earliest” → “à plus tot”). Improved handling of this error type accounts for an average of 53% of improvement across languages with the highest in French (69%) and lowest in Chinese (38%). We hypothesize that a combination of aggregate and mean-pool individual alignment in Minotaur benefits this specific case where semantics are expressed in varying numbers of words between languages. While this could be similarly approached using fine-grained token alignment labels, Minotaur improves transfer in this context without additional annotation. While this analysis is straightforward for French, it is unclear why the transfer to Chinese is weaker. A potential interpretation is that weaker transfer of multi-word expressions to Chinese could be related to poor tokenization. Sub-optimal sub-word tokenization of logographic or information-dense languages is an ongoing debate (Hofmann et al., 2022; Si et al., 2023) and exact explanations require further study. Translation-based models and weaker systems often generate malformed, non-executable SQL. Most additional improvement is due to a 23% boost in generating syntactically well-formed SQL evaluated within a database. Syntactic correctness is critical when a parser encounters a rare entity or unfamiliar linguistic construction and highlights how our model can better navigate inputs from languages minimally observed during training. This could potentially be further improved using recent incremental decoding advancements (Scholak et al., 2021).
7 Conclusion
We propose Minotaur, a method for few-shot cross-lingual semantic parsing leveraging Optimal Transport for knowledge transfer between languages. Minotaur uses a multi-level posterior alignment signal to enable sample-efficient semantic parsing of languages with few annotated examples. We identify how Minotaur aligns individual and aggregate representations to bootstrap parsing capability from English to multiple target languages. Our method is robust to different choices of alignment metrics and does not mandate parallel data for effective cross-lingual transfer. In addition, Minotaur learns more semantically distributed and language-agnostic latent representations with verifiably improved semantic similarity, indicating its potential application to improve cross-lingual generalization in a wide range of other tasks.
Acknowledgments
We thank the action editor and anonymous reviewers for their constructive feedback. The authors also thank Nikita Moghe, Mattia Opper, and N. Siddarth for their insightful comments on earlier versions of this paper. The authors (Sherborne, Lapata) gratefully acknowledge the support of the UK Engineering and Physical Sciences Research Council (grant EP/W002876/1). This work was supported in part by the UKRI Centre for Doctoral Training in Natural Language Processing, funded by the UKRI (grant EP/S022481/1) and the University of Edinburgh (Hosking).
Notes
Our code and data are publicly available at github.com/tomsherborne/minotaur.
Notation key: Capitals X, are random variables; Curly , are functional domains; lowercase x are observations and P{} are probability distributions.
Resource parity between languages is multilingual semantic parsing which we view as an upper-bound.
Γ* is implicit within the model parameters.
References
Author notes
Action Editor: Xavier Carreras