Abstract
Large multilingual language models typically share their parameters across all languages, which enables cross-lingual task transfer, but learning can also be hindered when training updates from different languages are in conflict. In this article, we propose novel methods for using language-specific subnetworks, which control cross-lingual parameter sharing, to reduce conflicts and increase positive transfer during fine-tuning. We introduce dynamic subnetworks, which are jointly updated with the model, and we combine our methods with meta-learning, an established, but complementary, technique for improving cross-lingual transfer. Finally, we provide extensive analyses of how each of our methods affects the models.
1 Introduction
Large multilingual language models, such as mBERT (Devlin et al. 2019), are pretrained on data covering many languages, but share their parameters across all languages. This modeling approach has several powerful advantages, such as allowing similar languages to exert positive influence on each other, and enabling cross-lingual task transfer (i.e., fine-tuning on some source language(s), then using the model on different target languages) (Pires, Schlinger, and Garrette 2019). These advantages are particularly enticing in low-resource scenarios since without sufficient training data in the target language, the model’s effectiveness hinges on its ability to derive benefit from other languages’ data. In practice, however, even state-of-the-art multilingual models tend to perform poorly on low-resource languages (Lauscher et al. 2020; Üstün et al. 2020), due in part to negative interference effects—parameter updates that help the model on one language, but harm its ability to handle another—which undercut the benefits of multilingual modeling (Arivazhagan et al. 2019; Wang, Lipton, and Tsvetkov 2020; Ansell et al. 2021).
In this article, we propose novel methods for using language-specific subnetworks, which control cross-lingual parameter sharing, to reduce conflicts and increase positive transfer during fine-tuning, with the goal of improving the performance of multilingual language models on low-resource languages. While recent work applies various subnetwork based approaches to their models statically (Lu et al. 2022; Yang et al. 2022; Nooralahzadeh and Sennrich 2022), we propose a new method that allows the model to dynamically update the subnetworks during fine-tuning. This allows for sharing between language pairs to a different extent at the different learning stages of the models. We accomplish this by using pruning techniques (Frankle and Carbin 2018) to select an optimal subset of parameters from the full model for further language-specific fine-tuning. Inspired by studies that show that attention-heads in BERT-based models have specialized functions (Voita et al. 2019; Htut et al. 2019), we focus on learning subnetworks at the attention-head level. We learn separate—but potentially overlapping—head masks for each language by fine-tuning the model on the language, and then pruning out the least important heads.
Given our focus on low-resource languages, we also combine our methods with meta-learning, a data-efficient technique to learn tasks from a few samples (Finn, Abbeel, and Levine 2017). Motivated by Wang, Lipton, and Tsvetkov (2020), who find that meta-learning can reduce negative interference in the multilingual set-up, we test how much our subnetwork methods can further benefit performance in this learning framework, as well as compare the subnetwork based approach to a meta-learning baseline. Our results show that a combination of meta-learning and dynamic subnetworks is particularly powerful. To the best of our knowledge, we are the first to adapt subnetwork sharing to the meta-learning framework.
We extensively test the effectiveness of our methods on the task of dependency parsing. We use data from Universal Dependencies (UD) (Nivre et al. 2016) comprising 82 datasets covering 70 distinct languages, from 43 language families; 58 of the languages can be considered truly low-resource. Our experiments show, quantitatively, that our language-specific subnetworks, when used during fine-tuning, act as an effective sharing mechanism: permitting positive influence from similar languages, while shielding each language’s parameters from negative interference that would otherwise have been introduced by more distant languages. Moreover, we show substantial improvements in cross-lingual transfer to new languages at test time. Importantly, we are able to achieve this while relying on data from just 8 treebanks before few-shot fine-tuning at test time.
Finally, we perform extensive analyses of our models to better understand how different choices affect generalization properties. We analyze model behavior with respect to several factors: typological relatedness of fine-tuning and test languages, data-scarcity during pretraining, robustness to domain transfer, and their ability to predict rare and unseen labels. We find interesting differences in model behavior that can provide useful guidance on which method to choose based on the properties of the target language.
2 Background and Related Work
2.1 Pruning and Sparse Networks
Frankle and Carbin (2018) were the first to show that neural network pruning (Han et al. 2015; Li et al. 2016) can be used to find a subnetwork that matches the test accuracy of the full network. Later studies confirmed that such subnetworks also exist within (multilingual) BERT (Prasanna, Rogers, and Rumshisky 2020; Budhraja et al. 2021; Li et al. 2022), and that they can even be transferred across different NLP tasks (Chen et al. 2020). While these studies are typically motivated by a desire to find a smaller, faster version of the model (Jiao et al. 2020; Lan et al. 2019; Sanh et al. 2019; Held and Yang 2022; Zhang et al. 2021), we use pruning to find multiple simultaneous subnetworks (one for each fine-tuning language) within the overall multilingual model, which we use during both fine-tuning and inference to guide cross-lingual sharing.
2.2 Selective Parameter Sharing
Naseem, Barzilay, and Globerson (2012) used categorizations from linguistic typology to explicitly share subsets of parameters across separate languages’ dependency parsing models. Large multilingual models have, however, been shown to induce implicit typological properties automatically, and different design decisions (e.g., training strategy) can influence the language relationships they encode (Chi, Hewitt, and Manning 2020; Choenni and Shutova 2022). Rather than attempting to force the model to follow an externally defined typology, we instead take a data-driven approach, using pruning methods to automatically identify the subnetwork of parameters most relevant to each language, and letting subnetwork overlap naturally dictate parameter sharing.
A related line of research aims to control selective sharing by injecting language-specific parameters (Üstün et al. 2020; Wang, Lipton, and Tsvetkov 2020; Le et al. 2021; Ansell et al. 2021; Pfeiffer et al. 2020), which is often realized by inserting adapter modules into the network (Houlsby et al. 2019). Our approach, in contrast, uses subnetwork masking of the existing model parameters to control language interaction.
Lastly, Wang, Lipton, and Tsvetkov (2020) separate language-specific and language-universal parameters within bilingual models, and then meta-train the language-specific parameters only. However, given that we work in a multilingual as opposed to a bilingual setting, most parameters are shared by at least a few languages, and are thus somewhere between purely language-specific and fully universal. Our approach, instead, allows for parameters to be shared among any specific subset of languages.
Analyzing and Training Shared Subnetworks
The idea of sharing through sparse subnetworks was first proposed for multi-task learning (Sun et al. 2020), and was recently studied in the multilingual setting: Foroutan et al. (2022) show that both language-neutral and language-specific subnetworks exist in multilingual models, and Nooralahzadeh and Sennrich (2022) show that training task-specific subnetworks can help in cross-lingual transfer as well.
Moreover, Lin et al. (2021) train multilingual models using language-pair-specific subnetworks for neural machine translation (NMT), and Hendy et al. (2022) build on their work, but use domain-specific subnetworks instead. In both studies, subnetworks are found using magnitude pruning and kept static during training. In addition, while Lin et al. (2021) show that their method can perform well in a zero-shot setting, their strategy for merging masks for new language-pairs relies on the availability of translation data between English and both the source and target language. This makes their approach unsuitable in low-resource scenarios where such resources are not available. In addition, they show that their methods work for unseen language-pairs, but the individual languages are not unseen during training on NMT.
Furthermore, Ansell et al. (2021) learn real-valued (composable) masks instead of binary ones. Thus, instead of fully enabling or disabling parameters, they essentially apply new weights to them, making the workings of these masks more similar to that of adapter modules (Pfeiffer et al. 2020).
Finally, in concurrent work, Lu et al. (2022) show that using language-specific subnetworks at the pretraining stage can mitigate negative interference for speech recognition, and Xu et al. (2022) apply subnetworks during the backward pass only. We instead apply subnetworks during fine-tuning and few-shot fine-tuning at test time, allowing us to both make use of existing pretrained models and apply our models to truly low-resource languages. Moreover, we go beyond existing work by experimenting with structured subnetworks, by allowing subnetworks to dynamically change during fine-tuning, and by extensively analyzing the effects and benefits of our methods.
2.3 Meta-learning
Meta-learning is motivated by the idea that a model can “learn to learn” many tasks from only a few samples. This has been adapted to the multilingual setting by optimizing a model to be able to quickly adapt to new languages: By using meta-learning to fine-tune a multilingual model on a small set of (higher-resource) languages, the model can then be adapted to a new language using only a few examples (Nooralahzadeh et al. 2020). In this work, we use the Model-Agnostic Meta-Learning algorithm (MAML) (Finn, Abbeel, and Levine 2017), which has already proven useful for cross-lingual transfer of NLP tasks (Nooralahzadeh et al. 2020; Wu et al. 2020; Gu et al. 2020), including being applied to dependency parsing by Langedijk et al. (2022), whose approach we follow for our own experiments.
MAML iteratively selects a batch of training tasks , also known as episodes. For each task , we sample a training dataset that consists of a support set used for adaptation, and a query set used for evaluation. MAML casts the meta-training step as a bilevel optimization problem. Within each episode, the parameters θ of a model fθ are fine-tuned on the support set of each task t yielding , that is, the model adapts to a new task. The model is then evaluated on the query set of task t, for all of the tasks in the batch. This adaptation step is referred to as the inner loop of MAML. In the outer loop, the original model fθ is then updated using the gradients of the query set of each with respect to the original model parameters θ. MAML strives to learn a good initialization of fθ, which allows for quick adaptation to new tasks. This set-up is mimicked at test time where we again select a support set from the test task for few-shot adaptation, prior to evaluating the model on the remainder of the task data.
2.4 Dependency Parsing
In dependency parsing, a model must predict, given an input sentence, a dependency tree: A directed graph of binary, asymmetrical arcs between words. Each arc is labeled with a dependency relation type that holds between the two words, commonly referred to as the head and its dependent.
The UD project has brought forth a dependency formalism that allows for consistent morphosyntactic annotation across typologically diverse languages (Nivre et al. 2016). While UD parsing has received much attention in the NLP community, performance on low-resource languages remains far below that of high-resource languages (Zeman et al. 2018). State-of-the-art multilingual parsers generally exploit a pretrained multilingual language model with a deep biaffine parser (Dozat and Manning 2016) on top. The model is then fine-tuned on data (typically) from high-resource languages. This fine-tuning stage has been performed on English data only (Wu et al. 2020), or multiple languages (Tran and Bisazza 2019).
UDify (Kondratyuk and Straka 2019) takes this a step further and is fine-tuned on all available training sets together (covering 75 languages). Moreover, they use a multi-task training objective that combines parsing with predicting part-of-speech tags, morphological features, and lemmas.
On the modeling side, previous studies have attempted to exploit knowledge from the field of Linguistic Typology to further improve upon this training paradigm. For instance, UDapter (Üstün et al. 2020) is trained on 13 languages using the same set-up as UDify, but freezes mBERT’s parameters and trains language-specific adapter modules. It induces typological guidance by taking language embeddings predicted from typological features as input. In a related study, Choudhary (2021) tries to induce typological knowledge into UDify by using typology prediction as an auxiliary task instead.
Other studies have taken a data-centric approach instead. van der Goot et al. (2021) propose MACHAMP, a toolkit for multi-task learning of a variety of NLP tasks, including dependency parsing. While using a similar architecture to existing literature, they show that they can further improve performance by resampling datasets according to a multinomial distribution on the batch level to prevent larger datasets from overwhelming the model. In addition, Glavaš and Vulić (2021), propose hierachical source selection, a model-agnostic method for finding the optimal subset of UD treebanks for cross-lingual transfer to a specific target language.
3 Data
We use data from Universal Dependencies v2.91 and test on 82 datasets covering 70 unique and highly typologically diverse languages belonging to 19 language families from 43 subfamilies. We consider 54 of these languages to be extremely low-resource as there are fewer than 31 training samples available. For the other 28 languages, 50% have approximately 150–2K training samples and the other 50% have 2K–15K samples available. In total, our test data contains 233 possible arc labels. We use 8 high-resource languages for fine-tuning, based on the selection used by Langedijk et al. (2022) and Tran and Bisazza (2019): English, Arabic, Czech, Estonian,2 Hindi, Italian, Norwegian, and Russian.
4 Methodology
In §4.1–4.2 we describe the model that will be used throughout our experiments and the training strategy. In §4.3 we then explain how we define and select subnetworks, and how we apply them to our models. In §4.4 we explain how our approach is adapted to the meta-learning setting, and in §4.5–4.6 we describe our test set-up and baselines.
4.1 Model
4.2 Training Procedure
Taking inspiration from Nooralahzadeh et al. (2020) for cross-lingual transfer to low-resource languages, our training procedure is split into two stages: (1) fine-tune on the full English training set (∼12.5K samples), without applying any subnetwork restrictions, for 60 epochs, to provide the full model with a general understanding of the task; and (2) fine-tune on the 7 other high-resource languages, to give the model a broad view over a typologically diverse set of languages in order to facilitate cross-lingual transfer to new languages.
For stage 2, in each iteration, we sample a batch from each language and average the losses of all languages to update the model. During this stage, we restrict each example to just the parameters in that language’s subnetwork. We perform 1,000 iterations, with a batch of size 20 from each of the 7 languages, for a total of 140K samples.
We use a cosine-based learning rate scheduler with 10% warm-up and the Adam optimizer (Kingma and Ba 2015), with separate learning rates for updating the encoder and the classifier (see Appendix A, Table 9 for details).
4.3 Subnetwork Masks
We represent language-specific subnetworks as masks that are applied to the model in order to ensure that only a subset of the model’s parameters are activated (or updated) during fine-tuning and inference. We follow Prasanna, Rogers, and Rumshisky (2020) in using structured masks, treating entire attention heads as units which are always fully enabled or disabled. Thus, for language ℓ, its subnetwork is implemented as a binary mask ξℓ ∈{0,1}12×12.
In our experiments, we present two ways of using the masks during fine-tuning: statically, in which we find initial masks based on the pretrained model parameters and hold those masks fixed throughout fine-tuning and inference (SNstatic); and dynamically, in which we update those masks over the course of fine-tuning (SNdyna). In Figure 1, we give a general overview of our training procedure.
4.3.1 Finding Initial Subnetwork Masks
We aim to find a mask for each of the 7 fine-tuning languages that prunes away as many heads as possible without harming performance for that language (i.e., by pruning away heads that are only used by other languages, or that are unrelated to the dependency parsing task). For this, we apply the procedure introduced by Michel, Levy, and Neubig (2019).
Consistent with findings from Prasanna, Rogers, and Rumshisky (2020), we observed that the subnetworks found by the procedure are unstable across different random initializations. To ensure that the subnetwork we end up with is more robust to these variations, we repeat the pruning procedure with 4 random seeds, and take the union3 of their results as the true subnetwork (i.e., it includes even those heads that were only sometimes found to be important).
4.3.2 Dynamically Adapting Subnetworks
Blevins, Gonen, and Zettlemoyer (2022) showed that multilingual models acquire linguistic knowledge progressively—lower-level syntax is learned prior to higher-level syntax, and then semantics—but that the order in which the model learns to transfer information between specific languages varies. As such, the optimal set of parameters to share may depend on what learning stage the model is in, or on other factors, for example, the domains of the specific training datasets, the amounts of data available, the complexity of the language with respect to the task, and so forth. Thus, we propose a dynamic approach to subnetwork sharing, in which each language’s subnetwork mask is trained jointly with the model during fine-tuning. This allows the subnetwork masks to be improved, and also allows for different patterns of sharing at different points during fine-tuning.
For dynamic adaptation, we initialize the identified static subnetworks as described in §4.3.1 using small positive weights. We then allow the model to update the mask weights during fine-tuning. After each iteration, the learned weights are fed to a threshold function that sets the smallest 20% of weights to zero (i.e., 28 heads4) to obtain a binary mask again. Given that the derivative of a threshold function is zero, we use a straight-through estimator (Bengio, Léonard, and Courville 2013) in the backward pass, meaning that we ignore the derivative of the threshold function and pass the incoming gradient on as if the threshold function was an identity function.
4.4 Meta-learning with Subnetworks
Meta-learning for multilingual models has been shown to enable both quick adaptation to unseen languages (Langedijk et al. 2022) and mitigation of negative interference (Wang, Lipton, and Tsvetkov 2020), but it does so using techniques that are different from—though compatible with—our subnetwork-sharing approach. Therefore, we experiment with the combination of these methods, and test the extent to which their benefits are complementary (as opposed to redundant) in practice.
To integrate our subnetworks within a meta-learning set-up, we just have to apply them in the inner loop of MAML, that is, given a model f parameterized by θ, we train θ by optimizing for the performance of the learner model of a language ℓ masked with the corresponding subnetwork . See Algorithm 1 for the details of the procedure.5
For all meta-learning experiments, we train for 500 episodes with support and query sets of size 20, that is, 10K samples per language are used for meta-training and validation each. We use 20 inner loop updates (k) and we follow Finn, Abbeel, and Levine (2017) in using SGD for updating the learner. All other training details are kept consistent with the non-episodic (NonEp) models (as described in §4.2).
4.5 Few-shot Fine-tuning at Test Time
Because the primary goal of this work is to improve performance in low-resource scenarios, we evaluate our models using a set-up that is appropriate when there is almost no annotated data in the target language: Few-shot fine-tuning. For a given test language, the model is fine-tuned on just 20 examples in that language, using 20 gradient updates. The examples are drawn from the development set, if there is one; otherwise they are drawn from (and removed from) the test set. We use the same hyperparameter values as during training. We report Labeled Attachment Scores (LAS) averaged across 5 random seeds, as computed by the official CoNLL 2018 Shared Task evaluation script.6
Since we do not have subnetworks for the test languages—only for the 7 high-resource languages used in stage 2 of fine-tuning (§4.2)—we instead use the subnetwork of the typologically most similar training language. We determine typological similarity by computing the cosine similarity between the language vectors from the URIEL database (syntax_knn) (Littell et al. 2017).
4.6 Baselines
To measure the effectiveness of our subnetwork-based methods, we train and evaluate baselines in which no subnetwork masking is applied (but for which all other details of the training and testing set-ups are kept unchanged). We refer to this as full model training (Full) to contrast our training approaches that use static or dynamic subnetworks (SNstatic and SNdyna), and we report these baselines for both the non-episodic (NonEp)7 and meta-learning (Meta) frameworks. For a fair comparison to existing literature, we also re-train UDify on dependency parsing using only our 8 treebanks for training and perform few-shot fine-tuning at test time as was done for all other models (Udf8).
5 Results
Overall, the results show that our subnetwork-based methods yield improvements over baseline models trained without any subnetwork masking. In Table 1, we see that, based on average LAS scores across all test languages, static subnetworks (SNstatic) perform best in the non-episodic training set-up, resulting in +2.8% average improvement over the Full baseline, and yielding the highest average LAS of all the models. Dynamic subnetworks (SNdyna), on the other hand, exhibit superior performance in the meta-learning setting, resulting in the model that performed best across all settings for the largest number of languages. In Table 2, we report the full set of results on all 82 test languages.
. | Full . | SNstatic . | SNdyna . | Total . | |
---|---|---|---|---|---|
NonEp | LAS | 38.49 | 41.32 | 40.0 | |
Best% | 0% | 22% | 8.5 % | 30.5% | |
Meta | LAS | 40.68 | 40.27 | 40.89 | |
Best% | 14.5% | 27% | 28% | 69.5% |
. | Full . | SNstatic . | SNdyna . | Total . | |
---|---|---|---|---|---|
NonEp | LAS | 38.49 | 41.32 | 40.0 | |
Best% | 0% | 22% | 8.5 % | 30.5% | |
Meta | LAS | 40.68 | 40.27 | 40.89 | |
Best% | 14.5% | 27% | 28% | 69.5% |
To gain more insight into the effects of our methods across the test languages, we plot the distribution over performance changes compared with the baseline per method and learning framework in Figure 2. We find that static and dynamic subnetworks exhibit opposite trends. NonEp-SNstatic achieves large gains (up to +25%), but can also cause more deterioration on other languages (up to −6%). In contrast, the performance change distribution for NonEp-SNdyna is centered around more modest improvements, but is also the safest option given that it deteriorates performance for the fewest languages. The same trade-off can be observed in the meta-learning framework, except that now Meta-SNstatic results in modest changes compared with Meta-SNdyna.
Lastly, we do not find strong trends for transfer languages; different magnitudes of performance changes are scattered across all transfer languages. Yet, when transferring from Norwegian, Meta-SNstatic and Meta-SNdyna particularly often underperform compared with Meta-Full; see Table 2. In contrast, Meta-SNdyna performs particularly well when transferring from Arabic; similarly, SNstatic performs especially well when transferring from Czech. Thus, the best approach might be dependent on the relationship between the transfer and test languages, or the properties of the transfer language itself.
We note that despite the observed improvements, overall performance remains low for many languages. Yet we would like to point out that we also find instances where our methods might already make the difference in acquiring a usable system compared with state-of-the-art models. For example, even with few-shot fine-tuning Udf75’s performance on Faroese OFT only reaches 53.8%, which is much lower than our 70.4% (NonEp-SNstatic), and for Indonesian PUD it reaches 69.0% versus 74.9% (NonEp-SNstatic)
6 Analysis
In this section, we provide more insight into the effects of our methods by analyzing performance with respect to four factors: typological relatedness, data-scarcity, robustness to domain transfer, and ability to predict unseen and rare labels. We focus on the best model from each learning framework: NonEp-SNstatic and Meta-SNdyna.
Typological Relatedness
The languages most similar to a low-resource language are often themselves low-resource, meaning that a low-resource language is often quite dissimilar from all the languages that are resource-rich enough to be used for fine-tuning. A method that only works well when a very similar high-resource language is available for fine-tuning will not be as useful in practice. Thus, we want to understand the degree to which our methods depend on similarity to a high-resource fine-tuning language. In Figure 3 (top), we plot each test language’s performance improvement against its typological closeness to the nearest high-resource fine-tuning, where that distance is as computed using the cosine similarity between the languages’ URIEL features. Interestingly, we find that our models show opposite trends: Whereas NonEp-SNstatic works well for typologically similar languages, the biggest gains from Meta-SNdyna actually come from less similar languages.
Data Scarcity
Given that language distribution in the mBERT pretraining corpus is very uneven, and 37 of our 70 unique test languages are not covered at all, we want to understand what effect this has on downstream model performance. As shown in Figure 3 (middle), we find that Meta-SNdyna provides the most benefit to previously unseen languages. In contrast, more data in pretraining positively correlates with the performance of NonEp-SNstatic.
Out-of-domain Data
For cross-lingual transfer we often focus on the linguistic properties of source and target languages. However, the similarity of the source and target datasets will also be based on the domains from which they were drawn (Glavaš and Vulić 2021). For example, our training datasets cover only 11 of 17 domains, as annotated by the creators of the UD treebank. While we acknowledge that it is difficult to neatly separate data based on source domain, we test for a correlation between performance and the proportion of out-of-domain data. Interestingly, we find no clear correlation with the percentage of domains from the test language covered by the transfer language. We do, however, find a strong correlation with the domain diversity of the transfer and test language in general for NonEp-SNstatic, as shown in Figure 3 (bottom), where we plot improvements against number of domain sources our test data is coming from (more sources → more diversity). In contrast, we see that Meta-SNdyna remains insensitive to this variable.
Unseen and Rare Labels
Lastly, another problem in cross-lingual transfer, especially when fine-tuning on only a few languages, is that the fine-tuning data may not cover the entire space of possible labels from our test data. In principle, only a model that is able to adequately adapt to unseen and rare labels can truly succeed in cross-lingual transfer. Given that we perform few-shot fine-tuning at test time, we could potentially overcome this problem (Lauscher et al. 2020). Thus, we investigate the extent to which our models succeed in predicting such labels for our test data. We consider a label to be rare when it is covered by our training data, but makes up <0.1% of training instances (23 such labels). There are 169 unseen labels, thus in total, 192 of 233 (82%) of the labels from our test data are rare or unseen during training. In Table 3, we report how often each model correctly predicts instances of unseen and rare labels. We find that models differ greatly, and, in particular, Meta-SNdyna vastly outperforms all other models when it comes to both unseen and rare labels. Upon further inspection, we find that two unseen labels are particularly often predicted correctly: sentence particle (discourse:sp) and inflectional dependency (dep:infl). The former label seems specific to Chinese linguistics and has a wide range of functions (e.g., modifying the modality of a sentence or its proposition, and expressing discourse and pragmatic information). The latter represents inflectional suffixes for the morpheme-level annotations, something that is unlikely to be observed in morphologically poor languages such as English; but, for instance, Yupik has much of its performance boost due to it.
NonEp- . | Full . | SNstatic . | SNdyna . |
---|---|---|---|
Unseen | 0.04% (3/3) | 0.003% (1/1) | 0.004% (2/2) |
Rare | 12.5% (12/50) | 6.4% (11/41) | 9.9% (8/49) |
Meta- | Full | SNstatic | SNdyna |
Unseen | 0% | 0% | 6.6% (15/23) |
Rare | 3.5% (10/39) | 3.0% (7/36) | 21.3% (13/55) |
NonEp- . | Full . | SNstatic . | SNdyna . |
---|---|---|---|
Unseen | 0.04% (3/3) | 0.003% (1/1) | 0.004% (2/2) |
Rare | 12.5% (12/50) | 6.4% (11/41) | 9.9% (8/49) |
Meta- | Full | SNstatic | SNdyna |
Unseen | 0% | 0% | 6.6% (15/23) |
Rare | 3.5% (10/39) | 3.0% (7/36) | 21.3% (13/55) |
7 Effect of Subnetworks at Training Time
7.1 Interaction Between Subnetworks
We now further investigate the selected subnetworks and their impact during training. Our findings were similar for meta-learning, so we just focus our analysis here on the non-episodic models.
Table 4 shows how using subnetworks affects performance on the training languages. Training with the subnetworks always improves performance, however, this effect is larger when subnetworks are kept static during training. Moreover, for the static subnetworks, the number of heads that are masked out can vary considerably per language; for example, for Arabic we only disable 13 heads compared with 37 for Estonian. Yet, we observe similar effects on performance, obtaining ∼+4% improvement for both languages. To disentangle how much of the performance gain comes from disabling suboptimal heads vs. protection from negative interference by other languages, we re-train NonEp-SNstatic in two ways using Czech as a test case: (1) we keep updates from Czech restricted to its subnetwork (i.e., we disable the suboptimal heads for Czech), but drop subnetwork masking for the other languages (i.e., we do not protect Czech from negative interference as all other languages can still update the full model); (2) we use subnetworks for all languages except Czech (i.e., we protect Czech from the other languages by restricting their updates to their subnetworks only, but still allow Czech to use the full model capacity).
Language . | Full . | SNstatic . | SNdyna . |
---|---|---|---|
Arabic | 68.6 | 72.9 (13) | 69.1 (28) |
Czech | 75.4 | 81.2 (13) | 77.9 (28) |
Estonian | 65.4 | 69.2 (37) | 68.3 (28) |
Hindi | 74.4 | 77.2 (21) | 75.2 (28) |
Italian | 85.0 | 87.7 (23) | 86.1 (28) |
Norwegian | 73.2 | 79.8 (24) | 73.6 (28) |
Russian | 79.5 | 81.6 (27) | 80.4 (28) |
Language . | Full . | SNstatic . | SNdyna . |
---|---|---|---|
Arabic | 68.6 | 72.9 (13) | 69.1 (28) |
Czech | 75.4 | 81.2 (13) | 77.9 (28) |
Estonian | 65.4 | 69.2 (37) | 68.3 (28) |
Hindi | 74.4 | 77.2 (21) | 75.2 (28) |
Italian | 85.0 | 87.7 (23) | 86.1 (28) |
Norwegian | 73.2 | 79.8 (24) | 73.6 (28) |
Russian | 79.5 | 81.6 (27) | 80.4 (28) |
We find that (1) disabling suboptimal heads for Czech only, results in 79.5 LAS on Czech (+4.1% improvement compared to baseline), while (2) just protection from the other languages, results in 80.3 LAS (+4.7% improvement); see Table 5 for results on the other training languages. This indicates that protection from negative interference has a slightly larger positive effect on the training language in this case. Still, a combination of both (i.e., using subnetworks for all fine-tuning languages) results in the best performance in most cases (81.2 LAS for Czech, a +5.9% improvement, as reported in Table 4). This suggests that the interaction between the subnetworks is a driving factor behind the selective sharing mechanism that resolves language conflicts. We confirm that similar trends were found for the other languages.
Language . | Full . | Selection . | Protection . |
---|---|---|---|
Arabic | 68.6 | 71.2 (+2.5) | 72.2 (+3.6) |
Czech | 75.4 | 79.5 (+4.1) | 80.1 (+4.7) |
Estonian | 65.4 | 67.9 (+2.5) | 67.9 (+2.5) |
Hindi | 74.4 | 76.8 (+2.4) | 76.7 (+2.3) |
Italian | 85.0 | 86.8 (+1.8) | 87.3 (+2.3) |
Norwegian | 73.2 | 80.2 (+7.1) | 80.3 (+7.2) |
Russian | 79.5 | 80.5 (+1.0) | 80.4 (+0.9) |
Language . | Full . | Selection . | Protection . |
---|---|---|---|
Arabic | 68.6 | 71.2 (+2.5) | 72.2 (+3.6) |
Czech | 75.4 | 79.5 (+4.1) | 80.1 (+4.7) |
Estonian | 65.4 | 67.9 (+2.5) | 67.9 (+2.5) |
Hindi | 74.4 | 76.8 (+2.4) | 76.7 (+2.3) |
Italian | 85.0 | 86.8 (+1.8) | 87.3 (+2.3) |
Norwegian | 73.2 | 80.2 (+7.1) | 80.3 (+7.2) |
Russian | 79.5 | 80.5 (+1.0) | 80.4 (+0.9) |
This, however, also means that if the quality of one subnetwork is suboptimal, it is still likely to negatively affect other languages. Moreover, analyzing the subnetworks can provide insights on language conflicts. For instance, using a subnetwork for only Czech or Arabic results in the biggest performance gains for Norwegian (+7.1% and +7.3% compared with the Full baseline), indicating that, in this set-up, Norwegian suffers more from interference.
7.2 Gradient Conflicts and Similarity
In multilingual learning, we aim to maximize knowledge transfer between languages while minimizing negative transfer between them. In this study, our main goal is the latter. To evaluate the extent to which our methods succeed in doing this, we explicitly test whether we are able to mitigate negative interference by adopting the gradient conflict measure from Yu et al. (2020). They show that conflicting gradients between dissimilar tasks, defined as a negative cosine similarity between gradients, is predictive of negative interference in multi-task learning. Similar to Wang, Lipton, and Tsvetkov (2020), we deploy this method in the multilingual setting: We study how often gradient conflicts occur between batches from different languages. For batches from each language, we compute the gradient of the loss with respect to the parameters of the full model during backpropagation. To get a stable estimate, we use gradient accumulation for 50 episodes/iterations before computing conflicts. Gradient conflicts are then computed between each language pair, pairs in total, and the percentage of total conflicts is computed across all language pairs.
At the same time, Lee et al. (2021) argue that lower cosine similarity between language gradients indicates that the model starts memorizing language-specific knowledge that at some point might cause catastrophic forgetting of the pretrained knowledge. This suggests that, ideally, our approach would find a good balance between minimizing gradient conflicts and maximizing the cosine similarity between the language gradients.
We quantitatively find that both subnetwork-based methods indeed reduce the percentage of gradient conflicts between languages. Over the last 50 iterations, we find that NonEp-SNstatic has reduced conflicts by 16% and NonEp-SNdyna by 4% compared with the NonEp-Full baseline as reported in Table 6. In the meta-learning set-up, we found an opposite trend where Meta-SNstatic reduces conflicts by 1% and Meta-SNdyna by 11% over the last 50 iterations compared to Meta-Full. This partly explains why NonEp-SNstatic and Meta-SNdyna are found to be the best performing models: They suffer the least from gradient conflicts. Interestingly, we do not find that our meta-trained models suffer less from gradient conflicts than the non-episodic models. In fact, while we found that, on average, Meta-Full improves over NonEp-Full (recall Table 1), its training procedure suffers from 13% more conflicts, meaning that we do not find meta-learning in itself to be a suitable method for reducing gradient conflicts, but our subnetwork-based methods are.
. | Conflicts . | Cosine Sim. . |
---|---|---|
NonEp-Full | 42% | 0.03 |
NonEp-SNstatic | 26% | 0.05 |
NonEp-SNdyna | 38% | 0.07 |
Meta-Full | 55% | −0.04 |
Meta-SNstatic | 54% | −0.02 |
Meta-SNdyna | 44% | 0.12 |
. | Conflicts . | Cosine Sim. . |
---|---|---|
NonEp-Full | 42% | 0.03 |
NonEp-SNstatic | 26% | 0.05 |
NonEp-SNdyna | 38% | 0.07 |
Meta-Full | 55% | −0.04 |
Meta-SNstatic | 54% | −0.02 |
Meta-SNdyna | 44% | 0.12 |
At the same time, the average cosine similarity between gradients increases when using both subnetwork methods compared with the Full model baselines. We compute the Pearson correlation coefficient between the relative decrease in percentage of gradient conflicts and increase in cosine similarity over training iterations compared with the baselines. We test for statistical significance (p-value <0.02), and average results over 4 random seeds. We get statistically significant positive correlation scores of 0.08, 0.16, 0.33, and 0.58 for NonEp-SNstatic, NonEp-SNdyna, Meta-SNstatic, and Meta-SNdyna, respectively. This indicates that our subnetwork-based methods try to minimize negative interference while simultaneously maximizing knowledge transfer.
8 Ablations
To ensure that each of the aspects of our set-up are indeed contributing to the improvements shown in our experiments, we retrained models with specific aspects ablated.
8.1 Random Ablations
Random Mask Initialization – Static
In these experiments, we verify that there is value in using the iterative pruning procedure to generate subnetwork masks (as opposed to the value coming entirely from the mere fact that masks were used).
First, we re-trained NonEp-SNstatic, but swapped out the subnetwork masks derived from iterative pruning with masks containing the same number of enabled heads, but that were randomly generated (Shuffle). Second, given that the number of masked heads might be more important than which exact heads are being masked out, we experiment with masking 20, 30, 40, and 50 random heads. We find that using the random masks results, on average, in ∼5% performance decreases on the training languages compared with using the subnetworks initialized using importance pruning; see Figure 4. In addition, we see that randomly masking out more heads results in further negative effects on performance.
Lastly, given that for many languages our subnetworks mask out very few heads (e.g., 13 for Arabic and Czech), we also try swapping these out with “intentionally bad” masks, where we randomly choose 20 heads to mask out, but do not allow any of the heads selected by the real pruning procedure to be chosen (Bad). From this, we see that preventing the right heads from being selected for masking does result in lower performance versus pure random selection (R20).
Random Mask Initialization – Dynamic
In these experiments, we verify that there is value in using the iterative pruning procedure to initialize subnetwork masks that will then by dynamically updated during fine-tuning.
We retrained NonEp-SNdyna 3 times using randomly initialized subnetworks. Figure 4 (DR20) shows that average performance across all test languages drops substantially (∼10%), making this method considerably worse than any of our other random baselines. We hypothesize that this is because the model is able to correct for any random static subnetwork, but that with dynamic masking, the subnetworks keep changing, which deprives the model of the chance to properly re-structure its information. This also gives us a strong indication that the improvements we observe are not merely an effect of regularization (Bartoldson et al. 2020).
Random Transfer Language
To test the effectiveness of our typology-based approach to selecting which high-resource fine-tuning language’s subnetwork should be used for a given test language, we experimented with just picking one of the high-resource languages at random, and found that this performed worse overall, resulting in lower scores for 78 of 82 test languages.
8.2 Unstructured Pruning
Our approach relies on the assumption that attention heads function independently. However, attention head interpretability studies have sometimes given mixed results on their function in isolation (Prasanna, Rogers, and Rumshisky 2020; Clark et al. 2019; Htut et al. 2019). Moreover, related works commonly focus on unstructured methods (Lu et al. 2022; Nooralahzadeh et al. 2020). Thus, we compare our strategy of masking whole attention heads against versions of NonEp-SNstatic and NonEp-SNdyna that were retrained using subnetwork masks found using the most popular unstructured method, magnitude pruning. In magnitude pruning, instead of disabling entire heads during the iterative pruning procedure, as described in §4.3.1, we prune the 10% of parameters with the lowest magnitude weights across all heads. Again, we check the development set score in each iteration and keep pruning until reaching <95% of the original performance. Note that we exclude the embedding and MLP layers.8
We find that for both the static and dynamic strategies, unstructured pruning performs worse overall, resulting in lower scores for 76% of test languages, and is especially harmful for dynamic subnetworks (SNstatic: 40.4 vs. 39.9, and SNdyna: 39.0 vs. 36.7 average LAS). We hypothesize that it might be more difficult to learn to adapt the unstructured masks as there are more weights to learn (weights per head × heads per layer × layers).
8.3 Effect of Selected Training Languages
Given that the selection of training languages can have an important effect on the overall performance, we now also perform a set of ablations to test the robustness of our findings with respect to the choice of training languages.
Fine-tuning Stage 1
As presented above, we fine-tune mBERT first using English to learn the task of dependency parsing. While English is still the most commonly used source language for cross-lingual transfer, it is important to understand how this choice may affect downstream performance. Therefore, we also tried using three other languages in place of English for this step: (1) Chinese (GSD) as it uses a different script, (2) Turkish (PENN) as it has a different word order (SOV), and (3) German (GSD) as it has no dominant word order and it was found to be the best source language for transfer by Turc et al. (2021) (together with Russian).
In Table 7, we show, for each of the three languages, how much the average performance changes in comparison with using English. For both Chinese and Turkish, we find that the average performance across test languages slightly decreases for all NonEp models. Even though the decreases are only minor, it indicates that Chinese and Turkish do not transfer as well to our test languages as English. This is not completely surprising as more of our test languages are written in the Latin script and, like English, use SVO word ordering. Yet, similar to Turc et al. (2021), we find that German is the best source language as it increases our average results when using both static and dynamic subnetworks compared with using English. Interestingly, in Figure 5, we see that all languages are able to increase average performance for the languages most closely related to Hindi, which could indicate that English has some properties that are particularly badly suited for transfer to this set of languages. At the same time, swapping out English with any of our three new languages causes an average decrease in performance on test languages that are most closely related to Arabic.
Fine-tuning Stage 2
The set of 7 languages we used above for the second stage of fine-tuning was chosen to be comparable to previous studies, but that set of languages is dominated by the Indo-European language family, which may result in poor generalization to other language families. Thus, we also re-trained our NonEp models on a completely different set of 7 languages, which were chosen from among those languages with relatively large treebanks (≥ 100K tokens), but selected in order to maximize diversity with respect to: (1) language family, (2) word order, and (3) data domain. This yielded the following set of languages: Belarusian (IE, Slavic, no dominant order), Chinese (Sino-Tibetan, SVO), Finnish (Uralic, SVO), Hebrew (Afro-Asiatic, SOV), Indonesian (Austronesian, SVO), Irish (Celtic, VSO), and Turkish (Turkic, SOV). These 7 languages cover 7 language families, 4 word orderings, and 14 data domains. Note that to limit the scope of this experiment, and to keep the results comparable to our original findings, we now again use English for fine-tuning stage 1.
We find that average results across all test languages9 are very similar using this different set of training languages. More concretely, for our FULL, SNstatic, and SNdyna (NonEp) models we only get +0.12, −1.09, and +0.02 average differences in LAS scores compared with our original results.10 Thus, our methods seem to be fairly robust with respect to the choice of training languages, and more diversity in training languages does not automatically result in better performance. One artifact that could influence this is the fact that we have much less training data for some of these selected languages (e.g., Irish and Indonesian [see Appendix A, Table 10]), so the quality of the retrieved subnetworks could be worse than those found for our more high-resource training languages. Thus, it could be possible that with more training data, this same set of training languages would result in higher performance gains.
9 Conclusion
We present and compare two methods, namely, static and dynamic subnetworks, that successfully help us guide selective sharing in multilingual training across two learning frameworks: non-episodic learning and meta-learning. We show that through the use of subnetworks, we can obtain considerable performance gains on cross-lingual transfer to low resource languages compared to full model training baselines for dependency parsing. Moreover, we quantitatively show that our subnetwork-based methods are able to reduce negative interference. Finally, we extensively analyze the behavior of our best performing models and show that they possess different strengths, obtaining relatively large improvements on different sets of test languages with often opposing properties. Given that our Meta-SNdyna model performs particularly well on data-scarce and typologically distant languages from our training languages, this is an interesting approach to further explore in future work on low-resource languages. In particular, it would be interesting to investigate methods to integrate the strengths of NonEp-SNstatic and Meta-SNdyna into one model.
Lastly, we test our results only on the task of dependency parsing, which is somewhat different from other NLP tasks as it has an annotation scheme explicitly designed to be applied across languages universally. However, we would like to point out that many NLP tasks are implicitly multilingual as well since most tasks do not involve a language-specific annotation scheme. For instance, in Named Entity Recognition (NER), the goal is to classify named entities into predefined categories such as “person”, “location”, “organization”, and so forth. When performing NER for other languages, we still select from the same categories. Moreover, negative interference is a general problem, first addressed in multi-task learning (Ruder 2017), and later studied in multilingual NLP (Wang, Lipton, and Tsvetkov 2020), that seems to occur whenever we attempt to learn multiple tasks/languages within one model. In multilingual NLP, languages will compete for the limited model capacity regardless of the task we are trying to solve. It was already shown that across a wide range of NLP tasks—NER, POS tagging, question answering, and natural language inference—negative interference occurs in multilingual models, and resolving such language conflicts can improve overall cross-lingual performance (Wang, Lipton, and Tsvetkov 2020). From our analysis of gradient conflicts, we find that similar negative interference issues can be found for the task of dependency parsing, and are mitigated by our subnetwork-based methods. Thus, as training with subnetworks appears to be a general approach to mitigating negative interference, we expect it to bring the same benefits to other NLP tasks for which this problem occurs. Moreover, we would like to point out that other studies have already shown the effectiveness of various other types of subnetworks for different tasks, for example, for Neural Machine Translation (Lin et al. 2021; Hendy et al. 2022) and cross-lingual speech recognition (Lu et al. 2022), making it less likely that the effectiveness of our methods are limited to dependency parsing only.
10 Limitations
One problem in multilingual NLP is that performance increases tend to happen for a specific set of languages at a time rather than across all languages simultaneously. This makes it hard to compare models and determine the state-of-the-art performance. Moreover, it is hard to determine the usefulness of a new method as average scores are not very informative when your test languages have a detrimental effect on this—for instance, taking out a few low performing languages would already boost our average performance substantially.
This also makes it more complicated to choose training languages. Changing the training languages can positively influence our performance at test time, especially if they are more similar to a large number of our test languages. However, when we want our model to generalize beyond our chosen set of test languages, it can be misleading to tailor the training set-up to the test data. Thus, while we do show that our methods generally improve performance when using two completely different sets of training languages, further experiments on finding an “optimal” set of training languages are omitted from this study. In addition, meta-learning is notorious for being hard to optimize; for example, slight changes in learning rates can have a detrimental effect on performance (Antoniou, Edwards, and Storkey 2019). This also means that different training languages can require different hyperparameter settings to work, which further complicates the search for an optimal training set.
Another limitation is that while we use a diverse set of test languages, our approach relies on the pretrained mBERT model, which means that it is unsuited to low-resource languages whose scripts are not seen during pretraining. Finding useful ways to circumvent this problem would be a good direction for follow-up work.
Lastly, given that we fine-tune on only 8 languages, the smallest typological distance between the training languages and a test language is often still relatively large. This makes the motivation for typology-informed subnetwork transfer at test time less satisfactory. In future work, it should be further investigated what the effect is of using more similar training and test language pairs for subnetwork transfer.
A Training and Data Details
. | Family . | TB . | Train . | Val. . | Test . |
---|---|---|---|---|---|
ar | Afro-Asiatic | PADT | 6,075 | 909 | 680 |
cs | Slavic | PDT | 68,495 | 9,270 | 10,148 |
en | German. | EWT | 12,543 | 2,002 | 2,077 |
hi | Indic | HDTB | 13,304 | 1,659 | 1,684 |
it | Roman. | ISDT | 13,121 | 564 | 482 |
et | Urallic | EDT | 24,633 | 3,125 | 3,214 |
no | German. | Norsk | 14,174 | 1,890 | 1,511 |
ru | Slavic | SynTag | 48,814 | 6,584 | 6,491 |
. | Family . | TB . | Train . | Val. . | Test . |
---|---|---|---|---|---|
ar | Afro-Asiatic | PADT | 6,075 | 909 | 680 |
cs | Slavic | PDT | 68,495 | 9,270 | 10,148 |
en | German. | EWT | 12,543 | 2,002 | 2,077 |
hi | Indic | HDTB | 13,304 | 1,659 | 1,684 |
it | Roman. | ISDT | 13,121 | 564 | 482 |
et | Urallic | EDT | 24,633 | 3,125 | 3,214 |
no | German. | Norsk | 14,174 | 1,890 | 1,511 |
ru | Slavic | SynTag | 48,814 | 6,584 | 6,491 |
. | Inner/Test LR . | |
---|---|---|
mBERT . | decoder . | |
NonEp | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Unstructured | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Meta-Full | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Meta-SNstatic | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Meta-SNdyna | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Outer LR | ||
Meta-All | {1e-04, 5e-05, 1e-05 } | {1e-03, 5e-04, 1e-04} |
. | Inner/Test LR . | |
---|---|---|
mBERT . | decoder . | |
NonEp | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Unstructured | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Meta-Full | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Meta-SNstatic | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Meta-SNdyna | {1e-04, 5e-05, 1e-05} | {1e-03, 5e-04, 1e-04} |
Outer LR | ||
Meta-All | {1e-04, 5e-05, 1e-05 } | {1e-03, 5e-04, 1e-04} |
. | Family . | TB . | Train . | Val. . | Test . |
---|---|---|---|---|---|
be | IE, Slavic | HSE | 22,853 | 1,301 | 1,077 |
fi | Uralic | TDT | 12,217 | 1,364 | 1,555 |
ga | Celtic | IDT | 4,005 | 451 | 454 |
he | Afro-Asiatic | HTB | 5,241 | 484 | 491 |
id | Austronesian | GSD | 4,482 | 559 | 557 |
tr | Turkic | Penn | 14,850 | 622 | 924 |
zh | Sino-Tibetan | GSD | 3,997 | 500 | 500 |
. | Family . | TB . | Train . | Val. . | Test . |
---|---|---|---|---|---|
be | IE, Slavic | HSE | 22,853 | 1,301 | 1,077 |
fi | Uralic | TDT | 12,217 | 1,364 | 1,555 |
ga | Celtic | IDT | 4,005 | 451 | 454 |
he | Afro-Asiatic | HTB | 5,241 | 484 | 491 |
id | Austronesian | GSD | 4,482 | 559 | 557 |
tr | Turkic | Penn | 14,850 | 622 | 924 |
zh | Sino-Tibetan | GSD | 3,997 | 500 | 500 |
All models use the same UDify architecture with the dependency tag and arc dimensions set to 256 and 768, respectively. At fine-tuning stage 1, we train for 60 epochs following the procedure of Langedijk et al. (2022) and Kondratyuk and Straka (2019). The Adam optimizer is used with the learning rates of the decoder and BERT layers set to 1e-3 and 5e-5, respectively. Weight decay of 0.01 is applied, and we use a gradual unfreezing scheme, freezing the BERT layer weights for the first epoch. For more details on the training procedure and hyperparameter selection, see Langedijk et al. (2022). For fine-tuning on seperate languages to find the subnetworks, we apply the same procedure.
Moreover, we need ∼3 hours for pretraining and, depending on the training set size, ∼4 hours per language for fine-tuning and finding a subnetwork (note that this step is run in parallel for all languages and only needs to be performed once for all models trained with subnetworks). We then only require ∼1 hour for non-episodic training or ∼6 hours for meta-training. All models are trained on a NVIDIA TITAN RTX.
Acknowledgments
This project was in part supported by a Google PhD Fellowship for the first author. We would like to thank Tim Dozat and Vera Axelrod for their thorough feedback and insights.
Notes
Note that we swapped out Korean with Estonian as we were unable to learn a high-quality subnetwork for Korean. The choice of Estonian is mainly motivated by the high-resource data requirement in combination with the fact that the subfamily, that is, Uralic, was not represented by our fine-tuning languages yet.
Stricter criteria (e.g., the intersection of the 4 subnetworks) resulted in lower performance on the development set.
We opted for a number roughly between our largest (13 heads pruned) and smallest (37 heads pruned) language-specific subnetwork found via pruning.
Note that for the meta-update, we use a first-order approximation, replacing by . See Finn, Abbeel, and Levine (2017) for more details on first-order MAML.
Note that Non-Episodic (NonEp) is used throughout the article to refer to models trained without meta-learning.
We recognize that the interaction between the MLPs and attention heads is important, but by focusing on the attention heads, we keep results comparable to importance pruning.
For a fair comparison, we removed test languages included in our new training set, e.g., Indonesian, so we average over 74 test languages instead. This was done for every experiment, where applicable.
We did not find clear patterns for the individual languages on which performance improvements are obtained.
References
Author notes
Action Editor: Kevin Duh