Meta-Learning the Difference: Preparing Large Language Models for Efficient Adaptation

Abstract Large pretrained language models (PLMs) are often domain- or task-adapted via finetuning or prompting. Finetuning requires modifying all of the parameters and having enough data to avoid overfitting while prompting requires no training and few examples but limits performance. Instead, we prepare PLMs for data- and parameter-efficient adaptation by learning to learn the difference between general and adapted PLMs. This difference is expressed in terms of model weights and sublayer structure through our proposed dynamic low-rank reparameterization and learned architecture controller. Experiments on few-shot dialogue completion, low-resource abstractive summarization, and multi-domain language modeling show improvements in adaptation time and performance over direct finetuning or preparation via domain-adaptive pretraining. Ablations show our task-adaptive reparameterization (TARP) and model search (TAMS) components individually improve on other parameter-efficient transfer like adapters and structure-learning methods like learned sparsification.


Introduction
Finetuning large pretrained language models (PLMs) on task-specific supervised data has become the default strategy to produce performant models for various NLP tasks (Dai and Le, 2015;Howard and Ruder, 2018;Radford et al., 2019, inter alia), provided a task has enough training data to be adapted to without overfitting.For few-shot tasks, very large PLMs like the 175B-parameter GPT-3 (Brown et al., 2020) do surprisingly well without training using prompts, where task-specific examples (x j , y j ) are presented as text to condition the PLM before a test input x test is given.Our * Work done during an internship at Amazon AWS AI.
work considers an important middle ground: minimizing the computational cost of finetuning while improving on its performance in low-resource and few-shot settings.
In general, self-supervised objectives used for PLMs assume little about the nature of downstream tasks.Earlier works suggested that task-awareness is unnecessary for PLMs of sufficient scale; e.g., Raffel et al. (2020) found that multi-task learning underperformed pretrain-finetune for the largest T5 models on multi-format question answering.However, Gururangan et al. (2020) showed that further pretraining on unlabeled text from the downstream task (task-adaptive pretraining, or TAPT) or a related domain (DAPT) consistently improved adaptation performance.Aghajanyan et al. (2021a) revisited Raffel et al. and found that by greatly improving the number and balance of tasks, one can utilize a multitask objective after pretraining and achieve gains in proportion to the number of tasks.As for even larger models, Brown et al. (2020) argue that the impressive few-shot prompting ability of GPT-3 comes from "implicit" meta-learning (Schmidhuber, 1987;Bengio et al., 1990) which they term in-context learning, where the outer loop is performed by self-supervised pretraining, and the inner loop is performed by forward passes on implicit examples in unlabeled texts.
These works motivate that exposure to broad information about downstream tasks remains useful in preparing a large PLM for adaptation.Hence, we propose explicit meta-learning for preparing large PLMs for data-efficient adaptation; a visual comparison is in Figure 1.To also achieve parameter efficiency and performance, we adapt meta-transfer learning (Sun et al., 2019) to large PLMs in two proposed ways: an inner loop optimizing a lowrank task-adaptive reparameterization (TARP) of weights, and an outer loop learning an architecture controller for searching task-adaptive model structures (TAMS).These improve over general learning from text corpora that incidentally contain task "prefixes", as in GPT-3 (Brown et al., 2020;Fig. 1.1), and (bottom) explicit meta-learning the transformation of a PLM's weights and sublayers for a distribution of tasks.
finetuning and even DAPT-prepared LMs on generative and unconditional few-shot and low-resource settings, such as multi-domain abstractive summarization (AdaptSum; Yu et al., 2021) and language modeling.Furthermore, our analysis shows that each component of our task distribution-aware strategy independently improves over prior work: (1) metatransfer learning improves over model-agnostic meta learning (Finn et al., 2017) even after multitask learning on the same data, setting a new stateof-the-art on few-shot Persona-Chat dialog personalization (Zhang et al., 2018); (2) our proposed dynamic low-rank TARP outperforms recent methods such as MAM adapters (He et al., 2022) and alternate reparameterizations like Kronecker products (Zhang et al., 2021); (3) our lightweight controller for generating task-aware architectures in TAMS extends improvements into higher resource tasks and rediscovers task-specific modifications like 1D convolutions for Transformers.
Our proposal is summarized in Figure 2, with pseudocode in Algorithm 1 at the end of the next section.We publicly release the code for our experiments and our reference library online. 1 1 https://github.com/amazon-research/meta-learning-the-difference

Methodology
Our goal is to explicitly optimize a PLM for efficient adaptation to any task T i sampled from a distribution of low-resource NLP tasks p(T ).Each task consists of a training set D train i , a test set D test i , and a loss function L i .
The prevailing approach for efficiently optimizing a base model f Θ on a (relatively) small taskspecific dataset is to use model-agnostic metalearning (MAML;Finn et al., 2017).This is a bi-level optimization process that uses a stochastic gradient-based strategy to sample a batch of tasks {T i } B i=1 from the task distribution p(T ) in each meta-iteration.In the inner loop, each task finetunes a copy of the model's weights Θ for a small number of steps T in , producing task-specific weights Θ i .In the outer loop, each task model f Θ i is evaluated on its corresponding task's test set D test i and these losses are summed to produce the overall meta-loss.The meta-loss T i ∼p(T ) L test i (f Θ i ) is then used to optimize and update Θ; see Weng (2018) 2 for a more detailed overview.
MAML, however, is not generally used in NLP as a competitive alternative to pretrain-thenfinetune methods for low-resource and few-shot settings.To rectify MAML's limitations, we propose a meta-learning the difference (MLtD) framework to optimize PLMs for fast and data-efficient adaptations with the following contributions: MAML after pretraining.Earlier works performed MAML using random initializations, or at best with pretrained token embeddings (Madotto et al., 2019), which was shown to underperform the pretrain-finetune paradigm.With the increased prevalence of large-scale pretraining, recent works have begun to initialize MAML with PLMs (Dou et al., 2019).We continue this approach, but further show that pretraining + MAML, even when labeled (i.e., multitask) and performed only on the meta-training data (i.e., no external text), improves performance and mitigates overfitting versus pretraining alone or MAML alone (Section 4), suggesting that pretraining produces a better initialization that promotes generalization in later meta-learning.
Parameter-efficient transfer.Adaptation data is typically limited, making it easy for large models to overfit.Previous works use very shallow CNNs Figure 2: Overview of our proposed method, which learns to transform a small set of weights Φ i (TARP learning) and modify sublayer modules α i (TAMS learning) in a task-specific, data-efficient, and parameter-efficient manner.First, we initialize with a base PLM (top left).In each meta-iteration, we sample a batch of tasks from a task distribution (left).In the inner loop (middle), independent sets of dynamic low-rank reparameterizations are initialized, and an architecture controller generates independent task-specific sublayer modules, all of whose weights are adapted to the task's training set.Each task model is evaluated on the corresponding task's test set.In the outer loop (right), these task losses are summed up to produce the overall meta-loss, and the backward path optimizes the base model, the initial reparameterization, and the architecture controller.(Finn et al., 2017), only adapt scale-and-shift parameters atop the original model (Sun et al., 2019), or apply various general regularization techniques such as weight decay, label smoothing, dropout, early stopping, and 1 regularization (Madotto et al., 2019;Song et al., 2020).In contrast, we propose learning dynamic low-rank reparameterizations g Φ i (Section 2.1) of the base model such that Θ i (x) = g Φ i (Θ LM , x) for task T i .Here, Φ is a small set of new parameters that are adapted into task-specific Φ i when finetuning.Notably, we modify MAML to incorporate these parameterefficient modules, so that during task adaptation in both meta-training (the inner loop) and metatesting (novel tasks) T i , we only adapt Φ → Φ i instead of Θ → Θ i , speeding up both phases and improving overall performance.Though some works explore the benefits of joint training or fusion of parameter-efficient modules (Stickland and Murray, 2019;Lin et al., 2020;Pfeiffer et al., 2021), prior work has not explored meta-learning to learn these adaptations in a task distribution-aware setting.
Architecture adaptation.While the Transformer has proven to be a robust general-purpose architecture, recent work has shown that the optimal attention-then-FFN sublayer structure can vary across tasks (e.g., Sandwich Transformers; Press et al., 2020).However, previous data-driven sublayer searches are often task-agnostic (e.g., So et al., 2021), where the sublayer search is implemented before pretraining.Meta-learning enables learning data-driven sublayers after pretraining, in a differentiable, task-adaptive manner (Section 2.2).Instead of a separate search per task as in previous methods (DARTS ;Liu et al., 2019a), we propose meta-learning a task-aware architecture controller to help it generalize to new tasks when searching neural architectures, by learning to directly generate task-specific sublayer structures from the dataset.By exploiting architectural knowledge learned over the task distribution, our task-adaptive model structure approach improves test-time performance.A related work in customizing model structure is CMAML (Song et al., 2020), which applies a sparse pruning algorithm to obtain task-specific weight masks.Our method differs in that we consider generalization over a distribution of tasks (instead of a single task), and has a richer search space with different operations, numbers of layers, and widths of layers, so that our method provides architecture diversity to accommodate to the different task data.
In all, we employ meta-learning to improve upon initializing from a pretrained Θ LM , allowing better downstream finetuning on tasks.By learning only the transformation weights Φ i and (optionally) the task-specific architecture α i for new tasks T i , our method "learns to learn the difference" between a PLM and a task-specific LM in a training-efficient way.
Adding parameter-efficient prefixes.Inspired by prompting, the learning of automated prompts or task-specific continuous variants has been applied for encoder-only PLMs like BERT (Shin et al., 2020;Hambardzumyan et al., 2021) and generative PLMs (Li and Liang, 2021;Liu et al., 2021;Lester et al., 2021) where one learns task-specific vectors prepended to inputs or hidden representations.
Transformations only.The adapter and prefixtuning strategies insert layers or introduce prefixes, increasing inference time or in-memory size.Instead, Zhao et al. (2020) learn binary masks, and diff pruning (Guo et al., 2021) learns sparse additive vectors.Both methods use unstructured sparsity to achieve parameter efficiency.Later works like BitFit (Zaken et al., 2021) and LoRA (Hu et al., 2022) introduce parameter-efficient modifications targeting the Transformer architecture: BitFit only tunes the bias parameters, while LoRA adds low-rank decomposition weights to the self-attention weights.Recently, He et al. (2022) proposed parallel mix-and-match (MAM) adapters which leverage benefits of the preceding types.
Hence, to minimize overhead we focus on a "transformations only" approach.Inspired by the scale-and-shift parameters of Sun et al. (2019), we propose learning affine transformations to reparameterize the pretrained model weights towards a task.For a pretrained weight matrix W l 0 ∈ R C in ×Cout (can be any dense layer in the self-attention module or the FFN module in a transformer based archi- tecture), we first reparameterize the task-specific weights as: where and denotes the elementwise (Hadamard) product.
At adaptation time, we apply low-rank constraints while optimizing the reparameterization weights only, giving the training objective min A straightforward approach to solve the rankconstrained problem is to apply a low-rank decomposition to the transformation weights Φ l i .We term this approach of learning parameter-efficient affine transformations task-adaptive reparameterization (TARP).We consider two standard static decomposition methods: • Bilinear, which takes Φ l j = U l j V l j T where U l j ∈ R C in ×r and V l j ∈ R Cout×r , as done in the additive-only setting (Φ l 1 = I) by LoRA.
• Kronecker product, which takes , and n is a hyperparameter, as used in the "added-layer" Compacter approach.
In addition, we propose a novel decomposition inspired by the self-attention mechanism, which aggregate features using input-dependent attention weights.This can be regarded as a function with input-dependent parameters y = f θ(x) (x).Similarly, the optimal reparameterization may vary with different input values.The computation of a reparameterized layer in the PLM becomes y x where TARP parameters Φ l j (x) are modeled by a dynamic lowrank decomposition (Figure 3): The square matrices Σ l j (x) ∈ R r×r are generated by a lightweight multi-layer perceptron (MLP) for different input vectors and U l j , V l j are the learnable weight matrices with r min(C in , C out ).We compare popular parameter-efficient transfer schemes and these three decompositions in Section 4.

Efficient architecture adaptation
We also propose adapting the model structure for each task in a data-driven manner.The weights of the task-specific architectures are learned in the inner loop, while the task-aware architecture generator which produces architecture candidates is learned in the outer loop.We term this approach task-adaptive model structure (TAMS) learning.
We first represent each task T i with an embedding vector z i based on the task training set D train i .An embedding module E computes the task representation by aggregating features of all training data: where Embed(x) are intermediate representations produced by the PLM.For encoder-decoder models (Transformer), we take Embed to be the encoder; for encoder-only or decoder-only PLMs (BERT, GPT-2), we use the token embedding layer.Inspired by DARTS (Liu et al., 2019a), we define the possible sublayer structures by a search space expressed as a directed acyclic graph (DAG), where each directed edge corresponds to a set of candidate operations O.The task architecture is represented by a set of parameters α i that encode the structure, where α i ∈ R E×|O| (E is the number of edges, and |O| is the number of operations).In our proposed TAMS approach we also introduce a controller A to generate these task-specific architecture parameters, as a function of the task embedding vector α i = A(z i ).The probability of choosing operation m in edge n is given by P n (m) = softmax m∈O (α i [n, m]).In meta-testing, the discrete architecture is obtained by taking the argmax.Since argmax is non-differentiable, we use the straight-through Gumbel-Softmax estimator to backpropagate gradients for optimizing the architecture controller during meta-training.
In TAMS, all possible architectures are initialized as part of the meta-parameters w based on weight-sharing (Pham et al., 2018), i.e., architecture α i 's weights w(α i ) are selected from the meta-parameters.After the reparameterization steps in TARP and the architecture generation steps in TAMS, our inner loop optimization takes parameters (Φ, w(α i )) and performs a small number of gradient steps T in on the task training set to give (Φ i , wi ).In the outer loop optimization, we thus have to simultaneously optimize the architecture controller to perform architecture search, as well as the parameter initialization.This is in contrast to MAML which just optimizes the parameter initialization in the outer loop.The meta-loss becomes: min , where the tuple W contains the base PLM's weights Θ LM , the low-rank reparameterization weights Φ, the architecture controller A, and the weight-sharing meta-parameters w.
In summary, our contributions with the TAMS framework are that (1) it meta-learns a task-aware controller by training on the task distribution and then generalizes to new tasks by automatically generating an optimized architecture α i from the task training data, and (2) it optimizes the controller and parameter initialization (shared by all tasks) simultaneously under a unified meta-learning objective.This is in contrast to DARTS, which performs a separate search and architecture parameter optimization for each task independently.
We summarize our net method with TARP and TAMS in Algorithm 1.

Main results
To demonstrate the overall benefit of our method, we compare our results to other approaches on generative adaptation tasks in the few-shot (dialogue personalization), low-resource (abstractive summarization), and medium-resource (multi-domain language modeling) regimes.In Section 4 we perform some analyses and also compare TARP by itself to previous parameter-efficient works.

Implementation
All of our experiments ran using PyTorch on single machines with NVIDIA V100 GPUs.See per-task hyperparameters in Appendix A.1.
TARP decomposition.We apply task-adaptive reparameterization (TARP) to the pretrained selfattention and feed-forward network (FFN) blocks.In Section 4.2 we conclude that TARP with dynamic decomposition outperforms other parameterefficient transfer methods; TARP will always be Algorithm 1: Meta-Learning the Difference (MLtD) with TARP and TAMS.  the same labeled data is used for both Pretrain and MAML, suggesting that meta-learning benefits from the more robust initialization that pretraining provides to improve task-specific few-shot adaptation (also see analysis in Section 4.1).Moreover, we see further improvements by "meta-learning the difference" (MLtD).By using TARP for MAML's inner loop adaptation (MLtD, TARP only), we attain equivalent or better results and faster training time while only updating a small amount of task-specific parameters (Section 4.2).This indicates that our method helps mitigate overfitting to low-resource tasks.Finally, by incorporating TAMS (MLtD), we use the full framework and achieve the best performance, suggesting the taskadapted model structure gives better architectures for personas.In this regard, CMAML lags behind MLtD as well.We conjecture this is because it uses a pruning algorithm to "customize" the model with different weight masks, which may not generate enough model diversity for diverse tasks as the architectural inductive bias remains the same.

Low-resource abstractive summarization
AdaptSum (Yu et al., 2021) is a new multi-domain dataset used to evaluate domain adaptation schemes for abstractive summarization.It consists of six diverse target domains ranging from movie reviews to scientific abstracts.Each domain has a lowresource task corpus and a larger unlabeled text corpus as well (list and statistics in Table 4) that is used to evaluate domain-and task-adaptive pretraining (DAPT/TAPT; Gururangan et al., 2020).We use pretrained BART (Lewis et al., 2020) and finetune to each low-resource task corpus as in Yu et al. (2021), whose code4 we extend.
Baselines.DAPT continues pretraining with BART's self-supervised objective using the unlabeled domain corpus.TAPT continues pretraining with the set of unlabeled documents found in the target summarization task.SDPT uses the XSum dataset in the News domain to further pretrain BART with a supervised training objective using document-summary pairs before finetuning.2).We find that MLtD, even without architecture search (TARP only), outperforms DAPT, TAPT, and SDPT.These methods use in-domain/-task knowledge and the standard pretraining objective to help adaptation to the target task, while our method considers cross-domain knowledge via the meta-learning objective, sampling meta-training tasks from multiple domain corpora to train the model.Moreover, the use of meta-learning as preparation outperforms multitask pretraining (TARP only, multitask pretraining instead), signifying that mere exposure to the cross-domain data may not be enough and using a meta-learning objective to explicitly optimize for the lightweight adaptation is beneficial.Finally, we see that without meta-learning or multitasking (TARP only, no meta-learning) our performance is also better than the baseline.This demonstrates the effectiveness of the lightweight TARP adaptation, which matches the performance of full finetuning while only updating less than 5% of parameters.

Multi-domain language modeling
Though the text corpora in AdaptSum were originally included to evaluate DAPT, we also use them to evaluate our methods on multi-domain language modeling.As this is a novel benchmark, to demonstrate fast adaptation we take T in = 1.Table 3: Test perplexities from multi-domain language modeling adaptation on AdaptSum (lower is better).All methods are initialized with pretrained GPT-2 medium and finetuned on the labeled domain set at the end.† : our re-implementation of Gururangan et al. (2020).(Yu et al., 2021) across the six domains, for both the text-only domainrelated corpus and the low-resource task corpus.
Baselines.We start with pretrained GPT-2 medium (345M) (Radford et al., 2019) with input sequence length 512 using the Transformers library (Wolf et al., 2019).Finetuning is performed on the training documents of the task corpus, and we evaluate perplexity on the test documents of the task corpus.The only exception to finetuning is Zero-shot which evaluates the pretrained GPT-2 model directly.DAPT continues pretraining of GPT-2 with the language modeling objective on the unlabeled domain corpus before finetuning.
Results (Table 3).Our findings in summarization also hold for the unconditional causal language modeling task.Namely, we see equal or better performance of TARP vs. full finetuning and that meta-learning plays a significant role in the task adaptation quality.In contrast to summarization with BART (Section 3.3) but similar to Persona-Chat with Transformer (Section 3.2), we see that TAMS leads to noticeable improvements.We explain why this may be the case and present the TAMS-learnt sublayer modules in Section 4.4.

Pretraining improves meta-learning
We analyze the performance of MLtD on Persona-Chat at meta-testing time (i.e., finetuning then testing on unseen personas) with respect to the number of inner loop steps and training dialogues.In Figure 4 (left), we see that original MAML (no pretraining) overfits, while finetuning the multitaskpretrained model keeps improving.Moreover, MLtD atop the multitask-pretrained model followed by finetuning continues to improve test perplexity.In Figure 4 (right), we fix the finetuning steps and vary the number of training dialogues used in finetuning.Using more dialogues improves perplexity for all three methods, with MLtD still leading over full MAML and direct finetuning after pretraining.The takeaway from these results is that applying MAML on a pretrained model prevents overfitting and promotes better generalizability from meta-learning.

Dynamic TARP versus alternatives
We benchmark our dynamic low-rank reparameterization on a variety of NLP models and tasks.adaptation methods, we report single-task results here.For generative tasks, we use pretrained GPT-2 medium on natural language generation datasets: we specifically evaluate on E2E (Novikova et al., 2017), which was used by adapter method (Lin et al., 2020); WebNLG (Gardent et al., 2017); and DART (Nan et al., 2021), which was used by LoRA (Hu et al., 2022).For classification, we use pretrained RoBERTa (Liu et al., 2019b) on lowresource GLUE (Wang et al., 2019) tasks, which were evaluated on by many recent parameter efficient adaptation methods (Houlsby et al., 2019;Zhao et al., 2020;Zaken et al., 2021;He et al., 2022).Further dataset and experimental setup details are in Appendix B. In particular, we chose rank r to give similar parameter counts to other approaches; r = 4 in Table 5, r = 8 in Table 6.
For generative tasks, we compare with finetuning all layers; FT-Top2, which only finetunes the last two layers of the model; BitFit (Zaken et al., 2021), which only finetunes the biases; and Adapter tuning (Houlsby et al., 2019), which only finetunes the adapter layers inserted after each feedforward and self-attention sublayer.As shown in Table 5, TARP methods match or outperform other parameter-efficient methods, while learning taskspecific parameters that are <3% of the number of base parameters and keep the base model unchanged.Among the three TARP variants, we find that Dynamic > Bilinear > Kronecker in terms of performance across generative metrics.This suggests that the optimal adaptation to the underlying model weights may vary per token, which dynamic low-rank accounts for.Moreover, dynamic TARP performs better than an alternative where the O(n 2 ) Hadamard product in Eq.( 1) is replaced by O(n 3 ) matrix multiplication (w/ matrix mult.).
For classification tasks, we compare with He et al. (2022), which proposed a unified framework connecting several state-of-the-art adaptation methods (Houlsby et al., 2019;Hu et al., 2022;Li and Liang, 2021) and devised an improved method (MAM Adapter).Our dynamic TARP can only partly be viewed in this unified framework as we explore a novel design dimension, i.e., making the modification to the base model dynamic w.r.t.input tokens.Moreover, in contrast to additive-only modifications in He et al. (2022), our dynamic TARP applies both multiplicative and additive modifications.For fair comparisons, we follow past works (Liu et al., 2019b;Zhao et al., 2020;He et al., 2022) and set the maximum finetuning epochs to 10 on each task.In Table 6, dynamic TARP introduces and trains only 1% versus the number of original parameters, while achieving comparable results to full finetuning (+0.3 abs.) and outperforming the previous best, MAM adapters (+1.0 abs.).

Dynamic TARP outperforms finetuning
Tables 5 and 6 also show that dynamic low-rank reparameterization outperforms finetuning on corresponding evaluating metrics, while being faster as it only adapts a small set of weights.The training time further improves through utilizing the training data more efficiently.In Figure 5 (left) we compare perplexities of our method against finetuning on subsets of WikiText-2 and see that finetuning increasingly underperforms as the number of examples decrease.To explain this behavior, in Figure 5 (right) we fix the number of training examples to 100 and ablate the rank.Our method performs best with a very small rank value, suggesting that the difference between the pretrained and finetuned weight matrices lies in a lower-dimensional subspace.This complements Aghajanyan et al. (2021b)'s observation that direct adaptation in lower-dimensional spaces can be equally as effective as in the original space.Moreover, we find that the larger the model (GPT-2 medium vs. the GPT-2 small), the lower the rank value required for the best adaptation performance.

TAMS discovers better architectures
Recent studies have shown that simple modifications to the transformer architecture, such as reorganizing the MHSA and FFN modules (Zhao et al., 2021) or adding 1D convolutions to selfattention (So et al., 2021), improve the task performance.Similarly, from our results in Table 3, adapting the model structure through sub-layer modifications in our meta-learning framework further reduces the testing perplexity compared to MLtD with fixed model structure.Applying task-aware architecture search (TAMS) on the FFN module incurs less than 5% additional model parameters compared to the original GPT-2 model, but reduces the perplexity by 3 points on average.
A limitation we observe is that the TAMS method tends to produce a dominant architecture (cf. Figure 6) as opposed to one different architecture for each task.We conjecture this may be because our initial task representation strategy has low variance due to averaging across the entire task training data.This may explain why TAMS did not uniformly improve MLtD in all settings.Nevertheless, the perplexity reduction implies that there is still room to optimize the architecture of current LMs without significantly increasing total model size.Thus, we believe task-aware architecture search is a promising direction to continue to invest in the future.

Training efficiency of MLtD
We study training efficiency by comparing the training and finetuning wall-clock time for multidomain abstractive summarization on AdaptSum.The results are shown in Table 7.
We have the following observations: (1) since meta-learning explicitly optimizes the model for fast adaptation.Compared with previous methods, MLtD takes fewer epochs to reach convergence (e.g., Figure 7) and takes the least time to adapt the model to each the task; (2) since our lightweight adaptation method (TARP) only updates a small set of task-specific weights, our model variant (TARP  only, no meta-learning) still reduces adaptation time by 25% over direct BART finetuning.
On the other hand, the proposed TARP and TAMS components introduce some inference overhead.Due to limitations of current DL libraries in implementing parallel computation branches, the dynamic low-rank decomposition and the taskaware architecture generation increases the inference time by 10% and 6%, respectively, measured with a batch size of 4 and a sequence length of 1024 on one GPU.

Conclusion
We have shown that explicit meta-learning is a useful preparation step on top of PLMs to improve later finetuning.Specifically, our MLtD framework incorporating dynamic task-adaptive reparameterization (TARP) and task-adaptive model search (TAMS) enable data-and parameter-efficient adaptation to a family of low-resource tasks.Future avenues include applying our method in other modalities like vision and speech, as well as exploring better model formulations for TARP and TAMS.

Figure 1 :
Figure1: Comparison between (top) implicit metalearning from text corpora that incidentally contain task "prefixes", as in GPT-3(Brown et al., 2020; Fig. 1.1), and (bottom) explicit meta-learning the transformation of a PLM's weights and sublayers for a distribution of tasks.

Figure 3 :
Figure 3: TARP with dynamic decomposition (only the additive Φ l 2 is depicted for simplicity).

Figure 4 :
Figure 4: Perplexities on Persona-Chat testing tasks with MLtD (TARP only) versus Pretrain+Finetune and MAML+Finetune.Left: Influence of number of adaptation iterations.Right: Influence of the number of adaptation dialogues.

Figure 5 :
Figure 5: Testing perplexities on WikiText-2 with full finetuning and/or low-rank adaptation with dynamic TARP.Left: Low-rank adaptation is extremely helpful on low-resource tasks.Right: Holding the number of training examples fixed, the model adaptation space is optimized by fewer dimensions.

Figure 6 :
Figure 6: Dominant structure of the TAMS-learned sublayers for AdaptSum language modeling.
Wall-clock time comparison on AdaptSum during preparation on the meta-training data (Prep.)and during meta-testing (Finetuning) to convergence (early stopping), summed over all domains.Times were measured on one GPU.

Figure 7 :
Figure 7: Convergence analysis for finetuning BART models obtained by different methods on AdaptSum, using the Debate domain as an example.

Table 2 :
Gururangan et al. (2020)ain adaptation for abstractive summarization on AdaptSum (higher is better).All methods are initialized with pretrained BART and finetuned on the labeled task training set of each domain at the end.*:publishedresultsfromYu et al. (2021), using DAPT and TAPT methods fromGururangan et al. (2020); the rest are ours.
14 Evaluate on D test i : 15 L meta_loss += L D test i f Θ LM ∪Φ i ∪ wi ; 16 Perform outer-loop optimization: Θ LM , Φ, A, w -= ηout∇ (Θ LM ,Φ,A, w) L meta_loss ; 17 Return: meta-trained PLM with learned (Θ LM , Φ, A, w); of this form for our main experiments, with rank r ≤ 32. the parameter efficiency of the search cell.Our controller A is a two-layer MLP.The first fullyconnected layer has 128 output neurons, and the second layer has E × |O| neurons (see Section 2.2 for notation).We apply ReLU after the first layer

Table 4 :
Data sizes for AdaptSum

Table 6 :
Comparison with other adaptation methods on low-resource GLUE tasks.We report single-task results (including our dynamic TARP) of adapting RoBERTa base on the training set of each task only.* : published results from Liu et al. (2019b); Zhao et al. (2020); Zaken et al. (2021); Pfeiffer et al. (2021); He et al. (2022); † : recreated using He et al. (2022)'s implementation; the rest are ours.For our dynamic TARP, we provide the 95% confidence interval over 5 runs.