Abstract
Large pretrained language models (PLMs) are often domain- or task-adapted via finetuning or prompting. Finetuning requires modifying all of the parameters and having enough data to avoid overfitting while prompting requires no training and few examples but limits performance. Instead, we prepare PLMs for data- and parameter-efficient adaptation by learning to learn the difference between general and adapted PLMs. This difference is expressed in terms of model weights and sublayer structure through our proposed dynamic low-rank reparameterization and learned architecture controller. Experiments on few-shot dialogue completion, low-resource abstractive summarization, and multi-domain language modeling show improvements in adaptation time and performance over direct finetuning or preparation via domain-adaptive pretraining. Ablations show our task-adaptive reparameterization (TARP) and model search (TAMS) components individually improve on other parameter-efficient transfer like adapters and structure-learning methods like learned sparsification.
1 Introduction
Finetuning large pretrained language models (PLMs) on task-specific supervised data has become the default strategy to produce performant models for various NLP tasks (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2019, inter alia), provided a task has enough training data to be adapted to without overfitting. For few-shot tasks, very large PLMs like the 175B-parameter GPT-3 (Brown et al., 2020) do surprisingly well without training using prompts, where task-specific examples (xj,yj) are presented as text to condition the PLM before a test input xtest is given. Our work considers an important middle ground: minimizing the computational cost of finetuning while improving on its performance in low-resource and few- shot settings.
In general, self-supervised objectives used for PLMs assume little about the nature of downstream tasks. Earlier works suggested that task-awareness is unnecessary for PLMs of sufficient scale; for example, Raffel et al. (2020) found that multi-task learning underperformed pretrain-finetune for the largest T5 models on multi-format question answering. However, Gururangan et al. (2020) showed that further pretraining on unlabeled text from the downstream task (task-adaptive pretraining, or TAPT) or a related domain (DAPT) consistently improved adaptation performance. Aghajanyan et al. (2021a) revisited Raffel et al. and found that by greatly improving the number and balance of tasks, one can utilize a multitask objective after pretraining and achieve gains in proportion to the number of tasks. As for even larger models, Brown et al. (2020) argue that the impressive few-shot prompting ability of GPT-3 comes from “implicit” meta-learning (Schmidhuber, 1987; Bengio et al., 1990) which they term in-context learning, where the outer loop is performed by self-supervised pretraining, and the inner loop is performed by forward passes on implicit examples in unlabeled texts.
These works motivate that exposure to broad information about downstream tasks remains useful in preparing a large PLM for adaptation. Hence, we propose explicit meta-learning for preparing large PLMs for data-efficient adaptation; a visual comparison is in Figure 1. To also achieve parameter efficiency and performance, we adapt meta-transfer learning (Sun et al., 2019) to large PLMs in two proposed ways: An inner loop optimizing a low-rank task-adaptive reparameterization (TARP) of weights, and an outer loop learning an architecture controller for searching task-adaptive model structures (TAMS). These improve over general finetuning and even DAPT-prepared LMs on generative and unconditional few-shot and low-resource settings, such as multi-domain abstractive summarization (AdaptSum; Yu et al., 2021) and language modeling.
Furthermore, our analysis shows that each component of our task distribution-aware strategy independently improves over prior work: (1) meta-transfer learning improves over model-agnostic meta learning (Finn et al., 2017) even after multitask learning on the same data, setting a new state-of-the-art on few-shot Persona-Chat dialog personalization (Zhang et al., 2018); (2) our proposed dynamic low-rank TARP outperforms recent methods such as MAM adapters (He et al., 2022) and alternate reparameterizations like Kronecker products (Zhang et al., 2021); (3) our lightweight controller for generating task-aware architectures in TAMS extends improvements into higher resource tasks and rediscovers task-specific modifications like 1D convolutions for Transformers.
2 Methodology
Our goal is to explicitly optimize a PLM for efficient adaptation to any task sampled from a distribution of low-resource NLP tasks . Each task consists of a training set , a test set , and a loss function .
The prevailing approach for efficiently optimizing a base model fΘ on a (relatively) small task-specific dataset is to use model-agnostic meta-learning (MAML; Finn et al., 2017). This is a bi-level optimization process that uses a stochastic gradient-based strategy to sample a batch of tasks from the task distribution in each meta-iteration. In the inner loop, each task finetunes a copy of the model’s weights Θ for a small number of steps Tin, producing task-specific weights Θi. In the outer loop, each task model is evaluated on its corresponding task’s test set and these losses are summed to produce the overall meta-loss. The meta-loss is then used to optimize and update Θ; see Weng (2018)2 for a more detailed overview.
MAML, however, is not generally used in NLP as a competitive alternative to pretrain-then- finetune methods for low-resource and few-shot settings. To rectify MAML’s limitations, we propose a meta-learning the difference (MLtD) framework to optimize PLMs for fast and data-efficient adaptations with the following contributions:
MAML after Pretraining.
Earlier work performed MAML using random initializations, or at best with pretrained token embeddings (Madotto et al., 2019), which was shown to underperform the pretrain-finetune paradigm. With the increased prevalence of large-scale pretraining, recent work has begun to initialize MAML with PLMs (Dou et al., 2019). We continue this approach, but further show that pretraining + MAML, even when labeled (i.e., multitask) and performed only on the meta-training data (i.e., no external text), improves performance and mitigates overfitting versus pretraining alone or MAML alone (Section 4), suggesting that pretraining produces a better initialization that promotes generalization in later meta-learning.
Parameter-efficient Transfer.
Adaptation data is typically limited, making it easy for large models to overfit. Previous work uses very shallow CNNs (Finn et al., 2017), only adapt scale-and-shift parameters atop the original model (Sun et al., 2019), or apply various general regularization techniques such as weight decay, label smoothing, dropout, early stopping, and ℓ1 regularization (Madotto et al., 2019; Song et al., 2020). In contrast, we propose learning dynamic low-rank reparameterizations (Section 2.1) of the base model such that for task . Here, Φ is a small set of new parameters that are adapted into task-specific Φi when finetuning. Notably, we modify MAML to incorporate these parameter-efficient modules, so that during task adaptation in both meta-training (the inner loop) and meta-testing (novel tasks) , we only adapt Φ → Φi instead of Θ → Θi, speeding up both phases and improving overall performance. Though some work explores the benefits of joint training or fusion of parameter-efficient modules (Stickland and Murray, 2019; Lin et al., 2020; Pfeiffer et al., 2021), prior work has not explored meta-learning to learn these adaptations in a task distribution-aware setting.
Architecture Adaptation.
While the Transformer has proven to be a robust general-purpose architecture, recent work has shown that the optimal attention-then-FFN sublayer structure can vary across tasks (e.g., Sandwich Transformers; Press et al., 2020). However, previous data-driven sublayer searches are often task-agnostic (e.g., So et al., 2021), where the sublayer search is implemented before pretraining. Meta-learning enables learning data-driven sublayers after pretraining, in a differentiable, task-adaptive manner (Section 2.2). Instead of a separate search per task as in previous methods (DARTS; Liu et al., 2019a), we propose meta-learning a task-aware architecture controller to help it generalize to new tasks when searching neural architectures, by learning to directly generate task-specific sublayer structures from the dataset. By exploiting architectural knowledge learned over the task distribution, our task-adaptive model structure approach improves test-time performance. A related work in customizing model structure is CMAML (Song et al., 2020), which applies a sparse pruning algorithm to obtain task-specific weight masks. Our method differs in that we consider generalization over a distribution of tasks (instead of a single task), and it has a richer search space with different operations, numbers of layers, and widths of layers, so that our method provides architecture diversity to accommodate to the different task data.
In all, we employ meta-learning to improve upon initializing from a pretrained ΘLM, allowing better downstream finetuning on tasks. By learning only the transformation weights Φi and (optionally) the task-specific architecture αi for new tasks , our method “learns to learn the difference” between a PLM and a task-specific LM in a training-efficient way.
2.1 Efficient Parameter Adaptation
We categorize recent works in parameter-efficient adaptation of large PLMs into three types:
Adding Parameter-efficient Layers.
Low- dimensional adapters (Rebuffi et al., 2018) have been injected into a frozen pretrained BERT either serially after each sublayer (Houlsby et al., 2019), or in parallel to the self-attention layers (PALs; Stickland and Murray, 2019). Previous work (Bapna and Firat, 2019; Lin et al., 2020) has applied adapters to other NLP models (e.g., GPT-2). Compacters (Mahabadi et al., 2021) reduce the adapter parameter count via hypercomplex multiplications (Zhang et al., 2021).
Adding Parameter-efficient Prefixes.
Inspired by prompting, the learning of automated prompts or task-specific continuous variants has been applied for encoder-only PLMs like BERT (Shin et al., 2020; Hambardzumyan et al., 2021) and generative PLMs (Li and Liang, 2021; Liu et al., 2021; Lester et al., 2021) where one learns task- specific vectors prepended to inputs or hidden representations.
Transformations Only.
The adapter and prefix-tuning strategies insert layers or introduce prefixes, increasing inference time or in-memory size. Instead, Zhao et al. (2020) learn binary masks, and diff pruning (Guo et al., 2021) learns sparse additive vectors. Both methods use unstructured sparsity to achieve parameter efficiency. Later work like BitFit (Zaken et al., 2021) and LoRA (Hu et al., 2022) introduces parameter-efficient modifications targeting the Transformer architecture: BitFit only tunes the bias parameters, while LoRA adds low-rank decomposition weights to the self-attention weights. Recently, He et al. (2022) proposed parallel mix- and-match (MAM) adapters, which leverage benefits of the preceding types.
A straightforward approach to solve the rank- constrained problem is to apply a low-rank decomposition to the transformation weights . We term this approach of learning parameter- efficient affine transformations task-adaptive reparameterization (TARP). We consider two standard static decomposition methods:
Bilinear, which takes where and , as done in the additive-only setting () by LoRA.
Kronecker product, which takes where Hk ∈ℝn×n, , , and n is a hyperparameter, as used in the “added- layer” Compacter approach.
The square matrices are generated by a lightweight multi-layer perceptron (MLP) for different input vectors and are the learnable weight matrices with r ≪min(Cin,Cout).
We compare popular parameter-efficient transfer schemes and these three decompositions in Section 4.
2.2 Efficient Architecture Adaptation
We also propose adapting the model structure for each task in a data-driven manner. The weights of the task-specific architectures are learned in the inner loop, while the task-aware architecture generator which produces architecture candidates is learned in the outer loop. We term this approach task-adaptive model structure (TAMS) learning.
Inspired by DARTS (Liu et al., 2019a), we define the possible sublayer structures by a search space expressed as a directed acyclic graph (DAG), where each directed edge corresponds to a set of candidate operations . The task architecture is represented by a set of parameters αi that encode the structure, where (E is the number of edges, and is the number of operations). In our proposed TAMS approach we also introduce a controller to generate these task-specific architecture parameters, as a function of the task embedding vector . The probability of choosing operation m in edge n is given by . In meta-testing, the discrete architecture is obtained by taking the argmax. Since argmax is non-differentiable, we use the straight-through Gumbel-Softmax estimator to backpropagate gradients for optimizing the architecture controller during meta-training.
In TAMS, all possible architectures are initialized as part of the meta-parameters based on weight-sharing (Pham et al., 2018), that is, architecture αi’s weights are selected from the meta-parameters. After the reparameterization steps in TARP and the architecture generation steps in TAMS, our inner loop optimization takes parameters and performs a small number of gradient steps Tin on the task training set to give . In the outer loop optimization, we thus have to simultaneously optimize the architecture controller to perform architecture search, as well as the parameter initialization. This is in contrast to MAML, which just optimizes the parameter initialization in the outer loop. The meta-loss becomes: , where the tuple contains the base PLM’s weights ΘLM, the low-rank reparameterization weights Φ, the architecture controller , and the weight-sharing meta-parameters .
In summary, our contributions with the TAMS framework are that (1) it meta-learns a task-aware controller by training on the task distribution and then generalizes to new tasks by automatically generating an optimized architecture αi from the task training data, and (2) it optimizes the controller and parameter initialization (shared by all tasks) simultaneously under a unified meta-learning objective. This is in contrast to DARTS, which performs a separate search and architecture parameter optimization for each task independently.
We summarize our net method with TARP and TAMS as pseudocode in Algorithm 1.
3 Main Results
To demonstrate the overall benefit of our method, we compare our results to other approaches on generative adaptation tasks in the few-shot (dialogue personalization), low-resource (abstractive summarization), and medium-resource (multi-domain language modeling) regimes. In Section 4 we perform some analyses and also compare TARP by itself to previous parameter- efficient work.
3.1 Implementation
All of our experiments ran using PyTorch on single machines with 32GB NVIDIA V100 GPUs. See per-task hyperparameters in Appendix A.1 and our code release.
TARP Decomposition.
We apply task-adaptive reparameterization (TARP) to the pretrained self-attention and feed-forward network (FFN) blocks. In Section 4.2 we conclude that TARP with dynamic decomposition outperforms other parameter-efficient transfer methods; TARP will always be of this form for our main experiments, with rank r ≤ 32.
TAMS details.
We apply TAMS to expand the FFN block, so the shared (in structure) sublayers capture the commonalities among tasks while new searched sublayers capture task-specific structure. Our search DAG contains two input nodes that project the inputs to a low-dimensional space, one output node that projects the intermediate representation back to the original dimension, and three intermediate nodes. Candidate operations for each edge are {linear, conv-3×1, conv-5×1, gated linear unit (GLU), zeroize, and skip connection}; see code for definitions. All the candidates operate on a reduced feature dimension to ensure the parameter efficiency of the search cell. Our controller is a two-layer MLP. The first fully connected layer has 128 output neurons, and the second layer has neurons (see Section 2.2 for notation). We apply ReLU after the first layer and softmax the final output.
3.2 Few-shot Dialogue Personalization
Persona-Chat (Zhang et al., 2018) is a dialogue generation benchmark with 1137/99/100 personas for training/validation/testing. We follow recent work (Madotto et al., 2019; Song et al., 2020) and regard learning a dialogue model for each persona as a few-shot meta-learning task. On average, each persona has 8.3 unique dialogues, 6–8 turns per dialogue, and 15 words per turn. Following previous work, we use a standard Transformer model with pretrained GLoVe embeddings and separate the dialogues by their persona description into meta-training/-validation/-testing using Madotto et al. (2019) splits and code.3
Baselines.
The following are from previous work. Pretrain denotes a multitask dialogue model trained on labeled data from all meta- training tasks. MAML meta-trains the Transformer model from scratch (Madotto et al., 2019), and CMAML (Song et al., 2020) additionally applies a pruning algorithm to customize the model structures for different tasks. +Finetune corresponds to finetuning on each testing task. Finally, Pretrain+Persona is a partial oracle for reference only, where the persona description is available.
Results (Table 1).
We include the same evaluation metrics from previous work, including perplexity, BLEU score, and C-score, where C-score is a domain-specific metric that uses a pretrained natural language inference model to evaluate whether the hypothesis matches the persona or not. Training MAML from scratch yields worse results than the Pretrain model. However, when MAML is initialized from the multitask model (Pretrain+MAML+Finetune), the result already outperforms previous work. Note that the same labeled data is used for both Pretrain and MAML, suggesting that meta-learning benefits from the more robust initialization that pretraining provides to improve task-specific few-shot adaptation (also see analysis in Section 4.1).
Method . | PPL . | BLEU . | C-score . |
---|---|---|---|
Pretrain (multitask)* | 36.75 | 0.64 | −0.03 |
Pretrain+Finetune* | 33.14 | 0.90 | 0.00 |
MAML+Finetune* | 40.34 | 0.74 | 0.20 |
CMAML+Finetune* | 36.30 | 0.89 | 0.18 |
Pretrain+Persona* | 30.42 | 1.00 | 0.07 |
Pretrain+MAML+Finetune | 32.54 | 0.97 | 0.23 |
MLtD (TARP only) | 32.15 | 0.99 | 0.25 |
MLtD | 28.14 | 1.20 | 0.30 |
Method . | PPL . | BLEU . | C-score . |
---|---|---|---|
Pretrain (multitask)* | 36.75 | 0.64 | −0.03 |
Pretrain+Finetune* | 33.14 | 0.90 | 0.00 |
MAML+Finetune* | 40.34 | 0.74 | 0.20 |
CMAML+Finetune* | 36.30 | 0.89 | 0.18 |
Pretrain+Persona* | 30.42 | 1.00 | 0.07 |
Pretrain+MAML+Finetune | 32.54 | 0.97 | 0.23 |
MLtD (TARP only) | 32.15 | 0.99 | 0.25 |
MLtD | 28.14 | 1.20 | 0.30 |
Moreover, we see further improvements by “meta-learning the difference” (MLtD). By using TARP for MAML’s inner loop adaptation (MLtD, TARP only), we attain equivalent or better results and faster training time while only updating a small amount of task-specific parameters (Section 4.2). This indicates that our method helps mitigate overfitting to low-resource tasks. Finally, by incorporating TAMS (MLtD), we use the full framework and achieve the best performance, suggesting the task-adapted model structure gives better architectures for personas. In this regard, CMAML lags behind MLtD as well. We conjecture this is because it uses a pruning algorithm to “customize” the model with different weight masks, which may not generate enough model diversity for diverse tasks as the architectural inductive bias remains the same.
3.3 Low-resource Abstractive Summarization
AdaptSum (Yu et al., 2021) is a new multi-domain dataset used to evaluate domain adaptation schemes for abstractive summarization. It consists of six diverse target domains ranging from movie reviews to scientific abstracts. Each domain has a low-resource task corpus and a larger unlabeled text corpus as well (list and statistics in Table 4) that is used to evaluate domain- and task-adaptive pretraining (DAPT/TAPT; Gururangan et al., 2020). We use pretrained BART (Lewis et al., 2020) and finetune to each low- resource task corpus as in Yu et al. (2021), whose code4 we extend.
Baselines.
DAPT continues pretraining with BART’s self-supervised objective using the unlabeled domain corpus. TAPT continues pretraining with the set of unlabeled documents found in the target summarization task. SDPT uses the XSum dataset in the News domain to further pretrain BART with a supervised training objective using document-summary pairs before finetuning.
Results (Table 2).
We find that MLtD, even without architecture search (TARP only), outperforms DAPT, TAPT, and SDPT. These methods use in-domain/-task knowledge and the standard pretraining objective to help adaptation to the target task, while our method considers cross-domain knowledge via the meta-learning objective, sampling meta-training tasks from multiple domain corpora to train the model. Moreover, the use of meta-learning as preparation outperforms multitask pretraining (TARP only, multitask pretraining instead), signifying that mere exposure to the cross-domain data may not be enough and using a meta-learning objective to explicitly optimize for the lightweight adaptation is beneficial. Finally, we see that without meta-learning or multitasking (TARP only, no meta-learning) our performance is also better than the baseline. This demonstrates the effectiveness of the lightweight TARP adaptation, which matches the performance of full finetuning while only updating less than 5% of parameters.
Method . | Dialog . | Email . | Movie . | Debate . | Social . | Science . | Avg. . |
---|---|---|---|---|---|---|---|
Baseline (full finetuning)* | 39.95 | 24.71 | 25.13 | 24.48 | 21.76 | 72.76 | 34.80 |
DAPT (Domain-Adaptive Pre-Training)* | 41.22 | 26.50 | 24.25 | 26.71 | 22.95 | 71.88 | 35.59 |
TAPT (Task-Adaptive Pre-Training)* | 40.15 | 25.30 | 25.27 | 24.59 | 22.81 | 73.08 | 35.20 |
SDPT (Supervised Domain Pre-Training)* | 42.84 | 25.16 | 25.45 | 25.61 | 22.43 | 73.09 | 35.76 |
MLtD | 44.81 | 25.30 | 26.83 | 26.88 | 24.40 | 74.03 | 37.04 |
(TARP only) | 42.88 | 26.92 | 25.98 | 25.95 | 23.34 | 73.69 | 36.46 |
(TARP only, no meta-learning) | 40.39 | 23.20 | 25.81 | 26.67 | 21.46 | 73.20 | 35.12 |
(TARP only, multitask pretraining instead) | 41.82 | 25.41 | 26.17 | 25.70 | 22.54 | 73.50 | 35.85 |
Method . | Dialog . | Email . | Movie . | Debate . | Social . | Science . | Avg. . |
---|---|---|---|---|---|---|---|
Baseline (full finetuning)* | 39.95 | 24.71 | 25.13 | 24.48 | 21.76 | 72.76 | 34.80 |
DAPT (Domain-Adaptive Pre-Training)* | 41.22 | 26.50 | 24.25 | 26.71 | 22.95 | 71.88 | 35.59 |
TAPT (Task-Adaptive Pre-Training)* | 40.15 | 25.30 | 25.27 | 24.59 | 22.81 | 73.08 | 35.20 |
SDPT (Supervised Domain Pre-Training)* | 42.84 | 25.16 | 25.45 | 25.61 | 22.43 | 73.09 | 35.76 |
MLtD | 44.81 | 25.30 | 26.83 | 26.88 | 24.40 | 74.03 | 37.04 |
(TARP only) | 42.88 | 26.92 | 25.98 | 25.95 | 23.34 | 73.69 | 36.46 |
(TARP only, no meta-learning) | 40.39 | 23.20 | 25.81 | 26.67 | 21.46 | 73.20 | 35.12 |
(TARP only, multitask pretraining instead) | 41.82 | 25.41 | 26.17 | 25.70 | 22.54 | 73.50 | 35.85 |
3.4 Multi-domain Language Modeling
Though the text corpora in AdaptSum were originally included to evaluate DAPT, we also use them to evaluate our methods on multi-domain language modeling. As this is a novel benchmark, to demonstrate fast adaptation we take Tin = 1.
Baselines.
We start with pretrained GPT-2 medium (345M) (Radford et al., 2019) with input sequence length 512 using the Transformers library (Wolf et al., 2019). Finetuning is performed on the training documents of the task corpus, and we evaluate perplexity on the test documents of the task corpus. The only exception to finetuning is Zero-shot, which evaluates the pretrained GPT-2 model directly. DAPT continues pretraining of GPT-2 with the language modeling objective on the unlabeled domain corpus before finetuning.
Results (Table 3).
Our findings in summarization also hold for the unconditional causal language modeling task. Namely, we see equal or better performance of TARP vs. full finetuning and that meta-learning plays a significant role in the task adaptation quality. In contrast to summarization with BART (Section 3.3) but similar to Persona-Chat with Transformer (Section 3.2), we see that TAMS leads to noticeable improvements. We explain why this may be the case and present the TAMS-learnt sublayer modules in Section 4.4.
Method . | Dialog . | Email . | Movie . | Debate . | Social . | Science . | Avg. . |
---|---|---|---|---|---|---|---|
Baseline (full finetuning) | 31.95 | 31.57 | 42.25 | 34.38 | 33.02 | 28.82 | 33.67 |
Zero-shot (no finetuning) | 37.26 | 38.45 | 49.46 | 41.38 | 37.13 | 34.20 | 39.65 |
DAPT† | 35.15 | 16.04 | 43.12 | 33.83 | 27.15 | 18.96 | 29.04 |
MLtD | 29.66 | 16.93 | 35.38 | 30.61 | 19.78 | 17.06 | 24.90 |
(TARP only) | 28.63 | 18.67 | 39.73 | 32.70 | 26.93 | 20.39 | 27.84 |
(TARP only, no meta-learning) | 31.66 | 31.59 | 41.78 | 33.18 | 32.78 | 28.20 | 33.19 |
Method . | Dialog . | Email . | Movie . | Debate . | Social . | Science . | Avg. . |
---|---|---|---|---|---|---|---|
Baseline (full finetuning) | 31.95 | 31.57 | 42.25 | 34.38 | 33.02 | 28.82 | 33.67 |
Zero-shot (no finetuning) | 37.26 | 38.45 | 49.46 | 41.38 | 37.13 | 34.20 | 39.65 |
DAPT† | 35.15 | 16.04 | 43.12 | 33.83 | 27.15 | 18.96 | 29.04 |
MLtD | 29.66 | 16.93 | 35.38 | 30.61 | 19.78 | 17.06 | 24.90 |
(TARP only) | 28.63 | 18.67 | 39.73 | 32.70 | 26.93 | 20.39 | 27.84 |
(TARP only, no meta-learning) | 31.66 | 31.59 | 41.78 | 33.18 | 32.78 | 28.20 | 33.19 |
Domain . | # of tokens . | |||
---|---|---|---|---|
Text only . | Task corpus . | |||
train . | val . | test . | ||
Dialog | 44.96M | 27K | 74K | 75K |
117.54M | 37K | 243K | 237K | |
Movie review | 11.36M | 633K | 1056K | 6193K |
Debate | 122.99M | 59K | 188K | 197K |
Social media | 153.30M | 68K | 229K | 229K |
Science | 41.73M | 63K | 221K | 314K |
Domain . | # of tokens . | |||
---|---|---|---|---|
Text only . | Task corpus . | |||
train . | val . | test . | ||
Dialog | 44.96M | 27K | 74K | 75K |
117.54M | 37K | 243K | 237K | |
Movie review | 11.36M | 633K | 1056K | 6193K |
Debate | 122.99M | 59K | 188K | 197K |
Social media | 153.30M | 68K | 229K | 229K |
Science | 41.73M | 63K | 221K | 314K |
4 Analysis
4.1 Pretraining Improves Meta-learning
We analyze the performance of MLtD on Persona-Chat at meta-testing time (i.e., finetuning then testing on unseen personas) with respect to the number of inner loop steps and training dialogues. In Figure 4 (left), we see that original MAML (no pretraining) overfits, while finetuning the multitask-pretrained model keeps improving. Moreover, MLtD atop the multitask-pretrained model followed by finetuning continues to improve test perplexity. In Figure 4 (right), we fix the finetuning steps and vary the number of training dialogues used in finetuning. Using more dialogues improves perplexity for all three methods, with MLtD still leading over full MAML and direct finetuning after pretraining. The takeaway from these results is that applying MAML on a pretrained model prevents overfitting and promotes better generalizability from meta-learning.
4.2 Dynamic TARP versus Alternatives
We benchmark our dynamic low-rank reparameterization on a variety of NLP models and tasks. To show that dynamic TARP individually improves on full finetuning and other parameter-efficient adaptation methods, we report single-task results here. For classification, we use pretrained RoBERTa (Liu et al., 2019b) on the GLUE benchmark tasks (Wang et al., 2019), which were evaluated on by many recent parameter efficient adaptation methods (Houlsby et al., 2019; Zhao et al., 2020; Zaken et al., 2021; Pfeiffer et al., 2021; He et al., 2022). For generative tasks, we use pretrained GPT-2 medium on natural language generation datasets: We specifically evaluate on E2E (Novikova et al., 2017), which was used for adapters (Lin et al., 2020); WebNLG (Gardent et al., 2017); and DART (Nan et al., 2021), which was used by LoRA (Hu et al., 2022). Further dataset and experimental setup details are in Appendix B. In particular, we chose rank r to give similar parameter counts to other approaches; r = 4 in Table 6, r = 8 in Table 5.
Method . | Params. . | CoLA . | MRPC . | STS-B . | RTE . | SST-2 . | MNLI . | QNLI . | QQP . | Avg. . |
---|---|---|---|---|---|---|---|---|---|---|
per task . | Matt. corr . | Acc. . | Pear. corr . | Acc. . | Acc. . | Acc. . | Acc. . | Acc. . | ||
Finetuning (full)* | 100% | 63.6 | 90.2 | 91.2 | 78.7 | 94.8 | 87.6 | 92.8 | 91.9 | 86.4 |
Masking* | 3% | 60.3 | 88.5 | – | 69.2 | 94.5 | – | 92.4 | – | – |
AdapterFusion* | 1% | – | 89.7 | – | 78.8 | 93.7 | 86.2 | – | 90.3 | – |
MAM Adapter* | 0.5% | 59.2 | 88.5 | 90.6 | 74.3 | 94.2 | 87.4 | 92.6 | 90.2 | 84.6 |
BitFit* | 0.1% | 61.8 | 92.0 | 90.8 | 77.8 | 93.7 | 84.8 | 91.3 | 84.5 | 84.6 |
MAM Adapter† | 1% | 59.7 | 90.2 | 90.6 | 77.3 | 94.6 | 87.6 | 92.9 | 90.9 | 85.5 |
LoRA (orig.)†,5 | 1% | 63.9 | 89.7 | 90.7 | 76.2 | 94.5 | 87.5 | 92.7 | 90.8 | 85.8 |
Dynamic TARP | 1% | 65.3±.8 | 90.9±.4 | 91.0±.2 | 80.9±.7 | 94.8±.2 | 87.6±.2 | 93.0±.2 | 91.3±.1 | 86.8 |
Method . | Params. . | CoLA . | MRPC . | STS-B . | RTE . | SST-2 . | MNLI . | QNLI . | QQP . | Avg. . |
---|---|---|---|---|---|---|---|---|---|---|
per task . | Matt. corr . | Acc. . | Pear. corr . | Acc. . | Acc. . | Acc. . | Acc. . | Acc. . | ||
Finetuning (full)* | 100% | 63.6 | 90.2 | 91.2 | 78.7 | 94.8 | 87.6 | 92.8 | 91.9 | 86.4 |
Masking* | 3% | 60.3 | 88.5 | – | 69.2 | 94.5 | – | 92.4 | – | – |
AdapterFusion* | 1% | – | 89.7 | – | 78.8 | 93.7 | 86.2 | – | 90.3 | – |
MAM Adapter* | 0.5% | 59.2 | 88.5 | 90.6 | 74.3 | 94.2 | 87.4 | 92.6 | 90.2 | 84.6 |
BitFit* | 0.1% | 61.8 | 92.0 | 90.8 | 77.8 | 93.7 | 84.8 | 91.3 | 84.5 | 84.6 |
MAM Adapter† | 1% | 59.7 | 90.2 | 90.6 | 77.3 | 94.6 | 87.6 | 92.9 | 90.9 | 85.5 |
LoRA (orig.)†,5 | 1% | 63.9 | 89.7 | 90.7 | 76.2 | 94.5 | 87.5 | 92.7 | 90.8 | 85.8 |
Dynamic TARP | 1% | 65.3±.8 | 90.9±.4 | 91.0±.2 | 80.9±.7 | 94.8±.2 | 87.6±.2 | 93.0±.2 | 91.3±.1 | 86.8 |
For classification tasks, we compare with finetuning all layers; weight Masking (Zhao et al., 2020); BitFit (Zaken et al., 2021), which only finetunes the biases; AdapterFusion (Pfeiffer et al., 2021), which composes learned adapters (Houlsby et al., 2019); as well as He et al. (2022), which proposed a unified framework connecting several state-of-the-art adaptation methods like LoRA5 (Hu et al., 2022) and Adapter (Houlsby et al., 2019), and derived an improved method (MAM Adapter). Our dynamic TARP can only partly be viewed in this unified framework as we explore a novel design dimension, namely, making the modification to the base model dynamic with respect to input tokens. For fair comparisons, we follow past work (Liu et al., 2019b; Zhao et al., 2020; He et al., 2022) and set the maximum finetuning epochs to 10 on each task. In Table 5, dynamic TARP introduces and trains only 1% versus the number of original parameters, while achieving comparable results to full finetuning and outperforming the previous best results from MAM adapters.
For generative tasks, we ablate the design of TARP and we compare with available numbers, including finetuning all layers; FT-Top2, which only finetunes the last two layers of the model; Adapter (Houlsby et al., 2019), which only finetunes the adapter layers inserted after each feed-forward and self-attention sublayer; Prefix- tuning (Li and Liang, 2021); and LoRA (Hu et al., 2022). As shown in Table 6, TARP methods match or outperform other parameter-efficient methods, while learning task-specific parameters that are <3% of the number of base parameters and keep the base model unchanged. Among the three TARP variants, we find that Dynamic > Bilinear > Kronecker in terms of performance across generative metrics. This suggests that the optimal adaptation to the underlying model weights may vary per token, which dynamic low-rank accounts for. Moreover, dynamic TARP performs better than an alternative where the O(n2) Hadamard product in Eq. (1) is replaced by O(n3) matrix multiplication (w/ matrix mult.).
Method . | Params. . | E2E . | DART . | WebNLG . | ||||
---|---|---|---|---|---|---|---|---|
per task . | BLEU . | NIST . | METEOR . | ROUGE-L . | CIDEr . | BLEU . | BLEU . | |
Finetuning (full)* | 100% | 68.2 | 8.62 | 46.2 | 71.0 | 2.47 | 46.0 | 47.6 |
FT-Top2* | 7.1% | 68.1 | 8.59 | 46.0 | 70.8 | 2.41 | 38.1 | 33.5 |
BitFit* | 0.1% | 67.2 | 8.63 | 45.1 | 69.3 | 2.32 | 43.3 | 50.5 |
Adapter* | 3.2% | 68.9 | 8.71 | 46.1 | 71.3 | 2.47 | 45.4 | 54.0 |
Prefix* | 1.0% | 69.7 | 8.81 | 46.1 | 71.4 | 2.49 | 45.7±.2 | 54.4±.1 |
LoRA* | 1.0% | 70.4±.1 | 8.85±.02 | 46.8±.2 | 71.8±.1 | 2.53±.02 | 47.1±.2 | 55.3±.2 |
Bilinear TARP | 2.4% | 68.8 | 8.75 | 46.1 | 70.8 | 2.43 | 46.7 | 54.0 |
Kronecker TARP | 2.4% | 68.2 | 8.73 | 45.2 | 69.4 | 2.36 | 45.6 | 53.1 |
Dynamic TARP | 1.0% | 69.7±.1 | 8.78±.02 | 46.9±.2 | 72.1±.1 | 2.51±.01 | 47.9±.2 | 55.3±.1 |
w/ matrix mult. | 1.0% | 68.3 | 8.64 | 46.4 | 71.1 | 2.47 | 46.5 | 53.2 |
Method . | Params. . | E2E . | DART . | WebNLG . | ||||
---|---|---|---|---|---|---|---|---|
per task . | BLEU . | NIST . | METEOR . | ROUGE-L . | CIDEr . | BLEU . | BLEU . | |
Finetuning (full)* | 100% | 68.2 | 8.62 | 46.2 | 71.0 | 2.47 | 46.0 | 47.6 |
FT-Top2* | 7.1% | 68.1 | 8.59 | 46.0 | 70.8 | 2.41 | 38.1 | 33.5 |
BitFit* | 0.1% | 67.2 | 8.63 | 45.1 | 69.3 | 2.32 | 43.3 | 50.5 |
Adapter* | 3.2% | 68.9 | 8.71 | 46.1 | 71.3 | 2.47 | 45.4 | 54.0 |
Prefix* | 1.0% | 69.7 | 8.81 | 46.1 | 71.4 | 2.49 | 45.7±.2 | 54.4±.1 |
LoRA* | 1.0% | 70.4±.1 | 8.85±.02 | 46.8±.2 | 71.8±.1 | 2.53±.02 | 47.1±.2 | 55.3±.2 |
Bilinear TARP | 2.4% | 68.8 | 8.75 | 46.1 | 70.8 | 2.43 | 46.7 | 54.0 |
Kronecker TARP | 2.4% | 68.2 | 8.73 | 45.2 | 69.4 | 2.36 | 45.6 | 53.1 |
Dynamic TARP | 1.0% | 69.7±.1 | 8.78±.02 | 46.9±.2 | 72.1±.1 | 2.51±.01 | 47.9±.2 | 55.3±.1 |
w/ matrix mult. | 1.0% | 68.3 | 8.64 | 46.4 | 71.1 | 2.47 | 46.5 | 53.2 |
4.3 Dynamic TARP Outperforms Finetuning
Tables 5 and 6 also show that dynamic low-rank reparameterization outperforms finetuning on corresponding evaluating metric while being faster, as it only adapts a small set of weights. The training time further improves through utilizing the training data more efficiently. In Figure 5 (left) we compare perplexities of our method against finetuning on subsets of WikiText-2 and see that finetuning increasingly underperforms as the number of examples decrease. To explain this behavior, in Figure 5 (right) we fix the number of training examples to 100 and ablate the rank. Our method performs best with a very small rank value, suggesting that the difference between the pretrained and finetuned weight matrices lies in a lower-dimensional subspace. This complements Aghajanyan et al.’s (2021b) observation that direct adaptation in lower-dimensional spaces can be equally as effective as in the original space. Moreover, we find that the larger the model (GPT-2 medium vs. GPT-2 small), the lower the rank value required for the best adaptation performance.
4.4 TAMS Discovers Better Architectures
Recent studies have shown that simple modifications to the transformer architecture, such as re-organizing the MHSA and FFN modules (Zhao et al., 2021) or adding 1D convolutions to self-attention (So et al., 2021), improve the task performance. Similarly, from our results in Table 3, adapting the model structure through sub-layer modifications in our meta-learning framework further reduces the testing perplexity compared to MLtD with fixed model structure. Applying task-aware architecture search (TAMS) on the FFN module incurs less than 5% additional model parameters compared to the original GPT-2 model, but reduces the perplexity by 3 points on average.
A limitation we observe is that the TAMS method tends to produce a dominant architecture (cf. Figure 6) as opposed to one different architecture for each task. We conjecture this may be because our initial task representation strategy has low variance due to averaging across the entire task training data. This may explain why TAMS did not uniformly improve MLtD in all settings. Nevertheless, the perplexity reduction implies that there is still room to optimize the architecture of current LMs without significantly increasing total model size. Thus, we believe, that task-aware architecture search is a promising direction to continue to invest in the future.
4.5 Training Efficiency of MLtD
We study training efficiency by comparing the training and finetuning wall-clock time for multi-domain abstractive summarization on AdaptSum. The results are shown in Table 7.
Method . | Prep. . | Finetuning . |
---|---|---|
(hrs.) . | (mins.) . | |
Baseline (direct finetuning) | – | 26 |
SDPT | 64 | 16 |
DAPT | 208 | 23 |
TAPT | 8 | 18 |
MLtD | 39 | 9 |
(TARP only) | 22 | 7 |
(TARP only, no meta-learning) | – | 20 |
Method . | Prep. . | Finetuning . |
---|---|---|
(hrs.) . | (mins.) . | |
Baseline (direct finetuning) | – | 26 |
SDPT | 64 | 16 |
DAPT | 208 | 23 |
TAPT | 8 | 18 |
MLtD | 39 | 9 |
(TARP only) | 22 | 7 |
(TARP only, no meta-learning) | – | 20 |
We have the following observations: (1) Since meta-learning explicitly optimizes the model for fast adaptation, compared with previous methods, MLtD takes fewer epochs to reach convergence (e.g., Figure 7) and takes the least time to adapt the model to each the task; (2) Since our lightweight adaptation method (TARP) only updates a small set of task-specific weights, our model variant (TARP only, no meta-learning) reduces the adaptation time by 20% over direct BART finetuning.
On the other hand, the proposed TARP and TAMS components introduce some inference overhead. Due to limitations of current DL libraries in implementing parallel computation branches, the dynamic low-rank decomposition and the task-aware architecture generation increases the inference time by 10% and 6%, respectively, measured with a batch size of 4 and a sequence length of 1024 on one GPU.
5 Conclusion
We have shown that explicit meta-learning is a useful preparation step on top of PLMs to improve later finetuning. Specifically, our MLtD framework incorporating dynamic task- adaptive reparameterization (TARP) and task- adaptive model search (TAMS) enable data- and parameter-efficient adaptation to a family of low- resource tasks. Future avenues include applying our method in other modalities like vision and speech, as well as exploring better model formulations for TARP and TAMS.
A Further Details for Main Experiments
A.1 Hyperparameters
Most of the experimental setups, for example, model type, max sequence, optimizer, batch size, beam search size, are taken from previous methods for fair comparison. We tuned the inner-loop and outer-loop learning rates in meta-training on the meta-validation set, and adjust the learning rate schedule accordingly. We chose the rank values r in our dynamic low-rank reparameterization to give similar parameter counts to other parameter-efficient methods. We adapted the search space in our task-aware model structure from DARTS. ηin denotes inner-loop and finetuning learning rate, ηout denotes outer-loop learning rate, Bin denotes inner-loop and finetuning batch size, Bout denotes meta-batch size, bsz denotes decoding beam size, and Tin denotes inner loop steps.
Few-shot Dialogue Personalization.
We take r = 4, Bout = 16 (as in previous works), bsz = 5, Tin = 10. For meta-training we use SGD (ηin = 0.01) in the inner loop and Adam for the outer loop (ηin = 0.0003).
Low-resource Abstractive Summarization.
We take r = 16, Bin = 40 (via gradient accumulation), Tin = 20, bsz = 4. We truncated the input documents into 1024 tokens due to the limit of max input length of BART model. We used Adam with momentum (β1 = 0.9,β2 = 0.998) and the Noam schedule (linear warmup of 1000 steps, then inverse square-root decay). Since the low-resource training set of science domain only has 100 samples, we used 3 times more training epochs than other domains.
Multi-domain Language Modeling.
We take r = 32, Bin = 4, and Tin = 1. We used Adam with ηin = 5 × 10−4, ηout = 5 × 10−5. In meta-testing we linear decay ηin.
A.2 Training Costs
Table 8 provides information about the amount of training that MLtD has taken for the main experiments. The reported training times are for the single run of each experiment. For hyperparameter tuning, we search the inner-loop learning rate and outer-loop learning rate over five runs, respectively.
#GPUs . | GPU . | Training . | Meta- . | Cluster . | Costs . |
---|---|---|---|---|---|
type . | time . | iterations . | |||
Low-resource abstractive summarization (Table 2) | |||||
1 | 32GB V100 | 39hrs | 100 | AWS p3.2xlarge | $120 |
Few-shot dialogue personalization (Table 1) | |||||
1 | 32GB V100 | 1.5hrs | 100 | AWS p3.2xlarge | $5 |
Multi-domain language modeling (Table 3) | |||||
1 | 32GB V100 | 22hrs | 100 | AWS p3.2xlarge | $68 |
#GPUs . | GPU . | Training . | Meta- . | Cluster . | Costs . |
---|---|---|---|---|---|
type . | time . | iterations . | |||
Low-resource abstractive summarization (Table 2) | |||||
1 | 32GB V100 | 39hrs | 100 | AWS p3.2xlarge | $120 |
Few-shot dialogue personalization (Table 1) | |||||
1 | 32GB V100 | 1.5hrs | 100 | AWS p3.2xlarge | $5 |
Multi-domain language modeling (Table 3) | |||||
1 | 32GB V100 | 22hrs | 100 | AWS p3.2xlarge | $68 |
B Further Details for TARP Experiments
Datasets.
E2E (Novikova et al., 2017) is commonly used for data-to-text evaluation of NLG systems. It consists of approximately 50K examples in total from the restaurant domain. Each input consists of a sequence of slot-value pairs and can have multiple references. The average output length is 22.9. We use the official evaluation script, which reports BLEU, NIST, METEOR, ROUGE-L, and CIDEr. WebNLG (Gardent et al., 2017) is a multi-domain dataset for data-to-text evaluation. It contains 22K examples in total from 14 distinct domains, and the average output length is 22.5. Nine domains are used for training, and the remaining five domains are used for testing. Each input is represented by a sequence of SUBJECT — PROPERTY — OBJECT triples. The evaluation metric is BLEU. DART (Nan et al., 2021) is an open-domain data-to-text dataset. The inputs are structured as sequences of ENTITY — RELATION — ENTITY triples. It contains 82K examples in total and the average output length is 21.6. The evaluation metric is BLEU. GLUE We report Matthew’s correlation for CoLA, Pearson correlation for STSB, and accuracy for the other tasks in Table 5. The dev set performance is presented by following Zhao et al. (2020) and He et al. (2022).
Setup.
For the natural language generation tasks, we build upon Hu et al.’s (2022) code.6 We used the GPT2-medium as the underlying LM. In training, we used the AdamW optimizer with weight decay 0.01. The batch size is set to be 8 and we trained for 5 epochs in total. We used linear decay learning rate scheduler with the first 500 iterations for warmup. The initial learning rate is set to be 0.0002. In decoding, we used beam search with beam size 10.
For GLUE tasks, we built upon He et al.’s (2022) code.7 Our experiments were performed on RoBERTabase model. We limited maximum length of a sentence (pair) to be 512 after wordpiece tokenization. We used the Adam optimizer with batch size 32, and trained for 10 epochs on each task. The learning rate is a hyperparameter to tune for different tasks over {1,2,3,4,5}× 10−4, with a linear warmup for the first 6% of steps followed by a linear decay to zero.
Acknowledgments
We thank our colleagues on the Speech Science team at Amazon AWS AI for supporting this research. We also thank our TACL action editor Shay Cohen and the reviewers for their helpful feedback.
Notes
https://github.com/microsoft/LoRA/tree/snapshot-9-15-2021; they have greatly refactored their code since our experiments.
References
Author notes
Action Editor: Shay Cohen