Abstract
Large pretrained language models are widely used in downstream NLP tasks via task- specific fine-tuning, but such procedures can be costly. Recently, Parameter-Efficient Fine-Tuning (PEFT) methods have achieved strong task performance while updating much fewer parameters than full model fine-tuning (FFT). However, it is non-trivial to make informed design choices on the PEFT configurations, such as their architecture, the number of tunable parameters, and even the layers in which the PEFT modules are inserted. Consequently, it is highly likely that the current, manually designed configurations are suboptimal in terms of their performance-efficiency trade-off. Inspired by advances in neural architecture search, we propose AutoPEFT for automatic PEFT configuration selection: We first design an expressive configuration search space with multiple representative PEFT modules as building blocks. Using multi-objective Bayesian optimization in a low-cost setup, we then discover a Pareto-optimal set of configurations with strong performance-cost trade-offs across different numbers of parameters that are also highly transferable across different tasks. Empirically, on GLUE and SuperGLUE tasks, we show that AutoPEFT-discovered configurations significantly outperform existing PEFT methods and are on par or better than FFT without incurring substantial training efficiency costs.
1 Introduction and Motivation
Pretrained language models (PLMs) are used in downstream tasks via the standard transfer learning paradigm, where they get fine-tuned for particular tasks (Devlin et al., 2019; Liu et al., 2019b). This achieves state-of-the-art results in a wide spectrum of NLP tasks, becoming a prevalent modeling paradigm in NLP (Raffel et al., 2020). Fine-tuning the PLMs typically requires a full update of their original parameters (i.e., the so-called full-model fine-tuning (FFT)); however, this is (i) computationally expensive and also (ii) storage-wise expensive as it requires saving a separate full model copy for each task-tuned model. With the ever-growing size of the PLMs (Brown et al., 2020; Sanh et al., 2022), the cost of full-model FT becomes a major bottleneck, due to its increasing demands as well as computational (time and space) non-efficiency.
Parameter-efficient fine-tuning (PEFT) delivers a solution for alleviating the issues with full-model FT (Houlsby et al., 2019). By freezing the majority of pretrained weights of PLMs, PEFT approaches only update a small portion of parameters for efficiently adapting the PLM to a new downstream task. Recent studies have shown that PEFT can achieve competitive task performance while being modular, adaptable, and preventing catastrophic forgetting in comparison to traditional FFT (Wang et al., 2022; Pfeiffer et al., 2023).
Recent developments have created diverse PEFT modules with distinctive characteristics (Pfeiffer et al., 2020b; Li and Liang, 2021), with one of the two main aims in focus: 1)improve task performance over other PEFT approaches while maintaining the same parameter budget as the competitor PEFT methods; or 2)maintain task performance while reducing the parameter budget needed. Existing PEFT modules, optimizing for one of the two aims, have been successfully applied to transfer learning tasks (Chen et al., 2022b; Pfeiffer et al., 2022). However, different tasks, with different complexity, show distinct sensitivity to the allocated parameter budget and even to the chosen PEFT approach (He et al., 2022). At the same time, most PEFT applications are limited to a single PEFT architecture (e.g., serial adapters, prefix-tuning) with fixed decisions on its components (e.g., hidden size dimensionality, insertion layers) resulting in potentially suboptimal PEFT configurations across many tasks. Therefore, in this work, we propose a new, versatile, and unified framework that automatically searches for improved and task-adapted PEFT configurations, aiming to effectively balance between the two (often colliding goals) of (i) improving performance and (ii) keeping the desired low parameter budget for PEFT.
While recent research has started exploring more dynamic PEFT configurations, prior studies remain limited across several dimensions, including how they define the configuration search space. Namely, they typically focus only on a single PEFT architecture (e.g., adapters) or their simple combinations, or a single property (e.g., insertion layers—where to insert the module); see a short overview later in §3. Here, we propose a unified and more comprehensive framework for improved configuration search. It covers multiple standard PEFT modules (serial adapters, parallel adapters, and prefix-tuning) as building blocks, combined with the critical parameter budget-related decisions: the size of each constituent module and the insertion layers for the modules.
Our defined comprehensive search space is huge; consequently, traversing it effectively and efficiently is extremely challenging. To enable search over the large configuration space, we thus propose the novel AutoPEFT framework. It automatically configures multiple PEFT modules along with their efficiency-oriented design decisions, relying on a high-dimensional Bayesian optimization (BO) approach. Crucially, within the search space, we propose a multi-objective optimization which learns to balance simultaneously between maximizing the searched configurations’ task performance and parameter efficiency.
We conduct extensive experiments on the standard GLUE and SuperGLUE benchmarks (Wang et al., 2018, 2019), with encoder-only and encoder-decoder models. We first study the transferability of the AutoPEFT-searched architecture by running AutoPEFT on a single task with a low-fidelity proxy (aiming to reduce computational cost), followed by transferring the found architecture to other tasks. Experimental results show that this architecture can outperform existing PEFT baselines while achieving on-par performance with the standard FFT. Further slight gains can be achieved with a larger computation budget for training, where we run AutoPEFT per task to find a task-adapted PEFT configuration. As revealed in Figure 1, AutoPEFT can find configurations that offer a solid trade-off between task performance and parameter efficiency, even outperforming FFT. We also provide ablation studies over the search space, validating that the AutoPEFT framework is versatile and portable to different search spaces.
Contributions.
1) We propose the AutoPEFT search space containing diverse and expressive combinations of PEFT configurations from three representative PEFT modules as foundational building blocks and the binary decisions concerning Transformer layers for inserting these modules as searchable dimensions. 2) To navigate the vast AutoPEFT search space and to discover a set of transferable PEFT configurations that optimally trade performance against cost across various parameter ranges in a single run, we further propose an effective search method based on multi-dimensional Bayesian optimization. 3) We demonstrate that the one-time search cost of AutoPEFT is low, and AutoPEFT yields task-shareable configurations, outperforming existing PEFT modules while being transferable across tasks. The AutoPEFT framework can also be easily extended to other and new PEFT modules. The code is available at https://github.com/cambridgeltl/autopeft.
2 AutoPEFT Framework
2.1 Designing the AutoPEFT Search Space
Inspired by the success of neural architecture search (NAS) methodology (Ru et al., 2020), we similarly start by designing a large and expressive configuration space. We additionally provide the motivation behind each decision to include a particular module and its components in the configuration space, along with a mathematical formulation.
The search space is known to be one of the most important factors in the performance of the configurations to be discovered subsequently (Ru et al., 2020; Xie et al., 2019; Li and Talwalkar, 2019; Dong and Yang, 2020; Yang et al., 2020). In order to simultaneously maximize task performance along with parameter efficiency, it is necessary to first define a ‘parameter-reducible’ search space, where each dimension within the space potentially contributes to reducing the parameter budget. Similarly, each dimension potentially impacts the performance positively without introducing redundancy in the space (Wan et al., 2022). Therefore, we propose the following search space with representative PEFT modules spanning a plethora of (non-redundant) configurations as illustrated in Figure 2:
PEFT Modules.
Inspired by common practices in NAS of using known well-performing modules as building blocks, we include three distinctive PEFT designs to efficiently adapt different forwarding stages of hidden states in the PLM layers. We combine Serial Adapters (SA), Parallel Adapters (PA), and Prefix-Tuning (PT) as the three representative modules in the search space as the building blocks, where the PT module adapts the multi-head attention layer, and SA and PA interact with the FFN layer (Figure 2). Each configuration makes a decision on the PEFT modules in the insertion layer: all of them can be ‘turned’ on or off. We combine this binary decision with the actual non-binary decision on the module size (see next) so that the value of 0, in fact, denotes the absence of the modules in the layer(s). We note that other PEFT modules such as LoRA (Hu et al., 2022a) are scaled variants of PA with the same insertion form (He et al., 2022). As we empirically validate later, the resultant search space spanned by the selected building blocks is extremely expressive and flexible and enables the discovery of configurations that outscore any of the individual building blocks and other PEFT modules.
Size.
Previous studies show that PEFT methods are highly sensitive to the number of tunable parameters: Adaptively setting their capacity in accordance with the target task is, therefore, essential for achieving good performance (Chen et al., 2022a). The number of tunable parameters depends on each particular module. The additional parameters introduced by both SA and PA are dominated by their bottleneck dimension D. Similarly, the size of the PT module is defined by its prefix length LPT. Thus, we define a binary logarithmic search scale for the respective discrete sets DSA, DPA, and LPT, spanning the values from 0 (absence of the module) to Dh where Dh is the dimensionality of the output embedding of the PLM (e.g., Dh =768 for BERTbase).
Insertion Layers.
Prior work has also shown that different layers in the PLMs store different semantic information (Vulić et al., 2020), where the higher layers produce more task-specific and contextualized representations (Tenney et al., 2019). Therefore, as another configuration dimension, we aim to search for the minimal number and the actual position of layers in which to insert the PEFT modules. We define a binary ‘insertion’ decision at each layer li.
Combining PEFT Modules.
The final composition should adapt effectively to both bias-influence hidden states and the original inputs before the pretrained FFN layer.1
Further, applying PEFT modules to interact with FFNs and multi-head attention should positively impact task performance (Mao et al., 2022; He et al., 2022). PT learns two prefix vectors, Pk and , that are concatenated with the original multi-head attention’s key and value vectors, which efficiently adapts the multi-head attention layer to fit the target task. Thus, we finally combine the SA and the PA (i.e., SAPA from above) with PT.
In sum, the overview of the dimensions spanning the final configuration space is provided in Figure 2. The combination of the different ‘configuration dimensions’ outlined above gives rise to a total of, e.g., 5,451,776 possible configurations with BERTbase and ∼ 3 × 1010 configurations with RoBERTalarge (i.e. the number of configurations is 2|l|×|DSA|×|DPA|×|LPT|). While a large search space is crucial for expressiveness and to ensure that good-performing configurations are contained, it also increases the difficulty for search strategies to navigate the search space well while remaining sample- and thus computationally efficient. Furthermore, in the PEFT setting, we are also often interested in discovering a family of configurations that trade-off between performance and efficiency for general application in various scenarios with different resource constraints, thus giving rise to a multi-objective optimization problem where we simultaneously aim to maximize performance while minimizing costs. In what follows, we propose a search framework that satisfies all those criteria.
2.2 Pareto-Optimal Configuration Search
Multi-objective Optimization Formulation.
The ultimate goal of AutoPEFT is to discover promising PEFT configuration(s) from the expressive search space designed in §2.1, which is itself challenging. In this paper, we focus on an even more challenging but practical goal: Instead of aiming to find a single, best-performing PEFT configuration, we aim to discover a family of Pareto-optimal PEFT configurations that trade performance against parameter-efficiency (or parameter cost) optimally: One of the most impactful use cases of PEFT is its ability to allow fine-tuning of massive language models even with modest computational resources, and thus we argue that searching Pareto-optimal configurations is key as it allows tailored user- and scenario-specific PEFT deployment depending on the computational budget.
Bayesian Optimization (BO).
To solve Eq. (4), we adopt a BO approach, illustrated in Figure 3. On a high level, BO consists of a surrogate model that sequentially approximates the objective function based on the observations so far and an acquisition function that is optimized at each iteration to actively select the next configuration to evaluate. Typically, the surrogate model is a Gaussian process (GP), a flexible and non-parametric model with well-principled and closed-form uncertainty estimates: Given an observed set of n configurations and their evaluated performance: , the GP surrogate model gives a closed form posterior distribution over the true, unobserved function values f potentially over configurations that have not been evaluated before. The acquisition function , on the other hand, uses the posterior distribution of the surrogate model to assign a utility value to possible configuration candidates in , typically balancing exploitation (i.e., querying near configurations in that were previously observed to be strong) and exploration (i.e., the configurations far from and are those we do not have knowledge on and can potentially be even better configurations). At each step of BO, the acquisition function is optimized (note that while evaluating f(a) is expensive, evaluating , which only uses the posterior distribution from the surrogate model, is not) to select the next configuration (or batch of configurations) to evaluate. For a detailed overview of BO, we refer the readers to Garnett (2023) and Frazier (2018).
Rationales for Using BO.
We argue that BO is well-suited to the task in principle and has various advantages over alternative, viable approaches such as those based on differentiable NAS (DARTS) (Liu et al., 2019a), which typically utilize a continuous relaxation of the discrete configurations, thereby allowing a to be jointly optimized with the model weights W in Eq. 4 with a supernet.
First, unlike the DARTS-based approach, by treating the optimization problem defined in Eq. 4 as a black box, BO decouples the optimization of the weights W and the optimization of architecture a, and solves the latter problem with no gradient information at all (White et al., 2021; Ru et al., 2021). This makes a BO-based solution more parallelizable and more amenable to a distributed setup, which modern large PLMs often rely on, as multiple configuration evaluations may take place simultaneously in different client machines as long as they can relay the evaluation results f back to a central server running the BO. This further contributes to memory efficiency, as unlike the DARTS-based method that optimizes a supernet (a heavily over-parameterized network that can be deemed as a weighted superposition of all configurations in ), each parallel evaluation in BO trains a single configuration only; we argue that this point is particularly important for PEFT given its main promise on parameter efficiency.
Second, as discussed, it is often desirable to discover a family of configurations with different trade-offs between performance and parameters in different application scenarios. As we will show, while BO generalizes elegantly to handle vector-valued objective functions and may generate a PF of configurations in a single run, competing methods, such as supernet-based NAS methods, typically require a scalar objective function and thus are limited to discovering a single best-performing configuration (Eriksson et al., 2021; Izquierdo et al., 2021); this means that one typically needs to run the NAS pipeline multiple times for different cost budgets in these methods.
Lastly, while one of the main arguments favoring differentiable techniques is its lighter computational expense as one only needs to train the supernet once rather than repeatedly training different candidate configurations, as we will later show, the sample-efficient nature of BO and strong transferability of the discovered configurations also ensure that the computational cost of our proposed method remains tractable. As we will show in §4, while DARTS-based NAS is indeed a plausible approach for PEFT configu ration search, we show that our approach performs competitively to S3PET (Hu et al., 2022b), a DARTS-based method.
Adapting BO to the AutoPEFT Task.
Adapting BO to the high-dimensional and combinatorial AutoPEFT search space is non-trivial. To address the challenges, we customize both components of BO, and the overall pipeline is shown in Algorithm 1. Instead of a standard GP, we propose to use a Gaussian process with sparse axis-aligned subspaces (SAAS-GP) (Eriksson and Jankowiak, 2021) as the surrogate model: As an intuitive explanation, SAAS-GP places strong, sparsity-inducing priors on the GP hyperparameters to alleviate the difficulty in modeling high-dimensional data by assuming that despite the high nominal dimensionality, some search dimensions contribute much more significantly to the variation of the objective function than others—this assumption is shown to hold in related problems of NAS in computer vision (Wan et al., 2022) and discrete prompt search in PLMs (Zhou et al., 2023), and we expect similar findings in our particular case.
For the acquisition function, we use the noisy expected hypervolume improvement (NEHVI) (Daulton et al., 2021) to handle the multi-objective setting: Unlike the commonly used scalarisation approach that transforms the vector-valued objective function to a scalar weighted sum (which corresponds to a single point on the PF), NEHVI is capable of automatically exploring all parts of the PF in a single run. Lastly, we additionally use low-fidelity approximations, a popular low-cost performance estimation strategy in NAS (Elsken et al., 2019), to manage the search cost: At search-time, instead of fine-tuning each candidate PEFT configuration in full, we only fine-tune with a much smaller number of iterations (5% of full)—this is possible as we are only interested in the relative ranking (rather than the performance itself) of the different configurations during search. Consistent with NAS literature, we also find the low-fidelity estimate to provide a reliable ranking, with the best-performing configurations in low fidelity also performing the best under fine-tuning with the full number of iterations. As we will show in §5, using the low-fidelity search pipeline, in combination with the strong transferability of the discovered configurations, AutoPEFT only incurs an additional one-off, 1.9% of the total GLUE fine-tuning cost, but delivers significant performance gains.
3 Related Work
PEFT Methods in NLP.
Standard PEFT methods can be divided into two main groups (Pfeiffer et al., 2023). 1) Some methods fine-tune a small portion of pretrained parameters (Zhao et al., 2020; Guo et al., 2021). For instance, Ben Zaken et al. (2022) propose to fine-tune the PLM’s bias terms, while Sung et al. (2021) and Ansell et al. (2022) fine-tune sparse subnetworks withing the original PLM for a particular task. 2) Other methods fine-tune an additional set of parameters (Liu et al., 2022). Since there is no interference with the pretrained parameters, this class of PEFT modules, besides offering strong task performance, is arguably more modular; we thus focus on this class of PEFT methods in this work. The original adapter modules (Houlsby et al., 2019; Pfeiffer et al., 2020b) have a bottleneck serial architecture which can be inserted into every Transformer layer, see Figure 2. LoRA (Hu et al., 2022a) assumes the low-rank intrinsic dimensionality of the target task and performs low-rank updates (Mahabadi et al., 2021). Li and Liang (2021) propose the Prefix-Tuning method that appends a learnable vector to the attention heads at each Transformer layer. Similarly, prompt-tuning (Lester et al., 2021) only appends this vector to the input embedding. UniPELT (Mao et al., 2022) integrates multiple PEFT modules with a dynamic gating mechanism. He et al. (2022) provide a unified formulation of existing PEFT modules and propose a parallel adapter module, along with a combined ‘Mix-and-Match Adapter (MAM)’ architecture that blends parallel adapters and prefix-tuning. Wang et al. (2022) propose the mixture-of-adaptations (AdaMix) architecture with weight averaging for a mixture of adapters.
Optimizing Parameter Efficiency in PEFT.
Recent work further aims to optimize the parameter efficiency of existing PEFT modules while maintaining task performance. The standard approach is to insert (typically serial) adapters into all Transformer layers, which still requires a sizeable parameter budget. Rücklé et al. (2021) address this question by randomly dropping adapters from lower-level layers, displaying only a small decrease in task performance. Adaptable Adapters (AA) (Moosavi et al., 2022) generalize this idea by learning gates that switch on or off adapters in particular Transformer layers. Neural Architecture Search (NAS) methods aim to automate the design of neural net architectures themselves, and NAS has seen great advances recently, with performance often surpassing human expert-designed architectures in various tasks (Zoph and Le, 2017; Ren et al., 2021; Elsken et al., 2019). Concerning NLP tasks and PEFT, Hu et al. (2022b) propose S3PET, which adapts Differentiable Architecture Search (DARTS) (Liu et al., 2019a) to learn the positions for inserting the PEFT modules. This work is closest in spirit to ours and is empirically compared to in §4. Conceptually, however, as discussed in detail in §2, we argue that our method offers a spectrum of advantages over S3PET and other related PEFT work, including but not limited to the ability to automatically discover a family of PEFT configurations across parameter budgets in a single run, better parallelisability and memory efficiency. Other concurrent work (Valipour et al., 2023; Zhang et al., 2023) also approaches the same problem by dynamic budget allocation mechanisms on a single PEFT module within a limited search space. Nonetheless, this field still lacks a compact solution for automatically configuring a complex space of PEFT modules (Chen et al., 2023).
4 Experimental Setup
Evaluation Data.
We follow prior PEFT research and base our evaluation on the standard and established GLUE and SuperGLUE benchmarks. For GLUE, we include 4 types of text classification tasks, including linguistic acceptability: CoLA; similarity and paraphrase: STS-B, MRPC, QQP; sentiment analysis: SST-2; natural language inference: RTE, QNLI, MNLI. We exclude WNLI following previous work (Houlsby et al., 2019; Mao et al., 2022). We also include CB, COPA, WiC, and BoolQ from SuperGLUE to further validate the transferability of AutoPEFT-found configuration across different tasks and datasets.
Baselines.
We compare the performance of the AutoPEFT-found configurations to the standard full model FT and each individual PEFT module (SA, PA, PT) from the search space used in their default setup from their respective original work. We also compare with the LoRA module to provide a comparison to low-rank decomposition methods. To compare with recent methods that also integrate multiple PEFT modules (see §3), we further include the UniPELT and the MAM adapter in their default settings. We reproduce AdaMix for a comparison to a mixture of homogeneous adaptations. In ablations on insertion layers, we also include the Adaptable Adapter (AA) as a baseline that proposes a differentiable gate learning method to select the insertion layer for PEFT modules (i.e. serial adapters originally). On T5 (Raffel et al., 2020) models, we also compare against S3PET (Hu et al., 2022b), one of the most similar works to us that use differentiable NAS for configuration search.
Implementation Details.
Following previous work on the GLUE benchmark, we report the best GLUE dev set performance (Ben Zaken et al., 2022) and use 20 training epochs with an early stopping scheme of 10 epochs for all per-task experiments. We use AdapterHub (Pfeiffer et al., 2020a) as the codebase and conduct extensive experiments with the uncased BERTbase (Devlin et al., 2019) as the main backbone model. We report main experiments with the mean and standard deviation over 5 different random seeds. Following Pfeiffer et al. (2020b), we use a recommended learning rate of 10−4 for all PEFT experiments. We use the learning rate of 2 × 10−5 for full model FT according to Mao et al. (2022). We use batch sizes 32 and 16 for all BERT and RoBERTa experiments, respectively. The optimizer settings for each PEFT module follow the default settings in AdapterHub (Pfeiffer et al., 2020a). We implement the BO search algorithm in BoTorch (Balandat et al., 2020) and use the recommended settings from Eriksson and Jankowiak (2021) for the surrogate. For acquisition function optimization, we use a local search method similar to previous literature with a similar setup (Wan et al., 2021; Eriksson et al., 2021): At each search iteration (after the initial randomly sampled points), we collect the Pareto-optimal architectures up to this point. From this collection of Pareto-optimal architectures, we perform a local search by evaluating the acquisition function values of their neighbors and move the current point to a neighbor with a higher acquisition function value, and this process is repeated until convergence. Due to the relatively noisy nature of the problem, we use 100 random initialization points for all experiments, followed by 100 BO iterations. We further show results using RoBERTalarge (Liu et al., 2019b) in Table 5, which shows findings that are consistent with the BERTbase. In experiments with RoBERTalarge as the underlying PLM, we report the RTE results with a learning rate of 2 × 10−5 for AutoPEFTMRPC and AutoPEFTCoLA; 10−4 for AutoPEFTRTE. We use batch size 16 and a learning rate of 3 × 10−4 for T5base experiments by AutoPEFT with the SAPA space; 10−5 for STS-B. We reproduce S3PET results with batch size 8 in the same experimental setup as AutoPEFT.
5 Results and Discussion
Discussion of Main Results.
The main results on BERT are summarized in Table 1, where we evaluate the AutoPEFT-found configurations searched from RTE, the most low-resource and challenging task, on the full GLUE suite. We further report selected GLUE tasks on T5 in Table 4 (where we also compare against S3PET) and RoBERTalarge in Table 5. For simplicity, we report a single configuration that leads to the highest task performance in a predefined, user-specified parameter budget from the discovered Pareto-optimal set in Table 1, whereas the full Pareto-optimal set is evaluated in Figure 4. On BERT (Table 1, we find that using only 0.76% of parameters, AutoPEFTRTE outperforms all the PEFT baselines (more than 2% on RTE). The AutoPEFT-found configuration also outperforms the full-model FT baseline on the RTE task by more than 1%. These results indicate the effectiveness of the AutoPEFT framework in optimizing both task performance and parameter efficiency. Transferring the RTE-based configurations to other tasks, we find that strong performance is maintained across the target tasks, with more benefits on the medium-resource tasks (MRPC, STS-B, CoLA), but the configuration remains competitive also for higher-resource tasks (e.g., QQP, MNLI). Finally, we find the strength of AutoPEFT to persist in RoBERTa and T5 as a representative of the encoder-decoder model families. It is particularly noteworthy that in addition to outperforming the baseline PEFT methods without configuration search, AutoPEFT also performs competitively compared to S3PET with configuration search under a comparable parameter count, even though S3PET was exclusively developed and tested on the T5 search space and that the S3PET search space was designed with meticulous hand-tuning, where the authors manually excluded several building blocks that did not lead to empirical gain; this provides further empirical support to the strength of a BO-based search strategy described in §2.2.
Method . | #Param. . | RTE . | MRPC . | STS-B . | CoLA . | SST-2 . | QNLI . | QQP . | MNLI . | Avg. . |
---|---|---|---|---|---|---|---|---|---|---|
FFT | 100% | 71.121.46 | 85.741.75 | 89.000.45 | 59.320.62 | 92.570.24 | 91.500.08 | 91.520.04 | 84.430.22 | 83.15 |
Prefix | 0.17% | 70.540.49 | 85.930.89 | 88.760.15 | 58.881.15 | 91.930.45 | 90.760.14 | 89.120.07 | 82.780.16 | 82.33 |
LoRA | 0.27% | 65.851.49 | 84.461.04 | 88.730.08 | 57.580.78 | 92.060.38 | 90.620.22 | 89.410.04 | 83.000.07 | 81.46 |
Serial | 0.81% | 68.011.34 | 84.750.45 | 88.610.11 | 59.730.62 | 91.930.33 | 91.060.12 | 90.520.05 | 84.180.22 | 82.35 |
AdaMix | 0.81% | 70.110.62 | 86.861.12 | 89.120.11 | 59.111.00 | 92.060.22 | 91.520.15 | 90.220.04 | 84.250.14 | 82.91 |
UniPELT | 1.25% | 67.071.82 | 84.220.78 | 88.840.11 | 60.130.46 | 92.520.24 | 91.090.13 | 90.690.11 | 84.280.18 | 82.35 |
Parallel | 6.46% | 68.523.44 | 86.520.96 | 88.900.28 | 58.721.69 | 92.130.35 | 90.830.22 | 90.740.08 | 73.9319.24 | 81.29 |
MAM | 6.97% | 69.101.76 | 87.160.74 | 89.010.48 | 47.8723.97 | 83.9416.52 | 90.850.22 | 90.760.05 | 83.310.17 | 80.25 |
AutoPEFTRTE | 0.76% | 72.200.72 | 87.160.83 | 88.770.07 | 60.301.24 | 92.220.30 | 90.900.10 | 90.370.06 | 83.460.21 | 83.17 |
AutoPEFTAvg.task | % | 72.350.94 | 87.450.87 | 89.170.24 | 60.921.47 | 92.220.30 | 91.120.13 | 90.640.05 | 84.010.10 | 83.49 |
Method . | #Param. . | RTE . | MRPC . | STS-B . | CoLA . | SST-2 . | QNLI . | QQP . | MNLI . | Avg. . |
---|---|---|---|---|---|---|---|---|---|---|
FFT | 100% | 71.121.46 | 85.741.75 | 89.000.45 | 59.320.62 | 92.570.24 | 91.500.08 | 91.520.04 | 84.430.22 | 83.15 |
Prefix | 0.17% | 70.540.49 | 85.930.89 | 88.760.15 | 58.881.15 | 91.930.45 | 90.760.14 | 89.120.07 | 82.780.16 | 82.33 |
LoRA | 0.27% | 65.851.49 | 84.461.04 | 88.730.08 | 57.580.78 | 92.060.38 | 90.620.22 | 89.410.04 | 83.000.07 | 81.46 |
Serial | 0.81% | 68.011.34 | 84.750.45 | 88.610.11 | 59.730.62 | 91.930.33 | 91.060.12 | 90.520.05 | 84.180.22 | 82.35 |
AdaMix | 0.81% | 70.110.62 | 86.861.12 | 89.120.11 | 59.111.00 | 92.060.22 | 91.520.15 | 90.220.04 | 84.250.14 | 82.91 |
UniPELT | 1.25% | 67.071.82 | 84.220.78 | 88.840.11 | 60.130.46 | 92.520.24 | 91.090.13 | 90.690.11 | 84.280.18 | 82.35 |
Parallel | 6.46% | 68.523.44 | 86.520.96 | 88.900.28 | 58.721.69 | 92.130.35 | 90.830.22 | 90.740.08 | 73.9319.24 | 81.29 |
MAM | 6.97% | 69.101.76 | 87.160.74 | 89.010.48 | 47.8723.97 | 83.9416.52 | 90.850.22 | 90.760.05 | 83.310.17 | 80.25 |
AutoPEFTRTE | 0.76% | 72.200.72 | 87.160.83 | 88.770.07 | 60.301.24 | 92.220.30 | 90.900.10 | 90.370.06 | 83.460.21 | 83.17 |
AutoPEFTAvg.task | % | 72.350.94 | 87.450.87 | 89.170.24 | 60.921.47 | 92.220.30 | 91.120.13 | 90.640.05 | 84.010.10 | 83.49 |
Table 2 specifies the composition of the found configuration, indicating the exact task-active layers while allocating more parameter budget to the efficient and effective PA module. On average, the AutoPEFTRTE configuration shows a comparable fine-tuning performance (83.17) to FFT (83.15) by only updating 0.76% of parameters. With strong transferability across similar tasks, AutoPEFT provides distinct advantages in parameter efficiency; the search algorithm itself, coupled with the transfer, becomes more sample-efficient within limited training resources.
Extending AutoPEFT to More Tasks.
We next ‘stress-test’ the ability of AutoPEFT-found configuration in a more challenging scenario, experimenting on a completely new set of dissimilar tasks. Table 3 reports the results of transferring AutoPEFTRTE from Table 1 to four SuperGLUE tasks. In terms of parameter efficiency, we observe consistent patterns as in Table 1 before, where our plug-and-play PEFT configuration outperforms existing PEFT baselines by a substantial margin (2%) on average while being comparable to the costly full-model FT.2 In terms of search cost, we recall that through the use of low-fidelity proxy and the strong transferability, AutoPEFTRTE in Table 1 only requires an additional, one-off 1.9% in terms of training time (or equivalently the number of fine-tuning steps) of that of single-seed training of the GLUE training sets. Furthermore, Figure 5 demonstrates the robustness of our framework to the choice of the source task to search on. Therefore, our framework is task-agnostic with a cheap one-time cost but yields ‘permanent’ improvement towards all efficiency metrics for PEFT: space, time, and memory.
Method . | CB . | COPA . | WiC . | BoolQ . | Avg. . |
---|---|---|---|---|---|
FFT | 71.431.13 | 51.803.76 | 68.621.93 | 72.170.86 | 66.01 |
LoRA | 67.142.42 | 55.801.47 | 68.561.11 | 69.090.42 | 65.15 |
Serial | 67.861.13 | 54.207.68 | 67.340.61 | 70.000.85 | 64.86 |
OursRTE | 71.072.86 | 56.406.83 | 68.871.06 | 70.860.89 | 66.80 |
Method . | CB . | COPA . | WiC . | BoolQ . | Avg. . |
---|---|---|---|---|---|
FFT | 71.431.13 | 51.803.76 | 68.621.93 | 72.170.86 | 66.01 |
LoRA | 67.142.42 | 55.801.47 | 68.561.11 | 69.090.42 | 65.15 |
Serial | 67.861.13 | 54.207.68 | 67.340.61 | 70.000.85 | 64.86 |
OursRTE | 71.072.86 | 56.406.83 | 68.871.06 | 70.860.89 | 66.80 |
Method . | #Param. . | RTE . | MRPC . | STS-B . | CoLA . | SST-2 . | QNLI . | QQP . | MNLI . | Avg. . |
---|---|---|---|---|---|---|---|---|---|---|
LoRA | 0.40% | 80.1 | 89.5 | 89.2 | 59.9 | 94.4 | 93.6 | 91.0 | 86.5 | 85.5 |
Serial | 0.79% | 78.0 | 88.2 | 89.1 | 60.6 | 94.6 | 93.1 | 90.7 | 86.4 | 85.1 |
S3PETRTE | 0.30% | 79.8 | 89.0 | 90.2 | 58.6 | 94.2 | 93.3 | 90.6 | 86.5 | 85.3 |
AutoPEFTRTE | 0.33% | 82.7 | 89.0 | 89.6 | 61.7 | 94.6 | 93.3 | 90.8 | 86.7 | 86.1 |
Method . | #Param. . | RTE . | MRPC . | STS-B . | CoLA . | SST-2 . | QNLI . | QQP . | MNLI . | Avg. . |
---|---|---|---|---|---|---|---|---|---|---|
LoRA | 0.40% | 80.1 | 89.5 | 89.2 | 59.9 | 94.4 | 93.6 | 91.0 | 86.5 | 85.5 |
Serial | 0.79% | 78.0 | 88.2 | 89.1 | 60.6 | 94.6 | 93.1 | 90.7 | 86.4 | 85.1 |
S3PETRTE | 0.30% | 79.8 | 89.0 | 90.2 | 58.6 | 94.2 | 93.3 | 90.6 | 86.5 | 85.3 |
AutoPEFTRTE | 0.33% | 82.7 | 89.0 | 89.6 | 61.7 | 94.6 | 93.3 | 90.8 | 86.7 | 86.1 |
Method . | #Param. . | RTE . | MRPC . | STS-B . | CoLA . | SST-2 . | QNLI . | Avg. . |
---|---|---|---|---|---|---|---|---|
FFT† | 100% | 86.6 | 90.9 | 92.4 | 68.0 | 96.4 | 94.7 | 88.2 |
LoRA‡ | 0.22% | 85.2 | 90.2 | 92.3 | 68.2 | 96.2 | 94.8 | 87.8 |
Serial | 0.89% | 84.8 | 90.2 | 92.0 | 66.8 | 96.3 | 94.7 | 87.5 |
AutoPEFTRTE | 0.03% | 88.1 | 89.5 | 92.3 | 67.0 | 96.0 | 94.6 | 87.9 |
AutoPEFTAvg.task | % | 88.1 | 92.2 | 92.4 | 70.6 | 96.8 | 94.6 | 89.1 |
Method . | #Param. . | RTE . | MRPC . | STS-B . | CoLA . | SST-2 . | QNLI . | Avg. . |
---|---|---|---|---|---|---|---|---|
FFT† | 100% | 86.6 | 90.9 | 92.4 | 68.0 | 96.4 | 94.7 | 88.2 |
LoRA‡ | 0.22% | 85.2 | 90.2 | 92.3 | 68.2 | 96.2 | 94.8 | 87.8 |
Serial | 0.89% | 84.8 | 90.2 | 92.0 | 66.8 | 96.3 | 94.7 | 87.5 |
AutoPEFTRTE | 0.03% | 88.1 | 89.5 | 92.3 | 67.0 | 96.0 | 94.6 | 87.9 |
AutoPEFTAvg.task | % | 88.1 | 92.2 | 92.4 | 70.6 | 96.8 | 94.6 | 89.1 |
Per-Task Search.
We further conduct full-resource per-task AutoPEFT searches. While naturally more expensive, we argue this setup is useful if, for example, one is interested in finding absolutely the best configurations for that particular task and where search cost is less of a concern. Due to computational constraints, we search per-task on RTE, MPRC, STS-B, and CoLA, then port the small set of best configurations to the remaining higher-resource tasks (SST-2, QNLI, QQP, MNLI). We observe consistent gain in all tasks we search on over the best-performing PEFT baselines, e.g., MRPC (87.16% (best baseline) to 87.45% (ours)) and CoLA (60.13% to 60.92%), and also the transferred configuration AutoPEFTRTE in Table 1. One interpretation is that while configurations are highly transferable, the optimal configurations may nonetheless differ slightly across tasks such that while transferred AutoPEFT configurations (e.g., the one reported in Table 1) perform well, searching per-task performs the best. Crucially, we also find per-task AutoPEFT in this setup to even outperform FFT, despite only using 1.4% of all parameters, except for the high-resources task where we mostly perform on par; this is consistent with our observations that similar to the baselines, due to the richness of training resources, the performance may be mostly saturated and PEFT methods often achieve on-par performance to FFT at most.
Analyzing the ‘Behavior’ of BO and the Discovered Configurations.
Figure 7 shows the distribution of AutoPEFT-found configurations when we conduct its search experiment on RTE. Recalling that the search strategy (§2.2) starts with random initialization, we compare the behaviors of the random explorations and the BO-suggested configurations: Whereas the random search baseline is purely exploratory and discovers less parameter-efficient configurations, BO succeeds in discovering configurations towards the regions with improved parameter efficiency. The superiority of BO over the random search baseline is further demonstrated quantitatively by Figure 8 where we compare the evolution of the hypervolume, which measures the size of the space enclosed by the Pareto front over a reference point (set to the nadir point of the optimization trajectory) (Zitzler and Thiele, 1998), discovered by BO and random search as a function of the number of configurations evaluated; it is clear that as optimization proceeds, BO finds a better Pareto set with a better trade-off between performance and cost in the end. BO eventually discovers a rich family of PEFT configurations across a wide range of parameters, whereas previous approaches typically fail to explore the entire PF. This is a critical strength motivating our BO search strategy.
Figure 6, on the other hand, visualizes the discovered sets in different tasks: we observe that within the Pareto-optimal configuration set of the same task, some layers are consistently enabled (e.g., Layer 2 in CoLA) whereas some are consistently disabled (e.g., Layer 1 across all tasks) even under very different cost budgets; this suggests PEFT modules in different layers are not equally important, and by selectively enabling them, AutoPEFT is capable of making better use of the parameter budgets by allocating them to the more beneficial Transformer layers only. We observe the unanimity of preference or disinclination towards certain layers extends even across tasks that are unlikely to stem from randomness only: For example, we found Layers 2 and 10 are enabled in 71.2% and 69.2% in all Pareto-optimal configurations over all tasks, whereas Layers 1 and 12 are enabled in only 7.7% and 13.4% of the time, respectively. We also observe that across all tasks, a common trend is that sequential and prefix adapters are universally preferred in low-budget ranges, and parallel adapters are only enabled when we have a more lenient budg et allowance; these commonalities in high-performing configurations may, to some extent, account for the strong transferability of the discovered configurations, as shown in Figure 5.
Ablation of the Configuration Space.
To provide a finer-grained analysis of factors that bring positive impact to AutoPEFT, we ablate the AutoPEFT search space from the full configuration space: 1) to the basic enumeration of the bottleneck size DSA of the SA only (the SA space); 2) a naïve baseline where instead of searching for each search dimension independently, we vary a single, common coefficient that generates a family of configurations of different sizes by scaling from the largest PEFT configuration in our search space (SA-PA-PT) over DSA, DPA and LPT. We then include the Transformer layer and the SA size into the search space (the SA-Layer space) to validate the usefulness of layer selection as one configuration dimension. We can then also expand the search space by adding another module (e.g., PA yields the SA-PA-Layer space). Figure 9 plots the performance over the ablated configuration spaces and different parameter budgets. Several key findings emerge. First, combining multiple single PEFT modules has a positive impact on AutoPEFT in general (c.f. full AutoPEFT vs. SA-PA-Layer vs SA-Layer). Second, simply scaling all search dimensions by a common scaling factor is sub-optimal. This is likely because not all parameters are equally important, necessitating a configuration search. Relying on layer selection also brings benefits (c.f. SA vs. SA-Layer). The comparison indicates that leaving out Transformer layers while increasing the capacity of the PEFT module is a straightforward method to improve the parameter efficiency and task performance of the PEFT module within a fixed parameter budget. The ablation results also demonstrate that AutoPEFT is search space-agnostic, capable of effectively operating over configuration spaces of different granularity.
Layer Selection.
The ability to disable some PEFT layers altogether is a key novelty of the AutoPEFT search space, and to further compare different layer selection approaches, we conduct a controlled experiment with the SA module on BERTlarge (24 Transformer layers) under a predefined parameter budget. In Table 6, we compare against AdapterDrop, which simply drops the adapters for the first 11 layers while doubling their bottleneck sizes, and, within the same architecture, we also include the Adaptable Adapter with selected layers from switch learning (3 and 10 layers from the first 12 and the other 12 layers, respectively). We show that AutoPEFT outperforms existing layer selection baselines activating fewer PEFT layers, leading to better parameter efficiency (12.5% fewer parameters in relative terms) yet achieving better performance. It indicates that selecting the best insertion layer is non-trivial, and AutoPEFT can efficiently learn the correlation between layers.
Method . | #Layers . | Size DSA . | RTE . |
---|---|---|---|
Serial | 24 | 64 | 72.560.76 |
Adaptable Adapter | 13 | 128 | 73.360.80 |
AdapterDrop | 13 | 128 | 73.501.40 |
AutoPEFTLayerSA | 10 | 128 | 73.860.94 |
Method . | #Layers . | Size DSA . | RTE . |
---|---|---|---|
Serial | 24 | 64 | 72.560.76 |
Adaptable Adapter | 13 | 128 | 73.360.80 |
AdapterDrop | 13 | 128 | 73.501.40 |
AutoPEFTLayerSA | 10 | 128 | 73.860.94 |
6 Conclusion
We proposed AutoPEFT, a novel search framework for automatically configuring parameter-efficient fine-tuning (PEFT) modules of pretrained language models. AutoPEFT features both a large and expressive, newly designed configuration search space and an effective search method featuring Bayesian optimization that discovers a Pareto-optimal set of novel PEFT configurations with promising performance-efficiency trade-offs. Empirically, we demonstrated that AutoPEFT-discovered configurations transfer strongly across different GLUE and SuperGLUE tasks, outperforming various strong PEFT baselines and being competitive to full model fine-tuning.
Limitations and Future Work
AutoPEFT search inevitably incurs a search cost since it requires iterative optimization at search time. However, we mitigate this by (i) using a low-fidelity proxy of 1-epoch training and (ii) leveraging strong transferability by generalising from low-resource and, thus, quick-to-train tasks. While the search itself can be seen as a one-timecost yielding a permanent well-performing and shareable configuration for particular tasks, we plan to delve deeper into further optimizing the search cost in future work.
Furthermore, while we conduct extensive experiments on the search space that contains three existing PEFT modules as building blocks, novel PEFT modules may emerge. However, AutoPEFT framework is general, so we may easily integrate these forthcoming new modules. We defer thorough investigations to future work.
Acknowledgments
Han Zhou is supported by the UK Research and Innovation (UKRI) Frontier Research Grant EP/Y031350/1 (the UK government’s funding guarantee for ERC Advanced Grants) awarded to Anna Korhonen at the University of Cambridge. Xingchen Wan is supported by the Clarendon Scholarship at the University of Oxford. The work has been supported in part by a Royal Society University Research Fellowship (no 221137; 2022-) awarded to Ivan Vulić, and by the UK EPSRC grant EP/T02450X/1. We thank TACL editors and anonymous reviewers for their constructive feedback that enabled us to strengthen our work.
Notes
The PA module also acts as the low-rank reparameterisation of the learned SA and the frozen FFN layer to further match the intrinsic dimensionality of the target task.
With the AutoPEFT-found off-the-shelf configuration, this requires no additional search cost and enables a more efficient and effective tuning approach for new tasks.
References
Author notes
Equal contribution.
Action Editor: Jacob Eisenstein