AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning

Large pretrained language models are widely used in downstream NLP tasks via task- specific fine-tuning, but such procedures can be costly. Recently, Parameter-Efficient Fine-Tuning (PEFT) methods have achieved strong task performance while updating much fewer parameters than full model fine-tuning (FFT). However, it is non-trivial to make informed design choices on the PEFT configurations, such as their architecture, the number of tunable parameters, and even the layers in which the PEFT modules are inserted. Consequently, it is highly likely that the current, manually designed configurations are suboptimal in terms of their performance-efficiency trade-off. Inspired by advances in neural architecture search, we propose AutoPEFT for automatic PEFT configuration selection: We first design an expressive configuration search space with multiple representative PEFT modules as building blocks. Using multi-objective Bayesian optimization in a low-cost setup, we then discover a Pareto-optimal set of configurations with strong performance-cost trade-offs across different numbers of parameters that are also highly transferable across different tasks. Empirically, on GLUE and SuperGLUE tasks, we show that AutoPEFT-discovered configurations significantly outperform existing PEFT methods and are on par or better than FFT without incurring substantial training efficiency costs.


Introduction and Motivation
Pretrained language models (PLMs) are used in downstream tasks via the standard transfer learning paradigm, where they get fine-tuned for particular tasks (Devlin et al., 2019;Liu et al., 2019b).This achieves state-of-the-art results in a wide spectrum of NLP tasks, becoming a prevalent modelling * Equal contribution.1) compared to other baseline PEFT methods (markers) and full model FT that updates 100% of parameters (dashed horizontal bar), averaged across 8 GLUE tasks.Our approach achieves the best trade-off between task performance and parameter efficiency.
paradigm in NLP (Raffel et al., 2020a).Fine-tuning the PLMs typically requires a full update of their original parameters (i.e. the so-called full-model fine-tuning (FFT)); however, this is (i) computationally expensive and also (ii) storage-wise expensive as it requires saving a separate full model copy for each task-tuned model.With the ever-growing size of the PLMs (Brown et al., 2020;Sanh et al., 2022), the cost of full-model FT becomes a major bottleneck, due to its increasing demands as well as computational (time and space) non-efficiency.
Parameter-efficient fine-tuning (PEFT) delivers a solution for alleviating the issues with full-model FT (Houlsby et al., 2019).By freezing the majority of pretrained weights of PLMs, PEFT approaches only update a small portion of parameters for efficiently adapting the PLM to a new downstream task.Recent studies have shown that PEFT can achieve competitive task performance while being modular, adaptable, and preventing catastrophic forgetting in comparison to traditional FFT (Wang et al., 2022;Pfeiffer et al., 2023).
Recent developments have created diverse PEFT modules with distinctive characteristics (Pfeiffer et al., 2020b;Li and Liang, 2021), with one of the two main aims in focus: 1) improve task performance over other PEFT approaches while maintaining the same parameter budget as the competitor PEFT methods; or 2) maintain task performance while reducing the parameter budget needed.Existing PEFT modules, optimising for one of the two aims, have been successfully applied to transfer learning tasks (Chen et al., 2022b;Pfeiffer et al., 2022).However, different tasks, with different complexity, show distinct sensitivity to the allocated parameter budget and even to the chosen PEFT approach (He et al., 2022).At the same time, most PEFT applications are limited to a single PEFT architecture (e.g.serial adapters, prefixtuning) with fixed decisions on its components (e.g.hidden size dimensionality, insertion layers) resulting in potentially suboptimal PEFT configurations across many tasks.Therefore, in this work, we propose a new, versatile and unified framework that automatically searches for improved and taskadapted PEFT configurations, aiming to effectively balance between the two (often colliding goals) of (i) improving performance and (ii) keeping the desired low parameter budget for PEFT.
While recent research has started exploring more dynamic PEFT configurations, the prior studies remain limited across several dimensions, including how they define the configuration search space.Namely, they typically focus only on a single PEFT architecture (e.g.adapters) or their simple combinations, or a single property (e.g.insertion layerswhere to insert the module); see a short overview later in §3.Here, we propose a unified and more comprehensive framework for improved configuration search.It covers multiple standard PEFT modules (serial adapters, parallel adapters, and prefix-tuning) as building blocks, combined with the critical parameter budget-related decisions: the size of each constituent module and the insertion layers for the modules.
Our defined comprehensive search space is huge; consequently, traversing it effectively and efficiently is extremely challenging.To enable search over the large configuration space, we thus propose the novel AUTOPEFT framework.It automatically configures multiple PEFT modules along with their efficiency-oriented design decisions, relying on a high-dimensional Bayesian optimisation (BO) ap-proach.Crucially, within the search space, we propose a multi-objective optimisation which learns to balance simultaneously between maximising the searched configurations' task performance and parameter efficiency.
We conduct extensive experiments on the standard GLUE and SuperGLUE benchmarks (Wang et al., 2018(Wang et al., , 2019)), with encoder-only and encoderdecoder models.We first study the transferability of the AUTOPEFT-searched architecture by running AUTOPEFT on a single task with a lowfidelity proxy (aiming to reduce computational cost), followed by transferring the found architecture to other tasks.Experimental results show that this architecture can outperform existing PEFT baselines while achieving on-par performance with the standard FFT.Further slight gains can be achieved with a larger computation budget for training, where we run AUTOPEFT per task to find a task-adapted PEFT configuration.As revealed in Figure 1, AUTOPEFT can find configurations that offer a solid trade-off between task performance and parameter efficiency, even outperforming FFT.We also provide ablation studies over the search space, validating that the AUTOPEFT framework is versatile and portable to different search spaces.Contributions. 1) We propose the AUTOPEFT search space containing diverse and expressive combinations of PEFT configurations from three representative PEFT modules as foundational building blocks and the binary decisions concerning Transformer layers for inserting these modules as searchable dimensions.2) To navigate the vast AU-TOPEFT search space and to discover a set of transferable PEFT configurations that optimally trade performance against cost across various parameter ranges in a single run, we further propose an effective search method based on multi-dimensional Bayesian optimisation.3) We demonstrate that the one-time search cost of AUTOPEFT is low, and AUTOPEFT yields task-shareable configurations, outperforming existing PEFT modules while being transferable across tasks.The AUTOPEFT framework can also be easily extended to other and new PEFT modules.The code is available at https: //github.com/cambridgeltl/autopeft.

Designing the AUTOPEFT Search Space
Inspired by the success of neural architecture search (NAS) methodology (Ru et al., 2020), we

PEFT layers
Figure 2: Illustration of the AUTOPEFT search space which combines both layer-level (Layers) and within-layer (Serial, Parallel, Prefix) search, and the connections within a layer (Left).We further show two possible configurations in the search space (Right): note that some PEFT layers can be inactive altogether and the searchable module sizes (shaded in green), i.e. the bottleneck sizes in Serial and Parallel (D SA and D PA respectively) and sizes of P K , P V in Prefix (L PT ), are dynamic.
similarly start by designing a large and expressive configuration space.We additionally provide the motivation behind each decision to include a particular module and its components in the configuration space, along with a mathematical formulation.
The search space is known to be one of the most important factors in the performance of the configurations to be discovered subsequently (Ru et al., 2020;Xie et al., 2019;Li and Talwalkar, 2019;Dong and Yang, 2020;Yang et al., 2020).In order to simultaneously maximise task performance along with parameter efficiency, it is necessary to first define a 'parameter-reducible' search space, where each dimension within the space potentially contributes to reducing the parameter budget.Similarly, each dimension potentially impacts the performance positively without introducing redundancy in the space (Wan et al., 2022).Therefore, we propose the following search space with representative PEFT modules spanning a plethora of (nonredundant) configurations as illustrated in Figure 2: PEFT Modules.Inspired by common practices in NAS of using known well-performing modules as building blocks, we include three distinctive PEFT designs to efficiently adapt different forwarding stages of hidden states in the PLM layers.We combine Serial Adapters (SA), Parallel Adapters (PA), and Prefix-Tuning (PT) as the three representative modules in the search space as the building blocks, where the PT module adapts the multi-head attention layer, and SA and PA interact with the FFN layer (Figure 2).Each configuration makes a decision on the PEFT modules in the insertion layer: all of them can be 'turned' on or off.We combine this binary decision with the actual non-binary decision on the module size (see next) so that the value of 0, in fact, denotes the absence of the modules in the layer(s).We note that other PEFT modules such as LoRA (Hu et al., 2022a) are scaled variants of PA with the same insertion form (He et al., 2022).As we empirically validate later, the resultant search space spanned by the selected building blocks is extremely expressive and flexible and enables the discovery of configurations that outscore any of the individual building blocks and other PEFT modules.
Size.Previous studies show that PEFT methods are highly sensitive to the number of tunable parameters: adaptively setting their capacity in accordance with the target task is, therefore, essential for achieving good performance (Chen et al., 2022a).The number of tunable parameters depends on each particular module.The additional parameters introduced by both SA and PA are dominated by their bottleneck dimension D. Similarly, the size of the PT module is defined by its prefix length L PT .Thus, we define a binary logarithmic search scale for the respective discrete sets D SA , D PA , and L PT , spanning the values from 0 (absence of the module) to D h where D h is the dimensionality of the output embedding of the PLM (e.g.D h =768 for BERT base ).
Insertion Layers.Prior work has also shown that different layers in the PLMs store different semantic information (Vulić et al., 2020), where the The BO agent trains on the vector representations of the evaluated configurations as inputs and their performance under a low-fidelity setup (e.g.accuracy -obtained by fine-tuning the language model with the PEFT configuration for a small number of iterations) and cost (e.g.number of parameters) as targets.The BO agent then iteratively suggests new configurations until convergence.
higher layers produce more task-specific and contextualized representations (Tenney et al., 2019).Therefore, as another configuration dimension, we aim to search for the minimal number and the actual position of layers in which to insert the PEFT modules.We define a binary 'insertion' decision at each layer l i .
Combining PEFT Modules.The SA module and the PA module share a bottleneck architecture.The SA receives hidden states from the FFN output as its inputs, adapting it with a down-projection matrix W down SA ∈ R D h ×D SA , followed by a non-linear activation function, and then an up-projection matrix PA, on the other hand, receives its inputs from hidden states before the FFN layer with the same formulation: Therefore, it is able to act in parallel with the SA without interference.Note that the FFN hidden states h = F (x) contain the task-specific bias learned in its pretrained weights.Therefore, by combining SA with PA, the following composition of functions is achieved: (3) The final composition should adapt effectively to both bias-influence hidden states and the original inputs before the pretrained FFN layer.*  Further, applying PEFT modules to interact with FFNs and multi-head attention should positively impact task performance (Mao et al., 2022;He et al., 2022).PT learns two prefix vectors, P k and P v ∈ R L PT ×D h , that are concatenated with the original multi-head attention's key and value vectors, which efficiently adapts the multi-head attention layer to fit the target task.Thus, we finally combine the SA and the PA (i.e., SAPA from above) with PT.
In sum, the overview of the dimensions spanning the final configuration space is provided in Figure 2. The combination of the different 'configuration dimensions' outlined above gives rise to a total of, e.g.5,451,776 possible configurations with BERT base and ∼ 3×10 10 configurations with RoBERTa large (i.e. the number of configurations is While a large search space is crucial for expressiveness and to ensure that good-performing configurations are contained, it also increases the difficulty for search strategies to navigate the search space well while remaining sample-and thus computationally efficient.Furthermore, in the PEFT setting, we are also often interested in discovering a family of configurations that trade-off between performance and efficiency for general application in various scenarios with different resource constraints, thus giving rise to a multi-objective optimisation problem where we simultaneously aim to maximise performance while minimising costs.In what follows, we propose a search framework that satisfies all those criteria.

Pareto-Optimal Configuration Search
Multi-objective Optimisation Formulation.The ultimate goal of AUTOPEFT is to discover promising PEFT configuration(s) from the expressive search space designed in §2.1, which is itself challenging.In this paper, we focus on an even more challenging but practical goal: instead of aiming Algorithm 1 Overall AUTOPEFT search pipeline.
1: Input: number of randomly initialising points N 0 , maximum number of config evaluations N > N 0 , AUTOPEFT search space A. 2: Output: a set of Pareto-optimal configs A * . 3: Initialise by sampling randomly at N 0 configurations a ∼ A and fine-tune the PLM to obtain f (•) of the corresponding configs.Initialise i=1 and fit a SAAS-GP model on D 0 .4: for n = N 0 , . . ., N do 5: Select the next configuration(s) to evaluate a n by maximising the NEHVI acquisition function a n = argmax a∈A α(a|D n−1 ).

6:
Fine-tune the PLM with candidate configuration(s) a (possibly with low-fidelity estimates) to obtain f (a) // Inner-loop optimisation in Eq. 4.

7:
Augment the observation data D n ← D n−1 ∪ (a t , f (a t )) and update the SAAS-GP model.8: end for 9: Return the set of non-dominated configurations to find a single, best-performing PEFT configuration, we aim to discover a family of Pareto-optimal PEFT configurations that trade performance against parameter-efficiency (or parameter cost) optimally: one of the most impactful use cases of PEFT is its ability to allow fine-tuning of massive language models even with modest computational resources, and thus we argue that searching Pareto-optimal configurations is key as it allows tailored user-and scenario-specific PEFT deployment depending on the computational budget.Formally, denoting the full AUTOPEFT search space as A and a single configuration a ∈ A with trainable weights W , without loss of generality, assuming our objective is to maximise (i) a performance metric f (a, W ) (e.g. the accuracy on the dev set) and to (ii) minimise a cost metric g(a) (e.g. the number of parameters in a), a search method aims to solve the bi-level, bi-objective optimisation problem: where the inner loop optimisation problem is the optimisation of the configuration weights achieved by fine-tuning the configuration a itself over the training loss L train .Given the bi-objective nature of the problem, there is, in general, no single maximiser of Eq. ( 4) but a set of Pareto-optimal configurations A * = {a * 1 , ..., a * |A * | }. that are non-dominated: we say that a configuration a dominates another a ] ⊤ , the set of Pareto-optimal architectures A * are those that are mutually non-dominated: The Pareto front (PF) P * is the image of the Pareto set of architectures: Bayesian Optimisation (BO).To solve Eq. ( 4), we adopt a BO approach, illustrated in Figure 3. On a high level, BO consists of a surrogate model that sequentially approximates the objective function based on the observations so far and an acquisition function that is optimised at each iteration to actively select the next configuration to evaluate.Typically, the surrogate model is a Gaussian process (GP), a flexible and non-parametric model with well-principled and closed-form uncertainty estimates: given an observed set of n configurations and their evaluated performance: , the GP surrogate model gives a closed form posterior distribution P(f (a)|D n ) over the true, unobserved function values f potentially over configurations that have not been evaluated before.The acquisition function α : A → R, on the other hand, uses the posterior distribution of the surrogate model to assign a utility value to possible configuration candidates in A, typically balancing exploitation (i.e., querying near configurations in {a i } n i=1 that were previously observed to be strong) and exploration (i.e., the configurations far from {a i } n i=1 and are those we do not have knowledge on and can potentially be even better configurations).At each step of BO, the acquisition function is optimised (note that while evaluating f (a) is expensive, evaluating α(a|D), which only uses the posterior distribution from the surrogate model, is not) to select the next configuration (or batch of configurations) a n+1 = arg max a∈A α(a|D n ) to evaluate.For a detailed overview of BO, we refer the readers to Garnett (2023) andFrazier (2018).
Rationales for Using BO.We argue that BO is well-suited to the task in principle and has various advantages over alternative, viable approaches such as those based on differentiable NAS (DARTS) (Liu et al., 2019a), which typically utilise a continuous relaxation of the discrete configurations, thereby allowing a to be jointly optimised with the model weights W in Eq. 4 with a supernet: First, unlike the DARTS-based approach, by treating the optimisation problem defined in Eq. 4 as a black box, BO decouples the optimisation of the weights W and the optimisation of architecture a, and solves the latter problem with no gradient information at all (White et al., 2021;Ru et al., 2021).This makes a BO-based solution more parallelisable and more amenable to a distributed setup, which modern large PLMs often rely on, as multiple configuration evaluations may take place simultaneously in different client machines as long as they can relay the evaluation results f back to a central server running the BO.This further contributes to memory efficiency, as unlike the DARTS-based method that optimises a supernet (a heavily over-parameterised network that can be deemed as a weighted superposition of all configurations in A), each parallel evaluation in BO trains a single configuration only; we argue that this point is particularly important for PEFT given its main promise on parameter efficiency.
Second, as discussed, it is often desirable to discover a family of configurations with different trade-offs between performance and parameters in different application scenarios.As we will show, while BO generalises elegantly to handle vectorvalued objective functions and may generate a PF of configurations in a single run, competing methods, such as supernet-based NAS methods, typically require a scalar objective function and thus are limited to discovering a single best-performing configuration (Eriksson et al., 2021;Izquierdo et al., 2021); this means that one typically needs to run the NAS pipeline multiple times for different cost budgets in these methods.
Lastly, while one of the main arguments favouring differentiable techniques is its lighter computational expense as one only needs to train the supernet once rather than repeatedly training different candidate configurations, as we will later show, the sample-efficient nature of BO and strong transferability of the discovered configurations also ensure that the computational cost of our proposed method remains tractable.As we will show in §4, while DARTS-based NAS is indeed a plausible approach for PEFT configuration search, we show that our approach performs competitively to S 3 PET (Hu et al., 2022b), a DARTS-based method.
Adapting BO to the AUTOPEFT Task.Adapting BO to the high-dimensional and combinato-rial AUTOPEFT search space is non-trivial.To address the challenges, we customise both components of BO, and the overall pipeline is shown in Algorithm 1. Instead of a standard GP, we propose to use a Gaussian process with sparse axis-aligned subspaces (SAAS-GP) (Eriksson and Jankowiak, 2021) as the surrogate model: As an intuitive explanation, SAAS-GP places strong, sparsity-inducing priors on the GP hyperparameters to alleviate the difficulty in modelling high-dimensional data by assuming that despite the high nominal dimensionality, some search dimensions contribute much more significantly to the variation of the objective function than others -this assumption is shown to hold in related problems of NAS in computer vision (Wan et al., 2022) and discrete prompt search in PLMs (Zhou et al., 2023), and we expect similar findings in our particular case.
For the acquisition function, we use the noisy expected hypervolume improvement (NE-HVI) (Daulton et al., 2021) to handle the multiobjective setting: unlike the commonly used scalarisation approach that transforms the vector-valued objective function to a scalar weighted sum (which corresponds to a single point on the PF), NEHVI is capable of automatically exploring all parts of the PF in a single run.Lastly, we additionally use low-fidelity approximations, a popular low-cost performance estimation strategy in NAS (Elsken et al., 2019), to manage the search cost: at search-time, instead of fine-tuning each candidate PEFT configuration in full, we only fine-tune with a much smaller number of iterations (5% of full) -this is possible as we are only interested in the relative ranking (rather than the performance itself) of the different configurations during search.Consistent with NAS literature, we also find the low-fidelity estimate to provide a reliable ranking, with the best-performing configurations in low fidelity also performing the best under fine-tuning with the full number of iterations.As we will show in §5, using the low-fidelity search pipeline, in combination with the strong transferability of the discovered configurations, AUTOPEFT only incurs an additional one-off, 1.9% of the total GLUE fine-tuning cost, but delivers significant performance gains.

Related Work
PEFT Methods in NLP.Standard PEFT methods can be divided into two main groups (Pfeiffer et al., 2023).1) Some methods fine-tune a small por-tion of pretrained parameters (Zhao et al., 2020;Guo et al., 2021).For instance, Ben Zaken et al. (2022) propose to fine-tune the PLM's bias terms, while Sung et al. (2021) and Ansell et al. (2022) fine-tune sparse subnetworks withing the original PLM for a particular task.2) Other methods finetune an additional set of parameters (Liu et al., 2022).Since there is no interference with the pretrained parameters, this class of PEFT modules, besides offering strong task performance, is arguably more modular; we thus focus on this class of PEFT methods in this work.The original adapter modules (Houlsby et al., 2019;Pfeiffer et al., 2020b) have a bottleneck serial architecture which can be inserted into every Transformer layer, see Figure 2. LoRA (Hu et al., 2022a) assumes the low-rank intrinsic dimensionality of the target task and performs low-rank updates (Mahabadi et al., 2021).Li and Liang (2021) propose the Prefix-Tuning method that appends a learnable vector to the attention heads at each Transformer layer.Similarly, prompt-tuning (Lester et al., 2021) only appends this vector to the input embedding.UniPELT (Mao et al., 2022) integrates multiple PEFT modules with a dynamic gating mechanism.He et al. (2022) provide a unified formulation of existing PEFT modules and propose a parallel adapter module, along with a combined 'Mix-and-Match Adapter (MAM)' architecture that blends parallel adapters and prefix-tuning.Wang et al. (2022) propose the mixture-of-adaptations (AdaMix) architecture with weight averaging for a mixture of adapters.
Optimising Parameter Efficiency in PEFT.Recent work further aims to optimise the parameter efficiency of existing PEFT modules while maintaining task performance.The standard approach is to insert (typically serial) adapters into all Transformer layers, which still requires a sizeable parameter budget.Rücklé et al. (2021) address this question by randomly dropping adapters from lower-level layers, displaying only a small decrease in task performance.Adaptable Adapters (AA) (Moosavi et al., 2022) generalise this idea by learning gates that switch on or off adapters in particular Transformer layers.Neural Architecture Search (NAS) methods aim to automate the design of neural net architectures themselves, and NAS has seen great advances recently, with performance often surpassing human expert-designed architectures in various tasks (Zoph and Le, 2017;Ren et al., 2021;Elsken et al., 2019).Concerning NLP tasks and PEFT, Hu et al. (2022b) propose S 3 PET, which adapts Differentiable Architecture Search (DARTS) (Liu et al., 2019a) to learn the positions for inserting the PEFT modules.This work is closest in spirit to ours and is empirically compared to in §4.Conceptually, however, as discussed in detail in §2, we argue that our method offers a spectrum of advantages over S 3 PET and other related PEFT work, including but not limited to the ability to automatically discover a family of PEFT configurations across parameter budgets in a single run, better parallelisability and memory efficiency.Other concurrent work (Valipour et al., 2023;Zhang et al., 2023) also approaches the same problem by dynamic budget allocation mechanisms on a single PEFT module within a limited search space.Nonetheless, this field still lacks a compact solution for automatically configuring a complex space of PEFT modules (Chen et al., 2023).

Experimental Setup
Evaluation Data.We follow prior PEFT research and base our evaluation on the standard and established GLUE and SuperGLUE benchmarks.For GLUE, we include 4 types of text classification tasks, including linguistic acceptability: CoLA; similarity and paraphrase: STS-B, MRPC, QQP; sentiment analysis: SST-2; natural language inference: RTE, QNLI, MNLI.We exclude WNLI following previous work (Houlsby et al., 2019;Mao et al., 2022).We also include CB, COPA, WiC, and BoolQ from SuperGLUE to further validate the transferability of AUTOPEFT-found configuration across different tasks and datasets.
Baselines.We compare the performance of the AUTOPEFT-found configurations to the standard full model FT and each individual PEFT module (SA, PA, PT) from the search space used in their default setup from their respective original work.We also compare with the LoRA module to provide a comparison to low-rank decomposition methods.To compare with recent methods that also integrate multiple PEFT modules (see §3), we further include the UniPELT and the MAM adapter in their default settings.We reproduce AdaMix for a comparison to a mixture of homogeneous adaptations.In ablations on insertion layers, we also include the Adaptable Adapter (AA) as a baseline that proposes a differentiable gate learning method to select the insertion layer for PEFT modules (i.e.serial adapters originally).On T5 (Raffel et al.,  models, we also compare against S 3 PET (Hu et al., 2022b), one of the most similar works to us that use differentiable NAS for configuration search.
Implementation Details.Following previous work on the GLUE benchmark, we report the best GLUE dev set performance (Ben Zaken et al., 2022) and use 20 training epochs with an early stopping scheme of 10 epochs for all per-task experiments.We use AdapterHub (Pfeiffer et al., 2020a) as the codebase and conduct extensive experiments with the uncased BERT base (Devlin et al., 2019) as the main backbone model.We report main experiments with the mean and standard deviation over 5 different random seeds.Following Pfeiffer et al. (2020b), we use a recommended learning rate of 10 −4 for all PEFT experiments.We use the learning rate of 2 × 10 −5 for full model FT according to Mao et al. (2022).We use batch sizes 32 and 16 for all BERT and RoBERTa experiments, respectively.The optimiser settings for each PEFT module follow the default settings in AdapterHub (Pfeiffer et al., 2020a).We implement the BO search algorithm in BoTorch (Balandat et al., 2020) and use the recommended settings from Eriksson and Jankowiak (2021) for the surrogate.For acquisition function optimisation, we use a local search method similar to previous literature with a similar setup (Wan et al., 2021;Eriksson et al., 2021): at each search iteration (after the initial randomly sampled points), we collect the Pareto-optimal architectures up to this point.From this collection of Pareto-optimal architectures, we perform a local search by evaluating the acquisition function values of their neighbours and move the current point to a neighbour with a higher acquisition function value, and this process is repeated until convergence.Due to the relatively noisy nature of the problem, we use 100 random initialisation points for all experiments, followed by 100 BO iterations.We further show results using RoBERTa large (Liu et al., 2019b) in Table 5, which shows findings that are consistent with the BERT base .In experiments with RoBERTa large as the underlying PLM, we report the RTE results with a learning rate of 2×10 −5 for AUTOPEFT MRPC and AUTOPEFT CoLA ; 10 −4 for AUTOPEFT RTE .We use batch size 16 and a learning rate of 3 × 10 −4 for T5 base experiments by AUTOPEFT with the SAPA space; 10 −5 for STS-B.We reproduce S 3 PET results with batch size 8 in the same experimental setup as AUTOPEFT.

Results and Discussion
Discussion of Main Results.The main results on BERT are summarised in Table 1 where we evaluate the AUTOPEFT-found configurations searched from RTE, the most low-resource and challenging   task, on the full GLUE suite.We further report selected GLUE tasks on T5 in Table 4 (where we also compare against S 3 PET) and RoBERTa large in Table 5.For simplicity, we report a single configuration that leads to the highest task performance in a predefined, user-specified parameter budget from the discovered Pareto-optimal set in Table 1, whereas the full Pareto-optimal set is evaluated in Figure 4. On BERT (Table 1, we find that using only 0.76% of parameters, AUTOPEFT RTE outperforms all the PEFT baselines (more than 2% on RTE).The AUTOPEFT-found configuration also outperforms the full-model FT baseline on the RTE task by more than 1%.These results indicate the effectiveness of the AUTOPEFT framework in optimising both task performance and parameter efficiency.Transferring the RTE-based configurations to other tasks, we find that strong performance is maintained across the target tasks, with more benefits on the medium-resource tasks (MRPC, STS-B, CoLA), but the configuration remains competitive also for higher-resource tasks (e.g.QQP, MNLI).Finally, we find the strength of AUTOPEFT to persist in RoBERTa and T5 as a representative of the encoder-decoder model families.It is particularly noteworthy that in addition to outperforming the baseline PEFT methods without configuration   [task] ) denotes the performances of the AUTOPEFT configuration searched from [task] (e.g.RTE) to the task itself and 3 other GLUE tasks.The results suggest that AU-TOPEFT performance is largely robust to the choice of which task to search on.search, AUTOPEFT also performs competitively compared to S 3 PET with configuration search under a comparable parameter count, even though S 3 PET was exclusively developed and tested on the T5 search space and that the S 3 PET search space was designed with meticulous hand-tuning, where the authors manually excluded several building blocks that did not lead to empirical gain; this provides further empirical support to the strength of a BO-based search strategy described in §2.2.
Table 2 specifies the composition of the found configuration, indicating the exact task-active layers while allocating more parameter budget to the efficient and effective PA module.On average, the AUTOPEFT RTE configuration shows a comparable fine-tuning performance (83.17) to FFT (83.15) by only updating 0.76% of parameters.With strong transferability across similar tasks, AU-TOPEFT provides distinct advantages in parameter efficiency; the search algorithm itself, coupled with the transfer, becomes more sample-efficient within limited training resources.Extending AUTOPEFT to More Tasks.We next 'stress-test' the ability of AUTOPEFT-found configuration in a more challenging scenario, experimenting on a completely new set of dissimilar tasks.Per-Task Search.We further conduct full-resource * With the AUTOPEFT-found off-the-shelf configuration, this requires no additional search cost and enables a more efficient and effective tuning approach for new tasks.per-task AUTOPEFT searches.While naturally more expensive, we argue this setup is useful if, for example, one is interested in finding absolutely the best configurations for that particular task and where search cost is less of a concern.Due to computational constraints, we search per-task on RTE, MPRC, STS-B, and CoLA, then port the small set of best configurations to the remaining higher-resource tasks (SST-2, QNLI, QQP, MNLI).We observe consistent gain in all tasks we search on over the best-performing PEFT baselines, e.g.MRPC (87.16% (best baseline) to 87.45% (ours)) and CoLA (60.13% to 60.92%), and also the transferred configuration AUTOPEFT RTE in Table 1.One interpretation is that while configurations are highly transferable, the optimal configurations may nonetheless differ slightly across tasks such that while transferred AUTOPEFT configurations (e.g. the one reported in Table 1) perform well, searching per-task performs the best.Crucially, we also find per-task AUTOPEFT in this setup to even outperform FFT, despite only using 1.4% of all parameters, except for the high-resources task where we mostly perform on par; this is consistent with our observations that similar to the baselines, due to the richness of training resources, the performance may be mostly saturated and PEFT methods often achieve on-par performance to FFT at most.Analysing the 'Behaviour' of BO and the Discovered Configurations.Figure 7 shows the distribution of AUTOPEFT-found configurations when we conduct its search experiment on RTE.Recalling that the search strategy ( §2.2) starts with random initialisation, we compare the behaviours of the random explorations and the BO-suggested configurations: whereas the random search baseline is purely exploratory and discovers less parameter-efficient configurations, BO succeeds in discovering configurations towards the regions with improved parameter efficiency.The superiority of BO over the random search baseline is further demonstrated quantitatively by Figure 8 where we compare the evolution of the hypervolume, which measures the size of the space enclosed by the Pareto front over a reference point (set to the nadir point of the optimisation trajectory) (Zitzler and Thiele, 1998)  is clear that as optimisation proceeds, BO finds a better Pareto set with a better trade-off between performance and cost in the end.BO eventually discovers a rich family of PEFT configurations across a wide range of parameters, whereas previous approaches typically fail to explore the entire PF.This is a critical strength motivating our BO search strategy.Figure 6, on the other hand, visualises the discovered sets in different tasks: we observe that within the Pareto-optimal configuration set of the same task, some layers are consistently enabled (e.g., Layer 2 in CoLA) whereas some are consistently disabled (e.g., Layer 1 across all tasks) even under very different cost budgets; this suggests PEFT modules in different layers are not equally important, and by selectively enabling them, AU-TOPEFT is capable of making better use of the parameter budgets by allocating them to the more beneficial Transformer layers only.We observe the unanimity of preference or disinclination towards certain layers extends even across tasks that are unlikely to stem from randomness only: for example, we found Layers 2 and 10 are enabled in 71.2%  (Pfeiffer et al., 2020b) with an enumeration of its bottleneck size.The Scaling results refer to the PF where smaller configurations are obtained by simply scaling the largest configuration in A over all search dimensions.We report the PF of AUTOPEFTfound configurations, where SA-PA-PT-Layer forms the search space of AUTOPEFT.and 69.2% in all Pareto-optimal configurations over all tasks, whereas Layers 1 and 12 are enabled in only 7.7% and 13.4% of the time, respectively.We also observe that across all tasks, a common trend is that sequential and prefix adapters are universally preferred in low-budget ranges, and parallel adapters are only enabled when we have a more lenient budget allowance; these commonalities in high-performing configurations may, to some extent, account for the strong transferability of the discovered configurations, as shown in Figure 5.
Ablation of the Configuration Space.To provide a finer-grained analysis of factors that bring positive impact to AUTOPEFT, we ablate the AU-TOPEFT search space from the full configuration space: 1) to the basic enumeration of the bottleneck size D SA of the SA only (the SA space); 2) a naïve baseline where instead of searching for each search dimension independently, we vary a single, common coefficient that generates a family of configurations of different sizes by scaling from the largest PEFT configuration in our search space (SA-PA-PT) over D SA , D PA and L PT .We then include the Transformer layer and the SA size into the search space (the SA-Layer space) to validate the usefulness of layer selection as one configuration dimension.We can then also expand the search space by adding another module (e.g.PA yields the SA-PA-Layer space).Figure 9 Rücklé et al., 2021) that inserts SA for the last 13 layers, and AA uni (Moosavi et al., 2022) without its rational activation function with 13 selected layers (Adaptable Adapter).We run our AUTOPEFT under the comparable search space of 24 layers and approximately match the size of Serial.
modules has a positive impact on AUTOPEFT in general (c.f.full AUTOPEFT vs. SA-PA-Layer vs SA-Layer).Second, simply scaling all search dimensions by a common scaling factor is suboptimal.This is likely because not all parameters are equally important, necessitating a configuration search.Relying on layer selection also brings benefits (c.f.SA vs. SA-Layer).The comparison indicates that leaving out Transformer layers while increasing the capacity of the PEFT module is a straightforward method to improve the parameter efficiency and task performance of the PEFT module within a fixed parameter budget.The ablation results also demonstrate that AUTOPEFT is search space-agnostic, capable of effectively operating over configuration spaces of different granularity.
Layer Selection.The ability to disable some PEFT layers altogether is a key novelty of the AUTOPEFT search space, and to further compare different layer selection approaches, we conduct a controlled experiment with the SA module on BERT large (24 Transformer layers) under a predefined parameter budget.In Table 6, we compare against AdapterDrop, which simply drops the adapters for the first 11 layers while doubling their bottleneck sizes, and, within the same architecture, we also include the Adaptable Adapter with selected layers from switch learning (3 and 10 layers from the first 12 and the other 12 layers, respectively).We show that AUTOPEFT outperforms existing layer selection baselines activating fewer PEFT layers, leading to better parameter efficiency (12.5% fewer parameters in relative terms) yet achieving better performance.It indicates that selecting the best insertion layer is non-trivial, and AUTOPEFT can efficiently learn the correlation between layers.
We proposed AUTOPEFT, a novel search framework for automatically configuring parameterefficient fine-tuning (PEFT) modules of pretrained language models.AUTOPEFT features both a large and expressive, newly designed configuration search space and an effective search method featuring Bayesian optimisation that discovers a Pareto-optimal set of novel PEFT configurations with promising performance-efficiency trade-offs.Empirically, we demonstrated that AUTOPEFTdiscovered configurations transfer strongly across different GLUE and SuperGLUE tasks, outperforming various strong PEFT baselines and being competitive to full model fine-tuning.

Limitations and Future Work
AUTOPEFT search inevitably incurs a search cost since it requires iterative optimisation at search time.However, we mitigate this by (i) using a lowfidelity proxy of 1-epoch training and (ii) leveraging strong transferability by generalising from low-resource and, thus, quick-to-train tasks.While the search itself can be seen as a one-time cost yielding a permanent well-performing and shareable configuration for particular tasks, we plan to delve deeper into further optimising the search cost in future work.Furthermore, while we conduct extensive experiments on the search space that contains three existing PEFT modules as building blocks, novel PEFT modules may emerge.However, AUTOPEFT framework is general, so we may easily integrate these forthcoming new modules.We defer thorough investigations to future work.

Figure 1 :
Figure1: Performance of AUTOPEFT-discovered configurations (AutoPEFT & AutoPEFT(per-task); see details in Table1) compared to other baseline PEFT methods (markers) and full model FT that updates 100% of parameters (dashed horizontal bar), averaged across 8 GLUE tasks.Our approach achieves the best trade-off between task performance and parameter efficiency.

Figure 3 :
Figure 3: Illustration of the Pareto-optimal search with multi-objective Bayesian optimisation (BO; §2.2):The BO agent trains on the vector representations of the evaluated configurations as inputs and their performance under a low-fidelity setup (e.g.accuracy -obtained by fine-tuning the language model with the PEFT configuration for a small number of iterations) and cost (e.g.number of parameters) as targets.The BO agent then iteratively suggests new configurations until convergence.

Figure 4 :
Figure4: Pareto Fronts of AUTOPEFT on four tasks compared to baselines on BERT base , over varying parameter budgets.We report the single-seed task score but otherwise follow the settings in Table1.

Figure 5 :
Figure5: Pairwise transferability study of AUTOPEFTdiscovered configurations: each row (Ours[task] ) denotes the performances of the AUTOPEFT configuration searched from [task] (e.g.RTE) to the task itself and 3 other GLUE tasks.The results suggest that AU-TOPEFT performance is largely robust to the choice of which task to search on.

Figure 6 :
Figure 6: Visualisation of the BO discovered Pareto-optimal sets of configurations A * in different tasks (i.e., the configurations on the PFs in Figure 4) in ascending order of parameter budget.layer_i denotes the binary choice of whether the PEFT module is active in the i-th layer of the PLM.The final 3 columns denote D SA , D PA and L PT respectively, and feature a range of possible values from 0 to 768.

Figure 7 :
Figure7: The distribution of the discovered configurations via BO (orange), described in §2.2 and random search (grey) using the same total number of evaluations(200).Both searches use the same 100 random initialising points (blue) on RTE.Note that BO-generated configurations typically have much better parameter efficiency for configurations with similar accuracy.

Figure 8 :
Figure 8: The hypervolumes of the Pareto-optimal configurations discovered by BO (orange) and random search (grey) as a function of the number of configurations evaluated.

Figure 9 :
Figure9: The performance of AUTOPEFT with ablation of search space on RTE on BERT base .The SA results refer to the Pfeiffer adapter(Pfeiffer et al., 2020b) with an enumeration of its bottleneck size.The Scaling results refer to the PF where smaller configurations are obtained by simply scaling the largest configuration in A over all search dimensions.We report the PF of AUTOPEFTfound configurations, where SA-PA-PT-Layer forms the search space of AUTOPEFT.

Table 1 :
2020b) Results on the GLUE benchmark with BERT base (tasks are ranked in ascending order of training resources required from left to right).For AUTOPEFT RTE , we search on RTE with a low-fidelity proxy, training for 1 epoch per iteration, only at a search cost of 1.9% (in terms of additional fine-tuning steps required) over the full GLUE experiment.We report the average fine-tuned parameters of per-task AUTOPEFT, where we conduct additional per-task searches on RTE, MRPC, STS-B, and CoLA, and take best-found configurations for the remaining tasks.We report Spearman's Correlation for STS-B, Matthew's Correlation for CoLA, and accuracy for all other tasks (matched accuracy for MNLI).The percentage of parameters is the ratio of the number of additional parameters to the pretrained parameters.We reproduce all baselines and report the mean and standard deviation of all results for 5 random seeds.The best, second-best, and third-best results are marked in bold fonts and ranked by colour.

Table 2 :
Specification of the discovered configuration reported in Table 1 (AUTOPEFT RTE ) using BERT base .

Table 3 :
Results on SuperGLUE tasks with AU-TOPEFT-discovered configurations searched on RTE with BERT base as the underlying PLM.We split 10% of the training set as the new validation set and report the AUTOPEFT RTE -found configuration transfer results on the evaluation set over five random seeds.

Table 4 :
Method#Param.RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI Avg.Experimental results on GLUE with T5 base .We report comparisons of in-task search performance and transfer performance between the architectures found by AUTOPEFT and the state-of-the-art baseline S 3 PET in a constrained parameter budget.Consistent with Table1, we report AUTOPEFT and S 3 PET results searched on RTE in full-resource settings that are then transferred to all other included GLUE tasks.

Table 5 :
Hu et al. (2022a))ts on GLUE with RoBERTa large .We report the full model fine-tuning † results fromLiu et al. (2019b)with Pearson correlation for STS-B.We include the LoRA ‡ module performance fromHu et al. (2022a).We exclude QQP and MNLI tasks due to the high computation cost of RoBERTa large .Consistent with Table1, we again report AUTOPEFT results searched on RTE in full-resource settings that are then transferred all included GLUE tasks (AUTOPEFT RTE ) and per-task AUTOPEFT (AUTOPEFT task Avg. ) but on RoBERTa large .

Table 6 :
plots the performance over the ablated configuration spaces and different parameter budgets.Several key findings emerge.First, combining multiple single PEFT Comparing AUTOPEFT to layer selection baselines with the same parameter budget on BERT large .We report the Pfeiffer adapter for all 24 layers (Serial), specialised AdapterDrop (