Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted a lot of research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models.


Introduction
Sentiment analysis, paraphrase detection, machine reading comprehension, question answering, text summarization: all these Natural Language Processing (NLP) tasks benefit from pre-training a large-scale generic model on an enormous corpus such as a Wikipedia dump and/or a book collection, and then fine-tuning for specific downstream tasks, as shown in Fig. 1. Earlier solutions following this methodology used recurrent neural networks (RNNs) as the generic base model, e.g., ULMFiT (Howard and Ruder, 2018) and ELMo (Peters et al., 2018). More recent methods mostly use the Transformer architecture (Vaswani et al., 2017), which relies heavily on the * * Both authors contributed equally to this research. attention mechanism, e.g., BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), XLNet (Yang et al., 2019), MegatronLM (Shoeybi et al., 2019), Turing-NLG (Rosset, 2020), T5 (Raffel et al., 2020), and GPT-3 (Brown et al., 2020). These Transformers are powerful, e.g., BERT, when first released, improved the state of the art for eleven NLP tasks by sizable margins (Devlin et al., 2019). However, Transformers are also bulky and resource-hungry: for instance, GPT-3 (Brown et al., 2020), a recent large-scale Transformer, has over 175 billion parameters. Models of this size incur high memory consumption, computational overhead, and energy. The problem is exacerbated when we consider devices with lower capacity, e.g., smartphones, and applications with strict latency constraints, e.g., interactive chatbots.
To put things in perspective, a single training run for GPT-3 (Brown et al., 2020), one of the most powerful and heaviest Transformer-based models, trained on a total of 300 billion tokens, costs well above 12 million USD (Floridi and Chiriatti, 2020). Moreover, fine-tuning or even inference with such a model on a downstream task cannot be done on a GPU with 32GB memory, which is the capacity of Tesla V100, one of the most advanced data center GPUs. Instead it requires access to high-performance GPU or multi-core CPU clusters, which often means a need to access cloud computing with high computation density, such as the Google Cloud Platform (GCP), Microsoft Azure, Amazon Web Services (AWS), etc., and results in a high monetary cost (Floridi and Chiriatti, 2020).
One way to address this problem is through model compression, an intricate part of deep learning that has attracted attention from both researchers and practitioners. A recent study by  highlights the importance of first training over-parameterized models and then compressing them, instead of directly training smaller models, to reduce the performance errors. Although most methods in model compression were originally proposed for convolutional neural networks (CNNs), e.g., pruning, quantization, knowledge distillation, etc. (Cheng et al., 2017), many ideas are directly applicable to Transformers. There are also methods designed specifically for Transformers, e.g., attention head pruning, attention decomposition, replacing Transformer blocks with an RNN or a CNN, etc. (discussed in Section 3). Unlike CNNs, a Transformer model has a relatively complex architecture consisting of multiple parts such as embedding layers, selfattention, and feed-forward layers (details introduced in Section 2). Thus, the effectiveness of different compression methods can vary when applied to different parts of a Transformer model.
Several recent surveys have focused on pre-trained representations and large-scale Transformer-based models e.g., Qiu et al. (2020); ; . However, to the best of our knowledge, no comprehensive, systematic study has compared the effectiveness of different model compression techniques on Transformer-based large-scale NLP models, even though a variety of approaches for compressing such models have been proposed. Motivated by this, here we offer a thorough and in-depth comparative study on compressing Transformer-based NLP models, with a special focus on the widely used BERT (Devlin et al., 2019). Although the compression methods discussed here can be extended to Transformer-based decoders and multilingual Transformer models, we restrict our discussion to BERT in order to provide detailed insights into various methods.
Our study is timely, since (i) the use of Transformer-based BERT-like models has grown dramatically, as demonstrated by current leaders of various NLP tasks such as language understanding (Wang et al., 2018), machine reading comprehension (Rajpurkar et al., 2016(Rajpurkar et al., , 2018, machine translation (Machacek and Bojar, 2014), summarization (Narayan et al., 2018), etc.; (ii) many researchers are left behind as they do not have expensive GPUs (or a multi-GPU setup) with a large amount of GPU memory, and thus cannot finetune and use the large BERT model for relevant downstream tasks; and (iii) AI-powered devices such as smartphones would benefit tremendously from an on-board BERT-like model, but do not have the capability to run it. In addition to summarizing existing techniques and best practices for BERT compression, we point out several promising future directions of research for compressing large-scale Transformer-based models.

Breakdown & Analysis of BERT
Bidirectional Encoder Representations from Transformers, or BERT (Devlin et al., 2019), is a Transformer model (Vaswani et al., 2017) pretrained on large corpora from Wikipedia and the Bookcorpus (Zhu et al., 2015) using two training objectives: (i) Masked Language Model (MLM), which helps it learn the context in a sentence, and (ii) Next Sentence Prediction (NSP), from which it learns the relationship between two sentences. Subsequent Transformers have further improved the training objective in various ways (Lan et al., 2019;. In the following, we focus on the original BERT model. BERT decomposes its input sentence(s) into WordPiece tokens (Wu et al., 2016). Specifically, WordPiece tokenization helps improve the representation of the input vocabulary and reduce its size, by segmenting complex words into subwords. These subwords can even form new words not seen in the training samples, thus making the model more robust to out-of-vocabulary (OOV) words. A classification token ([CLS]) is inserted before the input, and the output corresponding to this token is used for classification tasks. For sentence pair tasks, the two sentences are packed together by inserting a separator token ([SEP]) between them.
BERT represents each WordPiece token with three vectors, namely its token, segment, and position embeddings. These embeddings are summed together and then passed through the main body of  the model, i.e., the Transformer backbone, which produces the output representations that are fed into the final, application-dependent layer, e.g., a classifier for sentiment analysis. As shown in Fig. 2, the Transformer backbone has multiple stacked encoder units, each with two major sub-units: a self-attention sub-unit and a feed forward network (FFN) sub-unit, both with residual connections. Each self-attention sub-unit consists of a multi-head self-attention layer, and fully-connected layers before and after it. An FFN sub-unit exclusively contains fully-connected layers. The architecture of BERT can be specified using three hyper-parameters: number of encoder units (L), size of the embedding vector (H), and number of attention heads in each self-attention layer (A). L and H determine the depth and the width of the model, whereas A is an internal hyper-parameter that affects the number of contextual relations that each encoder can focus on. The authors of BERT provide two pre-trained models: BERT BASE (L = 12; H = 768; A = 12) and BERT LARGE (L = 24; H = 1024; A = 16).
We conducted various experiments on BERT BASE model by running inference on a sentence of length 256 and then collected the results in Fig. 3. The top graph in the figure compares the model size as well as the theoretical computational requirements (measured in millions of FLOPs) of different parts of the model. The bottom two graphs track the model's run-time memory consumption as well as the inference latency on two representative hardware setups. We conducted experiments using Nvidia Titan X GPU with 12GB of video RAM and Intel Xeon E5-1620 CPU with 32 GB system memory, which is a commonly used server or workstation configuration. All data was collected using the PyTorch profiling tool.
Clearly, the parts consuming the most memory in terms of model size and executing the highest number of FLOPs are the FFN sub-units. The embedding layer is also a substantial part of the model size, due to the large vector size (H) used to represent each embedding vector. Note that the embedding layer has zero FLOPs, since it is a lookup table that involves no arithmetic computations at inference time. For the self-attention subunits, we further break down the costs into multihead self-attention layers and the linear (i.e., fullyconnected) layers before and after them. The multi-head self-attention does not have any learnable parameters; however, its computational cost is non-zero due to the dot products and the softmax operations.
The linear layers surrounding each attention layer incur additional memory and computational overhead, though it is relatively small compared to the FFN sub-units. Note that the input to the attention layer is divided among various heads, and thus each head operates in a lower-dimensional space (H/A). The linear layer before attention is roughly three times the size of that after it, since each attention has three inputs (key, value, and query) and only one output.
The theoretical computational overhead may differ from the actual inference cost at run-time, which depends on the hardware that the model runs on. As expected, when running the model on a GPU, the total run-time memory includes both memory on the GPU side and the CPU side, and it is greater than for a model running solely on CPU due to duplicate tensors present on both devices for faster processing on GPU.
The most notable difference between the theoretical analysis and the run-time measurements on a GPU is that the multi-head self-attention layers are significantly more costly in practice than in theory. This is because the operations in these layers are rather complex, and are implemented as several matrix transformations followed by a matrix multiplication and softmax. Furthermore, GPUs are designed to accelerate certain operations, and can thus implement linear layers faster and more efficiently than the more complex attention layers. When we compare the run-time performance on a CPU, where the hardware is not specialized for linear layer operations, the inference time as well as the memory consumption of all the linear layers shoots up more compared to the multi-head self-attention. Thus on a CPU, the

Embedding
Layer Linear before Attn.
Multi-Head Self-Attention Linear after Attn. behavior of run-time performance is similar to that of theoretical computations. The total execution time of a single example on a GPU (57.1 ms) is far superior as compared to a CPU (750.9 ms), as expected. The execution time of the embedding layer is largely independent of the hardware on which the model is executed (since it is just a table lookup) and it is relatively small compared to other layers. The FFN sub-units are the bottleneck of the whole model, which is consistent with the results from the theoretical analysis.

Compression Methods
Due to BERT's complex architecture, no existing compression method focuses on every aspect of the model like self-attention, linear layers, embedding size, model depth, etc. Instead, each compression technique applies to certain components of BERT. Below, we consider the compression methods that provide model size reduction and speedup at inference time, rather than the training procedure.

Quantization
Quantization refers to reducing the number of unique values required to represent the model weights, which in turn allows to represent them using fewer bits, to reduce the memory footprint, and to lower the precision of the numerical calculations. Quantization may even improve the runtime memory consumption as well as the inference speed when the underlying computational device is optimized to process lower-precision numerical values, e.g., tensor cores in newer generations of Nvidia GPUs. Programmable hardware such as FPGAs can also be specifically optimized for any bitwidth representation. Quantization of intermediate outputs and activations can further speed up the model execution (Boo and Sung, 2020). Quantization is generally applicable to all model weights as the BERT weights reside in fully-connected layers (i.e., the embedding layer, the linear layers, and the FFN sub-units), which have been shown to be quantizationfriendly (Hubara et al., 2017). The original BERT model provided by Google represents each weight by a 32-bit floating point number. A naïve approach is to simply truncate each weight to the target bitwidth, which often yields a sizable drop in accuracy as this forces certain weights to go through a severe drift in their value, known as quantization noise (Fan et al., 2021). A possible way around is to identify these weights and then not truncate them during the quantization step in order to retain the model accuracy. For example, Zadeh et al. (2020) assume Gaussian distribution in the weight matrix and identify outliers. Then, by not quantizing the outliers, they are able to perform post-training quantization without any retraining requirements.
A more common approach to retaining model accuracy is Quantization-Aware Training (QAT), which involves additional training steps to adjust the quantized weights. Fig. 4 shows an example of naïve linear quantization, quantization noise, and the importance of quantizationaware training. For BERT, QAT has been used to perform fixed-length integer quantization (Zafrir et al., 2019;Boo and Sung, 2020), Hessian-based mixed-precision quantization (Shen et al., 2020), adaptive floating-point quantization (Tambe et al., 2020), and noise-based quantization (Fan et al., 2021). Finally, it has been observed that the embedding layer is more sensitive to quantization than other encoder layers, and requires more bits to maintain the model accuracy (Shen et al., 2020).

Pruning
Pruning refers to identifying and removing redundant or less important weights and/or components, which sometimes even makes the model more robust and better-performing. Moreover, pruning is a commonly used method of exploring the lottery ticket hypothesis in neural networks (Frankle and Carbin, 2018), which has also been studied in the context of BERT Prasanna et al., 2020). Pruning methods for BERT largely fall into two categories.
Unstructured Pruning. Unstructured pruning, also known as sparse pruning, prunes individual weights by locating the set of least important weights in the model. The importance of weights can be judged by their absolute values, gradients, or a custom-designed measure-ment (Gordon et al., 2020;Mao et al., 2020;Guo et al., 2019;. Unstructured pruning could be effective for BERT, given the latter's massive fully-connected layers. Unstructured pruning methods include magnitude weight pruning (Gordon et al., 2020;Mao et al., 2020;, which simply removes weights close to zero, movement-based pruning Tambe et al., 2020), which removes weights moving towards zero during fine-tuning, and reweighted proximal pruning (RPP) (Guo et al., 2019), which uses iteratively reweighted 1 minimization followed by the proximal algorithm for decoupling pruning and error back-propagation. Since unstructured pruning considers each weight individually, the set of pruned weights can be arbitrary and irregular, which in turn might decrease the model size, but with negligible improvement in runtime memory or speed, unless executed on specialized hardware or with specialized processing libraries.
Structured Pruning. Unlike unstructured pruning, structured pruning focuses on pruning structured blocks of weights (Li et al., 2020a) or even complete architectural components in the BERT model, by reducing and simplifying certain numerical modules: • Attention head pruning. The self-attention layer incurs considerable computational overhead at inference time; yet, its importance has often been questioned (Kovaleva et al., 2019;Tay et al., 2020;Raganato et al., 2020). In fact, high accuracy is possible with only 1-2 attention heads per encoder unit, even though the original model has 16 attention heads (Michel et al., 2019). Randomly pruning attention heads during the training phase has been proposed, which can create a model robust to various numbers of attention heads, and a smaller model can be directly extracted for inference based on deployment requirements (Hou et al., 2020  Xu et al., 2020). As BERT contains residual connections for every sub-unit, using an identity prior to prune these layers has also been proposed . • Embedding size pruning. Similar to encoder unit pruning, we can reduce the size of the embedding vector (H) by pruning along the width of the model. Such a model can be obtained by either training with adaptive width, so that the model is robust to such pruning during inference (Hou et al., 2020), or by removing the least important feature dimensions iteratively (Khetan and Karnin, 2020;Prasanna et al., 2020;Tsai et al., 2020;.

Knowledge Distillation
Knowledge Distillation refers to training a smaller model (called the student) using outputs (from various intermediate functional components) of one or more larger pre-trained models (called the teachers). The flow of information can sometimes be through an intermediate model (commonly known as teaching assistants) (Ding and Yang, 2020;Sun et al., 2020b;Wang et al., 2020c). In the BERT model, there are multiple intermediate results that the student can learn from, such as the logits in the final layer, the outputs of the encoder units, and the attention maps. Moreover, there are multiple forms of loss functions adapted such as cross-entropy loss, KL divergence, MAE, etc. While knowledge distillation is most commonly used to train student models directly on task-specific data, recent results have shown that distillation during both pre-training and fine-tuning can help create better performing models (Song et al., 2020). An overview of various forms of knowledge distillation and student models is shown in Fig. 6. Based on what the student learns from the teacher, we categorize the existing methods as follows: Distillation from Output Logits. Similar to knowledge distillation for CNNs (Cheng et al., 2017), the student can directly learn from the output logits (i.e., from soft labels) of the final softmax layer in BERT. This is done to allow the student to better mimic the output of the teacher model, by replicating the probability distribution across various classes.
While knowledge distillation on output logits is most commonly used to train smaller BERT models (Sun et al., 2019;Sanh et al., 2019;Jiao et al., 2020;Zhao et al., 2019b;Turc et al., 2019;Cao et al., 2020;Sun et al., 2020b;Song et al., 2020;Mao et al., 2020;Ding and Yang, 2020;Noach and Goldberg, 2020), the student does not need to be a smaller version of BERT or even a Transformer, and can follow a completely different architecture. Below we describe the two commonly used replacements: • Replacing the Transformer with a BiLSTM, to create a lighter backbone. Recurrent models such as BiLSTMs process words sequentially instead of simultaneously attending to each word in the sentence like Transformers do, resulting in a smaller runtime memory requirement. Both can create bidirectional representations, and thus BiLSTMs can be considered a faster alternative to Transformers (Wasserblat et al., 2020). Compressing to a BiLSTM is typically done directly for a specific NLP task (Mukherjee and Awadallah, 2020). Since these …..

Final layer
Encoder unit …..  models are trained from scratch on the taskspecific dataset without any intermediate guidance, various methods have been proposed to create additional synthetic training data using rule-based data augmentation techniques (Tang et al., 2019b,a;Mukherjee and Awadallah, 2019) or to collect data from multiple tasks to train a single model (Liu et al., 2019a). • Replacing the Transformer with a CNN, to take advantage of massively parallel computations and improved inference speed (Chia et al., 2018). While it is theoretically possible to make the internal processing of an encoder parallel, where each parallel unit requires access to all the inputs from the previous layer as an encoder unit focuses on the global context, this setup is computationally intensive and cost-inefficient. Unlike Transformers, each CNN unit focuses on the local context only, and, unlike BiLSTMs, CNNs do not operate on the input sequentially, which makes it easier for them to divide the computation into small parallel units. It is possible to either completely replace the Transformer backbone with a deep CNN network (Chen et al.), or to replace only a few encoder units to balance performance and efficiency (Tian et al., 2019).

Distillation from output logits
Distillation from Encoder Outputs. Each encoder unit in a Transformer model can be viewed as a separate functional unit. Intuitively, the output tensors of such an encoder unit may contain meaningful semantic and contextual relationships between input tokens, leading to an improved representation. Following this idea, we can create a smaller model by learning from an encoder's outputs. The smaller model can have a reduced embedding size H, a smaller number of encoder units L, or a lighter alternative that replaces the Transformer backbone.
• Reducing H leads to more compact representations in the student (Zhao et al., 2019b;Sun et al., 2020b;Jiao et al., 2020;. One challenge is that the student cannot directly learn from the teacher's intermediate outputs, due to different sizes. To overcome this, the student also learns a transformation, which can be implemented by either down-projecting the teacher's outputs to a lower dimension or by upprojecting the student's outputs to the original dimension (Zhao et al., 2019b). Another possibility is to introduce these transformations directly into the student model, and later merge them with the existing linear layers to obtain the final smaller model . • Reducing L, which is the number of encoder units, forces each encoder unit in the student to learn from the behavior of a sequence of multiple encoder units in the teacher (Sun et al., 2019;Sanh et al., 2019;Sun et al., 2020b;Jiao et al., 2020;Zhao et al., 2019b;. Further analysis into various details of choosing which encoder units to use for distillation is provided by Sajjad et al. (2020). For example, preserving the bottom encoder units and aggressively distilling the top encoder units yields a better-performing student model, which indicates the importance of the bottom layers in the teacher model. While most existing methods create an injective mapping from the student encoder units to the teacher,  instead propose a way to build a many-to-many mapping for a better flow of information. One can also completely bypass the mapping by combining all outputs into one single representation vector (Sun et al., 2020a). • It is also possible to use encoder outputs to train student models that are not Transformers Awadallah, 2019, 2020;Tian et al., 2019). However, when the student model uses a completely different architecture, the flexibility of using internal representations is rather limited, and only the output from the last encoder unit is used for distillation.
Distillation from Attention Maps. Attention map refers to the softmax distribution output of the self-attention layers and indicates the contextual dependence between the input tokens. It has been proposed that attention maps in BERT can identify distinguishable linguistic relations, e.g., identical words across sentences, verbs and corresponding objects, or pronouns and corresponding nouns (Clark et al., 2019). These distributions are the only source of inter-dependency between input tokens in a Transformer model and thus by replicating these distributions, a student can also learn such linguistic relations (Sun et al., 2020b;Jiao et al., 2020;Mao et al., 2020;Tian et al., 2019;Noach and Goldberg, 2020).
A common method of distillation from attention maps is to directly minimize the difference between the teacher and the student multi-head self-attention outputs. Similar to distillation from encoder outputs, replicating attention maps also faces a choice of mapping between the teacher and the student, as each encoder unit has its own attention distribution. Previous work has also proposed replicating only the last attention map in the model to truly capture the contextual dependence (Wang et al., 2020c). One can attempt an even deeper distillation of information through intermediate attention outputs such as key, query and value matrices, individual attention head outputs, key-query and value-value matrix products, etc., to facilitate the flow of information (Wang et al., 2020c;Noach and Goldberg, 2020).

Matrix Decomposition
The computational overhead in BERT mainly consists of large matrix multiplications, both in the linear layers as well as in the attention heads. Thus, decomposing these matrices can significantly impact the computational requirement for such models.
Weight Matrix Decomposition. A possible way to reduce the computational overhead of the model can be through weight matrix factorization, which replaces the original A × B weight matrix with the product of two smaller matrices ( Out0 Attention decomposition into smaller groups Attention Decomposition. The importance of attention calculation over the entire sentence has been explored, revealing a large number of redundant computations (Tay et al., 2020;Cao et al., 2020). One way to resolve this issue is by calculating attention in smaller groups, by either binning them using spatial locality (Cao et al., 2020), magnitude-based locality (Kitaev et al., 2019), or an adaptive attention span (Tambe et al., 2020). Moreover, since the outputs are calculated independently, local attention methods also enable a higher degree of parallel processing and individual representations can be saved during inference for multiple uses. Fig. 7 shows an example of attention decomposition based on spatial locality.
It has been also proposed to reduce the computations required in the attention calculation by projecting the key-query matrix into a lower dimensionality (Wang et al., 2020b) or by only calculating the softmax of the top-k key-query product values in order to further highlight these relations (Zhao et al., 2019a). Since the multi-head self-attention layer does not contain weights, these methods only improve the runtime memory costs and execution speed, but not the model size.

Dynamic Inference Acceleration
Besides directly compressing the model, there are methods that focus on reducing computational overhead at inference time, catering to individual input examples and dynamically changing the amount of computation required. We have visualized various dynamic inference acceleration methods in Fig. 8. Early Exit Ramps. One way to speed up inference is to create intermediary exit points in the model. Since the classification layers are the least parameter-extensive part of BERT, separate classifiers can be trained for each encoder unit output. This allows the model to get dynamic inference time for various inputs. Training these classifiers can either be done from scratch Tambe et al., 2020) or through distilling the output from the final classifier .
Progressive Word Vector Elimination. Another way to accelerate inference is by reducing the number of words processed at each encoder level. Since we use only the final output corresponding to the [CLS] token (defined in Section 2) as a representation of the complete sentence, the information of the entire sentence must have fused into that one token. Goyal et al. (2020) observe that such a fusion cannot be sudden, and that it must happen progressively across various encoder levels. We can use this information to lighten the later encoder units by reducing the sentence length through word vector elimination at each step.

Other Methods
Besides the aforementioned categories, there are also several one-of-a-kind methods that have been shown to be effective for reducing the size and inference time of BERT-like models.
Parameter Sharing. ALBERT (Lan et al., 2019) follows the same architecture as BERT, with the important difference that in the former, weights are shared across all encoder units, which yields a significantly reduced memory footprint. Moreover, ALBERT has been shown to enable training larger and deeper models. For example, while BERT's performance peaks at BERT LARGE (performance of BERT XLARGE drops significantly), the performance of ALBERT keeps improving until the far larger ALBERT XXLARGE model (L = 12; H = 4096; A = 64).
Embedding Matrix Compression. The embedding matrix is the lookup table for the embedding layer, which is about 21% of the size of the complete BERT model. One way to compress the embedding matrix is by reducing the vocabulary size V , which is about 30k in the original BERT model. Recall from Section 2 that the vocabulary of BERT is learned using a WordPiece tokenizer. The WordPiece tokenizer relies on the vocabulary size to figure out the degree of fragmentation of words present in the input text. A large vocabulary size allows for better representation of rare words and for more adaptability to OOV words.
However, even with a 5k vocabulary size, 94% of the tokens created match with the tokens created using a 30k vocabulary size (Zhao et al., 2019b). This shows that the majority of words that appear frequently enough in text are covered even with such a small vocabulary size, and thus makes it reasonable to consider decreasing vocabulary size to compress the embedding matrix.
Another alternative is to replace the existing one-hot vector encoding with a "codebook"-based encoding, where each token is represented using multiple indices from the codebook. The final embedding of the token can be calculated by doing the sum of embeddings present at all these indices (Prakash et al., 2020).
Weight Squeezing. Weight Squeezing (Chumachenko et al., 2020) is a compression method similar to knowledge distillation, where the student learns from the teacher. However, instead of learning from intermediate outputs like in knowledge distillation, it's the weights of the teacher model that are mapped to the student through a learnable transformation, and thus the student learns its weights directly from the teacher.

Effectiveness of the Compression Methods
In this section, we compare the performance of several BERT compression techniques based on their model size and speedup, as well as their ac-curacy/F1 on various NLP tasks. We chose papers whose results are either on the Pareto frontier (Deb, 2014) or representative for each compression technique mentioned in the previous section.

Datasets and Evaluation Measures
From the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) and the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016), we use the following most commonly reported tasks: MNLI and QQP for sentence pair classification, SST-2 for single sentence classification, and SQuAD v1.1 for machine reading comprehension. Following the official leaderboards, we report the accuracy for MNLI, SST-2 and QQP, and F1 score for SQuAD v1.1. We also report the absolute drop in accuracy with respect to BERT BASE , averaged across all tasks for which the authors have reported results, in an attempt to quantify their accuracy on a single scale. Furthermore, we report model speedup on both GPU and CPU devices collected directly from the original papers. For papers that provide their speedup, we also mention the target device on which the speedups are calculated. For papers which do not provide speedup values, we run their models on our own machine and perform inference on the complete MNLI-m test dataset (using a batch size of 1) with machine configurations as detailed in Section 2. We also provide the model size with and without the embedding matrix, since for certain application scenarios, where the memory constraints for model storage are not strict, the parameters of the embedding matrix can be ignored as it has negligible run-time cost (see Section 2). As no previous work in the literature reports a drop in runtime memory, and as many of the papers that we compare to here use probabilistic models that cannot be easily replicated without the authors releasing their code, we could not do direct runtime memory comparisons. Quantization and Pruning. Quantization is well-suited for BERT, and it is able to outperform other compression methods in terms of both model size and accuracy. As shown in Table 1, quantization can reduce the size of the BERT model to 15% and 10.2% of its original size, with accuracy drop of only 0.6% and 0.9%, respectively, across various tasks (Shen et al., 2020;Zadeh et al., 2020). This can be attributed to the fact that quantization is an architecture-invariant compression method, which means it only reduces the precision of the weights, but preserves all the original components and connections present in the model. Unstructured pruning also shows performance that is on par with other methods. We can see that unstructured pruning reduces the original BERT model to 67.6% of its original size, without any loss in accuracy, possibly due to the regularization effect caused by pruning (Guo et al., 2019). However, almost all existing work in unstructured pruning freezes the embedding matrix and focuses only on pruning weight matrices of the encoder. This makes extreme compression difficult, e.g., even with 3% weight density in encoders, the total model size still remains at 23.8% of its original size , and yields a sizable drop in accuracy/F1 (4.73% on average).

Comparison and Analysis
While both quantization and unstructured pruning reduce the model size significantly, none of them yields actual run-time speedups on a standard device. Instead, specialized hardware and/or libraries are required, which can do lower-bit arithmetic for quantization and an optimized implementation of sparse weight matrix multiplication for unstructured pruning. However, these methods can be easily combined with other compression methods as they are orthogonal from an implementation viewpoint. We discuss the performance of compounding multiple compression method together later in this section.
Structured Pruning. As discussed in Section 3, structured pruning removes architectural components from BERT, which can also be seen as reducing the hyper-parameters that govern the BERT architecture. While  pruned the encoder units (L) and reduced the model depth by half with an average accuracy drop of 1.0%, Khetan and Karnin (2020) took it a step further and systematically reduced both the depth (L) as well as the width (H, A) of the model, and were  Khetan and Karnin (2020) also show that reducing all hyper-parameters in harmony, instead of focusing on just one, helps achieve better performance.
Model-Agnostic Distillation. Applying distillation from output logits only allows model-agnostic compression and gives rise to LSTM/CNN-based student models. While some methods exist that do try to train a smaller BERT model (Song et al., 2020), this category is dominated by methods that aim to replace Transformers with lighter alternatives. It is also interesting to note that at a similar model size, Liu et al. (2019a) which uses a BiLSTM student model, yields significantly better speedup when compared to Song et al. (2020), which uses a Transformer-based student model. Chen et al. provides the fastest model in this category, using a NAS-based CNN model, with only 2.06% average drop in accuracy.
While these methods focused on achieving high compression ratio, they cause a sizable drop in accuracy. A possible explanation is that the total model size is not a true indicator of how powerful these compression methods are, because the majority of their model size is the embedding matrix. For example, while the total model size of Liu et al. (2019a) is 101 MB, only 11 MB is actually their BiLSTM model, and the remaining 90 MB are just the embedding matrix. Similarly to unstructured pruning, not focusing on the embedding matrix can hurt the deployment of such models on devices with strict memory constraints.
Distillation from Attention Maps. Wang et al. (2020c) were able to reduce the model to 60.7% its original size, with only 0.1% loss in accuracy on average, just by doing deep distillation on the attention layers. For the same student architecture, Sanh et al. (2019) used all other forms of distillation (i.e., output logits and encoder outputs) together and still faced an average accuracy loss of 1.73%. Clearly, the intermediate attention maps are an important distillation target.
Combining Multiple Distillations. Combining multiple distillation targets can help achieve an even better performing compressed model. Jiao et al. (2020) created a student model with smaller H and L hyperparameter values, and were able to compress the model size to 13.3% and achieved a 9.4x speedup on GPU (9.3x on CPU), while only facing a drop of 1.0% in accuracy. Zhao et al. (2019b) extended the same idea and created an extremely small BERT student model (1.6% of the original size and ∼ 25x faster) with H = 48 and vocabulary size |V | = 4928 (BERT BASE has H = 768 and |V | = 30522). The model lost 12.3% accuracy to make up for its size.
Matrix Decomposition and Dynamic Inference Acceleration. While weight matrix decomposition helps reduce the size of the weight matrices in BERT, they create deeper and fragmented models, which hurts the execution time (Noach and Goldberg, 2020). On the other hand, methods from the literature that implement faster attention and various forms of dynamic speedup do not change the model size, but instead provide faster inference. For example, the work by Cao et al. (2020) show that attention calculation across the complete sentence is not needed for the initial encoder layers, and they were able to achieve ∼ 3x run-time speedup with only 0.76% drop in accuracy. For applications where latency is the major constraint, such methods can help achieve the necessary speedup.
Structured Pruning vs. Distillation. While structured pruning attempts to iteratively prune the hyper-parameters of BERT, distillation starts with a smaller model and tries to train it using knowledge directly from the original BERT. However, both of them end up with a similar compressed model, and thus it is interesting to compare which path yields more promising results. As can be noted from Table 1, for the same compressed model with L = 6, the drop in accuracy for the model by  is smaller compared to that by Sanh et al. (2019). However, this is not a completely fair comparison, as Sanh et al.
(2019) does not use attention as a distillation target. When we compare other methods, we find that Jiao et al. (2020) was able to beat Khetan and Karnin (2020) in terms of both model size and accuracy. This shows that structured pruning outperforms student models trained using distillation only on encoder outputs and output logits, but fails against distillation on attention maps. This further indicates the importance of replicating attention maps in BERT.
Pruning with Distillation. Similar to combining multiple distillation methods, one can also combine pruning with distillation, as this can help guide the pruning towards removing less important connections. Mao et al. (2020) combined distillation with unstructured pruning, while Hou et al. (2020) combined distillation with structured pruning. When compared with only structured pruning (Khetan and Karnin, 2020), we see that Hou et al. (2020) achieved a smaller model size (12.4%), and a smaller accuracy drop (0.96%).
Quantization with Distillation. Similar to pruning, quantization is also orthogonal in implementation to distillation, and can together achieve better performance than either of them individually. Zadeh et al. (2020) attempted to quantize an already distilled BERT (Sanh et al., 2019) to 4 bits, thus reducing the model size from 60.2% to 7.5%, with an additional accuracy drop of only 0.9% (1.73% to 2.6%). Similarly, Sun et al. (2020b) attempted to quantize their model to 8 bits, which reduced their model size from 23% to 5.25%, with only a 0.07% additional drop in accuracy.
Compounding multiple methods together. As we have seen in this section, different methods of compression target different parts of the BERT architecture. Note that many of these methods are orthogonal in implementation, similar to the papers that we discussed on combining quantization and pruning with distillation, and thus it is possible to combine them. For example, Tambe et al. (2020) combined multiple forms of compression methods to create a truly deployable language model for edge devices. They combined parameter sharing, embedding matrix decomposition, unstructured movement pruning, adaptive floatingpoint quantization, adaptive attention span, dynamic inference speed with early exit ramps, and other hardware accelerations to suit their needs. However, as we noticed in this section, these particular methods can reduce the model size signif-icantly, but they cannot drastically speed up the model execution on standard devices. While the model size is dropped to only 1.3% of its original size, the speedup obtained on a standard GPU is only 1.83x, with an average drop of 1.53% in accuracy. With specialized accelerators, the authors were able to push the speedup eventually to 2.1x.

Practical Advice
Based on the results collected in this section, we attempt to give the readers some practical advice for specific applications: • Quantization and unstructured pruning can help reduce the model size, but do nothing to improve the runtime inference speed or the memory consumption, unless executed on specialized hardware or with specialized processing libraries. On the other hand, if executed on proper hardware, these methods can provide tremendous boost in terms of speed with negligible loss in performance (Zadeh et al., 2020;Tambe et al., 2020;Guo et al., 2019). Thus, it is important to recognize the target hardware device before using these compression methods. • Knowledge distillation has shown great affinity to a variety of student models and its orthogonal nature of implementation compared to other methods (Mao et al., 2020;Hou et al., 2020) means it is an important addition to any form of compression. More specifically, distillation from self-attention layers (if possible) is an integral part of Transformer compression (Wang et al., 2020c). • Alternatives such as BiLSTMs and CNNs have an additional advantage in terms of execution speed when compared to Transformers. Thus, for applications with strict latency constraints, replacing Transformers with alternative units is a better choice. Model execution can also be sped up using dynamic inference methods, as they can be incorporated into any student model with a skeleton similar to that of Transformers. • A major takeaway is the importance of compounding various compression methods together to achieve truly practical models for edge environment. The work of Tambe et al. (2020) is a good example of this, as it attempts to compress BERT, while simultaneously performing hardware optimizations in accordance with their chosen compression methods. Thus, combining compression methods that complement each other is better than compressing a single aspect of the model to its extreme.

Open Issues & Research Directions
From our analysis and comparison of various BERT compression methods, we conclude that traditional model compression methods such as quantization and pruning do show benefits for BERT. Techniques specific to BERT also yield competitive results, e.g., variants of knowledge distillation and methods that reduce architectural hyper-parameters. Such methods also offer insights into BERT's workings and the importance of various layers in its architecture. However, BERT compression is still in early stages and we see multiple avenues for future research: 1. A very prominent feature of most BERT compression methods is their coupled nature across various encoder units, as well as its inner architecture. However, some layers might be able to handle more compression. Methods compressing each layer independently (Khetan and Karnin, 2020;Tsai et al., 2020) have shown promising results but remain under-explored. 2. The very nature of the Transformer backbone that forces the model to be parameterheavy makes model compression for BERT more challenging. Existing work in replacing the Transformer backbone by Bi-LSTMs and CNNs has yielded extraordinary compression ratios, but with a sizable drop in accuracy. This suggests further exploration of more complicated variations of these models and hybrid Bi-LSTM/CNN/Transformer models to limit the loss in performance (Tian et al., 2019). 3. Many existing methods for BERT compression only work on specific parts of the model. However, we can combine such complementary methods to achieve better overall model compression performance. We have seen in Effectiveness of the Compression Methods that such compound compression methods perform better than their individual counterparts (Tambe et al., 2020;Hou et al., 2020), and thus more exploration in combining various existing methods is needed.