Abstract
Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.
1 Introduction
The Transformer architecture (Vaswani et al., 2017) has taken the NLP community by storm. Based on the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), it was shown to outperform recurrent architectures on a wide variety of tasks. Another step was taken with pretrained language models derived from this architecture (BERT, Devlin et al., 2019, among others): they now embody the default approach to a vast swath of NLP applications. Success breeds scrutiny; likewise the popularity of these models has fostered research in explainable NLP interested in the behavior and explainability of pretrained language models (Rogers et al., 2020).
In this paper, we develop a novel decomposition of Transformer output embeddings. Our approach consists in quantifying the contribution of each network submodule to the output contextual embedding, and grouping those into four terms: (i) what relates to the input for a given position, (ii) what pertains to feed-forward submodules, (iii) what corresponds to multi-head attention, and (iv) what is due to vector biases.
This allows us to investigate Transformer embeddings without relying on attention weights or treating the entire model as a black box, as is most often done in the literature.The usefulness of our method is demonstrated on BERT: Our case study yields enlightening connections to state-of-the-art work on Transformer explainability, evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as an overview of the effects of finetuning on the embedding space. We also provide a simple and intuitive measurement of the importance of any term in this decomposition with respect to the whole embedding.
2 Additive Structure in Transformers
Equation (1) provides interpretable and quantifiable terms that can explain the behavior of specific components of the Transformer architecture. More precisely, it characterizes what is the impact of adding another sublayer on top of what was previously computed: the terms in Equation (1) are defined as sums across (sub)layers; hence we can track how a given sublayer transforms its input, and show that this effect can be thought of as adding another vector to a previous sum. This layer-wise sum of submodule outputs also allows us to provide a first estimate of which parameters are most relevant to the overall embedding space: a submodule whose output is systematically negligible has its parameters set so that its influence on subsequent computations is minimal.
The formulation in Equation (1) more generally relies on the additive structure of Transformer embedding spaces. We start by reviewing the Transformer architecture in Section 2.1, before discussing our decomposition in greater detail in Section 2.2 and known limitations in Section 2.3.
2.1 Transformer Encoder Architecture
Let’s start by characterizing the Transformer architecture of Vaswani et al. (2017) in the notation described in Table 1.
A | matrix |
(A)t,· | tth row of A |
a | (row) vector |
a, α | scalars |
W(M) | item linked to submodule M |
a⊕ b | concatenation of vectors a and b |
a1 ⊕a2 ⊕⋯ ⊕an | |
a ⊙b | element-wise multiplication of a and b |
a1 ⊙a2 ⊙⋯ ⊙an | |
vector with all components set to 1 | |
0m,n | null matrix of shape m × n |
In | identity matrix of shape n × n |
A | matrix |
(A)t,· | tth row of A |
a | (row) vector |
a, α | scalars |
W(M) | item linked to submodule M |
a⊕ b | concatenation of vectors a and b |
a1 ⊕a2 ⊕⋯ ⊕an | |
a ⊙b | element-wise multiplication of a and b |
a1 ⊙a2 ⊙⋯ ⊙an | |
vector with all components set to 1 | |
0m,n | null matrix of shape m × n |
In | identity matrix of shape n × n |
Transformers are often defined using three hyperparameters: the number of layers L, the dimensionality of the hidden representations d, and H, the number of attention heads in multi-head attentions. Formally, a Transformer model is a stack of sublayers. A visual representation is shown in Figure 1. Two sublayers are stacked to form a single Transformer layer: The first corresponds to a multi-head attention mechanism (MHA), and the second to a feed-forward (FF). A Transformer with L layers contains Λ = 2L sublayers. In Figure 1 two sublayers (in blue) are grouped into one layer, and L layers are stacked one after the other.
Each sublayer is centered around a specific sublayer function. Sublayer functions map an input x to an output y, and can either be feed-forward submodules or multi-head attention submodules.
The gain g and bias b(LN) are learned parameters with d components each; is the vector (1,⋯ , 1) scaled by the mean component value mt of the input vector ; st is the standard deviation of the component values of this input. As such, a LN performs a z-scaling, followed by the application of the gain g and the bias b(LN).
To kick-start computations, a sequence of static vector representations x0,1…x0,n with d components each is fed into the first layer. This initial input corresponds to the sum of a static lookup word embedding and a positional encoding.1
2.2 Mathematical Re-framing
We now turn to the decomposition proposed in Equation (1): et =it +ft +ht +ct.2 We provide a derivation in Appendix A.
2.3 Limitations of Equation(1)
The decomposition proposed in Equation (1) comes with a few caveats that are worth addressing explicitly. Most importantly, Equation (1) does not entail that the terms are independent from one another. For instance, the scaling factor systematically depends on the magnitude of earlier hidden representations. Equation (1) only stresses that a Transformer embedding can be decomposed as a sum of the outputs of its submodules: It does not fully disentangle computations. We leave the precise definition of computation disentanglement and its elaboration for the Transformer to future research, and focus here on the decomposition proposed in Equation (1).
In all, the major issue at hand is the ft term: It is the only term that cannot be derived as a linear composition of vectors, due to the non-linear function used in the FFs. Aside from the ft term, non-linear computations all devolve into scalar corrections (namely, the LN z-scaling factors sλ,t and mλ,t and the attention weights αl,h). As such, ft is the single bottleneck that prevents us from entirely decomposing a Transformer embedding as a linear combination of sub-terms.
As the non-linear functions used in Transformers are generally either ReLU or GELU, which both behave almost linearly for a high enough input value, it is in principle possible that the FF submodules can be approximated by a purely linear transformation, depending on the exact set of parameters they converged onto. It is worth assessing this possibility. Here, we learn a least-squares linear regression mapping the z-scaled inputs of every FF to its corresponding z-scaled output. We use the BERT base uncased model of Devlin et al. (2019) and a random sample of 10,000 sentences from the Europarl English section (Koehn, 2005), or almost 900,000 word-piece tokens, and fit the regressions using all 900,000 embeddings.
Figure 2 displays the quality of these linear approximations, as measured by a r2 score. We see some variation across layers but never observe a perfect fit: 30% to 60% of the observed variance is not explained by a linear map, suggesting BERT actively exploits the non-linearity. That the model doesn’t simply circumvent the non-linear function to adopt a linear behavior intuitively makes sense: Adding the feed-forward terms is what prevents the model from devolving into a sum of bag-of-words and static embeddings. While such approaches have been successful (Mikolov et al., 2013; Mitchell and Lapata, 2010), a non-linearity ought to make the model more expressive.
In all, the sanity check in Figure 2 highlights that the interpretation of the ft term is the major “black box” unanalyzable component remaining under Equation (1). As such, the recent interest in analyzing these modules (e.g., Geva et al., 2021; Zhao et al., 2021; Geva et al., 2022) is likely to have direct implications for the relevance of the present work. When adopting the linear decomposition approach we advocate, this problem can be further simplified: We only require a computationally efficient algorithm to map an input weighted sum of vectors through the non-linearity to an output weighted sum of vectors.4
Also remark that previous research stressed that Transformer layers exhibit a certain degree of commutativity (Zhao et al., 2021) and that additional computation can be injected between contiguous sublayers (Pfeiffer et al., 2020). This can be thought of as evidence pointing towards a certain independence of the computations done in each layers: If we can shuffle and add layers, then it seems reasonable to characterize sublayers based on what their outputs add to the total embedding, as we do in Equation (1).
Beyond the expectations we may have, it remains to be seen whether our proposed methodology is of actual use, that is, whether is conducive to further research. The remainder of this article presents some analyses that our decomposition enables us to conduct.5
3 Visualizing the Contents of Embeddings
One major question is that of the relative relevance of the different submodules of the architecture with respect to the overall output embedding. Studying the four terms it, ft, ht, and ct can prove helpful in this endeavor. Given that Equations (2) to (5) are defined as sums across layers or sublayers, it is straightforward to adapt them to derive the decomposition for intermediate representations. Hence, we can study how relevant are each of the four terms to intermediary representations, and plot how this relevance evolves across layers.
We use the same Europarl sample as in Section 2.3. We contrast embeddings from three related models: The BERT base uncased model and fine-tuned variants on CONLL 2003 NER (Tjong Kim Sang and De Meulder, 2003)6 and SQuAD v2 (Rajpurkar et al., 2018).7
Figure 3 summarizes the relative importance of the four terms of Eq. (1), as measured by the normalized dot-product defined in Eq. (6); ticks on the x-axis correspond to different layers. Figures 3a to 3c display the evolution of our proportion metric across layers for all three BERT models, and Figures 3d to 3f display how our normalized dot-product measurements correlate across pairs of models using Spearman’s ρ.8
Looking at Figure 3a, we can make a few important observations. The input term it, which corresponds to a static embedding, initially dominates the full output, but quickly decreases in prominence, until it reaches 0.045 at the last layer. This should explain why lower layers of Transformers generally give better performances on static word-type tasks (Vulić et al., 2020, among others). The ht term is not as prominent as one could expect from the vast literature that focuses on MHA. Its normalized dot-product is barely above what we observe for ct, and never averages above 0.3 across any layer. This can be partly pinned down on the prominence of ft and its normalized dot-product of 0.4 or above across most layers. As FF submodules are always the last component added to each hidden state, the sub-terms of ft go through fewer LNs than those of ht, and thus undergo fewer scalar multiplications—which likely affects their magnitude. Lastly, the term ct is far from negligible: At layer 11, it is the most prominent term, and in the output embedding it makes up for up to 23%. Note that ct defines a set of offsets embedded in a 2Λ-dimensional hyperplane (cf. Appendix B). In BERT base, 23% of the output can be expressed using a 50-dimensional vector, or 6.5% of the 768 dimensions of the model. This likely induces part of the anisotropy of Transformer embeddings (e.g., Ethayarajh, 2019; Timkey and van Schijndel, 2021), as the ct term pushes the embedding towards a specific region of the space.
The fine-tuned models in Figures 3b and 3c are found to impart a much lower proportion of the contextual embeddings to the it and ct terms. While ft seems to dominate in the final embedding, looking at the correlations in Figures 3d and 3e suggest that the ht terms are those that undergo the most modifications. Proportions assigned to the terms correlate with those assigned in the non-finetuned model more in the case of lower layers than higher layers (Figures 3d and 3e). The required adaptations seem task-specific as the two fine-tuned models do not correlate highly with each other (Figure 3f). Lastly, updates in the NER model impact mostly layer 8 and upwards (Figure 3d), whereas the QA model (Figure 3e) sees important modifications to the ht term at the first layer, suggesting that SQuAD requires more drastic adaptations than CONLL 2003.
4 The MLM Objective
An interesting follow-up question concerns which of the four terms allow us to retrieve the target word-piece. We consider two approaches: (a) Using the actual projection learned by the non-fine-tuned BERT model, or (b) learning a simple categorical regression for a specific term. We randomly select 15% of the word-pieces in our Europarl sample. As in the work of Devlin et al. (2019), 80% of these items are masked, 10% are replaced by a random word-piece, and 10% are left as is. Selected embeddings are then split between train (80%), validation (10%), and test (10%).
Results are displayed in Table 2. The first row (“Default”) details predictions using the default output projection on the vocabulary, that is, we test the performances of combinations sub-terms under the circumstances encountered by the model during training.9 The rows below (“Learned”) correspond to learned linear projections; the row marked μ display the average performance across all 5 runs. Columns display the results of using the sum of 1, 2, 3, or 4 of the terms it, ht, ft and ct to derive representations; for example, the rightmost corresponds to it +ht +ft +ct (i.e., the full embedding), whereas the leftmost corresponds to predicting based on it alone. Focusing on the default projection first, we see that it benefits from a more extensive training: When using all four terms, it is almost 2% more accurate than learning one from scratch. On the other hand, learning a regression allows us to consider more specifically what can be retrieved from individual terms, as is apparent from the behavior of the ft: When using the default output projection, we get 1.36% accuracy, whereas a learned regression yields 53.77%.
The default projection matrix is also highly dependent on the normalization offsets ct and the FF terms ft being added together: Removing this ct term from any experiment using ft is highly detrimental to the accuracy. On the other hand, combining the two produces the highest accuracy scores. Our logistic regressions show that most of this performance can be imputed to the ft term. Learning a projection from the ft term already yields an accuracy of almost 54%. On the other hand, a regression learned from ct only has a limited performance of 9.72% on average. Interestingly, this is still above what one would observe if the model always predicted the most frequent word-piece (viz. the, 6% of the test targets): even these very semantically bare items can be exploited by a classifier. As ct is tied to the LN z-scaling, this suggests that the magnitude of Transformer embeddings is not wholly meaningless.
In all, do FFs make the model more effective? The ft term is necessary to achieve the highest accuracy on the training objective of BERT. On its own, it doesn’t achieve the highest performances: for that we also need to add the MHA outputs ht. However, the performances we can associate to ft on its own are higher than what we observe for ht, suggesting that FFs make the Transformer architecture more effective on the MLM objective. This result connects with the work of Geva et al. (2021, 2022) who argue that FFs update the distribution over the full vocabulary, hence it makes sense that ft would be most useful to the MLM task.
5 Lexical Contents and WSD
We now turn to look at how the vector spaces are organized, and which term yields the most linguistically appropriate space. We rely on wsd, as distinct senses should yield different representations.
We consider an intrinsic KNN-based setup and an extrinsic probe-based setup. The former is inspired from Wiedemann et al. (2019): We assign to a target the most common label in its neighborhood. We restrict neighborhoods to words with the same annotated lemma and use the 5 nearest neighbors using cosine distance. The latter is a 2-layer MLP similar to Du et al. (2019), where the first layer is shared for all items and the second layer is lemma-specific. We use the nltk Semcor dataset (Landes et al., 1998; Bird et al., 2009), with an 80%–10%–10% split. We drop monosemous or OOV lemmas and sum over word-pieces to convert them into single word representations. Table 3 shows accuracy results. Selecting the most frequent sense would yield an accuracy of 57%; picking a sense at random, 24%. The terms it and ct struggle to outperform the former baseline: relevant KNN accuracy scores are lower, and corresponding probe accuracy scores are barely above.
Overall, the same picture emerges from the KNN setup and all 5 runs of the classifier setup. The ft term does not yield the highest performances in our experiment—instead, the ht term systematically dominates. In single term models, ht is ranked first and ft second. As for sums of two terms, the setups ranked 1st, 2nd, and 3rd are those that include ht; setups ranked 3rd to 5st, those that include ft. Even more surprisingly, when summing three of the terms, the highest ranked setup is the one where we exclude ft, and the lowest corresponds to excluding ht. Removing ft systematically yields better performances than using the full embedding. This suggests that ft is not necessarily helpful to the final representation for WSD. This contrast with what we observed for MLM, where ht was found to be less useful then ft.
One argument that could be made here would be to posit that the predictions derived from the different sums of terms are intrinsically different, hence a purely quantitative ranking might not capture this important distinction. To verify whether this holds, we can look at the proportion of predictions that agree for any two models. Because our intent is to see what can be retrieved from specific subterms of the embedding, we focus solely on the most efficient classifiers across runs. This is summarized in Figure 4: An individual cell will detail the proportion of the assigned labels shared by the models for that row and that column. In short, we see that model predictions tend to a high degree of overlap. For both knn and classifier setups, the three models that appear to make the most distinct predictions turn out to be computed from the it term, the ct term or their sum: that is, the models that struggle to perform better than the MFS baseline and are derived from static representations.
6 Effects of Fine-tuning and NER
Downstream application can also be achieved through fine-tuning, that is, restarting a model’s training to derive better predictions on a narrower task. As we saw from Figures 3b and 3c, the modifications brought upon this second round of training are task-specific, meaning that an exhaustive experimental survey is out of our reach.
We consider the task of Named Entity Recognition, using the WNUT 2016 shared task dataset (Strauss et al., 2016). We contrast the performance of the non-finetuned BERT model to that of the aforementioned variant fine-tuned on the CONLL 2003 NER dataset using shallow probes.
Results are presented in Table 4. The very high variance we observe across is likely due to the smaller size of this dataset (46,469 training examples, as compared to the 142,642 of Section 5 or the 107,815 in Section 4). Fine-tuning BERT on another NER dataset unsurprisingly has a systematic positive impact: Average performance jumps up by 5% or more. More interesting is the impact this fine-tuning has on the ft term: When used as sole input, the highest observed performance increases by over 8%, and similar improvements are observed consistently across all setups involving ft. Yet, the best average performance for fine-tuned and base embeddings correspond to ht (39.28% in tuned), it +ht (39.21%), and it +ht +ct (39.06%); in the base setting the highest average performance are reached with ht +ct (33.40%), it +ht +ct (33.25%) and ht (32.91%)—suggesting that ft might be superfluous for this task.
We can also look at whether the highest scoring classifiers across runs classifiers produce different outputs. Given the high class imbalance of the dataset at hand, we macro-average the prediction overlaps by label. The result is shown in Figure 5; the upper triangle details the behavior of the untuned model, and the lower triangle details that of the NER-fine-tuned model. In this round of experiments, we see much more distinctly that the it model, the ct model, and the it +ct model behave markedly different from the rest, with ct yielding the most distinct predictions. As for the NER-fine-tuned model (lower triangle), aside from the aforementioned static representations, most predictions display a degree of overlap much higher than what we observe for the non-finetuned model: Both FFs and MHAs are skewed towards producing outputs more adapted to NER tasks.
7 Relevant Work
The derivation we provide in Section 2 ties in well with other studies setting out to explain how Transformers embedding spaces are structured (Voita et al., 2019; Mickus et al., 2020; Vázquez et al., 2021, among others) and more broadly how they behave (Rogers et al., 2020). For instance, lower layers tend to yield higher performance on surface tasks (e.g., predicting the presence of a word, Jawahar et al., 2019) or static benchmarks (e.g., analogy, Vulić et al., 2020): This ties in with the vanishing prominence of it across layers. Likewise, probe-based approaches to unearth a linear structure matching with the syntactic structure of the input sentence (Raganato and Tiedemann, 2018; Hewitt and Manning, 2019, among others) can be construed as relying on the explicit linear dependence that we highlight here.
Another connection is with studies on embedding space anisotropy (Ethayarajh, 2019; Timkey and van Schijndel, 2021): Our derivation provides a means of circumscribing which neural components are likely to cause it. Also relevant is the study on sparsifying Transformer representations of Yun et al. (2021): The linearly dependent nature of Transformer embeddings has some implications when it comes to dictionary coding.
Also relevant are the works focusing on the interpretation of specific Transformer components, and feed-forward sublayers in particular (Geva et al., 2021; Zhao et al., 2021; Geva et al., 2022). Lastly, our approach provides some quantitative argument for the validity of attention-based studies (Serrano and Smith, 2019; Jain and Wallace, 2019; Wiegreffe and Pinter, 2019; Pruthi et al., 2020) and expands on earlier works looking beyond attention weights (Kobayashi et al., 2020).
8 Conclusions and Future Work
In this paper, we stress how Transformer embeddings can be decomposed linearly to describe the impact of each network component. We showcased how this additive structure can be used to investigate Transformers. Our approach suggests a less central place for attention-based studies: If multi-head attention only accounts for 30% of embeddings, can we possibly explain what Transformers do by looking solely at these submodules? The crux of our methodology lies in that we decompose the output embedding by submodule instead of layer or head. These approaches are not mutually exclusive (cf. Section 3), hence our approach can easily be combined with other probing protocols, providing the means to narrow in on specific network components.
The experiments we have conducted in Sections 3 to 6 were designed so as to showcase whether our decomposition in Equation (1) could yield useful results—or, as we put it earlier in Section 2.3, whether this approach could be conducive to future research. We were able to use the proposed approach to draw insightful connections. The noticeable anisotropy of contextual embeddings can be connected to the prominent trace of the biases in the output embedding: As model biases make up an important part of the whole embedding, they push it towards a specific sub-region of the embedding. The diminishing importance of it links back to earlier results on word-type semantic benchmarks. We also report novel findings, showcasing how some submodules outputs may be detrimental in specific scenarios: The output trace of FF modules was found to be extremely useful for MLM, whereas the ht term was found to be crucial for WSD. Our methodology also allows for an overview of the impact of finetuning (cf. Section 6): It skews components towards more task-specific outputs, and its effect are especially noticeable in upper layers (Figures 3d and 3e).
Analyses in Sections 3 to 6 demonstrate the immediate insight that our Transformer decomposition can help achieve. This work therefore opens a number of research perspectives, of which we name three. First, as mentioned in Section 2.3, our approach can be extended further to more thoroughly disentangle computations. Second, while we focused here more on feed-forward and multi-head attention components, extracting the static component embeddings from it would allow for a principled comparison of contextual and static distributional semantics models. Last but not least, because our analysis highlights the different relative importance of Transformer components in different tasks, it can be used to help choose the most appropriate tools for further interpretation of trained models among the wealth of alternatives.
Acknowledgments
We are highly indebted to Marianne Clausel for her significant help with how best to present the mathematical aspects of this work. Our thanks also go to Aman Sinha, as well as three anonymous reviewers for their substantial comments towards bettering this work.
This work was supported by a public grant overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program: Idex Lorraine Université d’Excellence (reference: ANR-15-IDEX-0004). We also acknowledge the support by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n° 771113).
A Step-by-step Derivation of Eq. (1)
B Hyperplane Bounds of ct
C Computational Details
In Section 2.3, we use the default hyperparameters of scikit-learn (Pedregosa et al., 2011). In Section 4, we learn categorical regressions using an AdamW optimizer (Loshchilov and Hutter, 2019) and iterate 20 times over the train set; hyperparameters (learning rate, weight decay, dropout, and the β1 and β2 AdamW hyperparameters) are set using Bayes Optimization (Snoek et al., 2012), with 50 hyperparameter samples and accuracy as objective. In Section 5, learning rate, dropout, weight decay, β1 and β2, learning rate scheduling are selected with Bayes Optimization, using 100 samples and accuracy as objective. In Section 6, we learn shallow logistic regressions, setting hyperparameters with Bayes Optimization, using 100 samples and macro-f1 as the objective. Experiments were run on a 4GB NVIDIA GPU.
D Ethical Considerations
The offset method of Mikolov et al. (2013) is known to also model social stereotypes (Bolukbasi et al., 2016, among others). Some of the sub- representations of our decomposition may exhibit stronger biases than the whole embedding et, and can yield higher performances than focusing on the whole embedding (e.g., Table 3). This could provide an undesirable incentive to deploy NLP models with higher performances and stronger systemic biases.
Notes
We empirically verified that components from attested embeddings et and those derived from Eq. (1) are systematically equal up to ± 10−7.
In the case of relative positional embeddings applied to value projections (Shaw et al., 2018), it is rather straightforward to follow the same logic so as to include relative positional offset in the most appropriate term.
Code for our experiments is available at the following URL: https://github.com/TimotheeMickus/bert-splat.
Layer 0 is the layer normalization conducted before the first sublayer, hence ft and ht are undefined here.
We thank an anonymous reviewer for pointing out that the BERT model ties input and output embeddings; we leave investigating the implications of this fact for future work.
In the case of BERT, we also need to include a LN before the first layer, which is straightforward if we index it as λ = 0.
The edge case is taken to be the identity matrix Id, for notation simplicity.
References
Author notes
The work described in the present paper was conducted chiefly while at ATILF.
Action Editor: Dani Yogatama