## Abstract

Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.

## 1 Introduction

The Transformer architecture (Vaswani et al., 2017) has taken the NLP community by storm. Based on the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), it was shown to outperform recurrent architectures on a wide variety of tasks. Another step was taken with pretrained language models derived from this architecture (BERT, Devlin et al., 2019, among others): they now embody the default approach to a vast swath of NLP applications. Success breeds scrutiny; likewise the popularity of these models has fostered research in explainable NLP interested in the behavior and explainability of pretrained language models (Rogers et al., 2020).

In this paper, we develop a novel decomposition of Transformer output embeddings. Our approach consists in quantifying the contribution of each network submodule to the output contextual embedding, and grouping those into four terms: (i) what relates to the input for a given position, (ii) what pertains to feed-forward submodules, (iii) what corresponds to multi-head attention, and (iv) what is due to vector biases.

This allows us to investigate Transformer embeddings without relying on attention weights or treating the entire model as a black box, as is most often done in the literature.The usefulness of our method is demonstrated on BERT: Our case study yields enlightening connections to state-of-the-art work on Transformer explainability, evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as an overview of the effects of finetuning on the embedding space. We also provide a simple and intuitive measurement of the importance of any term in this decomposition with respect to the whole embedding.

## 2 Additive Structure in Transformers

**e**

_{t}for a token

*t*is as a sum of four

*terms*:

**i**

_{t}can be thought of as a classical static embedding,

**f**

_{t}and

**h**

_{t}are the cumulative contributions at every layer of the feed-forward submodules and the MHAs respectively, and

**c**

_{t}corresponds to biases accumulated across the model.

Equation (1) provides interpretable and quantifiable terms that can explain the behavior of specific components of the Transformer architecture. More precisely, it characterizes what is the impact of adding another sublayer on top of what was previously computed: the terms in Equation (1) are defined as sums across (sub)layers; hence we can track how a given sublayer transforms its input, and show that this effect can be thought of as adding another vector to a previous sum. This layer-wise sum of submodule outputs also allows us to provide a first estimate of which parameters are most relevant to the overall embedding space: a submodule whose output is systematically negligible has its parameters set so that its influence on subsequent computations is minimal.

The formulation in Equation (1) more generally relies on the additive structure of Transformer embedding spaces. We start by reviewing the Transformer architecture in Section 2.1, before discussing our decomposition in greater detail in Section 2.2 and known limitations in Section 2.3.

### 2.1 Transformer Encoder Architecture

Let’s start by characterizing the Transformer architecture of Vaswani et al. (2017) in the notation described in Table 1.

A | matrix |

(A)_{t,·} | t^{th} row of A |

a | (row) vector |

a, α | scalars |

W^{(M)} | item linked to submodule M |

a⊕ b | concatenation of vectors a and b |

$\u2295nan$ | a_{1} ⊕a_{2} ⊕⋯ ⊕a_{n} |

a ⊙b | element-wise multiplication of a and b |

$\u2299nan$ | a_{1} ⊙a_{2} ⊙⋯ ⊙a_{n} |

$1\u2192$ | vector with all components set to 1 |

0_{m,n} | null matrix of shape m × n |

I_{n} | identity matrix of shape n × n |

A | matrix |

(A)_{t,·} | t^{th} row of A |

a | (row) vector |

a, α | scalars |

W^{(M)} | item linked to submodule M |

a⊕ b | concatenation of vectors a and b |

$\u2295nan$ | a_{1} ⊕a_{2} ⊕⋯ ⊕a_{n} |

a ⊙b | element-wise multiplication of a and b |

$\u2299nan$ | a_{1} ⊙a_{2} ⊙⋯ ⊙a_{n} |

$1\u2192$ | vector with all components set to 1 |

0_{m,n} | null matrix of shape m × n |

I_{n} | identity matrix of shape n × n |

Transformers are often defined using three hyperparameters: the number of layers *L*, the dimensionality of the hidden representations *d*, and *H*, the number of attention heads in multi-head attentions. Formally, a Transformer model is a stack of *sublayers*. A visual representation is shown in Figure 1. Two sublayers are stacked to form a single Transformer *layer*: The first corresponds to a multi-head attention mechanism (MHA), and the second to a feed-forward (FF). A Transformer with *L* layers contains Λ = 2*L* sublayers. In Figure 1 two sublayers (in blue) are grouped into one layer, and *L* layers are stacked one after the other.

Each sublayer is centered around a specific sublayer function. Sublayer functions map an input **x** to an output **y**, and can either be *feed-forward* submodules or *multi-head attention* submodules.

*ϕ*is a non-linear function, such as ReLU or GELU (Hendrycks and Gimpel, 2016). Here,

^{(…,I)}and

^{(…,O)}distinguish the input and output linear projections, and the index

*t*corresponds to the token position. Input and output dimensions are equal, whereas the intermediary layer dimension (i.e., the size of the hidden representations to which the non-linear function

*ϕ*will be applied) is larger, typically of

*b*= 1024 or 2048. In other words,

**W**

^{(FF,I)}is of shape

*d*×

*b*,

**b**

^{(FF,I)}of size

*b*,

**W**

^{(FF,O)}is of shape

*b*×

*d*, and

**b**

^{(FF,O)}of size

*d*.

*attention heads*:

**A**

_{h})

_{t,·}is the

*t*

^{th}row vector of the following

*n*×

*d*/

*H*matrix

**A**

_{h}:

*h*an index tracking attention heads. The parameter matrix

**W**

^{(MHA,O)}of shape

*d*×

*d*, and the bias

**b**

^{(MHA,O)}of size

*d*. The queries

**Q**

_{h}, keys

**K**

_{h}and values

**V**

_{h}are simple linear projections of shape

*n*× (

*d*/

*H*), computed from all inputs

**x**

_{1},…,

**x**

_{n}:

*d*× (

*d*/

*H*), with

*H*the number of attention heads, and biases $bh(Q)$, $bh(K)$ and $bh(V)$ are of size

*d*/

*H*. This component is often analyzed in terms of

*attention weights*

*α*

_{h}, which correspond to the softmax dot-product between keys and queries. In other words, the product $softmax(QhKhT/d/H)$ can be thought of as

*n*×

*n*matrix of weights in an average over the transformed input vectors $xt\u2032Wh(V)+bh(V)$ (Kobayashi et al., 2020, Eqs. (1) to (4)): multiplying these weights with the value projection

**V**

_{h}yields a weighted sum of value projections:

*t*and column

*t′*of this attention weights matrix.

*S*, a

*residual connection*and a

*layer normalization*(LN, Ba et al., 2016) are applied:

The gain **g** and bias **b**^{(LN)} are learned parameters with *d* components each; $mt\xb71\u2192$ is the vector (1,⋯ , 1) scaled by the mean component value *m*_{t} of the input vector $Sxt+xt$; *s*_{t} is the standard deviation of the component values of this input. As such, a LN performs a *z*-scaling, followed by the application of the gain **g** and the bias **b**^{(LN)}.

To kick-start computations, a sequence of static vector representations **x**_{0,1}…**x**_{0,n} with *d* components each is fed into the first layer. This initial input corresponds to the sum of a static lookup word embedding and a positional encoding.^{1}

### 2.2 Mathematical Re-framing

We now turn to the decomposition proposed in Equation (1): **e**_{t} =**i**_{t} +**f**_{t} +**h**_{t} +**c**_{t}.^{2} We provide a derivation in Appendix A.

**i**

_{t}corresponds to the input embedding (i.e., the positional encoding, the input word-type embedding, and the segment encoding in BERT-like models), after having gone through all the LN gains and rescaling:

*L*ranges over all sublayers. Here, the

**g**

_{λ}correspond to the learned gain parameters of the LNs, whereas the

*s*

_{λ,t}scalar derive from the

*z*-scaling performed in the

*λ*

^{th}LN, as defined above. The input

**x**

_{0,t}consists of the sum of a static lookup embedding and a positional encoding—as such, it resembles an uncontextualized embedding.

**f**

_{t}is the sum of the outputs of the FF submodules. Submodule outputs pass through LNs of all the layers above, hence:

*t*of the FF submodule for this layer

*l*.

**h**

_{t}corresponds to the sum across layers of each MHA output, having passed through the relevant LNs. As MHAs are entirely linear, we can further describe each output as a sum over all

*H*heads of a weighted bag-of-words of the input representations to that submodule. Or:

**Z**

_{l,h}corresponds to passing an input embedding through the unbiased values projection $Wl,h(V)$ of the head

*h*, then projecting it from a

*d*/

*H*-dimensional subspace onto a

*d*-dimensional space using a zero-padded identity matrix:

**W**

_{l}

^{(MHA,O)}of the relevant MHA.

**c**

_{t}, we collect all the biases. We don’t expect these offsets to be meaningful but rather to depict a side-effect of the architecture:

**c**

_{t}includes the biases $b\lambda (LN)$ and mean-shifts $m\lambda ,t\xb71\u2192$ of the LNs, the outer projection biases of the FF submodules $bl(FF,O)$, the outer projection bias in each MHA submodule $bl(MHA,O)$ and the value projection biases, mapped through the outer MHA projection $\u2295hbl,h(V)Wl(MHA,O)$.

^{3}

### 2.3 Limitations of Equation(1)

The decomposition proposed in Equation (1) comes with a few caveats that are worth addressing explicitly. Most importantly, Equation (1) does not entail that the terms are independent from one another. For instance, the scaling factor $1/\u220fs\lambda ,t$ systematically depends on the magnitude of earlier hidden representations. Equation (1) only stresses that a Transformer embedding can be decomposed as a sum of the outputs of its submodules: It does not fully disentangle computations. We leave the precise definition of computation disentanglement and its elaboration for the Transformer to future research, and focus here on the decomposition proposed in Equation (1).

In all, the major issue at hand is the **f**_{t} term: It is the only term that cannot be derived as a linear composition of vectors, due to the non-linear function used in the FFs. Aside from the **f**_{t} term, non-linear computations all devolve into scalar corrections (namely, the LN *z*-scaling factors *s*_{λ,t} and *m*_{λ,t} and the attention weights *α*_{l,h}). As such, **f**_{t} is the single bottleneck that prevents us from entirely decomposing a Transformer embedding as a linear combination of sub-terms.

As the non-linear functions used in Transformers are generally either ReLU or GELU, which both behave almost linearly for a high enough input value, it is in principle possible that the FF submodules can be approximated by a purely linear transformation, depending on the exact set of parameters they converged onto. It is worth assessing this possibility. Here, we learn a least-squares linear regression mapping the *z*-scaled inputs of every FF to its corresponding *z*-scaled output. We use the BERT base uncased model of Devlin et al. (2019) and a random sample of 10,000 sentences from the Europarl English section (Koehn, 2005), or almost 900,000 word-piece tokens, and fit the regressions using all 900,000 embeddings.

Figure 2 displays the quality of these linear approximations, as measured by a *r*^{2} score. We see some variation across layers but never observe a perfect fit: 30% to 60% of the observed variance is not explained by a linear map, suggesting BERT actively exploits the non-linearity. That the model doesn’t simply circumvent the non-linear function to adopt a linear behavior intuitively makes sense: Adding the feed-forward terms is what prevents the model from devolving into a sum of bag-of-words and static embeddings. While such approaches have been successful (Mikolov et al., 2013; Mitchell and Lapata, 2010), a non-linearity ought to make the model more expressive.

In all, the sanity check in Figure 2 highlights that the interpretation of the **f**_{t} term is the major “black box” unanalyzable component remaining under Equation (1). As such, the recent interest in analyzing these modules (e.g., Geva et al., 2021; Zhao et al., 2021; Geva et al., 2022) is likely to have direct implications for the relevance of the present work. When adopting the linear decomposition approach we advocate, this problem can be further simplified: We only require a computationally efficient algorithm to map an input weighted sum of vectors through the non-linearity to an output weighted sum of vectors.^{4}

Also remark that previous research stressed that Transformer layers exhibit a certain degree of commutativity (Zhao et al., 2021) and that additional computation can be injected between contiguous sublayers (Pfeiffer et al., 2020). This can be thought of as evidence pointing towards a certain independence of the computations done in each layers: If we can shuffle and add layers, then it seems reasonable to characterize sublayers based on what their outputs add to the total embedding, as we do in Equation (1).

Beyond the expectations we may have, it remains to be seen whether our proposed methodology is of actual use, that is, whether is conducive to further research. The remainder of this article presents some analyses that our decomposition enables us to conduct.^{5}

## 3 Visualizing the Contents of Embeddings

One major question is that of the relative relevance of the different submodules of the architecture with respect to the overall output embedding. Studying the four terms **i**_{t}, **f**_{t}, **h**_{t}, and **c**_{t} can prove helpful in this endeavor. Given that Equations (2) to (5) are defined as sums across layers or sublayers, it is straightforward to adapt them to derive the decomposition for intermediate representations. Hence, we can study how relevant are each of the four terms to intermediary representations, and plot how this relevance evolves across layers.

*importance metric*to compare one of the terms

**t**

_{t}to the total

**e**

_{t}. We require it to be sensitive to co-directionality (i.e., whether

**t**

_{t}and

**e**

_{t}have similar directions) and relative magnitude (whether

**t**

_{t}is a major component of

**e**

_{t}). A normalized dot-product of the form:

We use the same Europarl sample as in Section 2.3. We contrast embeddings from three related models: The BERT base uncased model and fine-tuned variants on CONLL 2003 NER (Tjong Kim Sang and De Meulder, 2003)^{6} and SQuAD v2 (Rajpurkar et al., 2018).^{7}

Figure 3 summarizes the relative importance of the four terms of Eq. (1), as measured by the normalized dot-product defined in Eq. (6); ticks on the *x*-axis correspond to different layers. Figures 3a to 3c display the evolution of our proportion metric across layers for all three BERT models, and Figures 3d to 3f display how our normalized dot-product measurements correlate across pairs of models using Spearman’s *ρ*.^{8}

Looking at Figure 3a, we can make a few important observations. The input term **i**_{t}, which corresponds to a static embedding, initially dominates the full output, but quickly decreases in prominence, until it reaches 0.045 at the last layer. This should explain why lower layers of Transformers generally give better performances on static word-type tasks (Vulić et al., 2020, among others). The **h**_{t} term is not as prominent as one could expect from the vast literature that focuses on MHA. Its normalized dot-product is barely above what we observe for **c**_{t}, and never averages above 0.3 across any layer. This can be partly pinned down on the prominence of **f**_{t} and its normalized dot-product of 0.4 or above across most layers. As FF submodules are always the last component added to each hidden state, the sub-terms of **f**_{t} go through fewer LNs than those of **h**_{t}, and thus undergo fewer scalar multiplications—which likely affects their magnitude. Lastly, the term **c**_{t} is far from negligible: At layer 11, it is the most prominent term, and in the output embedding it makes up for up to 23%. Note that **c**_{t} defines a set of offsets embedded in a 2Λ-dimensional hyperplane (cf. Appendix B). In BERT base, 23% of the output can be expressed using a 50-dimensional vector, or 6.5% of the 768 dimensions of the model. This likely induces part of the anisotropy of Transformer embeddings (e.g., Ethayarajh, 2019; Timkey and van Schijndel, 2021), as the **c**_{t} term pushes the embedding towards a specific region of the space.

The fine-tuned models in Figures 3b and 3c are found to impart a much lower proportion of the contextual embeddings to the **i**_{t} and **c**_{t} terms. While **f**_{t} seems to dominate in the final embedding, looking at the correlations in Figures 3d and 3e suggest that the **h**_{t} terms are those that undergo the most modifications. Proportions assigned to the terms correlate with those assigned in the non-finetuned model more in the case of lower layers than higher layers (Figures 3d and 3e). The required adaptations seem task-specific as the two fine-tuned models do not correlate highly with each other (Figure 3f). Lastly, updates in the NER model impact mostly layer 8 and upwards (Figure 3d), whereas the QA model (Figure 3e) sees important modifications to the **h**_{t} term at the first layer, suggesting that SQuAD requires more drastic adaptations than CONLL 2003.

## 4 The MLM Objective

An interesting follow-up question concerns which of the four terms allow us to retrieve the target word-piece. We consider two approaches: (a) Using the actual projection learned by the non-fine-tuned BERT model, or (b) learning a simple categorical regression for a specific term. We randomly select 15% of the word-pieces in our Europarl sample. As in the work of Devlin et al. (2019), 80% of these items are masked, 10% are replaced by a random word-piece, and 10% are left as is. Selected embeddings are then split between train (80%), validation (10%), and test (10%).

Results are displayed in Table 2. The first row (“Default”) details predictions using the default output projection on the vocabulary, that is, we test the performances of combinations sub-terms under the circumstances encountered by the model during training.^{9} The rows below (“Learned”) correspond to learned linear projections; the row marked *μ* display the average performance across all 5 runs. Columns display the results of using the sum of 1, 2, 3, or 4 of the terms **i**_{t}, **h**_{t}, **f**_{t} and **c**_{t} to derive representations; for example, the rightmost corresponds to **i**_{t} +**h**_{t} +**f**_{t} +**c**_{t} (i.e., the full embedding), whereas the leftmost corresponds to predicting based on **i**_{t} alone. Focusing on the default projection first, we see that it benefits from a more extensive training: When using all four terms, it is almost 2% more accurate than learning one from scratch. On the other hand, learning a regression allows us to consider more specifically what can be retrieved from individual terms, as is apparent from the behavior of the **f**_{t}: When using the default output projection, we get 1.36% accuracy, whereas a learned regression yields 53.77%.

The default projection matrix is also highly dependent on the normalization offsets **c**_{t} and the FF terms **f**_{t} being added together: Removing this **c**_{t} term from any experiment using **f**_{t} is highly detrimental to the accuracy. On the other hand, combining the two produces the highest accuracy scores. Our logistic regressions show that most of this performance can be imputed to the **f**_{t} term. Learning a projection from the **f**_{t} term already yields an accuracy of almost 54%. On the other hand, a regression learned from **c**_{t} only has a limited performance of 9.72% on average. Interestingly, this is still above what one would observe if the model always predicted the most frequent word-piece (viz. the, 6% of the test targets): even these very semantically bare items can be exploited by a classifier. As **c**_{t} is tied to the LN *z*-scaling, this suggests that the magnitude of Transformer embeddings is not wholly meaningless.

In all, do FFs make the model more effective? The **f**_{t} term is necessary to achieve the highest accuracy on the training objective of BERT. On its own, it doesn’t achieve the highest performances: for that we also need to add the MHA outputs **h**_{t}. However, the performances we can associate to **f**_{t} on its own are higher than what we observe for **h**_{t}, suggesting that FFs make the Transformer architecture more effective on the MLM objective. This result connects with the work of Geva et al. (2021, 2022) who argue that FFs update the distribution over the full vocabulary, hence it makes sense that **f**_{t} would be most useful to the MLM task.

## 5 Lexical Contents and WSD

We now turn to look at how the vector spaces are organized, and which term yields the most linguistically appropriate space. We rely on wsd, as distinct senses should yield different representations.

We consider an intrinsic KNN-based setup and an extrinsic probe-based setup. The former is inspired from Wiedemann et al. (2019): We assign to a target the most common label in its neighborhood. We restrict neighborhoods to words with the same annotated lemma and use the 5 nearest neighbors using cosine distance. The latter is a 2-layer MLP similar to Du et al. (2019), where the first layer is shared for all items and the second layer is lemma-specific. We use the nltk Semcor dataset (Landes et al., 1998; Bird et al., 2009), with an 80%–10%–10% split. We drop monosemous or OOV lemmas and sum over word-pieces to convert them into single word representations. Table 3 shows accuracy results. Selecting the most frequent sense would yield an accuracy of 57%; picking a sense at random, 24%. The terms **i**_{t} and **c**_{t} struggle to outperform the former baseline: relevant KNN accuracy scores are lower, and corresponding probe accuracy scores are barely above.

Overall, the same picture emerges from the KNN setup and all 5 runs of the classifier setup. The **f**_{t} term does not yield the highest performances in our experiment—instead, the **h**_{t} term systematically dominates. In single term models, **h**_{t} is ranked first and **f**_{t} second. As for sums of two terms, the setups ranked 1^{st}, 2^{nd}, and 3^{rd} are those that include **h**_{t}; setups ranked 3^{rd} to 5^{st}, those that include **f**_{t}. Even more surprisingly, when summing three of the terms, the highest ranked setup is the one where we exclude **f**_{t}, and the lowest corresponds to excluding **h**_{t}. Removing **f**_{t} systematically yields better performances than using the full embedding. This suggests that **f**_{t} is not necessarily helpful to the final representation for WSD. This contrast with what we observed for MLM, where **h**_{t} was found to be less useful then **f**_{t}.

One argument that could be made here would be to posit that the predictions derived from the different sums of terms are intrinsically different, hence a purely quantitative ranking might not capture this important distinction. To verify whether this holds, we can look at the proportion of predictions that agree for any two models. Because our intent is to see what can be retrieved from specific subterms of the embedding, we focus solely on the most efficient classifiers across runs. This is summarized in Figure 4: An individual cell will detail the proportion of the assigned labels shared by the models for that row and that column. In short, we see that model predictions tend to a high degree of overlap. For both knn and classifier setups, the three models that appear to make the most distinct predictions turn out to be computed from the **i**_{t} term, the **c**_{t} term or their sum: that is, the models that struggle to perform better than the MFS baseline and are derived from static representations.

## 6 Effects of Fine-tuning and NER

Downstream application can also be achieved through fine-tuning, that is, restarting a model’s training to derive better predictions on a narrower task. As we saw from Figures 3b and 3c, the modifications brought upon this second round of training are task-specific, meaning that an exhaustive experimental survey is out of our reach.

We consider the task of Named Entity Recognition, using the WNUT 2016 shared task dataset (Strauss et al., 2016). We contrast the performance of the non-finetuned BERT model to that of the aforementioned variant fine-tuned on the CONLL 2003 NER dataset using shallow probes.

Results are presented in Table 4. The very high variance we observe across is likely due to the smaller size of this dataset (46,469 training examples, as compared to the 142,642 of Section 5 or the 107,815 in Section 4). Fine-tuning BERT on another NER dataset unsurprisingly has a systematic positive impact: Average performance jumps up by 5% or more. More interesting is the impact this fine-tuning has on the **f**_{t} term: When used as sole input, the highest observed performance increases by over 8%, and similar improvements are observed consistently across all setups involving **f**_{t}. Yet, the best average performance for fine-tuned and base embeddings correspond to **h**_{t} (39.28% in tuned), **i**_{t} +**h**_{t} (39.21%), and **i**_{t} +**h**_{t} +**c**_{t} (39.06%); in the base setting the highest average performance are reached with **h**_{t} +**c**_{t} (33.40%), **i**_{t} +**h**_{t} +**c**_{t} (33.25%) and **h**_{t} (32.91%)—suggesting that **f**_{t} might be superfluous for this task.

We can also look at whether the highest scoring classifiers across runs classifiers produce different outputs. Given the high class imbalance of the dataset at hand, we macro-average the prediction overlaps by label. The result is shown in Figure 5; the upper triangle details the behavior of the untuned model, and the lower triangle details that of the NER-fine-tuned model. In this round of experiments, we see much more distinctly that the **i**_{t} model, the **c**_{t} model, and the **i**_{t} +**c**_{t} model behave markedly different from the rest, with **c**_{t} yielding the most distinct predictions. As for the NER-fine-tuned model (lower triangle), aside from the aforementioned static representations, most predictions display a degree of overlap much higher than what we observe for the non-finetuned model: Both FFs and MHAs are skewed towards producing outputs more adapted to NER tasks.

## 7 Relevant Work

The derivation we provide in Section 2 ties in well with other studies setting out to explain how Transformers embedding spaces are structured (Voita et al., 2019; Mickus et al., 2020; Vázquez et al., 2021, among others) and more broadly how they behave (Rogers et al., 2020). For instance, lower layers tend to yield higher performance on surface tasks (e.g., predicting the presence of a word, Jawahar et al., 2019) or static benchmarks (e.g., analogy, Vulić et al., 2020): This ties in with the vanishing prominence of **i**_{t} across layers. Likewise, probe-based approaches to unearth a linear structure matching with the syntactic structure of the input sentence (Raganato and Tiedemann, 2018; Hewitt and Manning, 2019, among others) can be construed as relying on the explicit linear dependence that we highlight here.

Another connection is with studies on embedding space anisotropy (Ethayarajh, 2019; Timkey and van Schijndel, 2021): Our derivation provides a means of circumscribing which neural components are likely to cause it. Also relevant is the study on sparsifying Transformer representations of Yun et al. (2021): The linearly dependent nature of Transformer embeddings has some implications when it comes to dictionary coding.

Also relevant are the works focusing on the interpretation of specific Transformer components, and feed-forward sublayers in particular (Geva et al., 2021; Zhao et al., 2021; Geva et al., 2022). Lastly, our approach provides some quantitative argument for the validity of attention-based studies (Serrano and Smith, 2019; Jain and Wallace, 2019; Wiegreffe and Pinter, 2019; Pruthi et al., 2020) and expands on earlier works looking beyond attention weights (Kobayashi et al., 2020).

## 8 Conclusions and Future Work

In this paper, we stress how Transformer embeddings can be decomposed linearly to describe the impact of each network component. We showcased how this additive structure can be used to investigate Transformers. Our approach suggests a less central place for attention-based studies: If multi-head attention only accounts for 30% of embeddings, can we possibly explain what Transformers do by looking solely at these submodules? The crux of our methodology lies in that we decompose the output embedding by submodule instead of layer or head. These approaches are not mutually exclusive (cf. Section 3), hence our approach can easily be combined with other probing protocols, providing the means to narrow in on specific network components.

The experiments we have conducted in Sections 3 to 6 were designed so as to showcase whether our decomposition in Equation (1) could yield useful results—or, as we put it earlier in Section 2.3, whether this approach could be conducive to future research. We were able to use the proposed approach to draw insightful connections. The noticeable anisotropy of contextual embeddings can be connected to the prominent trace of the biases in the output embedding: As model biases make up an important part of the whole embedding, they push it towards a specific sub-region of the embedding. The diminishing importance of **i**_{t} links back to earlier results on word-type semantic benchmarks. We also report novel findings, showcasing how some submodules outputs may be detrimental in specific scenarios: The output trace of FF modules was found to be extremely useful for MLM, whereas the **h**_{t} term was found to be crucial for WSD. Our methodology also allows for an overview of the impact of finetuning (cf. Section 6): It skews components towards more task-specific outputs, and its effect are especially noticeable in upper layers (Figures 3d and 3e).

Analyses in Sections 3 to 6 demonstrate the immediate insight that our Transformer decomposition can help achieve. This work therefore opens a number of research perspectives, of which we name three. First, as mentioned in Section 2.3, our approach can be extended further to more thoroughly disentangle computations. Second, while we focused here more on feed-forward and multi-head attention components, extracting the static component embeddings from **i**_{t} would allow for a principled comparison of contextual and static distributional semantics models. Last but not least, because our analysis highlights the different relative importance of Transformer components in different tasks, it can be used to help choose the most appropriate tools for further interpretation of trained models among the wealth of alternatives.

## Acknowledgments

We are highly indebted to Marianne Clausel for her significant help with how best to present the mathematical aspects of this work. Our thanks also go to Aman Sinha, as well as three anonymous reviewers for their substantial comments towards bettering this work.

This work was supported by a public grant overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program: Idex *Lorraine Université d’Excellence* (reference: ANR-15-IDEX-0004). We also acknowledge the support by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n° 771113).

## A Step-by-step Derivation of Eq. (1)

*L*layers, each comprising two sublayers, we can treat a Transformer as a stack of Λ = 2

*L*sublayers. For notation simplicity, we link the sublayer index

*λ*to the layer index

*l*: The first sublayer of layer

*l*is the (2

*l*− 1)

^{th}sublayer, and the second is the (2

*l*)

^{th}sublayer.

^{10}All sublayers include a residual connection before the final LN:

**g**

_{λ}and the scaling 1/

*s*

_{λ,t}as the

*d*×

*d*square matrix:

**y**

_{λ,t}as:

**T**

_{λ +1}as previously and remark that, as both

**T**

_{λ}and

**T**

_{λ +1}only contain diagonal entries:

**x**through a complete layer, that is, through sublayers

*λ*and

*λ*+ 1:

**y**

_{λ}from above:

*S*

_{λ}is a MHA mechanism and

*S*

^{λ +1}a FF. We start by reformulating the output of a MHA. Recall that attention heads can be seen as weighted sums of value vectors (Kobayashi et al., 2020). Due to the softmax normalization, attention weights

*α*

_{t,1},…

*α*

_{t,n}sum to 1 for any position

*t*. Hence:

*H*heads in a MHA, we concatenate these head-specific sums and pass them through the output projection

**W**

^{(MHA,O)}. As such, we can denote the unbiased output of the MHA and the associated bias as:

**Z**

_{l,h}as introduced in (4). By substituting the actual sublayer functions in our previous equation:

*λ*+ 1 will correspond to the output of the FF for layer

*l*, i.e., $f~l,t+bl(FF,O)$, and similarly the output for sublayer

*λ*should be that of the MHA of layer

*l*, or $h~l,t+bl(MHA)$. To match Eq. (1), rewrite as:

**x**

_{λ,t}is the

*t*

^{th}input for sublayer

*λ*; that is, the above characterizes the output of sublayer

*λ*+ 1 with respect to the input of sublayer

*λ*. Passing the output

**y**

_{l,t}into the next layer

*l*+ 1 (i.e., through sublayers

*λ*+ 2 and

*λ*+ 3) then gives:

*k*≥ 1 as:

^{11}

**x**

_{0,t}, we obtain Eqs. (1) to (5).

## B Hyperplane Bounds of **c**_{t}

**p**

_{λ}and

**q**

_{λ}are constant across all inputs. Assuming their linear independence puts an upper bound of 2Λ vectors necessary to express

**c**

_{t}.

## C Computational Details

In Section 2.3, we use the default hyperparameters of scikit-learn (Pedregosa et al., 2011). In Section 4, we learn categorical regressions using an AdamW optimizer (Loshchilov and Hutter, 2019) and iterate 20 times over the train set; hyperparameters (learning rate, weight decay, dropout, and the *β*_{1} and *β*_{2} AdamW hyperparameters) are set using Bayes Optimization (Snoek et al., 2012), with 50 hyperparameter samples and accuracy as objective. In Section 5, learning rate, dropout, weight decay, *β*_{1} and *β*_{2}, learning rate scheduling are selected with Bayes Optimization, using 100 samples and accuracy as objective. In Section 6, we learn shallow logistic regressions, setting hyperparameters with Bayes Optimization, using 100 samples and macro-*f*_{1} as the objective. Experiments were run on a 4GB NVIDIA GPU.

## D Ethical Considerations

The offset method of Mikolov et al. (2013) is known to also model social stereotypes (Bolukbasi et al., 2016, among others). Some of the sub- representations of our decomposition may exhibit stronger biases than the whole embedding **e**_{t}, and can yield higher performances than focusing on the whole embedding (e.g., Table 3). This could provide an undesirable incentive to deploy NLP models with higher performances and stronger systemic biases.

## Notes

We empirically verified that components from attested embeddings **e**_{t} and those derived from Eq. (1) are systematically equal up to ± 10^{−7}.

In the case of relative positional embeddings applied to value projections (Shaw et al., 2018), it is rather straightforward to follow the same logic so as to include relative positional offset in the most appropriate term.

Code for our experiments is available at the following URL: https://github.com/TimotheeMickus/bert-splat.

Layer 0 is the layer normalization conducted before the first sublayer, hence **f**_{t} and **h**_{t} are undefined here.

We thank an anonymous reviewer for pointing out that the BERT model ties input and output embeddings; we leave investigating the implications of this fact for future work.

In the case of BERT, we also need to include a LN before the first layer, which is straightforward if we index it as *λ* = 0.

The edge case $\u220f\lambda \u2032=\lambda +1\lambda T\lambda \u2032$ is taken to be the identity matrix **I**_{d}, for notation simplicity.

## References

## Author notes

The work described in the present paper was conducted chiefly while at ATILF.

Action Editor: Dani Yogatama