Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.

The Transformer architecture (Vaswani et al., 2017) has taken the NLP community by storm. Based on the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), it was shown to outperform recurrent architectures on a wide variety of tasks. Another step was taken with pretrained language models derived from this architecture (BERT, Devlin et al., 2019, among others): they now embody the default approach to a vast swath of NLP applications. Success breeds scrutiny; likewise the popularity of these models has fostered research in explainable NLP interested in the behavior and explainability of pretrained language models (Rogers et al., 2020).

In this paper, we develop a novel decomposition of Transformer output embeddings. Our approach consists in quantifying the contribution of each network submodule to the output contextual embedding, and grouping those into four terms: (i) what relates to the input for a given position, (ii) what pertains to feed-forward submodules, (iii) what corresponds to multi-head attention, and (iv) what is due to vector biases.

This allows us to investigate Transformer embeddings without relying on attention weights or treating the entire model as a black box, as is most often done in the literature.The usefulness of our method is demonstrated on BERT: Our case study yields enlightening connections to state-of-the-art work on Transformer explainability, evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as an overview of the effects of finetuning on the embedding space. We also provide a simple and intuitive measurement of the importance of any term in this decomposition with respect to the whole embedding.

We will provide insights on the Transformer architecture in Section 2, and showcase how these insights can translate into experimental investigations in Sections 3 to 6. We will conclude with connections to other relevant works in Section 7 and discuss future perspectives in Section 8.

We show that the Transformer embedding et for a token t is as a sum of four terms:
(1)
where it can be thought of as a classical static embedding, ft and ht are the cumulative contributions at every layer of the feed-forward submodules and the MHAs respectively, and ct corresponds to biases accumulated across the model.

Equation (1) provides interpretable and quantifiable terms that can explain the behavior of specific components of the Transformer architecture. More precisely, it characterizes what is the impact of adding another sublayer on top of what was previously computed: the terms in Equation (1) are defined as sums across (sub)layers; hence we can track how a given sublayer transforms its input, and show that this effect can be thought of as adding another vector to a previous sum. This layer-wise sum of submodule outputs also allows us to provide a first estimate of which parameters are most relevant to the overall embedding space: a submodule whose output is systematically negligible has its parameters set so that its influence on subsequent computations is minimal.

The formulation in Equation (1) more generally relies on the additive structure of Transformer embedding spaces. We start by reviewing the Transformer architecture in Section 2.1, before discussing our decomposition in greater detail in Section 2.2 and known limitations in Section 2.3.

2.1 Transformer Encoder Architecture

Let’s start by characterizing the Transformer architecture of Vaswani et al. (2017) in the notation described in Table 1.

Table 1: 

Notation.

A matrix 
(A)t tth row of A 
a (row) vector 
a, α scalars 
W(M) item linked to submodule M 
ab concatenation of vectors a and b 
nan a1a2 ⊕⋯ ⊕an 
ab element-wise multiplication of a and b 
nan a1a2 ⊙⋯ ⊙an 
1 vector with all components set to 1 
0m,n null matrix of shape m × n 
In identity matrix of shape n × n 
A matrix 
(A)t tth row of A 
a (row) vector 
a, α scalars 
W(M) item linked to submodule M 
ab concatenation of vectors a and b 
nan a1a2 ⊕⋯ ⊕an 
ab element-wise multiplication of a and b 
nan a1a2 ⊙⋯ ⊙an 
1 vector with all components set to 1 
0m,n null matrix of shape m × n 
In identity matrix of shape n × n 

Transformers are often defined using three hyperparameters: the number of layers L, the dimensionality of the hidden representations d, and H, the number of attention heads in multi-head attentions. Formally, a Transformer model is a stack of sublayers. A visual representation is shown in Figure 1. Two sublayers are stacked to form a single Transformer layer: The first corresponds to a multi-head attention mechanism (MHA), and the second to a feed-forward (FF). A Transformer with L layers contains Λ = 2L sublayers. In Figure 1 two sublayers (in blue) are grouped into one layer, and L layers are stacked one after the other.

Figure 1: 

Overview of a Transformer encoder.

Figure 1: 

Overview of a Transformer encoder.

Close modal

Each sublayer is centered around a specific sublayer function. Sublayer functions map an input x to an output y, and can either be feed-forward submodules or multi-head attention submodules.

FFs are subnets of the form:
where ϕ is a non-linear function, such as ReLU or GELU (Hendrycks and Gimpel, 2016). Here, (…,I) and (…,O) distinguish the input and output linear projections, and the index t corresponds to the token position. Input and output dimensions are equal, whereas the intermediary layer dimension (i.e., the size of the hidden representations to which the non-linear function ϕ will be applied) is larger, typically of b = 1024 or 2048. In other words, W(FF,I) is of shape d × b, b(FF,I) of size b, W(FF,O) is of shape b × d, and b(FF,O) of size d.
MHAs are concatenations of scaled-dot attention heads:
where (Ah)t is the tth row vector of the following n × d/H matrix Ah:
with h an index tracking attention heads. The parameter matrix W(MHA,O) of shape d × d, and the bias b(MHA,O) of size d. The queries Qh, keys Kh and values Vh are simple linear projections of shape n × (d/H), computed from all inputs x1,…, xn:
where the weight matrices Wh(Q), Wh(K) and Wh(V) are of the shape d × (d/H), with H the number of attention heads, and biases bh(Q), bh(K) and bh(V) are of size d/H. This component is often analyzed in terms of attention weightsαh, which correspond to the softmax dot-product between keys and queries. In other words, the product softmax(QhKhT/d/H) can be thought of as n × n matrix of weights in an average over the transformed input vectors xtWh(V)+bh(V) (Kobayashi et al., 2020, Eqs. (1) to (4)): multiplying these weights with the value projection Vh yields a weighted sum of value projections:
where αh,t,t is the component at row t and column t′ of this attention weights matrix.
Lastly, after each sublayer function S, a residual connection and a layer normalization (LN, Ba et al., 2016) are applied:

The gain g and bias b(LN) are learned parameters with d components each; mt·1 is the vector (1,⋯ , 1) scaled by the mean component value mt of the input vector Sxt+xt; st is the standard deviation of the component values of this input. As such, a LN performs a z-scaling, followed by the application of the gain g and the bias b(LN).

To kick-start computations, a sequence of static vector representations x0,1x0,n with d components each is fed into the first layer. This initial input corresponds to the sum of a static lookup word embedding and a positional encoding.1

2.2 Mathematical Re-framing

We now turn to the decomposition proposed in Equation (1): et =it +ft +ht +ct.2 We provide a derivation in Appendix A.

The term it corresponds to the input embedding (i.e., the positional encoding, the input word-type embedding, and the segment encoding in BERT-like models), after having gone through all the LN gains and rescaling:
(2)
where Λ = 2L ranges over all sublayers. Here, the gλ correspond to the learned gain parameters of the LNs, whereas the sλ,t scalar derive from the z-scaling performed in the λth LN, as defined above. The input x0,t consists of the sum of a static lookup embedding and a positional encoding—as such, it resembles an uncontextualized embedding.
The next two terms capture the outputs of specific submodules, either FFs or MHAs. As such, their importance and usefulness will differ from task to task. The term ft is the sum of the outputs of the FF submodules. Submodule outputs pass through LNs of all the layers above, hence:
(3)
where f~l,t=ϕxl,t(FF)Wl(FF,I)+bl(FF,I)Wl(FF,O) is the unbiased output at the position t of the FF submodule for this layer l.
The term ht corresponds to the sum across layers of each MHA output, having passed through the relevant LNs. As MHAs are entirely linear, we can further describe each output as a sum over all H heads of a weighted bag-of-words of the input representations to that submodule. Or:
(4)
where Zl,h corresponds to passing an input embedding through the unbiased values projection Wl,h(V) of the head h, then projecting it from a d/H-dimensional subspace onto a d-dimensional space using a zero-padded identity matrix:
and finally passing it through the unbiased outer projection Wl(MHA,O) of the relevant MHA.
In the last term ct, we collect all the biases. We don’t expect these offsets to be meaningful but rather to depict a side-effect of the architecture:
(5)
The concatenation hbl,h(V) here is equivalent to a sum of zero-padded identity matrices: hbl,h(V)Mh. This term ct includes the biases bλ(LN) and mean-shifts mλ,t·1 of the LNs, the outer projection biases of the FF submodules bl(FF,O), the outer projection bias in each MHA submodule bl(MHA,O) and the value projection biases, mapped through the outer MHA projection hbl,h(V)Wl(MHA,O).3

2.3 Limitations of Equation(1)

The decomposition proposed in Equation (1) comes with a few caveats that are worth addressing explicitly. Most importantly, Equation (1) does not entail that the terms are independent from one another. For instance, the scaling factor 1/sλ,t systematically depends on the magnitude of earlier hidden representations. Equation (1) only stresses that a Transformer embedding can be decomposed as a sum of the outputs of its submodules: It does not fully disentangle computations. We leave the precise definition of computation disentanglement and its elaboration for the Transformer to future research, and focus here on the decomposition proposed in Equation (1).

In all, the major issue at hand is the ft term: It is the only term that cannot be derived as a linear composition of vectors, due to the non-linear function used in the FFs. Aside from the ft term, non-linear computations all devolve into scalar corrections (namely, the LN z-scaling factors sλ,t and mλ,t and the attention weights αl,h). As such, ft is the single bottleneck that prevents us from entirely decomposing a Transformer embedding as a linear combination of sub-terms.

As the non-linear functions used in Transformers are generally either ReLU or GELU, which both behave almost linearly for a high enough input value, it is in principle possible that the FF submodules can be approximated by a purely linear transformation, depending on the exact set of parameters they converged onto. It is worth assessing this possibility. Here, we learn a least-squares linear regression mapping the z-scaled inputs of every FF to its corresponding z-scaled output. We use the BERT base uncased model of Devlin et al. (2019) and a random sample of 10,000 sentences from the Europarl English section (Koehn, 2005), or almost 900,000 word-piece tokens, and fit the regressions using all 900,000 embeddings.

Figure 2 displays the quality of these linear approximations, as measured by a r2 score. We see some variation across layers but never observe a perfect fit: 30% to 60% of the observed variance is not explained by a linear map, suggesting BERT actively exploits the non-linearity. That the model doesn’t simply circumvent the non-linear function to adopt a linear behavior intuitively makes sense: Adding the feed-forward terms is what prevents the model from devolving into a sum of bag-of-words and static embeddings. While such approaches have been successful (Mikolov et al., 2013; Mitchell and Lapata, 2010), a non-linearity ought to make the model more expressive.

Figure 2: 

Fitting the ft term: r2 across layers.

Figure 2: 

Fitting the ft term: r2 across layers.

Close modal

In all, the sanity check in Figure 2 highlights that the interpretation of the ft term is the major “black box” unanalyzable component remaining under Equation (1). As such, the recent interest in analyzing these modules (e.g., Geva et al., 2021; Zhao et al., 2021; Geva et al., 2022) is likely to have direct implications for the relevance of the present work. When adopting the linear decomposition approach we advocate, this problem can be further simplified: We only require a computationally efficient algorithm to map an input weighted sum of vectors through the non-linearity to an output weighted sum of vectors.4

Also remark that previous research stressed that Transformer layers exhibit a certain degree of commutativity (Zhao et al., 2021) and that additional computation can be injected between contiguous sublayers (Pfeiffer et al., 2020). This can be thought of as evidence pointing towards a certain independence of the computations done in each layers: If we can shuffle and add layers, then it seems reasonable to characterize sublayers based on what their outputs add to the total embedding, as we do in Equation (1).

Beyond the expectations we may have, it remains to be seen whether our proposed methodology is of actual use, that is, whether is conducive to further research. The remainder of this article presents some analyses that our decomposition enables us to conduct.5

One major question is that of the relative relevance of the different submodules of the architecture with respect to the overall output embedding. Studying the four terms it, ft, ht, and ct can prove helpful in this endeavor. Given that Equations (2) to (5) are defined as sums across layers or sublayers, it is straightforward to adapt them to derive the decomposition for intermediate representations. Hence, we can study how relevant are each of the four terms to intermediary representations, and plot how this relevance evolves across layers.

To that end, we propose an importance metric to compare one of the terms tt to the total et. We require it to be sensitive to co-directionality (i.e., whether tt and et have similar directions) and relative magnitude (whether tt is a major component of et). A normalized dot-product of the form:
(6)
satisfies both of these requirements. As dot- product distributes over addition (i.e., aTibi=iaTbi) and the dot-product of a vector with itself is its magnitude squared (i.e., aTa=a22):
Hence this function intuitively measures the importance of a term relative to the total.

We use the same Europarl sample as in Section 2.3. We contrast embeddings from three related models: The BERT base uncased model and fine-tuned variants on CONLL 2003 NER (Tjong Kim Sang and De Meulder, 2003)6 and SQuAD v2 (Rajpurkar et al., 2018).7

Figure 3 summarizes the relative importance of the four terms of Eq. (1), as measured by the normalized dot-product defined in Eq. (6); ticks on the x-axis correspond to different layers. Figures 3a to 3c display the evolution of our proportion metric across layers for all three BERT models, and Figures 3d to 3f display how our normalized dot-product measurements correlate across pairs of models using Spearman’s ρ.8

Figure 3: 

Relative importance of main terms.

Figure 3: 

Relative importance of main terms.

Close modal

Looking at Figure 3a, we can make a few important observations. The input term it, which corresponds to a static embedding, initially dominates the full output, but quickly decreases in prominence, until it reaches 0.045 at the last layer. This should explain why lower layers of Transformers generally give better performances on static word-type tasks (Vulić et al., 2020, among others). The ht term is not as prominent as one could expect from the vast literature that focuses on MHA. Its normalized dot-product is barely above what we observe for ct, and never averages above 0.3 across any layer. This can be partly pinned down on the prominence of ft and its normalized dot-product of 0.4 or above across most layers. As FF submodules are always the last component added to each hidden state, the sub-terms of ft go through fewer LNs than those of ht, and thus undergo fewer scalar multiplications—which likely affects their magnitude. Lastly, the term ct is far from negligible: At layer 11, it is the most prominent term, and in the output embedding it makes up for up to 23%. Note that ct defines a set of offsets embedded in a 2Λ-dimensional hyperplane (cf. Appendix B). In BERT base, 23% of the output can be expressed using a 50-dimensional vector, or 6.5% of the 768 dimensions of the model. This likely induces part of the anisotropy of Transformer embeddings (e.g., Ethayarajh, 2019; Timkey and van Schijndel, 2021), as the ct term pushes the embedding towards a specific region of the space.

The fine-tuned models in Figures 3b and 3c are found to impart a much lower proportion of the contextual embeddings to the it and ct terms. While ft seems to dominate in the final embedding, looking at the correlations in Figures 3d and 3e suggest that the ht terms are those that undergo the most modifications. Proportions assigned to the terms correlate with those assigned in the non-finetuned model more in the case of lower layers than higher layers (Figures 3d and 3e). The required adaptations seem task-specific as the two fine-tuned models do not correlate highly with each other (Figure 3f). Lastly, updates in the NER model impact mostly layer 8 and upwards (Figure 3d), whereas the QA model (Figure 3e) sees important modifications to the ht term at the first layer, suggesting that SQuAD requires more drastic adaptations than CONLL 2003.

An interesting follow-up question concerns which of the four terms allow us to retrieve the target word-piece. We consider two approaches: (a) Using the actual projection learned by the non-fine-tuned BERT model, or (b) learning a simple categorical regression for a specific term. We randomly select 15% of the word-pieces in our Europarl sample. As in the work of Devlin et al. (2019), 80% of these items are masked, 10% are replaced by a random word-piece, and 10% are left as is. Selected embeddings are then split between train (80%), validation (10%), and test (10%).

Results are displayed in Table 2. The first row (“Default”) details predictions using the default output projection on the vocabulary, that is, we test the performances of combinations sub-terms under the circumstances encountered by the model during training.9 The rows below (“Learned”) correspond to learned linear projections; the row marked μ display the average performance across all 5 runs. Columns display the results of using the sum of 1, 2, 3, or 4 of the terms it, ht, ft and ct to derive representations; for example, the rightmost corresponds to it +ht +ft +ct (i.e., the full embedding), whereas the leftmost corresponds to predicting based on it alone. Focusing on the default projection first, we see that it benefits from a more extensive training: When using all four terms, it is almost 2% more accurate than learning one from scratch. On the other hand, learning a regression allows us to consider more specifically what can be retrieved from individual terms, as is apparent from the behavior of the ft: When using the default output projection, we get 1.36% accuracy, whereas a learned regression yields 53.77%.

Table 2: 

Masked language model accuracy (in %). Cells in underlined bold font indicate best performance per setup across runs. Cell color indicates the ranking of setups within a run. Rows marked μ contain average performance; rows marked σ contains the standard deviation across runs.

Masked language model accuracy (in %). Cells in underlined bold font indicate best performance per setup across runs. Cell color indicates the ranking of setups within a run. Rows marked μ contain average performance; rows marked σ contains the standard deviation across runs.
Masked language model accuracy (in %). Cells in underlined bold font indicate best performance per setup across runs. Cell color indicates the ranking of setups within a run. Rows marked μ contain average performance; rows marked σ contains the standard deviation across runs.

The default projection matrix is also highly dependent on the normalization offsets ct and the FF terms ft being added together: Removing this ct term from any experiment using ft is highly detrimental to the accuracy. On the other hand, combining the two produces the highest accuracy scores. Our logistic regressions show that most of this performance can be imputed to the ft term. Learning a projection from the ft term already yields an accuracy of almost 54%. On the other hand, a regression learned from ct only has a limited performance of 9.72% on average. Interestingly, this is still above what one would observe if the model always predicted the most frequent word-piece (viz. the, 6% of the test targets): even these very semantically bare items can be exploited by a classifier. As ct is tied to the LN z-scaling, this suggests that the magnitude of Transformer embeddings is not wholly meaningless.

In all, do FFs make the model more effective? The ft term is necessary to achieve the highest accuracy on the training objective of BERT. On its own, it doesn’t achieve the highest performances: for that we also need to add the MHA outputs ht. However, the performances we can associate to ft on its own are higher than what we observe for ht, suggesting that FFs make the Transformer architecture more effective on the MLM objective. This result connects with the work of Geva et al. (2021, 2022) who argue that FFs update the distribution over the full vocabulary, hence it makes sense that ft would be most useful to the MLM task.

We now turn to look at how the vector spaces are organized, and which term yields the most linguistically appropriate space. We rely on wsd, as distinct senses should yield different representations.

We consider an intrinsic KNN-based setup and an extrinsic probe-based setup. The former is inspired from Wiedemann et al. (2019): We assign to a target the most common label in its neighborhood. We restrict neighborhoods to words with the same annotated lemma and use the 5 nearest neighbors using cosine distance. The latter is a 2-layer MLP similar to Du et al. (2019), where the first layer is shared for all items and the second layer is lemma-specific. We use the nltk Semcor dataset (Landes et al., 1998; Bird et al., 2009), with an 80%–10%–10% split. We drop monosemous or OOV lemmas and sum over word-pieces to convert them into single word representations. Table 3 shows accuracy results. Selecting the most frequent sense would yield an accuracy of 57%; picking a sense at random, 24%. The terms it and ct struggle to outperform the former baseline: relevant KNN accuracy scores are lower, and corresponding probe accuracy scores are barely above.

Table 3: 

Accuracy on SemCor WSD (in %).

Accuracy on SemCor WSD (in %).
Accuracy on SemCor WSD (in %).

Overall, the same picture emerges from the KNN setup and all 5 runs of the classifier setup. The ft term does not yield the highest performances in our experiment—instead, the ht term systematically dominates. In single term models, ht is ranked first and ft second. As for sums of two terms, the setups ranked 1st, 2nd, and 3rd are those that include ht; setups ranked 3rd to 5st, those that include ft. Even more surprisingly, when summing three of the terms, the highest ranked setup is the one where we exclude ft, and the lowest corresponds to excluding ht. Removing ft systematically yields better performances than using the full embedding. This suggests that ft is not necessarily helpful to the final representation for WSD. This contrast with what we observed for MLM, where ht was found to be less useful then ft.

One argument that could be made here would be to posit that the predictions derived from the different sums of terms are intrinsically different, hence a purely quantitative ranking might not capture this important distinction. To verify whether this holds, we can look at the proportion of predictions that agree for any two models. Because our intent is to see what can be retrieved from specific subterms of the embedding, we focus solely on the most efficient classifiers across runs. This is summarized in Figure 4: An individual cell will detail the proportion of the assigned labels shared by the models for that row and that column. In short, we see that model predictions tend to a high degree of overlap. For both knn and classifier setups, the three models that appear to make the most distinct predictions turn out to be computed from the it term, the ct term or their sum: that is, the models that struggle to perform better than the MFS baseline and are derived from static representations.

Figure 4: 

Prediction agreement for WSD models (in %). Upper triangle: agreement for KNNs; lower triangle: for learned classifiers.

Figure 4: 

Prediction agreement for WSD models (in %). Upper triangle: agreement for KNNs; lower triangle: for learned classifiers.

Close modal

Downstream application can also be achieved through fine-tuning, that is, restarting a model’s training to derive better predictions on a narrower task. As we saw from Figures 3b and 3c, the modifications brought upon this second round of training are task-specific, meaning that an exhaustive experimental survey is out of our reach.

We consider the task of Named Entity Recognition, using the WNUT 2016 shared task dataset (Strauss et al., 2016). We contrast the performance of the non-finetuned BERT model to that of the aforementioned variant fine-tuned on the CONLL 2003 NER dataset using shallow probes.

Results are presented in Table 4. The very high variance we observe across is likely due to the smaller size of this dataset (46,469 training examples, as compared to the 142,642 of Section 5 or the 107,815 in Section 4). Fine-tuning BERT on another NER dataset unsurprisingly has a systematic positive impact: Average performance jumps up by 5% or more. More interesting is the impact this fine-tuning has on the ft term: When used as sole input, the highest observed performance increases by over 8%, and similar improvements are observed consistently across all setups involving ft. Yet, the best average performance for fine-tuned and base embeddings correspond to ht (39.28% in tuned), it +ht (39.21%), and it +ht +ct (39.06%); in the base setting the highest average performance are reached with ht +ct (33.40%), it +ht +ct (33.25%) and ht (32.91%)—suggesting that ft might be superfluous for this task.

Table 4: 

Macro-f1 on WNUT 2016 (in %).

Macro-f1 on WNUT 2016 (in %).
Macro-f1 on WNUT 2016 (in %).

We can also look at whether the highest scoring classifiers across runs classifiers produce different outputs. Given the high class imbalance of the dataset at hand, we macro-average the prediction overlaps by label. The result is shown in Figure 5; the upper triangle details the behavior of the untuned model, and the lower triangle details that of the NER-fine-tuned model. In this round of experiments, we see much more distinctly that the it model, the ct model, and the it +ct model behave markedly different from the rest, with ct yielding the most distinct predictions. As for the NER-fine-tuned model (lower triangle), aside from the aforementioned static representations, most predictions display a degree of overlap much higher than what we observe for the non-finetuned model: Both FFs and MHAs are skewed towards producing outputs more adapted to NER tasks.

Figure 5: 

NER prediction agreement (macro-average, in %). Upper triangle: agreement for untuned models; lower triangle: for tuned models.

Figure 5: 

NER prediction agreement (macro-average, in %). Upper triangle: agreement for untuned models; lower triangle: for tuned models.

Close modal

The derivation we provide in Section 2 ties in well with other studies setting out to explain how Transformers embedding spaces are structured (Voita et al., 2019; Mickus et al., 2020; Vázquez et al., 2021, among others) and more broadly how they behave (Rogers et al., 2020). For instance, lower layers tend to yield higher performance on surface tasks (e.g., predicting the presence of a word, Jawahar et al., 2019) or static benchmarks (e.g., analogy, Vulić et al., 2020): This ties in with the vanishing prominence of it across layers. Likewise, probe-based approaches to unearth a linear structure matching with the syntactic structure of the input sentence (Raganato and Tiedemann, 2018; Hewitt and Manning, 2019, among others) can be construed as relying on the explicit linear dependence that we highlight here.

Another connection is with studies on embedding space anisotropy (Ethayarajh, 2019; Timkey and van Schijndel, 2021): Our derivation provides a means of circumscribing which neural components are likely to cause it. Also relevant is the study on sparsifying Transformer representations of Yun et al. (2021): The linearly dependent nature of Transformer embeddings has some implications when it comes to dictionary coding.

Also relevant are the works focusing on the interpretation of specific Transformer components, and feed-forward sublayers in particular (Geva et al., 2021; Zhao et al., 2021; Geva et al., 2022). Lastly, our approach provides some quantitative argument for the validity of attention-based studies (Serrano and Smith, 2019; Jain and Wallace, 2019; Wiegreffe and Pinter, 2019; Pruthi et al., 2020) and expands on earlier works looking beyond attention weights (Kobayashi et al., 2020).

In this paper, we stress how Transformer embeddings can be decomposed linearly to describe the impact of each network component. We showcased how this additive structure can be used to investigate Transformers. Our approach suggests a less central place for attention-based studies: If multi-head attention only accounts for 30% of embeddings, can we possibly explain what Transformers do by looking solely at these submodules? The crux of our methodology lies in that we decompose the output embedding by submodule instead of layer or head. These approaches are not mutually exclusive (cf. Section 3), hence our approach can easily be combined with other probing protocols, providing the means to narrow in on specific network components.

The experiments we have conducted in Sections 3 to 6 were designed so as to showcase whether our decomposition in Equation (1) could yield useful results—or, as we put it earlier in Section 2.3, whether this approach could be conducive to future research. We were able to use the proposed approach to draw insightful connections. The noticeable anisotropy of contextual embeddings can be connected to the prominent trace of the biases in the output embedding: As model biases make up an important part of the whole embedding, they push it towards a specific sub-region of the embedding. The diminishing importance of it links back to earlier results on word-type semantic benchmarks. We also report novel findings, showcasing how some submodules outputs may be detrimental in specific scenarios: The output trace of FF modules was found to be extremely useful for MLM, whereas the ht term was found to be crucial for WSD. Our methodology also allows for an overview of the impact of finetuning (cf. Section 6): It skews components towards more task-specific outputs, and its effect are especially noticeable in upper layers (Figures 3d and 3e).

Analyses in Sections 3 to 6 demonstrate the immediate insight that our Transformer decomposition can help achieve. This work therefore opens a number of research perspectives, of which we name three. First, as mentioned in Section 2.3, our approach can be extended further to more thoroughly disentangle computations. Second, while we focused here more on feed-forward and multi-head attention components, extracting the static component embeddings from it would allow for a principled comparison of contextual and static distributional semantics models. Last but not least, because our analysis highlights the different relative importance of Transformer components in different tasks, it can be used to help choose the most appropriate tools for further interpretation of trained models among the wealth of alternatives.

We are highly indebted to Marianne Clausel for her significant help with how best to present the mathematical aspects of this work. Our thanks also go to Aman Sinha, as well as three anonymous reviewers for their substantial comments towards bettering this work.

This work was supported by a public grant overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program: Idex Lorraine Université d’Excellence (reference: ANR-15-IDEX-0004). We also acknowledge the support by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n° 771113).

Given that a Transformer layer consists of a stack of L layers, each comprising two sublayers, we can treat a Transformer as a stack of Λ = 2L sublayers. For notation simplicity, we link the sublayer index λ to the layer index l: The first sublayer of layer l is the (2l − 1) th sublayer, and the second is the (2l) th sublayer.10 All sublayers include a residual connection before the final LN:
We can model the effects of the gain gλ and the scaling 1/sλ,t as the d × d square matrix:
which we use to rewrite a sublayer output yλ,t as:
We can then consider what happens to this additive structure in the next sublayer. We first define Tλ +1 as previously and remark that, as both Tλ and Tλ +1 only contain diagonal entries:
This generalizes for any sequence of LNs as:
Let us now pass the input x through a complete layer, that is, through sublayers λ and λ + 1:
Substituting in the expression for yλ from above:
As we are interested in the combined effects of a layer, we only consider the case where Sλ is a MHA mechanism and Sλ +1 a FF. We start by reformulating the output of a MHA. Recall that attention heads can be seen as weighted sums of value vectors (Kobayashi et al., 2020). Due to the softmax normalization, attention weights αt,1,…αt,n sum to 1 for any position t. Hence:
To account for all H heads in a MHA, we concatenate these head-specific sums and pass them through the output projection W(MHA,O). As such, we can denote the unbiased output of the MHA and the associated bias as:
with Zl,h as introduced in (4). By substituting the actual sublayer functions in our previous equation:
Here, given that there is only one FF for this layer, the output of sublayer function at λ + 1 will correspond to the output of the FF for layer l, i.e., f~l,t+bl(FF,O), and similarly the output for sublayer λ should be that of the MHA of layer l, or h~l,t+bl(MHA). To match Eq. (1), rewrite as:
where xλ,t is the tth input for sublayer λ; that is, the above characterizes the output of sublayer λ + 1 with respect to the input of sublayer λ. Passing the output yl,t into the next layer l + 1 (i.e., through sublayers λ + 2 and λ + 3) then gives:
This logic carries on across layers: Adding a layer corresponds to (i) mapping the existing terms through the two new LNs, (ii) adding new terms for the MHA and the FF, (iii) tallying up biases introduced in the current layer. Hence, the above generalizes to any number of layers k ≥ 1 as:11
Lastly, recall that by construction, we have:
By recurrence over all layers and providing the initial input x0,t, we obtain Eqs. (1) to (5).
We can re-write Eq. (5) to highlight that is composed only of scalar multiplications applied to constant vectors. Let:
Using the above, Eq. (5) is equivalent to:
Note that pλ and qλ are constant across all inputs. Assuming their linear independence puts an upper bound of 2Λ vectors necessary to express ct.

In Section 2.3, we use the default hyperparameters of scikit-learn (Pedregosa et al., 2011). In Section 4, we learn categorical regressions using an AdamW optimizer (Loshchilov and Hutter, 2019) and iterate 20 times over the train set; hyperparameters (learning rate, weight decay, dropout, and the β1 and β2 AdamW hyperparameters) are set using Bayes Optimization (Snoek et al., 2012), with 50 hyperparameter samples and accuracy as objective. In Section 5, learning rate, dropout, weight decay, β1 and β2, learning rate scheduling are selected with Bayes Optimization, using 100 samples and accuracy as objective. In Section 6, we learn shallow logistic regressions, setting hyperparameters with Bayes Optimization, using 100 samples and macro-f1 as the objective. Experiments were run on a 4GB NVIDIA GPU.

The offset method of Mikolov et al. (2013) is known to also model social stereotypes (Bolukbasi et al., 2016, among others). Some of the sub- representations of our decomposition may exhibit stronger biases than the whole embedding et, and can yield higher performances than focusing on the whole embedding (e.g., Table 3). This could provide an undesirable incentive to deploy NLP models with higher performances and stronger systemic biases.

1 

In BERT (Devlin et al., 2019), additional terms to this static input encode the segment a token belongs to, and a LN is added before the very first sublayer. Other variants also encode positions by means of an offset in attention heads (Huang et al., 2018; Shaw et al., 2018).

2 

We empirically verified that components from attested embeddings et and those derived from Eq. (1) are systematically equal up to ± 10−7.

3 

In the case of relative positional embeddings applied to value projections (Shaw et al., 2018), it is rather straightforward to follow the same logic so as to include relative positional offset in the most appropriate term.

4 
One could simply treat the effect of a non-linear activation as if it were an offset. For instance, in the case of ReLU:
5 

Code for our experiments is available at the following URL: https://github.com/TimotheeMickus/bert-splat.

8 

Layer 0 is the layer normalization conducted before the first sublayer, hence ft and ht are undefined here.

9 

We thank an anonymous reviewer for pointing out that the BERT model ties input and output embeddings; we leave investigating the implications of this fact for future work.

10 

In the case of BERT, we also need to include a LN before the first layer, which is straightforward if we index it as λ = 0.

11 

The edge case λ=λ+1λTλ is taken to be the identity matrix Id, for notation simplicity.

Jimmy Lei
Ba
,
Jamie Ryan
Kiros
, and
Geoffrey E.
Hinton
.
2016
.
Layer normalization
.
Dzmitry
Bahdanau
,
Kyung Hyun
Cho
, and
Yoshua
Bengio
.
2015
.
Neural machine translation by jointly learning to align and translate
. In
3rd International Conference on Learning Representations
,
ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015
.
Steven
Bird
,
Ewan
Klein
, and
Edward
Loper
.
2009
.
Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit
.
O’Reilly Media, Inc.
Tolga
Bolukbasi
,
Kai-Wei
Chang
,
James Y.
Zou
,
Venkatesh
Saligrama
, and
Adam T.
Kalai
.
2016
.
Man is to computer programmer as woman is to homemaker? Debiasing word embeddings
. In
Advances in Neural Information Processing Systems
, volume
29
.
Curran Associates, Inc.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Jiaju
Du
,
Fanchao
Qi
, and
Maosong
Sun
.
2019
.
Using BERT for word sense disambiguation
.
CoRR
,
abs/1909.08358
.
Kawin
Ethayarajh
.
2019
.
How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
55
65
,
Hong Kong, China
.
Association for Computational Linguistics
.
Mor
Geva
,
Avi
Caciularu
,
Kevin Ro
Wang
, and
Yoav
Goldberg
.
2022
.
Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
.
Mor
Geva
,
Roei
Schuster
,
Jonathan
Berant
, and
Omer
Levy
.
2021
.
Transformer feed-forward layers are key-value memories
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
5484
5495
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Dan
Hendrycks
and
Kevin
Gimpel
.
2016
.
Gaussian error linear units (GELU)
.
arXiv preprint arXiv:1606.08415
.
John
Hewitt
and
Christopher D.
Manning
.
2019
.
A structural probe for finding syntax in word representations
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4129
4138
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Cheng-Zhi Anna
Huang
,
Ashish
Vaswani
,
Jakob
Uszkoreit
,
Noam
Shazeer
,
Curtis
Hawthorne
,
Andrew M.
Dai
,
Matthew D.
Hoffman
, and
Douglas
Eck
.
2018
.
An improved relative self-attention mechanism for transformer with application to music generation
.
CoRR
,
abs/1809.04281
.
Sarthak
Jain
and
Byron C.
Wallace
.
2019
.
Attention is not explanation
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
3543
3556
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Ganesh
Jawahar
,
Benoît
Sagot
, and
Djamé
Seddah
.
2019
.
What does BERT learn about the structure of language?
In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3651
3657
,
Florence, Italy
.
Association for Computational Linguistics
.
Goro
Kobayashi
,
Tatsuki
Kuribayashi
,
Sho
Yokoi
, and
Kentaro
Inui
.
2020
.
Attention is not only a weight: Analyzing transformers with vector norms
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7057
7075
,
Online
.
Association for Computational Linguistics
.
Philipp
Koehn
.
2005
.
Europarl: A parallel corpus for statistical machine translation
. In
Proceedings of Machine Translation Summit X: Papers
, pages
79
86
.
Phuket, Thailand
.
Shari
Landes
,
Claudia
Leacock
, and
Randee I.
Tengi
.
1998
.
Building semantic concordances
.
WordNet: An Electronic Lexical Database, chapter 8
, pages
199
216
.
Bradford Books
.
Ilya
Loshchilov
and
Frank
Hutter
.
2019
.
Decoupled weight decay regularization
. In
International Conference on Learning Representations
.
Thang
Luong
,
Hieu
Pham
, and
Christopher D.
Manning
.
2015
.
Effective approaches to attention-based neural machine translation
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
1412
1421
,
Lisbon, Portugal
.
Association for Computational Linguistics
.
Timothee
Mickus
,
Denis
Paperno
,
Mathieu
Constant
, and
Kees
van Deemter
.
2020
.
What do you mean, BERT?
In
Proceedings of the Society for Computation in Linguistics 2020
, pages
279
290
,
New York, New York
.
Association for Computational Linguistics
.
Tomas
Mikolov
,
Wen-tau
Yih
, and
Geoffrey
Zweig
.
2013
.
Linguistic regularities in continuous space word representations
. In
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
746
751
,
Atlanta, Georgia
.
Association for Computational Linguistics
.
Jeff
Mitchell
and
Mirella
Lapata
.
2010
.
Composition in distributional models of semantics
.
Cognitive Science
,
34
(
8
):
1388
1429
.
F.
Pedregosa
,
G.
Varoquaux
,
A.
Gramfort
,
V.
Michel
,
B.
Thirion
,
O.
Grisel
,
M.
Blondel
,
P.
Prettenhofer
,
R.
Weiss
,
V.
Dubourg
,
J.
Vanderplas
,
A.
Passos
,
D.
Cournapeau
,
M.
Brucher
,
M.
Perrot
, and
E.
Duchesnay
.
2011
.
Scikit-learn: Machine learning in Python
.
Journal of Machine Learning Research
,
12
:
2825
2830
.
Jonas
Pfeiffer
,
Andreas
Rücklé
,
Clifton
Poth
,
Aishwarya
Kamath
,
Ivan
Vulić
,
Sebastian
Ruder
,
Kyunghyun
Cho
, and
Iryna
Gurevych
.
2020
.
Adapterhub: A framework for adapting transformers
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020): Systems Demonstrations
, pages
46
54
,
Online
.
Association for Computational Linguistics
.
Danish
Pruthi
,
Mansi
Gupta
,
Bhuwan
Dhingra
,
Graham
Neubig
, and
Zachary C.
Lipton
.
2020
.
Learning to deceive with attention-based explanations
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
4782
4793
,
Online
.
Association for Computational Linguistics
.
Alessandro
Raganato
and
Jörg
Tiedemann
.
2018
.
An analysis of encoder representations in transformer-based machine translation
. In
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
, pages
287
297
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Pranav
Rajpurkar
,
Robin
Jia
, and
Percy
Liang
.
2018
.
Know what you don’t know: Unanswerable questions for SQuAD
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
784
789
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Anna
Rogers
,
Olga
Kovaleva
, and
Anna
Rumshisky
.
2020
.
A primer in BERTology: What we know about how BERT works
.
Transactions of the Association for Computational Linguistics
,
8
:
842
866
.
Sofia
Serrano
and
Noah A.
Smith
.
2019
.
Is attention interpretable?
In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
2931
2951
,
Florence, Italy
.
Association for Computational Linguistics
.
Peter
Shaw
,
Jakob
Uszkoreit
, and
Ashish
Vaswani
.
2018
.
Self-attention with relative position representations
.
CoRR
,
abs/1803.02155
.
Jasper
Snoek
,
Hugo
Larochelle
, and
Ryan P.
Adams
.
2012
.
Practical Bayesian optimization of machine learning algorithms
. In
Advances in Neural Information Processing Systems
, volume
25
.
Curran Associates, Inc.
Benjamin
Strauss
,
Bethany
Toma
,
Alan
Ritter
,
Marie-Catherine
de Marneffe
, and
Wei
Xu
.
2016
.
Results of the WNUT16 named entity recognition shared task
. In
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
, pages
138
144
,
Osaka, Japan
.
The COLING 2016 Organizing Committee
.
William
Timkey
and
Marten
van Schijndel
.
2021
.
All bark and no bite: Rogue dimensions in transformer language models obscure representational quality
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
4527
4546
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Erik F.
Tjong Kim Sang
and
Fien
De Meulder
.
2003
.
Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition
. In
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003
, pages
142
147
.
Ashish
Vaswani
,
Noam
Shazeer
,
Niki
Parmar
,
Jakob
Uszkoreit
,
Llion
Jones
,
Aidan N.
Gomez
,
Łukasz
Kaiser
, and
Illia
Polosukhin
.
2017
.
Attention is all you need
. In
Advances in Neural Information Processing Systems
, volume
30
.
Curran Associates, Inc.
Raúl
Vázquez
,
Hande
Celikkanat
,
Mathias
Creutz
, and
Jörg
Tiedemann
.
2021
.
On the differences between BERT and MT encoder spaces and how to address them in translation tasks
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
, pages
337
347
,
Online
.
Association for Computational Linguistics
.
Elena
Voita
,
Rico
Sennrich
, and
Ivan
Titov
.
2019
.
The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
4396
4406
,
Hong Kong, China
.
Association for Computational Linguistics
.
Ivan
Vulić
,
Edoardo Maria
Ponti
,
Robert
Litschko
,
Goran
Glavaš
, and
Anna
Korhonen
.
2020
.
Probing pretrained language models for lexical semantics
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7222
7240
,
Online
.
Association for Computational Linguistics
.
Gregor
Wiedemann
,
Steffen
Remus
,
Avi
Chawla
, and
Chris
Biemann
.
2019
.
Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings
.
ArXiv
,
abs/1909.10430
.
Sarah
Wiegreffe
and
Yuval
Pinter
.
2019
.
Attention is not not explanation
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
11
20
,
Hong Kong, China
.
Association for Computational Linguistics
.
Zeyu
Yun
,
Yubei
Chen
,
Bruno
Olshausen
, and
Yann
LeCun
.
2021
.
Transformer visualization via dictionary learning: Contextualized embedding as a linear superposition of transformer factors
. In
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
, pages
1
10
,
Online
.
Association for Computational Linguistics
.
Sumu
Zhao
,
Damián
Pascual
,
Gino
Brunner
, and
Roger
Wattenhofer
.
2021
.
Of non-linearity and commutativity in bert
. In
2021 International Joint Conference on Neural Networks (IJCNN)
, pages
1
8
.

Author notes

*

The work described in the present paper was conducted chiefly while at ATILF.

Action Editor: Dani Yogatama

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.