Abstract
Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple—a classifier is trained to predict some linguistic property from a model’s representations—and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This squib critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.
1. Introduction
The opaqueness of deep neural network models of natural language processing (NLP) has spurred a line of research into interpreting and analyzing them. Analysis methods may aim to answer questions about a model’s structure or its decisions. For instance, one might ask which parts of a neural network model are responsible for certain linguistic properties, or which parts of the input led the model to make a certain decision. A common methodology to answer questions about the structure of models is to associate internal representations with external properties, by training a classifier on said representations that predicts a given property. This framework, known as probing classifiers, has emerged as a prominent analysis strategy in many studies of NLP models.1
Despite its apparent success, the probing classifiers paradigm is not without limitations. Critiques have been made about comparative baselines, metrics, the choice of classifier, and the correlational nature of the method. In this short article, we first define the probing classifiers framework, taking care to consider the various involved components. Then we summarize the framework’s shortcomings, as well as improvements and advances. This article provides a roadmap for NLP researchers who wish to examine probing classifiers more critically and highlights areas in need of additional research.
2. The Probing Classifiers Framework
On the surface, the probing classifiers idea seems straightforward. We take a model that was trained on some task, such as a language model. We generate representations using the model, and train another classifier that takes the representations and predicts some property. If the classifier performs well, we say that the model has learned information relevant for the property. However, upon closer inspection, it turns out that much more is involved here. To see this, we now define this framework a bit more formally.
Let us denote by f : x ↦ ŷ a model that maps input x to output ŷ. We call this model the original model. It is trained on some annotated dataset 𝒟O = {x(i), y(i)}, which we refer to as the original dataset. Its performance is evaluated by some measure, denoted Perf(f, 𝒟O). The function f is typically a deep neural network that generates intermediate representations of x, for example, fl(x) may denote the representation of x at layer l of f.2 A probing classifier g : fl(x) ↦ maps intermediate representations to some property , which is typically some linguistic feature of interest. As a concrete example, f might be a sentiment analysis model, mapping a text x to a sentiment label y, while g might be a classifier mapping intermediate representations fl(x) to part-of-speech tags z. The classifier g is trained and evaluated on some annotated dataset 𝒟P = {x(i), z(i)}, and some performance measure Perf(g, f, 𝒟O, 𝒟P) (e.g., accuracy) is reported. Note that the performance measure depends on the probing classifier g and the probing dataset 𝒟P, as well as on the original model f and the original dataset 𝒟𝒪.
From an information theoretic perspective, training the probing classifier g can be seen as estimating the mutual information between the intermediate representations fl(x) and the property z (Belinkov 2018, page 42; Pimentel et al. 2020b; Zhu and Rudzicz 2020), which we write I(z; h), where z is a random variable ranging over properties z and h is a random variable ranging over representations fl(x).
The above careful definition of the probing classifiers framework reveals that it is composed of multiple concepts and components, depicted in Figure 1. The choice of each such component, and the interactions between them, lead to non-trivial questions regarding the design and implementation of any probing classifier experiment. Before we turn to these considerations in Section 4, we briefly review some history and promises of probing classifiers in the next section.
3. Promises
Perhaps the first studies that can be cast in the framework of probing classifiers are by Köhn (2015) and Gupta et al. (2015), who trained classifiers on static word embeddings to predict various morphological, syntactic, and semantic properties. Their goals were to provide more nuanced evaluations of word embeddings than prior work, which only integrated them in downstream tasks. Other early work classified hidden states of a recurrent neural network machine translation system into morpho-syntactic properties (Shi, Padhi, and Knight 2016). They were motivated by the end-to-end nature of the neural machine translation system, which, compared with a phrase/syntax-based system, did not explicitly integrate such properties (so they ask: “What kind of syntactic information is learned, and how much?”). The framework has taken up a more stable form by several groups who studied sentence embeddings (Ettinger, Elgohary, and Resnik 2016; Adi et al. 2017; Conneau et al. 2018) and recurrent/recursive neural networks (Belinkov et al. 2017a; Hupkes, Veldhoen, and Zuidema 2018).3 The same idea had been concurrently proposed for investigating computer vision models (Alain and Bengio 2016).
A main motivation in this body of work is the opacity of the representations.4 Compared with performance on downstream tasks, probing classifiers aim to provide more nuanced evaluations with respect to simple properties.5 Indeed, following the initial studies, a plethora of work has applied the framework to various models and properties, alleviating some of the opacity, at least in terms of properties encoded in the representations. See Belinkov and Glass (2019) for a comprehensive survey up to early 2019.6
However, what can be inferred from successful probing performance is less obvious. Good probing performance is often taken to indicate several potential situations: good quality of the representations with respect to the probing property,7readability of information found in the representations,8 or its extractability.9 In contrast, low probing performance is taken to indicate that the probing property is not present in the representations or is not usable.10 Sometimes, good performance is taken to indicate how the original model achieves its behavior on the original task (Hupkes, Veldhoen, and Zuidema 2018). A linear probing classifier is thought to reveal features that are used by the original model, while a more complex probe “bears the risk that the classifier infers features that are not actually used by the network” (Hupkes, Veldhoen, and Zuidema 2018). Often, different terms (quality, readability, usability, etc.) appear abstractedly without precise definitions.
As we shall see, some of the above assumptions and conclusions are better accounted for than others by the probing classifiers paradigm. Indeed, the community has recently taken a more critical look at the methodology, which we turn to now.
4. Shortcomings and Advances
In light of the promises discussed above, this section reviews several limitations of the probing classifiers framework, as well as existing proposals for addressing them. We discuss comparisons and controls, how to choose the probing classifier, which causal claims can be made, the difference between datasets and tasks, and the need to define the probed properties. We formalize new additional components (Figure 1b) in a unified framework, along with the basic components (Figure 1a).
4.1 Comparisons and Controls
A first concern with the framework is how to interpret the results of a probing classifier experiment. Suppose we run such an experiment and obtain a performance of Perf(g, f, 𝒟O, 𝒟P) = 87.8. Is that a high/low number? What should we compare it to? We will denote a baseline model with f and an upper bound or skyline model with .
Some studies compare with majority baselines (Belinkov et al. 2017a; Conneau et al. 2018) or with classifiers trained on representations that are thought to be simpler than what the original model f produces, such as static word embeddings (Belinkov et al. 2017a; Tenney et al. 2019). Others advocate for random baselines, training the classifier g on a randomized version of f (Conneau et al. 2018; Zhang and Bowman 2018; Tenney et al. 2019; Chrupała, Higy, and Alishahi 2020). These studies show that even random features capture significant information that can be decoded by the probing classifier, so performance on learned features should be viewed in such a perspective.
On the other hand, some studies compare Perf(g, f, 𝒟O, 𝒟P) to skylines or upper bounds , in an attempt to provide a point of comparison for how far probing performance is from the possible performance on the task of mapping x ↦ z. Examples include estimating human performance (Conneau et al. 2018), reporting the state of the art from the literature (Liu et al. 2019), or training a dedicated model to predict z from x, without restricting to (frozen) representations from f (Belinkov et al. 2017b).
Others have proposed to design controls for possible confounders. Hewitt and Liang (2019) observe that the probing performance Perf(g, f, 𝒟O, 𝒟P) may tell us more about the probe g than about the model f. The probe g may memorize information from 𝒟P, rather than evaluate information found in representations f(x). They design control tasks, which a probe may only solve by memorizing. In particular, they randomize the labels in 𝒟P, creating a new dataset 𝒟P,Rand. Then, they define selectivity as the difference between the probing performance on the probing task and the control task: Sel(g, f, 𝒟O, 𝒟P, 𝒟P,Rand) = Perf(g, f, 𝒟O, 𝒟P) − Perf(g, f, 𝒟O, 𝒟P,Rand). They show that probes may have high accuracy, but low selectivity, and that linear probes tend to have high selectivity, while nonlinear probes tend to have low selectivity. This indicates that high accuracy of nonlinear probes may come from memorization of surface patterns by the probe g, rather than from information captured in the representations fl(x). The control tasks introduced by Hewitt and Liang are particularly suited for word-level properties z as they evaluate memorization of word types; it is less clear how to apply this idea more broadly, such as in sentence-level properties.
Taking an information-theoretic perspective on probing, Pimentel et al. (2020b) proposed to use control functions instead of control tasks in order to compare probes. Their control function is any function applied to the representation, c : fl(x) ↦ c(fl(x)), and they compare the information gain, which is the difference in mutual information between the property z and the representation before and after applying the control function: 𝒢(z, h, c) = I(z; h) − I(z; c(h)). While Pimentel et al. (2020b) posit that their control function is a better criterion than the control tasks of Hewitt and Liang (2019), subsequent work showed that the two criteria are almost equivalent, both theoretically and empirically (Zhu and Rudzicz 2020).
Another kind of control is proposed by Ravichander, Belinkov, and Hovy (2021), who design control datasets, where the linguistic property z is not discriminative with respect to the original task of mapping x to y. That is, they modify 𝒟O and create a new dataset, 𝒟O,z, where all examples have the same value for property z. Intuitively, a model f trained on 𝒟O,z should not pick up information about z, since it is not useful for the task of f. They show that a probe g may learn to predict property z incidentally, even when it is not discriminative with respect to the original task of mapping x ↦ y, casting doubts on causal claims concerning the effect that a property encoded in the representation may have on the original task. While they create control datasets for probing sentence-level information, the same idea can be applied to word-level properties.
4.2 Which Classifier to Use?
Another concern is the choice of the probing classifier g: What should be its structure? What role does its expressivity play in drawing conclusions about the original model f?
Some studies advocate for using simple probes, such as linear classifiers (Alain and Bengio 2016; Hupkes, Veldhoen, and Zuidema 2018; Liu et al. 2019; Hall Maudslay et al. 2020). Somewhat anecdotally, a few studies observed better performance with more complex probes, but reported similar relative trends (Conneau et al. 2018; Belinkov 2018). That is, a ranking Perf(g, f1, 𝒟O, 𝒟P) > Perf(g, f2, 𝒟O, 𝒟P), of two representations f1(x) and f2(x), holds across different probes g. However, this pattern may be flipped under alternative measures, such as selectivity (Hewitt and Liang 2019).
Several studies considered the complexity of the probe g in more detail. Pimentel et al. (2020b) argue that, in order to give the best estimate about the information that model f has about property z, the most complex probe should be used. In a more practical view, Voita and Titov (2020) propose to measure both the performance of the probe g and its complexity, by estimating the minimum description length of the code required to transmit property z knowing the representations fl(x): MDL(g, f, 𝒟O, 𝒟P). Note that this measure again depends on the probe g, the model f, and their respective datasets 𝒟O and 𝒟P. They found that MDL provides more information about how a probe g works—for instance, by revealing differences in complexity of probes when performing control tasks from 𝒟P,Rand, as in Hewitt and Liang (2019). Pimentel et al. (2020a) argue that probing work should report the possible trade-offs between accuracy and complexity, along a range of probes g, and call for using probes that are both simple and accurate. While they study a number of linear and nonlinear multilayered perceptrons, one could extend this idea to other classes of probes. Indeed, Cao, Sanh, and Rush (2021) design a pruning-based probe, which learns a mask on weights of f and obtains a better accuracy–complexity trade-off than a nonlinear probe.
Another line of work proposes methods to extract linguistic information from a trained model without learning additional parameters. In particular, much work has used some sort of pairwise importance score between words in a sentence as a signal for inferring linguistic properties, either full syntactic parsing or more fine-grained properties such as coreference resolution. These scores may come from attention weights (Raganato and Tiedemann 2018; Clark et al. 2019; Mareček and Rosa 2019; Htut et al. 2019) or from distances between word representations, perhaps including perturbations of the input sentence (Wu et al. 2020). The pairwise scores can feed into some general parsing algorithm, such as the Chu-Liu Edmonds algorithm (1965; 1967). Alternatively, some work has used representational similarity analysis (Kriegeskorte, Mur, and Bandettini 2008) to measure similarity between word or sentence representations and syntactic properties, both local properties like determining a verb’s subject (Lepori and McCoy 2020) and more structured properties like inferring the full syntactic tree (Chrupała and Alishahi 2019). Also related is work on clustering representations with respect to a linguistic property and classifying by cluster assignment (Zhou and Srikumar 2021). This line of work can be seen as a parameter-less probing classifier g: a linguistic property is inferred from internal model components (representations, attention weights), without needing to learn new parameters. Thus, such work avoids some of the issues about what the probe learns. Additionally, from the perspective of an accuracy–complexity trade-off, such work should perhaps be placed on the low end of the complexity axis, although the complexity of the parsing algorithm could also be taken into account.
4.3 Correlation vs. Causation
A main limitation of the probing classifier paradigm is the disconnect between the probing classifier g and the original model f. They are trained in two different steps, where f is trained once and only used to generate feature representations fl(x), which are fed into g. Once we have fl(x), we get a probing performance from g, which tells us something about the information in fl(x). However, in the process, we have forgotten about the original task assigned to f, which was to predict y. This raises an important question, which early work has largely taken for granted (Section 3): Does model f use the information discovered by probe g? In other words, the probing framework may indicate correlations between representations fl(x) and linguistic property z, but it does not tell us whether this property is involved in predictions of f. Indeed, several studies have pointed out this limitation (Belinkov and Glass 2019), including reports on a mismatch between performance of the probe, Perf(g, f, 𝒟O, 𝒟P), and performance of the original model, Perf(f, 𝒟O) (Vanmassenhove, Du, and Way 2017). In contrast, Lovering et al. (2021) find that extractability of a property according to MDL(g, f, 𝒟O, 𝒟P) is correlated with f, making predictions consistent with that property. Relatedly, Tamkin et al. (2020) find a discrepancy between features fl(x) obtaining high probing performance, Perf(g, f, 𝒟O, 𝒟P), and features identified as important when fine-tuning f while performing the probing task fl(x) ↦ z. They reveal this by randomizing the weights of specific layers when fine-tuning f, which can be seen as a kind of intervention.
Indeed, a number of studies have proposed improvements to the probing classifier paradigm, which aim to discover causal effects by intervening in representations of the model f. Giulianelli et al. (2018) use gradients from g to modify the representations in f and evaluate how this change affects both the probing performance and the original model performance. In their case, f is a language model and g predicts subject–verb number agreement. They find that their intervention increases probing performance, as may be expected. Interestingly, while in the general language modeling case the intervention has a small effect on the original model performance, Perf(f, 𝒟O), they find an increase in this performance on examples designed to assess number agreement. They conclude that probing classifiers can identify features that are actually used by the model. Tucker, Qian, and Levy (2021) also use probe gradients to update the representations fl(x) with respect to z, resulting in what they call counterfactual representations, and measure the effect on other properties. Similarly, Elazar et al. (2021) remove certain properties z (such as parts of speech or syntactic dependencies) from representations in f by repeatedly training (linear) probing classifiers g and projecting them out of the representation. This results in a modified representation l(x), which has less information about z. They compare the probing performance to the performance on the original task (in their case, language modeling) after the removal of said features. They find that high probing performance Perf(g, f, 𝒟O, 𝒟P) does not necessarily entail a large drop in original task performance after their removal, that is, Perf(, 𝒟O). Thus, contrary to Giulianelli et al. (2018), they conclude that probing classifiers do not always identify features that are actually used by the model. In a similar vein, Feder et al. (2021) remove properties z from representations in f by training g adversarially. At the same time, another probing classifier gC is trained positively, aiming to control for properties zC that should not be removed from f. A major difference from standard probing classifiers work is the continued updating of f. They find that they can accurately estimate the effect of properties z on downstream tasks performed by f when it is fine-tuned.11
4.4 Datasets vs. Tasks
The probing paradigm aims to study models performing some task (f : x ↦ ŷ) via a classifier performing another task (g : fl(x) ↦ ). However, in practice these tasks are operationalized via finite datasets. Ravichander, Belinkov, and Hovy (2021) point out that datasets are imperfect proxies for tasks. Indeed, the effect of the choice of datasets—both the original dataset 𝒟O and the probing dataset 𝒟P—has not been widely studied. Furthermore, we ideally want to disentangle the role of each dataset from the role of the original model f and probing classifier g. Unfortunately, models f tend to be trained on different datasets 𝒟O, making statements about models confounded with issues of datasets. Some prior work acknowledged that conclusions can only be made about the existing trained models, not about general architectures (Liu et al. 2019). However, in an ideal world, we would compare different architectures {fi} trained on the same dataset 𝒟O or the same f trained on different datasets {}. Concerning the latter, Zhang et al. (2021) found that models require less data to encode syntactic and semantic properties compared to commonsense knowledge. More such experiments are currently lacking.
The effect of the probing dataset 𝒟P—its size, composition, and so forth—is similarly not well studied. While some work reported results on multiple datasets when predicting the same property z (e.g., Belinkov et al. 2017a), more careful investigations are needed.
4.5 Properties Must Be Pre-defined
Finally, inherent to the probing classifier framework is determining a property z to probe for. This limits the investigation in multiple ways: It constrains the work to existing annotated datasets, which are often limited to English and certain properties. It also requires focusing on properties z that are thought to be relevant to the task of mapping x ↦ y a priori, potentially leading to biased conclusions. In an isolated effort to alleviate this limitation, Michael, Botha, and Tenney (2020) propose to learn latent clusters useful for predicting a property z. They discover clusters corresponding to known properties (such as personhood) as well as new categories, which are not usually annotated in common datasets. Still, probing classifiers are so far mainly useful when one has prior expectations about which properties z might be relevant with respect to a given task.
5. Summary
Given the various limitations discussed in this article, one might ask: What are probing classifiers good for? In line with the original motivation to alleviate the opacity of learned representations, work using probing classifiers has characterized them along a range of fine-grained properties. However, we have discussed several reservations regarding which insights can be drawn from a probing classifier experiment. Absolute claims about representation quality seem difficult to make. Yet recent improvements to the framework, such as better controls and metrics, allow us to make relative claims and answer questions like how extractable a property is from a representation. And causal approaches (Section 4.3) may reveal which properties are used by the original model.
One might hope that probing classifier experiments would suggest ways to improve the quality of the probed model or to direct it to be better tuned to some use or task. Presently, there are few such successful examples. For instance, earlier results showing that lower layers in language models focus on local phenomena while higher layers focus on global ones (using probing classifiers and other methods) motivated Cao et al. (2020) to decouple a question-answering model, such that lower layers process the question and the passage independently and higher layers process them jointly. An analysis of redundancy in language models (again using probing classifiers and other methods) motivated an efficient transfer-learning procedure (Dalvi et al. 2020). An analysis of phonetic information in layers of a speech recognition system (Belinkov and Glass 2017) partly motivated Krishna, Toshniwal, and Livescu (2019) to propose multitask learning with phonetic supervision on intermediate layers. Belinkov et al. (2020) discuss how their probing experiments can guide the selection of which machine translation models to use when translating specific languages. Finally, when considering using the representations for some downstream task, probing experiments can indicate which information is encoded, or can easily be extracted, from these representations.
To conclude, our critical review of the probing classifiers framework reveals that it is more complicated than may seem. When designing a probing classifier experiment, we advise researchers to take the various controls and alternative measures into account. Naturally, one should clearly define the original task/dataset/model and the probing task/dataset/classifier. It is important to set upper and lower bounds, and to consider proper controls, via either control tasks (for word-level properties) or datasets (for sentence-level properties). Depending on goals, one may want to measure the probe’s complexity (if ease of extractability is in question), report the accuracy–complexity trade-off (when designing new probes), or perform an intervention (to measure usage of information by the original model). When possible, using parameter-free probes may circumvent some of the challenges with parameterized probes. We do not argue that every study must perform all the various controls and report all the alternative measures summarized here. However, future work seeking to use probing classifiers would do well to take into account the complexity of the framework, its apparent shortcomings, and available advances.
Acknowledgments
This research was supported by the Israel Science Foundation (grant no. 448/20) and by an Azrieli Foundation Early Career Faculty Fellowship.
Notes
We use fl(x) to refer more generally to any intermediate output of f when applied to x, so the framework includes analyses of other model components, such as attention weights (Clark et al. 2019).
“fine-grained measurement of some of the information encoded in sentence embeddings” (Adi et al. 2017); “simple linguistic properties of sentences” (Conneau et al. 2018); “assessing the specific semantic information that is being captured in sentence representations” (Ettinger, Elgohary, and Resnik 2016).
There have also been numerous other studies using the probing classifier framework as is. For a partial list, see https://github.com/boknilev/nlp-analysis-methods/issues/5. For recent analyses focusing on the BERT model (Devlin et al. 2019), see Rogers, Kovaleva, and Rumshisky (2020).
“evaluate the quality of the trained classifier on the given task as a proxy to the quality of the extracted representations” (Belinkov et al. 2017a).
“If the classifier succeeds, it means that the pre-trained encoder is storing readable tense information into the embeddings it creates” (Conneau et al. 2018).
“testing for extractability of semantic information by testing classification accuracy” (Ettinger, Elgohary, and Resnik 2016); “if a sequential model is computing certain information, or merely keeping track of it, it should be possible to extract this information from its internal state space” (Hupkes, Veldhoen, and Zuidema 2018).
“low accuracy suggests this information is not represented in the hidden state” (Hupkes, Veldhoen, and Zuidema 2018); “if we cannot train a classifier to predict some property of a sentence based on its vector representation, then this property is not encoded in the representation (or rather, not encoded in a useful way, considering how the representation is likely to be used)” (Adi et al. 2017).
References
Author notes
Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion.