Probing Classifiers: Promises, Shortcomings, and Advances

Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple -- a classifier is trained to predict some linguistic property from a model's representations -- and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated various methodological limitations of this approach. This article critically reviews the probing classifiers framework, highlighting their promises, shortcomings, and advances.

property. If the classifier performs well, we say that the model has learned information relevant for the property. However, upon closer inspection, it turns out that much more is involved here. To see this, we now define this framework a bit more formally.
Let us denote by f : x →ŷ a model that maps input x to outputŷ. We call this model the original model. It is trained on some annotated dataset D O = {x (i) , y (i) }, which we refer to as the original dataset. Its performance is evaluated by some measure, denoted PERF(f, D O ). The function f is typically a deep neural network that generates intermediate representations of x, for example f l (x) may denote the representation of x at layer l of f . 2 A probing classifier g : f l (x) →ẑ maps intermediate representations to some propertyẑ, which is typically some linguistic feature of interest. As a concrete example, f might be a sentiment analysis model, mapping a text x to a sentiment label y, while g might be a classifier mapping intermediate representations f l (x) to partof-speech tags z. The classifier g is trained and evaluated on some annotated dataset D P = {x (i) , z (i) }, and some performance measure PERF(g, f, D O , D P ) (e.g., accuracy) is reported. Note that the performance measure depends on the probing classifier g and the probing dataset D P , as well as on the original model f and the original dataset D O .
From an information theoretic perspective, training the probing classifier g can be seen as estimating the mutual information between the intermediate representations f l (x) and the property z (Belinkov 2018, p. 42;Pimentel et al. 2020b;Zhu and Rudzicz 2020), which we write I(z; h), where z is a random variable ranging over properties z and h is a random variable ranging over representations f l (x).
The above careful definition of the probing classifiers framework reveals that it is comprised of multiple concepts and components, depicted in Figure 1a. The choice of each such component, and the interactions between them, lead to non-trivial questions regarding the design and implementation of any probing classifier experiment. Before we turn to these considerations in Section 4, we briefly review some history and promises of probing classifiers in the next section.

Promises
Perhaps the first studies that can be cast in the framework of probing classifiers are by Köhn (2015) and Gupta et al. (2015), who trained classifiers on static word embeddings to predict various morphological, syntactic, and semantic properties. Their goals were to provide more nuanced evaluations of word embeddings compared to prior work, which only integrated them in downstream tasks. Other early work classified hidden states of a recurrent neural network machine translation system into morpho-syntactic properties (Shi, Padhi, and Knight 2016). They were motivated by the end-to-end nature of the neural machine translation system, which, compared to a phrase/syntax-based system, did not explicitly integrate such properties (so they ask: "What kind of syntactic information is learned, and how much?"). The framework has taken up a more stable form by several groups who studied sentence embeddings (Ettinger, Elgohary, and Resnik 2016;Adi et al. 2017;Conneau et al. 2018) and recurrent/recursive neural networks (Belinkov et al. 2017a;Hupkes, Veldhoen, and Zuidema 2018). 3 The same idea had been concurrently proposed for investigating computer vision models (Alain and Bengio 2016).
Control task (Hewitt and Liang 2019) Control task dataset (Hewitt and Liang 2019) Control dataset (Ravichander, Belinkov, and Hovy 2021) SEL(g, f, D O , D P , D P,Rand ) Probing selectivity (Hewitt and Liang 2019) Probe minimum description length (Voita and Titov 2020) Representations of x from f , after an intervention (b) Additional Components. A main motivation in this body of work is the opacity of the representations. 4 Compared to performance on downstream tasks, probing classifiers aim to provide more nuanced evaluations w.r.t simple properties. 5 Indeed, following the initial studies, a plethora of work has applied the framework to various models and properties, alleviating some of the opacity, at least in terms of properties encoded in the representations. See Belinkov and Glass (2019) for a comprehensive survey up to early 2019. 6 However, what can be inferred from successful probing performance is less obvious. Good probing performance is often taken to indicate several potential situations: good quality of the representations w.r.t the probing property, 7 readability of information found in the representations, 8 or its extractability. 9 In contrast, low probing performance is taken to indicate that the probing property is not present in the representations or is 4 "little is known about the information that is captured by different sentence embedding learning mechanisms" (Adi et al. 2017); "a poor understanding of what they are capturing" (Conneau et al. 2018); "little is known about what and how much these models learn." (Belinkov et al. 2017a). 5 "fine-grained measurement of some of the information encoded in sentence embeddings" (Adi et al. 2017); "simple linguistic properties of sentences" (Conneau et al. 2018); "assessing the specific semantic information that is being captured in sentence representations" (Ettinger, Elgohary, and Resnik 2016). 6 There have also been numerous other studies using the probing classifier framework as is. For a partial list, see https://github.com/boknilev/nlp-analysis-methods/issues/5. For recent analyses focusing on the BERT model (Devlin et al. 2019), see Rogers, Kovaleva, and Rumshisky (2020). 7 "evaluate the quality of the trained classifier on the given task as a proxy to the quality of the extracted representations" (Belinkov et al. 2017a). 8 "If the classifier succeeds, it means that the pre-trained encoder is storing readable tense information into the embeddings it creates" (Conneau et al. 2018). 9 "testing for extractability of semantic information by testing classification accuracy.." (Ettinger, Elgohary, and Resnik 2016); "if a sequential model is computing certain information, or merely not usable. 10 Sometimes, good performance is taken to indicate how the original model achieves its behavior on the original task (Hupkes, Veldhoen, and Zuidema 2018). A linear probing classifier is thought to reveal features that are used by the original model, while a more complex probe "bears the risk that the classifier infers features that are not actually used by the network" (Hupkes, Veldhoen, and Zuidema 2018). Often, different terms (quality, readability, usability, etc.) appear abstractedly without precise definitions. As we shall see, some of the above assumptions and conclusions are better accounted for than others by the probing classifiers paradigm. Indeed, the community has recently taken a more critical look at the methodology, which we turn to now.

Shortcomings and Advances
In light of the promises discussed above, this section reviews several limitations of the probing classifiers framework, as well as existing proposals for addressing them. We discuss comparisons and controls, how to choose the probing classifier, which causal claims can be made, the difference between datasets and tasks, and the need to define the probed properties. We formalize new additional components ( Figure 1b) in a unified framework, along with the basic components ( Figure 1a).

Comparisons and controls
A first concern with the framework is how to interpret the results of a probing classifier experiment. Suppose we run such an experiment and obtain a performance of PERF(g, f, D O , D P ) = 87.8. Is that a high/low number? What should we compare it to? We will denote a baseline model with f and an upper bound or skyline model withf .
Some studies compare with majority baselines (Belinkov et al. 2017a;Conneau et al. 2018)  Others have proposed to design controls for possible confounders. Hewitt and Liang (2019) observe that the probing performance PERF(g, f, D O , D P ) may tell us more about the probe g than about the model f . The probe g may memorize information from D P , rather than evaluate information found in representations f (x). They design control tasks, which a probe may only solve by memorizing. In particular, they randomize the labels in D P , creating a new keeping track of it, it should be possible to extract this information from its internal state space" (Hupkes, Veldhoen, and Zuidema 2018). 10 "low accuracy suggests this information is not represented in the hidden state" (Hupkes, Veldhoen, and Zuidema 2018); "if we cannot train a classifier to predict some property of a sentence based on its vector representation, then this property is not encoded in the representation (or rather, not encoded in a useful way, considering how the representation is likely to be used)" (Adi et al. 2017).
dataset D P,Rand . Then, they define selectivity as the difference between the probing performance on the probing task and the control task: . They show that probes may have high accuracy, but low selectivity, and that linear probes tend to have high selectivity, while non-linear probes tend to have low selectivity. This indicates that high accuracy of non-linear probes may come from memorization of surface patterns by the probe g, rather than from information captured in the representations f l (x). The control tasks introduced by Hewitt and Liang are particularly suited for word-level properties z as they evaluate memorization of word types; it is less clear how to apply this idea more broadly, such as in sentence-level properties.
Taking an information-theoretic perspective on probing, Pimentel et al. (2020b) proposed to use control functions instead of control tasks in order to compare probes. Their control function is any function applied to the representation, c : f l (x) → c(f l (x)), and they compare the information gain, which is the difference in mutual information between the property z and the representation before and after applying the control function: G(z, h, c) = I(z; h) − I(z; c(h)). While Pimentel et al. (2020b) posit that their control function are a better criterion than the control tasks of Hewitt and Liang (2019), subsequent work showed that the two criteria are almost equivalent, both theoretically and empirically (Zhu and Rudzicz 2020).
Another kind of control is proposed by Ravichander, Belinkov, and Hovy (2021), who design control datasets, where the linguistic property z is not discriminative w.r.t the original task of mapping x to y. That is, they modify D O and create a new dataset, D O,z , where all examples have the same value for property z. Intuitively, a model f trained on D O,z should not pick up information about z, since it is not useful for the task of f . They show that a probe g may learn to predict property z incidentally, even when it is not discriminative w.r.t the original task of mapping x → y, casting doubts on causal claims concerning the effect that a property encoded in the representation may have on the original task. While they create control datasets for probing sentence-level information, the same idea can be applied to word-level properties.

Which classifier to use?
Another concern is the choice of the probing classifier g: What should be its structure? What role does its expressivity play in drawing conclusions about the original model f ? Some studies advocate for using simple probes, such as linear classifiers (Alain and Bengio 2016;Hupkes, Veldhoen, and Zuidema 2018;Liu et al. 2019;Hall Maudslay et al. 2020). Somewhat anecdotally, a few studies observed better performance with more complex probes, but reported similar relative trends (Conneau et al. 2018;Belinkov 2018). That is, a ranking PERF(g, f 1 , D O , D P ) > PERF(g, f 2 , D O , D P ), of two representations f 1 (x) and f 2 (x), holds across different probes g. However, this pattern may be flipped under alternative measures, such as selectivity (Hewitt and Liang 2019).
Several studies considered the complexity of the probe g in more detail. Pimentel et al. (2020b) argue that, in order to give the best estimate about the information that model f has about property z, the most complex probe should be used. In a more practical view, Voita and Titov (2020) propose to measure both the performance of the probe g and its complexity, by estimating the minimum description length of the code required to transmit property z knowing the representations f l (x): MDL(g, f, D O , D P ). Note that this measure again depends on the probe g, the model f , and their respective datasets D O and D P . They found that MDL provides more information about how a probe g works, for instance by revealing differences in com-plexity of probes when performing control tasks from D P,Rand , as in Hewitt and Liang (2019). Pimentel et al. (2020a) argue that probing work should report the possible tradeoffs between accuracy and complexity, along a range of probes g, and call for using probes that are both simple and accurate. While they study a number of linear and nonlinear multi-layered perceptrons, one could extend this idea to other classes of probes. Indeed, Cao, Sanh, and Rush (2021) design a pruning-based probe, which learns a mask on weights of f and obtains a better accuracy-complexity trade-off than a non-linear probe.
Another line of work proposes methods to extract linguistic information from a trained model without learning additional parameters. In particular, much work has used some sort of pairwise importance score between words in a sentence as a signal for inferring linguistic properties, either full syntactic parsing or more finegrained properties such as coreference resolution. These scores may come from attention weights ( (Wu et al. 2020). The pairwise scores can feed into some general parsing algorithm, such as the Chu-Liu Edmonds algorithm (1965;1967). Alternatively, some work has used representational similarity analysis (Kriegeskorte, Mur, and Bandettini 2008) to measure similarity between word or sentence representations and syntactic properties, both local properties like determining a verb's subject (Lepori and McCoy 2020) and more structured properties like inferring the full syntactic tree (Chrupała and Alishahi 2019). Also related is work on clustering representations w.r.t linguistic property and classifying by cluster assignment (Zhou and Srikumar 2021). This line of work can be seen as a parameter-less probing classifier g: a linguistic property is inferred from internal model components (representations, attention weights), without needing to learn new parameters. Thus, such work avoids some of the issues about what the probe learns. Additionally, from the perspective of an accuracy-complexity trade-off, such work should perhaps be placed on the low end of the complexity axis, although the complexity of the parsing algorithm could also be taken into account.

Correlation vs. causation
A main limitation of the probing classifier paradigm is the disconnect between the probing classifier g and the original model f . They are trained in two different steps, where f is trained once and only used to generate feature representations f l (x), which are fed into g. Once we have f l (x), we get a probing performance from g, which tells us something about the information in f l (x). However, in the process, we have forgotten about the original task assigned to f , which was to predict y. This raises an important question, which early work has largely taken for granted (Section 3): Does model f use the information discovered by probe g? In other words, the probing framework may indicate correlations between representations f l (x) and linguistic property z, but it does not tell us whether this property is involved in predictions of f . Indeed, several studies pointed out this limitation , including reports on a mismatch between performance of the probe, PERF(g, f, D O , D P ), and performance of the original model, PERF(f, D O ) (Vanmassenhove, Du, and Way 2017). In contrast, Lovering et al. (2021) find that extractability of a property according to MDL(g, f, D O , D P ) is correlated with f making predictions consistent with that property. Relatedly, Tamkin et al. (2020) find a discrepancy between features f l (x) obtaining high probing performance, PERF(g, f, D O , D P ), and features identified as important when fine-tuning f while per-forming the probing task f l (x) → z. They reveal this by randomizing the weights of specific layers when fine-tuning f , which can be seen as a kind of intervention.
Indeed, a number of studies have proposed improvements to the probing classifier paradigm, which aim to discover causal effects by intervening in representations of the model f . Giulianelli et al. (2018) use gradients from g to modify the representations in f and evaluate how this change affects both the probing performance and the original model performance. In their case, f is a language model and g predicts subject-verb number agreement. They find that their intervention increases probing performance, as may be expected. Interestingly, while in the general language modeling case the intervention has a small effect on the original model performance, PERF(f, D O ), they find an increase in this performance on examples designed to assess number agreement. They conclude that probing classifiers can identify features that are actually used by the model. Tucker, Qian, and Levy (2021) also use probe gradients to update the representations f l (x) w.r.t z, resulting in what they call counterfactual representations, and measure the effect on other properties. Similarly, Elazar et al. (2021) remove certain properties z (such as parts of speech or syntactic dependencies) from representations in f by repeatedly training (linear) probing classifiers g and projecting them out of the representation. This results in a modified representationf l (x), which has less information about z. They compare the probing performance to the performance on the original task (in their case, language modeling) after the removal of said features. They find that high probing performance PERF(g, f, D O , D P ) does not necessarily entail a large drop in original task performance after their removal, that is, PERF(f , D O ). Thus, contrary to Giulianelli et al. (2018), they conclude that probing classifiers do not always identify features that are actually used by the model. In a similar vein, Feder et al. (2021) remove properties z from representations in f by training g adversarially. At the same time, another probing classifier g C is trained positively, aiming to control for properties z C that should not be removed from f . A major difference from standard probing classifiers work is the continued updating of f . They find that they can accurately estimate the effect of properties z on downstream tasks performed by f when it is fine-tuned. 11

Datasets vs. tasks
The probing paradigm aims to study models performing some task (f : x →ŷ) via a classifier performing another task (g : f l (x) →ẑ). However, in practice these tasks are operationalized via finite datsaets. Ravichander, Belinkov, and Hovy (2021) point out that datasets are imperfect proxies for tasks. Indeed, the effect of the choice of datasetsboth the original dataset D O and the probing dataset D P -has not been widely studied. Furthermore, we ideally want to disentangle the role of each dataset from the role of the original model f and probing classifier g. Unfortunately, models f tend to be trained on different datasets D O , making statements about models confounded with issues of datasets. Some prior work acknowledged that conclusions can only be made about the existing trained models, not about general architectures (Liu et al. 2019). However, in an ideal world, we would compare different architectures {f i } trained on the same dataset D O or the same f trained on different datasets {D i O }. Concerning the latter, Zhang et al. (2021) found that models require less data to encode syntactic and semantic properties compared to commonsense knowledge. More such experiments are currently lacking.
11 Other studies that perform interventions to interpret NLP models without involving probing classifiers (e.g., Bau et al. 2019;Lakretz et al. 2019;Vig et al. 2020) are left out of the present scope.
The effect of the probing dataset D P -its size, composition, etc.-is similarly not well studied. While some work reported results on multiple datasets when predicting the same property z (e.g., Belinkov et al. 2017a), more careful investigations are needed.

Properties must be pre-defined
Finally, inherent to the probing classifier framework is determining a property z to probe for. This limits the investigation in multiple ways: It constrains the work to existing annotated datasets, which are often limited to English and certain properties. It also requires focusing on properties z that are thought to be relevant to the task of mapping x → y a-priori, potentially leading to biased conclusions. In an isolated effort to alleviate this limitation, Michael, Botha, and Tenney (2020) propose to learn latent clusters useful for predicting a property z. They discover clusters corresponding to known properties (such as personhood) as well as new categories, which are not usually annotated in common datasets. Still, probing classifiers are so far mainly useful when one has prior expectations about which properties z might be relevant w.r.t a given task.

Summary
Given the various limitations discussed in this article, one might ask: What are probing classifiers good for? In line with the original motivation to alleviate the opacity of learned representations, work using probing classifiers has characterized them along a range of fine-grained properties. However, we have discussed several reservations regarding which insights can be drawn from a probing classifier experiment. Absolute claims about representation quality seem difficult to make. Yet recent improvements to the framework, such as better controls and metrics, allow us to make relative claims and answer questions like how extractable a property is from a representation. And causal approaches (Section 4.3) may reveal which properties are used by the original model.
One might hope that probing classifier experiments would suggest ways to improve the quality of the probed model or to direct it to be better tuned to some use or task. Presently, there are few such successful examples. For instance, earlier results showing that lower layers in language models focus on local phenomena while higher layers focus on global ones (using probing classifiers and other methods) motivated Cao et al. (2020) to decouple a question-answering model, such that lower layers process the question and the passage independently and higher layers process them jointly. An analysis of redundancy in language models (again using probing classifiers and other methods) motivated an efficient transfer-learning procedure ). An analysis of phonetic information in layers of a speech recognition systems (Belinkov and Glass 2017) partly motivated Krishna, Toshniwal, and Livescu (2019) to propose multi-task learning with phonetic supervision on intermediate layers.  discuss how their probing experiments can guide the selection of which machine translation models to use when translating specific languages. Finally, when considering using the representations for some downstream task, probing experiments can indicate which information is encoded, or can easily be extracted, from these representations.
To conclude, our critical review of the probing classifiers framework reveals that it is more complicated than may seem. When designing a probing classifier experiment, we advise researchers to take the various controls and alternative measures into account. Naturally, one should clearly define the original task/dataset/model and the probing task/dataset/classifier. It is important to set upper and lower bounds, and to consider proper controls, via either control tasks (for word-level properties) or datasets (for sentence-level properties). Depending on goals, one may want to measure the probe's complexity (if ease of extractability is in question), report the accuracy-complexity trade-off (when designing new probes), or perform an intervention (to measure usage of information by the original model). When possible, using parameter-free probes may circumvent some of the challenges with parameterized probes. We do not argue that every study must perform all the various controls and report all the alternative measures summarized here. However, future work seeking to use probing classifiers would do well to take into account the complexity of the framework, its apparent shortcomings, and available advances.