## Abstract

We introduce a theoretical framework for understanding and predicting the complexity of sequence classification tasks, using a novel extension of the theory of Boolean function sensitivity. The sensitivity of a function, given a distribution over input sequences, quantifies the number of disjoint subsets of the input sequence that can each be individually changed to change the output. We argue that standard sequence classification methods are biased towards learning low-sensitivity functions, so that tasks requiring high sensitivity are more difficult. To that end, we show analytically that simple lexical classifiers can only express functions of bounded sensitivity, and we show empirically that low-sensitivity functions are easier to learn for LSTMs. We then estimate sensitivity on 15 NLP tasks, finding that sensitivity is higher on challenging tasks collected in GLUE than on simple text classification tasks, and that sensitivity predicts the performance both of simple lexical classifiers and of vanilla BiLSTMs without pretrained contextualized embeddings. Within a task, sensitivity predicts which inputs are hard for such simple models. Our results suggest that the success of massively pretrained contextual representations stems in part because they provide representations from which information can be extracted by low-sensitivity decoders.

## 1 Introduction

What makes some tasks harder and others easier for modern machine learning methods?^{1} In NLP, simple models based on lexical classifiers provide good performance on some tasks, while strong performance on other tasks has been attained only recently with massive pretrained models. However, there is no unified theoretical framework for understanding these difficulty differences between tasks, or what models might be more or less effective.

Existing complexity metrics provide limited practical insight. The Chomsky Hierarchy (Chomsky, 1956) is a prominent classification of formal languages by complexity, but it describes asymptotic worst-case complexity and does not provide a measure of how hard it is to achieve high accuracy on realistic task distributions. Kolmogorov complexity (Li and Vitányi, 1993) is uncomputable and becomes well-defined only in the asymptotic limit. Psycholinguistic complexity metrics such as surprisal (Hale, 2001) and dependency length (Gibson, 1998) only capture formal features of the input, without regard to the task.

We propose **sensitivity** as a theory of complexity for sequence classification tasks, that is, any task involving learning a function from sequences to labels. The sensitivity of a function, given a distribution over input sequences, quantifies the number of disjoint subsets of the input sequence that can each be individually changed in such a way as to change the output. Intuitively, high-sensitivity functions are complex because a single change in the input, in many different places, can completely change the output; low-sensitivity functions are simpler because the output is predictable from redundant information in many subsets of the input. We will argue that sensitivity predicts what tasks are easy or hard for modern machine learning methods to learn.

Our notion of sensitivity is grounded in a well-studied theory for Boolean functions (O’Donnell, 2014), which we generalize to natural language. Unlike measures like Kolmogorov complexity, sensitivity can be estimated on real datasets and single inputs without asymptotic approximations, only requiring a generalized language model such as XLNet (Yang et al., 2019) and a strong model of the task.

In this paper, we argue that sensitivity captures informal notions of complexity both at the level of architectures and on the level of tasks. First, we show that sensitivity quantifies architectural limitations and inductive biases of various machine learning architectures used in NLP, including both lexical classifiers and vanilla LSTMs without pretrained contextualized embeddings (Section 3). Second, in a survey of 15 major NLP tasks, we find that sensitivity quantitatively predicts how difficult a task is for simple lexical classifiers and neural models, both across tasks and across different inputs for a single task (Section 4). The validity of our methods for quantifying sensitivity is verified using human experiments in Section 5. Section 6 discusses the relationship of sensitivity to previous theories of complexity and brittleness in neural networks, and implications for NLP practice. Section 7 concludes.

## 2 Sensitivity

### 2.1 Analysis of Boolean Functions

**sensitivity**of a Boolean function

*f*: {−1, 1}

^{n}→ {−1, 1} at a bitstring

*x*∈ {−1, 1}

^{n}is defined as:

*x*

^{⊕i}is the result of flipping the

*i*-th bit of

*x*. This describes how many bits of

*x*can be flipped individually to change

*f*, or equivalently, how many Hamming neighbors of

*x*have the opposite value of

*f*.

The highest possible sensitivity is attained by the Parity function *f*_{Parity}(*x*) := $\u220fi=1n$*x*_{i}. Given a string of “1”s and “−1”s, this function counts whether the number of negative inputs is even (output +1) or odd (output −1). For instance, *f*_{Parity}(1, 1, 1) = *f*_{Parity}(1, −1, −1) = 1 and *f*_{Parity}(1, −1, 1) = *f*_{Parity}(−1, 1, 1) = −1. The function *f*_{Parity} has the property that flipping any individual bit flips the output. For instance, given the string “1 1 1”, changing any of the three input symbols to “−1” flips the parity of the string from +1 to −1. Therefore, for every bitstring *x* ∈ {−1, 1}^{n}, we have *s*(*f*_{Parity}, *x*) = *n*. It is impossible to approximate *f*_{Parity} beyond chance level with linear functions (Minsky and Papert, 1969), or with linear combinations of functions that contain nonlinear interactions between less than *n* input bits (O’Donnell, 2014). In this sense, the function *f*_{Parity} is maximally nonlinear. On the other hand, low-sensitivity functions can be approximated with linear functions or linear combinations of functions that each only combine a few input bits (O’Donnell, 2014, Thm. 2.38). Sensitivity also has close connections with other complexity measures such as decision tree depth (Nisan, 1991) and the degree of a Boolean function when written as a polynomial.

### 2.2 Application to sequence classification

We argue that this theory can be brought to bear to quantify the complexity of sequence classification tasks. In this setting, sensitivity measures the nonlinearity of the decision boundary. Low sensitivity tasks are those where simple methods based on linear combinations of local features are most successful. For instance, low sensitivity tasks can be solved by bag-of-words classifiers and linear classifiers based on *n*-gram features, which have bounded similarity (as we will make precise in Proposition 1 below). On the other hand, high sensitivity tasks require more sophisticated methods. We expect that tasks that have proven empirically difficult in the literature, such as those requiring reasoning, correspond to those with high sensitivity, which means that changing different substrings in an input can easily flip the label (e.g., Entailment ⇒ NonEntailment).

*X*

_{i}are distributed independently and uniformly (rephrased based on O’Donnell, 2014, Def. 8.22):

*f*varies across strings

*X*∈ Σ

^{n}that agree with

*x*on all except possibly the

*i*-th input. Definition (2) reduces to (1) if Σ = {−1, 1} and

*f*: {−1, 1}

^{n}→ {−1, 1}.

More challenging is the fact that symbol sequences in language are not distributed uniformly. For example, in movie review sentiment classification, most inputs will sound like movie reviews (rather than tweets or Wikipedia articles), and almost all will respect the grammatical and statistical properties of the underlying language. When defining a generalization of *s*(*f*, *x*) to natural language, we want to focus on those strings *x* and their Hamming neighbors *x*′ that are typical instances of the problem. We next describe an adaptation of Equations (1) and (2) taking this into account.

### 2.3 Formal Definitions

In order to adapt the idea of sensitivity to the setting of NLP tasks, we introduce a generalized notion called block sensitivity. Block sensitivity is the maximum sensitivity over all possible partitions of the input bits. Block sensitivity has been studied for Boolean functions as an upper bound on (1) (Nisan, 1991; Bernasconi, 1996; Hatami et al., 2010); we construct a probabilistic version of this notion as a sensitivity measure appropriate to more sequence classification tasks.

Consider a set Σ (e.g., the words of a language), with an arbitrary distribution Π over the set Σ* of finite sequences of symbols from Σ. We formalize classification tasks as functions *f* : Σ* → [−1, 1].^{2} Such functions could be binary classifiers *f* mapping to {−1, 1}, or they could output a continuous score. We take the output space to be [−1, 1] instead of [0, 1] to make our definitions consistent with those from the analysis of Boolean functions.

**subset sensitivity**of the function

*f*: Σ* → ℝ on the point

*x*∈ Σ

^{n}and the set

*P*⊂ {1, …,

*n*} is

*x*

^{⊕P}denotes the set of all strings

*x*′ that agree with

*x*on all indices outside of

*P*:

*P*is a singleton {

*i*}, we recover the term inside the sum in (2):

*s*(

*f*,

*x*) = $\u2211i=1n$

*s*(

*f*,

*x*, {

*i*}).

We illustrate this definition in Figure 1 with examples from the Stanford Sentiment Treebank (Socher et al., 2013). Here, the function *f* maps movie reviews to the probability that the review is positive, scaled to [−1, 1]. For each sentence, we select a singleton subset *P* and show 10 samples from Π, the distribution over possible substitutions. In Sentence 1, due to the positive adjectives in the context, the distribution is concentrated on positive adjectives, and so the sensitivity *s*(*f*, *x*, *P*) ≈ 0. In Sentence 2, both positive and negative adjectives are plausible substitutions, and *s*(*f*, *x*, *P*) ≈ 0.6.

This example shows how (3) differs from the vanilla definition (1) by accounting for the statistical dependencies between words in natural language: It takes into account that the choice of possible completions for a set *P* is often constrained by the context given by *x*. Inputs violating these statistical dependencies (e.g., ‘a *boring*, witty, seductive movie’ for Figure 1) are unlikely to occur in naturalistic input, and the behavior of *f* on such unlikely inputs may not impact the difficulty of representing *f* with high average fidelity. This motivates considering the variance of *f* over neighboring strings, instead of, say, the entire range of *f* over all possible neighboring strings.

**block sensitivity**at

*x*as an analogue to (1):

*n*} into disjoint subsets

*P*

_{1}$\u222a\u0307$ … $\u222a\u0307$

*P*

_{k}($\u222a\u0307$ denoting disjoint union). We recover the quantity

*s*(

*f*,

*x*) in (1) and (2) by restricting subsets

*P*

_{i}to the singletons {

*i*}; thus, we have

*bs*(

*f*,

*x*) measures the following: Given an input

*x*, how many disjoint subsequences can be changed individually so as to flip the label? The formal definition modifies this logic by considering, for each subsequence, not whether changing it to flip the label is possible in principle, but also the probabilities of the different changes. A useful summary statistic is the

**average block sensitivity**:

#### Why Consider Subsets?

By considering subsets *P* instead of single indices *i*, block sensitivity takes into account that words are composed into phrases, and that changing a phrase might change the meaning when changing any individual word cannot. For instance, exchanging the entire phrase ‘a gorgeous, witty, seductive’ (see Figure 1) with something negative can make the review negative, whereas exchanging any of the individual adjectives cannot, due to the statistical dependencies between the different words. This definition also makes the sensitivity measure robust against tokenization: a more fine-grained tokenization (e.g., into characters) cannot decrease *bs*(*f*, *x*).

## 3 Sensitivity Bounds for NLP Methods

Many statistical NLP methods proposed over the past decades involve linear combinations of features that look at individual words or groups of a few words. Proposition 1 shows that such methods can only express functions of bounded block sensitivity, with an upper bound quadratic in the number *k* of inputs the model looks at simultaneously, independently of input length *n*.

**Proposition 1.**

*Let f be any function*Σ* → ℝ

*parameterized as follows:*

*where*

*f*

_{1,n}, …,

*f*

_{n−k,n}

*are functions*Σ

^{k}→ ℝ

^{d}

*such that*max

_{x∈Σk}∥

*f*

_{i, n}(

*x*)∥

_{2}≤

*C*,

*and*

*h*: ℝ

^{d}→ ℝ

*is L-Lipschitz continuous. Then, independently of input length*

*n*,

*we have*

*Proof.*Fix a partition

*P*

_{1}$\u222a\u0307$ … $\u222a\u0307$

*P*

_{l}= {1, …,

*n*}. Write

*g*(

*x*) for the average inside

*h*(⋅) in (8). Changing inputs in

*P*

_{i}affects up to

*k*|

*P*

_{i}| of the summands in

*g*. The ℓ

_{2}norm of the sum of these affected terms is bounded by $Ck|Pi|n$, and thus

*P*

_{i}|

^{2}≤ ($\u2211i=1k$ |

*P*

_{i}|)

^{2}=

*n*

^{2}, we find

This result has direct bearing on a wide variety of methods used in NLP, such as averaging word embeddings to construct sentence embeddings (Wieting et al., 2016; Arora et al., 2017; Ethayarajh, 2018), CNNs (Kim, 2014) with average pooling, and log-linear models with *n*-gram features. The parameter *k* equals 1 for models averaging word embeddings, the Kernel width for CNNs with average pooling, and *n* for models using *n*-gram features. *C* describes the norm of word embeddings, of the output of a CNN kernel, or of the weights of a linear model. Lipschitz functions *h* include the sigmoid function *σ* used in logistic regression and its generalization *softmax*, which are 1-Lipschitz, and feedforward networks with Lipschitz activations.

RNNs and LSTMs (Hochreiter and Schmidhuber, 1997) can express functions of any sensitivity, such as *f*_{Parity}, because they can express all regular languages (Horne and Hush, 1994). On the other hand, transformers (Vaswani et al., 2017) have asymptotically bounded sensitivity as the input length *n* increases (Hahn, 2020, Lemma 5).

We show that even LSTMs have a learning bias towards low-sensitivity functions, despite their theoretical capacity to represent high-sensitivity functions. We consider functions *f* : {−1, 1}^{n} → ℝ where inputs are uniformly distributed over {−1, 1}^{n}. We first evaluated average block sensitivity both for randomly initialized LSTMs and for the uniform distribution over Boolean functions {−1, 1}^{7} → {−1, 1}. We constructed Boolean functions from a randomly initialized LSTM by obtaining a scalar output and making this a binary output *f* based on a threshold chosen to maximize Var(*f*). We initialized the LSTM’s weights uniformly from [−*d*^{−0.5}, *d*^{−0.5}] or from a Gaussian with *σ*^{2} = *d*^{−1}, where *d* is the number of hidden units. Results are shown in Figure 2, for *d* = 128 and *d* = 256. Random Boolean functions have block sensitivity tightly concentrated around ≈4.5, whereas the randomly initialized LSTMs consistently show lower block sensitivity. This suggests that low-sensitivity functions are ‘overrepresented’ in the LSTM parameter space, echoing a theoretical result for feedforward networks (Palma et al., 2019).

Second, we directly examined learnability on functions of different sensitivities. As randomly chosen functions have tightly clustered sensitivity, we sampled^{3} functions with a specific targeted average sensitivity *as*(*f*) = $12n$ ∑_{x∈{−1,1}n}*s*(*f*, *x*). We did this for sequence lengths *n* = 7, 10, 15. For each *i* = 1, …, *n*, we constructed five such functions, and then trained an LSTM (128 hidden units) for 10^{5} iterations with Adam (learning rate 0.003, batch size 32), and recorded average mean squared error after 10^{2}, 10^{3}, 10^{4}, 10^{5} training iterations. Training batches and test examples are sampled uniformly from the 2^{n} elements of {−1, 1}^{n}, without consideration of out-of-sample generalization. Results are shown in Figure 2. For *n* = 7, we arrange functions by $bs\u0302$(*f*); for *n* = 10, 15 we take *as*(*f*) instead as it can be computed efficiently and is strongly correlated with $bs\u0302$(*f*) at *n* = 7 (*R* = 0.95). Low-sensitivity functions are learned perfectly with fewer iterations, whereas high-sensitivity functions are not approximated much better than chance even after 10^{5} training iterations. We note that this is a result on the ability to simply *fit* a function of 2^{n} inputs, not the (harder) task of *generalizing* to unseen input.

## 4 Sensitivity and Difficulty of NLP Tasks

In Section 3, we provided evidence that sensitivity describes how hard a function is to learn and represent for simple machine learning architectures that do not include pretrained contextual embeddings. In this section, we argue empirically that sensitivity is successful at capturing intuitive notions of task difficulty: Low-sensitivity tasks are those on which simple classifiers as described in Proposition 1, and vanilla LSTMs without pretraining, are relatively successful. More challenging tasks such as those collected in the GLUE suite (Wang et al., 2019b) have higher sensitivity.

Estimating block sensitivity (5) requires two ingredients: an estimate of the distributions of neighboring strings Π(*X*|*X* ∈ *x*^{⊕P}), and an estimate of *f* on this set. We approximate Π via a language model, and *f* via a trained model that is known to attain strong performance on the task. That is, we estimate the sensitivity of a task *f* by measuring the sensitivity of a model *f*′ that is known to provide close fit to *f* on the task’s input distribution. In Section 5, we report human annotation studies that justify this approximation.

### Sampling Neighboring Strings

For estimating Π(*X*|*X* ∈ *x*^{⊕P}), we leverage the ability of XLNet (Yang et al., 2019) and u-PMLM (Liao et al., 2020) to model prediction in any order. We use the pretrained xlnet-large-cased model provided in Wolf et al. (2019), and a pretrained u-PMLM model trained on the 1 Billion Word benchmark (Chelba et al., 2014). As these models take input on the level of subword tokenizations, we require all samples to consist of the same number of subword symbols as in the span covered by *P*. To enable meaningful comparison with traditional tokenization and with human intuitions, we only consider subsets *P* that respect whitespace. We take 10 samples for each *P*. For tasks with short inputs (text classification and CoLA), we finetune XLNet on the training set to produce completions more in line with the task-specific input distribution. Due to compute availability, we did not apply this procedure to other tasks. Finetuning XLNet slightly increased estimated sensitivity; as we applied it to those tasks expected to have low sensitivity, this procedure potentially makes comparison between tasks more conservative.

### Tasks

First, we consider four text classification tasks: movie review sentiment (MR, Pang and Lee, 2005), sentence subjectivity (SUBJ, Pang and Lee, 2004), customer reviews sentiment (CR, Hu and Liu, 2004), and opinion polarity (MPQA, Wiebe et al., 2005). On these tasks, low-sensitivity models such as CNNs are known to achieve good performance (Kim, 2014). To approximate the functions *f*, we finetune roberta.large.mnli using fairseq for each of the tasks using a single set of hyperparameters.

Second, we selected all tasks of the GLUE challenge suite (Wang et al., 2019b), designed to require a good amount of nontrivial language understading. GLUE contains inference, similarity, and paraphrase tasks (MNLI, Williams et al. (2018); MRPC, Dolan and Brockett (2005); QNLI, Rajpurkar et al. (2016); QQP; STS-B, Cer et al. (2017); RTE, Dagan et al. (2009)), an NLI version of the Winograd schema challenge (Levesque et al., 2012), linguistic acceptability judgments (CoLA, Warstadt et al., 2019), and the Stanford sentiment treebank (SST-2, Socher et al., 2013). On many of these tasks, simple BOW baselines perform essentially at chance (Wang et al., 2019b). We obtain predictions by finetuning RoBERTa (roberta.large.mnli) using fairseq (Ott et al., 2019) using provided hyperparameters.^{4} RoBERTa provides performance close to or exceeding estimated human performance on all GLUE tasks. For the Winograd schema challenge, we took the WSC version from SuperGLUE (Wang et al., 2019a) instead of the NLI reformulation (WNLI) used in GLUE; we used the pretrained model roberta.large.wsc. Unlike WNLI, WSC is a single-span task, reducing the number of subsets *P* considered.

Third, we considered sequence classification formulations of POS tagging and syntactic parsing. For 150 dev sentences in the English Web Treebank (Silveira et al., 2014), we considered the word at the median position of the sentence, and estimated sensitivity of identifying (1) its POS tag in the universal tagset (Petrov et al., 2012), (2) its Universal Dependencies label (Nivre et al., 2016), and (3) the relative position of its head, as an integer. All three tasks are formalized as multi-class classification problems. We estimated all three computations using the pretrained English dependency parser provided in Stanza (Qi et al., 2018; Qi et al., 2020).

Fourth, we considered two datasets probing syntactic knowledge of anaphor licensing (Marvin and Linzen, 2018; Hu et al., 2020), namely, tasks 248 and 260 in SyntaxGym (Gauthier et al., 2020). These tasks ask a model to choose a singular (*himself*) or plural (*themselves*) reflexive after a context where only one is grammatical, but identifying the right reflexive requires syntactic knowledge. We modeled *f* using the medium-sized GPT2 model (Radford et al., 2019). We chose this task because it could be formalized as binary classification problem, and because GPT2 performed better on this task than on the more familiar subject-verb agreement (and on the feminine version with *herself*).

For each task, we estimated sensitivity for at least 150 dev examples, determined by compute availability. For the syntactic tasks, we estimated sensitivity on the full dataset, as language models are evaluated on these tasks without finetuning.

We considered continuous predictions in [−1, 1] for binary classification tasks, and in [−1, 1]^{d} for multiclass tasks with *d* classes, obtained from the sigmoid or softmax layer of the relevant models. For STS-B, we rescale continuous similarity scores to [−1, 1]. For parsing and WSC, we used the discrete output labels provided by the pretrained models, represented as one-hot vectors ∈ {−1, 1}^{d} or binary labels ∈ {−1, 1}. For multivariate output *f*(*x*) ∈ [−1, 1]^{d}, we define *s*(*f*, *x*, *P*) by computing it for each of the *d* coordinates of *f*(*x*), and taking the maximum value over these. The resulting sensitivity estimates describe the behavior of the coordinate of *f* that has the most nonlinear decision boundary around *x*.

### Lower Bound Approximation

Calculating block sensitivity (5) requires calculating the variance for each of the exponentially many subparts *P* of the input, intractable for all but short inputs. We restrict consideration to a polynomial number of subparts, thus obtaining a *lower bound* on full block sensitivity. We only consider (1) subsets of 1, …, 8 adjacent tokens, and (2) unions of sets {*x*_{in/7}, …, *x*_{(i+1)n/7−1}} for *i* = 1, …, 7. For the parsing tasks, we additionally consider all subsets in a window of 7 tokens around the relevant word. This bounds the number of subsets by 8*n* + 256, compared to 2^{n} for full block sensitivity.

### 4.1 Results

Across the 15 tasks, XLNet and u-PMLM yielded very similar estimates of average block sensitivity (*R* = 0.87, *p* = 7 ⋅ 10^{−6}). In Figure 3, we show block sensitivity across tasks as estimated by XLNet. The left panels show kernel density estimates of the distribution over *bs*(*f*, *x*) over the inputs *x* from the dev sets. The right panels show estimated average block sensitivity $bs\u0302$(*f*). Text classification tasks have low estimated block sensitivity, with *bs*(*f*, *x*) being concentrated on values lower than three. For the two syntactic tasks, sensitivity is slightly higher; in comparison to the text classification tasks, the histograms show that these tasks have no datapoints with very low sensitivity. For parsing, we see a substantial difference between POS tagging and relation labeling on the one hand, and head identification on the other hand. Identifying tags and relations has lower sensitivity comparable to text classification tasks, whereas identifying the relative position of the head has higher sensitivity. This makes sense: The relative position of the head is sensitive to intervening words that, while not changing the syntactic relation, change the numerical distance between head and dependent. Finally, for GLUE, we observe a wide range of sensitivity scores. SST-2, a sentiment analysis task, has sensitivity very similar to the (other) text classification tasks, as do STS-B (semantic similarity) and QQP (identifying redundant Quora questions). Other tasks show substantially higher scores; the highest estimated average block sensitivities are attained by RTE, MRPC, and WSC, three tasks designed to require nontrivial reasoning.

To provide insight into these results, we show examples from SST-2 and RTE, with samples from XLNet. In Figure 4, we show two examples from SST-2. The first example has low sensitivity, as our models find only one sensitive subset. On the second example, our models find three disjoint sensitive subsets, leading to higher sensitivity. In Figure 7, we show an example from RTE, consisting of a premise and a hypothesis. The models identify five highly sensitivity subsequences, such that changing the input on any of these subsequences can flip the label from Entailment to NoEntailment.

#### Sensitivity and Sentence Length

Sensitivity might be higher on longer sentences, because they can be partitioned into more sets *P*. Does this explain away the differences between tasks? Figure 5 shows per-sentence sensitivity (estimated using XLNet) as a function of sentence length. The left panel compares sensitivity on simple text classification tasks and on CoLA, a GLUE task consisting of short sentences. For the simple text classification tasks, sensitivity increases sharply for very short sentences, but then plateaus. For CoLA, it increases with length. The right panel shows averaged values *bs*(*f*, *x*) across the tasks in each of the four categories. Again, sensitivity increases for GLUE and dependency parsing, while it plateaus for text classification. The two syntactic tasks consist of short and tightly controlled sentences; in relation to their lengths, their sensitivities are particularly high.

#### Average Block Sensitivity and Simple Models

Based on Section 3, we hypothesized that tasks with low sensitivity correspond to those for which bag-of-words models can meaningfully outperform the majority class baseline, and those on which vanilla LSTM models do best. In Figure 6, we plot average block sensitivity against error reduction (in % of previously misclassified examples) of a bag-of-embeddings (BoE) model,^{5} a vanilla BiLSTM,^{6} and RoBERTa against the majority class baseline, on the development sets. BoE instantiates the model described in Proposition 1 with *k* = 1; thus, we expect the top right of this graph to be empty for BoE: There can be no high-sensitivity task on which the BoE model provides strong quantitative performance. For both BoE and the vanilla BiLSTM, average sensitivity was negatively associated with error reduction (XLNet: *R* = −0.71, *p* = 0.001 for BoE; *R* = −0.82, *p* = 0.0002 for BiLSTM. u-PMLM: *R* = −0.66, *p* = 0.005 for BoE; *R* = −0.76, *p* = 0.002 for BiLSTM), while no association was observed for RoBERTa (XLNet: *R* = −0.05, *p* = 0.87; u-PMLM: *R* = −0.07, *p* = 0.84). We compared sensitivity as a predictor with label entropy, which showed little association with error reduction of either BoE or the vanilla BiLSTM (both *p* > 0.1).

#### Which Inputs Have High Sensitivity?

We used the Stanford Sentiment Treebank (SST-2, Socher et al., 2013) to investigate which inputs have high sensitivity in sentiment classification. We extracted the 445 dev inputs for which we had estimated sensitivity (determined by compute availability). The dataset contains syntactic parses, with human sentiment annotation for each constituent. We hypothesized that inputs have high sensitivity when different constituents have different sentiment. We focus on estimates from XLNet for simplicity; results from u-PMLM are qualitatively identical. We measured the dispersion of sentiment labels over constituents by enumerating positive (+1) and negative (−1) labels of all constituents, and computing the standard deviation of this resulting distribution; this is 1 if as many constituents have positive sentiment as there are constituents with negative sentiment. Figure 8 (left) shows this dispersion measure as a function of sensitivity. High-sensitivity examples have higher dispersion. In a linear regression with dispersion and sentence length as predictors of sensitivity, dispersion was highly significant (*β* = 0.53, *p* < 1.95 ⋅ 10^{−10}), while length was not (*β* = −0.00, *p* = 0.49). This is illustrated by the examples in Figure 4 discussed above, where dispersion correlates with sensitivity: The first example has low block sensitivity (0.93) and low label dispersion (0.0); the sentence is labeled positive and no constituent is labeled negative. The second example has higher block sensitivity (1.88) and very high label dispersion (0.94): while the sentence is labeled positive, three constituents are labeled positive and five negative.

Second, we hypothesized that a BoE classifier and a vanilla LSTM perform better on low-sensitivity examples, whereas RoBERTa should provide better performance also on higher-sensitivity examples. This is confirmed by Figure 8 (right), where we show the accuracy of BoE, BiLSTM, and RoBERTa as a function of sensitivity. In a logistic regression with sensitivity and sentence length as predictors of BoE accuracy, sensitivity was again highly significant (*β* = −1.22, *p* = 4.1 ⋅ 10^{−10}). Findings were similar for the BiLSTM (*β* = −1.16, *p* = 1.41 ⋅ 10^{−9}). When predicting the accuracy of RoBERTa, there was still a measurable effect of sensitivity (*β* = −1.37, *p* = 1.6 ⋅ 10^{−5}), but overall Figure 8 shows that RoBERTa provides more accurate predictions on higher-sensitivity input. Sentence length was not a significant predictor for accuracy of any of the three models (all *p* > 0.05).

If we choose *s*(*f*, *x*) instead of *bs*(*f*, *x*), namely, restricting to singletons *P*, there is still a significant effect of *s*(*f*, *x*) on BoE accuracy (*β* = −1.24, *p* = 1.1 ⋅ 10^{−6}), but with inferior model fit compared to *bs*(*f*, *x*) (ΔDeviance = 20.0), confirming block sensitivity as the more appropriate difficulty measure for simple NLP models.

#### Role of Task Model

We have estimated sensitivity of GLUE and text classification tasks using a large pretrained transformer model (RoBERTa). What would happen if we used a model outside of the family of massive pretrained contextual embeddings? To answer this, we estimated *bs*(*f*, *x*) on SST-2 and RTE using the vanilla BiLSTM to represent *f*. On SST-2, sensitivity estimated with the BiLSTM’s correlated with sensitivty estimated with RoBERTa on those inputs where the BiLSTM provides correct predictions (*R* = 0.36, *p* = 2 ⋅ 10^{−11}), but not on those (typically higher-sensitivity ones) where its predictions are incorrect (*R* = 0.15, *p* = 0.21); a linear regression confirmed that RoBERTa’s sensitivity was more predictive of the BiLSTM’s sensitivity in those cases that the LSTM labeled correctly (*β* = 0.2, *p* = 0.004). On RTE (where the BiLSTM’s accuracy is at chance), the BiLSTM’s sensitivity was at a constant low value (≈0.5) for all inputs. This illustrates that automatic estimation of sensitivity requires a strong model that is able to achieve the sensitivity levels required by a task.

#### Role of Lower Bound Approximation

We evaluated the role of the lower bound approximation on 20 inputs from SST-2 of between 8 and 11 words each—long enough to make the approximation inexact but still allowing consideration of all 2^{n} subsets. We compared estimates of *bs*(*f*, *x*) based on the approximation (≤ 216 subsets) and the full power set (≤2^{11} = 2048 subsets). On average, the approximation decreased estimates of *b*(*f*, *x*) from 1.59 to 1.35. However, the two estimates were almost perfectly correlated (*R* = 0.95, *p* < 10^{−10}). Even when restricting to singletons *P* (up to 11 subsets), the correlation remained high (*R* = 0.81, *p* < 0.0001). Thus, while the approximation may underestimate the numerical values of *bs*(*f*, *x*), it preserves the relative sensitivities of different inputs.

## 5 Human Validation

In Section 4, we estimated the sensitivity of NLP tasks by plugging a model *f*′ of the task *f* into equation 5. This methodology requires that the model *f*′ provides good labels on the samples from *x*^{⊕P} obtained using the language models. As the language models only approximate the input distribution, their samples could fall outside of the data distribution on which *f*′ approximates the true task *f* at high accuracy. If this were the case, high estimated sensitivity on tasks such as RTE might reflect brittleness of large models rather than true high sensitivity. Here, we show that this is not the case: Reasoning tasks like RTE have higher sensitivity than text classification tasks like SST-2, even when using human labels.

### 5.1 Experiment 1: Validating Oracle Model

For 60 items from SST2 and 30 items from RTE each, we collected the subsets *P*_{1}, …, *P*_{k} achieving the maximum in (5), with 6 samples from XLNet for each subset (we collected fewer items from RTE because they typically have more sensitive subsets *P*_{i}, making annotation more expensive). We then recruited naive participants who labeled these samples; each sample was labeled by two or three annotators. In addition to the appropriate labels (“positive” and “negative” for SST-2, “entails” and “does not entail” for RTE), participants were also provided with a “makes no sense” option. We repeated the study for SST2 both with and without finetuning.

The rate of “makes no sense” responses on SST-2 was 18% without finetuning and 11% with finetuning; it was 12% on RTE. The agreement between RoBERTa and the modal humanlabel was 80% (without finetuning) and 85% (with finetuning) on SST-2, and 72% on RTE; compared to 87%, 92%, and 79%, respectively, average agreement between a single annotator and the modal label. Interannotator agreement is below the human accuracies reported by Nangia and Bowman (2019); we note that the creators of RTE specifically excluded items where human annotators did not agree (Dagan et al., 2009) and that SST-2 excludes reviews labeled as neutral (Socher et al., 2013); we thus expect lower agreement on other strings from the same domain.

The key question is whether these levels of agreement guarantee consistent sensitivity estimates. Figure 9 (left) compares block sensitivity estimated using RoBERTa with values obtained by plugging in average human labels for the function *f*(⋅). On both SST-2 and RTE, the values are strongly correlated (SST-2 with and without finetuning both *R* = 0.85; RTE: *R* = 0.91; all *p* < 2.2 ⋅ 10^{−16}). On RTE, human estimates are numerically lower than automatic estimates, but the difference in average sensitivity between SST-2 and RTE was strongly replicated by the human estimates (*β* = 1.3, *p* = 1.3 ⋅ 10^{−14} in a linear regression). These results indicate that a strong model of a task leads to results similar to a human oracle. In particular, the qualitative difference in sensitivity between SST2 and RTE is replicated when using human labels.

### 5.2 Experiment 2: Manual Approximation

Experiment 1 showed that human and model labels yield similar results in estimating sensitivity. However, we still relied on the subsets *P*_{i} generated by the models. Here, we show that sensitivity, both on the level of individual inputs and on the level of tasks, relates to human intuitions about the number of disjoint subsequences that can be changed to flip the label, which can be easily estimated without any model.

We asked 30 naive individuals to find disjoint subsets in inputs from SST-2 and RTE such that changing the words in any one of them would flip the label. Each participant worked on 30 items from one of the tasks. They rewrote sentences by clicking on words they wanted to replace and entering text replacing those. After submitting a rewrite, participants had the option of identifying another subset disjoint from the previously selected words. They changed at least one subset for every input, and were provided a bonus for every additional subset, incentivizing them to find as many disjoint subsequences as possible. For both SST-2 and RTE, participants were shown an example with instructions guiding them to change three disjoint subsequences. For RTE, we only allowed participants to modify the premise, as similar changes are often possible in premise and hypothesis.

We interpreted the number of disjoint subsets found by participants as a proxy for block sensitivity. This quantity is different from block sensitivity (5), as it does not weight the relative probabilities of all possible changes, and we thus do not expect the same numerical values for both quantities. An exact human estimate of block sensitivity would rely on asking humans both to create multiple samples from *x*^{⊕P} for different subsets *P* and to then label these, infeasible given the large number of possible subsets of each input. In contrast, the task described here only requires annotation from a few annotators for every input.

Figure 9 (right) shows the average number of changes made on each input, as a function of the sensitivity estimated by XLNet + RoBERTa. We conducted a mixed-effects Poisson regression of the number of changes made on the inputs, with random effects for items and subjects. Sensitivity predicted the number of changes (*β* = 0.061, *SE* = 0.02, *p* = 0.0023), and there were overall more changes for RTE than for SST2 (*β* = 0.39, *SE* = 0.097, *p* = 6 ⋅ 10^{−5}). Input length was not predictive (*β* = −0.0015, *SE* = 0.002, *p* = 0.32). This result shows that a fully manual annotation task can approximate differences in sensitivity both between inputs (effect of sensitivity) and between tasks (effects of the contrast between RTE and SST-2).

## 6 Discussion

We have proposed sensitivity as a theoretical framework for studying the complexity of sequence classification tasks, arguing that it captures complexity both across several machine learning architectures (Section 3) and across NLP tasks (Section 4).

Prior work has studied the ability of RNNs and transformers to represent and learn languages in different classes of the Chomsky hierarchy (e.g., Merrill, 2019). Sensitivity is orthogonal to the Chomsky hierarchy: The maximally sensitive function *f*_{Parity} has a two-state finite automaton, but there are also low-sensitivity functions that are not even computable. Sensitivity is also distinct from Kolmogorov complexity and similar description length measures (Li and Vitányi, 1993): *f*_{Parity} has high sensitivity but very low description length. Whereas Kolmogorov complexity is uncomputable and can only be approximated asymptotically, sensitivity can be calculated for individual inputs, enabling us to explicitly evaluate it as a predictor of difficulty on NLP tasks.

### Implications for NLP Practice

Our results in Section 4 suggest that pretrained contextualized embeddings have been so successful in NLP because they make it possible to learn high-sensitivity functions with modest amounts of task-specific training data. We conjecture that, through large-scale pretraining, models implicitly learn high-sensitivity operations that are generally useful for language understanding. Finetuning such models for classification tasks (Howard and Ruder, 2018; Peters et al., 2018; Devlin et al., 2019) amounts to composing a high-sensitivity model with a low-sensitivity classifier. Some classical techniques can also be interpreted in this light, such as aligning parse trees (a potentially high-sensitivity computation) and extracting features from these alignments that then are fed into an SVM (a low-sensitivity classifier) as an approach to tasks like RTE (Dagan et al., 2009).

### Decision Boundaries in NLP

The decision boundaries of NLP models are commonly studied to understand their linguistic knowledge (e.g., Linzen et al., 2016; Marvin and Linzen, 2018; Futrell et al., 2019; Jeretic et al., 2020). Kaushik et al. (2020) and Gardner et al. (2020) propose to improve NLP models and their evaluation by specifically considering input pairs that differ in some part and in their (true) label. Dattan et al. (2020) propose to quantify the difficulty of an input by the largest eigenvalue of the Fisher information matrix of a task model, finding that it predicts how sensitive classifiers are to word substitutions.

Sensitivity is different from widely studied pheh nomena of adversarial brittleness (Szegedy et al., 2014; Jia and Liang, 2017): The existence of adversarial examples typically means that natural examples have *some* neighbors, possibly *outside* of the input distribution, on which model output changes even though the true label does not. In contrast, high sensitivity means that there are *many* neighboring inputs *within* the data distribution on which the *true label* changes. Sensitivity may be related to the observation that models often rely on spurious statistical patterns, such as simple lexical correlates of the label observed in reading comprehension datasets (e.g., Kaushik and Lipton, 2018; Gururangan et al., 2018); we expect that such artifacts decrease task sensitivity as they make the gold labels correlated with the output of simple lexical classifiers. Similarly, if the premise alone is predictive of the label in an entailment task (Poliak et al., 2018), changing the hypothesis while staying within the task distribution is less likely to flip the label, again decreasing sensitivity.

### Inductive Biases in Neural Networks

There is empirical and theoretical evidence that the generalization capabilities of neural networks are in part due to a bias towards “simple” functions, with different formal notions of simplicity (e.g., Franco, 2006; Palma et al., 2019; Valle-Perez et al., 2019). A few studies explicitly propose notions similar to low sensitivity as describing simplicity (Franco, 2006; Palma et al., 2019; Novak et al., 2018). Relatedly, empirical work shows that neural networks learn low frequencies in the Fourier spectrum of functions first (Rahaman et al., 2019; Xu et al., 2019; Cao et al., 2019). As low average sensitivity corresponds to concentration of Fourier spectrum on low frequencies (O’Donnell, 2014, Prop. 3.2), this can be understood as a bias towards low sensitivity. One aspect distinguishing our results here from these prior studies is that we measure sensitivity of realistic functions arising as NLP tasks and on distributions reflecting the nontrivial statistics of natural language. Measuring sensitivity or Fourier spectra on other machine learning tasks is an interesting problem for future research.

## 7 Conclusion

We proposed block sensitivity as a complexity measure for functions from sequences to labels, applying the measure to quantify the complexity of sequence classification tasks in NLP. Block sensitivity generalizes well-understood complexity measures from the theory of Boolean functions to the setting of natural language. We showed both theoretically and empirically that low sensitivity characterizes tasks on which simple models without massive pretraining provide reasonable performance, and that, in such tasks, more difficult inputs correspond to those with high sensitivity. Our results show that pretrained contextual embeddings enable models to learn tasks with higher sensitivity, and suggest designing challenging tasks by maximizing sensitivity.

## Acknowledgments

We thank Judith Degen, Kawin Ethayarajh, Mike Frank, Noah Goodman, and the members of the Stanford NLP group for helpful discussion and feedback. We also thank the anonymous TACL reviewers for their insightful feedback that helped improve the paper. We are also grateful to Yi Liao for providing code and models for u-PMLM. This work was supported by NSF grant #1947307 to RF.

## Notes

^{1}

^{2}

For multi-class problems, we take a family of functions *f* corresponding to the classes, see Section 4.

^{3}

For each *i* = 1, …, *N*, we sampled functions *f* where the Fourier spectrum is entirely concentrated on degrees {*i* − 1, *i*, *i* + 1}. By O’Donnell (2014, Thm. 2.38), *as*(*f*) ≈ *i*.

^{4}

https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md, retrieved June 1, 2020.

^{6}

The syntax tasks have no training sets and we thus do not report BiLSTM results; we deduced necessarily at-chance performance for BoE from the design of the task. We excluded STS-B because it cannot be evaluated with accuracy.