Theoretical Limitations of Self-Attention in Neural Sequence Models

Transformers are emerging as the new workhorse of NLP, showing great success across tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention. Previous work has suggested that the computational capabilities of self-attention to process hierarchical structures are limited. In this work, we mathematically investigate the computational power of self-attention to model formal languages. Across both soft and hard attention, we show strong theoretical limitations of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. These limitations seem surprising given the practical success of self-attention and the prominent role assigned to hierarchical structure in linguistics, suggesting that natural language can be approximated well with models that are too weak for the formal languages typically assumed in theoretical linguistics.


Introduction
Transformers are emerging as the new workhorse of NLP, achieving the state-of-the-art in tasks such as language modeling, machine translation, and creating pretrained contextualized word embeddings. Eschewing recurrent computations, transformers are entirely based on self-attention, performing their computations largely in parallel. This enables them to scale to very long sequences (Vaswani et al., 2017;Dai et al., 2019;Child et al., 2019). On the other hand, it has been suggested that this limits their expressiveness, as they cannot process input sequentially (Tran et al., 2018;Dehghani et al., 2019;Shen et al., 2018a;Chen et al., 2018;Hao et al., 2019). One aspect thought to be challenging for sequence models is hierarchical structure and recursion. Hierarchical structure is widely thought to be essential to modeling natural language, in particular its syntax (Everaert et al., 2015). Consequently, many researchers have studied the capability of recurrent neural network models to capture context-free languages (e.g., Kalinke and Lehmann, 1998;Gers and Schmidhuber, 2001;Grüning, 2006;Weiss et al., 2018;Sennhauser and Berwick, 2018;Korsky and Berwick, 2019) and linguistic phenomena involving hierarchical structure (e.g., Linzen et al., 2016;Gulordava et al., 2018). Some experimental evidence suggests that transformers might not be as strong as LSTMs at modeling hierarchical structure (Tran et al., 2018), though analysis studies have shown that transformer-based models encode a good amount of syntactic knowledge (e.g., Clark et al., 2019;Lin et al., 2019;Tenney et al., 2019).
In this work, we examine these questions from a theoretical perspective, asking whether models entirely based on self-attention are theoretically capable of modeling hierarchical structures involving unbounded recursion. Formally, we study their ability to perform two computations that are thought to be essential to hierarchical structure: First, their ability to correctly close brackets, a basic problem underlying all nonregular context-free languages and formalized by the DYCK language (Chomsky and Schützenberger, 1963). Second, their ability to evaluate iterated negation, a basic component of the task of evaluating logical formulas, amounting to evaluating the PARITY of bitstrings. We show that neither of these problems can be solved by transformers and similar models relying entirely on self-attention, unless the number or size of parameters increases with the input length. Besides representing basic building blocks of hierarchical structure, these languages also represent large classes of regular and context-free languages, meaning that our results carry over to classes of other formal languages. Our results therefore also yield more generally limitations on the ability of self-attention to model finite-state languages and context-free languages.
Although theoretical work has investigated the power of recurrent neural networks in depth (e.g., Siegelman and Sontag, 1995;Bengio et al., 1994;Weiss et al., 2018;Hardt, 2019, Merrill, 2019), the theoretical study of selfattention has begun only recently (Pérez et al., 2019;Hsieh et al., 2019). Our study provides the first theoretical results on limitations in the power of self-attention. We will provide results both for hard and soft attention settings, using different proof methods. Our results are strongest in the hard attention setting, holding without further assumptions on activation functions and parameter norms. In the soft attention settings, we still obtain results assuming smoothness of activation functions as used in practical implementations.
After discussing related work (Section 2), we introduce self-attention (Section 3) and two fundamental formal languages representing regular and context-free languages (Section 4). We then prove that self-attention cannot model these languages using either hard (Section 5) or soft (Section 6) attention. Finally, we discuss our results (Section 7).

Related Work
Prior Work on Self-Attention Transformers were proposed by Vaswani et al. (2017), previous related work using self-attention includes Cheng et al. (2016), Parikh et al. (2016), Paulus et al. (2018), and Lin et al. (2017). It has been a recurrent suggestion in the literature that transformers, relying entirely on self-attention, are restricted computationally, as they cannot process their input sequentially. Dehghani et al. (2019) suggested that transformers cannot compute functions that require sequential processing of input, without providing further details or proofs. Similarly, Shen et al. (2018a), Chen et al. (2018, Hao et al. (2019) have introduced extensions of transformers with recurrence, citing similar intuitions about limitations of transformers. Our results provide the first explicit formalization of these limitations.
A few studies have experimentally tested the abilities of transformers to learn structures. Most related to our work, Tran et al. (2018) compared the ability of transformers and LSTMs to learn hierarchical structure, specifically, English subject-verb agreement and evaluating logical formulas. Their experimental results suggested that LSTMs are better at learning hierarchical structure.  experimentally investigated the power of self-attention to extract word order information, finding differences between recurrent and self-attention models; however, these were modulated by the training objective. Lin et al. (2019) and Tenney et al. (2019) show that BERT (Devlin et al., 2019) encodes syntactic information.
Theoretical study of transformers was initiated by Pérez et al. (2019), who theoretically studied the ability of Seq2Seq transformers to emulate the computation of Turing machines. While we consider incremental modeling of sequences, where the number of computation steps is bounded by the input length n, they study the setting in which the transformer computes an unbounded number of autoregressive decoding steps, not bounded in the input length n. Even more recently, and more closely related to our interest here, Hsieh et al. (2019) studied the adversarial robustness of transformers. Although they focused on experiments on NLP tasks, they also provided a theoretical analysis, showing that a single self-attention layer with a single head will be robust against input perturbations, assuming that input embeddings are drawn uniformly from the unit sphere. One of our results, Lemma 5, can be seen as considerably widening the scope of their result, both by avoiding distributional assumptions, and by applying to transformers with arbitrary numbers of heads and layers.
Investigating the Power of Sequence Modeling Architectures The computational power of recurrent neural networks has been a focus of study. A particular focus has been on their ability to learn non-regular context-free languages, thought to provide simple models of recursion and hierarchical structure as found in natural language.
A range of studies has experimentally examined the ability of recurrent networks to model counter languages such as a n b n (Kalinke and Lehmann, 1998;Gers and Schmidhuber, 2001;Cartling, 2008;Weiss et al., 2018;Suzgun et al., 2019). Other work has experimentally studied the performance of recurent architectures on learning to recognize well-bracketed strings, a similar but more challenging problem (Sennhauser and Berwick, 2018;Skachkova et al., 2018;Bernardy, 2018). Beyond modeling formal languages, another line of work has studied the ability of LSTMs to model hierarchical structure as occurring in realistic natural language data (Linzen et al., 2016;Gulordava et al., 2018).
Recently, Merill (2019) and Korsky and Berwick (2019) theoretically studied several types of recurrent networks. Merrill (2019) showed that-in the finite precision setting-LSTMs recognize a subset of the counter languages, whereas GRUs and simple RNNs recognize regular languages. Korsky and Berwick (2019) showed, among other results, that arbitrary-precision RNNs can emulate pushdown automata, and can therefore recognize all deterministic context-free languages.
A related, though different, strand of research has investigated the power of neural networks to model Turing machines. A classical result (Siegelman and Sontag, 1995) states that-given unlimited computation time-recurrent networks can emulate the computation of Turing machines. Very recently, Pérez et al. (2019) have shown the same result for both (argmax-attention) Transformers and Neural GPUs. The crucial difference between these studies and studies of language recognition is that, in these studies, the networks are allowed to perform unbounded recurrent computations, arbitrarily longer than the input length.

Self-Attention
Here we define self-attention as used in Transformers, following Vaswani et al. (2017), with some changes in the notation to simplify arguments in our proofs. We have an input x = x 1 . . . x n , where all x i come from some finite alphabet V, and x n is an end-of-sequence symbol. This input is then encoded into a sequence of input embeddings v 1 , . . . , v n using some embedding map V → R k . We furthermore have a sequence p 1 , p 2 , . . . of positional embeddings p i ∈ R k . These are independent of the input x, and can be computed through some predefined scheme, or could be learned for each position occurring in the training data (Vaswani et al., 2017). Input and positional embeddings are combined (e.g., via addition or concatenation) to vectors y (0) i = f (v i , p i ) (i = 1, . . . , n), which we will refer to as Layer 0.
A transformer has a fixed number L of layers; the activations y (k) i at position i of the k-th layer (k = 1, . . . , L) are defined as follows. Each layer has a set of H attention heads; we first compute attention scores for the h-th head: where f att k,h combines the activations from the previous level into an attention score. This can be implemented, for example, using dot product or additive attention. Specifically, the implementation described by Vaswani et al. (2017, p. 5) linearly transforms the position-wise activations y (k−1) i separately into 'query' vectors Qy The activation of the head is computed by weighting according to attention weightsâ We note that the implementation described by Vaswani et al. (2017) first linearly transforms the activations y (k−1) j into 'value vectors' before multiplying withâ (k,h) i,j ; this is mathematically equivalent to applying this linear transformation to b i,k,h as part of the map f act we describe below.
In the soft attention version, these weightsâ are obtained by the softmax operation:â i,· ). In the hard attention variant (Pérez et al., 2019), one takes the actual maximum attention values:â The per-position activations are then computed as where f act is implemented as a fully-connected feedforward network with a skip-connection (Vaswani et al., 2017) Hard and Soft Attention There is a choice between soft attention and hard attention (Shen et al., 2018b;Pérez et al., 2019). The one prior theoretical study of transformers (Pérez et al., 2019) assumes hard attention. In practice, soft attention is easier to train with gradient descent; however, analysis studies suggest that attention often concentrates on one or a few positions in trained transformer models (Voita et al., 2019;Clark et al., 2019) and that the most important heads are those that clearly focus on a few positions (Voita et al., 2019), suggesting that attention often behaves like hard attention in practice. We will examine both hard (Section 5) and soft (Section 6) attention.
Formalizing Language Recognition We consider the problem of language recognition, the task of classifying input strings as belonging to or not belonging to a formal language. Following Weiss et al. (2018), we formalize this as the sequence-to-sequence task of mapping words to labels 1 ('in the language') and 0 ('not in the language'). Following the construction of transformers in sequence-to-sequence tasks (Vaswani et al., 2017), we compute a softmax probability vector for this label from the last activation y (L) n , obtained after reading the end-of-sequence symbol.

Regular and Context-Free Languages
We will analyze the ability of transformers to recognize regular and context-free languages, using two prominent representatives.
PARITY is the set of bit strings such that the number of 1s is even. This is a very simple regular language, recognized by a finite-state automaton with two states. The regular languages form the lowest layer of the Chomsky hierarchy, and even simple RNNs can compute all regular languages. Within the regular languages, a particularly basic class is formed by the counter-free or star-free languages (McNaughton and Papert, 1971), which can be expressed by regular expressions using only union, complementation, and concatenation. In some sense, PARITY is the simplest non-counterfree, or periodic, regular language. This means, if transformers cannot compute PARITY, they cannot recognize (almost) 2 any regular language that is not counter-free. In the context of natural language, PARITY naturally arises in the context of evaluating logical formulas: Evaluating iterated negations is tantamount to counting whether the number of nested negations is even or odd. If transformers cannot compute parity, they also cannot evaluate logical formulas accurately.
2DYCK is the language of correctly bracketed words consisting of two types of brackets ('(', ')' and ' [', ']'). This language is a very simple model of hierarchical structure. The Chomsky-Schützenberger theorem (Chomsky and Schützenberger, 1963) states that any context-free language arises from a variant of 2DYCK with multiple types of parentheses through intersection with a regular language and homomorphisms. Consequently, the ability of LSTMs to model languages such as 2DYCK has been an object of experimental study (Sennhauser and Berwick, 2018;Skachkova et al., 2018;Bernardy, 2018). Our theoretical results will show that transformers are strongly limited in their ability to model 2DYCK, including variants with fewer or more types of parentheses.

Results for Hard Attention
We will start our analysis with the study of hard attention (Pérez et al., 2019). We show that hard attention transformers cannot represent PARITY or 2DYCK. To keep the results maximally general, our analysis will use combinatorial arguments and make no assumption about, for example, activation functions and the norms of parameter matrices. In fact, we do not even assume that the internal position-wise representations y (k) j in each layer are vector-valued, as opposed to, say, discrete structures.
We aim to prove that no hard-attention transformer is capable of representing PARITY or 2DYCK, by constructing-for any given candidate transformer model-a set of input words that this model will have to misclassify. The basic idea (see Figure 1) behind the proof is that, by fixing a small fraction of the input symbols in a particular way, we can ''capture the attention'' of the transformer in such a way that it ends up ignoring almost all remaining input symbols. This shows that the Figure 1: Iteratively reducing the layers of a transformer by fixing a few input symbols. (a) By applying a suitable input restriction, we fix a small number of input symbols, 'attracting' attention from the first layer to a few inputs. (b) After this step, Lemma 4 ensures that each activation in the first layer only depends on a small number of input symbols that it can attend to (solid connections), plus the input that feeds into it via a skip connection (dashed connections). (c) We again fix a few input symbols in such a way as to 'attract' attention of layer-2 heads to some layer-1 activations. As a result, each layer-2 activation only depends on a small number of layer-1 activations, again by Lemma 4. (d) After this step, each layer-1 activation only depends on a few inputs, and we can remove layer 1. transformer could not have solved a problem such as PARITY, where every single input bit matters.
In order to formalize the idea of ''fixing'' a few input bits, we introduce the notion of input restrictions: An input restriction (short: restriction) ρ is a family of maps ρ n : {1, . . . , n} → { * , 0, 1} for n ∈ N. An input restriction ρ is applied to a transformer by fixing, when the input length is n, the input symbol x i to the value ρ n (i) whenever ρ n (i) ∈ {0, 1}. The output value of the resulting transformer only depends on those inputs x i such that ρ n (i) = * .
The idea of using such input restrictions has been successful in the theory of Boolean circuits (Furst et al., 1984;Hastad et al., 1994). In particular, Furst et al. (1984) famously used it to prove that polynomial-size bounded-depth Boolean circuits with ∧, ∨, and ¬ gates cannot compute PARITY. We describe a new method to prove existence of suitable restrictions appropriate to transformers, as the proof approaches from the Boolean circuit literature do not seem to generalize to networks with real-valued activations.
The following result formalizes the claim that any transformer can be forced to ignore input bits by fixing some inputs in a particular way: Theorem 1. Let any hard attention transformer be given, and let C ∈ (0, 1). Then there is a restriction ρ and an integer c > 0 such that |{i ≤ n : ρ n (i) = * }| ≥ Cn (for all sufficiently large n) and such that the function computed by the transformer on the restricted input depends only on ≤ c inputs, independent of input length n.
We first show how this entails that transformers do not recognize the two formal languages: Corollary 2. Transformers with hard attention cannot model PARITY or 2DYCK.
Proof. For PARITY, after applying a restriction, the transformer's output depends on c inputs. An input of sufficiently large size n thus has unrestricted inputs that do not influence the output. But flipping a single input bit changes the value, so the transformer's output cannot match membership in PARITY beyond chance for such n.
For 2DYCK, we show that hard attention transformers cannot even solve the simpler variant 1DYCK with a single bracket type ('(', ')'). We first restrict the first 0.2n input positions to '(' and the last 0.2n positions to ')'. After then applying the restriction provided by the theorem with C = 0.9, the resulting restricted input will still be compatible with both well-bracketed and non-well-bracketed inputs, but the prediction will depend only on a bounded number of positions. As the prediction depends on only a bounded number of positions, this shows the transformer could not recognize 1DYCK, and thus also not 2DYCK.
Discussion It may be instructive to compare to similar languages that can be modeled by hardattention transformers. First, 1 * (over the alphabet {0, 1}) is the regular language of words that have only ones and no zeroes; its minimal automaton has two states, like PARITY. A transformer can recognize this by having an attention head that attends to a position with zero input if it exists, and rejects if the head found such a position. Second, a n b n is a very basic context-free language. It can be recognized using suitable positional embeddings by (1) having one head attend to the largest position n, (2) using this information to attend to any b at position < n/2 or any a at position ≥ n/2. If such a symbol is found, the model rejects, else it accepts. A crucial difference between these languages and PARITY / 2DYCK is that fixing a few inputs in any part of an input string can easily force nonmembership, e.g., a single 0 for 1 * , and an a in the second half for a n b n . Therefore, such simple languages are immune to the depth reduction method, and indeed can be modeled perfectly with self-attention.
In general, the depth reduction method applies to languages that are sufficiently sensitive: If, for some C ∈ (0, 1), fixing Cn input symbols cannot force a word to be inside or outside of the language, then hard-attention transformers cannot recognize this language. Sensitivity of functions has been studied in computational complexity (Boppana, 1997;Gopalan et al., 2016) and more recently linked to generalization in feedforward networks (De Palma et al., 2018). We intend to investigate these connections in future work.
Proof Idea of the Theorem Our approach for proving Theorem 1 will be to construct input restrictions in a layerwise manner, starting from layer 1. In order for this construction to go through, the main challenge is to construct a suitable restriction at a given layer: As shown in Figure 2, this restriction should only affect a few input bits (about (1 − C 1/L )n many input bits), while forcing each attention head in the first layer to ignore all but c input bits. Perhaps surprisingly, this is possible; the idea is to fix input bits that achieve high attention scores for several heads, so that input bits that cannot achieve such high attention scores will be ignored.
Once we have shown that such a restriction always exists, we can use this technique to iteratively remove layers, as illustrated in Figure 1: After we have applied the first such restriction, each of the heads in the first layer will only depend on a bounded number c of input positions. In the Figure 2: Finding a good input restriction: (a) Every attention head in the first layer could potentially attend to any input bit. (b) Perhaps surprisingly, one can fix a small number of input bits in such a way that each layer-1 attention head can only possibly attend to c (here, c = 1) inputs, and ignores all other inputs. Each activation vector y (1) j in the first layer then only depends on the H · c inputs that its H (here, H = 1) attention heads can attend to, plus the input x j that feeds into it via a skip-connection. second step, we apply the same argument to the heads in the second layer, so that each head in the second layer only depends on a bounded number c ′ of heads in the first layer. After this step, we can collapse the first layer into a collection of feedforward networks that transform a bounded number ≤ cc ′ of input positions into an activation y (0) i of the lowest layer. After this step, the first layer has been entirely removed. Iterating this argument, we remove all layers until the prediction output only depends on a bounded number of input positions, bounded independently of input length.
We now make these ideas formal. After the removal of the first layer of a transformer, the resulting structure is not a transformer any more, as each head in the lowest layer now depends on a combination of input positions. We introduce a technical definition to make this concept precise: Definition 3. (c-Transformer). Let c be a positive integer. A c-transformer with L layers is one in which the layer-0 activations y (0) j depend on the embeddings not just at one position j, but are a function of the embeddings at ≤ c input positions: for some indices i j,n s ∈ {1, . . . , n} (s = 1, . . . , c). With this technical notion, we show that we can reduce layers, iteratively removing the lowest layer until no self-attention layer is left: Lemma 4. (Depth Reduction Lemma). Given a c-transformer with L layers, and some restriction ρ such that |{i ≤ n : ρ n (i) = * }| ≥ Cn (C ∈ (0, 1]) for all sufficiently large n. Choose any C ′ < C. Then there is a restriction ρ ′ such that |{i ≤ n : ρ ′ n (i) = * }| ≥ C ′ n (6) for all sufficiently large n, and such that the resulting function is computed by a (c · (2 c kH + 1))-transformer with L−1 layers, for some integer k (depending on C ′ ), where H ≥ 1 is the number of attention heads at each layer and position.
The lemma implies Theorem 1: Proof of Theorem 1. The output of the transformer is determined by the last activation y (L) n . Apply the Depth Reduction Lemma iteratively, choosing the constants C ′ in the lemma appropriately, until only the zero-th layer remains. Then, after applying the resulting restriction, the final activation y (L) n is now computed by y (0) n , which is determined by a bounded number of input bits.

Proving the Depth Reduction Lemma
In this section, we will prove the Depth Reduction Lemma. We construct the restrictions ρ ′ n separately for each n, on the basis of the given restriction ρ n . In this process, we will only restrict additional bits, that is, the only case in which ρ ′ n (i) can be different from ρ n (i) is that ρ ′ n (i) may be 0 or 1 where ρ n (i) was * . The construction proceeds in three stages ρ (1) n , ρ n , and ρ (3) n = ρ ′ n , which all may restrict additional bits. At the end, we verify that the conclusion of the Depth Reduction Lemma is satisfied for the resulting restriction ρ ′ n . Throughout the proof, we will need a few parameters independent of n: First, we need an integer k that has to be sufficiently large for the proof to succeed, and will be fixed later in the proof. Second, we need parameters η ∈ (0, 1 2 ), q ∈ (0, 1) and δ > 0; we will also fix the specific values later in the proof.
Stage 1 We start from ρ n and first modify it into a restriction ρ (1) n such that each input bit serves as an input to at most ≤ 1 η c/C many different layer-0 heads, when applying ρ (1) n . Assume the number of input bits feeding into more than 1 η c/C different layer-0 activations is ≥ ηCn. Then the number of pairs of input bits and depending layer-0 activations is > ηCn · 1 η c/C = nc. But there are at most nc such pairs, because there are n layer-0 activations, each of which depends on ≤ c inputs. So the number of input bits with > 1 η c/C depending layer-0 heads is ≤ ηCn. We can obtain ρ (1) n from ρ n by restricting these input bits to some fixed value in {0, 1} (it doesn't matter which one), and the set {i ≤ n : ρ (1) n (i) = * } still has at least (1 − η)Cn elements, for all sufficiently large n.
Stage 2 We now describe the second stage. We write (h, i) for a layer-1 attention head h (h = 1, . . . , H) at position i (i = 1, . . . , n). Fix such a head (h, i). As y (0) i depends on ≤ c input bits, it can take on at most ≤ 2 c possible values. For each possible value z, and each position j ∈ {1, . . . , n}, we compute the maximum possible attention value that can be achieved for this pair: considering only inputs x 1 . . . x n that are compatible with the restriction ρ s , there is at least one input x q that only feeds into the activation at position j there is no subsequence with smaller i (z) k that also satisfies (1). This construction is visualized in an example in Figure 3. Such a subsequence exists unless n ≤ ck, in which case the Depth Reduction Lemma is already satisfied for this input length n.
If z is a possible value of the activation y i , then we say that a pair ((i, h), z), of a head h at position i and a possible value z of y n to the value achieving the maximum attention value (7). Also, we say that (h, i) is satisfied if each ((h, i), z) is. The idea behind this definition is: If ((h, i), z) is satisfied, then there are at most k different layer-0 heads that this head could attend to when applying ρ ′ n , assuming that y (0) i takes the value z. As a consequence, a satisfied head can only depend on c · (2 c k + 1) For each of these, there is at least one (in fact, two in the example) input bits (also marked in yellow and green) that feed into this one and no other selected activation. many input bits. Our aim will be to construct ρ ′ n so that each layer-1 head is satisfied.
A layer-1 head k-depends on some input x i if ρ n (i) = * and x i appears as an input to some j k is minimal, a layer-1 head k-depends on an input if and only if that input appears as an input to some j i (z) s (s ≤ k). In particular, a layer-1 head k-depends only on ≤ 2 c ck input variables. Two layer-1 head are k-neighbors if some j i (z) s for one and j i (z ′ ) s ′ for the other both k-depend on some input bit x l .
We will construct ρ ′ n using probabilistic arguments over randomly chosen restrictions. For this approach to succeed, we require a sufficient amount of independence between the activations of different heads in layer 1. We thus need to ensure that the number of k-neighbors of each head is bounded. Recall η ∈ (0, 1 2 ), and let H be the number of attention heads in each position of layer 1.
We modify ρ (1) n into ρ (2) n so that each layer-0 head has at most ≤ 2 c kH many k-depending unsatisfied layer-1 heads. Assume that indeed some layer-0 head has more than 2 c kH many k-depending unsatisfied layer-1 heads. By fixing ≤ c input bits and appealing to the Pigeonhole principle, we can fix this head to a value that achieves the maximum attention value for at least > kH many of these k-depending layer-1 heads. Let ρ (2) n be the restriction resulting from adding this to ρ (1) n . Once we have done this, {i ≤ n : ρ (2) n (i) = * } still has at least (1 − η)Cn − c elements, and more than kH many additional pairs ((h, i), z) are now also satisfied. We then repeat the selection of the sequence j n in the definition), and repeat the construction described here, to restrict additional input bits in ρ (2) n . We iterate this procedure until no layer-0 head has > 2 c kH many k-depending unsatisfied layer-1 heads (h, i). This procedure can be iterated at most until each layer-1 head is satisfied, that is, at most 2 c Hn kH = 2 c n k times. Let U be the number of times this procedure is iterated (U ≤ 2 c n k ). At the end, {i ≤ n : ρ (2) n (i) = * } has at least (1 − η)Cn − cU ≥ (1 − η)C − 2 c c k n elements. By choosing k so large that 2 c c k ≤ η, we find that {i ≤ n : ρ (2) n (i) = * } still has at least (1 − 2η)Cn many elements. Once this is completed, each layer-0 head has at most ≤ 2 c kH many k-depending unsatisfied layer-1 heads. Thus each input bit now has at most ≤ 2 c η kcH/C many k-depending unsatisfied layer-1 heads. Consequently, every unsatisfied layer-1 head has at most f ≤ 2 2c η c 2 k 2 H/C many unsatisfied k-neighbors.

Stage 3 In order to construct the third and final restriction ρ
(3) n , we apply the ''probabilistic method'': We define a probability distribution over restrictions ρ (3) n , and show that the probability assigned to restrictions of the type we require is strictly greater than zero, showing that such a restriction exists. For each input length n, define the distribution over restrictions ρ (3) n that independently assigns to each input position i ∈ {1, . . . , n} the symbol 1 or 0 with probability q/2 each (q ∈ (0, 1) chosen later), and * with probability 1 − q. On those input bits where ρ (2) n (i) = * , we restrict this random restriction to agree with ρ (2) n (i). For an layer-1 attention head (h, i) and for each value z (there are at most 2 c such), define X (z) i,h to be the event that, for this head, none of y k are fixed to the value that produces the highest attention weight. Define X 0 to be the event that more than (1 + δ)q of the input bits that ρ (2) n maps to * are set to 0/1 by ρ (3) n (where δ ∈ (0, 1), to be fixed later). Our goal will be to show that a nonzero amount of probability mass is assigned to restrictions ρ ′ n avoiding all events {X 0 } ∪ {X (z) i,h : i, z}. First, a Chernoff bound gives (Mitzenmacher and Upfal, 2017, Theorem 4.4) n had ≥ (1 − 2η)Cn unrestricted input bits after Stage 2.
Second, we show that the probability of X i,h ) = 0. Else, fixing z for ease of notation, let Y t i,h (t = 1, . . . , k) be the event that the layer-0 activation y is not fixed to the value that produces the highest attention weight, for the given attention , because each input bit has at most 1 η c/C depending layer-0 heads. Therefore, there is a set of ≥ k 1 η c 2 /C independent events among these. Call η c 2 /C (9) for each i = 1, 2, . . . , n and h = 1, . . . , H.
In order to conclude that there is a restriction ρ i,h : i, h, z}, we apply the Lovász Local Lemma (Mitzenmacher and Upfal, 2017, Theorem 6.17). Each event X heads (j, h ′ ) and (i, h) are not k-neighbors}. The complement of this set has cardinality ≤ f = 2 2c η c 2 k 2 H/C, as concluded at the end of Stage 2. Set A := 1 k 2 , B := 1 2 . By the Lovász Local Lemma, it is sufficient show the following: The Lovász Local Lemma then guarantees that there is some input restriction ρ n that avoids all events {X 0 } ∪ {X (z) i,h : i, h, z}. For (10), we need where D = (1 − (q/2) c ) 1 1 η c 2 /C ∈ (0, 1). For the first term on the right, for E := 2 2c η c 2 H/C. So, if we choose k large enough (independently of n), the RHS of (12) can be made arbitrarily close to 1, in particular, greater than D. In order to also satisfy (11), we need which holds for n, k large enough (again, choosing k independent of n). In conclusion, there exists, for each sufficiently-large n, a restriction ρ for all sufficiently large n. Then choose η ∈ (0, 1 2 ) small, q ∈ (0, 1), and δ > 0 (such that (1 + δ)q ∈ (0, 1)) in such a way as to achieve (1 − 2η) · (1 − (1 + δ)q) = C ′ /C.

After applying ρ
(3) n , every layer-1 head b j,1,h depends only on (1) the c input bits feeding into y (0) j , and (2) the ≤ c2 c k input bits that the head k-depends on. Thus, each layer-1 activation y (1) j only depends on ≤ c · (2 c kH + 1) input bits: There are ≤ H · c · 2 c · k input bits that the H different attention heads k-depend on, plus a skipconnection from y (0) j , which itself depends on ≤ c input bits. We can thus remove layer 0, convert layer-1 activations y (1) j into layer-0 activations y (0) j , and obtain a (c · (2 c kH + 1))-transformer performing the same computation as before when ρ (3) is applied. This concludes the proof of the Depth Reduction Lemma.

Results for Soft Attention
In the previous section, we showed that transformers using hard attention are not able to recognize a range of core formal languages. In this section, we study soft attention. It turns out that proving limitations as strong as what we found in the hard attention setting would settle a major open problem in computational complexity, and may therefore be extremely hard to attain with currently available mathematical methods. 3 This barrier prevents us from proving bounds on the accuracy that soft attention transformers can achieve; nevertheless, we will be able to prove limitations on the achievable cross-entropy in modeling distributions over the formal languages. We will use the smoothness of the operations used in transformers to show that any transformer, as inputs get longer, will not be able to robustly model such distributions. The idea behind the proof is that the impact of any single input symbol on the output of the transformer is small if the input is long: Lemma 5. Let a soft attention transformer be given, and let n be the input length. If we exchange one input symbol x i (i < n), then the change in the resulting activation y (L) n at the decoder layer is bounded as O( 1 n ) with constants depending on the parameter matrices.
This contrasts with recurrent networks: Changing a single input can have nonnegligible impact on the final state even for very long input. For example, an RNN recognizing PARITY through a hidden state that encodes parity of the current prefix will flip its hidden state if a single input bit is flipped.
Lemma 5 entails that, as inputs become longer, soft attention transformers cannot achieve good cross-entropies on prediction problems that are very sensitive to individual input symbols: A Lipschitz-continuous prediction function, such as a ReLU MLP with a softmax output, will not be able to make very different predictions for inputs that are encoded into similar activations y (L) n . To make all our assumptions explicit, we will assume the following setting, though the results do not depend on the specific details. For PARITY, we consider the distribution over bitstrings generated by a two-state automaton that -if the number of 1s emitted so far is even -terminates with probability p, and otherwise emits a 1 or 0 with equal probability each. Given a prefix of a string drawn from this distribution, we ask the transformer to predict the next symbol from Σ = {0, 1, ENDOFSEQUENCE}. Note that the next symbol can be ENDOFSEQUENCE if and only if the prefix has an even number of 1s. For 2DYCK, we follow the experimental study of Skachkova et al. (2018) and take the distribution generated by a PCFG that expands S → (S)S or S → [S]S with probability p/2 each, and S → ǫ with probability 1 − p. We ask the model to predict the next character among Σ = {(, ), [, ], ENDOFSEQUENCE}.
Theorem 6. Let a soft attention transformer be given for PARITY or 2DYCK. As n → ∞, crossentropy on predicting the next symbol converges to unigram chance level (PARITY), or is at least separated from the optimal cross-entropy by some constant ǫ > 0 (2DYCK).
Proof. First, let us consider PARITY. Exchanging a single bit flips membership in PARITY. Thus, for any x ∈ PARITY, there is a string x ′ ∈ PARITY, differing only in one bit. As x and x ′ differ only in one bit, the transformer's output activations differ by O( 1 n ). Therefore, a Lipschitz-continuous prediction function cannot robustly assign different next-symbol probabilities after even and odd numbers of 1s, and cross-entropy will converge to unigram chance level.
For 2DYCK, consider a string x of length n, known to be the prefix of a word generated by the PCFG. One can show that there is a constant P 0 ∈ (0, 1) (dependent on p but not n), such that x both ends with a closing bracket, and is unbalanced, with probability ≥ P 0 . 4 After such 4 One can show this formally using a Markov chain argument. Let the height H(x) of a word x be the number of opening brackets minus the number of closing brackets in x. When iteratively sampling a symbol sequence using a pushdown automaton for 2DYCK, the height H n of the prefix x up to length n forms a Markov chain taking values in N. The prefix x is unbalanced if and only if H n > 0, this is always the case whenever n is odd. Restricting to even n, the chain {H n : n = 0, 2, 4, ...} is aperiodic and takes values in {0, 2, 4, . . . }. It is also positive recurrent, as words sampled from the PCFG have finite expected length (Skachkova et al., 2018, 2.2.1). Therefore, the Markov chain {H n : n = 0, 2, 4, ...} converges to its stationary distribution (Mitzenmacher and Upfal, 2017), which -by positive recurrence -must assign some nonzero weight an x, the next symbol is a closing bracket with constant nonzero probability (1 − p). If x can be followed by, say, ')' but not ']', then there is a string x ′ , differing only in one input position, but for which the next symbol can be ']' but not ')'. As x was assumed to end with a closing bracket, the exchanged symbol is not the last symbol of x, and thus the transformer's predictions on x and x ′ differ only by O( 1 n ). We can decompose the prediction task into predicting (1) whether an opening or closing bracket, or ENDOFSEQUENCE, follows, and (2) whether a round or square bracket follows, in case a bracket follows. The crossentropy loss is the sum of the cross-entropies incurred on these two successive predictions tasks. Therefore, when such a prefix x is followed by the correct closing bracket, say ')', the model will incur, as n → ∞, a cross-entropy loss on the second task of at least log 2, reflecting atchance performance in choosing between the two possible closing brackets. In contrast, optimal cross-entropy loss on (2) would be 0, as the bracket type (round or square) is actually fully determined by x. Thus, the overall cross-entropy on all prefixes x of length n is, asymptotically as n → ∞, at least P 0 · (1 − p) · log 2 > 0 more than the optimal cross-entropy.
We proceed to proving Lemma 5.
Proof of Lemma 5. We compare the activations at the decoder layer for two inputs that only differ in the input at the i-th position.
the norm of the difference of the input embeddings at this position. We show by induction over k = 1, . . . , L that, for some C > 0 (chosen below) the differences between the resulting activations y (k) j , y (k) j ′ are bounded as: Once we have shown this, we know that the influence of any individual input on the final prediction is O( 1 n ), with constants depending on π 2i to each height 2i (i ≥ 0). Hence, even when n is even, the prefix x is unbalanced with nonzero probability 1 − π 0 asymptotically independent of n. Also, since the transition probabilities P (H n+1 |H n ) are independent of n, there is an asymptotically constant nonzero probability that x 1 . . . x |x|−1 has height larger than x, i.e., the last bracket of x is a closing one.
the norms of parameter matrices and the number of layers. At this point, it is worth remarking that a key property of transformers for this proof is that the number L of layers is bounded independently of the input length. A similar proof strategy can also be applied to other fixed-depth architectures that combine unboundedly many inputs in a smooth manner, such as 1D temporal convolutions with average pooling.
For k = 0, y , where C f act < ∞ depends on the norms of the parameter matrices of f act , which is implemented as a ReLU MLP (Vaswani et al., 2017). We'll write F for this upper bound for y (k) j 2 . Attention logits are bounded by A := F 2 C f att in the case of dotproduct attention, and A := 2F C f att in the case of additive attention. Then any attention weight a j,i = exp(a i )/ j exp(a j ) is upper bounded by which, using the induction hypothesis, is at most: Plugging this into the definition of y First, if j = i, this is bounded by (as n → ∞) . This proves the inductive step for k > 0.

Discussion
We have shown that, even with infinite precision, transformers cannot robustly model non-counterfree regular languages, nor basic hierarchical structure. In the hard attention setting, our results hold independently of activation functions and the magnitude of the parameters, and show that no transformer can accurately classify strings as belonging to such languages. In the soft attention setting, our results are slightly less general, but still show that transformers cannot achieve perfect cross-entropies when modeling distributions over these formal languages.
Our results are asymptotic, in the sense that they show that any transformer will make mistakes on modeling PARITY and 2DYCK when the input is sufficiently long. A transformer may nonetheless be able to perform well on on short inputs; indeed, given any bound N on the input length, it is possible to construct a transformer that will achieve perfect accuracy or cross-entropy on all examples of length n ≤ N ; our results show that the number of heads and layers, or the parameter norms, will have to increase with N . Practical implementations of transformers might thus be able circumvent such asymptotic limitations by using large numbers of layers and heads, in relation to the sentence lengths typically occurring in natural language. Therefore, pending tighter nonasymptotic bounds, the results reported here need not constitute conclusive evidence for practical limitations of real-world NLP systems.
We believe that the most imminent implications of our results are theoretical in nature. They showcase mathematical techniques for analyzing the capabilities of self-attention, an architecture at the heart of recent advances in NLP. These tools provide theoretical understanding of differences between self-attention and theoretically more wellstudied recurrent architectures: Recurrent networks such as LSTMs can perfectly emulate finite-state automata, and therefore can model any finite state language with optimal cross-entropy, as long as the state transition and symbol emission distributions are Markovian. In particular, PARITY of i.i.d. bitstrings can be predicted with perfect accuracy and cross-entropy, independent of the input length. Furthermore, infinite-precision RNNs and LSTMs can model stacks (Tabor, 2000;Grüning, 2006;Kirov and Frank, 2012) and thus are theoretically capable of modeling 2DYCK and other deterministic context-free languages perfectly. The results presented here thus theoretically confirm the intuition that models entirely built on self-attention may have restricted expressivity when compared to recurrent architectures (Tran et al., 2018;Dehghani et al., 2019;Shen et al., 2018a;Chen et al., 2018;Hao et al., 2019). Complementing the asymptotic methods developed here with empirical studies or nonasymptotic extensions is an interesting avenue for future research.
While finite languages are sufficient to model language up to any finite bound on sequence length, it has typically been argued that asymptotically more powerful formalisms at the level of contextfree grammars or beyond are necessary to properly capture generalizations about the syntax and meaning of natural language (e.g., Chomsky, 1957;Shieber, 1985). Our results entail that self-attention is limited in its ability to model context-free languages or evaluate logical formulas. In particular, self-attention cannot in general emulate stacks or arbitrary finite-state automata. Whether this hinders its capacity for syntactic generalization in practice is an interesting question; empirical research suggests that models with strong quantitative performance-both recurrent and transformer models-continue to struggle with syntactic generalization and that quantitative performance metrics such as perplexity can partly be dissociated from syntactic knowledge displayed on more challenging benchmarks (e.g., Kuncoro et al., 2018;Marvin and Linzen, 2018;Tran et al., 2018;McCoy et al., 2019).
Nonetheless, the success of transformers across NLP tasks suggests that many aspects of natural language can be modeled well with methods that are formally too weak for the formal languages typically assumed in theoretical linguistics. Beyond general limitations of asymptotic analysis, a possible reason for this phenomenon is that language uses recursive structure only in restricted ways due to cognitive factors. For instance, it has long been noted that center embeddings, syntactic structures exhibiting iterated bracketing, are very challenging for humans to process (Miller and Chomsky, 1963;Gibson and Thomas, 1999). Intriguingly, self-attention bears some resemblance to psycholinguistic models of memory in human sentence processing that assume that humans, while processing a word, attend to chunks that were stored in memory when processing some previous words (Lewis and Vasishth, 2005;Parker et al., 2017). Such processing models predict difficulty with center embedding because they cannot count brackets (Lewis and Vasishth, 2005), akin to what we have shown theoretically for neural network models based on self-attention.

Conclusion
We formally investigated the capabilities of selfattention in modeling regular languages and hierarchical structure. We showed that transformers cannot model periodic regular languages or basic recursion, either with hard or soft attention, and even if infinite precision is allowed. This entails that self-attention cannot in general emulate stacks or general finite-state automata. Our results theoretically confirm the idea that self-attention, by avoiding recurrence, has quite limited computational power.