Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al., 2022). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class TC0 as an upper bound on the formal languages they recognize.

Opening the “black box” (Alishahi et al., 2020) of the representations within neural networks is an important step towards building systems with robust and interpretable behavior. In NLP, one part of this question is analyzing the languages that networks can model, and the mechanisms they use to represent linguistic structure and dependencies.

One path toward this goal is via formal analysis of specific network architectures (Merrill, 2021); for example, recurrent neural networks (RNNs). Due to their autoregressive formulation, formal linguistic analysis of RNNs has often characterized their power by relating them to automata- theoretic classes of formal languages (Weiss et al., 2018; Peng et al., 2018; Merrill, 2019, inter alia). Recently, however, RNNs have largely been overtaken in NLP by a new class of models: transformers (Vaswani et al., 2017). Transformers are not autoregressive, and therefore less naturally resemble automata, posing challenges to characterizing their linguistic capacity or inductive biases in the same terms as RNNs. Instead, some recent work has related them to circuit complexity classes, a direction that we continue to pursue in this paper. Drawing on classical circuit lower bound results, Hao et al. (2022) and Hahn (2020) derive theoretical limitations of transformers with hard attention, meaning the attention distributions focus all probability mass on one index. Together, their results show that AC0—the class of languages recognizable by constant-depth circuit families— upper bounds the formal languages hard-attention transformers can recognize.

However, hard attention is a strong assumption, making it unclear how these results transfer to practical transformers. For example, Bhattamishra et al. (2020) showed how transformers can solve synthetic counting tasks by using uniform attention patterns, which hard attention does not allow. Motivated by this potential disconnect between theory and practice, we aim to extend circuit- based analysis to transformers with saturated attention: a generalization of hard attention that has been argued to approximate attention patterns acquired through gradient descent (Merrill et al., 2021). Broadly speaking, saturated attention goes beyond hard attention in that it can “tie” across a subset of positions, rather than selecting just one position. The tied positions are then aggregated by averaging. Qualitatively, saturated attention heads can “count”: a capability observed in transformers in practice (Bhattamishra et al., 2020). Further, Merrill et al. (2021) show that transformer training dynamics lead attention heads in several pretrained transformers to approximate saturated attention. In summary, saturated attention strictly generalizes hard attention and should more closely reflect the attention patterns acquired in practical transformers.

Our main contributions are twofold. First, we show that saturated transformers can recognize languages outside AC0. Then, as depicted in Table 1, we prove that transformers with floating point activations and saturated attention can only recognize formal languages in the circuit complexity class TC0, constituting an upper bound for a more realistic model of transformers than past results with hard attention.

Table 1: 

Summary of combined results from Hao et al. (2022) and this paper. Each cell α,D characterizes the languages recognizable by transformers with attention function α and datatype D (floats F or rationals ℚ). TC0 and TC0 are circuit complexity classes, and AC0TC0. ALL is the set of all formal languages over alphabet {0,1}. See §4 for formal definitions. Out of these results, we view saturated attention with floats as the best model of practical transformers.

Float (F)Rational (ℚ)
Hard (ηAC0 AC0 
Saturated (ζTC0 ALL 
Float (F)Rational (ℚ)
Hard (ηAC0 AC0 
Saturated (ζTC0 ALL 

In §3, we formally define our model of the transformer, including defining saturated attention in contrast to hard attention. §4 introduces circuits in theoretical computer science and relevant complexity measures and classes for them.

In §5, we first briefly analyze saturated transformers with rational values where the embedding, scoring, and activation functions are allowed to be any size-preserving function. We find such transformers to be universally powerful. We also observe that when the positional embeddings are computed in time linear in the sequence length, saturated rational-valued transformers are exactly as powerful as the complexity class of their activation functions, because the full input sequence can be pooled to a single position, and an activation function can be used as an oracle over the full input sequence. However, this setup relies on the use of unrealistic embedding functions. To move to a more realistic model of computation, we then focus on saturated transformers whose values are restricted to be floats, which have a coarser granularity and, thus, cannot encode the full input sequence into a single position.

Building on results of Pérez et al. (2019), we demonstrate in §6 that saturated transformers with float activations transcend the theoretical limitations of hard-attention transformers. In particular, we will show that they can recognize the majority language, which lies outside AC0. We experimentally validate that transformers can learn to recognize the majority language. Taken together, these results suggest that the very weak characterization of hard-attention transformers does not hold in practice for saturated or soft attention.

In §7, we show that, on input sequences of length n, the size of each state vector in a transformer over floats is O(logn) bits, similar to saturated LSTMs (cf. Merrill, 2019). Thus, the full transformer state at any given layer has size O(nlogn), although each feedforward block can only locally access a small, O(logn) “piece”. Thus, while hierarchical representations can be implemented in a transformer (e.g., to process arbitrary-depth Dyck languages or reverse strings as in Weiss et al. [2021]), our result implies that they must be distributed in some way across n state vectors, rather than represented compactly within a single vector.

Finally, in §8, we use the bounded size of transformer representations to upper bound the formal languages that can be recognized by saturated transformers with floating-point values. In particular, we show that such transformers can be simulated by constant-depth threshold circuits, and thus only recognize languages in TC0. Informally, this suggests that moving from hard attention to saturated attention can be thought of as extending the implicit class of circuit gates available in the network to include threshold gates.

Our results make progress in the analysis of transformers by deriving upper bounds for a more realistic model of transformers than has previously been analyzed. RoBERTa, T5, and other pretrained transformers have been shown to be approximately saturated (Merrill et al., 2021), so our results imply that TC0 may be a meaningful upper bound on the computation expressible within such networks. Our analysis also motivates future work further refining the circuit characterization of saturated transformers, as well as comparing transformers with soft and saturated attention.

We will often use w to refer to a string over any generic alphabet Σ, that is, w ∈ Σ*. Semantically, w corresponds to the string a transformer receives as input. In contrast, we use x and other symbols to refer to binary strings in {0,1}*. These binary strings will represent intermediate values within the transformer computation, rather than the raw input to the transformer.

3.1 Datatypes

Under our model, all values in the transformer are binary strings. In order to compute self attention and other operations over binary strings, we need to define datatypes describing the semantics of these binary strings as numbers. We will describe a semantics for binary strings as integers, as often comes up in circuit complexity. We then extend this to rational numbers and floats, which are necessary for representing the division operations that occur in attention heads within transformers.

Unsigned Integers
We can interpret binary strings x ∈ {0,1}* as unsigned integers in the standard way, namely, the numerical value of x ∈ {0,1}n is
We allow standard integer operations like +, <. For example, 101 +1 = 110.
Rationals
To interpret r ∈ {0,1}* as a rational number, we first view it as a sign bit s along with a tuple of two unsigned integer substrings p,q.1 The numerical value represented by r is
Let red(p,q) return s,t where s = p/gcd(p, q) and t = q/gcd(p, q). Then, we can define arithmetic operations over two rationals r=p,q and r=p,q in the standard way:
Floats

We define floats F as the subset of the rationals where the denominator is constrained to be a power of 2.2 Multiplication and addition are defined as for ℚ, and are guaranteed to produce another float. Notably, division for floats is implemented by multiplying by an approximate multiplicative inverse, so it may be that (x/Fy)·yx. See Appendix A for a more formal discussion.

In §5, we will study transformers over rational values. From §6 onwards, we will then take the values in transformers to be floats unless otherwise stated. Going forward, we will generally omit datatype subscripts from operations where they are clear from context. We will sometimes write D as a set in function signatures, for example, f:DkDk. In this usage, it refers to the set {0,1}*, but it is often more intuitive to write the datatype shorthand (rather than {0,1}*) to hint at the intended semantics of the functional arguments.

Size of Binary Strings

Under our model, integers, rationals, and floats are all abstractions built out of binary strings. For any x ∈ {0,1}* (which can be interpreted semantically as an integer, float, or rational), we define its size |x| as the total length of x measured in bits. We imagine a tuple p,q is encoded by padding p,q to the same length with leading 0’s, and interleaving bits from each sequence. This means the size of a rational is 2max(|p|,|q|)+1. For example, the integer 2 takes 2 bits to specify, while the float 12 takes 5 bits (1 for the sign, 2 for the numerator, 2 for the denominator).

Size Preservation

We say that a function f : {0,1}* →{0,1}* is size-preserving iff there exist constants c,n such that for all inputs x with n|x|, |f(x)|c·|x|. Let 𝒫 be the set of size-preserving functions. While size-preserving functions are defined here over binary strings, they can be equivalently applied over integers, rationals, and floats, since these datatypes, as we have defined them, are just binary strings.

3.2 Transformers

We define the following general transformer model, which can be parameterized to use different types of attention patterns and whose internal functions (e.g., feedforward blocks) can be computed by different function classes.

Definition 1 (Transformer).

A transformer is a tuple Σ,D,α,L,H,ϕ,{s,h},h=1L,H,{f}=1L where

  1. Σ is a finite input alphabet, that is, the set of token types in a formal language.

  2. D is a scalar datatype, that is, a semantics for interpreting binary strings as numbers. We will generally consider D=F.

  3. α is an attention function that maps a vector of attention scores in Dn (for any n) to a normalized probability distribution, also in Dn. In this paper we take α to be either hard (η) or saturated (ζ) attention; see §3.3.

  4. L is the number of layers.

  5. H is the number of heads.

  6. ϕ:Σ×NDm is a position-aware embedding function that maps a token and position to a vector, where m is a multiple of H.

  7. For each ,h, the function s,h:Dm×DmD assigns attention scores to pairs of values.

  8. For each , the function f:Dm×DmDm, maps a previous layer value and attention head output to a new value vector.

On an input string w ∈ Σn, a transformer computes L layers of output sequences v,1,⋯ ,v,n (for L), where each v,iDm. In the 0th layer, each token wi and its position i are embedded into a value v0,i. Subsequent layers aggregate information from the previous value sequence v using a multi-head attention mechanism, and output a new value sequence v +1. More formally, these layers are structured as follows:

  1. Embedding Layer:v0,i = ϕ(wi,i).

  2. Attention Head: Each of the H attention heads in layer maps the full previous sequence into a new value via s,h and then applies the attention function α:
    Crucially, the semantics for addition and multiplication here (as well as in the computation of α) come from the datatype D.
  3. Activation Block:3

3.3 Attention Functions

An attention function α maps a vector of scores aDn to a probability distribution over 1,⋯ ,n. Specifically, we consider two attention functions: hard attention η(a) and soft attention ζ(a).

Hard attention collapses the attention scores to a one-hot distribution with all mass concentrated at one index. Let M(a)={iai=maxjaj}.

Definition 2 (Hard attention).
Define hard attention η(a) as

In contrast, saturated attention spreads probability mass evenly across “tied” scores.

Definition 3 (Strong saturated attention; Merrill et al. 2021).
Define saturated attention ζ(a) as

Merrill (2019) shows how this form of attention can be derived by taking a large-norm limit of the network weights; a derivation can be found there. Saturated attention reduces to hard attention when |M(a)|=1, and attends uniformly when |M(a)|=n. Both hard and uniform attention can be implemented with numerical stability, motivating weak saturated (or, “uniform”) attention:

Definition 4 (Weak saturated attention).

Each head implements either hard attention (Definition 2) or the uniform pattern υ(a)j=1n.

In general, we will use “saturated attention” to refer to strong saturated attention and provide upper bounds for this setting. On the other hand, our lower bounds only use weak saturated attention, thereby showing that even weak saturated attention is more powerful than hard attention.

3.4 Language Recognition

Finally, we define language recognition for transformers.

Definition 5 (Language recognition).
Write v,i(w) for the value of v,i on input string w. A transformer recognizes a language ℒ ⊆ Σ* if there exists a D-valued affine transformation W,b such that, for all w ∈ Σ*,

This says that the decision problem of recognizing ℒ must be linearly separable using the first value in the last layer of the transformer. In practice, the first token in a transformer is often set to CLS, and its output can be passed to a classifier during finetuning (Devlin et al., 2019). This inspires Definition 5. There are other potential ways to define language recognition and generation for transformers (Hewitt et al., 2020; Yao et al., 2021), but they do not lead to meaningful differences for our purposes.

Finally, we define AHAT(D) as the set of languages recognizable by some saturated transformer over D, where the internal functions can be any size-preserving function.4

Definition 6.

Let AHAT(D) be the set of languages ℒ such that there exists a transformer Σ,D,ζ,L,H,ϕ,s,h,f that recognizes ℒ where each ϕ,s,h,f ∈ 𝒫.5

We note that size preservation is a weak condition to assume about the internal functions in practical transformers: Because any linear-time- computable function is size-preserving, it is strictly weaker than assuming that the internal functions can be computed in linear time. To further justify this condition, we explicitly show in Appendix B that the component functions within transformers are size-preserving.

Circuit complexity is a branch of computational complexity theory that studies circuit families as a model of computation.6 Intuitively, circuits are useful for formally studying the types of computational problems that can be efficiently solved with parallelism, as the depth of a circuit corresponds to the runtime of a program on an idealized, fully parallel computer. We review background on circuits, circuit families, and relevant complexity measures and classes.

Circuits

For a fixed n, a circuit is a computation graph, where leaves correspond to input bits xi and their negations ¬xi, and the internal nodes are logic gates (typically ∧ and ∨), with one labeled as the output node. The gates can conventionally be taken to have either binary or unbounded fan-in. The circuit computes a function f : {0,1}n →{0,1} by substituting the input values into the leaf nodes, propagating the computation through the graph, and returning the value of the output node. Figure 1 shows an example circuit that takes inputs of length 5, and returns whether they contain the bigram 11.

Figure 1: 

A circuit that takes a string x ∈ {0,1}5 and returns whether it contains the bigram 11.

Figure 1: 

A circuit that takes a string x ∈ {0,1}5 and returns whether it contains the bigram 11.

Close modal
Circuit Families
A circuit family is an ordered set of circuits {Cn}n∈ ℕ where each circuit is identified with a particular input size n. We say a circuit family recognizes a formal language ℒ ⊆{0,1}* iff, for all w ∈ ℒ,7
Circuit Complexity

Two important notions of complexity for a circuit are its size and depth. The size of a circuit is the number of gates. The depth is the longest path from an input node to the output node. For a circuit family, both quantities can be expressed as functions of the input size n. A circuit complexity class is a set of formal languages that can be recognized by circuit families of a certain size, depth, and set of gates. In particular, we will discuss the classes AC0 and TC0.

Definition 7.

AC0 is the set of languages ℒ ⊆{0,1}* such that there exists a circuit family recognizing ℒ with unbounded arity {∧,∨} gates, poly(n) size, and O(1) depth.

Intuitively, AC0 represents the class of problems that are highly parallelizable when the computational primitives are standard logic gates. In contrast, TC0 will also represent highly parallelizable computation, but when the gates are expanded to include threshold gates.

For a bitstring x ∈ {0,1}*, define the threshold gate θk(x) to return 1 iff ≥ k bits in x are 1, and equivalently for ≤ k. For example, θ≥3(110011) = 1.

Definition 8.

TC0 is the set of languages ℒ ⊆{0,1}* such that there exists a circuit family recognizing ℒ with unbounded arity {∧,∨,θ} gates, poly(n) size, and O(1) depth.

It is known that AC0TC0NC1, where NC1 denotes the languages recognizable by O(logn)- depth circuits with bounded gate arity. Whether or not the latter containment between TC0 and NC1 is strict is an open question. Whereas parity and other basic regular languages are outside AC0 (Furst et al., 1981), TC0 properly contains parity, although it is unknown whether it contains all the regular languages. Between AC0 and TC0 lies the class ACC0 (Yao, 1990).

Uniformity
The circuit classes defined above (and which we will use in this paper) are non- uniform, meaning circuits for different input sizes are not constrained to have any relation to each other. Non-uniform circuit families can recognize some uncomputable languages, such as the language of strings 1k such that Turing machine k does not halt on the null input (cf. Arora and Barak, 2009). In contrast, the uniform variants of circuit families are constrained such that a log-space Turing machine must output a string encoding of circuit Cn on the input string 1n, forcing any language the circuit family can recognize to be computable. For these uniform classes (which we write with a u prefix), it is known that
where L and P denote the conventional complexity classes of log-space and polynomial-time decision problems. Thus, it is unknown whether uTC0 is restricted compared to general polynomial-time computation, but if we accept the common conjecture that one (if not all) of the above containments are strict, then uTC0 forms a restricted family of problems compared to P, which, intuitively, are more parallelizable than other problems in P.

We now begin our analysis of saturated transformers. Hao et al. (2022) and Hahn (2020) were able to give upper bounds on the power of hard attention without imposing any constraints on the embedding, scoring, and activation functions. The same will not be the case with saturated attention: any bounds on transformers will require leveraging some properties constraining their internal functions. One property we use will be size preservation. We will first show, though, that size preservation is not enough on its own: Deriving a nontrivial upper bound will depend on subtle assumptions about the transformer’s datatype.

With rational values and size-preserving internal functions, we will show saturated transformers can recognize any formal language, namely, the class ALL = {ℒ∣ℒ ⊆{0,1}*}. Our construction resembles the universal approximation construction of Yun et al. (2020), which relies on the ability of the transformer to uniquely encode the full input string into a single value vector. After the full sequence is encoded locally into a single vector, the activation block can be used as a black box to recognize any language.

Theorem 1.

AHAT(ℚ) = ALL.

Proof.
We construct a 1-layer rational-valued transformer with a single head to recognize every string w in any formal language ℒ ∈ ALL. We will omit ,h subscripts. Let pi denote the ith prime number. The embedding layer encodes each input token according to
Since piilogi for large i by the prime number theorem (cf. Goldstein, 1973), the number of bits needed to represent the denominator of ϕ(wi,i) is
Since i had size logi, this implies ϕ is size- preserving.

Now, we define a single uniform attention head that sums across all i, outputting wi=11pi. The denominator q of this sum is the product i=1pi. Observe that wi = 1 iff pi divides q. Thus, we can define a function f that extracts the input sequence w from q by checking whether, for each i, pi divides q. We let g be a function recognizing L, and set f = gf. The output of the transformer will now compute whether w ∈ ℒ, since f outputs an encoding of the original input sequence w, and g decides whether w ∈ ℒ. Note that any function solving a decision problem is size-preserving, hence f ∈ 𝒫.

Theorem 1 says that our transformer architecture parameterized with a rational datatype can recognize any formal language. But a construction of this form feels unrealistic for two reasons. First, it requires the embedding layer to implement an unconventional prime encoding scheme in the embedding layer. Second, we are using the activation layer as a black box to recognize any language—even uncomputable ones! On the other hand, the feedforward subnetworks used in practice in transformers cannot even implement all computable functions when the weights are fixed independent of the sequence length n. We can get around both these issues by instead restricting the datatype to floats, which is the direction we will pursue in the remaining sections.8

5.1 Resource-Bounded Transformers

In Appendix C, we develop an alternate perspective on the universality of transformers, showing that, if the embedding function is allowed to be computed in time linear in the sequence length, then the transformer’s complexity is equivalent to its activation functions’ complexity.

Theorem 2 (Informal).

If ϕ can be any function computable in time linear in n, and the scoring and activation functions can be computed in T(m) time on inputs of size m withT(m)m, then languages recognizable by the transformer areTIME(T(m)).

Appendix C contains a formal statement and proof. For example, allowing polynomial-time functions inside the transformer implies that the transformer will recognize exactly the complexity class P. A major unrealism about this setup is the assumption that ϕ can be an arbitrary function computable in time linear in n, motivating our main results in a more constrained setting in §8.

5.2 Discussion

We are not stating the results in this section as evidence that practical transformers are capable of universal or arbitrary polynomial computation. Rather, the unnaturalness of these constructions (specifically, the prime numbers based position encoding) motivates us to slightly constrain our model of the transformer in a realistic way: We will switch the datatype from rationals to floats, because even using only simple uniform attention, a model with rationals and unconstrained internal functions is universal. We will soon see that this realistic constraint prevents universal simulation, and in fact bounds the capacity of the saturated transformer within TC0.

We now move to the setting of saturated transformers over floats. Hao et al. (2022) identified that hard-attention transformers can only recognize languages within AC0. In contrast, saturated transformers over floats can recognize the “majority” language maj, which is known to lie outside AC0 (Furst et al., 1981). Pérez et al. (2019, Prop. 3.3) show how maj can be recognized by transformers. In Theorem 3, we offer a simpler construction that leverages only a single uniform attention head, as opposed to the model of transformers they were considering. Thus, this construction is achievable with saturated attention.

Theorem 3.

AHAT(F)⫅̸AC0.

Proof.
Let #σ(w) ∈ ℕ denote the number of σ tokens in string w ∈ {0,1}*. Let #(w) denote a count vector where each element corresponds to some σ ∈ {0,1}. We define maj as follows:
We will construct a 1-layer transformer with a single head to recognize maj, omitting ,h subscripts from s,f,x,b. Figure 2 gives the same construction in RASP (Weiss et al., 2021).
Let xi = ϕ(wi,i) be a 1-hot encoding of wi. For all i,j, set s(xi,xj) = 1, resulting in a single head attending everywhere:
Finally, set f(bi) to return whether #1(w)/n > #0(w)/n, which, for n > 0, is true iff wMAJ.

Figure 2: 

A program recognizing maj in RASP, a programming language designed to abstract away details of transformer computation (Weiss et al., 2021). frac{0,1} measure the fraction of inputs that are 0 or 1. Then maj checks whether frac1 ¿ frac0.

Figure 2: 

A program recognizing maj in RASP, a programming language designed to abstract away details of transformer computation (Weiss et al., 2021). frac{0,1} measure the fraction of inputs that are 0 or 1. Then maj checks whether frac1 ¿ frac0.

Close modal
Notably, the construction in Theorem 3 is not just possible within our generalized transformer framework, but can also be implemented by the standard parameterization of ϕ,s, and f in real transformers (Vaswani et al., 2017). The uniform attention pattern can be implemented by setting all query and key attention parameters to 0. Then, we can use the affine transformation that aggregates the head outputs to compute the tuple:
This tuple is then passed through layer normalization (Ba et al., 2016), resulting in a new tuple t1,t2. Crucially, t1 > t2 iff the same applies to the quantities in the original tuple. Thus, a linear classifier can decide whether t1 > t2 to successfully recognize the language, as per Definition 5.

6.1 Empirical Validation

In Figure 3, we show empirically that a 1-layer transformer can learn and generalize maj. This supports our argument that the theoretical limitations of hard-attention transformers do not apply to practical transformers. We train with three different types of positional encoding: none, meaning no positional information; learned, where each position gets a trainable embedding vector, and the sinusoidal scheme of Vaswani et al. (2017). The model with no positional embeddings generalizes the best, followed by the learned embeddings. It appears that while maj is in the capacity of the transformer, the standard sinusoidal positional embedding scheme provides the wrong inductive bias for learning it. This recalls the finding of Yao et al. (2021) that the choice of positional encodings seems to greatly impact the transformer’s ability to generalize formal language tasks to longer sequences.

Figure 3: 

In practice, transformers can learn the majority language (which lies outside AC0). We train 1-layer transformers on majority, where each line represents a different positional encoding scheme. Training string length was binomial with n = 100. Trained models were then evaluated on generalization sets with n ranging from 100 to 500. Mean length (x axis) is n/2.

Figure 3: 

In practice, transformers can learn the majority language (which lies outside AC0). We train 1-layer transformers on majority, where each line represents a different positional encoding scheme. Training string length was binomial with n = 100. Trained models were then evaluated on generalization sets with n ranging from 100 to 500. Mean length (x axis) is n/2.

Close modal

The theoretical limits on hard-attention transformers were derived by Hao et al. (2022) by bounding the size in bits of the representation v,i at each layer and position i. Specifically, they show that the value v,i is representable in O(logn) bits on input sequences of length n. Thus, each value can only contain limited information about the input sequence, intuitively explaining their upper bound on hard-attention transformers. Inspired by their analysis, this section will show that, in a saturated transformer, each v,i also has a size of O(logn) bits. Later in §8, we will use this property to show that saturated attention transformers are limited in the formal languages they can recognize.

7.1 Size of Float Sums

How many bits does it take to represent the value of an attention head within a saturated transformer? As a naive bound, the output of a saturated attention head is specified by a float for each of n values attended over from the last layer, which would take at least linearly many bits in n. However, this upper bound on its size is not tight. Instead, we will show that all head and activation values can be represented in O(logn) bits. Our analysis will rely heavily on the following lemma:

Lemma 1.

Let v1,⋯ ,vn be a sequence of floats, each with size at most z. Then there exists c such that i=1nvi has size at most 4cz+2logn+1.

Proof.
Let pi,qi denote the numerator and denominator of the floating point vi, respectively. Similarly, let ps,qs be the numerator and denominator of the float s. By assumption, there exists c such that each pi,qi both have size ≤ cz for large enough n. We let pmax=maxipi and analogously for qmax. Because all qi’s are powers of 2, the numerator ps is
which, represented as an integer, has size:
On the other hand, the denominator qs = qmax, which has size ≤ z. Therefore, the float representing the sum has size
which completes the proof.

In particular, we will use Lemma 1 to show that, when each of a sequence of n values has size O(logn), the sum will also have size O(logn).

7.2 Size of Transformer Values

We will now leverage Lemma 1 to show that the values are of bounded size in any transformer over floats with an elementwise-size-preserving attention function.

Definition 9.

A function α:DnDn is elementwise-size-preserving if, for 1 ≤ in, the function xiα(x)i is size-preserving (where xDn).

Note that saturated attention satisfies this definition. We are ready to prove a theorem bounding the size of the representations in transformers with elementwise-size-preserving attention.

Theorem 4.

For any transformer overFwithϕ,s, h,f ∈ 𝒫 andαelementwise-size-preserving, for allLandin, v,ihas sizeO(logn).

Proof.

By induction over . The proof follows the definition of transformer computation in §3.2.

Base Case

wi ∈ Σ has size O(1), and i ∈ [n] has size O(logn). Since ϕ ∈ 𝒫, v0,i = ϕ(wi,i) has size O(logn) for all i.

Inductive Case
Assume v,i has size O(logn). Since s +1,h ∈ 𝒫, a +1, h, i, j = s +1, h(v, i,v, j) has size O(logn) for all i,j. Since α is elementwise-size-preserving, we can conclude that α(a +1,h, i,:)j also has size O(logn) for all h, i, j. Multiplying two floats is size-preserving (cf. Appendix B), so α(a +1,h,i,:)j · v,j has size O(logn) for all h, i, j. We then apply Lemma 1 to conclude that b +1, h, i has size O(logn), where, recall,
Finally, computing v +1, i = f +1(v, i,b,:,i), we conclude that v +1, i has size O(logn) for all i by size preservation.

Corollary 4.1.

For any saturated transformer overFwith size-preserving internal functions, for allLandin, v, ihas sizeO(logn).

Corollary 4.1 follows because saturated attention is elementwise-size-preserving. Softmax attention, on the other hand, is not guaranteed to fulfill this property, because it requires computing the exponential function. This technical challenge prevents generalizing our technique to soft attention.

7.3 Discussion

Similar to hard-attention transformers (Hao et al., 2022), the size of each vector representation in a saturated transformer over floats is O(logn). This is enough memory for individual vectors to “count”, a behavior that has been observed in both LSTMs (Weiss et al., 2018) and transformers (Bhattamishra et al., 2020). On the other hand, O(logn) space is not enough memory for individual vectors (for example, the CLS output) to encode arbitrarily large combinatorial objects like trees. However, transformers are not limited to computing in an “online” fashion where tokens are consumed sequentially, meaning that their effective state is n values of size O(logn). Notably, trees with n leaves can be encoded in a distributed fashion across n values of size O(logn). One construction for this is, at index i, to store wi and i, along with a pointer j to the parent. Since i,j can both be represented in logn bits, each vector uses only O(logn) bits.

Additionally, the O(logn) space bound has implications from the perspective of circuit complexity. While saturated attention cannot be simulated in AC0, we will show in §8 that saturated transformers over Fcan be simulated by TC0 circuits.

We have proved that each value vector in a saturated transformer over floats has O(logn) size. Now, we show how this implies saturated transformers can be simulated by TC0 circuits. Our results heavily leverage the following lemmas:

Lemma 2 (Hao et al. 2022).

Any function f : {0,1}c →{0,1}d can be computed by a Boolean circuit of depth 3 and size at most (2c + c + 1)d.

So that our results are self-contained, we reproduce a proof of this lemma in Appendix D. Applying Lemma 2 to a size-preserving function with at most clogn input bits immediately yields:

Corollary 2.1.

Any size-preserving function with at mostclogninput bits can be computed by a Boolean circuit of depth 3 and polynomial size.

In other words, such functions can be computed with AC0 circuits. In addition, we will show that the sum of n floats of size at most clogn can be computed by TC0 circuits.

Lemma 3.

Let v1,⋯ ,vn be a sequence of floats, each with size at most clogn for some c. Then the sum i=1nvi is computable by a threshold circuit of constant depth and polynomial size.

Proof.
Let the integers pi,qi be the numerator and denominator of vi. We first compute qmax, the maximum qi, using an AC0 circuit that compares all pairs qi,qj, and returns the first qi such that qjqj for all j. We then use the fact that multiplication and right shift (qi is a power of 2) are in TC0, in order to compute
in parallel for all i. Note that qmax and qi are both powers of 2, so the division will be exact. Next, we leverage the fact that the sum of n integers of size O(logn) is in TC0 (Kayal, 2015), in order to compute the numerator of the sum p=iri. We select the denominator as q′ = qmax. Finally, we can add an AC0 circuit that “reduces” the fraction by removing shared trailing zeros from p′,q′, which is possible by Corollary 2.1. Thus, we have constructed a TC0 circuit to compute the sum of n floats with size O(logn).

We now construct a TC0 circuit that simulates a saturated transformer over floats.

Theorem 5.

AHAT(F)TC0.

Proof.

For each n, we construct a TC0 circuit that simulates a saturated transformer on inputs of size n. We construct the circuit modularly, with one subcircuit for the attention mechanism, and another for the feedforward subnetwork.

Attention Head
Fix a single head in some layer. We will construct a TC0 subcircuit that simulates the attention mechanism at position i. The head attends over vectors v1,⋯ ,vn. For all j, vj has size O(logn) by Theorem 4. In parallel for each j, we compute the scores ai,j = s(vi,vj) with an AC0 circuit by Corollary 2.1. We then compute ai,maxmaxjai,j with an AC0 circuit by comparing all vj pairwise, and selecting the first vk such that vkvj for all j. We then compute “masked” values ui,j for each j via an AC0 circuit by Lemma 2:
We then compute the sum sij=1nui,j by Lemma 3. By Lemma 1, si has size O(logn). Now, we similarly define
Using an analogous sum construction with zi,j instead of ui,j, we can use a TC0 circuit to compute |M(a)|: the number of j such that ai,jai,max. Finally, since dividing floats is in TC0 (cf. §9), we can compute the head output as si/|M(a)|, which has size O(logn) by size preservation of division.
Feedforward

As input, f receives vi as well as H head outputs, all of which have size O(logn). As the total size of the input is O(logn), we can use Corollary 2.1 to compute the output of f with an AC0 circuit. The size of the output is O(logn) by size preservation of f. The same idea holds for ϕ as well as the linear classification head.

We have simulated each transformer component with a TC0 subcircuit, completing the proof.

8.1 Discussion

Recall that, over rationals, we found that size- preserving saturated transformers could recognize any language. In contrast, we have now shown that using floating-point representations places such transformers within TC0. In this paper, we have only considered non-uniform AC0 and TC0, as opposed to the uniform variants of these classes, which are more closely connected to familiar formal language classes like the regular and context- free languages (cf. Cojocaru, 2016; Mahajan, 2007). As transformers satisfy some intuitive notion of uniformity, an open question is whether saturated transformers also fall into uniform TC0.

Compared with hard attention, saturated attention adds theoretical power to transformers. We showed that saturated attention lets transformers recognize languages outside AC0, which is the upper bound with hard attention. Further, while saturated transformers with rational values and size-preserving internal functions can recognize any language, we characterize the limits of size- preserving saturated transformers with floats. Specifically, saturated transformers with float values fall in TC0, a more powerful circuit class than AC0. Thus, going from hard to saturated attention can be understood as augmenting the model with threshold gates. This illustrates one way that the circuit complexity paradigm characterizes the power of transformers. Going forward, there are many interesting open questions that circuit analysis can answer, such as comparing the power of saturated and soft attention, and refining existing upper bounds for transformers in terms of uniform circuit families.

Thanks to Yiding Hao, Dana Angluin, and Robert Frank for sharing an early draft of their work. We also appreciate helpful feedback from Dana Angluin, Matt Gardner, Yoav Goldberg, Michael Hahn, Kyle Richardson, and Roy Schwartz.

Let / be truncated division between integers. We divide a float by an integer p by defining an approximate multiplicative inverse p−1. The numerator is 2|p|/p and the denominator is 2|p|. For division by a float p,q, we simply apply the integer approach and then multiply by q. This yields numerator 2|p|/p·q and denominator 2|p|.

The fact that float division is defined in terms of integer multiplication and division implies that it is size-preserving and can be simulated in TC0, which we use in §8.

We justify that feedforward neural networks are size-preserving over floats. Feedforward neural networks are made up of a fixed (with respect to n) number of addition, multiplication, division, ReLU, and square root (for layer norm) operations. Therefore, it suffices to show that these operations are all in S(F).

For addition, the numerator is
which has size log2+|pmax|+|qmax|2(|pmax|+|qmax|) for large enough input size.

For multiplication, the numerator is just p1 · p2, which has size 2|pmax|. Let the denominators be q1=2k1 and q2=2k2. Then the denominator is 2k1+k2, which has size 2|qmax|.

Division can be analyzed in terms of the approximate multiplicative inverse (§9).9 Its numerator has size |p|+1+|q|2(|p|+|q|) for large enough input size. The denominator has size |p|+12|p| for large enough input size.

Size preservation is trivially satisfied for ReLU, which cannot expand the size of the input.

To make layer norm work, we just need to analyze square root, which we will define in a truncated fashion over integers. The square root of a rational, then, simply takes the square root of p and q. We have that |p||p| and analogously for q.

Size preservation is one way to characterize the constraints on transformers’ internal functions; a slightly different perspective is to fix ϕ and analyze how the language recognition abilities of the transformer change depending on the computational resources allotted to each s,h and f. In this section, we derive an alternate universality theorem in terms of time complexity classes. We will show that as long as ϕ is powerful enough, such transformers have equivalent time complexity to their activation functions.

Recall that a transformer is a tuple Σ,D,α,L,H,ϕ,s,h,f. In contrast to AHAT(D) (cf. Definition 6), we will now work with a different class of transformer languages AHAT(D,T(m)) We will allow the embedding functions to be linear in the sequence length, and explore the effect of varying the complexity of the other internal functions. Let FTIME(T(m)) be the set of functions computable by a Turing machine in T(m) time.10

Definition 10.

Let AHAT(D,T(n)) be the class of languages ℒ ⊆ Σ* such that there exists a transformer Σ,D,α,L,H,ϕ,s,h,f that recognizes ℒ, where ϕ runs in time linear in the sequence length n, and s,h,fFTIME(T(m)).

For any T(m)m, we will show transformers AHAT(D,T(m)) have the complexity of their activation functions. Formally:

Theorem 2
(Formal version of Theorem 2).ForD{F,}andT(m)m,

Proof.

First, observe that AHAT(D,T(m))TIME(T(m)), since the embedding function and saturated attention can be computed in time linear in the input sequence length, and the other internal functions can be computed in FTIME(T(m)) by construction.

We now show TIME(m)AHAT(D,T(m)). We adopt a 1-layer transformer construction, and thus omit ,h subscripts. We define three components of the embedding function ϕ:Σ×ND3:
Each of these components is computable in time linear in n. Define three heads b1, i, b2, i, b3, i. Without loss of generality, consider bh, i to act on ϕ(wi, i)h alone, rather than the full embeding vector. b1, i is defined as a uniform head, while b2, i and b3, i are computed with sh(vi,vj) = vj. Thus,
Finally, we discuss how to set f to compute whether w ∈ ℒ. Let p be the function that extracts the numerator of a float or rational number, which is computable in O(m) time on float of size m. Within f, we compute u = p(b1,i). At this point, we proceed in two cases depending on the datatype D:

  1. Rationals: If D=, then u is the binary string w. Any ℒ ∈ TIME(T(m)) has an indicator function δFTIME(T(m)), which we now apply to recognize whether w ∈ ℒ.

  2. Floats: If D=F, then u=2|n|/n·w as in §9. Therefore, in linear time, we compute
    and feed w through δ as in the D= case.

So, TIME(T(m))AHAT(D,T(m)).

The proof for Lemma 2 largely follows the proof of a core lemma of Hao et al. (2022). We reproduce a slightly adapted version of their proof here, because their manuscript is not yet publicly available, and we wish for our paper to be self-contained.

Lemma 2.

Any function f : {0,1}c →{0,1}d can be computed by a Boolean circuit of depth 3 and size at most d(2c + c + 1).

Proof.

The idea of the proof is to define d subcircuits of size at most 2c + c + 1 that compute the d output bits of f in parallel. We will build a circuit that computes each output bit of f according to its representation in disjunctive normal form (DNF). We define a first layer of the circuit that computes the negation of each input, which takes c gates. The second layer then computes the value of each DNF term by computing a conjunction (∧ gate) over the corresponding literals or negated literals. Note that a formula of c variables has at most 2c DNF terms. Finally, the third layer of the circuit computes a disjunction (∨ gate) over the values of all terms, yielding the output of f, and adding a single gate. In summary, we have shown how to compute each output bit with a circuit of size at most 2c + c + 1, which implies the full function f can be computed by a circuit of size at most d(2c + c + 1).

1 

Under the hood, we imagine the pair p,q is encoded by padding p and q to the same length with 0’s and interweaving bits from each.

2 

More generally, the denominator may be taken to have a prime factorization of bounded length, although we work with the power of 2 definition, which is both simpler and closely resembles conventional floating point datatypes.

3 

Let V,h be a head’s value matrix in the standard transformer parameterization. Then f is computed by first multiplying each b,h,i by V,h, aggregating the multiple attention heads, and applying the feedforward subnetwork.

4 

The name AHAT standards for “averaging hard attention transformer”, and is taken from Hao et al. (2022).

5 

To apply size preservation to the embedding function ϕ, we consider the size of a token to be log(|Σ|).

6 

For more reference material on circuit complexity, we refer the reader to chapters 6 and 14 of Arora and Barak (2009) or chapters 1 and 2 of the Handbook of Theoretical Computer Science, Volume A (van Emde Boas, 1991; Johnson, 1991).

7 

Similarly, for any alphabet Σ and ℒ ⊆ Σ*, we interpret wi as a one-hot vector over Σ and define the family to recognize ℒ iff, for all w ∈ ℒ, C|w|·|Σ|(w)=1wL.

8 

It may also be possible to derive tighter bounds for rational-valued transformers by imposing stronger constraints on the internal functions. However, with floats, we will see that size preservation is sufficient to derive a tighter characterization of transformers’ power. We leave this alternate direction to future work.

9 

The exact multiplicative inverse p,qq,p over unconstrained rationals is also size-preserving. Thus, neural networks are size-preserving over both floats and rationals.

10 

We write FTIME(m) instead of the conventional FTIME(n) to avoid confusion with the sequence length n.

Afra
Alishahi
,
Yonatan
Belinkov
,
Grzegorz
Chrupała
,
Dieuwke
Hupkes
,
Yuval
Pinter
, and
Hassan
Sajjad
, editors.
2020
.
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
.
Association for Computational Linguistics
,
Online
.
Sanjeev
Arora
and
Boaz
Barak
.
2009
.
Computational Complexity: A Modern Approach
.
Cambridge University Press
.
Jimmy
Ba
,
Jamie Ryan
Kiros
, and
Geoffrey E.
Hinton
.
2016
.
Layer normalization
.
ArXiv
,
abs/ 1607.06450
.
Satwik
Bhattamishra
,
Kabir
Ahuja
, and
Navin
Goyal
.
2020
.
On the ability and limitations of transformers to recognize formal languages
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7096
7116
,
Online
.
Association for Computational Linguistics
.
Liliana
Cojocaru
.
2016
.
Advanced Studies on the Complexity of Formal Languages
. Ph.D. thesis,
University of Tampere
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Peter
van Emde Boas
.
1991
.
Machine Models and Simulations, chapter 1
.
MIT Press
,
Cambridge, MA, USA
.
Merrick
Furst
,
James B.
Saxe
, and
Michael
Sipser
.
1981
.
Parity, circuits, and the polynomial-time hierarchy
. In
Proceedings of the 22nd Annual Symposium on Foundations of Computer Science
,
SFCS ’81
, pages
260
270
,
USA
.
IEEE Computer Society
.
Larry J.
Goldstein
.
1973
.
A history of the prime number theorem
.
The American Mathematical Monthly
,
80
(
6
):
599
615
.
Michael
Hahn
.
2020
.
Theoretical limitations of self-attention in neural sequence models
.
Transactions of the Association for Computational Linguistics
,
8
:
156
171
.
Yiding
Hao
,
Dana
Angluin
, and
Robert
Frank
.
2022
.
Formal language recognition by hard attention transformers: Perspectives from circuit complexity
.
John
Hewitt
,
Michael
Hahn
,
Surya
Ganguli
,
Percy
Liang
, and
Christopher D.
Manning
.
2020
.
RNNs can generate bounded hierarchical languages with optimal memory
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1978
2010
,
Online
.
Association for Computational Linguistics
.
David S.
Johnson
.
1991
.
A Catalog of Complexity Classes, chapter 2
.
MIT Press
,
Cambridge, MA, USA
.
Neeraj
Kayal
.
2015
.
Lecture notes for topics in complexity theory
.
Meena
Mahajan
.
2007
.
Polynomial size log depth circuits: Between NC1 and AC1.
Bulletin of the EATCS
,
91
:
42
56
.
William
Merrill
.
2019
.
Sequential neural networks as automata
. In
Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges
, pages
1
13
,
Florence
.
Association for Computational Linguistics
.
William
Merrill
.
2021
.
Formal language theory meets modern NLP
.
ArXiv
,
abs/2102.10094
.
William
Merrill
,
Vivek
Ramanujan
,
Yoav
Goldberg
,
Roy
Schwartz
, and
Noah A.
Smith
.
2021
.
Effects of parameter norm growth during transformer training: Inductive bias from gradient descent
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
1766
1781
,
Online
and
Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Hao
Peng
,
Roy
Schwartz
,
Sam
Thomson
, and
Noah A.
Smith
.
2018
.
Rational recurrences
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
1203
1214
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Jorge
Pérez
,
Javier
Marinkovic
, and
Pablo
Barceló
.
2019
.
On the Turing completeness of modern neural network architectures
. In
International Conference on Learning Representations
.
Ashish
Vaswani
,
Noam
Shazeer
,
Niki
Parmar
,
Jakob
Uszkoreit
,
Llion
Jones
,
Aidan N.
Gomez
,
Łukasz
Kaiser
, and
Illia
Polosukhin
.
2017
.
Attention is all you need
. In
Advances in Neural Information Processing Systems
, volume
30
.
Curran Associates, Inc.
Gail
Weiss
,
Yoav
Goldberg
, and
Eran
Yahav
.
2018
.
On the practical computational power of finite precision RNNs for language recognition
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
740
745
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Gail
Weiss
,
Yoav
Goldberg
, and
Eran
Yahav
.
2021
.
Thinking like transformers
.
ArXiv
,
abs/ 2106.06981
.
Andrew C.-C.
Yao
.
1990
.
On ACC and threshold circuits
. In
Proceedings of the 31st Annual Symposium on Foundations of Computer Science
, pages
619
627
vol.
2
.
Shunyu
Yao
,
Binghui
Peng
,
Christos
Papadimitriou
, and
Karthik
Narasimhan
.
2021
.
Self-attention networks can process bounded hierarchical languages
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
3770
3785
,
Online
.
Association for Computational Linguistics
.
Chulhee
Yun
,
Srinadh
Bhojanapalli
,
Ankit Singh
Rawat
,
Sashank
Reddi
, and
Sanjiv
Kumar
.
2020
.
Are transformers universal approximators of sequence-to-sequence functions?
In
International Conference on Learning Representations
.

Author notes

Action Editor: Mark Johnson

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.