The Parallelism Tradeoff: Limitations of Log-Precision Transformers

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if L≠P (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.


Introduction
This work aims to characterize the computational model implicit in transformer neural networks (Vaswani et al., 2017), which form the basis of recent breakthroughs in large language models such as BERT (Devlin et al., 2019), T5 (Raffel et al., 2020), and GPT-3 (Brown et al., 2020).What computational primitives can the transformer's components implement, and what problems can the full system solve in aggregate?These questions are important for interpreting transformers in a principled way, understanding potential limitations of their reasoning capabilities, and building trust in deployed transformer-based systems.
Early theoretical work on transformers established their Turing completeness, albeit with assumptions like infinite precision and arbitrarily powerful feedforward subnets (Pérez et al., 2019;Dehghani et al., 2019).On the other hand, a strand of more recent work uses techniques from circuit complexity theory to derive strong limitations on the types of problems transformers can solve given restrictions on the form of attention allowed in the transformer.Specifically, Hahn (2020) and Hao et al. (2022) showed transformers restricted to hard attention are very limited: they can only solve problems in a weak complexity class (non-uniform AC 0 ) that doesn't even contain basic problems like majority of n bits.Merrill et al. (2022) extended this to a more general class of "saturated attention" transformers with a floating point datatype, and showed a larger class of problems (non-uniform TC 0 ) as an upper bound.This motivates analyzing a setting that strikes a middle ground: Can we characterize transformers whose precision and feedforward nets' computational power are realistically bounded, but where attention is also realistically expressive?
An important practical limitation of these prior results is the "non-uniform" nature of the considered circuit classes, which makes these classes non-realizable and the findings difficult to interpret.This is because non-uniform AC 0 and TC 0 , while highly limited in computation, also contain some problems that are not even decidable, i.e., for which there doesn't exist any exact algorithm.Thus, non-uniform classes cannot be directly compared with standard algorithmic complexity classes such as P, NP, etc.This motivates our second key question: Can we derive uniform upper bounds on transformers?
We show that one can achieve both of these goals by making the modest assumption that all values in the transformer have O(log n) precision (where n is the number of input tokens), and, similarly, that transformer's subnetworks are computable in O(log n) space.Log precision is enough to represent the positional encodings at the input layer of the transformer, and to encode pointers to all other positions in the sequence at later transformer layers.Assuming log precision across all layers captures the idea that the hidden representations contain a constant number of hidden states whose precision (16 or 32 bits) is small relative to the length of the input (2048 in GPT-3).On long sequences, the precision will not be enough to losslessly encode the full input sequence into a single vector.Instead, the processing of the sequence must somehow be distributed in each layer and performed in parallel.
Upper Bound on Transformers.Our main contribution is proving that log-precision transformers can be simulated by uniform constant-depth threshold circuits.Thus, such transformers can only solve problems in uniform TC 0 .This characterization is strikingly weak compared to the Turing-completeness of infinite-precision transformers.Since we believe log precision is more realistic for practical transformers than infinite precision, these results point to the conclusion that transformers are not Turing-complete in practice.
In contrast to past results, our upper bound on transformers is a uniform circuit class, enabling direct comparison of log-precision transformers to many natural complexity classes.These connections reveal specific problems that define the upper limits of log-precision transformers' capabilities, as discussed further in §2.
Intuitively, our upper bound says that logprecision transformers are computationally shallow, and that this shallowness can be understood to emerge from their parallelizability.Transformers' inherent parallelism is useful for training them efficiently at massive scale, but may limit the complexity of the computations they can express.We introduce the term parallelism tradeoff to capture this idea, which represents a potential fundamental weakness of the current paradigm of scaling language models.Formally characterizing reasoning capabilities relevant to language models and understanding whether they likely fall outside upper bounds implied by the tradeoff would clarify the practical implications of this limitation of scaling.
It could also be that the limitations of parallelism are not a curse but a blessing, if they constrain the hypothesis space in a way useful for learning.We have no evidence that this is true, but mention it as an alternate interpretation of the results that could be clarified in future work.
Instruction Following and Advice Transformers.We also consider an instruction following setting (Brown et al., 2020) where the transformer is provided the description of a task along with an input on which to execute the instruction.We construct a practically parameterizable transformer that can execute instructions perfectly if they are provided in the form of TC 0 circuits.This complements recent work that studies transformers' ability to follow other forms of instructions such as regular expressions (Finlayson et al., 2022).
Based on the fundamental property that transformers can correctly evaluate any given TC 0 circuit on a given input, we introduce the notion of advice transformers akin to advice taking Turing machines.We show that transformers can recognize any (non-uniform) TC 0 language if provided appropriate poly-size advice.
In summary, our findings provide new insights on both the abilities and the limitations of transformers, and bring out bounded precision, threshold computations, and parallelism as key notions for understanding the implicit computational model of transformers in practice.
Roadmap.Before diving into technical details, we discuss in §2 the implications of our results on both fundamental as well as practical abilities of transformers.§3 provides a brief primer on circuits as a model of computation.It then discusses a way of serializing a circuit into a string; we later show how to generate such serializations using a resource-bounded algorithm, which is the key to proving containment of transformers in uniform circuit classes.§4 defines our formal model of bounded-precision transformers.§5 derives our first formal bound on log-precision transformers.This bound involves non-uniform circuit families, similar in spirit to prior results in this area.§6 proves our more technical main result: the first uniform circuit complexity upper bound for transformers (specifically, uniform TC 0 ).Finally, §7 provides a lower bound on transformers, introduces the notion of an Advice Transformer, and connects these to the machine learning problems of Instruction Learning and Following.
Before diving into technical details, we discuss the general implications of our findings on the abilities and limitations of transformers.We will focus here on our main result (Thm.2), which shows that log-precision transformers are in the complexity class logspace-uniform TC 0 .The Parallelism Tradeoff.One interpretation of complexity classes such as NC 0 , AC 0 , and TC 0 is sets of poly-time solvable problems that are parallelizable to a very high degree-they can be solved in parallel in constant time with enough parallel processors.This gives some intuitive explanation of our result: log-precision transformers end up in TC 0 because they were designed to be highly parallelizable.Since parallelism is an important property of today's dominant paradigm of training models at massive scale, this points to the conclusion that any massively scaled up modeltransformer or otherwise-will likely obey restrictions similar to the ones derived here for logprecision transformers.There is thus an important tradeoff between the massive parallelizability of today's networks and their representation power.
What Transformers Can/Cannot Compute.Our result places log-precision transformers in the complexity class logspace-uniform TC 0 .This has immediate implications on the kinds of problems such transformers can and cannot accurately solve.
Consider any problem X that is complete for a complexity class C that contains logspace-uniform TC 0 .By definition of completeness, every problem log-precision transformers can solve perfectly is efficiently reducible to X and is thus no harder than X.This implies that-despite their massive size-the computation performed by such transformers is, for instance, no harder than solving basic L-complete problems like graph connectivity: the problem of checking whether there is a path between two nodes in an undirected graph (Lewis and Papadimitriou, 1982;Reingold, 2008).
By the same token, if C is strictly larger than logspace-uniform TC 0 , then such transformers cannot perfectly solve X.Thus, log-precision transformers cannot perfectly solve the following reasoning problems: • Linear equalities: find x s.t.Ax = b 1 1 Assuming logspace-uniform TC 0 = P. Follows because these problems are P-complete (Greenlaw et al., 1991).
• Universal context-free recognition 1,2 • Propositional satisfiability (SAT)3 • Horn-clause satisfiability (HORN-SAT) 1 • AI planning (Bylander, 1991) • Permanent computation4 This highlights the limits of practical transformers with limited-precision arithmetic, indicating that they are far from being universal or all-powerful as suggested by some prior studies.
One important caveat about these negative results is that they are asymptotic in nature-they apply for "large enough" input size n.It's possible for log-precision transformers to solve such problems easily when n is small.Further, these negative results are about exact solutions, but they often also extend beyond this when formal hardnessof-approximation results are known.
Limitations of Our Formal Model.Prior formal characterizations of transformers either make unrealistically strong assumptions (Pérez et al., 2019;Dehghani et al., 2019) or place unrealistic restrictions (Hahn, 2020;Hao et al., 2022;Merrill et al., 2022).In contrast, we make only one assumption-namely, all intermediate values in the transformer are limited to O(log n) bits, where n is the number of input tokens.We next discuss some implications of this assumption and what our findings mean for practical transformers.
As mentioned above, our bounds are asymptotic in nature and thus apply when n is sufficiently large.In practice, transformers use fixed precision at each computation node, which is more restrictive than precision growing with the input sequence length n, as O(log n) bits.However, this constant could be large and thus, for relatively small n, our results do not rule out practical transformers solving difficult problems.Our results, however, do show that as n grows sufficiently large, log-precision transformers are fundamentally limited to problems within TC 0 and cannot accurately solve various commonly studied problems mentioned earlier under "What Transformers Cannot Compute".Extending our analysis to small n will help close the gap to practice.
Our formal model is based on a binary classification view of transformers.However, our results apply directly to multi-class classification as well and can be extended to generation problems by viewing, for instance, next word prediction in NLP as a multi-class classification problem.However, if the transformer decoder is allowed to condition on its previous output in a generation problem, then this would violate our formal setup.

Potential Applications
Extracting Circuits from Transformers.Elhage et al. (2021) propose extracting circuits5 that capture the computational structure of transformers.Our results suggest threshold circuit families are a good formalism for expressing mechanisms extracted from transformers.Constructively converting transformers to threshold circuits is beyond the scope of the current paper, although we hope to explore this in more detail in future work.
Testing Separation Candidates in Complexity Theory.Thm. 2 also motivates a paradigm for quickly testing complexity theory conjectures.If a problem is believed to separate TC 0 and NC 1 , a transformer can be trained on problem instances.If the transformer generalizes perfectly to harder instances than it was trained on, this gives an empirical hint that the problem is in TC 0 , providing evidence against the conjecture.

Circuit Computation
Let {0, 1} * be the set of finite binary strings.For x ∈ {0, 1} * , let |x| be its length.We refer to a function from {0, 1} * to {0, 1} * as a boolean function.Boolean functions can implement arithmetic operations if we define a semantics for binary strings as numbers.We will treat the intermediate values in a transformer as binary strings, and the internal operations as boolean functions.
Circuits are a model of computation for computing boolean functions of fixed-length binary strings. 6Formally, a circuit is a directed acyclic computation graph.The leaf nodes represent binary variables and their negations.The internal nodes represent functions in some set G, and the directed edges represent the flow of function outputs into inputs of other functions.One or more nodes in the circuit are marked such that their value is the output of the circuit.
Definition 1.For a set of functions G, a G-circuit is a directed acyclic computation graph where the internal nodes have labels from G.
Complexity Measures.The size of a circuit is the total number of gates in it, including negation.The depth of a circuit is the length of the longest path from any input node to any output node.
Circuit Families.A circuit family generalizes a circuit to take variable-length binary strings as input.Formally, a circuit family is a sequence of circuits C n : {0, 1} n → {0, 1} for n ∈ N. A circuit family implicitly recognizes a formal language defined as follows: We now define classes of languages by constraining the complexity of the circuit families needed to recognize them: Definition 3. Let non-uniform AC 0 be the set of L ⊆ {0, 1} * such that L is recognizable by a polysize, constant-depth {¬, ∧, ∨}-circuit family.
The gates ¬, ∧, and ∨ are all just special cases of thresholds, so we can imagine TC 0 circuits to have access to these as well.Thus, TC 0 circuits can implement AC 0 circuits.Circuit Serialization.We identify a circuit with its serialization in a formal language that identifies each node's label and adjacency list.We will adopt a specific grammar for concreteness, but our construction can be adapted to other string representations of circuits.
We define a circuit serialization as a traversal of a circuit ordered by some topological sort.In this serialization, leaf nodes (variables) are represented by the string X.An internal node (gate) is represented in Polish notation by the function it computes (AND, OR, or NOT) followed by a list of pointers to its arguments.Each argument &1 j of gate i encodes (in a unary) a zero-indexed pointer to the j-th gate in the circuit, where j < i.The final node is interpreted as the circuit output.
To serialize {∧, ∨}-circuits, we use the following grammar, where the i parameter is passed through Gate[i] nonterminals to track the index of the gate in left-to-right order: In the Arg[i] rule, we enforce that j < i so that arguments must be pointers to already defined gates.As an example of this serialization language, the circuit for x 1 ∨ ¬x 2 ∨ x 3 is represented as7 X X X NOT &1 OR & &111 &11 By convention (cf.§3), negations in AC 0 circuits are usually taken to occur at the beginning of the circuit, rather than after ∧ or ∨ nodes.8Our serialization grammar does not enforce this property, but of course any circuit with this property can be serialized by our grammar.
It is a bit more complicated to serialize threshold circuits.Formally, a threshold circuit serialization is generated by the following grammar: In the rewrite rule for Gate[i], m ∈ N is the arity of the gate, and k ≤ m is its threshold.The span 1 k after Dir can be interpreted semantically as a unary encoding of the parameter k for a threshold gate, padded by 0's to the number of total arguments of gate i.For simplicity, we imagine ¬ gates are represented as unary θ ≤0 gates.Thus, the circuit θ ≥1 (x 1 , ¬x 2 ) would be represented as X X <= 00 &1 >= 10 & &11 We say a threshold circuit serialization is in prefix form if all inputs (X) come before all threshold gates (<= or >=), as is the case in this example.
Uniformity.The circuit families we have defined above are non-uniform, meaning that we do not enforce that the circuits processing different input sizes must be related in any way.In degenerate cases, non-uniform circuit families can solve undecidable problems9 because they have infinite description length, making them a physically unrealizable model of computation.Complexity theorists have thus introduced uniform circuit families.Uniform circuit families are a realizable model of computation with relations to classes in computational complexity and formal language theory.
Intuitively, in a uniform circuit family, the circuits for different input sizes must be "somewhat similar" to each other.We formalize this (cf.Arora and Barak, 2009) by saying that there exists a resource-constrained Turing machine that maps the input 1 n to a serialization of circuit C n .Definition 5. A language L is (S(n), I(n))-space uniformly computable by a circuit model M iff there exists a Turing machine that, for all n ≥ 0, uses S(n) space to map 1 n to an M -circuit recognizing L on inputs of size I(n).
This notion of uniformity is more general than the standard notion in that the input size I(n) is a function of the problem complexity n.The reason for this is that we will apply uniformity to subcomputations with different input sizes I(n) within a larger computation of input size n.The standard notion of uniformity corresponds to I(n) = n.
Furthermore, we will refer to a circuit family as uniform if it is uniformly computable with Arora and Barak, 2009).We can define uniform versions of AC 0 and TC 0 by adopting the previous definitions exactly, but also enforcing uniformity.For the rest of the paper we will clarify whether we mean the uniform or non-uniform variant of TC 0 when unclear from context, since both classes will come up.

Bounded-Precision Transformers
A transformer (Vaswani et al., 2017) is a neural network architecture made up of a constant number of transformer layers.A transformer layer is a module that computes self-attention over a sequence followed by an elementwise transformation of the output vectors.

Precision and Space
We will assume that each transformer is resource bounded in terms of the precision of each value it computes and, for some of our results, the space it uses for the computation of key operations such as embedding, attention, and activation.Specifically, we will assume precision p, i.e., the values at all layers, as well as the outputs of all key intermediate operations in it (attention, activation, arithmetic operators, etc.), are represented using p bits.This is a realistic assumption as, in practice, today's transformers are typically limited to the 64-bit precision of the underlying hardware.Formally, we define p-precision as follows: * have size at most p bits, and f can be computed by a p-space-bounded Turing machine.
This says the size of the function input and output are bounded below p.Similarly, the intermediate space used by the computation must also be bounded below p.Thus, higher precision computations cannot somehow be hidden inside f .
Def. 6 naturally applies to functions with bounded arity k.We will also need to define p precision for the summation operator in the transformer, which adds n different floats of size p. 10  Adding n floats can blow up the precision needed to represent their sum.For example, imagine adding the floating points 1 • 2 0 + 1 • 2 c .We obtain (2 c +1)•2 0 , whose mantissa takes c+1 bits to represent.In practice, computers do not preserve full precision in such situations: instead, small terms like 1 • 2 0 are discarded.Thus, we define the transformer's addition operation ⊕ to be similarly approximate (and thus preserve precision); see §A.

Attention Heads
The core building block of a transformer is an attention head.We define this at a high level of abstraction as follows: Definition 7. A p-precision attention head is specified by a binary p-precision similarity function s : {0, 1} p × {0, 1} p → {0, 1} p .
Let h 1 , . . ., h n ∈ {0, 1} p be the input sequence to a p-precision attention head, and let ⊕ be approximate floating-point addition ( §A). 10 Our proof also goes through if the transformer weights are integers, as is sometimes done (Dettmers et al., 2022).
Definition 8.For all ≥ 0, a p-precision attention head H +1 h computes a vector a +1 ih ∈ {0, 1} p via where Z i = n j=1 s(h i , h j ).Standard transformer attention heads (Vaswani et al., 2017) are a special case of this definition where s is scaled dot-product similarity between keys and queries.Standard transformers also have a linear or affine value function applied to each h j in the sum over j.By its affineness, the value function can, without loss of generality, be removed from the attention head and considered to be part of the transformer layer (i.e., applied to the output of the attention head).

Transformer Layers
A p-precision transformer layer is then a tuple of heads and a function f used to combine them.

Definition 9 (p-precision transformer layer). A p-precision transformer layer is a tuple
A p-precision transformer layer can be understood to define a sequence of vectors h +1 1 , . . ., h +1 n in terms of an input sequence of vectors h 1 , . . ., h n (coming from the previous layer in the transformer) by first computing k attention heads in parallel and then combining their output using f .The first k inputs to f will correspond to the attention head outputs, and the additional input is the original input from the previous layer.Recall that a +1 ih is the output of head H +1 ih on input h at position i.The function computed by a transformer layer can be described formally as follows.
Definition 10 (Transformer layer computation).For ≥ 0, a p-precision transformer layer L +1 recurrently computes the output sequence h +1 1 , . . ., h +1 n as a function of the inputs h 1 , . . ., h n , where, for 1 ≤ i ≤ n, the i-th component is computed according to f can be understood to encapsulate layernorm, residual connections, and the feedforward sublayer of a standard transformer (Vaswani et al., 2017).h i is given to f to allow residual connections.As mentioned in §4.3, f can also encapsulate the value function for each head.

Transformer Encoder
Finally, we define a transformer of depth d as a cascade of d transformer layers: Definition 11 (p-precision transformer).A pprecision transformer over alphabet Σ is a pair consisting of a p-precision position embedding function11 φ : Σ × N → {0, 1} p and a d-tuple of p-precision transformer layers L 1 , . . ., L d .
For a position embedding function φ and w ∈ Σ n , let φ(w) be the position-wise broadcasted embedding of w: for 1 ≤ i ≤ n, φ i (w) φ(w i , i).
Definition 12 (Transformer computation).A transformer φ, L 1 , • • • L d computes the following function of a string w ∈ Σ * : We will use n to denote the length of w, and take the transformer's depth d to be fixed w.r.t.n.
The input to the transformer can thus be represented with N = n log|Σ| bits using a binary encoding for the vocabulary.The circuits we construct in subsequent sections to simulate transformers will also have input size N .We will assume transformers have log-precision relative to the size of the input, specifically O(log N )precision.Since |Σ| is fixed (typically 30000 in practice), we will think in terms of O(log n)precision.Thus, by Def. 6, all of the intermediate functions of such transformers are computable in O(log n) space and output (at most) these many bits.Note that this is enough precision to represent positional encodings and for each position to point to a constant number of other values, but not enough precision for non-lossy pooling of the entire input into a single value.
Relationship to Practical Transformers.Our log-precision transformers do not enforce that s (Def.7) and f (Def.9) follow the transformer structure.However, a feedforward net whose primitive operations (e.g., scalar multiplication) are defined over O(log n)-size numbers can be computed in O(log n) space.Thus, boundedprecision practical transformers are a special case of our log-precision transformers.This makes our setup appropriate for proving upper bounds on transformers, which is our main contribution.

Log-Precision Transformers as
Non-Uniform Threshold Circuits We first show that log-precision transformers can be simulated by non-uniform threshold circuits, before presenting the more technical uniform version of the results in §6.The initial non-uniform result extends the findings of Merrill et al. (2022), who showed that saturated attention transformers12 can be simulated in TC 0 .Here, we remove the simplifying saturated attention assumption and other restrictions on the underlying datatype.Instead, we show that our log-precision assumption is enough to prove that a transformer can be simulated in TC 0 with any attention function.Hao et al. observed that any boolean function of O(log n) bits can be computed by a poly(n) size circuit.We extend this to m-bit outputs, which is both more convenient and more efficient than constructing m separate boolean circuits: Lemma 1 (Extended from Hao et al., 2022).Let f : {0, 1} * → {0, 1} m be a function.For all c ∈ R + and n ∈ N, there exists an AND/OR circuit of size at most n c + c log n + m and depth 3 that computes f on inputs of size c log n.
Proof.Like Hao et al. (2022), we construct a circuit using a DNF representation of f on inputs of size c log n, except we use a combined DNF representation for all output bits of f .The DNF formula has at most 2 c log n = n c terms.The circuit has a NOT gate for each input bit, an AND gate for each DNF term, and, for each of the m output bits, an OR gate combining the outputs of those AND gates (i.e., DNF terms) for which that bit is 1.
We now use Lem. 1 to prove the following non-uniform result.We note that the proof goes through even if the notion of p-precision (Def.6) is relaxed to not require computability in space p.This requirement will, however, become important for our subsequent result in §6.
Theorem 1 (Non-uniform).Any c log n-precision depth-d transformer operating on inputs in Σ n can be simulated by a threshold circuit family of depth 3 + (9 + 2d ⊕ )d.
Proof.Let w ∈ Σ n be the input of a c log nprecision transformer.We show by induction that we can construct a composition of constant-depth, poly-size threshold circuits to compute each layer of this transformer.Thus, any constant-depth transformer will be computable by a constantdepth threshold circuit.
In the base case of layer 0 and token i, we construct gates representing the constant i encoded in binary.We can then compute h 0 i = φ(w i , i) using Lem. 1, yielding a poly-size depth-3 circuit.
In the inductive case of computing layer h +1 i for 1 ≤ + 1 ≤ d, we note that each vector output of layer h i has size (at most) c log n bits because of the log-precision assumption.
We first fix a head a +1 ik (Def.8) to simulate.Applying Lem. 1, we can compute s(h i , h j ) with a poly-size depth-3 circuit, in parallel for all j.
Since n floats with c log n precision can be approximately added in TC 0 ( §A), we can construct a TC 0 circuit of depth d ⊕ to compute Z j .Since s(h i , h j ), Z i , and h i all have c log n bits, we can compute h j with a poly-size depth-3 circuit;13 we do this in parallel for all j.Next, we again use the fact that approximate addition of n floats is in TC 0 to compute a +1 ih as the approximate sum over j with a depth-d ⊕ circuit.
We now simulate a layer h +1 i (Def.10) in terms of its constituent heads.Since all arguments of g have size c log n, we apply Lem. 1 to compute g with a poly-size depth-3 circuit, yielding h +1 i .We repeat this in parallel for all i.This completes the inductive step new to compute all values in the + 1-st layer with a circuit depth of 9 + 2d ⊕ .Aggregating the circuit over all d layers, the overall circuit depth is 3 + (9 + 2d ⊕ )d.
Corollary 1.1 (Non-uniform).Any log-precision transformer can be simulated by a non-uniform TC 0 circuit family.14

Log-Precision Transformers as
Uniform Threshold Circuits We will now extend the argument from the last section to show that O(log n)-precision transformers can be simulated by uniform constant-depth threshold circuits by capitalizing on the assumption that φ, s, and f are log-precision, and thus can be computed in O(log n) space.The overall proof idea is similar, but due to the uniformity condition, the proof becomes substantially more technical.We must not just show the existence of a threshold circuit family computing a transformer, but also show that this circuit family can be generated by a log-space Turing machine.We first extend Lem. 1 to respect uniformity: Lemma 2. Let f : {0, 1} * → {0, 1} m be a linearspace computable function.There exists a Turing machine that, for all n ∈ N and c ∈ R + , uses at most c log n + log m space to map input 1 n to a circuit of size at most n c + c log n + m and depth 3 that computes f on inputs of size at most c log n.
Proof.We give the proof in the form of an algorithm to construct a circuit as a function of n and then justify its correctness and space complexity.
Algorithm.We first print 2c log n nodes representing unnegated and negated input nodes. 15ow, we need to show how to construct nodes corresponding to n c DNF terms.To this end, we loop over all possible inputs x ∈ {0, 1} c log n by maintaining the c log n bit binary representation of x (initialized with 0 c log n ) and incrementing it by 1 at each step of the loop.We create a new ∧ node i with c log n arguments, defined as follows.For j ∈ [c log n], we create an argument pointer to (unnegated) node j if x j = 1 and to (negated) node c log n + j otherwise.Now, we construct nodes computing each of the m output nodes.We loop over k ∈ [m], constructing a single node for each k.We loop over all x ∈ {0, 1} c log n analogously above to construct a list of arguments.By our linear-space computability assumption and because x has c log n bits, we can compute f (x) as a subroutine in O(log n)space to obtain f k (x).If f k (x) = 1, we print node 2c log n + j as an argument of node k.
Correctness.We show that this Turing machine maps input n to a serialized circuit computing f on inputs of size n.The first layer simply produces unnegated and negated input values.The second layer then produce all possible DNF terms.Finally, node k of the third layer computes the disjunction over all terms x such that f k (x) = 1.Thus, node k of the third layer computes f k .
Log Space.To complete the proof, we justify that M uses O(log n+log m) space.Looping over x ∈ {0, 1} c log n is accomplished by treating x as a binary number initialized to 0 and incrementing it at each step.Thus, the loop pointer for building the DNF terms takes c log n space to store.For building the m output nodes, we maintain a similar loop pointer as well as an index k ≤ m, taking c log n + log m space.Thus, the overall algorithm uses c log n + log m space.
Thus, M uses c log n + log m space to map 1 n to a circuit of size at most n c + c log n + m and depth 3 that computes f on size c log n inputs.
We can leverage this lemma to derive the uniform analog of Thm. 1, as follows.
Theorem 2 (Uniform, main result).Any c log nprecision depth-d transformer operating on inputs in Σ n can be simulated by a logspace-uniform threshold circuit family of depth 3 + (9 + 2d ⊕ )d.
Proof.We will provide a proof by induction over transformer layers that there is a Turing machine M operating in O(log n) space that, on input 1 n , outputs a circuit that simulates the transformer's computation on inputs of size n.This circuit is identical to the one in the proof of Thm. 1, and thus has the same circuit depth.
In the base case, we use log space to track a counter maintaining the current token i (between 1 and n) throughout the circuit construction.We construct gates encoding the constant i in binary.We can then apply Lem. 2 to construct a Turing machine that maps 1 n to a constant-depth threshold circuit computing h 0 i = φ(w i , i).In the inductive case, we assume we can output in O(log n) space a circuit computing every value h i in the previous layer .We will show that we can, in O(log n) space, now output a circuit computing every value in layer + 1.
As in Thm. 1, we first fix a head a +1 ih to simulate.Recall (Def.8) that By Lem. 2, we can generate a depth-3 circuit of size at most z = n c + c log n + 1, where c = 2c (since the input to f is of size 2c log n) that computes s(h i , h j ) for specific i, j.We do this sequentially for 1 ≤ j ≤ n and 1 ≤ h ≤ k, padding each circuit with unused nodes so that each one has size exactly z, and the z-th node corresponds to the output.Thus, the indices of the output nodes for each of the columns will be w + z(jk + h) for 1 ≤ j ≤ n, where w is the index of the last output node h n of the previous layer.
At this point, we use the fact that for p = c log n, the p-precision approximate sum of n pprecision numbers can be computed by a uniform threshold circuit ( §A).We can thus use a Turing machine as a sub-routine to generate, on input 1 n , a k threshold circuits, where each has size z that computes an ⊕ gate over n items of precision p each.We set the inputs of circuit h to be nodes w + z(jk + h) for 1 ≤ j ≤ n.By construction, this yields the normalizing constants Z i = n j=1 s(h i , h j ), whose value is located at the node at index w + znk + z for head h.
Using p-precision arithmetic operator circuits, we can now also generate a circuit to compute s(h i ,h j ) Z i h j for each 1 ≤ j ≤ n and 1 ≤ h ≤ k, by using index w + z(jk + h) as before for the value of s(h i , h j ) and index w + znk + z h for the normalizing constant Z i of head h.Here too we use circuits of identical size z , making w +k(zn+z +z i) the index of the output nodes of these n circuits.Next, we again employ a ⊕ circuit of size z , similar to the computation of Z i , to compute the sum of these n values.Finally, we compute h +1 i by applying f via Lem. 2.
Note that this requires keeping only , i, and n in memory, each of which takes O(log n) bits.
We repeat this process for all 1 ≤ i ≤ n to compute the entire + 1 layer, which finishes the inductive step: if we can output a circuit computing layer in O(log n) space, then we can do the same for layer + 1.
Because the depth derived in Thm. 2 is constant with respect to n, it follows that: Corollary 2.1 (Uniform, main result).Any logprecision transformer can be simulated by a uniform TC 0 circuit family.Following and Advice Transformers So far, we have shown that uniform TC 0 is an upper bound for log-precision transformers.Is this upper bound tight, i.e., also a lower bound?While we do not answer this question here, we address a related question as a first step: we construct a transformer that can evaluate TC 0 circuits on binary inputs, showing that transformers can compute any TC 0 function when their input is augmented with the right "instructions".
More formally, we consider the Circuit Value Problem (CVP) (Ladner, 1975), also referred to as the Circuit Evaluation Problem, where the input is a boolean circuit C and a string x ∈ {0, 1} n , and the task is to return the value of C(x) ∈ {0, 1}.This problem is known to be complete for the class P under AC 0 reductions (Ladner, 1975).We will assume C is serialized as described in §3 and prove that log-precision transformers can evaluate any TC 0 circuit.Note that this is an extension of the typical CVP since the circuit has threshold gates, not just standard AND/OR gates.
It is known that LSTMs cannot evaluate boolean formulae (Merrill, 2020), a special case of the CVP.In contrast, we show that transformers can.
To demonstrate the practicality of our lower bound construction, we will not just prove the existence of transformers that can evaluate TC 0 circuits but also specify concrete choices for the positional embedding scheme and the class of attention functions that are sufficient to do so.
Fractional Positional Embeddings.For a vector x and scalar y, let x, y be the vector appending y onto x. 16 For σ ∈ Σ, let v(σ) be the one-hot embedding of σ into R |Σ| .For w ∈ Σ * and i ≥ 1, the fractional positional embedding at token i is Saturated Attention.We imagine f (h i , h j ) is computed via saturated attention (cf.Merrill et al., 2022), which provides a simple model of the types of attention we can expect to be learned in transformers (Merrill et al., 2021).First, queries are computed as q i = Qh i , and then keys k j = Kh j Define the dot-product attention score σ ij = q i k j .We can then define saturated attention as 16 I.e., x, y i = xi for 1 ≤ i ≤ |x|, and y if i = |x| + 1.
After normalization, saturated attention creates a distribution that is uniform over a subset of positions.Thus, it is capable of parameterizing hard attention, uniform attention over the full sequence, and various attention patterns in between.
Simple Pooling Functions.For simplicity, we assume pooling functions f are thresholded linear functions of their inputs.Thus, they could be implemented by a feedforward neural net.Without loss of generality, we let attention heads have a value function, which can be folded into the pooling function from the last layer (see §4).
Terminology.We use input node to mean a token of type X and gate node to mean a token of type Dir.We call a token of type & an argument.
We are now ready to present the main result.Our construction below is specific to circuits serialized in prefix form (see §3), but it can be extended to other serializations as well.
Lemma 3.For all d, there exists a transformer with fractional positional embeddings, saturated attention, thresholded linear pooling functions, and depth 2d that, for any threshold circuit C of depth d serialized in prefix form, maps input C, x to the value C(x).
Proof.We will construct a pair of two transformer layers that evaluate all the nodes at depth in the threshold circuit, for any .It follows that a transformer of depth 2d can compute the value C(x).
Base Case: Input Nodes.We use an attention layer to attend uniformly over all positions with value returns 1 if w i = X and 0 otherwise.This head computes #(X)/n, where #(X) is the number of occurrences of X in w.A second layer, then, at input node i, computes the positional embedding of the token representing input value x i : We attend to this position to retrieve x i .After these layers, each input node i stores its value x i .We also use the base-case layers to construct an attention head that, at the i-th node, counts the fraction of tokens (out of n) that are nodes to the left of the current node.Thus, the column corresponding to node i stores the value i/n.
At each gate node i, we use two more attention heads to find the index of the next & to the right and then count the fraction of tokens before it that are 1.This head thus computes k i /m i where k i is the threshold value of gate i and m i is its arity.
Finally, using the first attention layer, we have each 1 node attend to the first argument symbol & to its left and retrieve its index p/n.Then, in the second attention layer, each argument attends uniformly over all nodes with values p/n.The net effect is for each argument to store j/n, i.e., the pointer it is encoding in unary as &1 j .
Inductive Case: Gate Nodes.By our inductive assumption over prior layers, all tokens corresponding to circuit nodes at depth ≤ contain their appropriate value.We now construct 2 transformer layers to evaluate gate nodes at depth + 1.
In the first attention layer, each argument token attends to the closest gate node i to its left, which is the gate it belongs to.Recall from the base case that argument token & already stores j/n, where j is the pointer value it encodes.Each argument token now attends with query j/n to retrieve from node j its already computed value.
The second attention layer applies at gate nodes, not arguments.At gate i of arity m i , we set the attention s(i, j) to indicate whether argument j belongs to gate node i, which holds for exactly m i arguments.We set the attention value at argument j to be the binary value of node j, which was retrieved in the previous paragraph.Thus, the attention head computes c i /m i , where c i is the number of arguments of node i that are 1.We repeat this for all gate nodes.
At this point, we have both the count of true inputs to gate node i (c i /m i ) and, from the base case, the threshold parameter of gate i (k i /m i ).Thresholding (c i − k i )/m i at 0 allows us to decide, based on whether Dir is <= or >=, whether the current gate node should output a 0 or a 1. Repeating this for all gates at layer + 1 completes the inductive step: we can evaluate all gate nodes in this layer.

Instruction Following
CVP is closely related to instruction learning (Brown et al., 2020) and instruction following tasks (Finlayson et al., 2022).The latter task setup provides a transformer two inputs: a regular expression r as an "instruction", and z ∈ {0, 1} * .The goal of the task is to return whether z belongs to the regular language represented by r.Viewed from this lens, the circuit evaluation setup asks: Can transformers follow instructions provided in the form of a circuit?As discussed below, our result says the answer is yes for all constant depth threshold circuits.This, to the best of our knowledge, provides the first non-trivial lower bound for transformers in the instruction learning setting.
Formally, an instruction I is any description17 of a function f I of {0, 1} * .We say a transformer correctly follows an instruction I if, for all x ∈ {0, 1} * , it correctly computes f I (x) on input I, x .A non-uniform instruction description is a family of length-specific descriptions {I n } ∞ n=1 .We say a transformer correctly follows a nonuniform instruction family {I n } if, for all n and all x ∈ {0, 1} n , it correctly computes f I (x) on input I n , x .The non-uniform description {I n } may take any form.When it forms a TC 0 circuit family, we refer to it as a TC 0 instruction description.Since Thm. 3 constructs a transformer that can evaluate any TC 0 circuit, it follows that: Corollary 3.1.There exists a depth-2d transformer that can correctly follow any depth-d TC 0 instruction description.
Thus, transformers with simple position embeddings, attention, and pooling functions can simulate any instruction provided in the form of a TC 0 circuit.We note that while it is unknown whether the class of regular languages, considered by Finlayson et al. (2022), is contained in TC 0 , the other side is known: there are problems computable by TC 0 circuits that are not computable by a regular language.These include problems involving counting and arithmetic, which are beyond regular languages.Our results thus expand the known kinds of instructions transformers are able to follow, at least with hand-constructed weights.

Advice Transformers
We can also view circuit evaluation abilities of transformers (Lem.3) from the lens of advice taking Turing machines which, in addition to their usual input, are also provided an input length dependent (but input independent) advice string.For instance, P/poly is the class of problems decidable in polynomial time when the Turing machine is given an advice string of size polynomial in the input length (cf.Arora and Barak, 2009).
In the same vein, let T/poly be the class of logprecision, constant-depth transformers with polynomial advice strings.In other words, on an input of size n, we allow the transformer to receive an additional poly(n) bits of input that cannot depend on the standard input.Now let {C n } ∞ n=1 be a circuit family demonstrating that a problem is in non-uniform TC 0 .Then, by passing the description of C n as advice for input length n, it immediately follows from Lem. 3 that advice transformers can simulate non-uniform TC 0 : Corollary 3.2.Non-uniform TC 0 ⊆ T/poly .
Since non-uniform TC 0 even contains some undecidable languages (Arora and Barak, 2009, Claim 6.8), T/poly is clearly a very powerful class and a strict superset of T, the class of decision problems recognized by transformers (which are all decidable).Thus, a problem in T/poly cannot always be solved by a transformer on its own.However, if given a description of how to do so ("advice") in the form of a TC 0 circuit, our result shows that a transformer could solve that problem.

Conclusion
Answering two open questions from Merrill et al. (2022), we prove log-precision transformers with any (including soft) attention can be simulated by uniform constant-depth threshold circuits.This establishes thresholded addition as a fundamental operation for understanding the computational model of transformers: any log-precision transformer can be re-expressed as a polynomial number of threshold gates stacked to a constant depth.This result also establishes potential limits on the computational power of log-precision transformers; e.g., if L ⊂ P, transformers cannot compute all poly-time functions.They are certainly very far from being universal.The intuition at the heart of this result is that forcing a model to be highly parallelizable likely sacrifices its expressiveness.Since parallelism seems essential to pretraining any massive model at scale, any large language model-transformer or otherwise-may suffer from a similar tradeoff.