Abstract
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.
1 Introduction
Transformers (Vaswani et al., 2017) have gained prominence in natural language processing (NLP), both in direct applications like machine translation and in pretrained models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2018; Brown et al., 2020; OpenAI, 2023). Consequently, some researchers have sought to investigate their theoretical properties. Such studies can broadly be divided into studies of expressivity and trainability. While trainability is very important and the focus of much study (e.g., Bhattamishra et al., 2023; Allen-Zhu and Li, 2023), here we focus on expressivity, which is a prerequisite for trainability.
Studies of expressivity could be further divided into those from the perspectives of approximation theory and of formal language theory. The former (e.g., Yun et al., 2020; Sanford et al., 2023), investigates transformers as approximators of various classes of functions, along the lines of the universal approximation theorem for feedforward neural networks (Hornik et al., 1989; Cybenko, 1989). The latter, which is the subject of this survey, investigates transformers as recognizers or generators of formal languages—that is, the inputs or outputs are treated as sequences of discrete symbols from a finite alphabet, and crucially as sequences of unbounded length.
The core research question in this subarea is: How can we characterize the expressivity of transformers in relation to various formal models, such as automata, Boolean circuits, or formal logic? Applications of this subarea, which are not addressed by the papers surveyed here but could be by future work, would hopefully answer questions like:
What new transformer variants are suggested by formal models?
Do failure cases anticipated from formal models occur in practice?
What insights into the complexity of human language are offered by a characterization of transformer expressivity?
This paper provides a comprehensive survey of research in this subarea. Compared to the surveys of Ackerman and Cybenko (2020) and Merrill (2021, 2023), which cover convolutional neural networks (CNNs), RNNs, and transformers, this is a narrower, but deeper, survey on transformers only.
Interpreting theoretical transformer results is complex due to diverse assumptions. Many variants of transformers exist in practice, and even more have been proposed in theory. This diversity leads to varied, even seemingly contradictory, results. We set up a unified framework for talking about transformer variants (§4), and discuss how some of these variants compare to one another in expressivity.
We then provide background on various formal models that transformers have been compared with (§5). Then, in §6, we systematically survey current results in this literature, documenting their assumptions and claims in terms of the definitions of Sections 4 and 5.
2 Overview
Table 1 summarizes the results surveyed here. One way to classify them is into lower bounds (what transformers can do) and upper bounds (what transformers can’t do).
Surveyed claims and their assumptions. Please see the main text for full details of assumptions.
Lower bound . | Source . | PE . | Attention . | Notes . |
---|---|---|---|---|
∋Majority | Pérez et al. 2019 | none | average-hard | |
∋Shuffle-Dyck-k | Bhattamishra et al. 2020a | none | softmax, future mask | |
Bhattamishra et al. 2020a | none | softmax, future mask | ||
∋Dyck-k | Yao et al. 2021 | i/n, i/n3, n | softmax & leftmost-hard | |
Pérez et al. 2021 | i,1/i,1/i2 | average-hard | poly(n) steps | |
∋Parity | Chiang and Cholak 2022 | i/n, (−1)i | softmax | |
[MOD; +] | Chiang et al. 2023 | sinusoidal | softmax | |
Barceló et al. 2024 | arbitrary | leftmost-hard | ||
+C[Mon] | Barceló et al. 2024 | arbitrary | average-hard | |
Upper bound | Source | Precision | Attention | Notes |
∌Parity,Dyck-1 | Hahn 2020 | ℝ | leftmost-hard | |
∌Parity,Dyck-2 | Hahn 2020 | ℝ | softmax, future mask | , vanishing KL |
⊆AC0 | Hao et al. 2022 | ℚ | leftmost-hard | |
⊆TC0 | Merrill et al. 2022 | average-hard | ||
⊆FOC[MOD; +] | Chiang et al. 2023 | O(1) | softmax | |
⊆L-uniform TC0 | Merrill and Sabharwal, 2023a | softmax | ||
⊆FOM[BIT] | Merrill and Sabharwal, 2023b | softmax | ||
⊆L-uniform TC0 | Strobl 2023 | average-hard | ||
Equivalent | Source | PE | Attention | Notes |
=RE | Pérez et al. 2021 | i,1/i,1/i2 | average-hard | unbounded steps |
=FO | Angluin et al. 2023 | none | rightmost-hard, strict future mask | |
=FO[MOD] | Angluin et al. 2023 | sinusoidal | rightmost-hard, strict future mask | |
=FO[Mon] | Angluin et al. 2023 | arbitrary | rightmost-hard, strict future mask | |
=P | Merrill and Sabharwal, 2024 | none | average-hard, future mask | poly(n) steps |
Lower bound . | Source . | PE . | Attention . | Notes . |
---|---|---|---|---|
∋Majority | Pérez et al. 2019 | none | average-hard | |
∋Shuffle-Dyck-k | Bhattamishra et al. 2020a | none | softmax, future mask | |
Bhattamishra et al. 2020a | none | softmax, future mask | ||
∋Dyck-k | Yao et al. 2021 | i/n, i/n3, n | softmax & leftmost-hard | |
Pérez et al. 2021 | i,1/i,1/i2 | average-hard | poly(n) steps | |
∋Parity | Chiang and Cholak 2022 | i/n, (−1)i | softmax | |
[MOD; +] | Chiang et al. 2023 | sinusoidal | softmax | |
Barceló et al. 2024 | arbitrary | leftmost-hard | ||
+C[Mon] | Barceló et al. 2024 | arbitrary | average-hard | |
Upper bound | Source | Precision | Attention | Notes |
∌Parity,Dyck-1 | Hahn 2020 | ℝ | leftmost-hard | |
∌Parity,Dyck-2 | Hahn 2020 | ℝ | softmax, future mask | , vanishing KL |
⊆AC0 | Hao et al. 2022 | ℚ | leftmost-hard | |
⊆TC0 | Merrill et al. 2022 | average-hard | ||
⊆FOC[MOD; +] | Chiang et al. 2023 | O(1) | softmax | |
⊆L-uniform TC0 | Merrill and Sabharwal, 2023a | softmax | ||
⊆FOM[BIT] | Merrill and Sabharwal, 2023b | softmax | ||
⊆L-uniform TC0 | Strobl 2023 | average-hard | ||
Equivalent | Source | PE | Attention | Notes |
=RE | Pérez et al. 2021 | i,1/i,1/i2 | average-hard | unbounded steps |
=FO | Angluin et al. 2023 | none | rightmost-hard, strict future mask | |
=FO[MOD] | Angluin et al. 2023 | sinusoidal | rightmost-hard, strict future mask | |
=FO[Mon] | Angluin et al. 2023 | arbitrary | rightmost-hard, strict future mask | |
=P | Merrill and Sabharwal, 2024 | none | average-hard, future mask | poly(n) steps |
Much work on lower bounds has looked at automata like finite automata, counter machines, and Turing machines, all of which had been successfully related to RNNs before (Siegelmann and Sontag, 1995; Merrill, 2020). This wide diversity of machines is due to different variants of transformers, especially whether a transformer decoder is allowed to take a number of intermediate steps before outputting a decision (§4.3.4), which dramatically increases its power (§6.1).
By contrast, investigation of upper bounds has mainly focused on circuit complexity (§5.2), which had been successfully related to feedforward networks before (Parberry, 1994; Siu et al., 1995; Beiu and Taylor, 1996; Šíma and Orponen, 2003). This line of research began with restricted models of transformer encoders and progressed to increasingly realistic variants and tighter bounds. One way to restrict transformers is by discretizing the attention mechanism (§4.2.1); another is to limit the precision of number representations (§4.4).
More recent work has turned to formal logic (§5.3) as a way of characterizing the expressive power of transformers. The finer control afforded by logics opens the possibility for them to be used as upper bounds, lower bounds, or both.
3 Preliminaries
Sets
We denote by ℕ0 = {0,1,2,…} and ℕ =ℕ0 ∖{0} the set of natural numbers with and without 0, respectively. We write [n] = {0,1,2,…, n −1} for any n ∈ℕ. We write Σ for a finite alphabet, which, in NLP applications, is the set of words or subwords known to the model.
Vectors
We use d, d′, etc., for dimensionalities of vector spaces, lowercase bold letters (x, y,…) for vectors, and uppercase bold letters (X, Y,…) for matrices. For any vector x ∈ℝd, we number its elements starting from 0. For i ∈ [d], we write xi or [x]i (not xi) for the i-th component of x.
Sequences
For any set A, we write A* for the set of all finite sequences over A. We write the length of a sequence s ∈ A* as |s| and number its elements starting from 0; thus, s = s0s1⋯s|s|−1. We use the variable w for a string in Σ* and n for the length of w. For sequences in ℝ*, we use lowercase bold letters (x, y,…), and for sequences in (ℝd)*, we use the variable X.
A function is length-preserving if |f(w)| = |w| for all w ∈ A*. For every function , we denote its extension to sequences by g as well. That is, is defined as follows: for all s ∈ A* and i ∈ [|s|], g(s)i = g(si).
Neural Networks
An affine transformation is a function parameterized by weights and bias such that for every , L(x) = WLx +bL. We say that L is linear if bL = 0.
The activation functions we use are the rectified linear unit (ReLU) and the logistic sigmoid function σ(x) = 1/(1 + e−x).
4 Transformers
In this section, we define transformers and relevant variants, and how transformers are used to describe formal languages. For additional background on transformers (not in relation to formal languages), Huang et al. (2022) give a lucid commentary on the original paper, Phuong and Hutter (2022) give formal definitions and pseudocode, and Lin et al. (2022) survey many variants of transformers.
4.1 Input Layer
In theoretical constructions, the word embedding can be any computable function.
4.2 Hidden Layers
is a multi-head self-attention with d input/output dimensions, H heads, and dkv key/value dimensions per head
is a feed-forward network (§4.2.2) with d input/output dimensions and dff hidden dimensions
and are layernorms with d dimensions.
We define each of these components below.
4.2.1 Attention
Attention was initially developed to facilitate retrieval of previously processed data from a variable-length history (Bahdanau et al., 2015). Transformers use a simple variant of attention known as scaled dot-product attention.
Scaled Dot-product Attention
Attention Masking
Multi-head Attention
Hard Attention
By substituting or for in Eq. (4), we get leftmost-hard and average-hard attention, respectively. Leftmost-hard attention was previously called hard attention by Hahn (2020) and unique hard attention by Hao et al. (2022). One may also consider rightmost-hard attention, in which the rightmost maximal element is used. Average-hard attention was also called hard attention by Pérez et al. (2021) and saturated attention by Merrill et al. (2022), and has been argued to be a realistic approximation to how trained transformers behave in practice (Merrill et al., 2021).
4.2.2 Feed-forward Networks
4.2.3 Layer Normalization
The original definition of layernorm (Ba et al., 2016) sets , but, for numerical stability, all implementations we are aware of set . Observe that is Lipschitz-continuous iff .
4.3 Networks and Output Layers
We now define a complete transformer network.
4.3.1 Transformer Encoders
Chiang and Cholak (2022) also consider a requirement that an encoder accepts/rejects strings with bounded cross-entropy. That is, we say that an encoder recognizes a language L with cross-entropy at most η iff for all strings w, if w ∈ L then , and if w∉L then .
We are aware of two choices for the distinguished position i. Most papers use the last position (i = n −1), but some (Chiang and Cholak, 2022; Chiang et al., 2023), inspired by binary classifiers based on BERT (Devlin et al., 2019), prepend a special symbol CLS at position 0 and use i = 0. While this is a minor difference, it should be noted that the guarantee of exactly one occurrence of CLS in the input can be useful in some constructions.
4.3.2 Transformer Decoders
While a decoder can be used to recognize strings similarly to an encoder, it can also be used to generate the entire string; at least two definitions have been given for this.
While not focusing on transformers, Lin et al. (2021) demonstrate limitations of autoregressive models for generation; for example, that there is a language L ∈ P that cannot be ε-generated in polynomial time for any ε > 0 if P≠NP.
4.3.3 Transformer Encoder–Decoders
A transformer encoder–decoder combines a transformer encoder and decoder, adding to each layer of the decoder an additional attention sublayer, known as cross attention, which attends to the output of the encoder. In the literature surveyed here, only the construction of Pérez et al. (2021) and related constructions (Bhattamishra et al., 2020b; Wei et al., 2022a) employ an encoder–decoder.
4.3.4 Intermediate Steps
When a transformer decoder or encoder–decoder is run as a language recognizer, it allows for the possibility of inserting a number of intermediate time steps between the end of the input string and the decision. The encoder–decoder models above do this, as do some decoder-only models (Feng et al., 2023; Merrill and Sabharwal, 2024). As we will see (§6.1), intermediate steps vastly increase the model’s power, which has also been observed in practice in the form of a “scratchpad” (Nye et al., 2021) or “chain of thought” (Wei et al., 2022b).
4.4 Uniformity and Precision
Although meaningful theoretical claims can be made about transformers for fixed-length strings (e.g., Yun et al., 2020), it is crucial when examining transformers as language recognizers to allow for unbounded string length. Fixing a maximum length makes all languages finite, collapsing many language classes into one.
It might be objected that considering unbounded lengths is too abstract, because in practice one can always fix a maximum length. But this maximum length, driven by practical needs, is growing steadily: for example, GPT-4 Turbo uses 128,000 tokens of context. At the same time, some theoretical findings surveyed here seem to have practical consequences for modest string lengths. For example, we will see that there are reasons to think that in theory, transformers cannot recognize Parity; in practice, they fail to learn Parity for strings with lengths in [2,50] (Bhattamishra et al., 2020a).
Numeric Precision
Transformers operate, in principle, on real numbers. While hard attention transformers could be defined using only rational numbers, even rational numbers can represent an arbitrary amount of information. With RNNs, the use of real or rational numbers has led to results that make them appear more powerful in theory than in practice (Siegelmann and Sontag, 1994, 1995; Weiss et al., 2018).
Consequently, many studies use limited-precision numbers. Some studies limit number representations to have O(1) bits, as floating-point numbers do in practice (Chiang et al., 2023). But Merrill and Sabharwal (2023b) argue that in O(1) precision, attention cannot attend uniformly to a string of sufficient length n, as the attention weights (α) would all round down to zero. So bits of precision is a common choice (Yao et al., 2021; Merrill and Sabharwal, 2023a, b). Other choices are possible as well: Merrill and Sabharwal (2023a) use the set .
Restricting intermediate activations to limited precision introduces many decisions about when and how rounding should take place, which can potentially affect expressivity. For example, when summing n numbers, one could round after each addition or only at the end of the summation. Better formalizing these decisions and their impact on expressivity is an area for future research.
Parameters
A few constructions allow the parameters themselves to depend on n, which we consider to be a stronger dependence, because if these transformers were to be learned from data, different transformers would have to be learned for different maximum lengths. Finally, a few papers construct transformers in which d, and therefore the number of parameters, depends on n, which we consider to be stronger still.
4.5 Summary
In summary, transformers can vary in at least the following ways, any of which could a priori impact theoretical claims:
Architecture: encoder-only, decoder-only, or encoder–decoder
For encoders: definition of recognition
For decoders and encoder–decoders: definition of generation and how many intermediate steps
Position embedding (PE)
Attention pattern: leftmost-hard, rightmost-hard, average-hard, or softmax
Attention masking: none, future, or past
Layernorm: inclusion or omission, value of
Residual connections: pre-norm or post-norm
Precision: infinite, , O(1)
Uniformity: whether parameter values or number of parameters depend on n.
5 Languages and Language Classes
Next, we present various formal models that transformers are compared to in the literature surveyed.
5.1 Automata and Classes L, NL, P
We assume familiarity with finite automata and Turing machines; for definitions, please see the textbook by Sipser (2013). Counter machines are automata with integer-valued registers (Fischer et al., 1968); they have been studied extensively in connection with LSTM RNNs (Weiss et al., 2018; Suzgun et al., 2019; Merrill, 2019, 2020).
5.2 Circuits and Classes AC0, ACC0, TC0, NC1
Circuits are a model of parallel computation particularly relevant to transformers. For more details, please see the textbook by Arora and Barak (2009).
Circuits operate on binary values. If we choose a fixed-length encoding of the symbols of Σ as strings of bits, then a circuit can simulate input alphabet Σ by encoding the value of the i-th input symbol into positions ib to ib + (b −1). For the rest of this section, we assume Σ = {0,1}.
Circuits
A circuitC with input length n is a directed acyclic graph with ninput vertices s1,…, sn and zero or more gate vertices, each labeled with a type NOT, AND, or OR. Input vertices have fan-in (in-degree) zero, NOT gates have fan-in one, and the fan-in of AND and OR gates can be either two or unbounded. One (input or gate) vertex t is designated the output of the circuit.
Given an input string w ∈{0,1}n, each input vertex si is assigned the value wi, and each gate vertex is assigned the value computed by applying the logical function corresponding to its type to the values assigned to its in-neighbors. The circuit computes the Boolean function , mapping each input string to the value assigned to t. The depth of C, denoted D(C), is the length of the longest directed path from any si to t. The size of C, denoted |C|, is the number of vertices in C.
Circuit Families
A circuit family is a sequence such that for each n, Cn is a circuit with input length n. We treat as a function on {0,1}* as follows: for every w ∈{0,1}*, . Then defines the language , and we say that recognizes . The depth and size of are the functions n↦D(Cn) and n↦|Cn|.
Uniformity
As defined, a circuit family contains a different circuit for each length n, with no constraint on the relationship between the circuits. For example, let L be any unary language: L ⊆{1}*. For n ∈ℕ, if 1n∉L, define Cn to be a circuit for the constant 0 function (an OR gate with fan-in 0), and if 1n ∈ L, define Cn to be a circuit for the AND of all the inputs. Thus, every unary language, even an undecidable one, is recognized by a circuit family of size O(n) and depth O(1).
A uniformity restriction on a circuit family {Cn}n∈ℕ requires that the task of constructing a description of the circuit Cn given input n be computable within some specified resource bound as a function of n, potentially making it comparable with classes defined by bounds on Turing machine time or space. Two such uniformity bounds are used in the work here: L and DLOGTIME. Because these bounds are very restrictive, a special representation of the circuit Cn is used, namely, the ability to answer queries of the type of a gate and whether the output of one gate is an input to another gate.
We assume that the vertices of the circuit Cn are numbered from 0 to |Cn|−1. The direct connection language of a family of circuits is the set of all tuples such that in Cn, vertex i has type f and there is an edge from vertex i to vertex j (Barrington et al., 1990). Given a computable function bounding the size of and access to a membership oracle for the direct connection language, for any n it is straightforward to write out the list of vertices, edges, and types in Cn.
Then a circuit family is L-uniform (resp., DLOGTIME-uniform) if there is a Turing machine that runs in logarithmic space (resp., deterministic logarithmic time) to decide membership in the direct connection language of .
Circuit Complexity Classes
Circuit complexity classes classify circuit families and the languages they recognize based on uniformity, depth, size, fan-in bound, and the allowed gates. Since transformers have constant depth, circuit classes with constant depth are of particular interest; the classes that are used in the work we survey are:
AC0 contains those languages that can be recognized by families of circuits with unbounded fan-in, constant depth, and polynomial size.
ACC0 is like AC0, but also has gates that output 1 iff the inputs sum to 0 modulo some constant.
TC0 is like AC0, but also allows MAJORITY gates, which have unbounded fan-in and output 1 iff at least half of their inputs are 1.
NC1 is like AC0, but with fan-in at most 2 and depth in .
5.3 Logic
A formal language can also be defined as a set of finite strings that satisfy a closed formula of a logic. For more details, refer to Thomas (1997) or Straubing (1994).
In the first-order logic of strings, or FO, the formulas are the smallest set containing:
Variables x, y, and so on.
Atomic formulas Qa(x), x = y, x < y, where a ∈ Σ is a symbol and x, y are variables.
ϕ1 ∧ ϕ2, ϕ1 ∨ ϕ2, , ¬ϕ1, where ϕ1 and ϕ2 are formulas.
∀x.ϕ, ∃x.ϕ, where x is a variable and ϕ is a formula.
Under the intended interpretation, variables stand for positions of a finite string w, and Qa(x) is true iff wx = a. For example, if Σ = {a, b}, defines the regular language a*b*. The language defined by a closed formula ϕ consists of those strings that satisfy ϕ.
The languages definable in FO are exactly the star-free languages (McNaughton and Papert, 1971). Other variants add more quantifiers:
We are also interested in various sets of predicates:
A logic extended with predicates is conventionally written with the predicates in square brackets; for example, we write FO[BIT] for first-order logic with the BIT predicate.
In linear temporal logic or LTL (Kamp, 1968), every formula implicitly depends on a single time (or position). There are atomic formulas Qa for every a ∈ Σ, the connectives ∧, ∨, and ¬, as well as operators since and until. The formula αsinceβ is true iff α was true at some past time i and β was true from i to now (exclusive). LTL is equivalent to FO (Kamp, 1968).
5.4 Relationships
Figure 1, which depicts the relationships between the language classes defined above, shows that the classes defined by circuits/logics cut across the (perhaps more familiar) Chomsky hierarchy. In this figure and in this section, all circuit classes are understood to be DLOGTIME-uniform unless specified otherwise.
Relationship of some languages and language classes discussed in this paper (right) to the Chomsky hierarchy (left), assuming that and . Circuit classes are DLOGTIME-uniform.
Relationship of some languages and language classes discussed in this paper (right) to the Chomsky hierarchy (left), assuming that and . Circuit classes are DLOGTIME-uniform.
5.4.1 Beyond AC0
The classic examples of languages not in AC0 are Parity and Majority. The language Parity ⊆{0,1}* contains all bit strings containing an odd number of 1’s, and Majority ⊆{0,1}* consists of all bit strings in which more than half of the bits are 1’s. Other problems in TC0 but not AC0 include sorting, integer multiplication (Chandra et al., 1984), and integer division (Hesse, 2001).
Dyck Languages
The language Dyck-k for k > 0 is the language of strings over k pairs of parentheses that are correctly balanced and nested. If we write the i-th parenthesis pair as (i )i for each i ∈ [k], then Dyck-k is generated by the context-free grammar . These languages are of interest because any context-free language can be obtained by applying a string homomorphism to the intersection of a Dyck language with a regular language (Chomsky and Schützenberger, 1963).
Some papers surveyed here consider variations on Dyck languages. The language Dyck-(k, D) for D > 0 is the subset of Dyck-k consisting of strings with maximum nesting depth D; it is a star-free regular language (and therefore in AC0).
The language Shuffle-Dyck-k is the set of strings over k pairs of parentheses in which, for each parenthesis pair, erasing the other types of parentheses leaves a correctly balanced and nested string. For example, [(()]) is in Shuffle-Dyck-2. If k > 1, Shuffle-Dyck-k is not context free.
5.4.2 Beyond TC0
As we will see (§6.3.2), some transformer variants lie within TC0. What problems lie beyond?
The Word Problem for Permutation Groups
A permutation of [k] is a bijection , and Sk is the set of all permutations of [k]. Treating Sk as an alphabet and compositions of permutations as strings, we can define the language W(Sk) of compositions of permutations of [k] that equal the identity permutation. For example, in S3, the permutation (120) maps 0↦1, 1↦2, and 2↦0, so that W(S3) contains (120) ∘ (120) ∘ (120) but not (120) ∘ (120). These languages are easy for finite automata to recognize, but difficult with only fixed computation depth. Indeed, W(S5) is complete for NC1 under AC0 reductions (Barrington, 1989), so it is not in TC0, assuming that (as is widely believed). This makes it an example of a regular language that transformer encoders probably cannot recognize.
The languages W(Sk) have some relevance to natural language: they resemble expressions like the child of the enemy of Ann where the interpretation of the child of is (roughly) a permutation of possible referents (Paperno, 2022), and problems that have been used to benchmark transformers’ state-tracking abilities (Kim and Schuster, 2023).
Other Languages
that are widely believed to be not in TC0 include:
The language of closed Boolean formulas that are true (BFVP) is context-free but complete for NC1 under DLOGTIME reductions (Buss, 1987), so it is outside TC0 if .
Undirected graph connectivity is L-complete under L-uniform NC1 reductions (Cook and McKenzie, 1987; Reingold, 2008), so it is outside L-uniform NC1 (and therefore outside TC0) if L-uniform .
There is a context-free language LP that is NL-complete under L reductions (Sudborough, 1975), so it is outside L (and therefore outside NC1 and TC0) if .
Solving systems of linear equalities and universal context-free grammar recognition are P-complete under L reductions (Jones and Laaser, 1976; Greenlaw et al., 1995), so they are outside TC0 if .
Matrix permanent is known to be outside of TC0 (Allender, 1999).
5.4.3 Circuits and Logics
DLOGTIME-uniform AC0 and TC0 are equivalent to FO[BIT] and FOM[BIT], respectively. There are many such equivalences between circuit classes and logics. As a rule of thumb, adding unbounded fan-in gates to a circuit family correlates with adding quantifiers to the corresponding logic, and increasing the degree of non-uniformity of a circuit family correlates with adding numerical predicates to the corresponding logic (Barrington and Immerman, 1994). For example, making AC0 and TC0 completely non- uniform corresponds to adding arbitrary numerical predicates (ARB) to FO and FOM, respectively (Immerman, 1997; Barrington et al., 1990).
As we will see below, circuits and logics have their advantages and disadvantages for capturing the expressivity of transformers. An advantage of the circuit approach is that they have a more transparent resemblance to transformers. Transformers are computations with bounded depth, so it’s not hard to see that they should be computable by circuit families with bounded depth (AC0 or TC0). On the other hand, an advantage of the logical approach is that if we seek an exact characterization of transformers, it can be easier in a logic to add or remove quantifiers or predicates, to limit quantifier depth or number of variables, to partition terms into different sorts, and so on, than to make adjustments to a circuit family.
6 Current Results
While this area of research still has many unresolved questions, the emerging picture has three levels of expressivity. At the upper end are decoders or encoder–decoders with intermediate steps; these are equivalent to Turing machines (§6.1). At the lower end are encoders with leftmost-hard or rightmost-hard attention; these can recognize only languages in AC0 (§6.2). In the middle are encoders with average-hard or softmax attention, which are the least well-understood but appear to lie between AC0 and TC0 (§6.3).
In this section, “transformer” refers to a transformer encoder unless otherwise indicated.
6.1 Decoders with Intermediate Steps
Pérez et al. (2021) consider transformer encoder–decoders with several modifications:
As described above (§4.3.3), the decoder is allowed to run for arbitrarily many time steps until an acceptance criterion is met. Under these assumptions, transformer encoder–decoders can recognize any recursively enumerable language.3 This result uses arbitrary precision, but as a corollary, it shows that a T(n)-time-bounded Turing machine can be simulated in a transformer using precision and O(T(n)) intermediate steps.
Bhattamishra et al. (2020b) provide a simpler proof of Pérez et al.’s result by reducing to an RNN and appealing to the construction of Siegelmann and Sontag (1995). They do this for two sets of assumptions. First,
The PE includes only i.
The self attention sublayers are as above.
The FFNs use saturated linear activation functions: .
Second, they show the same with no PE and standard dot-product attention with future masking.
Wei et al. (2022a) define a notion of statistically meaningful (SM) approximation and show that transformer encoder–decoders SM-approximate Turing machines. Both the decoder and Turing machine are limited to N time steps; additionally,
The PE can be an arbitrary computable function on [N].
Attention is average-hard.
The FFNs have three ReLU layers.
Feng et al. (2023) observe that the problems of evaluating arithmetic expressions or solving linear equations over ℤp are NC1-hard under DLOGTIME reductions, so (if ) they cannot be solved by -precision transformer decoders without intermediate steps.4 Similarly, the universal recognition problem for CFGs is P-complete, so (if ) it cannot be solved by -precision transformer decoders without intermediate steps.
However, these problems can be solved by a transformer decoder using (a polynomial number of) intermediate steps. The decoder has GELU activations (Hendrycks and Gimpel, 2016) and PE including i and (for linear equation solving) and where m is the number of variables. More generally, they define a class of dynamic-programming algorithms that these transformers can solve using intermediate steps. All these decoders have parameters that depend on n.
Merrill and Sabharwal (2024) show that a transformer decoder with precision and O(T(n)) intermediate steps can simulate a Turing machine for T(n) steps, and in particular, decoders with a polynomial number of intermediate steps recognize exactly the languages in P. The proof is similar to that of Pérez et al. (2021), but uses a standard definition of transformers without PEs, relying only on the mild assumption that the input string begins with BOS.
6.2 Leftmost-hard/Rightmost-hard Attention
Hahn (2020) shows that leftmost-hard attention transformers cannot recognize Parity or Dyck-1, using a variant of Furst et al.’s random restriction method for proving that Parity is outside of AC0.
Hao et al. (2022) show more generally that any language recognized by a transformer with leftmost-hard attention is in AC0. The proof gives a normal form for transformers with leftmost-hard attention and uses it to construct an AC0 circuit family. It uses the fact that only bits of information are needed per position.
Barceló et al. (2024) give a lower bound on leftmost-hard-attention transformers with arbitrary PEs depending on a single position i and length n, including i, , (−1)i, , and . They show that these transformers can recognize any language definable in FO[Mon]. Their proof converts a FO[Mon] formula to LTL (§5.3), which is simulated in a transformer.
Angluin et al. (2023) exactly characterize rightmost-hard-attention transformers with strict future masking. Without PEs, these transformers recognize exactly the class of star-free languages, that is, languages definable in FO. With periodic PEs, they are exactly equivalent to FO[MOD], and with arbitrary PEs, they are exactly equivalent to FO[Mon]. Strict masking is important, as nonstrict masking is less expressive. They give two proofs of the star-free to transformer direction, one which goes through LTL (§5.3) and one which uses Krohn-Rhodes theory. These proofs use a Boolean-valued version of RASP (Weiss et al., 2021) as an intermediate representation.
6.3 Average-hard and Softmax Attention
Theoretical results on average-hard and softmax attention transformers have not yet clearly separated the two, so we treat them together. Both kinds of attention enable counting, which can be used to solve problems like Majority that are outside AC0. But these transformers are no more powerful than DLOGTIME-uniform TC0, implying that they likely cannot solve problems complete for NC1, L, and other classes believed to be above TC0 (§5.4).
6.3.1 Lower Bounds: Particular Languages
The languages Majority, Dyck-k, and Parity are all not in AC0, so are interesting test cases.
Pérez et al. (2019) prove that a transformer encoder–decoder with a trivial decoder and without any PE recognizes Majority; Merrill et al. (2022) prove the same for transformer encoders.
Bhattamishra et al. (2020a) prove that Shuffle-Dyck-k (which equals Dyck-1 when k = 1) is recognizable by a soft-attention transformer with future masking, no PE, no layernorm, and no residual connections. Yao et al. (2021) show that a transformer decoder can generate Dyck-k using precision, softmax and leftmost- hard attention, future masking, and a PE including i/n, i/n3, and n. They also give constructions for Dyck-(k, D).
Chiang and Cholak (2022) show that transformers whose PE includes i/n and can recognize Parity.
On the other hand, Hahn (2020) shows that softmax attention transformers cannot generate Parity or Dyck-2 under the following two conditions:
all position-wise functions are Lipschitz-continuous, and
generation is defined using the KL divergence criterion in Eq. (5).
The apparent contradiction is resolved by considering the different assumptions underlying each result. Chiang and Cholak (2022) address this by giving two constructions corresponding to Hahn’s two conditions. The first has Lipschitz-continuous position-wise functions, but has high cross-entropy (§4.3.1); as a generator, it would not meet criterion (5). The second construction uses layernorm with , which is not Lipschitz-continuous, but it has arbitrarily low cross-entropy.
A number of authors have tested empirically whether transformers can learn the above languages. Ebrahimi et al. (2020) find that they are competitive with LSTMs at learning Dyck-2 and Dyck-4, and that prepending a BOS symbol helps.
Bhattamishra et al. (2020a) train transformers with future masking and no PE on Dyck-1 and Shuffle-Dyck-k, finding near-perfect learning and length generalization. For the languages Dyck-(1, D) with learned or sinusoidal PEs, they find that the models do not generalize well for D > 1. Yao et al. (2021) then investigate Dyck-(k, D) for several values of k and D and several PEs. They report strong generalization only when using i/n for the PE, and posit that this is the key. It is hard, however, to directly compare the two results: Bhattamishra et al. (2020a) require correct prediction of the possible next symbols at each string prefix, while Yao et al. (2021) average over predictions of right brackets.
Delétang et al. (2023) study experimentally how well transformers (and other networks) learn tasks at various levels of the Chomsky hierarchy, including generalization to longer strings. They find that transformers learn Majority, but not Parity.
6.3.2 Upper Bounds: TC0
Merrill et al. (2022) prove an upper bound analogous to that of Hao et al. (2022), but for average-hard-attention transformers. They show that an average-hard-attention transformer with activations in can be simulated in TC0. Strobl (2023) tightens this bound to L-uniform TC0.
Furthermore, Merrill and Sabharwal (2023a) show that softmax attention, -precision transformers are in L-uniform TC0, and then tighten this bound to DLOGTIME-uniform TC0 (Merrill and Sabharwal, 2023b). The proof constructs subroutines to answer queries about the types of nodes and connectivity of pairs of nodes in the computation graph of a transformer, and shows that these queries can be translated to queries for a TC0 circuit family with time overhead.
6.3.3 Other Lower Bounds
In addition to explicit constructions for particular languages mentioned above, various lower bounds have been proven, which are quite diverse.
Counter Machines
Bhattamishra et al. (2020a), following Merrill et al. (2020), define a subclass of counter machines called simplified and stateless k-counter machines (SSCMs). These can update each counter based on the current input symbol, but have no state and cannot read the counters until the end of the string. They show that any SSCM can be converted to an equivalent transformer with future masking and no residual connections.
Finite Automata
Liu et al. (2023) study the ability of transformers with future masked attention to simulate deterministic finite automata (DFAs), in the sense of computing not only the same acceptance decision but also the same state sequence. Although a transformer with depth N can simulate a DFA for N timesteps, Liu et al. show how to construct lower-depth shortcuts for subclasses roughly corresponding to classes of regular languages in Figure 1. Though the parameters of these constructions depend on N, in the context of this survey, a noteworthy finding is that any regular language in ACC0 can be recognized up to length N by a transformer whose FFNs use sine activations and whose number of parameters is independent of N.
First-order Logic
Chiang et al. (2023) obtain both an upper and a lower bound by defining a logic FOC[MOD; +], which is first-order logic with counting quantifiers, using two sorts for positions and counts (Immerman, 1999, p. 185–187), where positions have the MOD predicate (but not < or =), and counts have <, +, and =, capturing the fact that transformers can add and compare activations, but not positions. They show that this logic is intermediate in expressivity between O(1)-precision and infinite-precision transformers. The lower-bound proof uses a normal form that eliminates quantifiers over counts and makes quantifiers over positions have depth 1; a perhaps surprising consequence is that O(1)-precision transformers are no more powerful than 2-layer uniform-attention transformers.
Temporal Logic
Barceló et al. (2024) show that average-hard-attention transformers with arbitrary PEs depending on a single position i and length n, including i, , (−1)i, , and , can recognize any language definable in LTL with counting operators, Presburger arithmetic on counts, and predicates in Mon.
Programming Languages
Weiss et al. (2021) introduce the RASP (Restricted Access Sequence Processing) language as an abstraction of transformers, discussing how its components relate to the transformer architecture. However, they do not prove any relationship. Lindner et al. (2023) present Tracr, a compiler from RASP programs to transformers. To do so, they impose some restrictions: a maximum input length, given at compile time; a mandatory BOS token; and the removal of selector composition, a RASP operation with no clear parallel in transformers. They rewrite several programs from Weiss et al. (2021) without this operation. In the other direction, Friedman et al. (2023) define a restricted class of transformers that can be learned and decompiled into RASP. Finally, Angluin et al. (2023) use a version of RASP restricted to Boolean values, and Zhou et al. (2023) use a restricted version of RASP to explore length generalization.
7 Conclusions
Out of the large body of research surveyed above, we highlight several conclusions:
Transformer decoders can use intermediate steps to simulate Turing machines; with unbounded steps, they are Turing-complete.
Regarding the expressivity of transformer encoders, circuit complexity and logic are especially promising frameworks.
Leftmost-hard-attention transformer encoders are in AC0 and cannot solve some intuitively easy problems, like Parity and Majority.
Softmax and average-hard attention give transformer encoders the ability to count. Still, they lie within TC0 and likely cannot solve problems like evaluating closed Boolean formulas.
Some open questions that we think should be priorities for future research are:
Some variants (PEs, average-hard vs. softmax attention, pre-norm vs. post-norm, the presence of BOS/EOS/CLS) appear to be instrumental in proofs reviewed here; can their effect on expressivity be clarified?
Can the expressivity of softmax-attention transformers be characterized more tightly or even exactly in terms of some logic?
Given the current practical importance of decoder-only transformers and chain-of-thought, what further insights can circuits or logic provide into transformer decoders?
We hope this paper can serve as a valuable resource for researchers pursuing these and other questions.
Acknowledgments
We would like to thank Frank Drewes, Jon Rawski, Ashish Sabharwal, and the anonymous reviewers as well as the TACL action editor for their valuable comments on earlier versions of this paper.
Notes
This differs from the original paper (Vaswani et al., 2017), which treats them as matrices in ℝn×d. Our notation aligns better with notation for formal languages and emphasizes the variability of the sequence length.
Pérez et al. (2021) define both Turing machines and encoder–decoders to halt only when accepting. The construction could easily be modified to capture decidable languages.
This uses the result of Merrill and Sabharwal (2023b), which would have to be adapted to transformer decoders, but this should be straightforward.
References
Author notes
Action Editor: Mark-Jan Nederhof