## Abstract

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as *formal languages*. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.

## 1 Introduction

Transformers (Vaswani et al., 2017) have gained prominence in natural language processing (NLP), both in direct applications like machine translation and in pretrained models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2018; Brown et al., 2020; OpenAI, 2023). Consequently, some researchers have sought to investigate their theoretical properties. Such studies can broadly be divided into studies of *expressivity* and *trainability*. While trainability is very important and the focus of much study (e.g., Bhattamishra et al., 2023; Allen-Zhu and Li, 2023), here we focus on expressivity, which is a prerequisite for trainability.

Studies of expressivity could be further divided into those from the perspectives of approximation theory and of formal language theory. The former (e.g., Yun et al., 2020; Sanford et al., 2023), investigates transformers as approximators of various classes of *functions*, along the lines of the universal approximation theorem for feedforward neural networks (Hornik et al., 1989; Cybenko, 1989). The latter, which is the subject of this survey, investigates transformers as recognizers or generators of *formal languages*—that is, the inputs or outputs are treated as sequences of discrete symbols from a finite alphabet, and crucially as sequences of unbounded length.

The core research question in this subarea is: *How can we characterize the expressivity of transformers in relation to various formal models, such as automata, Boolean circuits, or formal logic?* Applications of this subarea, which are not addressed by the papers surveyed here but could be by future work, would hopefully answer questions like:

What new transformer variants are suggested by formal models?

Do failure cases anticipated from formal models occur in practice?

What insights into the complexity of human language are offered by a characterization of transformer expressivity?

This paper provides a comprehensive survey of research in this subarea. Compared to the surveys of Ackerman and Cybenko (2020) and Merrill (2021, 2023), which cover convolutional neural networks (CNNs), RNNs, and transformers, this is a narrower, but deeper, survey on transformers only.

Interpreting theoretical transformer results is complex due to diverse assumptions. Many variants of transformers exist in practice, and even more have been proposed in theory. This diversity leads to varied, even seemingly contradictory, results. We set up a unified framework for talking about transformer variants (§4), and discuss how some of these variants compare to one another in expressivity.

We then provide background on various formal models that transformers have been compared with (§5). Then, in §6, we systematically survey current results in this literature, documenting their assumptions and claims in terms of the definitions of Sections 4 and 5.

## 2 Overview

Table 1 summarizes the results surveyed here. One way to classify them is into *lower bounds* (what transformers *can* do) and *upper bounds* (what transformers *can’t* do).

Lower bound . | Source . | PE . | Attention . | Notes . |
---|---|---|---|---|

∋Majority | Pérez et al. 2019 | none | average-hard | |

∋Shuffle-Dyck-k | Bhattamishra et al. 2020a | none | softmax, future mask | |

$\u2287SSCMs$ | Bhattamishra et al. 2020a | none | softmax, future mask | |

∋Dyck-k | Yao et al. 2021 | i/n, i/n^{3}, n | softmax & leftmost-hard | |

$\u2287P$ | Pérez et al. 2021 | i,1/i,1/i^{2} | average-hard | poly(n) steps |

∋Parity | Chiang and Cholak 2022 | i/n, (−1)^{i} | softmax | |

$\u2287FOC$[MOD; +] | Chiang et al. 2023 | sinusoidal | softmax | |

$\u2287FO[Mon]$ | Barceló et al. 2024 | arbitrary | leftmost-hard | |

$\u2287LTL$+C[Mon] | Barceló et al. 2024 | arbitrary | average-hard | |

Upper bound | Source | Precision | Attention | Notes |

∌Parity,Dyck-1 | Hahn 2020 | ℝ | leftmost-hard | |

∌Parity,Dyck-2 | Hahn 2020 | ℝ | softmax, future mask | $\epsilon N>0$, vanishing KL |

⊆AC^{0} | Hao et al. 2022 | ℚ | leftmost-hard | |

⊆TC^{0} | Merrill et al. 2022 | $F$ | average-hard | |

⊆FOC[MOD; +] | Chiang et al. 2023 | O(1) | softmax | |

⊆L-uniform TC^{0} | Merrill and Sabharwal, 2023a | $O(logn)$ | softmax | |

⊆FOM[BIT] | Merrill and Sabharwal, 2023b | $O(logn)$ | softmax | |

⊆L-uniform TC^{0} | Strobl 2023 | $F$ | average-hard | |

Equivalent | Source | PE | Attention | Notes |

=RE | Pérez et al. 2021 | i,1/i,1/i^{2} | average-hard | unbounded steps |

=FO | Angluin et al. 2023 | none | rightmost-hard, strict future mask | |

=FO[MOD] | Angluin et al. 2023 | sinusoidal | rightmost-hard, strict future mask | |

=FO[Mon] | Angluin et al. 2023 | arbitrary | rightmost-hard, strict future mask | |

=P | Merrill and Sabharwal, 2024 | none | average-hard, future mask | poly(n) steps |

Lower bound . | Source . | PE . | Attention . | Notes . |
---|---|---|---|---|

∋Majority | Pérez et al. 2019 | none | average-hard | |

∋Shuffle-Dyck-k | Bhattamishra et al. 2020a | none | softmax, future mask | |

$\u2287SSCMs$ | Bhattamishra et al. 2020a | none | softmax, future mask | |

∋Dyck-k | Yao et al. 2021 | i/n, i/n^{3}, n | softmax & leftmost-hard | |

$\u2287P$ | Pérez et al. 2021 | i,1/i,1/i^{2} | average-hard | poly(n) steps |

∋Parity | Chiang and Cholak 2022 | i/n, (−1)^{i} | softmax | |

$\u2287FOC$[MOD; +] | Chiang et al. 2023 | sinusoidal | softmax | |

$\u2287FO[Mon]$ | Barceló et al. 2024 | arbitrary | leftmost-hard | |

$\u2287LTL$+C[Mon] | Barceló et al. 2024 | arbitrary | average-hard | |

Upper bound | Source | Precision | Attention | Notes |

∌Parity,Dyck-1 | Hahn 2020 | ℝ | leftmost-hard | |

∌Parity,Dyck-2 | Hahn 2020 | ℝ | softmax, future mask | $\epsilon N>0$, vanishing KL |

⊆AC^{0} | Hao et al. 2022 | ℚ | leftmost-hard | |

⊆TC^{0} | Merrill et al. 2022 | $F$ | average-hard | |

⊆FOC[MOD; +] | Chiang et al. 2023 | O(1) | softmax | |

⊆L-uniform TC^{0} | Merrill and Sabharwal, 2023a | $O(logn)$ | softmax | |

⊆FOM[BIT] | Merrill and Sabharwal, 2023b | $O(logn)$ | softmax | |

⊆L-uniform TC^{0} | Strobl 2023 | $F$ | average-hard | |

Equivalent | Source | PE | Attention | Notes |

=RE | Pérez et al. 2021 | i,1/i,1/i^{2} | average-hard | unbounded steps |

=FO | Angluin et al. 2023 | none | rightmost-hard, strict future mask | |

=FO[MOD] | Angluin et al. 2023 | sinusoidal | rightmost-hard, strict future mask | |

=FO[Mon] | Angluin et al. 2023 | arbitrary | rightmost-hard, strict future mask | |

=P | Merrill and Sabharwal, 2024 | none | average-hard, future mask | poly(n) steps |

Much work on lower bounds has looked at *automata* like finite automata, counter machines, and Turing machines, all of which had been successfully related to RNNs before (Siegelmann and Sontag, 1995; Merrill, 2020). This wide diversity of machines is due to different variants of transformers, especially whether a transformer decoder is allowed to take a number of intermediate steps before outputting a decision (§4.3.4), which dramatically increases its power (§6.1).

By contrast, investigation of upper bounds has mainly focused on *circuit complexity* (§5.2), which had been successfully related to feedforward networks before (Parberry, 1994; Siu et al., 1995; Beiu and Taylor, 1996; Šíma and Orponen, 2003). This line of research began with restricted models of transformer encoders and progressed to increasingly realistic variants and tighter bounds. One way to restrict transformers is by discretizing the attention mechanism (§4.2.1); another is to limit the precision of number representations (§4.4).

More recent work has turned to *formal logic* (§5.3) as a way of characterizing the expressive power of transformers. The finer control afforded by logics opens the possibility for them to be used as upper bounds, lower bounds, or both.

## 3 Preliminaries

### Sets

We denote by ℕ_{0} = {0,1,2,…} and ℕ =ℕ_{0} ∖{0} the set of natural numbers with and without 0, respectively. We write [*n*] = {0,1,2,…, *n* −1} for any *n* ∈ℕ. We write Σ for a finite alphabet, which, in NLP applications, is the set of words or subwords known to the model.

### Vectors

We use *d*, *d*′, etc., for dimensionalities of vector spaces, lowercase bold letters (**x**, **y**,…) for vectors, and uppercase bold letters (**X**, **Y**,…) for matrices. For any vector **x** ∈ℝ^{d}, we number its elements starting from 0. For *i* ∈ [*d*], we write **x**_{i} or [**x**]_{i} (not *x*_{i}) for the *i*-th component of **x**.

### Sequences

For any set *A*, we write *A*^{*} for the set of all finite sequences over *A*. We write the length of a sequence *s* ∈ *A*^{*} as |*s*| and number its elements starting from 0; thus, *s* = *s*_{0}*s*_{1}⋯*s*_{|s|−1}. We use the variable *w* for a string in Σ^{*} and *n* for the length of *w*. For sequences in ℝ^{*}, we use lowercase bold letters (**x**, **y**,…), and for sequences in (ℝ^{d})^{*}, we use the variable *X*.

A function $f:A*\u2192B*$ is *length-preserving* if |*f*(*w*)| = |*w*| for all *w* ∈ *A*^{*}. For every function $g:A\u2192B$, we denote its extension to sequences by *g* as well. That is, $g:A*\u2192B*$ is defined as follows: for all *s* ∈ *A*^{*} and *i* ∈ [|*s*|], *g*(*s*)_{i} = *g*(*s*_{i}).

### Neural Networks

An *affine transformation* is a function $L:Rdin\u2192Rdout$ parameterized by weights $WL\u2208Rdout\xd7din$ and bias $bL\u2208Rdout$ such that for every $x\u2208Rdin$, *L*(**x**) = **W**_{L}**x** +**b**_{L}. We say that *L* is *linear* if **b**_{L} = **0**.

The activation functions we use are the *rectified linear unit* (ReLU) $R(x)=max(x,0)$ and the logistic *sigmoid* function *σ*(*x*) = 1/(1 + *e*^{−x}).

*softmax*function $S:R*\u2192R*$ converts any sequence of reals into a probability distribution:

## 4 Transformers

In this section, we define transformers and relevant variants, and how transformers are used to describe formal languages. For additional background on transformers (not in relation to formal languages), Huang et al. (2022) give a lucid commentary on the original paper, Phuong and Hutter (2022) give formal definitions and pseudocode, and Lin et al. (2022) survey many variants of transformers.

### 4.1 Input Layer

*word embedding*$WE:\Sigma \u2192Rd$ and a

*position(al) embedding*or

*encoding*$PEn:[n]\u2192Rd$ for

*n*∈ℕ:

In theoretical constructions, the word embedding can be any computable function.

### 4.2 Hidden Layers

*transformer layer*is a length-preserving function $L:(Rd)*\u2192(Rd)*$. There are two variants. The

*post-norm*variant (Vaswani et al., 2017) is

*pre-norm*variant (Wang et al., 2019) is

$A$ is a multi-head self-attention with

*d*input/output dimensions,*H*heads, and*d*_{kv}key/value dimensions per head$F$ is a feed-forward network (§4.2.2) with

*d*input/output dimensions and*d*_{ff}hidden dimensions$N1$ and $N2$ are layernorms with

*d*dimensions.

We define each of these components below.

#### 4.2.1 Attention

Attention was initially developed to facilitate retrieval of previously processed data from a variable-length history (Bahdanau et al., 2015). Transformers use a simple variant of attention known as *scaled dot-product attention*.

##### Scaled Dot-product Attention

*d*input/ output dimensions and

*d*

_{kv}key/value dimensions is a function $A:Rd\xd7(Rd)*\u2192Rd$ parameterized by linear transformations

**z**∈ℝ

^{d},

*X*∈(ℝ

^{d})

^{*}(with |

*X*| =

*n*), and

*j*∈ [

*n*] as

*first*argument. In

*cross*-attention,

**z**is computed by the decoder while

*X*is computed by the encoder. In

*self*-attention, the two arguments are identical:

##### Attention Masking

*future masked*(also known as

*causally*masked) self attention, a term

*m*(

*i*,

*j*) is added to Eq. (3) to force every position to attend only to preceding positions:

*strict*future masking, that is,

*m*(

*i*,

*j*) = 0 iff

*j*<

*i*, and occasionally

*past*masking (

*j*≥

*i*) and strict past masking (

*j*>

*i*).

##### Multi-head Attention

*d*

_{kv}key/value dimensions per head is the sum of

*H*attentions with

*d*

_{kv}key/value dimensions:

*W*

_{A}

^{O}.

##### Hard Attention

**s**∈ℝ

^{*}, let

*M*(

**s**) = {

*i*∈ [|

**s**|]∣∀

*j*∈ [|

**s**|],

**s**

_{j}≤

**s**

_{i}} be the set of indices of the maximal elements of

**s**. In

*leftmost-argmax*, the leftmost maximal element is used:

*average-argmax*the maximal elements share weight equally:

By substituting $Sh$ or $Sa$ for $S$ in Eq. (4), we get *leftmost-hard* and *average-hard* attention, respectively. Leftmost-hard attention was previously called *hard* attention by Hahn (2020) and *unique hard* attention by Hao et al. (2022). One may also consider *rightmost-hard* attention, in which the rightmost maximal element is used. Average-hard attention was also called *hard* attention by Pérez et al. (2021) and *saturated* attention by Merrill et al. (2022), and has been argued to be a realistic approximation to how trained transformers behave in practice (Merrill et al., 2021).

#### 4.2.2 Feed-forward Networks

*feed-forward network*(FFN) with

*d*input/output dimensions and

*d*

_{ff}hidden dimensions is a function $F:Rd\u2192Rd$ parameterized by two affine transformations, $LF1:Rd\u2192Rdff$ and $LF2:Rdff\u2192Rd$, such that

#### 4.2.3 Layer Normalization

*d*-dimensional

*layer normalization*(Ba et al., 2016), or

*layernorm*for short, is a function $N:Rd\u2192Rd$ parameterized by vectors $\gamma N,\beta N\u2208Rd$ and scalar $\epsilon N\u22650$:

The original definition of layernorm (Ba et al., 2016) sets $\epsilon N=0$, but, for numerical stability, all implementations we are aware of set $\epsilon N>0$. Observe that $N$ is Lipschitz-continuous iff $\epsilon N>0$.

### 4.3 Networks and Output Layers

We now define a complete transformer network.

#### 4.3.1 Transformer Encoders

*transformer encoder*is a length-preserving function $T:\Sigma *\u2192(Rd)*$ parameterized by the weights of an input layer

*e*and

*D*transformer layers $L$

_{1},…,$L$

_{D}. A

*post-norm*transformer encoder is:

_{l}is a post-norm layer (1) and ∘ is function composition. A

*pre-norm*transformer encoder is additionally parameterized by the weights of a final layernorm $N$ and is defined as:

_{l}is a pre-norm layer (2).

^{d})

^{*}. To use it as a language recognizer, we add an output layer that converts $T(w)$ to a probability

**w**∈ℝ

^{d},

*b*∈ℝ, and

*i*is a distinguished position. The encoder accepts iff $p^\u226512$.

Chiang and Cholak (2022) also consider a requirement that an encoder accepts/rejects strings with bounded cross-entropy. That is, we say that an encoder recognizes a language *L* with cross-entropy at most *η* iff for all strings *w*, if *w* ∈ *L* then $\u2212logp^\u2264\eta $, and if *w*∉*L* then $\u2212log(1\u2212p^)\u2264\eta $.

We are aware of two choices for the distinguished position *i*. Most papers use the last position (*i* = *n* −1), but some (Chiang and Cholak, 2022; Chiang et al., 2023), inspired by binary classifiers based on BERT (Devlin et al., 2019), prepend a special symbol CLS at position 0 and use *i* = 0. While this is a minor difference, it should be noted that the guarantee of exactly one occurrence of CLS in the input can be useful in some constructions.

#### 4.3.2 Transformer Decoders

*transformer decoder*is a transformer encoder $T$ with future masking in its attention, typically used to generate rather than recognize strings. The input is the prefix of previously-generated symbols,

*w*

_{ <t}=

*w*

_{0}⋯

*w*

_{t−1}, and the output is a probability distribution $p^(wt\u2223w<t)$ over the next symbol,

**W**∈ℝ

^{|Σ|×d}and

**b**∈ℝ

^{|Σ|}. We assume

*w*

_{0}= BOS and every string ends with EOS, where BOS and EOS are special symbols that do not occur anywhere else. To sample a string, we first sample

*w*

_{1}from $p^(w1\u2223$BOS), then, for each time step

*t*> 1, sample

*w*

_{t}from $p^(wt\u2223w<t)$. The process stops when

*w*

_{t}= EOS. Because each sampled output symbol becomes part of the input at the next time step, this kind of model is called

*autoregressive*.

While a decoder can be used to recognize strings similarly to an encoder, it can also be used to generate the entire string; at least two definitions have been given for this.

*p*(

*w*). For any length

*t*, the KL divergence (relative entropy) of the model $p^(w)$ from the true distribution

*p*(

*w*), for predicting

*w*

_{t}conditioned on all previous words, is

*ε*

*-generatesL*iff

*T*generates a language

*L*iff there exists an

*ε*> 0 such that

*T*

*ε*-generates

*L*. (This means that a transformer decoder may generate more than one language, depending on the

*ε*chosen.) They also show that any

*ε*-generator can be converted into a recognizer.

While not focusing on transformers, Lin et al. (2021) demonstrate limitations of autoregressive models for generation; for example, that there is a language *L* ∈ P that cannot be *ε*-generated in polynomial time for any *ε* > 0 if P≠NP.

#### 4.3.3 Transformer Encoder–Decoders

A *transformer encoder–decoder* combines a transformer encoder and decoder, adding to each layer of the decoder an additional attention sublayer, known as *cross attention*, which attends to the output of the encoder. In the literature surveyed here, only the construction of Pérez et al. (2021) and related constructions (Bhattamishra et al., 2020b; Wei et al., 2022a) employ an encoder–decoder.

#### 4.3.4 Intermediate Steps

When a transformer decoder or encoder–decoder is run as a language recognizer, it allows for the possibility of inserting a number of *intermediate* time steps between the end of the input string and the decision. The encoder–decoder models above do this, as do some decoder-only models (Feng et al., 2023; Merrill and Sabharwal, 2024). As we will see (§6.1), intermediate steps vastly increase the model’s power, which has also been observed in practice in the form of a “scratchpad” (Nye et al., 2021) or “chain of thought” (Wei et al., 2022b).

### 4.4 Uniformity and Precision

Although meaningful theoretical claims can be made about transformers for fixed-length strings (e.g., Yun et al., 2020), it is crucial when examining transformers as language recognizers to allow for unbounded string length. Fixing a maximum length makes all languages finite, collapsing many language classes into one.

It might be objected that considering unbounded lengths is too abstract, because in practice one can always fix a maximum length. But this maximum length, driven by practical needs, is growing steadily: for example, GPT-4 Turbo uses 128,000 tokens of context. At the same time, some theoretical findings surveyed here seem to have practical consequences for modest string lengths. For example, we will see that there are reasons to think that in theory, transformers cannot recognize Parity; in practice, they fail to learn Parity for strings with lengths in [2,50] (Bhattamishra et al., 2020a).

#### Numeric Precision

Transformers operate, in principle, on real numbers. While hard attention transformers could be defined using only rational numbers, even rational numbers can represent an arbitrary amount of information. With RNNs, the use of real or rational numbers has led to results that make them appear more powerful in theory than in practice (Siegelmann and Sontag, 1994, 1995; Weiss et al., 2018).

Consequently, many studies use limited-precision numbers. Some studies limit number representations to have *O*(1) bits, as floating-point numbers do in practice (Chiang et al., 2023). But Merrill and Sabharwal (2023b) argue that in *O*(1) precision, attention cannot attend uniformly to a string of sufficient length *n*, as the attention weights (*α*) would all round down to zero. So $O(logn)$ bits of precision is a common choice (Yao et al., 2021; Merrill and Sabharwal, 2023a, b). Other choices are possible as well: Merrill and Sabharwal (2023a) use the set $F={a/2b\u2223a\u2208Z,b\u2208N}$.

Restricting intermediate activations to limited precision introduces many decisions about when and how rounding should take place, which can potentially affect expressivity. For example, when summing *n* numbers, one could round after each addition or only at the end of the summation. Better formalizing these decisions and their impact on expressivity is an area for future research.

#### Parameters

A few constructions allow the parameters themselves to depend on *n*, which we consider to be a stronger dependence, because if these transformers were to be learned from data, different transformers would have to be learned for different maximum lengths. Finally, a few papers construct transformers in which *d*, and therefore the number of parameters, depends on *n*, which we consider to be stronger still.

### 4.5 Summary

In summary, transformers can vary in at least the following ways, any of which could *a priori* impact theoretical claims:

Architecture: encoder-only, decoder-only, or encoder–decoder

For encoders: definition of recognition

For decoders and encoder–decoders: definition of generation and how many intermediate steps

Position embedding (PE)

Attention pattern: leftmost-hard, rightmost-hard, average-hard, or softmax

Attention masking: none, future, or past

Layernorm: inclusion or omission, value of $\epsilon N$

Residual connections: pre-norm or post-norm

Precision: infinite, $O(logn)$,

*O*(1)Uniformity: whether parameter values or number of parameters depend on

*n*.

## 5 Languages and Language Classes

Next, we present various formal models that transformers are compared to in the literature surveyed.

### 5.1 Automata and Classes L, NL, P

We assume familiarity with finite automata and Turing machines; for definitions, please see the textbook by Sipser (2013). Counter machines are automata with integer-valued registers (Fischer et al., 1968); they have been studied extensively in connection with LSTM RNNs (Weiss et al., 2018; Suzgun et al., 2019; Merrill, 2019, 2020).

### 5.2 Circuits and Classes AC^{0}, ACC^{0}, TC^{0}, NC^{1}

Circuits are a model of parallel computation particularly relevant to transformers. For more details, please see the textbook by Arora and Barak (2009).

Circuits operate on binary values. If we choose a fixed-length encoding of the symbols of Σ as strings of $b=\u2308log2\u2223\Sigma \u2223\u2309$ bits, then a circuit can simulate input alphabet Σ by encoding the value of the *i*-th input symbol into positions *ib* to *ib* + (*b* −1). For the rest of this section, we assume Σ = {0,1}.

#### Circuits

A *circuitC* with input length *n* is a directed acyclic graph with *ninput* vertices *s*_{1},…, *s*_{n} and zero or more *gate* vertices, each labeled with a *type* NOT, AND, or OR. Input vertices have fan-in (in-degree) zero, NOT gates have fan-in one, and the fan-in of AND and OR gates can be either two or unbounded. One (input or gate) vertex *t* is designated the *output* of the circuit.

Given an input string *w* ∈{0,1}^{n}, each input vertex *s*_{i} is assigned the value *w*_{i}, and each gate vertex is assigned the value computed by applying the logical function corresponding to its type to the values assigned to its in-neighbors. The circuit computes the Boolean function $C:{0,1}n\u2192{0,1}$, mapping each input string to the value assigned to *t*. The *depth* of *C*, denoted *D*(*C*), is the length of the longest directed path from any *s*_{i} to *t*. The *size* of *C*, denoted |*C*|, is the number of vertices in *C*.

#### Circuit Families

A *circuit family* is a sequence $C={Cn}n\u2208N$ such that for each *n*, *C*_{n} is a circuit with input length *n*. We treat $C$ as a function on {0,1}^{*} as follows: for every *w* ∈{0,1}^{*}, $C(w)=C\u2223w\u2223(w)$. Then $C$ defines the language $L(C)={w\u2208{0,1}*\u2223C(w)=1}$, and we say that $C$ recognizes $L(C)$. The *depth* and *size* of $C$ are the functions *n*↦*D*(*C*_{n}) and *n*↦|*C*_{n}|.

#### Uniformity

As defined, a circuit family contains a different circuit for each length *n*, with no constraint on the relationship between the circuits. For example, let *L* be any *unary* language: *L* ⊆{1}^{*}. For *n* ∈ℕ, if 1^{n}∉*L*, define *C*_{n} to be a circuit for the constant 0 function (an OR gate with fan-in 0), and if 1^{n} ∈ *L*, define *C*_{n} to be a circuit for the AND of all the inputs. Thus, every unary language, even an undecidable one, is recognized by a circuit family of size *O*(*n*) and depth *O*(1).

A uniformity restriction on a circuit family {*C*_{n}}_{n∈ℕ} requires that the task of constructing a description of the circuit *C*_{n} given input *n* be computable within some specified resource bound as a function of *n*, potentially making it comparable with classes defined by bounds on Turing machine time or space. Two such uniformity bounds are used in the work here: L and DLOGTIME. Because these bounds are very restrictive, a special representation of the circuit *C*_{n} is used, namely, the ability to answer queries of the type of a gate and whether the output of one gate is an input to another gate.

We assume that the vertices of the circuit *C*_{n} are numbered from 0 to |*C*_{n}|−1. The *direct connection language* of a family of circuits $C$ is the set of all tuples $\u2329f,i,j,1n\u232a$ such that in *C*_{n}, vertex *i* has type *f* and there is an edge from vertex *i* to vertex *j* (Barrington et al., 1990). Given a computable function bounding the size of $C$ and access to a membership oracle for the direct connection language, for any *n* it is straightforward to write out the list of vertices, edges, and types in *C*_{n}.

Then a circuit family $C$ is L-*uniform* (resp., DLOGTIME-*uniform*) if there is a Turing machine that runs in logarithmic space (resp., deterministic logarithmic time) to decide membership in the direct connection language of $C$.

#### Circuit Complexity Classes

Circuit complexity classes classify circuit families and the languages they recognize based on uniformity, depth, size, fan-in bound, and the allowed gates. Since transformers have constant depth, circuit classes with constant depth are of particular interest; the classes that are used in the work we survey are:

AC

^{0}contains those languages that can be recognized by families of circuits with unbounded fan-in, constant depth, and polynomial size.ACC

^{0}is like AC^{0}, but also has gates that output 1 iff the inputs sum to 0 modulo some constant.TC

^{0}is like AC^{0}, but also allows MAJORITY gates, which have unbounded fan-in and output 1 iff at least half of their inputs are 1.NC

^{1}is like AC^{0}, but with fan-in at most 2 and depth in $O(logn)$.

^{1}⊆L.

### 5.3 Logic

A formal language can also be defined as a set of finite strings that satisfy a closed formula of a logic. For more details, refer to Thomas (1997) or Straubing (1994).

In the *first-order logic of strings*, or FO, the formulas are the smallest set containing:

Variables

*x*,*y*, and so on.Atomic formulas

*Q*_{a}(*x*),*x*=*y*,*x*<*y*, where*a*∈ Σ is a symbol and*x*,*y*are variables.*ϕ*_{1}∧*ϕ*_{2},*ϕ*_{1}∨*ϕ*_{2}, $\varphi 1\u2192\varphi 2$, ¬*ϕ*_{1}, where*ϕ*_{1}and*ϕ*_{2}are formulas.∀

*x*.*ϕ*, ∃*x*.*ϕ*, where*x*is a variable and*ϕ*is a formula.

Under the intended interpretation, variables stand for positions of a finite string *w*, and *Q*_{a}(*x*) is true iff *w*_{x} = *a*. For example, if Σ = {*a*, *b*}, $\u2200x.\u2200y.Qa(x)\u2227Qb(y)\u2192x<y$ defines the regular language *a*^{*}*b*^{*}. The language defined by a closed formula *ϕ* consists of those strings that satisfy *ϕ*.

The languages definable in FO are exactly the *star-free* languages (McNaughton and Papert, 1971). Other variants add more quantifiers:

We are also interested in various sets of predicates:

Modular predicates $MODmr(x)$, which hold iff

*x*≡*r*(mod*m*) (Barrington et al., 1992).BIT(

*x*,*y*), which holds iff the*y*-th bit of*x*is 1.Mon, the set of all predicates on one position, possibly depending on

*n*.^{2}ARB, the set of all predicates on one or more positions.

A logic extended with predicates is conventionally written with the predicates in square brackets; for example, we write FO[BIT] for first-order logic with the BIT predicate.

In *linear temporal logic* or LTL (Kamp, 1968), every formula implicitly depends on a single time (or position). There are atomic formulas *Q*_{a} for every *a* ∈ Σ, the connectives ∧, ∨, and ¬, as well as operators **since** and **until**. The formula *α***since***β* is true iff *α* was true at some past time *i* and *β* was true from *i* to now (exclusive). LTL is equivalent to FO (Kamp, 1968).

### 5.4 Relationships

Figure 1, which depicts the relationships between the language classes defined above, shows that the classes defined by circuits/logics cut across the (perhaps more familiar) Chomsky hierarchy. In this figure and in this section, all circuit classes are understood to be DLOGTIME-uniform unless specified otherwise.

#### 5.4.1 Beyond AC^{0}

The classic examples of languages not in AC^{0} are Parity and Majority. The language Parity ⊆{0,1}^{*} contains all bit strings containing an odd number of 1’s, and Majority ⊆{0,1}^{*} consists of all bit strings in which more than half of the bits are 1’s. Other problems in TC^{0} but not AC^{0} include sorting, integer multiplication (Chandra et al., 1984), and integer division (Hesse, 2001).

##### Dyck Languages

The language Dyck-*k* for *k* > 0 is the language of strings over *k* pairs of parentheses that are correctly balanced and nested. If we write the *i*-th parenthesis pair as (_{i} )_{i} for each *i* ∈ [*k*], then Dyck-*k* is generated by the context-free grammar ${S\u2192(iS)iS\u2223i\u2208[k]}\u222a{S\u2192\epsilon}$. These languages are of interest because any context-free language can be obtained by applying a string homomorphism to the intersection of a Dyck language with a regular language (Chomsky and Schützenberger, 1963).

Some papers surveyed here consider variations on Dyck languages. The language Dyck-(*k*, *D*) for *D* > 0 is the subset of Dyck-*k* consisting of strings with maximum nesting depth *D*; it is a star-free regular language (and therefore in AC^{0}).

The language Shuffle-Dyck-*k* is the set of strings over *k* pairs of parentheses in which, for each parenthesis pair, erasing the other types of parentheses leaves a correctly balanced and nested string. For example, [(()]) is in Shuffle-Dyck-2. If *k* > 1, Shuffle-Dyck-*k* is not context free.

#### 5.4.2 Beyond TC^{0}

As we will see (§6.3.2), some transformer variants lie within TC^{0}. What problems lie beyond?

##### The Word Problem for Permutation Groups

A permutation of [*k*] is a bijection $\pi :[k]\u2192[k]$, and *S*_{k} is the set of all permutations of [*k*]. Treating *S*_{k} as an alphabet and compositions of permutations as strings, we can define the language W(*S*_{k}) of compositions of permutations of [*k*] that equal the identity permutation. For example, in *S*_{3}, the permutation (120) maps 0↦1, 1↦2, and 2↦0, so that W(*S*_{3}) contains (120) ∘ (120) ∘ (120) but not (120) ∘ (120). These languages are easy for finite automata to recognize, but difficult with only fixed computation depth. Indeed, W(*S*_{5}) is complete for NC^{1} under AC^{0} reductions (Barrington, 1989), so it is not in TC^{0}, assuming that $TC0\u2acbNC1$ (as is widely believed). This makes it an example of a regular language that transformer encoders probably cannot recognize.

The languages W(*S*_{k}) have some relevance to natural language: they resemble expressions like *the child of the enemy of Ann* where the interpretation of *the child of* is (roughly) a permutation of possible referents (Paperno, 2022), and problems that have been used to benchmark transformers’ state-tracking abilities (Kim and Schuster, 2023).

##### Other Languages

that are widely believed to be not in TC^{0} include:

The language of closed Boolean formulas that are true (BFVP) is context-free but complete for NC

^{1}under DLOGTIME reductions (Buss, 1987), so it is outside TC^{0}if $TC0\u2acbNC1$.Undirected graph connectivity is L-complete under L-uniform NC

^{1}reductions (Cook and McKenzie, 1987; Reingold, 2008), so it is outside L-uniform NC^{1}(and therefore outside TC^{0}) if L-uniform $NC1\u2acbL$.There is a context-free language

*L*_{P}that is NL-complete under L reductions (Sudborough, 1975), so it is outside L (and therefore outside NC^{1}and TC^{0}) if $L\u2acbNL$.Solving systems of linear equalities and universal context-free grammar recognition are P-complete under L reductions (Jones and Laaser, 1976; Greenlaw et al., 1995), so they are outside TC

^{0}if $L\u2acbP$.Matrix permanent is known to be outside of TC

^{0}(Allender, 1999).

#### 5.4.3 Circuits and Logics

DLOGTIME-uniform AC^{0} and TC^{0} are equivalent to FO[BIT] and FOM[BIT], respectively. There are many such equivalences between circuit classes and logics. As a rule of thumb, adding unbounded fan-in gates to a circuit family correlates with adding quantifiers to the corresponding logic, and increasing the degree of non-uniformity of a circuit family correlates with adding numerical predicates to the corresponding logic (Barrington and Immerman, 1994). For example, making AC^{0} and TC^{0} completely non- uniform corresponds to adding arbitrary numerical predicates (ARB) to FO and FOM, respectively (Immerman, 1997; Barrington et al., 1990).

As we will see below, circuits and logics have their advantages and disadvantages for capturing the expressivity of transformers. An advantage of the circuit approach is that they have a more transparent resemblance to transformers. Transformers are computations with bounded depth, so it’s not hard to see that they should be computable by circuit families with bounded depth (AC^{0} or TC^{0}). On the other hand, an advantage of the logical approach is that if we seek an exact characterization of transformers, it can be easier in a logic to add or remove quantifiers or predicates, to limit quantifier depth or number of variables, to partition terms into different sorts, and so on, than to make adjustments to a circuit family.

## 6 Current Results

While this area of research still has many unresolved questions, the emerging picture has three levels of expressivity. At the upper end are decoders or encoder–decoders with intermediate steps; these are equivalent to Turing machines (§6.1). At the lower end are encoders with leftmost-hard or rightmost-hard attention; these can recognize only languages in AC^{0} (§6.2). In the middle are encoders with average-hard or softmax attention, which are the least well-understood but appear to lie between AC^{0} and TC^{0} (§6.3).

In this section, “transformer” refers to a transformer encoder unless otherwise indicated.

### 6.1 Decoders with Intermediate Steps

Pérez et al. (2021) consider transformer encoder–decoders with several modifications:

As described above (§4.3.3), the decoder is allowed to run for arbitrarily many time steps until an acceptance criterion is met. Under these assumptions, transformer encoder–decoders can recognize any recursively enumerable language.^{3} This result uses arbitrary precision, but as a corollary, it shows that a *T*(*n*)-time-bounded Turing machine can be simulated in a transformer using $O(logT(n))$ precision and *O*(*T*(*n*)) intermediate steps.

Bhattamishra et al. (2020b) provide a simpler proof of Pérez et al.’s result by reducing to an RNN and appealing to the construction of Siegelmann and Sontag (1995). They do this for two sets of assumptions. First,

The PE includes only

*i*.The self attention sublayers are as above.

The FFNs use saturated linear activation functions: $\sigma (x)=max(0,min(1,x))$.

Second, they show the same with no PE and standard dot-product attention with future masking.

Wei et al. (2022a) define a notion of *statistically meaningful* (SM) approximation and show that transformer encoder–decoders SM-approximate Turing machines. Both the decoder and Turing machine are limited to *N* time steps; additionally,

The PE can be an arbitrary computable function on [

*N*].Attention is average-hard.

The FFNs have three ReLU layers.

Feng et al. (2023) observe that the problems of evaluating arithmetic expressions or solving linear equations over ℤ_{p} are NC^{1}-hard under DLOGTIME reductions, so (if $TC0\u2acbNC1$) they cannot be solved by $O(logn)$-precision transformer decoders without intermediate steps.^{4} Similarly, the universal recognition problem for CFGs is P-complete, so (if $L\u2acbP$) it cannot be solved by $O(logn)$-precision transformer decoders without intermediate steps.

However, these problems can be solved by a transformer decoder using (a polynomial number of) intermediate steps. The decoder has GELU activations (Hendrycks and Gimpel, 2016) and PE including *i* and (for linear equation solving) $m2sin2i\pi m$ and $m2cos2i\pi m$ where *m* is the number of variables. More generally, they define a class of dynamic-programming algorithms that these transformers can solve using intermediate steps. All these decoders have parameters that depend on *n*.

Merrill and Sabharwal (2024) show that a transformer decoder with $O(log(n+T(n)))$ precision and *O*(*T*(*n*)) intermediate steps can simulate a Turing machine for *T*(*n*) steps, and in particular, decoders with a polynomial number of intermediate steps recognize *exactly* the languages in P. The proof is similar to that of Pérez et al. (2021), but uses a standard definition of transformers without PEs, relying only on the mild assumption that the input string begins with BOS.

### 6.2 Leftmost-hard/Rightmost-hard Attention

Hahn (2020) shows that leftmost-hard attention transformers cannot recognize Parity or Dyck-1, using a variant of Furst et al.’s random restriction method for proving that Parity is outside of AC^{0}.

Hao et al. (2022) show more generally that any language recognized by a transformer with leftmost-hard attention is in AC^{0}. The proof gives a normal form for transformers with leftmost-hard attention and uses it to construct an AC^{0} circuit family. It uses the fact that only $O(logn)$ bits of information are needed per position.

Barceló et al. (2024) give a lower bound on leftmost-hard-attention transformers with arbitrary PEs depending on a single position *i* and length *n*, including *i*, $1i+1$, (−1)^{i}, $cos\pi (1\u22122\u2212i)10$, and $sin\pi (1\u22122\u2212i)10$. They show that these transformers can recognize any language definable in FO[Mon]. Their proof converts a FO[Mon] formula to LTL (§5.3), which is simulated in a transformer.

Angluin et al. (2023) exactly characterize rightmost-hard-attention transformers with strict future masking. Without PEs, these transformers recognize exactly the class of star-free languages, that is, languages definable in FO. With periodic PEs, they are exactly equivalent to FO[MOD], and with arbitrary PEs, they are exactly equivalent to FO[Mon]. Strict masking is important, as nonstrict masking is less expressive. They give two proofs of the star-free to transformer direction, one which goes through LTL (§5.3) and one which uses Krohn-Rhodes theory. These proofs use a Boolean-valued version of RASP (Weiss et al., 2021) as an intermediate representation.

### 6.3 Average-hard and Softmax Attention

Theoretical results on average-hard and softmax attention transformers have not yet clearly separated the two, so we treat them together. Both kinds of attention enable counting, which can be used to solve problems like Majority that are outside AC^{0}. But these transformers are no more powerful than DLOGTIME-uniform TC^{0}, implying that they likely cannot solve problems complete for NC^{1}, L, and other classes believed to be above TC^{0} (§5.4).

#### 6.3.1 Lower Bounds: Particular Languages

The languages Majority, Dyck-*k*, and Parity are all not in AC^{0}, so are interesting test cases.

Pérez et al. (2019) prove that a transformer encoder–decoder with a trivial decoder and without any PE recognizes Majority; Merrill et al. (2022) prove the same for transformer encoders.

Bhattamishra et al. (2020a) prove that Shuffle-Dyck-*k* (which equals Dyck-1 when *k* = 1) is recognizable by a soft-attention transformer with future masking, no PE, no layernorm, and no residual connections. Yao et al. (2021) show that a transformer decoder can generate Dyck-*k* using $O(logn)$ precision, softmax and leftmost- hard attention, future masking, and a PE including *i*/*n*, *i*/*n*^{3}, and *n*. They also give constructions for Dyck-(*k*, *D*).

Chiang and Cholak (2022) show that transformers whose PE includes *i*/*n* and $(\u22121)i=cosi\pi $ can recognize Parity.

On the other hand, Hahn (2020) shows that softmax attention transformers cannot generate Parity or Dyck-2 under the following two conditions:

all position-wise functions are Lipschitz-continuous, and

generation is defined using the KL divergence criterion in Eq. (5).

The apparent contradiction is resolved by considering the different assumptions underlying each result. Chiang and Cholak (2022) address this by giving two constructions corresponding to Hahn’s two conditions. The first has Lipschitz-continuous position-wise functions, but has high cross-entropy (§4.3.1); as a generator, it would not meet criterion (5). The second construction uses layernorm with $\epsilon N=0$, which is not Lipschitz-continuous, but it has arbitrarily low cross-entropy.

A number of authors have tested empirically whether transformers can learn the above languages. Ebrahimi et al. (2020) find that they are competitive with LSTMs at learning Dyck-2 and Dyck-4, and that prepending a BOS symbol helps.

Bhattamishra et al. (2020a) train transformers with future masking and no PE on Dyck-1 and Shuffle-Dyck-*k*, finding near-perfect learning and length generalization. For the languages Dyck-(1, *D*) with learned or sinusoidal PEs, they find that the models do not generalize well for *D* > 1. Yao et al. (2021) then investigate Dyck-(*k*, *D*) for several values of *k* and *D* and several PEs. They report strong generalization only when using *i*/*n* for the PE, and posit that this is the key. It is hard, however, to directly compare the two results: Bhattamishra et al. (2020a) require correct prediction of the possible next symbols at each string prefix, while Yao et al. (2021) average over predictions of right brackets.

Delétang et al. (2023) study experimentally how well transformers (and other networks) learn tasks at various levels of the Chomsky hierarchy, including generalization to longer strings. They find that transformers learn Majority, but not Parity.

#### 6.3.2 Upper Bounds: TC^{0}

Merrill et al. (2022) prove an upper bound analogous to that of Hao et al. (2022), but for average-hard-attention transformers. They show that an average-hard-attention transformer with activations in $F$ can be simulated in TC^{0}. Strobl (2023) tightens this bound to L-uniform TC^{0}.

Furthermore, Merrill and Sabharwal (2023a) show that softmax attention, $O(logn)$-precision transformers are in L-uniform TC^{0}, and then tighten this bound to DLOGTIME-uniform TC^{0} (Merrill and Sabharwal, 2023b). The proof constructs subroutines to answer queries about the types of nodes and connectivity of pairs of nodes in the computation graph of a transformer, and shows that these queries can be translated to queries for a TC^{0} circuit family with $O(logn)$ time overhead.

#### 6.3.3 Other Lower Bounds

In addition to explicit constructions for particular languages mentioned above, various lower bounds have been proven, which are quite diverse.

##### Counter Machines

Bhattamishra et al. (2020a), following Merrill et al. (2020), define a subclass of counter machines called *simplified and stateless k-counter machines* (SSCMs). These can update each counter based on the current input symbol, but have no state and cannot read the counters until the end of the string. They show that any SSCM can be converted to an equivalent transformer with future masking and no residual connections.

##### Finite Automata

Liu et al. (2023) study the ability of transformers with future masked attention to simulate deterministic finite automata (DFAs), in the sense of computing not only the same acceptance decision but also the same state sequence. Although a transformer with depth *N* can simulate a DFA for *N* timesteps, Liu et al. show how to construct lower-depth *shortcuts* for subclasses roughly corresponding to classes of regular languages in Figure 1. Though the parameters of these constructions depend on *N*, in the context of this survey, a noteworthy finding is that any regular language in ACC^{0} can be recognized up to length *N* by a transformer whose FFNs use sine activations and whose *number* of parameters is independent of *N*.

##### First-order Logic

Chiang et al. (2023) obtain both an upper and a lower bound by defining a logic FOC[MOD; +], which is first-order logic with counting quantifiers, using two sorts for positions and counts (Immerman, 1999, p. 185–187), where positions have the MOD predicate (but not < or =), and counts have <, +, and =, capturing the fact that transformers can add and compare activations, but not positions. They show that this logic is intermediate in expressivity between *O*(1)-precision and infinite-precision transformers. The lower-bound proof uses a normal form that eliminates quantifiers over counts and makes quantifiers over positions have depth 1; a perhaps surprising consequence is that *O*(1)-precision transformers are no more powerful than 2-layer uniform-attention transformers.

##### Temporal Logic

Barceló et al. (2024) show that average-hard-attention transformers with arbitrary PEs depending on a single position *i* and length *n*, including *i*, $1i+1$, (−1)^{i}, $cos\pi (1\u22122\u2212i)10$, and $sin\pi (1\u22122\u2212i)10$, can recognize any language definable in LTL with counting operators, Presburger arithmetic on counts, and predicates in Mon.

##### Programming Languages

Weiss et al. (2021) introduce the RASP (Restricted Access Sequence Processing) language as an abstraction of transformers, discussing how its components relate to the transformer architecture. However, they do not prove any relationship. Lindner et al. (2023) present Tracr, a compiler from RASP programs to transformers. To do so, they impose some restrictions: a maximum input length, given at compile time; a mandatory BOS token; and the removal of *selector composition*, a RASP operation with no clear parallel in transformers. They rewrite several programs from Weiss et al. (2021) without this operation. In the other direction, Friedman et al. (2023) define a restricted class of transformers that can be learned and decompiled into RASP. Finally, Angluin et al. (2023) use a version of RASP restricted to Boolean values, and Zhou et al. (2023) use a restricted version of RASP to explore length generalization.

## 7 Conclusions

Out of the large body of research surveyed above, we highlight several conclusions:

Transformer decoders can use intermediate steps to simulate Turing machines; with unbounded steps, they are Turing-complete.

Regarding the expressivity of transformer encoders, circuit complexity and logic are especially promising frameworks.

Leftmost-hard-attention transformer encoders are in AC

^{0}and cannot solve some intuitively easy problems, like Parity and Majority.Softmax and average-hard attention give transformer encoders the ability to count. Still, they lie within TC

^{0}and likely cannot solve problems like evaluating closed Boolean formulas.

Some open questions that we think should be priorities for future research are:

Some variants (PEs, average-hard vs. softmax attention, pre-norm vs. post-norm, the presence of BOS/EOS/CLS) appear to be instrumental in proofs reviewed here; can their effect on expressivity be clarified?

Can the expressivity of softmax-attention transformers be characterized more tightly or even exactly in terms of some logic?

Given the current practical importance of decoder-only transformers and chain-of-thought, what further insights can circuits or logic provide into transformer decoders?

We hope this paper can serve as a valuable resource for researchers pursuing these and other questions.

## Acknowledgments

We would like to thank Frank Drewes, Jon Rawski, Ashish Sabharwal, and the anonymous reviewers as well as the TACL action editor for their valuable comments on earlier versions of this paper.

## Notes

This differs from the original paper (Vaswani et al., 2017), which treats them as matrices in ℝ^{n×d}. Our notation aligns better with notation for formal languages and emphasizes the variability of the sequence length.

Pérez et al. (2021) define both Turing machines and encoder–decoders to halt only when accepting. The construction could easily be modified to capture decidable languages.

This uses the result of Merrill and Sabharwal (2023b), which would have to be adapted to transformer decoders, but this should be straightforward.

## References

*NC*

^{1}

*NC*

^{1}

*NC*

^{1}

^{0}

## Author notes

Action Editor: Mark-Jan Nederhof