What Formal Languages Can Transformers Express? A Survey

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.


Introduction
Transformers (Vaswani et al., 2017) have gained prominence in natural language processing (NLP), both in direct applications like machine translation and in pretrained models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2018;Brown et al., 2020;OpenAI, 2023).Consequently, some researchers have sought to investigate their theoretical properties.Such studies can broadly be divided into studies of expressivity and trainability.While trainability is very important and the focus of much study (e.g., Bhattamishra et al., 2023;Allen-Zhu and Li, 2023), here we focus on expressivity, which is a prerequisite for trainability.
Studies of expressivity could be further divided into those from the perspectives of approximation theory and of formal language theory.The former (e.g., Yun et al., 2020;Sanford et al., 2023), investigates transformers as approximators of various classes of functions, along the lines of the universal approximation theorem for feedforward neural networks (Hornik et al., 1989;Cybenko, 1989).The latter, which is the subject of this survey, investigates transformers as recognizers or generators of formal languages -that is, the inputs or outputs are treated as sequences of discrete symbols from a finite alphabet, and crucially as sequences of unbounded length.
The core research question in this subarea is: How can we characterize the expressivity of transformers in relation to various formal models, such as automata, boolean circuits or formal logic?Applications of this subarea, which are not addressed by the papers surveyed here but could be by future work, would hopefully answer questions like: • What new transformer variants are suggested by formal models?
• Do failure cases anticipated from formal models occur in practice?
• What insights into the complexity of human language are offered by a characterization of transformer expressivity?
This paper provides a comprehensive survey of research in this subarea.Compared to the surveys of Ackerman and Cybenko (2020) and Merrill (2021Merrill ( , 2023)), which cover convolutional neural networks (CNNs), RNNs, and transformers, this is a narrower, but deeper, survey on transformers only.
Interpreting theoretical transformer results is complex due to diverse assumptions.Many variants of transformers exist in practice, and even more have been proposed in theory.This diversity leads to varied, even seemingly contradictory, results.We set up a unified framework for talking about transformer variants ( §4), and discuss how some of these variants compare to one another in expressivity.
We then provide background on various formal models that transformers have been compared with ( §5).Then, in §6, we systematically survey current results in this literature, documenting their assumptions and claims in terms of the definitions of Sections 4 and 5. 1 arXiv:2311.00208v3[cs.LG] 4 Sep 2024 Much work on lower bounds has looked at automata like finite automata, counter machines, and Turing machines, all of which had been successfully related to RNNs before (Siegelmann and Sontag, 1995;Merrill, 2020).This wide diversity of machines is due to different variants of transformers, especially whether a transformer decoder is allowed to take a number of intermediate steps before outputting a decision ( §4.3.4), which dramatically increases its power ( §6.1).
By contrast, investigation of upper bounds has mainly focused on circuit complexity ( §5.2), which had been successfully related to feedforward networks before (Parberry, 1994;Siu et al., 1995;Beiu and Taylor, 1996;Šíma and Orponen, 2003).This line of research began with restricted models of transformer encoders and progressed to increasingly realistic variants and tighter bounds.One way to restrict transformers is by discretizing the attention mechanism ( §4.2.1); another is to limit the precision of number representations ( §4.4).
More recent work has turned to formal logic ( §5.3) as a way of characterizing the expressive power of transformers.The finer control afforded by logics opens the possibility for them to be used as upper bounds, lower bounds, or both.

Vectors
We use ,  ′ , etc., for dimensionalities of vector spaces, lowercase bold letters (x, y, . . . ) for vectors, and uppercase bold letters (X, Y, . . . ) for matrices.For any vector x ∈ R  , we number its elements starting from 0. For  ∈ [], we write x  or [x]  (not   ) for the -th component of x.
Sequences For any set , we write  * for the set of all finite sequences over .We write the length of a sequence  ∈  * as || and number its elements starting from 0; thus,  =  0   Neural networks An affine transformation is a function  : R  in → R  out parameterized by weights The activation functions we use are the rectified linear unit (ReLU) R () = max(, 0) and the logistic sigmoid function () = 1/(1 +  −  ).
The softmax function S : R * → R * converts any sequence of reals into a probability distribution:

Transformers
In this section, we define transformers and relevant variants, and how transformers are used to describe formal languages.For additional background on transformers (not in relation to formal languages), Huang et al. (2022) give a lucid commentary on the original paper, Phuong and Hutter (2022) give formal definitions and pseudocode, and Lin et al. (2022) survey many variants of transformers.
Transformers are composed of an input layer ( §4.1), one or more hidden layers ( §4.2), and an output layer ( §4.3).The inputs and outputs of the layers are sequences of vectors, which we treat as members of (R  ) * .1

Input layer
Strings are initially mapped to sequences of vectors using a length-preserving function  : Σ * → (R  ) * , which is the sum of a word embedding In theoretical constructions, the word embedding can be any computable function.

Hidden layers
A transformer layer is a length-preserving function There are two variants.The post-norm variant (Vaswani et al., 2017) is and the pre-norm variant (Wang et al., 2019) is where • A is a multi-head self-attention with  input/output dimensions,  heads, and  kv key/value dimensions per head • F is a feed-forward network ( §4.2.2) with  input/output dimensions and  ff hidden dimensions • N 1 and N 2 are layernorms with  dimensions.
We define each of these components below.

Attention
Attention was initially developed to facilitate retrieval of previously processed data from a variablelength history (Bahdanau et al., 2015).Transformers use a simple variant of attention known as scaled dot-product attention.
Scaled dot-product attention with  input/output dimensions and  kv key/value dimensions is a function A : R  × (R  ) * → R  parameterized by linear transformations and defined for every z ∈ R  ,  ∈ (R  ) * (with Attention masking In future masked (also known as causally masked) self attention, a term (, ) is added to Eq. ( 3) to force every position to attend only to preceding positions: Some papers use strict future masking, that is, (, ) = 0 iff  < , and occasionally past masking (  ≥ ) and strict past masking (  > ).
Multi-head attention with  kv key/value dimensions per head is the sum of  attentions with  kv key/value dimensions: A ℎ (z, ).
Multi-head self attention is defined analogously.This is equivalent to the original formulation, which concatenated the outputs of the heads and passed the result through a shared, larger,  O A .Hard attention Some theoretical analyses simplify attention by replacing the softmax with variants that focus attention only on the position(s) with the maximum value, breaking ties in various ways.For any , s  ≤ s  } be the set of indices of the maximal elements of s.In leftmost-argmax, the leftmost maximal element is used: whereas in average-argmax the maximal elements share weight equally: If softmax is thought of as a Boltzmann distribution, then average-argmax is its low-temperature limit.By substituting S h or S a for S in Eq. ( 4), we get leftmost-hard and average-hard attention, respectively.Leftmost-hard attention was previously called hard attention by Hahn (2020) and unique hard attention by Hao et al. (2022).One may also consider rightmost-hard attention, in which the rightmost maximal element is used.Average-hard attention was also called hard attention by Pérez et al. (2021) and saturated attention by Merrill et al. (2022), and has been argued to be a realistic approximation to how trained transformers behave in practice (Merrill et al., 2021).

Feed-forward networks
Although feed-forward networks can take many forms, in the context of transformers, we use the following definition.A feed-forward network (FFN) with  input/output dimensions and  ff hidden dimensions is a function F : R  → R  parameterized by two affine transformations,  1 F : R  → R  ff and  2 F : R  ff → R  , such that where R is applied component-wise.

Layer normalization
A -dimensional layer normalization (Ba et al., 2016), or layernorm for short, is a function N : R  → R  parameterized by vectors  N ,  N ∈ R  and scalar  N ≥ 0: where ⊙ is component-wise multiplication and The original definition of layernorm (Ba et al., 2016) sets  N = 0, but, for numerical stability, all implementations we are aware of set  N > 0. Observe that N is Lipschitz-continuous iff  N > 0.

Networks and output layers
We now define a complete transformer network.

Transformer encoders
A transformer encoder is a length-preserving function T : Σ * → (R  ) * parameterized by the weights of an input layer  and  transformer layers L 1 , . . ., L  .A post-norm transformer encoder is: where each L  is a post-norm layer (1) and • is function composition.A pre-norm transformer encoder is additionally parameterized by the weights of a final layernorm N and is defined as: where each L  is a pre-norm layer (2).The encoder's output is a sequence of vectors in (R  ) * .To use it as a language recognizer, we add an output layer that converts T () to a probability where w ∈ R  ,  ∈ R, and  is a distinguished position.The encoder accepts iff p ≥ 1 2 .Chiang and Cholak (2022) also consider a requirement that an encoder accepts/rejects strings with bounded cross-entropy.That is, we say that an encoder recognizes a language  with crossentropy at most  iff for all strings , if  ∈  then − log p ≤ , and if  ∉  then − log(1 − p) ≤ .
We are aware of two choices for the distinguished position .Most papers use the last position ( =  − 1), but some (Chiang and Cholak, 2022;Chiang et al., 2023), inspired by binary classifiers based on BERT (Devlin et al., 2019), prepend a special symbol CLS at position 0 and use  = 0.While this is a minor difference, it should be noted that the guarantee of exactly one occurrence of CLS in the input can be useful in some constructions.

Transformer decoders
A transformer decoder is a transformer encoder T with future masking in its attention, typically used to generate rather than recognize strings.The input is the prefix of previously-generated symbols, where W ∈ R |Σ| × and b ∈ R |Σ| .We assume  0 = BOS and every string ends with EOS, where BOS and EOS are special symbols that do not occur anywhere else.To sample a string, we first sample  1 from p( 1 | BOS), then, for each time step  > 1, sample   from p(  |  < ).The process stops when   = EOS.Because each sampled output symbol becomes part of the input at the next time step, this kind of model is called autoregressive.
While a decoder can be used to recognize strings similarly to an encoder, it can also be used to generate the entire string; at least two definitions have been given for this.
First, Hahn (2020) considers a weighted language as a distribution over strings ().For any length , the KL divergence (relative entropy) of the model p() from the true distribution (), for predicting   conditioned on all previous words, is .
As Hahn's results are negative, he does not spell out a positive criterion, but he seems to implicitly require that this divergence vanish at infinity: Second, let us say that a transformer decoder -generates  iff Then Yao et al. (2021), following Hewitt et al. (2020), say that a transformer decoder  generates a language  iff there exists an  > 0 such that  -generates .(This means that a transformer decoder may generate more than one language, depending on the  chosen.)They also show that any -generator can be converted into a recognizer.
While not focusing on transformers, Lin et al. ( 2021) demonstrate limitations of autoregressive models for generation; for example, that there is a language  ∈ P that cannot be -generated in polynomial time for any  > 0 if P ≠ NP.

Transformer encoder-decoders
A transformer encoder-decoder combines a transformer encoder and decoder, adding to each layer of the decoder an additional attention sublayer, known as cross attention, which attends to the output of the encoder.In the literature surveyed here, only the construction of Pérez et al. (2021) and related constructions (Bhattamishra et al., 2020b;Wei et al., 2022a) employ an encoder-decoder.

Intermediate steps
When a transformer decoder or encoder-decoder is run as a language recognizer, it allows for the possibility of inserting a number of intermediate time steps between the end of the input string and the decision.The encoder-decoder models above do this, as do some decoder-only models (Feng et al., 2023;Merrill and Sabharwal, 2024).As we will see ( §6.1), intermediate steps vastly increase the model's power, which has also been observed in practice in the form of a "scratchpad" (Nye et al., 2022) or "chain of thought" (Wei et al., 2022b).

Uniformity and precision
Although meaningful theoretical claims can be made about transformers for fixed-length strings (e.g.Yun et al., 2020), it is crucial when examining transformers as language recognizers to allow for unbounded string length.Fixing a maximum length makes all languages finite, collapsing many language classes into one.
It might be objected that considering unbounded lengths is too abstract, because in practice one can always fix a maximum length.But this maximum length, driven by practical needs, is growing steadily: for example, GPT-4 Turbo uses 128,000 tokens of context.At the same time, some theoretical findings surveyed here seem to have practical consequences for modest string lengths.For example, we will see that there are reasons to think that in theory, transformers cannot recognize PARITY; in practice, they fail to learn PARITY for strings with lengths in [2, 50] (Bhattamishra et al., 2020a).
Some theoretical studies of transformers do allow them to depend on the input length .To borrow a term from circuit complexity ( §5.2), they allow certain kinds of non-uniformity.As we have seen, some position embeddings ( §4.1) depend on .We discuss some other instances below.
Numeric precision Transformers operate, in principle, on real numbers.While hard attention transformers could be defined using only rational numbers, even rational numbers can represent an arbitrary amount of information.With RNNs, the use of real or rational numbers has led to results that make them appear more powerful in theory than in practice (Siegelmann andSontag, 1994, 1995;Weiss et al., 2018).
Consequently, many studies use limitedprecision numbers.Some studies limit number representations to have  (1) bits, as floating-point numbers do in practice (Chiang et al., 2023).But Merrill and Sabharwal (2023b) argue that in  (1) precision, attention cannot attend uniformly to a string of sufficient length , as the attention weights () would all round down to zero.So  (log ) bits of precision is a common choice (Yao et al., 2021;Merrill and Sabharwal, 2023a,b).Other choices are possible as well: Merrill and Sabharwal (2023a) use the set Restricting intermediate activations to limited precision introduces many decisions about when and how rounding should take place, which can potentially affect expressivity.For example, when summing  numbers, one could round after each addition or only at the end of the summation.Better formalizing these decisions and their impact on expressivity is an area for future research.
Parameters A few constructions allow the parameters themselves to depend on , which we consider to be a stronger dependence, because if these transformers were to be learned from data, different transformers would have to be learned for different maximum lengths.Finally, a few papers construct transformers in which , and therefore the number of parameters, depends on , which we consider to be stronger still.

Summary
In summary, transformers can vary in at least the following ways, any of which could a priori impact theoretical claims: • Architecture: encoder-only, decoder-only, or encoder-decoder • For encoders: definition of recognition • Uniformity: whether parameter values or number of parameters depend on .

Languages and Language Classes
Next, we present various formal models that transformers are compared to in the literature surveyed.
The language classes L (languages decidable in  (log ) space) and P (languages decidable in polynomial time) are defined using deterministic Turing machines (with a read-only input tape and a read/write work tape).The class NL (languages decidable in nondeterministic  (log ) space) uses nondeterministic Turing machines.The class DLOGTIME (languages decidable in  (log ) time) uses random-access Turing machines (Barrington et al., 1990).It is known that L ⊆ NL ⊆ P but none of these inclusions are known to be strict.

Circuits and classes AC
Circuits are a model of parallel computation particularly relevant to transformers.For more details, please see the textbook by Arora and Barak (2009).
Circuits operate on binary values.If we choose a fixed-length encoding of the symbols of Σ as strings of  = ⌈log 2 |Σ|⌉ bits, then a circuit can simulate input alphabet Σ by encoding the value of the -th input symbol into positions  to + (−1).For the rest of this section, we assume Σ = {0, 1}.
Circuits A circuit  with input length  is a directed acyclic graph with  input vertices  1 , . . .,   and zero or more gate vertices, each labeled with a type NOT, AND, or OR.Input vertices have fan-in (in-degree) zero, NOT gates have fanin one, and the fan-in of AND and OR gates can be either two or unbounded.One (input or gate) vertex  is designated the output of the circuit.
Given an input string  ∈ {0, 1}  , each input vertex   is assigned the value   , and each gate vertex is assigned the value computed by applying the logical function corresponding to its type to the values assigned to its in-neighbors.The circuit computes the boolean function  : {0, 1}  → {0, 1}, mapping each input string to the value assigned to .The depth of , denoted  (), is the length of the longest directed path from any   to .The size of , denoted | |, is the number of vertices in .Uniformity As defined, a circuit family contains a different circuit for each length , with no constraint on the relationship between the circuits.For example, let  be any unary language:  ⊆ {1} * .For  ∈ N, if 1  ∉ , define   to be a circuit for the constant 0 function (an OR gate with fan-in 0), and if 1  ∈ , define   to be a circuit for the AND of all the inputs.Thus, every unary language, even an undecidable one, is recognized by a circuit family of size  () and depth  (1).

Circuit families
A uniformity restriction on a circuit family {  } ∈N requires that the task of constructing a description of the circuit   given input  be computable within some specified resource bound as a function of , potentially making it comparable with classes defined by bounds on Turing machine time or space.Two such uniformity bounds are used in the work here: L and DLOGTIME.Because these bounds are very restrictive, a special representation of the circuit   is used, namely, the ability to answer queries of the type of a gate and whether the output of one gate is an input to another gate.
We assume that the vertices of the circuit   are numbered from 0 to |  | − 1.The direct connection language of a family of circuits C is the set of all tuples ⟨  , , , 1  ⟩ such that in   , vertex  has type  and there is an edge from vertex  to vertex  (Barrington et al., 1990).Given a computable function bounding the size of C and access to a membership oracle for the direct connection language, for any  it is straightforward to write out the list of vertices, edges, and types in   .
Then a circuit family C is L-uniform (resp., DLOGTIME-uniform) if there is a Turing machine that runs in logarithmic space (resp., deterministic logarithmic time) to decide membership in the direct connection language of C.
Circuit complexity classes Circuit complexity classes classify circuit families and the languages they recognize based on uniformity, depth, size, fan-in bound, and the allowed gates.Since transformers have constant depth, circuit classes with constant depth are of particular interest; the classes that are used in the work we survey are: • AC 0 contains those languages that can be recognized by families of circuits with unbounded fan-in, constant depth, and polynomial size.
• ACC 0 is like AC 0 , but also has gates that output 1 iff the inputs sum to 0 modulo some constant.
• TC 0 is like AC 0 , but also allows MAJORITY gates, which have unbounded fan-in and output 1 iff at least half of their inputs are 1.
• NC 1 is like AC 0 , but with fan-in at most 2 and depth in  (log ).
The known relationships between these classes are: in the DLOGTIME-uniform, L-uniform, and nonuniform settings; moreover, L-uniform NC 1 ⊆ L.

Logic
A formal language can also be defined as a set of finite strings that satisfy a closed formula of a logic.For more details, refer to Thomas (1997) or Straubing (1994).
In the first-order logic of strings, or FO, the formulas are the smallest set containing: • Variables , , and so on.
• ∀., ∃.,where  is a variable and  is a formula.
Under the intended interpretation, variables stand for positions of a finite string , and   () is true iff   = .For example, if Σ = {, }, ∀.∀. () ∧   () →  <  defines the regular language  *  * .The language defined by a closed formula  consists of those strings that satisfy .The languages definable in FO are exactly the star-free languages (McNaughton and Papert, 1971).Other variants add more quantifiers: • FOC adds counting quantifiers ∃ = ., which hold iff there are exactly  values of  that make  true (Barrington et al., 1990).
• FOM adds majority quantifiers M., which hold iff at least half of the values of  make  true (Barrington et al., 1990).
• BIT(, ), which holds iff the -th bit of  is 1.
• Mon, the set of all predicates on one position, possibly depending on .2 • ARB, the set of all predicates on one or more positions.
A logic extended with predicates is conventionally written with the predicates in square brackets; for example, we write FO[BIT] for first-order logic with the BIT predicate.
In linear temporal logic or LTL (Kamp, 1968), every formula implicitly depends on a single time (or position).There are atomic formulas   for every  ∈ Σ, the connectives ∧, ∨, and ¬, as well as operators since and until.The formula  since  is true iff  was true at some past time  and  was true from  to now (exclusive).LTL is equivalent to FO (Kamp, 1968).

Relationships
Figure 1, which depicts the relationships between the language classes defined above, shows that the classes defined by circuits/logics cut across the (perhaps more familiar) Chomsky hierarchy.In this figure and in this section, all circuit classes are understood to be DLOGTIME-uniform unless specified otherwise.

Beyond AC 0
The classic examples of languages not in AC 0 are PARITY and MAJORITY.The language PARITY ⊆ {0, 1} * contains all bit strings containing an odd number of 1's, and MAJORITY ⊆ {0, 1} * consists of all bit strings in which more than half of the bits are 1's.Other problems in TC 0 but not AC 0 include sorting, integer multiplication (Chandra et al., 1984), and integer division (Hesse, 2001).

Dyck languages
The language DYCK- for  > 0 is the language of strings over  pairs of parentheses that are correctly balanced and nested.If we write the -th parenthesis pair as (  )  for each  ∈ [], then DYCK- is generated by the contextfree grammar { → (  )   |  ∈ []} ∪ { → }.These languages are of interest because any context-free language can be obtained by applying a string homomorphism to the intersection of a Dyck language with a regular language (Chomsky and Schützenberger, 1963).Some papers surveyed here consider variations on Dyck languages.The language DYCK-(, ) for  > 0 is the subset of DYCK- consisting of strings with maximum nesting depth ; it is a starfree regular language (and therefore in AC 0 ).
The language SHUFFLE-DYCK- is the set of strings over  pairs of parentheses in which, for each parenthesis pair, erasing the other types of parentheses leaves a correctly balanced and nested string.For example, [(()]) is in SHUFFLE-DYCK-2.If  > 1, SHUFFLE-DYCK- is not context free.

Beyond TC 0
As we will see ( §6.3.2),some transformer variants lie within TC 0 .What problems lie beyond?
The word problem for permutation groups A permutation of [] is a bijection  : [] → [], and   is the set of all permutations of [].Treating   as an alphabet and compositions of permutations as strings, we can define the language W(  ) of compositions of permutations of [] that equal the identity permutation.For example, in  3 , the permutation (120) maps 0 ↦ → 1, 1 ↦ → 2, and 2 ↦ → 0, so that W( 3 ) contains (120) • (120) • (120) but not (120) • ( 120).These languages are easy for finite automata to recognize, but difficult with only fixed computation depth.Indeed, W( 5 ) is complete for NC 1 under AC 0 reductions (Barrington, 1989), so it is not in TC 0 , assuming that TC 0 ⊊ NC 1 (as is widely believed).This makes it an example of a regular language that transformer encoders probably cannot recognize.
The languages W(  ) have some relevance to natural language: they resemble expressions like the child of the enemy of Ann where the interpretation of the child of is (roughly) a permutation of possible referents (Paperno, 2022), and problems that have been used to benchmark transformers' state-tracking abilities (Kim and Schuster, 2023).
Other languages that are widely believed to be not in TC 0 include: • The language of closed Boolean formulas that are true (BFVP) is context-free but complete for NC 1 under DLOGTIME reductions (Buss, 1987), so it is outside TC 0 if TC 0 ⊊ NC 1 .
• Undirected graph connectivity is L-complete under L-uniform NC 1 reductions (Cook and McKenzie, 1987;Reingold, 2008), so it is outside L-uniform NC 1 (and therefore outside • There is a context-free language   that is NL-complete under L reductions (Sudborough, 1975), so it is outside L (and therefore outside NC 1 and TC 0 ) if L ⊊ NL.
• Solving systems of linear equalities and universal context-free grammar recognition are P-complete under L reductions (Jones and Laaser, 1976;Greenlaw et al., 1995), so they are outside TC 0 if L ⊊ P.
• Matrix permanent is known to be outside of TC 0 (Allender, 1999).

Circuits and logics
DLOGTIME-uniform AC 0 and TC 0 are equivalent to FO[BIT] and FOM[BIT], respectively.There are many such equivalences between circuit classes and logics.As a rule of thumb, adding unbounded fan-in gates to a circuit family correlates with adding quantifiers to the corresponding logic, and increasing the degree of non-uniformity of a circuit family correlates with adding numerical predicates to the corresponding logic (Barrington and Immerman, 1994).For example, making AC 0 and TC 0 completely non-uniform corresponds to adding arbitrary numerical predicates (ARB) to FO and FOM, respectively (Immerman, 1997;Barrington et al., 1990).As we will see below, circuits and logics have their advantages and disadvantages for capturing the expressivity of transformers.An advantage of the circuit approach is that they have a more transparent resemblance to transformers.Transformers are computations with bounded depth, so it's not hard to see that they should be computable by circuit families with bounded depth (AC 0 or TC 0 ).On the other hand, an advantage of the logical approach is that if we seek an exact characterization of transformers, it can be easier in a logic to add or remove quantifiers or predicates, to limit quantifier depth or number of variables, to partition terms into different sorts, and so on, than to make adjustments to a circuit family.

Current Results
While this area of research still has many unresolved questions, the emerging picture has three levels of expressivity.At the upper end are decoders or encoder-decoders with intermediate steps; these are equivalent to Turing machines ( §6.1).At the lower end are encoders with leftmosthard or rightmost-hard attention; these can recognize only languages in AC 0 ( §6.2).In the middle are encoders with average-hard or softmax attention, which are the least well-understood but appear to lie between AC 0 and TC 0 ( §6.3).
In this section, "transformer" refers to a transformer encoder unless otherwise indicated.• In self attention, Eq. ( 3) takes the negative absolute value of the dot-product, and Eq. ( 4) uses average-hard attention.

Decoders with intermediate steps
• The FFNs use sigmoids instead of ReLUs.
As described above ( §4.3.3), the decoder is allowed to run for arbitrarily many time steps until an acceptance criterion is met.Under these assumptions, transformer encoder-decoders can recognize any recursively enumerable language. 3This result uses arbitrary precision, but as a corollary, they show that a  ()-time-bounded Turing machine can be simulated in a transformer using  (log  ()) precision and  ( ()) intermediate steps.Bhattamishra et al. (2020b) provide a simpler proof of Pérez et al.'s result by reducing to an RNN and appealing to the construction of Siegelmann and Sontag (1995).They do this for two sets of assumptions.First, • The PE includes only .
• The self attention sublayers are as above.
3 Pérez et al. (2021) define both Turing machines and encoder-decoders to halt only when accepting.The construction could easily be modified to capture decidable languages.
Second, they show the same with no PE and standard dot-product attention with future masking.Wei et al. (2022a) define a notion of statisticallymeaningful (SM) approximation and show that transformer encoder-decoders SM-approximate Turing machines.Both the decoder and Turing machine are limited to  time steps; additionally, • The PE can be an arbitrary computable function on [].
• Attention is average-hard.
• The FFNs have three ReLU layers.Feng et al. (2023) observe that the problems of evaluating arithmetic expressions or solving linear equations over Z  are NC 1 -hard under DLOGTIME reductions, so (if TC 0 ⊊ NC 1 ) they cannot be solved by  (log )-precision transformer decoders without intermediate steps. 4Similarly, the universal recognition problem for CFGs is P-complete, so (if L ⊊ P) it cannot be solved by  (log )-precision transformer decoders without intermediate steps.
However, these problems can be solved by a transformer decoder using (a polynomial number of) intermediate steps.The decoder has GELU activations (Hendrycks and Gimpel, 2016) and PE including  and (for linear equation solving)  2 sin 2   and  2 cos 2   where  is the number of variables.More generally, they define a class of dynamic-programming algorithms that these transformers can solve using intermediate steps.All these decoders have parameters that depend on .Merrill and Sabharwal (2024) show that a transformer decoder with  (log( +  ())) precision and  ( ()) intermediate steps can simulate a Turing machine for  () steps, and in particular, decoders with a polynomial number of intermediate steps recognize exactly the languages in P. The proof is similar to that of Pérez et al. (2021), but uses a standard definition of transformers without PEs, relying only on the mild assumption that the input string begins with BOS.
6.2 Leftmost-hard/rightmost-hard attention Hahn (2020) shows that leftmost-hard attention transformers cannot recognize PARITY or DYCK-1, using a variant of Furst et al.'s random restriction method for proving that PARITY is outside of AC 0 .Hao et al. (2022) show more generally that any language recognized by a transformer with leftmost-hard attention is in AC 0 .The proof gives a normal form for transformers with leftmost-hard attention and uses it to construct an AC 0 circuit family.It uses the fact that only  (log ) bits of information are needed per position.Barceló et al. (2024) give a lower bound on leftmost-hard-attention transformers with arbitrary PEs depending on a single position  and length

10
. They show that these transformers can recognize any language definable in FO [Mon].Their proof converts a FO[Mon] formula to LTL ( §5.3), which is simulated in a transformer.Angluin et al. (2023) exactly characterize rightmost-hard-attention transformers with strict future masking.Without PEs, these transformers recognize exactly the class of star-free languages, that is, languages definable in FO.With periodic PEs, they are exactly equivalent to FO[MOD], and with arbitrary PEs, they are exactly equivalent to FO[Mon].Strict masking is important, as non-strict masking is less expressive.They give two proofs of the star-free to transformer direction, one which goes through LTL ( §5.3) and one which uses Krohn-Rhodes theory.These proofs use a Booleanvalued version of RASP (Weiss et al., 2021) as an intermediate representation.

Average-hard and softmax attention
Theoretical results on average-hard and softmax attention transformers have not yet clearly separated the two, so we treat them together.Both kinds of attention enable counting, which can be used to solve problems like MAJORITY that are outside AC 0 .But these transformers are no more powerful than DLOGTIME-uniform TC 0 , implying that they likely cannot solve problems complete for NC 1 , L, and other classes believed to be above TC 0 ( §5.4).

Lower bounds: particular languages
The languages MAJORITY, DYCK-, and PARITY are all not in AC 0 , so are interesting test cases.Pérez et al. (2019) prove that a transformer encoder-decoder with a trivial decoder and without any PE recognizes MAJORITY; Merrill et al. (2022) prove the same for transformer encoders.Bhattamishra et al. (2020a) prove that SHUFFLE-DYCK- (which equals DYCK-1 when  = 1) is recognizable by a soft-attention transformer with future masking, no PE, no layernorm, and no residual connections.Yao et al. (2021) show that a transformer decoder can generate DYCK- using  (log ) precision, softmax and leftmost-hard attention, future masking, and a PE including /, / 3 , and .They also give constructions for DYCK-(, ).Chiang and Cholak (2022) show that transformers whose PE includes / and (−1)  = cos  can recognize PARITY.
On the other hand, Hahn (2020) shows that softmax attention transformers cannot generate PAR-ITY or DYCK-2 under the following two conditions: 1. all position-wise functions are Lipschitzcontinuous, and 2. generation is defined using the KL divergence criterion in Eq. ( 5).
The apparent contradiction is resolved by considering the different assumptions underlying each result.Chiang and Cholak (2022) address this by giving two constructions corresponding to Hahn's two conditions.The first has Lipschitz-continuous position-wise functions, but has high cross-entropy ( §4.3.1); as a generator, it would not meet criterion (5).The second construction uses layernorm with  N = 0, which is not Lipschitz-continuous, but it has arbitrarily low cross-entropy.
A number of authors have tested empirically whether transformers can learn the above languages.Ebrahimi et al. (2020) 2022) prove an upper bound analogous to that of Hao et al. (2022), but for averagehard-attention transformers.They show that an average-hard-attention transformer with activations in F can be simulated in TC 0 .Strobl (2023) tightens this bound to L-uniform TC 0 .Furthermore, Merrill and Sabharwal (2023a) show that softmax attention,  (log )-precision transformers are in L-uniform TC 0 , and then tighten this bound to DLOGTIME-uniform TC 0 (Merrill and Sabharwal, 2023b).The proof constructs subroutines to answer queries about the types of nodes and connectivity of pairs of nodes in the computation graph of a transformer, and shows that these queries can be translated to queries for a TC 0 circuit family with  (log ) time overhead.
An upper bound of DLOGTIME-uniform TC 0 immediately implies an upper bound of FOM[BIT] (Merrill and Sabharwal, 2023b).Chiang et al. (2023) prove a tighter upper bound using a logic called FOC[MOD; +], but on transformers with  (1) precision.This result is discussed below.

Other lower bounds
In addition to explicit constructions for particular languages mentioned above, various lower bounds have been proven, which are quite diverse.
Counter machines Bhattamishra et al. (2020a), following Merrill et al. (2020), define a subclass of counter machines called simplified and stateless -counter machines (SSCMs).These can update each counter based on the current input symbol, but have no state and cannot read the counters until the end of the string.They show that any SSCM can be converted to an equivalent transformer with future masking and no residual connections.
Finite automata Liu et al. (2023) study the ability of transformers with future masked attention to simulate deterministic finite automata (DFAs), in the sense of computing not only the same acceptance decision but also the same state sequence.Although a transformer with depth  can simulate a DFA for  timesteps, Liu et al. show how to construct lower-depth shortcuts for subclasses roughly corresponding to classes of regular languages in Fig. 1.Though the parameters of these constructions depend on , in the context of this survey, a noteworthy finding is that any regular language in ACC 0 can be recognized up to length  by a transformer whose FFNs use sine activations and whose number of parameters is independent of .
First-order logic Chiang et al. (2023) obtain both an upper and a lower bound by defining a logic FOC[MOD; +], which is first-order logic with counting quantifiers, using two sorts for positions and counts (Immerman, 1999, p. 185-187), where positions have the MOD predicate (but not < or =), and counts have <, +, and =, capturing the fact that transformers can add and compare activations, but not positions.They show that this logic is intermediate in expressivity between  (1)-precision and infinite-precision transformers.The lower-bound proof uses a normal form that eliminates quantifiers over counts and makes quantifiers over positions have depth 1; a perhaps surprising consequence is that  (1)-precision transformers are no more powerful than 2-layer uniform-attention transformers.

10
, can recognize any language definable in LTL with counting operators, Presburger arithmetic on counts, and predicates in Mon.
Programming languages Weiss et al. (2021) introduce the RASP (Restricted Access Sequence Processing) language as an abstraction of transformers, discussing how its components relate to the transformer architecture.However, they do not prove any relationship.Lindner et al. (2023) present Tracr, a compiler from RASP programs to transformers.To do so, they impose some restrictions: a maximum input length, given at compile time; a mandatory BOS token; and the removal of selector composition, a RASP operation with no clear parallel in transformers.They rewrite several programs from Weiss et al. (2021)  2. Regarding the expressivity of transformer encoders, circuit complexity and logic are especially promising frameworks.
3. Leftmost-hard-attention transformer encoders are in AC 0 and cannot solve some intuitively easy problems, like PARITY and MAJORITY.
4. Softmax and average-hard attention give transformer encoders the ability to count.Still, they lie within TC 0 and likely cannot solve problems like evaluating closed Boolean formulas.
Some open questions that we think should be priorities for future research are: 5. Some variants (PEs, average-hard vs. softmax attention, pre-norm vs. post-norm, the presence of BOS/EOS/CLS) appear to be instrumental in proofs reviewed here; can their effect on expressivity be clarified?
6. Can the expressivity of softmax-attention transformers be characterized more tightly or even exactly in terms of some logic?
7. Given the current practical importance of decoder-only transformers and chain-ofthought, what further insights can circuits or logic provide into transformer decoders?
We hope this paper can serve as a valuable resource for researchers pursuing these and other questions.

Figure 1 :
Figure 1: Relationship of some languages and language classes discussed in this paper (right) to the Chomsky hierarchy (left), assuming that TC 0 ⊊ NC 1 and L ⊊ NL. Circuit classes are DLOGTIME-uniform.
without this operation.In the other direction,Friedman et al. (2023) define a restricted class of transformers that can be learned and decompiled into RASP.Finally, Angluin et al. (2023) use a version of RASP restricted to Boolean values, and Zhou et al. (2024) use a restricted version of RASP to explore length generalization.7 Conclusions Out of the large body of research surveyed above, we highlight several conclusions: 1. Transformer decoders can use intermediate steps to simulate Turing machines; with unbounded steps, they are Turing-complete.
Table 1 summarizes the results surveyed here.One way to classify them is into lower bounds (what transformers can do) and upper bounds (what transformers can't do).

Table 1 :
Pérez et al. (2021)consider transformer encoderdecoders with several modifications:• The PE includes components , 1/, and 1/ 2 .Surveyed claims and their assumptions.Please see the main text for full details of assumptions.
Yao et al. (2021)l.(2020a)l PEs, they find that the models do not generalize well for  > 1.Yao et al. (2021)then investigate DYCK-(, ) for several values of  and  and several PEs.They report strong generalization only when using / for the PE, and posit that this is the key.It is hard, however, to directly compare the two results:Bhattamishra et al. (2020a)require correct prediction of the possible next symbols at each string prefix, whileYao et al. (2021)average over predictions of right brackets.