Abstract
We study the sequence-to-sequence mapping capacity of transformers by relating them to finite transducers, and find that they can express surprisingly large classes of (total functional) transductions. We do so using variants of RASP, a programming language designed to help people “think like transformers,” as an intermediate representation. We extend the existing Boolean variant B-RASP to sequence-to-sequence transductions and show that it computes exactly the first-order rational transductions (such as string rotation). Then, we introduce two new extensions. B-RASP[pos] enables calculations on positions (such as copying the first half of a string) and contains all first-order regular transductions. S-RASP adds prefix sum, which enables additional arithmetic operations (such as squaring a string) and contains all first-order polyregular transductions. Finally, we show that masked average-hard attention transformers can simulate S-RASP.
1 Introduction
Transformers (Vaswani et al., 2017) have become a standard tool in natural language processing and vision tasks. They are primarily studied in terms of their expressivity (which functions they can or cannot compute) or learnability (which functions they can or cannot learn from examples). Much recent expressivity work views transformers as recognizers of formal languages, by comparing them to automata, circuits, or logic (Strobl et al., 2024). Here we take the more general view that they compute (total functional) transductions, or functions from strings to strings.
Transductions are a fundamental object in computer science, with a long history in linguistics and natural language processing (Mohri, 1997; Roark and Sproat, 2007). Many empirical tests of transformer reasoning ability use transductions to define algorithmic sequence generation tasks (e.g., Suzgun et al., 2023; Delétang et al., 2023) such as tracking shuffled objects, sorting strings, concatenating all k-th letters, or removing duplicates from a list.
This paper is the first theoretical analysis, to our knowledge, of transformers as transducers of formal languages (Figure 1). Previous work on transformers as recognizers showed that unique-hard attention transformers correspond to star-free regular languages (Yang et al., 2024); here, we prove the analogous result for transformers as transducers, that unique-hard attention transformers correspond to aperiodic rational transductions. We then study two superclasses of aperiodic rational transductions that are (also) analogous to star-free regular languages: aperiodic regular transductions (e.g., w↦wR or w↦ww) and aperiodic polyregular transductions (e.g., w↦w|w|). We prove unique-hard attention transformers cannot compute all of these, but average-hard attention transformers can.
Overview of results of this paper. Arrows denote inclusion; dashed arrows denote inclusions that are known from previous work. Slashed arrows denote non-inclusions. The columns, from left to right, are: (1) the hierarchy of aperiodic transductions; (2) RASP variants; (3) variants of transformer encoders.
Overview of results of this paper. Arrows denote inclusion; dashed arrows denote inclusions that are known from previous work. Slashed arrows denote non-inclusions. The columns, from left to right, are: (1) the hierarchy of aperiodic transductions; (2) RASP variants; (3) variants of transformer encoders.
To do this, we introduce two new variants of RASP (Weiss et al., 2021), a programming language designed to make it easier to write down the kinds of computations that transformers can perform. This makes our analysis more simple, concise, and interpretable compared to describing transformers directly using linear algebra. These variants, called B-RASP[pos] and S-RASP, compute more than just the aperiodic regular and aperiodic polyregular transductions, and are interesting in their own right.
2 Preliminaries
We write [n] for the set {0,…, n −1}. Fix finite input and output alphabets Σ and Γ. We sometimes use special symbols # and ⊣, which we assume do not belong to Σ or Γ. Let Σ* and Γ* be the sets of strings over Σ and Γ, respectively. The empty string is denoted ε. For any string w, we number its positions starting from 0, so w = a0⋯an−1. We write uv or u · v for the concatenation of strings u and v, and wR for the reverse of string w.
2.1 Transductions and Transducers
A transduction is a binary relation between strings in Σ* and strings in Γ*. Here we consider only total functional transductions, that is, functions , and all of our transducers define total functional transductions.
A string homomorphism is a function f : Σ* → Γ* such that, for any strings u, v ∈ Σ*, we have f(uv) = f(u) f(v).
A deterministic finite transducer (DFT) is a tuple T = (Σ,Γ, Q, q0, δ) where
Σ and Γ are the input and output alphabets,
Q is the finite set of states,
q0 ∈ Q is the initial state,
δ: Q × (Σ ∪⊣) → Γ*× Q is the transition function.
The transition function δ extends to strings as follows: δ(q, ε) = (ε, q) and for u ∈ Σ* and a ∈ Σ, δ(q, ua) = (u′v′, s) where δ(q, u) = (u′, r) and δ(r, a) = (v′, s) for some r ∈ Q. Then for any w ∈ Σ*, we say that T transduces w to w′ iff δ(q0, w⊣) = (w′, r) for some r ∈ Q. We call a transduction sequential if it is definable by a DFT.
Next, we introduce several nested classes of transductions: rational, regular, and polyregular. We first give examples of transductions in these classes and informal descriptions of the classes in terms of various transducers.
In brief, a transduction is rational if it is definable by a nondeterministic transducer (the kind probably most familiar to NLP researchers, except we are assuming it is total functional). A transduction is regular if it is definable by a two-way transducer, which can be thought of as a rational transducer that can go back and forth on the input string, or a Turing machine with a read-only input tape and a write-only, one-way output tape.
The following transductions are regular but not rational:
- map-reverse: Reverse each substring between markers.
- map-duplicate: Duplicate each substring between markers.
A transduction is polyregular if it is definable by a pebble transducer, which is a two-way transducer augmented with a stack of up to kpebbles (Bojańczyk, 2022). It can push the current head position onto the stack and jump to the beginning of the string, and it can pop the top pebble from the stack and jump to that pebble’s position. It can read the symbol at every pebble, and it can compare the positions of the pebbles.
Next we restrict to the aperiodic subsets of these classes, and give formal definitions of these subclasses as composition closures of sets of prime transductions. We will use these definitions for the rest of the paper.
Let T be a deterministic finite automaton or transducer. For any input string w, there is a binary relation on states, , which holds iff δ(p, w) arrives at state q; if T is a DFT, this means that δ(p, w) = (w′, q) for some w′. Then T is aperiodic (or counter-free ) if there is an N ≥ 0 (depending on T) such that for all strings w ∈ Σ* and all n ≥ N, the relations and are the same.
Aperiodic deterministic finite automata (DFAs) are equivalent to star-free regular expressions and first-order logic with order (Schützenberger, 1965; McNaughton and Papert, 1971). They are also equivalent to masked hard-attention transformers (Yang et al., 2024). We take this equivalence as our starting point.
Each of the classes of transductions described above has an aperiodic subclass.
Aperiodic sequential transductions (which include string homomorphisms) are those defined by aperiodic DFTs.
Aperiodic rational transductions are the composition closure of aperiodic sequential transductions and right-to-left aperiodic sequential transductions , that is, transductions that can be expressed as w↦f(wR)R, where f is aperiodic sequential.1
Aperiodic regular transductions are the composition closure of aperiodic sequential transductions and the transductions map-reverse and map-duplicate (Ex. 2.3).2
Aperiodic polyregular transductions (Bojańczyk, 2018, Def. 1.3) are the composition closure of aperiodic regular transductions and the transduction marked-square (Ex. 2.4).
2.2 Transformers
We assume familiarity with transformers (Vaswani et al., 2017) and describe a few concepts briefly. For more detailed definitions, please see the survey by Strobl et al. (2024).
In standard attention, attention weights are computed from attention scores using the softmax function. In average-hard attention (Pérez et al., 2021; Merrill et al., 2022), each position i attends to those positions j that maximize the score si, j. If there is more than one such position, attention is divided equally among them. In unique-hard attention (Hahn, 2020), exactly one maximal element receives attention. In leftmost hard attention, the leftmost maximum element is chosen, while in rightmost hard attention, the rightmost maximum element is chosen.
In this work, we use RASP (Weiss et al., 2021) as a proxy for transformers. Specifically, we use extensions of B-RASP, a version of RASP restricted to Boolean values (Yang et al., 2024). B-RASP is equivalent to masked hard-attention transformer encoders with leftmost and rightmost hard attention, and with strict future and past masking.
3 B-RASP and Unique Hard Attention Transformers
In this section, in order to facilitate our study of transformers and how they relate to classes of transductions, we modify the definition of B-RASP to compute transductions and use it to show that unique-hard attention transformers are equivalent to aperiodic rational transductions.
In Sections 4 and 5, we consider two extensions: B-RASP[pos] adds position information, and S-RASP also includes an operator for prefix sum. These have various correspondences both with more realistic transformers and with larger classes of transductions.
3.1 Definition and Examples
We give an example first, followed by a more systematic definition.
B-RASP computation for increment.
in | 0 | 1 | 0 | 1 | 1 |
not | 1 | 0 | 1 | 0 | 0 |
carry | 0 | 0 | 1 | 1 | 1 |
out | 0 | 1 | 1 | 0 | 0 |
in | 0 | 1 | 0 | 1 | 1 |
not | 1 | 0 | 1 | 0 | 0 |
carry | 0 | 0 | 1 | 1 | 1 |
out | 0 | 1 | 1 | 0 | 0 |
We give a definition of B-RASP that is equivalent to that of Yang et al. (2024), extended to transductions. For now, we consider B-RASP programs for length-preserving transductions, and will later consider two schemes for defining non-length-preserving transductions as needed.
There are two types of values: Booleans from {⊤,⊥}, and symbols from a finite alphabet Δ. These are stored in vectors, which all share the same length n, mirroring the transformer encoders they are intended to model.
A B-RASP program receives an input string w = a0⋯an−1 represented as a symbol vector in, where in(i) = ai for i ∈ [n].
A B-RASP program is a sequence of definitions of the form P(i) = ρ, where P is a vector name, i is a variable name, and ρ is a right-hand side, to be defined below. The type of P is the type of ρ. No two definitions can have the same left-hand side.
Each definition has one of the following forms:
Position-wise operationsP(i) = e, where e is an expression such that FV(e) ⊆{i}.
- Attention operations, which have one of the two formswhere:
- •
The choice function is either leftmost () or rightmost ().
- •
M(i, j) is a mask predicate, one of
no masking: M(i, j) = ⊤
future masking: M(i, j) = (j < i) or M(i, j) = (j ≤ i)
past masking: M(i, j) = (j > i) or M(i, j) = (j ≥ i).
- •
S(i, j) is an attention predicate, given by a Boolean expression with FV(S(i, j)) ⊆{i, j}.
- •
V (j) is a value function, given by a Boolean or symbol expression with FV(V (j)) ⊆{j}.
- •
D(i) is a default function, given by a Boolean or symbol expression with FV(D(i)) ⊆{i}.
The attention operation defines a new vector P, as follows. For i ∈ [n] and choice function , ji is the minimum j ∈ [n] such that M(i, j) = ⊤ and S(i, j) = ⊤, if any, and P(i) is set to the value V (ji). If there is no such j, then P(i) is set to the value D(i). If the choice function is then ji is the maximum j ∈ [n] such that M(i, j) = ⊤ and S(i, j) = ⊤, if any, and P(i) is set to the value V (ji). If there is no such j, then P(i) is set to the value D(i).
- •
The output of a B-RASP program is given in a designated symbol vector out, which has the same form as the input vector in.
3.2 Packed Outputs
So far, we have defined B-RASP to encompass only length-preserving transductions. But even some simple classes of transductions, like string homomorphisms, are not length-preserving.
To address this, we allow the program to output a vector containing strings up to some length k instead of a vector of symbols. For any finite alphabet A, let A≤k denote the set of all strings over A of length at most k (including the empty string ε).
The input vector is still a vector of input symbols: a0a1⋯an−1, where ai ∈ Σ for i ∈ [n]. However, the output vector is a vector of symbols over the alphabet Γ≤k for some k. The output vector is a k-packed representation of a string u if the concatenation of the strings at positions 0,…, n −1 is u. There may be many different k-packed representations of the same string. For an input string of length n, the output string has length at most kn. Packed outputs make it possible to compute any string homomorphism, as in the following example.
3.3 B-RASP Defines Exactly the Aperiodic Rational Transductions
Examples 3.1 and 3.2 show that B-RASP can compute some aperiodic rational transductions that are not sequential. The following theorem shows that B-RASP can compute only aperiodic rational transductions.
Any B-RASP program with packed outputs defines an aperiodic rational transduction.
Let be a B-RASP program. By Lemma 12 of Yang et al. (2024), can be rewritten so that every score predicate S(i, j) depends only on j. Denote the sequence of vectors of as P1,…, Pm, and treat the input vector in as P0. We prove by induction that the first k operations of can be converted to a composition of left-to-right and right-to-left aperiodic sequential transductions. The output of the composition is the sequence of (k + 1)-tuples (x0,0,…, x0, k),…,(xn−1,0,…, xn−1, k), where for i ∈ [n] and j ∈ [k + 1], we have xi, j = Pj(i).
If k = 1, we just construct the identity transducer. If k > 1, assume that the first k −1 operations have been converted to a composition of transductions. If Pk is a position-wise operation, it can be computed by a one-state DFT that appends the value of xi, k = Pk(i) onto the end of the input k-tuple. The interesting cases are the attention operations.
Case is the mirror image of the previous case.
For the converse, we need the following lemma.
Ifis a B-RASP program with packed outputs andfis an aperiodic sequential transduction, there is a B-RASP program with packed outputs that computes.
We can adapt the proof of Lemma 19 of Yang et al. (2024). By the Krohn-Rhodes decomposition theorem for aperiodic sequential transductions (Pradic and Nguyễn, 2020, Thm. 4.8), f is equivalent to the sequential composition of finitely many two-state aperiodic DFTs. Hence, without loss of generality, we can assume that f is defined by a two-state aperiodic DFT T. This machine T is an identity–reset transducer, which means that for any symbol σ ∈ Σ, the state transformation either is the identity (maps both states to themselves) or resets to one of the states q (maps both states to q). For each state q of T, let Rq be the set of symbols that reset to q. Let and I = Σ ∖ R. Let q1 be the start state and q2 the other state. We write T(q, w) = w′ if δ(q, w) = (w′, q′) for some q′.
For any aperiodic rational transductionf : Σ* → Γ*, there is a B-RASP programwith packed outputs that computesf.
The transduction f can be written as fR ∘ fL, where fL is an aperiodic sequential transduction and fR is a right-to-left aperiodic sequential transduction (Def. 2.7). The identity transduction can clearly be computed by a B-RASP program, and by Lem. 3.5 there is a B-RASP program computing fL. Finally, Lem. 3.5 can be easily modified to apply also to fR, using the mirror images of stateq and sym above.
3.4 Unique-hard Attention Transformers Compute Exactly the Aperiodic Rational Transductions
Yang et al. (2024) show that a B-RASP program can be simulated by a unique-hard attention transformer with no position information besides masking, and vice versa. With Thms. 1 and 2, this implies that masked unique-hard attention transformers with packed outputs can compute exactly the aperiodic rational transductions.
4 B-RASP with Positions
4.1 Definition
We extend B-RASP to B-RASP[pos], which adds a type nat for integers in [n], and vectors containing integers.
There is a pre-defined integer vector pos(i), whose value is simply i at every position i ∈ [n].
There are position-wise operations P(i) = enat, where FV(enat) ⊆{i}. Addition and subtraction have their usual meaning, but values less than 0 are replaced by 0 and values greater than n −1 are replaced by n −1. (Since this is not associative, we fully parenthesize arithmetic expressions.)
There are position-wise operations P(i) = cbool, where FV(cbool) ⊆{i}. The operators <, >, =, ≠, ≤, and ≥ have their usual meaning.
In B-RASP, S(i, j) was a Boolean expression (ebool); in B-RASP[pos], it can be either a Boolean expression (ebool) or, as a special case, V1(i) = V2(j), where V1 and V2 are previously defined integer vectors. We emphasize that only tests for equality are allowed (not, for example, V1(i) < V2(j)). This restriction is used in the transformer simulation in Section 5.5.
4.2 Examples
Informally, we omit a default value from a leftmost or rightmost operation if the operation is such that the default value will never be taken.
Example B-RASP[pos]computation for map-reverse. Details in Ex. 4.1.
in | input | | | a | b | | | c | d | e | | |
pos | position | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
prev | previous ‘|’ | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 3 |
next | next ‘|’ | 3 | 3 | 3 | 7 | 7 | 7 | 7 | 0 |
src | source | 3 | 2 | 1 | 4 | 6 | 5 | 4 | 0 |
y1 | in(src(i)) | | | b | a | c | e | d | c | | |
out | output | | | b | a | | | e | d | c | | |
in | input | | | a | b | | | c | d | e | | |
pos | position | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
prev | previous ‘|’ | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 3 |
next | next ‘|’ | 3 | 3 | 3 | 7 | 7 | 7 | 7 | 0 |
src | source | 3 | 2 | 1 | 4 | 6 | 5 | 4 | 0 |
y1 | in(src(i)) | | | b | a | c | e | d | c | | |
out | output | | | b | a | | | e | d | c | | |
Example B-RASP[pos] computation for map-duplicate. Details in Ex. 4.2.
in | input | | | a | b | | | c | d | e | | |
pos | position | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
prev | previous ‘|’ | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 3 |
next | next ‘|’ | 3 | 3 | 3 | 7 | 7 | 7 | 7 | 0 |
nowrap | 0 | 1 | 3 | 5 | 4 | 6 | 7 | 7 | |
wrap | 0 | 0 | 1 | 0 | 1 | 3 | 5 | 7 | |
src1 | left symbol | 0 | 1 | 1 | 5 | 4 | 6 | 5 | 7 |
src2 | right symbol | 1 | 2 | 2 | 6 | 5 | 4 | 6 | 7 |
out | output | | | ab | ab | | | cd | ec | de | | |
in | input | | | a | b | | | c | d | e | | |
pos | position | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
prev | previous ‘|’ | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 3 |
next | next ‘|’ | 3 | 3 | 3 | 7 | 7 | 7 | 7 | 0 |
nowrap | 0 | 1 | 3 | 5 | 4 | 6 | 7 | 7 | |
wrap | 0 | 0 | 1 | 0 | 1 | 3 | 5 | 7 | |
src1 | left symbol | 0 | 1 | 1 | 5 | 4 | 6 | 5 | 7 |
src2 | right symbol | 1 | 2 | 2 | 6 | 5 | 4 | 6 | 7 |
out | output | | | ab | ab | | | cd | ec | de | | |
Example B-RASP[pos] computation for copy-first-half. Details in Ex. 4.3.
in | input | a | b | c | a | a | b | c | b | b |
po | i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
last | n −1 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
sum | 0 | 2 | 4 | 6 | 8 | 8 | 8 | 8 | 8 | |
out | output | a | b | c | a | ε | ε | ε | ε | ε |
in | input | a | b | c | a | a | b | c | b | b |
po | i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
last | n −1 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
sum | 0 | 2 | 4 | 6 | 8 | 8 | 8 | 8 | 8 | |
out | output | a | b | c | a | ε | ε | ε | ε | ε |
The transductioncopy-first-halfis neither regular nor polyregular.
Both regular and polyregular transductions preserve regular languages under inverse (Bojańczyk, 2018, Thm 1.7). The inverse of the regular language a* under copy-first-half is the set of words of the form anw where |w|≤ n, which is not regular if |Σ| > 1, so the transduction copy-first-half is neither regular nor polyregular.3
Let m be a positive integer and define the transduction residues-mod-m to map any input a0a1⋯an−1 to the sequence b0b1⋯bn−1 where bi = i mod m. This transduction is rational but not aperiodic.
For anym, B-RASP[pos] can compute the transductionresidues-mod-m.
4.3 Expressivity
B-RASP[pos] programs with packed outputs can compute all aperiodic regular transductions.
If f is an aperiodic regular transduction, then by Def. 2.7, it can be decomposed into a composition of transductions, each of which is (a) aperiodic sequential, (b) map-reverse, or (c) map-duplicate. We convert f to a B-RASP[pos] program by induction on the number of functions in the composition. Case (a) is the same as the proof of Lem. 3.5, mutatis mutandis. The following two lemmas handle the other two cases.
Ifis a B-RASP[pos] program with packed outputs, then there is a B-RASP[pos] program with packed outputs that computesmap-reverse.
To see how this works, consider a packed symbol z(i). If it contains at least one separator, it is parsed into head, body, and tail as xyz. The correct output for position i is computed in sep(i), and consists of replacing x by the reverse of the tail of the closest left neighbor that has a separator, replacing y with map-reverse (y), and replacing z by the reverse of the head of the closest right neighbor that has a separator. If z(i) contains no separator, then it appears in a maximal subsequence w0, w1,…, wk−1 with no separator, say as wℓ, and the correct output for position i is computed in nosep, and consists of the reverse of wk−1−ℓ.
Ifis a B-RASP[pos] program with packed outputs, then there is a B-RASP[pos] program with packed outputs that computes.
This computes in the vector sep the correct outputs for those positions i that have a separator in the input symbol z(i). The symbol is parsed into head, body, and tail as xyz, and the correct output is the concatenation of the tail of the preceding symbol, the strings x, map-duplicate(y), z, and the head of the the following symbol. Note that map-duplicate is applied only to strings of bounded length.
On the other hand, every operation in B-RASP[pos] is computable by a family of AC0 circuits, that is, a family of Boolean circuits of constant depth and polynomial size (Hao et al., 2022), which implies that any transduction computable in B-RASP[pos] is computable in AC0.
5 S-RASP and Average Hard Attention Transformers
5.1 Definition
We further extend B-RASP[pos] to RASP with prefix sum (or S-RASP) by adding a prefix sum operation.
5.2 Padded Inputs
We defined non-length-preserving transductions for B-RASP and B-RASP[pos] by employing the convention of packed outputs. However, for S-RASP, we introduce a simpler scheme: using only symbol, not string, vectors, while assuming that the input string is followed by padding symbols #, enough to accommodate the output string.
The input vector is a0a1⋯aℓ−1#n−ℓ, where ℓ < n and ai ∈ Σ for i ∈ [ℓ]. The output vector, similarly, is b0b1⋯bk−1#n−k, where k < n and bi ∈ Γ for i ∈ [k].
With this input/output convention, padding symbols may be necessary to create enough positions to hold the output string. But in order to be able to prove closure under composition for transductions computable in B-RASP and its extensions, we allow additional padding symbols to be required. In particular, the program computes the transductionf iff there exists a nondecreasing function q, called the minimum vector length, such that for every input string w ∈ Σℓ, we have q(ℓ) ≥ k = |f(w)|, and if is run on w ·#n−ℓ, where n > q(ℓ), then the output is f(w) ·#n−k. In all of the examples in this section, except marked-square, q is linear.
We could have used padded inputs with B-RASP programs, but it can be shown that programs would only be able to map input strings of length n to output strings of length at most n + k, for some constant k. Packed outputs give B-RASP the ability to define transductions with longer outputs, like string homomorphisms. However, the situation is exactly opposite with S-RASP. Packed outputs do not add any power to S-RASP, because “unpacking” a packed output into a vector of output symbols can be computed within S-RASP (Lem. 5.3). Moreover, packed outputs only allow transductions with linear growth, and, as we will see, S-RASP can define transductions with superlinear growth (Ex. 2.4).
5.3 Properties
Ifandare computable in S-RASP, then their compositionis computable in S-RASP.
Let the S-RASP program compute the transduction fi with minimum vector length qi for i = 1,2. Let be the S-RASP program that consists of the operations of followed by the operations of , where is modified to output a fresh vector z (instead of out) and is modified to input vector z (instead of in). We can choose a nondecreasing function q3 such that , so that q3 as a minimum vector length ensures that correctly computes f2 ∘ f1.
For any string homomorphismthere exists an S-RASP program to computeh, with minimum vector lengthq(ℓ) = Kℓ, whereKis the maximum of |h(σ)| overσ ∈ Σ.
5.4 Examples and Expressivity
Example S-RASP computation for a string homomorphism. Details in Ex. 5.4.
in | input | A | B | B | C | # | # |
pos | i | 0 | 1 | 2 | 3 | 4 | 5 |
lens | length of h(in(i)) | 2 | 0 | 0 | 3 | 0 | 0 |
ends | end of h(in(i)) | 2 | 2 | 2 | 5 | 5 | 5 |
starts | start of h(in(i)) | 0 | 2 | 2 | 2 | 5 | 5 |
sym0 | mark start | A | # | C | # | # | # |
sym1 | mark start+1 | # | A | # | C | # | # |
sym2 | mark start+2 | # | # | A | # | C | # |
out | output | a | a | c | c | d | # |
in | input | A | B | B | C | # | # |
pos | i | 0 | 1 | 2 | 3 | 4 | 5 |
lens | length of h(in(i)) | 2 | 0 | 0 | 3 | 0 | 0 |
ends | end of h(in(i)) | 2 | 2 | 2 | 5 | 5 | 5 |
starts | start of h(in(i)) | 0 | 2 | 2 | 2 | 5 | 5 |
sym0 | mark start | A | # | C | # | # | # |
sym1 | mark start+1 | # | A | # | C | # | # |
sym2 | mark start+2 | # | # | A | # | C | # |
out | output | a | a | c | c | d | # |
Example S-RASP computation for marked-square. Details in Ex. 5.5.
in | input | a | a | b | # | # | # | # | # | # | # | # | # | # | # |
pos | i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
len | input length | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
inpos | i in input? | ⊤ | ⊤ | ⊤ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ |
glen | ℓg = group length (i > 0) | 0 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
mglen | 0 | 4 | 8 | 12 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | |
starts | starts of groups | 0 | 4 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
isstart | is i in starts? | ⊤ | ⊥ | ⊥ | ⊥ | ⊤ | ⊥ | ⊥ | ⊥ | ⊤ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ |
isstartnum | isstart numeric | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
gnumber | group number | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 |
gstart | start of i’s group | 0 | 0 | 0 | 0 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | 8 | 8 | 8 |
src | i −gstart(i) −1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 4 |
ismarked | is i marked? | ⊤ | ⊤ | ⊥ | ⊥ | ⊤ | ⊤ | ⊤ | ⊥ | ⊤ | ⊤ | ⊤ | ⊤ | ⊥ | ⊥ |
y1 | letters moved | a | a | a | b | a | a | a | b | a | a | a | b | # | # |
y2 | mark and add initial ‘|”s | | | A | a | b | | | A | A | b | | | A | A | B | # | # |
lastbar | i for last ‘|’ | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
out | output | | | A | a | b | | | A | A | b | | | A | A | B | | | # |
in | input | a | a | b | # | # | # | # | # | # | # | # | # | # | # |
pos | i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
len | input length | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
inpos | i in input? | ⊤ | ⊤ | ⊤ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ |
glen | ℓg = group length (i > 0) | 0 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
mglen | 0 | 4 | 8 | 12 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | |
starts | starts of groups | 0 | 4 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
isstart | is i in starts? | ⊤ | ⊥ | ⊥ | ⊥ | ⊤ | ⊥ | ⊥ | ⊥ | ⊤ | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ |
isstartnum | isstart numeric | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
gnumber | group number | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 |
gstart | start of i’s group | 0 | 0 | 0 | 0 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | 8 | 8 | 8 |
src | i −gstart(i) −1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 4 |
ismarked | is i marked? | ⊤ | ⊤ | ⊥ | ⊥ | ⊤ | ⊤ | ⊤ | ⊥ | ⊤ | ⊤ | ⊤ | ⊤ | ⊥ | ⊥ |
y1 | letters moved | a | a | a | b | a | a | a | b | a | a | a | b | # | # |
y2 | mark and add initial ‘|”s | | | A | a | b | | | A | A | b | | | A | A | B | # | # |
lastbar | i for last ‘|’ | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
out | output | | | A | a | b | | | A | A | b | | | A | A | B | | | # |
Every aperiodic polyregular transduction is computable in S-RASP.
By Def. 2.7, any aperiodic polyregular transduction can be decomposed into a composition of aperiodic regular transductions and marked-square. All aperiodic regular transductions are computable in B-RASP[pos] (Thm. 4.7), and their packed outputs can be unpacked in S-RASP (Lem. 5.3), so all aperiodic regular transductions are computable in S-RASP. Further, marked-square is computable in S-RASP (Ex. 5.5) and S-RASP is closed under composition (Lem. 5.2). Thus, S-RASP can compute all aperiodic polyregular transductions.
An example run is in Table 8.
Example S-RASP computation for majority-rules. Details in Ex. 5.7.
in | input | b | b | a | b | b | a | b | a | # | # |
pos | i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
pa | count-left(a) | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 3 |
na | count(a) | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
pb | count-left(b) | 1 | 2 | 2 | 3 | 4 | 4 | 5 | 5 | 5 | 5 |
nb | count(b) | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
out | output | b | b | b | b | b | b | b | b | # | # |
in | input | b | b | a | b | b | a | b | a | # | # |
pos | i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
pa | count-left(a) | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 3 |
na | count(a) | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
pb | count-left(b) | 1 | 2 | 2 | 3 | 4 | 4 | 5 | 5 | 5 | 5 |
nb | count(b) | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
out | output | b | b | b | b | b | b | b | b | # | # |
The transductionmajority-rulesis neither polyregular nor computable in B-RASP[pos].
Polyregular transductions preserve regular languages under inverse (Bojańczyk, 2018, Thm. 1.7). The preimage of the regular language a* under majority-rules is M = {w | w contains more a’s than b’s}, which is not regular, so majority-rules is not polyregular.
A circuit family computing majority-rules can be modified to decide M, which is not in AC0 (Furst et al., 1984). Thus the majority-rules transduction is not computable in B-RASP[pos].
Let m be a positive integer and define the transduction count-mod-m to map any input sequence a0a1⋯an−1 to the sequence b0b1⋯bn−1 where . This transduction is rational but not aperiodic; it is a generalization of the parity problem, which has been discussed at length elsewhere (Hahn, 2020; Chiang and Cholak, 2022).
For anym, S-RASP can compute the transductioncount-mod-m.
On the other hand, because prefix sum can be simulated by a family of TC0 circuits (threshold circuits of constant depth and polynomial size), any transduction computable in S-RASP is in TC0.
5.5 Average-hard Attention Transformers
We prove the following connection between S-RASP programs and average hard attention transformers in Appendix B.
Any transduction computable by an S-RASP program is computable by a masked average-hard attention transformer encoder with a position encoding ofi/n, (i/n)2, and 1/(i + 2).
One consequence is the following result relating unique-hard and average-hard attention:
Any transduction computable by a masked unique-hard attention transformer encoder can be computed by a masked average-hard attention transformer encoder with a position encoding of i/n, (i/n)2, and 1/(i + 2).
6 Conclusions
This is, to our knowledge, the first formal study of transformers for sequence-to-sequence transductions, using variants of RASP to connect classes of transformers to classes of transductions. We showed that unique-hard attention transformers and B-RASP compute precisely the class of aperiodic rational transductions; B-RASP[pos] strictly contains all aperiodic regular transductions; and average-hard attention transformers and S-RASP strictly contain all aperiodic polyregular transductions. Our finding that B-RASP[pos] and S-RASP can compute transductions outside the corresponding aperiodic class in the transduction hierarchy raises the question of fully characterizing their expressivity, a promising future research direction.
Acknowledgments
We thank Mikołaj Bojańczyk, Michaël Cadilhac, Lê Thành Dũng (Tito) Nguyễn, and the anonymous reviewers for their very helpful advice.
Notes
Nguyễn et al. (2023, fn. xii) characterize aperiodic rational transductions using just one aperiodic sequential transduction and one right-to-left aperiodic sequential transduction, and Filiot et al. (2016, Prop. 3) use a closely related characterization in terms of bimachines. Here we use a composition of any number of transductions, which is equivalent because aperiodic rational transductions are closed under composition (Carton and Dartois, 2015, Thm. 10).
This characterization is given by Nguyễn (2021, p. 15). It is also given by Bojańczyk and Stefański (2020, Thm. 18) for the more general setting of infinite alphabets constructed from atoms; our definition here corresponds to the special case of finite alphabets (that is, where the set of atoms is empty).
Thanks to an anonymous reviewer for suggesting this argument to us.
References
In the following appendices, we prove Thm. 5.11. Appendix A reviews the definition of average-hard attention transformers. Appendix B contains our main proof, while Appendix C contains another construction using a different position embedding. Appendix D compares some features of our simulation with other simulations.
A Average Hard Attention Transformers
We recall the definition of a transformer encoder with average-hard attention, also known as saturated attention (Yao et al., 2021; Hao et al., 2022; Barceló et al., 2024). Let d > 0 and n ≥ 0. An activation sequence is a sequence of n vectors in ℝd, one for each string position. The positions are numbered −1,0,1,…, n −1. Position −1, called the default position, does not hold an input symbol and will be explained below. A transformer encoder is the composition of a constant number (independent of n) of layers, which of which maps an activation sequence u−1,…, un−1 to an activation sequence u−1′,…, un−1′.
There are two types of layers: (1) position-wise and (2) average hard attention. A position-wise layer computes a function ui′ = ui + f(ui) for all positions i, where f is a position-wise two-layer feed-forward network (FFN) with ReLU activations. An average hard attention layer is specified by three linear transformations Q, K, V :ℝd →ℝd. The dot product is the attention score from position i to position j. For each position i, let Mi be the set of positions j that maximize S(i, j). Then . An average hard attention layer may be masked using strict or non-strict future masking, in which for each position i, only positions j < i or j ≤ i (respectively) are considered in the attention calculation. With strict future masking, the default position has nowhere to attend to, so the result is u−1′ = u−1.
B Simulating S-RASP
B.1 Overview of the Simulation
To define the computation of a transduction by a transformer, we need to specify how the input and output strings are represented in the initial and final activation sequences. If the input string is w = a0a1⋯aℓ−1, we let ai = # for i = ℓ,…, n −1.
Let Σ ∪{#} = {σ0,…, σk−1} be totally ordered. The first k coordinates of each activation vector hold the one-hot encoding of ai (or the zero vector at the default position). The representation of the output string is analogous, using the alphabet Γ ∪{#}.
We turn to how S-RASP programs may be simulated. Vectors of Boolean, symbol, and integer values in an S-RASP program are represented in one or more coordinates of an activation sequence in the transformer. Each operation of an S-RASP program computes a new vector of values, and is simulated by one or more transformer encoder layers which compute new values in one or more coordinates of the activation sequence. Assume that is an S-RASP program computing a transduction with minimum vector length q(ℓ), and that n > q(ℓ).
B.2 Representing S-RASP Vectors
Vectors of Booleans, symbols, and integers in the program are represented in the activation sequence of the transformer as follows.
Each Boolean vector v0, v1,…, vn−1 in is represented by one coordinate r of the transformer activation sequence u−1, u0,…, un−1, where for each i ∈ [n], ui[r] = 0 if vi = ⊥ and ui[r] = 1 if vi = ⊤. For the default position, u−1[r] = 0.
Let Δ = {δ0, δ1,…, δk} denote the finite set of all symbols that appear in any symbol vector in . Each symbol vector v0, v1,…, vn−1 in is represented by |Δ| coordinates r0, r1,…, rk−1, which hold a one-hot representation of vi (or the zero vector at the default position).
Each integer vector v0, v1,…, vn−1 in the program is represented by a specified coordinate r in the transformer activation sequence, where for each i ∈ [n], ui[r] = vi/n. In the PE, the value of u−1[pos] is −1/n, but for other integer vectors we have u−1[r] = 0. We note that all of the representing values are less than or equal to 1.
B.3 Table Lookup
A key property of S-RASP is that every integer value computed in the program must be equal to some position index i ∈ [n]. We use this property to implement a table lookup operation.
fq(x) is uniquely maximized atx = q;
ifx ≠ q, thenfq(q) −fq(x) ≥ 1.
This is a generalized version of a technique by Barceló et al. (2024). It can easily be shown by looking at the first and second derivatives of f, and by comparing fq(q) with fq(q −1) and fq(q + 1).
Fix an activation sequenceu−1,…, un−1and coordinatesr, s, tsuch thatui[r] = ki/n, where eachki ∈ [n]. Then there is an average-hard attention layer that computesu−1′,…, un−1′, whereand the other coordinates stay the same.
We remark that if ki ≥ n, the unique maximizing value of S(i, j) for j ∈ [−1, n −1] is j = n −1, so the attention layer in the proof above returns the value vn−1 for such positions i.
B.4 Simulating S-RASP Operations
For each operation below, let u−1,…, un−1 be the input activation sequence, and let u−1′,…, un−1′ be the output activation sequence. If k, v1, v2, b, and t are S-RASP vectors, we also write k, v1, v2, b, and t, respectively, for the coordinates representing them in the transformer.
B.4.1 Position-wise Operations
Position-wise Boolean operations on Boolean vectors can be simulated exactly by position-wise FFNs, as shown by Yang et al. (2024). Position-wise operations on symbol values reduce to Boolean operations on Boolean values.
To simulate addition of two integer vectors, t(i) = v1(i) + v2(i), we first use a FFN to compute . The result may exceed (n −1)/n, so we use table lookup (Lem. B.2) to map k/n to uk[pos]; this sets values larger than (n −1)/n to (n −1)/n. The result is stored in ui′[t]. Subtraction is similar, with ReLU ensuring the result is non-negative.
For position-wise comparison of integer vectorst(i) = v1(i) ≤ v2(i), we use a FFN to compute . We use table lookup to map k/n to uk[zero], which is 1 if ui[v1] −ui[v2] ≤ 0, and 0 otherwise. The other comparison operators are similar.
B.4.2 Prefix Sum
Next, we turn to the prefix sum operation, . Assume that ui[k] = k(i)/n, where each k(i) is an integer in [n] and k(−1) = 0. Let pi ≥ 0 be the sum of k(−1), k(0),…, k(i) and let , which is the sequence of values to be computed and stored in coordinate t.
B.4.3 Leftmost and Rightmost Attention
Maximization.
If S(i, j) is a Boolean combination of Boolean vectors, to ensure that attention from any position to the default position is 0, we let Sbase(i, j) = ¬default(j) ∧ S(i, j). This may be computed by dot product attention, as described by Yang et al. (2024).
Breaking Ties.
If Sbase(i, j) were used with average hard attention, then the activation values would be averaged for all the satisfying j. To ensure that the maximum satisfying position j has a unique maximum score, we break ties by adding or subtracting Stie(i, j). We must ensure that the values added or subtracted are smaller than the minimum difference between the values for satisfying and non-satisfying positions.
For a Boolean combination of Boolean vectors, let . Then under rightmost attention, the rightmost satisfying j has the highest attention score, which is at least 1, while every non-satisfying j has an attention score less than 1/2. Similarly for leftmost attention.
For an equality comparison v1(i) = v2(j), the difference between the maximum score attained and any other score is at least (1/n)2 by Lem. B.1. So if we add or subtract values less than (1/n)2, no non-equality score can exceed an equality score. This can be achieved by letting Stie(i, j) = j/(2n3). This is computable using dot product attention because j/n is in the PE for j and (1/n)2 is in the PE for 1 and can be initially broadcast to all positions.
Default Values.
The term Sdef needs to give the default position an attention score strictly between the possible scores for satisfying and non-satisfying j.
For a Boolean combination of Boolean vectors, the maximum non-satisfying score is less than 1/2 and the minimum satisfying score is at least 1, so if we let Sdef(i) = 3/4, then the default position has an attention score of 3/4, so it will be the unique maximum in case there are no satisfying positions.
For an equality comparison of integer vectors, the maximum non-satisfying score is less than (v1(i)/n)2 −(1/2)(1/n2), and the minimum satisfying score is at least (v1(i)/n)2, so Sdef(i) = (v1(i)/n)2 −(1/4)(1/n2) is strictly between these values. The value of (v1(i)/n)2 may be obtained at position i using Lem. B.2 with index v1(i)/n and the posq coordinate of the PE.
Thus, the default position is selected when there is no j ∈ [n] satisfying the attention predicate; it remains to supply the default value. We use an attention layer with the attention score S′ given above and value . Let ji be the position that i attends to. Then we use a position-wise if/else operation that returns (the simulation of) D(i) if default(ji) = 1 and V (ji) otherwise. This concludes the proof of Theorem 5.11.
C An Alternate Position Encoding
The simulation of S-RASP via average hard attention transformers in Thm. 5.11 relies on three kinds of position encoding: i/n, (i/n)2, and 1/(i + 2). In this section, we present evidence for the following.
Any transduction computable by an S-RASP program is computable by a masked average-hard attention transformer encoder with a position encoding of i/n.
First, 1/(i + 2) can be computed from i/n.
A transformer with positionsi ∈{−1,0,…, n} and position encodingi/ncan compute 1/(i + 2) at all positionsi.
As observed by Merrill and Sabharwal (2024), a transformer can use the i/n encoding to uniquely identify the first position (−1) and compute 1/(i + 2) by using non-strict future masked attention with value 1 at that position and 0 elsewhere (0,…, n −1).
In Thm. C.4 we show that the position encoding i/n and 1/(i +2)2 suffices for the simulation of S-RASP by a masked average hard attention transformer. Though it’s unclear whether a transformer with position encoding i/n can compute 1/(i +2)2, we note the following.
A transformer with positionsi ∈{−1,0,…, n} and position encodingi/ncan compute 1/((i +2)2 −1) at positionsi < n.
Any transduction computable by an S-RASP program is computable by a masked average-hard attention transformer encoder with a position encoding ofi/nand 1/(i +2)2.
The proof of this theorem closely follows the argument presented earlier for Thm. 5.11, except for the position encoding used. We will show how each use of (i/n)2 in that original argument can be replaced with an equivalent use of 1/(i +2)2, which we assume to be stored in a coordinate called posiq (for “inverse quadratic”). We also assume that 1/(i + 2) is available by Prop. C.2.
The original proof uses the quadratic maximization in Lem. B.1, which we replace with:
Consider the derivative, −2/n(j +2)2 + 2(q + 2)/n(j +2)3, whose only real-valued root is j = q. Furthermore, the derivative is positive for j < q and negative for j > q.
This score is a bilinear form that can be computed via average hard attention using query at position i and key at position j. In all our applications of this new score, we will ensure that q/n is available at position i. The 2/n term can also be computed at position i by attending uniformly (without masking) with value 2 at the first position and 0 elsewhere. There are three uses of posq in the original argument that we have to modify.
As in the original argument, there may be multiple matches and we thus need to break ties in favor of the leftmost or rightmost match. To this end, we observe that S(i, j) = 1/(n(ki + 2)) when kj = ki, and compare this to the maximum value of S(i, j) for kj ≠ ki, which is (ki + 4)/(n(ki +3)2), attained at kj = ki + 1. Thus, the gap between the attention score when kj = ki versus the maximum possible when kj ≠ ki is 1/(n(ki + 2)(ki +3)2). Since ki < n, this is lower bounded by 1/(n(n + 1)(n +2)2) > g(n) where g(n) = 1/(20n4). As in the original argument, if we add or subtract from S(i, j) values less than g(n), no non-equality score can exceed the corresponding equality score. We achieve this by adding or subtracting the tie-breaking term g(n)j/(2n) = j/(40n5); the reason for using this specific tie-breaking term will become apparent when we discuss default values below. This term is computable by first computing 1/n4 at position i and then using dot product attention with j/n in the position encoding of j. In order to compute 1/n4, we can attend uniformly with only the first position having value 1/n (the rest having value 0) to obtain 1/n2, and repeat this process twice more to obtain 1/n4. This finishes the updates needed for the simulation of leftmost and rightmost attention.
We address default values in a similar way as in the original proof. When it involves an equality comparison of integer vectors and rightmost attention, we observe that with the tie-breaking term g(n)j/(2n) discussed above, the gap between the matching attention score 1/(n(ki + 2)) and the maximum non-matching attention score for rightmost attention is at least g(n)/2. Hence, a default position value of 1/(n(ki + 2)) −g(n)/4 is strictly between these two values. Further, this default position value is computable at position i by the same arguments as above. We treat default values with leftmost attention analogously.
D Comparison with Other Simulations
In the prefix sum operation (1), the result at position i is s(i)/(i + 1), where s(i) is the prefix sum of v(i). The fact that the denominator of this expression varies with position is an obstacle to comparing or adding the values s(i) and s(j) at two different positions i and j. This problem is addressed by Yao et al. (2021) and Merrill and Sabharwal (2024) using a non-standard layer normalization operation to produce a vector representation of the quantities, which allows them to be compared for equality using dot product attention. Pérez et al. (2021) include 1/(i + 1) in their position embedding to enable the comparison; however, they compute attention scores as in place of the standard dot-product. The approach of the current paper is based on that of Barceló et al. (2024), who show how average hard attention can be used to compute the prefix sum of a 0/1 vector.
Author notes
Action Editor: Alexander Clark