In this article, we mainly study the depth and width of autoencoders consisting of rectified linear unit (ReLU) activation functions. An autoencoder is a layered neural network consisting of an encoder, which compresses an input vector to a lower-dimensional vector, and a decoder, which transforms the low-dimensional vector back to the original input vector exactly (or approximately). In a previous study, Melkman et al. (2023) studied the depth and width of autoencoders using linear threshold activation functions with binary input and output vectors. We show that similar theoretical results hold if autoencoders using ReLU activation functions with real input and output vectors are used. Furthermore, we show that it is possible to compress input vectors to one-dimensional vectors using ReLU activation functions, although the size of compressed vectors is trivially Ω(log n) for autoencoders with linear threshold activation functions, where n is the number of input vectors. We also study the cases of linear activation functions. The results suggest that the compressive power of autoencoders using linear activation functions is considerably limited compared with those using ReLU activation functions.

Over the past decade, we have seen a rapid progress in both developments and applications of artificial neural network (ANN) technologies. Among various models of ANNs, much attention has recently been paid on autoencoders because of their generative power of new data. Indeed, autoencoders have been applied to various areas including image processing (Doersch, 2016), natural language processing (Tschannen et al., 2018), and drug discovery (Gómez-Bombarelli et al., 2018). An autoencoder is a layered neural network consisting of an encoder and a decoder, where the former transforms an input vector x to a low-dimensional vector z = f(x) and the latter transforms z to an output vector y = g(z), which should be the same as or similar to the input vector. Therefore, an autoencoder performs a kind of dimensionality reduction. Encoder and decoder functions, f and g, are usually obtained via unsupervised learning that minimizes the difference between input and output data by adjusting weights (and some additional parameters).

Although autoencoders have a long history (Ackley et al., 1985; Baldi & Hornik, 1989; Hinton & Salakhutdinov, 2006), how data are compressed via autoencoders is not yet very clear. Baldi and Hornik (1989) studied relations between principal component analysis (PCA) and autoencoders with one hidden layer. Hinton and Salakhutdinov (2006) conducted empirical studies on relations between the depth of autoencoders and the dimensionality reduction. The results suggest that deeper networks can produce lower reconstruction errors. Kärkkäinen and Hänninen (2023) also empirically studied relations between the depth and the dimensionality reduction using a variant model of autoencoders. The results suggest that deeper networks obtain lower autoencoding errors during the identification of the intrinsic dimension, but the detected dimension does not change compared to a shallow network. Recently, several analyses have been done on mutual information between layers in order to understand information flow in autoencoders (Lee & Jo, 2021; Tapia & Estévez, 2020; Yu & Príncipe, 2019). Baldi (2012) presented and studied a general framework on autoencoders with both linear and nonlinear activation functions. In particular, he showed that learning in the autoencoder with Boolean activation functions is NP-hard in general by a reduction from a clustering problem.

However, as far as we know, no theoretical studies had been done on relations between the compressive power and the size of autoencoders, whereas extensive theoretical studies have been done on the representational power of deep neural networks (Delalleau & Bengio, 2011; Montufar et al., 2014; Vershynin, 2020; Yun et al., 2019). Recently, some theoretical studies have been done on the compressive power of autoencoders with linear threshold functions (Akutsu & Melkman, 2023; Melkman et al., 2023) with respect to the depth (number of layers) and width (number of nodes in a layer). However, linear threshold networks are not popular in recent studies on neural networks. Furthermore, linear threshold networks can only handle binary input and output, which is far from the practical settings.

Motivated by this situation, we study in this article the compressive power of autoencoders with real input and output vectors using rectified linear unit (ReLU) functions as the activation functions, with focusing on theoretical aspects. In order to clarify the superiority of ReLU functions over linear functions, we also study the compressive power of autoencoders with linear activation functions (not linear threshold activation functions).

The results are summarized in Table 1, where D and d denote the number of dimensions of input and compressed vectors, respectively; n denotes the number of input vectors; and A is a matrix explained in theorem 2. Note that the number and dimensions of input vectors must be the same as those of output vectors. Here, we first obtain some new results about autoencoders with linear activation functions. Then we modify theorems 12, 19, and 22 in Melkman et al. (2023) for ReLU functions and real input vectors. In addition, we modify theorem 1 in Zhang et al. (2017) so that the number of nodes in the middle layer decreases from n to 2n. Based on the proof of theorem 3.1 in Yun et al. (2019), we design a four-layer ReLU neural network to reduce the number of nodes in the middle layer. The difference between this paper and theorem 3.1 is that hard-tanh activation functions are used in the proof of theorem 3.1, while we use ReLU activation functions. Finally, we modify the decoders in the proofs of theorems 19 and 22 in Melkman et al. (2023).

Table 1:

Summary of Results.

VectorMiddle Layer (d)ArchitectureTypeActivation
Theorem 1 real n - 1 D/d/D Encoder/Decoder Linear 
Theorem 2 real d D/d/D if rank(A) ⩽ d + 1 Encoder/Decoder Linear 
Theorem 3 real 2n D/(D+d)/(D+d2)/3d2/d Encoder ReLU 
(Theorem 12 binary 2n D/(D+d2)/d Encoder Threshold 
(Melkman et al., 2023)) 
Theorem 4 real 2n D/(D+d)/(D+d2)/3d2/d/dD2/D Encoder/Decoder ReLU 
(Theorem 19 binary 2n D/(D+d2)/d/dD2/D Encoder/Decoder Threshold 
(Melkman et al., 2023)) 
Theorem 5 real 2logn Theorem 2 +2d/d/4n/2n/Dn/D Encoder/Decoder ReLU 
(Theorem 22 binary 2logn Theorem 12 (Melkman et al., 2023Encoder/Decoder Threshold 
(Melkman et al., 2023))   +d/2n/Dn/D 
(Corollary 8 binary ⌈log n⌉ Theorem 12 (Melkman et al., 2023Encoder/Decoder Threshold 
(Akutsu & Melkman, 2023))   +d/2nD/nD/D 
Proposition 2 real D/1/(2n+1)/(n+1)/ Encoder/Decoder ReLU 
   3n/2n/Dn/D 
Theorem 7 real 2n D/d/d/1 Memorizer ReLU 
(Theorem 1 real n D/d/1 Memorizer ReLU 
(Zhang et al., 2017)) 
VectorMiddle Layer (d)ArchitectureTypeActivation
Theorem 1 real n - 1 D/d/D Encoder/Decoder Linear 
Theorem 2 real d D/d/D if rank(A) ⩽ d + 1 Encoder/Decoder Linear 
Theorem 3 real 2n D/(D+d)/(D+d2)/3d2/d Encoder ReLU 
(Theorem 12 binary 2n D/(D+d2)/d Encoder Threshold 
(Melkman et al., 2023)) 
Theorem 4 real 2n D/(D+d)/(D+d2)/3d2/d/dD2/D Encoder/Decoder ReLU 
(Theorem 19 binary 2n D/(D+d2)/d/dD2/D Encoder/Decoder Threshold 
(Melkman et al., 2023)) 
Theorem 5 real 2logn Theorem 2 +2d/d/4n/2n/Dn/D Encoder/Decoder ReLU 
(Theorem 22 binary 2logn Theorem 12 (Melkman et al., 2023Encoder/Decoder Threshold 
(Melkman et al., 2023))   +d/2n/Dn/D 
(Corollary 8 binary ⌈log n⌉ Theorem 12 (Melkman et al., 2023Encoder/Decoder Threshold 
(Akutsu & Melkman, 2023))   +d/2nD/nD/D 
Proposition 2 real D/1/(2n+1)/(n+1)/ Encoder/Decoder ReLU 
   3n/2n/Dn/D 
Theorem 7 real 2n D/d/d/1 Memorizer ReLU 
(Theorem 1 real n D/d/1 Memorizer ReLU 
(Zhang et al., 2017)) 

Note: Theorems in parentheses are existing results.

Specifically, theorem 1 reveals that when a set of n = d + 1 vectors in D-dimensional Euclidean space is given, there is a three-layer perfect autoencoder with linear activation functions that has the middle layer with d nodes. When the number of vectors in the given set is greater than d + 1, a three-layer perfect autoencoder with linear activation functions that has the middle layer with d nodes may not exist, and in theorem 2, some conditions for the existence of a three-layer perfect autoencoder with linear activation functions that has the middle layer with d nodes are given.

Theorem 3 is obtained by modifying theorem 12 in Melkman et al. (2023). Compared with theorem 12, we apply ReLU activation functions in theorem 3 to replace the binary input vectors with real input vectors, so that the number of layers of the ReLU neural network designed is increased by 2 and the number of hidden nodes is increased by D+5d2.

By adding a decoder to the encoder constructed by theorem 3, we obtain a seven-layer perfect autoencoder in theorem 4. It is worth noting that in the decoder part, we do not simply represent the threshold function of theorem 19 in Melkman et al. (2023) as three ReLU functions with two layers, but design some new ReLU activation functions. Therefore, compared with theorem 19 in Melkman et al. (2023), the numbers of nodes and layers in the decoder part do not increase.

In theorem 4, the size of the middle layer is 2n. In theorem 5, we reduce the size of the middle layer from 2n to 2logn. Moreover, theorem 5 is obtained by modifying theorem 22 in Melkman et al. (2023). In proposition 2, we further reduce the size of the middle layer to 1 by using a real number as a compressed vector. The proof is obtained by simple modifications of those for theorems 3 and 4.

We also consider the number of nodes and layers to represent a memorizer, which is a set of pairs of input vectors and their output values. Compared with theorem 1 in Zhang et al. (2017), in theorem 7, the number of nodes in the middle layer decreases to 2n and the number of layers increases by 1. It is worth noting that the four-layer ReLU neural network we designed in theorem 7 is based on the four-layer fully connected neural network of theorem 3.1 in Yun et al. (2019).1 However, in theorem 3.1, it mainly describes the design process of a fully connected neural network using hard-tanh activation functions, while we need to design a ReLU neural network, so there are some differences in the design process, and we explain these differences.

It is seen from Table 1 that there is a large difference in the number of nodes of the middle layer between the linear and ReLU autoencoders. Linear autoencoders need n - 1 nodes in the middle layer (especially when rank(A) = n), whereas ReLU autoencoders need a much smaller number of nodes in the middle layer. This is a crucial limitation of linear autoencoders. However, ReLU autoencoders need more than three layers and a large number of nodes (e.g., Dn nodes) in some layers. This is a limitation of ReLU autoencoders. In addition, errors are not taken into account in both types of autoencoders, a limitation from a practical viewpoint.

In summary, the contribution of this article is to theoretically analyze the number of nodes and layers of autoencoders for real input and output vectors for the first time. Although we use the framework and some techniques introduced in Melkman et al. (2023), we introduce additional techniques in this article, and thus there exist substantial differences between the results of Melkman et al. (2023) and those of this article, as seen from Table 1. We also use some techniques introduced in Yun et al. (2019). However, we use to show theorem 7, which can also be used as a decoder part of the autoencoder.

R stands for real numbers. Rn denotes the set of n-dimensional column vectors. For integers a and b, a < b, we denote [a : b] ≔ {a, a + 1, . . . , b}.

A function f: {0, 1}h → {0, 1} is called a Boolean threshold function if it is represented as
for some (a, θ), where aRh, θ ∈ R, and a · x denotes the inner product between two vectors a and x. We also denote the same function f as [a · x ⩾ θ].
A function f: RhR is called a ReLU function if it is represented as
In this article, we only consider layered neural networks in which a linear, linear threshold, or ReLU function is assigned to each node except input nodes. The nodes in a network are divided into L-layers, and each node in the ith layer has inputs only from nodes in the (i - 1)th layer (i = 2, . . . , L - 1). Then the states of nodes in the ith layer can be represented as a Wi-dimensional binary vector where Wi is the number of nodes in the ith layer and is called the width of the layer. A layered neural network is represented as y = f(L - 1)(f(L - 2)(f(1)(x) )), where x and y are the input and output vectors, respectively, and f(i) is a list of activation functions for the (i + 1)th layer. The 1st and (L)th layers are called the input and output layers, respectively, and the corresponding nodes are called input and output nodes, respectively. When we consider autoencoders, one layer (kth layer where k ∈ {2, . . . , L - 1}) is specified as the middle layer, and the nodes in this layer are called the middle nodes. Then the middle vectorz, encoderf, and decoderg are defined by
Since we consider autoencoders, the input layer and the output layer have the same number of nodes, denoted by D. We use d to denote the number of nodes in the middle layer.

Let Xn = {x0, . . . , xn-1} be a set of n D-dimensional binary or real input vectors that are all different. We define perfect encoder, decoder, autoencoder as follows (Melkman et al., 2023).

Definition 1.

A mapping f is called a perfect encoder for Xn if f(xi)f(xj) holds for all ij.

Definition 2.

A pair of mappings (f,g) with f and g is called a perfect autoencoder if g(f(xi))=xi holds for all xiXn. Furthermore, g is called a perfect decoder.

In a word, a perfect encoder maps distinct input vectors into distinct middle vectors, a perfect decoder maps each middle vector into the original input vector, and a perfect autodecoder maps each input vector into the original input vector via a distinct middle vector.

Note that a perfect decoder exists only if there exists a perfect autoencoder. Furthermore, it is easily seen from the definitions that if (f, g) is a perfect autoencoder, f is a perfect encoder.

Many of the results in Table 1 rely on the following proposition. Since it is mentioned in Zhang et al. (2017) without a proof, we give our own proof here:

Proposition 1.

For any set of D-dimensional distinct real vectors X={x0,...,xn-1}, there exists a real vector a satisfying a·xia·xj for all ij.

Proof.

We prove the proposition by mathematical induction on D.

In the case of D = 1, the claim trivially holds by letting a = [1]. Assume that the claim holds for all X in the case of D = d - 1. Let X = {x0, . . . , xn-1} be a set of d-dimensional distinct vectors. For each vector x = [x0, . . . , xd-1], let x^=[x0,...,xd-2] and X^={x^0,...,x^n-1}. It should be noted that x^i=x^j may hold for some ij. From the induction hypothesis, we can assume that there exists a vector a^=[a0,...,ad-2] satisfying a^·x^ia·x^j for all x^ix^j. Here, we define a = [a0, . . . , ad - 2, ad-1] by using a sufficiently large real number ad-1 such that |ad-1·(xd-1i-xd-1j)||a^·(x^h-x^k)| holds for any i, j, k, h such that xd-1ixd-1j. Then a · xia · xj clearly holds for all ij.

Furthermore, we can assume without loss of generality (w.l.o.g.) that xis are reindexed so that
(2.1)
holds and additionally, c-1 = c0 - δ > 0 and cn = cn-1 + δ, hold for any δ > 0.

For the input vector, we can assume w.l.o.g. that all elements of the real input vector xi are nonnegative, because this assumption can be satisfied by adding a sufficiently large constant to each element. For example, let z=mini[0:n-1],j[0:D-1](xji). If z < 0, all elements of the real input vector xi can be made nonnegative by adding any value not less than -z to each element, and this value does not affect network performance.

In this section, we consider autoencoders using linear functions as the activation functions.

We begin with a simple example. Consider the case of D = 3, d = 1, and n = 2. Then X = {[x0, y0, z0], [x1, y1, z1]}. Let f([xi, yi, zi]) = [xi]. Let g([xi]) = [xi, axi + b, cxi + d], where axi + b satisfies ax0 + b = y0 and ax1 + b = y1, and cxi + d satisfies cx0 + d = z0 and cx1 + d = z1. Then (f, g) is a perfect autoencoder for most X (precisely, if detx01x110), where det(A) denotes the determinant of a matrix A.)

For another example, consider the case of D = 4, d = 2, and n = 3. Then
Let f([xi, yi, zi, wi]) = [xi, yi]. Let g([xi, yi]) = [xi, yi, axi + byi + c, dxi + eyi + f], where axi + byi + c satisfies ax0 + by0 + c = z0, ax1 + by1 + c = z1, and ax2 + by2 + c = z2, and dxi + eyi + f satisfies similar equations. Then, (f, g) is a perfect autoencoder for most X (precisely, if detx0y01x1y11x2y210).

By generalizing these simple examples, we have the following theorem.

Theorem 1.

Let X be a set of d+1 vectors in D-dimensional Euclidean space, where D is any integer such that D>d. Then, for X, there exists a perfect autoencoder (f,g) with linear activation functions that has the middle layer (i.e., compressed layer) with d nodes (i.e., perform dimensionality reduction to d).

Proof.

Suppose that a set of d + 1 vectors in D-dimensional Euclidean space is described as X = {x0, x1, . . . , xd}, where xi=[x0i,x1i,...,xD-1i],i[0:d] and D > d.

Then a matrix A can be constructed as follows:
Here, A is a (d + 1) × (D + 1) matrix and D > d, so rank(A) ⩽ d + 1 and assuming w.l.o.g. that rank(A) = d1 + 1 (d1d).
Let
and S = {x0, x1, . . . , xD-1, 1d+1}. According to the fact that the rank of the matrix A is the maximum number of linearly independent column vectors in A, we assume w.l.o.g. that a maximal linearly independent subset of S is {xα1,xα2,...,xαd1,1d+1}, where 0α1<<αd1D-1.

A maximal linearly independent subset of S can be obtained as follows:

  • Step 1: Initialize S¯={1d+1}.

  • Step 2: Choose a column vector xiS. If 1d+1 and xi are linearly dependent, remove xi from S and choose another column vector in S and repeat this step. If 1d+1 and xi are linearly independent, add xi to the set S¯, that is, S¯={1d+1,xi}, and go to the next step.

  • Step 3: Let S¯={1d+1,xi}, where 1d+1 and xi are linearly independent. Choose a column vector xjS. If 1d+1, xi and xj are linearly dependent, remove xj from S and choose another column vector in S and repeat this step. If 1d+1, xi and xj are linearly independent, add xj to the set S¯, that is, S¯={1d+1,xi,xj} and go to the next step.

  • Step 4: . . .

Going on in turn, we finally find that S¯ is a maximal linearly independent subset of S. Here, 1d+1S¯.

Then any other vector in S can be expressed as a linear combination of elements of the maximal linearly independent subset, so for any xjS, we can always have
where a1j,a2j,...,ad1j,ajR.
And especially if d1 < d, we add other d - d1 vectors xαd1+1,...,xαd from S{xα1,...,xαd1,1d+1} to the maximal linearly independent subset. Then for any xjS, we still have
Let
and
where yji=a1jxα1i+a2jxα2i++adjxαdi+aj,j[0:D-1], and i ∈ [0 : d]. Clearly, we have a1jxα1i+a2jxα2i++adjxαdi+aj=xji,j[0:D-1], and i ∈ [0 : d], that is, g([xα1i,xα2i,...,xαdi])=[x0i,x1i,...,xD-1i]. Hence, (f, g) is a perfect autoencoder for X.

For X, there exists a perfect autoencoder (f, g) with linear activation functions that has the middle layer (i.e., compressed layer) with d nodes (i.e., perform dimensionality reduction to d).

Furthermore, for X with |X| > d + 1, we consider whether there is a perfect autoencoder with linear activation functions that has the compressed layer with d nodes and we have the following result:

Theorem 2.
Consider a set of c+1 vectors X={x0,x1,...,xc} in D-dimensional Euclidean space, where xi=[x0i,x1i,...,xD-1i],i[0:c]. Let
There are two cases:
  • If rank (A)d+1, then there exists a perfect autoencoder with linear activation functions that has the compressed layer with d,d<c nodes.

  • If rank (A)>d+1, then there does not exist a perfect autoencoder with linear activation functions that has the compressed layer with d,d<c nodes.

Proof.

When rank(A) ⩽ d + 1, a perfect autoencoder can be constructed similar to theorem 1.

When rank(A) > d + 1, let
and S = {x0, x1, . . . , xD-1, 1c+1}. If we want to find a basis for S, then the number of vectors in the basis must be larger than d + 1. Hence, there does not exist a perfect autoencoder with linear activation functions that has the compressed layer with d nodes.
Remark 1.

According to theorems 1 and 2, we know that the number of nodes in the compression layer is at least rank(A) - 1 for the existence of a perfect autoencoder.

Remark 2.

Theorem 1 still holds for multilayer networks because composition of linear functions is a linear function.

Remark 3.
It is worth noting that for the above example, if x0 = x1, the method for the case of D = 3, d = 1, and n = 2 does not work. The reason is that in this case,
is not the maximal linearly independent subset of
More generally, if {x0, x1, . . . , xd-1, 1d+1} does not contain the maximal linearly independent subset of S, then in order to get a perfect autoencoder, we cannot design the encoder f as f(xi)=[x0i,x1i,...,xd-1i].
Remark 4.

According to theorem 2, we need to calculate the rank of the matrix A. For the large matrix A, it is not very easy to directly obtain its rank, but we can use rank estimation methods to estimate its rank, such as in Ubaru and Saad (2016).

According to Kumano and Akutsu (2022), we can simulate a threshold function by using three ReLU functions in two layers. Hence, theorems 12, 19, and 22 in Melkman et al. (2023) can be modified for ReLU functions such that binary input vectors are replaced by real input vectors, with increasing the number of layers by some constant and increasing the number of nodes in some layers by twice (because one threshold function can be simulated using three ReLU functions with two layers). Note that the compressed layer still corresponds to binary vectors.

Specifically, given n different binary input vectors of dimension D, in theorem 12 (Melkman et al., 2023), it is mentioned that there is a three-layer network whose activation functions are Boolean threshold functions that maps these vectors to n different binary vectors of dimension 2n using n+D hidden nodes, and then in theorem 19 (Melkman et al., 2023), it is mentioned that there is a five-layer perfect Boolean threshold network autoencoder, whose encoding is constructed based on theorem 12, and the number of nodes in its middle hidden layer is 2n. Further, in theorem 22 (Melkman et al., 2023), it is mentioned that there is a seven-layer perfect Boolean threshold network autoencoder, and compared with the autoencoder constructed in theorem 19, the number of nodes in the middle hidden layer is reduced to 2logn, although the number of layers is increased by 2.

As in Melkman et al. (2023), we define an r-dimensional binary vector hi[r]=[h0i,...,hr-1i] by
We first modify theorem 12 in Melkman et al. (2023) for ReLU functions and real input vectors as follows:
Theorem 3.
Let r=n. For any set of D-dimensional real vectors,
where xi=[x0i,x1i,...,xD-1i] and i[0:n-1], there exists a five-layer ReLU neural network that maps xi to (hk[r],hl[r]), where i=kr+l with i[0:n-1],k[0:r-1], and l[0:r-1].
The neural network has D input nodes xj,j[0:D-1]; D+2r nodes in the second layer αi1,αr+i1,i[0:r-1] and αj2,j[0:D-1], D+r nodes in the third layer βi1,i[0:r-1] and βj2,j[0:D-1], 3r nodes in the fourth layer γi1,i[0:r-1] and γi2,γr+i2,i[0:r-1], and 2r output nodes ηi1,ηi2,i[0:r-1] (see also Figure 1).
Figure 1:

Five-layer ReLU neural network. The circle nodes represent those carrying input and output information. Specifically, there are D input nodes xj, j ∈ [0 : D - 1] and 2r output nodes ηi1,ηi2,i[0:r-1]. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4, 5) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.

Figure 1:

Five-layer ReLU neural network. The circle nodes represent those carrying input and output information. Specifically, there are D input nodes xj, j ∈ [0 : D - 1] and 2r output nodes ηi1,ηi2,i[0:r-1]. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4, 5) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.

Close modal
Proof.
First, the nodes αj2 simply copy the input, αj2=xj,j[0:D-1]. For any i ∈ [0 : r - 1], we choose si with cir-1 < si < cir, and there exists an ɛi1>0, such that si-ɛi1>cir-1 and si+ɛi1<cir. Let the ReLU activation function of αi1 be
and the ReLU activation function of αr+i1 be
In the third layer, let βj2=αj2,j[0:D-1] and
Here, it is easy to see that ß1 = hk[r].
In the fourth layer, the node γi1 copies βi1, where i ∈ [0 : r - 1]. For any i ∈ [0 : r - 1], there exists an ɛi2>0 such that ckr+i-3ɛi2>ckr+(i-1). Let the ReLU activation function of γi2 be
and the ReLU activation function of γr+i2 be
where ti=ciβ01+(cr+i-ci)β11++(c(r-1)r+i-c(r-2)r+i)βr-11-2ɛi2 and i ∈ [0 : r - 1].

Finally, the output node ηi1 copies γi1, where i ∈ [0 : r - 1]. The ReLU activation functions of the remaining output nodes ηi2,i[0:r-1] are maxγi2-γr+i2,0. Here, we can observe that η2 = hl[r].

Example 1.

Suppose that n = 16. Then, we have r=16=4, h0[r] = [1, 0, 0, 0], h1[r] = [1, 1, 0, 0], h2[r] = [1, 1, 1, 0], and h3[r] = [1, 1, 1, 1]. Furthermore, x9 is mapped to (η1, η2) = (h2[r], h1[r]) because 9 = 2 · 4 + 1.

By modifying theorem 19 in Melkman et al. (2023) for ReLU functions and real input vectors, we get the following theorem.

Theorem 4.
Let r=n. For any set of D-dimensional real vectors,
where xi=[x0i,x1i,...,xD-1i] and i[0:n-1], there exists a seven-layer perfect ReLU neural network autoencoder. There are 2r nodes in the middle layer βi1,βi2,i[0:r-1]; rD nodes in the sixth layer ηi,j,i[0:r-1],j[0:D-1]; and D output nodes yj,j[0:D-1].
Proof.

On top of the encoder constructed in the proof of theorem 3, we add a decoder that is a two-layer ReLU neural network.

Here, we need to design a decoder that outputs y = xkr+l for an input (hk[r], hl[r]). Let (ß1, ß2) = (hk[r], hl[r]). Note that (ß1, ß2) corresponds to (η1, η2) in theorem 3.

First, we choose an ɛ, where
Equip the node ηi, j, i ∈ [0 : r - 1], j ∈ [0 : D - 1] with the ReLU activation function
where βr2=0.
Note that when ß1 = hk[r], we have
Furthermore, when ß2 = hl[r], we have
Hence, the value of ηi, j is -xjkr+l+ε if i = l and 0 otherwise.
Finally, the ReLU activation function of the output node yj is
where j ∈ [0 : D - 1].

In theorem 4, the size of the middle layer is 2n. Next, we reduce the size of the middle layer from 2n to 2logn by increasing the number of layers.

Theorem 5.
Let r=n. For any set of D-dimensional real vectors
where xi=[x0i,x1i,...,xD-1i] and i[0:n-1], there exists an 11-layer perfect ReLU neural network autoencoder. There are 2logn nodes in the middle layer (see also Figure 2).
Figure 2:

Eleven-layer perfect ReLU neural network autoencoder (omit the first 4 layers). The circle nodes represent those carrying input and output information. Specifically, there are 2m nodes ξj1,ξj2,j[0:m-1] in the middle layer and D nodes yj, j ∈ [0 : D - 1] in the output layer. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 6, 7, . . . , 11) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.

Figure 2:

Eleven-layer perfect ReLU neural network autoencoder (omit the first 4 layers). The circle nodes represent those carrying input and output information. Specifically, there are 2m nodes ξj1,ξj2,j[0:m-1] in the middle layer and D nodes yj, j ∈ [0 : D - 1] in the output layer. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 6, 7, . . . , 11) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.

Close modal
Proof.

The first five layers are constructed in the proof of theorem 3, and then the input xi maps to (hk[r], hl[r]). Let (ß1, ß2) = (hk[r], hl[r]). For ease of exposition, we assume that r = 2m.

On (ß1, ß2), the ReLU activation functions for γji,i=1,2,j[0:m-1] can be computed as follows:
and the ReLU activation functions for γm+ji,i=1,2,j[0:m-1] can be computed as follows:

Let ξji=maxγji-γm+ji,0, where i = 1, 2 and j ∈ [0 : m - 1]. Then ζi gives a binary representation (in the reverse order) of ßi. For example, ζi = [1, 1, 0] for ßi = [1, 1, 1, 1, 0, 0, 0, 0], where n = 64 and r = 8.

In the next layer, let the ReLU activation functions for ρji,i=1,2,j[0:r-1] be
and the ReLU activation functions for ρr+ji,i=1,2,j[0:r-1] be
where 0 < ε < 0.5.

Let μji=maxρji-ρr+ji,0, where i = 1, 2 and j ∈ [0 : r - 1]. Here, we can find that μ1 = ß1 and μ2 = ß2.

Finally, the last two layers are constructed as shown in the proof of theorem 4.

In the above, the compressed layer corresponds to binary vectors. However, we can simply use a · xi to compress a D-dimensional vector to a 1-dimensional vector (i.e., a scalar). As shown below, decoders can be developed for this compressed scalar value by modifying the decoders in the proofs of theorems 3 and 4.

Proposition 2.

Let r=n. For any set of D-dimensional real vectors X={x0,x1,...,xn-1}, where xi=[x0i,x1i,...,xD-1i] and i[0:n-1], there exists an eight-layer perfect ReLU neural network autoencoder.

The neural network has D input nodes xj,j[0:D-1] one node in the second layer a·x, 2r+1 nodes in the third layer αi1,αr+i1,i[0:r-1] and α2, r+1 nodes in the fourth layer βi1,i[0:r-1] and β2, 3r nodes in the fifth layer γi1,i[0:r-1] and γi2,γr+i2,i[0:r-1], 2r nodes in the sixth layer ηi1,ηi2,i[0:r-1], Dr nodes in the seventh layer ξi,j,i[0:r-1],j[0:D-1], and D output nodes yj,j[0:D-1].

Proof.

We have one node a · x in the second layer, which corresponds to an encoder.

The decoder is obtained by simple modifications of the proofs of theorems 3 and 4. We use the output of the second layer in place of a · x in the activation function for each of αi1 in the proof of theorem 3. Furthermore, we replace ß2 by a single node copying a · x, whose value is also used in place of a · ß2 in the proof of theorem 3. Then, as in the proof of theorem 4, we identify the nodes in (η1, η2) of the proof of theorem 3 with the nodes in (ß1, ß2) in the proof of theorem 4. It is straightforward to see that this modification gives a perfect autoencoder.

Next, we consider memorizers, that is, functions for finite input samples. The following theorem is shown in Zhang et al. (2017).

Theorem 6

(Zhang et al., 2017). There exists a three-layer neural network with ReLU activations and 2n+D weights that can represent any function on a sample of size n in D dimensions.

For ease of understanding, we explain this theorem briefly. Here, X = {x0, x1, . . . , xn-1} denotes a set of n D-dimensional real input vectors that are all different, and Y = {y0, y1, . . . , yn-1} denotes a set of n 1-dimensional real outputs that all yi ⩾ 0.

Then there exist weights
such that b0 < a · x0 < b1 < a · x1 < < bn-1 < a · xn-1, and a three-layer neural network with ReLU activation functions can be designed as follows:

Now, we consider reducing the number of nodes in the middle layer from O(n) to O(n) (although the number of layers increases by one).

Theorem 7.
Let r=n. For any set of D-dimensional real vectors,
and any set Y={y0,y1,...,yn-1} that all yi[-1,1],i[0:n-1], there exists a four-layer ReLU neural network with 2r nodes in the second layer and 2r nodes in the third layer (see also Figure 3) whose output value satisfies σ=yi+1 for each input vector xi.
Figure 3:

Four-layer ReLU neural network. The circle nodes represent those carrying input and output information. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4) layer are determined based on the ReLU activation functions related to the states of nodes in the (i - 1)th layer.

Figure 3:

Four-layer ReLU neural network. The circle nodes represent those carrying input and output information. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4) layer are determined based on the ReLU activation functions related to the states of nodes in the (i - 1)th layer.

Close modal
Proof.

Here, the four-layer ReLU neural network we designed is based on the four-layer fully connected neural network of theorem 3.1 in Yun et al. (2019). Notably, in the proof of theorem 3.1 (Yun et al., 2019), it mainly indicates that there is a four-layer fully connected neural network using hard-tanh activation functions, which can fit any arbitrary data set {(xi,yi)}i=1N, where all inputs xiRdx are distinct and all yi ∈ [- 1, 1]. Although our construction is very similar to their construction, there are some differences because they use a neural network with hard-tanh activation functions. Therefore, in this proof, we mainly describe our construction of the ReLU neural network with briefly explaining the differences from that in Yun et al. (2019). The full proof is given in the online supplemental material.

For ease of exposition, we assume that r=n and r is a multiple of 2. Divide total n input vectors into r groups with r vectors each.

First, let the ReLU activation functions for βj1,j[0:r-1] be
and the ReLU activation functions for βj2,j[0:r-1] be
where
where each ci is a constant given in equation 2.1. Here, we should note that γj=βj1(x)-βj2(x)-1 is similar to the output of the jth node of the first hidden layer αj1(x) in theorem 3.1 (Yun et al., 2019). More specifically, when j is even, we have
When j is odd, we have
In the next layer, let the ReLU activation functions for ξk1,k[0:r-1] be
and the ReLU activation functions for ξk2,k[0:r-1] be
where wk = [wk, 0, wk, 1, . . . , wk, r-1] and bk are solutions of the following equations:
under the condition that each element of wk is sufficiently large negative (resp., positive) when k is even (resp., odd). The main difference from Yun et al. (2019) is that we define set Ik,k[0:r-1] as
where ik, j, j ∈ [0 : r - 1] are used to represent the elements in the set Ik, and ik,0 = k, ik, 1 = 2r - 1 - k, . . . , ik,r-1 = r2 - 1 - k.
Finally, in the last layer, let σ=maxl=0r-1(ξl1-ξl2-1),0. Here, ξl1(x)-ξl2(x)-1,l[0:r-1] is similar to the output of the lth node of the second hidden layer αl2(x) in theorem 3.1, and for j ∈ [0 : r - 1] and k ∈ [0 : r - 1], we also have
Remark 5.
When yi, i ∈ [0 : n - 1] is any real number, we can also construct a four-layer ReLU neural network. First, let
and
By means of zi=yi-y mean ymax-ymin, we can scale the elements yiY to -1 ⩽ zi ⩽ 1, and then put {z0, z1, . . . , zn-1} as outputs; then we can construct a four-layer ReLU neural network as shown in theorem 7. Only for the last layer do we change it to
Remark 6.

It is worth noting that theorem 1 in Zhang et al. (2017) (i.e., theorem 6 in this article) shows that there exists a two-layer neural network. Similarly, theorem 3.1 in Yun et al. (2019) shows that there is a three-layer hard-tanh, fully connected neural network. However, since here we regard the input layer as the first layer, to avoid confusion, the number of layers is increased by 1 when we describe the neural networks of theorems 1 and 3.1 in this article.

In order to test whether some of the autoencoder (memorizer) architectures obtained through theoretical analyses can be used in the design of practical neural networks, we conducted some computational experiments using neural networks. Here, the activation functions are learned from input and output data. All numerical experiments were conducted on a PC with Xeon Gold 5222 CPU and A100 GPU under the Ubuntu 18.04.

First, we performed computational experiments on theorem 4, considering the use of 256 36-dimensional real vectors as input vectors and 324 40-dimensional real vectors as input vectors, respectively. For the training of neural networks with gaussian error linear unit (GELU) functions (Dubey et al., 2022), we employed the Adam optimizer in PyTorch with a learning rate 0.01 and 800 epochs repetition. The specific process is as follows.

  • Step 1: Generate a neural network N with the architecture given in theorem 4.

  • Step 2: Randomly generate D-dimensional real vectors x0, x1, . . . , xn-1, where each xji[0,2],i=0,1,...,n-1,j=0,1,...,D-1.

  • Step 3: Train N using x0, x1, . . . , xn-1 for both input and output data.

  • Step 4: Compute the training accuracy of the trained N using x0, x1, . . . , xn-1 as the input data. Assuming y0, y1, . . . , yn-1 are the corresponding output data, the training accuracy is defined as
    (5.1)
    where I[-0.2,0.2]yji-xji is an indicator function, that is,

We repeated this procedure 100 times to obtain the average training accuracy; the results are shown in Table 2.

Table 2:

Results of Computational Experiments on Theorem 4.

ArchitectureDndTraining Accuracy
Theorem 4 36 256 32 0.9899 
Theorem 4 40 324 36 0.9837 
ArchitectureDndTraining Accuracy
Theorem 4 36 256 32 0.9899 
Theorem 4 40 324 36 0.9837 

In the above experiment, we used the GELU function instead of the ReLU function, mainly because we found that the training accuracy obtained by training a neural network with ReLU functions was not ideal. We believe that the main reason for this phenomenon is that the neural network constructed by theorem 4 has a slightly large number of layers and relatively complex connection methods for each layer. Using the GELU function, described as a smoother version of the ReLU function, for neural network training can provide smoother gradients, which can help maintain gradient flow in complex models and avoid problems such as vanishing or exploding gradients. In contrast, when the data range is small and the network hierarchy is complex, using the ReLU function may lead to numerical instability, thereby affecting the convergence and final performance of the model.

Subsequently, we also conducted experiments using neural networks with ReLU functions on theorem 7, and the process is similar to that mentioned above. Here, generate a neural network N with the architecture given in theorem 7 and randomly generate data set {xi,yi}i=0n-1, where xiRD (each xji[0,1],i=0,1,...,n-1,j=0,1,...,D-1) is the input and yi ∈ [- 1, 1] is the output. Consider two cases: D = 36, n = 256 and D = 40, n = 324. We also repeated this procedure 100 times and recorded the average training accuracy in Table 3. Here, the training accuracy is defined as
(5.2)
where y^i is the output obtained by the trained neural network N using xi as the input data and
Table 3:

Results of Computational Experiments on Theorem 7.

ArchitectureDndTraining Accuracy
Theorem 7 36 256 32 1.0000 
Theorem 7 40 324 36 1.0000 
ArchitectureDndTraining Accuracy
Theorem 7 36 256 32 1.0000 
Theorem 7 40 324 36 1.0000 

The above experimental results indicate that our proposed architectures may provide useful insights for designing practical autoencoders (memorizer).

In this article, we studied relations between the compressed vectors and the depth and width (the number of node in a layer) of autoencoders with real input and output vectors using linear and ReLU activation functions, under the condition that the input and output vectors must be the same. The results on ReLU activation functions suggest that we can achieve the same compression ratio as in the case of binary input and output vectors (Melkman et al., 2023) using similar archtectures. The results are interesting because real input and output vectors can be handled by replacing linear threshold activation functions with ReLU activation functions, where some modifications are required. Furthermore, the results on linear activation functions suggest that ReLU activation functions are much more powerful than linear activation functions with respect to autoencoders with real input and output vectors.

Although we have mainly given upper bounds on the depth and width, we have not shown any lower bounds for autoencoders using ReLU activation functions. Note that a lower bound of Ω(Dn/d) is known for the width of autoencoders using linear threshold activation functions (Akutsu & Melkman, 2023). However, the techniques used there heavily depend on properties of Boolean functions and thus cannot be applied to the case of ReLU activation functions. Thus, showing lower bounds on autoencoders with ReLU activation functions is left as future work. For the case of d = ⌈log n⌉, an O(Dn) upper bound is shown for the width of autoencoders using linear threshold activation functions (see corollary 8 of Akutsu & Melkman, 2023), which is better than an O(Dn) upper bound shown in Melkman et al. (2023) and this article. Hence, development of an autoencoder with width O(Dn) using ReLU activation functions is also left as future work. Furthermore, it is interesting to study whether it is possible to develop autoencoders with a smaller width using ReLU activation functions than those using linear threshold activation functions.

In our definition of the perfect autoencoder, we assumed that the input vectors must be the same as the output vectors. But in practical situations, the input and output vectors need not be the same but should be similar. For the case of autoencoders with linear threshold activation functions, it is shown that the width of the decoder part can be reduced by a constant factor if some Hamming distance error is allowed (Akutsu & Melkman, 2023). However, the techniques used in that study cannot be applied to real input and output vectors because the Hamming distance cannot be directly generalized to real vectors and the construction of autoencoders heavily depends on binary values. Therefore, conducting theoretical studies on autoencoders allowing errors between real input and output vectors is interesting and important future work.

Another drawback of this work is that the proposed design methods are somewhat ad hoc. However, most of the designed architectures are based on existing ones for binary vectors using linear threshold activation functions, and the main purpose of this article is to extend such existing ones to a more practical setting (i.e., real vectors using ReLU activation functions). Therefore, this work can be considered an important step toward understanding data compression mechanisms of autoencoders in more practical settings. Of course, many other practical activation functions are known (Apicella et al., 2021; Dubey et al., 2022). Therefore, important future work is to develop more general design methods that can be applied to many practical activation functions.

T.A. was partially supported by grants-in-aid 22H00532 and 22K19830 from JSPS, Japan; W.C. was partially supported by Hong Kong RGC GRF grant 17301519, IMR, and Hung Hing Ying Physical Sciences Research Fund, HKU.

1

In Yun et al. (2019), their network is stated as a three-layer network excluding the input layer.

Ackley
,
D. H.
,
Hinton
,
G. E.
, &
Sejnowski
,
T. J.
(
1985
).
A learning algorithm for Boltzmann machines
.
Cognitive Science
,
9
(
1
),
147
169
.
Akutsu
,
T.
, &
Melkman
,
A. A.
(
2023
).
On the size and width of the decoder of a Boolean threshold autoencoder
.
IEEE Transactions on Neural Networks and Learning Systems
, (PP)
99
,
1
8
.
Apicella
,
A.
,
Donnarumma
,
F.
,
Isgró
,
F.
, &
Prevete
,
R.
(
2021
).
A survey on modern trainable activation functions
.
Neural Networks
,
138
,
14
32
.
Baldi
,
P.
(
2012
).
Autoencoders, unsupervised learning, and deep architectures
. In
Proceedings of ICML Workshop on Unsupervised and Transfer Learning
,
27
(pp.
37
49
).
Baldi
,
P.
, &
Hornik
,
K.
(
1989
).
Neural networks and principal component analysis: Learning from examples without local minima
.
Neural Networks
,
2
(
1
),
53
58
.
Delalleau
,
O.
, &
Bengio
,
Y.
(
2011
).
Shallow vs. deep sum-product networks
. In
S.
Hanson
,
J.
Cowan
, &
C.
Giles
(Eds.),
Advances in neural information processing systems
,
24
(pp.
666
674
).
MIT Press
.
Doersch
,
C.
(
2016
).
Tutorial on variational autoencoders
. .
Dubey
,
S. R.
,
Singh
,
S. K.
, &
Chaudhuri
,
B. B.
(
2022
).
Activation functions in deep learning: A comprehensive survey and benchmark
.
Neurocomputing
,
503
,
92
108
.
Gömez-Bombarelli
,
R.
,
Wei
,
J. N.
,
Duvenaud
,
D.
,
Hernández-Lobato
,
Sánchez-Lengeling
,
Sheberla
, . . .
Aspuru-Guzik
,
A.
(
2018
).
Automatic chemical design using a data-driven continuous representation of molecules
.
ACS Central Science
,
4
(
2
),
268
276
.
Hinton
,
G. E.
, &
Salakhutdinov
,
R. R.
(
2006
).
Reducing the dimensionality of data with neural networks
.
Science
,
313
(
5786
),
504
507
.
Kärkkäinen
,
T.
, &
Hänninen
,
J.
(
2023
).
Additive autoencoder for dimension estimation
.
Neurocomputing
,
551
, 126520.
Kumano
,
S.
, &
Akutsu
,
T.
(
2022
).
Comparison of the representational power of random forests, binary decision diagrams, and neural networks
.
Neural Computation
,
34
(
4
),
1019
1044
.
Lee
,
S.
, &
Jo
,
J.
(
2021
).
Information flows of diverse autoencoders
.
Entropy
,
23
(
7
), 862.
Melkman
,
A. A.
,
Guo
,
S.
,
Ching
,
W-K.
,
Liu
,
P.
, &
Akutsu
,
T.
(
2023
).
On the compressive power of Boolean threshold autoencoders
.
IEEE Transactions on Neural Networks and Learning Systems
,
34
(
2
),
92
931
.
Montufar
,
G. F.
,
Pascanu
,
R.
,
Cho
,
K.
, &
Bengio
,
Y.
(
2014
).
On the number of linear regions of deep neural networks
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2924
2932
).
Curran
.
Tapia
,
N. I.
, &
Estévez
,
P. A.
(
2020
).
On the information plane of autoencoders
. In
Proceedings of the 2020 International Joint Conference on Neural Networks
(pp.
1
8
).
Tschannen
,
M.
,
Bachem
,
O.
, &
Lucic
,
M.
(
2018
).
Recent advances in autoencoder-based representation learning
. .
Ubaru
,
S.
, &
Saad
,
Y.
(
2016
).
Fast methods for estimating the numerical rank of large matrices
. In
Proceedings of the 33rd International Conference on Machine Learning
(pp.
468
477
).
Vershynin
,
R.
(
2020
).
Memory capacity of neural networks with threshold and rectified linear unit activations
.
SIAM Journal on Mathematics of Data Science
,
2
(
4
),
1004
1033
.
Yu
,
S.
, &
Príncipe
,
J. C.
(
2019
).
Understanding autoencoders with information theoretic concepts
.
Neural Networks
,
117
,
104
123
.
Yun
,
C.
,
Sra
,
S.
, &
Jadbabaie
,
A.
(
2019
).
Small ReLU networks are powerful memorizers: A tight analysis of memorization capacity
. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d’Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
(pp.
15532
15543
).
Curran
.
Zhang
,
C.
,
Bengio
,
S.
,
Hardt
,
M.
,
Recht
,
B.
, &
Vinyals
,
O.
(
2017
).
Understanding deep learning requires rethinking generalization
. In
Proceedings of the International Conference on Learning Representations
.

Supplementary data