Abstract
In this article, we mainly study the depth and width of autoencoders consisting of rectified linear unit (ReLU) activation functions. An autoencoder is a layered neural network consisting of an encoder, which compresses an input vector to a lower-dimensional vector, and a decoder, which transforms the low-dimensional vector back to the original input vector exactly (or approximately). In a previous study, Melkman et al. (2023) studied the depth and width of autoencoders using linear threshold activation functions with binary input and output vectors. We show that similar theoretical results hold if autoencoders using ReLU activation functions with real input and output vectors are used. Furthermore, we show that it is possible to compress input vectors to one-dimensional vectors using ReLU activation functions, although the size of compressed vectors is trivially Ω(log n) for autoencoders with linear threshold activation functions, where n is the number of input vectors. We also study the cases of linear activation functions. The results suggest that the compressive power of autoencoders using linear activation functions is considerably limited compared with those using ReLU activation functions.
1 Introduction
Over the past decade, we have seen a rapid progress in both developments and applications of artificial neural network (ANN) technologies. Among various models of ANNs, much attention has recently been paid on autoencoders because of their generative power of new data. Indeed, autoencoders have been applied to various areas including image processing (Doersch, 2016), natural language processing (Tschannen et al., 2018), and drug discovery (Gómez-Bombarelli et al., 2018). An autoencoder is a layered neural network consisting of an encoder and a decoder, where the former transforms an input vector x to a low-dimensional vector z = f(x) and the latter transforms z to an output vector y = g(z), which should be the same as or similar to the input vector. Therefore, an autoencoder performs a kind of dimensionality reduction. Encoder and decoder functions, f and g, are usually obtained via unsupervised learning that minimizes the difference between input and output data by adjusting weights (and some additional parameters).
Although autoencoders have a long history (Ackley et al., 1985; Baldi & Hornik, 1989; Hinton & Salakhutdinov, 2006), how data are compressed via autoencoders is not yet very clear. Baldi and Hornik (1989) studied relations between principal component analysis (PCA) and autoencoders with one hidden layer. Hinton and Salakhutdinov (2006) conducted empirical studies on relations between the depth of autoencoders and the dimensionality reduction. The results suggest that deeper networks can produce lower reconstruction errors. Kärkkäinen and Hänninen (2023) also empirically studied relations between the depth and the dimensionality reduction using a variant model of autoencoders. The results suggest that deeper networks obtain lower autoencoding errors during the identification of the intrinsic dimension, but the detected dimension does not change compared to a shallow network. Recently, several analyses have been done on mutual information between layers in order to understand information flow in autoencoders (Lee & Jo, 2021; Tapia & Estévez, 2020; Yu & Príncipe, 2019). Baldi (2012) presented and studied a general framework on autoencoders with both linear and nonlinear activation functions. In particular, he showed that learning in the autoencoder with Boolean activation functions is NP-hard in general by a reduction from a clustering problem.
However, as far as we know, no theoretical studies had been done on relations between the compressive power and the size of autoencoders, whereas extensive theoretical studies have been done on the representational power of deep neural networks (Delalleau & Bengio, 2011; Montufar et al., 2014; Vershynin, 2020; Yun et al., 2019). Recently, some theoretical studies have been done on the compressive power of autoencoders with linear threshold functions (Akutsu & Melkman, 2023; Melkman et al., 2023) with respect to the depth (number of layers) and width (number of nodes in a layer). However, linear threshold networks are not popular in recent studies on neural networks. Furthermore, linear threshold networks can only handle binary input and output, which is far from the practical settings.
Motivated by this situation, we study in this article the compressive power of autoencoders with real input and output vectors using rectified linear unit (ReLU) functions as the activation functions, with focusing on theoretical aspects. In order to clarify the superiority of ReLU functions over linear functions, we also study the compressive power of autoencoders with linear activation functions (not linear threshold activation functions).
The results are summarized in Table 1, where D and d denote the number of dimensions of input and compressed vectors, respectively; n denotes the number of input vectors; and A is a matrix explained in theorem 2. Note that the number and dimensions of input vectors must be the same as those of output vectors. Here, we first obtain some new results about autoencoders with linear activation functions. Then we modify theorems 12, 19, and 22 in Melkman et al. (2023) for ReLU functions and real input vectors. In addition, we modify theorem 1 in Zhang et al. (2017) so that the number of nodes in the middle layer decreases from n to . Based on the proof of theorem 3.1 in Yun et al. (2019), we design a four-layer ReLU neural network to reduce the number of nodes in the middle layer. The difference between this paper and theorem 3.1 is that hard-tanh activation functions are used in the proof of theorem 3.1, while we use ReLU activation functions. Finally, we modify the decoders in the proofs of theorems 19 and 22 in Melkman et al. (2023).
Summary of Results.
. | Vector . | Middle Layer (d) . | Architecture . | Type . | Activation . |
---|---|---|---|---|---|
Theorem 1 | real | n - 1 | D/d/D | Encoder/Decoder | Linear |
Theorem 2 | real | d | D/d/D if rank(A) ⩽ d + 1 | Encoder/Decoder | Linear |
Theorem 3 | real | Encoder | ReLU | ||
(Theorem 12 | binary | Encoder | Threshold | ||
(Melkman et al., 2023)) | |||||
Theorem 4 | real | Encoder/Decoder | ReLU | ||
(Theorem 19 | binary | Encoder/Decoder | Threshold | ||
(Melkman et al., 2023)) | |||||
Theorem 5 | real | Theorem 2 | Encoder/Decoder | ReLU | |
(Theorem 22 | binary | Theorem 12 (Melkman et al., 2023) | Encoder/Decoder | Threshold | |
(Melkman et al., 2023)) | |||||
(Corollary 8 | binary | ⌈log n⌉ | Theorem 12 (Melkman et al., 2023) | Encoder/Decoder | Threshold |
(Akutsu & Melkman, 2023)) | |||||
Proposition 2 | real | 1 | Encoder/Decoder | ReLU | |
Theorem 7 | real | D/d/d/1 | Memorizer | ReLU | |
(Theorem 1 | real | n | D/d/1 | Memorizer | ReLU |
(Zhang et al., 2017)) |
. | Vector . | Middle Layer (d) . | Architecture . | Type . | Activation . |
---|---|---|---|---|---|
Theorem 1 | real | n - 1 | D/d/D | Encoder/Decoder | Linear |
Theorem 2 | real | d | D/d/D if rank(A) ⩽ d + 1 | Encoder/Decoder | Linear |
Theorem 3 | real | Encoder | ReLU | ||
(Theorem 12 | binary | Encoder | Threshold | ||
(Melkman et al., 2023)) | |||||
Theorem 4 | real | Encoder/Decoder | ReLU | ||
(Theorem 19 | binary | Encoder/Decoder | Threshold | ||
(Melkman et al., 2023)) | |||||
Theorem 5 | real | Theorem 2 | Encoder/Decoder | ReLU | |
(Theorem 22 | binary | Theorem 12 (Melkman et al., 2023) | Encoder/Decoder | Threshold | |
(Melkman et al., 2023)) | |||||
(Corollary 8 | binary | ⌈log n⌉ | Theorem 12 (Melkman et al., 2023) | Encoder/Decoder | Threshold |
(Akutsu & Melkman, 2023)) | |||||
Proposition 2 | real | 1 | Encoder/Decoder | ReLU | |
Theorem 7 | real | D/d/d/1 | Memorizer | ReLU | |
(Theorem 1 | real | n | D/d/1 | Memorizer | ReLU |
(Zhang et al., 2017)) |
Note: Theorems in parentheses are existing results.
Specifically, theorem 1 reveals that when a set of n = d + 1 vectors in D-dimensional Euclidean space is given, there is a three-layer perfect autoencoder with linear activation functions that has the middle layer with d nodes. When the number of vectors in the given set is greater than d + 1, a three-layer perfect autoencoder with linear activation functions that has the middle layer with d nodes may not exist, and in theorem 2, some conditions for the existence of a three-layer perfect autoencoder with linear activation functions that has the middle layer with d nodes are given.
Theorem 3 is obtained by modifying theorem 12 in Melkman et al. (2023). Compared with theorem 12, we apply ReLU activation functions in theorem 3 to replace the binary input vectors with real input vectors, so that the number of layers of the ReLU neural network designed is increased by 2 and the number of hidden nodes is increased by .
By adding a decoder to the encoder constructed by theorem 3, we obtain a seven-layer perfect autoencoder in theorem 4. It is worth noting that in the decoder part, we do not simply represent the threshold function of theorem 19 in Melkman et al. (2023) as three ReLU functions with two layers, but design some new ReLU activation functions. Therefore, compared with theorem 19 in Melkman et al. (2023), the numbers of nodes and layers in the decoder part do not increase.
In theorem 4, the size of the middle layer is . In theorem 5, we reduce the size of the middle layer from to . Moreover, theorem 5 is obtained by modifying theorem 22 in Melkman et al. (2023). In proposition 2, we further reduce the size of the middle layer to 1 by using a real number as a compressed vector. The proof is obtained by simple modifications of those for theorems 3 and 4.
We also consider the number of nodes and layers to represent a memorizer, which is a set of pairs of input vectors and their output values. Compared with theorem 1 in Zhang et al. (2017), in theorem 7, the number of nodes in the middle layer decreases to and the number of layers increases by 1. It is worth noting that the four-layer ReLU neural network we designed in theorem 7 is based on the four-layer fully connected neural network of theorem 3.1 in Yun et al. (2019).1 However, in theorem 3.1, it mainly describes the design process of a fully connected neural network using hard-tanh activation functions, while we need to design a ReLU neural network, so there are some differences in the design process, and we explain these differences.
It is seen from Table 1 that there is a large difference in the number of nodes of the middle layer between the linear and ReLU autoencoders. Linear autoencoders need n - 1 nodes in the middle layer (especially when rank(A) = n), whereas ReLU autoencoders need a much smaller number of nodes in the middle layer. This is a crucial limitation of linear autoencoders. However, ReLU autoencoders need more than three layers and a large number of nodes (e.g., nodes) in some layers. This is a limitation of ReLU autoencoders. In addition, errors are not taken into account in both types of autoencoders, a limitation from a practical viewpoint.
In summary, the contribution of this article is to theoretically analyze the number of nodes and layers of autoencoders for real input and output vectors for the first time. Although we use the framework and some techniques introduced in Melkman et al. (2023), we introduce additional techniques in this article, and thus there exist substantial differences between the results of Melkman et al. (2023) and those of this article, as seen from Table 1. We also use some techniques introduced in Yun et al. (2019). However, we use to show theorem 7, which can also be used as a decoder part of the autoencoder.
2 Problem Definitions
R stands for real numbers. Rn denotes the set of n-dimensional column vectors. For integers a and b, a < b, we denote [a : b] ≔ {a, a + 1, . . . , b}.
Let Xn = {x0, . . . , xn-1} be a set of n D-dimensional binary or real input vectors that are all different. We define perfect encoder, decoder, autoencoder as follows (Melkman et al., 2023).
A mapping is called a perfect encoder for if holds for all .
A pair of mappings with and is called a perfect autoencoder if holds for all . Furthermore, is called a perfect decoder.
In a word, a perfect encoder maps distinct input vectors into distinct middle vectors, a perfect decoder maps each middle vector into the original input vector, and a perfect autodecoder maps each input vector into the original input vector via a distinct middle vector.
Note that a perfect decoder exists only if there exists a perfect autoencoder. Furthermore, it is easily seen from the definitions that if (f, g) is a perfect autoencoder, f is a perfect encoder.
Many of the results in Table 1 rely on the following proposition. Since it is mentioned in Zhang et al. (2017) without a proof, we give our own proof here:
For any set of -dimensional distinct real vectors , there exists a real vector satisfying for all .
We prove the proposition by mathematical induction on D.
In the case of D = 1, the claim trivially holds by letting a = [1]. Assume that the claim holds for all X in the case of D = d - 1. Let X = {x0, . . . , xn-1} be a set of d-dimensional distinct vectors. For each vector x = [x0, . . . , xd-1], let and . It should be noted that may hold for some i ≠ j. From the induction hypothesis, we can assume that there exists a vector satisfying for all . Here, we define a = [a0, . . . , ad - 2, ad-1] by using a sufficiently large real number ad-1 such that holds for any i, j, k, h such that . Then a · xi ≠ a · xj clearly holds for all i ≠ j.
For the input vector, we can assume w.l.o.g. that all elements of the real input vector xi are nonnegative, because this assumption can be satisfied by adding a sufficiently large constant to each element. For example, let . If z < 0, all elements of the real input vector xi can be made nonnegative by adding any value not less than -z to each element, and this value does not affect network performance.
3 Autoencoders with Linear Activation Functions
In this section, we consider autoencoders using linear functions as the activation functions.
We begin with a simple example. Consider the case of D = 3, d = 1, and n = 2. Then X = {[x0, y0, z0], [x1, y1, z1]}. Let f([xi, yi, zi]) = [xi]. Let g([xi]) = [xi, axi + b, cxi + d], where axi + b satisfies ax0 + b = y0 and ax1 + b = y1, and cxi + d satisfies cx0 + d = z0 and cx1 + d = z1. Then (f, g) is a perfect autoencoder for most X (precisely, if ), where denotes the determinant of a matrix A.)
By generalizing these simple examples, we have the following theorem.
Let be a set of vectors in -dimensional Euclidean space, where is any integer such that . Then, for , there exists a perfect autoencoder with linear activation functions that has the middle layer (i.e., compressed layer) with nodes (i.e., perform dimensionality reduction to ).
Suppose that a set of d + 1 vectors in D-dimensional Euclidean space is described as X = {x0, x1, . . . , xd}, where and D > d.
A maximal linearly independent subset of S can be obtained as follows:
Step 1: Initialize .
Step 2: Choose a column vector xi ∈ S. If 1d+1 and xi are linearly dependent, remove xi from S and choose another column vector in S and repeat this step. If 1d+1 and xi are linearly independent, add xi to the set , that is, , and go to the next step.
Step 3: Let , where 1d+1 and xi are linearly independent. Choose a column vector xj ∈ S. If 1d+1, xi and xj are linearly dependent, remove xj from S and choose another column vector in S and repeat this step. If 1d+1, xi and xj are linearly independent, add xj to the set , that is, and go to the next step.
Step 4: . . .
Going on in turn, we finally find that is a maximal linearly independent subset of S. Here, .
For X, there exists a perfect autoencoder (f, g) with linear activation functions that has the middle layer (i.e., compressed layer) with d nodes (i.e., perform dimensionality reduction to d).
Furthermore, for X with |X| > d + 1, we consider whether there is a perfect autoencoder with linear activation functions that has the compressed layer with d nodes and we have the following result:
If , then there exists a perfect autoencoder with linear activation functions that has the compressed layer with nodes.
If , then there does not exist a perfect autoencoder with linear activation functions that has the compressed layer with nodes.
When rank(A) ⩽ d + 1, a perfect autoencoder can be constructed similar to theorem 1.
According to theorems 1 and 2, we know that the number of nodes in the compression layer is at least rank(A) - 1 for the existence of a perfect autoencoder.
Theorem 1 still holds for multilayer networks because composition of linear functions is a linear function.
According to theorem 2, we need to calculate the rank of the matrix A. For the large matrix A, it is not very easy to directly obtain its rank, but we can use rank estimation methods to estimate its rank, such as in Ubaru and Saad (2016).
4 Autoencoders with ReLU functions for Real Vectors
According to Kumano and Akutsu (2022), we can simulate a threshold function by using three ReLU functions in two layers. Hence, theorems 12, 19, and 22 in Melkman et al. (2023) can be modified for ReLU functions such that binary input vectors are replaced by real input vectors, with increasing the number of layers by some constant and increasing the number of nodes in some layers by twice (because one threshold function can be simulated using three ReLU functions with two layers). Note that the compressed layer still corresponds to binary vectors.
Specifically, given n different binary input vectors of dimension D, in theorem 12 (Melkman et al., 2023), it is mentioned that there is a three-layer network whose activation functions are Boolean threshold functions that maps these vectors to n different binary vectors of dimension using hidden nodes, and then in theorem 19 (Melkman et al., 2023), it is mentioned that there is a five-layer perfect Boolean threshold network autoencoder, whose encoding is constructed based on theorem 12, and the number of nodes in its middle hidden layer is . Further, in theorem 22 (Melkman et al., 2023), it is mentioned that there is a seven-layer perfect Boolean threshold network autoencoder, and compared with the autoencoder constructed in theorem 19, the number of nodes in the middle hidden layer is reduced to , although the number of layers is increased by 2.
Five-layer ReLU neural network. The circle nodes represent those carrying input and output information. Specifically, there are D input nodes xj, j ∈ [0 : D - 1] and 2r output nodes . The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4, 5) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.
Five-layer ReLU neural network. The circle nodes represent those carrying input and output information. Specifically, there are D input nodes xj, j ∈ [0 : D - 1] and 2r output nodes . The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4, 5) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.
Finally, the output node copies , where i ∈ [0 : r - 1]. The ReLU activation functions of the remaining output nodes are . Here, we can observe that η2 = hl[r].
Suppose that n = 16. Then, we have , h0[r] = [1, 0, 0, 0], h1[r] = [1, 1, 0, 0], h2[r] = [1, 1, 1, 0], and h3[r] = [1, 1, 1, 1]. Furthermore, x9 is mapped to (η1, η2) = (h2[r], h1[r]) because 9 = 2 · 4 + 1.
By modifying theorem 19 in Melkman et al. (2023) for ReLU functions and real input vectors, we get the following theorem.
On top of the encoder constructed in the proof of theorem 3, we add a decoder that is a two-layer ReLU neural network.
Here, we need to design a decoder that outputs y = xkr+l for an input (hk[r], hl[r]). Let (ß1, ß2) = (hk[r], hl[r]). Note that (ß1, ß2) corresponds to (η1, η2) in theorem 3.
In theorem 4, the size of the middle layer is . Next, we reduce the size of the middle layer from to by increasing the number of layers.
Eleven-layer perfect ReLU neural network autoencoder (omit the first 4 layers). The circle nodes represent those carrying input and output information. Specifically, there are 2m nodes in the middle layer and D nodes yj, j ∈ [0 : D - 1] in the output layer. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 6, 7, . . . , 11) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.
Eleven-layer perfect ReLU neural network autoencoder (omit the first 4 layers). The circle nodes represent those carrying input and output information. Specifically, there are 2m nodes in the middle layer and D nodes yj, j ∈ [0 : D - 1] in the output layer. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 6, 7, . . . , 11) layer are determined based on the ReLU activation functions related to the states of some nodes in the (i - 1)th layer.
The first five layers are constructed in the proof of theorem 3, and then the input xi maps to (hk[r], hl[r]). Let (ß1, ß2) = (hk[r], hl[r]). For ease of exposition, we assume that r = 2m.
Let , where i = 1, 2 and j ∈ [0 : m - 1]. Then ζi gives a binary representation (in the reverse order) of ßi. For example, ζi = [1, 1, 0] for ßi = [1, 1, 1, 1, 0, 0, 0, 0], where n = 64 and r = 8.
Let , where i = 1, 2 and j ∈ [0 : r - 1]. Here, we can find that μ1 = ß1 and μ2 = ß2.
Finally, the last two layers are constructed as shown in the proof of theorem 4.
In the above, the compressed layer corresponds to binary vectors. However, we can simply use a · xi to compress a D-dimensional vector to a 1-dimensional vector (i.e., a scalar). As shown below, decoders can be developed for this compressed scalar value by modifying the decoders in the proofs of theorems 3 and 4.
Let . For any set of -dimensional real vectors , where and , there exists an eight-layer perfect ReLU neural network autoencoder.
The neural network has input nodes one node in the second layer , nodes in the third layer and , nodes in the fourth layer and , nodes in the fifth layer and , nodes in the sixth layer , nodes in the seventh layer , and output nodes .
We have one node a · x in the second layer, which corresponds to an encoder.
The decoder is obtained by simple modifications of the proofs of theorems 3 and 4. We use the output of the second layer in place of a · x in the activation function for each of in the proof of theorem 3. Furthermore, we replace ß2 by a single node copying a · x, whose value is also used in place of a · ß2 in the proof of theorem 3. Then, as in the proof of theorem 4, we identify the nodes in (η1, η2) of the proof of theorem 3 with the nodes in (ß1, ß2) in the proof of theorem 4. It is straightforward to see that this modification gives a perfect autoencoder.
Next, we consider memorizers, that is, functions for finite input samples. The following theorem is shown in Zhang et al. (2017).
(Zhang et al., 2017). There exists a three-layer neural network with ReLU activations and weights that can represent any function on a sample of size in dimensions.
For ease of understanding, we explain this theorem briefly. Here, X = {x0, x1, . . . , xn-1} denotes a set of n D-dimensional real input vectors that are all different, and Y = {y0, y1, . . . , yn-1} denotes a set of n 1-dimensional real outputs that all yi ⩾ 0.
Now, we consider reducing the number of nodes in the middle layer from O(n) to (although the number of layers increases by one).
Four-layer ReLU neural network. The circle nodes represent those carrying input and output information. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4) layer are determined based on the ReLU activation functions related to the states of nodes in the (i - 1)th layer.
Four-layer ReLU neural network. The circle nodes represent those carrying input and output information. The arrows connecting the circle nodes show that the states of nodes in the ith (i = 2, 3, 4) layer are determined based on the ReLU activation functions related to the states of nodes in the (i - 1)th layer.
Here, the four-layer ReLU neural network we designed is based on the four-layer fully connected neural network of theorem 3.1 in Yun et al. (2019). Notably, in the proof of theorem 3.1 (Yun et al., 2019), it mainly indicates that there is a four-layer fully connected neural network using hard-tanh activation functions, which can fit any arbitrary data set , where all inputs are distinct and all yi ∈ [- 1, 1]. Although our construction is very similar to their construction, there are some differences because they use a neural network with hard-tanh activation functions. Therefore, in this proof, we mainly describe our construction of the ReLU neural network with briefly explaining the differences from that in Yun et al. (2019). The full proof is given in the online supplemental material.
For ease of exposition, we assume that and r is a multiple of 2. Divide total n input vectors into r groups with r vectors each.
It is worth noting that theorem 1 in Zhang et al. (2017) (i.e., theorem 6 in this article) shows that there exists a two-layer neural network. Similarly, theorem 3.1 in Yun et al. (2019) shows that there is a three-layer hard-tanh, fully connected neural network. However, since here we regard the input layer as the first layer, to avoid confusion, the number of layers is increased by 1 when we describe the neural networks of theorems 1 and 3.1 in this article.
5 Computational Experiments
In order to test whether some of the autoencoder (memorizer) architectures obtained through theoretical analyses can be used in the design of practical neural networks, we conducted some computational experiments using neural networks. Here, the activation functions are learned from input and output data. All numerical experiments were conducted on a PC with Xeon Gold 5222 CPU and A100 GPU under the Ubuntu 18.04.
First, we performed computational experiments on theorem 4, considering the use of 256 36-dimensional real vectors as input vectors and 324 40-dimensional real vectors as input vectors, respectively. For the training of neural networks with gaussian error linear unit (GELU) functions (Dubey et al., 2022), we employed the Adam optimizer in PyTorch with a learning rate 0.01 and 800 epochs repetition. The specific process is as follows.
Step 1: Generate a neural network with the architecture given in theorem 4.
Step 2: Randomly generate D-dimensional real vectors x0, x1, . . . , xn-1, where each .
Step 3: Train using x0, x1, . . . , xn-1 for both input and output data.
- Step 4: Compute the training accuracy of the trained using x0, x1, . . . , xn-1 as the input data. Assuming y0, y1, . . . , yn-1 are the corresponding output data, the training accuracy is defined aswhere is an indicator function, that is,(5.1)
We repeated this procedure 100 times to obtain the average training accuracy; the results are shown in Table 2.
In the above experiment, we used the GELU function instead of the ReLU function, mainly because we found that the training accuracy obtained by training a neural network with ReLU functions was not ideal. We believe that the main reason for this phenomenon is that the neural network constructed by theorem 4 has a slightly large number of layers and relatively complex connection methods for each layer. Using the GELU function, described as a smoother version of the ReLU function, for neural network training can provide smoother gradients, which can help maintain gradient flow in complex models and avoid problems such as vanishing or exploding gradients. In contrast, when the data range is small and the network hierarchy is complex, using the ReLU function may lead to numerical instability, thereby affecting the convergence and final performance of the model.
Results of Computational Experiments on Theorem 7.
Architecture . | D . | n . | d . | Training Accuracy . |
---|---|---|---|---|
Theorem 7 | 36 | 256 | 32 | 1.0000 |
Theorem 7 | 40 | 324 | 36 | 1.0000 |
Architecture . | D . | n . | d . | Training Accuracy . |
---|---|---|---|---|
Theorem 7 | 36 | 256 | 32 | 1.0000 |
Theorem 7 | 40 | 324 | 36 | 1.0000 |
The above experimental results indicate that our proposed architectures may provide useful insights for designing practical autoencoders (memorizer).
6 Discussion
In this article, we studied relations between the compressed vectors and the depth and width (the number of node in a layer) of autoencoders with real input and output vectors using linear and ReLU activation functions, under the condition that the input and output vectors must be the same. The results on ReLU activation functions suggest that we can achieve the same compression ratio as in the case of binary input and output vectors (Melkman et al., 2023) using similar archtectures. The results are interesting because real input and output vectors can be handled by replacing linear threshold activation functions with ReLU activation functions, where some modifications are required. Furthermore, the results on linear activation functions suggest that ReLU activation functions are much more powerful than linear activation functions with respect to autoencoders with real input and output vectors.
Although we have mainly given upper bounds on the depth and width, we have not shown any lower bounds for autoencoders using ReLU activation functions. Note that a lower bound of is known for the width of autoencoders using linear threshold activation functions (Akutsu & Melkman, 2023). However, the techniques used there heavily depend on properties of Boolean functions and thus cannot be applied to the case of ReLU activation functions. Thus, showing lower bounds on autoencoders with ReLU activation functions is left as future work. For the case of d = ⌈log n⌉, an upper bound is shown for the width of autoencoders using linear threshold activation functions (see corollary 8 of Akutsu & Melkman, 2023), which is better than an upper bound shown in Melkman et al. (2023) and this article. Hence, development of an autoencoder with width using ReLU activation functions is also left as future work. Furthermore, it is interesting to study whether it is possible to develop autoencoders with a smaller width using ReLU activation functions than those using linear threshold activation functions.
In our definition of the perfect autoencoder, we assumed that the input vectors must be the same as the output vectors. But in practical situations, the input and output vectors need not be the same but should be similar. For the case of autoencoders with linear threshold activation functions, it is shown that the width of the decoder part can be reduced by a constant factor if some Hamming distance error is allowed (Akutsu & Melkman, 2023). However, the techniques used in that study cannot be applied to real input and output vectors because the Hamming distance cannot be directly generalized to real vectors and the construction of autoencoders heavily depends on binary values. Therefore, conducting theoretical studies on autoencoders allowing errors between real input and output vectors is interesting and important future work.
Another drawback of this work is that the proposed design methods are somewhat ad hoc. However, most of the designed architectures are based on existing ones for binary vectors using linear threshold activation functions, and the main purpose of this article is to extend such existing ones to a more practical setting (i.e., real vectors using ReLU activation functions). Therefore, this work can be considered an important step toward understanding data compression mechanisms of autoencoders in more practical settings. Of course, many other practical activation functions are known (Apicella et al., 2021; Dubey et al., 2022). Therefore, important future work is to develop more general design methods that can be applied to many practical activation functions.
Acknowledgments
T.A. was partially supported by grants-in-aid 22H00532 and 22K19830 from JSPS, Japan; W.C. was partially supported by Hong Kong RGC GRF grant 17301519, IMR, and Hung Hing Ying Physical Sciences Research Fund, HKU.
Note
In Yun et al. (2019), their network is stated as a three-layer network excluding the input layer.