A new network with super-approximation power is introduced. This network is built with Floor (x) or ReLU (max{0,x}) activation function in each neuron; hence, we call such networks Floor-ReLU networks. For any hyperparameters NN+ and LN+, we show that Floor-ReLU networks with width max{d,5N+13} and depth 64dL+3 can uniformly approximate a Hölder function f on [0,1]d with an approximation error 3λdα/2N-αL, where α(0,1] and λ are the Hölder order and constant, respectively. More generally for an arbitrary continuous function f on [0,1]d with a modulus of continuity ωf(·), the constructive approximation rate is ωf(dN-L)+2ωf(d)N-L. As a consequence, this new class of networks overcomes the curse of dimensionality in approximation power when the variation of ωf(r) as r0 is moderate (e.g., ωf(r)rα for Hölder continuous functions), since the major term to be considered in our approximation rate is essentially d times a function of N and L independent of d within the modulus of continuity.

1  Introduction

Recently, there has been a large number of successful real-world applications of deep neural networks in many fields of computer science and engineering, especially for large-scale and high-dimensional learning problems. Understanding the approximation capacity of deep neural networks has become a fundamental research direction for revealing the advantages of deep learning compared to traditional methods. This letter introduces new theories and network architectures achieving root exponential convergence and avoiding the curse of dimensionality simultaneously for (Hölder) continuous functions with an explicit error bound in deep network approximation, which might be two foundational laws supporting the application of deep network approximation in large-scale and high-dimensional problems. The approximation results here are quantitative and apply to networks with essentially arbitrary width and depth. These results suggest considering Floor-ReLU networks as a possible alternative to ReLU networks in deep learning.

Deep ReLU networks with width O(N) and depth O(L) can achieve the approximation rate O(N-L) for polynomials on [0,1]d (Lu, Shen, Yang, & Zhang, 2020), but it is not true for general functions, such as the (nearly) optimal approximation rates of deep ReLU networks for a Lipschitz continuous function and a Cs function f on [0,1]d are O(dN-2/dL-2/d) and O(fCsN-2s/dL-2s/d) (Shen, Yang, & Zhang, 2020; Lu et al., 2020), respectively. The limitation of ReLU networks motivates us to explore other types of network architectures to answer our curiosity on deep networks: Do deep neural networks with arbitrary width O(N) and arbitrary depth O(L) admit an exponential approximation rate O(ωf(N-Lη)) for some constant η>0 for a generic continuous function f on [0,1]d with a modulus of continuity ωf(·)?

To answer this question, we introduce the Floor-ReLU network, a fully connected neural network (FNN) built with either Floor (x) or ReLU (max{0,x}) activation function1 in each neuron. Mathematically, if we let N0=d, NL+1=1, and N be the number of neurons in th hidden layer of a Floor-ReLU network for =1,2,,L, then the architecture of this network with input x and output ϕ(x) can be described as
x=h˜0W0,b0h1σor·h˜1WL-1,bL-1hLσor·h˜LWL,bLhL+1=ϕ(x),
where WRN+1×N, bRN+1, h+1:=W·h˜+b for =0,1,,L, and h˜,n is equal to σ(h,n)orh,n for =1,2,,L and n=1,2,,N, where h=(h,1,,h,N) and h˜=(h˜,1,,h˜,N) for =1,2,,L. (See Figure 1 for an example.)
Figure 1:

An example of a Floor-ReLU network with width 5 and depth 2.

Figure 1:

An example of a Floor-ReLU network with width 5 and depth 2.

In theorem 1, we show by construction that Floor-ReLU networks with width max{d,5N+13} and depth 64dL+3 can uniformly approximate a continuous function f on [0,1]d with a root exponential approximation rate2ωf(dN-L)+2ωf(d)N-L, where ωf(·) is the modulus of continuity defined as
ωf(r):=sup|f(x)-f(y)|:x-y2r,x,y[0,1]d,foranyr0,
where x2=x12+x22++xd2 for any x=(x1,x2,,xd)Rd.
Theorem 1.
Given any N,LN+ and an arbitrary continuous function f on [0,1]d, there exists a function ϕ implemented by a Floor-ReLU network with width max{d,5N+13} and depth 64dL+3 such that
|ϕ(x)-f(x)|ωf(dN-L)+2ωf(d)N-L,foranyx[0,1]d.

With theorem 1, we have an immediate corollary:

Corollary 1.
Given an arbitrary continuous function f on [0,1]d, there exists a function ϕ implemented by a Floor-ReLU network with width N¯ and depth L¯ such that
|ϕ(x)-f(x)|ωfdN¯-135-L¯-364d+2ωf(d)N¯-135-L¯-364d,
for any x[0,1]d and N¯,L¯N+ with N¯max{d,18} and L¯64d+3.
In theorem 1, the rate in ωf(dN-L) implicitly depends on N and L through the modulus of continuity of f, while the rate in 2ωf(d)N-L is explicit in N and L. Simplifying the implicit approximation rate to make it explicitly depending on N and L is challenging in general. However, if f is a Hölder continuous function on [0,1]d of order α(0,1] with a constant λ, that is, f(x) satisfying
|f(x)-f(y)|λx-y2α,foranyx,y[0,1]d,
(1.1)
then ωf(r)λrα for any r0. Therefore, in the case of Hölder continuous functions, the approximation rate is simplified to 3λdα/2N-αL, as shown in the following corollary. In the special case of Lipschitz continuous functions with a Lipschitz constant λ, the approximation rate is simplified to 3λdN-L.
Corollary 2.
Given any N,LN+ and a Hölder continuous function f on [0,1]d of order α with a constant λ, there exists a function ϕ implemented by a Floor-ReLU network with width max{d,5N+13} and depth 64dL+3 such that
|ϕ(x)-f(x)|3λdα/2N-αL,foranyx[0,1]d.

First, theorem 1 and corollary 2 show that the approximation capacity of deep networks for continuous functions can be nearly exponentially improved by increasing the network depth, and the approximation error can be explicitly characterized in terms of the width O(N) and depth O(L). Second, this new class of networks overcomes the curse of dimensionality in the approximation power when the modulus of continuity is moderate, since the approximation order is essentially ωf(dN-L). Finally, applying piecewise constant and integer-valued functions as activation functions and integer numbers as parameters has been explored in the study of quantized neural networks (Hubara, Courbariaux, Soudry, El-Yaniv, & Bengio, 2017; Yin et al., 2019; Bengio, Léonard, & Courville, 2013) with efficient training algorithms for low computational complexity (Wang et al., 2018). The floor function (x) is a piecewise constant function and can be easily implemented numerically at very little cost. Hence, the evaluation of the proposed network could be efficiently implemented in practical computation. Though there might not be an existing optimization algorithm to identify an approximant with the approximation rate in this letter, theorem 1 can provide an expected accuracy before a learning task and how much the current optimization algorithms could be improved. Designing an efficient optimization algorithm for Floor-ReLU networks will be left as future work, with several possible directions discussed later.

We remark that an increased smoothness or regularity of the target function could improve our approximation rate, but at the cost of a large prefactor. For example, to attain better approximation rates for functions in Cs([0,1]d), it is common to use Taylor expansions and derivatives, which are tools that suffer from the curse of dimensionality and will result in a large prefactor like O((s+1)d) that is subject to the curse of dimensionality. Furthermore, the prospective approximation rate using smoothness is not attractive. For example, the prospective approximation rate would be O(N-sL) if we use Floor-ReLU networks with width O(N) and depth O(L) to approximate functions in Cs([0,1]d). However, such a rate O(N-sL)=O(N-s2L) can be attained by using Floor-ReLU networks with width O(N) and depth O(s2L) to approximate Lipschitz continuous functions. Hence, increasing the network depth can result in the same approximation rate for Lipschitz continuous functions as the rate of smooth functions.

The rest of this letter is organized as follows. In section 2, we discuss the application scope of our theory and compare related works in the literature. In section 3, we prove theorem 1 based on proposition 1. Next, this basic proposition is proved in section 4. Finally, we conclude in section 5.

2  Discussion

In this section, we discuss the application scope of our theory in machine learning and its comparison related to existing works.

2.1  Application Scope of Our Theory in Machine Learning

In supervised learning, an unknown target function f(x) defined on a domain Ω is learned through its finitely many samples {(xi,f(xi))}i=1n. If deep networks are applied in supervised learning, the following optimization problem is solved to identify a deep network ϕ(x;θS), with θS as the set of parameters, to infer f(x) for unseen data samples x:
θS=argminθRS(θ):=argminθ1n{xi}i=1nϕ(xi;θ),f(xi),
(2.1)
with a loss function typically taken as (y,y')=12|y-y'|2. The inference error is usually measured by RD(θS), where
RD(θ):=ExU(Ω)(ϕ(x;θ),f(x)),
where the expectation is taken with an unknown data distribution U(Ω) over Ω.
Note that the best deep network to infer f(x) is ϕ(x;θD) with θD given by
θD=argminθRD(θ).
The best possible inference error is RD(θD). In real applications, U(Ω) is unknown, and only finitely many samples from this distribution are available. Hence, the empirical loss RS(θ) is minimized hoping to obtain ϕ(x;θS) instead of minimizing the population loss RD(θ) to obtain ϕ(x;θD). In practice, a numerical optimization method to solve equation 2.1 may result in a numerical solution (denoted as θN) that may not be a global minimizer θS. Therefore, the actual learned neural network to infer f(x) is ϕ(x;θN), and the corresponding inference error is measured by RD(θN).
By the discussion, it is crucial to quantify RD(θN) to see how good the learned neural network ϕ(x;θN) is, since RD(θN) is the expected inference error over all possible data samples. Note that
RD(θN)=[RD(θN)-RS(θN)]+[RS(θN)-RS(θS)]+[RS(θS)-RS(θD)]+[RS(θD)-RD(θD)]+RD(θD)RD(θD)+[RS(θN)-RS(θS)]+[RD(θN)-RS(θN)]+[RS(θD)-RD(θD)],
(2.2)
where the inequality comes from the fact that [RS(θS)-RS(θD)]0 since θS is a global minimizer of RS(θ). The constructive approximation established in this letter and in the literature provides an upper bound of RD(θD) in terms of the network size, for example, in terms of the network width and depth or the number of parameters. The second term of equation 2.2 is bounded by the optimization error of the numerical algorithm applied to solve the empirical loss minimization problem in equation 2.1. If the numerical algorithm is able to find a global minimizer, the second term is equal to zero. The theoretical guarantee of the convergence of an optimization algorithm to a global minimizer θS and the characterization of the convergence belong to the optimization analysis of neural networks. The third and fourth terms of equation 2.2 are usually bounded in terms of the sample size n and a certain norm of θN and θD (e.g., 1, 2, or the path norm), respectively. The study of the bounds for the third and fourth terms is referred to as the generalization error analysis of neural networks.

The approximation theory, optimization theory, and generalization theory form the three main theoretical aspects of deep learning with different emphases and challenges, which have motivated many separate research directions recently. Theorem 1 and corollary 2 provide an upper bound of RD(θD). This bound only depends on the given budget of neurons and layers of Floor-ReLU networks and on the modulus of continuity of the target function f. Hence, this bound is independent of the empirical loss minimization in equation 2.1 and the optimization algorithm used to compute the numerical solution of that equation. In other words, theorem 1 and corollary 2 quantify the approximation power of Floor-ReLU networks with a given size. Designing efficient optimization algorithms and analyzing the generalization bounds for Floor-ReLU networks are two other separate future directions. Although optimization algorithms and generalization analysis are not our focus in this letter, in the next two paragraphs, we discuss several possible research topics in these directions for our Floor-ReLU networks.

In this work, we have not analyzed the feasibility of optimization algorithms for the Floor-ReLU network. Typically, stochastic gradient descent (SGD) is applied to solve a network optimization problem. However, the Floor-ReLU network has piecewise constant activation functions, making standard SGD infeasible. There are two possible directions to solve the optimization problem for Floor-ReLU networks: (1) gradient-free optimization methods, such as the Nelder-Mead method (Nelder & Mead, 1965), genetic algorithm (Holland, 1992), simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983), particle swarm optimization (Kennedy & Eberhart, 1995), and consensus-based optimization (Pinnau, Totzeck, Tse, & Martin, 2017; Carrillo, Jin, Li, & Zhu, 2019); and (2) applying optimization algorithms for quantized networks that also have piecewise constant activation functions (Lin, Lei, & Niu, 2019; Boo, Shin, & Sung, 2020; Bengio et al., 2013; Wang et al., 2018; Hubara et al., 2017; Yin et al., 2019). It would be interesting future work to explore efficient learning algorithms based on the Floor-ReLU network.

Generalization analysis of Floor-ReLU networks is also an interesting future direction. Previous works have shown the generalization power of ReLU networks for regression problems (Jacot, Gabriel, & Hongler, 2018; Cao & Gu, 2019; Chen, Cao, Zou, & Gu, 2019; Weinan, Ma, & Wu, 2019; Weinan & Wojtowytsch, 2020) and for solving partial differential equations (Berner et al., 2018; Luo & Yang, 2020). Regularization strategies for ReLU networks to guarantee good generalization capacity of deep learning have been proposed in Weinan, Ma, and Wu (2019) and Weinan and Wojtowytsch (2020). It is important to investigate the generalization capacity of our Floor-ReLU networks. Especially, it is of great interest to see whether problem-dependent regularization strategies exist to make the generalization error of our Floor-ReLU networks free of the curse of dimensionality.

2.2  Approximation Rates in O(N) and O(L) versus O(W)

Characterizing deep network approximation in terms of the width O(N)3 and depth O(L) simultaneously is fundamental and indispensable in realistic applications, while quantifying the deep network approximation based on the number of nonzero parameters W is probably only of interest in theory as far as we know. Theorem 1 can provide practical guidance for choosing network sizes in realistic applications, while theories in terms of W cannot tell how large a network should be to guarantee a target accuracy. The width and depth are the two most direct and amenable hyperparameters in choosing a specific network for a learning task, while the number of nonzero parameters W is hardly controlled efficiently. Theories in terms of W essentially have a single variable to control the network size in three types of structures: (1) fixing the width N and varying the depth L; (2) fixing the depth L and changing the width N; (3) both the width and depth are controlled by the same parameter like the target accuracy ɛ in a specific way (e.g., N is a polynomial of 1ɛd and L is a polynomial of log(1ɛ)). Considering the nonuniqueness of structures for realizing the same W, it is impractical to develop approximation rates in terms of W covering all these structures. If one network structure has been chosen in a certain application, there might not be a known theory in terms of W to quantify the performance of this structure. Finally, in terms of full error analysis of deep learning including approximation theory, optimization theory, and generalization theory as illustrated in equation 2.2, the approximation error characterization in terms of width and depth is more useful than that in terms of the number of parameters, because almost all existing optimization and generalization analysis is based on depth and width instead of the number of parameters (Jacot, Gabriel, & Hongler, 2018; Cao & Gu, 2019; Chen, Cao, Zou, & Gu, 2019; Arora, Du, Hu, Li, & Wang, 2019; Allen-Zhu, Li, & Liang, 2019; Weinan et al., 2019; Weinan & Wojtowytsch, 2020; Ji & Telgarsky, 2020), to the best of our knowledge. Approximation results in terms of width and depth are more consistent with optimization and generalization analysis tools to obtain a full error analysis in equation 2.2.

Most existing approximation theories for deep neural networks so far focus on the approximation rate in the number of parameters W (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989; Barron, 1993; Liang & Srikant, 2016; Yarotsky, 2017, 2018; Poggio, Mhaskar, Rosasco, Miranda, & Liao, 2017; Weinan & Wang, 2018; Petersen & Voigtlaender, 2018; Chui, Lin, & Zhou, 2018; Nakada & Imaizumi, 2019; Gribonval, Kutyniok, Nielsen, & Voigtlaender, 2019; Gühring, Kutyniok, & Petersen, 2019; Chen, Jiang, Liao, & Zhao, 2019; Li, Lin, & Shen, 2019; Suzuki, 2019; Bao et al., 2019; Opschoor, Schwab, & Zech, 2019; Yarotsky & Zhevnerchuk, 2019; Bölcskei, Grohs, Kutyniok, & Petersen, 2019; Montanelli & Du, 2019; Chen & Wu, 2019; Zhou, 2020; Montanelli & Yang, 2020; Montanelli, Yang, & Du, in press). From the point of view of theoretical difficulty, controlling two variables, N and L, in our theory is more challenging than controlling one variable W in the literature. In terms of mathematical logic, the characterization of deep network approximation in terms of N and L can provide an approximation rate in terms of W, while we are not aware of how to derive approximation rates in terms of arbitrary N and L given approximation rates in terms of W, since existing results in terms of W are valid for specific network sizes with width and depth as functions in W without the degree of freedom to take arbitrary values. Existing theories essentially have a single variable to control the network size in three types of structures. Let us use the first type of structure, which includes the best-known result for a nearly optimal approximation rate, O(ωf(W-2/d)), for continuous functions in terms of W using ReLU networks (Yarotsky, 2018) and the best-known result, O(exp(-cα,dW)), for Hölder continuous functions of order α using Sine-ReLU networks (Yarotsky & Zhevnerchuk, 2019), as an example to show how theorem 1 in terms of N and L can be applied to show a better result in terms of W. One can apply theorem 1 in a similar way to obtain other corollaries with other types of structures in terms of W. The main idea is to specify the value of N and L in theorem 1 to show the desired corollary. For example, if we let the width parameter N=2 and the depth parameter L=W in theorem 1, then the width is max{d,23}, the depth is 64dW+3, and the total number of parameters is bounded by Omax{d2,232}(64dW+3)=O(W). Therefore, we can prove corollary 3 for the approximation capacity of our Floor-ReLU networks in terms of the total number of parameters as follows:

Corollary 3.
Given any WN+ and a continuous function f on [0,1]d, there exists a function ϕ implemented by a Floor-ReLU network with O(W) nonzero parameters, a width max{d,23}, and depth 64dW+3, such that
|ϕ(x)-f(x)|ωf(d2-W)+2ωf(d)2-W,foranyx[0,1]d.

Corollary 3 achieves root exponential convergence without the curse of dimensionality in terms of the number of parameters W with the help of the Floor-ReLU networks. When only ReLU networks are used, the result in (Yarotsky, 2018) suffers from the curse and does not have any kind of exponential convergence. The result in Yarotsky and Zhevnerchuk (2019) with Sine-ReLU networks has root exponential convergence but has not excluded the possibility of the curse of dimensionality, as we shall discuss. Furthermore, corollary 3 works for generic continuous functions while Yarotsky and Zhevnerchuk (2019) only applies to Hölder continuous functions.

2.3  Further Interpretation of Our Theory

In the interpretation of our theory, two more aspects are important to discuss. The first is whether it is possible to extend our theory to functions on a more general domain, for example, [-M,M]d for some M>1, because M>1 may cause an implicit curse of dimensionality in some existing theory, as we shall point out. The second one is how bad the modulus of continuity would be since it is related to a high-dimensional function f that may lead to an implicit curse of dimensionality in our approximation rate.

First, theorem 1 can be easily generalized to C([-M,M]d) for any M>0. Let L be a linear map given by L(x)=2M(x-1/2). By theorem 1, for any fC([-M,M]d), there exists ϕ implemented by a Floor-ReLU network with width max{d,5N+13} and depth 64dL+3 such that
|ϕ(x)-fL(x)|ωfL(dN-L)+2ωfL(d)N-L,foranyx[0,1]d.
It follows from y=L(x)[-M,M]d and ωfL(r)=ωf[-M,M]d(2Mr) for any r0 that,4 for any y[-M,M]d,
|ϕy+M2M-f(y)|ωf[-M,M]d(2MdN-L)+2ωf[-M,M]d(2Md)N-L.
(2.3)
Hence, the size of the function domain [-M,M]d has only a mild influence on the approximation rate of our Floor-ReLU networks. Floor-ReLU networks can still avoid the curse of dimensionality and achieve root exponential convergence for continuous functions on [-M,M]d when M>1. For example, in the case of Hölder continuous functions of order α with a constant λ on [-M,M]d, our approximation rate becomes 3λ(2MdN-L)α.
Second, most interesting continuous functions in practice have a good modulus of continuity such that there is no implicit curse of dimensionality hiding in ωf(·). For example, we have discussed the case of Hölder continuous functions previously. We note that the class of Hölder continuous functions implicitly depends on d through its definition in equation 1.1, but this dependence is moderate since the 2-norm in the equation is the square root of a sum with d terms. We now discuss several cases of ωf(·) when we cannot achieve exponential convergence or cannot avoid the curse of dimensionality. The first example is ωf(r)=1ln(1/r) for all small r>0, which leads to an approximation rate,
3LlnN-12lnd-1,forlargeN,LN+.
Apparently the above approximation rate still avoids the curse of dimensionality, but there is no exponential convergence, which has been canceled out by “ln” in ωf(·). The second example is ωf(r)=1ln1/d(1/r) for all small r>0, which leads to an approximation rate,
3LlnN-12lnd-1/d,forlargeN,LN+.
The power 1d further weakens the approximation rate, and, hence, the curse of dimensionality occurs. The last example we discuss is ωf(r)=rα/d for all small r>0, which results in the approximation rate
3dα2dN-αdL,forlargeN,LN+,
which achieves the exponential convergence and avoids the curse of dimensionality when we use very deep networks with a fixed width. But if we fix the depth, there is no exponential convergence, and the curse occurs. Though we have provided several examples of immoderate ωf(·), to the best of our knowledge, we are not aware of practically useful continuous functions with ωf(·) that is immoderate.

2.4  Discussion of the Literature

The neural networks constructed here achieve exponential convergence without the curse of dimensionality simultaneously for a function class as general as (Hölder) continuous functions, while, to the best of our knowledge, most existing theories apply only to functions with an intrinsic low complexity. For example, the exponential convergence was studied for polynomials (Yarotsky, 2017; Montanelli et al., in press; Lu et al., 2020), smooth functions (Montanelli et al., in press; Liang & Srikant, 2016), analytic functions (Weinan & Wang, 2018), and functions admitting a holomorphic extension to a Bernstein polyellipse (Opschoor et al., 2019). For another example, no curse of dimensionality occurs, or the curse is lessened for Barron spaces (Barron, 1993; Weinan et al., 2019; Weinan & Wojtowytsch, 2020), Korobov spaces (Montanelli & Du, 2019), band-limited functions (Chen & Wu, 2019; Montanelli et al., in press), compositional functions (Poggio et al., 2017), and smooth functions (Yarotsky & Zhevnerchuk, 2019; Lu et al., 2020; Montanelli & Yang, 2020; Yang & Wang, 2020).

Our theory admits a neat and explicit approximation error bound. For example, our approximation rate in the case of Hölder continuous functions of order α with a constant λ is 3λdα/2N-αL, while the prefactor of most existing theories is unknown or grows exponentially in d. Our proof fully explores the advantage of the compositional structure and the nonlinearity of deep networks, while many existing theories were built on traditional approximation tools (e.g., polynomial approximation, multiresolution analysis, and Monte Carlo sampling), making it challenging for existing theories to obtain a neat and explicit error bound with an exponential convergence and without the curse of dimensionality.

Let us review existing work in more detail.

2.4.1  Curse of Dimensionality

The curse of dimensionality is the phenomenon that approximating a d-dimensional function using a certain parameterization method with a fixed target accuracy generally requires a large number of parameters that is exponential in d, and this expense quickly becomes unaffordable when d is large. For example, traditional finite element methods with W parameters can achieve an approximation accuracy O(W-1/d) with an explicit indicator of the curse 1d in the power of W. If an approximation rate has a constant independent of W and exponential in d, the curse still occurs implicitly through this prefactor by definition. If the approximation rate has a prefactor Cf depending on f, then the prefactor Cf still depends on d implicitly via f and the curse implicitly occurs if Cf exponentially grows when d increases. Designing a parameterization method that can overcome the curse of dimensionality is an important research topic in approximation theory.

In (Barron, 1993) and its variants or generalizations (Weinan et al., 2019; Weinan & Wojtowytsch, 2020; Chen & Wu, 2019; Montanelli et al., in press), d-dimensional functions defined on a domain ΩRd admitting an integral representation with an integrand as a ridge function on Ω˜Rd with a variable coefficent were considered, for example,
f(x)=Ω˜a(w)K(w·x)dν(w),
(2.4)
where ν(w) is a Lebesgue measure in w. f(x) can be reformulated into the expectation of a high-dimensional random function when w is treated as a random variable. Then f(x) can be approximated by the average of W samples of the integrand in the same spirit of the law of large numbers with an approximation error essentially bounded by Cfμ(Ω)W measured in L2(Ω,μ) (see equation 6 of Barron, 1993), where O(W) is the total number of parameters in the network, Cf is a d-dimensional integral with an integrand related to f, and μ(Ω) is the Lebesgue measure of Ω. As (Barron, 1993) pointed out, if Ω is not a unit domain in Rd, μ(Ω) would be exponential in d; Barron (1993) also remarked (p. 932ff.), it was remarked that Cf can often be exponentially large in d, and standard smoothness properties of f alone are not enough to remove the exponential dependence of Cf on d, though in a large number of examples of Cf is only moderately large. Therefore, the curse of dimensionality occurs unless Cf and μ(Ω) are not exponential in d. It was observed that if the error is measured in the sense of mean squared error in machine learning, which is the square of the L2(Ω,μ) error averaged over μ(Ω) resulting in Cf2W, then the mean squared error has no curse of dimensionality as long as Cf is not exponential in d (Barron, 1993; Weinan et al., 2019; Weinan & Wojtowytsch, 2020).

In Montanelli and Du (2019), d-dimensional functions in the Korobov space are approximated by the linear combination of basis functions of a sparse grid, each of which is approximated by a ReLU network. Though the curse of dimensionality has been lessened, target functions have to be sufficiently smooth, and the approximation error still contains a factor that is exponential in d, that is, the curse still occurs. Other works (Yarotsky, 2017; Yarotsky & Zhevnerchuk, 2019; Lu et al., 2020; Yang & Wang, 2020) studies the advantage of smoothness in the network approximation. Polynomials are applied to approximate smooth functions, and ReLU networks are constructed to approximate polynomials. The application of smoothness can lessen the curse of dimensionality in the approximation rates in terms of network sizes but also results in a prefactor that is exponentially large in the dimension, which means that the curse still occurs implicitly.

The Kolmogorov-Arnold superposition theorem (KST) (Kolmogorov, 1956, 1957; Arnold, 1957) has also inspired a research direction of network approximation (Kůrková, 1992; Maiorov & Pinkus, 1999; Igelnik & Parikh, 2003; Montanelli & Yang, 2020) for continuous functions. (Kůrková, 1992) provided a quantitative approximation rate of networks with two hidden layers, but the number of neurons scales exponentially in the dimension and the curse occurs. Maiorov and Pinkus (1999) relax the exact representation in KST to an approximation in a form of two-hidden-layer neural networks with a maximum width 6d+3 and a single activation function. This powerful activation function is very complex as described by its authors, and its numerical evaluation was not available until a more concrete algorithm was recently proposed in Guliyev and Ismailov (2018). Note that there is no available numerical algorithm in Maiorov and Pinkus (1999) and Guliyev and Ismailov (2018) to compute the whole networks proposed there. The difficulty is due to the fact that the construction of these networks relies on the outer univariate continuous function of the KST. Though the existence of these outer functions can be shown by construction via a complicated iterative procedure in Braun and Griebel (2009), there is no existing numerical algorithm to evaluate them for a given target function yet, though computation with an arbitrary precision is assumed to be available. Therefore, the networks considered in Maiorov and Pinkus (1999) and Guliyev and Ismailov (2018) are similar to the original representation in KST in the sense that their existence is proved without an explicit way or numerical algorithm to construct them. Igelnik and Parikh (2003) and Montanelli and Yang (2020) apply cubic-splines and piecewise linear functions to approximate the inner and outer functions of KST, resulting in cubic-spline and ReLU networks to approximate continuous functions on [0,1]d. Due to the pathological outer functions of KST, the approximation bounds still suffer from the curse of dimensionality unless target functions are restricted to a small class of functions with simple outer functions in the KST.

Recently in Yarotsky and Zhevnerchuk (2019), Sine-ReLU networks have been applied to approximate Hölder continuous functions of order α on [0,1]d with an approximation accuracy ɛ=exp(-cα,dW1/2), where W is the number of parameters in the network and cα,d is a positive constant depending on α and d only. Whether cα,d exponentially depends on d determines whether the curse of dimensionality exists for the Sine-ReLU networks, which is not answered in Yarotsky and Zhevnerchuk (2019) and is still an open question.

Finally, we discuss the curse of dimensionality in terms of the continuity of the weight selection as a map Σ:C([0,1]d)RW. For a fixed network architecture with a fixed number of parameters W, let g:RWC([0,1]d) be the map of realizing a DNN from a given set of parameters in RW to a function in C([0,1]d). Suppose that there is a continuous map Σ from the unit ball of Sobolev space with smoothness s, denoted as Fs,d, to RW such that f-g(Σ(f))Lɛ for all fFs,d. Then Wcɛ-d/s with some constant c depending only on s. This conclusion is given in theorem 3 of Yarotsky (2017), which is a corollary of theorem 4.2 of Devore (1989) in a more general form. Intuitively, this conclusion means that any constructive approximation of ReLU FNNs to approximate C([0,1]d) cannot enjoy a continuous weight selection property if the approximation rate is better than cɛ-d/s, that is, the curse of dimensionality must occur for constructive approximation for ReLU FNNs with a continuous weight selection. Theorem 4.2 of Devore (1989) can also lead to a new corollary with a weight selection map Σ:Ks,dRW (e.g., the constructive approximation of Floor-ReLU networks) and g:RWL([0,1]d) (e.g., the realization map of Floor-ReLU networks), where Ks,d is the unit ball of Cs([0,1]d) with the Sobolev norm Ws,([0,1]d). Then this new corollary implies that the constructive approximation in this letter cannot enjoy continuous weight selection. However, theorem 4.2 of Devore (1989) is essentially a min-max criterion to evaluate weight selection maps maintaining continuity: the approximation error obtained by minimizing over all continuous selection Σ and network realization g and maximizing over all target functions is bounded below by O(W-s/d). In the worst scenario, a continuous weight selection cannot enjoy an approximation rate beating the curse of dimensionality. However, theorem 4.2 of Devore (1989) has not excluded the possibility that most continuous functions of interest in practice may still enjoy a continuous weight selection without the curse of dimensionality.

2.4.2  Exponential Convergence

Exponential convergence is referred to as the situation that the approximation error exponentially decays to zero when the number of parameters increases. Designing approximation tools with an exponential convergence is another important topic in approximation theory. In the literature of deep network approximation, when the number of network parameters W is a polynomial of O(log(1ɛ)), the terminology “exponential convergence” was also used (Weinan & Wang, 2018; Yarotsky & Zhevnerchuk, 2019; Opschoor et al., 2019). The exponential convergence in this letter is root-exponential as in Yarotsky and Zhevnerchuk (2019), that is, W=O(log2(1ɛ)). The exponential convergence in other works is worse than root-exponential.

In most cases, the approximation power to achieve exponential approximation rates in existing works comes from traditional tools for approximating a small class of functions instead of taking advantage of the network structure itself. In Weinan and Wang (2018) and Opschoor et al. (2019), highly smooth functions are first approximated by the linear combination of special polynomials with high degrees (e.g., Chebyshev polynomials, Legendre polynomials) with an exponential approximation rate, that is, to achieve an ɛ-accuracy, a linear combination of only O(p(log(1ɛ))) polynomials is required, where p is a polynomial with a degree that may depend on the dimension d. Then each polynomial is approximated by a ReLU network with O(log(1ɛ)) parameters. Finally, all ReLU networks are assembled to form a large network approximating the target function with an exponential approximation rate. As far as we know, the only existing work that achieves exponential convergence without taking advantage of special polynomials and smoothness is the Sine-ReLU network in Yarotsky and Zhevnerchuk (2019). We emphasize that the result in our letter applies for generic continuous functions, including the Hölder continuous functions considered in Yarotsky and Zhevnerchuk (2019).

3  Approximation of Continuous Functions

In this section, we introduce basic notations in section 3.1. Then we prove theorem 1 based on proposition 1, which we prove in section 4.

3.1  Notations

The main notations of this letter follow:

  • Vectors and matrices are denoted in bold. Standard vectorization is adopted in the matrix and vector computation. For example, adding a scalar and a vector means adding the scalar to each entry of the vector.

  • Let N+ denote the set containing all positive integers: N+={1,2,3,}.

  • Let σ:RR denote the rectified linear unit (ReLU): σ(x)=max{0,x}. With a slight abuse of notation, we define σ:RdRd as σ(x)=max{0,x1}max{0,xd} for any x=(x1,,xd)Rd.

  • The floor function (Floor) is defined as x:=max{n:nx,nZ} for any xR.

  • For θ[0,1), suppose its binary representation is θ==1θ2- with θ{0,1}. We introduce a special notation bin0.θ1θ2θL to denote the L-term binary representation of θ: bin0.θ1θ2θL:==1Lθ2-.

  • The expression “a network with width N and depth L” means the maximum width of this network for all hidden layers is no more than N, and the number of hidden layers of this network is no more than L.

3.2  Proof of Theorem 1

Theorem 1 is an immediate consequence of Theorem 2:

Theorem 2.
Given any N,LN+ and an arbitrary continuous function f on [0,1]d, there exists a function ϕ implemented by a Floor-ReLU network with width max{d,2N2+5N} and depth 7dL2+3 such that
|ϕ(x)-f(x)|ωf(dN-L)+2ωf(d)2-NL,foranyx[0,1]d.

This theorem will be proved later in this section. Now we prove theorem 1 based on theorem 2.

Proof of Theorem 1.
Given any N,LN+, there exist N˜,L˜N+ with N˜2 and L˜3 such that
(N˜-1)2N<N˜2and(L˜-1)24L<L˜2.
By theorem 2, there exists a function ϕ implemented by a Floor-ReLU network with width max{d,2N˜2+5N˜} and depth 7dL˜2+3 such that
|ϕ(x)-f(x)|ωf(dN˜-L˜)+2ωf(d)2-N˜L˜,foranyx[0,1]d.
Note that
2-N˜L˜N˜-L˜=(N˜2)-12L˜2N-124LN-L.
Then we have
|ϕ(x)-f(x)|ωf(dN-L)+2ωf(d)N-L,foranyx[0,1]d.
For N˜,L˜N+ with N˜2 and L˜3, we have
2N˜2+5N˜5(N˜-1)2+135N+13and7L˜216(L˜-1)264L.
Therefore, ϕ can be computed by a Floor-ReLU network with width max{d,2N˜2+5N˜}max{d,5N+13} and depth 7dL˜2+364dL+3, as desired.

To prove theorem 2, we first present the proof sketch. We construct piecewise constant functions implemented by Floor-ReLU networks to approximate continuous functions. There are four key steps in our construction:

  1. 1.
    Normalize f as f˜ satisfying f˜(x)[0,1] for any x[0,1]d, divide [0,1]d into a set of nonoverlapping cubes {Qβ}β{0,1,,K-1}d, and denote xβ as the vertex of Qβ with minimum ·1 norm, where K is an integer determined later. See Figure 2 for illustrations of Qβ and xβ.
    Figure 2:

    Illustrations of Qβ and xβ for β{0,1,,K-1}d. (a) K=4,d=1. (b) K=4,d=2.

    Figure 2:

    Illustrations of Qβ and xβ for β{0,1,,K-1}d. (a) K=4,d=1. (b) K=4,d=2.

  2. 2.

    Construct a Floor-ReLU subnetwork to implement a vector-valued function Φ1:RdRd projecting the whole cube Qβ to the index β for each β{0,1,,K-1}d, that is, Φ1(x)=β for all xQβ.

  3. 3.

    Construct a Floor-ReLU subnetwork to implement a function ϕ2:RdR mapping β{0,1,,K-1}d approximately to f˜(xβ) for each β, that is, ϕ2(β)f˜(xβ). Then ϕ2Φ1(x)=ϕ2(β)f˜(xβ) for any xQβ and each β{0,1,,K-1}d, implying ϕ˜:=ϕ2Φ1 approximates f˜ within an error O(ωf(1/K)) on [0,1]d.

  4. 4.

    Rescale and shift ϕ˜ to obtain the desired function ϕ approximating f well, and determine the final Floor-ReLU network to implement ϕ.

It is not difficult to construct Floor-ReLU networks with the desired width and depth to implement Φ1. The most technical part is the construction of a Floor-ReLU network with the desired width and depth computing ϕ2, which needs the following proposition based on the bit extraction technique introduced in Bartlett, Maiorov, and Meir (1998) and Harvey, Liaw, and Mehrabian (2017).

Proposition 1.
Given any N,LN+, and arbitrary θm{0,1} for m=1,2,,NL, there exists a function ϕ computed by a Floor-ReLU network with width 2N+2 and depth 7L-2 such that
ϕ(m)=θm,form=1,2,,NL.

The proof of this proposition is presented in section 4. By this proposition and the definition of VC-dimension (see Harvey et al., 2017), it is easy to prove that the VC-dimension of Floor-ReLU networks with a constant width and depth O(L) has a lower bound 2L. Such a lower bound is much larger than O(L2), which is a VC-dimension upper bound of ReLU networks with the same width and depth due to theorem 8 of (Harvey et al., 2017). This means Floor-ReLU networks are much more powerful than ReLU networks from the perspective of VC-dimension.

Based on the proof sketch stated just above, we are ready to give the detailed proof of theorem 2 following similar ideas as in our previous work (Shen et al., 2019; Shen et al., 2020; Lu et al., 2020). The main idea of our proof is to reduce high-dimensional approximation to one-dimensional approximation via a projection. The idea of projection was probably first used in well-established theories, such as KST (Kolmogorov superposition theorem) mentioned in section 2, where the approximant to high-dimensional functions is constructed by projecting high-dimensional data points to one-dimensional data points and then constructing one-dimensional approximants. There has been extensive research based on this idea, for example, references related to KST summarized in section 2, our previous work (Shen, Yang, & Zhang, 2019; Shen, Yang, & Zhang, 2020; Lu et al., 2020), and (Yarotsky & Zhevnerchuk, 2019). The key to a successful approximant is to construct one-dimensional approximants to deal with a large number of one-dimensional data points; in fact, the number of points is exponential in the dimension d.

Proof of Theorem 2.

The proof consists of four steps.

Step 1: Setup. Assume f is not a constant function since it is a trivial case. Then ωf(r)>0 for any r>0. Clearly, |f(x)-f(0)|ωf(d) for any x[0,1]d. Define
f˜:=f-f(0)+ωf(d)/2ωf(d).
(3.1)
It follows that f˜(x)[0,1] for any x[0,1]d.
Set K=NL, EK-1=[K-1K,1], and Ek=[kK,k+1K) for k=0,1,,K-2. Define xβ:=β/K and
Qβ:=x=(x1,x2,,xd)Rd:xjEβjforj=1,2,,d,
for any β=(β1,β2,βd){0,1,,K-1}d. See Figure 2 for the examples of Qβ and xβ for β{0,1,,K-1}d with K=4 and d=1,2.
Step 2: Construct Φ1 mapping xQβ to β. Define a step function ϕ1 as
ϕ1(x):=-σ(-Kx+K-1)+K-1,foranyxR.5
See Figure 3 for an example of ϕ1 when K=4. It follows from the definition of ϕ1 that
ϕ1(x)=k,ifxEk,fork=0,1,,K-1.
Figure 3:

An illustration of ϕ1 on [0,1] for the case K=4.

Figure 3:

An illustration of ϕ1 on [0,1] for the case K=4.

Define
Φ1(x):=ϕ1(x1),ϕ1(x2),,ϕ1(xd),foranyx=(x1,x2,,xd)Rd.
Clearly, we have, for xQβ and β{0,1,,K-1}d,
Φ1(x)=ϕ1(x1),ϕ1(x2),,ϕ1(xd)=(β1,β2,,βd)=β.
Step 3: Construct ϕ2 mapping β{0,1,,K-1}d approximately to f˜(xβ). Using the idea of K-ary representation, we define a linear function ψ1 via
ψ1(x):=1+j=1dxjKj-1,foranyx=(x1,x2,,xd)Rd.
Then ψ1 is a bijection from {0,1,,K-1}d to {1,2,,Kd}.
Given any i{1,2,,Kd}, there exists a unique β{0,1,,K-1}d such that i=ψ1(β). Then define
ξi:=f˜(xβ)[0,1],fori=ψ1(β)andβ{0,1,,K-1}d,
where f˜ is the normalization of f defined in equation 3.1. It follows that there exists ξi,j{0,1} for j=1,2,,NL such that
|ξi-bin0.ξi,1ξi,2,ξi,NL|2-NL,fori=1,2,,Kd.
By Kd=(NL)d=NdL and proposition 1, there exists a function ψ2,j implemented by a Floor-ReLU network with width 2N+2 and depth 7dL-2, for each j=1,2,,NL, such that
ψ2,j(i)=ξi,j,fori=1,2,,Kd.
Define
ψ2:=j=1NL2-jψ2,jandϕ2:=ψ2ψ1.
Then, for i=ψ1(β) and β{0,1,,K-1}d, we have
|f˜(xβ)-ϕ2(β)|=|f˜(xβ)-ψ2(ψ1(β))|=|ξi-ψ2(i)|=ξi-j=1NL2-jψ2,j(i)=|ξi-bin0.ξi,1ξi,2ξi,NL|2-NL.
(3.2)
Step 4: Determine the final network to implement the desired function ϕ. Define ϕ˜:=ϕ2Φ1, that is, for any x=(x1,x2,,xd)Rd,
ϕ˜(x)=ϕ2Φ1(x)=ϕ2ϕ1(x1),ϕ1(x2),,ϕ1(xd).
Note that |x-xβ|dK for any xQβ and β{0,1,,K-1}d. Then we have, for any xQβ and β{0,1,,K-1}d,
|f˜(x)-ϕ˜(x)||f˜(x)-f˜(xβ)|+|f˜(xβ)-ϕ˜(x)|ωf˜dK+|f˜(xβ)-ϕ2(Φ1(x))|ωf˜dK+|f˜(xβ)-ϕ2(β)|ωf˜dK+2-NL,
where the last inequality comes from equation 3.2.
Note that xQβ and β{0,1,,K-1}d are arbitrary. Since [0,1]d=β{0,1,,K-1}dQβ, we have
|f˜(x)-ϕ˜(x)|ωf˜dK+2-NL,foranyx[0,1]d.
Define
ϕ:=2ωf(d)ϕ˜+f(0)-ωf(d).
By K=NL and ωf(r)=2ωf(d)·ωf˜(r) for any r0, we have, for any x[0,1]d,
|f(x)-ϕ(x)|=2ωf(d)|f˜(x)-ϕ˜(x)|2ωf(d)ωf˜dK+2-NLωfdK+2ωf(d)2-NLωf(dN-L)+2ωf(d)2-NL.
It remains to determine the width and depth of the Floor-ReLU network implementing ϕ. Clearly, ϕ2 can be implemented by the architecture in Figure 4.
Figure 4:

An illustration of the desired network architecture implementing ϕ2=ψ2ψ1 for any input β{0,1,,K-1}d, where i=ψ1(β).

Figure 4:

An illustration of the desired network architecture implementing ϕ2=ψ2ψ1 for any input β{0,1,,K-1}d, where i=ψ1(β).

As we can see from Figure 4, ϕ2 can be implemented by a Floor-ReLU network with width N(2N+2+3)=2N2+5N and depth L(7dL-2+1)+2=L(7dL-1)+2. With the network architecture implementing ϕ2 in hand, ϕ˜ can be implemented by the network architecture shown in Figure 5. Note that ϕ is defined via rescaling and shifting ϕ˜. As shown in Figure 5, ϕ and ϕ˜ can be implemented by a Floor-ReLU network with width max{d,2N2+5N} and depth 1+1+L(7dL-1)+27dL2+3.
Figure 5:

An illustration of the network architecture implementing ϕ˜=ϕ2Φ1.

Figure 5:

An illustration of the network architecture implementing ϕ˜=ϕ2Φ1.

4  Proof of Proposition 1

The proof of proposition 1 mainly relies on the bit extraction technique. As we shall see, our key idea is to apply the Floor activation function to make bit extraction more powerful to reduce network sizes. In particular, Floor-ReLU networks can extract more bits than ReLU networks with the same network size.

We first establish a basic lemma to extract 1/N of the total bits of a binary number; the result is again stored in a binary number.

Lemma 1.
Given any J,NN+, there exists a function ϕ:R2R that can be implemented by a Floor-ReLU network with width 2N and depth 4 such that for any θj{0,1}, j=1,,NJ, we have
ϕ(bin0.θ1θNJ,n)=bin0.θ(n-1)J+1θnJ,forn=1,2,,N.
Proof.
Given any θj{0,1} for j=1,,NJ, denote
s=bin0.θ1θNJandsn=bin0.θ(n-1)J+1θnJ,forn=1,2,,N.
Then our goal is to construct a function ϕ:R2R computed by a Floor-ReLU network with the desired width and depth that satisfies
ϕ(s,n)=sn,forn=1,2,,N.
Based on the properties of the binary representation, it is easy to check that
sn=2nJs/2J-2(n-1)Js,forn=1,2,,N.
(4.1)
Even with the above formulas to generate s1,s2,,sN, it is still technical to construct a network outputting sn for a given index n{1,2,,N}.
Set δ=2-J and define g (see Figure 6) as
g(x):=σσ(x)-σx+δ-1δ,whereσ(x)=max{0,x}.
Figure 6:

An illustration of g(x)=σ((σ(x)-σ(x+δ-1δ)), where σ(x)=max{0,x} is the ReLU activation function.

Figure 6:

An illustration of g(x)=σ((σ(x)-σ(x+δ-1δ)), where σ(x)=max{0,x} is the ReLU activation function.

Since sn[0,1-δ] for n=1,2,,N, we have
sn=k=1Ng(sk+k-n),forn=1,2,,N.
(4.2)
As shown in Figure 7, the desired function ϕ can be computed by a Floor-ReLU network with width 2N and depth 4. Moreover, it holds that
ϕ(s,n)=sn,forn=1,2,,N.
Figure 7:

An illustration of the desired network architecture implementing ϕ based on equations 4.1 and 4.2. We omit some ReLU (σ) activation functions when inputs are obviously nonnegative. All parameters in this network are essentially determined by equations 4.1 and 4.2, which are valid no matter what θ1,,θNJ{0,1} are. Thus, the desired function ϕ implemented by this network is independent of θ1,,θNJ{0,1}.

Figure 7:

An illustration of the desired network architecture implementing ϕ based on equations 4.1 and 4.2. We omit some ReLU (σ) activation functions when inputs are obviously nonnegative. All parameters in this network are essentially determined by equations 4.1 and 4.2, which are valid no matter what θ1,,θNJ{0,1} are. Thus, the desired function ϕ implemented by this network is independent of θ1,,θNJ{0,1}.

The next lemma constructs a Floor-ReLU network that can extract any bit from a binary representation according to a specific index.

Lemma 2.
Given any N,LN+, there exists a function ϕ:R2R implemented by a Floor-ReLU network with width 2N+2 and depth 7L-3 such that, for any θm{0,1}, m=1,2,,NL, we have
ϕ(bin0.θ1θ2θNL,m)=θm,form=1,2,,NL.
Proof.

The proof is based on repeated applications of lemma 1. Specifically, we inductively construct a sequence of functions ϕ1,ϕ2,,ϕL implemented by Floor-ReLU networks to satisfy the following two conditions for each {1,2,,L}.

  1. 1.

    ϕ:R2R can be implemented by a Floor-ReLU network with width 2N+2 and depth 7-3.

  2. 2.
    For any θm{0,1}, m=1,2,,N, we have
    ϕ(bin0.θ1θ2θN,m)=bin0.θm,form=1,2,,N.

First, consider the case =1. By lemma 1 (set J=1 therein), there exists a function ϕ1 implemented by a Floor-ReLU network with width 2N2N+2 and depth 4=7-3 such that for any θm{0,1}, m=1,2,,N, we have
ϕ1(bin0.θ1θ2θN,m)=bin0.θm,form=1,2,,N.
It follows that conditions 1 and 2 hold for =1.
Next, assume conditions 1 and 2 hold for =k. We would like to construct ϕk+1 to make conditions 1 and 2 true for =k+1. By lemma 1 (set J=Nk therein), there exists a function ψ implemented by a Floor-ReLU network with width 2N and depth 4 such that for any θm{0,1}, m=1,2,,Nk+1, we have
ψ(bin0.θ1θNk+1,n)=bin0.θ(n-1)Nk+1θ(n-1)Nk+Nk,forn=1,2,,N.
(4.3)
By the hypothesis of induction, we have
  • ϕk:R2R can be implemented by a Floor-ReLU network with width 2N+2 and depth 7k-3.

  • For any θj{0,1}, j=1,2,,Nk, we have
    ϕk(bin0.θ1θ2θNk,j)=bin0.θj,forj=1,2,,Nk.
    (4.4)
Given any m{1,2,,Nk+1}, there exist n{1,2,,N} and j{1,2,,Nk} such that m=(n-1)Nk+j, and such n,j can be obtained by
n=(m-1)/Nk+1andj=m-(n-1)Nk.
(4.5)
Then the desired architecture of the Floor-ReLU network implementing ϕk+1 is shown in Figure 8.
Figure 8:

An illustration of the desired network architecture implementing ϕk+1 based on equations 4.3 to 4.5. We omit ReLU (σ) for neurons with nonnegative inputs.

Figure 8:

An illustration of the desired network architecture implementing ϕk+1 based on equations 4.3 to 4.5. We omit ReLU (σ) for neurons with nonnegative inputs.

Note that ψ can be computed by a Floor-ReLU network of width 2N and depth 4. By Figure 8, we have

  • ϕk+1:R2R can be implemented by a Floor-ReLU network with width 2N+2 and depth 2+4+1+(7k-3)=7(k+1)-3, which implies condition 1 for =k+1.

  • For any θm{0,1}, m=1,2,,Nk+1, we have
    ϕk+1(bin0.θ1θ2θNk+1,m)=bin0.θm,form=1,2,,Nk+1.
    That is, condition 2 holds for =k+1.

So we finish the process of induction.

By the principle of induction, there exists a function ϕL:R2R such that

  • ϕL can be implemented by a Floor-ReLU network with width 2N+2 and depth 7L-3.

  • For any θm{0,1}, m=1,2,,NL, we have
    ϕL(bin0.θ1θ2θNL,m)=bin0.θm,form=1,2,,NL.

Finally, define ϕ:=2ϕL. Then ϕ can also be implemented by a Floor-ReLU network with width 2N+2 and depth 7L-3. Moreover, for any θm{0,1}, m=1,2,,NL, we have
ϕ(bin0.θ1θ2θNL,m)=2·ϕL(bin0.θ1θ2θNL,m)=2·bin0.θm=θm,
for m=1,2,,NL.

With lemma 2 in hand, we are ready to prove proposition 1.

Proof of Proposition 1.
By lemma 2, there exists a function ϕ˜:R2R computed by a Floor-ReLU network with a fixed architecture with width 2N+2 and depth 7L-3 such that, for any zm{0,1}, m=1,2,,NL, we have
ϕ˜(bin0.z1z2zNL,m)=zm,form=1,2,,NL.
Based on θm{0,1} for m=1,2,,NL given in proposition 1, we define the final function ϕ as
ϕ(x):=ϕ˜σ(x·0+bin0.θ1θ2θNL),σ(x),whereσ(x)=max{0,x}.
Clearly, ϕ can be implemented by a Floor-ReLU network with width 2N+2 and depth (7L-3)+1=7L-2. Moreover, we have, for any m{1,2,,NL},
ϕ(m):=ϕ˜σ(m·0+bin0.θ1θ2θNL),σ(m)=ϕ˜(bin0.θ1θ2θNL,m)=θm.

We point out that only the properties of Floor on [0,) are used in our proof. Thus, the Floor can be replaced by the truncation function that can be easily computed by truncating the decimal part.

5  Conclusion

This letter has introduced a theoretical framework to show that deep network approximation can achieve root exponential convergence and avoid the curse of dimensionality for approximating functions as general as (Hölder) continuous functions. Given a Lipschitz continuous function f on [0,1]d, it was shown by construction that Floor-ReLU networks with width max{d,5N+13} and depth 64dL+3 can achieve a uniform approximation error bounded by 3λdN-L, where λ is the Lipschitz constant of f. More generally for an arbitrary continuous function f on [0,1]d with a modulus of continuity ωf(·), the approximation error is bounded by ωf(dN-L)+2ωf(d)N-L. Our results provide a theoretical lower bound of the power of deep network approximation. Whether this bound is achievable in actual computation relies on advanced algorithm design as a separate line of research.

Notes

1

Our results can be easily generalized to Ceiling-ReLU networks, namely, feedforward neural networks with either Ceiling (x) or ReLU (max{0,x}) activation function in each neuron.

2

All of the exponential convergence in this letter is root exponential convergence. Nevertheless, after this introduction, for the convenience of presentation, we omit the prefix “root,” as in the literature.

3

For simplicity, we omit O(·) in the following discussion.

4

For an arbitrary set ERd, ωfE(r) is defined via ωfE(r):=sup{|f(x)-f(y)|:x-y2r,x,yE}, for any r0. As defined earlier, ωf(r) is short of ωf[0,1]d(r).

5

If we just define ϕ1(x)=Kx, then ϕ1(1)=KK-1 even though 1EK-1.

Acknowledgments

Z.S. is supported by the Tan Chin Tuan Centennial Professor ship. H.Y. was partially supported by the U.S. National Science Foundation under award DMS-1945029.

References

Allen-Zhu
,
Z.
,
Li
,
Y.
, &
Liang
,
Y.
(
2019
).
Learning and generalization in overparameterized neural networks, going beyond two layers
. arXiv:abs/1811.04918.
Arnold
,
V. I.
(
1957
).
On functions of three variables
.
Dokl. Akad. Nauk SSSR
,
114
(
5
),
679
681
.
Arora
,
S.
,
Du
,
S. S.
,
Hu
,
W.
,
Li
,
Z.
, &
Wang
,
R.
(
2019
).
Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks
. In
Proceedings of the ICML
.
Bao
,
C.
,
Li
,
Q.
,
Shen
,
Z.
,
Tai
,
C.
,
Wu
,
L.
, &
Xiang
,
X.
(
2019
).
Approximation analysis of convolutional neural networks
. Semantic Scholar ID:204762668.
Barron
,
A. R.
(
1993
).
Universal approximation bounds for superpositions of a sigmoidal function
.
IEEE Transactions on Information Theory
,
39
(
3
),
930
945
.
Bartlett
,
P.
,
Maiorov
,
V.
, &
Meir
,
R.
(
1998
).
Almost linear VC-dimension bounds for piecewise polynomial networks
.
Neural Computation
,
10
,
217
223
.
Bengio
,
Y.
,
Léonard
,
N.
, &
Courville
,
A.
(
2013
).
Estimating or propagating gradients through stochastic neurons for conditional computation.
arXiv:1308.3432.
Berner
,
J.
,
Grohs
,
P.
, &
Jentzen
,
A.
(
2018
).
Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations.
CoRR, abs/1809.03062.
Bölcskei
,
H.
,
Grohs
,
P.
,
Kutyniok
,
G.
, &
Petersen
,
P.
(
2019
).
Optimal approximation with sparsely connected deep neural networks
.
SIAM Journal on Mathematics of Data Science
,
1
(
1
),
8
45
.
Boo
,
Y.
,
Shin
,
S.
, &
Sung
,
W.
(
2020
).
Quantized neural networks: Characterization and holistic optimization
. arXiv: abs/2006.00530.
Braun
,
J.
, &
Griebel
,
M.
(
2009
).
On a constructive proof of Kolmogorov's superposition theorem
.
Constructive Approximation
,
30
,
653
675
.
Cao
,
Y.
, &
Gu
,
Q.
(
2019
).
Generalization bounds of stochastic gradient descent for wide and deep neural networks.
CoRR, abs/1905.13210.
Carrillo
,
J. A. T.
,
Jin
,
S.
,
Li
,
L.
, &
Zhu
,
Y.
(
2019
).
A consensus-based global optimization method for high dimensional machine learning problems
. arXiv:1909.09249.
Chen
,
L.
, &
Wu
,
C.
(
2019
).
A note on the expressive power of deep rectified linear unit networks in high-dimensional spaces
.
Mathematical Methods in the Applied Sciences
,
42
(
9
),
3400
3404
.
Chen
,
M.
,
Jiang
,
H.
,
Liao
,
W.
, &
Zhao
,
T.
(
2019
). Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 32
(pp.
8174
8184
).
Red Hook, NY
:
Curran
.
Chen
,
Z.
,
Cao
,
Y.
,
Zou
,
D.
, &
Gu
,
Q.
(
2019
).
How much over-parameterization is sufficient to learn deep ReLU networks?
CoRR, arXiv:1911.12360.
Chui
,
C. K.
,
Lin
,
S.-B.
, &
Zhou
,
D.-X.
(
2018
).
Construction of neural networks for realization of localized deep learning
.
Frontiers in Applied Mathematics and Statistics
,
4
, 14.
Cybenko
,
G.
(
1989
).
Approximation by superpositions of a sigmoidal function
.
Mathematics of Control, Signals, and Systems
,
2
,
303
314
.
Devore
,
R. A.
(
1989
).
Optimal nonlinear approximation
.
Manuskripta Math
,
63
(
4
),
469
478
.
Gribonval
,
R.
,
Kutyniok
,
G.
,
Nielsen
,
M.
, &
Voigtlaender
,
F.
(
2019
).
Approximation spaces of deep neural networks.
arXiv:1905.01208.
hring
,
I.
,
Kutyniok
,
G.
, &
Petersen
,
P.
(
2019
). Error bounds for approximations with deep ReLU neural networks in Ws,p norms. arXiv:1902.07896.
Guliyev
,
N. J.
, &
Ismailov
,
V. E.
(
2018
).
Approximation capability of two hidden layer feedforward neural networks with fixed weights
.
Neurocomputing
,
316
,
262
269
.
Harvey
,
N.
,
Liaw
,
C.
, &
Mehrabian
,
A.
(
2017
). Nearly-tight VC-dimension bounds for piecewise linear neural networks. In
S.
Kale
&
O.
Shamir
(Eds.),
Proceedings of Machine Learning Research
(pp.
1064
1068
).
Amsterdam
:
PMLR
.
Holland
,
J. H.
(
1992
).
Genetic algorithms
.
Scientific American
,
267
(
1
),
66
73
.
Hornik
,
K.
,
Stinchcombe
,
M.
, &
White
,
H.
(
1989
).
Multilayer feedforward networks are universal approximators
.
Neural Networks
,
2
(
5
),
359
366
.
Hubara
,
I.
,
Courbariaux
,
M.
,
Soudry
,
D.
,
El-Yaniv
,
R.
, &
Bengio
,
Y.
(
2017
).
Quantized neural networks: Training neural networks with low precision weights and activations
.
J. Mach. Learn. Res.
,
18
(
1
),
6869
6898
.
Igelnik
,
B.
, &
Parikh
,
N.
(
2003
).
Kolmogorov's spline network
.
IEEE Transactions on Neural Networks
,
14
(
4
),
725
733
.
Jacot
,
A.
,
Gabriel
,
F.
, &
Hongler
,
C.
(
2018
).
Neural tangent kernel: Convergence and generalization in neural networks.
CoRR, abs/1806.07572.
Ji
,
Z.
, &
Telgarsky
,
M.
(
2020
).
Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks
. arXiv:abs/1909.12292.
Kennedy
,
J.
, &
Eberhart
,
R.
(
1995
). Particle swarm optimization. In
Proceedings of the International Conference on Neural Networks
(vol. 4, pp.
1942
1948
).
Red Hook, NY
:
Curran
.
Kirkpatrick
,
S.
,
Gelatt
,
C. D.
, &
Vecchi
,
M. P.
(
1983
).
Optimization by simulated annealing
.
Science
,
220
(
4598
),
671
680
.
Kolmogorov
,
A. N.
(
1956
).
On the representation of continuous functions of several variables by superposition of continuous functions of a smaller number of variables
.
Dokl. Akad. Nauk SSSR
,
108
,
179
182
.
Kolmogorov
,
A. N.
(
1957
).
On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition
.
Dokl. Akad. Nauk SSSR
,
114
,
953
956
.
Kůrková
,
V.
(
1992
).
Kolmogorov's theorem and multilayer neural networks
.
Neural Networks
,
5
(
3
),
501
506
.
Li
,
Q.
,
Lin
,
T.
, &
Shen
,
Z.
(
2019
).
Deep learning via dynamical systems: An approximation perspective.
arXiv:1912.10382.
Liang
,
S.
, &
Srikant
,
R.
(
2016
).
Why deep neural networks?
CoRR, abs/1610.04161.
Lin
,
Y.
,
Lei
,
M.
, &
Niu
,
L.
(
2019
).
Optimization strategies in quantized neural networks: A review
. In
Proceedings of the 2019 International Conference on Data Mining Workshops
(pp.
385
390
). Piscataway, NJ: IEEE.
Lu
,
J.
,
Shen
,
Z.
,
Yang
,
H.
, &
Zhang
,
S.
(
2020
).
Deep network approximation for smooth functions
. arXiv:2001.03040.
Luo
,
T.
, &
Yang
,
H.
(
2020
).
Two-layer neural networks for partial differential equations: Optimization and generalization theory
. arXiv:abs/2006.15733.
Maiorov
,
V.
, &
Pinkus
,
A.
(
1999
).
Lower bounds for approximation by MLP neural networks
.
Neurocomputing
,
25
(
1
),
81
91
.
Montanelli
,
H.
, &
Du
,
Q.
(
2019
).
New error bounds for deep ReLU networks using sparse grids
.
SIAM Journal on Mathematics of Data Science
,
1
(
1
),
78
92
.
Montanelli
,
H.
, &
Yang
,
H.
(
2020
).
Error bounds for deep ReLU networks using the Kolmogorov-Arnold superposition theorem
.
Neural Networks
,
129
,
1
6
.
Montanelli
,
H.
,
Yang
,
H.
, &
Du
,
Q.
(in press).
Deep ReLU networks overcome the curse of dimensionality for bandlimited functions
.
Journal of Computational Mathematics
.
Nakada
,
R.
, &
Imaizumi
,
M.
(
2019
).
Adaptive approximation and estimation of deep neural network with intrinsic dimensionality
. arXiv:1907.02177.
Nelder
,
J.
, &
Mead
,
R.
(
1965
).
A simplex method for function minimization
.
Comput. J.
,
7
,
308
313
.
Opschoor
,
J. A. A.
,
Schwab
,
C.
, &
Zech
,
J.
(
2019
).
Exponential ReLU DNN expression of holomorphic maps in high dimension
. (Technical Report 2019-35). Seminar for Applied Mathematics, ETH Zürich, Switzerland. https://math.ethz.ch/sam/research/reports.html?id=839.
Petersen
,
P.
, &
Voigtlaender
,
F.
(
2018
).
Optimal approximation of piecewise smooth functions using deep ReLU neural networks
.
Neural Networks
,
108
,
296
330
.
Pinnau
,
R.
,
Totzeck
,
C.
,
Tse
,
O.
, &
Martin
,
S.
(
2017
).
A consensus-based model for global optimization and its mean-field limit
.
Mathematical Models and Methods in Applied Sciences
,
27
(
1
),
183
204
.
Poggio
,
T.
,
Mhaskar
,
H. N.
,
Rosasco
,
L.
,
Miranda
,
B.
, &
Liao
,
Q.
(
2017
).
Why and when can deep—but not shallow—networks avoid the curse of dimensionality: A review
.
International Journal of Automation and Computing
,
14
,
503
519
.
Shen
,
Z.
,
Yang
,
H.
, &
Zhang
,
S.
(
2019
).
Nonlinear approximation via compositions
.
Neural Networks
,
119
,
74
84
.
Shen
,
Z.
,
Yang
,
H.
, &
Zhang
,
S.
(
2020
).
Deep network approximation characterized by number of neurons
.
Communications in Computational Physics
,
28
(
5
),
1768
1811
.
Suzuki
,
T.
(
2019
).
Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality
. In
Proceedings of the International Conference on Learning Representations
.
Wang
,
P.
,
Hu
,
Q.
,
Zhang
,
Y.
,
Zhang
,
C.
,
Liu
,
Y.
, &
Cheng
,
J.
(
2018
).
Two-step quantization for low-bit neural networks
. In
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition
(pp.
4376
4384
).
Piscataway, NJ
:
IEEE
.
Weinan
,
E.
,
Ma
,
C.
, &
Wu
,
L.
(
2019
).
A priori estimates of the population risk for two-layer neural networks
.
Communications in Mathematical Sciences
,
17
(
5
),
1407
1425
.
Weinan
,
E.
, &
Wang
,
Q.
(
2018
).
Exponential convergence of the deep neural network approximation for analytic functions.
CoRR, abs/1807.00297.
Weinan
,
E.
, &
Wojtowytsch
,
S.
(
2020
).
Representation formulas and pointwise properties for Barron functions
. arXiv:2006.05982.
Yang
,
Y.
, &
Wang
,
Y.
(
2020
).
Approximation in shift-invariant spaces with deep ReLU neural networks.
arXiv:2005.11949.
Yarotsky
,
D.
(
2017
).
Error bounds for approximations with deep ReLU networks
.
Neural Networks
,
94
,
103
114
.
Yarotsky
,
D.
(
2018
).
Optimal approximation of continuous functions by very deep ReLU networks
. In
S.
Bubeck
,
V.
Perchet
, &
P.
Rigollet
(Eds.),
Proceedings of Machine Learning Research
(pp.
639
649
).
Yarotsky
,
D.
, &
Zhevnerchuk
,
A.
(
2019
).
The phase diagram of approximation rates for deep neural networks.
arXiv:1906.09477.
Yin
,
P.
,
Lyu
,
J.
,
Zhang
,
S.
,
Osher
,
S.
,
Qi
,
Y.
, &
Xin
,
J.
(
2019
).
Understanding straight-through estimator in training activation quantized neural nets.
arXiv:abs/1903.05662.
Zhou
,
D.-X.
(
2020
).
Universality of deep convolutional neural networks
.
Applied and Computational Harmonic Analysis
,
48
(
2
),
787
794
.