A new network with super-approximation power is introduced. This network is built with Floor ($\u230ax\u230b$) or ReLU ($max{0,x}$) activation function in each neuron; hence, we call such networks Floor-ReLU networks. For any hyperparameters $N\u2208N+$ and $L\u2208N+$, we show that Floor-ReLU networks with width $max{d,5N+13}$ and depth $64dL+3$ can uniformly approximate a Hölder function $f$ on $[0,1]d$ with an approximation error $3\lambda d\alpha /2N-\alpha L$, where $\alpha \u2208(0,1]$ and $\lambda $ are the Hölder order and constant, respectively. More generally for an arbitrary continuous function $f$ on $[0,1]d$ with a modulus of continuity $\omega f(\xb7)$, the constructive approximation rate is $\omega f(dN-L)+2\omega f(d)N-L$. As a consequence, this new class of networks overcomes the curse of dimensionality in approximation power when the variation of $\omega f(r)$ as $r\u21920$ is moderate (e.g., $\omega f(r)\u2272r\alpha $ for Hölder continuous functions), since the major term to be considered in our approximation rate is essentially $d$ times a function of $N$ and $L$ independent of $d$ within the modulus of continuity.

## 1 Introduction

Recently, there has been a large number of successful real-world applications of deep neural networks in many fields of computer science and engineering, especially for large-scale and high-dimensional learning problems. Understanding the approximation capacity of deep neural networks has become a fundamental research direction for revealing the advantages of deep learning compared to traditional methods. This letter introduces new theories and network architectures achieving root exponential convergence and avoiding the curse of dimensionality simultaneously for (Hölder) continuous functions with an explicit error bound in deep network approximation, which might be two foundational laws supporting the application of deep network approximation in large-scale and high-dimensional problems. The approximation results here are quantitative and apply to networks with essentially arbitrary width and depth. These results suggest considering Floor-ReLU networks as a possible alternative to ReLU networks in deep learning.

Deep ReLU networks with width $O(N)$ and depth $O(L)$ can achieve the approximation rate $O(N-L)$ for polynomials on $[0,1]d$ (Lu, Shen, Yang, & Zhang, 2020), but it is not true for general functions, such as the (nearly) optimal approximation rates of deep ReLU networks for a Lipschitz continuous function and a $Cs$ function $f$ on $[0,1]d$ are $O(dN-2/dL-2/d)$ and $O(\u2225f\u2225CsN-2s/dL-2s/d)$ (Shen, Yang, & Zhang, 2020; Lu et al., 2020), respectively. The limitation of ReLU networks motivates us to explore other types of network architectures to answer our curiosity on deep networks: Do deep neural networks with arbitrary width $O(N)$ and arbitrary depth $O(L)$ admit an exponential approximation rate $O(\omega f(N-L\eta ))$ for some constant $\eta >0$ for a generic continuous function $f$ on $[0,1]d$ with a modulus of continuity $\omega f(\xb7)$?

^{1}in each neuron. Mathematically, if we let $N0=d$, $NL+1=1$, and $N\u2113$ be the number of neurons in $\u2113$th hidden layer of a Floor-ReLU network for $\u2113=1,2,\u2026,L$, then the architecture of this network with input $x$ and output $\varphi (x)$ can be described as

^{2}$\omega f(dN-L)+2\omega f(d)N-L$, where $\omega f(\xb7)$ is the modulus of continuity defined as

**Theorem 1.**

With theorem 1, we have an immediate corollary:

**Corollary 1.**

**Corollary 2.**

First, theorem 1 and corollary 2 show that the approximation capacity of deep networks for continuous functions can be nearly exponentially improved by increasing the network depth, and the approximation error can be explicitly characterized in terms of the width $O(N)$ and depth $O(L)$. Second, this new class of networks overcomes the curse of dimensionality in the approximation power when the modulus of continuity is moderate, since the approximation order is essentially $\omega f(dN-L)$. Finally, applying piecewise constant and integer-valued functions as activation functions and integer numbers as parameters has been explored in the study of quantized neural networks (Hubara, Courbariaux, Soudry, El-Yaniv, & Bengio, 2017; Yin et al., 2019; Bengio, Léonard, & Courville, 2013) with efficient training algorithms for low computational complexity (Wang et al., 2018). The floor function ($\u230ax\u230b$) is a piecewise constant function and can be easily implemented numerically at very little cost. Hence, the evaluation of the proposed network could be efficiently implemented in practical computation. Though there might not be an existing optimization algorithm to identify an approximant with the approximation rate in this letter, theorem 1 can provide an expected accuracy before a learning task and how much the current optimization algorithms could be improved. Designing an efficient optimization algorithm for Floor-ReLU networks will be left as future work, with several possible directions discussed later.

We remark that an increased smoothness or regularity of the target function could improve our approximation rate, but at the cost of a large prefactor. For example, to attain better approximation rates for functions in $Cs([0,1]d)$, it is common to use Taylor expansions and derivatives, which are tools that suffer from the curse of dimensionality and will result in a large prefactor like $O((s+1)d)$ that is subject to the curse of dimensionality. Furthermore, the prospective approximation rate using smoothness is not attractive. For example, the prospective approximation rate would be $O(N-sL)$ if we use Floor-ReLU networks with width $O(N)$ and depth $O(L)$ to approximate functions in $Cs([0,1]d)$. However, such a rate $O(N-sL)=O(N-s2L)$ can be attained by using Floor-ReLU networks with width $O(N)$ and depth $O(s2L)$ to approximate Lipschitz continuous functions. Hence, increasing the network depth can result in the same approximation rate for Lipschitz continuous functions as the rate of smooth functions.

## 2 Discussion

In this section, we discuss the application scope of our theory in machine learning and its comparison related to existing works.

### 2.1 Application Scope of Our Theory in Machine Learning

The approximation theory, optimization theory, and generalization theory form the three main theoretical aspects of deep learning with different emphases and challenges, which have motivated many separate research directions recently. Theorem 1 and corollary 2 provide an upper bound of $RD(\theta D)$. This bound only depends on the given budget of neurons and layers of Floor-ReLU networks and on the modulus of continuity of the target function $f$. Hence, this bound is independent of the empirical loss minimization in equation 2.1 and the optimization algorithm used to compute the numerical solution of that equation. In other words, theorem 1 and corollary 2 quantify the approximation power of Floor-ReLU networks with a given size. Designing efficient optimization algorithms and analyzing the generalization bounds for Floor-ReLU networks are two other separate future directions. Although optimization algorithms and generalization analysis are not our focus in this letter, in the next two paragraphs, we discuss several possible research topics in these directions for our Floor-ReLU networks.

In this work, we have not analyzed the feasibility of optimization algorithms for the Floor-ReLU network. Typically, stochastic gradient descent (SGD) is applied to solve a network optimization problem. However, the Floor-ReLU network has piecewise constant activation functions, making standard SGD infeasible. There are two possible directions to solve the optimization problem for Floor-ReLU networks: (1) gradient-free optimization methods, such as the Nelder-Mead method (Nelder & Mead, 1965), genetic algorithm (Holland, 1992), simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983), particle swarm optimization (Kennedy & Eberhart, 1995), and consensus-based optimization (Pinnau, Totzeck, Tse, & Martin, 2017; Carrillo, Jin, Li, & Zhu, 2019); and (2) applying optimization algorithms for quantized networks that also have piecewise constant activation functions (Lin, Lei, & Niu, 2019; Boo, Shin, & Sung, 2020; Bengio et al., 2013; Wang et al., 2018; Hubara et al., 2017; Yin et al., 2019). It would be interesting future work to explore efficient learning algorithms based on the Floor-ReLU network.

Generalization analysis of Floor-ReLU networks is also an interesting future direction. Previous works have shown the generalization power of ReLU networks for regression problems (Jacot, Gabriel, & Hongler, 2018; Cao & Gu, 2019; Chen, Cao, Zou, & Gu, 2019; Weinan, Ma, & Wu, 2019; Weinan & Wojtowytsch, 2020) and for solving partial differential equations (Berner et al., 2018; Luo & Yang, 2020). Regularization strategies for ReLU networks to guarantee good generalization capacity of deep learning have been proposed in Weinan, Ma, and Wu (2019) and Weinan and Wojtowytsch (2020). It is important to investigate the generalization capacity of our Floor-ReLU networks. Especially, it is of great interest to see whether problem-dependent regularization strategies exist to make the generalization error of our Floor-ReLU networks free of the curse of dimensionality.

### 2.2 Approximation Rates in $O(N)$ and $O(L)$ versus $O(W)$

Characterizing deep network approximation in terms of the width $O(N)$^{3} and depth $O(L)$ simultaneously is fundamental and indispensable in realistic applications, while quantifying the deep network approximation based on the number of nonzero parameters $W$ is probably only of interest in theory as far as we know. Theorem 1 can provide practical guidance for choosing network sizes in realistic applications, while theories in terms of $W$ cannot tell how large a network should be to guarantee a target accuracy. The width and depth are the two most direct and amenable hyperparameters in choosing a specific network for a learning task, while the number of nonzero parameters $W$ is hardly controlled efficiently. Theories in terms of $W$ essentially have a single variable to control the network size in three types of structures: (1) fixing the width $N$ and varying the depth $L$; (2) fixing the depth $L$ and changing the width $N$; (3) both the width and depth are controlled by the same parameter like the target accuracy $\u025b$ in a specific way (e.g., $N$ is a polynomial of $1\u025bd$ and $L$ is a polynomial of $log(1\u025b)$). Considering the nonuniqueness of structures for realizing the same $W$, it is impractical to develop approximation rates in terms of $W$ covering all these structures. If one network structure has been chosen in a certain application, there might not be a known theory in terms of $W$ to quantify the performance of this structure. Finally, in terms of full error analysis of deep learning including approximation theory, optimization theory, and generalization theory as illustrated in equation 2.2, the approximation error characterization in terms of width and depth is more useful than that in terms of the number of parameters, because almost all existing optimization and generalization analysis is based on depth and width instead of the number of parameters (Jacot, Gabriel, & Hongler, 2018; Cao & Gu, 2019; Chen, Cao, Zou, & Gu, 2019; Arora, Du, Hu, Li, & Wang, 2019; Allen-Zhu, Li, & Liang, 2019; Weinan et al., 2019; Weinan & Wojtowytsch, 2020; Ji & Telgarsky, 2020), to the best of our knowledge. Approximation results in terms of width and depth are more consistent with optimization and generalization analysis tools to obtain a full error analysis in equation 2.2.

Most existing approximation theories for deep neural networks so far focus on the approximation rate in the number of parameters $W$ (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989; Barron, 1993; Liang & Srikant, 2016; Yarotsky, 2017, 2018; Poggio, Mhaskar, Rosasco, Miranda, & Liao, 2017; Weinan & Wang, 2018; Petersen & Voigtlaender, 2018; Chui, Lin, & Zhou, 2018; Nakada & Imaizumi, 2019; Gribonval, Kutyniok, Nielsen, & Voigtlaender, 2019; Gühring, Kutyniok, & Petersen, 2019; Chen, Jiang, Liao, & Zhao, 2019; Li, Lin, & Shen, 2019; Suzuki, 2019; Bao et al., 2019; Opschoor, Schwab, & Zech, 2019; Yarotsky & Zhevnerchuk, 2019; Bölcskei, Grohs, Kutyniok, & Petersen, 2019; Montanelli & Du, 2019; Chen & Wu, 2019; Zhou, 2020; Montanelli & Yang, 2020; Montanelli, Yang, & Du, in press). From the point of view of theoretical difficulty, controlling two variables, $N$ and $L$, in our theory is more challenging than controlling one variable $W$ in the literature. In terms of mathematical logic, the characterization of deep network approximation in terms of $N$ and $L$ can provide an approximation rate in terms of $W$, while we are not aware of how to derive approximation rates in terms of arbitrary $N$ and $L$ given approximation rates in terms of $W$, since existing results in terms of $W$ are valid for specific network sizes with width and depth as functions in $W$ without the degree of freedom to take arbitrary values. Existing theories essentially have a single variable to control the network size in three types of structures. Let us use the first type of structure, which includes the best-known result for a nearly optimal approximation rate, $O(\omega f(W-2/d))$, for continuous functions in terms of $W$ using ReLU networks (Yarotsky, 2018) and the best-known result, $O(exp(-c\alpha ,dW))$, for Hölder continuous functions of order $\alpha $ using Sine-ReLU networks (Yarotsky & Zhevnerchuk, 2019), as an example to show how theorem 1 in terms of $N$ and $L$ can be applied to show a better result in terms of $W$. One can apply theorem 1 in a similar way to obtain other corollaries with other types of structures in terms of $W$. The main idea is to specify the value of $N$ and $L$ in theorem 1 to show the desired corollary. For example, if we let the width parameter $N=2$ and the depth parameter $L=W$ in theorem 1, then the width is $max{d,23}$, the depth is $64dW+3$, and the total number of parameters is bounded by $Omax{d2,232}(64dW+3)=O(W)$. Therefore, we can prove corollary 3 for the approximation capacity of our Floor-ReLU networks in terms of the total number of parameters as follows:

**Corollary 3.**

Corollary 3 achieves root exponential convergence without the curse of dimensionality in terms of the number of parameters $W$ with the help of the Floor-ReLU networks. When only ReLU networks are used, the result in (Yarotsky, 2018) suffers from the curse and does not have any kind of exponential convergence. The result in Yarotsky and Zhevnerchuk (2019) with Sine-ReLU networks has root exponential convergence but has not excluded the possibility of the curse of dimensionality, as we shall discuss. Furthermore, corollary 3 works for generic continuous functions while Yarotsky and Zhevnerchuk (2019) only applies to Hölder continuous functions.

### 2.3 Further Interpretation of Our Theory

In the interpretation of our theory, two more aspects are important to discuss. The first is whether it is possible to extend our theory to functions on a more general domain, for example, $[-M,M]d$ for some $M>1$, because $M>1$ may cause an implicit curse of dimensionality in some existing theory, as we shall point out. The second one is how bad the modulus of continuity would be since it is related to a high-dimensional function $f$ that may lead to an implicit curse of dimensionality in our approximation rate.

^{4}for any $y\u2208[-M,M]d$,

### 2.4 Discussion of the Literature

The neural networks constructed here achieve exponential convergence without the curse of dimensionality simultaneously for a function class as general as (Hölder) continuous functions, while, to the best of our knowledge, most existing theories apply only to functions with an intrinsic low complexity. For example, the exponential convergence was studied for polynomials (Yarotsky, 2017; Montanelli et al., in press; Lu et al., 2020), smooth functions (Montanelli et al., in press; Liang & Srikant, 2016), analytic functions (Weinan & Wang, 2018), and functions admitting a holomorphic extension to a Bernstein polyellipse (Opschoor et al., 2019). For another example, no curse of dimensionality occurs, or the curse is lessened for Barron spaces (Barron, 1993; Weinan et al., 2019; Weinan & Wojtowytsch, 2020), Korobov spaces (Montanelli & Du, 2019), band-limited functions (Chen & Wu, 2019; Montanelli et al., in press), compositional functions (Poggio et al., 2017), and smooth functions (Yarotsky & Zhevnerchuk, 2019; Lu et al., 2020; Montanelli & Yang, 2020; Yang & Wang, 2020).

Our theory admits a neat and explicit approximation error bound. For example, our approximation rate in the case of Hölder continuous functions of order $\alpha $ with a constant $\lambda $ is $3\lambda d\alpha /2N-\alpha L$, while the prefactor of most existing theories is unknown or grows exponentially in $d$. Our proof fully explores the advantage of the compositional structure and the nonlinearity of deep networks, while many existing theories were built on traditional approximation tools (e.g., polynomial approximation, multiresolution analysis, and Monte Carlo sampling), making it challenging for existing theories to obtain a neat and explicit error bound with an exponential convergence and without the curse of dimensionality.

Let us review existing work in more detail.

#### 2.4.1 Curse of Dimensionality

The curse of dimensionality is the phenomenon that approximating a $d$-dimensional function using a certain parameterization method with a fixed target accuracy generally requires a large number of parameters that is exponential in $d$, and this expense quickly becomes unaffordable when $d$ is large. For example, traditional finite element methods with $W$ parameters can achieve an approximation accuracy $O(W-1/d)$ with an explicit indicator of the curse $1d$ in the power of $W$. If an approximation rate has a constant independent of $W$ and exponential in $d$, the curse still occurs implicitly through this prefactor by definition. If the approximation rate has a prefactor $Cf$ depending on $f$, then the prefactor $Cf$ still depends on $d$ implicitly via $f$ and the curse implicitly occurs if $Cf$ exponentially grows when $d$ increases. Designing a parameterization method that can overcome the curse of dimensionality is an important research topic in approximation theory.

In Montanelli and Du (2019), $d$-dimensional functions in the Korobov space are approximated by the linear combination of basis functions of a sparse grid, each of which is approximated by a ReLU network. Though the curse of dimensionality has been lessened, target functions have to be sufficiently smooth, and the approximation error still contains a factor that is exponential in $d$, that is, the curse still occurs. Other works (Yarotsky, 2017; Yarotsky & Zhevnerchuk, 2019; Lu et al., 2020; Yang & Wang, 2020) studies the advantage of smoothness in the network approximation. Polynomials are applied to approximate smooth functions, and ReLU networks are constructed to approximate polynomials. The application of smoothness can lessen the curse of dimensionality in the approximation rates in terms of network sizes but also results in a prefactor that is exponentially large in the dimension, which means that the curse still occurs implicitly.

The Kolmogorov-Arnold superposition theorem (KST) (Kolmogorov, 1956, 1957; Arnold, 1957) has also inspired a research direction of network approximation (Kůrková, 1992; Maiorov & Pinkus, 1999; Igelnik & Parikh, 2003; Montanelli & Yang, 2020) for continuous functions. (Kůrková, 1992) provided a quantitative approximation rate of networks with two hidden layers, but the number of neurons scales exponentially in the dimension and the curse occurs. Maiorov and Pinkus (1999) relax the exact representation in KST to an approximation in a form of two-hidden-layer neural networks with a maximum width $6d+3$ and a single activation function. This powerful activation function is very complex as described by its authors, and its numerical evaluation was not available until a more concrete algorithm was recently proposed in Guliyev and Ismailov (2018). Note that there is no available numerical algorithm in Maiorov and Pinkus (1999) and Guliyev and Ismailov (2018) to compute the whole networks proposed there. The difficulty is due to the fact that the construction of these networks relies on the outer univariate continuous function of the KST. Though the existence of these outer functions can be shown by construction via a complicated iterative procedure in Braun and Griebel (2009), there is no existing numerical algorithm to evaluate them for a given target function yet, though computation with an arbitrary precision is assumed to be available. Therefore, the networks considered in Maiorov and Pinkus (1999) and Guliyev and Ismailov (2018) are similar to the original representation in KST in the sense that their existence is proved without an explicit way or numerical algorithm to construct them. Igelnik and Parikh (2003) and Montanelli and Yang (2020) apply cubic-splines and piecewise linear functions to approximate the inner and outer functions of KST, resulting in cubic-spline and ReLU networks to approximate continuous functions on $[0,1]d$. Due to the pathological outer functions of KST, the approximation bounds still suffer from the curse of dimensionality unless target functions are restricted to a small class of functions with simple outer functions in the KST.

Recently in Yarotsky and Zhevnerchuk (2019), Sine-ReLU networks have been applied to approximate Hölder continuous functions of order $\alpha $ on $[0,1]d$ with an approximation accuracy $\u025b=exp(-c\alpha ,dW1/2)$, where $W$ is the number of parameters in the network and $c\alpha ,d$ is a positive constant depending on $\alpha $ and $d$ only. Whether $c\alpha ,d$ exponentially depends on $d$ determines whether the curse of dimensionality exists for the Sine-ReLU networks, which is not answered in Yarotsky and Zhevnerchuk (2019) and is still an open question.

Finally, we discuss the curse of dimensionality in terms of the continuity of the weight selection as a map $\Sigma :C([0,1]d)\u2192RW$. For a fixed network architecture with a fixed number of parameters $W$, let $g:RW\u2192C([0,1]d)$ be the map of realizing a DNN from a given set of parameters in $RW$ to a function in $C([0,1]d)$. Suppose that there is a continuous map $\Sigma $ from the unit ball of Sobolev space with smoothness $s$, denoted as $Fs,d$, to $RW$ such that $\u2225f-g(\Sigma (f))\u2225L\u221e\u2264\u025b$ for all $f\u2208Fs,d$. Then $W\u2265c\u025b-d/s$ with some constant $c$ depending only on $s$. This conclusion is given in theorem 3 of Yarotsky (2017), which is a corollary of theorem 4.2 of Devore (1989) in a more general form. Intuitively, this conclusion means that any constructive approximation of ReLU FNNs to approximate $C([0,1]d)$ cannot enjoy a continuous weight selection property if the approximation rate is better than $c\u025b-d/s$, that is, the curse of dimensionality must occur for constructive approximation for ReLU FNNs with a continuous weight selection. Theorem 4.2 of Devore (1989) can also lead to a new corollary with a weight selection map $\Sigma :Ks,d\u2192RW$ (e.g., the constructive approximation of Floor-ReLU networks) and $g:RW\u2192L\u221e([0,1]d)$ (e.g., the realization map of Floor-ReLU networks), where $Ks,d$ is the unit ball of $Cs([0,1]d)$ with the Sobolev norm $Ws,\u221e([0,1]d)$. Then this new corollary implies that the constructive approximation in this letter cannot enjoy continuous weight selection. However, theorem 4.2 of Devore (1989) is essentially a min-max criterion to evaluate weight selection maps maintaining continuity: the approximation error obtained by minimizing over all continuous selection $\Sigma $ and network realization $g$ and maximizing over all target functions is bounded below by $O(W-s/d)$. In the worst scenario, a continuous weight selection cannot enjoy an approximation rate beating the curse of dimensionality. However, theorem 4.2 of Devore (1989) has not excluded the possibility that most continuous functions of interest in practice may still enjoy a continuous weight selection without the curse of dimensionality.

#### 2.4.2 Exponential Convergence

Exponential convergence is referred to as the situation that the approximation error exponentially decays to zero when the number of parameters increases. Designing approximation tools with an exponential convergence is another important topic in approximation theory. In the literature of deep network approximation, when the number of network parameters $W$ is a polynomial of $O(log(1\u025b))$, the terminology “exponential convergence” was also used (Weinan & Wang, 2018; Yarotsky & Zhevnerchuk, 2019; Opschoor et al., 2019). The exponential convergence in this letter is root-exponential as in Yarotsky and Zhevnerchuk (2019), that is, $W=O(log2(1\u025b))$. The exponential convergence in other works is worse than root-exponential.

In most cases, the approximation power to achieve exponential approximation rates in existing works comes from traditional tools for approximating a small class of functions instead of taking advantage of the network structure itself. In Weinan and Wang (2018) and Opschoor et al. (2019), highly smooth functions are first approximated by the linear combination of special polynomials with high degrees (e.g., Chebyshev polynomials, Legendre polynomials) with an exponential approximation rate, that is, to achieve an $\u025b$-accuracy, a linear combination of only $O(p(log(1\u025b)))$ polynomials is required, where $p$ is a polynomial with a degree that may depend on the dimension $d$. Then each polynomial is approximated by a ReLU network with $O(log(1\u025b))$ parameters. Finally, all ReLU networks are assembled to form a large network approximating the target function with an exponential approximation rate. As far as we know, the only existing work that achieves exponential convergence without taking advantage of special polynomials and smoothness is the Sine-ReLU network in Yarotsky and Zhevnerchuk (2019). We emphasize that the result in our letter applies for generic continuous functions, including the Hölder continuous functions considered in Yarotsky and Zhevnerchuk (2019).

## 3 Approximation of Continuous Functions

### 3.1 Notations

The main notations of this letter follow:

- •
Vectors and matrices are denoted in bold. Standard vectorization is adopted in the matrix and vector computation. For example, adding a scalar and a vector means adding the scalar to each entry of the vector.

- •
Let $N+$ denote the set containing all positive integers: $N+={1,2,3,\u2026}$.

- •
Let $\sigma :R\u2192R$ denote the rectified linear unit (ReLU): $\sigma (x)=max{0,x}$. With a slight abuse of notation, we define $\sigma :Rd\u2192Rd$ as $\sigma (x)=max{0,x1}\vdots max{0,xd}$ for any $x=(x1,\u2026,xd)\u2208Rd$.

- •
The floor function (Floor) is defined as $\u230ax\u230b:=max{n:n\u2264x,n\u2208Z}$ for any $x\u2208R$.

- •
For $\theta \u2208[0,1)$, suppose its binary representation is $\theta =\u2211\u2113=1\u221e\theta \u21132-\u2113$ with $\theta \u2113\u2208{0,1}$. We introduce a special notation $bin0.\theta 1\theta 2\cdots \theta L$ to denote the $L$-term binary representation of $\theta $: $bin0.\theta 1\theta 2\cdots \theta L:=\u2211\u2113=1L\theta \u21132-\u2113$.

- •
The expression “a network with width $N$ and depth $L$” means the maximum width of this network for all hidden layers is no more than $N$, and the number of hidden layers of this network is no more than $L$.

### 3.2 Proof of Theorem 1

Theorem 1 is an immediate consequence of Theorem 2:

**Theorem 2.**

This theorem will be proved later in this section. Now we prove theorem 1 based on theorem 2.

**Proof of Theorem 1.**

To prove theorem 2, we first present the proof sketch. We construct piecewise constant functions implemented by Floor-ReLU networks to approximate continuous functions. There are four key steps in our construction:

- 1.Normalize $f$ as $f\u02dc$ satisfying $f\u02dc(x)\u2208[0,1]$ for any $x\u2208[0,1]d$, divide $[0,1]d$ into a set of nonoverlapping cubes ${Q\beta}\beta \u2208{0,1,\u2026,K-1}d$, and denote $x\beta $ as the vertex of $Q\beta $ with minimum $\u2225\xb7\u22251$ norm, where $K$ is an integer determined later. See Figure 2 for illustrations of $Q\beta $ and $x\beta $.Figure 2:
- 2.
Construct a Floor-ReLU subnetwork to implement a vector-valued function $\Phi 1:Rd\u2192Rd$ projecting the whole cube $Q\beta $ to the index $\beta $ for each $\beta \u2208{0,1,\u2026,K-1}d$, that is, $\Phi 1(x)=\beta $ for all $x\u2208Q\beta $.

- 3.
Construct a Floor-ReLU subnetwork to implement a function $\varphi 2:Rd\u2192R$ mapping $\beta \u2208{0,1,\u2026,K-1}d$ approximately to $f\u02dc(x\beta )$ for each $\beta $, that is, $\varphi 2(\beta )\u2248f\u02dc(x\beta )$. Then $\varphi 2\u2218\Phi 1(x)=\varphi 2(\beta )\u2248f\u02dc(x\beta )$ for any $x\u2208Q\beta $ and each $\beta \u2208{0,1,\u2026,K-1}d$, implying $\varphi \u02dc:=\varphi 2\u2218\Phi 1$ approximates $f\u02dc$ within an error $O(\omega f(1/K))$ on $[0,1]d$.

- 4.
Rescale and shift $\varphi \u02dc$ to obtain the desired function $\varphi $ approximating $f$ well, and determine the final Floor-ReLU network to implement $\varphi $.

It is not difficult to construct Floor-ReLU networks with the desired width and depth to implement $\Phi 1$. The most technical part is the construction of a Floor-ReLU network with the desired width and depth computing $\varphi 2$, which needs the following proposition based on the bit extraction technique introduced in Bartlett, Maiorov, and Meir (1998) and Harvey, Liaw, and Mehrabian (2017).

**Proposition 1.**

The proof of this proposition is presented in section 4. By this proposition and the definition of VC-dimension (see Harvey et al., 2017), it is easy to prove that the VC-dimension of Floor-ReLU networks with a constant width and depth $O(L)$ has a lower bound $2L$. Such a lower bound is much larger than $O(L2)$, which is a VC-dimension upper bound of ReLU networks with the same width and depth due to theorem 8 of (Harvey et al., 2017). This means Floor-ReLU networks are much more powerful than ReLU networks from the perspective of VC-dimension.

Based on the proof sketch stated just above, we are ready to give the detailed proof of theorem 2 following similar ideas as in our previous work (Shen et al., 2019; Shen et al., 2020; Lu et al., 2020). The main idea of our proof is to reduce high-dimensional approximation to one-dimensional approximation via a projection. The idea of projection was probably first used in well-established theories, such as KST (Kolmogorov superposition theorem) mentioned in section 2, where the approximant to high-dimensional functions is constructed by projecting high-dimensional data points to one-dimensional data points and then constructing one-dimensional approximants. There has been extensive research based on this idea, for example, references related to KST summarized in section 2, our previous work (Shen, Yang, & Zhang, 2019; Shen, Yang, & Zhang, 2020; Lu et al., 2020), and (Yarotsky & Zhevnerchuk, 2019). The key to a successful approximant is to construct one-dimensional approximants to deal with a large number of one-dimensional data points; in fact, the number of points is exponential in the dimension $d$.

**Proof of Theorem 2.**

The proof consists of four steps.

*Step 1: Setup.*Assume $f$ is not a constant function since it is a trivial case. Then $\omega f(r)>0$ for any $r>0$. Clearly, $|f(x)-f(0)|\u2264\omega f(d)$ for any $x\u2208[0,1]d$. Define

*Step 2: Construct $\Phi 1$ mapping $x\u2208Q\beta $ to $\beta $.*Define a step function $\varphi 1$ as

^{}

*Step 3: Construct $\varphi 2$ mapping $\beta \u2208{0,1,\u2026,K-1}d$ approximately to $f\u02dc(x\beta )$.*Using the idea of $K$-ary representation, we define a linear function $\psi 1$ via

*Step 4: Determine the final network to implement the desired function $\varphi $.*Define $\varphi \u02dc:=\varphi 2\u2218\Phi 1$, that is, for any $x=(x1,x2,\u2026,xd)\u2208Rd$,

## 4 Proof of Proposition 1

The proof of proposition 1 mainly relies on the bit extraction technique. As we shall see, our key idea is to apply the Floor activation function to make bit extraction more powerful to reduce network sizes. In particular, Floor-ReLU networks can extract more bits than ReLU networks with the same network size.

We first establish a basic lemma to extract $1/N$ of the total bits of a binary number; the result is again stored in a binary number.

**Lemma 1.**

**Proof.**

The next lemma constructs a Floor-ReLU network that can extract any bit from a binary representation according to a specific index.

**Lemma 2.**

**Proof.**

The proof is based on repeated applications of lemma 1. Specifically, we inductively construct a sequence of functions $\varphi 1,\varphi 2,\u2026,\varphi L$ implemented by Floor-ReLU networks to satisfy the following two conditions for each $\u2113\u2208{1,2,\u2026,L}$.

- 1.
$\varphi \u2113:R2\u2192R$ can be implemented by a Floor-ReLU network with width $2N+2$ and depth $7\u2113-3$.

- 2.For any $\theta m\u2208{0,1}$, $m=1,2,\u2026,N\u2113$, we have$\varphi \u2113(bin0.\theta 1\theta 2\cdots \theta N\u2113,m)=bin0.\theta m,form=1,2,\u2026,N\u2113.$

- •
$\varphi k:R2\u2192R$ can be implemented by a Floor-ReLU network with width $2N+2$ and depth $7k-3$.

- •For any $\theta j\u2208{0,1}$, $j=1,2,\u2026,Nk$, we have$\varphi k(bin0.\theta 1\theta 2\cdots \theta Nk,j)=bin0.\theta j,forj=1,2,\u2026,Nk.$(4.4)

Note that $\psi $ can be computed by a Floor-ReLU network of width $2N$ and depth 4. By Figure 8, we have

- •
$\varphi k+1:R2\u2192R$ can be implemented by a Floor-ReLU network with width $2N+2$ and depth $2+4+1+(7k-3)=7(k+1)-3$, which implies condition 1 for $\u2113=k+1$.

- •For any $\theta m\u2208{0,1}$, $m=1,2,\u2026,Nk+1$, we haveThat is, condition 2 holds for $\u2113=k+1$.$\varphi k+1(bin0.\theta 1\theta 2\cdots \theta Nk+1,m)=bin0.\theta m,form=1,2,\u2026,Nk+1.$

So we finish the process of induction.

By the principle of induction, there exists a function $\varphi L:R2\u2192R$ such that

- •
$\varphi L$ can be implemented by a Floor-ReLU network with width $2N+2$ and depth $7L-3$.

- •For any $\theta m\u2208{0,1}$, $m=1,2,\u2026,NL$, we have$\varphi L(bin0.\theta 1\theta 2\cdots \theta NL,m)=bin0.\theta m,form=1,2,\u2026,NL.$

With lemma 2 in hand, we are ready to prove proposition 1.

**Proof of Proposition 1.**

We point out that only the properties of Floor on $[0,\u221e)$ are used in our proof. Thus, the Floor can be replaced by the truncation function that can be easily computed by truncating the decimal part.

## 5 Conclusion

This letter has introduced a theoretical framework to show that deep network approximation can achieve root exponential convergence and avoid the curse of dimensionality for approximating functions as general as (Hölder) continuous functions. Given a Lipschitz continuous function $f$ on $[0,1]d$, it was shown by construction that Floor-ReLU networks with width $max{d,5N+13}$ and depth $64dL+3$ can achieve a uniform approximation error bounded by $3\lambda dN-L$, where $\lambda $ is the Lipschitz constant of $f$. More generally for an arbitrary continuous function $f$ on $[0,1]d$ with a modulus of continuity $\omega f(\xb7)$, the approximation error is bounded by $\omega f(dN-L)+2\omega f(d)N-L$. Our results provide a theoretical lower bound of the power of deep network approximation. Whether this bound is achievable in actual computation relies on advanced algorithm design as a separate line of research.

## Notes

^{1}

Our results can be easily generalized to Ceiling-ReLU networks, namely, feedforward neural networks with either Ceiling $(\u2308x\u2309$) or ReLU ($max{0,x}$) activation function in each neuron.

^{2}

All of the exponential convergence in this letter is root exponential convergence. Nevertheless, after this introduction, for the convenience of presentation, we omit the prefix “root,” as in the literature.

^{3}

For simplicity, we omit $O(\xb7)$ in the following discussion.

^{4}

For an arbitrary set $E\u2286Rd$, $\omega fE(r)$ is defined via $\omega fE(r):=sup{|f(x)-f(y)|:\u2225x-y\u22252\u2264r,x,y\u2208E},$ for any $r\u22650$. As defined earlier, $\omega f(r)$ is short of $\omega f[0,1]d(r)$.

^{5}

If we just define $\varphi 1(x)=\u230aKx\u230b$, then $\varphi 1(1)=K\u2260K-1$ even though $1\u2208EK-1$.

## Acknowledgments

Z.S. is supported by the Tan Chin Tuan Centennial Professor ship. H.Y. was partially supported by the U.S. National Science Foundation under award DMS-1945029.

## References

*Learning and generalization in overparameterized neural networks, going beyond two layers*

*Dokl. Akad. Nauk SSSR*

*Proceedings of the ICML*

*Approximation analysis of convolutional neural networks*

*IEEE Transactions on Information Theory*

*Neural Computation*

*Estimating or propagating gradients through stochastic neurons for conditional computation.*

*SIAM Journal on Mathematics of Data Science*

*Quantized neural networks: Characterization and holistic optimization*

*Constructive Approximation*

*Generalization bounds of stochastic gradient descent for wide and deep neural networks.*

*A consensus-based global optimization method for high dimensional machine learning problems*

*Mathematical Methods in the Applied Sciences*

*Advances in neural information processing systems, 32*

*How much over-parameterization is sufficient to learn deep ReLU networks?*

*Frontiers in Applied Mathematics and Statistics*

*Mathematics of Control, Signals, and Systems*

*Manuskripta Math*

*Approximation spaces of deep neural networks.*

*Error bounds for approximations with deep ReLU neural networks in W$s,p$ norms*. arXiv:1902.07896.

*Neurocomputing*

*Proceedings of Machine Learning Research*

*Scientific American*

*Neural Networks*

*J. Mach. Learn. Res.*

*IEEE Transactions on Neural Networks*

*Neural tangent kernel: Convergence and generalization in neural networks.*

*Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks*

*Proceedings of the International Conference on Neural Networks*

*Science*

*Dokl. Akad. Nauk SSSR*

*Dokl. Akad. Nauk SSSR*

*Neural Networks*

*Deep learning via dynamical systems: An approximation perspective.*

*Why deep neural networks?*

*Proceedings of the 2019 International Conference on Data Mining Workshops*

*Deep network approximation for smooth functions*

*Two-layer neural networks for partial differential equations: Optimization and generalization theory*

*Neurocomputing*

*SIAM Journal on Mathematics of Data Science*

*Neural Networks*

*Journal of Computational Mathematics*

*Adaptive approximation and estimation of deep neural network with intrinsic dimensionality*

*Comput. J.*

*Exponential ReLU DNN expression of holomorphic maps in high dimension*

*Neural Networks*

*Mathematical Models and Methods in Applied Sciences*

*International Journal of Automation and Computing*

*Neural Networks*

*Communications in Computational Physics*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition*

*Communications in Mathematical Sciences*

*Exponential convergence of the deep neural network approximation for analytic functions.*

*Representation formulas and pointwise properties for Barron functions*

*Approximation in shift-invariant spaces with deep ReLU neural networks.*

*Neural Networks*

*Proceedings of Machine Learning Research*

*The phase diagram of approximation rates for deep neural networks.*

*Understanding straight-through estimator in training activation quantized neural nets.*

*Applied and Computational Harmonic Analysis*