The possibility of approximating a continuous function on a compact subset of the real line by a feedforward single hidden layer neural network with a sigmoidal activation function has been studied in many papers. Such networks can approximate an arbitrary continuous function provided that an unlimited number of neurons in a hidden layer is permitted. In this note, we consider constructive approximation on any finite interval of by neural networks with only one neuron in the hidden layer. We construct algorithmically a smooth, sigmoidal, almost monotone activation function providing approximation to an arbitrary continuous function within any degree of accuracy. This algorithm is implemented in a computer program, which computes the value of at any reasonable point of the real axis.
Neural networks are being successfully applied across an extraordinary range of problem domains, in fields as diverse as computer science, finance, medicine, engineering, and physics, for example. The main reason for such popularity is their ability to approximate arbitrary functions. For the past 30 years, a number of results have been published showing that the artificial neural network called a feedforward network with one hidden layer can approximate arbitrarily well any continuous function of several real variables. These results play an important role in determining boundaries of efficacy of the considered networks. But the proofs usually do not state how many neurons should be used in the hidden layer. The purpose of this note is to prove constructively that a neural network having only one neuron in its single hidden layer can approximate arbitrarily well all continuous functions defined on any compact subset of the real axis.
In approximation by neural networks, there are two main problems. The first is the density problem of determining the conditions under which an arbitrary target function can be approximated arbitrarily well by neural networks. The second problem, called the complexity problem, is to determine how many neurons in hidden layers are necessary to give a prescribed degree of approximation. This problem is almost the same as the problem of degree of approximation (see Barron, 1993; Cao, Xie, & Xu, 2008; Hahm & Hong, 1999). The possibility of approximating a continuous function on a compact subset of the real line (or n-dimensional space) by a single hidden layer neural network with a sigmoidal activation function has been well studied in a number of papers. Different methods were used. Carroll and Dickinson (1989) used the inverse Radon transformation to prove the universal approximation property of single hidden layer neural networks. Gallant and White (1988) constructed a specific continuous, nondecreasing sigmoidal function, called a cosine squasher, from which it was possible to obtain any Fourier series. Thus, their activation function had the density property. Cybenko (1989) and Funahashi (1989), independently from each other, established that feedforward neural networks with a continuous sigmoidal activation function can approximate any continuous function within any degree of accuracy on compact subsets of Cybenko’s proof uses the functional analysis method, combining the Hahn–Banach theorem and the Riesz representation theorem, whiles Funahashi’s proof applies the result of Irie and Miyake (1988) on the integral representation of functions, using a kernel that can be expressed as a difference of two sigmoidal functions. Hornik, Stinchcombe, and White (1989) applied the Stone–Weierstrass theorem, using trigonometric functions.
Kůrková (1992) proved that staircase-like functions of any sigmoidal type can approximate continuous functions on any compact subset of the real line within an arbitrary accuracy. This is effectively used in Kůrková’s subsequent results, which show that a continuous multivariate function can be approximated arbitrarily well by two hidden layer neural networks with a sigmoidal activation function (see Kůrková, 1991, 1992).
Chen, Chen, and Liu (1992) extended the result of Cybenko by proving that any continuous function on a compact subset of can be approximated by a single hidden layer feedforward network with a bounded (not necessarily continuous) sigmoidal activation function. Almost the same result was independently obtained by Jones (1990).
Costarelli and Spigler (2013) reconsidered Cybenko’s approximation theorem and for a given function constructed certain sums of the form (1.1), which approximate f within any degree of accuracy. In their result, similar to (Chen et al., 1992), is bounded and sigmoidal. Therefore, when , the result can be viewed as a density result in for the set of all functions of the form (1.1).
Chui and Li (1992) proved that a single hidden layer network with a continuous sigmoidal activation function having integer weights and thresholds can approximate an arbitrary continuous function on a compact subset of . Ito (1991) established a density result for continuous functions on a compact subset of by neural networks with a sigmoidal function having only unit weights. Density properties of a single hidden layer network with a restricted set of weights were also studied in other papers (for a detailed discussion, see Ismailov, 2012).
In many subsequent papers, which dealt with the density problem, nonsigmoidal activation functions were allowed. Among them are the papers by Stinchcombe and White (1990), Cotter (1990), Hornik (1991), Mhaskar and Micchelli (1992), and other researchers. The more general result in this direction belongs to Leshno, Lin, Pinkus, and Schocken (1993). They proved that the necessary and sufficient condition for any continuous activation function to have the density property is that it not be a polynomial. (For a detailed discussion of most of the results in this section, see the review paper by Pinkus, 1999).
It should be remarked that in all the works mentioned above, the number of neurons r in the hidden layer is not fixed. As such, to achieve a desired precision, one may take an excessive number of neurons. This, in turn, gives rise to the problem of complexity (see above).
Our approach to the problem of approximation by single hidden layer feedforward networks is different and quite simple. We consider networks (1.1) defined on with a limited number of neurons (r is fixed!) in a hidden layer and ask the following fair question: Is it possible to construct a well-behaved (i.e., sigmoidal, smooth, monotone) universal activation function providing approximation to arbitrary continuous functions on any compact set in within any degree of precision? We show that this is possible even in the case of a feedforward network with only one neuron in its hidden layer (i.e., in the case r = 1). The basic form of our theorem claims that there exists a smooth, sigmoidal, almost monotone activation function with the property: for each univariate continuous function f on the unit interval and any positive , one can choose three numbers c0, c1, and such that the function gives -approximation to f. It should be remarked that we prove not only the existence result but also give an algorithm for constructing the universal sigmoidal function. For a wide class of Lipschitz continuous functions, we also give an algorithm for evaluating the numbers c0, c1, and .
The main theoretical idea behind the construction of is that polynomials with rational coefficients form a countable dense subset of . Let u1, u2, be the sequence of these polynomials. By translating members of this sequence to the right 1, 3, 5, units, respectively, scaling them vertically, and adding offsets, we can construct a set of polynomials bounded (between 0 and 1) and almost monotone on the intervals , , ,, respectively. Now let s be a function defined on the union and coinciding on each interval with the corresponding polynomial. Then our can be obtained by a smooth extension of s to the whole real line in such a way that and .
For numerical experiments we used SageMath (Stein et al., 2015). We wrote a code for creating the graph of and computing at any reasonable . (The code is available at http://sites.google.com/site/njguliyev/papers/sigmoidal.)
2 The Theoretical Result
We begin this section with the definition of a -increasing (-decreasing) function. Let be any nonnegative number. A real function f defined on is called -increasing (-decreasing) if there exists an increasing (decreasing) function such that , for all . If u is strictly increasing (or strictly decreasing), then the above function f is called a -strictly increasing (or -strictly decreasing) function. Clearly, 0-monotonicity coincides with the usual concept of monotonicity, and a -increasing function is -increasing if .
The following theorem is valid.
Let be any positive number. Divide the interval into the segments , , Let be any strictly increasing, infinitely differentiable function on with the properties
for all .
, as .
The existence of a strictly increasing smooth function satisfying these properties is easy to verify. Note that from conditions 1 to 3, it follows that any function satisfying the inequality for all is -strictly increasing and , as .
At the second stage, we define on the intervals , so that it is in and satisfies the inequality, equation 2.2. Finally, in all of , we define while maintaining the strict monotonicity property and in such a way that . We obtain from the properties of h and the condition 2.2 that is a -strictly increasing function on the interval and , as . Note that the construction of a obeying all the above conditions is feasible. We show this in the next section.
The idea of using a limited number of neurons in hidden layers of a feedforward network was first implemented by Maiorov and Pinkus (1999). They proved the existence of a sigmoidal, strictly increasing, analytic activation function such that two hidden layer neural networks with this activation function and a fixed number of neurons in each hidden layer can approximate any continuous multivariate function over the unit cube in . Note that the result is of theoretical value; the authors do not suggest constructing and using their sigmoidal function. Using the techniques developed in Maiorov and Pinkus (1999), Ismailov showed theoretically that if we replace the demand of analyticity by smoothness and monotonicity by -monotonicity, then the number of neurons in hidden layers can be reduced substantially (see Ismailov, 2014). We stress again that in both papers, the algorithmic implementation of the obtained results is not discussed or illustrated by numerical examples.
In the next section, we propose an algorithm for computing the sigmoidal function at any point of the real axis. (The code of this algorithm is available at http://sites.google.com/site/njguliyev/papers/sigmoidal.) As examples, we include in this note the graph of (see Figure 1) and a numerical table (see Table 1) containing several computed values of this function.
3 Algorithmic Construction of the Universal Sigmoidal Function
3.1 Step 1
3.2 Step 2
3.3 Step 3
Enumerating the polynomials with rational coefficients. It is clear that every positive rational number determines a unique finite continued fraction with , and .
3.4 Step 4
3.5 Step 5
3.6 Step 6
Step 6 completes the construction of the universal activation function , which satisfies theorem 1.
4 The Algorithm for Evaluating the Numbers c0, c1, and
Although theorem 1 is valid for all continuous functions, in practice it is quite difficult to calculate algorithmically c0, c1, and in theorem 1 for badly behaved continuous functions. The main difficulty arises while attempting to design an efficient algorithm for the construction of a best approximating polynomial within any given degree of accuracy. But for certain large classes of well-behaved functions, the computation of the above numbers can be done. In this section, we show this for the class of Lipschitz continuous functions.
Assume that f is a Lipschitz continuous function on with a Lipschitz constant L. In order to find the parameters c0, c1, and algorithmically, it is sufficient to perform the following steps.
4.1 Step 1
4.2 Step 2
4.3 Step 3
4.4 Step 4
4.5 Step 5
4.6 Step 6
4.7 Step 7
Evaluating the numbers c0, c1, and . In this step, we return to our original function f and calculate the numbers c0, c1, and (see theorem 1). The numbers c0 and c1 have been calculated above (they are the same for both g and f). In order to find , we use the formula .
Note that some computational difficulties may arise while implementing the above algorithm in standard computers. For some functions, the index m of a polynomial um in step 5 may be extraordinarily large. In this case, a computer is not capable of producing this number—hence, the numbers c0, c1, and .