Neural networks are versatile tools for computation, having the ability to approximate a broad range of functions. An important problem in the theory of deep neural networks is expressivity; that is, we want to understand the functions that are computable by a given network. We study real, infinitely differentiable (smooth) hierarchical functions implemented by feedforward neural networks via composing simpler functions in two cases: (1) each constituent function of the composition has fewer inputs than the resulting function and (2) constituent functions are in the more specific yet prevalent form of a nonlinear univariate function (e.g., tanh) applied to a linear multivariate function. We establish that in each of these regimes, there exist nontrivial algebraic partial differential equations (PDEs) that are satisfied by the computed functions. These PDEs are purely in terms of the partial derivatives and are dependent only on the topology of the network. Conversely, we conjecture that such PDE constraints, once accompanied by appropriate nonsingularity conditions and perhaps certain inequalities involving partial derivatives, guarantee that the smooth function under consideration can be represented by the network. The conjecture is verified in numerous examples, including the case of tree architectures, which are of neuroscientific interest. Our approach is a step toward formulating an algebraic description of functional spaces associated with specific neural networks, and may provide useful new tools for constructing neural networks.

1.1  Motivation

A central problem in the theory of deep neural networks is to understand the functions that can be computed by a particular architecture (Raghu, Poole, Kleinberg, Ganguli, & Dickstein, 2017; Poggio, Banburski, & Liao, 2019). Such functions are typically superpositions of simpler functions, that is, compositions of functions of fewer variables. This article aims to study superpositions of real smooth (i.e., infinitely differentiable or C) functions that are constructed hierarchically (see Figure 3). Our core thesis is that such functions (also referred to as hierarchical or compositional interchangeably) are constrained in the sense that they satisfy certain partial differential equations (PDEs). These PDEs are dependent only on the topology of the network and could be employed to characterize smooth functions computable by a given network.
Figure 1:

The architecture on the left (studied in example 1) can compute functions of the form g(f(x,y),z) as in the middle. They involve the smaller class of functions of the form g(w3f(w1x+w2y+b1)+w4z+b2) on the right.

Figure 1:

The architecture on the left (studied in example 1) can compute functions of the form g(f(x,y),z) as in the middle. They involve the smaller class of functions of the form g(w3f(w1x+w2y+b1)+w4z+b2) on the right.

Close modal
Figure 2:

Implementations of superpositions of the form F(x,y,z)=g(f(x,y),h(x,z)) (studied in examples 2 and 7) by three-layer neural networks.

Figure 2:

Implementations of superpositions of the form F(x,y,z)=g(f(x,y),h(x,z)) (studied in examples 2 and 7) by three-layer neural networks.

Close modal
Figure 3:

The neural network on the left can compute the hierarchical function F(x1,x2,x3)=f1(3)f1(2)f1(1)(x1,x2),f2(1)(x2,x3),f2(2)f2(1)(x2,x3),f3(1)(x3,x1) once appropriate functions are assigned to its nodes as on the right.

Figure 3:

The neural network on the left can compute the hierarchical function F(x1,x2,x3)=f1(3)f1(2)f1(1)(x1,x2),f2(1)(x2,x3),f2(2)f2(1)(x2,x3),f3(1)(x3,x1) once appropriate functions are assigned to its nodes as on the right.

Close modal

1.1.1  Example 1

One of the simplest examples of a superposition is when a trivariate function is obtained from composing two bivariate functions; for instance, let us consider the composition
F(x,y,z)=gf(x,y),z
(1.1)
of functions f=f(x,y) and g=g(u,z) that can be computed by the network in Figure 1. Assuming that all functions appearing here are twice continuously differentiable (or C2), the chain rule yields
Fx=gufx,Fy=gufy.
If either Fx or Fy – say the former – is nonzero, the equations above imply that the ratio between Fx and Fy is independent of z:
FyFx=fyfx.
(1.2)
Therefore, its derivative with respect to z must be identically zero:
FyFxz=FyzFx-FxzFy(Fx)2=0.
(1.3)
This amounts to
FyzFx=FxzFy,
(1.4)
an equation that always holds for functions of form 1.1. Notice that one may readily exhibit functions that do not satisfy the necessary PDE constraint FxzFy=FyzFx and so cannot be brought into form 1.1, for example,1
xyz+x+y+z.
(1.5)
Conversely, if the constraint FyzFx=FxzFy is satisfied and Fx (or Fy) is nonzero, we can reverse this processes to obtain a local expression of the form 1.1 for F(x,y,z). By interpreting the constraint as the independence of FxFy of z, one can devise a function f=f(x,y) whose ratio of partial derivatives coincides with FxFy (this is a calculus fact; see theorem 5). Now that equation 1.2 is satisfied, the gradient of F may be written as
F=FxFyFz=Fxfxfxfy0+Fz001,
that is, as a linear combination of gradients of f(x,y) and z. This guarantees that F(x,y,z) is (at least locally) a function of the latter two (see the discussion at the beginning of section 3). So there exists a bivariate function g defined on a suitable domain with F(x,y,z)=g(f(x,y),z). Later in the article, we generalize this toy example to a characterization of superpositions computed by tree architectures (see theorem 3).
Functions appearing in the context of neural networks are more specific than a general superposition such as equation 1.1; they are predominantly constructed by composing univariate nonlinear activation functions and multivariate linear functions defined by weights and biases. In the case of a trivariate function F(x,y,z), we should replace the representation g(f(x,y),z) studied so far with
F(x,y,z)=g(w3f(w1x+w2y+b1)+w4z+b2).
(1.6)
Assuming that activation functions f and g are differentiable, now new constraints of the form 1.3 are imposed. The ratio FyFx is equal to w2w1, hence it is not only independent of z as equation 1.3 suggests, but indeed a constant function. So we arrive at
FyFxx=FyFxy=FyFxz=0,
or, equivalently,
FxyFx=FxxFy,FyyFx=FxyFy,FyzFx=FxzFy.
Again, these equations characterize differentiable functions of the form 1.6; this is a special case of theorem 7 below.

1.1.2  Example 2

The preceding example dealt with compositions of functions with disjoint sets of variables and this facilitated our calculations. But this is not the case for compositions constructed by most neural networks, for example, networks may be fully connected or may have repeated inputs. For instance, let us consider a superposition of the form
F(x,y,z)=g(f(x,y),h(x,z))
(1.7)
of functions f(x,y), h(x,z), and g(u,v) as implemented in Figure 2. Applying the chain rule tends to be more complicated than the case of equation 1.1 and results in identities
Fx=gufx+gvhx,Fy=gufy,Fz=gvhz.
(1.8)
Nevertheless, it is not hard to see that there are again (perhaps cumbersome) nontrivial PDE constraints imposed on the hierarchical function F, a fact that will be established generally in theorem 1. To elaborate, notice that identities in equation 1.8 together imply
Fx=A(x,y)Fy+B(x,z)Fz,
(1.9)
where A:=fxfy and B:=hxhz are independent of z and y, respectively. Repeatedly differentiating this identity (if possible) with respect to y,z results in linear dependence relations between partial derivatives of F (and hence PDEs) since the number of partial derivatives of Fx of order at most n with respect to y,z grows quadratically with n, while on the right-hand side, the number of possibilities for coefficients (partial derivatives of A and B with respect to y and z, respectively) grows only linearly. Such dependencies could be encoded by the vanishing of determinants of suitable matrices formed by partial derivatives of F. In example 7, by pursuing the strategy just mentioned, we complete this treatment of superpositions 1.7 by deriving the corresponding characteristic PDEs that are necessary and (in a sense) sufficient conditions on F that it be in the form of equation 1.7. Moreover, in order to be able to differentiate several times, we shall assume that all functions are smooth (or C) hereafter.

1.2  Statements of Main Results

Fixing a neural network hierarchy for composing functions, we shall prove that once the constituent functions of corresponding superpositions have fewer inputs (lower arity), there exist universal algebraic partial differential equations (algebraic PDEs) that have these superpositions as their solutions. A conjecture, which we verify in several cases, states that such PDE constraints characterize a generic smooth superposition computable by the network. Here, genericity means a nonvanishing condition imposed on an algebraic expression of partial derivatives. Such a condition has already occurred in example 1 where in the proof of the sufficiency of equation 1.4 for the existence of a representation of the form 1.1 for a function F(x,y,z), we assumed either Fx or Fy is nonzero. Before proceeding with the statements of main results, we formally define some of the terms that have appeared so far.

Terminology

  • We take all neural networks to be feedforward. A feedforward neural network is an acyclic hierarchical layer to layer scheme of computation. We also include residual networks (ResNets) in this category: an identity function in a layer could be interpreted as a jump in layers. Tree architectures are recurring examples of this kind. We shall always assume that in the first layer, the inputs are labeled by (not necessarily distinct) labels chosen from coordinate functions x1,,xn, and there is only one node in the output layer. Assigning functions to nodes in layers above the input layer implements a real scalar-valued function F=F(x1,,xn) as the superposition of functions appearing at nodes (see Figure 3).

  • In our setting, an algebraic PDE is a nontrivial polynomial relation such as
    ΦFx1,,Fxn,Fx12,Fx1x2,,Fxα,=0
    (1.10)
    among the partial derivatives (up to a certain order) of a smooth function F=F(x1,,xn). Here, for a tuple α:=(α1,,αn) of nonnegative integers, the partial derivative α1++αnFx1α1xnαn (which is of order |α|:=α1++αn) is denoted by Fxα. For instance, asking for a polynomial expression of partial derivatives of F to be constant amounts to n algebraic PDEs given by setting the first-order partial derivatives of that expression with respect to x1,,xn to be zero.
  • A nonvanishing condition imposed on smooth functions F=F(x1,,xn) is asking for these functions not to satisfy a particular algebraic PDE, namely,
    ΨFx1,,Fxn,Fx12,Fx1x2,,Fxα,0,
    (1.11)
    for a nonconstant polynomial Ψ. Such a condition could be deemed pointwise since if it holds at a point pRn, it persists throughout a small enough neighborhood. Moreover, equation 1.11 determines an open dense subset of the functional space; so, it is satisfied generically.

Theorem 1.

Let N be a feedforward neural network in which the number of inputs to each node is less than the total number of distinct inputs to the network. Superpositions of smooth functions computed by this network satisfy nontrivial constraints in the form of certain algebraic PDEs that are dependent only on the topology of N.

In the context of deep learning, the functions applied at each node are in the form of
yσw,y;
(1.12)
that is, they are obtained by applying an activation function σ to a linear functional yw,y. Here, as usual, the bias term is absorbed into the weight vector. The bias term could also be excluded via composing σ with a translation since throughout our discussion, the only requirement for a function σ to be the activation function of a node is smoothness, and activation functions are allowed to vary from a node to another. In our setting, σ in equation 1.12 could be a polynomial or a sigmoidal function such as hyperbolic tangent or logistic functions, but not ReLU or maxout activation functions. We shall study functions computable by neural networks as either superpositions of arbitrary smooth functions or as superpositions of functions of the form 1.12, which is a more limited regime. Indeed, the question of how well arbitrary compositional functions, which are the subject of theorem 1, may be approximated by a deep network has been studied in the literature (Mhaskar, Liao, & Poggio, 2017; Poggio, Mhaskar, Rosasco, Miranda, & Liao, 2017).

In order to guarantee the existence of PDE constraints for superpositions, theorem 1 assumes a condition on the topology of the network. However, theorem 2 states that by restricting the functions that can appear in the superposition, one can still obtain PDE constraints even for a fully connected multilayer perceptron:

Theorem 2.

Let N be an arbitrary feedforward neural network with at least two distinct inputs, with smooth functions of the form 1.12 applied at its nodes. Any function computed by this network satisfies nontrivial constraints in the form of certain algebraic PDEs that are dependent only on the topology of N.

1.2.1  Example 3

As the simplest example of PDE constraints imposed on compositions of functions of the form 1.12, recall that d'Alembert's solution to the wave equation,
utt=c2uxx,
(1.13)
is famously given by superpositions of the form f(x+ct)+g(x-ct). This function can be implemented by a network with two inputs x,t and with one hidden layer in which the activation functions f,g are applied (see Figure 4). Since we wish for a PDE that works for this architecture universally, we should get rid of c. The PDE 1.13 may be written as uttuxx=c2; that is the ratio uttuxx must be constant. Hence, for our purposes, the wave equation should be written as uttuxxx=uttuxxt=0, or equivalently,
uxttuxx-uttuxxx=0,utttuxx-uttuxxt=0.
A crucial point to notice is that the constant c2 is nonnegative; thus an inequality of the form uxxutt0 or uxxutt0 is imposed as well. In example 11, we visit this network again and study functions of the form
F(x,t)=σ(a''f(ax+bt)+b''g(a'x+b't))
(1.14)
via a number of equalities and inequalities involving partial derivatives of F.
Figure 4:

The neural network on the left can compute the function F(x,t)=σ(a''f(ax+bt)+b''g(a'x+b't)) once, as on the right, the activation functions σ,f,g and appropriate weights are assigned to the nodes. Such functions are the subject of examples 3 and 11.

Figure 4:

The neural network on the left can compute the function F(x,t)=σ(a''f(ax+bt)+b''g(a'x+b't)) once, as on the right, the activation functions σ,f,g and appropriate weights are assigned to the nodes. Such functions are the subject of examples 3 and 11.

Close modal

The preceding example suggests that smooth functions implemented by a neural network may be required to obey a nontrivial algebraic partial differential inequality (algebraic PDI). So it is convenient to have the following setup of terminology.

Terminology

  • An algebraic PDI is an inequality of the form
    ΘFx1,,Fxn,Fx12,Fx1x2,,Fxα,>0
    (1.15)
    involving partial derivatives (up to a certain order) where Θ is a real polynomial.

Remark 1.

Without any loss of generality, we assume that the PDIs are strict since a nonstrict one such as Θ0 could be written as the union of Θ>0 and the algebraic PDE Θ=0.

Theorem 1 and example 1 deal with superpositions of arbitrary smooth functions while theorem 2 and example 3 are concerned with superpositions of a specific class of smooth functions, functions of the form 1.12. In view of the necessary PDE constraints in both situations, the following question then arises: Are there sufficient conditions in the form of algebraic PDEs and PDIs that guarantee a smooth function can be represented, at least locally, by the neural network in question?

Conjecture 1.

Let N be a feedforward neural network whose inputs are labeled by the coordinate functions x1,,xn. Suppose we are working in the setting of one of theorems 1 or 2. Then there exist

  • finitely many nonvanishing conditionsΨiFxα|α|r0i

  • finitely many algebraic PDEsΦjFxα|α|r=0j

  • finitely many algebraic PDIsΘkFxα|α|r>0k

with the following property: For any arbitrary point pRn, the space of smooth functions F=F(x1,,xn) defined in a vicinity1 of p that satisfy Ψi0 at p and are computable by N (in the sense of the regime under consideration) is nonvacuous and is characterized by PDEs Φj=0 and PDIs Θk>0.

To motivate the conjecture, notice that it claims the existence of functionals
FΨiFxα|α|ri,FΦjFxα|α|rj,FΘkFxα|α|rk,
which are polynomial expressions of partial derivatives, and hence continuous in the Cr-norm,2 such that in the space of functions computable by N, the open dense3 subset given by Ψi0i can be described in terms of finitely many equations and inequalities as the locally closed subset Φj=0jΘk>0k. (Also see corollary 1.) The usage of Cr-norm here is novel. For instance, with respect to Lp-norms, the space of functions computable by N lacks such a description and often has undesirable properties like nonclosedness (Petersen, Raslan, & Voigtlaender, 2020). Besides, describing the functional space associated with a neural network N with finitely many equations and inequalities also has an algebraic motivation: it is reminiscent of the notion of a semialgebraic set from real algebraic geometry. To elaborate, take the activation functions to be polynomials. Such neural networks have been studied in the literature (Du & Lee, 2018; Soltanolkotabi, Javanmard, & Lee, 2018; Venturi, Bandeira, & Bruna, 2018; Kileel, Trager, & Bruna, 2019). By bounding the degrees of constituent functions of superpositions computed by a polynomial neural network, the functional space formed by these superpositions sits inside a finite-dimensional ambient space of real polynomials and is hence finite-dimensional and amenable to techniques of algebraic geometry. One can, for instance, in each degree associate a functional variety to a neural network N whose dimension could be interpreted as a measure of expressive power (Kileel et al., 2019). Our approach to describing real functions computable by neural networks via PDEs and PDIs has ramifications to the study of polynomial neural networks as well. Indeed, if F=F(x1,,xn) is a polynomial, an algebraic PDE of the form 1.10 translates to a polynomial equation of the coefficients of F, and the condition that an algebraic PDI such as equation 1.15 is valid throughout Rn can again be described via equations and inequalities involving the coefficients of F (see examples 12 and 13). A notable feature here is the claim of the existence of a universal characterization dependent only on the architecture from which a description as a semialgebraic set could be read off in any degree.

Conjecture 1 is settled in (Farhoodi, Filom, Jones, and Körding, 2019) for trees (a particular type of architectures) with distinct inputs, a situation in which no PDI is required, and the inequalities should be taken to be trivial. Throughout the article, the conjecture above will be established for a number of architectures; in particular, we shall characterize tree functions (cf. theorems 3 and 4 below).

1.3  Related Work

There is an extensive literature on the expressive power of neural networks. Although shallow networks with sigmoidal activation functions can approximate any continuous function on compact sets (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989; Hornik, 1991; Mhaskar, 1996), this cannot be achieved without the hidden layer getting exponentially large (Eldan & Shamir, 2016; Telgarsky, 2016; Mhaskar et al., 2017; Poggio et al., 2017). Many articles thus try to demonstrate how the expressive power is affected by depth. This line of research draws on a number of different scientific fields including algebraic topology (Bianchini & Scarselli, 2014), algebraic geometry (Kileel et al., 2019), dynamical systems (Chatziafratis, Nagarajan, Panageas, & Wang, 2019), tensor analysis (Cohen, Sharir, & Shashua, 2016), Vapnik–Chervonenkis theory (Bartlett, Maiorov, & Meir, 1999), and statistical physics (Lin, Tegmark, & Rolnick, 2017). One approach is to argue that deeper networks are able to approximate or represent functions of higher complexity after defining a “complexity measure” (Bianchini & Scarselli, 2014; Montufar, Pascanu, Cho, & Bengio, 2014; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Telgarsky, 2016; Raghu et al., 2017). Another approach more in line with this article is to use the “size” of an associated functional space as a measure of representation power. This point of view is adapted in Farhoodi et al. (2019) by enumerating Boolean functions, and in Kileel et al. (2019) by regarding dimensions of functional varieties as such a measure.

A central result in the mathematical study of superpositions of functions is the celebrated Kolmogorov-Arnold representation theorem (Kolmogorov, 1957), which resolves (in the context of continuous functions) the thirteenth problem on Hilbert's famous list of 23 major mathematical problems (Hilbert, 1902). The theorem states that every continuous function F(x1,,xn) on the closed unit cube may be written as
F(x1,,xn)=i=12n+1fij=1nϕi,j(xj)
(1.16)
for suitable continuous univariate functions fi, ϕi,j defined on the unit interval. (See Vituškin and Henkin, 1967, chap. 1, or Vituškin, 2004, for a historical account.) In more refined versions of this theorem (Sprecher, 1965; Lorentz, 1966), the outer functions fi are arranged to be the same, and the inner ones ϕi,j are be taken to be in the form of λjϕi with λj's and ϕi's independent of F. Based on the existence of such an improved representation, Hecht-Nielsen argued that any continuous function F can be implemented by a three-layer neural network whose weights and activation functions are determined by the representation (Hecht-Nielsen, 1987). On the other hand, it is well known that even when F is smooth, one cannot arrange for functions appearing in representation 1.16 to be smooth (Vituškin, 1964). As a matter of fact, there exist continuously differentiable functions of three variables that cannot be represented as sums of superpositions of the form gf(x,y),z with f and g being continuously differentiable as well (Vituškin, 1954) whereas in the continuous category, one can write any trivariate continuous functions as a sum of nine superpositions of the form gf(x,y),z (Arnold, 2009b). Due to this emergence of nondifferentiable functions, it has been argued that Kolmogorov-Arnold's theorem is not useful for obtaining exact representations of functions via networks (Girosi & Poggio, 1989), although it may be used for approximation (Kůrková, 1991, 1992). More on algorithmic aspects of the theorem and its applications to the network theory can be found in Brattka (2007).
Focusing on a superposition
F=f1(L)(f1(L-1)fa1(L-2)(),,,fj(L-1)faj(L-2)(),,,fNL-1(L-1)faNL-1(L-2)(),)
(1.17)
of smooth functions (which can be computed by a neural network as in Figure 3), the chain rule provides descriptions for partial derivatives of F in terms of partial derivatives of functions fj(i) that constitute the superposition. The key insight behind the proof of theorem 1 is that when the former functions have fewer variables compared to F, one may eliminate the derivatives of fj(i)'s to obtain relations among partial derivatives of F. This idea of elimination has been utilized in Buck (1981b) and Rubel (1981) to prove the existence of universal algebraic differential equations whose C solutions are dense in the space of continuous functions. The fact that there will be constraints imposed on derivatives of a function F that is written as a superposition of differentiable functions was employed by Hilbert himself to argue that certain analytic functions of three variables are not superpositions of analytic functions of two variables (Arnold, 2009a, p. 28), and by Ostrowski to exhibit an analytic bivariate function that cannot be represented as a superposition of univariate smooth functions and multivariate algebraic functions due to the fact that it does not satisfy any nontrivial algebraic PDE (Vituškin, 2004, p. 14; Ostrowski, 1920). The novelty of our approach is to adapt this point of view to demonstrate theoretical limitations of smooth functions that neural networks compute either as a superposition as in theorem 1 or as compositions of functions of the form 1.12 as in theorem 2, and to try to characterize these functions via calculating PDE constraints that are sufficient too (cf. conjecture 1). Furthermore, necessary PDE constraints enable us to easily exhibit functions that cannot be computed by a particular architecture; see example 1. This is reminiscent of the famous Minsky XOR Theorem (Minsky & Papert, 2017). An interesting nonexample from the literature is F(x,y,z)=xy+yz+zx which cannot be written as a superposition of the form 1.7 even in the continuous category (Pólya & Szegö, 1945; Buck, 1979, 1981a; von Golitschek, 1980; Arnold, 2009a).

To the best of our knowledge, the closest mentions of a characterization of a class of superpositions by necessary and sufficient PDE constraints in the literature are papers (Buck, 1979, 1981a) by R. C. Buck. The first one (along with its earlier version, Buck, 1976) characterizes superpositions of the form g(f(x,y),z) in a similar fashion as example 1. Also in those papers, superpositions such as g(f(x,y),h(x,z)) (which appeared in example 2) are discussed although only the existence of necessary PDE constraints is shown; see (Buck, 1979, lemma 7), and (Buck, 1981a, p. 141). We exhibit a PDE characterization for superpositions of this form in example 7. These papers also characterize sufficiently differentiable nomographic functions of the form σ(f(x)+g(y)) and σ(f(x)+g(y)+h(z)).

A special class of neural network architectures is provided by rooted trees where any output of a layer is passed to exactly one node from one of the layers above (see Figure 8). Investigating functions computable by trees is of neuroscientific interest because the morphology of the dendrites of a neuron processes information through a tree that is often binary (Kollins & Davenport, 2005; Gillette & Ascoli, 2015). Assuming that the inputs to a tree are distinct, in our previous work (Farhoodi et al., 2019), we have completely characterized the corresponding superpositions through formulating necessary and sufficient PDE constraints; a result that answers conjecture 1 in positive for such architectures.

Figure 5:

Theorems 3 and 4 impose constraints 1.18 and 1.19 for any three leaves xi, xj, and xk. In the former theorem, the constraint should hold whenever (as on the left) there exists a rooted full subtree separating xi and xj from xk, while in the latter theorem, the constraint is imposed for certain other triples as well (as on the right).

Figure 5:

Theorems 3 and 4 impose constraints 1.18 and 1.19 for any three leaves xi, xj, and xk. In the former theorem, the constraint should hold whenever (as on the left) there exists a rooted full subtree separating xi and xj from xk, while in the latter theorem, the constraint is imposed for certain other triples as well (as on the right).

Close modal
Figure 6:

Theorem 4 imposes constraint 1.20 for any four leaves xi,xi' and xj,xj' that belong to two different rooted full subtrees emanating from a node.

Figure 6:

Theorem 4 imposes constraint 1.20 for any four leaves xi,xi' and xj,xj' that belong to two different rooted full subtrees emanating from a node.

Close modal
Remark 2.

The characterization suggested by the theorem below is a generalization of example 1 which was concerned with smooth superpositions of the form 1.1. The characterization of such superpositions as solutions of PDE 1.4 has also appeared in a paper (Buck, 1979) that we were not aware of while writing (Farhoodi et al., 2019).

Figure 7:

A multilayer neural network can be expanded to a tree. The figure is adapted from Farhoodi et al. (2019).

Figure 7:

A multilayer neural network can be expanded to a tree. The figure is adapted from Farhoodi et al. (2019).

Close modal
Figure 8:

A tree architecture and the relevant terminology.

Figure 8:

A tree architecture and the relevant terminology.

Close modal
Theorem 3
(Farhoodi et al., 2019). Let T be a rooted tree with n leaves that are labeled by the coordinate functions x1,,xn. Let F=F(x1,,xn) be a smooth function implemented on this tree. Then for any three leaves of T corresponding to variables xi,xj,xk of F with the property that there is a (rooted full) subtree of T containing the leaves xi,xj while missing the leaf xk (see Figure 5), F must satisfy
FxixkFxj=FxjxkFxi.
(1.18)
Conversely, a smooth function F defined in a neighborhood of a point pRn can be implemented by the tree T provided that equation 1.18 holds for any triple (xi,xj,xk) of its variables with the above property; and moreover, the non-vanishing conditions below are satisfied:
  • For any leaf xi with siblings either Fxi(p)0 or there is a sibling leaf xi' with Fxi'(p)0.

This theorem was formulated in Farhoodi et al. (2019) for binary trees and in the context of analytic functions (and also that of Boolean functions). Nevertheless, the proof carries over to the more general setting above. Below, we formulate the analogous characterization of functions that trees compute via composing functions of the form 1.12. Proofs of theorems 3 and 4 are presented in section 4.

Theorem 4.

Let T be a rooted tree admitting n leaves that are labeled by the coordinate functions x1,,xn. We formulate the following constraints on smooth functions F=F(x1,,xn):

  • For any two leaves xi and xj of T, we have
    FxixkFxj=FxjxkFxi
    (1.19)
    for any other leaf xk of T that is not a leaf of a (rooted full) subtree that has exactly one of xi or xj (see Figure 5). In particular, equation 1.19 holds for any xk if the leaves xi and xj are siblings, and for any xi and xj if the leaf xk is adjacent to the root of T.
  • For any two (rooted full) subtrees T1 and T2 that emanate from a node of T (see Figure 6), we have
    FxiFxjFxixi'xj'Fxj+Fxixi'Fxjxj'-Fxixj'Fxjxi'-FxiFxjxi'xj'=Fxixi'Fxj-FxiFxjxi'Fxixj'Fxj+FxiFxjxj'
    (1.20)
    if xi, xi' are leaves of T1 and xj, xj' are leaves of T2.

These constraints are satisfied if F(x1,,xn) is a superposition of functions of the form yσw,y according to the hierarchy provided by T. Conversely, a smooth function F defined on an open box-like region4BRn can be written as such a superposition on B provided that the constraints 1.19 and 1.20 formulated above hold and, moreover, the nonvanishing conditions below are satisfied throughout B:

  • For any leaf xi with siblings either Fxi0 or there is a sibling leaf xi' with Fxi'0;

  • For any leaf xi without siblings Fxi0.

The constraints that appeared in theorems 3 and 4 may seem tedious, but they can be rewritten more conveniently once the intuition behind them is explained. Assuming that partial derivatives do not vanish (a nonvanishing condition) so that division is allowed, equations 1.18 and 1.19 may be written as
FxiFxjxk=0FxiFxkxj=FxjFxkxi,
(1.21)
while equation 1.20 is
FxiFxjxi'FxiFxjxj'=0.
(1.22)
Equation 1.21 simply states that the ratio FxiFxj is independent of xk. Notice that in comparison with theorem 3, theorem 7, requires the equation FxixkFxj=FxjxkFxi to hold in a greater generality and for more triples (xi,xj,xk) of leaves (see Figure 5).5 The second simplified equation 1.22, holds once the function FxiFxj of (x1,,xn) may be split into a product such as
q1(,xi,,xi',)q2(,xj,,xj',).
Lemma 4 discusses the necessity and sufficiency of these equations for the existence of such a splitting.
Remark 3.

A significant feature of theorem 7 is that once the appropriate conditions are satisfied on a box-like domain, the smooth function under consideration may be written as a superposition of the desired form on the entirety of that domain. On the contrary, theorem 3 is local in nature.

Aside from neuroscientific interest, studying tree architectures is important also because any neural network can be expanded into a tree network with repeated inputs through a procedure called TENN (the Tree Expansion of the Neural Network; see Figure 7). Tree architectures with repeated inputs are relevant in the context of neuroscience too because the inputs to neurons may be repeated (Schneider-Mizell et al., 2016; Gerhard, Andrade, Fetter, Cardona, & Schneider-Mizell, 2017). We have already seen an example of a network along with its TENN in Figure 2. Both networks implement functions of the form F(x,y,z)=g(f(x,y),h(x,z)). Even for this simplest example of a tree architecture with repeated inputs, the derivation of characteristic PDEs is computationally involved and will be done in example 7. This verifies conjecture 1 for the tree that appeared in Figure 2.

1.4  Outline of the Article

Theorems 1 and 2 are proven in section 2 where it is established that in each setting, there are necessary PDE conditions for expressibility of smooth functions by a neural network. In section 3 we verify conjecture 1 in several examples by characterizing computable functions via PDE constraints that are necessary and (given certain nonvanishing conditions) sufficient. This starts by studying tree architectures in section 3.1. In example 7, we finish our treatment of a tree function with repeated inputs initiated in example 2; and, moreover, we present a number of examples to exhibit the key ideas of the proofs of theorems 3 and 4, which are concerned with tree functions with distinct inputs. The section then proceeds with switching from trees to other neural networks in section 3.2 where, building on example 3, example 11 demonstrates why the characterization claimed by conjecture 1 involves inequalities. We end section 3 with a brief subsection on PDE constraints for polynomial neural networks. Examples in section 3.1 are generalized in the next section to a number of results establishing conjecture 1 for certain families of tree architectures: Proofs of theorems 3 and 4 are presented in section 4. The last section is devoted to few concluding remarks. There are two appendices discussing technical proofs of propositions and lemmas (appendix A), and the basic mathematical background on differential forms (appendix B).

The goal of the section is to prove theorems 1 and 2. Lemma 1 below is our main tool for establishing the existence of constraints:

Lemma 1.
Any collection p1(t1,,tm),,pl(t1,,tm) of polynomials on m indeterminates are algebraically dependent provided that l>m. In other words, if l>m, there exists a nonconstant polynomial Φ=Φ(s1,,sl) dependent only on the coefficients of pi's for which
Φp1(t1,,tm),,pl(t1,,tm)0.

Proof.

For a positive integer a, there are precisely a+ll monomials such as p1a1plal with their total degree a1++al not greater than a. But each of them is a polynomial of t1,,tm of total degree at most ad where d:=max{degp1,,degpl}. For a large enough, a+ll is greater than ad+mm because the degree of the former as a polynomial of a is l, while the degree of the latter is m. For such an a, the number of monomials p1a1,,plal is larger than the dimension of the space of polynomials of t1,,tm of total degree at most ad. Therefore, there exists a linear dependency among these monomials that amounts to a nontrivial polynomial relation among p1,,pl.

Proof of Theorem 1.
Let F=F(x1,,xn) be a superposition of smooth functions
f1(1),,fN1(1);;f1(i),,fNi(i);;f1(L)
(2.1)
according to the hierarchy provided by N where f1(i),,fNi(i) are the functions appearing at the neurons of the ith layer above the input layer (in the last layer, fNL=1(L) appears at the output neuron). The total number of these functions is N:=N1++NL, namely, the number of the neurons of the network. By the chain rule, any partial derivative Fxα of the superposition may be described as a polynomial of partial derivatives of order not greater than |α| of functions that appeared in equation 2.1. These polynomials are determined solely by how neurons in consecutive layers are connected to each other, that is, the architecture. The function F of n variables admits r+nn-1 partial derivatives (excluding the function itself) of order at most r, whereas the same number for any of the functions listed in equation 2.1 is at most r+n-1n-1-1 because by the hypothesis, each of them is dependent on less than n variables. Denote the partial derivatives of order at most r of functions fj(i) (evaluated at appropriate points as required by the chain rule) by indeterminates t1,,tm. Following the previous discussion, one has mNr+n-1n-1-1. Hence, the chain rule describes the partial derivatives of order not greater than r of F as polynomials (dependent only on the architecture of N) of t1,,tm. Invoking lemma 1, the partial derivatives of F are algebraically dependent once
r+nn-1>Nr+n-1n-1-1.
(2.2)
Indeed, the inequality holds for r large enough since the left-hand side is a polynomial of degree n of r, while the similar degree for the right-hand side is n-1.
Proof of Theorem 2.
In this case F=F(x1,,xn) is a superposition of functions of the form
σ1(1)w1(1),.,,σN1(1)wN1(1),.;;σ1(i)w1(i),.,,σNi(i)wNi(i),.;;σ1(L)w1(L),.
(2.3)
appearing at neurons. The jth neuron of the ith layer above the input layer (1iN,1jNi) corresponds to the function σj(i)wj(i),. where a univariate smooth activation function σj(i) is applied to the inner product of the weight vector wj(i) with the vector formed by the outputs of neurons in the previous layer which are connected to the neuron of the ith layer. We proceed as in the proof of theorem 1. The chain rule describes each partial derivative Fxα as a polynomial, dependent only on the architecture, of components of vectors wj(i) along with derivatives of functions σj(i) up to order at most |α| (each evaluated at an appropriate point). The total number of components of all weight vectors coincides with the total number of connections (edges of the underlying graph), and the number of the derivatives of activation functions is the number of neurons times |α|. We denote the total number of connections and neurons by C and N, respectively. There are r+nn-1 partial derivatives Fxα of order at most r (i.e., |α|r) of F and, by the previous discussion, each of them may be written as a polynomial of C+Nr quantities given by components of weight vectors and derivatives of activation functions. Lemma 1 implies that these partial derivatives of F are algebraically dependent provided that
r+nn-1>Nr+C,
(2.4)
an inequality that holds for sufficiently large r as the degree of the left-hand side with respect to r is n>1.
Corollary 1.

Let N be a feedforward neural network whose inputs are labeled by the coordinate functions x1,,xn and satisfies the hypothesis of either of theorems 1 or 2. Define the positive integer r as

  • r=n#neurons-1 in the case of theorem 1

  • r=maxn#neurons1n-1,#connections+2 in the case of theorem 2,

where #connections and #neurons are, respectively, the number of edges of the underlying graph of N and the number of its vertices above the input layer. Then the smooth functions F=F(x1,,xn) computable by N satisfy nontrivial algebraic partial differential equations of order r. In particular, the subspace formed by these functions lies in a subset of positive codimension, which is closed with respect to the Cr-norm.

Proof.
One only needs to verify that for the values of r provided by the corollary the inequalities 2.2 and 2.4 are valid. The former holds if
r+nnr+n-1n-1=r+nn
is not smaller than N, that is, if rn(N-1). As for equation 2.4, notice that
r+nn-1-Nrrnn!-Nr=rrn-1n!-N;
hence, it suffices to have rrn-1n!-N>C. This holds if r>C and rn-1n!-N1. The latter inequality is valid once rn.N1n-1+2, since then:
rn-1n!=r(n!)1n-1n-1rnn-1N1n-1+2nn-1N+2(n-1)n.Nn-2n-1N+1.
Remark 4.

It indeed follows from the arguments above that there is a multitude of algebraically independent PDE constraints. By a simple dimension count, this number is r+nn-1-Nr+n-1n-1-1 in the first case of corollary 1 and r+nn-1-Nr in the second case.

Remark 5.

The approach here merely establishes the existence of nontrivial algebraic PDEs satisfied by the superpositions. These are not the simplest PDEs of this kind and hence are not the best candidates for the purpose of characterizing superpositions. For instance, for superpositions 1.7, which networks in Figure 2 implement, one has n=3 and #neurons=3. Corollary 1 thus guarantees that these superpositions satisfy a sixth-order PDE. But in example 7, we shall characterize them via two fourth-order PDEs (compare with Buck, 1979, lemma 7).

Remark 6.
Prevalent smooth activation functions such as the logistic function 11+e-x or tangent hyperbolic ex-e-xex+e-x satisfy certain autonomous algebraic ODEs. Corollary 1 could be improved in such a setting. If each activation function σ=σ(x) appearing in equation 2.3 satisfies a differential equation of the form
dkσdxk=pσ,dσdx,,dk-1σdxk-1
where p is a polynomial, one can change equation 2.4 to r+nn-1>Nkmax+C where kmax is the maximum order of ODEs that activation functions in equation 2.3 satisfy.

This section examines several elementary examples demonstrating how one can derive a set of necessary or sufficient PDE constraints for an architecture. The desired PDEs should be universal, that is, purely in terms of the derivatives of the function F that is to be implemented and not dependent on any weight vector, activation function, or a function of lower dimensionality that has appeared at a node. In this process, it is often necessary to express a smooth function in terms of other functions. If k<n and f(x1,,xn) is written as g(ξ1,,ξk) throughout an open neighborhood of a point pRn where each ξi=ξi(x1,,xn) is a smooth function, the gradient of f must be a linear combination of those of ξ1,,ξk due to the chain rule. Conversely, if fSpan{ξ1,,ξk} near p, by the inverse function theorem, one can extend (ξ1,,ξk) to a coordinate system (ξ1,,ξk;ξk+1,,ξn) on a small enough neighborhood of p provided that ξ1(p),,ξk(p) are linearly independent; a coordinate system in which the partial derivative fξi vanishes for k<in; the fact that implies f can be expressed in terms of ξ1,,ξk near p. Subtle mathematical issues arise if one wants to write f as g(ξ1,,ξk) on a larger domain containing p:

  • A k-tuple (ξ1,,ξk) of smooth functions defined on an open subset U of Rn whose gradient vector fields are linearly independent at all points cannot necessarily be extended to a coordinate system (ξ1,,ξk;ξk+1,,ξn) for the whole U. As an example, consider r=x2+y2 whose gradient is nonzero at any point of R2-{(0,0)}, but there is no smooth function h:R2-{(0,0)}R with h¬r throughout R2-{(0,0)}. The level set r=1 is compact, and so the restriction of h to it achieves its absolute extrema, and at such points h=λf (λ is the Lagrange multiplier).

  • Even if one has a coordinate system (ξ1,,ξk;ξk+1,,ξn) on a connected open subset U of Rn, a smooth function f:UR with fξk+1,,fξn0 cannot necessarily be written globally as f=g(ξ1,,ξk). One example is the function
    f(x,y):=0ifx0e-1xifx>0,y>0-e-1xifx>0,y<0
    defined on the open subset R2-[0,)R2 for which fy0. It may only locally be written as f(x,y)=g(x); there is no function g:RR with f(x,y)=g(x) for all (x,y)R2-[0,). Defining g(x0) as the value of f on the intersection of its domain with the vertical line x=x0 does not work because, due to the shape of the domain, such intersections may be disconnected. Finally, notice that f, although smooth, is not analytic (Cω); indeed, examples of this kind do not exist in the analytic category.

This difficulty of needing a representation f=g(ξ1,,ξk) that remains valid not just near a point but over a larger domain comes up only in the proof of theorem 4 (see remark 3); the representations we work with in the rest of this section are all local. The assumption about the shape of the domain and the special form of functions 1.12 allows us to circumvent the difficulties just mentioned in the proof of theorem 4. Below we have two related lemmas that we use later.

Lemma 2.

Let B and T be a box-like region in Rn and a rooted tree with the coordinate functions x1,,xn labeling its leaves as in theorem 7. Suppose a smooth function F=F(x1,,xn) on B is implemented on T via assigning activation functions and weights to the nodes of T. If F satisfies the nonvanishing conditions described at the end of theorem 7, then the level sets of F are connected and F can be extended to a coordinate system (F,F2,,Fn) for B.

Lemma 3.

A smooth function F(x1,,xn) of the form σ(a1x1++anxn) satisfies FxixkFxj=FxjxkFxi for any 1i,j,kn. Conversely, if F has a first-order partial derivative Fxj which is nonzero throughout an open box-like region B in its domain, each identity FxixkFxj=FxjxkFxi could be written as FxiFxjxk=0; that is, for any 1in, the ratio FxiFxj should be constant on B, and such requirements guarantee that F admits a representation of the form σ(a1x1++anxn) on B.

In view of the discussion so far, it is important to know when a smooth vector field,
V(x1,,xn)=[V1(x1,,xn)Vn(x1,,xn)]T,
(3.1)
on an open subset URn is locally given by a gradient. Clearly, a necessary condition is to have
(Vi)xj=(Vj)xii,j{1,,n}.
(3.2)
It is well known that if U is simply connected, this condition is sufficient too and guarantees the existence of a smooth potential function ξ on U satisfying ξ=V (Pugh, 2002). A succinct way of writing equation 3.2 is dω=0 where ω is defined as the differential form:
ω:=V1dx1++Vndxn.
(3.3)
Here is a more subtle question also pertinent to our discussion: When may V be rescaled to a gradient vector field? As the reader may recall from the elementary theory of differential equations, for a planer vector field, such a rescaling amounts to finding an integration factor for the corresponding first order ODE (Boyce & DiPrima, 2012). It turns out that the answer could again be encoded in terms of differential forms:
Theorem 5.

A smooth vector field V is parallel to a gradient vector field near each point only if the corresponding differential 1-form ω satisfies ωdω=0. Conversely, if V is nonzero at a point pRn in the vicinity of which ωdω=0 holds, there exists a smooth function ξ defined on a suitable open neighborhood of p that satisfies Vξ0. In particular, in dimension 2, a nowhere vanishing vector field V is locally parallel to a nowhere vanishing gradient vector field, while in dimension 3, that is the case if and only if V.curlV=0.

A proof and background on differential forms are provided in appendix B.

3.1  Trees with Four Inputs

We begin with officially defining the terms related to tree architectures (see Figure 8).

Terminology

  • A tree is a connected acyclic graph. Singling out a vertex as its root turns it into a directed acyclic graph in which each vertex has a unique predecessor/parent. We take all trees to be rooted. The following notions come up frequently:

    • Leaf: a vertex with no successor/child.

    • Node: a vertex that is not a leaf, that is, has children.

    • Sibling leaves: leaves with the same parent.

    • Subtree: all descendants of a vertex along with the vertex itself. Hence in our convention, all subtrees are full and rooted.

    To implement a function, the leaves pass the inputs to the functions assigned to the nodes. The final output is received from the root.

The first example of the section elucidates theorem 3.

3.1.1  Example 4

Let us characterize superpositions
F(x,y,z,w)=g(f(x,y),z,w)
of smooth functions f,g, which correspond to the first tree architecture in Figure 9. Necessary PDE constraints are more convenient to write for certain ratios. So to derive them, we assume for a moment that first-order partial derivatives of F are nonzero, although by a simple continuity argument, the constraints will hold regardless. Computing the numerator and the denominator of FxFy via the chain rule indicates that this ratio coincides with fxfy and is, hence, independent of z,w. One thus obtains
FyFxz=0,FyFxw=0,
or, equivalently,
FyzFx=FxzFy,FywFx=FxwFy.
Assuming Fx0, the preceding constraints are sufficient. The gradient F is parallel with
1FyFxFzFxFwFx
where the second entry FyFx is dependent only on x and y and thus may be written as FyFx=fyfx for an appropriate bivariate function f=f(x,y) defined throughout a small enough neighborhood of the point under consideration (at which Fx is assumed to be nonzero). Such a function exists due to theorem 5. Now we have
F1fyfx00+FzFx0010+FwFx0001Spanf,z,w,
which guarantees that F(x,y,z,w) may be written as a function of f(x,y),z,w.
Figure 9:

Two tree architectures with four distinct inputs. Examples 4, 5, and 6 characterize functions computable by them.

Figure 9:

Two tree architectures with four distinct inputs. Examples 4, 5, and 6 characterize functions computable by them.

Close modal

The next two examples serve as an invitation to the proof of theorem 4 in section 4 and are concerned with trees illustrated in Figure 9.

3.1.2  Example 5

Let us study the example above in the regime of activation functions. The goal is to characterize functions of the form F(x,y,z,w)=σ(τ(ax+by)+cz+dw). The ratios FyFx,FzFw must be constant while FxFz and FxFw are dependent merely on x,y as they are equal to acτ'(ax+by) and adτ'(ax+by), respectively. Equating the corresponding partial derivatives with zero, we obtain the following PDEs:
FxyFx=FxxFy,FyyFx=FxyFy,FyzFx=FxzFy,FywFx=FxwFy;FxzFw=FxwFz,FyzFw=FywFz,FzzFw=FzwFz,FzwFw=FwwFz;FxzFz=FzzFx,FxwFz=FzwFx;FxzFw=FzwFx,FxwFw=FwwFx.
One can easily verify that they always hold for functions of the form above. We claim that under the assumptions of Fx0 and Fw0, these conditions guarantee the existence of a local representation of the form σ(τ(ax+by)+cz+dw) of F. Denoting FxFw by β(x,y) and the constant functions FyFx and FzFw by c1 and c2, respectively, we have
F=FxFyFzFwFxFwFyFwFzFw1=β(x,y)c1β(x,y)c21(f(x,y)+c2z+w),
where f=β(x,y)c1β(x,y). Such a potential function f for β(x,y)c1β(x,y)=FxFwFyFw exists since
FxFwy=FyFwxFyFxw=0,
and it must be in the form of τ(ax+by) as fyfx=c1 is constant (see lemma 3). Thus, F is a function of τ(ax+by)+c2z+w because the gradients are parallel.

The next example is concerned with the symmetric tree in Figure 9. We shall need the following lemma:

Lemma 4.
Suppose a smooth function q=qy1(1),,yn1(1);y1(2),,yn2(2) is written as a product
q1y1(1),,yn1(1)q2y1(2),,yn2(2)
(3.4)
of smooth functions q1,q2. Then qqya(1)yb(2)=qya(1)qyb(2) for any 1an1 and 1bn2. Conversely, for a smooth function q defined on an open box-like region B1×B2Rn1×Rn2, once q is nonzero, these identities guarantee the existence of such a product representation on B1×B2.

3.1.3  Example 6

We aim for characterizing smooth functions of four variables of the form F(x,y,z,w)=σ(τ1(ax+by)+τ2(cz+dw)). Assuming for a moment that all first-order partial derivatives are nonzero, the ratios FyFx,FzFw must be constant while FxFw is equal to aτ1'(ax+by)dτ2'(cz+dw) and hence (along with its constant multiples FxFz,FyFz,FyFw) splits into a product of bivariate functions of x,y and z,w, a requirement that by lemma 4 is equivalent to the following identities:
FxFwFxFwxz=FxFwxFxFwz,FxFwFxFwxw=FxFwxFxFww,FxFwFxFwyz=FxFwyFxFwz,FxFwFxFwyw=FxFwyFxFww.
After expanding and cross-multiplying, the identities above result in PDEs of the form 1.20 imposed on F that hold for any smooth function of the form F(x,y,z,w)=σ(τ1(ax+by)+τ2(cz+dw)). Conversely, we claim that if Fx0 and Fw0, then the constraints we have noted guarantee that F locally admits a representation of this form. Denoting the constants FyFx and FzFw by c1 and c2, respectively, and writing FxFw0 in the split form β(x,y)γ(z,w), we obtain
F=FxFyFzFwFxFwFyFwFzFw1=β(x,y)γ(z,w)c1β(x,y)γ(z,w)c21β(x,y)c1β(x,y)c2γ(z,w)γ(z,w).
We desire functions f=f(x,y) and g=g(z,w) with f=β(x,y)c1β(x,y) and g=c2γ(z,w)γ(z,w), because then, F(f(x,y)+g(z,w)) and hence F=σf(x,y)+g(z,w) for an appropriate σ. Notice that f(x,y) and g(z,w) are automatically in the forms of τ1(ax+by) and τ2(cz+dw) because fyfx=c1 and fzfw=c2 are constants (see lemma 3). To establish the existence of f and g, one should verify the integrability conditions βy=c1βx and c2γw=γz. We only verify the first one; the second one is similar. Notice that FyFx=c1 is constant, and FxFw=β(x,y)γ(z,w) implies that βx=βFxFwxFxFw while βy=βFxFwyFxFw. So the question is whether
FyFxFxFwx=FyFxFxFwx=FyFwx
and FxFwy coincide, which is the case since FyFwx=FxFwy can be rewritten as FyFxw=0.
Remark 7.

Examples 5 and 6 demonstrate an interesting phenomenon: one can deduce nontrivial facts about the weights once a formula for the implemented function is available. In example 5, for a function F(x,y,z,w)=σ(τ(ax+by)+cz+dw), we have FyFxba and FzFwcd. The same identities are valid for functions of the form F(x,y,z,w)=σ(τ1(ax+by)+τ2(cz+dw)) in example 6.6 This seems to be a direction worthy of study. In fact, there are papers discussing how a neural network may be “reverse-engineered” in the sense that the architecture of the network is determined from the knowledge of its outputs, or the weights and biases are recovered without the ordinary training process involving gradient descent algorithms (Fefferman & Markel, 1994; Dehmamy, Rohani, & Katsaggelos, 2019; Rolnick & Kording, 2019). In our approach, the weights appearing in a composition of functions of the form yσw,y could be described (up to scaling) in terms of partial derivatives of the resulting superposition.

3.1.4  Example 7

Let us go back to example 2. In (Farhoodi et al., 2019, c, 7.2), a PDE constraint on functions of the form 1.7 is obtained via differentiating equation 1.9 several times and forming a matrix equation, which implies that a certain determinant of partial derivatives must vanish. The paper then raises the question of existence of PDE constraints that are both necessary and sufficient. The goal of this example is to derive such a characterization. Applying differentiation operators y, z, and yz to equation 1.9 results in
FyFz00FyyFyzFy0FyzFzz0FzFyyzFyzzFyzFyzABAyBz=FxFxyFxzFxyz.
If this matrix is nonsingular, a nonvanishing condition, Cramer's rule provides descriptions of A,B in terms of partial derivatives of F, and then Az=By=0 yield PDE constraints. Reversing this procedure, we show that these conditions are sufficient too. Let us assume that
Ψ:=FyFz00FyyFyzFy0FyzFzz0FzFyyzFyzzFyzFyz=(Fy)2FzFyzz-(Fy)2FyzFzz-Fy(Fz)2Fyyz+(Fz)2FyzFyy0.
(3.5)
Notice that this condition is nonvacuous for functions F(x,y,z) of the form 1.7 since they include all functions of the form g(y,z). Then the linear system
FxFxyFxzFxyz=FyFz00FyyFyzFy0FyzFzz0FzFyyzFyzzFyzFyzABCD
(3.6)
may be solved as
A=FxFz00FxyFyzFy0FxzFzz0FzFxyzFyzzFyzFyzFyFz00FyyFyzFy0FyzFzz0FzFyyzFyzzFyzFyz=1Ψ[-Fy(Fz)2Fxyz+FyFzFxzFyz+FxFyFzFyzz-FxFyFyzFzz+(Fz)2FxyFyz-FxFz(Fyz)2]
(3.7)
and
B=FyFx00FyyFxyFy0FyzFxz0FzFyyzFxyzFyzFyzFyFz00FyyFyzFy0FyzFzz0FzFyyzFyzzFyzFyz=1Ψ[(Fy)2FzFxyz-(Fy)2FxzFyz-FyFzFxyFyz-FxFyFzFyyz+FxFy(Fyz)2+FxFzFyyFyz].
(3.8)
Denote the numerators of 3.7 and 3.8 by Ψ1 and Ψ2, respectively:
Ψ1=-Fy(Fz)2Fxyz+FyFzFxzFyz+FxFyFzFyzz-FxFyFyzFzz+(Fz)2FxyFyz-FxFz(Fyz)2,Ψ2=(Fy)2FzFxyz-(Fy)2FxzFyz-FyFzFxyFyz-FxFyFzFyyz+FxFy(Fyz)2+FxFzFyyFyz.
(3.9)
Requiring A=Ψ1Ψ and B=Ψ2Ψ to be independent of z and y, respectively, amounts to
Φ1:=(Ψ1)zΨ-Ψ1Ψz=0,Φ2:=(Ψ2)yΨ-Ψ2Ψy=0.
(3.10)
A simple continuity argument demonstrates that the constraints Φ1=0 and Φ2=0 above are necessary even if the determinant 3.5 vanishes: if Ψ is identically zero on a neighborhood of a point pR3, the identities 3.10 obviously hold throughout that neighborhood. Another possibility is that Ψ(p)=0, but there is a sequence {pn}n of nearby points with pnp and Ψ(pn)0. Then the polynomial expressions Φ1, Φ2 of partial derivatives vanish at any pn and hence at p by continuity.
To finish the verification of conjecture 1 for superpositions of the form 1.7, one should establish that PDEs Φ1=0, Φ2=0 from equation 3.10 are sufficient for the existence of such a representation provided that the nonvanishing condition Ψ0 from equation 3.5 holds. In that case, the functions A and B from equations 3.7 and 3.8 satisfy equation 1.9. According to theorem 5, there exist smooth locally defined f(x,y) and h(x,z) with fxfy=A(x,y) and hxhz=B(x,z). We have:
F=A(x,y)Fy+B(x,z)FzFyFz=FyA(x,y)10+FzB(x,z)01=Fyfyfxfy0+Fzhzhx0hzSpan{f,h};
hence, F can be written as a function g(f(x,y),h(x,z)) of f and h for an appropriate g.

3.1.5  Example 8

We now turn to the asymmetric tree with four repeated inputs in Figure 10 with the corresponding superpositions,
F(x,y,z)=g(x,f(h(x,y),z)).
(3.11)
In our treatment here, the steps are reversible, and we hence derive PDE constraints that are simultaneously necessary and sufficient. The existence of a representation of the form 3.11 for F(x,y,z) is equivalent to the existence of a locally defined coordinate system,
(ξ:=x,ζ,η),
with respect to which Fη=0; moreover, ζ=ζ(x,y,z) must be in the form of f(h(x,y),z), which, according to example 1, is the case if and only if ζyζxz=0. Here, we assume that ζx,ζy0 so that ζyζx is well defined and ξ,ζ are linearly independent. We denote the preceding ratio by β=β(x,y)0. Conversely, theorem 5 guarantees that there exists ζ with ζyζx=β for any smooth β(x,y). The function F could be locally written as a function of ξ=x and ζ if and only if
F=FxFyFzSpanx,ζ=ζxζy=β(x,y)ζxζz.
Clearly, this occurs if and only if FzFy coincides with ζzζy. Therefore, one only needs to arrange for β(x,y) so that the vector field
1ζxζ=1ζyζxζzζx=1β(x,y)β(x,y)FzFy
is parallel to a gradient vector field ζ. That is, we want the vector field to be perpendicular to its curl (see theorem 5). We have:
x+β(x,y)y+β(x,y)FzFyz.curlx+β(x,y)y+β(x,y)FzFyz=βyFzFy+βFzFyy-β2FzFyx.
The vanishing of the expression above results in a description of FzFyx as the linear combination
FzFyx=βyβ2FzFy+1βFzFyy
(3.12)
whose coefficients βyβ2=-1βy and 1β are independent of z. Thus, we are in a situation similar to that of examples 2 and 7, where we encountered identity 1.9. The same idea used there could be applied again to obtain PDE constraints: Differentiating equation 3.12 with respect to z results in a linear system:
FzFyFzFyyFzFyzFzFyyz-1βy1β=FzFyxFzFyxz.
Assuming the matrix above is nonsingular, Cramer's rule implies
-1βy=FzFyxFzFyyFzFyxzFzFyyzFzFyFzFyyFzFyzFzFyyz,1β=FzFyFzFyxFzFyzFzFyxzFzFyFzFyyFzFyzFzFyyz.
(3.13)
We now arrive at the desired PDE characterization of superpositions 3.11. In each of the ratios of determinants appearing in equation 3.13, the numerator and denominator are in the form of polynomials of partial derivatives divided by (Fy)4. So we introduce the following polynomial expressions:
Ψ1=(Fy)4FzFyFzFyyFzFyzFzFyyz,Ψ2=(Fy)4FzFyFzFyxFzFyzFzFyxz,Ψ3=(Fy)4FzFyxFzFyyFzFyxzFzFyyz.
(3.14)
Then in view of equation 3.13,
Ψ2Ψ1=1β,Ψ3Ψ1=-1βy.
(3.15)
Hence Ψ2Ψ1y+Ψ3Ψ1=0; furthermore, Ψ2Ψ1z=0 since β is independent of z:
Φ1:=Ψ1(Ψ2)y-(Ψ1)yΨ2+Ψ1Ψ3=0,Φ2:=Ψ1(Ψ2)z-(Ψ1)zΨ2=0.
(3.16)
Again as in example 7, a continuity argument implies that the algebraic PDEs above are necessary even when the denominator in equation 3.13 (i.e., Ψ1) is zero. As for the nonvanishing conditions, in view of equations 3.14 and 3.15, we require Fy to be nonzero as well as Ψ1 and Ψ2 (recall that β0):
Ψ10,Ψ20,Fy0.
(3.17)
It is easy to see that these conditions are not vacuous for functions of the form 3.11. If F(x,y,z)=(xy)2z+z3, neither Fy nor the expression Ψ1 or Ψ2 is identically zero.
Figure 10:

An asymmetric tree architecture that computes the superpositions of the form F(x,y,z)=g(x,f(h(x,y),z)). These are characterized in example 8.

Figure 10:

An asymmetric tree architecture that computes the superpositions of the form F(x,y,z)=g(x,f(h(x,y),z)). These are characterized in example 8.

Close modal

In summary, a special case of conjecture 1 has been verified in this example. A function F=F(x,y,z) of the form 3.11 satisfies the constraints 3.16; conversely, a smooth function satisfying them along with the nonvanishing conditions 3.17 admits a local representation of that form.

3.2  Examples of Functions Computed by Neural Networks

We now switch from trees to examples of PDE constraints for neural networks. The first two examples are concerned with the network illustrated on the left of Figure 11; this is a ResNet with two hidden layers that has x,y,z,w as its inputs. The functions it implements are in the form of
F(x,y,z,w)=g(h1(f(x,y),z),h2(f(x,y),w)),
(3.18)
where f and h1,h2 are the functions appearing in the hidden layers.
Figure 11:

The space of functions computed by the neural network on the left is strictly smaller than that of its TENN on the right. See example 9.

Figure 11:

The space of functions computed by the neural network on the left is strictly smaller than that of its TENN on the right. See example 9.

Close modal

3.2.1  Example 9

On the right of Figure 11, the tree architecture corresponding to the neural network discussed above is illustrated. The functions implemented by this tree are in the form of
F(x,y,z,w)=g(h1(f1(x,y),z),h2(f2(x,y),w)),
(3.19)
which is a form more general than the form 3.18 of functions computable by the network. In fact, there are PDEs satisfied by the latter class that functions in the former class, equation 3.19, do not necessarily satisfy. To see this, observe that for a function F(x,y,z,w) of the form 3.18, the ratio FyFx coincides with fyfx and is thus independent of z and w—hence the PDEs FxzFy=FyzFx and FxwFy=FywFx. Neither of them holds for the function F(x,y,z,w)=xyz+(x+y)w, which is of the form 3.19. We deduce that the set of PDE constraints for a network may be strictly larger than that of the corresponding TENN.

3.2.2  Example 10

Here we briefly argue that conjecture 1 holds for the network in Figure 11 (which has two hidden layers). The goal is to obtain PDEs that, given suitable nonvacuous, nonvanishing conditions, characterize smooth functions F(x,y,z,w) of the form 3.18. We seek a description of the form g(F1(x,y,z),F2(x,y,w)) of F(x,y,z,w) where the trivariate functions F1(x,y,z) and F2(x,y,w) are superpositions h1(f(x,y),z) and h2(f(x,y),w) with the same bivariate function f appearing in both of them. Invoking the logic that has been used repeatedly in section 3.1, F must be a linear combination of F1 and F2. Following example 1, the only restriction on the latter two gradients is
F11(F1)y(F1)x=fyfxα(x,y,z):=(F1)z(F1)x0,F21(F2)y(F2)x=fyfx0β(x,y,w):=(F2)w(F2)x;
and as observed in example 9, the ratio fyfx coincides with FyFx. Thus, the existence of a representation of the form 3.18 is equivalent to the existence of a linear relation such as
FxFyFzFw=Fzα1FyFxα(x,y,z)0+Fwβ1FyFx0β(x,y,w).
This amounts to the equation
Fz1α+Fw1β=Fx.
Now the idea of examples 2 and 7 applies. As 1αw=0 and 1βz=0, applying the operators z, w, and zw to the last equation results in a linear system with four equations and four unknowns: 1α, 1β, 1αz, and 1βw. If nonsingular (a nonvanishing condition), the system may be solved to obtain expressions purely in terms of partial derivatives of F for the aforementioned unknowns. Now 1αw=0 and 1βz=0, along with the equations FxzFy=FyzFx, FxwFy=FywFx from example 9, yield four algebraic PDEs characterizing superpositions 3.18.

The final example of this section finishes example 3 from the section 1.

3.2.3  Example 11

We go back to example 3 to study PDEs and PDIs satisfied by functions of the form 1.14. Absorbing a'',b'' into inner functions, we can focus on the simpler form:
F(x,t)=σ(f(ax+bt)+g(a'x+b't)).
(3.20)
Let us for the time being forget about the outer activation function σ. Consider functions such as
G(x,t)=f(ax+bt)+g(a'x+b't).
Smooth functions of this form constitute solutions of a second-order linear homogeneous PDE with constant coefficients
UGxx+VGxt+WGtt=0,
(3.21)
where (a,b) and (a',b') satisfy
UA2+VAB+WB2=0.
(3.22)
The reason is that when (a,b) and (a',b') satisfy equation 3.22, the differential operator Uxx+Vxt+Wtt can be factorized as
(bx-at)(b'x-a't)
to a composition of operators that annihilate the linear forms ax+bt and a'x+b't. If (a,b) and (a',b') are not multiples of each other, they constitute a new coordinate system (ax+bt,a'x+b't) in which the mixed partial derivatives of F all vanish; so, at least locally, F must be a sum of univariate functions of ax+bt and a'x+b't.7 We conclude that assuming V2-4UW>0, functions of the form G(x,t)=f(ax+bt)+g(a'x+b't) may be identified with solutions of PDEs of the form 3.21. As in example 1, we desire algebraic PDEs purely in terms of F and without constants U,V, and W. One way to do so is to differentiate equation 3.21 further, for instance:
UGxxx+VGxxt+WGxtt=0.
(3.23)
Notice that equations 3.21 and 3.23 could be interpreted as (U,V,W) being perpendicular to (Gxx,Gxt,Gtt) and (Gxxx,Gxxt,Gxtt). Thus, the cross-product
(GxtGxtt-GttGxxt,GttGxxx-GxxGxtt,GxxGxxt-GxtGxxx)
of the latter two vectors must be parallel to a constant vector. Under the nonvanishing condition that one of the entries of the cross-product, say the last one, is nonzero, the constancy may be thought of as ratios of the other two components to the last one being constants. The result is a characterization (in the vein of conjecture 1) of functions G of the form f(ax+bt)+g(a'x+b't), which are subjected to GxxGxxt-GxtGxxx0 and
GxtGxtt-GttGxxtGxxGxxt-GxtGxxxandGttGxxx-GxxGxttGxxGxxt-GxtGxxxareconstants,(GttGxxx-GxxGxtt)2>4(GxtGxtt-GttGxxt)(GxxGxxt-GxtGxxx).
(3.24)
Notice that the PDI is not redundant here. For a solution G=G(x,t) of Laplace's equation, the fractions from the first line of equation 3.24 are constants, while on the second line, the left-hand side of the inequality is zero but its right-hand side is 4(GxtGxtt-GttGxxt)20.
Composing G with σ makes the derivation of PDEs and PDIs imposed on functions of the form 3.20 even more cumbersome. We provide only a sketch. Under the assumption that the gradient of F=σG is nonzero, the univariate function σ admits a local inverse τ. Applying the chain rule to G=τF yields
Gxx=τ''(F)(Fx)2+τ'(F)Fxx,Gxt=τ''(F)FxFt+τ'(F)Fxt,Gtt=τ''(F)(Ft)2+τ'(F)Ftt.
Plugging them in the PDE 3.21 that G satisfies results in
τ''(F)U(Fx)2+VFxFt+W(Ft)2+τ'(F)UFxx+VFxt+WFtt=0,
or, equivalently,
UFxx+VFxt+WFttU(Fx)2+VFxFt+W(Ft)2=-τ''(F)τ'(F)=-τ''τ'(F).
(3.25)
It suffices for the ratio UFxx+VFxt+WFttU(Fx)2+VFxFt+W(Ft)2 to be a function of F such as ν(F) since then τ may be recovered as τ=e-ν. Following the discussion at the beginning of section 3, this is equivalent to
UFxx+VFxt+WFttU(Fx)2+VFxFt+W(Ft)2F.
This amounts to an identity of the form
Φ1(F)U2+Φ2(F)V2+Φ3(F)W2+Φ4(F)UV+Φ5(F)VW+Φ6(F)UW=0,
where Φi(F)'s are complicated nonconstant polynomial expressions of partial derivatives of F. In the same way that the parameters U,V, and W in PDE 3.21 were eliminated to arrive at equation 3.24, one may solve the homogeneous linear system consisting of the identity above and its derivatives in order to derive a six-dimensional vector,
Ξ1(F),Ξ2(F),Ξ3(F),Ξ4(F),Ξ5(F),Ξ6(F)
(3.26)
of rational expressions of partial derivatives of F parallel to the constant vector
(U2,V2,W2,UV,VW,UW).
(3.27)
The parallelism amounts to a number of PDEs, for example, Ξ1(F)Ξ2(F)=Ξ4(F)2, and the ratios Ξi(F)Ξj(F) must be constant because they coincide with the ratios of components of equation 3.27. Moreover, V2-4UW>0 implies V2UW-4V2UW0. Replacing with the corresponding ratios of components of equation 3.26, we obtain the PDI
Ξ2(F)-4Ξ6(F)Ξ2(F)Ξ6(F)0,
which must be satisfied by any function of the form 3.20.

3.3  Examples of Polynomial Neural Networks

The superpositions we study in this section are constructed out of polynomials. Again, there are two different regimes to discuss: composing general polynomial functions of low dimensionality or composing polynomials of arbitrary dimensionality but in the simpler form of yσw,y where the activation function σ is a polynomial of a single variable. The latter regime deals with polynomial neural networks. Different aspects of such networks have been studied in the literature (Du & Lee, 2018; Soltanolkotabi et al., 2018; Venturi et al., 2018; Kileel et al., 2019). In the spirit of this article, we are interested in the spaces formed by such polynomial superpositions. Bounding the total degree of polynomials from the above, these functional spaces are subsets of an ambient polynomial space, say, the space Polyd,n of real polynomials P(x1,,xn) of total degree at most d, which is an affine space of dimension d+nn. By writing a polynomial P(x1,,xn) of degree d as
P(x1,x2,,xn)=a1,a2,,an0a1+a2++andca1,a2,,anx1a1x2a2xnan,
(3.28)
the coefficients ca1,a2,,an provide a natural coordinate system on Polyd,n. Associated with a neural network N that receives x1,,xn as its inputs, there are polynomial functional spaces for any degree d that lie in the ambient space Polyd,n:
  1. The subset Fd(N) of Polyd,n consisting of polynomials P(x1,,xn) of total degree at most d that can be computed by N via assigning real polynomial functions to its neurons

  2. The smaller subset Fdact(N) of Polyd,n consisting of polynomials P(x1,,xn) of total degree at most d that can be computed by N via assigning real polynomials of the form yσw,y to the neurons where σ is a polynomial activation function

In general, subsets Fd(N) and Fdact(N) of Polyd,n are not closed in the algebraic sense (see remark 8). Therefore, one may consider their Zariski closures Vd(N) and Vdact(N), that is, the smallest subsets defined as zero loci of polynomial equations that contain them. We shall call Vd(N) and Vdact(N) the functional varieties associated with N. Each of the subsets Vd(N) and Vdact(N) of Polyd,n could be described with finitely many polynomial equations in terms of ca1,a2,,an's. The PDE constraints from section 2 provide nontrivial examples of equations satisfied on the functional varieties: In any degree d, substituting equation 3.28 in an algebraic PDE that smooth functions computed by N must obey results in equations in terms of the coefficients that are satisfied at any point of Fd(N) or Fdact(N) and hence at the points of Vd(N) or Vdact(N). This will be demonstrated in example 12 and results in the following corollary to theorems 1 and 2.

Corollary 2.

Let N be a neural network whose inputs are labeled by the coordinate functions x1,,xn. Then there exist nontrivial polynomials on affine spaces Polyd,n that are dependent only on the topology of N and become zero on functional varieties Vdact(N)Polyd,n. The same holds for functional varieties Vd(N) provided that the number of inputs to each neuron of N is less than n.

Proof.
The proof immediately follows from theorem 2 (in the case of Vdact(N)) and from theorem 1 (in the case of Vd(N)). Substituting a polynomial P(x1,,xn) in a PDE constraint
ΦPx1,,Pxn,Px12,Px1x2,,Pxα,=0
that these theorems suggest for N and equating the coefficient of a monomial x1a1x2a2xnan with zero results in a polynomial equation in ambient polynomial spaces that must be satisfied on the associated functional varieties.

3.3.1  Example 12

Let N be a rooted tree T with distinct inputs x1,,xn. Constraints of the form FxixkFxj=FxjxkFxi are not only necessary conditions for a smooth function F=F(x1,,xn) to be computable by T; but by the virtue of theorem 3, they are also sufficient for the existence of a local representation of F on T if suitable nonvanishing conditions are satisfied. An interesting feature of this setting is that when F is a polynomial P=P(x1,,xn), one can relax the nonvanishing conditions; and P actually admits a global representation as a composition of polynomials if it satisfies the characteristic PDEs (Farhoodi et al., 2019, proposition 4). The basic idea is that if P is locally written as a superposition of smooth functions according to the hierarchy provided by T, then comparing the Taylor series shows that the constituent parts of the superposition could be chosen to be polynomials as well. Now P and such a polynomial superposition must be the same since they agree on a nonempty open set. Consequently, each Fd(N) coincides with its closure Vd(N) and can be described by equations of the form PxixkPxj=PxjxkPxi in the polynomial space. Substituting an expression of the form
P(x1,,xn)=a1,,an0ca1,,anx1a1xnan
in PxixkPxj-PxjxkPxi=0 and equating the coefficient of a monomial x1a1xnan with zero yields
ai'+ai''=ai+1,aj'+aj''=aj+1,ak'+ak''=ak+1as'+as''=ass{1,,n}-{i,j,k}ak'ai'aj''-aj'ai''ca1',,an'ca1'',,an''=0.
(3.29)
We deduce that equations 3.29 written for a1,,an0 and for triples (i,j,k) with the property that xk is separated from xi and xj by a subtree of T (as in theorem 3) describe the functional varieties associated with T. In a given degree d, to obtain equations describing T in Polyd,n, one should set any cb1,,bn with b1++bn>d to be zero in equation 3.29. No such a coefficient occurs if da1++an+3, and thus for d large enough, equation 3.29 defines an equation in Polyd,n as is.

Similarly, theorem 7 can be used to write equations for Fdact(N)=Vdact(N). In that situation, a new family of equations corresponding to equation 1.20 emerges that are expected to be extremely complicated.

3.3.2  Example 13

Let N be the neural network appearing in Figure 4. The functional space Fdact(N) is formed by polynomials P(x,t) of total degree at most d that are in the form of σ(f(ax+bt)+g(a'x+b't)). By examining the Taylor expansions, it is not hard to see that if P(x,t) is written in this form for univariate smooth functions σ, f, and g, then these functions could be chosen to be polynomials. Therefore, in any degree d, our characterization of superpositions of this form in example 11 in terms of PDEs and PDIs results in polynomial equations and inequalities that describe a Zariski open subset of Fdact(N) which is the complement of the locus where the nonvanishing conditions fail. The inequalities disappear after taking the closure, so Vdact(N) is strictly larger than Fdact(N) here.

Remark 8.
The emergence of inequalities in describing the functional spaces, as observed in example 13, is not surprising due to the Tarski–Seidenberg theorem (see Coste, 2000), which implies that the image of a polynomial map between real varieties (i.e., a map whose components are polynomials) is semialgebraic; that is, it could be described as a union of finitely many sets defined by polynomial equations and inequalities. To elaborate, fix a neural network architecture N. Composing polynomials of bounded degrees according to the hierarchy provided by N yields polynomial superpositions lying in FD(N) for D sufficiently large. The composition thus amounts to a map
Polyd1,n1××PolydN,nNPolyD,n,
where, on the left-hand side, the polynomials assigned to the neurons of N appear, and Dd1,,dn. The image, a subset of FD(N), is semialgebraic and thus admits a description in terms of finitely many polynomial equations and inequalities. The same logic applies to the regime of activation functions too; the map just mentioned must be replaced with
RC×Polyd1,1××PolydN,1PolyD,n
whose image lies in FDact(N), and its domain is the Cartesian product of spaces of polynomial activation functions assigned to the neurons by the space RC of weights assigned to the connections of the network.

Building on the examples of the previous section, we prove theorems 3 and 4. This will establish conjecture 1 for tree architectures with distinct inputs.

Proof of Theorem 3.
The necessity of the constraints from equation 1.18 follows from example 1. As demonstrated in Figure 12, picking three of variables xi=x, xj=y, and xk=z where the former two are separated from the latter by a subtree and taking the rest of variables to be constant, we obtain a superposition of the form F(x,y,z)=g(f(x,y),z) studied in example 1; it should satisfy FxzFy=FyzFx or, equivalently, equation 1.18.
Figure 12:

The necessity of constraint 1.18 in theorem 3 follows from the case of trivariate tree functions discussed in example 1. Choosing three of the variables (red leaves) and fixing the rest (gray leaves) results in a superposition of the form g(f(x,y),z) that must obey constraint 1.4.

Figure 12:

The necessity of constraint 1.18 in theorem 3 follows from the case of trivariate tree functions discussed in example 1. Choosing three of the variables (red leaves) and fixing the rest (gray leaves) results in a superposition of the form g(f(x,y),z) that must obey constraint 1.4.

Close modal
We induct on the number of variables, which coincides with the number of leaves, to prove the sufficiency of constraint 1.18 and the nonvanishing conditions in theorem 3 for the existence of a local implementation, in the form of a superposition of functions of lower arity, on the tree architecture in hand. Consider a rooted tree T with n leaves labeled by the coordinate functions x1,,xn. The inductive step is illustrated in Figure 13. Removing the root results in a number of smaller trees T1,,Tl and a number of single vertices8 corresponding to the leaves adjacent to the root of T. By renumbering x1,,xn one may write the leaves as
x1,,xm1;xm1+1,,xm1+m2;;xm1++ml-1+1,,xm1++ml;xm1++ml+1;;xn,
(4.1)
where xm1++ms-1+1,,xm1++ms-1+ms(1sl) are the leaves of the subtree Ts while xm1++ml+1 through xn are the leaves adjacent to the root of T. The goal is to write F(x1,,xn) as
g(G1(x1,,xm1),,Gl(xm1++ml-1+1,,xm1++ml),xm1++ml+1,,xn),
(4.2)
where each smooth function,
Gs(xm1++ms-1+1,,xm1++ms-1+ms),
satisfies the constraints coming from Ts and thus, by invoking the induction hypothesis, is computable by the tree Ts. Following the discussion before theorem 5, it suffices to express F as a linear combination of the gradients G1,,Gl,xm1++ml+1,,xn. The nonvanishing conditions in theorem 3 require the first-order partial derivative with respect to at least one of the leaves of each Ts to be nonzero; we may assume Fxm1++ms-1+10 without any loss of generality. We should have
F=Fx1Fxm1Fxm1++ml-1+1Fxm1++mlFxm1++ml+1FxnT=s=1lFxm1++ms-1+100m1++ms-11Fxm1++ms-1+2Fxm1++ms-1+1Fxm1++ms-1+msFxm1++ms-1+100n-(m1++ms)T+Fxm1++ml+1xm1++ml+1++FxnxnSpan{G1(x1,,xm1),,Gl(xm1++ml-1+1,,xm1++ml),xm1++ml+1,,xn}.
In the expressions above, the vector 1Fxm1++ms-1+2Fxm1++ms-1+1Fxm1++ms-1+msFxm1++ms-1+1T (which is of size ms) is dependent only on the variables xm1++ms-1+1,,xm1++ms, which are the leaves of Ts: any other leaf xk is separated from them by the subtree Ts of T, and, hence, for any leaf xi with m1++ms-1<im1++ms, we have FxiFxm1++ms-1+1xk=0 due to the simplified form 1.21 of equation 1.18. To finish the proof, one should establish the existence of functions Gs(xm1++ms-1+1,,xm1++ms) appearing in equation 4.2; that is, 1Fxm1++ms-1+2Fxm1++ms-1+1Fxm1++ms-1+msFxm1++ms-1+1T should be shown to be parallel to a gradient vector field Gs. Notice that the induction hypothesis would be applicable to Gs since any ratio (Gs)xi(Gs)xj of partial derivatives is the same as the corresponding ratio of partial derivatives of F. Invoking theorem 5, to prove the existence of Gs, we should verify that the 1-form
ωs:=i=m1++ms-1+1m1++ms-1+msFxiFxm1++ms-1+1dxi(1sl)
satisfies ωsdωs=0. We finish the proof by showing this in the case of s=1; other cases are completely similar. We have
ω1dω1=i=1m1FxiFx1dxij=1m1dFxjFx1dxj=i=1m1FxiFx1dxij=1m1k=1m1FxjFx1xkdxkdxj=i,j,k{1,,m1}FxiFxjxk(Fx1)2-FxiFxjFx1xk(Fx1)3dxidxkdxj=i=1m1Fxi(Fx1)2dxij,k{1,,m1}Fxjxkdxkdxj+i,j{1,,m1}FxiFxjdxidxjk=1m1Fx1xk(Fx1)3dxk.
The last two terms are zero because, in the parentheses, the 2-forms
j,k{1,,m1}Fxjxkdxjdxk,i,j{1,,m1}FxiFxjdxidxj,
are zero since interchanging j and k or i and j in the summations results in the opposite of the original differential form.
Figure 13:

The inductive step in the proof of theorem 3. The removal of the root of T results in a number of smaller rooted trees along with single vertices that were the leaves adjacent to the root of T (if any).

Figure 13:

The inductive step in the proof of theorem 3. The removal of the root of T results in a number of smaller rooted trees along with single vertices that were the leaves adjacent to the root of T (if any).

Close modal
Remark 9.

The formulation of theorem 3 in Farhoodi et al. (2019) is concerned with analytic functions and binary trees. The proof presented above follows the same inductive procedure but utilizes theorem 5 instead of Taylor expansions. Of course, theorem 5 remains valid in the analytic category, so the tree representation of F constructed in the proof here consists of analytic functions if F is analytic. An advantage of working with analytic functions is that in certain cases, the nonvanishing conditions may be relaxed. For instance, if in example 1 the function F(x,y,z) satisfying equation 1.4 is analytic, it admits a local representation of the form 1.1, while if F is only smooth, at least one of the conditions Fx0 of Fy0 is required. (See Farhoodi et al., 2019, sec. 5.1 and 5.3, for details.)

Proof of Theorem 4.
Establishing the necessity of constraints 1.19 and 1.20 is straightforward. An implementation of a smooth function F=F(x1,,xn) on the tree T is in a form such as
σ((w˜.σ˜(w1.τ1w˜1.τ˜1(cxi+)+w˜˜1.τ˜˜1(c'xi'+)++w2.τ2w˜2.τ˜2(dxj+)+w˜˜2.τ˜˜2(d'xj'+)++w3.τ3+)+))
(4.3)
for appropriate activation functions and weights. In the expression above, variables xs appearing in
σ˜(w1.τ1w˜1.τ˜1(cxi+)+w˜˜1.τ˜˜1(c'xi'+)++w2.τ2w˜2.τ˜2(dxj+)+w˜˜2.τ˜˜2(d'xj'+)++w3.τ3+)
are the leaves of the smallest (full) subtree of T in which both xi and xj appear as leaves. Denoting this subtree by T˜, the activation function applied at the root of T˜ is σ˜, and the subtrees emanating from the root of T˜, which we write as T1˜,T2˜,T3˜,, have τ1,τ2,τ3, assigned to their roots. Here, T1˜ and T2˜ contain xi and xj, respectively, and are the largest (full) subtrees that have exactly one of xi and xj. To verify equation 1.19, notice that FxiFxj is proportional to
τ1'w˜1.τ˜1(cxi+)+w˜˜1.τ˜˜1(c'xi'+)+τ˜1'(cxi+)τ2'w˜2.τ˜2(dxj+)+w˜˜2.τ˜˜2(d'xj'+)+τ˜2'(dxj+)
(4.4)
with the constant of proportionality being a quotient of two products of certain weights of the network. The ratio 4.4 is dependent only on those variables that appear as leaves of T1˜ and T2˜, so
FxiFxjxk=0FxixkFxj=FxjxkFxi
unless there is a subtree of T containing the leaf xk and exactly one of xi or xj (which forcibly will be a subtree of T1˜ or T2˜). Before switching to constraint 1.20, we point out that the description of F in equation 4.3 assumes that the leaves xi and xj are not siblings. If they are, F may be written as
σw˜.σ˜w.τ(cxi+dxj+)++,
in which case, FxiFxj=cd is a constant and hence equation 1.21 holds for all 1kn. To finish the proof of necessity of the constraints introduced in theorem 7, consider the fraction from equation 4.4, which is a multiple of FxiFxj. This has a description as a product of a function of xi,xi', (leaves of T1˜) by a function of xj,xj', (leaves of T2˜). Lemma 4 now implies that for any leaf xi' of T1˜ and any leaf xj' of T2˜,
FxiFxjxi'FxiFxjxj'=0;
hence, the simplified form, equation 1.22 of 1.20.

We induct on the number of leaves to prove the sufficiency of constraints 1.19 and 1.20 (accompanied by suitable nonvanishing conditions) for the existence of a tree implementation of a smooth function F=F(x1,,xn) as a composition of functions of the form 1.12. Given a rooted tree T with n leaves labeled by x1,,xn, the inductive step has two cases demonstrated in Figures 14 and 15:

  • There are leaves, say, xm+1,,xn, directly adjacent to the root of T; their removal results in a smaller tree T' with leaves x1,,xm (see Figure 14). The goal is to write F(x1,,xn) as
    σ(G(x1,,xm)+cm+1xm+1++cnxn),
    (4.5)
    with G satisfying appropriate constraints that, invoking the induction hypothesis, guarantee that G is computable by T'.
    Figure 14:

    The first case of the inductive step in the proof of theorem 4. The removal of the leaves directly connected to the root of T results in a smaller rooted tree.

    Figure 14:

    The first case of the inductive step in the proof of theorem 4. The removal of the leaves directly connected to the root of T results in a smaller rooted tree.

    Close modal
    Figure 15:

    The second case of the inductive step in the proof of theorem 4. There is no leaf directly connected to the root of T. Separating one of the rooted subtrees adjacent to the root results in two smaller rooted trees.

    Figure 15:

    The second case of the inductive step in the proof of theorem 4. There is no leaf directly connected to the root of T. Separating one of the rooted subtrees adjacent to the root results in two smaller rooted trees.

    Close modal
  • There is no leaf adjacent to the root of T, but there are smaller subtrees. Denote one of them with T2 and show its leaves by xm+1,,xn. Removing this subtree results in a smaller tree T1 with leaves x1,,xm (see Figure 15). The goal is to write F(x1,,xn) as
    σ(G1(x1,,xm)+G2(xm+1,,xn)),
    (4.6)
    with G1 and G2 satisfying constraints corresponding to T1 and T2, and hence may be implemented on these trees by invoking the induction hypothesis.

Following the discussion at the beginning of section 3, F may be locally written as a function of another function with nonzero gradient if the gradients are parallel. This idea has been frequently used so far, but there is a twist here: we want such a description of F to persist on the box-like region B that is the domain of F. Lemma 2 resolves this issue. The tree function in the argument of σ in either of equation 4.5 or 4.6, which here we denote by F˜, shall be constructed below by invoking the induction hypothesis, so F˜ is defined at every point of B. Besides, our description of F˜ below (cf. equations 4.7 and 4.9) readily indicates that just like F, it satisfies the nonvanishing conditions of theorem 7. Applying lemma 2 to F˜, any level set xB|F˜(x)=c is connected, and F˜ can be extended to a coordinate system (F˜,F2,,Fn) for B. Thus, F, whose partial derivatives with respect to other coordinate functions vanish, realizes precisely one value on any coordinate hypersurface xB|F˜(x)=c. Setting σ(c) to be the aforementioned value of F defines a function σ with F=σ(F˜). After this discussion on the domain of definition of the desired representation of F, we proceed with constructing F˜=F˜(x1,,xn) as either G(x1,,xm)+cm+1xm+1++cnxn in the case of equation 4.5 or as G(x1,,xm)+G2(xm+1,,xn) in the case of equation 4.6.

In the case of equation 4.5, assuming that, as theorem 7 requires, one of the partial derivatives Fxm+1,,Fxn, example Fxn, is nonzero, we should have
F=Fx1FxmFxm+1Fxn-1FxnTFx1FxnFxmFxnFxm+1FxnFxn-1Fxn1T=Gx1Gxmcm+1cn-11T=(G(x1,,xm)+cm+1xm+1++cn-1xn-1+xn).
(4.7)
Here, each ratio FxjFxn where m<jn must be a constant, which we show by cj, due to the simplified form equation 1.21 of equation 1.19: the only (full) subtree of T containing either xj or xn is the whole tree since these leaves are adjacent to the root of T. On the other hand, Fx1FxnFxmFxnT appearing in equation 4.7 is a gradient vector field of the form G(x1,,xm) again as a by-product of equations 1.19 and 1.21: each ratio FxiFxn where 1im is independent of xm+1,,xn by the same reasoning as above; and this vector function of (x1,,xm) is integrable because for any 1i,i'm,
FxiFxnxi'=Fxi'FxnxiFxixnFxi'=Fxi'xnFxi.
Hence, such a G(x1,,xm) exists; moreover, it satisfies constraints from the inductions hypothesis since any ratio GxjGxj' coincides with the corresponding ratio of partial derivatives of F, a function assumed to satisfy equations 1.21 and 1.22.
Next, in the second case of the inductive step, let us turn to equation 4.6. The nonvanishing conditions of theorem 7 require a partial derivative among Fx1,,Fxm and also a partial derivative among Fxm+1,,Fxn to be nonzero. Without any loss of generality, we assume Fx10 and Fxn0. We want to apply lemma 4 to split the ratio Fx1Fxn0 as
Fx1Fxn=β(x1,,xm)1γ(xm+1,,xn)=β(x1,,xm)γ(xm+1,,xn).
(4.8)
To do so, it needs to be checked that
Fx1FxnxiFx1Fxnxj=0
for any two indices 1im and m<jn. This is the content of equation 1.20, or its simplified form, equation 1.22, when xi belongs to the same maximal subtree of T adjacent to the root that has x1 and holds for other choices of xi{x1,,xm} too since in that situation, by the simplified form equation 1.21, of equation 1.19, the derivative Fx1Fxnxi must be zero because x1, xi, and xn belong to different maximal subtrees of T. Next, the gradient of F could be written as
F=Fx1Fx2FxmFxm+1Fxn-1FxnTFx1FxnFx2FxnFxmFxnFxm+1FxnFxn-1Fxn1T=Fx1FxnFx2Fx1.Fx1FxnFxmFx1.Fx1FxnFxm+1FxnFxn-1Fxn1T=Fx1Fxn1Fx2Fx1FxmFx100n-mT+00mFxm+1FxnFxn-1Fxn1T.
Combining with equation 4.8:
Fβ(x1,,xm)β(x1,,xm).Fx2Fx1β(x1,,xm).FxmFx100n-mT+[00mγ(xm+1,,xn).Fxm+1Fxnγ(xm+1,,xn).Fxn-1Fxnγ(xm+1,,xn)]T.
(4.9)
To establish equation 4.6, it suffices to argue that the vectors on the right-hand side are in the form of G1 and G2 for suitable functions G1(x1,,xm) and G2(xm+1,,xn), to which the induction hypothesis can be applied by the same logic as before. Notice that the first one is dependent only on x1,,xm, while the second one is dependent only on xm+1,,xn, again by equations 1.19 and 1.21. For any 1im and m<jn, we have FxiFx1xj=0 (respectively, FxjFxnxi=0) since there is no subtree of T that has only one of x1 and xi (resp. only one of xn and xj) and also xj (resp. also xi). We finish the proof by verifying the corresponding integrability conditions,
βFxiFx1xi'=βFxi'Fx1xi,γFxjFxnxj'=γFxj'Fxnxj,
for any 1i,i'm and m<j,j'n. In view of equation 4.8, one can change β and γ above to Fx1Fxn or FxnFx1, respectively, and write the desired identities as the new ones,
Fx1FxnFxiFx1xi'=Fx1FxnFxi'Fx1xi,FxnFx1FxjFxnxj'=FxnFx1Fxj'Fxnxj,
which hold due to equation 1.21.
Remark 10.

As mentioned in remark 3, working with functions of the form 1.12 in theorem 7 rather than general smooth functions has the advantage of enabling us to determine a domain on which a superposition representation exists. In contrast, the sufficiency part of theorem 3 is a local statement since it relies on the implicit function theorem. It is possible to say something nontrivial about the domains when functions are furthermore analytic. This is because the implicit function theorem holds in the analytic category as well (Krantz & Parks, 2002, sec. 6.1) where lower bounds on the domain of validity of the theorem exist in the literature (Chang, He, & Prabhu, 2003).

In this article, we proposed a systematic method for studying smooth real-valued functions constructed as compositions of other smooth functions that are either of lower arity or in the form of a univariate activation function applied to a linear combination of inputs. We established that any such smooth superposition must satisfy nontrivial constraints in the form of algebraic PDEs, which are dependent only on the hierarchy of composition or, equivalently, only on the topology of the neural network that produces superpositions of this type. We conjectured that there always exist characteristic PDEs that also provide sufficient conditions for a generic smooth function to be expressible by the feedforward neural network in question. The genericity is to avoid singular cases and is captured by nonvanishing conditions that require certain polynomial functions of partial derivatives to be nonzero. We observed that there are also situations where nontrivial algebraic inequalities involving partial derivatives (PDIs) are imposed on the hierarchical functions. In summary, the conjecture aims to describe generic smooth functions computable by a neural network with finitely many universal conditions of the form Φ0, Ψ=0, and Θ>0, where Φ, Ψ, and Θ are polynomial expressions of the partial derivatives and are dependent only on the architecture of the network, not on any tunable parameter or any activation function used in the network. This is reminiscent of the notion of a semialgebraic set from real algebraic geometry. Indeed, in the case of compositions of polynomial functions or functions computed by polynomial neural networks, the PDE constraints yield equations for the corresponding functional variety in an ambient space of polynomials of a prescribed degree.

The conjecture was verified in several cases, most importantly, for tree architectures with distinct inputs where, in each regime, we explicitly exhibited a PDE characterization of functions computable by a tree network. Examples of tree architectures with repeated inputs were addressed as well. The proofs were mathematical in nature and relied on classical results of multivariable analysis.

The article moreover highlights the differences between the two regimes mentioned at the beginning: the hierarchical functions constructed out of composing functions of lower dimensionality and the hierarchical functions that are compositions of functions of the form yσw,y. The former functions appear more often in the mathematical literature on the Kolmogorov-Arnold representation theorem, while the latter are ubiquitous in deep learning. The special form of functions yσw,y requires more PDE constraints to be imposed on their compositions, whereas their mild nonlinearity is beneficial in terms of ascertaining the domain on which a claimed compositional representation exists.

Our approach for describing the functional spaces associated with feedforward neural networks is of natural interest in the study of expressivity of neural networks and could lead to new complexity measures. We believe that the point of view adapted here is novel and might shed light on a number of practical problems such as comparison of architectures and reverse-engineering deep networks.

Proof of Lemma 2.
We first prove that F can be extended to a coordinate system on the entirety of the box-like region B, which we shall write as I1××In. As in the proof of theorem 3, we group the variables x1,,xn according to the maximal subtrees of T in which they appear:
x1,,xm1;xm1+1,,xm1+m2;;xm1++ml-1+1,,xm1++ml;xm1++ml+1;;xn,
where, denoting the subtrees emanating from the root of T by T1,,Tl, for any 1sl the leaves of Ts are labeled by xm1++ms-1+1,,xm1++ms-1+ms; and xm1++ml+1,,xn represent the leaves that are directly connected to the root (if any); see Figure 13. Among the variables labeling the leaves of T1, there should exist one with respect to which the first-order partial derivative of F is not zero. Without any loss of generality, we may assume that Fx10 at any point of B. Hence, the Jacobian of the map (F,x2,,xn):BRn is always invertible. To prove that the map provides a coordinate system, we just need to show that it is injective. Keeping x2,,xn constant and varying x1, we obtain a univariate function of x1 on the interval I1 whose derivative is always nonzero and is hence injective.
Next, to prove that the level sets of F:BR are connected, notice that F admits a representation
F(x1,,xn)=σw1G1++wlGl+wm1++ml+1'xm1++ml+1++wn'xn
(A.1)
where Gs=Gs(xm1++ms-1+1,,xm1++ms-1+ms) is the tree function that Ts computes by receiving xm1++ms-1+1,,xm1++ms-1+ms from its leaves; σ is the activation function assigned to the root of T; and w1,,wl, wm1++ml+1',,wn' are the weights appearing at the root. A simple application of the chain rule implies that G1, which is a function implemented on the tree T1, satisfies the nonvanishing hypotheses of theorem 7 on the box-like region I1××Im1; moreover, the derivative of σ is nonzero at any point of its domain9 because otherwise there exists a point of pB at which Fxi(p)=0 for any leaf xi. By the same logic, the weight w1 must be nonzero because otherwise all first-order partial derivatives with respect to the variables appearing in T1 are identically zero. We now show that an arbitrary level set Lc:={xB|F(x)=c} is connected. Given the representation (see equation A.1) of F, the level set is empty if σ does not attain the value c. Otherwise, σ attains c at a unique point σ-1(c) of its domain. So one may rewrite the equation F(x1,,xn)=c as
G1=-w2w1G2--wlw1Gl-wm1++ml+1'w1xm1++ml+1--wn'w1xn+1w1σ-1(c).
(A.2)
The left-hand side of equation A.2 is a function of x1,,xm1, while its right-hand side, which we denote by G˜, is a function of xm1+1,,xn. Therefore, the level set Lc is the preimage of
(y,x˜)R×Im1+1××In|y=G˜(x˜)
(A.3)
under the map
π:B=(I1××Ims)×(Im1+1××In)R×Im1+1××In(x1,,xm1;x˜)G1(x1,,xm1),x˜.
(A.4)
The following simple fact can now be invoked: Let π:XY be a continuous map of topological spaces that takes open sets to open sets and has connected level sets. Then the preimage of any connected subset of Y under π is connected. Here, Lc is the preimage of the set from equation A.3, which is connected since it is the graph of a continuous function, under the map π defined in equation A.4, which is open because the scalar-valued function G1 is: its gradient never vanishes. Therefore, the connectedness of the level sets of F is implied by the connectedness of the level sets of π. A level set of the map, equation A.4, could be identified with a level set of its first component G1. Consequently, we have reduced to the similar problem for the function G1, which is implemented on the smaller tree T1. Therefore, an inductive argument yields the connectedness of the level sets of F. It only remains to check the basic case of a tree whose leaves are directly connected to the root. In that setting, F(x1,,xn) is in the form of σ(a1x1++anxn) (the family of functions that lemma 3 is concerned with). By repeating the argument used before, the activation function σ is injective. Hence, a level set F(x)=c is the intersection of the hyperplane a1x1++anxn=σ-1(c) with the box-like region B. Such an intersection is convex and thus connected.
Proof of Lemma 3.
The necessity of conditions FxixkFxj=FxjxkFxi follows from a simple computation. For the other direction, suppose Fxj0 throughout an open box-like region BRn and any ratio FxiFxj is constant on B. Denoting it by ai, we obtain numbers a1,,an with aj=1. They form a vector [a1an]T parallel to F. Thus, F could have nonzero first-order partial derivative only with respect to the first member of the coordinate system,
(a1x1++anxn,x1,,xj-1,xj+1,,xn),
for B. The coordinate hypersurfaces are connected since they are intersections of hyperplanes in Rn with the convex region B. This fact enables us to deduce that F can be written as a function of a1x1++anxn globally.
Proof of Lemma 4.
For a function q=q1q2 such as equation 3.4, equalities of the form qqya(1)yb(2)=qya(1)qyb(2) hold since both sides coincide with q1q2(q1)ya(1)(q2)yb(2). For the other direction, let q=qy1(1),,yn1(1);y1(2),,yn2(2) be a smooth function on an open box-like region B1×B2Rn1×Rn2 that satisfies qqya(1)yb(2)=qya(1)qyb(2) for any 1an1 and 1bn2, and never vanishes. So q is either always positive or always negative. One may assume the former by replacing q with -q if necessary. Hence, we can define a new function p:=Ln(q) by taking the logarithm. We have
pya(1)yb(2)=qya(1)qyb(2)=qqya(1)yb(2)-qya(1)qyb(2)q2=0.
It suffices to show that this vanishing of mixed partial derivatives allows us to write py1(1),,yn1(1);y1(2),,yn2(2) as p1y1(1),,yn1(1)+p2y1(2),,yn2(2) since then exponentiating yields q1 and q2 as ep1 and ep2, respectively. The domain of p is a box-like region of the form
B1×B2=a=1n1Ia(1)×b=1n2Ib(2).
Picking an arbitrary point z1(1)I1(1), the fundamental theorem of calculus implies
py1(1),,yn1(1);y1(2),,yn2(2)=z1(1)y1(1)py1(1)s1(1),y2(1),,yn1(1);y1(2),,yn2(2)ds1(1)+pz1(1),y2(1),,yn1(1);y1(2),,yn2(2).
On the right-hand side, the integral is dependent only on y1(1),,yn1(1) because the partial derivatives of the integrand with respect to y1(2),,yn2(2) are all identically zero. The second term, pz1(1),y2(1),,yn1(1);y1(2),,yn2(2), is a function on the smaller box-like region
a=2n1Ia(1)×b=1n2Ib(2)
in Rn1-1×Rn2 and thus, proceeding inductively, can be brought into the appropriate summation form.

Differential forms are ubiquitous objects in differential geometry and tensor calculus. We only need the theory of differential forms on open domains in Euclidean spaces. Theorem 5 (which has been used several times throughout the, for example, in the proof of theorem 3) is formulated in terms of differential forms. This appendix provides the necessary background for understanding the theorem and its proof.

We begin with a very brief account of the local theory of differential forms. (For a detailed treatment see Pugh, 2002, chap. 5.) Let U be an open subset of Rn. A differential k-formω on U assigns a scalar to any k-tuple of tangent vectors at a point p of U. This assignment, denoted by ωp, must be multilinear and alternating. We say ω is smooth (resp. analytic) if ωp varies smoothly (resp. analytically) with p. In other words, feeding ω with k smooth (resp. analytic) vector fields V1,,Vk on U results in a function ω(V1,,Vk):UR that is smooth (resp. analytic). We next exhibit an expression for ω. Consider the standard basis x1,,xn of vector fields on U where xi assigns ei to each point. The dual basis is denoted by (dx1,,dxn), which at each point yields the dual of the standard basis (e1,,en) for Rn, that is, dxixjδij. Each of dx1,,dxn is a 1-form on U, and any k-form ω can be written in terms of them:
ω=1i1<<iknfi1ikdxi1dxik.
(B.1)
Here, each coefficient fi1ik is a (smooth or analytic according to the context) function UR. In front of it, dxi1dxik appears, which is a k-form satisfying dxi1dxikxi1,,xik=1. This is constructed by the operation of the exterior product (also called the wedge product) from multilinear algebra. The exterior product is an associative and distributive linear operation that out of ki-tensors τi(1il) constructs an alternating (k1++kl)-tensor τ1τl. This product is anti-commutative, for example, dxidxj=-dxjdxi; this is the reason that in equation B.1, the indices are taken to be strictly ascending.
Another operation in the realm of differential forms is exterior differentiation. For the k-form ω from equation B.1, its exterior derivative dω is a (k+1)-form defined as
dω:=1i1<<ikndfi1ikdxi1dxik,
where the exterior derivative of a function f is defined as
df:=i=1nfxidxi.
(B.2)
Notice that the 1-form is the dual of the gradient vector field
f=i=1nfxixi.
(B.3)

B.1  Example 14

In dimension 3, the exterior differentiation encapsulates the familiar vector calculus operators curl and divergence. Consider the vector field
V(x,y,z)=[V1(x,y,z),V2(x,y,z),V3(x,y,z)]T.
The exterior derivatives
dV1dx+V2dy+V3dz=(V3)y-(V2)zdydz+(V1)z-(V3)xdzdx+(V2)x-(V1)ydxdy
and
dV1dydz+V2dzdx+V3dxdy=(V1)x+(V2)y+(V3)zdxdydz,
respectively, have curlV and divV as their coefficients. In fact, there is a general Stokes formula for differential forms that recovers the Kelvin–Stokes theorem and the divergence theorem as special cases. Finally, we point out that the familiar identities curl=0 and divcurl=0 are instances of the general property dd=0 of the exterior differentiation.

B.2  Example 15

As mentioned in the previous example, the outcome of twice applying the exterior differentiation operator to a form is always zero. This is an extremely important property that leads to the definitions of closed and exact differential forms. A k-form ω on an open subset U of Rn is called closed if dω=0. This holds if ω is in the form of ω=dα for a (k-1)-form α on U. Such forms are called exact. The space of closed forms may be strictly larger than the space of exact forms; the difference of these spaces can be used to measure the topological complexity of U. If U is an open box-like region, every closed form on it is exact. But, for instance, the 1-form ω=-yx2+y2dx+xx2+y2dy on R2-{(0,0)} is closed while it may not be written as dα for any smooth function α:R2-{(0,0)}R. This brings us to a famous fact from multivariable calculus that we have used several times (e.g., in the proof of theorem 4). A necessary condition for a vector field V=i=1nVixi on an open subset U of Rn to be a gradient vector field is (Vi)xj=(Vj)xi for any 1i,jn. Near each point of U, the vector field V may be written as f; it is globally in the form of f for a function f:UR when U is simply connected. In view of equations B.2 and B.3, one may rephrase this fact as: Closed 1-forms on U are exact if and only if U is simply connected.

Proof of Theorem 5.

Near a point pRn at which V(p)0, we seek a locally defined function ξ with Vξ0. Recall that if qRn is a regular point of ξ, then near q, the level set of ξ passing through q is an (n-1)-dimensional submanifold of Rn to which the gradient vector field, ξ0, is perpendicular. As we want the gradient to be parallel to the vector field V, the equivalent characterization in terms of the 1-form ω, which is the dual of V (cf. equations 3.1 and 3.3), asserts that ω is zero at any vector tangent to the level set. So the tangent space to the level set at the point q could be described as {vRn|ωq(v)=0}. As q varies near p, these (n-1)-dimensional subspaces of Rn vary smoothly. In differential geometry, such a higher-dimensional version of a vector field is called a distribution, and the property that these subspaces are locally given by tangent spaces to a family of submanifolds (the level sets here) is called integrability. The seminal Frobenius theorem (Narasimhan, 1968, theorem 2.11.11) implies that the distribution defined by a nowhere vanishing 1-from ω is integrable if and only if ωdω=0.

1

To be mathematically precise, the open neighborhood of p on which F admits a compositional representation in the desired form may be dependent on F and p. So conjecture 1 is local in nature and must be understood as a statement about function germs.

2

Convergence in the Cr-norm is defined as the uniform convergence of the function and its partial derivatives up to order r.

3

In conjecture 1, the subset cut off by equations Ψi=0 is meager: It is a closed and (due to the term nonvacuous appearing in the conjecture) proper subset of the space of functions computable by N, and a function implemented by N at which a Ψi vanishes could be perturbed to another computable function at which all of Ψi's are nonzero.

4

An open box-like region in Rn is a product I1××In of open intervals.

5

A piece of terminology introduced in Farhoodi et al. (2019) may be illuminating here. A member of a triple (xi,xj,xk) of (not necessarily distinct) leaves of T is called the outsider of the triple if there is a (rooted full) subtree of T that misses it but has the other two members. Theorem 3 imposes FxixkFxj=FxjxkFxi whenever xk is the outsider, while theorem 4 imposes the constraint whenever xi and xj are not outsiders.

6

Notice that this is the best one can hope to recover because through scaling the weights and inversely scaling the inputs of activation functions, the function F could also be written as σ(τ˜(λax+λby)+cz+dw) or σ(τ1˜(λax+λby)+τ2(cz+dw)) where τ˜(y):=τyλ and τ1˜(y):=τ1yλ. Thus, the other ratios ac and bd are completely arbitrary.

7

Compare with the proof of lemma 4 in appendix A.

8

A single vertex is not considered to be a rooted tree in our convention.

9

As the vector (x1,,xn) of inputs varies in the box-like region B, the inputs to each node form an interval on which the corresponding activation function is defined.

Arnold
,
V. I.
(
2009a
).
On the representation of functions of several variables as a superposition of functions of a smaller number of variables.
In
V. A.
Arnold
,
Collected works: Representations of functions, celestial mechanics and KAM theory, 1957–1965
(pp.
25
46
).
Berlin
:
Springer
.
Arnold
,
V. I.
(
2009b
).
Representation of continuous functions of three variables by the superposition of continuous functions of two variables.
In
V. A.
Arnold
,
Collected works: Representations of functions, celestial mechanics and KAM theory, 1957–1965
(pp.
47
133
).
Berlin
:
Springer
.
Bartlett
,
P. L.
,
Maiorov
,
V.
, &
Meir
,
R.
(
1999
). Almost linear VC dimension bounds for piecewise polynomial networks. In
S.
Solla
,
T.
Leen
, &
K. R.
Müller
(Eds.),
Advances in neural information processing systems, 11
(pp.
190
196
).
Cambridge, MA
:
MIT Press
.
Bianchini
,
M.
, &
Scarselli
,
F.
(
2014
).
On the complexity of neural network classifiers: A comparison between shallow and deep architectures
.
IEEE Transactions on Neural Networks and Learning Systems
,
25
(
8
),
1553
1565
.
[PubMed]
Boyce
,
W. E.
, &
DiPrima
,
R. C.
(
2012
).
Elementary differential equations
(10th ed.).
Hoboken, NJ
:
Wiley
.
Brattka
,
V.
(
2007
). From Hilbert's 13th problem to the theory of neural networks: Constructive aspects of Kolmogorov's superposition theorem. In
E.
Charpentier
,
A.
Lesne
, &
N.
Annick
(Eds.),
Kolmogorov's heritage in mathematics
(pp.
253
280
).
Berlin
:
Springer
.
Buck
,
R. C.
(
1976
).
Approximate complexity and functional representation
(Tech. Rep.).
Madison
:
Mathematics Research Center, University of Wisconsin, Madison.
https://apps.dtic.mil/dtic/tr/fulltext/u2/a031972.pdf
Buck
,
R. C.
(
1979
).
Approximate complexity and functional representation
.
J. Math. Anal. Appl.
,
70
(
1
),
280
298
.
Buck
,
R. C.
(
1981a
).
Characterization of classes of functions.
Amer. Math. Monthly
,
88
(
2
),
139
142
.
Buck
,
R. C.
(
1981b
).
The solutions to a smooth PDE can be dense in C(I).
J. Differential Equations
,
41
(
2
),
239
244
.
Chang
,
H.-C.
,
He
,
W.
, &
Prabhu
,
N.
(
2003
).
The analytic domain in the implicit function theorem.
J. Inequal. Pure Appl. Math.
,
4
(
1
), art.
[PubMed]
.
Chatziafratis
,
V.
,
Nagarajan
,
S. G.
,
Panageas
,
I.
, &
Wang
,
X.
(
2019
).
Depth-width trade-offs for ReLU networks via Sharkovsky's theorem
. arXiv:1912.04378v1.
Cohen
,
N.
,
Sharir
,
O.
, &
Shashua
,
A.
(
2016
).
On the expressive power of deep learning: A tensor analysis.
In
Proceedings of the Conference on Learning Theory
(pp.
698
728
).
Coste
,
M.
(
2000
).
An introduction to semialgebraic geometry
. Citeseer.
Cybenko
,
G.
(
1989
).
Approximation by superpositions of a sigmoidal function
.
Mathematics of Control, Signals and Systems
,
2
(
4
),
303
314
.
Dehmamy
,
N.
,
Rohani
,
N.
, &
Katsaggelos
,
A. K.
(
2019
).
Direct estimation of weights and efficient training of deep neural networks without SGD
. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
(pp.
3232
3236
).
Piscataway, NJ
:
IEEE
.
Du
,
S. S.
, &
Lee
,
J. D.
(
2018
).
On the power of over-parametrization in neural networks with quadratic activation
. arXiv:1803.01206v2.
Eldan
,
R.
, &
Shamir
,
O.
(
2016
).
The power of depth for feedforward neural networks.
In
Proceedings of the Conference on Learning Theory
(pp.
907
940
).
Farhoodi
,
R.
,
Filom
,
K.
,
Jones
,
I. S.
, &
Körding
,
K. P.
(
2019
).
On functions computed on trees.
Neural Computation
,
31
,
2075
2137
.
[PubMed]
Fefferman
,
C.
, &
Markel
,
S.
(
1994
). Recovering a feed-forward net from its output. In
G.
Tesauro
,
D.
Toretsky
, &
T.
Leen
(Eds.),
Advances in neural information processing systems
,
7
(pp.
335
342
).
Cambridge, MA
:
MIT Press
.
Gerhard
,
S.
,
Andrade
,
I.
,
Fetter
,
R. D.
,
Cardona
,
A.
, &
Schneider-Mizell
,
C. M.
(
2017
).
Conserved neural circuit structure across drosophila larval development revealed by comparative connectomics
.
eLife
,
6
, e29089.
[PubMed]
Gillette
,
T. A.
, &
Ascoli
,
G. A.
(
2015
).
Topological characterization of neuronal arbor morphology via sequence representation: I-motif analysis
.
BMC Bioinformatics
,
16
(
1
), 216.
Girosi
,
F.
, &
Poggio
,
T.
(
1989
).
Representation properties of networks: Kolmogorov's theorem is irrelevant
.
Neural Computation
,
1
(
4
),
465
469
.
Hecht-Nielsen
,
R.
(
1987
).
Kolmogorov's mapping neural network existence theorem.
In
Proceedings of the International Conference on Neural Networks
(vol.
3
, pp.
11
14
).
Piscataway, NJ
:
IEEE
.
Hilbert
,
D.
(
1902
).
Mathematical problems
.
Bulletin of the American Mathematical Society
,
8
(
10
),
437
479
.
Hornik
,
K.
(
1991
).
Approximation capabilities of multilayer feedforward networks
.
Neural Networks
,
4
(
2
),
251
257
.
Hornik
,
K.
,
Stinchcombe
,
M.
, & White H.
(
1989
).
Multilayer feedforward networks are universal approximators
.
Neural Networks
,
2
(
5
),
359
366
.
Kileel
,
J.
,
Trager
,
M.
, &
Bruna
,
J.
(
2019
).
On the expressive power of deep polynomial neural networks
. arXiv:1905.12207v1.
Kollins
,
K. M.
, &
Davenport
,
R. W.
(
2005
). Branching morphogenesis in vertebrate neurons. In
K.
Kollins
&
R.
Davenport
(Eds.),
Branching morphogenesis
(pp.
8
65
).
Berlin
:
Springer
.
Kolmogorov
,
A. N.
(
1957
).
On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition
.
Dokl. Akad. Nauk SSSR
,
114
,
953
956
.
Krantz
,
S. G.
, &
Parks
,
H. R.
(
2002
).
The implicit function theorem
.
Boston
:
Birkhäuser
.
Kůrková
,
V.
(
1991
).
Kolmogorov's theorem is relevant
.
Neural Computation
,
3
(
4
),
617
622
.
Kůrková
,
V.
(
1992
).
Kolmogorov's theorem and multilayer neural networks
.
Neural Networks
,
5
(
3
),
501
506
.
Lin
,
H. W.
,
Tegmark
,
M.
, &
Rolnick
,
D.
(
2017
).
Why does deep and cheap learning work so well?
Journal of Statistical Physics
,
168
(
6
),
1223
1247
.
Lorentz
,
G. G.
(
1966
).
Approximation of functions
.
New York
:
Holt
.
Mhaskar
,
H.
(
1996
).
Neural networks for optimal approximation of smooth and analytic functions
.
Neural Computation
,
8
(
1
),
164
177
.
Mhaskar
,
H.
,
Liao
,
Q.
, &
Poggio
,
T.
(
2017
).
When and why are deep networks better than shallow ones?
In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
Palo Alto
:
AAAI
.
Minsky
,
M.
, &
Papert
,
S. A.
(
2017
).
Perceptrons: An introduction to computational geometry
.
Cambridge, MA: MIT Press
.
Montufar
,
G. F.
,
Pascanu
,
R.
,
Cho
,
K.
, &
Bengio
,
Y.
(
2014
). On the number of linear regions of deep neural networks. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems, 27
(pp.
2924
2932
).
Red Hook, NY
:
Curran
.
Narasimhan
,
R.
(
1968
).
Analysis on real and complex manifolds
.
Amsterdam
:
North-Holland
.
Ostrowski
,
A.
(
1920
).
Über dirichletsche reihen und algebraische differentialgleichungen
.
Mathematische Zeitschrift
,
8
(
3
),
241
298
.
Petersen
,
P.
,
Raslan
,
M.
, &
Voigtlaender
,
F.
(
2020
).
Topological properties of the set of functions generated by neural networks of fixed size
.
Foundations of Computational Mathematics
,
21
,
375
444
.
Poggio
,
T.
,
Banburski
,
A.
, &
Liao
,
Q.
(
2019
).
Theoretical issues in deep networks: Approximation, optimization and generalization.
arXiv:1908.09375v1.
Poggio
,
T.
,
Mhaskar
,
H.
,
Rosasco
,
L.
,
Miranda
,
B.
, &
Liao
,
Q.
(
2017
).
Why and when can deep—but not shallow—networks avoid the curse of dimensionality: A review
.
International Journal of Automation and Computing
,
14
(
5
),
503
519
.
Pólya
,
G.
, &
Szegö
,
G.
(
1945
).
Aufgaben und Lehrsätze aus der Analysis
.
New York
:
Dover
.
Poole
,
B.
,
Lahiri
,
S.
,
Raghu
,
M.
,
Sohl-Dickstein
,
J.
, &
Ganguli
,
S.
(
2016
). Exponential expressivity in deep neural networks through transient chaos. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
3360
3368
).
Red Hook, NY
:
Curran
.
Pugh
,
C. C.
(
2002
).
Real mathematical analysis
.
Springer-Verlag
,
New York
.
Raghu
,
M.
,
Poole
,
B.
,
Kleinberg
,
J.
,
Ganguli
,
S.
, &
Dickstein
,
J. S.
(
2017
).
On the expressive power of deep neural networks
. In
Proceedings of the 34th International Conference on Machine Learning
,
70
(pp.
2847
2854
).
Rolnick
,
D.
, &
Kording
,
K. P.
(
2019, October
).
Reverse-engineering deep ReLU networks.
arXiv:1910.00744v2.
Rubel
,
L. A.
(
1981
).
A universal differential equation
.
Bull. Amer. Math. Soc. (N.S.)
,
4
(
3
),
345
349
.
Schneider-Mizell
,
C. M.
,
Gerhard
,
S.
,
Longair
,
M.
,
Kazimiers
,
T.
,
Li
,
F.
,
Zwart
,
M. F.
, …
Cardona
,
A.
(
2016
).
Quantitative neuroanatomy for connectomics in drosophila
.
eLife
,
5
, e12059.
Soltanolkotabi
,
M.
,
Javanmard
,
A.
, &
Lee
,
J. D.
(
2018
).
Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
.
IEEE Transactions on Information Theory
,
65
(
2
),
742
769
.
Sprecher
,
D. A.
(
1965
).
On the structure of continuous functions of several variables
.
Trans. Amer. Math. Soc.
,
115
,
340
355
.
Telgarsky
,
M.
(
2016
).
Benefits of depth in neural networks.
arXiv:1602.04485v2.
Venturi
,
L.
,
Bandeira
,
A. S.
, &
Bruna
,
J.
(
2018
).
Spurious valleys in two-layer neural network optimization landscapes.
arXiv:1802.06384v3.
Vituškin
,
A. G.
(
1954
).
On Hilbert's thirteenth problem
.
Doklady Akad. Nauk SSSR (N.S.)
,
95
,
701
704
.
Vituškin
,
A. G.
(
1964
).
A proof of the existence of analytic functions of several variables not representable by linear superpositions of continuously differentiable functions of fewer variables
.
Dokl. Akad. Nauk SSSR
,
156
,
1258
1261
.
Vituškin
,
A. G.
(
2004
).
On Hilbert's thirteenth problem and related questions
.
Russian Mathematical Surveys
,
59
(
1
), 11.
Vituškin
,
A. G.
, &
Henkin
,
G. M.
(
1967
).
Linear superpositions of functions
.
Uspehi Mat. Nauk
,
22
(
133
),
77
124
.
von Golitschek
,
M.
(
1980
).
Remarks on functional representation.
In
Approximation theory, III (Proc. Conf., Univ. Texas, Austin, Tex., 1980)
(pp.
429
434
).
New York
:
Academic Press
.