## Abstract

Neural networks are versatile tools for computation, having the ability to approximate a broad range of functions. An important problem in the theory of deep neural networks is expressivity; that is, we want to understand the functions that are computable by a given network. We study real, infinitely differentiable (smooth) hierarchical functions implemented by feedforward neural networks via composing simpler functions in two cases: (1) each constituent function of the composition has fewer inputs than the resulting function and (2) constituent functions are in the more specific yet prevalent form of a nonlinear univariate function (e.g., tanh) applied to a linear multivariate function. We establish that in each of these regimes, there exist nontrivial algebraic partial differential equations (PDEs) that are satisfied by the computed functions. These PDEs are purely in terms of the partial derivatives and are dependent only on the topology of the network. Conversely, we conjecture that such PDE constraints, once accompanied by appropriate nonsingularity conditions and perhaps certain inequalities involving partial derivatives, guarantee that the smooth function under consideration can be represented by the network. The conjecture is verified in numerous examples, including the case of tree architectures, which are of neuroscientific interest. Our approach is a step toward formulating an algebraic description of functional spaces associated with specific neural networks, and may provide useful new tools for constructing neural networks.

## 1 Introduction

### 1.1 Motivation

*hierarchical*or

*compositional*interchangeably) are constrained in the sense that they satisfy certain

**p**artial

**d**ifferential

**e**quations (

**PDE**s). These PDEs are dependent only on the topology of the network and could be employed to characterize smooth functions computable by a given network.

#### 1.1.1 Example 1

^{1}

*tree architectures*(see theorem 3).

#### 1.1.2 Example 2

### 1.2 Statements of Main Results

Fixing a neural network hierarchy for composing functions, we shall prove that once the constituent functions of corresponding superpositions have fewer inputs (lower arity), there exist universal **algebraic p**artial **d**ifferential **e**quations (**algebraic PDE**s) that have these superpositions as their solutions. A conjecture, which we verify in several cases, states that such PDE constraints characterize a generic smooth superposition computable by the network. Here, genericity means a nonvanishing condition imposed on an algebraic expression of partial derivatives. Such a condition has already occurred in example 1 where in the proof of the sufficiency of equation 1.4 for the existence of a representation of the form 1.1 for a function $F(x,y,z)$, we assumed either $Fx$ or $Fy$ is nonzero. Before proceeding with the statements of main results, we formally define some of the terms that have appeared so far.

**Terminology**

We take all neural networks to be feedforward. A

**feedforward neural network**is an acyclic hierarchical layer to layer scheme of computation. We also include**res**idual**net**works (**ResNet**s) in this category: an identity function in a layer could be interpreted as a jump in layers. Tree architectures are recurring examples of this kind. We shall always assume that in the first layer, the inputs are labeled by (not necessarily distinct) labels chosen from coordinate functions $x1,\u2026,xn$, and there is only one node in the output layer. Assigning functions to nodes in layers above the input layer implements a real scalar-valued function $F=F(x1,\u2026,xn)$ as the superposition of functions appearing at nodes (see Figure 3).- In our setting, an
**algebraic PDE**is a nontrivial polynomial relation such asamong the partial derivatives (up to a certain order) of a smooth function $F=F(x1,\u2026,xn)$. Here, for a tuple $\alpha :=(\alpha 1,\u2026,\alpha n)$ of nonnegative integers, the partial derivative $\u2202\alpha 1+\cdots +\alpha nF\u2202x1\alpha 1\cdots \u2202xn\alpha n$ (which is of order $|\alpha |:=\alpha 1+\cdots +\alpha n$) is denoted by $Fx\alpha $. For instance, asking for a polynomial expression of partial derivatives of $F$ to be constant amounts to $n$ algebraic PDEs given by setting the first-order partial derivatives of that expression with respect to $x1,\u2026,xn$ to be zero.$\Phi Fx1,\u2026,Fxn,Fx12,Fx1x2,\u2026,Fx\alpha ,\u2026=0$(1.10) - A
**nonvanishing condition**imposed on smooth functions $F=F(x1,\u2026,xn)$ is asking for these functions not to satisfy a particular algebraic PDE, namely,for a nonconstant polynomial $\Psi $. Such a condition could be deemed pointwise since if it holds at a point $p\u2208Rn$, it persists throughout a small enough neighborhood. Moreover, equation 1.11 determines an open dense subset of the functional space; so, it is satisfied generically.$\Psi Fx1,\u2026,Fxn,Fx12,Fx1x2,\u2026,Fx\alpha ,\u2026\u22600,$(1.11)

Let $N$ be a feedforward neural network in which the number of inputs to each node is less than the total number of distinct inputs to the network. Superpositions of smooth functions computed by this network satisfy nontrivial constraints in the form of certain algebraic PDEs that are dependent only on the topology of $N$.

In order to guarantee the existence of PDE constraints for superpositions, theorem 1 assumes a condition on the topology of the network. However, theorem 2 states that by restricting the functions that can appear in the superposition, one can still obtain PDE constraints even for a fully connected multilayer perceptron:

Let $N$ be an arbitrary feedforward neural network with at least two distinct inputs, with smooth functions of the form 1.12 applied at its nodes. Any function computed by this network satisfies nontrivial constraints in the form of certain algebraic PDEs that are dependent only on the topology of $N$.

#### 1.2.1 Example 3

The preceding example suggests that smooth functions implemented by a neural network may be required to obey a nontrivial **algebraic p**artial **d**ifferential **i**nequality (**algebraic PDI**). So it is convenient to have the following setup of terminology.

**Terminology**

- An
**algebraic PDI**is an inequality of the forminvolving partial derivatives (up to a certain order) where $\Theta $ is a real polynomial.$\Theta Fx1,\u2026,Fxn,Fx12,Fx1x2,\u2026,Fx\alpha ,\u2026>0$(1.15)

Without any loss of generality, we assume that the PDIs are strict since a nonstrict one such as $\Theta \u22650$ could be written as the union of $\Theta >0$ and the algebraic PDE $\Theta =0$.

Theorem 1 and example 1 deal with superpositions of arbitrary smooth functions while theorem 2 and example 3 are concerned with superpositions of a specific class of smooth functions, functions of the form 1.12. In view of the necessary PDE constraints in both situations, the following question then arises: Are there sufficient conditions in the form of algebraic PDEs and PDIs that guarantee a smooth function can be represented, at least locally, by the neural network in question?

*Let $N$ be a feedforward neural network whose inputs are labeled by the coordinate functions $x1,\u2026,xn$. Suppose we are working in the setting of one of theorems 1 or 2. Then there exist*

*finitely many nonvanishing conditions*$\Psi iFx\alpha |\alpha |\u2264r\u22600i$*finitely many algebraic PDEs*$\Phi jFx\alpha |\alpha |\u2264r=0j$*finitely many algebraic PDIs*$\Theta kFx\alpha |\alpha |\u2264r>0k$

*with the following property: For any arbitrary point $p\u2208Rn$, the space of smooth functions $F=F(x1,\u2026,xn)$ defined in a vicinity ^{1} of $p$ that satisfy $\Psi i\u22600$ at $p$ and are computable by $N$ (in the sense of the regime under consideration) is nonvacuous and is characterized by PDEs $\Phi j=0$ and PDIs $\Theta k>0$.*

^{2}such that in the space of functions computable by $N$, the open dense

^{3}subset given by $\Psi i\u22600i$ can be described in terms of finitely many equations and inequalities as the locally closed subset $\Phi j=0j\u22c3\Theta k>0k.$ (Also see corollary 1.) The usage of $Cr$-norm here is novel. For instance, with respect to $Lp$-norms, the space of functions computable by $N$ lacks such a description and often has undesirable properties like nonclosedness (Petersen, Raslan, & Voigtlaender, 2020). Besides, describing the functional space associated with a neural network $N$ with finitely many equations and inequalities also has an algebraic motivation: it is reminiscent of the notion of a

*semialgebraic set*from real algebraic geometry. To elaborate, take the activation functions to be polynomials. Such neural networks have been studied in the literature (Du & Lee, 2018; Soltanolkotabi, Javanmard, & Lee, 2018; Venturi, Bandeira, & Bruna, 2018; Kileel, Trager, & Bruna, 2019). By bounding the degrees of constituent functions of superpositions computed by a polynomial neural network, the functional space formed by these superpositions sits inside a finite-dimensional ambient space of real polynomials and is hence finite-dimensional and amenable to techniques of algebraic geometry. One can, for instance, in each degree associate a functional variety to a neural network $N$ whose dimension could be interpreted as a measure of expressive power (Kileel et al., 2019). Our approach to describing real functions computable by neural networks via PDEs and PDIs has ramifications to the study of polynomial neural networks as well. Indeed, if $F=F(x1,\u2026,xn)$ is a polynomial, an algebraic PDE of the form 1.10 translates to a polynomial equation of the coefficients of $F$, and the condition that an algebraic PDI such as equation 1.15 is valid throughout $Rn$ can again be described via equations and inequalities involving the coefficients of $F$ (see examples 12 and 13). A notable feature here is the claim of the existence of a universal characterization dependent only on the architecture from which a description as a semialgebraic set could be read off in any degree.

Conjecture 1 is settled in (Farhoodi, Filom, Jones, and Körding, 2019) for trees (a particular type of architectures) with distinct inputs, a situation in which no PDI is required, and the inequalities should be taken to be trivial. Throughout the article, the conjecture above will be established for a number of architectures; in particular, we shall characterize tree functions (cf. theorems 3 and 4 below).

### 1.3 Related Work

There is an extensive literature on the expressive power of neural networks. Although shallow networks with sigmoidal activation functions can approximate any continuous function on compact sets (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989; Hornik, 1991; Mhaskar, 1996), this cannot be achieved without the hidden layer getting exponentially large (Eldan & Shamir, 2016; Telgarsky, 2016; Mhaskar et al., 2017; Poggio et al., 2017). Many articles thus try to demonstrate how the expressive power is affected by depth. This line of research draws on a number of different scientific fields including algebraic topology (Bianchini & Scarselli, 2014), algebraic geometry (Kileel et al., 2019), dynamical systems (Chatziafratis, Nagarajan, Panageas, & Wang, 2019), tensor analysis (Cohen, Sharir, & Shashua, 2016), Vapnik–Chervonenkis theory (Bartlett, Maiorov, & Meir, 1999), and statistical physics (Lin, Tegmark, & Rolnick, 2017). One approach is to argue that deeper networks are able to approximate or represent functions of higher complexity after defining a “complexity measure” (Bianchini & Scarselli, 2014; Montufar, Pascanu, Cho, & Bengio, 2014; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Telgarsky, 2016; Raghu et al., 2017). Another approach more in line with this article is to use the “size” of an associated functional space as a measure of representation power. This point of view is adapted in Farhoodi et al. (2019) by enumerating Boolean functions, and in Kileel et al. (2019) by regarding dimensions of functional varieties as such a measure.

To the best of our knowledge, the closest mentions of a characterization of a class of superpositions by necessary and sufficient PDE constraints in the literature are papers (Buck, 1979, 1981a) by R. C. Buck. The first one (along with its earlier version, Buck, 1976) characterizes superpositions of the form $g(f(x,y),z)$ in a similar fashion as example 1. Also in those papers, superpositions such as $g(f(x,y),h(x,z))$ (which appeared in example 2) are discussed although only the existence of necessary PDE constraints is shown; see (Buck, 1979, lemma 7), and (Buck, 1981a, p. 141). We exhibit a PDE characterization for superpositions of this form in example 7. These papers also characterize sufficiently differentiable *nomographic* functions of the form $\sigma (f(x)+g(y))$ and $\sigma (f(x)+g(y)+h(z))$.

A special class of neural network architectures is provided by rooted trees where any output of a layer is passed to exactly one node from one of the layers above (see Figure 8). Investigating functions computable by trees is of neuroscientific interest because the morphology of the dendrites of a neuron processes information through a tree that is often binary (Kollins & Davenport, 2005; Gillette & Ascoli, 2015). Assuming that the inputs to a tree are distinct, in our previous work (Farhoodi et al., 2019), we have completely characterized the corresponding superpositions through formulating necessary and sufficient PDE constraints; a result that answers conjecture 1 in positive for such architectures.

The characterization suggested by the theorem below is a generalization of example 1 which was concerned with smooth superpositions of the form 1.1. The characterization of such superpositions as solutions of PDE 1.4 has also appeared in a paper (Buck, 1979) that we were not aware of while writing (Farhoodi et al., 2019).

For any leaf $xi$ with siblings either $Fxi(p)\u22600$ or there is a sibling leaf $xi'$ with $Fxi'(p)\u22600$.

This theorem was formulated in Farhoodi et al. (2019) for binary trees and in the context of analytic functions (and also that of Boolean functions). Nevertheless, the proof carries over to the more general setting above. Below, we formulate the analogous characterization of functions that trees compute via composing functions of the form 1.12. Proofs of theorems 3 and 4 are presented in section 4.

Let $T$ be a rooted tree admitting $n$ leaves that are labeled by the coordinate functions $x1,\u2026,xn$. We formulate the following constraints on smooth functions $F=F(x1,\u2026,xn)$:

- For any two leaves $xi$ and $xj$ of $T$, we havefor any other leaf $xk$ of $T$ that is not a leaf of a (rooted full) subtree that has exactly one of $xi$ or $xj$ (see Figure 5). In particular, equation 1.19 holds for any $xk$ if the leaves $xi$ and $xj$ are siblings, and for any $xi$ and $xj$ if the leaf $xk$ is adjacent to the root of $T$.$FxixkFxj=FxjxkFxi$(1.19)
- For any two (rooted full) subtrees $T1$ and $T2$ that emanate from a node of $T$ (see Figure 6), we haveif $xi$, $xi'$ are leaves of $T1$ and $xj$, $xj'$ are leaves of $T2$.$FxiFxjFxixi'xj'Fxj+Fxixi'Fxjxj'-Fxixj'Fxjxi'-FxiFxjxi'xj'=Fxixi'Fxj-FxiFxjxi'Fxixj'Fxj+FxiFxjxj'$(1.20)

These constraints are satisfied if $F(x1,\u2026,xn)$ is a superposition of functions of the form $y\u21a6\sigma \u2329w,y\u232a$ according to the hierarchy provided by $T$. Conversely, a smooth function $F$ defined on an open box-like region^{4}$B\u2286Rn$ can be written as such a superposition on $B$ provided that the constraints 1.19 and 1.20 formulated above hold and, moreover, the nonvanishing conditions below are satisfied throughout $B$:

For any leaf $xi$ with siblings either $Fxi\u22600$ or there is a sibling leaf $xi'$ with $Fxi'\u22600$;

For any leaf $xi$ without siblings $Fxi\u22600$.

^{5}The second simplified equation 1.22, holds once the function $FxiFxj$ of $(x1,\u2026,xn)$ may be split into a product such as

Aside from neuroscientific interest, studying tree architectures is important also because any neural network can be expanded into a tree network with repeated inputs through a procedure called **TENN** (the **T**ree **E**xpansion of the **N**eural **N**etwork; see Figure 7). Tree architectures with repeated inputs are relevant in the context of neuroscience too because the inputs to neurons may be repeated (Schneider-Mizell et al., 2016; Gerhard, Andrade, Fetter, Cardona, & Schneider-Mizell, 2017). We have already seen an example of a network along with its TENN in Figure 2. Both networks implement functions of the form $F(x,y,z)=g(f(x,y),h(x,z))$. Even for this simplest example of a tree architecture with repeated inputs, the derivation of characteristic PDEs is computationally involved and will be done in example 7. This verifies conjecture 1 for the tree that appeared in Figure 2.

### 1.4 Outline of the Article

Theorems 1 and 2 are proven in section 2 where it is established that in each setting, there are necessary PDE conditions for expressibility of smooth functions by a neural network. In section 3 we verify conjecture 1 in several examples by characterizing computable functions via PDE constraints that are necessary and (given certain nonvanishing conditions) sufficient. This starts by studying tree architectures in section 3.1. In example 7, we finish our treatment of a tree function with repeated inputs initiated in example 2; and, moreover, we present a number of examples to exhibit the key ideas of the proofs of theorems 3 and 4, which are concerned with tree functions with distinct inputs. The section then proceeds with switching from trees to other neural networks in section 3.2 where, building on example 3, example 11 demonstrates why the characterization claimed by conjecture 1 involves inequalities. We end section 3 with a brief subsection on PDE constraints for polynomial neural networks. Examples in section 3.1 are generalized in the next section to a number of results establishing conjecture 1 for certain families of tree architectures: Proofs of theorems 3 and 4 are presented in section 4. The last section is devoted to few concluding remarks. There are two appendices discussing technical proofs of propositions and lemmas (appendix A), and the basic mathematical background on differential forms (appendix B).

## 2 Existence of PDE Constraints

The goal of the section is to prove theorems 1 and 2. Lemma 1 below is our main tool for establishing the existence of constraints:

For a positive integer $a$, there are precisely $a+ll$ monomials such as $p1a1\cdots plal$ with their total degree $a1+\cdots +al$ not greater than $a$. But each of them is a polynomial of $t1,\u2026,tm$ of total degree at most $ad$ where $d:=max{degp1,\u2026,degpl}$. For $a$ large enough, $a+ll$ is greater than $ad+mm$ because the degree of the former as a polynomial of $a$ is $l$, while the degree of the latter is $m$. For such an $a$, the number of monomials $p1a1,\u2026,plal$ is larger than the dimension of the space of polynomials of $t1,\u2026,tm$ of total degree at most $ad$. Therefore, there exists a linear dependency among these monomials that amounts to a nontrivial polynomial relation among $p1,\u2026,pl$.

Let $N$ be a feedforward neural network whose inputs are labeled by the coordinate functions $x1,\u2026,xn$ and satisfies the hypothesis of either of theorems 1 or 2. Define the positive integer $r$ as

$r=n#neurons-1$ in the case of theorem 1

$r=max\u230an#neurons1n-1\u230b,#connections+2$ in the case of theorem 2,

where $#connections$ and $#neurons$ are, respectively, the number of edges of the underlying graph of $N$ and the number of its vertices above the input layer. Then the smooth functions $F=F(x1,\u2026,xn)$ computable by $N$ satisfy nontrivial algebraic partial differential equations of order $r$. In particular, the subspace formed by these functions lies in a subset of positive codimension, which is closed with respect to the $Cr$-norm.

It indeed follows from the arguments above that there is a multitude of algebraically independent PDE constraints. By a simple dimension count, this number is $r+nn-1-Nr+n-1n-1-1$ in the first case of corollary 1 and $r+nn-1-Nr$ in the second case.

The approach here merely establishes the existence of nontrivial algebraic PDEs satisfied by the superpositions. These are not the simplest PDEs of this kind and hence are not the best candidates for the purpose of characterizing superpositions. For instance, for superpositions 1.7, which networks in Figure 2 implement, one has $n=3$ and $#neurons=3$. Corollary 1 thus guarantees that these superpositions satisfy a sixth-order PDE. But in example 7, we shall characterize them via two fourth-order PDEs (compare with Buck, 1979, lemma 7).

## 3 Toy Examples

This section examines several elementary examples demonstrating how one can derive a set of necessary or sufficient PDE constraints for an architecture. The desired PDEs should be universal, that is, purely in terms of the derivatives of the function $F$ that is to be implemented and not dependent on any weight vector, activation function, or a function of lower dimensionality that has appeared at a node. In this process, it is often necessary to express a smooth function in terms of other functions. If $k<n$ and $f(x1,\u2026,xn)$ is written as $g(\xi 1,\u2026,\xi k)$ throughout an open neighborhood of a point $p\u2208Rn$ where each $\xi i=\xi i(x1,\u2026,xn)$ is a smooth function, the gradient of $f$ must be a linear combination of those of $\xi 1,\u2026,\xi k$ due to the chain rule. Conversely, if $\u2207f\u2208Span{\u2207\xi 1,\u2026,\u2207\xi k}$ near $p$, by the inverse function theorem, one can extend $(\xi 1,\u2026,\xi k)$ to a coordinate system $(\xi 1,\u2026,\xi k;\xi k+1,\u2026,\xi n)$ on a small enough neighborhood of $p$ provided that $\u2207\xi 1(p),\u2026,\u2207\xi k(p)$ are linearly independent; a coordinate system in which the partial derivative $f\xi i$ vanishes for $k<i\u2264n$; the fact that implies $f$ can be expressed in terms of $\xi 1,\u2026,\xi k$ near $p$. Subtle mathematical issues arise if one wants to write $f$ as $g(\xi 1,\u2026,\xi k)$ on a larger domain containing $p$:

A $k$-tuple $(\xi 1,\u2026,\xi k)$ of smooth functions defined on an open subset $U$ of $Rn$ whose gradient vector fields are linearly independent at all points cannot necessarily be extended to a coordinate system $(\xi 1,\u2026,\xi k;\xi k+1,\u2026,\xi n)$ for the whole $U$. As an example, consider $r=x2+y2$ whose gradient is nonzero at any point of $R2-{(0,0)}$, but there is no smooth function $h:R2-{(0,0)}\u2192R$ with $\u2207h\xac\u2225\u2207r$ throughout $R2-{(0,0)}$. The level set $r=1$ is compact, and so the restriction of $h$ to it achieves its absolute extrema, and at such points $\u2207h=\lambda \u2207f$ ($\lambda $ is the Lagrange multiplier).

- Even if one has a coordinate system $(\xi 1,\u2026,\xi k;\xi k+1,\u2026,\xi n)$ on a connected open subset $U$ of $Rn$, a smooth function $f:U\u2192R$ with $f\xi k+1,\u2026,f\xi n\u22610$ cannot necessarily be written globally as $f=g(\xi 1,\u2026,\xi k)$. One example is the functiondefined on the open subset $R2-[0,\u221e)\u2282R2$ for which $fy\u22610$. It may only locally be written as $f(x,y)=g(x)$; there is no function $g:R\u2192R$ with $f(x,y)=g(x)$ for all $(x,y)\u2208R2-[0,\u221e)$. Defining $g(x0)$ as the value of $f$ on the intersection of its domain with the vertical line $x=x0$ does not work because, due to the shape of the domain, such intersections may be disconnected. Finally, notice that $f$, although smooth, is not analytic ($C\omega $); indeed, examples of this kind do not exist in the analytic category.$f(x,y):=0ifx\u22640e-1xifx>0,y>0-e-1xifx>0,y<0$

This difficulty of needing a representation $f=g(\xi 1,\u2026,\xi k)$ that remains valid not just near a point but over a larger domain comes up only in the proof of theorem 4 (see remark 3); the representations we work with in the rest of this section are all local. The assumption about the shape of the domain and the special form of functions 1.12 allows us to circumvent the difficulties just mentioned in the proof of theorem 4. Below we have two related lemmas that we use later.

Let $B$ and $T$ be a box-like region in $Rn$ and a rooted tree with the coordinate functions $x1,\u2026,xn$ labeling its leaves as in theorem 7. Suppose a smooth function $F=F(x1,\u2026,xn)$ on $B$ is implemented on $T$ via assigning activation functions and weights to the nodes of $T$. If $F$ satisfies the nonvanishing conditions described at the end of theorem 7, then the level sets of $F$ are connected and $F$ can be extended to a coordinate system $(F,F2,\u2026,Fn)$ for $B$.

A smooth function $F(x1,\u2026,xn)$ of the form $\sigma (a1x1+\cdots +anxn)$ satisfies $FxixkFxj=FxjxkFxi$ for any $1\u2264i,j,k\u2264n$. Conversely, if $F$ has a first-order partial derivative $Fxj$ which is nonzero throughout an open box-like region $B$ in its domain, each identity $FxixkFxj=FxjxkFxi$ could be written as $FxiFxjxk=0$; that is, for any $1\u2264i\u2264n$, the ratio $FxiFxj$ should be constant on $B$, and such requirements guarantee that $F$ admits a representation of the form $\sigma (a1x1+\cdots +anxn)$ on $B$.

*differential form*:

A smooth vector field $V$ is parallel to a gradient vector field near each point only if the corresponding differential 1-form $\omega $ satisfies $\omega \u2227d\omega =0$. Conversely, if $V$ is nonzero at a point $p\u2208Rn$ in the vicinity of which $\omega \u2227d\omega =0$ holds, there exists a smooth function $\xi $ defined on a suitable open neighborhood of $p$ that satisfies $V\u2225\u2207\xi \u22600$. In particular, in dimension 2, a nowhere vanishing vector field $V$ is locally parallel to a nowhere vanishing gradient vector field, while in dimension 3, that is the case if and only if $V.curlV=0$.

A proof and background on differential forms are provided in appendix B.

### 3.1 Trees with Four Inputs

We begin with officially defining the terms related to tree architectures (see Figure 8).

**Terminology**

A

**tree**is a connected acyclic graph. Singling out a vertex as its root turns it into a directed acyclic graph in which each vertex has a unique predecessor/parent. We take all trees to be rooted. The following notions come up frequently:**Leaf**: a vertex with no successor/child.**Node**: a vertex that is not a leaf, that is, has children.**Sibling leaves**: leaves with the same parent.**Subtree**: all descendants of a vertex along with the vertex itself. Hence in our convention, all subtrees are full and rooted.

To implement a function, the leaves pass the inputs to the functions assigned to the nodes. The final output is received from the root.

The first example of the section elucidates theorem 3.

#### 3.1.1 Example 4

#### 3.1.2 Example 5

The next example is concerned with the symmetric tree in Figure 9. We shall need the following lemma:

#### 3.1.3 Example 6

Examples 5 and 6 demonstrate an interesting phenomenon: one can deduce nontrivial facts about the weights once a formula for the implemented function is available. In example 5, for a function $F(x,y,z,w)=\sigma (\tau (ax+by)+cz+dw)$, we have $FyFx\u2261ba$ and $FzFw\u2261cd$. The same identities are valid for functions of the form $F(x,y,z,w)=\sigma (\tau 1(ax+by)+\tau 2(cz+dw))$ in example 6.^{6} This seems to be a direction worthy of study. In fact, there are papers discussing how a neural network may be “reverse-engineered” in the sense that the architecture of the network is determined from the knowledge of its outputs, or the weights and biases are recovered without the ordinary training process involving gradient descent algorithms (Fefferman & Markel, 1994; Dehmamy, Rohani, & Katsaggelos, 2019; Rolnick & Kording, 2019). In our approach, the weights appearing in a composition of functions of the form $y\u21a6\sigma \u2329w,y\u232a$ could be described (up to scaling) in terms of partial derivatives of the resulting superposition.

#### 3.1.4 Example 7

#### 3.1.5 Example 8

### 3.2 Examples of Functions Computed by Neural Networks

#### 3.2.1 Example 9

#### 3.2.2 Example 10

#### 3.2.3 Example 11

^{7}We conclude that assuming $V2-4UW>0$, functions of the form $G(x,t)=f(ax+bt)+g(a'x+b't)$ may be identified with solutions of PDEs of the form 3.21. As in example 1, we desire algebraic PDEs purely in terms of $F$ and without constants $U,V$, and $W$. One way to do so is to differentiate equation 3.21 further, for instance:

### 3.3 Examples of Polynomial Neural Networks

The subset $Fd(N)$ of $Polyd,n$ consisting of polynomials $P(x1,\u2026,xn)$ of total degree at most $d$ that can be computed by $N$ via assigning real polynomial functions to its neurons

The smaller subset $Fdact(N)$ of $Polyd,n$ consisting of polynomials $P(x1,\u2026,xn)$ of total degree at most $d$ that can be computed by $N$ via assigning real polynomials of the form $y\u21a6\sigma \u2329w,y\u232a$ to the neurons where $\sigma $ is a polynomial activation function

In general, subsets $Fd(N)$ and $Fdact(N)$ of $Polyd,n$ are not closed in the algebraic sense (see remark 8). Therefore, one may consider their Zariski closures $Vd(N)$ and $Vdact(N)$, that is, the smallest subsets defined as zero loci of polynomial equations that contain them. We shall call $Vd(N)$ and $Vdact(N)$ the *functional varieties* associated with $N$. Each of the subsets $Vd(N)$ and $Vdact(N)$ of $Polyd,n$ could be described with finitely many polynomial equations in terms of $ca1,a2,\u2026,an$'s. The PDE constraints from section 2 provide nontrivial examples of equations satisfied on the functional varieties: In any degree $d$, substituting equation 3.28 in an algebraic PDE that smooth functions computed by $N$ must obey results in equations in terms of the coefficients that are satisfied at any point of $Fd(N)$ or $Fdact(N)$ and hence at the points of $Vd(N)$ or $Vdact(N)$. This will be demonstrated in example 12 and results in the following corollary to theorems 1 and 2.

Let $N$ be a neural network whose inputs are labeled by the coordinate functions $x1,\u2026,xn$. Then there exist nontrivial polynomials on affine spaces $Polyd,n$ that are dependent only on the topology of $N$ and become zero on functional varieties $Vdact(N)\u2282Polyd,n$. The same holds for functional varieties $Vd(N)$ provided that the number of inputs to each neuron of $N$ is less than $n$.

#### 3.3.1 Example 12

#### 3.3.2 Example 13

Let $N$ be the neural network appearing in Figure 4. The functional space $Fdact(N)$ is formed by polynomials $P(x,t)$ of total degree at most $d$ that are in the form of $\sigma (f(ax+bt)+g(a'x+b't))$. By examining the Taylor expansions, it is not hard to see that if $P(x,t)$ is written in this form for univariate smooth functions $\sigma $, $f$, and $g$, then these functions could be chosen to be polynomials. Therefore, in any degree $d$, our characterization of superpositions of this form in example 11 in terms of PDEs and PDIs results in polynomial equations and inequalities that describe a Zariski open subset of $Fdact(N)$ which is the complement of the locus where the nonvanishing conditions fail. The inequalities disappear after taking the closure, so $Vdact(N)$ is strictly larger than $Fdact(N)$ here.

## 4 PDE Characterization of Tree Functions

Building on the examples of the previous section, we prove theorems 3 and 4. This will establish conjecture 1 for tree architectures with distinct inputs.

^{8}corresponding to the leaves adjacent to the root of $T$. By renumbering $x1,\u2026,xn$ one may write the leaves as

The formulation of theorem 3 in Farhoodi et al. (2019) is concerned with analytic functions and binary trees. The proof presented above follows the same inductive procedure but utilizes theorem 5 instead of Taylor expansions. Of course, theorem 5 remains valid in the analytic category, so the tree representation of $F$ constructed in the proof here consists of analytic functions if $F$ is analytic. An advantage of working with analytic functions is that in certain cases, the nonvanishing conditions may be relaxed. For instance, if in example 1 the function $F(x,y,z)$ satisfying equation 1.4 is analytic, it admits a local representation of the form 1.1, while if $F$ is only smooth, at least one of the conditions $Fx\u22600$ of $Fy\u22600$ is required. (See Farhoodi et al., 2019, sec. 5.1 and 5.3, for details.)

We induct on the number of leaves to prove the sufficiency of constraints 1.19 and 1.20 (accompanied by suitable nonvanishing conditions) for the existence of a tree implementation of a smooth function $F=F(x1,\u2026,xn)$ as a composition of functions of the form 1.12. Given a rooted tree $T$ with $n$ leaves labeled by $x1,\u2026,xn$, the inductive step has two cases demonstrated in Figures 14 and 15:

- There are leaves, say, $xm+1,\u2026,xn$, directly adjacent to the root of $T$; their removal results in a smaller tree $T'$ with leaves $x1,\u2026,xm$ (see Figure 14). The goal is to write $F(x1,\u2026,xn)$ aswith $G$ satisfying appropriate constraints that, invoking the induction hypothesis, guarantee that $G$ is computable by $T'$.$\sigma (G(x1,\u2026,xm)+cm+1xm+1+\cdots +cnxn),$(4.5)Figure 14:Figure 15:
- There is no leaf adjacent to the root of $T$, but there are smaller subtrees. Denote one of them with $T2$ and show its leaves by $xm+1,\u2026,xn$. Removing this subtree results in a smaller tree $T1$ with leaves $x1,\u2026,xm$ (see Figure 15). The goal is to write $F(x1,\u2026,xn)$ aswith $G1$ and $G2$ satisfying constraints corresponding to $T1$ and $T2$, and hence may be implemented on these trees by invoking the induction hypothesis.$\sigma (G1(x1,\u2026,xm)+G2(xm+1,\u2026,xn)),$(4.6)

Following the discussion at the beginning of section 3, $F$ may be locally written as a function of another function with nonzero gradient if the gradients are parallel. This idea has been frequently used so far, but there is a twist here: we want such a description of $F$ to persist on the box-like region $B$ that is the domain of $F$. Lemma 2 resolves this issue. The tree function in the argument of $\sigma $ in either of equation 4.5 or 4.6, which here we denote by $F\u02dc$, shall be constructed below by invoking the induction hypothesis, so $F\u02dc$ is defined at every point of $B$. Besides, our description of $\u2207F\u02dc$ below (cf. equations 4.7 and 4.9) readily indicates that just like $F$, it satisfies the nonvanishing conditions of theorem 7. Applying lemma 2 to $F\u02dc$, any level set $x\u2208B|F\u02dc(x)=c$ is connected, and $F\u02dc$ can be extended to a coordinate system $(F\u02dc,F2,\u2026,Fn)$ for $B$. Thus, $F$, whose partial derivatives with respect to other coordinate functions vanish, realizes precisely one value on any coordinate hypersurface $x\u2208B|F\u02dc(x)=c$. Setting $\sigma (c)$ to be the aforementioned value of $F$ defines a function $\sigma $ with $F=\sigma (F\u02dc)$. After this discussion on the domain of definition of the desired representation of $F$, we proceed with constructing $F\u02dc=F\u02dc(x1,\u2026,xn)$ as either $G(x1,\u2026,xm)+cm+1xm+1+\cdots +cnxn$ in the case of equation 4.5 or as $G(x1,\u2026,xm)+G2(xm+1,\u2026,xn)$ in the case of equation 4.6.

As mentioned in remark 3, working with functions of the form 1.12 in theorem 7 rather than general smooth functions has the advantage of enabling us to determine a domain on which a superposition representation exists. In contrast, the sufficiency part of theorem 3 is a local statement since it relies on the implicit function theorem. It is possible to say something nontrivial about the domains when functions are furthermore analytic. This is because the implicit function theorem holds in the analytic category as well (Krantz & Parks, 2002, sec. 6.1) where lower bounds on the domain of validity of the theorem exist in the literature (Chang, He, & Prabhu, 2003).

## 5 Conclusion

In this article, we proposed a systematic method for studying smooth real-valued functions constructed as compositions of other smooth functions that are either of lower arity or in the form of a univariate activation function applied to a linear combination of inputs. We established that any such smooth superposition must satisfy nontrivial constraints in the form of algebraic PDEs, which are dependent only on the hierarchy of composition or, equivalently, only on the topology of the neural network that produces superpositions of this type. We conjectured that there always exist characteristic PDEs that also provide sufficient conditions for a generic smooth function to be expressible by the feedforward neural network in question. The genericity is to avoid singular cases and is captured by nonvanishing conditions that require certain polynomial functions of partial derivatives to be nonzero. We observed that there are also situations where nontrivial algebraic inequalities involving partial derivatives (PDIs) are imposed on the hierarchical functions. In summary, the conjecture aims to describe generic smooth functions computable by a neural network with finitely many universal conditions of the form $\Phi \u22600$, $\Psi =0$, and $\Theta >0$, where $\Phi $, $\Psi $, and $\Theta $ are polynomial expressions of the partial derivatives and are dependent only on the architecture of the network, not on any tunable parameter or any activation function used in the network. This is reminiscent of the notion of a semialgebraic set from real algebraic geometry. Indeed, in the case of compositions of polynomial functions or functions computed by polynomial neural networks, the PDE constraints yield equations for the corresponding functional variety in an ambient space of polynomials of a prescribed degree.

The conjecture was verified in several cases, most importantly, for tree architectures with distinct inputs where, in each regime, we explicitly exhibited a PDE characterization of functions computable by a tree network. Examples of tree architectures with repeated inputs were addressed as well. The proofs were mathematical in nature and relied on classical results of multivariable analysis.

The article moreover highlights the differences between the two regimes mentioned at the beginning: the hierarchical functions constructed out of composing functions of lower dimensionality and the hierarchical functions that are compositions of functions of the form $y\u21a6\sigma \u2329w,y\u232a$. The former functions appear more often in the mathematical literature on the Kolmogorov-Arnold representation theorem, while the latter are ubiquitous in deep learning. The special form of functions $y\u21a6\sigma \u2329w,y\u232a$ requires more PDE constraints to be imposed on their compositions, whereas their mild nonlinearity is beneficial in terms of ascertaining the domain on which a claimed compositional representation exists.

Our approach for describing the functional spaces associated with feedforward neural networks is of natural interest in the study of expressivity of neural networks and could lead to new complexity measures. We believe that the point of view adapted here is novel and might shed light on a number of practical problems such as comparison of architectures and reverse-engineering deep networks.

## Appendix A: Technical Proofs

^{9}because otherwise there exists a point of $p\u2208B$ at which $Fxi(p)=0$ for any leaf $xi$. By the same logic, the weight $w1$ must be nonzero because otherwise all first-order partial derivatives with respect to the variables appearing in $T1$ are identically zero. We now show that an arbitrary level set $Lc:={x\u2208B|F(x)=c}$ is connected. Given the representation (see equation A.1) of $F$, the level set is empty if $\sigma $ does not attain the value $c$. Otherwise, $\sigma $ attains $c$ at a unique point $\sigma -1(c)$ of its domain. So one may rewrite the equation $F(x1,\u2026,xn)=c$ as

*Let $\pi :X\u2192Y$ be a continuous map of topological spaces that takes open sets to open sets and has connected level sets. Then the preimage of any connected subset of $Y$ under $\pi $ is connected.*Here, $Lc$ is the preimage of the set from equation A.3, which is connected since it is the graph of a continuous function, under the map $\pi $ defined in equation A.4, which is open because the scalar-valued function $G1$ is: its gradient never vanishes. Therefore, the connectedness of the level sets of $F$ is implied by the connectedness of the level sets of $\pi $. A level set of the map, equation A.4, could be identified with a level set of its first component $G1$. Consequently, we have reduced to the similar problem for the function $G1$, which is implemented on the smaller tree $T1$. Therefore, an inductive argument yields the connectedness of the level sets of $F$. It only remains to check the basic case of a tree whose leaves are directly connected to the root. In that setting, $F(x1,\u2026,xn)$ is in the form of $\sigma (a1x1+\cdots +anxn)$ (the family of functions that lemma 3 is concerned with). By repeating the argument used before, the activation function $\sigma $ is injective. Hence, a level set $F(x)=c$ is the intersection of the hyperplane $a1x1+\cdots +anxn=\sigma -1(c)$ with the box-like region $B$. Such an intersection is convex and thus connected.

## Appendix B: Differential Forms

Differential forms are ubiquitous objects in differential geometry and tensor calculus. We only need the theory of differential forms on open domains in Euclidean spaces. Theorem 5 (which has been used several times throughout the, for example, in the proof of theorem 3) is formulated in terms of differential forms. This appendix provides the necessary background for understanding the theorem and its proof.

*differential $k$-form*$\omega $ on $U$ assigns a scalar to any $k$-tuple of tangent vectors at a point $p$ of $U$. This assignment, denoted by $\omega p$, must be multilinear and alternating. We say $\omega $ is smooth (resp. analytic) if $\omega p$ varies smoothly (resp. analytically) with $p$. In other words, feeding $\omega $ with $k$ smooth (resp. analytic) vector fields $V1,\u2026,Vk$ on $U$ results in a function $\omega (V1,\u2026,Vk):U\u2192R$ that is smooth (resp. analytic). We next exhibit an expression for $\omega $. Consider the standard basis $\u2202\u2202x1,\u2026,\u2202\u2202xn$ of vector fields on $U$ where $\u2202\u2202xi$ assigns $ei$ to each point. The dual basis is denoted by $(dx1,\u2026,dxn)$, which at each point yields the dual of the standard basis $(e1,\u2026,en)$ for $Rn$, that is, $dxi\u2202\u2202xj\u2261\delta ij$. Each of $dx1,\u2026,dxn$ is a 1-form on $U$, and any $k$-form $\omega $ can be written in terms of them:

### B.1 Example 14

### B.2 Example 15

As mentioned in the previous example, the outcome of twice applying the exterior differentiation operator to a form is always zero. This is an extremely important property that leads to the definitions of closed and exact differential forms. A $k$-form $\omega $ on an open subset $U$ of $Rn$ is called closed if $d\omega =0$. This holds if $\omega $ is in the form of $\omega =d\alpha $ for a $(k-1)$-form $\alpha $ on $U$. Such forms are called exact. The space of closed forms may be strictly larger than the space of exact forms; the difference of these spaces can be used to measure the topological complexity of $U$. If $U$ is an open box-like region, every closed form on it is exact. But, for instance, the 1-form $\omega =-yx2+y2dx+xx2+y2dy$ on $R2-{(0,0)}$ is closed while it may not be written as $d\alpha $ for any smooth function $\alpha :R2-{(0,0)}\u2192R$. This brings us to a famous fact from multivariable calculus that we have used several times (e.g., in the proof of theorem 4). A necessary condition for a vector field $V=\u2211i=1nVi\u2202\u2202xi$ on an open subset $U$ of $Rn$ to be a gradient vector field is $(Vi)xj=(Vj)xi$ for any $1\u2264i,j\u2264n$. Near each point of $U$, the vector field $V$ may be written as $\u2207f$; it is globally in the form of $\u2207f$ for a function $f:U\u2192R$ when $U$ is simply connected. In view of equations B.2 and B.3, one may rephrase this fact as: *Closed 1-forms on $U$ are exact if and only if $U$ is simply connected.*

Near a point $p\u2208Rn$ at which $V(p)\u22600$, we seek a locally defined function $\xi $ with $V\u2225\u2207\xi \u22600$. Recall that if $q\u2208Rn$ is a regular point of $\xi $, then near $q$, the level set of $\xi $ passing through $q$ is an $(n-1)$-dimensional submanifold of $Rn$ to which the gradient vector field, $\u2207\xi \u22600$, is perpendicular. As we want the gradient to be parallel to the vector field $V$, the equivalent characterization in terms of the 1-form $\omega $, which is the dual of $V$ (cf. equations 3.1 and 3.3), asserts that $\omega $ is zero at any vector tangent to the level set. So the tangent space to the level set at the point $q$ could be described as ${v\u2208Rn|\omega q(v)=0}$. As $q$ varies near $p$, these $(n-1)$-dimensional subspaces of $Rn$ vary smoothly. In differential geometry, such a higher-dimensional version of a vector field is called a *distribution*, and the property that these subspaces are locally given by tangent spaces to a family of submanifolds (the level sets here) is called *integrability*. The seminal Frobenius theorem (Narasimhan, 1968, theorem 2.11.11) implies that the distribution defined by a nowhere vanishing 1-from $\omega $ is integrable if and only if $\omega \u2227d\omega =0$.

## Notes

^{1}

To be mathematically precise, the open neighborhood of $p$ on which $F$ admits a compositional representation in the desired form may be dependent on $F$ and $p$. So conjecture 1 is local in nature and must be understood as a statement about function germs.

^{2}

Convergence in the $Cr$-norm is defined as the uniform convergence of the function and its partial derivatives up to order $r$.

^{3}

In conjecture 1, the subset cut off by equations $\Psi i=0$ is meager: It is a closed and (due to the term *nonvacuous* appearing in the conjecture) proper subset of the space of functions computable by $N$, and a function implemented by $N$ at which a $\Psi i$ vanishes could be perturbed to another computable function at which all of $\Psi i$'s are nonzero.

^{4}

An open box-like region in $Rn$ is a product $I1\xd7\cdots \xd7In$ of open intervals.

^{5}

A piece of terminology introduced in Farhoodi et al. (2019) may be illuminating here. A member of a triple $(xi,xj,xk)$ of (not necessarily distinct) leaves of $T$ is called the *outsider* of the triple if there is a (rooted full) subtree of $T$ that misses it but has the other two members. Theorem 3 imposes $FxixkFxj=FxjxkFxi$ whenever $xk$ is the outsider, while theorem 4 imposes the constraint whenever $xi$ and $xj$ are not outsiders.

^{6}

Notice that this is the best one can hope to recover because through scaling the weights and inversely scaling the inputs of activation functions, the function $F$ could also be written as $\sigma (\tau \u02dc(\lambda ax+\lambda by)+cz+dw)$ or $\sigma (\tau 1\u02dc(\lambda ax+\lambda by)+\tau 2(cz+dw))$ where $\tau \u02dc(y):=\tau y\lambda $ and $\tau 1\u02dc(y):=\tau 1y\lambda $. Thus, the other ratios $ac$ and $bd$ are completely arbitrary.

^{8}

A single vertex is not considered to be a rooted tree in our convention.

^{9}

As the vector $(x1,\u2026,xn)$ of inputs varies in the box-like region $B$, the inputs to each node form an interval on which the corresponding activation function is defined.

## References

*Collected works: Representations of functions, celestial mechanics and KAM theory, 1957–1965*

*Collected works: Representations of functions, celestial mechanics and KAM theory, 1957–1965*

*Advances in neural information processing systems, 11*

*IEEE Transactions on Neural Networks and Learning Systems*

*Elementary differential equations*

*Kolmogorov's heritage in mathematics*

*Approximate complexity and functional representation*

*J. Math. Anal. Appl.*

*Amer. Math. Monthly*

*J. Differential Equations*

*J. Inequal. Pure Appl. Math.*

*Depth-width trade-offs for ReLU networks via Sharkovsky's theorem*

*Proceedings of the Conference on Learning Theory*

*An introduction to semialgebraic geometry*

*Mathematics of Control, Signals and Systems*

*Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing*

*On the power of over-parametrization in neural networks with quadratic activation*

*Proceedings of the Conference on Learning Theory*

*Neural Computation*

*Advances in neural information processing systems*

*eLife*

*BMC Bioinformatics*

*Neural Computation*

*Proceedings of the International Conference on Neural Networks*

*Bulletin of the American Mathematical Society*

*Neural Networks*

*Neural Networks*

*On the expressive power of deep polynomial neural networks*

*Branching morphogenesis*

*Dokl. Akad. Nauk SSSR*

*The implicit function theorem*

*Neural Computation*

*Neural Networks*

*Journal of Statistical Physics*

*Approximation of functions*

*Neural Computation*

*Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.*

*Perceptrons: An introduction to computational geometry*

*Advances in neural information processing systems, 27*

*Analysis on real and complex manifolds*

*Mathematische Zeitschrift*

*Foundations of Computational Mathematics*

*Theoretical issues in deep networks: Approximation, optimization and generalization.*

*International Journal of Automation and Computing*

*Aufgaben und Lehrsätze aus der Analysis*

*Advances in neural information processing systems*

*Real mathematical analysis*

*Proceedings of the 34th International Conference on Machine Learning*

*Reverse-engineering deep ReLU networks.*

*Bull. Amer. Math. Soc. (N.S.)*

*eLife*

*IEEE Transactions on Information Theory*

*Trans. Amer. Math. Soc.*

*Benefits of depth in neural networks.*

*Spurious valleys in two-layer neural network optimization landscapes.*

*Doklady Akad. Nauk SSSR (N.S.)*

*Dokl. Akad. Nauk SSSR*

*Russian Mathematical Surveys*

*Uspehi Mat. Nauk*

*Approximation theory, III (Proc. Conf., Univ. Texas, Austin, Tex., 1980)*