Abstract
Neural networks are versatile tools for computation, having the ability to approximate a broad range of functions. An important problem in the theory of deep neural networks is expressivity; that is, we want to understand the functions that are computable by a given network. We study real, infinitely differentiable (smooth) hierarchical functions implemented by feedforward neural networks via composing simpler functions in two cases: (1) each constituent function of the composition has fewer inputs than the resulting function and (2) constituent functions are in the more specific yet prevalent form of a nonlinear univariate function (e.g., tanh) applied to a linear multivariate function. We establish that in each of these regimes, there exist nontrivial algebraic partial differential equations (PDEs) that are satisfied by the computed functions. These PDEs are purely in terms of the partial derivatives and are dependent only on the topology of the network. Conversely, we conjecture that such PDE constraints, once accompanied by appropriate nonsingularity conditions and perhaps certain inequalities involving partial derivatives, guarantee that the smooth function under consideration can be represented by the network. The conjecture is verified in numerous examples, including the case of tree architectures, which are of neuroscientific interest. Our approach is a step toward formulating an algebraic description of functional spaces associated with specific neural networks, and may provide useful new tools for constructing neural networks.
1 Introduction
1.1 Motivation
1.1.1 Example 1
1.1.2 Example 2
1.2 Statements of Main Results
Fixing a neural network hierarchy for composing functions, we shall prove that once the constituent functions of corresponding superpositions have fewer inputs (lower arity), there exist universal algebraic partial differential equations (algebraic PDEs) that have these superpositions as their solutions. A conjecture, which we verify in several cases, states that such PDE constraints characterize a generic smooth superposition computable by the network. Here, genericity means a nonvanishing condition imposed on an algebraic expression of partial derivatives. Such a condition has already occurred in example 1 where in the proof of the sufficiency of equation 1.4 for the existence of a representation of the form 1.1 for a function , we assumed either or is nonzero. Before proceeding with the statements of main results, we formally define some of the terms that have appeared so far.
Terminology
We take all neural networks to be feedforward. A feedforward neural network is an acyclic hierarchical layer to layer scheme of computation. We also include residual networks (ResNets) in this category: an identity function in a layer could be interpreted as a jump in layers. Tree architectures are recurring examples of this kind. We shall always assume that in the first layer, the inputs are labeled by (not necessarily distinct) labels chosen from coordinate functions , and there is only one node in the output layer. Assigning functions to nodes in layers above the input layer implements a real scalar-valued function as the superposition of functions appearing at nodes (see Figure 3).
- In our setting, an algebraic PDE is a nontrivial polynomial relation such asamong the partial derivatives (up to a certain order) of a smooth function . Here, for a tuple of nonnegative integers, the partial derivative (which is of order ) is denoted by . For instance, asking for a polynomial expression of partial derivatives of to be constant amounts to algebraic PDEs given by setting the first-order partial derivatives of that expression with respect to to be zero.(1.10)
- A nonvanishing condition imposed on smooth functions is asking for these functions not to satisfy a particular algebraic PDE, namely,for a nonconstant polynomial . Such a condition could be deemed pointwise since if it holds at a point , it persists throughout a small enough neighborhood. Moreover, equation 1.11 determines an open dense subset of the functional space; so, it is satisfied generically.(1.11)
Let be a feedforward neural network in which the number of inputs to each node is less than the total number of distinct inputs to the network. Superpositions of smooth functions computed by this network satisfy nontrivial constraints in the form of certain algebraic PDEs that are dependent only on the topology of .
In order to guarantee the existence of PDE constraints for superpositions, theorem 1 assumes a condition on the topology of the network. However, theorem 2 states that by restricting the functions that can appear in the superposition, one can still obtain PDE constraints even for a fully connected multilayer perceptron:
Let be an arbitrary feedforward neural network with at least two distinct inputs, with smooth functions of the form 1.12 applied at its nodes. Any function computed by this network satisfies nontrivial constraints in the form of certain algebraic PDEs that are dependent only on the topology of .
1.2.1 Example 3
The preceding example suggests that smooth functions implemented by a neural network may be required to obey a nontrivial algebraic partial differential inequality (algebraic PDI). So it is convenient to have the following setup of terminology.
Terminology
- An algebraic PDI is an inequality of the forminvolving partial derivatives (up to a certain order) where is a real polynomial.(1.15)
Without any loss of generality, we assume that the PDIs are strict since a nonstrict one such as could be written as the union of and the algebraic PDE .
Theorem 1 and example 1 deal with superpositions of arbitrary smooth functions while theorem 2 and example 3 are concerned with superpositions of a specific class of smooth functions, functions of the form 1.12. In view of the necessary PDE constraints in both situations, the following question then arises: Are there sufficient conditions in the form of algebraic PDEs and PDIs that guarantee a smooth function can be represented, at least locally, by the neural network in question?
Let be a feedforward neural network whose inputs are labeled by the coordinate functions . Suppose we are working in the setting of one of theorems 1 or 2. Then there exist
finitely many nonvanishing conditions
finitely many algebraic PDEs
finitely many algebraic PDIs
with the following property: For any arbitrary point , the space of smooth functions defined in a vicinity1 of that satisfy at and are computable by (in the sense of the regime under consideration) is nonvacuous and is characterized by PDEs and PDIs .
Conjecture 1 is settled in (Farhoodi, Filom, Jones, and Körding, 2019) for trees (a particular type of architectures) with distinct inputs, a situation in which no PDI is required, and the inequalities should be taken to be trivial. Throughout the article, the conjecture above will be established for a number of architectures; in particular, we shall characterize tree functions (cf. theorems 3 and 4 below).
1.3 Related Work
There is an extensive literature on the expressive power of neural networks. Although shallow networks with sigmoidal activation functions can approximate any continuous function on compact sets (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989; Hornik, 1991; Mhaskar, 1996), this cannot be achieved without the hidden layer getting exponentially large (Eldan & Shamir, 2016; Telgarsky, 2016; Mhaskar et al., 2017; Poggio et al., 2017). Many articles thus try to demonstrate how the expressive power is affected by depth. This line of research draws on a number of different scientific fields including algebraic topology (Bianchini & Scarselli, 2014), algebraic geometry (Kileel et al., 2019), dynamical systems (Chatziafratis, Nagarajan, Panageas, & Wang, 2019), tensor analysis (Cohen, Sharir, & Shashua, 2016), Vapnik–Chervonenkis theory (Bartlett, Maiorov, & Meir, 1999), and statistical physics (Lin, Tegmark, & Rolnick, 2017). One approach is to argue that deeper networks are able to approximate or represent functions of higher complexity after defining a “complexity measure” (Bianchini & Scarselli, 2014; Montufar, Pascanu, Cho, & Bengio, 2014; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Telgarsky, 2016; Raghu et al., 2017). Another approach more in line with this article is to use the “size” of an associated functional space as a measure of representation power. This point of view is adapted in Farhoodi et al. (2019) by enumerating Boolean functions, and in Kileel et al. (2019) by regarding dimensions of functional varieties as such a measure.
To the best of our knowledge, the closest mentions of a characterization of a class of superpositions by necessary and sufficient PDE constraints in the literature are papers (Buck, 1979, 1981a) by R. C. Buck. The first one (along with its earlier version, Buck, 1976) characterizes superpositions of the form in a similar fashion as example 1. Also in those papers, superpositions such as (which appeared in example 2) are discussed although only the existence of necessary PDE constraints is shown; see (Buck, 1979, lemma 7), and (Buck, 1981a, p. 141). We exhibit a PDE characterization for superpositions of this form in example 7. These papers also characterize sufficiently differentiable nomographic functions of the form and .
A special class of neural network architectures is provided by rooted trees where any output of a layer is passed to exactly one node from one of the layers above (see Figure 8). Investigating functions computable by trees is of neuroscientific interest because the morphology of the dendrites of a neuron processes information through a tree that is often binary (Kollins & Davenport, 2005; Gillette & Ascoli, 2015). Assuming that the inputs to a tree are distinct, in our previous work (Farhoodi et al., 2019), we have completely characterized the corresponding superpositions through formulating necessary and sufficient PDE constraints; a result that answers conjecture 1 in positive for such architectures.
The characterization suggested by the theorem below is a generalization of example 1 which was concerned with smooth superpositions of the form 1.1. The characterization of such superpositions as solutions of PDE 1.4 has also appeared in a paper (Buck, 1979) that we were not aware of while writing (Farhoodi et al., 2019).
For any leaf with siblings either or there is a sibling leaf with .
This theorem was formulated in Farhoodi et al. (2019) for binary trees and in the context of analytic functions (and also that of Boolean functions). Nevertheless, the proof carries over to the more general setting above. Below, we formulate the analogous characterization of functions that trees compute via composing functions of the form 1.12. Proofs of theorems 3 and 4 are presented in section 4.
Let be a rooted tree admitting leaves that are labeled by the coordinate functions . We formulate the following constraints on smooth functions :
- For any two (rooted full) subtrees and that emanate from a node of (see Figure 6), we haveif , are leaves of and , are leaves of .(1.20)
These constraints are satisfied if is a superposition of functions of the form according to the hierarchy provided by . Conversely, a smooth function defined on an open box-like region4 can be written as such a superposition on provided that the constraints 1.19 and 1.20 formulated above hold and, moreover, the nonvanishing conditions below are satisfied throughout :
For any leaf with siblings either or there is a sibling leaf with ;
For any leaf without siblings .
Aside from neuroscientific interest, studying tree architectures is important also because any neural network can be expanded into a tree network with repeated inputs through a procedure called TENN (the Tree Expansion of the Neural Network; see Figure 7). Tree architectures with repeated inputs are relevant in the context of neuroscience too because the inputs to neurons may be repeated (Schneider-Mizell et al., 2016; Gerhard, Andrade, Fetter, Cardona, & Schneider-Mizell, 2017). We have already seen an example of a network along with its TENN in Figure 2. Both networks implement functions of the form . Even for this simplest example of a tree architecture with repeated inputs, the derivation of characteristic PDEs is computationally involved and will be done in example 7. This verifies conjecture 1 for the tree that appeared in Figure 2.
1.4 Outline of the Article
Theorems 1 and 2 are proven in section 2 where it is established that in each setting, there are necessary PDE conditions for expressibility of smooth functions by a neural network. In section 3 we verify conjecture 1 in several examples by characterizing computable functions via PDE constraints that are necessary and (given certain nonvanishing conditions) sufficient. This starts by studying tree architectures in section 3.1. In example 7, we finish our treatment of a tree function with repeated inputs initiated in example 2; and, moreover, we present a number of examples to exhibit the key ideas of the proofs of theorems 3 and 4, which are concerned with tree functions with distinct inputs. The section then proceeds with switching from trees to other neural networks in section 3.2 where, building on example 3, example 11 demonstrates why the characterization claimed by conjecture 1 involves inequalities. We end section 3 with a brief subsection on PDE constraints for polynomial neural networks. Examples in section 3.1 are generalized in the next section to a number of results establishing conjecture 1 for certain families of tree architectures: Proofs of theorems 3 and 4 are presented in section 4. The last section is devoted to few concluding remarks. There are two appendices discussing technical proofs of propositions and lemmas (appendix A), and the basic mathematical background on differential forms (appendix B).
2 Existence of PDE Constraints
The goal of the section is to prove theorems 1 and 2. Lemma 1 below is our main tool for establishing the existence of constraints:
For a positive integer , there are precisely monomials such as with their total degree not greater than . But each of them is a polynomial of of total degree at most where . For large enough, is greater than because the degree of the former as a polynomial of is , while the degree of the latter is . For such an , the number of monomials is larger than the dimension of the space of polynomials of of total degree at most . Therefore, there exists a linear dependency among these monomials that amounts to a nontrivial polynomial relation among .
Let be a feedforward neural network whose inputs are labeled by the coordinate functions and satisfies the hypothesis of either of theorems 1 or 2. Define the positive integer as
where and are, respectively, the number of edges of the underlying graph of and the number of its vertices above the input layer. Then the smooth functions computable by satisfy nontrivial algebraic partial differential equations of order . In particular, the subspace formed by these functions lies in a subset of positive codimension, which is closed with respect to the -norm.
It indeed follows from the arguments above that there is a multitude of algebraically independent PDE constraints. By a simple dimension count, this number is in the first case of corollary 1 and in the second case.
The approach here merely establishes the existence of nontrivial algebraic PDEs satisfied by the superpositions. These are not the simplest PDEs of this kind and hence are not the best candidates for the purpose of characterizing superpositions. For instance, for superpositions 1.7, which networks in Figure 2 implement, one has and . Corollary 1 thus guarantees that these superpositions satisfy a sixth-order PDE. But in example 7, we shall characterize them via two fourth-order PDEs (compare with Buck, 1979, lemma 7).
3 Toy Examples
This section examines several elementary examples demonstrating how one can derive a set of necessary or sufficient PDE constraints for an architecture. The desired PDEs should be universal, that is, purely in terms of the derivatives of the function that is to be implemented and not dependent on any weight vector, activation function, or a function of lower dimensionality that has appeared at a node. In this process, it is often necessary to express a smooth function in terms of other functions. If and is written as throughout an open neighborhood of a point where each is a smooth function, the gradient of must be a linear combination of those of due to the chain rule. Conversely, if near , by the inverse function theorem, one can extend to a coordinate system on a small enough neighborhood of provided that are linearly independent; a coordinate system in which the partial derivative vanishes for ; the fact that implies can be expressed in terms of near . Subtle mathematical issues arise if one wants to write as on a larger domain containing :
A -tuple of smooth functions defined on an open subset of whose gradient vector fields are linearly independent at all points cannot necessarily be extended to a coordinate system for the whole . As an example, consider whose gradient is nonzero at any point of , but there is no smooth function with throughout . The level set is compact, and so the restriction of to it achieves its absolute extrema, and at such points ( is the Lagrange multiplier).
- Even if one has a coordinate system on a connected open subset of , a smooth function with cannot necessarily be written globally as . One example is the functiondefined on the open subset for which . It may only locally be written as ; there is no function with for all . Defining as the value of on the intersection of its domain with the vertical line does not work because, due to the shape of the domain, such intersections may be disconnected. Finally, notice that , although smooth, is not analytic (); indeed, examples of this kind do not exist in the analytic category.
This difficulty of needing a representation that remains valid not just near a point but over a larger domain comes up only in the proof of theorem 4 (see remark 3); the representations we work with in the rest of this section are all local. The assumption about the shape of the domain and the special form of functions 1.12 allows us to circumvent the difficulties just mentioned in the proof of theorem 4. Below we have two related lemmas that we use later.
Let and be a box-like region in and a rooted tree with the coordinate functions labeling its leaves as in theorem 7. Suppose a smooth function on is implemented on via assigning activation functions and weights to the nodes of . If satisfies the nonvanishing conditions described at the end of theorem 7, then the level sets of are connected and can be extended to a coordinate system for .
A smooth function of the form satisfies for any . Conversely, if has a first-order partial derivative which is nonzero throughout an open box-like region in its domain, each identity could be written as ; that is, for any , the ratio should be constant on , and such requirements guarantee that admits a representation of the form on .
A smooth vector field is parallel to a gradient vector field near each point only if the corresponding differential 1-form satisfies . Conversely, if is nonzero at a point in the vicinity of which holds, there exists a smooth function defined on a suitable open neighborhood of that satisfies . In particular, in dimension 2, a nowhere vanishing vector field is locally parallel to a nowhere vanishing gradient vector field, while in dimension 3, that is the case if and only if .
A proof and background on differential forms are provided in appendix B.
3.1 Trees with Four Inputs
We begin with officially defining the terms related to tree architectures (see Figure 8).
Terminology
A tree is a connected acyclic graph. Singling out a vertex as its root turns it into a directed acyclic graph in which each vertex has a unique predecessor/parent. We take all trees to be rooted. The following notions come up frequently:
Leaf: a vertex with no successor/child.
Node: a vertex that is not a leaf, that is, has children.
Sibling leaves: leaves with the same parent.
Subtree: all descendants of a vertex along with the vertex itself. Hence in our convention, all subtrees are full and rooted.
To implement a function, the leaves pass the inputs to the functions assigned to the nodes. The final output is received from the root.
The first example of the section elucidates theorem 3.
3.1.1 Example 4
3.1.2 Example 5
The next example is concerned with the symmetric tree in Figure 9. We shall need the following lemma:
3.1.3 Example 6
Examples 5 and 6 demonstrate an interesting phenomenon: one can deduce nontrivial facts about the weights once a formula for the implemented function is available. In example 5, for a function , we have and . The same identities are valid for functions of the form in example 6.6 This seems to be a direction worthy of study. In fact, there are papers discussing how a neural network may be “reverse-engineered” in the sense that the architecture of the network is determined from the knowledge of its outputs, or the weights and biases are recovered without the ordinary training process involving gradient descent algorithms (Fefferman & Markel, 1994; Dehmamy, Rohani, & Katsaggelos, 2019; Rolnick & Kording, 2019). In our approach, the weights appearing in a composition of functions of the form could be described (up to scaling) in terms of partial derivatives of the resulting superposition.
3.1.4 Example 7
3.1.5 Example 8
3.2 Examples of Functions Computed by Neural Networks
3.2.1 Example 9
3.2.2 Example 10
3.2.3 Example 11
3.3 Examples of Polynomial Neural Networks
The subset of consisting of polynomials of total degree at most that can be computed by via assigning real polynomial functions to its neurons
The smaller subset of consisting of polynomials of total degree at most that can be computed by via assigning real polynomials of the form to the neurons where is a polynomial activation function
In general, subsets and of are not closed in the algebraic sense (see remark 8). Therefore, one may consider their Zariski closures and , that is, the smallest subsets defined as zero loci of polynomial equations that contain them. We shall call and the functional varieties associated with . Each of the subsets and of could be described with finitely many polynomial equations in terms of 's. The PDE constraints from section 2 provide nontrivial examples of equations satisfied on the functional varieties: In any degree , substituting equation 3.28 in an algebraic PDE that smooth functions computed by must obey results in equations in terms of the coefficients that are satisfied at any point of or and hence at the points of or . This will be demonstrated in example 12 and results in the following corollary to theorems 1 and 2.
Let be a neural network whose inputs are labeled by the coordinate functions . Then there exist nontrivial polynomials on affine spaces that are dependent only on the topology of and become zero on functional varieties . The same holds for functional varieties provided that the number of inputs to each neuron of is less than .
3.3.1 Example 12
3.3.2 Example 13
Let be the neural network appearing in Figure 4. The functional space is formed by polynomials of total degree at most that are in the form of . By examining the Taylor expansions, it is not hard to see that if is written in this form for univariate smooth functions , , and , then these functions could be chosen to be polynomials. Therefore, in any degree , our characterization of superpositions of this form in example 11 in terms of PDEs and PDIs results in polynomial equations and inequalities that describe a Zariski open subset of which is the complement of the locus where the nonvanishing conditions fail. The inequalities disappear after taking the closure, so is strictly larger than here.
4 PDE Characterization of Tree Functions
Building on the examples of the previous section, we prove theorems 3 and 4. This will establish conjecture 1 for tree architectures with distinct inputs.
The formulation of theorem 3 in Farhoodi et al. (2019) is concerned with analytic functions and binary trees. The proof presented above follows the same inductive procedure but utilizes theorem 5 instead of Taylor expansions. Of course, theorem 5 remains valid in the analytic category, so the tree representation of constructed in the proof here consists of analytic functions if is analytic. An advantage of working with analytic functions is that in certain cases, the nonvanishing conditions may be relaxed. For instance, if in example 1 the function satisfying equation 1.4 is analytic, it admits a local representation of the form 1.1, while if is only smooth, at least one of the conditions of is required. (See Farhoodi et al., 2019, sec. 5.1 and 5.3, for details.)
We induct on the number of leaves to prove the sufficiency of constraints 1.19 and 1.20 (accompanied by suitable nonvanishing conditions) for the existence of a tree implementation of a smooth function as a composition of functions of the form 1.12. Given a rooted tree with leaves labeled by , the inductive step has two cases demonstrated in Figures 14 and 15:
- There are leaves, say, , directly adjacent to the root of ; their removal results in a smaller tree with leaves (see Figure 14). The goal is to write aswith satisfying appropriate constraints that, invoking the induction hypothesis, guarantee that is computable by .(4.5)Figure 14:Figure 15:
- There is no leaf adjacent to the root of , but there are smaller subtrees. Denote one of them with and show its leaves by . Removing this subtree results in a smaller tree with leaves (see Figure 15). The goal is to write aswith and satisfying constraints corresponding to and , and hence may be implemented on these trees by invoking the induction hypothesis.(4.6)
Following the discussion at the beginning of section 3, may be locally written as a function of another function with nonzero gradient if the gradients are parallel. This idea has been frequently used so far, but there is a twist here: we want such a description of to persist on the box-like region that is the domain of . Lemma 2 resolves this issue. The tree function in the argument of in either of equation 4.5 or 4.6, which here we denote by , shall be constructed below by invoking the induction hypothesis, so is defined at every point of . Besides, our description of below (cf. equations 4.7 and 4.9) readily indicates that just like , it satisfies the nonvanishing conditions of theorem 7. Applying lemma 2 to , any level set is connected, and can be extended to a coordinate system for . Thus, , whose partial derivatives with respect to other coordinate functions vanish, realizes precisely one value on any coordinate hypersurface . Setting to be the aforementioned value of defines a function with . After this discussion on the domain of definition of the desired representation of , we proceed with constructing as either in the case of equation 4.5 or as in the case of equation 4.6.
As mentioned in remark 3, working with functions of the form 1.12 in theorem 7 rather than general smooth functions has the advantage of enabling us to determine a domain on which a superposition representation exists. In contrast, the sufficiency part of theorem 3 is a local statement since it relies on the implicit function theorem. It is possible to say something nontrivial about the domains when functions are furthermore analytic. This is because the implicit function theorem holds in the analytic category as well (Krantz & Parks, 2002, sec. 6.1) where lower bounds on the domain of validity of the theorem exist in the literature (Chang, He, & Prabhu, 2003).
5 Conclusion
In this article, we proposed a systematic method for studying smooth real-valued functions constructed as compositions of other smooth functions that are either of lower arity or in the form of a univariate activation function applied to a linear combination of inputs. We established that any such smooth superposition must satisfy nontrivial constraints in the form of algebraic PDEs, which are dependent only on the hierarchy of composition or, equivalently, only on the topology of the neural network that produces superpositions of this type. We conjectured that there always exist characteristic PDEs that also provide sufficient conditions for a generic smooth function to be expressible by the feedforward neural network in question. The genericity is to avoid singular cases and is captured by nonvanishing conditions that require certain polynomial functions of partial derivatives to be nonzero. We observed that there are also situations where nontrivial algebraic inequalities involving partial derivatives (PDIs) are imposed on the hierarchical functions. In summary, the conjecture aims to describe generic smooth functions computable by a neural network with finitely many universal conditions of the form , , and , where , , and are polynomial expressions of the partial derivatives and are dependent only on the architecture of the network, not on any tunable parameter or any activation function used in the network. This is reminiscent of the notion of a semialgebraic set from real algebraic geometry. Indeed, in the case of compositions of polynomial functions or functions computed by polynomial neural networks, the PDE constraints yield equations for the corresponding functional variety in an ambient space of polynomials of a prescribed degree.
The conjecture was verified in several cases, most importantly, for tree architectures with distinct inputs where, in each regime, we explicitly exhibited a PDE characterization of functions computable by a tree network. Examples of tree architectures with repeated inputs were addressed as well. The proofs were mathematical in nature and relied on classical results of multivariable analysis.
The article moreover highlights the differences between the two regimes mentioned at the beginning: the hierarchical functions constructed out of composing functions of lower dimensionality and the hierarchical functions that are compositions of functions of the form . The former functions appear more often in the mathematical literature on the Kolmogorov-Arnold representation theorem, while the latter are ubiquitous in deep learning. The special form of functions requires more PDE constraints to be imposed on their compositions, whereas their mild nonlinearity is beneficial in terms of ascertaining the domain on which a claimed compositional representation exists.
Our approach for describing the functional spaces associated with feedforward neural networks is of natural interest in the study of expressivity of neural networks and could lead to new complexity measures. We believe that the point of view adapted here is novel and might shed light on a number of practical problems such as comparison of architectures and reverse-engineering deep networks.
Appendix A: Technical Proofs
Appendix B: Differential Forms
Differential forms are ubiquitous objects in differential geometry and tensor calculus. We only need the theory of differential forms on open domains in Euclidean spaces. Theorem 5 (which has been used several times throughout the, for example, in the proof of theorem 3) is formulated in terms of differential forms. This appendix provides the necessary background for understanding the theorem and its proof.
B.1 Example 14
B.2 Example 15
As mentioned in the previous example, the outcome of twice applying the exterior differentiation operator to a form is always zero. This is an extremely important property that leads to the definitions of closed and exact differential forms. A -form on an open subset of is called closed if . This holds if is in the form of for a -form on . Such forms are called exact. The space of closed forms may be strictly larger than the space of exact forms; the difference of these spaces can be used to measure the topological complexity of . If is an open box-like region, every closed form on it is exact. But, for instance, the 1-form on is closed while it may not be written as for any smooth function . This brings us to a famous fact from multivariable calculus that we have used several times (e.g., in the proof of theorem 4). A necessary condition for a vector field on an open subset of to be a gradient vector field is for any . Near each point of , the vector field may be written as ; it is globally in the form of for a function when is simply connected. In view of equations B.2 and B.3, one may rephrase this fact as: Closed 1-forms on are exact if and only if is simply connected.
Near a point at which , we seek a locally defined function with . Recall that if is a regular point of , then near , the level set of passing through is an -dimensional submanifold of to which the gradient vector field, , is perpendicular. As we want the gradient to be parallel to the vector field , the equivalent characterization in terms of the 1-form , which is the dual of (cf. equations 3.1 and 3.3), asserts that is zero at any vector tangent to the level set. So the tangent space to the level set at the point could be described as . As varies near , these -dimensional subspaces of vary smoothly. In differential geometry, such a higher-dimensional version of a vector field is called a distribution, and the property that these subspaces are locally given by tangent spaces to a family of submanifolds (the level sets here) is called integrability. The seminal Frobenius theorem (Narasimhan, 1968, theorem 2.11.11) implies that the distribution defined by a nowhere vanishing 1-from is integrable if and only if .
Notes
To be mathematically precise, the open neighborhood of on which admits a compositional representation in the desired form may be dependent on and . So conjecture 1 is local in nature and must be understood as a statement about function germs.
Convergence in the -norm is defined as the uniform convergence of the function and its partial derivatives up to order .
In conjecture 1, the subset cut off by equations is meager: It is a closed and (due to the term nonvacuous appearing in the conjecture) proper subset of the space of functions computable by , and a function implemented by at which a vanishes could be perturbed to another computable function at which all of 's are nonzero.
An open box-like region in is a product of open intervals.
A piece of terminology introduced in Farhoodi et al. (2019) may be illuminating here. A member of a triple of (not necessarily distinct) leaves of is called the outsider of the triple if there is a (rooted full) subtree of that misses it but has the other two members. Theorem 3 imposes whenever is the outsider, while theorem 4 imposes the constraint whenever and are not outsiders.
Notice that this is the best one can hope to recover because through scaling the weights and inversely scaling the inputs of activation functions, the function could also be written as or where and . Thus, the other ratios and are completely arbitrary.
A single vertex is not considered to be a rooted tree in our convention.
As the vector of inputs varies in the box-like region , the inputs to each node form an interval on which the corresponding activation function is defined.