Abstract
We present an investigation on threshold circuits and other discretized neural networks in terms of the following four computational resources—size (the number of gates), depth (the number of layers), weight (weight resolution), and energy—where the energy is a complexity measure inspired by sparse coding and is defined as the maximum number of gates outputting nonzero values, taken over all the input assignments. As our main result, we prove that if a threshold circuit of size , depth , energy , and weight computes a Boolean function (i.e., a classification task) of variables, it holds that regardless of the algorithm employed by to compute , where is a parameter solely determined by a scale of and defined as the maximum rank of a communication matrix with regard to taken over all the possible partitions of the input variables. For example, given a Boolean function ) , we can prove that holds for any circuit computing . While its left-hand side is linear in , its right-hand side is bounded by the product of the logarithmic factors of and the linear factors of . If we view the logarithmic terms as having a negligible impact on the bound, our result implies a trade-off between depth and energy: needs to be smaller than the product of and . For other neural network models, such as discretized ReLU circuits and discretized sigmoid circuits, we also prove that a similar trade-off holds. Thus, our results indicate that increasing depth linearly enhances the capability of neural networks to acquire sparse representations when there are hardware constraints on the number of neurons and weight resolution.
1 Introduction
Nervous systems receive an abundance of environmental stimuli and encode them within neural networks, in which the internal representations formed play a crucial role in fine neural information processing. More formally, DiCarlo and Cox (2007) considered an internal representation as a firing pattern, that is, a vector in a very high-dimensional space, where each axis is one neuron’s activity, and the dimensionality equals the number (e.g., approximately 1 million) of neurons in a feedforward neural network and then argued that for object recognition tasks, a representation is considered to be good if, for a given pair of images that are hard to distinguish in the input space, there exist representations that are easy to separate by simple classifiers such as linear classifiers.
However, there is a severe constraint toward forming useful internal representations. While the total energy resources supplied to the brain are limited, there are high energetic costs incurred by both the resting and signaling of neurons (Attwell & Laughlin, 2001; Lennie, 2003). Niven and Laughlin (2008) claimed that selective pressures both to improve behavioral performance and reduce energy consumption can affect all levels of components and mechanisms in the brain. Thus, in addition to the development of subcellular structures and individual neurons, neural networks are likely to develop useful and energy-efficient internal representations.
Sparse coding is a strategy in which a relatively small number of neurons are simultaneously active out of a large population and is considered a plausible principle for constructing internal representations in the brain. Sparse coding can reconcile the issue of representational capacity and energy expenditure (Földiák, 2003; Levy & Baxter, 1996; Olshausen & Field, 2004) and has been experimentally observed in various sensory systems (see, for example, Barth & Poulet, 2012; Shoham et al., 2006), including the visual cortex, where a representation of a few active neurons conveys information useful enough to reconstruct or classify natural images (Tang et al., 2018; Yoshida & Ohki, 2020).
Because sparse coding restricts the available internal representations of a neural network, it limits its representational power. Can we quantitatively evaluate the effects of a sparse coding strategy on neural information processing?
Uchizawa et al. (2008) sought to address the question from the viewpoint of circuit complexity. Let be a class of feedforward circuits modeling neural networks. In a typical circuit complexity argument, we introduce a complexity measure for Boolean functions and show that any Boolean function computable by a circuit has a low complexity bounded by its available computational resources for . If a Boolean function has inherently high complexity regarding its scale, we can conclude that any circuit in requires a sufficiently large number of resources to compute . A Boolean function can model a binary classification task, implying that neural networks cannot construct good internal representations for the task if the computational resources are limited or, equivalently, the scale of the task is too large.
More formally, Uchizawa et al. (2008) employed threshold circuits as a model for neural networks, where a threshold circuit is a feedforward logic circuit whose basic computational element computes a linear threshold function (McCulloch & Pitts, 1943; Minsky & Papert, 1988; Parberry, 1994; Rosenblatt, 1958; Siu & Bruck, 1991; Siu et al., 1995; Siu & Roychowdhury, 1994). Size, depth, and weight are computational resources that have been extensively studied. For a threshold circuit , size is defined as the number of gates in , depth is the number of layers of , and weight is the degree of resolution of the weights among the gates. These resources are likely to be bound by neural networks in the brain; the number of neurons in the brain is clearly limited, and the number of neurons performing a particular task could be further restricted. Depth is related to the reaction time required for performing tasks; therefore, low depth values are preferred. A single synaptic weight may take an analog value; however, it is unlikely that a neuron will have an infinitely high resolution against neuronal noise; hence, a neuron may have a bounded degree of resolution. In addition, inspired by the sparse coding, Uchizawa et al. (2008) introduced a new complexity measure called energy complexity, where the energy of a circuit is defined as the maximum number of internal gates outputting nonzero values taken over all the input assignments to the circuit. Studies on the energy complexities of other types of logic circuits have also been reported: Dinesh et al. (2020), Kasim-zade (1992), Silva and Souza (2022), and Sun et al. (2019); Vaintsvaig (1961).
Uchizawa et al. (2008) then showed that the energy-bounded threshold circuits have a certain computational power by observing that threshold circuits can simulate linear decision trees, where a linear decision tree is a binary decision tree in which a query at each internal node is given by a linear threshold function. In particular, they proved that any linear decision tree of leaves can be simulated by a threshold circuit of size and energy . Hence, any linear decision tree of leaves can be simulated by a threshold circuit of size and energy only , where is the number of input variables.
Following Uchizawa et al. (2008), a sequence of papers shows relations among other resources such as size, depth, and weight (Maniwa et al., 2018; Suzuki et al., 2011, 2013; Uchizawa & Takimoto, 2008; Uchizawa, 2020, 2014; Uchizawa et al., 2011). In particular, Uchizawa and Takimoto (2008) showed that any threshold circuit of depth and energy requires size if computes a high bounded-error communication complexity function such as the inner product modulo 2. Even for low communication complexity functions, an exponential lower bound on the size is known for constant-depth threshold circuits: any threshold circuit of depth and energy requires size if computes the parity function (Uchizawa, 2020). These results provide exponential lower bounds if the depth is constant and energy is sublinear (Uchizawa & Takimoto, 2008) or sublogarithmic (Uchizawa, 2020), while both the inner product modulo 2 and parity are computable by linear-size, constant-depth, and linear-energy threshold circuits. These results imply that the energy is strongly related to the representational power of threshold circuits and is an important computational resource. However, these lower bounds break down when we consider threshold circuits of larger depth and energy, say, nonconstant depth and sublinear energy.
Here, we provide a more sophisticated relation among size, depth, energy, and weight. Our main result is formulated as follows. Let be a Boolean function of variables and be a binary partition of . Then we can express any assignment as . A communication matrix is a matrix, where each row (resp., each column) is indexed by an assignment (resp., ), and the value is defined to be the output of given and . We denote by the maximum rank of over real numbers, where the maximum is to take over all the partitions of . We establish the following relation.
The theorem also improves known lower bounds for threshold circuits. By rearranging equation 1.2, the following lower bound can be obtained: , which is exponential in if both and are sublinear and is subexponential. For example, an exponential lower bound can be obtained even for threshold circuits of depth , energy , and weight . Similar lower bounds can be obtained for the inner product modulo 2 and the equality function because they have linear ranks. Upon comparing the lower bound from the study by Uchizawa and Takimoto (2008) to ours, our lower bound is found to be meaningful only for subexponential weight but affords two-fold improvement: the lower bound is exponential even if is sublinear, and it provides an exponential lower bound for Boolean functions with a much weaker condition such that Boolean functions has rank over real numbers. Threshold circuits have received considerable attention in circuit complexity, and several lower-bound arguments have developed for threshold circuits under some restrictions on computational resources including size, depth, energy, and weight (Amano, 2020; Amano & Maruoka, 2005; Chen et al., 2018; Hajnal et al., 1993; Håstad & Goldmann, 1991; Impagliazzo et al., 1997; Kane & Williams, 2016; Nisan, 1993; Razborov & Sherstov, 2010; Sherstov, 2007; Uchizawa, 2020; Uchizawa & Takimoto, 2008; Uchizawa et al., 2011). However, the arguments for lower bounds are designated for constant-depth threshold circuits and cannot provide meaningful bounds when the depth is not constant. In particular, and are computable by polynomial-size and constant-depth threshold circuits. Thus, directly applying known techniques is unlikely to yield our lower bound.
To complement theorem 1, we show that the relation is tight up to a constant factor if the products of and are small.
Apart from threshold circuits, we consider other well-studied models of neural networks, where an activation function and weights are discretized (e.g., discretized sigmoid and ReLU circuits). The size, depth, energy, and weight are also important parameters for artificial neural networks. The size and depth are major topics on the success of deep learning. The energy is related to important techniques for deep learning methods such as sparse autoencoder (Ng, 2011). The weight resolution is closely related to chip resources in neuromorphic hardware systems (Pfeil et al., 2012); accordingly, quantization schemes have received considerable attention (Courbariaux et al., 2015; Hubara et al., 2018).
For discretized circuits, we also show that there exists a similar trade-off to the one for threshold circuits. For example, the following proposition can be obtained for discretized sigmoid circuits:
Therefore, artificial neural networks obtained through the machine learning process may face a challenge to obtain good internal representations; because hardware constraints are imposed on the number of neurons and the weight resolution, the depth is predefined, and the learning algorithm may force a resulting network to acquire sparse activity. Further, times larger depth is comparable to times larger size. Thus, increasing the depth could significantly help neural networks to increase their expressive power and can evidently aid neural networks in acquiring sparse activity. Consequently, our bound may afford insight into the reason for the success of deep learning.
The remainder of the article is organized as follows. In section 2, we formally introduce the terminologies and notations that are used in the rest of the article. In section 3, we present our main lower-bound result. In section 4, we show the tightness of the lower bound. In section 5, we show a bound for discretized circuits. In section 6, we conclude with some remarks.
2 Preliminaries
For an integer , we denote by a set . For a finite set of integers, we denote by a set of Boolean assignments, where each assignment consists of elements, each of which is indexed by . The base of the logarithm is two unless stated otherwise. In section 2.1, we define terms on threshold circuits and discretized circuits. In section 2.2, we define the communication matrix together with some related terms and summarize some facts.
2.1 Circuit Model
2.1.1 Threshold Circuits
A threshold circuit is a combinatorial circuit consisting of threshold gates and is expressed by a directed acyclic graph. The nodes of in-degree 0 correspond to input variables, and the other nodes correspond to gates. Let be a set of the gates in . For each gate , the level of , denoted by , is defined as the length of a longest path from an input variable to on the underlying graph of . For each , we define as a set of gates in the th level: . We denote by a unique output gate, which is a linear classifier separating internal representations given by the gates in the lower levels (possibly together with input variables). We say that computes a Boolean function if for every . Although the inputs to a gate in may not be only the input variables but the outputs of gates in the lower levels, we write for the output of for because inductively decides the output of .
2.1.2 Discretized Circuits
Let be an activation function. Let be a discretizer that maps a real number to a number representable by a bit width . We define a discretized activation function as a composition of and , that is, for any number . We say that has a silent range for an interval if if , and , otherwise. For example, if we use the ReLU function as the activation function , then has a silent range for for any discretizer . If we use the sigmoid function as the activation function and linear partition as a discretizer , then has a silent range for where and is the natural logarithm.
We define weight of as , where is the bit width possibly needed to represent a potential value invoked by a single input of a gate in .
2.2 Communication Matrix and Its Rank
Let be a Boolean function. For a partition of , we can view as . We define communication matrix as a matrix where each row and column is indexed by and , respectively, and each entry is defined as . For and , we call a combinatorial rectangle and say that is monochromatic with respect to if is constant on . If a circuit computes , we may write instead of . Figure 1a shows a communication matrix where , and . Figure 1b shows a monochromatic combinatorial rectangle with respect to , where and .
For and , . Thus, .
and .
We also use well-known facts on the rank. Let and be two matrices of the same dimensions. We denote by the summation of and , and by the Hadamard product of and .
For two matrices, and , of the same dimensions, we have
,
.
3 Trade-Off for Threshold Circuits
In this section, we provide our main results showing the relationship among the four resources and trade-offs.
Let be a threshold circuit computing a Boolean function of variables. We prove the theorem by showing that for any partition of , we can express as a sum of matrices, each of which corresponds to an internal representation that arises in . Since has bounded energy, the number of internal representations is also bounded. We then show by the inclusion-exclusion principle that each matrix corresponding to an internal representation has a bounded rank. Thus, fact 1 implies the theorem.
.
.
We can obtain a similar relationship also for discretized circuits, as follows.
We establish the trade-off by showing that any discretized circuit can be simulated using a threshold circuit with a moderate increase in size, depth, energy, and weight. Theorem 5 then implies the claim. The detailed proof is included in the supplemental appendix.
4 Tightness of the Trade-Off
In this section, we show that the trade-off given in Theorem 5 is tight if the depth and energy are small.
4.1 Definitions
compose a partition of .
for every .
- For every assignment ,
We say that a set of threshold gates sharing input variables is a neural set, and a neural set is selective if at most one of the gates in the set outputs one for any input assignment. A selective neural set computes a Boolean function if for every assignment in , no gates in output one, while for every assignment in , exactly one gate in outputs one. We define the size and weight of as and , respectively. Below we assume that does not contain a threshold gate computing a constant function.
Since any conjunction of literals can be computed by a threshold gate, we can obtain by a DNF-like construction a selective neural set of exponential size that computes for any Boolean function (see example 7 and theorem 2.3 in Uchizawa, 2014).
For any Boolean function of variables, there exists a selective neural set of size and weight one that computes .
4.2 Upper Bounds
The following lemma shows that we can construct threshold circuits of small energy for piecewise functions.
We construct the desired threshold circuit by arranging and connecting the selective neural sets, where has a simple layered structure consisting of the selective neural sets. After we complete the construction of , we show that computes and then evaluate its size, depth, energy, and weight.
We here show that computes . By construction, the following claim is easy to verify:
If every gate at the levels outputs zero, the output of is identical to the counterpart of , and hence . Otherwise, there is a gate outputting one at the lower levels. Since receives an output from the gate at the lower level, the value is added to the potential of . Since receives at most weights whose absolute values are bounded by , the potential of is now below its threshold.
Suppose . In this case, for every , , and , . Therefore, claim 3 implies that no gate in outputs one.
Suppose . In this case, there exists such that for some , while for every and . Since computes , claim 3 implies that there exists such that , which implies that .
Finally, we evaluate the size, depth, energy, and weight of . Since contains at most gates for each pair of and where , we have in total . The additional one corresponds to the output gate. Because the gates are placed at the th level for , the level of is clearly , and, hence, has depth . Claim 3 implies that if there is a gate outputting one at level , then no gate in higher levels outputs one. In addition, since is selective, at most one gate outputs one. Therefore, at most gates at the th level output one, followed by gates outputting one. Thus, has an energy . Any connection in has weight at most or . Thus, the weight of is .
Clearly, is a piecewise function, and so the lemma gives our upper bound for .
For simplicity, we consider the case where is a multiple of . It suffices to show that is -piecewise and computable by a neural set of size and weight .
Suppose . There are two cases: and for every .
We can also obtain a similar proposition for .
5 Conclusion
We prove here that a threshold circuit is able to compute only a Boolean function of which the communication matrix has a rank bounded by a product of logarithmic factors of size and weight and linear factors of depth and energy. This bound implies a trade-off between depth and energy if we view the logarithmic terms as having negligible impact. We also prove that a similar trade-off exists for discretized circuits, which suggests that increasing the depth linearly improves the ability of neural networks to decrease the number of neurons outputting nonzero values, subject to hardware constraints on the number of neurons and weight resolution.
where the expectation is taken over . They then showed that the average energy complexity can be bounded by the entropy over the internal representations on . It would be interesting if there exists a trade-off with regard to the average energy complexity. For circuit complexity, our trade-off implies a lower bound on the energy of polynomial size, constant depth and polynomial-weight threshold circuits computing . It would also be interesting to ask if there exists a Boolean function that needs linear or even superlinear energy for polynomial-size threshold circuits.
Since we simplified and ignored many aspects of neural computation, our results are not enough to perfectly explain representational power of neural networks in the brain. However, circuit complexity arguments can potentially aid in devising a plausible principle behind neural computation. Apart from the three-level approach to understanding brain computation (computational level, algorithmic level, and implementation level) reported by Marr (1982), Valiant (2014) added a requirement that it has to incorporate some understanding of the quantitative constraints faced by the cortex. Circuit complexity arguments could provide quantitative constraints through complexity measures. Further, Maass et al. (2019) identified the difficulty of uncovering a neural algorithm employed by the brain because its hardware could be extremely adapted to the task; consequently the algorithm vanishes. Even if its precise structure, connectivity, and vast array of numerical parameters are known in the minutest detail, extracting an algorithm implemented in the network would still be difficult. A trade-off does not provide a description of an explicit neural algorithm but can afford insights relevant for formulating computational principles because its argument necessarily concerns every algorithm that a theoretical model of a neural network can implement.
Acknowledgments
The preliminary version of our article was presented at MFCS2023 (Uchizawa & Abe, 2023). We thank the anonymous reviewers of MFCS2023 for their careful reading and helpful comments. We also thank the anonymous reviewers of Neural Computation for constructive suggestions greatly improving the presentation and organization. This work was supported by JSPS KAKENHI grant JP22K11897.