We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to gaussian distributions. We show that if the activation function satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the “length process” converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases and the activation function . We also show that this convergence may fail for that violate our assumptions. We show how to use this analysis to choose the variance of weight initialization, depending on the activation function, so that hidden variables maintain a consistent scale throughout the network.
The size of the weights of a deep network must be managed delicately. If they are too large, signals blow up as they travel through the network, leading to numerical problems, and if they are too small, the signals fade away. The practical state of the art in deep learning made a significant step forward due to schemes for initializing the weights that aimed in different ways at maintaining roughly the same scale for the hidden variables before and after a layer (LeCun, Bottou, Orr, & Müller, 1998; Glorot & Bengio, 2010). Later work (He, Zhang, Ren, & Sun, 2015; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Daniely, Frostig, & Singer, 2016) took into account the effect of the nonlinearities on the length dynamics of a deep network, informing initialization policies in a more refined way.
An influential theoretical analysis (Poole et al., 2016) considered whether signals tend to blow up or fade away as they propagate through a fully connected network with the same activation function at each hidden node. For a given input, they studied the probability distribution over the lengths of the vectors of hidden variables when the weights between nodes are chosen from a zero-mean gaussian with variance and where the biases are chosen from a zero-mean distribution with variance . They argued that in a fully connected network, as a width of the network approaches infinity, the (suitably normalized) lengths of the hidden layers approach a sequence of values, one for each layer, and characterized this length map as a function of , , and . This analysis has since been widely used (Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein, 2016; Yang & Schoenholz, 2017; Pennington, Schoenholz, & Ganguli, 2017; Lee, Bahri, Novak, Schoenholz, Pennington, & Sohl-Dickstein, 2018; Xiao, Bahri, Sohl-Dickstein, Schoenholz, & Pennington, 2018; Chen, Pennington, & Schoenholz, 2018; Pennington, Schoenholz, & Ganguli, 2018; Hayou, Doucet, & Rousseau, 2018).
Poole et al. (2016) claimed that their analysis holds for arbitrary nonlinearities . In contrast, we show that for arbitrarily small, positive , even if , for , the distribution of values of each of the hidden nodes in the second layer diverges as gets large. For finite , each node has a Cauchy distribution, which already has infinite variance, and as gets large, the scale parameter of the Cauchy distribution gets larger, leading to divergence. We also show that the hidden variables in the second layer may not be independent, even for commonly used like the ReLU, contradicting a claim that is part of the analysis of Poole et al. (2016).
These observations, together with the wide use of the length map from Poole et al. (2016), motivate the search for a new analysis. This letter provides such an analysis for activation functions that satisfy the following properties: (1) the restriction of to any finite interval is bounded, (2) as gets large, , and (3) is measurable.1 We refer to such as permissible. Note that conditions (1) and (3) hold for any nondecreasing .
We show that for all permissible and all and , as gets large, the length process converges in probability to the length map described in Poole et al. (2016).
Section 5 describes some simulation experiments verifying some of the findings of the letter and illustrating the dependence among the values of the hidden nodes.
Section 6 describes one way to use our analysis to choose the variance of the weights depending on the activation function so that signals neither blow up nor vanish as computation flows through a wide and deep network.
Our analysis of the convergence of the length map borrows ideas from Daniely et al. (2016), who studied the properties of the mapping from inputs to hidden representations resulting from random gaussian initialization. Their theory applies in the case of activation functions with certain smoothness properties and to a wide variety of architectures. Informally, they showed that after random initialization, for wide networks, it is likely that the kernel associated with a feature map computed by the network closely approximates a fixed kernel. Our analysis treats a wider variety of values of and and uses weaker assumptions on . Motivated by Bayesian goals as in the work of Neal (1996), and Matthews, Rowland, Hron, Turner, & Ghahramani (2018) performed an analysis in a related setting, characterizing the distribution of kernels arising from a random initialization. Their analysis used a “linear envelope” condition on that is stronger than the assumption used here. Alternative but related uses of theory to guide the choice of weight variances may be found in Schoenholz et al. (2016) and Pennington et al. (2017). Hanin (2018) studied the effect of the widths of layers and the depth of a fully connected network on the size of the input-output Jacobian in the case of ReLU activations.
For , we use to denote the set . If is a tensor, then for , let be the matrix such that , and define , and so on, analogously.
2.2 The Finite Case
We study the process arising from fixing an arbitrary input and choosing the parameters independently at random. The entries of are sampled from and the entries of are from . For each , define .
Note that for all , all the components of and are identically distributed.
2.3 The Wide-Network Limit
2.4 Total Variation Distance
If and are probability distributions, then , and if and are their densities,
2.5 Permissible Activation Functions
An activation function is permissible if
The restriction of to any finite interval is bounded.
as gets large.2
If is permissible, then for all positive constants , the function defined by is integrable.
2.6 Length Map
3 Some Surprising Behaviors
In this section, we show that for some activation functions, the probability distribution of hidden nodes can have some surprising properties.
3.1 Failure to Converge
We show that the probability distribution of the hidden variables may not converge. Our proof refers to the Cauchy distribution.
A distribution over the reals that, for and , has a density given by is a Cauchy distribution, denoted by . is the standard Cauchy distribution.
(Hazewinkel, 2013). If are independent and identically distributed (i.i.d.) random variables with a Cauchy distribution, then has the same distribution.
(Lupton, 1993). If and are zero-mean normally distributed random variables with the same variance, then has the standard Cauchy distribution.
The following shows that there is a such that the limiting is not defined. It contradicts a claim made on line 7 of section A.1 of Poole et al. (2016).
For any input function with range , there is an activation function such that for every , if , then (a) for finite , has infinite variance, and (b) diverges as goes to infinity.
The following contradicts a claim made on line 8 of section A.1 of Poole et al. (2016).
If is either the ReLU or the Heaviside function, then for every , , and , are not independent.
We will show that , which will imply that and are not independent.
Because each component of is the dot product of with an independent row of plus an independent component of , the components of are independent, and since , this implies that the components of are independent. Since each row of and each component of the bias vector has the same distribution, is i.i.d.
Now, we calculate the difference using equation 3.2 for the Heaviside and ReLU functions.
Suppose is a Heaviside function, that is, is the indicator function for . In this case, since the components of are symmetric about 0, the distribution of is uniform over . Thus, , and so equation 3.2 gives
Note that, informally, the degree of dependence between pairs of hidden nodes established in the proof of theorem 7 approaches 0 as gets large. On the other hand, the number of dependent pairs of hidden nodes is .
3.3 Undefined Length Map
Here, we show, informally, that for at the boundary of the second condition in the definition of permissibility, the recursive formula defining the length map breaks down. Roughly, this condition cannot be relaxed.
For any , if is defined by , even if all components of all inputs are in , there exists a s.t. is undefined for all .
4 Convergence in Probability
In this section, the length process converges in probability to the length map from Poole et al. (2016).
For any permissible , , any depth , and any , there is an such that for all , with probability , for all , we have
Before proving theorem 9, we establish some lemmas. Our proof will use the weak law of large numbers.
In order to divide our analysis into cases, we need the following lemma, whose proof is in appendix B.
If is permissible and not zero almost everywhere (a.e.), for all , for all , and .
We will also need a lemma that shows that small changes in lead to small changes in .
The following technical lemma, which shows that tail bounds hold uniformly over different choices of , is proved in appendix C.
If is permissible, for all , for all , there is an such that for all , and
Armed with these lemmas, we are ready to prove theorem 9.
First, if is zero a.e. or if , theorem 9 follows directly from lemma 10, together with a union bound over the layers. Assume for the rest of the proof that is nonzero on a set of positive measures and that , so that and for all .
For each , define
Our proof of theorem 9 is by induction. The inductive hypothesis is that for any there is an such that if , then, with probability , for all , and, for all , and .
The base case, where , holds because is defined to be the limit of as goes to infinity.
For the induction step, choose , and . (Note that these choices are without loss of generality.) Let take a value that will be described later, using quantities from the analysis. By the inductive hypothesis, whatever the value of , there is an such that if , then with probability , for all , we have and . Thus, to establish the inductive step, it suffices to show that after conditioning on the random choices before the th layer, if , there is an such that if , then with probability at least with respect only to the random choices of and , that and . Given such an , the inductive step can be satisfied by letting be the maximum of and .
Let us do that. To simplify the notation, for the rest of the proof of the inductive step, let us condition on outcomes of the layers before layer ; all expectations and probabilities will concern the randomness only in the th layer. Let us further assume that .
Recall that . Since the values of have been fixed by conditioning, each component of is obtained by taking the dot product of with and adding an independent . Thus, conditioned on we have that are independent. Also, since is fixed by conditioning, each has an identical gaussian distribution.
Since each component of and has zero mean, each has zero mean.
Recall that is an average of identically distributed random variables with a mean between 0 and (which is therefore finite) and is an average of identically distributed random variables, each with mean between 0 and . Applying the weak law of large numbers (see lemma 10), there is an such that if , with probability at least , both and hold, which in turn implies and , completing the proof of the inductive step, and therefore the proof of theorem 9.
Our first experiment fixed , , , .
For each , we initialized the weights 100 times and plotted the histograms of all of the values of , along with the distribution from the proof of proposition 1 and for estimated from the data (see Figure 1). Consistent with the theory, the distribution fits the data well.
To illustrate the fact that the values in the second hidden layer are not independent, for and the parameters otherwise as in the other experiment, we plotted histograms of the values seen in the second layer for nine random initializations of the weights in Figure 2. When some of the values in the first hidden layer have unusually small magnitude, then the values in the second hidden layer coordinately tend to be large. Note that this is consistent with theorem 9, establishing convergence in probability for permissible , since the used in this experiment is not permissible.
6 Maintaining Unit Scale
|Activation .||Input Variance .||Weight Variance .|
|Activation .||Input Variance .||Weight Variance .|
We have given a rigorous analysis of the limiting value of the distribution of the lengths of the vectors of hidden nodes in a fully connected deep network and described how to choose the variance of the weights at initialization using this analysis for various commonly used activation functions. Our analysis can be easily applied to other activation functions.
As in earlier work, our analysis concerned a limit in which the input grows along with the hidden layers. This simplifies the analysis, but it appears not to be difficult to remove this assumption (see Matthews et al., 2018).
Analysis of the length map in the case of ReLU activations was an important component of recent analyses of the convergence of deep network training (Zou, Cao, Zhou, & Gu, 2018; Allen-Zhu, Li, & Song, 2019). A nonasymptotic refinement of our analysis would be a step toward generalizing those results to more general activation functions.
Appendix A: Proof of Lemma 2
Appendix B: Proof of Lemma 11
The proof is by induction. The base case holds since we have assumed that .
To prove the inductive step, we need the following lemma.
If is not zero a.e., then for all , .
Returning to the proof of lemma 11, by the inductive hypothesis, , which, since , implies . Applying lemma 9 yields .
Appendix C: Proof of Lemma 14
Here denotes any function of that grows strictly more slowly than , such as for .
This condition may be expanded as follows: and .
We thank Ben Poole, Sam Schoenholz, and Jascha Sohl-Dickstein for valuable conversations and Jascha and anonymous reviewers for their helpful comments on earlier versions of this letter.
The authors are ordered alphabetically.