Abstract

We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to gaussian distributions. We show that if the activation function φ satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the “length process” converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases and the activation function φ. We also show that this convergence may fail for φ that violate our assumptions. We show how to use this analysis to choose the variance of weight initialization, depending on the activation function, so that hidden variables maintain a consistent scale throughout the network.

1  Introduction

The size of the weights of a deep network must be managed delicately. If they are too large, signals blow up as they travel through the network, leading to numerical problems, and if they are too small, the signals fade away. The practical state of the art in deep learning made a significant step forward due to schemes for initializing the weights that aimed in different ways at maintaining roughly the same scale for the hidden variables before and after a layer (LeCun, Bottou, Orr, & Müller, 1998; Glorot & Bengio, 2010). Later work (He, Zhang, Ren, & Sun, 2015; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Daniely, Frostig, & Singer, 2016) took into account the effect of the nonlinearities on the length dynamics of a deep network, informing initialization policies in a more refined way.

An influential theoretical analysis (Poole et al., 2016) considered whether signals tend to blow up or fade away as they propagate through a fully connected network with the same activation function φ at each hidden node. For a given input, they studied the probability distribution over the lengths of the vectors of hidden variables when the weights between nodes are chosen from a zero-mean gaussian with variance σw2/N and where the biases are chosen from a zero-mean distribution with variance σb2. They argued that in a fully connected network, as a width of the network approaches infinity, the (suitably normalized) lengths of the hidden layers approach a sequence of values, one for each layer, and characterized this length map as a function of φ, σw, and σb. This analysis has since been widely used (Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein, 2016; Yang & Schoenholz, 2017; Pennington, Schoenholz, & Ganguli, 2017; Lee, Bahri, Novak, Schoenholz, Pennington, & Sohl-Dickstein, 2018; Xiao, Bahri, Sohl-Dickstein, Schoenholz, & Pennington, 2018; Chen, Pennington, & Schoenholz, 2018; Pennington, Schoenholz, & Ganguli, 2018; Hayou, Doucet, & Rousseau, 2018).

Poole et al. (2016) claimed that their analysis holds for arbitrary nonlinearities φ. In contrast, we show that for arbitrarily small, positive σw, even if σb=0, for φ(z)=1/z, the distribution of values of each of the hidden nodes in the second layer diverges as N gets large. For finite N, each node has a Cauchy distribution, which already has infinite variance, and as N gets large, the scale parameter of the Cauchy distribution gets larger, leading to divergence. We also show that the hidden variables in the second layer may not be independent, even for commonly used φ like the ReLU, contradicting a claim that is part of the analysis of Poole et al. (2016).

These observations, together with the wide use of the length map from Poole et al. (2016), motivate the search for a new analysis. This letter provides such an analysis for activation functions φ that satisfy the following properties: (1) the restriction of φ to any finite interval is bounded, (2) as z gets large, |φ(z)|exp(o(z2)), and (3) φ is measurable.1 We refer to such φ as permissible. Note that conditions (1) and (3) hold for any nondecreasing φ.

We show that for all permissible φ and all σw and σb, as N gets large, the length process converges in probability to the length map described in Poole et al. (2016).

Section 5 describes some simulation experiments verifying some of the findings of the letter and illustrating the dependence among the values of the hidden nodes.

Section 6 describes one way to use our analysis to choose the variance of the weights depending on the activation function so that signals neither blow up nor vanish as computation flows through a wide and deep network.

Our analysis of the convergence of the length map borrows ideas from Daniely et al. (2016), who studied the properties of the mapping from inputs to hidden representations resulting from random gaussian initialization. Their theory applies in the case of activation functions with certain smoothness properties and to a wide variety of architectures. Informally, they showed that after random initialization, for wide networks, it is likely that the kernel associated with a feature map computed by the network closely approximates a fixed kernel. Our analysis treats a wider variety of values of σw and σb and uses weaker assumptions on φ. Motivated by Bayesian goals as in the work of Neal (1996), and Matthews, Rowland, Hron, Turner, & Ghahramani (2018) performed an analysis in a related setting, characterizing the distribution of kernels arising from a random initialization. Their analysis used a “linear envelope” condition on φ that is stronger than the assumption used here. Alternative but related uses of theory to guide the choice of weight variances may be found in Schoenholz et al. (2016) and Pennington et al. (2017). Hanin (2018) studied the effect of the widths of layers and the depth of a fully connected network on the size of the input-output Jacobian in the case of ReLU activations.

2  Preliminaries

2.1  Notation

For nN, we use [n] to denote the set {1,2,,n}. If T is a n×m×p tensor, then for i[n], let Ti,:,: be the matrix A such that Aj,k=Ti,j,k, and define Ti,j,:, and so on, analogously.

2.2  The Finite Case

Consider a deep, fully connected width-N network with D layers. Let WRD×N×N. An activation function φ maps R to R; we will also use φ to denote the function from RN to RN obtained by applying φ componentwise. Computation of the neural activity vectors x0,:,,xD,:RN and preactivations h1,:,,hD,:RN proceeds in the standard way as follows:
h,:=W,:,:x-1,:+b,:x,:=φ(h,:),for=1,,D.

We study the process arising from fixing an arbitrary input x0,:RN and choosing the parameters independently at random. The entries of W are sampled from Gauss0,σw2N and the entries of b are from Gauss0,σb2. For each {0,,D}, define q=1Ni=1Nh,i2.

Note that for all 1, all the components of h,: and x,: are identically distributed.

2.3  The Wide-Network Limit

For the purpose of defining a limit, assume that for a fixed, arbitrary function χ:NR, for finite N, we have x0,:=(χ(1),,χ(N)). We also assume that
limN1Ni=1χ(i)2
exists and is nonzero. For >0, let x̲ be a random variable whose distribution is the limit of the distribution of x,1 as N goes to infinity, if this limit exists (in the sense of convergence in distribution). Define h̲ and q̲ similarly.

2.4  Total Variation Distance

If P and Q are probability distributions, then dTV(P,Q)=supEP(E)-Q(E), and if p and q are their densities, dTV(P,Q)=12|p(x)-q(x)|dx.

2.5  Permissible Activation Functions

Definition 1.

An activation function φ is permissible if

  • The restriction of φ to any finite interval is bounded.

  • |φ(x)|=exp(o(x2)) as |x| gets large.2

  • φ is measurable.

Conditions (2) and (3) ensure that a key integral can be computed. The proof of lemma 2 is in appendix A.

Lemma 1.

If φ is permissible, then for all positive constants c, the function g defined by g(x)=φ(cx)2exp(-x2/2) is integrable.

2.6  Length Map

Next we recall the definition of a length map from Poole et al. (2016). We will prove that the length process converges to this length map. Define q˜1,,q˜D and r˜0,,r˜D recursively as follows. First, r˜0=limN1Ni=1Nx0,i2. Then, for >0,
q˜=σw2r˜-1+σb2
and
r˜=EzGauss(0,1)[φ(q˜z)2].
If φ is permissible, then since φ(cz)2exp(-z2/2) is integrable for all c, we have that q˜0,,q˜D,r˜0,,r˜D are well-defined finite real numbers.

3  Some Surprising Behaviors

In this section, we show that for some activation functions, the probability distribution of hidden nodes can have some surprising properties.

3.1  Failure to Converge

We show that the probability distribution of the hidden variables may not converge. Our proof refers to the Cauchy distribution.

Definition 2.

A distribution over the reals that, for x0R and γ>0, has a density f given by f(x)=1πγ1+x-x0γ2 is a Cauchy distribution, denoted by Cauchy(x0,γ). Cauchy(0,1) is the standard Cauchy distribution.

Lemma 2

(Hazewinkel, 2013). If X1,,Xn are independent and identically distributed (i.i.d.) random variables with a Cauchy distribution, then 1ni=1nXi has the same distribution.

Lemma 3

(Lupton, 1993). If U and V are zero-mean normally distributed random variables with the same variance, then U/V has the standard Cauchy distribution.

The following shows that there is a φ such that the limiting h̲2 is not defined. It contradicts a claim made on line 7 of section A.1 of Poole et al. (2016).

Proposition 1.

For any input function χ with range {-1,1}, there is an activation function φ such that for every σw>0, if σb=0, then (a) for finite N, h2,1 has infinite variance, and (b) h2,1 diverges as N goes to infinity.

Proof.
Consider φ defined by
φ(y)=1/yify00ify=0.
Fix a value of N and σw>0, and take σb=0. Each component of h1,: is a sum of zero-mean gaussians with variance σw2/N. Thus, for all i, h1,iGauss(0,σw2). Now, almost surely,
h2,1=j=1NW2,1,jφ(h1,j)=j=1NW2,1,j/h1,j.
By lemma 5, for each j, W2,1,j/h1,j has a Cauchy distribution, and since
(NW2,1,1),,(NW2,1,N)Gauss(0,Nσw2),
recalling that h1,1,,h1,NGauss(0,σw2), we have that
NW2,1,1/h1,1,,NW2,1,N/h1,N
are i.i.d. Cauchy(0,N). Applying lemma 4,
h2,1=j=1NW2,1,jφ(h2,j)=1Nj=1NNW2,1,jφ(h1,j)
is also Cauchy(0,N).
So, for all N, h2,1 is Cauchy(0,N). Suppose that h2,1 converged in distribution to some distribution P. Since the CDF of P can have at most countably many discontinuities, we can cover the real line by a countable set of finite-length intervals [a1,b1],[a2,b2], whose end points are points of continuity for P. Since Cauchy(0,N) converges to P in distribution, for any i,
P([ai,bi])limN|bi-ai|πN=0.
Thus, the probability assigned by P to the entire real line is 0, a contradiction.

3.2  Independence

The following contradicts a claim made on line 8 of section A.1 of Poole et al. (2016).

Theorem 1.

If φ is either the ReLU or the Heaviside function, then for every σw>0, σb0, and N2, (h2,1,,h2,N) are not independent.

Proof.

We will show that E[h2,12h2,22]E[h2,12]E[h2,22], which will imply that h2,1 and h2,2 are not independent.

Because each component of h1,: is the dot product of x0,: with an independent row of W1,:,: plus an independent component of b1,:, the components of h1,: are independent, and since x1,:=φ(h1,:), this implies that the components of x1,: are independent. Since each row of W1,:,: and each component of the bias vector has the same distribution, x1,: is i.i.d.

We have
E[h2,12]=Ei[N]W2,1,ix1,i+b2,12=(i,j)[N]2EW2,1,iW2,1,jx1,ix1,j+2i[N]EW2,1,ix1,ib2,1+Eb2,12.
The components of W2,:,: and x1,:, along with b2,1, are mutually independent, so terms in the double sum with ij have zero expectation, and E[h2,12]=i[N]EW2,1,i2Ex1,i2+E[b2,12]. For a random variable x with the same distribution as the components of x1,:, this implies
E[h2,12]=σw2Ex2+σb2.
(3.1)
Similarly,
E[h2,12h2,22]=Ei[N]W2,1,ix1,i+b2,12i[N]W2,2,ix1,i+b2,22=(i,j,r,s)[N]4E[W2,1,iW2,1,jW2,2,rW2,2,sx1,ix1,jx1,rx1,s]+2(i,j,r)[N]3E[W2,1,iW2,1,jW2,2,rx1,ix1,jx1,rb2,2]+2(i,r,s)[N]3E[W2,1,iW2,2,rW2,2,sx1,ix1,rx1,sb2,1]+4(i,r)[N]2E[W2,1,iW2,2,rx1,ix1,rb2,1b2,2]+(i,j)[N]2E[W2,1,iW2,1,jx1,ix1,jb2,22]+(r,s)[N]2E[W2,2,rW2,2,sx1,rx1,sb2,12]+2i[N]E[W2,1,ix1,ib2,1b2,22]+2r[N]E[W2,2,rx1,rb2,12b2,2]+E[b2,12b2,22]=(i,r)[N]2,irE[W2,1,i2W2,2,r2]E[x1,i2]E[x1,r2]+i[N]E[W2,1,i2W2,2,i2]E[x1,i4]+i[N]E[W2,1,i2]E[x1,i2]E[b2,22]+r[N]E[W2,2,r2]E[x1,r2]E[b2,12]+E[b1,22b2,22]=(N2-N)σw4E[x2]2N2+Nσw4E[x4]N2+2Nσw2E[x2]σb2N+σb4=σw4E[x2]2+σw4(E[x4]-E[x2]2)N+2σw2σb2E[x2]+σb4.
Putting this together with equation 3.1, we have
E[h2,12h2,22]-E[h2,12]E[h2,22]=σw4(E[x4]-E[x2]2)N.
(3.2)

Now, we calculate the difference using equation 3.2 for the Heaviside and ReLU functions.

Suppose φ is a Heaviside function, that is, φ(z) is the indicator function for z>0. In this case, since the components of h1,: are symmetric about 0, the distribution of x1,: is uniform over {0,1}N. Thus, E[x4]=E[x2]=1/2, and so equation 3.2 gives E[h2,12h2,22]-E[h2,12]E[h2,22]=3σw44N0.

Next, we consider the case that φ is the ReLU. Recalling that for all i, h1,iGauss(0,σw2), we have E[x2]=12πσw20z2exp-z22σw2dz. By symmetry this is 12EzGauss(0,σw2)[z2]=σw2/2. Similarly, E[x4]=12EzGauss(0,σw2)[z4]=3σ42. Plugging these into equation 3.2, we get that in the case the φ is the ReLU, that
E[h2,12h2,22]-E[h2,12]E[h2,22]=σw4(3/2)σw4-σw4/4N=5σw84N>0,
completing the proof.

Note that, informally, the degree of dependence between pairs of hidden nodes established in the proof of theorem 7 approaches 0 as N gets large. On the other hand, the number of dependent pairs of hidden nodes is Ω(N2).

3.3  Undefined Length Map

Here, we show, informally, that for φ at the boundary of the second condition in the definition of permissibility, the recursive formula defining the length map q˜ breaks down. Roughly, this condition cannot be relaxed.

Proposition 2.

For any α>0, if φ is defined by φ(x)=exp(αx2), even if all components of all inputs are in {-1,1}, there exists a σw,σb s.t. q˜,r˜ is undefined for all 2.

Proof.
Suppose σw2+σb2=14α2. Then q˜1=14α2, so that
r˜1=12π-φ(q˜1z)exp-z22dz=12π-exp(αq˜1z2)exp-z22dz=12π-exp(z2/2)exp-z22dz=,
and downsteam values of q˜ and r˜ are undefined.

4  Convergence in Probability

In this section, the length process q0,,qD converges in probability to the length map q˜0,,q˜D from Poole et al. (2016).

Theorem 2.

For any permissible φ, σw,σb0, any depth D, and any ε,δ>0, there is an N0 such that for all NN0, with probability 1-δ, for all [D], we have |q-q˜|ε.

Before proving theorem 9, we establish some lemmas. Our proof will use the weak law of large numbers.

Lemma 4
(Feller, 2008). For any random variable X with a finite expectation and any ε,δ>0, there is an N0 such that for all NN0, if X1,,XN are i.i.d. with the same distribution as X, then
PrE[X]-1Ni=1NXi>εδ.

In order to divide our analysis into cases, we need the following lemma, whose proof is in appendix B.

Lemma 5.

If φ is permissible and not zero almost everywhere (a.e.), for all σw>0, for all , q˜>0 and r˜>0.

We will also need a lemma that shows that small changes in σ lead to small changes in Gauss(0,σ2).

Lemma 6
(see Klartag, 2007). There is an absolute constant C such that for all σ1,σ2>0,
dTV(Gauss(0,σ12),Gauss(0,σ22))C|σ1-σ2|σ1.

The following technical lemma, which shows that tail bounds hold uniformly over different choices of q, is proved in appendix C.

Lemma 7.

If φ is permissible, for all 0<rs, for all β>0, there is an a0 such that for all q[r,s], aφ(qz)2exp(-z2/2)dzβ and --aφ(qz)2exp(-z2/2)dzβ.

Armed with these lemmas, we are ready to prove theorem 9.

Proof.

First, if φ is zero a.e. or if σw=0, theorem 9 follows directly from lemma 10, together with a union bound over the layers. Assume for the rest of the proof that φ(x) is nonzero on a set of positive measures and that σw>0, so that q˜>0 and r˜>0 for all .

For each {0,,D}, define r=1Ni=1Nx,i2.

Our proof of theorem 9 is by induction. The inductive hypothesis is that for any ε,δ>0 there is an N0 such that if NN0, then, with probability 1-δ, for all '{1,,}, |q'-q˜'|ε and, for all '{0,,}, and |r'-r˜'|ε.

The base case, where =0, holds because r˜0 is defined to be the limit of r0 as N goes to infinity.

For the induction step, choose >0, 0<ε<min{q˜/4,r˜} and 0<δ1/2. (Note that these choices are without loss of generality.) Let ε'(0,ε) take a value that will be described later, using quantities from the analysis. By the inductive hypothesis, whatever the value of ε', there is an N0' such that if NN0', then with probability 1-δ/2, for all '-1, we have |q'-q˜'|ε' and |r'-r˜'|ε'. Thus, to establish the inductive step, it suffices to show that after conditioning on the random choices before the th layer, if |r-1-r˜-1|ε', there is an N such that if NN, then with probability at least 1-δ/2 with respect only to the random choices of W,:,: and b,:, that |q-q˜|ε and |r-r˜|ε. Given such an N, the inductive step can be satisfied by letting N0 be the maximum of N0' and N.

Let us do that. To simplify the notation, for the rest of the proof of the inductive step, let us condition on outcomes of the layers before layer ; all expectations and probabilities will concern the randomness only in the th layer. Let us further assume that |r-1-r˜-1|ε'.

Recall that q=1Ni=1Nh,i2. Since the values of h-1,1,...,h-1,N have been fixed by conditioning, each component of h,i is obtained by taking the dot product of x-1,:=φ(h-1,:) with W,i,: and adding an independent b,i. Thus, conditioned on h-1,1,...,h-1,N, we have that h,1,...,h,N are independent. Also, since x-1,: is fixed by conditioning, each h,i has an identical gaussian distribution.

Since each component of W and b has zero mean, each h,i has zero mean.

Choose an arbitrary i[N]. Since x-1,: is fixed by conditioning and
W,i,1,,W,i,N
and b,i are independent,
E[q]=E[h,i2]=σb2+σw2Njx-1,j2=σb2+σw2r-1=defq¯.
(4.1)
We wish to emphasize the q¯ is determined as a function of random outcomes before the th layer, and thus a fixed, nonrandom quantity, regarding the randomization of the th layer. By the inductive hypothesis, we have
|E[q]-q˜|=|E[h,i2]-q˜|=|q¯-q˜|=σw2|r-1-r˜-1|ε'σw2.
(4.2)
The key consequence of this might be paraphrased by saying that to establish the portion of the inductive step regarding q, it suffices for q to be close to its mean. Now, we want to prove something similar for r. We have
E[r]=1Ni=1NE[x,i2]=1Ni=1NE[φ(h,i)2]=E[φ(h,1)2],
since, recalling that we have conditioned on previous layers, h,1,,h,N are i.i.d. Since h,iGauss(0,q¯), we have
E[r]=EzGauss(0,q¯)[φ(z)2]=EzGauss(0,1)[φ(q¯z)2]=12πφ(q¯z)2exp(-z2/2)dz,
which gives
|E[r]-r˜|EzGauss(0,q¯)[φ(z)2]-EzGauss(0,q˜)[φ(z)2].
Since |q¯-q˜|ε'σw2, and we may choose ε' to ensure ε'q˜2σw2, we have q˜/2q¯2q˜.
For β>0 and κ(0,1/2) to be named later, by lemma 14, we can choose a such that for all q[q˜/2,2q˜],
--aφ(qz)2exp(-z2/2)dzβ/2andaφ(qz)2exp(-z2/2)dzβ/2
and 12πq-aaexp-z22qdz1-κ. Choose such an a.
We claim that -aaφ(qz)2exp(-z2/2)dz-φ(qz)2exp(-z2/2)dzβ for all q˜/2<q2q˜. Choose such a q. We have
-aaφ(qz)2exp(-z2/2)dz-φ(qz)2exp(-z2/2)dz=--aφ(qz)2exp(-z2/2)dz+aφ(qz)2exp(-z2/2)dz2max--aφ(qz)2exp(-z2/2)dz,aφ(qz)2exp(-z2/2)dzβ.
So now we are trying to bound
-aaφq¯z2exp(-z2/2)dz--aaφ(q˜z)2exp(-z2/2)dz
using q˜/2q¯2q˜.
Using changes of variables, we have
-aaφq¯z2exp(-z2/2)dz--aaφ(q˜z)2exp(-z2/2)dz=1q¯-aq¯aq¯φ(z)2exp-z22q¯dz-1q˜-aq˜aq˜φ(z)2exp-z22q˜dz.
Since φ is permissible, φ2 is bounded on [-a2q˜,a2q˜]. If P is the distribution obtained by conditioning Gauss(0,q¯) on [-aq¯,aq¯], and P˜ by conditioning Gauss(0,q˜) on [-aq˜,aq˜], then if M=2πsupz[-a2q˜,a2q˜]φ(z)2, since q¯2q˜,
1q¯-aq¯aq¯φ(z)2exp-z22q¯dz-1q˜-aq˜aq˜φ(z)2exp-z22q˜dzMdTV(P,P˜).
But since for κ<1/2, conditioning on an event of probability at least 1-κ changes a distribution only by total variation distance at most 2κ, and therefore, applying lemma 12 along with the fact that |q¯-q˜|ε'σw2, for the constant C from lemma 12, we get
dTV(P,P˜)4κ+dTV(Gauss(0,q¯),Gauss(0,q˜))4κ+C|q¯-q˜|q˜=4κ+C|q¯-q˜||q¯+q˜|q˜4κ+Cε'σw2q˜.
Tracing back, we have
-aaφ(q¯z)2exp(-z2/2)dz--aaφ(q˜z)2exp(-z2/2)dzM4κ+Cε'σw2q˜,
which implies
|E[r]-r˜|φ(q¯z)2exp(-z2/2)dz-φ(q˜z)2exp(-z2/2)dzM4κ+Cε'σw2q˜+2β.
If κ=min{ε24M,13}, β=ε12, and ε'=minε2,ε2σw2,q˜2σw2,q˜ε6CMσw2 this implies |E[r]-r˜|ε/2.

Recall that q is an average of N identically distributed random variables with a mean between 0 and 2q˜ (which is therefore finite) and r is an average of N identically distributed random variables, each with mean between 0 and r˜+ε/22r˜. Applying the weak law of large numbers (see lemma 10), there is an N such that if NN, with probability at least 1-δ/2, both |q-E[q]|ε/2 and |r-E[r]|ε/2 hold, which in turn implies |q-q˜|ε and |r-r˜|ε, completing the proof of the inductive step, and therefore the proof of theorem 9.

5  Experiments

Our first experiment fixed x[0,:]=(1,,1), σw=1, σb=0, φ(z)=1/z.

For each N{10,100,1000}, we initialized the weights 100 times and plotted the histograms of all of the values of h[2,:], along with the Cauchy(0,N) distribution from the proof of proposition 1 and Gauss(0,σ2) for σ estimated from the data (see Figure 1). Consistent with the theory, the Cauchy(0,N) distribution fits the data well.

Figure 1:

Histograms of h[2,:], averaged over 100 random initializations, for N{10,100,1000}, along with Cauchy(0,N) (shown in red) and Gauss(0,σ2) for σ estimated from the data (shown in green). When we average over multiple random initializations of the weights, the distribution of the activations matches the Cauchy distribution, not the gaussian.

Figure 1:

Histograms of h[2,:], averaged over 100 random initializations, for N{10,100,1000}, along with Cauchy(0,N) (shown in red) and Gauss(0,σ2) for σ estimated from the data (shown in green). When we average over multiple random initializations of the weights, the distribution of the activations matches the Cauchy distribution, not the gaussian.

To illustrate the fact that the values in the second hidden layer are not independent, for N=1000 and the parameters otherwise as in the other experiment, we plotted histograms of the values seen in the second layer for nine random initializations of the weights in Figure 2. When some of the values in the first hidden layer have unusually small magnitude, then the values in the second hidden layer coordinately tend to be large. Note that this is consistent with theorem 9, establishing convergence in probability for permissible φ, since the φ used in this experiment is not permissible.

Figure 2:

Histograms of h[2,:] for nine random weight initializations. Plotting activations separately for different random initializations reveals the dependence among the activations in a layer.

Figure 2:

Histograms of h[2,:] for nine random weight initializations. Plotting activations separately for different random initializations reveals the dependence among the activations in a layer.

6  Maintaining Unit Scale

In this section, we describe one use of our analysis to guide the design of initialization variances. Our analysis shows that q1σw2r0+σb2, and
q+1σw2EzGauss(0,1)[φ(qz)2]+σb2.
If we achieve
1σw2r0+σb2
and
1=σw2EzGauss(0,1)[φ(z)2]+σb2,
this will promote q11,q21,q31, and so on. Setting
σw2=1/EzGauss(0,1)[φ(z)2],r0=EzGauss(0,1)[φ(z)2],σb2=0
satisfies both. These values are collected from some common activation functions in Table 1.
Table 1:
Choices of Input Variance (r0) and Weight Variances (σw2) That Theory Suggests Will Promote an Invariant That the Preactivations Maintain a Constant Scale as Computation Flows through the Network.
ActivationInput VarianceWeight Variance
Identity 
ReLU 1/2 
Heaviside 1/2 
Exponential e2 1/e2 
Tanh 0.394 2.53 
ActivationInput VarianceWeight Variance
Identity 
ReLU 1/2 
Heaviside 1/2 
Exponential e2 1/e2 
Tanh 0.394 2.53 

7  Conclusion

We have given a rigorous analysis of the limiting value of the distribution of the lengths of the vectors of hidden nodes in a fully connected deep network and described how to choose the variance of the weights at initialization using this analysis for various commonly used activation functions. Our analysis can be easily applied to other activation functions.

As in earlier work, our analysis concerned a limit in which the input grows along with the hidden layers. This simplifies the analysis, but it appears not to be difficult to remove this assumption (see Matthews et al., 2018).

After publication of some of this work in preliminary form (Long & Sedghi, 2019), elements of its analysis were used in Novak et al. (2019).

Analysis of the length map in the case of ReLU activations was an important component of recent analyses of the convergence of deep network training (Zou, Cao, Zhou, & Gu, 2018; Allen-Zhu, Li, & Song, 2019). A nonasymptotic refinement of our analysis would be a step toward generalizing those results to more general activation functions.

Appendix A:  Proof of Lemma 2

Choose c>0. Since limsupxlog|φ(x)|x2=0 and limsupx-log|φ(x)|x2=0, we also have limsupxlog|φ(cx)|x2=0 and limsupx-log|φ(cx)|x2=0. Thus, there is an a such that for all x[-a,a], log|φ(cx)|x28, which implies φ(cx)2expx24. Since φ is permissible, it is bounded on [-a,a]. Thus, we have
φ(cx)2exp(-x2/2)dx=--aφ(cx)2exp(-x2/2)dx+-aaφ(cx)2exp(-x2/2)dx+aφ(cx)2exp(-x2/2)dx--aexp(-x2/4)dx+supx[-a,a]φ(cx)2-aaexp(-x2/2)dx+aexp(-x2/4)dx<.

Appendix B:  Proof of Lemma 11

The proof is by induction. The base case holds since we have assumed that r˜0>0.

To prove the inductive step, we need the following lemma.

Lemma 8.

If φ is not zero a.e., then for all c>0, EzGauss(0,1)(φ(cz)2)>0.

Proof.
If μ is the Lebesgue measure, since
μ({xR:φ2(cx)>0})=limnμ({x:φ2(cx)>1/n}[-n,n])>0,
there exists n such that μ({x:φ2(cx)>1/n}[-n,n])>0. For such an n, we have
EzGauss(0,1)(φ(cz)2)1ne-n2/2μ({x:φ2(cx)>1/n}[-n,n])>0.

Returning to the proof of lemma 11, by the inductive hypothesis, r˜-1>0, which, since σw>0, implies q˜>0. Applying lemma 9 yields r˜>0.

Appendix C:  Proof of Lemma 14

Since limsupxlog|φ(x)|x2=0 there is an b such that, for all xb, log|φ(x)|x28s, which implies φ(x)2expx24s. Now, choose q[r,s]. For a=b/r, we then have
aφ(qx)2exp(-x2/2)dx=1qaqφ(z)2exp-z22qdz1qaqexpz24sexp-z22qdz1qaqexp-z24qdz1qbexp-z24qdz.
By increasing b if necessary, we can ensure 1qbexp-z24qdzβ which then gives aφ(qx)2exp(-x2/2)dxβ. A symmetric argument yields
-aφ(qx)2exp(-x2/2)dxβ.

Notes

1

Here o(z2) denotes any function of z that grows strictly more slowly than z2, such as z2-ε for ε>0.

2

This condition may be expanded as follows: limsupxlog|φ(x)|x2=0 and limsupx-log|φ(x)|x2=0.

Acknowledgments

We thank Ben Poole, Sam Schoenholz, and Jascha Sohl-Dickstein for valuable conversations and Jascha and anonymous reviewers for their helpful comments on earlier versions of this letter.

References

Allen-Zhu
,
Z.
,
Li
,
Y.
, &
Song
,
Z.
(
2019
). A convergence theory for deep learning via over-parameterization. In
Proceedings of the International Conference on Machine Learning
(pp.
242
252
).
Chen
,
M.
,
Pennington
,
J.
, &
Schoenholz
,
S. S.
(
2018
).
Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks
.
arXiv:1806.05394
.
Daniely
,
A.
,
Frostig
,
R.
, &
Singer
,
Y.
(
2016
). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
2253
2261
).
Red Hook, NY
:
Curran
.
Feller
,
W.
(
2008
).
An introduction to probability theory and its applications
.
Hoboken, NJ
:
Wiley
.
Glorot
,
X.
, &
Bengio
,
Y.
(
2010
). Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(pp.
249
256
).
Hanin
,
B.
(
2018
). Which neural net architectures give rise to exploding and vanishing gradients? In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
580
589
).
Red Hook, NY
:
Curran
.
Hayou
,
S.
,
Doucet
,
A.
, &
Rousseau
,
J.
(
2018
).
On the selection of initialization and activation function for deep neural networks
.
arXiv:1805.08266
.
Hazewinkel
,
M.
(
2013
). Cauchy distribution. In
M.
Hazewinkel
(Ed.),
Encyclopaedia of mathematics
(vol. 6)
.
New York
:
Springer Science & Business Media
.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, &
Sun
,
J.
(
2015
). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
1026
1034
).
Piscataway, NJ
:
IEEE
.
Klartag
,
B.
(
2007
). A central limit theorem for convex sets.
Inventiones mathematicae
,
168
(
1
),
91
131
.
LeCun
,
Y. A.
,
Bottou
,
L.
,
Orr
,
G. B.
, &
Müller
,
K.
(
1998
). Efficient backprop. In
G.
Montavon
,
G.
Orr
, &
K.-R.
Müller
(Eds.),
Neural networks: Tricks of the trade
.
Berlin
:
Springer
.
Lee
,
J.
,
Bahri
,
Y.
,
Novak
,
R.
,
Schoenholz
,
S. S.
,
Pennington
,
J.
, &
Sohl-Dickstein
,
J.
(
2018
). Deep neural networks as gaussian processes. In
Proceedings of the International Conference on Learning Representations
.
Open Review
.
Long
,
P. M.
, &
Sedghi
,
H.
(
2019
).
On the effect of the activation function on the distribution of hidden nodes in a deep network
.
arXiv:1901.02104
.
Lupton
,
R.
(
1993
).
Statistics in theory and practice
.
Princeton, NJ
:
Princeton University Press
.
Matthews
,
A. G. d. G.
,
Rowland
,
M.
,
Hron
,
J.
,
Turner
,
R. E.
, &
Ghahramani
,
Z.
(
2018
).
Gaussian process behaviour in wide deep neural networks
.
arXiv:1804.11271
.
Neal
,
R. M.
(
1996
).
Bayesian learning for neural networks
.
New York
:
Springer Science & Business Media
.
Novak
,
R.
,
Xiao
,
L.
,
Bahri
,
Y.
,
Lee
,
J.
,
Yang
,
G.
,
Hron
,
J.
, …
Sohl-Dickstein
,
J.
(
2019
). Bayesian deep convolutional networks with many channels are gaussian processes. In
Proceedings of the International Conference on Learning Representations
.
Pennington
,
J.
,
Schoenholz
,
S.
, &
Ganguli
,
S.
(
2017
). Resurrecting the sigmoid in deep learning through dynamical isometry: Theory and practice. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Viswanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
4785
4795
).
Red Hook, NY
:
Curran
.
Pennington
,
J.
,
Schoenholz
,
S. S.
, &
Ganguli
,
S.
(
2018
).
The emergence of spectral universality in deep networks
.
arXiv:1802.09979
.
Poole
,
B.
,
Lahiri
,
S.
,
Raghu
,
M.
,
Sohl-Dickstein
,
J.
, &
Ganguli
,
S.
(
2016
). Exponential expressivity in deep neural networks through transient chaos. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
3360
3368
).
Red Hook, NY
:
Curran
.
Schoenholz
,
S. S.
,
Gilmer
,
J.
,
Ganguli
,
S.
, &
Sohl-Dickstein
,
J.
(
2016
).
Deep information propagation
.
arXiv:1611.01232
.
Xiao
,
L.
,
Bahri
,
Y.
,
Sohl-Dickstein
,
J.
,
Schoenholz
,
S. S.
, &
Pennington
,
J.
(
2018
). Dynamical isometry and a mean field theory of CNNs:
How to train 10,000-layer vanilla convolutional neural networks
.
arXiv:1806.05393
.
Yang
,
G.
, &
Schoenholz
,
S.
(
2017
). Mean field residual networks: On the edge of chaos. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
7103
7114
).
Red Hook, NY
:
Curran
.
Zou
,
D.
,
Cao
,
Y.
,
Zhou
,
D.
, &
Gu
,
Q.
(
2018
).
Stochastic gradient descent optimizes over-parameterized deep ReLU networks
.
CoRR
.
abs/1811.08888
.

Author notes

The authors are ordered alphabetically.