## Abstract

We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to gaussian distributions. We show that if the activation function $\phi $ satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the “length process” converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases and the activation function $\phi $. We also show that this convergence may fail for $\phi $ that violate our assumptions. We show how to use this analysis to choose the variance of weight initialization, depending on the activation function, so that hidden variables maintain a consistent scale throughout the network.

## 1 Introduction

The size of the weights of a deep network must be managed delicately. If they are too large, signals blow up as they travel through the network, leading to numerical problems, and if they are too small, the signals fade away. The practical state of the art in deep learning made a significant step forward due to schemes for initializing the weights that aimed in different ways at maintaining roughly the same scale for the hidden variables before and after a layer (LeCun, Bottou, Orr, & Müller, 1998; Glorot & Bengio, 2010). Later work (He, Zhang, Ren, & Sun, 2015; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Daniely, Frostig, & Singer, 2016) took into account the effect of the nonlinearities on the length dynamics of a deep network, informing initialization policies in a more refined way.

An influential theoretical analysis (Poole et al., 2016) considered whether signals tend to blow up or fade away as they propagate through a fully connected network with the same activation function $\phi $ at each hidden node. For a given input, they studied the probability distribution over the lengths of the vectors of hidden variables when the weights between nodes are chosen from a zero-mean gaussian with variance $\sigma w2/N$ and where the biases are chosen from a zero-mean distribution with variance $\sigma b2$. They argued that in a fully connected network, as a width of the network approaches infinity, the (suitably normalized) lengths of the hidden layers approach a sequence of values, one for each layer, and characterized this length map as a function of $\phi $, $\sigma w$, and $\sigma b$. This analysis has since been widely used (Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein, 2016; Yang & Schoenholz, 2017; Pennington, Schoenholz, & Ganguli, 2017; Lee, Bahri, Novak, Schoenholz, Pennington, & Sohl-Dickstein, 2018; Xiao, Bahri, Sohl-Dickstein, Schoenholz, & Pennington, 2018; Chen, Pennington, & Schoenholz, 2018; Pennington, Schoenholz, & Ganguli, 2018; Hayou, Doucet, & Rousseau, 2018).

Poole et al. (2016) claimed that their analysis holds for arbitrary nonlinearities $\phi $. In contrast, we show that for arbitrarily small, positive $\sigma w$, even if $\sigma b=0$, for $\phi (z)=1/z$, the distribution of values of each of the hidden nodes in the second layer diverges as $N$ gets large. For finite $N$, each node has a Cauchy distribution, which already has infinite variance, and as $N$ gets large, the scale parameter of the Cauchy distribution gets larger, leading to divergence. We also show that the hidden variables in the second layer may not be independent, even for commonly used $\phi $ like the ReLU, contradicting a claim that is part of the analysis of Poole et al. (2016).

These observations, together with the wide use of the length map from Poole et al. (2016), motivate the search for a new analysis. This letter provides such an analysis for activation functions $\phi $ that satisfy the following properties: (1) the restriction of $\phi $ to any finite interval is bounded, (2) as $z$ gets large, $|\phi (z)|\u2264exp(o(z2))$, and (3) $\phi $ is measurable.^{1} We refer to such $\phi $ as permissible. Note that conditions (1) and (3) hold for any nondecreasing $\phi $.

We show that for all permissible $\phi $ and all $\sigma w$ and $\sigma b$, as $N$ gets large, the length process converges in probability to the length map described in Poole et al. (2016).

Section 5 describes some simulation experiments verifying some of the findings of the letter and illustrating the dependence among the values of the hidden nodes.

Section 6 describes one way to use our analysis to choose the variance of the weights depending on the activation function so that signals neither blow up nor vanish as computation flows through a wide and deep network.

Our analysis of the convergence of the length map borrows ideas from Daniely et al. (2016), who studied the properties of the mapping from inputs to hidden representations resulting from random gaussian initialization. Their theory applies in the case of activation functions with certain smoothness properties and to a wide variety of architectures. Informally, they showed that after random initialization, for wide networks, it is likely that the kernel associated with a feature map computed by the network closely approximates a fixed kernel. Our analysis treats a wider variety of values of $\sigma w$ and $\sigma b$ and uses weaker assumptions on $\phi $. Motivated by Bayesian goals as in the work of Neal (1996), and Matthews, Rowland, Hron, Turner, & Ghahramani (2018) performed an analysis in a related setting, characterizing the distribution of kernels arising from a random initialization. Their analysis used a “linear envelope” condition on $\phi $ that is stronger than the assumption used here. Alternative but related uses of theory to guide the choice of weight variances may be found in Schoenholz et al. (2016) and Pennington et al. (2017). Hanin (2018) studied the effect of the widths of layers and the depth of a fully connected network on the size of the input-output Jacobian in the case of ReLU activations.

## 2 Preliminaries

### 2.1 Notation

For $n\u2208N$, we use $[n]$ to denote the set ${1,2,\u2026,n}$. If $T$ is a $n\xd7m\xd7p$ tensor, then for $i\u2208[n]$, let $Ti,:,:$ be the matrix $A$ such that $Aj,k=Ti,j,k$, and define $Ti,j,:$, and so on, analogously.

### 2.2 The Finite Case

We study the process arising from fixing an arbitrary input $x0,:\u2208RN$ and choosing the parameters independently at random. The entries of $W$ are sampled from $Gauss0,\sigma w2N$ and the entries of $b$ are from $Gauss0,\sigma b2$. For each $\u2113\u2208{0,\u2026,D}$, define $q\u2113=1N\u2211i=1Nh\u2113,i2$.

Note that for all $\u2113\u22651$, all the components of $h\u2113,:$ and $x\u2113,:$ are identically distributed.

### 2.3 The Wide-Network Limit

### 2.4 Total Variation Distance

If $P$ and $Q$ are probability distributions, then $dTV(P,Q)=supEP(E)-Q(E)$, and if $p$ and $q$ are their densities, $dTV(P,Q)=12\u222b|p(x)-q(x)|dx.$

### 2.5 Permissible Activation Functions

An activation function $\phi $ is *permissible* if

The restriction of $\phi $ to any finite interval is bounded.

$|\phi (x)|=exp(o(x2))$ as $|x|$ gets large.

^{2}$\phi $ is measurable.

Conditions (2) and (3) ensure that a key integral can be computed. The proof of lemma ^{2} is in appendix A.

If $\phi $ is permissible, then for all positive constants $c$, the function $g$ defined by $g(x)=\phi (cx)2exp(-x2/2)$ is integrable.

### 2.6 Length Map

## 3 Some Surprising Behaviors

In this section, we show that for some activation functions, the probability distribution of hidden nodes can have some surprising properties.

### 3.1 Failure to Converge

We show that the probability distribution of the hidden variables may not converge. Our proof refers to the Cauchy distribution.

A distribution over the reals that, for $x0\u2208R$ and $\gamma >0$, has a density $f$ given by $f(x)=1\pi \gamma 1+x-x0\gamma 2$ is a *Cauchy distribution*, denoted by $Cauchy(x0,\gamma )$. $Cauchy(0,1)$ is the *standard Cauchy distribution*.

(Hazewinkel, 2013). If $X1,\u2026,Xn$ are independent and identically distributed (i.i.d.) random variables with a Cauchy distribution, then $1n\u2211i=1nXi$ has the same distribution.

(Lupton, 1993). If $U$ and $V$ are zero-mean normally distributed random variables with the same variance, then $U/V$ has the standard Cauchy distribution.

The following shows that there is a $\phi $ such that the limiting $h\u03322$ is not defined. It contradicts a claim made on line 7 of section A.1 of Poole et al. (2016).

For any input function $\chi $ with range ${-1,1}$, there is an activation function $\phi $ such that for every $\sigma w>0$, if $\sigma b=0$, then (a) for finite $N$, $h2,1$ has infinite variance, and (b) $h2,1$ diverges as $N$ goes to infinity.

^{5}, for each $j$, $W2,1,j/h1,j$ has a Cauchy distribution, and since

^{4},

### 3.2 Independence

The following contradicts a claim made on line 8 of section A.1 of Poole et al. (2016).

If $\phi $ is either the ReLU or the Heaviside function, then for every $\sigma w>0$, $\sigma b\u22650$, and $N\u22652$, $(h2,1,\u2026,h2,N)$ are not independent.

We will show that $E[h2,12h2,22]\u2260E[h2,12]E[h2,22]$, which will imply that $h2,1$ and $h2,2$ are not independent.

Because each component of $h1,:$ is the dot product of $x0,:$ with an independent row of $W1,:,:$ plus an independent component of $b1,:$, the components of $h1,:$ are independent, and since $x1,:=\phi (h1,:)$, this implies that the components of $x1,:$ are independent. Since each row of $W1,:,:$ and each component of the bias vector has the same distribution, $x1,:$ is i.i.d.

Now, we calculate the difference using equation 3.2 for the Heaviside and ReLU functions.

Suppose $\phi $ is a Heaviside function, that is, $\phi (z)$ is the indicator function for $z>0$. In this case, since the components of $h1,:$ are symmetric about 0, the distribution of $x1,:$ is uniform over ${0,1}N$. Thus, $E[x4]=E[x2]=1/2$, and so equation 3.2 gives $E[h2,12h2,22]-E[h2,12]E[h2,22]=3\sigma w44N\u22600.$

Note that, informally, the degree of dependence between pairs of hidden nodes established in the proof of theorem ^{7} approaches 0 as $N$ gets large. On the other hand, the number of dependent pairs of hidden nodes is $\Omega (N2)$.

### 3.3 Undefined Length Map

Here, we show, informally, that for $\phi $ at the boundary of the second condition in the definition of permissibility, the recursive formula defining the length map $q\u02dc\u2113$ breaks down. Roughly, this condition cannot be relaxed.

For any $\alpha >0$, if $\phi $ is defined by $\phi (x)=exp(\alpha x2)$, even if all components of all inputs are in ${-1,1}$, there exists a $\sigma w,\sigma b$ s.t. $q\u02dc\u2113,r\u02dc\u2113$ is undefined for all $\u2113\u22652$.

## 4 Convergence in Probability

In this section, the length process $q0,\u2026,qD$ converges in probability to the length map $q\u02dc0,\u2026,q\u02dcD$ from Poole et al. (2016).

For any permissible $\phi $, $\sigma w,\sigma b\u22650$, any depth $D$, and any $\epsilon ,\delta >0$, there is an $N0$ such that for all $N\u2265N0$, with probability $1-\delta $, for all $\u2113\u2208[D]$, we have $|q\u2113-q\u02dc\u2113|\u2264\epsilon .$

Before proving theorem ^{9}, we establish some lemmas. Our proof will use the weak law of large numbers.

In order to divide our analysis into cases, we need the following lemma, whose proof is in appendix B.

If $\phi $ is permissible and not zero almost everywhere (a.e.), for all $\sigma w>0$, for all $\u2113$, $q\u02dc\u2113>0$ and $r\u02dc\u2113>0$.

We will also need a lemma that shows that small changes in $\sigma $ lead to small changes in $Gauss(0,\sigma 2)$.

The following technical lemma, which shows that tail bounds hold uniformly over different choices of $q$, is proved in appendix C.

If $\phi $ is permissible, for all $0<r\u2264s$, for all $\beta >0$, there is an $a\u22650$ such that for all $q\u2208[r,s]$, $\u222ba\u221e\phi (qz)2exp(-z2/2)dz\u2264\beta $ and $\u222b-\u221e-a\phi (qz)2exp(-z2/2)dz\u2264\beta .$

Armed with these lemmas, we are ready to prove theorem ^{9}.

First, if $\phi $ is zero a.e. or if $\sigma w=0$, theorem ^{9} follows directly from lemma ^{10}, together with a union bound over the layers. Assume for the rest of the proof that $\phi (x)$ is nonzero on a set of positive measures and that $\sigma w>0$, so that $q\u02dc\u2113>0$ and $r\u02dc\u2113>0$ for all $\u2113$.

For each $\u2113\u2208{0,\u2026,D}$, define $r\u2113=1N\u2211i=1Nx\u2113,i2.$

Our proof of theorem ^{9} is by induction. The inductive hypothesis is that for any $\epsilon ,\delta >0$ there is an $N0$ such that if $N\u2265N0$, then, with probability $1-\delta $, for all $\u2113'\u2208{1,\u2026,\u2113}$, $|q\u2113'-q\u02dc\u2113'|\u2264\epsilon $ and, for all $\u2113'\u2208{0,\u2026,\u2113}$, and $|r\u2113'-r\u02dc\u2113'|\u2264\epsilon $.

The base case, where $\u2113=0$, holds because $r\u02dc0$ is defined to be the limit of $r0$ as $N$ goes to infinity.

For the induction step, choose $\u2113>0$, $0<\epsilon <min{q\u02dc\u2113/4,r\u02dc\u2113}$ and $0<\delta \u22641/2$. (Note that these choices are without loss of generality.) Let $\epsilon '\u2208(0,\epsilon )$ take a value that will be described later, using quantities from the analysis. By the inductive hypothesis, whatever the value of $\epsilon '$, there is an $N0'$ such that if $N\u2265N0'$, then with probability $1-\delta /2$, for all $\u2113'\u2264\u2113-1$, we have $|q\u2113'-q\u02dc\u2113'|\u2264\epsilon '$ and $|r\u2113'-r\u02dc\u2113'|\u2264\epsilon '$. Thus, to establish the inductive step, it suffices to show that after conditioning on the random choices before the $\u2113$th layer, if $|r\u2113-1-r\u02dc\u2113-1|\u2264\epsilon '$, there is an $N\u2113$ such that if $N\u2265N\u2113$, then with probability at least $1-\delta /2$ with respect only to the random choices of $W\u2113,:,:$ and $b\u2113,:$, that $|q\u2113-q\u02dc\u2113|\u2264\epsilon $ and $|r\u2113-r\u02dc\u2113|\u2264\epsilon $. Given such an $N\u2113$, the inductive step can be satisfied by letting $N0$ be the maximum of $N0'$ and $N\u2113$.

Let us do that. To simplify the notation, for the rest of the proof of the inductive step, let us condition on outcomes of the layers before layer $\u2113$; all expectations and probabilities will concern the randomness only in the $\u2113$th layer. Let us further assume that $|r\u2113-1-r\u02dc\u2113-1|\u2264\epsilon '$.

Recall that $q\u2113=1N\u2211i=1Nh\u2113,i2$. Since the values of $h\u2113-1,1,...,h\u2113-1,N$ have been fixed by conditioning, each component of $h\u2113,i$ is obtained by taking the dot product of $x\u2113-1,:=\phi (h\u2113-1,:)$ with $W\u2113,i,:$ and adding an independent $b\u2113,i$. Thus, conditioned on $h\u2113-1,1,...,h\u2113-1,N,$ we have that $h\u2113,1,...,h\u2113,N$ are independent. Also, since $x\u2113-1,:$ is fixed by conditioning, each $h\u2113,i$ has an identical gaussian distribution.

Since each component of $W$ and $b$ has zero mean, each $h\u2113,i$ has zero mean.

^{14}, we can choose $a$ such that for all $q\u2208[q\u02dc\u2113/2,2q\u02dc\u2113]$,

^{12}along with the fact that $|q\xaf\u2113-q\u02dc\u2113|\u2264\epsilon '\sigma w2$, for the constant $C$ from lemma

^{12}, we get

Recall that $q\u2113$ is an average of $N$ identically distributed random variables with a mean between 0 and $2q\u02dc\u2113$ (which is therefore finite) and $r\u2113$ is an average of $N$ identically distributed random variables, each with mean between 0 and $r\u02dc\u2113+\epsilon /2\u22642r\u02dc\u2113$. Applying the weak law of large numbers (see lemma ^{10}), there is an $N\u2113$ such that if $N\u2265N\u2113$, with probability at least $1-\delta /2$, both $|q\u2113-E[q\u2113]|\u2264\epsilon /2$ and $|r\u2113-E[r\u2113]|\u2264\epsilon /2$ hold, which in turn implies $|q\u2113-q\u02dc\u2113|\u2264\epsilon $ and $|r\u2113-r\u02dc\u2113|\u2264\epsilon $, completing the proof of the inductive step, and therefore the proof of theorem ^{9}.$\u25a1$

## 5 Experiments

Our first experiment fixed $x[0,:]=(1,\u2026,1)$, $\sigma w=1$, $\sigma b=0$, $\phi (z)=1/z$.

For each $N\u2208{10,100,1000}$, we initialized the weights 100 times and plotted the histograms of all of the values of $h[2,:]$, along with the $Cauchy(0,N)$ distribution from the proof of proposition 1 and $Gauss(0,\sigma 2)$ for $\sigma $ estimated from the data (see Figure 1). Consistent with the theory, the $Cauchy(0,N)$ distribution fits the data well.

To illustrate the fact that the values in the second hidden layer are not independent, for $N=1000$ and the parameters otherwise as in the other experiment, we plotted histograms of the values seen in the second layer for nine random initializations of the weights in Figure 2. When some of the values in the first hidden layer have unusually small magnitude, then the values in the second hidden layer coordinately tend to be large. Note that this is consistent with theorem ^{9}, establishing convergence in probability for permissible $\phi $, since the $\phi $ used in this experiment is not permissible.

## 6 Maintaining Unit Scale

## 7 Conclusion

We have given a rigorous analysis of the limiting value of the distribution of the lengths of the vectors of hidden nodes in a fully connected deep network and described how to choose the variance of the weights at initialization using this analysis for various commonly used activation functions. Our analysis can be easily applied to other activation functions.

As in earlier work, our analysis concerned a limit in which the input grows along with the hidden layers. This simplifies the analysis, but it appears not to be difficult to remove this assumption (see Matthews et al., 2018).

After publication of some of this work in preliminary form (Long & Sedghi, 2019), elements of its analysis were used in Novak et al. (2019).

Analysis of the length map in the case of ReLU activations was an important component of recent analyses of the convergence of deep network training (Zou, Cao, Zhou, & Gu, 2018; Allen-Zhu, Li, & Song, 2019). A nonasymptotic refinement of our analysis would be a step toward generalizing those results to more general activation functions.

## Appendix A: Proof of Lemma ^{2}

## Appendix B: Proof of Lemma ^{11}

The proof is by induction. The base case holds since we have assumed that $r\u02dc0>0$.

To prove the inductive step, we need the following lemma.

If $\phi $ is not zero a.e., then for all $c>0$, $Ez\u2208Gauss(0,1)(\phi (cz)2)>0$.

Returning to the proof of lemma ^{11}, by the inductive hypothesis, $r\u02dc\u2113-1>0$, which, since $\sigma w>0$, implies $q\u02dc\u2113>0$. Applying lemma 9 yields $r\u02dc\u2113>0$.

## Appendix C: Proof of Lemma ^{14}

## Notes

^{1}

Here $o(z2)$ denotes any function of $z$ that grows strictly more slowly than $z2$, such as $z2-\epsilon $ for $\epsilon >0$.

^{2}

This condition may be expanded as follows: $limsupx\u2192\u221elog|\phi (x)|x2=0$ and $limsupx\u2192-\u221elog|\phi (x)|x2=0$.

## Acknowledgments

We thank Ben Poole, Sam Schoenholz, and Jascha Sohl-Dickstein for valuable conversations and Jascha and anonymous reviewers for their helpful comments on earlier versions of this letter.

## References

## Author notes

The authors are ordered alphabetically.