Graph convolutional network (GCN) is a powerful deep model in dealing with graph data. However, the explainability of GCN remains a difficult problem since the training behaviors for graph neural networks are hard to describe. In this work, we show that for GCN with wide hidden feature dimension, the output for semisupervised problem can be described by a simple differential equation. In addition, the dynamics of the behavior of output is decided by the graph convolutional neural tangent kernel (GCNTK), which is stable when the width of hidden feature tends to be infinite. And the solution of node classification can be explained directly by the differential equation for a semisupervised problem. The experiments on some toy models speak to the consistency of the GCNTK model and GCN.

Graph neural networks (GNNs) are widely used in dealing with non-Euclid data (Wu et al., 2021). A typical kind of GNN, the graph convolutional network (GCN; Kipf & Welling, 2016; Wu et al., 2019), achieves great performance in several fields, including society and transportation (Zhou et al., 2020; Wu et al., 2021). Therefore, many researchers focus on the theoretical aspects of GNN, especially GCN, including expressive power (Xu, Hu, Leskovec, & Jegelka, 2018; Loukas, 2020; Chen, Villar, Chen, & Bruna, 2019) and generalization capability (Garg, Jegelka, & Jaakkola, 2020; Scarselli, Tsoi, & Hagenbuchner, 2018; Xu et al., 2020). The explainability of GNN, which often studies the underlying relationship behind predictions, has also gotten the broad attention of numerous experts and scholars (Yuan, Yu, Gui, & Ji, 2020). Since many GNNs are proposed without explaining them, they are treated as a black box and cannot be trusted in critical applications concerning privacy and safety. Therefore, it is necessary to develop explanation techniques to study the causal relationship behind GNN predictions.

Some methods explore GNNs' explainability by identifying important nodes related to their prediction. Gradients and features-based methods (Baldassarre & Azizpour, 2019; Pope, Kolouri, Rostami, Martin, & Hoffmann, 2019) and perturbation-based methods (Ying, Bourgeois, You, Zitnik, & Leskovec, 2019; Luo et al., 2020; Schlichtkrull et al., 2020; Funke et al., 2021) employ different model metrics, including gradients or antiperturbation ability, to indicate the importance of different nodes, while decomposition methods (Schwarzenberg, Hübner, Harbecke, Alt, & Hennig, 2019; Schnake et al., 2020) and surrogate methods (Huang et al., 2020; Vu & Thai, 2020; Zhang et al., 2021) decomposing GNNs or finding a surrogate model to simplify and explain the GNNs. However, most of this work seeks to understand GNN predictions by post hoc explanation, which means the explanations cannot be used to predict GNN output before training.

In this article, we focus on GCN behaviors in dealing with the node classification problem (Kipf & Welling, 2016) and try to interpret GCN, which can be applied to predict its output before training. Since the objective function for training GCN is often nonconvex, it is hard to analyze GCN behavior directly. Recently neural tangent kernel (NTK; Bartlett, Helmbold, & Long, 2018; Arora et al., 2019; Jacot, Gabriel, & Hongler, 2018) has been proposed to analyze deep neural networks including GNNs in different perspectives. Du, Hou, et al. (2019) use the aggregation and combination formulation of infinitely wide GNNs to define the graph NTK (GNTK) at graph level, which can predict the results of graph-level classification problems. Following the definition of GNTK, Huang et al. (2021) defines an NTK of GCN in node level and focus on the trainability of ultrawide GCNs. However, neither of the two works can be applied for analyzing node classification problems directly based on GNTK.

Here, we establish a GCN tangent kernel (GCNTK) in matrix form and use it to analyze the learning dynamics of wide GCN under gradient descent for node classification problems. GCNTK can also predict test nodes' label and help to explain the importance of different training nodes for prediction.

We summarize our contributions as follows:

  • The convergence of GCNTK. Since the input and output of GCN are in matrix form, the formula of a gaussian process for GCN (GCNGP) is complex. We first give the explicit formula for GCNGP and GCNTK and demonstrate that GCNTK can converge to a fixed form as the GCN's layer width tends to be infinite.

  • The convergence of training loss and stability of GCNTK. We prove that the loss constrained on the training data set tends to zero, as the width of parameter matrix tends to be infinite. And the GCNTK remains fixed during the training procedure.

  • Predictions for the test nodes' label based on linear dynamics. We formally obtain the predictions of test nodes' label with infinite-width GCN. We first find that the solution of semisupervised problems mainly depends on the ratio of kernel restricting on the training and test data set. The prediction on the test nodes and the impact of training nodes can be interpreted well by the GCNTK.

2.1  GNNs and GCNs

GNNs learn task-specific node/edge/graph representations via hierarchical iterative operators and obtain great success in graph learning tasks. A classical GNN consists of aggregation and combination operators, which gather information from neighbors iteratively. GCN, as a typical GNN, defines the convolution on the spectral domain and applies a filter operator on the feature components. Its aggregation function can be treated as a weighted summation of neighbors' feature. Among this, a specific GCN proposed in Kipf and Welling (2016) uses a 1-localized ChebNet to define convolution and obtain the model in equation 3.1 with the bias term, which obtains significant advantages in dealing with node classification problems.

2.2  Explainability of GNNs

The explainability of GNNs can be explored by identifying important nodes related to GNN's prediction. For example, SA and guided BP (Baldassarre & Azizpour, 2019) use gradient square values as the importance and contributions of different nodes, while Grad-CAM (Pope et al., 2019) maps the final layer to the input nodes space to generate the importance. GNN perturbation methods including GNNExplainer (Ying et al., 2019), PGExplainer (Luo et al., 2020), GraphMask (Schlichtkrull, De Cao, & Titov, 2020), and ZORRO (Funke, Khosla, & Anand, 2021) consider the influence of node perturbations on predictions. Surrogate and decomposition models, including GraphLime (Huang et al., 2020), PGM-explainer (Vu & Thai, 2020), Relex (Zhang, Defazio, & Ramesh, 2021), LRP (Baldassarre & Azizpour, 2019), and GNN-LRP (Schnake et al., 2020), explain GNN by employing a simple and interpretable surrogate model to approximate the predictions. However, most of these models explain the GNN afterward and GNN is still a black box to some extent. It means these models can provide an explanation of GNN's output but cannot predict the results before training.

Starting from a spectral GCN, a classical and special GNN, this article addresses GCN's interpretability and analyzes its causal relationship between the predictions and training nodes, which help to predict GCN's output beforehand.

2.3  Neural Tangent Kernel

Based on the gaussian process (GP; Neal, 1995; Lee et al., 2018; de G. Matthews, Rowland, Hron, Turner, & Ghahramani, 2018) property of deep neural networks, Jacot et al. (2018) introduce the neural tangent kernel and describe the exact dynamics of a fully connected network's output through gradient flow training in an overparameterized situation. This initial work has been followed by a series of studies, including the exacter description (Arora et al., 2019; Lee et al., 2019), generalization for different initialization (Liu, Zhu, & Belkin, 2020; Sohl-Dickstein, Novak, Schoenholz, & Lee, 2020), and NTK for different structures of neural networks (Arora et al., 2019; Li et al., 2021; Luo, Xu, Ma, & Zhang, 2021).

As for graph deep learning, Du, Hou, et al. (2019) considered graph-supervised problems and defined graph NTK (GNTK) in graph level based on GNN's aggregation function, which can be used to predict an unlabeled graph. Immediately after Du, Hou, et al. (2019), Huang et al. (2021) defined GNTK for GCN and conducted research on how to deepen GCN. Since the GNTK formulation in Huang et al. (2021) is based on training nodes, the predictions of testing nodes cannot be obtained directly. In this article, we derive a matrix form of GCN tangent kernel (GCNTK) and provide an explicit form to predict the behavior of testing nodes directly.

Let G=(V,E,X) be a graph with N vertices, where V={v1,v2,...,vN} represents N vertices or nodes. The adjacency matrix A=(aij)N×N represents the link relation. XRN×d0 is the node feature matrix. A typical semisupervised problem defined on graph is node classification. Assume that there exist k classes of nodes, and each node vi in graph G has an associated one-hot representation label yiRk. However, only some of the nodes in V denoted by Vtrain are annotated. The task of node classification is to predict the labels for the unlabeled nodes in Vtest.

First, we give some notations in matrix form. Assume that the index of a training and a test set is Itrain={i|viVtrain} and Itest={i|viVtest}, respectively. For an arbitrary matrix X=(Xij)N×d, denote the ith row of matrix X by Xi and the jth column by X;,j. Let ItrainRN×N be a diagonal matrix where the training index is 1 and the other index set is 0, with a similar definition for Itest. Xtrain=ItrainXRN×d represents the X constraining on the training set by rows, with a similar definition for Xtest. For an arbitrary matrix ARN×N, Atrain,test=ItrainAItest represents A constraining on the training index by rows and test index by columns.

A GCN f:RN×d0RN×dL+1 is a nonlinear transform defined by HL+1=f(X0), which can be expressed in the recurrent form,
(3.1)
with parameters
(3.2)
where ϕ represents the activation function. XlRN×dl is the l layer output, and X0=XRN×d0 is the initial node feature matrix of G. Wl+1Rdl×dl+1,Bl+1RN×dl+1. ωijl,βi,jl are trainable variables drawn independent and identically distributed (i.i.d.) from a standard gaussian with ωijl,βi,jlN(0,1) at initialization. The weight and bias variance σω and σb are predefined constant variances. The parameterization, equation 3.2, is nonstandard, and we refer to it as “NTK parameterization” (Jacot et al., 2018; Arora et al., 2019; Liu et al., 2020; Littwin, Galanti, Wolf, & Yang, 2020). Unlike the standard parameterization (Sohl-Dickstein, Novak, Schoenholz, & Lee, 2020; Franceschi et al., 2021) that leads to a divergent NTK, the NTK parameterization has a width-dependent scaling factor 1dl-1 in each layer and thus can normalize the backward dynamics by a convergent NTK.

Define θlvec({ωijl,βijl}), which is a column-wise stacked (N+dl-1)dl×1 vector of all parameters in layer l. Let θ=vec(l=1L+1θl) denote all the network parameters, θl0=vec(l=1l0θl), with similar definitions for θl0. Let θt denote the network parameters at time t, and θ0 represents the initial values. Combined with the current parameter θt, the output of GCN is denoted by the function f(X,θt)=HL+1(X,θt)RN×dL+1, where dL+1=k represents the number of classes of nodes. To simplify the representation, we use f(X)=f(X,θ), ft(X)=f(X,θt).

We use the loss L on the labeled nodes for learning the θ by the gradient descent (GD),
(3.3)
where f(X)i,: represents the i row of GCN output.
Then the square loss function equation 3.3, can be written as
(3.4)
Here YRN×k is the matrix of N labels in one-hot representation. Although the label of the test nodes, Ytext cannot obtained, the loss on the training set slice can be represented directly.
Next, since the gradient flow of GCNs involves the gradient of the matrix, we give its definition as follows. The vectorization of a matrix X is
(3.5)
We define the derivative of matrix FRp×q with respect to matrix XRm×n by vector representation as
(3.6)
Let η be the learning rate of GD. Applying the gradient flow on the GCN, the optimization procedure can be written as
(3.7)
Let f˙t(X)=f˙t(vec(X))RNdL+1×1 be a vector representation; then
(3.8)
Definition 1.
Similar to the definition of NTK in Jacot et al. (2018), the GCNTK Θt is defined by
(3.9)
Then equation 3.8 can be written as
(3.10)
Here ft(X)LRNdL+1×1 is the gradient of the loss with respect to the output matrix, and θft(X)R|θ|×NdL+1 is the gradient of the output with respect to θ at time t.

Equations 3.7 and 3.8 are two differential equations that describe the evolution of parameters and output, respectively. The behavior of output depends on the time-dependent GCNTK Θt. However, Θt depends on the random draw of θt, which makes the differential equations hard to solve. This article assumes that the width tends to infinity to obtain the convergence of Θt.

4.1  Convergence of GCNTK with Respect to Width d

GNTK (Du, Hou, et al., 2019; Huang et al., 2021) gives an NTK formula on GNNs based on the distinct node and the dynamics behavior on the training set. However, for the semisupervised problem on the graph, the convergence of GCNTK in matrix form has not been researched. Since the formula of GCNTK involves the complex gradient computation of matrix, it has not been given exactly. Therefore, we first focus on the convergence of GCNTK.

Theorem 1.
For a GCN defined by equation 3.1 under the NTK parameterization, the GCNTK at Θ0 defined by equation 3.9 converges in probability to a deterministic limiting kernel
(4.1)
as the layers width d1,d2,...,dL+1.

The proof details of multi-output GPs and NTK are in appendix A. We give the proof sketch as follows.

In order to prove theorem 1, denote AB as a Kronecker product of A and B. We first show that the output of GCNs at each layer is also in correspondence with a certain class of multioutput GPs defined by
(4.2)
where
(4.3)
with base case
(4.4)
According to equation 4.2, we have the convergence of GCNTK:
(4.5)
Here
(4.6)
where
(4.7)
with
(4.8)
is the Hadamard product and ϕ˙ is the derivative of ϕ.
Remark 1.

Jacot et al. (2018) and Yang (2019) show that the NTK of fully connected networks at initialization Θ0 converges to a deterministic kernel Θ based on the GP property. Theorem 1 shows that the GCNTK also converges at initialization. Compared with the results in Du, Hou, et al. (2019), which define the GNTK from a single node perspective, this article defines the GCNTK in matrix form first and is suitable for further analysis of node classification problems.

4.2  The Behavior of GCNTK with Respect to Time t

Furthermore, we focus on the behavior of GCN and GCNTK during the training procedure. We use GCNTK to provide a simple proof of the convergence of GCN under gradient descent on the training nodes. The stability of GCNTK and parameters during the training procedure is guaranteed in our proof. Since Lee et al. (2019) show that both the convergence of NTK and standard parameterization can be obtained by similar proof procedure, we prove the convergence under standard parameterization. Compared with NTK parameterization, standard parameterization is common in the realization of GCN (Kipf & Welling, 2016; Wu et al., 2019), which is defined as
(4.9)
and
(4.10)
Different from the NTK parameterization, the standard NTK kernel (Sohl-Dickstein et al., 2020; Park, Sohl-Dickstein, Le, & Smith, 2019) is defined as
(4.11)
Theorem 1 has proven that Θ exists. Next, we prove that Θt is invariant with respect to t as the width tends to infinity. Before that, we give some assumptions.
  1. For the GCN model defined in equations 4.9 and 4.10, the widths in each layers are identical: d1=d2==dL=d.

  2. The analytic GCNTK kernel Θ in equation 4.11 is of full rank and positive.

  3. The activation function ϕ is Lipschitz continuous and smooth, satisfying
    (4.12)
    (4.13)

Assumption 1 is easy to satisfy in the limit condition. Assumption 2 holds since the NTK kernel is a multiplication of the derivative. Common activation functions like Relu, and sigmoid satisfy assumption 3.

Under the setting of node classification problem, we have the parameter update formula,
(4.14)
and the gradient flow equation is
(4.15)
Using the following shorthand,
(4.16)
we have
(4.17)
or
(4.18)
Then the behavior of GCN satisfies
(4.19)
Note that J(θt)J(θt) is connected tightly with Θ. Denote Θtrain=Θ·(IItrain). It is easy to prove that the eigenvalues of Θtrain are equal to that of Θ except for some zeros. Define
(4.20)
We first show that the Jacobian is local Lipschitz, where the Lipschitzness constant K is related to parameters A,L,C1,C2, and d.
Lemma 1.
There exist a K>0, and N, for every dN; the following holds with high probability over random initialization:
(4.21)
where
(4.22)

The proof of lemma 1 is in section B.3 in online appendix B. It shows that in the neighborbood of initialization θ0, the Jacobian J(θ) is Lipschitz continuous when d is large enough.

Then for the gradient descent and gradient flow, we have the following main results:

Theorem 2.
Assume assumptions 1, 2, and 3 hold. For δ0>0 and η0<ηcritical, there exist R0>0, NN, and K>1, such that for every dN, the following holds with probability at least 1-δ0 over random initialization when applying gradient descent with learning rate η=η0d,
(4.23)
and
(4.24)
Theorem 3.
Assume assumptions 1, 2, and 3 hold. For δ0>0 and η0<ηcritical, there exist R0>0, NN, and K>1, such that for every dN, the following holds with probability at least 1-δ0 over random initialization when applying gradient flow with learning rate η=η0d,
(4.25)
and
(4.26)
Remark 2.

Lee et al. (2019), Du, Lee, Li, Wang, and Zhai (2019), and Allen-Zhu, Li, and Song (2019) show the convergence of an overparameterized, fully connected network and the stability of NTK, while theorems 2 and 3 extend the convergence and stability of GCNTK for a semisupervised problem. As the GCNTK converges at initialization time when the width tends to be infinite, it also remains fixed during training. Different from the supervised problem, the convergence rate of GCN for a semisupervised problem depends on the smallest eigenvalues of the GCNTK constrained on the training nodes. Therefore for the same graph, training GCN with fewer training nodes is faster in general. The proof is in appendix B.

4.3  GCN Explanation Based on Training Dynamics Equation

Under the infinite width assumption, GCNTK remains fixed, and the training dynamics can be computed by the evolution equation 3.8. We can obtain the explanation of GCN with infinite width based on the solution of the training dynamics equation.

Theorem 4.
For a graph convolutional network defined by equation 3.1 in the limits as the layers width d1,d2,...,dL+1, the output of GCN is
(4.27)
where Θtrain=Θ·IItrain.
The solution of equation 3.4
(4.28)
The slice of Θtrain constraining on the rows of the testing set and the columns of the training set is Θtest,train. Then we have
(4.29)
Similarly, from
(4.30)
we have
(4.31)
Theorem 4 shows that the output label of the test nodes is influenced by two factors of GCNTK, Θtest,train and Θtrain,train. As t, the output label is a linear combination of the label of training nodes. Define
(4.32)
then M describes the contribution of labeled nodes. Therefore, according to equation 4.31, the influence of training nodes on the testing nodes can be obtained easily.

Little previous work has provided beforehand explanations of GNN. And to our knowledge, there is no beforehand explanation model investigating the node classification problem. In this article, we compare only the output of GCN and GCNTK and verify the consistency of their predictions. We use synthetic data sets to obtain some conclusions which are implicit in the classical GCN.

  • Synthetic data sets: In order to verify the prediction correctness in theorem 4, we build some subgraphs with a few nodes that have common patterns in various graph data sets. All the nodes in these graphs except one are labeled with red or green. We use the GCN and GCNTK to predict the label (red or green) of the unlabeled (blue) node, respectively based on labeled nodes, and obtain the contribution of those labeled nodes.

  • GCN model: We train the node classification model using different types of graphs by the classical GCN in Kipf and Welling (2016) where the feature matrix X is set by the identity matrix. And we only predict the label of one node (in blue).

5.1  Experiments on the Star Graph

Theorem 4 shows that the predicted label is only influenced by the contribution of unlabeled nodes, where the contribution ratio is M defined by equation 4.32. For instance, if there is no link between the test nodes and training nodes, then Θtest,train=0. As a result, no node has an effect on the training nodes.

5.2  Common Patterns with Four Nodes

In this section, we compute the contribution ratio M of all the patterns with four nodes in Figures 1 and 2. As is shown in Table 1, the labeled nodes make different predictions of 0. The ratio displays the importance of the node related to the predicted node. The positive or negative ratio of one training node means that predicted node tends to be the same as or different from that node. And the absolute value of the ratio means the importance of different nodes.
Figure 1:

Using GCNTK to predict the color of node 0, the GCNTK ratio M=(0.33,0.33,0.33). Then node 0 in the left, the middle, and the right is predicted using equation 4.31 as red, red, and green, respectively, which is completely consistent with the result of GCN.

Figure 1:

Using GCNTK to predict the color of node 0, the GCNTK ratio M=(0.33,0.33,0.33). Then node 0 in the left, the middle, and the right is predicted using equation 4.31 as red, red, and green, respectively, which is completely consistent with the result of GCN.

Close modal
Table 1:

The Contribution Ratio M of Different Graphs.

NumberMNumberM
(0.38, 0.31, 0.31) (0, 1, 0) 
(0.33, 0.33, 0.33) (0, 0, 1) 
(0.10, 0.45, 0.45) (0.33, 0.33, 0.33) 
(-0.34, 0.74, 0.60) (0.29, 0.42, 0.29) 
(1.5, -0.85, 0.35) (-0.16, 1.32, -0.16) 
NumberMNumberM
(0.38, 0.31, 0.31) (0, 1, 0) 
(0.33, 0.33, 0.33) (0, 0, 1) 
(0.10, 0.45, 0.45) (0.33, 0.33, 0.33) 
(-0.34, 0.74, 0.60) (0.29, 0.42, 0.29) 
(1.5, -0.85, 0.35) (-0.16, 1.32, -0.16) 
Figure 2:

Different patterns with four nodes. Red nodes are training nodes with labels, and the blue node is the unlabeled node to be predicted.

Figure 2:

Different patterns with four nodes. Red nodes are training nodes with labels, and the blue node is the unlabeled node to be predicted.

Close modal

In Figures 2c and 2f, all the labeled nodes make the same positive contribution to the predicted nodes. Therefore, the largest number of those nodes with the same color decide the color of node 0. In addition, in Figures 2g, 2e, and 2h, different nodes make different positive contribution to node 0. And in Figures 2g, 2i, and 2j, some nodes even make negative contribution to node 0. We can find that node 2 in Figure 2g, 1 in Figure 2i, and 2 in Figure 2j make the greatest contribution to the predicted node 0.

5.3  A Special Case

In this section, we display an interesting experiment that is inconsistent with the intuition that similar neighbors should have the same labels. The graph in Figure 3 shows that different colored nodes connect with each other, but the node with the same color is separate. Then we use the labeled node 1 to 7 to predict the unlabeled node 0.
Figure 3:

Node 0 is connected with all the green nodes, but GCN still predicts node 0 as red. GCNTK can be used to explain the results. Since the contribution ratio of training nodes is M=(0,0,0,0,0,0.5,0.5), only red nodes contribute to node 0's label.

Figure 3:

Node 0 is connected with all the green nodes, but GCN still predicts node 0 as red. GCNTK can be used to explain the results. Since the contribution ratio of training nodes is M=(0,0,0,0,0,0.5,0.5), only red nodes contribute to node 0's label.

Close modal

Graph convolutional networks (GCNs) are widely studied and perform well for node classification problems. However, the causal relationship under the predictions is unknown and thus restricts their applications in the areas of security and privacy. To interpret GCN, we assume that the GCN has an infinite width. We define the GCNTK to analyze the GCN training procedure and predict its output. Firstly, we prove that the GCNTK converges and is stable as the width tends to infinite. Then we find for GCN, with an infinite width, that the output value of unlabeled nodes can be predicted by a linear combination of the training nodes' label. The coefficients of training nodes can be computed by GCNTK and imply the importance of training nodes to the unlabeled nodes. Finally, we conduct experiments on synthetic data sets including common patterns in small model graphs to demonstrate the effectiveness of GCNTK.

Similar to Lee et al. (2018) and Arora et al. (2019), GCNGP can be written as a recursive formula. Let the activation function ϕ be a differentiable function. Let X and X' be two inputs in RN×d1. Denote Hl=[H1l,H2l,...,Hdll], where
(A.1)
is a column vector. We have
(A.2)
Based on the central limit theorem, Lee et al. (2018) show that for each hijl+1, i=1,...N,j=1,...dl+1, hijl+1GP(0,Ki). Note that the gaussian process hijl+1 has the same and independent Ki with respect to j. We need to establish the relation of Hjl+1 with respect to column element.
According to the intrinsic coregionalization model for the multioutput gaussian process theory, we have
(A.3)
where Kl(X,X')=cov(Hjl(X),Hjl(X')) is a matrix that represents the covariance of different dimension. For Hj,
(A.4)
Then we write Kl(X,X') in matrix form based on equation A.4.
(A.5)
with base case
(A.6)
Let
(A.7)
then
(A.8)
and
(A.9)
When d1,d2,...,dl-1, this term at initial value converges to IdlKl(X,X'). Using H0l(X) represents the initial value of Hl(X) and denoting Dl-1(X)=diag(ϕ˙(H0l-1(X))), we have;
(A.10)
Assume Θl-1(X,X')=Idl-1Θ˜l-1(X,X'); then
(A.11)
Denote
(A.12)
Therefore,
(A.13)

In this section, we give a simple proof of the global convergence of GCNTK restricting on the training data set under gradient descent and gradient flow. With a subtle difference with the NTK initialization, we give the proof procedure based on standard parameterization. The GCN is generated with standard parameterization by equations 4.9 and 4.10.

B.1  Proof of Theorem 3

Since the parameter at initialization is randomly generated, there exist R0 and n0 such that for every n>n0, with probability at (1-δ0/10) over random initialization,
(B.1)
Let C=3KR0λmin in lemma 1. There exists a large n1>n0 such that for every d>n1, equations 4.21 and B.1 hold with probability at least (1-δ0/5) over random initialization. The case that t=0 holds obviously, and we assume equation 4.21 holds for t=t. By induction, we can obtain the second formula in equation 4.23,
(B.2)
(B.3)
Therefore, the second formula of equation 4.23 at t=t+1 can be satisfied. Since
(B.4)
θt+1Bθ0,Cd-12.
(B.5)
where θ˜tBθ0,Cd-12 is a linear interpolation between θt and θt+1.
(B.6)
then
(B.7)
Since η0<2λmin+λmax, we have
(B.8)
Because Θ0Θ, there exists n3 such that d>n3,
(B.9)
(B.10)
Choose d(18K3R0λmin)2:
(B.11)
Therefore,
(B.12)
We obtain the convergence of Θt:
(B.13)

B.2  Proof of Theorem 4

There exist R0 and d0 such that d>d0 with probability at (1-δ0/10) over random initialization,
(B.14)
Let C=3KR0λmin. Using the same arguments, one can show that there exists n2 such that d>n2, with probability at least (1-δ0/10),
(B.15)
Let
(B.16)
We claim that t1=. If not, then for all t<t1, θtB(θ0,Cd-12) and
thus,
(B.17)
(B.18)
Note that
(B.19)
(B.20)
This contradicts the definition of t1 and thus t1=.

B.3  Proof of Lemma 1

Theorem 5
(Corollary 5.35 in Vershynin, 2010). Let P=PN,n be an N×n random matrix whose entries are independent standard normal random variables. Then for every t0, with probability at least 1-2exp-t2/2, one has
Let θ={W,B} and θ˜={W˜,B˜},θ0={W0,B0}. Therefore, choose a suitable t and a large d; with high probability over random initialization, we have
(B.21)
For l1, denote
(B.22)
Assume that σω and σb are the same level. Then according to the definition of GCN and assumption of ϕ, ϕ(Hl(X,θ))2 can be controlled by the Oi=1lC1Wi2A2, and δl(X,θ)2 can be controlled by Oi=lLC2Wi2A2, where C1 and C2 are the Lipschitz continuous and smoothness constants. Therefore, there exists a constant K1, depending on σω2,σb2,N, and L such that for l=1,...,L,
(B.23)
With high probability over random initialization, we have
(B.24)
(B.25)
Set K2=max(2(L+1)(A22K12)K12Nk,3(L+1)AK14Nk); then
(B.26)

This work was supported by the National Natural Science Foundation of China (grant 61977065) and the National Key Basic Research Program (grant 2020YFA0713504).

Allen-Zhu
,
Z.
,
Li
,
Y.
, &
Song
,
Z.
(
2019
).
A convergence theory for deep learning via over-parameterization
. In
Proceedings of the International Conference on Machine Learning
(pp.
242
252
).
Arora
,
S.
,
Du
,
S. S.
,
Hu
,
W.
,
Li
,
Z.
,
Salakhutdinov
,
R.
, &
Wang
,
R.
(
2019
). On exact computation with an infinitely wide neural net. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 32
(pp.
8139
8148
).
Red Hook, NY
:
Curran
.
Baldassarre
,
F.
, &
Azizpour
,
H.
(
2019
).
Explainability techniques for graph convolutional networks
. arXiv:1905.13686.
Bartlett
,
P.
,
Helmbold
,
D.
, &
Long
,
P.
(
2018
).
Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks
. In
Proceedings of theInternational Conference on Machine Learning
(pp.
521
530
).
Chen
,
Z.
,
Villar
,
S.
,
Chen
,
L.
, &
Bruna
,
J.
(
2019
).
On the equivalence between graph isomorphism testing and function approximation with GNNs.
arXiv:1905.12560.
de G. Matthews
,
A. G.
,
Rowland
,
M.
,
Hron
,
J.
,
Turner
,
R. E.
, &
Ghahramani
,
Z.
(
2018
).
Gaussian process behaviour in wide deep neural networks
. In
Proceedings of the International Conference on Learning Representations.
Du
,
S.
,
Lee
,
J.
,
Li
,
H.
,
Wang
,
L.
, &
Zhai
,
X.
(
2019
).
Gradient descent finds global minima of deep neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
1675
1685
).
Du
,
S. S.
,
Hou
,
K.
,
Salakhutdinov
,
R. R.
,
Poczos
,
B.
,
Wang
,
R.
, &
Xu
,
K.
(
2019
). Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
(pp.
5723
5733
).
Red Hook, NY
:
Curran
.
Franceschi
,
J.-Y.
,
de Bézenac
,
E.
,
Ayed
,
I.
,
Chen
,
M.
,
Lamprier
,
S.
, &
Gallinari
,
P.
(
2021
).
A neural tangent kernel perspective of GANs.
arXiv:2016.05566v4.
Funke
,
T.
,
Khosla
,
M.
, &
Anand
,
A.
(
2021
).
Hard masking for explaining graph neural networks.
https://openreview.net/forum?id=uDN8pRAdsoC
Garg
,
V.
,
Jegelka
,
S.
, &
Jaakkola
,
T.
(
2020
).
Generalization and representational limits of graph neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
3419
3430
).
Huang
,
Q.
,
Yamada
,
M.
,
Tian
,
Y.
,
Singh
,
D.
,
Yin
,
D.
, &
Chang
,
Y.
(
2020
). Graphlime:
Local interpretable model explanations for graph neural networks.
arXiv:2001.06216.
Huang
,
W.
,
Li
,
Y.
,
Du
,
W.
,
Da Xu
,
R. Y.
,
Yin
,
J.
,
Chen
,
L.
, &
Zhang
,
M.
(
2021
).
Towards deepening graph neural networks: A GNTK-based optimization perspective.
arXiv:2103.03113.
Jacot
,
A.
,
Gabriel
,
F.
, &
Hongler
,
C.
(
2018
). Neural tangent kernel: Convergence and generalization in neural networks. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 32
(pp.
8580
8589
).
Red Hook, NY
:
Curran
.
Kipf
,
T. N.
, &
Welling
,
M.
(
2016
).
Semi-supervised classification with graph convolutional networks.
arXiv:1609.02907.
Lee
,
J.
,
Bahri
,
Y.
,
Novak
,
R.
,
Schoenholz
,
S. S.
,
Pennington
,
J.
, &
Sohl-Dickstein
,
J.
(
2018
).
Deep neural networks as gaussian processes
. In
Proceedings of the International Conference on Learning Representations.
Lee
,
J.
,
Xiao
,
L.
,
Schoenholz
,
S.
,
Bahri
,
Y.
,
Novak
,
R.
,
Sohl-Dickstein
,
J.
, &
Pennington
,
J.
(
2019
). Wide neural networks of any depth evolve as linear models under gradient descent. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
(pp.
8572
8583
).
Red Hook, NY
:
Curran
.
Li
,
M. B.
,
Nica
,
M.
, &
Roy
,
D. M.
(
2021
).
The future is log-gaussian: ReNets and their infinite-depth-and-width limit at initialization
. arXiv:2106.04013.
Littwin
,
E.
,
Galanti
,
T.
,
Wolf
,
L.
, &
Yang
,
G.
(
2020
). On infinite-width hypernetworks. In
H.
Larochelle
,
M.
Ranzato
,
R.
Hadsell
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
,
33
.
Red Hook. NY
:
Curran
.
Liu
,
C.
,
Zhu
,
L.
, &
Belkin
,
M.
(
2020
). On the linearity of large non-linear models: When and why the tangent kernel is constant. In
H.
Larochelle
,
M.
Ranzato
,
R.
Hadsell
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
,
33
.
Red Hook. NY
:
Curran
.
Loukas
,
A.
(
2020
).
How hard is to distinguish graphs with graph neural networks?
arXiv:2005.06649.
Luo
,
D.
,
Cheng
,
W.
,
Xu
,
D.
,
Yu
,
W.
,
Zong
,
B.
,
Chen
,
H.
, &
Zhang
,
X.
(
2020
).
Parameterized explainer for graph neural network.
arXiv:2011.04573.
Luo
,
T.
,
Xu
,
Z.-Q. J.
,
Ma
,
Z.
, &
Zhang
,
Y.
(
2021
).
Phase diagram for two-layer ReLU neural networks at infinite-width limit
.
Journal of Machine Learning Research
,
22
(
71
),
1
47
.
Neal
,
R. M.
(
1995
).
Bayesian learning for neural networks
. Lecture Notes in Computer Science 118.
Berlin
:
Springer
.
Park
,
D.
,
Sohl-Dickstein
,
J.
,
Le
,
Q.
, &
Smith
,
S.
(
2019
).
The effect of network width on stochastic gradient descent and generalization: an empirical study
. In
Proceedings of the International Conference on Machine Learning
(pp.
5042
5051
).
Pope
,
P. E.
,
Kolouri
,
S.
,
Rostami
,
M.
,
Martin
,
C. E.
, &
Hoffmann
,
H.
(
2019
).
Explainability methods for graph convolutional neural networks
. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp.
10772
10781
).
Piscataway, NJ
:
IEEE
.
Scarselli
,
F.
,
Tsoi
,
A. C.
, &
Hagenbuchner
,
M.
(
2018
).
The Vapnik–Chervonenkis dimension of graph and recursive neural networks
.
Neural Networks
,
108
,
248
259
.
Schlichtkrull
,
M. S.
,
De Cao
,
N.
, &
Titov
,
I.
(
2020
).
Interpreting graph neural networks for NLP with differentiable edge masking
. arXiv:2010.00577.
Schnake
,
T.
,
Eberle
,
O.
,
Lederer
,
J.
,
Nakajima
,
S.
,
Schütt
,
K. T.
,
Müller
,
K.-R.
, &
Montavon
,
G.
(
2020
).
Higher-order explanations of graph neural networks via relevant walks
. arXiv:2006.03589.
Schwarzenberg
,
R.
,
Hübner
,
M.
,
Harbecke
,
D.
,
Alt
,
C.
, &
Hennig
,
L.
(
2019
).
Layer-wise relevance visualization in convolutional text graph classifiers
. arXiv:1909.10911.
Sohl-Dickstein
,
J.
,
Novak
,
R.
,
Schoenholz
,
S. S.
, &
Lee
,
J.
(
2020
).
On the infinite width limit of neural networks with a standard parameterization.
arXiv:2001.07301.
Vershynin
,
R.
(
2010
).
Introduction to the non-asymptotic analysis of random matrices
. arXiv preprint arXiv:1011.3027.
Vu
,
M. N.
, &
Thai
,
M. T.
(
2020
).
PGM-explainer: Probabilistic graphical model explanations for graph neural networks
. arXiv:2010.05788.
Wu
,
F.
,
Souza
,
A.
,
Zhang
,
T.
,
Fifty
,
C.
,
Yu
,
T.
, &
Weinberger
,
K.
(
2019
).
Simplifying graph convolutional networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
6861
6871
).
Wu
,
Z.
,
Pan
,
S.
,
Chen
,
F.
,
Long
,
G.
,
Zhang
,
C.
, &
Yu
,
P. S.
(
2021
).
A comprehensive survey on graph neural networks
.
IEEE Transactions on Neural Networks and Learning Systems
,
32
(
1
),
4
24
.
Xu
,
K.
,
Hu
,
W.
,
Leskovec
,
J.
, &
Jegelka
,
S.
(
2018
).
How powerful are graph neural networks?
arXiv:1810.0026.
Xu
,
K.
,
Zhang
,
M.
,
Li
,
J.
,
Du
,
S. S.
,
Kawarabayashi
,
K.-i.
, &
Jegelka
,
S.
(
2020
).
How neural networks extrapolate: From feedforward to graph neural networks
. arXiv:2009.11848.
Yang
,
G.
(
2019
).
Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation.
arXiv:1902.04760.
Ying
,
R.
,
Bourgeois
,
D.
,
You
,
J.
,
Zitnik
,
M.
, &
Leskovec
,
J.
(
2019
). GNNExplainer: Generating explanations for graph neural networks. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
, 9240.
Red Hook, NY
:
Curran
.
Yuan
,
H.
,
Yu
,
H.
,
Gui
,
S.
, &
Ji
,
S.
(
2020
).
Explainability in graph neural networks: A taxonomic survey.
arXiv:2012.15445.
Zhang
,
Y.
,
Defazio
,
D.
, &
Ramesh
,
A.
(
2021
).
RelEx: A model-agnostic relational model explainer
. In
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
(pp.
1042
1049
).
New York
:
ACM
.
Zhou
,
J.
,
Cui
,
G.
,
Hu
,
S.
,
Zhang
,
Z.
,
Yang
,
C.
,
Liu
,
Z.
, . . .
Sun
,
M.
(
2020
).
Graph neural networks: A review of methods and applications
.
AI Open
,
1
,
57
81
. https://www.sciencedirect.com/science/article/pii/S2666651021000012.