## Abstract

Graph convolutional network (GCN) is a powerful deep model in dealing with graph data. However, the explainability of GCN remains a difficult problem since the training behaviors for graph neural networks are hard to describe. In this work, we show that for GCN with wide hidden feature dimension, the output for semisupervised problem can be described by a simple differential equation. In addition, the dynamics of the behavior of output is decided by the graph convolutional neural tangent kernel (GCNTK), which is stable when the width of hidden feature tends to be infinite. And the solution of node classification can be explained directly by the differential equation for a semisupervised problem. The experiments on some toy models speak to the consistency of the GCNTK model and GCN.

## 1 Introduction

Graph neural networks (GNNs) are widely used in dealing with non-Euclid data (Wu et al., 2021). A typical kind of GNN, the graph convolutional network (GCN; Kipf & Welling, 2016; Wu et al., 2019), achieves great performance in several fields, including society and transportation (Zhou et al., 2020; Wu et al., 2021). Therefore, many researchers focus on the theoretical aspects of GNN, especially GCN, including expressive power (Xu, Hu, Leskovec, & Jegelka, 2018; Loukas, 2020; Chen, Villar, Chen, & Bruna, 2019) and generalization capability (Garg, Jegelka, & Jaakkola, 2020; Scarselli, Tsoi, & Hagenbuchner, 2018; Xu et al., 2020). The explainability of GNN, which often studies the underlying relationship behind predictions, has also gotten the broad attention of numerous experts and scholars (Yuan, Yu, Gui, & Ji, 2020). Since many GNNs are proposed without explaining them, they are treated as a black box and cannot be trusted in critical applications concerning privacy and safety. Therefore, it is necessary to develop explanation techniques to study the causal relationship behind GNN predictions.

Some methods explore GNNs' explainability by identifying important nodes related to their prediction. Gradients and features-based methods (Baldassarre & Azizpour, 2019; Pope, Kolouri, Rostami, Martin, & Hoffmann, 2019) and perturbation-based methods (Ying, Bourgeois, You, Zitnik, & Leskovec, 2019; Luo et al., 2020; Schlichtkrull et al., 2020; Funke et al., 2021) employ different model metrics, including gradients or antiperturbation ability, to indicate the importance of different nodes, while decomposition methods (Schwarzenberg, Hübner, Harbecke, Alt, & Hennig, 2019; Schnake et al., 2020) and surrogate methods (Huang et al., 2020; Vu & Thai, 2020; Zhang et al., 2021) decomposing GNNs or finding a surrogate model to simplify and explain the GNNs. However, most of this work seeks to understand GNN predictions by post hoc explanation, which means the explanations cannot be used to predict GNN output before training.

In this article, we focus on GCN behaviors in dealing with the node classification problem (Kipf & Welling, 2016) and try to interpret GCN, which can be applied to predict its output before training. Since the objective function for training GCN is often nonconvex, it is hard to analyze GCN behavior directly. Recently neural tangent kernel (NTK; Bartlett, Helmbold, & Long, 2018; Arora et al., 2019; Jacot, Gabriel, & Hongler, 2018) has been proposed to analyze deep neural networks including GNNs in different perspectives. Du, Hou, et al. (2019) use the aggregation and combination formulation of infinitely wide GNNs to define the graph NTK (GNTK) at graph level, which can predict the results of graph-level classification problems. Following the definition of GNTK, Huang et al. (2021) defines an NTK of GCN in node level and focus on the trainability of ultrawide GCNs. However, neither of the two works can be applied for analyzing node classification problems directly based on GNTK.

Here, we establish a GCN tangent kernel (GCNTK) in matrix form and use it to analyze the learning dynamics of wide GCN under gradient descent for node classification problems. GCNTK can also predict test nodes' label and help to explain the importance of different training nodes for prediction.

We summarize our contributions as follows:

*The convergence of GCNTK*. Since the input and output of GCN are in matrix form, the formula of a gaussian process for GCN (GCNGP) is complex. We first give the explicit formula for GCNGP and GCNTK and demonstrate that GCNTK can converge to a fixed form as the GCN's layer width tends to be infinite.*The convergence of training loss and stability of GCNTK*. We prove that the loss constrained on the training data set tends to zero, as the width of parameter matrix tends to be infinite. And the GCNTK remains fixed during the training procedure.*Predictions for the test nodes' label based on linear dynamics*. We formally obtain the predictions of test nodes' label with infinite-width GCN. We first find that the solution of semisupervised problems mainly depends on the ratio of kernel restricting on the training and test data set. The prediction on the test nodes and the impact of training nodes can be interpreted well by the GCNTK.

## 2 Related Work

### 2.1 GNNs and GCNs

GNNs learn task-specific node/edge/graph representations via hierarchical iterative operators and obtain great success in graph learning tasks. A classical GNN consists of aggregation and combination operators, which gather information from neighbors iteratively. GCN, as a typical GNN, defines the convolution on the spectral domain and applies a filter operator on the feature components. Its aggregation function can be treated as a weighted summation of neighbors' feature. Among this, a specific GCN proposed in Kipf and Welling (2016) uses a 1-localized ChebNet to define convolution and obtain the model in equation 3.1 with the bias term, which obtains significant advantages in dealing with node classification problems.

### 2.2 Explainability of GNNs

The explainability of GNNs can be explored by identifying important nodes related to GNN's prediction. For example, SA and guided BP (Baldassarre & Azizpour, 2019) use gradient square values as the importance and contributions of different nodes, while Grad-CAM (Pope et al., 2019) maps the final layer to the input nodes space to generate the importance. GNN perturbation methods including GNNExplainer (Ying et al., 2019), PGExplainer (Luo et al., 2020), GraphMask (Schlichtkrull, De Cao, & Titov, 2020), and ZORRO (Funke, Khosla, & Anand, 2021) consider the influence of node perturbations on predictions. Surrogate and decomposition models, including GraphLime (Huang et al., 2020), PGM-explainer (Vu & Thai, 2020), Relex (Zhang, Defazio, & Ramesh, 2021), LRP (Baldassarre & Azizpour, 2019), and GNN-LRP (Schnake et al., 2020), explain GNN by employing a simple and interpretable surrogate model to approximate the predictions. However, most of these models explain the GNN afterward and GNN is still a black box to some extent. It means these models can provide an explanation of GNN's output but cannot predict the results before training.

Starting from a spectral GCN, a classical and special GNN, this article addresses GCN's interpretability and analyzes its causal relationship between the predictions and training nodes, which help to predict GCN's output beforehand.

### 2.3 Neural Tangent Kernel

Based on the gaussian process (GP; Neal, 1995; Lee et al., 2018; de G. Matthews, Rowland, Hron, Turner, & Ghahramani, 2018) property of deep neural networks, Jacot et al. (2018) introduce the neural tangent kernel and describe the exact dynamics of a fully connected network's output through gradient flow training in an overparameterized situation. This initial work has been followed by a series of studies, including the exacter description (Arora et al., 2019; Lee et al., 2019), generalization for different initialization (Liu, Zhu, & Belkin, 2020; Sohl-Dickstein, Novak, Schoenholz, & Lee, 2020), and NTK for different structures of neural networks (Arora et al., 2019; Li et al., 2021; Luo, Xu, Ma, & Zhang, 2021).

As for graph deep learning, Du, Hou, et al. (2019) considered graph-supervised problems and defined graph NTK (GNTK) in graph level based on GNN's aggregation function, which can be used to predict an unlabeled graph. Immediately after Du, Hou, et al. (2019), Huang et al. (2021) defined GNTK for GCN and conducted research on how to deepen GCN. Since the GNTK formulation in Huang et al. (2021) is based on training nodes, the predictions of testing nodes cannot be obtained directly. In this article, we derive a matrix form of GCN tangent kernel (GCNTK) and provide an explicit form to predict the behavior of testing nodes directly.

## 3 Preliminaries

Let $G=(V,E,X)$ be a graph with $N$ vertices, where $V={v1,v2,...,vN}$ represents $N$ vertices or nodes. The adjacency matrix $A=(aij)N\xd7N$ represents the link relation. $X\u2208RN\xd7d0$ is the node feature matrix. A typical semisupervised problem defined on graph is node classification. Assume that there exist $k$ classes of nodes, and each node $vi$ in graph $G$ has an associated one-hot representation label $yi\u2208Rk$. However, only some of the nodes in $V$ denoted by $Vtrain$ are annotated. The task of node classification is to predict the labels for the unlabeled nodes in $Vtest$.

First, we give some notations in matrix form. Assume that the index of a training and a test set is $Itrain={i|vi\u2208Vtrain}$ and $Itest={i|vi\u2208Vtest}$, respectively. For an arbitrary matrix $X=(Xij)N\xd7d$, denote the $i$th row of matrix $X$ by $Xi$ and the $j$th column by $X;,j$. Let $Itrain\u2208RN\xd7N$ be a diagonal matrix where the training index is 1 and the other index set is 0, with a similar definition for $Itest$. $Xtrain=ItrainX\u2208RN\xd7d$ represents the $X$ constraining on the training set by rows, with a similar definition for $Xtest$. For an arbitrary matrix $A\u2208RN\xd7N$, $Atrain,test=ItrainAItest$ represents $A$ constraining on the training index by rows and test index by columns.

Define $\theta l\u2261vec({\omega ijl,\beta ijl})$, which is a column-wise stacked $(N+dl-1)dl\xd71$ vector of all parameters in layer $l$. Let $\theta =vec(\u222al=1L+1\theta l)$ denote all the network parameters, $\theta \u2264l0=vec(\u222al=1l0\theta l)$, with similar definitions for $\theta \u2265l0$. Let $\theta t$ denote the network parameters at time $t$, and $\theta 0$ represents the initial values. Combined with the current parameter $\theta t$, the output of GCN is denoted by the function $f(X,\theta t)=HL+1(X,\theta t)\u2208RN\xd7dL+1$, where $dL+1=k$ represents the number of classes of nodes. To simplify the representation, we use $f(X)=f(X,\theta )$, $ft(X)=f(X,\theta t)$.

Equations 3.7 and 3.8 are two differential equations that describe the evolution of parameters and output, respectively. The behavior of output depends on the time-dependent GCNTK $\Theta t$. However, $\Theta t$ depends on the random draw of $\theta t$, which makes the differential equations hard to solve. This article assumes that the width tends to infinity to obtain the convergence of $\Theta t$.

## 4 Main Results

### 4.1 Convergence of GCNTK with Respect to Width $d$

GNTK (Du, Hou, et al., 2019; Huang et al., 2021) gives an NTK formula on GNNs based on the distinct node and the dynamics behavior on the training set. However, for the semisupervised problem on the graph, the convergence of GCNTK in matrix form has not been researched. Since the formula of GCNTK involves the complex gradient computation of matrix, it has not been given exactly. Therefore, we first focus on the convergence of GCNTK.

The proof details of multi-output GPs and NTK are in appendix A. We give the proof sketch as follows.

Jacot et al. (2018) and Yang (2019) show that the NTK of fully connected networks at initialization $\Theta 0$ converges to a deterministic kernel $\Theta $ based on the GP property. Theorem 1 shows that the GCNTK also converges at initialization. Compared with the results in Du, Hou, et al. (2019), which define the GNTK from a single node perspective, this article defines the GCNTK in matrix form first and is suitable for further analysis of node classification problems.

### 4.2 The Behavior of GCNTK with Respect to Time $t$

For the GCN model defined in equations 4.9 and 4.10, the widths in each layers are identical: $d1=d2=\cdots =dL=d$.

The analytic GCNTK kernel $\Theta $ in equation 4.11 is of full rank and positive.

- The activation function $\varphi $ is Lipschitz continuous and smooth, satisfying$|\varphi (0)|,\varphi '\u221e,supx\u2260x\u02dc\varphi '(x)-\varphi '(x\u02dc)/|x-x\u02dc|=C2<\u221e,$(4.12)$supx\u2260x\u02dc\varphi (x)-\varphi (x\u02dc)/|x-x\u02dc|=C1<\u221e.$(4.13)

Assumption 1 is easy to satisfy in the limit condition. Assumption 2 holds since the NTK kernel is a multiplication of the derivative. Common activation functions like Relu, and sigmoid satisfy assumption 3.

The proof of lemma 1 is in section B.3 in online appendix B. It shows that in the neighborbood of initialization $\theta 0$, the Jacobian $J(\theta )$ is Lipschitz continuous when $d$ is large enough.

Then for the gradient descent and gradient flow, we have the following main results:

Lee et al. (2019), Du, Lee, Li, Wang, and Zhai (2019), and Allen-Zhu, Li, and Song (2019) show the convergence of an overparameterized, fully connected network and the stability of NTK, while theorems 2 and 3 extend the convergence and stability of GCNTK for a semisupervised problem. As the GCNTK converges at initialization time when the width tends to be infinite, it also remains fixed during training. Different from the supervised problem, the convergence rate of GCN for a semisupervised problem depends on the smallest eigenvalues of the GCNTK constrained on the training nodes. Therefore for the same graph, training GCN with fewer training nodes is faster in general. The proof is in appendix B.

### 4.3 GCN Explanation Based on Training Dynamics Equation

Under the infinite width assumption, GCNTK remains fixed, and the training dynamics can be computed by the evolution equation 3.8. We can obtain the explanation of GCN with infinite width based on the solution of the training dynamics equation.

## 5 Experiments

Little previous work has provided beforehand explanations of GNN. And to our knowledge, there is no beforehand explanation model investigating the node classification problem. In this article, we compare only the output of GCN and GCNTK and verify the consistency of their predictions. We use synthetic data sets to obtain some conclusions which are implicit in the classical GCN.

*Synthetic data sets*: In order to verify the prediction correctness in theorem 4, we build some subgraphs with a few nodes that have common patterns in various graph data sets. All the nodes in these graphs except one are labeled with red or green. We use the GCN and GCNTK to predict the label (red or green) of the unlabeled (blue) node, respectively based on labeled nodes, and obtain the contribution of those labeled nodes.*GCN model*: We train the node classification model using different types of graphs by the classical GCN in Kipf and Welling (2016) where the feature matrix $X$ is set by the identity matrix. And we only predict the label of one node (in blue).

### 5.1 Experiments on the Star Graph

Theorem 4 shows that the predicted label is only influenced by the contribution of unlabeled nodes, where the contribution ratio is $M$ defined by equation 4.32. For instance, if there is no link between the test nodes and training nodes, then $\Theta test,train=0$. As a result, no node has an effect on the training nodes.

### 5.2 Common Patterns with Four Nodes

Number . | $M$ . | Number . | $M$ . |
---|---|---|---|

a | (0.38, 0.31, 0.31) | b | (0, 1, 0) |

c | (0.33, 0.33, 0.33) | d | (0, 0, 1) |

e | (0.10, 0.45, 0.45) | f | (0.33, 0.33, 0.33) |

g | ($-$0.34, 0.74, 0.60) | h | (0.29, 0.42, 0.29) |

i | (1.5, $-$0.85, 0.35) | j | ($-$0.16, 1.32, $-$0.16) |

Number . | $M$ . | Number . | $M$ . |
---|---|---|---|

a | (0.38, 0.31, 0.31) | b | (0, 1, 0) |

c | (0.33, 0.33, 0.33) | d | (0, 0, 1) |

e | (0.10, 0.45, 0.45) | f | (0.33, 0.33, 0.33) |

g | ($-$0.34, 0.74, 0.60) | h | (0.29, 0.42, 0.29) |

i | (1.5, $-$0.85, 0.35) | j | ($-$0.16, 1.32, $-$0.16) |

In Figures 2c and 2f, all the labeled nodes make the same positive contribution to the predicted nodes. Therefore, the largest number of those nodes with the same color decide the color of node 0. In addition, in Figures 2g, 2e, and 2h, different nodes make different positive contribution to node 0. And in Figures 2g, 2i, and 2j, some nodes even make negative contribution to node 0. We can find that node 2 in Figure 2g, 1 in Figure 2i, and 2 in Figure 2j make the greatest contribution to the predicted node 0.

### 5.3 A Special Case

## 6 Conclusion

Graph convolutional networks (GCNs) are widely studied and perform well for node classification problems. However, the causal relationship under the predictions is unknown and thus restricts their applications in the areas of security and privacy. To interpret GCN, we assume that the GCN has an infinite width. We define the GCNTK to analyze the GCN training procedure and predict its output. Firstly, we prove that the GCNTK converges and is stable as the width tends to infinite. Then we find for GCN, with an infinite width, that the output value of unlabeled nodes can be predicted by a linear combination of the training nodes' label. The coefficients of training nodes can be computed by GCNTK and imply the importance of training nodes to the unlabeled nodes. Finally, we conduct experiments on synthetic data sets including common patterns in small model graphs to demonstrate the effectiveness of GCNTK.

## Appendix A: Computing GCNGP and GCNTK

## Appendix B: Convergence of GCNTK to Its Linearization, and Stability of GCNTK

In this section, we give a simple proof of the global convergence of GCNTK restricting on the training data set under gradient descent and gradient flow. With a subtle difference with the NTK initialization, we give the proof procedure based on standard parameterization. The GCN is generated with standard parameterization by equations 4.9 and 4.10.

### B.1 Proof of Theorem 3

### B.2 Proof of Theorem 4

### B.3 Proof of Lemma 1

## Acknowledgments

This work was supported by the National Natural Science Foundation of China (grant 61977065) and the National Key Basic Research Program (grant 2020YFA0713504).

## References

*Proceedings of the International Conference on Machine Learning*

*Advances in neural information processing systems, 32*

*Explainability techniques for graph convolutional networks*

*Proceedings of theInternational Conference on Machine Learning*

*On the equivalence between graph isomorphism testing and function approximation with GNNs.*

*Proceedings of the International Conference on Learning Representations.*

*Proceedings of the International Conference on Machine Learning*

*Advances in neural information processing systems*

*A neural tangent kernel perspective of GANs.*

*Hard masking for explaining graph neural networks.*

*Proceedings of the International Conference on Machine Learning*

*Local interpretable model explanations for graph neural networks.*

*Towards deepening graph neural networks: A GNTK-based optimization perspective.*

*Advances in neural information processing systems, 32*

*Semi-supervised classification with graph convolutional networks.*

*Proceedings of the International Conference on Learning Representations.*

*Advances in neural information processing systems*

*The future is log-gaussian: ReNets and their infinite-depth-and-width limit at initialization*

*Advances in neural information processing systems*

*Advances in neural information processing systems*

*How hard is to distinguish graphs with graph neural networks?*

*Parameterized explainer for graph neural network.*

*Journal of Machine Learning Research*

*Bayesian learning for neural networks*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*

*Neural Networks*

*Interpreting graph neural networks for NLP with differentiable edge masking*

*Higher-order explanations of graph neural networks via relevant walks*

*Layer-wise relevance visualization in convolutional text graph classifiers*

*On the infinite width limit of neural networks with a standard parameterization.*

*Introduction to the non-asymptotic analysis of random matrices*

*PGM-explainer: Probabilistic graphical model explanations for graph neural networks*

*Proceedings of the International Conference on Machine Learning*

*IEEE Transactions on Neural Networks and Learning Systems*

*How powerful are graph neural networks?*

*How neural networks extrapolate: From feedforward to graph neural networks*

*Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation.*

*Advances in neural information processing systems*

*Explainability in graph neural networks: A taxonomic survey.*

*Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society*

*AI Open*