Abstract
Graph convolutional network (GCN) is a powerful deep model in dealing with graph data. However, the explainability of GCN remains a difficult problem since the training behaviors for graph neural networks are hard to describe. In this work, we show that for GCN with wide hidden feature dimension, the output for semisupervised problem can be described by a simple differential equation. In addition, the dynamics of the behavior of output is decided by the graph convolutional neural tangent kernel (GCNTK), which is stable when the width of hidden feature tends to be infinite. And the solution of node classification can be explained directly by the differential equation for a semisupervised problem. The experiments on some toy models speak to the consistency of the GCNTK model and GCN.
1 Introduction
Graph neural networks (GNNs) are widely used in dealing with non-Euclid data (Wu et al., 2021). A typical kind of GNN, the graph convolutional network (GCN; Kipf & Welling, 2016; Wu et al., 2019), achieves great performance in several fields, including society and transportation (Zhou et al., 2020; Wu et al., 2021). Therefore, many researchers focus on the theoretical aspects of GNN, especially GCN, including expressive power (Xu, Hu, Leskovec, & Jegelka, 2018; Loukas, 2020; Chen, Villar, Chen, & Bruna, 2019) and generalization capability (Garg, Jegelka, & Jaakkola, 2020; Scarselli, Tsoi, & Hagenbuchner, 2018; Xu et al., 2020). The explainability of GNN, which often studies the underlying relationship behind predictions, has also gotten the broad attention of numerous experts and scholars (Yuan, Yu, Gui, & Ji, 2020). Since many GNNs are proposed without explaining them, they are treated as a black box and cannot be trusted in critical applications concerning privacy and safety. Therefore, it is necessary to develop explanation techniques to study the causal relationship behind GNN predictions.
Some methods explore GNNs' explainability by identifying important nodes related to their prediction. Gradients and features-based methods (Baldassarre & Azizpour, 2019; Pope, Kolouri, Rostami, Martin, & Hoffmann, 2019) and perturbation-based methods (Ying, Bourgeois, You, Zitnik, & Leskovec, 2019; Luo et al., 2020; Schlichtkrull et al., 2020; Funke et al., 2021) employ different model metrics, including gradients or antiperturbation ability, to indicate the importance of different nodes, while decomposition methods (Schwarzenberg, Hübner, Harbecke, Alt, & Hennig, 2019; Schnake et al., 2020) and surrogate methods (Huang et al., 2020; Vu & Thai, 2020; Zhang et al., 2021) decomposing GNNs or finding a surrogate model to simplify and explain the GNNs. However, most of this work seeks to understand GNN predictions by post hoc explanation, which means the explanations cannot be used to predict GNN output before training.
In this article, we focus on GCN behaviors in dealing with the node classification problem (Kipf & Welling, 2016) and try to interpret GCN, which can be applied to predict its output before training. Since the objective function for training GCN is often nonconvex, it is hard to analyze GCN behavior directly. Recently neural tangent kernel (NTK; Bartlett, Helmbold, & Long, 2018; Arora et al., 2019; Jacot, Gabriel, & Hongler, 2018) has been proposed to analyze deep neural networks including GNNs in different perspectives. Du, Hou, et al. (2019) use the aggregation and combination formulation of infinitely wide GNNs to define the graph NTK (GNTK) at graph level, which can predict the results of graph-level classification problems. Following the definition of GNTK, Huang et al. (2021) defines an NTK of GCN in node level and focus on the trainability of ultrawide GCNs. However, neither of the two works can be applied for analyzing node classification problems directly based on GNTK.
Here, we establish a GCN tangent kernel (GCNTK) in matrix form and use it to analyze the learning dynamics of wide GCN under gradient descent for node classification problems. GCNTK can also predict test nodes' label and help to explain the importance of different training nodes for prediction.
We summarize our contributions as follows:
The convergence of GCNTK. Since the input and output of GCN are in matrix form, the formula of a gaussian process for GCN (GCNGP) is complex. We first give the explicit formula for GCNGP and GCNTK and demonstrate that GCNTK can converge to a fixed form as the GCN's layer width tends to be infinite.
The convergence of training loss and stability of GCNTK. We prove that the loss constrained on the training data set tends to zero, as the width of parameter matrix tends to be infinite. And the GCNTK remains fixed during the training procedure.
Predictions for the test nodes' label based on linear dynamics. We formally obtain the predictions of test nodes' label with infinite-width GCN. We first find that the solution of semisupervised problems mainly depends on the ratio of kernel restricting on the training and test data set. The prediction on the test nodes and the impact of training nodes can be interpreted well by the GCNTK.
2 Related Work
2.1 GNNs and GCNs
GNNs learn task-specific node/edge/graph representations via hierarchical iterative operators and obtain great success in graph learning tasks. A classical GNN consists of aggregation and combination operators, which gather information from neighbors iteratively. GCN, as a typical GNN, defines the convolution on the spectral domain and applies a filter operator on the feature components. Its aggregation function can be treated as a weighted summation of neighbors' feature. Among this, a specific GCN proposed in Kipf and Welling (2016) uses a 1-localized ChebNet to define convolution and obtain the model in equation 3.1 with the bias term, which obtains significant advantages in dealing with node classification problems.
2.2 Explainability of GNNs
The explainability of GNNs can be explored by identifying important nodes related to GNN's prediction. For example, SA and guided BP (Baldassarre & Azizpour, 2019) use gradient square values as the importance and contributions of different nodes, while Grad-CAM (Pope et al., 2019) maps the final layer to the input nodes space to generate the importance. GNN perturbation methods including GNNExplainer (Ying et al., 2019), PGExplainer (Luo et al., 2020), GraphMask (Schlichtkrull, De Cao, & Titov, 2020), and ZORRO (Funke, Khosla, & Anand, 2021) consider the influence of node perturbations on predictions. Surrogate and decomposition models, including GraphLime (Huang et al., 2020), PGM-explainer (Vu & Thai, 2020), Relex (Zhang, Defazio, & Ramesh, 2021), LRP (Baldassarre & Azizpour, 2019), and GNN-LRP (Schnake et al., 2020), explain GNN by employing a simple and interpretable surrogate model to approximate the predictions. However, most of these models explain the GNN afterward and GNN is still a black box to some extent. It means these models can provide an explanation of GNN's output but cannot predict the results before training.
Starting from a spectral GCN, a classical and special GNN, this article addresses GCN's interpretability and analyzes its causal relationship between the predictions and training nodes, which help to predict GCN's output beforehand.
2.3 Neural Tangent Kernel
Based on the gaussian process (GP; Neal, 1995; Lee et al., 2018; de G. Matthews, Rowland, Hron, Turner, & Ghahramani, 2018) property of deep neural networks, Jacot et al. (2018) introduce the neural tangent kernel and describe the exact dynamics of a fully connected network's output through gradient flow training in an overparameterized situation. This initial work has been followed by a series of studies, including the exacter description (Arora et al., 2019; Lee et al., 2019), generalization for different initialization (Liu, Zhu, & Belkin, 2020; Sohl-Dickstein, Novak, Schoenholz, & Lee, 2020), and NTK for different structures of neural networks (Arora et al., 2019; Li et al., 2021; Luo, Xu, Ma, & Zhang, 2021).
As for graph deep learning, Du, Hou, et al. (2019) considered graph-supervised problems and defined graph NTK (GNTK) in graph level based on GNN's aggregation function, which can be used to predict an unlabeled graph. Immediately after Du, Hou, et al. (2019), Huang et al. (2021) defined GNTK for GCN and conducted research on how to deepen GCN. Since the GNTK formulation in Huang et al. (2021) is based on training nodes, the predictions of testing nodes cannot be obtained directly. In this article, we derive a matrix form of GCN tangent kernel (GCNTK) and provide an explicit form to predict the behavior of testing nodes directly.
3 Preliminaries
Let be a graph with vertices, where represents vertices or nodes. The adjacency matrix represents the link relation. is the node feature matrix. A typical semisupervised problem defined on graph is node classification. Assume that there exist classes of nodes, and each node in graph has an associated one-hot representation label . However, only some of the nodes in denoted by are annotated. The task of node classification is to predict the labels for the unlabeled nodes in .
First, we give some notations in matrix form. Assume that the index of a training and a test set is and , respectively. For an arbitrary matrix , denote the th row of matrix by and the th column by . Let be a diagonal matrix where the training index is 1 and the other index set is 0, with a similar definition for . represents the constraining on the training set by rows, with a similar definition for . For an arbitrary matrix , represents constraining on the training index by rows and test index by columns.
Define , which is a column-wise stacked vector of all parameters in layer . Let denote all the network parameters, , with similar definitions for . Let denote the network parameters at time , and represents the initial values. Combined with the current parameter , the output of GCN is denoted by the function , where represents the number of classes of nodes. To simplify the representation, we use , .
Equations 3.7 and 3.8 are two differential equations that describe the evolution of parameters and output, respectively. The behavior of output depends on the time-dependent GCNTK . However, depends on the random draw of , which makes the differential equations hard to solve. This article assumes that the width tends to infinity to obtain the convergence of .
4 Main Results
4.1 Convergence of GCNTK with Respect to Width
GNTK (Du, Hou, et al., 2019; Huang et al., 2021) gives an NTK formula on GNNs based on the distinct node and the dynamics behavior on the training set. However, for the semisupervised problem on the graph, the convergence of GCNTK in matrix form has not been researched. Since the formula of GCNTK involves the complex gradient computation of matrix, it has not been given exactly. Therefore, we first focus on the convergence of GCNTK.
The proof details of multi-output GPs and NTK are in appendix A. We give the proof sketch as follows.
Jacot et al. (2018) and Yang (2019) show that the NTK of fully connected networks at initialization converges to a deterministic kernel based on the GP property. Theorem 1 shows that the GCNTK also converges at initialization. Compared with the results in Du, Hou, et al. (2019), which define the GNTK from a single node perspective, this article defines the GCNTK in matrix form first and is suitable for further analysis of node classification problems.
4.2 The Behavior of GCNTK with Respect to Time
Assumption 1 is easy to satisfy in the limit condition. Assumption 2 holds since the NTK kernel is a multiplication of the derivative. Common activation functions like Relu, and sigmoid satisfy assumption 3.
The proof of lemma 1 is in section B.3 in online appendix B. It shows that in the neighborbood of initialization , the Jacobian is Lipschitz continuous when is large enough.
Then for the gradient descent and gradient flow, we have the following main results:
Lee et al. (2019), Du, Lee, Li, Wang, and Zhai (2019), and Allen-Zhu, Li, and Song (2019) show the convergence of an overparameterized, fully connected network and the stability of NTK, while theorems 2 and 3 extend the convergence and stability of GCNTK for a semisupervised problem. As the GCNTK converges at initialization time when the width tends to be infinite, it also remains fixed during training. Different from the supervised problem, the convergence rate of GCN for a semisupervised problem depends on the smallest eigenvalues of the GCNTK constrained on the training nodes. Therefore for the same graph, training GCN with fewer training nodes is faster in general. The proof is in appendix B.
4.3 GCN Explanation Based on Training Dynamics Equation
Under the infinite width assumption, GCNTK remains fixed, and the training dynamics can be computed by the evolution equation 3.8. We can obtain the explanation of GCN with infinite width based on the solution of the training dynamics equation.
5 Experiments
Little previous work has provided beforehand explanations of GNN. And to our knowledge, there is no beforehand explanation model investigating the node classification problem. In this article, we compare only the output of GCN and GCNTK and verify the consistency of their predictions. We use synthetic data sets to obtain some conclusions which are implicit in the classical GCN.
Synthetic data sets: In order to verify the prediction correctness in theorem 4, we build some subgraphs with a few nodes that have common patterns in various graph data sets. All the nodes in these graphs except one are labeled with red or green. We use the GCN and GCNTK to predict the label (red or green) of the unlabeled (blue) node, respectively based on labeled nodes, and obtain the contribution of those labeled nodes.
GCN model: We train the node classification model using different types of graphs by the classical GCN in Kipf and Welling (2016) where the feature matrix is set by the identity matrix. And we only predict the label of one node (in blue).
5.1 Experiments on the Star Graph
Theorem 4 shows that the predicted label is only influenced by the contribution of unlabeled nodes, where the contribution ratio is defined by equation 4.32. For instance, if there is no link between the test nodes and training nodes, then . As a result, no node has an effect on the training nodes.
5.2 Common Patterns with Four Nodes
Number . | . | Number . | . |
---|---|---|---|
a | (0.38, 0.31, 0.31) | b | (0, 1, 0) |
c | (0.33, 0.33, 0.33) | d | (0, 0, 1) |
e | (0.10, 0.45, 0.45) | f | (0.33, 0.33, 0.33) |
g | (0.34, 0.74, 0.60) | h | (0.29, 0.42, 0.29) |
i | (1.5, 0.85, 0.35) | j | (0.16, 1.32, 0.16) |
Number . | . | Number . | . |
---|---|---|---|
a | (0.38, 0.31, 0.31) | b | (0, 1, 0) |
c | (0.33, 0.33, 0.33) | d | (0, 0, 1) |
e | (0.10, 0.45, 0.45) | f | (0.33, 0.33, 0.33) |
g | (0.34, 0.74, 0.60) | h | (0.29, 0.42, 0.29) |
i | (1.5, 0.85, 0.35) | j | (0.16, 1.32, 0.16) |
In Figures 2c and 2f, all the labeled nodes make the same positive contribution to the predicted nodes. Therefore, the largest number of those nodes with the same color decide the color of node 0. In addition, in Figures 2g, 2e, and 2h, different nodes make different positive contribution to node 0. And in Figures 2g, 2i, and 2j, some nodes even make negative contribution to node 0. We can find that node 2 in Figure 2g, 1 in Figure 2i, and 2 in Figure 2j make the greatest contribution to the predicted node 0.
5.3 A Special Case
6 Conclusion
Graph convolutional networks (GCNs) are widely studied and perform well for node classification problems. However, the causal relationship under the predictions is unknown and thus restricts their applications in the areas of security and privacy. To interpret GCN, we assume that the GCN has an infinite width. We define the GCNTK to analyze the GCN training procedure and predict its output. Firstly, we prove that the GCNTK converges and is stable as the width tends to infinite. Then we find for GCN, with an infinite width, that the output value of unlabeled nodes can be predicted by a linear combination of the training nodes' label. The coefficients of training nodes can be computed by GCNTK and imply the importance of training nodes to the unlabeled nodes. Finally, we conduct experiments on synthetic data sets including common patterns in small model graphs to demonstrate the effectiveness of GCNTK.
Appendix A: Computing GCNGP and GCNTK
Appendix B: Convergence of GCNTK to Its Linearization, and Stability of GCNTK
In this section, we give a simple proof of the global convergence of GCNTK restricting on the training data set under gradient descent and gradient flow. With a subtle difference with the NTK initialization, we give the proof procedure based on standard parameterization. The GCN is generated with standard parameterization by equations 4.9 and 4.10.
B.1 Proof of Theorem 3
B.2 Proof of Theorem 4
B.3 Proof of Lemma 1
Acknowledgments
This work was supported by the National Natural Science Foundation of China (grant 61977065) and the National Key Basic Research Program (grant 2020YFA0713504).