## Abstract

Data samples collected for training machine learning models are typically assumed to be independent and identically distributed (i.i.d.). Recent research has demonstrated that this assumption can be problematic as it simplifies the manifold of structured data. This has motivated different research areas such as data poisoning, model improvement, and explanation of machine learning models. In this work, we study the influence of a sample on determining the intrinsic topological features of its underlying manifold. We propose the Shapley homology framework, which provides a quantitative metric for the influence of a sample of the homology of a simplicial complex. Our proposed framework consists of two main parts: homology analysis, where we compute the Betti number of the target topological space, and Shapley value calculation, where we decompose the topological features of a complex built from data points to individual points. By interpreting the influence as a probability measure, we further define an entropy that reflects the complexity of the data manifold. Furthermore, we provide a preliminary discussion of the connection of the Shapley homology to the Vapnik-Chervonenkis dimension. Empirical studies show that when the zero-dimensional Shapley homology is used on neighboring graphs, samples with higher influence scores have a greater impact on the accuracy of neural networks that determine graph connectivity and on several regular grammars whose higher entropy values imply greater difficulty in being learned.

## 1  Introduction

In much machine learning research, common practice is to assume that the training samples are independent and identically distributed (i.i.d.). As such, samples are implicitly regarded as having equal influence on determining the performance of a machine learning model. Recently, limitations of this assumption have been explored. For example, a recent study (Koh & Liang, 2017) showed that certain training samples can have significant influences over a model's decisions for certain testing samples. This effect has motivated research on model interpretation (Gunning, 2017; Koh & Liang, 2017; Yeh, Kim, Yen, & Ravikumar, 2018), model and algorithm improvement (Lee & Yoo, 2019; Wang, Zhu, Torralba, & Efros, 2018; Ren, Zeng, Yang, & Urtasun, 2018), and poisoning attacks (Wang & Chaudhuri, 2018; Koh & Liang, 2017; Chen, Liu, Li, Lu, & Song, 2017).

Many of these methods study the sample influence problem by using neural networks. Specifically, it is common to adopt a neural network (either a target model or a surrogate) to identify samples that are deemed as more influential (Koh & Liang, 2017; Wang, Zhu, et al., 2018). However, as Chen et al. (2017) showed in a poisoning attack scenario, a limited number of poisoning samples can be effective in applying a backdoor attack in a model-agnostic manner. This motivates work that studies the intrinsic properties of data so that one can develop countermeasures to methods that defeat learning models. Specifically, by representing the underlying space of data as a topological space, we study the sample influence problem from a topological perspective. The belief that topological features of the data space better represent its intrinsic properties has attracted recent research on establishing relationships between machine learning and topological data analysis (TDA; Chazal & Michel, 2017; Hofer, Kwitt, Niethammer, & Uhl, 2017; Carlsson & Gabrielsson, 2018).

In this work, we consider TDA as a complementary approach to other approaches that study sample influence. We propose a framework for decomposing the topological features of a complex built with data points to individual points. We interpret the decomposed value for each point as representing its influence on affecting its topological features. We then calculate that influence as Shapley values (Narahari, 2012). Recent research (Chen, Song, Wainwright, & Jordan, 2018; Lundberg & Lee, 2017; Datta, Sen, & Zick, 2016) has proposed similar approaches to quantify feature influence. By interpreting the influence as a probability measure defined on a data space, we also calculate the entropy that describes the complexity of this space. Under our framework, we devise an algorithm for calculating the influence of data samples and the entropy for any data set.

We perform both analytical and empirical studies on several sets of data samples from two families of structured data: graphs and strings. Specifically, we generate random graphs using the Erdos-Renyi model and binary strings with regular grammars. In both cases, we use neural networks as verification tools for our analysis of sample influence. Explicitly, we employ a feedforward network trained to determine the connectivity of the generated graphs and a recurrent network trained to recognize the generated strings. Our results show that samples identified by our method as having more significant influence indeed have more impact on the connectivity of their underlying graphs and that grammars with higher complexity have more difficulty in being learned.

## 2  Related Work

### 2.1  Sample Influence

Recent research has focused on how to explore and exploit influential samples. As an example, by analyzing how the performance of a model in the testing phase affects a small number of training samples motivated research in poisoning attacks (Wang & Chaudhuri, 2018; Koh & Liang, 2017; Chen et al., 2017). In this case, a limited number of corrupted training examples are injected to degrade a target models performance. An alternative thread of research exploits this effect to improve the generalization accuracy of learning models and the efficiency of learning algorithms. Specifically, influential samples can be identified via learning and then used for enhancing a model's performance on imbalanced and corrupted data (Ren et al., 2018; Lee & Yoo, 2019). They can also be synthesized to represent a much larger set of samples, thus accelerating the learning process (Wang, Zhu, et al., 2018). In addition, influential samples can bring explainability to deep learning models by identifying representative samples or prototypes used in decision making (Yeh et al., 2018; Anirudh, Thiagarajan, Sridhar, & Bremer, 2017).

Typical approaches measure the influence of a data point by decomposing the target functions values with respect to all of the data points of their contributions (Koh & Liang, 2017). More precisely, they observe the unit change of the value of the target function caused by this point. In most cases, the target function is the output of a trained model. We follow the same approach but change the decomposition rule and particularly the target function. In our case, instead of building a model-dependent framework, we reveal the internal dependencies of data in a metric space.

### 2.2  Topological Data Analysis

Currently, the most widely used tool in topological data analysis (TDA) is persistent homology (Chazal & Michel, 2017), which is an algebraic method for measuring the topological features of shapes and functions. This method provides a multiscale analysis of the underlying geometric structures represented by persistence bar codes or diagrams. Most of the previous research using persistent homologies has focused on the classification task by identifying manifolds constructed by data points (Carlsson, Ishkhanov, De Silva, & Zomorodian, 2008; Li, Ovsjanikov, & Chazal, 2014; Turner, Mukherjee, & Boyer, 2014) and mapping the topological signatures to machine learning representations (Carlsson & Gabrielsson, 2018; Bubenik, 2015). Unlike previous work, our proposed Shapley homology emphasizes revealing the influence of a vertex on the topological features of a space from a microscopic viewpoint and assembling individual influence from which to derive an entropy measure that reflects the macroscopic features of a space consisting of individual vertices.

## 3  Shapley Homology

Here, we introduce the relevant concepts that are necessary for our Shapley homology framework. We then provide the definitions of sample influence and entropy.

### 3.1  C̆ech Complex and Homology Group

#### 3.1.1  C̆ech Complex

In topological analysis, the study of a set of data points is typically based on simplicial complexes, which provide more abstract generalizations of neighboring graphs that describe the structural relationships between data points. Here, we are particularly interested in the C̆ech complex, an abstract simplicial complex. More formally, the definition of the C̆ech complex is as follows.

Definition 1

(C̆ech Complex). The C̆ech complex $Cr(X)$ is the intersection complex or nerve of the set of balls of radius $r$ centered at points in $X$.

In particular, given a finite point cloud $X$ in a metric space and a radius $r>0$, the C̆ech complex $Cr(X)$ can be constructed by first taking the points in $X$ as a vertex set of $Cr(X)$. Then for each set $σ⊂X$, let $X∈Cr(X)$ if the set of $r$-balls centered at points of $X$ have a nonempty intersection. Note that the C̆ech complex has a property showing that the simplicial complex constructed with a smaller $r$ is a subcomplex of one constructed with a larger $r$. In this way, we can obtain a filtration.

#### 3.1.2  Homology

The topological features of data space are typically formalized and studied through Homology, which is a classical concept in algebraic topology (Chazal & Michel, 2017). In the following, we briefly introduce homology to the extent that is necessary for understanding its role in our framework.

Definition 2
(Homology Group). A chain complex $(A.,d.)$ is a sequence of abelian groups or modules connected by homomorphisms $dn:An→An-1$, such that the composition of any two consecutive maps is the zero map. The $k$-th homology group is the group of cycles modulo boundaries in degree $k$, that is,
$Hk:=kerdnimdn+1,$
(3.1)
where ker and im denote the kernel and image of the homomorphism, respectively.

Generally the $k$th homology is a quotient group that indicates the $k$-dimensional independent features of the space $X$. Particularly, when $k=0$, we have the following proposition (Hatcher, 2005):

Proposition 1.

For any space $X$, $H0(X)$ is a direct sum of abelian groups, one for each path component of $X$.

In particular, when a complex takes the form of a neighboring graph, then the 0-homology denote the subgraphs that are connected in this graph. As for the one-homology $H1$, it represents the number of 1-dimensional holes in the given manifold.

### 3.2  Sample Influence and Entropy

Given a point cloud represented by a C̆ech complex, we study the influence of each data point on the topological features of this complex. This influence of a data point can be further interpreted as the probability that a unit change of the topological feature is caused by this point. More formally, denote a data set containing $n$ samples by $X={x1,…,xn}$; then we have the following definition of sample influence.

Definition 3

(Sample Influence). Given a discrete space $(X,Σ)$, the influence of any sample (or subset) ${x}⊂X$ is a probability measure $μ$.

We then define an entropy that shares a similarity to the entropy in information theory; both reflect the number of internal configurations or arrangements of a system.

Definition 4

(Entropy). Given a probability measure $μ$, the entropy of a data set $X$ is defined as $H(X)=-∑i=1nμ(xi)logμ(xi)$.

Newly introduced data points may change the number of boundaries needed to separate the data points in the given metric space. Moreover, each of the newly added data points will have a different influence on the resulting number. By interpreting the influence as a probability measure, our defined entropy quantifies the number of boundaries in a data set and thus implies the capacity that a neural network needs to learn this data set.

### 3.3  The Framework of Shapley Homology

Here we propose a framework (depicted in Figure 1) of the Shapley homology in order to study sample influence. Specifically, in Figure 1, we provide an example for investigating a specific topological feature—the Betti number (Rote & Vegter, 2006)—of a topological space. As will become clear, the Betti number can be used to quantify the homology group of topological space according to its connectivity.
Figure 1:

Framework of Shapley homology for influence analysis.

Figure 1:

Framework of Shapley homology for influence analysis.

Definition 5
(Betti Number). Given a nonnegative integer $k$, the $k$-th Betti number $βk(X)$ of the space $X$ is defined as the rank of the abelian group $Hk(X)$, that is,
$βk(X)=rank(Hk(X)).$
(3.2)

Following the definition, as there are no special structures (such as real projective space in the complex), the Betti number $βk(X)$ indicates the number of the direct sum of abelian groups of the $k$th homology group. In other words, the $k$th Betti number refers to the number of $k$-dimensional voids on a topological surface.

While the Betti number indicates only an overall feature of a complex, we need to further distribute this number to each vertex of this complex as its influence score. Recent research (Chen et al., 2018) on interpreting neural networks has introduced Shapley value from cooperative game theory to decompose a classification score of a neural network made for a specific sample to individual features as their importance or contribution to rendering this classification result. Inspired by this line of work, we also employ the Shapley value as the influence score for each vertex. However, it should be noted that for a fixed $k$, the Betti number $βk$ does not satisfy the monotonicity property of Shapley values. That is, the Betti number of $X1$ is not necessarily larger than that of $X2$ when $X1$ is a subcomplex of $X2$. As a result, we cannot adopt the formulation for calculating Shapley values directly. Here we use the following variant formulation for calculating Shapley values for a C̆ech complex:
$s(xi)=∑C⊆X∖xiC!(n-C-1)!n!β(C∪xi)-β(C).$
(3.3)
It is important to note that in our equation 3.3, we use the absolute value to resolve the monotonicity issue.

It is clear that our equation 3.3 satisfies the symmetry axiom, whereas it does not satisfy other Shapley axioms, including the linearity and carrier axioms (Narahari, 2012). Nonetheless, our formulation still measures the marginal effect of a vertex. Besides, as we mainly focus on measuring the topological feature, both the decrease and increase in the Betti number of a vertex are crucial for determining the influence. Furthermore, since our entropy is symmetry invariant, its value will remain the same under the group action on the vertices (Conrad, 2008).

The discussion indicates that the influence of a data sample can be regarded as a function of the radius $r$, the Betti number $βk$, the size of the data set containing this sample, and the topological space constructed on the chosen metric. Unlike persistent homology (Edelsbrunner & Harer, 2008), which studies the topological features of data space at varying “resolution” $r$, our analysis takes a more static view of the topological features of a complex built with a fixed $r$. As such, our analysis can be viewed as taking a slice from the filtration used for persistent homology. In the following section, we propose an algorithm for calculating the influence scores of data points in a C̆ech complex constructed with certain specified $r$ and $k$.

## 4  Algorithm Design and Case Study

With proposition 1 and definition 6 and when the Betti number $β0$ is regarded as a quantitative indicator, it equals the number of connected components in a complex. In this case, the task of calculating $β0$ of a C̆ech complex is equivalent to calculating the number of connected components of a graph. This enables us to compute the Laplacian matrix $L$ of a graph and then apply the following proposition (Marsden, 2013).1

Proposition 2.

A graph $G$ has $m$ connected components if and only if the algebraic multiplicity of 0 in the Laplacian is $m$.

With proposition 2, we can see that the Betti number $β0(X)$ is equal to the number of zeros in the spectrum of the corresponding Laplacian matrix $LX$. As such, we propose algorithm 1 to calculate the influence score and the entropy under the setting of $k=0$ and $r$ as some constant. It is evident that our proposed algorithm 1 always produces a probability measure. A more detailed discussion of the choice of $k$ and $r$ is provided in section 5.

Note that this algorithm is specifically designed for the zero-dimensional homology group, a special case in our Shapley homology framework. In this case, the metric computed in algorithm 1 can be interpreted as the betweenness centrality (Newman, 2005) of a graph. However, it is important to note that betweenness centrality does not have a natural correspondence to any topological features. On the contrary, the metric derived for the zero-dimensional homology group precisely matches the Betti number. Furthermore, while it provides conceptual convenience in relating our designed metric with betweenness centrality, this connection should not dim the generality of our framework to other homology groups with higher dimensions.

Here we provide a simple time complexity analysis for our algorithm and propose several possible solutions to accelerate the algorithm for future work. Our algorithm can be decomposed into three parts: (1) complex construction, (2) calculation of the graph spectrum, and (3) assignment of the Shapley value. For the first step, we need to calculate pairwise distance. From the adjacent matrix, it is clear that the complexity is $O(n2)$. Second, we need to compute the spectrum of all subcomplexes. As such, in total, the complexity is $O(n32n)$, where the complexity of each Laplacian decomposition is $O(n3)$. For the third step, we sum all the marginal utility for one sample, which results in a complexity of $O(2n)$. Therefore, the complexity of computing the influence scores for all samples is $O(n2n)$. Based on the above analysis, we obtain the overall complexity of our algorithm as $O(n32n)$. Clearly, the second and third steps contribute most to the complexity. In order to alleviate the computational burden caused by the second step, future work will consider various approaches (Cohen-Steiner, Kong, Sohler, & Valiant, 2018) that approximates the spectrum of a graph. For the third step, several existing approximation algorithms, such as C-Shapley and L-Shapley (Chen et al., 2018), can be considered for approximating the Shapley value using local topological properties.

We next provide a set of case studies on several different graphs with representative topological structures. In particular, we study four types of graphs representing the space of binary strings generated by four regular grammars. We select these grammars due to their simplicity for demonstration. A brief introduction of the selected grammars is provided in Table 1.2 Since we deal with binary string, we specify the distance metric used in these studies as the edit distance (De la Higuera, 2010) and set the radius $r=1$. Also, we set the length $N$ of generated strings to fixed values as specified in Table 1.

Table 1:
Example Grammars and Their Associated Entropy Values.
$g$DescriptionEntropy
$1*$ 0.00
Even number of 0s and even number of 1s. $logM$
$1*+0*(1+0)$ $3log2/2$
An odd number of consecutive 1s is always followed $≈2.30$$(N=4)$
by an even number of consecutive 0s.
$g$DescriptionEntropy
$1*$ 0.00
Even number of 0s and even number of 1s. $logM$
$1*+0*(1+0)$ $3log2/2$
An odd number of consecutive 1s is always followed $≈2.30$$(N=4)$
by an even number of consecutive 0s.

Furthermore, we generalize our proposed framework to another six special sets of graphs and provide the analytical results of their vertices' influence and the entropy values. These graphs are selected as they represent a set of simple complexes that can be used for building up more complicated topological structures.

### 4.1  A Case Study on Regular Grammars

#### 4.1.1  Simple Examples

Here we calculate and analyze the first three grammars shown in Table 1 as they have simple yet different topological features. As we can see from Table 1, for the first grammar, given a fixed $N$, there exists only one string defined by this grammar. As a result, the influence of this sample is 1, and the entropy $H(g1)=0$.

For the second grammar, it is clear to see that any pair of valid strings has an edit distance larger than 1. In this case, the complex formed with the radius $r=1$ consists of disjoint points. Assuming that there exist $M$ valid strings defined by this grammar with length $N$, then all strings have the same influence $1/M$ and the entropy $H(g2)=logM$. For the third grammar, when the length $N$ of its associated strings is larger than 3, the set $g3$ of these strings can be expressed as $g3={1N,0N,0N-11}$, denoted as $g3={1,2,3}$ for notation simplicity. We depict the complex for the case when $r=1$ in Figure 2. According to proposition 2, we then have the following Betti number $β0$ of each subcomplex:
$β0(G1)=β0(G2)=β0(G3)=β0(G2,3)=1,β0(G1,2)=β0(G1,3)=β0(G1,2,3)=2,$
where $GS$ ($S⊆{1,2,3}$) denotes the subgraph of $G1,2,3$ formed by the vertex set $S$. According to equation 3.3 and algorithm 1, we have $μ(1)=0.5,μ(2)=μ(3)=0.25$, and finally the entropy is $H(g3)=32log2$.
Figure 2:

A simple illustration of the Shapley homology framework on grammar $1*+0*(0+1)$.

Figure 2:

A simple illustration of the Shapley homology framework on grammar $1*+0*(0+1)$.

#### 4.1.2  A Complicated Example

The fourth grammar $g4$ shown in Table 1 is more complicated than the other three grammars mentioned above. In particular, let $N=4$ and $r=1$ or 2: we illustrate the results in Figure 3. The entropy is $H(g4)=2.292$ when $r=1$ and $H(g4)=2.302$ when $r=2$. Besides the analytical results presented here, we further demonstrate the difference between this grammar and the first two grammars in a set of empirical studies in section 6.2.
Figure 3:

The C̆ech complex, influence scores, and entropy of the fourth grammar with $N=4$.

Figure 3:

The C̆ech complex, influence scores, and entropy of the fourth grammar with $N=4$.

### 4.2  A Case Study on Special Graphs

In this section, we apply our framework to six special families of graphs, which are shown with examples in Table 2. Due to space constraints, we omit the detailed calculation of the influence scores and entropy values for these graphs. With the analytical results shown in Table 2, it is easy to derive the following two corollaries:

Table 2:
Six Special Sets of Graphs and Their Values of Influence for the Vertices.
ExampleShapley ValueInfluence Score
Complete graph $Kn$  $1/n$ $1/n$
Cycles $Cn$  $2/3-1/n$ $1/n$
Wheel graph $Wn$  Periphery: $1/3-1/n(n-1)$ Periphery: $2(n2-n-3)(3(n-1)(n2-3n+4)$
Center: $(n2-7n+18)/6n$ Center: $(n2-7n+18)3(n2-3n+4)($
Star $Sn-1$  Periphery: $1/2$ Periphery: $n/2(n2-2n+2)($
Center: $(n2-3n+4)/2n$ Center: $n2-3n+42(n2-2n+2)($
Path graph $Pn$  Ends: $1/2$ Ends: $3/2(2n-1)($
Middle: $2/3$ Middle: $2/(2n-1)$
Complete bipartite graph $Km,n$  $m$ side: $n(n-1)m(m+1)(m+n)+1n+1$ $m$ side: $m3+n3+m2n+mn+m2-n(m(m+n)(2m2+2n2+m+n-mn-1)$
$n$ side: $m(m-1)n(n+1)(m+n)+1m+1$ $n$ side: $m3+n3+n2m+mn+n2-mn(m+n)(2m2+2n2+m+n-mn-1)$
ExampleShapley ValueInfluence Score
Complete graph $Kn$  $1/n$ $1/n$
Cycles $Cn$  $2/3-1/n$ $1/n$
Wheel graph $Wn$  Periphery: $1/3-1/n(n-1)$ Periphery: $2(n2-n-3)(3(n-1)(n2-3n+4)$
Center: $(n2-7n+18)/6n$ Center: $(n2-7n+18)3(n2-3n+4)($
Star $Sn-1$  Periphery: $1/2$ Periphery: $n/2(n2-2n+2)($
Center: $(n2-3n+4)/2n$ Center: $n2-3n+42(n2-2n+2)($
Path graph $Pn$  Ends: $1/2$ Ends: $3/2(2n-1)($
Middle: $2/3$ Middle: $2/(2n-1)$
Complete bipartite graph $Km,n$  $m$ side: $n(n-1)m(m+1)(m+n)+1n+1$ $m$ side: $m3+n3+m2n+mn+m2-n(m(m+n)(2m2+2n2+m+n-mn-1)$
$n$ side: $m(m-1)n(n+1)(m+n)+1m+1$ $n$ side: $m3+n3+n2m+mn+n2-mn(m+n)(2m2+2n2+m+n-mn-1)$

Corollary 1.

$H(Kn)=H(Cn)>H(Wn)>H(Sn-1)$ for $n>5$, where $Kn$ denotes the complete graph, $Cn$ denotes the cycle graph, $Wn$ denotes wheel graph, and $Sn$ denotes the path graph.

Corollary 2.

Suppose that $m+n$ is constant; then the value of $H(Km,n)$ decreases as $m-n$ increases, where $Km,n$ denotes the complete bipartite graph.

## 5  Extended Discussion of Shapley Homology

As previously introduced in section 3.3, our proposed influence score can be viewed as a function of the parameters $k$ and $r$. Here we extend our study from the previous case of $k=0$ to cover other cases when $k>0$ and show that adopting the zero-dimensional homology offers the merits of having more meaningful and practical implication. Also, we discuss the effect of selecting different radii $r$ on the topological features we study. More important, we show that our framework provides a more general view of sample influence and can be further adapted to capture the dynamical features of a topological space.

### 5.1  Discussion of the Choice of $k$

We first consider the case when $k=1$. In this case, the Betti number $β1$ represents the number of one-dimensional holes in a given manifold. Take the third grammar introduced in section 4.1 as an example. It is clear to see that when using the one-dimensional homology, the entire subcomplex has $β1$ equal to zero. This is precisely the case when our algorithm cannot generate a probability measure. In a more general case, it is easy to show by the homology theory (Schapira, 2001) that once the entire subcomplex gets the Betti number $βK^=0$ for some $K^$, then for any other $k$-dimensional homology with $k>K^$, our framework cannot produce a probability measure. When we adopt a $k$-dimensional homology with $k$ larger than two, our framework will highlight the effect of having the voids constructed or destroyed in the manifold by introducing a data point. However, the voids in a high-dimensional manifold can be very abstract and cause difficulty for intuitive understanding.

Another difficulty of adopting a homology with higher dimensions is due to practical concerns. More specifically, since it is challenging to calculate the homology when $k$ is large, one may need to apply tools such as the Euler characteristic or Mayer-Vietoris sequences (Hatcher, 2005), which is out of the scope of this letter.

### 5.2  Discussion of the Choice of $r$

The radius $r$ plays a critical role in building a manifold. For example, when $r$ is sufficiently small, the resulting complex contains only discrete data points with equal influence in our framework. When $r$ is sufficiently large, the complex is contractible, which indicates that there is no difference between the influence of data points contained in this complex. Both extreme cases produce equal importance for the data points in a data set and correspond to the i.i.d. assumption of sample distribution that is commonly assumed in current machine learning practice. In this sense, our framework provides a general abstraction of sample influence.

In practice, selecting a proper $r$ can be very difficult. This explains the use of persistent homology (Edelsbrunner & Harer, 2008), which studies the evolution of topological features of a space as $r$ dynamically changes. It is not difficult to extend our framework to dynamically generate a series of probability measures for the influence and the corresponding entropy values for a data set. However, a sequence of probability measures with various radii has an unclear physical meaning. In this way, the radius should depend on the specific application and should be predefined as the resolution of the problem of interest.

### 5.3  Remarks on Related Definitions of Entropy

Different notions of entropy have been proposed. Here we briefly revisit several representative definitions and compare them with our entropy definition, which quantifies the number of boundaries in a data set and therefore determines the capacity of a neural network that is needed to learn this data set (Khoury & Hadfield-Menell, 2018).

#### 5.3.1  Graph Entropy

An important property of the graph entropy (Rezaei, 2013) is monotonicity since it is defined based on mutual information. Specifically, it describes that the entropy of a subgraph is smaller than that of the whole graph on the same vertex set. In our case, by considering the complex as a graph, our entropy is defined to capture the geometric properties (such as the symmetry invariant property mentioned in section 3.3) of a graph. More specifically, our entropy measure focuses more on the variation of the topological features when a graph is changed. As such, our definition also covers variations that may violate the monotonicity property, and the subadditivity property.

#### 5.3.2  Entropy for Grammatical Learning

This entropy (Wang, Zhang, Ororbia, Alexander, Xing et al., 2018) is defined to reflect the balance between the population of strings accepted and rejected by a particular grammar. It shows that the entropy of a certain grammar is equal to that of the complement of that grammar. This raises a contradiction in the intuition that a grammar with a higher cardinality of strings is more likely to have a higher entropy value. Our entropy instead is defined to capture the intrinsic properties of a set of samples instead of reflecting the difference between different sets of samples. In this sense, our entropy is more likely to assign a higher entropy value to a set of samples with larger cardinality.

#### 5.3.3  Entropy in Symbolic Dynamics

This type of entropy (Williams, 2004) is defined to reflect the cardinality of a shift space, which can be regarded as a more general definition of regular grammar. It implicitly assumes that any shift contained in a shift space has equal influence. This is contrary to our case in that we define the entropy to describe the complexity of a topological space that contains vertices with different influence. As such, our entropy provides a more finely grained description of a topological space.

### 5.4  Connection to the VC Dimension

Here we provide a preliminary discussion on the connection between our proposed Shapley homology and Vapnik-Chervonenkis (VC) dimension (Vapnik, 2000). Recall that this essentially reflects the complexity of a space of functions by measuring the cardinality of the largest set of samples that can be “shattered” by functions in this space. From a similar point of view, we expect that the topology of data space is also critical in evaluating the complexity of real-world learning algorithms in learning a given set of data.

Note that the proposed approach has a close connection to statistical learning theory. An obvious interpretation of our complexity is that it can be viewed as analogous to sample complexity (Hanneke, 2016), which is closely related to the VC dimension. However, here we point out the limitation of complexity specified by the VC dimension, which is part of the motivation of this work. Specifically, given a certain data space, only the hypothesis space of models with sufficiently large VC dimension can shatter this data space. For this hypothesis space, we can further use sample complexity to specify the number of samples required to achieve certain PAC learning criteria. However, we argue that different shattering of the data space leads to different levels of complexity (or different entropy values in our terms). Instead of focusing only on the maximally shattered data space, we argue that in practice, when building a learning model, a different shattering should be treated differently. To better explain this effect, we take regular grammars as an example case. One can consider a certain binary regular grammar as a certain configuration in the hypothesis space. In other words, a binary regular grammar explicitly split all the $2N$ strings into the set of accept strings and the set of rejected strings, given that the strings have a fixed length of $N$. Since this grammar is equivalent to a DFA and if we regard this DFA as a classification model, it has a certain VC dimension (Ishigami & Tani, 1997). Indeed, this effect is shown in the experiments in section 6 on grammar learning. In particular, in our experiments, the demonstrated different levels of difficulty of learning different regular grammars indicate that different grammars (or different ways of shattering the data space) should not be taken as equal.

## 6  Experiments

In this section, we first demonstrate the results of using algorithm 1 to identify influential nodes in random graphs. Then we evaluate this on several data sets generated by regular grammars to determine if data sets assigned by our algorithm with higher entropy values cause more challenges for neural networks to be able to learn these grammars. The settings of all parameters in the experiments are provided in the supplementary file.

### 6.1  Graph Classification

For these experiments, we first constructed the learning models and data sets of random graphs following Dai et al. (2018). We adopted this similar setting in order to evaluate and verify the influence of individual nodes in a graph. To avoid the evaluation results from being biased by a particular synthetic data set, we performed 20 experimental trials by varying the probability used in the Erdos-Renyi random graph model to generate the same number of random graphs. More specifically, in each trial, we generated a set of 6000 undirected random graphs from three classes (2000 for each class) and split the generated set into a training and a testing set with the ratio of 9 to 1.

We constructed the learning models based on structure2vec (Dai, Dai, & Song, 2016) for the task of determining the number of connected components (up to three) of a given graph. The constructed models function as verification tools for the fidelity test. The idea is to first mask nodes in a graph assigned by our algorithm with different influence scores and examine how much the classification results for this graph are affected. Similar fidelity tests have been widely adopted in previous research on studying feature influence (Ribeiro, Singh, & Guestrin, 2016; Lundberg & Lee, 2017; Chen et al., 2018). In each trial, we first trained a neural network on the training set with random initialization. For each graph in the testing set, we computed the influence score for each node in this graph. Then we generated two data sets, $Dtop$ and $Dbottom$, with graphs masking out the top-J nodes and bottom-J nodes (both the top-J nodes and bottom-J nodes are identified by their influence scores). We show in Figure 4 the examples of a clean graph and its top-1 and bottom-1 masked versions. We also constructed a third data set, $Drand$, by randomly masking $L$ nodes for all testing graphs. The evaluation results were obtained by comparing the classification performance of our trained models on these three data sets, and that achieved on the original testing sets.
Figure 4:

Example graphs with classification results for each graph from left to right: 1, 2, and 1. This indicates the number of connected components recognized by a neural network on these graphs.

Figure 4:

Example graphs with classification results for each graph from left to right: 1, 2, and 1. This indicates the number of connected components recognized by a neural network on these graphs.

Results in Figure 5 show all trials that fit the shaded area for each line plot. The accuracy value indexed by $L=0$ is the averaged accuracy of our models obtained on the clean testing sets from all trials. It is clear from Figure 5 that the influence score calculated by our algorithm effectively indicates the impact of a node on determining the connectivity of the graph containing this node. In addition, as we increase $L$, the accuracy of our models obtained on $Dtop$, $Dbottom$, and $Drand$ degrades with different scales. In particular, the accuracy obtained on $Dtop$ and $Dbottom$ shows the largest and smallest scales of degradation, respectively. The result for $Dtop$ is surprising in that even on these simple synthetic graphs, the robustness of a neural network model is far from satisfactory. Specifically, similar to the results shown by Dai et al. (2018), by masking top-1 influential nodes in the testing graphs, the accuracy of a neural network is brought down to 40% to 50%.
Figure 5:

Accuracy of neural networks obtained on data sets with varying scales of manipulations.

Figure 5:

Accuracy of neural networks obtained on data sets with varying scales of manipulations.

In the case when we focus on only the vertex with the highest importance score, the experiment can be taken as a sanity check. However, as shown in Figure 5, when we evaluate the influence from manipulating vertices that have been identified to have different levels of influence on the performance of classifying graphs, our framework also provides a quantitative evaluation of the importance of different vertices, which goes beyond a sanity check.

### 6.2  Grammar Recognition

In this set of experiments, we used three Tomita grammars introduced in Table 1 due to their simplicity and wide adoption in various grammar learning research (Weiss, Goldberg, & Yahav, 2017; Wang, Zhang, Ororbia II, Xing et al., 2018; Wang, Zhang, Liu, & Giles, 2018). In Table 3, we show for each grammar its entropy value with different radii calculated by our algorithm. We set the radius as discrete integers since we apply the edit distance to regular grammars. Note that it can be computationally intractable if we apply the Shapley homology framework to a large-scale data set, since our algorithm under our framework had exponential time complexity. Therefore, we apply an approximation, that is, we compute the spectrum of $K$ subcomplexes instead of all subcomplexes, where $K$ is a fixed number. More specifically, when computing the specific sample influence, we randomly selected $K$ subcomplexes that contain this sample and use the same formula to obtain its Shapley value.

Table 3:
Entropy of Different Grammars with Various Radii.
 Grammar 1 Grammar 2 Grammar 4 Entropy (⁠$K=100$⁠) $r=1$ 2.38 5.67 5.07 $r=2$ 2.47 5.65 5.08 $r=3$ 2.46 5.65 5.07
 Grammar 1 Grammar 2 Grammar 4 Entropy (⁠$K=100$⁠) $r=1$ 2.38 5.67 5.07 $r=2$ 2.47 5.65 5.08 $r=3$ 2.46 5.65 5.07

We used these grammars to generate three sets of binary strings with length ranging from 2 to 13 and split the data set of each grammar into a training set and a testing set with the ratio of 7 to 3. We then trained several different recurrent networks—simple recurrent network (SRN; Elman, 1990), gated-recurrent-unit networks (GRU; Cho, Van Merriënboer, Bahdanau, & Bengio, 2014), and long-short-term-memory networks (LSTM; Hochreiter & Schmidhuber, 1997)—for a binary classification task on the data set generated by each grammar. Essentially, the binary classification task is designed to train a recurrent network to correctly differentiate and accept strings that are accepted by a certain regular grammar and reject strings that are rejected by this grammar. For each type of recurrent network, we use the same number of parameters for models used for all three grammars in order to avoid any bias that these models would have any different learning ability. We trained the model of each type of RNN on each grammar for 20 trials. In each trial, we randomly split the training and testing set and randomly initialize the model. The results are demonstrated in Figure 6, in which the results from all trials fit the shaded area associated with each plot.
Figure 6:

Training performance for different recurrent networks on grammars 1, 2, and 4.

Figure 6:

Training performance for different recurrent networks on grammars 1, 2, and 4.

In Figures 6a, 6b, and 6c, we see that for the first and fourth grammars, which have lower entropy values, the learning process converges much faster and more consistently than the process for the second grammar, which has the highest entropy value. This effect holds for all types of recurrent networks we evaluate. To better illustrate the difference of the difficulty of learning the first and fourth grammars, we provide in Figures 6d, 6e, and 6f the zoomed view for each plot at the top row of Figure 6. While the learning process of all models converges within 10 epochs for both grammars, it is still clear that the learning process is slower for the fourth grammar. These results agree with both our analysis on the entropy of these grammars and the intuition. Specifically, the second grammar defines two sets of strings with equal cardinality when the string length is even. In this case, by flipping any binary digit of a string to its opposite (e.g., flipping a 0 to 1 or vice versa), a valid or invalid string can be converted into a string with the opposite label. This implies that a model must pay equal attention to any string to learn the underlying grammar. This corresponds to our analysis result that in the data space defined by the second grammar, each string sample shares equal influence on affecting the topological feature of this space.

## 7  Conclusion

In this work, we proposed the Shapley homology framework to study the sample influence on topological features of data space and its associated entropy. This provides an understanding of the intrinsic properties of both individual samples and the entire data set. We also designed an algorithm for decomposing the zeroth-Betti number using cooperative game theory and provided analytical results for several special representative topological spaces. Furthermore, we empirically verified our results with two carefully designed experiments. We show that data points identified by our algorithm as having larger influences on the topological features of their underlying space also have more impact on the accuracy of neural networks for determining the connectivity of their underlying graphs. We also show that a regular grammar with higher entropy has more difficulty of being learned by neural networks.

## Notes

1

The Laplacian matrix $L$ is defined to be $D-A$, where $D$ and $A$ denote the degree matrix and adjacency matrix, respectively.

2

Grammars 1, 2, and 4 are selected from the set of Tomita grammars (Tomita, 1982), of which their indices in the Tomita grammars are 1, 5 and 3, respectively.

3

Only the calculation of Shapley value is given; the influence score can be easily obtained by normalization of the Shapley value.

## Appendix A:  Proof of the Results for the Six Special Sets of Graphs

Here we provide proofs for results shown in Table 2.3 With these results, one can easily derive corollaries using basic inequalities.

1. The complete graph and cycles: For the complete graph and cycles, the results presented in Table 2 are trivial since these two graphs preserve invariance under permutation.

2. The wheel graph: For the wheel graph shown in Table 2, we have
$Periphery:1N!∑k=0N-4N-4kk!(N-k-1)!+∑k=0N-5N-4k(k+2)!(N-k-3)!,Center:1N+1N!∑m=2N-3∑k=2lT(N-1,k,m)(k-1)m!(N-m-1)!,$
where $l=min(m,N-m-1)$ and $T(N,k,m)=NmmkN-m-1k-1$, and simplifying the above formulas gives the results in Table 2.
3. The star graph: For the star graph, we have
$Periphery:1N!∑k=0N-2N-2kk!(N-k-1)!=12Center:1N+1N!∑k=2N-1N-1k(k-1)(k-1)!(N-k)!=N2-3N+42N.$
4. The path graph: For the path graph, we have
$Ends:1N!∑k=0N-2N-2kk!(N-k-1)!=12Middle:2N!∑k=0N-3N-3kk!(N-k-1)!=23.$
5. The complete bipartite graph: For the complete bipartite graph, we have
$mside:1(m+n)!∑k=0m-1m-1kk!(m+n-k-1)!+∑k=2nnk(k-1)k!(m+n-k-1)!,nside:1(m+n)!∑k=0n-1n-1kk!(m+n-k-1)!++∑k=2mmk(k-1)k!(m+n-k-1)!,$
which after simplifying the above gives the results in Table 2.
The above derivation has applied the following facts, which can be easily verified by combinatorics:
$1N!∑k=0N-mN-mkk!(N-k-1)!=1m,1(m+n)!∑k=2mmk(k-1)k!(m+n-k-1)!=m(m-1)n(n+1)(m+n),1N!∑m=2N-3∑k=2lT(N-1,k,m)(k-1)m!(N-m-1)!=(N-3)(N-4)6N,$
where $l=min(m,N-m-1)$ and $T(N,k,m)=NmmkN-m-1k-1$.

## Appendix B:  Experiment and Model Settings

Table 4 provides all the parameters set in the experiments.

Table 4:
Parameter Settings for Neural Network Models and Datasets.
ExperimentGraphGrammar
Model setting Component Hidden Model Hidden (Total)
Struct2vec 32 SRN 200 (40800)
GRU 80 (19920)
Neural Network 16 LSTM 80 (19920)
Optimizer Rmsprop
Data set setting Class Class
Total 6000 Total G1 G2 G4
8677 8188 8188
Training (per class) 5400 (1800) Training 6073 5730 5730
Testing (per class) 600 (200) Testing 2604 2458 2458
Prob. 0.02–0.21 Length 2–13
Nodes 8–14
ExperimentGraphGrammar
Model setting Component Hidden Model Hidden (Total)
Struct2vec 32 SRN 200 (40800)
GRU 80 (19920)
Neural Network 16 LSTM 80 (19920)
Optimizer Rmsprop
Data set setting Class Class
Total 6000 Total G1 G2 G4
8677 8188 8188
Training (per class) 5400 (1800) Training 6073 5730 5730
Testing (per class) 600 (200) Testing 2604 2458 2458
Prob. 0.02–0.21 Length 2–13
Nodes 8–14

## Acknowledgments

We gratefully acknowledge useful comments from the referees and Christopher Griffin.

## References

Anirudh
,
R.
,
Thiagarajan
,
J. J.
,
Sridhar
,
R.
, &
Bremer
,
T.
(
2017
).
Influential sample selection: A graph signal processing approach
.
arXiv:1711.05407
.
Bubenik
,
P.
(
2015
).
Statistical topological data analysis using persistence landscapes
.
Journal of Machine Learning Research
,
16
(
1
),
77
102
.
Carlsson
,
G.
, &
Gabrielsson
,
R. B.
(
2018
).
Topological approaches to deep learning
.
CoRR, abs/1811.01122
.
Carlsson
,
G.
,
Ishkhanov
,
T.
,
De Silva
,
V.
, &
Zomorodian
,
A.
(
2008
).
On the local behavior of spaces of natural images
.
International Journal of Computer Vision
,
76
(
1
),
1
12
.
Chazal
,
F.
, &
Michel
,
B.
(
2017
).
An introduction to topological data analysis: Fundamental and practical aspects for data scientists
.
CoRR, abs/1710.04019
.
Chen
,
J.
,
Song
,
L.
,
Wainwright
,
M. J.
, &
Jordan
,
M. I.
(
2018
).
L-Shapley and c-Shapley: Efficient model interpretation for structured data.
CoRR, abs/1808.02610
.
Chen
,
X.
,
Liu
,
C.
,
Li
,
B.
,
Lu
,
K.
, &
Song
,
D.
(
2017
).
Targeted backdoor attacks on deep learning systems using data poisoning
.
CoRR, abs/1712.05526
.
Cho
,
K.
,
Van Merriënboer
,
B.
,
Bahdanau
,
D.
, &
Bengio
,
Y.
(
2014
).
On the properties of neural machine translation: Encoder-decoder approaches.
In
Proceedings of the Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
(pp.
103
111
).
Stroudsburg, PA
:
Association for Computational Linguistics
.
Cohen-Steiner
,
D.
,
Kong
,
W.
,
Sohler
,
C.
, &
Valiant
,
G.
(
2018
).
Approximating the spectrum of a graph
. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1263
1271
).
New York
:
ACM
.
Dai
,
H.
,
Dai
,
B.
, &
Song
,
L.
(
2016
).
Discriminative embeddings of latent variable models for structured data.
arXiv:1603.05629
.
Dai
,
H.
,
Li
,
H.
,
Tian
,
T.
,
Huang
,
X.
,
Wang
,
L.
,
Zhu
,
J.
, &
Song
,
L.
(
2018
).
Adversarial attack on graph structured data
.
arXiv:1806.02371
.
Datta
,
A.
,
Sen
,
S.
, &
Zick
,
Y.
(
2016
).
Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems.
In
Proceedings of the IEEE Symposium on Security and Privacy
(pp.
598
617
).
Piscataway, NJ
:
IEEE
.
De la Higuera
,
C.
(
2010
).
Grammatical inference: Learning automata and grammars
.
Cambridge
:
Cambridge University Press
.
Edelsbrunner
,
H.
, &
Harer
,
J.
(
2008
).
Persistent homology: A survey
.
Contemporary Mathematics
,
453
,
257
282
.
Elman
,
J. L.
(
1990
).
Finding structure in time
.
Cognitive Science
,
14
(
2
),
179
211
.
Gunning
,
D.
(
2017
).
Explainable artificial intelligence (XAI)
.
Arlington, VA
:
Defense Advanced Research Projects Agency
.
Hanneke
,
S.
(
2016
).
The optimal sample complexity of PAC learning
.
Journal of Machine Learning Research
,
17
(
1
),
1319
1333
.
Hatcher
,
A.
(
2005
).
Algebraic topology
.
Cambridge
:
Cambridge University Press
.
Hochreiter
,
S.
, &
Schmidhuber
,
J.
(
1997
).
Long short-term memory
.
Neural Computation
,
9
(
8
),
1735
1780
.
Hofer
,
C.
,
Kwitt
,
R.
,
Niethammer
,
M.
, &
Uhl
,
A.
(
2017
). Deep learning with topological signatures. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
1633
1643
).
Red Hook, NY
:
Curran
.
Ishigami
,
Y.
, &
Tani
,
S.
(
1997
).
VC-dimensions of finite automata and commutative finite automata with K letters and N states
.
Discrete Applied Mathematics
,
74
(
2
),
123
134
.
Khoury
,
M.
, &
Hadfield-Menell
,
D.
(
2018
).
On the geometry of adversarial examples
.
CoRR, abs/1811.00525
.
Koh
,
P. W.
, &
Liang
,
P.
(
2017
).
Understanding black-box predictions via influence functions.
In
Proceedings of the 34th International Conference on Machine Learning
(pp.
1885
1894
).
Lee
,
D.
, &
Yoo
,
C. D.
(
2019
).
Learning to augment influential data.
https://openreview.net/forum?id=BygIV2CcKm
Li
,
C.
,
Ovsjanikov
,
M.
, &
Chazal
,
F.
(
2014
).
Persistence-based structural recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
1995
2002
).
Piscataway, NJ
:
IEEE
.
Lundberg
,
S. M.
, &
Lee
,
S.
(
2017
).
A unified approach to interpreting model predictions.
In
I.
Guyon
,
U.
V. Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
4768
4777
).
Red Hook, NY
:
Curran
.
Marsden
,
A.
(
2013
).
Eigenvalues of the Laplacian and their relationship to the connectedness of a graph
.
University of Chicago lecture note
.
Newman
,
M. E.
(
2005
).
A measure of betweenness centrality based on random walks
.
Social Networks
,
27
(
1
),
39
54
.
Ren
,
M.
,
Zeng
,
W.
,
Yang
,
B.
, &
Urtasun
,
R.
(
2018
).
Learning to reweight examples for robust deep learning.
In
Proceedings of the 35th International Conference on Machine Learning
(pp.
4331
4340
).
Rezaei
,
S. S. C.
(
2013
).
Entropy and graphs
.
arXiv:1311.5632
.
Ribeiro
,
M. T.
,
Singh
,
S.
, &
Guestrin
,
C.
(
2016
).
“Why should I trust you?” Explaining the predictions of any classifier.
In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1135
1144
).
New York
:
ACM
.
Rote
,
G.
, &
Vegter
,
G.
(
2006
). Computational topology: An introduction. In
J.-D.
Boissonnat
&
M.
Teillaud
(Eds.),
Effective computational geometry for curves and surfaces
(pp.
277
312
).
Berlin
:
Springer
.
Schapira
,
P.
(
2001
).
Categories and homological algebra
.
Société mathétique de France
.
Tomita
,
M.
(
1982
).
Dynamic construction of finite-state automata from examples using hill-climbing.
In
Proceedings of the Fourth Annual Conference of the Cognitive Science Society
(pp.
105
108
).
Austin, TX
:
Computer Science Society
.
Turner
,
K.
,
Mukherjee
,
S.
, &
Boyer
,
D. M.
(
2014
).
Persistent homology transform for modeling shapes and surfaces
.
Information and Inference: A Journal of the IMA
,
3
(
4
),
310
344
.
Vapnik
,
V. N.
(
2000
).
The nature of statistical learning theory
(2nd ed.).
Berlin
:
Springer
.
Wang
,
Q.
,
Zhang
,
K.
,
Liu
,
X.
, &
Giles
,
C. L.
(
2018
).
Verification of recurrent neural networks through rule extraction
.
arXiv:1811.06029
.
Wang
,
Q.
,
Zhang
,
K.
,
Ororbia
,
I.
,
Alexander
,
G.
,
Xing
,
X.
,
Liu
,
X.
, &
Giles
,
C. L.
(
2018
).
A comparative study of rule extraction for recurrent neural networks
.
arXiv:1801.05420
.
Wang
,
Q.
,
Zhang
,
K.
,
Ororbia II
,
A. G.
,
Xing
,
X.
,
Liu
,
X.
, &
Giles
,
C. L.
(
2018
).
An empirical evaluation of rule extraction from recurrent neural networks
.
Neural Computation
,
30
(
9
),
2568
2591
.
Wang
,
T.
,
Zhu
,
J.
,
Torralba
,
A.
, &
Efros
,
A. A.
(
2018
).
Dataset distillation
.
CoRR, abs/1811.10959
.
Wang
,
Y.
, &
Chaudhuri
,
K.
(
2018
).
Data poisoning attacks against online learning
.
CoRR, abs/1808.08994
.
Weiss
,
G.
,
Goldberg
,
Y.
, &
Yahav
,
E.
(
2017
).
Extracting automata from recurrent neural networks using queries and counterexamples
.
arXiv:1711.09576
.
Williams
,
S. G.
(
2004
).
Introduction to symbolic dynamics
.
Proceedings of Symposia in Applied Mathematics
,
60
,
1
12
.
Yeh
,
C.
,
Kim
,
J. S.
,
Yen
,
I. E.
, &
Ravikumar
,
P.
(
2018
). Representer point selection for explaining deep neural networks. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 31
(pp.
9311
9321
).
Red Hook, NY
:
Curran
.