## Abstract

A cycle in a brain network is a subset of a connected component with redundant additional connections. If there are many cycles in a connected component, the connected component is more densely connected. Whereas the number of connected components represents the integration of the brain network, the number of cycles represents how strong the integration is. However, it is unclear how to perform statistical inference on the number of cycles in the brain network. In this study, we present a new statistical inference framework for determining the significance of the number of cycles through the Kolmogorov-Smirnov (KS) distance, which was recently introduced to measure the similarity between networks across different filtration values by using the zeroth Betti number. In this paper, we show how to extend the method to the first Betti number, which measures the number of cycles. The performance analysis was conducted using the random network simulations with ground truths. By using a twin imaging study, which provides biological ground truth, the methods are applied in determining if the number of cycles is a statistically significant heritable network feature in the resting-state functional connectivity in 217 twins obtained from the Human Connectome Project. The MATLAB codes as well as the connectivity matrices used in generating results are provided at http://www.stat.wisc.edu/∼mchung/TDA.

## Author Summary

In this paper, we propose a new topological distance based on the Kolmogorov-Smirnov (KS) distance that is adapted for brain networks, and compare them against other topological network distances including the Gromov-Hausdorff (GH) distances. KS-distance is recently introduced to measure the similarity between networks across different filtration values by using the zeroth Betti number, which measures the number of connected components. In this paper, we show how to extend the method to the first Betti number, which measures the number of cycles. The performance analysis was conducted using random network simulations with ground truths. Using a twin imaging study, which provides biological ground truth (of network differences), we demonstrate that the KS distances on the zeroth and first Betti numbers have the ability to determine heritability.

## INTRODUCTION

The modular structure and connected components are the fundamental topological features of a brain network. Brain networks with a higher number of connected components have many disjointed clusters, and the transfer of information will likely be impeded. Modular structures are often studied through the Q-modularity in graph theory (Meunier, Lambiotte, Fornito, Ersche, & Bullmore, 2009; Newman, Barabasi, & Watts, 2006) and the zeroth Betti number in persistent homology (Carlsson & Memoli, 2008; Carlsson & Mémoli, 2010; Chung, Vilalta-Gil, Lee, Rathouz, Lahey, & Zald, 2017b; Chung, Luo, Leow, Adluru, Alexander, Richard, & Goldsmith, 2018b; Lee, Chung, & Lee, 2014).

Persistent homology provides a coherent framework for obtaining higher order topological features beyond modular structures (Edelsbrunner & Harer, 2008; Zomorodian & Carlsson, 2005). A brain network can be treated as the 1-skeleton of a simplicial complex, where the 0-dimensional hole is the connected component, and the 1-dimensional hole is a cycle. The number of *k*-dimensional holes is called the *k*-th Betti number and denoted as *β*_{k} (Lee et al., 2014; Lee, Chung, Kang, Choi, Kim, & Lee, 2018; Petri, Expert, Turkheimer, Carhart-Harris, Nutt, Hellyer, & Vaccarino, 2014; Sizemore, Giusti, Kahn, Vettel, Betzel, & Bassett, 2018). In this study, we will study higher order topological changes of brain networks using cycles. The cycle structure in networks is important for information propagation, redundancy, and feedback loops (Lind, Gonzalez, & Herrmann, 2005). If a cycle exists in the network, information can be delivered using two different redundant paths and interpreted as redundant connections. Alternately, it can be viewed as diffusing the spread of information and creating information bottlenecks (Tarjan, 1972).

Although cycles in a network have been widely studied in graph theory, especially in path analysis, they are rarely used in brain network analysis (Sporns, 2003; Sporns, Tononi, & Edelman, 2000). Existing graph analysis packages such as Brain Connectivity (http://sites.google.com/site/bctnet) do not provide any tools related to cycles. Traditionally, cycles are often computed using the brute-force depth-first search algorithm (Tarjan, 1972). In standard graph theoretic approaches, graph theory features are measured mainly by determining the difference in graph theory features such as assortativity, betweenness centrality, small-worldness, and network homogeneity (Bullmore & Sporns, 2009; Rubinov & Sporns, 2010; Rubinov, Knock, Stam, Micheloyannis, Harris, Williams, & Breakspear, 2009; Uddin, Kelly, Biswal, Margulies, Shehzad, Shaw, Ghaffari, Rotrosen, Adler, Castellanos, & Milham, 2008). Comparison of graph theory features appears to reveal changes of structural or functional connectivity associated with different clinical populations (Rubinov & Sporns, 2010). Since weighted brain networks are difficult to interpret and visualize, they are often turned into binary networks by thresholding edge weights (He, Chen, & Evans, 2008; Wijk, Stam, & Daffertshofer, 2010). However, the thresholds for the edge weights are often chosen arbitrarily and produce results that could alter the network topology and thus make comparisons difficult. To obtain the proper optimal threshold where comparisons can be made, the multiple comparison correction over every possible edge has been proposed (Rubinov et al., 2009; Wijk et al., 2010). However, the resulting binary graph is extremely sensitive depending on the chosen *p* value or threshold value. Others tried to control the sparsity of edges in the network in obtaining the binary network (Achard & Bullmore, 2007; Bassett, 2006; He et al., 2008; Lee, Kang, Chung, Kim, & Lee, 2012; Wijk et al., 2010). However, one encounters the problem of thresholding sparse parameters. Thus existing methods for binarizing weighted networks cannot escape the inherent problem of arbitrary thresholding.

There is currently no widely accepted criteria for thresholding networks. Instead of trying to find an optimal threshold that gives rise to a single network that may not be suitable for comparing clinical populations, cognitive conditions, or different studies, *why not use each network produced from every threshold?* Motivated by this simple question, a new multiscale hierarchical network modeling framework based on persistent homology has been proposed (Cassidy, Rae, & Solo, 2015; Chung, Hanson, Lee, Adluru, Alexander, Davidson, & Pollak, 2013; Giusti, Pastalkova, Curto, & Itskov, 2015; Lee, Chung, Kang, Kim, & Lee, 2011a, 2011b; Lee et al., 2012; Petri, Scolamiero, Donato, & Vaccarino, 2013; Petri et al., 2014; Sizemore, Giusti, & Bassett, 2016; Sizemore et al., 2018; Stolz, Harrington, & Porter, 2017). Persistent homology, a branch of computational topology (Carlsson & Memoli, 2008; Edelsbrunner & Harer, 2008; Edelsbrunner, Letscher, & Zomorodian, 2000), provides a more coherent mathematical framework for measuring network distance than the conventional method of simply taking the difference between graph theoretic features or the norm of the connectivity matrices. Instead of looking at networks at a fixed scale, as is usually done in many standard brain network analysis, persistent homology observes the changes of topological features of the network over multiple resolutions and scales (Edelsbrunner & Harer, 2008; Horak, Maletić, & Rajković, 2009; Zomorodian & Carlsson, 2005). In doing so, it reveals the most persistent topological features that are robust under noise perturbations. This robustness in performance under different scales is needed for most network distances that are parameter and scale dependent.

In persistent homology–based brain network analysis, instead of analyzing networks at one fixed threshold that may not be optimal, we build the collection of nested networks over every possible threshold by using the *graph filtration*, a persistent homological construct (Chung et al., 2013; Lee et al., 2011a, 2012). The graph filtration is a threshold-free framework for analyzing a family of graphs but requires hierarchically building specific nested subgraph structures. The graph filtration shares similarities to the existing multithresholding or multiresolution network models that use many different arbitrary thresholds or scales (Achard, Salvador, Whitcher, Suckling, & Bullmore, 2006; He et al., 2008; Kim, Adluru, Chung, Okonkwo, Johnson, Bendlin, & Singh, 2015; Lee et al., 2012; Supekar, Menon, Rubin, Musen, & Greicius, 2008). Such approaches are mainly used to visually display the dynamic pattern of how graph theoretic features change over different thresholds, and the pattern of change is rarely quantified. Persistent homology can be used to quantify such dynamic patterns in a more coherent mathematical framework. Recently, various persistent homological network approaches have been proposed. In Giusti et al. (2015) and Sizemore et al. (2016, 2018), graph filtration was developed on cliques. In Petri et al. (2013), weighted clique rank homology was developed. In Petri et al. (2014), the concept of homological scaffolds was developed and applied to the resting-state fMRI.

In persistent homology, there are various metrics that have been proposed to measure similarity and distances, including the bottleneck, Gromov-Hausdorff (GH), and Wasserstein distances (Chazal, Cohen-Steiner, Guibas, Mémoli, & Oudot, 2009; Kerber, Morozov, & Nigmetov, 2017; Tuzhilin, 2016), the complex vector method (Di Fabio & Ferri, 2015), and the persistence kernel (Ibanez-Marcelo, Campioni, Manzoni, Santarcangelo, & Petri, 2018a; Ibanez-Marcelo, Campioni, Phinyomark, Petri, & Santarcangelo, 2018b; Kusano, Hiraoka, & Fukumizu, 2016). Among them, the bottleneck and GH distances are possibly the two most popular distances that were originally used to measure distance between two metric spaces (Tuzhilin, 2016). They were later adapted to measure distances in persistent homology, dendrograms (Carlsson & Memoli, 2008; Carlsson & Mémoli, 2010; Chazal et al., 2009), and brain networks (Lee et al., 2011b, 2012). The probability distributions of bottleneck and GH-distances are unknown. Thus, the statistical inference on them can only be done through resampling techniques such as permutations (Lee et al., 2012; Lee, Kang, Chung, Lim, Kim, & Lee, 2017), which often cause serious computational bottlenecks for large-scale networks.

To bypass the computational bottleneck associated with resampling large-scale networks, the Kolmogorov-Smirnov (KS) distance was introduced (Chung et al., 2013, 1; Lee et al., 2017). The advantage of using KS-distance is that its gives results that are easier to interpret than those obtained from less intuitive distances from persistent homology. Furthermore because of its simplicity in construction, it is possible to determine its probability distribution exactly without resampling (Chung et al., 2017b). However, the KS-distance has been only applied to the number of connected components *β*_{0}, and it is unclear how to apply to the number of cycles *β*_{1} in graphs and networks. In this paper, for the first time, we show how to extend the KS-distance by performing statistical inference on *β*_{1}. This is achieved by establishing the monotonic property of the number of cycles over graph filtration. The monotonicity is then used in constructing the KS-distance for topologically differentiating two networks. Subsequently, the method is applied to the large-scale resting-state twin fMRI study in determining the heritability of the number of cycles.

## CORRELATION BRAIN NETWORK

The edge weight, which measures the strength of a connection, is usually given by a similarity measure between the observed data on the nodes in brain networks. Various similarity measures have been proposed. The correlation or mutual information between measurements for the biological or metabolic network and the frequency of contact between actors for the social network have been used as edge weights (Bassett, Meyer-Lindenberg, Achard, Duke, & Bullmore, 2006; Bien & Tibshirani, 2011; Li, Liu, Li, Qin, Li, Yu, & Jiang, 2009; McIntosh & Gonzalez-Lima, 1994; Newman & Watts, 1999; Song, Havlin, & Makse, 2005). In particular, the Pearson correlation has been most widely used as edge weights in functional brain network modeling.

*V*= {1, …,

*p*} and edge weights

*w*= (

*w*

_{ij}) between nodes

*i*and

*j*. Let

**x**

_{j}= (

*x*

_{1j}, ⋯,

*x*

_{nj})

^{⊤}∈ ℝ

^{n}be

*n*× 1 measurement vector on node

*j*. Let us center and normalize data

**x**

_{j}such that

*ρ*

_{ij}= $xi\u22a4$

**x**

_{j}is the Pearson correlation between

**x**

_{i}and

**x**

_{j}(Chung, Hanson, Ye, Davidson, & Pollak, 2015). Note that correlations are invariant under scale and translations. Naturally, we are interested in using correlations or their simple functions such as

*w*

_{ij}≤

*w*

_{ik}+

*w*

_{kj}and other metric properties (Chung, Lee, Solo, Davidson, & Pollak, 2017a). Having metric distances facilitates more mathematically coherent interpretation of brain networks and offers many nice mathematical properties. With such edge weight

*w*, 𝒳 = (

*V*,

*w*) forms a metric space. In the simulation studies in this paper, Equation 1 is used as the edge weights.

## GRAPH FILTRATION

All topological network distances that will be introduced in later sections are based on filtrations on graphs by thresholding edge weights.

*V*,

*w*) with positive edge weight

*w*= (

*w*

_{ij}), the binary network 𝒳

_{ϵ}= (

*V*,

*w*

_{ϵ}) is a graph consisting of the node set

*V*and the binary edge weights

*w*

_{ϵ}given by

Any edge weight less than or equal to *ϵ* is made into zero while edge weights larger than *ϵ* are made into one. Lee et al. (2011b, 1) defines the binary graphs by thresholding above, that is, *w*_{ij,ϵ} = 1 if *w*_{ij} <= *ϵ*, which is consistent with the definition of the Rips filtration. However, in brain imaging, the higher value of *w*_{ij} indicates stronger connectivity. Thus, we are thresholding below and leave out stronger connections (Chung et al., 2013, 1).

*w*

_{ϵ}is the adjacency matrix of 𝒳

_{ϵ}, which is a simplicial complex consisting of 0-simplices (nodes) and 1-simplices (edges) (Ghrist, 2008). By increasing the filtration value

*ϵ*, we are deleting more edges, so the size of the edge set decreases. Thus, the binary network satisfies the monotonic subset property

*ϵ*

_{0}≤

*ϵ*

_{1}≤

*ϵ*

_{2}⋯. Equivalently, we also have

_{0}is the complete graph and 𝒳

_{∞}is the node set

*V*. For a graph with

*p*nodes, the maximum number of edges is (

*p*

^{2}−

*p*)/2, which is obtained in a complete graph. If we order the edge weights in increasing order, we have the sorted edge weights:

*q*≤ (

*p*

^{2}−

*p*)/2. The subscript

_{( )}denotes the order statistic. Hence, we simply construct the graph filtration at the edge weights

The condition of having unique edge weights is not restrictive in practice. Assuming edge weights to follow some continuous distribution, the probability of any two edges being equal is zero. The finiteness and uniqueness of the filtration levels over finite graphs are intuitively clear by themselves and are implicitly assumed in software packages such as javaPlex (Adams, Tausz, & Vejdemo-Johansson, 2014).

## BETTI NUMBERS

In persistent homology, the *k*-th Betti number is often referred to as the number of *k*-dimensional holes (Lee et al., 2014, 1; Petri et al., 2014; Sizemore et al., 2018). In network setting, the 0-th Betti number is the number of connected components and the 1st Betti number is the number of cycles. During graph filtration, we can show that *β*_{0} and *β*_{1} monotonically change. Although it is not true in general (Bobrowski & Kahle, 2014), on the graph filtration (2), *β*_{0} and *β*_{1} numbers have very stable monotonic increases and decreases respectively.

In a graph, Betti numbers *β*_{0} and *β*_{1} are monotone over graph filtration on edge weights.

*Proof*. Under graph filtration (2), the edges are deleted one at a time. Since an edge has only two end points, the deletion of the edge disconnects the graph into at most two. Thus, the number of connected components (

*β*

_{0}) always increases, and the increase is at most by one. The Euler characteristic

*χ*of the graph is given by (Adler, Bobrowski, Borman, Subag, & Weinberger, 2010)

*p*and

*q*are the number of nodes and edges respectively. Thus,

*p*is fixed over the filtration but

*q*is decreasing by one while

*β*

_{0}increases at most by one. Hence,

*β*

_{1}always decreases and the decrease is at most by one.

Theorem 1 is related to the incremental Betti number computation over a simplical complex (Boissonnat & Teillaud, 2006). Once we compute *β*_{0} number, *β*_{1} number is simply given by *β*_{0} − *p* + *q* without additional computation. For the computation of *β*_{0}, it is not necessary to perform graph filtration for infinitely many possible filtration values. The maximum possible number of filtration level needed for computing *β*_{0} is one plus the number of unique edge weights. In the case of trees, *β*_{0} computation is exactly given.

*V*,

*w*) with

*p*≥ 2 nodes and unique positive edge weights

*β*

_{0}over graph filtration (2) is given by

The proof is given in Chung et al. (2015). Note a tree with *p* nodes has *p* − 1 edges. For a graph that is not possible, it may not be possible to analytically represent *β*_{0} over a filtration like Theorem 2. In general, *β*_{0} can be numerically computed using the single linkage dendrogram (SLD) (Lee et al., 2012), the Dulmage-Mendelsohn decomposition (Chung, Adluru, Dalton, Alexander, & Davidson, 2011; Pothen & Fan, 1990), or the simplical complex method (Carlsson & Memoli, 2008; de Silva & Ghrist, 2007; Edelsbrunner, Letscher, & Zomorodian, 2002). In this study, we computed *β*_{0} over filtration by using the Dulmage-Mendelsohn decomposition.

## SINGLE LINKAGE CLUSTERING

*β*

_{0}computation is related to single linkage clustering and dendrogram construction (Carlsson, 2009; Carlsson, De Silva, & Morozov, 2009; Carlsson, Singh, & Zomorodian, 2009b; Chowdhury & Mémoli, 2016; Khalid, Kim, Chung, Ye, & Jeon, 2014). In single linkage clustering, the single linkage distance (SLD)

*s*

_{ij}between the closest nodes in the two disjoint connected components

**R**

_{1}and

**R**

_{2}is given by

*w*

_{kl}.

Every edge connecting a node in **R**_{1} to a node in **R**_{2} has the same SLD. The SLD is then used to construct the single linkage matrix (SLM) *S* = (*s*_{ij}) (Figure 1). SLM shows how connected components are merged locally and can be used in constructing a dendrogram over filtration. If the single linkage distance *s*_{ij} is larger than the current filtration value *ϵ*_{k} but smaller than the next filtration value *ϵ*_{k+1}, that is, *ϵ*_{k} ≤ *s*_{ij} < *ϵ*_{k+1}. Then components **R**_{1} and **R**_{2} will be connected at the next filtration value *ϵ*_{k+1}. The sequence of how components are merged during the graph filtration is identical to the sequence of the merging in the dendrogram construction (Lee et al., 2012). By tracing how each of the connected components are merged, we can compute *β*_{0}. In the single linkage clustering, instead of deleting edges, we are connecting nodes over increasing edge weights.

*ultrametric*, which is a metric space satisfying the stronger triangle inequality

*s*

_{ij}≤ max(

*s*

_{ik},

*s*

_{kj}) (Carlsson & Mémoli, 2010). Thus the dendrogram can be represented as an ultrametric space 𝒟 = (

*V*,

*S*), which is again a metric space. In persistent homology, the Gromov-Hausdorff (GH) distance has been mainly used in quantifying the dendrogram shape differences (Carlsson & Mémoli, 2010; Chung et al., 2017a; Lee et al., 2011b, 1). The GH-distance between dendrograms 𝒟

^{1}= (

*V*,

*S*

^{1}) and 𝒟

^{2}= (

*V*,

*S*

^{2}) with SLM

*S*

^{1}= ($sij1$) and

*S*

^{2}= ($sij2$) is given by

## BOTTLENECK DISTANCE

*rarely*used for brain networks. In persistent homology, the topology of underlying data can be represented by the birth and death of topological features, such as the number of connected components or cycles (Carlsson, Ishkhanov, De Silva, & Zomorodian, 2008). During the filtration, these topological features appear and disappear. If a topological feature appears at the threshold

*ξ*and disappears at

*τ*, it can be encoded into a point, (

*ξ*,

*τ*) (0 ≤

*ξ*≤

*τ*< ∞) in ℝ

^{2}. If

*m*number of connected components or cycles appear during the filtration of a network 𝒳 = (

*V*,

*w*), the homology group can be represented by a point set

^{1}= (

*V*

^{1},

*w*

^{1}) with

*m*features and 𝒳

^{2}= (

*V*

^{2},

*w*

^{2}) with

*n*features, PDs

^{1}) and

*γ*is a bijection from 𝒫(𝒳

^{1}) to 𝒫(𝒳

^{2}). The infimum is taken over all possible bijections. If $tj2$ = ($\xi j2$, $\tau j2$) =

*γ*($ti1$) for some

*i*and

*j*,

*L*

_{∞}-norm is given by

*m*=

*n*such that the bijection

*γ*exists. Suppose two networks share the same node set, that is,

*V*

^{1}=

*V*

^{2}, with

*p*nodes and the same number of

*q*unique edge weights. If the graph filtration is performed on two networks, the number of connected components and cycles that appear and disappear during the filtration is

*p*and 1 −

*p*+

*q*, respectively. Thus, their persistence diagrams always have the same number of points. The bijection

*γ*is determined by the bipartite graph matching algorithm (Cohen-Steiner et al., 2007; Edelsbrunner & Harer, 2008).

*m*≠

*n*, there is no one-to-one correspondence between two PDs. Then, auxiliary points

*ξ*=

*τ*in 𝒫(𝒳

^{1}) and 𝒫(𝒳

^{2}) are added to 𝒫(𝒳

^{2}) and 𝒫(𝒳

^{1}), respectively, to make the identical number of points in PDs.

The bottleneck distance does not directly measure the distance between two metric spaces 𝒳^{1} = (*V*^{1}, *w*^{1}) and 𝒳^{2} = (*V*^{2}, *w*^{2}), but measures the distance between their corresponding persistence diagrams 𝒫(𝒳^{1}) and 𝒫(𝒳^{1}). In practice, the bottleneck distance has been often used since it is a lower bound on the GH-distance and it is easier to compute (Chazal et al., 2009). Since the brain regions that form the network nodes are matched across the networks through predefined parcellations in brain network studies, the GH-distance can be computed easily. Thus, in this study, we will only use the GH-distance and not show the result of the bottleneck distance in the simulation study.

## PERMUTATION TEST ON NETWORK DISTANCES

Statistical inference on network distances can be done using resampling techniques such as the permutation test (Chung et al., 2013; Efron, 1982; Lee et al., 2012). The permutation test is perhaps the most widely used nonparametric test procedure in the sciences (Chung et al., 2017b; Nichols & Holmes, 2002; Thompson, Cannon, Narr, van Erp, Poutanen, Huttunen, Lonnqvist, Standertskjold-Nordenstam, Kaprio, & Khaledy, 2001; Zalesky, Fornito, Harding, Cocchi, Yücel, Pantelis, & Bullmore, 2010). It is known as the exact test in brain imaging since the distribution of the test statistic under the null hypothesis can be exactly computed if we can calculate all possible values of the test statistic under every possible permutation.

*m*measurement in Group 1 on node set

*V*of size

*p*. Denote the data matrix as $Xm\xd7p1$. The edge weights of Group 1 are given by

*f*(

**x**

^{1}) for some function

*f*and the metric space is given by 𝒳

^{1}= (

*V*,

*f*(

**X**

^{1})). Suppose there are

*n*measurement in Group 2 on the identical node set

*V*. Denote data matrix as $Xn\xd7p2$ and the corresponding metric space as 𝒳

^{1}= (

*V*,

*f*(

**X**

^{1})). We test the statistical significance of network distance

*D*(𝒳

^{1}, 𝒳

^{2}) under the null hypothesis

*H*

_{0}:

*H*

_{0}, one can concatenate the data matrices

**X**in the symmetric group of degree

*m*+

*n*, that is,

*S*

_{m+n}(Kondor, Howard, & Jebara, 2007). Denote the

*i*-th permuted data matrix as

**X**

_{σ(i)}= (

*x*

_{σ(i),j}), where

*σ*∈

*S*

_{m+n}. Then we split

**X**

_{σ(i)}into submatrices such that

*m*×

*p*and

*n*×

*p*respectively. Let $X\sigma (i)1$ = (

*V*,

*f*($X\sigma (i)1$)) and $X\sigma (i)2$ = (

*V*,

*f*($X\sigma (i)2$)) be weighted networks where the rows of the data matrices are permuted across the groups. Then we have distance

*D*($X\sigma (i)1$, $X\sigma (i)2$) for each permutation. The fraction of permutations

*D*($X\sigma (i)1$, $X\sigma (i)2$) that is larger than

*D*(𝒳

^{1}, 𝒳

^{2}) gives the estimate for the

*p*value.

Unfortunately, generating every possible permutation for whole images is still extremely time consuming even for a modest sample size. The number of permutations exponentially increases, and it is impractical to generate every possible permutation. In the permutation test, only a small fraction of possible permutations are generated, and the statistical significance is computed approximately. In most studies, on the order of 1% of total permutations were often used, mainly due to the computational bottleneck of generating permutations (Thompson et al., 2001; Zalesky et al., 2010). In Zalesky et al. (2010), 5,000 permutations out of possible $2712$ = 17,383,860 permutations (2.9%) were used. In Thompson et al. (2001), 1 million permutations out of $4020$ possible permutations (0.07%) were generated using a super computer. In our study, we have 131 MZ and 77 DZ twins. The possible number of permutations is $20877$. This is a number so large, we cannot exactly represent it in computing systems such as MATLAB and R. Even the 1% of $20877$ is about 1.96 × 10^{56}, which is still astronomically large and beyond the computing capability of the most computers. On the other hand, the proposed KS-distance method computes for all possible permutations combinatorially and completely bypasses the computational bottleneck. There is no computational cost involved in the KS-distance and the computation is done in a few seconds. Furthermore, the method computes *p* values exactly and it is not approximate.

## KOLMOGOROV-SMIRNOV DISTANCE

Recently, the Kolmogorov-Smirnov (KS) distance has been successfully applied in quantifying the change of *β*_{0} number over graph filtration as a way to quantify brain networks without thresholding (Chung et al., 2017a, 2017b). The main advantage of the method is that it avoids using the computationally costly and time consuming permutation test for large-scale networks. In this paper, we show how to apply KS-distance in quantifying the change of the *β*_{1} number over graph filtration as well.

^{1}= (

*V*,

*w*

^{1}) and 𝒳

^{2}= (

*V*,

*w*

^{2}), KS distances between 𝒳

^{1}and 𝒳

^{2}for Betti numbers

*β*

_{0}and

*β*

_{j}are defined as (Chung et al., 2013; Lee et al., 2017):

*β*

_{j}($X\u03f5i$) is the

*j*-th Betti number for binary network $X\u03f5i$. The distance

*D*

_{KS}can be discretely approximated using the finite number of filtrations:

*q*such that

*ϵ*

_{j}are all the sorted edge weights, then

*p*(

*p*− 1)/2 number of unique edges in a graph with

*p*nodes and the monotone function increases discretely but

*not continuously*. In practice,

*ϵ*

_{j}may be chosen uniformly or a divide-and-conquer strategy can be used to adaptively grid the filtration values. Then the probability distribution of

*D*

_{q}can be computed exactly by combinatorial means.

*A*

_{u,v}satisfies

*A*

_{u,v}=

*A*

_{u−1,v}+

*A*

_{u,v−1}with the boundary condition

*A*

_{0,v}=

*A*

_{u,0}= 1 within band |

*u*−

*v*| <

*d*and initial condition

*A*

_{0,0}= 0 for

*u*,

*v*≥ 1.

The proof is given in Chung et al. (2017b).

*P*(

*D*

_{3}≥ 2) is computed sequentially as follows (Figure 2). We start with the bottom left corner

*A*

_{0,0}= 0 and move right or up toward the upper corner

*P*(

*D*

_{3}≥ 2) = 1 − 8/$63$ = 0.6. The computational complexity of the combinatorial inference is 𝒪(

*q*log

*q*) for sorting and 𝒪(

*q*

^{2}) for computing

*A*

_{q,q}in the grid while the permutation test requires exponential run time.

*q*is too large, it may not be possible to represent and compute $2qq$ in all the digits. For large

*q*, use the asymptotic probability distribution

*D*

_{q}given by Chung et al. (2017b):

*p*value of the test statistic under the null is then computed as

*d*

_{o}is the least integer greater than

*D*

_{q}/$2q$ in the data.

## COMPARISONS

Six network distances (*L*_{1}, *L*_{2}, *L*_{∞}, GH and KS on *β*_{0} and *β*_{1}) were compared in simulation studies. For the review of various brain network distances, refer to Chung et al. (2017a). We also used the popular Q-modularity function for community detection in graph theory (Girvan & Newman, 2002; Meunier et al., 2009; Newman et al., 2006). The difference in Q-modularity functions was used as the distance measure. The simulations below were independently performed 100 times. We used *p* = 20,100,500 nodes and *n* = 5 images in each group, which made it possible for permutations to be exactly $5+55$ = 252 (Figure 3). The small number of permutations enables us to compare the performance of distances exactly. Through the simulations, *σ* = 0.1 was universally used as network variability.

**x**

_{i}at node

*i*was simulated as identical and independently distributed multivariate normal across

*i*, that is,

**x**

_{i}∼

*N*(0,

*I*

_{n}) with

*n*by

*n*identity matrix

*I*

_{n}as the covariance matrix. This gives the correlation matrix

*C*

^{1}= ($cij1$) = (

*corr*(

**x**

_{i},

**x**

_{j})). The edge weights were given by $1\u2212cij1$. The data vector

**y**

_{i}at node

*i*that produced node dependency was simulated by adding additional dependency to

**x**

_{i}through a hierarchical linear model or mixed-effect model (Pinehiro & Bates, 2002; Snijders, Spreen, & Zwaagstra, 1995). This is a standard simulation technique for introducing dependency structures in random simulations. The hierarchical linear model enables us to explicitly model the data vector at each node and simulate the amount of dependency between nodes, providing detailed control over the topological structures in the correlation matrices. Data vector

**y**

_{i}at node

*i*will be simulated using

**x**

_{i}as follows.

*c*=

*p*/

*k*= 10, 5, 4, 2 and

*k*=

*p*/

*c*= 2, 4, 5, 10 are used (Figure 3). Subsequently, we have the correlation matrix

*C*

^{2}= ($cij2$) = (

*corr*(

**y**

_{i},

**y**

_{j})) and the subsequent edge weights $1\u2212cij2$.

### No Network Difference

It was expected there was no network difference between networks generated using the same parameters and initial data vectors **x**_{i} in the above model. For example, Figure 3 shows two simulated networks generated with the same parameters *k* = 4, 10. We compared networks with the same parameter *k*: 4 vs. 4, 5 vs. 5 and 10 vs. 10. It is expected we should not able to detect the network differences. The performance results were given in terms of the false positive error rate computed as the fraction of simulations that gave *p* value below 0.05 (Table 1). For all the distances except KS-distance, the permutation test was used. Since there were five samples in each group, the total number of permutations was $105$ = 272, making the permutation test exact and the comparisons accurate. All the distances performed very well including Q-modularity. KS-distance was overly sensitive and was producing up to 7% false positives. However, for 0.05 level test, it is expected that there is 5% chance of producing false positives. Thus, KS-distance is producing only 2% above the expected error rate.

p = 20
. | L_{1}
. | L_{2}
. | L_{∞}
. | GH . | KS (β_{0})
. | KS (β_{1})
. | Q . |
---|---|---|---|---|---|---|---|

4 vs. 4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.01 | 0.05 |

5 vs. 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.01 | 0.06 |

10 vs. 10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 |

4 vs. 5 | 0.63 | 0.40 | 0.33 | 0.15 | 0.27 | 0.06 | 0.9 |

2 vs. 4 | 0.71 | 0.48 | 0.42 | 0.53 | 0.18 | 0.00 | 0.95 |

5 vs. 10 | 0.94 | 0.80 | 0.78 | 0.72 | 0.44 | 0.24 | 0.96 |

p = 20
. | L_{1}
. | L_{2}
. | L_{∞}
. | GH . | KS (β_{0})
. | KS (β_{1})
. | Q . |
---|---|---|---|---|---|---|---|

4 vs. 4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.01 | 0.05 |

5 vs. 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.01 | 0.06 |

10 vs. 10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 |

4 vs. 5 | 0.63 | 0.40 | 0.33 | 0.15 | 0.27 | 0.06 | 0.9 |

2 vs. 4 | 0.71 | 0.48 | 0.42 | 0.53 | 0.18 | 0.00 | 0.95 |

5 vs. 10 | 0.94 | 0.80 | 0.78 | 0.72 | 0.44 | 0.24 | 0.96 |

The *p* = 20 simutation might be too small a network to extract topologically distinct features that are used in topological distances. Thus, we increased the number of nodes to *p* = 100 (Table 2). All the network distances except KS-distances performed reasonably well. KS-distances seem to be overly sensitive to slight topological change in large topological structures that were present in *k* = 2, 4, 5 cases. As *k* increases, KS-distances seem to perform reasonably well.

p = 100
. | L_{1}
. | L_{2}
. | L_{∞}
. | GH . | KS (β_{0})
. | KS (β_{1})
. | Q . |
---|---|---|---|---|---|---|---|

4 vs. 4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.26 | 0.54 | 0.03 |

5 vs. 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | 0.43 | 0.05 |

10 vs. 10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.05 | 0.05 | 0.05 |

4 vs. 5 | 0.51 | 0.37 | 0.35 | 0.16 | 0.11 | 0.00 | 0.93 |

2 vs. 4 | 0.66 | 0.45 | 0.57 | 0.61 | 0.03 | 0.00 | 0.91 |

5 vs. 10 | 0.94 | 0.86 | 0.79 | 0.72 | 0.11 | 0.00 | 0.98 |

p = 100
. | L_{1}
. | L_{2}
. | L_{∞}
. | GH . | KS (β_{0})
. | KS (β_{1})
. | Q . |
---|---|---|---|---|---|---|---|

4 vs. 4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.26 | 0.54 | 0.03 |

5 vs. 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | 0.43 | 0.05 |

10 vs. 10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.05 | 0.05 | 0.05 |

4 vs. 5 | 0.51 | 0.37 | 0.35 | 0.16 | 0.11 | 0.00 | 0.93 |

2 vs. 4 | 0.66 | 0.45 | 0.57 | 0.61 | 0.03 | 0.00 | 0.91 |

5 vs. 10 | 0.94 | 0.86 | 0.79 | 0.72 | 0.11 | 0.00 | 0.98 |

### Network Differences

We generated networks with parameter *k* = 2, 4, 5, 10 with *p* = 20 nodes simulation (Figure 3). Since topological structures were different, the distances are expected to differentiate the networks. The performance results were given in terms of the false negative error rate computed as the fraction of simulations that give *p* value above 0.05 (Table 1). All the distances including Q-modularity performed badly, although KS-distance performed the best. Since graph theory features are *not explicitly* designed to measure network distances, they do not usually perform well when there are large topological differences.

We increased the number of nodes to *p* = 100. All the network distances including Q-modularity were still performing badly except KS-distances (Table 2). KS-distance on the number of cycles seems to be the best network distance to use when there are network topology differences, although it has tendency to produce false positives when there is no difference.

In terms of computation, distance methods based on the permutation test took about 950 seconds (16 minutes) for 100 nodes, while the KS-like test procedure only took about 20 seconds in a computer. The results given in Tables 1–3 may slightly change if different random networks are generated. We also performed the simulation study on the 500 nodes to see the effect of increased network sizes (Table 3). The proposed KS-distance on both *β*_{0} and *β*_{1} are not necessarily performing well in the case of no network differences. Again the KS-distance is too sensitive and detecting minute network differences. On the other hand, in the case of actual network differences, the KS-distances are performing exceptionally well compared with other network differences.

p = 500
. | L_{1}
. | L_{2}
. | L_{∞}
. | GH . | KS (β_{0})
. | KS (β_{1})
. | Q . |
---|---|---|---|---|---|---|---|

4 vs. 4 | 0.04 | 0.05 | 0.06 | 0.08 | 0.20 | 0.26 | 0.02 |

5 vs. 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.13 | 0.20 | 0.02 |

10 vs. 10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | 0.18 | 0.05 |

4 vs. 5 | 0.20 | 0.20 | 0.20 | 0.20 | 0.11 | 0.00 | 0.20 |

2 vs. 4 | 0.14 | 0.11 | 0.14 | 0.12 | 0.00 | 0.00 | 0.17 |

5 vs. 10 | 0.20 | 0.18 | 0.19 | 0.16 | 0.00 | 0.00 | 0.20 |

p = 500
. | L_{1}
. | L_{2}
. | L_{∞}
. | GH . | KS (β_{0})
. | KS (β_{1})
. | Q . |
---|---|---|---|---|---|---|---|

4 vs. 4 | 0.04 | 0.05 | 0.06 | 0.08 | 0.20 | 0.26 | 0.02 |

5 vs. 5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.13 | 0.20 | 0.02 |

10 vs. 10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | 0.18 | 0.05 |

4 vs. 5 | 0.20 | 0.20 | 0.20 | 0.20 | 0.11 | 0.00 | 0.20 |

2 vs. 4 | 0.14 | 0.11 | 0.14 | 0.12 | 0.00 | 0.00 | 0.17 |

5 vs. 10 | 0.20 | 0.18 | 0.19 | 0.16 | 0.00 | 0.00 | 0.20 |

## APPLICATION

As an application, we show how to apply KS-distances in understanding heritability of brain networks. Because of their unique relationship, twin imaging studies allow researchers to examine genetic and environmental influences easily *in vivo* (Blokland, McMahon, Thompson, Martin, de Zubicaray, & Wright, 2011; Chiang, McMahon, de Zubicaray, Martin, Hickie, Toga, Wright, & Thompson, 2011; Glahn, Winkler, Kochunov, Almasy, Duggirala, Carless, Curran, Olvera, Laird, Smith, Beckmann, Fox, & Blangero, 2010; McKay, Knowles, Winkler, Sprooten, Kochunov, Olvera, Curran, Kent Jr., Carless, Göring, Dyer, Duggirala, Almasy, Fox, Blangero, & Glahn, 2014; Smit, Stam, Posthuma, Boomsma, & De Geus, 2008). Monozygotic (MZ) twins share 100% of genes, whereas dizygotic (DZ) twins share 50% of genes (Chung et al., 2017b). The difference between MZ and DZ twins measures the degree of genetic and environmental influence. Twin imaging studies are very useful for understanding the extent to which brain networks are influenced by genetic factors. This information can then be later used to develop better ways to prevent and treat disorders and maladaptive behaviors.

### Dataset and Image Preprocessing

We used the resting-state fMRI of 271 twin pairs from the Human Connectome Project (Van Essen, Ugurbil, Auerbach, Barch, Behrens, Bucholz, Chang, Chen, Corbetta, & Curtiss, 2012). Out of a total 271 twin pairs, we only used genetically confirmed 131 MZ twin pairs (age 29.3 ± 3.3 years, 56M/75F) and 77 same-sex DZ twin pairs (age 29.1 ± 3.5 years, 30M/47F) in this study. Since the discrepancy between self-reported and genotype-verified zygosity was fairly high at 13% of all the available data, 19 MZ and 19 DZ twin pairs that do not have genotyping were excluded. We additionally excluded 35 twin pairs with missing fMRI data.

fMRI were collected on a customized Siemens 3T Connectome Skyra scanner, using a gradient-echo-planar imaging (EPI) sequence with multiband factor = 8, TR = 720 ms, TE = 33.1 ms, flip angle = 52°, 104 × 90 (RO×PE) matrix size, 72 slices, and 2-mm isotropic voxels; 1,200 volumes were obtained over a 14 min, 33 sec scanning session. fMRI data has undergone spatial and temporal preprocessing including motion and physiological noise removal (Smith et al., 2013). Using the resting-state fMRI, we employed the Automated Anatomical Labeling (AAL) brain template to parcellate the brain volume into 116 regions (Tzourio-Mazoyer, Landeau, Papathanassiou, Crivello, Etard, Delcroix, Mazoyer, & Joliot, 2002). The fMRI were then averaged across voxels in each brain region for each subject. The averaged fMRI signal in each parcellation was then temporally smoothed using the cosine series representation as follows (Chung, Adluru, Lee, Lazar, Lainhart, & Alexander, 2010; Gritsenko, Lindquist, Kirk, & Chung, 2018).

*i*-th parcellation

*ζ*

_{i}(

*t*) at time

*t*, we scaled it to fit to unit interval [0, 1]. Then subtracted its mean over time $\u222b01$

*ζ*

_{i}(

*t*)

*dt*. Then the resulting scaled and translated time series was represented as

*ψ*

_{0}(

*t*) = 1,

*ψ*

_{l}(

*t*) = $2$ cos(

*lπt*) were cosine basis functions and

*c*

_{li}were coefficients estimated in the least squares fashion. For our study,

*k*= 119 was used such that fMRI were compressed into 10% of the original data size;

*k*= 119 expansion increased the signal-to-noise ratio (SNR) as measured by the ratio of variabilities by 81% in average over all 116 brain regions and 416 subjects, that is, SNR = 1.81. The resulting real-valued Fourier coefficient vector

**c**

_{i}= (

*c*

_{0i},

*c*

_{1i}, ⋯,

*c*

_{ki}) was then used to represent the fMRI in each parcellation as 120 features in the spectral domain.

### Twin Correlations

*C*= (

*c*

_{ij}) was computed by correlating 120 features in the spectral domain. Between

*i*- and

*j*-th parcellations, the connectivity was measured by correlating

**c**

_{i}and

**c**

_{j}over 120 features, that is,

*c*

_{ij}=

*corr*(

**c**

_{i},

**c**

_{j}). From the individual correlation matrices

*C*, we computed pairwise twin correlations in each group at the edge level. The resulting group level twin correlations matrices

*C*

_{MZ}= ($cijMZ$) and

*C*

_{DZ}= ($cijDZ$) are nonsymmetric cross-correlation matrices. Since there is no preference in the order of twins, we symmetrize them by

*A*, the common environmental factor

*C*for each twin type are related as

*corr*($cijMZ$) and

*corr*($cijDZ$) are the pairwise correlation within MZ and same-sex DZ twins at edge between

*i*and

*j*. Solving Equation 5 and Equation 6, we obtain the additive genetic factor, that is, HI given by

The network differences between MZ and DZ twins are considered as mainly contributed to heritability and can be used to determine the statistical significance of HI (Chung et al., 2017, 2018). The KS-distance was computed by taking 1 − *C*_{MZ} and 1 − *C*_{DZ} as edge weights.

In most brain imaging studies, 5,000–1,000,000 permutations are often used, which puts the total number of generated permutations to usually less than 0.01 to 1% of all possible permutations. In Zalesky et al. (2010), 5,000 permutations are out of a possible $2712$ = 17,383,860 permutations (2.9%) used. In Thompson et al. (2001), for instance, 1 million permutations out of $4020$ possible permutations (0.07%) were generated using a super computer. In Lee et al. (2017), 5,000 permutations out of a possible $3310$ = 92,561,040 permutations (0.005%) were used. Since we have 131 MZ and 77 DZ pairs, the total number of possible permutation is $271131$, which is larger than 10^{80}. Even if we generate only 0.01% of 10^{80} of all possible permutations, 10^{76} permutations are still too large for most desktop computers. Thus, we choose the KS-distance for measuring the network distance. Although the probability distribution of the KS-distance is actually based on the permutation test but the probability is computed combinatorially, bypassing the need for resampling. KS-distance in our study only took a few seconds to compute the *p* value.

### Results

We used *β*_{0} and *β*_{1} in computing KS-distances. Let *ϕ* ∘ *C*_{MZ} = (*ϕ*($cijMZ$)) and *ϕ* ∘ *C*_{DZ} = (*ϕ*($cijDZ$)) for some monotone function *ϕ*. Then KS-distance between *C*_{MZ} and *C*_{DZ} is equivalent to KS-distance between 1 − *C*_{MZ} and 1 − *C*_{DZ} as well as between *ϕ* ∘ (1 − *C*_{MZ}) and *ϕ* ∘ (1 − *C*_{DZ}). Thus, we simply built filtrations over *C*_{MZ} and *C*_{DZ} and computed KS-distance without using the square-root of 1 - correlation. We used 101 filtration values between 0 and 1 at 0.01 increment (Figure 4). This gives a reasonably accurate estimate of the maximum gap in the *β*_{i}-plots between the twins (Figure 5). For *β*_{0}-plots, the maximum gap is 82, which gives the *p* value smaller than 10^{−24}. For *β*_{1}-plots, the maximum gap is 3,647, which gives the *p* value smaller than 10^{−32}. At the same correlation value, MZ twins are more connected than DZ twins. Also MZ twins have more cycles than DZ twins. Such huge topological differences are contributed to heritability.

Figure 6, which displays the HI index thresholded at 100% heritability, shows MZ twins far more similar compared with DZ twins in many connections, suggesting that genes influence the development of these connections. The most heritable connections include the left frontal gyrus, left and right middle frontal gyri, left superior frontal gyrus, left parahippocampal gyrus, left and right thalami, left and right caudate, and nuclei among many other regions. Most regions overlap with highly heritable regions observed in other twins brain-imaging studies (Fan, Fossella, Sommer, Wu, & Posner, 2003; Glahn et al., 2010; Gritsenko et al., 2018). Moreover, the findings here are somewhat consistent with a previous study on diffusion tensor imaging on twins from our group (Chung, Luo, Adluru, Alexander, Richard, & Goldsmith, 2018a; Chung et al., 2018b), showing that many regions of both resting-state functional and structural connections are heritable at the same time. The left and right caudate nuclei are identified as the most heritable hub nodes in our study.

The MATLAB codes for the simulation study as well as the connectivity matrices *C*_{MZ} and *C*_{DZ} used in generating results are given at http://www.stat.wisc.edu/∼mchung/TDA.

## DISCUSSION

### The Limitation of KS-distances

Currently KS-distance is applied to Betti numbers *β*_{0} and *β*_{1} separately. It may be possible to construct a new topological distance that uses the combination of both *β*_{0} and *β*_{1} and come up with topologically more sensitive distances. One possible approach is to use the convex combination *α*$DKS0$ + (1 − *α*)$DKS1$, where $DKSi$ is KS-distance for *β*_{i} and 0 ≤ *α* ≤ 1. This is beyond the scope of this paper and left as a future study.

### Other Network Distances

The network distances used in this study are not just any other distances but metrics. Since there are almost infinitely many possible similarity measures and distances we can use in networks, the performance of the chosen distance is important in discrimination tasks, which we have shown in simulation studies. The determination of the optimal distance is related to *metric learning*, an area of supervised machine learning in which the goal is to learn from data an optimal similarity function that measures how similar two objects are (Ktena, Parisot, Ferrante, Rajchl, Lee, Glocker, & Rueckert, 2018; Lowe, 1995). This is left as a future study.

### Computational Issues

The total number of permutations in permuting two groups of size *q* each is $2qq$ ∼ $4q2\pi q$. Even for small *q* = 10, more than tens of thousands of permutations are needed for the accurate approximation of the *p* value. The main advantage of KS-distance over all other distance measures is that it avoids numerically performing the permutation test and avoids generating tens of thousands of permutations. Although the probability distribution of the KS-distance is actually based on the permutation test, the probability is computed combinatorially. We believe that it is possible to develop similar theoretical results for other distance measures and come up with a method for avoiding a resampling-based method for statistical inference.

## AUTHOR CONTRIBUTIONS

Moo Chung: Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Writing - Original Draft; Writing - Review & Editing. Hyekyoung Lee: Investigation; Methodology; Validation; Visualization; Writing - Original Draft. Alex DiChristofano: Investigation. Hernando Ombao: Writing - Review & Editing. Victor Solo: Conceptualization; Methodology; Writing - Review & Editing.

## FUNDING INFORMATION

Moo Chung, National Institutes of Health (http://dx.doi.org/10.13039/100000002), Award ID: EB022856. Hyekyoung Lee, National Research Foundation of Korea (http://dx.doi.org/10.13039/501100003725), Award ID: NRF-2016R1D1A1B03935463.

## ACKNOWLEDGMENTS

We thank Yuan Wang of University of South Carolina, Peter Bebunik of University of Florida, Bala Krishnamoorthy of Washington State University, Dustin Pluta of University of California-Irvine, Alex Leow of University of Illinois-Chicago, and Martin Lindquist of Johns Hopkins University for valuable discussions. We also thank Andrey Gritsenko and Gregory Kirk of University of Wisconsin-Madison for logistic support and image preprocessing help.

## TECHNICAL TERMS

- Persistent homology:
A topological data analysis technique for computing topological features at different spatial resolutions.

- Graph filtration:
A collection of nested graphs.

- Metric space:
A set with a metric defined on the set.

- Permutation test:
Determines the statistical significance by calculating all possible values of the test statistic under all possible rearrangements of the samples.

- Kolmogorov-Smirnov (KS) distance:
A distance between the empirical distributions of two samples.

- Mixed-effect model:
A model with both fixed and random effect terms.

- Heritability index:
A number between 0 and 1 that measures the amount of genetic contribution.

- Betti-plots:
Displays the change of Betti numbers over filtration values.

## REFERENCES

## Author notes

Competing Interests: The authors have declared that no competing interests exist.

Handling Editor: Paul Expert