Summarizing large-scale directed graphs into small-scale representations is a useful but less-studied problem setting. Conventional clustering approaches, based on Min-Cut-style criteria, compress both the vertices and edges of the graph into the communities, which lead to a loss of directed edge information. On the other hand, compressing the vertices while preserving the directed-edge information provides a way to learn the small-scale representation of a directed graph. The reconstruction error, which measures the edge information preserved by the summarized graph, can be used to learn such representation. Compared to the original graphs, the summarized graphs are easier to analyze and are capable of extracting group-level features, useful for efficient interventions of population behavior. In this letter, we present a model, based on minimizing reconstruction error with nonnegative constraints, which relates to a Max-Cut criterion that simultaneously identifies the compressed nodes and the directed compressed relations between these nodes. A multiplicative update algorithm with column-wise normalization is proposed. We further provide theoretical results on the identifiability of the model and the convergence of the proposed algorithms. Experiments are conducted to demonstrate the accuracy and robustness of the proposed method.

## 1 Introduction

In directed graphs, it is important to understand the influence between vertices, which is represented by the directed edges. Investigating the influence structure in graphs has become an evolving research field that attracts wide attention from scientific communities including social sciences (Tang, Sun, Wang, & Yang, 2009; Li, Zhang, & Huang, 2018; Mehmood, Barbieri, Bonchi, & Ukkonen, 2013), economics (Spirtes, 2005; Jackson, 2011) and ecological sciences (Pavlopoulos et al., 2011; Delmas et al., 2019). In large-scale, densely connected directed graphs, finding an efficient way to compress vertices and summarizing the directed influence between vertices are not only useful to visualize complicated networks but also crucial to extract group-level features for further analysis, such as profiling or intervention.

Conventional graph clustering methods group the densely connected vertices into the same community on undirected graphs (Fortunato, 2010; Schaeffer, 2007; Shi & Malik, 2000). Directed graph clustering is commonly based on symmetrized undirected graphs (Malliaros & Vazirgiannis, 2013). However, the recovered communities do not preserve much of the edge information since the communities themselves are sparsely connected. Hence, effective reconstruction of the original graph from the summarized graph is a meaningful task that enjoys applications in graph compression (Dhabu, Deshpande, & Vishwakarma, 2013; Dhulipala et al., 2016), graph sampling (Orbanz, 2017; Leskovec & Faloutsos, 2006) and others.

For example, in a large-scale social network, individual-level connections are hard to analyze and contain a fair amount of noise. It is complicated to directly extract group-level features and interpret the influence structure of the graphs. In social network analysis, for instance, the key opinion leaders (KOL) (Valente & Pumpuang, 2007; Nisbet & Kotcher, 2009) with common features may also share a similar influence structure. Such information is important in terms of understanding the opinion diffusions within the network, as well as implementing interventions for various purposes such as marketing (Chaney, 2001) or pooling (Zhou, Zeng, & Zhang, 2009; Thomson et al., 1998). Moreover, extracting these features from the KOL within a group may also enable us to analyze the fairness of a certain process and perform de-biasg actions when necessary.

Previous works have considered related problems in undirected graph settings (Shahaf et al., 2013; Navlakha, Rastogi, & Shrivastava, 2008), which aim to define compressed nodes by preserving particular structures. Graph compression literature (Maneth & Peternek, 2015; Fan, Li, Wang, & Wu, 2012; Dhulipala et al., 2016) is also related, while the goal is to minimize the storage space, irrespective of preserving feature patterns of the graph. In addition, another line of related work, under the theme of influence maximization (Li, Fan, Wang, & Tan, 2018), studies the directed influence of a set of vertices to the rest of the network.

In our setting, we would like to extract sets of vertices, each becoming a compressed node, such that the influence between vertices is maximally preserved by the directed summarized graph. Previous work such as flow-based graph summarization (Shi et al., 2016) or graph de-densification (Maccioni & Abadi, 2016) addressed a similar problem based on directed influence. Though these works deal with directed graphs, the directions of summarized nodes are defined from different domains so that the algorithms essentially apply to symmetrized undirected graphs. In this work, we present a novel criterion that is specifically applicable to directed graphs, exploiting the asymmetric information of the directed edges and preserving the influence as much as possible.

This letter is organized as follows. In section 2, we introduce notations and the problem settings. In section 3, we present our learning objective and propose the structured nonnegative matrix factorization (StNMF) algorithm to solve the problem. In section 4, we provide theoretical results for reconstruction error, identifiability, and convergence of the algorithm. In sections 6 and 7, we experimentally demonstrate the usefulness of the proposed method and conclude in section 8.

## 2 Preliminaries and Problem Formulation

In this work, we focus on simple directed graphs, which exclude self-loops and multiple edges. We use “graph” for referring to a directed graph when there is no ambiguity. A positive value in the adjacency matrix represents an out-edge. We may use a negative value to represent an in-edge. The inhibition type of directed relations, where an out-edge has a negative influence, is out of the scope of this letter. We continue by defining some preliminary concepts.

### 2.1 Notations and Definitions

Denote a directed graph of the node-set $V$ and the directed-edge set $E$ by $G=(V,E)$. Denote a summarized directed graph of the compressed node-set $C$ and the directed relation set $R$. In this work, both $G$ and $H$ are simple. We distinguish terms in both graphs shown in Table 1. A node-compression is a function $\varphi V:V\u2192C$ that assigns a vertex $xi\u2208V$ to a compressed node $cI\u2208C$. In this work, $\varphi V$ is surjective. Note that we do not require all vertices to belong to an associated compressed node as opposed to the graph partition problem, that is, the union of members in an identified compressed node does not need to be the full set of vertices. An edge-compression is a function $\varphi E:E\u2192R$. We say an edge-compression, $\varphi E$, is induced from a node-compression $\varphi V$ if $\varphi V(xi)=\varphi V(xi')$, $\varphi V(xj)=\varphi V(xj')$, implies $\varphi E(eij)=\varphi E(ei'j')$, $\u2200i,j,i',j'$, that is, vertices assigned to the same compressed node admit the same compressed relation. Hence, we can write $\varphi E(eij)=rIJ$, $\u2200\varphi V(xi)=cI,\varphi V(xj)=cJ$. In this work, we only consider the edge-compression induced from the node-compression.

original graph: $G$ | vertices: $xi\u2208V$ | edges: $eij\u2208E$ |

summarized graph: $H$ | compressed nodes: $cI\u2208C$ | compressed relations: $rIJ\u2208R$ |

original graph: $G$ | vertices: $xi\u2208V$ | edges: $eij\u2208E$ |

summarized graph: $H$ | compressed nodes: $cI\u2208C$ | compressed relations: $rIJ\u2208R$ |

Denote $A\u2208Rn\xd7n$ as the asymmetric adjacency matrix of a directed graph such that $Aij=1$ if there is a directed edge from $xi$ to $xj$, and $Aij=0$ otherwise. Denote $T\u2208Rn\xd7n$ as the skew-symmetric adjacency matrix of a directed graph where $Tij=1$ and $Tji=-1$ if there is a directed edge from $xi$ to $xj$ and $Tij=0$ if there is no edge connection between $xi$ and $xj$. We say a directed graph is to be connected if its undirected skeleton is connected.

### 2.2 Influence-Preserving Criteria

*influence*to describe the information in directed edges. By choosing different loss $\u2113$, the reconstruction error describes different types of influence-preserving criteria. We say a graph has an exact influence-preserving structure (IPS) if the relevant reconstruction error $L0=0$.

Choosing $\u2113d(eij,rIJ)=1-1sign(eij)=sign(rIJ),\u2200i\u2260j$,^{1} we describe the reconstruction by recovering the directed edge direction. We say a graph summarization has an exact directions Influence-preserving structure (D-IPS) if there exists a summarization such that the reconstruction error based on $\u2113d$ is 0.

In a weighted graph, we may not only preserve the directional information of edges but also the weight information. Hence, we may choose the square loss between edges and compressed relations: $\u2113w(eij,rIJ)=(eij-rIJ)2,\u2200i\u2260j$. We say a graph summarization has an exact weights influence-preserving structure (W-IPS) if there exists a summarization such that the reconstruction error based on $\u2113w$ is 0. Note that D-IPS is a special case for W-IPS, where, for a uniformly weighted graph, an exact W-IPS is equivalent to an exact D-IPS.

When a graph does not have an exact IPS, commonly observed in practice, we would like to simultaneously learn a node-compression, $\varphi V$, and an edge-compression, $\varphi E$, such that the corresponding summarization minimizes the relevant reconstruction error $L0$.

## 3 Learning the Influence-Preserving Summarization

In this section, we present the formulation of influence-preserving summarization as a constrained supervised learning objective based on the reconstruction error. Our “labels” can be seen as the compressed relations. We then present the algorithm to solve the constrained optimization problem.

### 3.1 The Constrained Supervised Learning Objective

We start with defining our learning objective based on the reconstruction loss with $\u2113w$ and derive the factorization model as our constrained optimization objective.

#### 3.1.1 The IPS-Based Objective

**Lemma 1.**

The proof proceeds by basic linear algebra, which can be found in appendix A. Note that $R$ is an asymmetric adjacency matrix representing the compressed relations in the summarized graph. Since the summarized graph is assumed to be simple, $rIJ$ and $rJI$ have at most one nonzero $\u2200I\u2260J$ and $RII=0,\u2200I$, which imply the constraint $RIJRJI=0,\u2200I,J$.

#### 3.1.2 Continuous Relaxation

^{4}below that the objectives in equations 3.4 and 3.5 admit the same solution, up to permutation in the exact D-IPS case. Despite the fact that the asymmetric matrix $A$ is useful for deriving the identifiability result, using the skew-symmetric matrix $T$ is easier to solve and more robust in noisy cases as the model explicitly penalizes the reversely directed noise edges. Further details are discussed in appendix B. In the rest of the letter, we use the skew-symmetric matrix $T$ to represent the graphs.

#### 3.1.3 Positive Values Identifies the Compressed Node

^{2}in section 4.1 that under the exact D-IPS, the factor $U$ is nonnegative, and positive entries correctly identify the compressed nodes. Hence, we propose a nonnegative constrained factorization model for more efficient node-compression identification with the presence of noise:

### 3.2 Learning Algorithms

This optimization objective can be solved by gradient methods with projection to the Stiefel manifold, as discussed in Hirayama, Hyvärinen, Kiviniemi, Kawanabe, and Yamashita (2016) and Edelman, Arias, and Smith (1998). However, the projection-based algorithm is very sensitive to initialization. Instead, we propose a multiplicative update scheme: $U\u2190U\u2299[\u2207UL]+[\u2207UL]-$ modified from Ding et al. (2006, 2010) and Lee and Seung (2001). We use $X+$ and $X-$ to denote the positive and negative parts of matrix $X$, respectively. The modification does not only allow the imposition of the specific skew-symmetric structure of $S$ and orthogonal constraints but also gives more stable results. This leads to our proposed structured nonnegative matrix factorization (StNMF) in algorithm 1.

The nonnegative matrix $U$ can be effectively initialized via nonnegative SVD (Boutsidis & Gallopoulos, 2008). $S$ can be initialized by any $k\xd7k$ skew-symmetric matrix. For instance, when $k=2$, we set initial $S=01-10$. The algorithm exploits zero locking properties in the multiplicative update scheme so that the desired structure of $S$ is preserved throughout the updates with such initialization. In the fixed regularization scheme, a different choice of $\Lambda $ results in a different local optimal solution, and it depends on users' preference. For instance, if the user would like to have a strictly nonoverlapping compressed node-set, one may set the magnitude of off-diagonal terms to be large to emphasize orthogonality; if the user is interested in the weight assignment between compressed nodes, the diagonal terms may be set relatively larger to ensure the unit length vector. However, the discussion of such a topic is out of the scope in this letter.

It is important to note that the directed compressed relations can be read off from $S$, which represents the skew-symmetric (weighted) adjacency matrix of the summarized graph $H=(C,R)$. Hence, our model is able to simultaneously identify the node-compression and the edge-compression, thus the summarized graph $H$.

We can also optimize $\Lambda $ using the Karush-Kuhn-Tacker (KKT) complementary condition (Kuhn & Tucker, 1951) and set $\Lambda =U\u22a4Q-P=U\u22a4Q++P--U\u22a4Q--P+$, where $P=S\u22a4U\u22a4US$ and $Q=T\u22a4US$. The derivation can be found in section 4.2. In addition, algorithm 1 does not guarantee a tightly bounded norm of the column vectors in $U$. When the adaptive regularizer is used, the optimization trajectory is not monotonic nonincreasing. Hence we impose a column-wise normalization step in algorithm 2 to alleviate this problem. Denote $DUTU$ the diagonal matrix with $i$th diagonal value as the Frobenius norm of the $i$th columns of matrix $UTU$.

### 3.3 Multiple Network Summarization: The Tensor Case

We can also extend our model to extract community structure, from a series of relational network matrices, stacked into a tensor, $T$$\u2208Rn\xd7n\xd7t$. Such a tensor can be obtained from various scenarios (e.g., the same relationship networks on different time periods) or have multiple relationships between the same set of users. The second scenario is commonly observed in social networks. For instance, different types of opinions, such as, “like,” “support,” and “disagree,” between users or friends form such a set of networks.

Using algorithm 3, we directly obtain paired vector $v,w$ for cluster assignment. Note that without imposing $\eta k>0,\u2200k$ in the objective, the directed relationships can be different for different relationship networks. The $k$th relationship has direction $V\u2192W$ if $\eta k>0$ and $W\u2192V$ if $\eta k<0$. This is useful as we do not assume prior information about these networks. Recovering a reversed direction may correspond to the semantic opposite of the relationships; for example, in social network analysis, we may observe networks representing like and dispute, where opposite directions may be expected between communities; thus, we may not want to constrain the sign of $\eta k$. On the other hand, we can impose $\eta k>0$, $\u2200k$ for unidirectional relationships between communities for all observed networks in some settings. For instance, in an opinion diffusion social network, we may want to find a community with influence on another community on different social issues (with issues represented via several directed network tensor), assuming $\eta k>0$, $\u2200k$ preserves these influence structures. Such an assumption enables us to conclude the overall directed relationship $V\u2192W$.

## 4 Theoretical Analysis

In this section, we present theoretical results for the identification of nonnegative models and analysis of structured nonnegative matrix factorization (StNMF).

### 4.1 Identifiability Analysis

**Theorem 1**

(exact D-IPS Identification). Let $A$ be an asymmetric adjacency matrix of a directed graph with the exact D-IPS with $k$ compressed nodes. Assume that each submatrix between compressed nodes has distinct leading singular values with geometric multiplicity one. The optimization problem in equation 3.4 has a unique solution $U\u2208Rn\xd7k$ such that $U\u22650$ and the positive part of each column vector in $U$ identifies compressed nodes.

The proof technique extends on the $k=2$ case in theorem ^{11}, which applies the Perron-Frobenius theorem on the rearranged block matrix. Details can be found in appendix A. For graphs with more than one connected component to be determined, the most strongly connected component will be identified first and the consecutive components can be identified via deflation methods discussed in Hyvärinen et al. (2016) and Hirayama et al. (2016), which is out of the scope of this letter.

**Lemma 2.**

If a directed graph, with asymmetric adjacency matrix $A$, has the exact IPS, $A$ can be divided into a block submatrix according to compressed nodes, such that if a block $A\u02dcIJ$ is nonzero, its block-wise transpose $A\u02dcJI$ as a zero matrix, and the diagonal blocks are zero-matrices.

**Proof of Lemma 2.**

By definition of the exact D-IPS, the direction of edges between compressed nodes are the same and there are no links within the compressed nodes. The result follows since the summarized graph is simple.

**Theorem 2**

(equivalence decomposition of $A$ and $T$). Let $A$ and $T$ be the asymmetric and skew-symmetric adjacency matrix of a directed graph with the exact D-IPS, respectively. The optimal solution $U$ for $L3$ and $L4$ is the same up to permutation.

Detailed proofs are in appendix A.

**Corollary 1.**

Let $T$ be the skew-symmetric adjacency matrix of a directed graph with the exact D-IPS. Assume that each submatrix between compressed nodes has geometric multiplicity two. The optimization problem with loss $L4$ has a unique solution such that $U\u22650$ and the positive part of each column vectors in $U$ identifies communities.

The corollary follows from combining theorems ^{2} and 2.

### 4.2 Convergence Analysis

**Lemma 3.**

The proof is based on pairing symmetric terms, and details can be found in appendix A.

**Lemma 4.**

is an auxiliary function of equation 4.2.

**Theorem 3.**

The update rule described in algorithm 1 is nonincreasing and converges to the stationary point of objective in equation 3.7.

The proof of theorem ^{8} is in appendix A. The proof technique is based on some carefully chosen symmetric matrices, which is applicable for fixed $\Lambda $ as it is a symmetric matrix. With the adaptively chosen $\Lambda $ in each step, as in equation A.2 the column of $U$ does not have a fixed norm and $P++\Lambda $ is no longer symmetric, making the proof technique for theorem ^{8} in applicable. However, with the proposed column-wise normalization scheme, the algorithm is shown to be monotonic nonincreasing and convergent to the stationary point.

**Theorem 4.**

The update rule in algorithm 2 is nonincreasing and convergent.

The proof is by constructing a symmetric matrix based on the normalized column vectors and applying the proof techniques in theorem ^{8}. Detail derivations are in appendix A.

## 5 Related Work

### 5.1 Graph Summarization

With increasing graph size, it is of interest to summarize graph structures to facilitate the identification of network data and their meaning, which can be useful for data mining tasks. Graph summarization procedures are application-dependent and vary by goals ranging from data volume compression and preserving distribution properties of graphs to maintaining specific patterns on graphs. Therefore, various goal-oriented graph summarization methods have been developed to tackle the tasks of supervised edge-classification (van Leeuwen, Vreeken, & Siebes, 2006), unsupervised edge-clustering (Cilibrasi & Vitányi, 2005; Liu, Shah, & Koutra, 2015), outlier detection (Akoglu, Tong, Vreeken, & Faloutsos, 2012; Akoglu et al., 2015), and pattern set mining (Vreeken, Van Leeuwen, & Siebes, 2011). Beyond the static networks, the dynamics networks, usually represented in the form of tensors, are studied for the graph summarization tasks (Xu, Kliger, & Hero, 2011). In our problem setting, we aim to simultaneously compress the nodes and edges in the form of Figure 2, which is essentially different from the existing graph summarization goals or settings. A more closely related class of graph summarization goals can be viewed as influence-based summarization (Mehmood et al., 2013; Purohit, Prakash, Kang, Zhang, & Subrahmanian, 2014; Shi, Tong, Tang, & Lin, 2014, 2015; Shi et al., 2016), which aims to identify the description of influence propagation in the graphs. Hence, we develop the nonnegative theory to deal with our goal and employ nonnegative-based factorization methods to solve such problem. A comprehensive survey on relevent summarization techniques can be found in Liu, Safavi, Dighe, and Koutra (2018).

### 5.2 Nonnegative Matrix Factorization (NMF)

Because our graph summarization goal is closely related to nonnegative values of decomposing the skew-symmetric adjacency matrix due to theorem ^{2}, the algorithm developed for the task is closely related to NMF. NMF algorithms have been developed (Lee & Seung, 2001) to extract the features represented by nonnegative values, minimizing the objective $\u2225X-WH\u2225F2$, or $\u2211i,j(XijlogXij(WH)ij-Xij+(WH)ij)$, where $X,W,H\u22650$ are all nonnegative matrices, with $W$ and $H$ being low rank. Subsequently, variants of NMF have been developed to tackle practical problems, including real data with nonnegative features (Ding et al., 2010)—$W\u22650$ and entries of $H$ may have both signs—and features with orthogonal constraints (Ding et al., 2006; Edelman et al., 1998; Hirayama et al., 2016)—the columns of $W$ are orthogonal. Implementation techniques such as initialization (Boutsidis & Gallopoulos, 2008) have also been investigated. As the directed graph deals with skew-symmetric matrices, no prior works have explicitly addressed this problem. Our proposed computational scheme, StNMF, though based on the framework derived from NMF, is useful in the particular unaddressed setting, where we provide useful insights. The column-wise normalization procedure is first mentioned in Ding et al. (2010), which we have proved works in our setting. More details are in section B.3.

### 5.3 Tensor Decomposition

Feature extractions from data in the form of higher-order arrays (or tensor) have been explored via tensor decomposition (Kolda & Bader, 2009). The first exploration can be traced back to understanding the multiway interaction of dissolved organic matter fluorescence (Stedmon & Bro, 2008) via parallel factorization, or PARAFAC (Bro, 1997), a generalization of PCA methods into higher-order tensor data. PARAFAC features may be nontrivial to obtain; thus, the alternating least-square-based algorithms have been developed to efficiently compute PARAFAC (Comon, Luciani, & De Almeida, 2009). Moreover, an improved version to automatically determine the appropriate number of components (i.e., the rank in matrix decomposition case) has been developed (Bro & Kiers, 2003). Subsequently, more efficient method to extract features from tensors, higher-order singular value decomposition (HOSVD; De Lathauwer, De Moor, & Vandewalle, 2000) has been developed, which is of better rank property than PARAFAC. A nonnegative version for tensor decomposition methods (Cichocki, Zdunek, Phan, & Amari, 2009; Shashua & Hazan, 2005; Zhou et al., 2009) has also been developed. In section 3.3, we also dealt with multiple network data in the form of a three-dimensional tensor. Because our summarization goal is based on the number of compressed nodes, we need to explicitly control the rank of the tensor approximations, where nonnegative tensor factorization methods are not entirely applicable. Instead, we adapt the proxy variable strategy from Hirayama et al. (2016) and Hyvärinen et al. (2016) that iteratively optimize the desired objective.

## 6 Simulation Results

In this section, we apply the graph summarization methods to synthetically generated directed graphs and compare them with summarization methods on undirected cases, as well as conventional clustering methods such as spectral methods or normalized cut methods.

### 6.1 Simulation on Single Network

- •
**Fix-StNMF**: The fixed regularization scheme in algorithm 1 and $\Lambda $ is chosen as a scalar time all-one matrix. - •
**Adaptive-StNMF**: The adaptive scheme described in algorithm 2;

with the competing approaches including

#### 6.1.1 Larger-Scale Networks

We further demonstrate our method in larger scaled networks. As our identifiability theorem ^{2} suggests, the nonnegative values identify the compressed nodes. We investigate the node assignment accuracy for network size up to 100,000 nodes. Fix-StNMF and Adaptive-StNMF are implemented to compare with existing methods Undirected WNCut and Spectral as above. Node assignment accuracy (the mean followed by the standard deviation) from different noise settings is reported in Table 2. Noise levels are shown as direction or background noise. From the table, we can see that, for all network sizes, Adaptive-StNMF correctly assigns the vertices when the noise level is minimum (left panel of Table 2). This could not be achieved using other methods. Moreover, in large-scale networks, the Adaptive-StNMF can recover the compressed nodes with high accuracy, outperforming the competing methods (shown in the middle and right columns of Table 2).

. | Noise Level: 0.0 /0.1 . | Noise Level: 0.5 /0.7 . | Noise Level: 0.9 /1.0 . | ||||||
---|---|---|---|---|---|---|---|---|---|

(accuracy) . | n $=$ 1k . | n $=$ 10k . | n $=$ 100k . | n $=$ 1k . | n $=$ 10k . | n $=$ 100k . | n $=$ 1k . | n $=$ 10k . | n $=$ 100k . |

Adaptive-StNMF | 1. (7e-16) | 1. (1e-16) | 0.99 (1e-4) | 0.99 (5e-4) | 0.99 (1e-6) | 0.99 (9e-4) | 0.79 (7e-6) | 0.86 (1e-5) | 0.82 (2e-4) |

Fixed-StNMF | 0.71 (5e-2) | 0.71 (1e-1) | 0.65 (1e-2) | 0.76 (8e-2) | 0.65 (5e-3) | 0.74 (7e-2) | 0.69 (8e-2) | 0.72 (7e-2) | 0.70 (5e-2) |

Undirected | 0.55 (1e-2) | 0.34 (2e-3) | 0.34 (3e-3) | 0.36 (7e-3) | 0.34 (3e-3) | 0.34 (2e-3) | 0.32 (1e-2) | 0.34 (3e-3) | 0.34 (4e-3) |

WNCut | 0.53 (1e-1) | 0.54 (2e-2) | 0.52 (8e-2) | 0.39 (3e-2) | 0.37 (9e-3) | 0.37 (2e-3) | 0.35 (6e-2) | 0.37 (9e-3) | 0.34 (6e-3) |

Spectral | 0.89 (9e-2) | 0.82 (1e-1) | 0.87 (1e-1) | 0.71 (2e-2) | 0.63 (2e-2) | 0.54 (1e-1) | 0.36 (1e-1) | 0.32 (2e-1) | 0.38 (2e-2) |

. | Noise Level: 0.0 /0.1 . | Noise Level: 0.5 /0.7 . | Noise Level: 0.9 /1.0 . | ||||||
---|---|---|---|---|---|---|---|---|---|

(accuracy) . | n $=$ 1k . | n $=$ 10k . | n $=$ 100k . | n $=$ 1k . | n $=$ 10k . | n $=$ 100k . | n $=$ 1k . | n $=$ 10k . | n $=$ 100k . |

Adaptive-StNMF | 1. (7e-16) | 1. (1e-16) | 0.99 (1e-4) | 0.99 (5e-4) | 0.99 (1e-6) | 0.99 (9e-4) | 0.79 (7e-6) | 0.86 (1e-5) | 0.82 (2e-4) |

Fixed-StNMF | 0.71 (5e-2) | 0.71 (1e-1) | 0.65 (1e-2) | 0.76 (8e-2) | 0.65 (5e-3) | 0.74 (7e-2) | 0.69 (8e-2) | 0.72 (7e-2) | 0.70 (5e-2) |

Undirected | 0.55 (1e-2) | 0.34 (2e-3) | 0.34 (3e-3) | 0.36 (7e-3) | 0.34 (3e-3) | 0.34 (2e-3) | 0.32 (1e-2) | 0.34 (3e-3) | 0.34 (4e-3) |

WNCut | 0.53 (1e-1) | 0.54 (2e-2) | 0.52 (8e-2) | 0.39 (3e-2) | 0.37 (9e-3) | 0.37 (2e-3) | 0.35 (6e-2) | 0.37 (9e-3) | 0.34 (6e-3) |

Spectral | 0.89 (9e-2) | 0.82 (1e-1) | 0.87 (1e-1) | 0.71 (2e-2) | 0.63 (2e-2) | 0.54 (1e-1) | 0.36 (1e-1) | 0.32 (2e-1) | 0.38 (2e-2) |

Notes: The first number shows the mean accuracy, and the number in parentheses shows its standard deviation from 100 trials. Noise level: $L1/L2$ denotes that the network is simulated with direction noise at $L1$ and background noise at $L2$.

### 6.2 Simulations on Multiple Networks: The Tensor Case

In the second set of simulations, we consider the case when multiple networks of the same underlying summarization structures are observed, as discussed in section 3.3. We construct network tensors of size $2000\xd72000\xd710$, where each frontal slice of the tensor is a skew-symmetric matrix, representing a particular weighted, directed relationship network in the multiple network setting. We compare similar criteria used in the matrix case above. The 10 different networks are constructed from the same connected structure's different noise edges. We then perform Tensor-StNMF, the procedure presented in algorithm 3; we compare it with the baseline approaches, including Tensor-SVD (or HOSVD), that is, SVD on a reshaped tensor matrix to obtain the first singular vectors in each direction (De Lathauwer et al., 2000), followed by projection onto the Stiefel manifold, and Nonnegative PARAFAC, the parallel-factorization procedure for nonnegative cases proposed in Shashua and Hazan (2005). We also compare the optimization objective values against the unconstrained optimal, which is used as a benchmark, denoted by Unconstrained.

## 7 Real Network Applications

### 7.1 Seventh-Grade Student Network

Questions . | Relations . | $rk$ . |
---|---|---|

Who do you get along with? | Character | 0.746 |

Who are your best friends? | Emotional | 0.398 |

Who you prefer to work with? | Working | 0.532 |

Questions . | Relations . | $rk$ . |
---|---|---|

Who do you get along with? | Character | 0.746 |

Who are your best friends? | Emotional | 0.398 |

Who you prefer to work with? | Working | 0.532 |

### 7.2 MEG Data on Visual Task

For more complicated and interesting cases, we study the real network data from the visual92 MEG data set (Cichy, Pantazis, & Oliva, 2014), which is recorded using a 306-channel Neuromag vectorview system when both natural and artificial images are presented per minute cycle. We extract 204 channels of gradiometer signals from a total of 306 channels and perform a series of preprocessing steps including bandpass filtering, ICA artifact removal, and Morlet transformation, for example, to produce sensible and usable signals. The data set consists of 26 minutes of recordings where the first and last minutes have no images present. Thus, we consider each minute with an image present as a network observation and construct 24 skew-symmetric matrices to form a tensor. We construct the skew-symmetric matrices $Tk$ using three measure, to construct the directed network:

## 8 Conclusion

In this work, we propose a new framework to summarize directed graphs, based on a criterion that maximally preserves directed edge information. Our criterion is related to the reconstruction error from the summarized graph to the original graph. We provide theoretical results that the novel learning criterion, via preserving the directed edge information from the original graph, corresponds to nonnegative terms in the factorization of adjacency matrices. This inspires us to formulate an optimization problem with a nonnegative constraint. We then proposed a nonnegative algorithm to learn such graph summarization. We provide theoretical results showing that the novel learning criterion, preserving the directed-edge information in the original graph, corresponds to nonnegative terms in the factorization of adjacency matrices.

## Appendix A: Additional Theorems and Proofs

### A.1 Proof of Lemma 1

**Proof of Lemma 1.**

**Proposition 1**

(Perron-Frobenius). Suppose $M\u2208Rn\xd7n$ is a nonnegative square matrix that is irreducible. Then:

- 1.
$M$ has a positive real eigenvalue $\lambda max$, such that all other eigenvalues of $M$ satisfy, $|\lambda |\u2264\lambda max$ (if $M$ is primitive, $|\lambda |<\lambda max$).

- 2.
$\lambda max$ has algebraic and geometric multiplicity 1 and positive eigenvector $x>0$ (called a Perron vector).

- 3.
Any nonnegative eigenvector is a multiple of $x$.

**Proof of Proposition 1.**

$M$ is an irreducible nonnegative square matrix; then $\u2203k\u2208N+$ such that $P=(I+M)k>0$. $(I+M)k=I+M+12!M2+\cdots 1k!Mk$. By irreducibility and nonnegativity, for large enough $k$, the expansion fills in all $n2$ terms with positive numbers. Hence $P$ is primitive. We also have $TP=PT$.

For $\u2200r>0$, we have $L(rz)=L(z)$ by definition, so $L(z)$ depends only on the ray along $z$.

We add an $\u2264$ sign between vectors, $v\u2264w$, to imply $vi\u2264wi,\u2200i$. A similar definition applies for $<$. For $v\u2264w$ and $v\u2260w$, we have $Pv<Pw$, since $P(w-v)\u22650$ and $P(w-v)\u22600$.

If for scalar $s$, $sz\u2264Tz$, then $Psz\u2264PTz=TPz$, which implies $s(Pz)\u2264T(Pz)$. Thus, $L(Pz)\u2265L(z)$.

If $L(z)z\u2260Tz$, then $L(z)Pz<TPz$. This implies $L(z)<L(Pz)$, unless $z$ is an eigenvector ($Tz=L(z)z$). Hence, positive $z$ is an eigenvector when $L(z)$ is maximized.

Consider the image of $C$ under $P$. It is compact because it is the image of a compact set under a continuous map. All of the elements of $P(C)$ have all components strictly positive, as $P>0$. Hence, the $L$ is continuous on $P(C)$. Thus, $L$ achieves a maximum value on $P(C)$. Since $L(z)\u2264L(Pz)$, this is, in fact, the maximum value of $L$ on all of $Q$, which implies the existence of maximum eigenvalue. Since $L(Pz)>L(z)$ unless $z$ is an eigenvector of $T$, $Lmax$ is achieved at an eigenvector; call it $x$ of T and $x>0$ with $Lmax$ as the eigenvalue. Since $Tx>0$ and$Tx=Lmaxx$, we have $Lmax>0$.

Let, $y$ be any other eigenvectors of $T$. With eigenvalue $\lambda $, we have $\lambda yi=\u2211jTijyj$. As $T\u22650$, we have $|\lambda yi|=\u2211jTij|yj|$; thus, we write $|\lambda ||y|\u2264T|y|$. Consider $|\lambda |\u2264L(|y|)\u2264Lmax$ by definition of $L$. Writing $\lambda max=Lmax,$ we show that $|\lambda |\u2264\lambda max$. Note that if $\lambda max=0$, $T$ is nil-potent. Thus, we have $\lambda max>0$.

If there exists any other nontrivial nonnegative eigenvector $y\u22650$, $y$ is not a multiple of $x$, since $\lambda max$ has geometric multiplicity 1, $y\u22a4x=0$. However, $x>0$ and $y\u22a4x=0$ imply $y=0$, a contradiction.

### A.2 Proof of Theorem 2

**Proof of Theorem 2.**

^{3}, we write $A$ in block form where the blocks are grouped by compressed node assignment. Hence, use the fact that $tr(AA)=0=tr(RR)$ from a simple graph, and $U\u22a4AU$ has the same zero/nonzero positions as $R$ for the exact D-IPS. We have $tr((A-URU\u22a4)(A-URU\u22a4))=tr(AA-2U\u22a4AUR+RR)=0$ and

**Theorem 5**

(Bipartite Identification). Let $A$ be an asymmetric adjacency matrix of the exact D-IPS of two compressed nodes. SVD of $A$ has a unique leading left and right singular vector $v,w\u22650$, and the positive part of $v,w$ identifies two compressed nodes.

**Proof of Theorem 5.**

For the exact D-IPS with two compressed nodes, we can always rearrange the vertices such that $A=00A\u02dc0$. $A\u02dc\u22a4A\u02dc$ and $A\u02dcA\u02dc\u22a4$ represent the “in-out” and “out-in” two-step transition. As the two compressed nodes are connected, any vertex from the two-step transition can reach any other vertex in the same compressed node. Hence, $A\u02dc\u22a4A\u02dc$ and $A\u02dcA\u02dc\u22a4$ are both primitive. Using the Perron-Frobenius theorem, we have a unique real, positive leading eigenvector. Padded with 0s, the leading eigenvectors of $A\u02dc\u22a4A\u02dc$ and $A\u02dcA\u02dc\u22a4$ are unique and nonnegative where the nonzero terms correspond to the compressed node assignment.

**Proof of Theorem 1.**

^{10}and lemma

^{3}. Rearrange the indices according to the compressed nodes, and denote the block submatrices between $CI$ and $CJ$ as $A\u02dcIJ\u2208R|CI|\xd7|CJ|$. Write $A\xafIJ\u2208Rn\xd7n$ as the zero-padded matrix of $A\u02dcIJ$. The zero-padded vector for compressed node $CI$, denoted by $uI$, is the vector with nonzero $i$th entries for $xi\u2208CI$ and zeros otherwise. Write each column of $U$, $u:I'$ as a linear combination of zero-padded vectors: $u:I'=\u2211I\eta II'u:I'I$, where $\u2211I\eta II'2=1$. We write the nonzero part of $uI\u2208Rn$ as $uII\u2208R|CI|$, which is a unit vector. The optimization objective in equation 3.4 can be written as

^{11}, we know that $A\u02dcIJ$ is primitive for all $I,J\u2208[k]$. Applying Perron-Frobenius in theorem

^{2}, $uIII>0$ and $u:I\u22650$, where the positive part identifies some compressed node $CI$. As the compressed node blocks do not need to have an order, the solution is unique only up to the permutation of blocks.

**Proposition 2**

(Proposition 6 in Ding et al., 2006). For any symmetric matrices A $\u2208R\u22650n\xd7n,B\u2208R\u22650k\xd7k,S,S'\u2208R\u22650n\xd7k$, the following inequality holds: $\u2211i,p(AS'B)ipSip2Sip'\u2265tr(S\u22a4ASB)$.

**Proof of Proposition 2.**

**Proposition 3.**

For any matrices $B\u2208R\u22650k\xd7k,S,S'\u2208R\u22650n\xd7k$, and B is symmetric, the following inequality holds: $\u2211i,p(BS'\u22a4)ipSip2Sip'\u2265tr(SBS\u22a4)$.

**Proof of Proposition 3.**

**Proof of Lemma 3.**

^{12}, we have $tr(U(P++\Lambda )U\u22a4)\u2264\u2211ij[U'(P++\Lambda )]ijUij2Uij'$ since $P+$ and $\Lambda $ are both symmetric matrices. Using $a\u2264a2+b22b$, we have $tr(Q-U\u22a4)\u2264\u2211ij[Q-]ijUij2+Uij'22Uij'$. $Z(U,U')$ reaches lower bound $L3$ when $U=U'$. Hence, $Z(U,U')$ is an auxiliary function.

### A.3 Proof of Theorem 3

**Proof of Theorem 3.**

^{6}, we take the derivative of $Z(U,U')$ with regard to $Uij$:

^{11}of Ding et al. (2006). As the objective is bounded below and the iterative procedure is monotonic nonincreasing, the algorithm finds the local minimum of the objective function.

**Lemma 5.**

Let $U\u2208Rn\xd7k$ be an orthogonal matrix such that $U\u22a4U=Ik$ and $U'\u2208Rn\xd7k$ be a matrix of unit column vectors. Let $G\u2208Rk$ be a nonnegative matrix. Then $tr(U\u22a4UG)\u2264tr(U'\u22a4U'G)$.

**Proof of Lemma 5.**

Write $U'\u22a4U'=Ik+E$ for some nonnegative matrix $E$. Since $E$ and $G$ are nonnegative, then $tr(U'\u22a4U'G)=tr(IkG+EG)\u2265tr(IkG)$.

**Lemma 6.**

$U\u22a4Q=U\u22a4T\u22a4US$ is symmetric under the update rule of algorithm 2.

**Proof of Lemma 6.**

Under the update rule in algorithm 2, as $U$ is column-wise normalized, $U\u22a4T\u22a4U=S\u22a4$. Hence, $U\u22a4Q=S\u22a4S$ is symmetric.

It is worth noting that the original scheme proposed in Ding et al. (2006), without normalization does not have such property. Assume the norm for each row of $U$ is $D$, where normalized $U\u02dcD=U$. Then the update is $S\u02dc=U\u02dc\u22a4TU\u02dc$, where $S=U\u22a4TU=DS\u02dcD$. Hence, $S\u22a4S=DS\u02dc\u22a4DS\u02dc$ is not necessarily symmetric, which violates the auxiliary function formulation.

**Proof of Lemma 4.**

^{14}. Due to the normalization step, the factor $U$ has unit-norm column vectors. Hence, $tr(U'\u22a4U'U\u22a4Q-)\u2265tr(U\u22a4Q-)$ and

^{15}, we can apply proposition

^{12}and have

### A.4 Proof of Theorem 4

**Proof of Theorem 4.**

^{7}. Differentiate equation 4.3 with regard to $Uij$:

## Appendix B: Additional Experiments and Illustrations

### B.1 Noise Analysis

Another type of noise is that within clusters, which is referred to as intra-cluster noise, shown in Figure 7c. This noise type is more common in a real-world scenario. For instance, in the political opinion diffusion network, despite the strong influence from one community to another, there might still be some directed influence within groups. We refer to these two types of noise as in-network noise.

Furthermore, we also consider the case that even though the compressed node and edges correctly covered the information between edges, there could be missing information that the summarization cannot track, that is, there exists a vertex set $U\u2260\u2205$ such that $U\u2229(V\u222aW)=0$, where $V$ and $W$ denote two compressed nodes. Edges consist of vertices between $V\u222aW$ and $U$ are also considered as noise, shown in Figure 7d. We call this type the background noise.

To see the effect of the different noise types, we investigate the compressed node recoveries without constraint, that is, spectral methods, in the following settings.

#### B.1.1 Negative Values in Singular Vectors

^{2}. In Figure 8, we show the proportion of singular vectors with different levels of reverse directional noise between the two connected bipartite structures, with different sizes of bipartite components. We show the case of two-compressed nodes: $us$, $vs$ followed by the numbers 30, 50, and 100 for the size of the compressed node. We show the ratio between the Euclidean norms of negative entries to that of positive entries. The $x$-axis is the percentage of reverse edges. From the plot, we see that the singular vectors are fairly robust to reverse directional noise: there is a nonnegligible portion of negative terms only when more than $20%$ of edges are in the wrong direction. Moreover, the larger the size of the components, the more that singular vectors are robust to negative values. These observations validate our use of positive values in identifying bipartite community structure, with the presence of reverse directional noise.

### B.2 The Adjacency Matrices

### B.3 Illustration of the Normalization Step

In this section, we visually show the effect of normalization step in the adaptive algorithm: step 4 in algorithm 2. We theoretically show that with the normalization step, the objective value is monotonically nonincreasing under algorithm 2. We emphasize that the normalization step is crucial for monotonic property, and we provide an alternative intuition that in the proof. In essence, the algorithm with adaptive regularization (such as those proposed in Ding et al. (2006)) without normalization experiences some form of oscillatory behavior.

## Note

^{1}

The absence edge does not have the same sign as a directed edge: $sign(0)\u2260sign(p),\u2200p\u22600$.

## Acknowledgments

W.X. was supported by the Gatsby Charitable Foundation. A.H. was supported by a fellowship from CIFAR and by the DATAIA convergence institute as part of the Programme d'Investissement d'Avenir (ANR-17-CONV-0003) operated by Inria. M.S. was supported by the International Research Center for Neurointelligence (WPI-IRCN) at the University of Tokyo Institutes for Advanced Study.

## References

*Data Mining and Knowledge Discovery*

*Proceedings of the 21st ACM International Conference on Information and Knowledge Management*

*Pattern Recognition*

*Chemometrics and Intelligent Laboratory Systems*

*Journal of Chemometrics*

*Marketing Intelligence and Planning*

*Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation*

*Nature Neuroscience*

*IEEE Transactions on Information theory*

*Journal of Chemometrics*

*SIAM Journal on Matrix Analysis and Applications*

*Biological Reviews*

*International Journal of Advanced Computer Science and Applications*

*Compressing graphs and indexes with recursive graph bisection.*

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*45–55.*

*Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*Canadian Journal of Mathematics*

*SIAM Journal on Matrix Analysis and Applications*

*Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data*

*Physics Reports*

*PLOS One*

*Neural Computation*

*Journal of Machine Learning Research*

*Handbook of social economics*

*SIAM Review*

*Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability*

*Advances in neural information processing systems*

*Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining*

*Engineering*

*IEEE Transactions on Knowledge and Data Engineering*

*ACM Computing Surveys*

*An empirical comparison of the summarization power of graph clustering methods*

*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*Physics Reports*

*A survey on methods and systems for graph compression.*

*Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases*

*Proceedings of the 2007 SIAM International Conference on Data Mining*

*Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data*

*Science Communication*

*Clinical Neurophysiology*

*Subsampling large graphs and invariance in networks*

*BioData Mining*

*Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*Computer Science Review*

*Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*Proceedings of the 22nd International Conference on Machine Learning*

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering*

*2014 IEEE International Conference on Data Mining*

*IEEE Transactions on Knowledge and Data Engineering*

*Journal of Economic Methodology*

*Limnology and Oceanography: Methods*

*Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*Cochrane Library*

*Health Education and Behavior*

*Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery*

*Representing classroom social structure*

*Data Mining and Knowledge Discovery*

*Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction*

*Proceedings of the 2009 IEEE International Conference on Intelligence and Security Informatics*