Summarizing large-scale directed graphs into small-scale representations is a useful but less-studied problem setting. Conventional clustering approaches, based on Min-Cut-style criteria, compress both the vertices and edges of the graph into the communities, which lead to a loss of directed edge information. On the other hand, compressing the vertices while preserving the directed-edge information provides a way to learn the small-scale representation of a directed graph. The reconstruction error, which measures the edge information preserved by the summarized graph, can be used to learn such representation. Compared to the original graphs, the summarized graphs are easier to analyze and are capable of extracting group-level features, useful for efficient interventions of population behavior. In this letter, we present a model, based on minimizing reconstruction error with nonnegative constraints, which relates to a Max-Cut criterion that simultaneously identifies the compressed nodes and the directed compressed relations between these nodes. A multiplicative update algorithm with column-wise normalization is proposed. We further provide theoretical results on the identifiability of the model and the convergence of the proposed algorithms. Experiments are conducted to demonstrate the accuracy and robustness of the proposed method.

1  Introduction

In directed graphs, it is important to understand the influence between vertices, which is represented by the directed edges. Investigating the influence structure in graphs has become an evolving research field that attracts wide attention from scientific communities including social sciences (Tang, Sun, Wang, & Yang, 2009; Li, Zhang, & Huang, 2018; Mehmood, Barbieri, Bonchi, & Ukkonen, 2013), economics (Spirtes, 2005; Jackson, 2011) and ecological sciences (Pavlopoulos et al., 2011; Delmas et al., 2019). In large-scale, densely connected directed graphs, finding an efficient way to compress vertices and summarizing the directed influence between vertices are not only useful to visualize complicated networks but also crucial to extract group-level features for further analysis, such as profiling or intervention.

Conventional graph clustering methods group the densely connected vertices into the same community on undirected graphs (Fortunato, 2010; Schaeffer, 2007; Shi & Malik, 2000). Directed graph clustering is commonly based on symmetrized undirected graphs (Malliaros & Vazirgiannis, 2013). However, the recovered communities do not preserve much of the edge information since the communities themselves are sparsely connected. Hence, effective reconstruction of the original graph from the summarized graph is a meaningful task that enjoys applications in graph compression (Dhabu, Deshpande, & Vishwakarma, 2013; Dhulipala et al., 2016), graph sampling (Orbanz, 2017; Leskovec & Faloutsos, 2006) and others.

For example, in a large-scale social network, individual-level connections are hard to analyze and contain a fair amount of noise. It is complicated to directly extract group-level features and interpret the influence structure of the graphs. In social network analysis, for instance, the key opinion leaders (KOL) (Valente & Pumpuang, 2007; Nisbet & Kotcher, 2009) with common features may also share a similar influence structure. Such information is important in terms of understanding the opinion diffusions within the network, as well as implementing interventions for various purposes such as marketing (Chaney, 2001) or pooling (Zhou, Zeng, & Zhang, 2009; Thomson et al., 1998). Moreover, extracting these features from the KOL within a group may also enable us to analyze the fairness of a certain process and perform de-biasg actions when necessary.

Previous works have considered related problems in undirected graph settings (Shahaf et al., 2013; Navlakha, Rastogi, & Shrivastava, 2008), which aim to define compressed nodes by preserving particular structures. Graph compression literature (Maneth & Peternek, 2015; Fan, Li, Wang, & Wu, 2012; Dhulipala et al., 2016) is also related, while the goal is to minimize the storage space, irrespective of preserving feature patterns of the graph. In addition, another line of related work, under the theme of influence maximization (Li, Fan, Wang, & Tan, 2018), studies the directed influence of a set of vertices to the rest of the network.

In our setting, we would like to extract sets of vertices, each becoming a compressed node, such that the influence between vertices is maximally preserved by the directed summarized graph. Previous work such as flow-based graph summarization (Shi et al., 2016) or graph de-densification (Maccioni & Abadi, 2016) addressed a similar problem based on directed influence. Though these works deal with directed graphs, the directions of summarized nodes are defined from different domains so that the algorithms essentially apply to symmetrized undirected graphs. In this work, we present a novel criterion that is specifically applicable to directed graphs, exploiting the asymmetric information of the directed edges and preserving the influence as much as possible.

In directed networks, the summarization is harder as there are both the edge weights to be summarized as well as the edge directions. To effectively summarize the directed graphs, we focus on the reconstruction error from the summarized graph to the original graph. The directed graph summarization is more useful compared to the undirected case. For instance, with well-defined directed causal edges, the summarization can be helpful to approximate causal information between compressed nodes. Conventional clustering or dimensionality reduction methods utilizing the Max-Flow, Min-Cut-style to preseving criteria, compressed vertices without considering to preserve the edge information. These methods are unable to perform such summarization, illustrated in Figure 1, since the objective is to minimize the connections between compressed nodes, which results in large reconstruction error and thus undesired grouping. Our proposed objective is closely related to but essentially different from such a scheme while we try to maximize the “Cut” to preserve the directed edge information. Various discrete optimization schemes such as Dulmage-Mendelsohn decomposition (Dulmage & Mendelsohn, 1958) can also find a good summarization in a noiseless case, although they are less accurate and harder to implement when the noise level is high. On the other hand, our proposed model not only works well in the noiseless case but is also more robust in the presence of noise. In addition, the proposed criterion is extendable to deal with the multiple networks problem. It is not uncommon, in the real world, that multiple relational networks are observed on the same set of users. When the users are assumed to behave similarly across different relations, the different networks will have the same node compression, while the edge compression may or may not vary.
Figure 1:

An illustrative example for differences between clustering and summarization.

Figure 1:

An illustrative example for differences between clustering and summarization.

This letter is organized as follows. In section 2, we introduce notations and the problem settings. In section 3, we present our learning objective and propose the structured nonnegative matrix factorization (StNMF) algorithm to solve the problem. In section 4, we provide theoretical results for reconstruction error, identifiability, and convergence of the algorithm. In sections 6 and 7, we experimentally demonstrate the usefulness of the proposed method and conclude in section 8.

2  Preliminaries and Problem Formulation

In this work, we focus on simple directed graphs, which exclude self-loops and multiple edges. We use “graph” for referring to a directed graph when there is no ambiguity. A positive value in the adjacency matrix represents an out-edge. We may use a negative value to represent an in-edge. The inhibition type of directed relations, where an out-edge has a negative influence, is out of the scope of this letter. We continue by defining some preliminary concepts.

2.1  Notations and Definitions

Denote a directed graph of the node-set V and the directed-edge set E by G=(V,E). Denote a summarized directed graph of the compressed node-set C and the directed relation set R. In this work, both G and H are simple. We distinguish terms in both graphs shown in Table 1. A node-compression is a function ϕV:VC that assigns a vertex xiV to a compressed node cIC. In this work, ϕV is surjective. Note that we do not require all vertices to belong to an associated compressed node as opposed to the graph partition problem, that is, the union of members in an identified compressed node does not need to be the full set of vertices. An edge-compression is a function ϕE:ER. We say an edge-compression, ϕE, is induced from a node-compression ϕV if ϕV(xi)=ϕV(xi'), ϕV(xj)=ϕV(xj'), implies ϕE(eij)=ϕE(ei'j'), i,j,i',j', that is, vertices assigned to the same compressed node admit the same compressed relation. Hence, we can write ϕE(eij)=rIJ, ϕV(xi)=cI,ϕV(xj)=cJ. In this work, we only consider the edge-compression induced from the node-compression.

Table 1:

Term Comparison between Original Graph and Summarized Graph.

original graph: G vertices: xiV edges: eijE 
summarized graph: H compressed nodes: cIC compressed relations: rIJR 
original graph: G vertices: xiV edges: eijE 
summarized graph: H compressed nodes: cIC compressed relations: rIJR 

A graph summarization, based on the compressions ϕV and ϕE, refers to the map Φ from the original directed graph G to the summarized graph H, such that Φ(G,ϕV,ϕE)=H (see Figure 2). A graph summarization of size k, Φk is a constrained mapping where |C|=k|V|. In practice, we would like k|V|. When k=|V|, the summarization is trivial as the original graph always gives reconstruction error zero.
Figure 2:

Summarization Φ based on the compressions (ϕV,ϕE).

Figure 2:

Summarization Φ based on the compressions (ϕV,ϕE).

Denote ARn×n as the asymmetric adjacency matrix of a directed graph such that Aij=1 if there is a directed edge from xi to xj, and Aij=0 otherwise. Denote TRn×n as the skew-symmetric adjacency matrix of a directed graph where Tij=1 and Tji=-1 if there is a directed edge from xi to xj and Tij=0 if there is no edge connection between xi and xj. We say a directed graph is to be connected if its undirected skeleton is connected.

2.2  Influence-Preserving Criteria

Consider the performance measure of our graph summarization problem. The quality of a summarization can be measured by how much the directed-edge information can be recovered from the summarized graph via the reconstruction error,
L0(G,ϕV,ϕE)=I,JxicI,xjcJ(eij,rIJ),
(2.1)
where is some nonnegative loss measure. We use the term influence to describe the information in directed edges. By choosing different loss , the reconstruction error describes different types of influence-preserving criteria. We say a graph has an exact influence-preserving structure (IPS) if the relevant reconstruction error L0=0.

Choosing d(eij,rIJ)=1-1sign(eij)=sign(rIJ),ij,1 we describe the reconstruction by recovering the directed edge direction. We say a graph summarization has an exact directions Influence-preserving structure (D-IPS) if there exists a summarization such that the reconstruction error based on d is 0.

In a weighted graph, we may not only preserve the directional information of edges but also the weight information. Hence, we may choose the square loss between edges and compressed relations: w(eij,rIJ)=(eij-rIJ)2,ij. We say a graph summarization has an exact weights influence-preserving structure (W-IPS) if there exists a summarization such that the reconstruction error based on w is 0. Note that D-IPS is a special case for W-IPS, where, for a uniformly weighted graph, an exact W-IPS is equivalent to an exact D-IPS.

When a graph does not have an exact IPS, commonly observed in practice, we would like to simultaneously learn a node-compression, ϕV, and an edge-compression, ϕE, such that the corresponding summarization minimizes the relevant reconstruction error L0.

3  Learning the Influence-Preserving Summarization

In this section, we present the formulation of influence-preserving summarization as a constrained supervised learning objective based on the reconstruction error. Our “labels” can be seen as the compressed relations. We then present the algorithm to solve the constrained optimization problem.

3.1  The Constrained Supervised Learning Objective

We start with defining our learning objective based on the reconstruction loss with w and derive the factorization model as our constrained optimization objective.

3.1.1  The IPS-Based Objective

Our objective is to seek a graph summarization of size k that minimizes the reconstruction error (which corresponds to w for the rest of the letter). Denote node-compression ϕV by assignment matrix U{0,1}n×k where the Ith column vector u:I{0,1}n×1 represents the elements in compressed node cI: uiI=1xiCI. Denote the edge-compression by relationship matrix RRk×k where rIJ represents the compressed relation from cI to cJ. Since the summarized graph is assumed to be simple, R is an asymmetric adjacency matrix. Given weighted asymmetric adjacency matrix A and a graph summarization represented by U and R, the objective based on loss measure w(Aij,rIJ)=(Aij-rIJ)21xiCI1xjCJ can be written as
L1(A,U,R)=I,Ji,j(Aij-rIJ)2uiIujJ.
(3.1)
However, without information on the number of compressed nodes allowed, the objective in equation 3.1 will take k=|V| and the zero reconstruction error can always be achieved. To avoid this, we impose a constraint on the size of the summarized graph to make k|V|. With such a constraint, this objective may still identify a compressed node containing less relevant elements. To address this problem, we propose a normalized version of the objective in equation 3.1,
L2(A,U,R)=I,J1|CI||CJ|i,j(Aij-rIJ)2uiIujJ,
(3.2)
which corresponds to a normalized loss measure: w(Aij,rIJ)=(Aij-rIJ)21xiCI1xjCJ|CI||CJ|. We further assume the compressed node does not have overlaps, which corresponds to the orthogonality constraints, u:Iu:J=0,IJ.
Lemma 1.
The objective in equation 3.2 has the factorization form
L2(A,U,R)=A-URUF2,s.t.UU=Ik;RIJRJI=0,I,J.
(3.3)

The proof proceeds by basic linear algebra, which can be found in appendix A. Note that R is an asymmetric adjacency matrix representing the compressed relations in the summarized graph. Since the summarized graph is assumed to be simple, rIJ and rJI have at most one nonzero IJ and RII=0,I, which imply the constraint RIJRJI=0,I,J.

3.1.2  Continuous Relaxation

The normalized objective in equation 3.2 is an NP-hard discrete problem, which is similar to the discrete cluster assignment problem. Using continuous relaxation proposed in Shi and Malik (2000) and Meilă and Pentney (2007) is a way to approximately solve such a problem. Here, we propose a continuous relaxation for the factorization model in equation 3.2,
L3(A,U,R)=A-URUF2s.t.UU=Ik;RIJRJI=0,I,J,
(3.4)
where URn×k and RRk×k.
Due to the constraint on R, it is not easy to solve such a constrained objective as we do not assume structures on the summarized graph. This issue can be alleviated by modeling structure via the skew-symmetric adjacency matrix T. The corresponding factorization becomes
L4(T;U,S)=T-USUF2s.t.UU=Ik,
(3.5)
where URn×k and SRk×k is skew-symmetric. Equation 3.5 exploits a skew-symmetric structure and is easier to solve. We show in theorem 4 below that the objectives in equations 3.4 and 3.5 admit the same solution, up to permutation in the exact D-IPS case. Despite the fact that the asymmetric matrix A is useful for deriving the identifiability result, using the skew-symmetric matrix T is easier to solve and more robust in noisy cases as the model explicitly penalizes the reversely directed noise edges. Further details are discussed in appendix B. In the rest of the letter, we use the skew-symmetric matrix T to represent the graphs.

3.1.3  Positive Values Identifies the Compressed Node

With the factorization model in equation 3.5, we further show in theorem 2 in section 4.1 that under the exact D-IPS, the factor U is nonnegative, and positive entries correctly identify the compressed nodes. Hence, we propose a nonnegative constrained factorization model for more efficient node-compression identification with the presence of noise:
L5(T;U,S)=T-USUF2s.t.U0,UU=Ik.
(3.6)
The nonnegative and orthogonal constraints are studied in Ding, Li, Peng, and Park (2006), while our matrix of interest T is itself not nonnegative. Such a factorization problem is referred to as semi-nonnegative factorization, studied in Ding, Li, and Jordan (2010). Based on previous work, we develop the following algorithm to solve the optimization problem in equation 3.6.

3.2  Learning Algorithms

With the nonnegative and orthogonal constraints on U, the model in equation 3.6 can be written as a regularized version of the orthogonality constraint nonnegative matrix tri-factorization,
L6(T;U,S,Λ)=T-USU2+tr(Λ(UU-I))s.t.U0,
(3.7)
where the regularization parameter Λ is a symmetric matrix. It is also related to semi-nonnegative matrix factorization since T itself is not a nonnegative matrix.

This optimization objective can be solved by gradient methods with projection to the Stiefel manifold, as discussed in Hirayama, Hyvärinen, Kiviniemi, Kawanabe, and Yamashita (2016) and Edelman, Arias, and Smith (1998). However, the projection-based algorithm is very sensitive to initialization. Instead, we propose a multiplicative update scheme: UU[UL]+[UL]- modified from Ding et al. (2006, 2010) and Lee and Seung (2001). We use X+ and X- to denote the positive and negative parts of matrix X, respectively. The modification does not only allow the imposition of the specific skew-symmetric structure of S and orthogonal constraints but also gives more stable results. This leads to our proposed structured nonnegative matrix factorization (StNMF) in algorithm 1.

formula

The nonnegative matrix U can be effectively initialized via nonnegative SVD (Boutsidis & Gallopoulos, 2008). S can be initialized by any k×k skew-symmetric matrix. For instance, when k=2, we set initial S=01-10. The algorithm exploits zero locking properties in the multiplicative update scheme so that the desired structure of S is preserved throughout the updates with such initialization. In the fixed regularization scheme, a different choice of Λ results in a different local optimal solution, and it depends on users' preference. For instance, if the user would like to have a strictly nonoverlapping compressed node-set, one may set the magnitude of off-diagonal terms to be large to emphasize orthogonality; if the user is interested in the weight assignment between compressed nodes, the diagonal terms may be set relatively larger to ensure the unit length vector. However, the discussion of such a topic is out of the scope in this letter.

It is important to note that the directed compressed relations can be read off from S, which represents the skew-symmetric (weighted) adjacency matrix of the summarized graph H=(C,R). Hence, our model is able to simultaneously identify the node-compression and the edge-compression, thus the summarized graph H.

We can also optimize Λ using the Karush-Kuhn-Tacker (KKT) complementary condition (Kuhn & Tucker, 1951) and set Λ=UQ-P=UQ++P--UQ--P+, where P=SUUS and Q=TUS. The derivation can be found in section 4.2. In addition, algorithm 1 does not guarantee a tightly bounded norm of the column vectors in U. When the adaptive regularizer is used, the optimization trajectory is not monotonic nonincreasing. Hence we impose a column-wise normalization step in algorithm 2 to alleviate this problem. Denote DUTU the diagonal matrix with ith diagonal value as the Frobenius norm of the ith columns of matrix UTU.

formula

3.3  Multiple Network Summarization: The Tensor Case

We can also extend our model to extract community structure, from a series of relational network matrices, stacked into a tensor, TRn×n×t. Such a tensor can be obtained from various scenarios (e.g., the same relationship networks on different time periods) or have multiple relationships between the same set of users. The second scenario is commonly observed in social networks. For instance, different types of opinions, such as, “like,” “support,” and “disagree,” between users or friends form such a set of networks.

Denote Tk as the kth slice from tensor T. We extract common community structure where the edge-compression matrix Sk allows varying with regard to k:
L7(T;U,S,Λ)=kTk-USkU2s.t.U0,UTU=I,Sk=-SkT.
(3.8)
With the relevant constraints, applying the strategy in Hirayama et al. (2016) and Hyvärinen, Hirayama, Kiviniemi, and Kawanabe (2016), the objective can be written as
maxktr(UTTkU)2,s.t.U0,UTU=I.
(3.9)
To solve the problem, we may introduce auxiliary variable rk=tr(UTTkU) and apply the StNMF algorithm iteratively on the sum of the reweighted matrix, M=krkTk, until convergence. Given the factor matrix U(i) at the previous iteration, the auxiliary variable and reweighted matrix can be computed as,
rk(i)=tr(U(i)TTkU(i)),M(i)=krk(i)Tk.
Applying the soft constraint on orthogonality presented in equation 3.7, the objective in equation 3.8 at a particular iteration can be written as
L6(M(i);U,S,Λ)=M(i)-USU2+tr(Λ(UU-I))s.t.U0.
(3.10)
In each iteration, we solve
U(i+1),S(i+1)=argmaxU,SL6(M(i);U,S,Λ)
using StNMF, which is presented in algorithm 3.
formula

Using algorithm 3, we directly obtain paired vector v,w for cluster assignment. Note that without imposing ηk>0,k in the objective, the directed relationships can be different for different relationship networks. The kth relationship has direction VW if ηk>0 and WV if ηk<0. This is useful as we do not assume prior information about these networks. Recovering a reversed direction may correspond to the semantic opposite of the relationships; for example, in social network analysis, we may observe networks representing like and dispute, where opposite directions may be expected between communities; thus, we may not want to constrain the sign of ηk. On the other hand, we can impose ηk>0, k for unidirectional relationships between communities for all observed networks in some settings. For instance, in an opinion diffusion social network, we may want to find a community with influence on another community on different social issues (with issues represented via several directed network tensor), assuming ηk>0, k preserves these influence structures. Such an assumption enables us to conclude the overall directed relationship VW.

4  Theoretical Analysis

In this section, we present theoretical results for the identification of nonnegative models and analysis of structured nonnegative matrix factorization (StNMF).

4.1  Identifiability Analysis

Theorem 1

(exact D-IPS Identification). Let A be an asymmetric adjacency matrix of a directed graph with the exact D-IPS with k compressed nodes. Assume that each submatrix between compressed nodes has distinct leading singular values with geometric multiplicity one. The optimization problem in equation 3.4 has a unique solution URn×k such that U0 and the positive part of each column vector in U identifies compressed nodes.

The proof technique extends on the k=2 case in theorem 11, which applies the Perron-Frobenius theorem on the rearranged block matrix. Details can be found in appendix A. For graphs with more than one connected component to be determined, the most strongly connected component will be identified first and the consecutive components can be identified via deflation methods discussed in Hyvärinen et al. (2016) and Hirayama et al. (2016), which is out of the scope of this letter.

Lemma 2.

If a directed graph, with asymmetric adjacency matrix A, has the exact IPS, A can be divided into a block submatrix according to compressed nodes, such that if a block A˜IJ is nonzero, its block-wise transpose A˜JI as a zero matrix, and the diagonal blocks are zero-matrices.

Proof of Lemma 2.

By definition of the exact D-IPS, the direction of edges between compressed nodes are the same and there are no links within the compressed nodes. The result follows since the summarized graph is simple.

Theorem 2

(equivalence decomposition of A and T). Let A and T be the asymmetric and skew-symmetric adjacency matrix of a directed graph with the exact D-IPS, respectively. The optimal solution U for L3 and L4 is the same up to permutation.

Detailed proofs are in appendix A.

Corollary 1.

Let T be the skew-symmetric adjacency matrix of a directed graph with the exact D-IPS. Assume that each submatrix between compressed nodes has geometric multiplicity two. The optimization problem with loss L4 has a unique solution such that U0 and the positive part of each column vectors in U identifies communities.

The corollary follows from combining theorems 2 and 2.

4.2  Convergence Analysis

In this section, we analyze the convergence of fixed regularization scheme in algorithm 1 and the adaptive regularization scheme with column-wise normalization in algorithm 2. Optimizing the objective in equation 3.7 for fixed Λ can be written as
minU0L6(T;U,S,Λ)=minU0tr(-2UTUS+USUUSU+UΛU).
(4.1)
Lemma 3.
Let Q=TUS and P=SUUS.
Z(U,U')=tr(-2Q+U-UP-U)+ij[U'(P++Λ)]ijUij2Uij'+2Q-ijUij2+Uij'22Uij'
is an auxiliary function of equation 4.1.

The proof is based on pairing symmetric terms, and details can be found in appendix A.

Lemma 4.
When choosing Λ=UQ++P--UQ--P+, the objective in equation 4.1 becomes
minU0L7(T;U,S,Λ)=minU0tr(-2UQ+-UUP-+UU(UQ++P--UQ-)+2UQ-),
(4.2)
then
Z(U,U')=tr(-2Q+U-UUP-)+ij[U'(U'Q++P-)]ijUij2+[Q-U'U']Uij'2Uij'
(4.3)

is an auxiliary function of equation 4.2.

Theorem 3.

The update rule described in algorithm 1 is nonincreasing and converges to the stationary point of objective in equation 3.7.

The proof of theorem 8 is in appendix A. The proof technique is based on some carefully chosen symmetric matrices, which is applicable for fixed Λ as it is a symmetric matrix. With the adaptively chosen Λ in each step, as in equation A.2 the column of U does not have a fixed norm and P++Λ is no longer symmetric, making the proof technique for theorem 8 in applicable. However, with the proposed column-wise normalization scheme, the algorithm is shown to be monotonic nonincreasing and convergent to the stationary point.

Theorem 4.

The update rule in algorithm 2 is nonincreasing and convergent.

The proof is by constructing a symmetric matrix based on the normalized column vectors and applying the proof techniques in theorem 8. Detail derivations are in appendix A.

5  Related Work

5.1  Graph Summarization

With increasing graph size, it is of interest to summarize graph structures to facilitate the identification of network data and their meaning, which can be useful for data mining tasks. Graph summarization procedures are application-dependent and vary by goals ranging from data volume compression and preserving distribution properties of graphs to maintaining specific patterns on graphs. Therefore, various goal-oriented graph summarization methods have been developed to tackle the tasks of supervised edge-classification (van Leeuwen, Vreeken, & Siebes, 2006), unsupervised edge-clustering (Cilibrasi & Vitányi, 2005; Liu, Shah, & Koutra, 2015), outlier detection (Akoglu, Tong, Vreeken, & Faloutsos, 2012; Akoglu et al., 2015), and pattern set mining (Vreeken, Van Leeuwen, & Siebes, 2011). Beyond the static networks, the dynamics networks, usually represented in the form of tensors, are studied for the graph summarization tasks (Xu, Kliger, & Hero, 2011). In our problem setting, we aim to simultaneously compress the nodes and edges in the form of Figure 2, which is essentially different from the existing graph summarization goals or settings. A more closely related class of graph summarization goals can be viewed as influence-based summarization (Mehmood et al., 2013; Purohit, Prakash, Kang, Zhang, & Subrahmanian, 2014; Shi, Tong, Tang, & Lin, 2014, 2015; Shi et al., 2016), which aims to identify the description of influence propagation in the graphs. Hence, we develop the nonnegative theory to deal with our goal and employ nonnegative-based factorization methods to solve such problem. A comprehensive survey on relevent summarization techniques can be found in Liu, Safavi, Dighe, and Koutra (2018).

5.2  Nonnegative Matrix Factorization (NMF)

Because our graph summarization goal is closely related to nonnegative values of decomposing the skew-symmetric adjacency matrix due to theorem 2, the algorithm developed for the task is closely related to NMF. NMF algorithms have been developed (Lee & Seung, 2001) to extract the features represented by nonnegative values, minimizing the objective X-WHF2, or i,j(XijlogXij(WH)ij-Xij+(WH)ij), where X,W,H0 are all nonnegative matrices, with W and H being low rank. Subsequently, variants of NMF have been developed to tackle practical problems, including real data with nonnegative features (Ding et al., 2010)—W0 and entries of H may have both signs—and features with orthogonal constraints (Ding et al., 2006; Edelman et al., 1998; Hirayama et al., 2016)—the columns of W are orthogonal. Implementation techniques such as initialization (Boutsidis & Gallopoulos, 2008) have also been investigated. As the directed graph deals with skew-symmetric matrices, no prior works have explicitly addressed this problem. Our proposed computational scheme, StNMF, though based on the framework derived from NMF, is useful in the particular unaddressed setting, where we provide useful insights. The column-wise normalization procedure is first mentioned in Ding et al. (2010), which we have proved works in our setting. More details are in section B.3.

5.3  Tensor Decomposition

Feature extractions from data in the form of higher-order arrays (or tensor) have been explored via tensor decomposition (Kolda & Bader, 2009). The first exploration can be traced back to understanding the multiway interaction of dissolved organic matter fluorescence (Stedmon & Bro, 2008) via parallel factorization, or PARAFAC (Bro, 1997), a generalization of PCA methods into higher-order tensor data. PARAFAC features may be nontrivial to obtain; thus, the alternating least-square-based algorithms have been developed to efficiently compute PARAFAC (Comon, Luciani, & De Almeida, 2009). Moreover, an improved version to automatically determine the appropriate number of components (i.e., the rank in matrix decomposition case) has been developed (Bro & Kiers, 2003). Subsequently, more efficient method to extract features from tensors, higher-order singular value decomposition (HOSVD; De Lathauwer, De Moor, & Vandewalle, 2000) has been developed, which is of better rank property than PARAFAC. A nonnegative version for tensor decomposition methods (Cichocki, Zdunek, Phan, & Amari, 2009; Shashua & Hazan, 2005; Zhou et al., 2009) has also been developed. In section 3.3, we also dealt with multiple network data in the form of a three-dimensional tensor. Because our summarization goal is based on the number of compressed nodes, we need to explicitly control the rank of the tensor approximations, where nonnegative tensor factorization methods are not entirely applicable. Instead, we adapt the proxy variable strategy from Hirayama et al. (2016) and Hyvärinen et al. (2016) that iteratively optimize the desired objective.

6  Simulation Results

In this section, we apply the graph summarization methods to synthetically generated directed graphs and compare them with summarization methods on undirected cases, as well as conventional clustering methods such as spectral methods or normalized cut methods.

6.1  Simulation on Single Network

We first study the case when a single network is observed. We simulate graphs of different sizes at different noise levels with known compressed nodes. The noise level is measured by the conventional signal-to-noise ratio (SNR). Specifically, the background noise is defined as:
γb=i,jD-IPS|ei'j'|i,jD-IPS|eij|,
and the direction noise is defined as the ratio
γd=i,jD-IPS,eij<0eiji,jD-IPS,eij0eij<1.
The direction noise is less than 1 since if the edges considered as noise exceed the edges considered as signal, the “correct” direction needs to flip, and the corresponding reciprocal is then considered as the noise level. We compare our proposed algorithms:
  • Fix-StNMF: The fixed regularization scheme in algorithm 1 and Λ is chosen as a scalar time all-one matrix.

  • Adaptive-StNMF: The adaptive scheme described in algorithm 2;

with the competing approaches including

  • Undirected: The graph summarization scheme using the undirected skeleton, similar to Hirayama et al. (2016)

  • WNCut: The weighted normalized cut scheme (Meilă & Pentney, 2007) for directed graph

  • Spectral: The clustering method using normalized Laplacian (Shi & Malik, 2000)

Since we simulate the network data from a known model in these synthetic cases, we use the node assignment accuracy of the extracted compressed nodes as the measure of performance. From the simulation results, we see that at low noise levels, the Adaptive-StNMF correctly finds the compressed node assignment, which validates our theoretical claims as shown in the first row of Figure 3. The competing method are unable to accurately assign the vertices to the desired group. Morover, when the noise level increase, both proposed methods, Adaptive-StNMF and Fix-StNMF, perform better than the competitors, achieving higher accuracy, as shown in the second row of Figure 3. The low accuracy for clustering methods is expected as they do not maximize the desired objectives. Moreover, we see that the Fixed-StNMF is worse than the adaptive version as we deliberately chose Λ to be a fixed, all-one matrix, where the algorithm does not necessarily converge to the most useful local optimum, which shows that the learning accuracy is also sensitive to the choice of regularization parameters. Further experiments are discussed in appendix B.
Figure 3:

Node assignment accuracy with different noise levels shown in subpanel titles.

Figure 3:

Node assignment accuracy with different noise levels shown in subpanel titles.

6.1.1  Larger-Scale Networks

We further demonstrate our method in larger scaled networks. As our identifiability theorem 2 suggests, the nonnegative values identify the compressed nodes. We investigate the node assignment accuracy for network size up to 100,000 nodes. Fix-StNMF and Adaptive-StNMF are implemented to compare with existing methods Undirected WNCut and Spectral as above. Node assignment accuracy (the mean followed by the standard deviation) from different noise settings is reported in Table 2. Noise levels are shown as direction or background noise. From the table, we can see that, for all network sizes, Adaptive-StNMF correctly assigns the vertices when the noise level is minimum (left panel of Table 2). This could not be achieved using other methods. Moreover, in large-scale networks, the Adaptive-StNMF can recover the compressed nodes with high accuracy, outperforming the competing methods (shown in the middle and right columns of Table 2).

Table 2:

Node Assignment Accuracy for Large-Scale Networks.

Noise Level: 0.0 /0.1Noise Level: 0.5 /0.7Noise Level: 0.9 /1.0
(accuracy)n = 1kn = 10kn = 100kn = 1kn = 10kn = 100kn = 1kn = 10kn = 100k
Adaptive-StNMF 1. (7e-16) 1. (1e-16) 0.99 (1e-4) 0.99 (5e-4) 0.99 (1e-6) 0.99 (9e-4) 0.79 (7e-6) 0.86 (1e-5) 0.82 (2e-4) 
Fixed-StNMF 0.71 (5e-2) 0.71 (1e-1) 0.65 (1e-2) 0.76 (8e-2) 0.65 (5e-3) 0.74 (7e-2) 0.69 (8e-2) 0.72 (7e-2) 0.70 (5e-2) 
Undirected 0.55 (1e-2) 0.34 (2e-3) 0.34 (3e-3) 0.36 (7e-3) 0.34 (3e-3) 0.34 (2e-3) 0.32 (1e-2) 0.34 (3e-3) 0.34 (4e-3) 
WNCut 0.53 (1e-1) 0.54 (2e-2) 0.52 (8e-2) 0.39 (3e-2) 0.37 (9e-3) 0.37 (2e-3) 0.35 (6e-2) 0.37 (9e-3) 0.34 (6e-3) 
Spectral 0.89 (9e-2) 0.82 (1e-1) 0.87 (1e-1) 0.71 (2e-2) 0.63 (2e-2) 0.54 (1e-1) 0.36 (1e-1) 0.32 (2e-1) 0.38 (2e-2) 
Noise Level: 0.0 /0.1Noise Level: 0.5 /0.7Noise Level: 0.9 /1.0
(accuracy)n = 1kn = 10kn = 100kn = 1kn = 10kn = 100kn = 1kn = 10kn = 100k
Adaptive-StNMF 1. (7e-16) 1. (1e-16) 0.99 (1e-4) 0.99 (5e-4) 0.99 (1e-6) 0.99 (9e-4) 0.79 (7e-6) 0.86 (1e-5) 0.82 (2e-4) 
Fixed-StNMF 0.71 (5e-2) 0.71 (1e-1) 0.65 (1e-2) 0.76 (8e-2) 0.65 (5e-3) 0.74 (7e-2) 0.69 (8e-2) 0.72 (7e-2) 0.70 (5e-2) 
Undirected 0.55 (1e-2) 0.34 (2e-3) 0.34 (3e-3) 0.36 (7e-3) 0.34 (3e-3) 0.34 (2e-3) 0.32 (1e-2) 0.34 (3e-3) 0.34 (4e-3) 
WNCut 0.53 (1e-1) 0.54 (2e-2) 0.52 (8e-2) 0.39 (3e-2) 0.37 (9e-3) 0.37 (2e-3) 0.35 (6e-2) 0.37 (9e-3) 0.34 (6e-3) 
Spectral 0.89 (9e-2) 0.82 (1e-1) 0.87 (1e-1) 0.71 (2e-2) 0.63 (2e-2) 0.54 (1e-1) 0.36 (1e-1) 0.32 (2e-1) 0.38 (2e-2) 

Notes: The first number shows the mean accuracy, and the number in parentheses shows its standard deviation from 100 trials. Noise level: L1/L2 denotes that the network is simulated with direction noise at L1 and background noise at L2.

6.2  Simulations on Multiple Networks: The Tensor Case

In the second set of simulations, we consider the case when multiple networks of the same underlying summarization structures are observed, as discussed in section 3.3. We construct network tensors of size 2000×2000×10, where each frontal slice of the tensor is a skew-symmetric matrix, representing a particular weighted, directed relationship network in the multiple network setting. We compare similar criteria used in the matrix case above. The 10 different networks are constructed from the same connected structure's different noise edges. We then perform Tensor-StNMF, the procedure presented in algorithm 3; we compare it with the baseline approaches, including Tensor-SVD (or HOSVD), that is, SVD on a reshaped tensor matrix to obtain the first singular vectors in each direction (De Lathauwer et al., 2000), followed by projection onto the Stiefel manifold, and Nonnegative PARAFAC, the parallel-factorization procedure for nonnegative cases proposed in Shashua and Hazan (2005). We also compare the optimization objective values against the unconstrained optimal, which is used as a benchmark, denoted by Unconstrained.

From the results, we see that Tensor-StNMF performs better in terms of accuracy (shown in Figure 4a). Though the average score is better at the low noise level, Tensor-StNMF produces comparable objective value when noise is higher compared to baseline methods, especially Nonnegative PARAFAC. The result in Figure 4b also shows that the optimized objective functions for Tensor-StNMF and Nonnegative PARAFAC are comparable, which is expected.
Figure 4:

Tensor decomposition result.

Figure 4:

Tensor decomposition result.

7  Real Network Applications

7.1  Seventh-Grade Student Network

We study our proposed method on the multiple-network data of seventh-grade students' opinion data set (Vickers & Chan, 1981). Three directed networks were constructed in response to questions shown in Table 3. Setting the number of compressed nodes to be k=2 and applying the Tensor-StNMF in algorithm 3, we obtained two communities with node assignment weights shown in Figure 5, where the directed connections between two communities are from the vertices compressed in red to vertices compressed in blue. Interestingly, our detected communities largely characterize gender, since the first 13 students are boys and the rest are girls. These results show a directed trend where boys are more likely to prefer girls for working and playing. The auxiliary weights rk in Table 3 show that this is strongest in choosing who to with, followed by choosing a working partner. In Figure 5, we visualize the character relationship network, with blue and red being the identified compressed nodes and yellow denoting students who do not belong to any of the compressed nodes. We use blue squares for girls in the compressed nodes with more boys and red squares for boys in the compressed nodes with more girls. We see that the blue and red vertices are highly connected and that the edges that go into the red compressed node are predominantly from the blue compressed node, while the edges going into the blue compressed node are from all vertices. The yellow labeled vertices, which belong to neither compressed node, are more sparsely connected in the network.
Figure 5:

Seventh-grade students' opinion data set.

Figure 5:

Seventh-grade students' opinion data set.

Table 3:

Seventh-Grade Students: Relational Networks and Corresponding Weights.

QuestionsRelationsrk
Who do you get along with? Character 0.746 
Who are your best friends? Emotional 0.398 
Who you prefer to work with? Working 0.532 
QuestionsRelationsrk
Who do you get along with? Character 0.746 
Who are your best friends? Emotional 0.398 
Who you prefer to work with? Working 0.532 

7.2  MEG Data on Visual Task

For more complicated and interesting cases, we study the real network data from the visual92 MEG data set (Cichy, Pantazis, & Oliva, 2014), which is recorded using a 306-channel Neuromag vectorview system when both natural and artificial images are presented per minute cycle. We extract 204 channels of gradiometer signals from a total of 306 channels and perform a series of preprocessing steps including bandpass filtering, ICA artifact removal, and Morlet transformation, for example, to produce sensible and usable signals. The data set consists of 26 minutes of recordings where the first and last minutes have no images present. Thus, we consider each minute with an image present as a network observation and construct 24 skew-symmetric matrices to form a tensor. We construct the skew-symmetric matrices Tk using three measure, to construct the directed network:

  1. 1.

    The imaginary part of spectral connectivity (Nolte et al., 2004)

  2. 2.

    Sign of imaginary part of spectral connectivity

  3. 3.

    Pairwise LiNGAM (Hyvärinen & Smith, 2013)

We apply the StNMF algorithm and compare the node compression results in Figure 6. We see that the first community pair, identified from three different measures for constructing directed networks, are very similar (with the red part in the topographical map representing V and the blue part representing W, while the directed edge is acting from VW). The V clusters are concentrated in the left part of the occipital lobe area, which implies that the primary visual cortex is the source that propagated signals during the recording. The consistency of the results may suggest that the right visual field is closer to the stimuli. In addition, we found that the summarization from the Pairwise LiNGAM (Hyvärinen & Smith, 2013) network construction demonstrates the pattern most clearly. In further investigation, it will be useful to examine subsequent pairs of information flow to see whether similar patterns occur in the other hemisphere.
Figure 6:

Compressed nodes extracted from various visual92 MEG networks, The compressed edge direction is from red to blue.

Figure 6:

Compressed nodes extracted from various visual92 MEG networks, The compressed edge direction is from red to blue.

8  Conclusion

In this work, we propose a new framework to summarize directed graphs, based on a criterion that maximally preserves directed edge information. Our criterion is related to the reconstruction error from the summarized graph to the original graph. We provide theoretical results that the novel learning criterion, via preserving the directed edge information from the original graph, corresponds to nonnegative terms in the factorization of adjacency matrices. This inspires us to formulate an optimization problem with a nonnegative constraint. We then proposed a nonnegative algorithm to learn such graph summarization. We provide theoretical results showing that the novel learning criterion, preserving the directed-edge information in the original graph, corresponds to nonnegative terms in the factorization of adjacency matrices.

Appendix A: Additional Theorems and Proofs

A.1  Proof of Lemma 1

Proof of Lemma 1.
Normalized by the size of the compressed node, each assignment vector has unit length. Expanding each term in equation 3.1, we have
LHS=I,Ji,jAij2uiIujJ-2AijrIJuiIujJ+rIJ2uiIujJ.
Summing over the index I,J, the first term is i,jAij=tr(AA), a constant independent of R and U. Summing over the index I,J, the second term is i,jAij(URU)ij=tr(AURU). For the third term, since i,juiIujJ=1 in the normalized setting, summing over index i,j, we have I,JrIJ2=tr(RR). Writing
RHS=tr(-2AURU+RR),
the result follows. Because the summarized graph is simple, the constraint on R in the factorization model is imposed.
Proposition 1

(Perron-Frobenius). Suppose MRn×n is a nonnegative square matrix that is irreducible. Then:

  1. 1.

    M has a positive real eigenvalue λmax, such that all other eigenvalues of M satisfy, |λ|λmax (if M is primitive, |λ|<λmax).

  2. 2.

    λmax has algebraic and geometric multiplicity 1 and positive eigenvector x>0 (called a Perron vector).

  3. 3.

    Any nonnegative eigenvector is a multiple of x.

Proof of Proposition 1.

M is an irreducible nonnegative square matrix; then kN+ such that P=(I+M)k>0. (I+M)k=I+M+12!M2+1k!Mk. By irreducibility and nonnegativity, for large enough k, the expansion fills in all n2 terms with positive numbers. Hence P is primitive. We also have TP=PT.

Let Q be the positive orthant and C be the intersection of the surface of the unit sphere and positive orthant. zQ, define a function:
L(z)=max{s:szTz}=min1in,zi>0(Tz)izi.

For r>0, we have L(rz)=L(z) by definition, so L(z) depends only on the ray along z.

We add an sign between vectors, vw, to imply viwi,i. A similar definition applies for <. For vw and vw, we have Pv<Pw, since P(w-v)0 and P(w-v)0.

If for scalar s, szTz, then PszPTz=TPz, which implies s(Pz)T(Pz). Thus, L(Pz)L(z).

If L(z)zTz, then L(z)Pz<TPz. This implies L(z)<L(Pz), unless z is an eigenvector (Tz=L(z)z). Hence, positive z is an eigenvector when L(z) is maximized.

Consider the image of C under P. It is compact because it is the image of a compact set under a continuous map. All of the elements of P(C) have all components strictly positive, as P>0. Hence, the L is continuous on P(C). Thus, L achieves a maximum value on P(C). Since L(z)L(Pz), this is, in fact, the maximum value of L on all of Q, which implies the existence of maximum eigenvalue. Since L(Pz)>L(z) unless z is an eigenvector of T, Lmax is achieved at an eigenvector; call it x of T and x>0 with Lmax as the eigenvalue. Since Tx>0 andTx=Lmaxx, we have Lmax>0.

Let, y be any other eigenvectors of T. With eigenvalue λ, we have λyi=jTijyj. As T0, we have |λyi|=jTij|yj|; thus, we write |λ||y|T|y|. Consider |λ|L(|y|)Lmax by definition of L. Writing λmax=Lmax, we show that |λ|λmax. Note that if λmax=0, T is nil-potent. Thus, we have λmax>0.

Consider the rate of change in the characteristic polynomial of matrix T:
ddλdet(λI-T)=idet(λI-T(i)),
where T(i) is matrix T deleting the ith row and column. Each of the matrices λmaxI-T(i) has a strictly positive determinant, which shows that the derivative of the characteristic polynomial of T is not zero at λmax, and therefore the algebraic multiplicity and, hence, the geometric multiplicity of λmax is one.

If there exists any other nontrivial nonnegative eigenvector y0, y is not a multiple of x, since λmax has geometric multiplicity 1, yx=0. However, x>0 and yx=0 imply y=0, a contradiction.

A.2  Proof of Theorem 2

Proof of Theorem 2.
By lemma 3, we write A in block form where the blocks are grouped by compressed node assignment. Hence, use the fact that tr(AA)=0=tr(RR) from a simple graph, and UAU has the same zero/nonzero positions as R for the exact D-IPS. We have tr((A-URU)(A-URU))=tr(AA-2UAUR+RR)=0 and
T-USU2=A-URU-(A-URU)2=2A-URU+0.
Hence, both objectives solve the same problem.
Theorem 5

(Bipartite Identification). Let A be an asymmetric adjacency matrix of the exact D-IPS of two compressed nodes. SVD of A has a unique leading left and right singular vector v,w0, and the positive part of v,w identifies two compressed nodes.

Proof of Theorem 5.

For the exact D-IPS with two compressed nodes, we can always rearrange the vertices such that A=00A˜0. A˜A˜ and A˜A˜ represent the “in-out” and “out-in” two-step transition. As the two compressed nodes are connected, any vertex from the two-step transition can reach any other vertex in the same compressed node. Hence, A˜A˜ and A˜A˜ are both primitive. Using the Perron-Frobenius theorem, we have a unique real, positive leading eigenvector. Padded with 0s, the leading eigenvectors of A˜A˜ and A˜A˜ are unique and nonnegative where the nonzero terms correspond to the compressed node assignment.

Proof of Theorem 1.
This proof is based on proposition 10 and lemma 3. Rearrange the indices according to the compressed nodes, and denote the block submatrices between CI and CJ as A˜IJR|CI|×|CJ|. Write A¯IJRn×n as the zero-padded matrix of A˜IJ. The zero-padded vector for compressed node CI, denoted by uI, is the vector with nonzero ith entries for xiCI and zeros otherwise. Write each column of U, u:I' as a linear combination of zero-padded vectors: u:I'=IηII'u:I'I, where IηII'2=1. We write the nonzero part of uIRn as uIIR|CI|, which is a unit vector. The optimization objective in equation 3.4 can be written as
-2I',J'rI'J'u:I'I,JA¯IJu:J'+I',J'rI'J'2.
Differentiate with regard to rI'J' to find the optimized rI'J'=u:I'(I,JA¯IJ)u:J'. Then the optimization objective becomes maxuI',J'(u:I'(I,JA¯IJ)u:J')2, which can be simplified as I',J'(I,JηII'ηJJ'wI'J'IJ)2 where wI'J'IJ=uII'IA˜IJuJJ'J. Since uII'I,uJJ'J are unit vectors, maxI'J'wI'J'IJλIJ where λIJ is the leading singular value of A˜IJ. Due to the unit norm constraint, we have the objective
I',J'u:I'I,JA¯IJu:J'2IJλIJ2,
where the equality holds when uIII, uJJJ are the left and right singular vectors of A˜IJ and ηII'=1I=I'. By theorem 11, we know that A˜IJ is primitive for all I,J[k]. Applying Perron-Frobenius in theorem 2, uIII>0 and u:I0, where the positive part identifies some compressed node CI. As the compressed node blocks do not need to have an order, the solution is unique only up to the permutation of blocks.
Proposition 2

(Proposition 6 in Ding et al., 2006). For any symmetric matrices A R0n×n,BR0k×k,S,S'R0n×k, the following inequality holds: i,p(AS'B)ipSip2Sip'tr(SASB).

Proof of Proposition 2.
Write Sip=Sip'aip. Then i,p(AS'B)ipSip2Sip'-tr(SASB)=
i,k,l,pAikSkl'BlpSip'(aip2-aipakl)=i,k,l,p12AikSkl'BlpSip'(aip2+akl2-2aipakl)0
as A and B are symmetric and nonnegative.
Proposition 3.

For any matrices BR0k×k,S,S'R0n×k, and B is symmetric, the following inequality holds: i,p(BS')ipSip2Sip'tr(SBS).

Proof of Proposition 3.
Similar to the proof above, we write Sip=Sip'aip. Then
i,p(BS')ipSip2Sip'-tr(SBS)=i,k,l,pBikSpk'Sip'(aip2-aipakp)=i,k,l,pBikSpk'Sip'(aip2+akp2-2aipakp)0
as B is symmetric and nonnegative.
Proof of Lemma 3.
Write Q=TUS and P=SUUS. Since both Q and P are not nonnegative matrices in general, the optimization objective L6 in equation 4.1 can be written as
L6(T;U,S,Λ)=tr(-2UQ+-UUP-+UU(P++Λ)+2UQ-)
for U0. From proposition 12, we have tr(U(P++Λ)U)ij[U'(P++Λ)]ijUij2Uij' since P+ and Λ are both symmetric matrices. Using aa2+b22b, we have tr(Q-U)ij[Q-]ijUij2+Uij'22Uij'. Z(U,U') reaches lower bound L3 when U=U'. Hence, Z(U,U') is an auxiliary function.

A.3  Proof of Theorem 3

Proof of Theorem 3.
Using the auxiliary function in lemma 6, we take the derivative of Z(U,U') with regard to Uij:
Z(U,U')Uij=2[-Q+-U'P-]ij+[U'(P++Λ)+Q-]ijUijUij'=0.
Solving the stationary point, we have the update rule for U as stated in algorithm 1:
Uij=Uij'[Q++U'P-]ij[U'[P++Λ]+Q-]ij.
Since the update of S is independent of Λ, the update can be readily adapted from theorem 11 of Ding et al. (2006). As the objective is bounded below and the iterative procedure is monotonic nonincreasing, the algorithm finds the local minimum of the objective function.
Lemma 5.

Let URn×k be an orthogonal matrix such that UU=Ik and U'Rn×k be a matrix of unit column vectors. Let GRk be a nonnegative matrix. Then tr(UUG)tr(U'U'G).

Proof of Lemma 5.

Write U'U'=Ik+E for some nonnegative matrix E. Since E and G are nonnegative, then tr(U'U'G)=tr(IkG+EG)tr(IkG).

Lemma 6.

UQ=UTUS is symmetric under the update rule of algorithm 2.

Proof of Lemma 6.

Under the update rule in algorithm 2, as U is column-wise normalized, UTU=S. Hence, UQ=SS is symmetric.

It is worth noting that the original scheme proposed in Ding et al. (2006), without normalization does not have such property. Assume the norm for each row of U is D, where normalized U˜D=U. Then the update is S˜=U˜TU˜, where S=UTU=DS˜D. Hence, SS=DS˜DS˜ is not necessarily symmetric, which violates the auxiliary function formulation.

Proof of Lemma 4.
This proof uses lemma 14. Due to the normalization step, the factor U has unit-norm column vectors. Hence, tr(U'U'UQ-)tr(UQ-) and
tr(-2UQ+-UUP-+UU(UQ++P--UQ-)+2U'U'UQ-)
is an upper bound for equation 4.2, where the equality holds when U is an orthogonal matrix. As UQ is symmetric by lemma 15, we can apply proposition 12 and have
tr(UU(UQ++P--UQ-))ij[U'(U'Q++P--U'Q-)]Uij2Uij'.
We also have
tr(UQ-U'U')ijUij2+Uij'22Uij'(U'U'Q-)ij.
Combining both terms, the result follows.
We assume Λ+P+0. The KKT condition on the orthogonal constrained case can be applied to choose the optimum regularization term Λ. The KKT condition reads:
2[-Q+-UP-+UP++Q-+UΛ]ijUij=0.
(A.1)
For diagonal terms, we sum over j in equation A.1 to have [-UQ+-UUP-+UUP++UQ-+UUΛ]ii=0, which implies Λkk=[UQ++P--P+-UQ-]kk. For off-diagonal terms jp, k[Λ+P]ikUjk=Qij, multiply Uip and sum over p on both sides. We get k[Λ+P]pk=[Λ+P]jp=[UQ]jp. Hence we have:
Λ=UQ-P=UQ++P--UQ--P+
(A.2)
with Λ+P+0.

A.4  Proof of Theorem 4

Proof of Theorem 4.
Applying KKT condition and choosing adaptive Λ=UQ-P, the objective has the form in equation 4.2, which is bounded by equation 4.3 in lemma 7. Differentiate equation 4.3 with regard to Uij:
2[-Q+-U'P-]ij+[U'(P-+U'Q-)]ijUijUij'=0.
Solving the stationary point, we have the update rule for U as stated in algorithm 1:
Uij=Uij'[Q++U'P-]ij[U'[P++Λ]+Q-]ij.
Since the U factor here does not have a unit norm for each column, we explicitly normalized U and update S=UTU after normalization. With the normalization step, the optimization scheme in algorithm 2 is nonincreasing even for the adaptive regularization scheme. Since the objective is bounded below, it converges to the stationary point.

Appendix B: Additional Experiments and Illustrations

B.1  Noise Analysis

To study the effect of decomposition with the presence of noise, we define three types of noise arising from our desired structure described in the IPS structure—the noiseless case in Figure 7a. We first consider the situation that the undirected skeleton of the observed network still preserves the IPS structure: in two compressed node IPS cases, the undirected graph is bipartite. The first type of noise arises when the unidirectional condition is violated. The directed edges are no longer preserved by a single compressed edge, as shown in Figure 7b. We consider the direction with minority edges as the reverse direction. Such“noise” is commonly observed, for instance, in the task query network within a company. The requests sent are generally between different departments, following a particular flow of production sequence, from upstream department to downstream department. However, sometimes feedback may be required, which creates this type of reverse directional noise.
Figure 7:

Types of noise in directed graphs.

Figure 7:

Types of noise in directed graphs.

Another type of noise is that within clusters, which is referred to as intra-cluster noise, shown in Figure 7c. This noise type is more common in a real-world scenario. For instance, in the political opinion diffusion network, despite the strong influence from one community to another, there might still be some directed influence within groups. We refer to these two types of noise as in-network noise.

Furthermore, we also consider the case that even though the compressed node and edges correctly covered the information between edges, there could be missing information that the summarization cannot track, that is, there exists a vertex set U such that U(VW)=0, where V and W denote two compressed nodes. Edges consist of vertices between VW and U are also considered as noise, shown in Figure 7d. We call this type the background noise.

To see the effect of the different noise types, we investigate the compressed node recoveries without constraint, that is, spectral methods, in the following settings.

B.1.1  Negative Values in Singular Vectors

For reverse directional noise, we show the negative values in the singular vectors to better understand our theorem 2. In Figure 8, we show the proportion of singular vectors with different levels of reverse directional noise between the two connected bipartite structures, with different sizes of bipartite components. We show the case of two-compressed nodes: us, vs followed by the numbers 30, 50, and 100 for the size of the compressed node. We show the ratio between the Euclidean norms of negative entries to that of positive entries. The x-axis is the percentage of reverse edges. From the plot, we see that the singular vectors are fairly robust to reverse directional noise: there is a nonnegligible portion of negative terms only when more than 20% of edges are in the wrong direction. Moreover, the larger the size of the components, the more that singular vectors are robust to negative values. These observations validate our use of positive values in identifying bipartite community structure, with the presence of reverse directional noise.
Figure 8:

Negative values in singular vectors, with the presence of directional noise.

Figure 8:

Negative values in singular vectors, with the presence of directional noise.

B.2  The Adjacency Matrices

In the main text, we briefly discussed the advantage of using the asymmetric adjacency matrix T over the nonsymmetric adjacency matrix A. In Figure 9, we show the clustering accuracy using asymmetric matrix A and skew-symmetric matrix T at various noise levels. From the plot, we see that when the present noise increases, the reduction in clustering accuracy using T has a gentler slope compared to that of using A. It implies that the decomposition of T is more robust to the presence of noise, which is more useful in real-world network analysis.
Figure 9:

Analysis of adjacency matrix.

Figure 9:

Analysis of adjacency matrix.

B.3  Illustration of the Normalization Step

In this section, we visually show the effect of normalization step in the adaptive algorithm: step 4 in algorithm 2. We theoretically show that with the normalization step, the objective value is monotonically nonincreasing under algorithm 2. We emphasize that the normalization step is crucial for monotonic property, and we provide an alternative intuition that in the proof. In essence, the algorithm with adaptive regularization (such as those proposed in Ding et al. (2006)) without normalization experiences some form of oscillatory behavior.

Recall the L6 objective in equation 3.7:
L6(T;U,S,Λ)=T-USU2Frobeniusloss+tr(Λ(UU-I))Orthogonalitylosss.t.U0,
(B.1)
where there are two loss components we termed as Frobenius loss that track how good the factorization is in terms of Frobenius norm and the orthogonality loss that tracks how far the factor U is from orthogonality condition, that is, the Stiefel manifold.
From Figure 10a, we see that without using normalization, the objective value over iteration is not monotonic decreasing. From the plot, we also realize that the fluctuation of the objective value mainly comes from the regularizer. The oscillation implies that the adaptive algorithm greedily optimizes the Frobenius loss when the factor is close to the Steifel manifold and penalizes the orthogonal condition when the factor is far from the Steifel manifold. This is expected under the “adaptive” scheme. If we see the loss in every two steps, it looks monotonic, an observation that is one of the motivations for our normalization remedy. When normalization is applied, as shown in Figure 10b, we see that both parts of the objective are smooth. As the regularizer is monotonic decreasing, the solution gets closer and closer to the constraint set, giving rise to a slight increase in the Frobenius norm loss, which is expected. Overall, the objective value is observed to be nonincreasing, which validates the convergence analysis.
Figure 10:

Normalization step analysis. Blue line: the Frobenius norm part of objective; red line: the regularizer; yellow line: the total objective value.

Figure 10:

Normalization step analysis. Blue line: the Frobenius norm part of objective; red line: the regularizer; yellow line: the total objective value.

Note

1

The absence edge does not have the same sign as a directed edge: sign(0)sign(p),p0.

Acknowledgments

W.X. was supported by the Gatsby Charitable Foundation. A.H. was supported by a fellowship from CIFAR and by the DATAIA convergence institute as part of the Programme d'Investissement d'Avenir (ANR-17-CONV-0003) operated by Inria. M.S. was supported by the International Research Center for Neurointelligence (WPI-IRCN) at the University of Tokyo Institutes for Advanced Study.

References

Akoglu
,
L.
,
Tong
,
H.
, &
Koutra
,
D.
(
2015
).
Graph based anomaly detection and description: A survey
.
Data Mining and Knowledge Discovery
,
29
(
3
),
626
688
.
Akoglu
,
L.
,
Tong
,
H.
,
Vreeken
,
J.
, &
Faloutsos
,
C.
(
2012
).
Fast and reliable anomaly detection in categorical data.
In
Proceedings of the 21st ACM International Conference on Information and Knowledge Management
(pp.
415
424
).
New York
:
ACM
.
Boutsidis
,
C.
, &
Gallopoulos
,
E.
(
2008
).
SVD based initialization: A head start for nonnegative matrix factorization
.
Pattern Recognition
,
41
(
4
),
1350
1362
.
Bro
,
R.
(
1997
).
PARAFAC. Tutorial and applications
.
Chemometrics and Intelligent Laboratory Systems
,
38
(
2
),
149
171
.
Bro
,
R.
, &
Kiers
,
H. A.
(
2003
).
A new efficient method for determining the number of components in PARAFAC models
.
Journal of Chemometrics
,
17
(
5
),
274
286
.
Chaney
,
I. M.
(
2001
).
Opinion leaders as a segment for marketing communications
.
Marketing Intelligence and Planning
,
19
(
5
),
302
308
.
Cichocki
,
A.
,
Zdunek
,
R.
,
Phan
,
A. H.
, &
Amari
,
S.-i.
(
2009
).
Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation
.
Hoboken, NJ
:
Wiley
.
Cichy
,
R. M.
,
Pantazis
,
D.
, &
Oliva
,
A.
(
2014
).
Resolving human object recognition in space and time
.
Nature Neuroscience
,
17
(
3
), 455.
Cilibrasi
,
R.
, &
Vitányi
,
P. M.
(
2005
).
Clustering by compression
.
IEEE Transactions on Information theory
,
51
(
4
),
1523
1545
.
Comon
,
P.
,
Luciani
,
X.
, &
De Almeida
,
A. L.
(
2009
).
Tensor decompositions, alternating least squares and other tales
.
Journal of Chemometrics
,
23
(
7–8
),
393
405
.
De Lathauwer
,
L.
,
De Moor
,
B.
, &
Vandewalle
,
J.
(
2000
).
A multilinear singular value decomposition
.
SIAM Journal on Matrix Analysis and Applications
,
21
(
4
),
1253
1278
.
Delmas
,
E.
,
Besson
,
M.
,
Brice
,
M.-H.
,
Burkle
,
L. A.
,
Dalla Riva
,
G. V.
,
Fortin
,
M.-J.
,
Poisot
,
T.
(
2019
).
Analysing ecological networks of species interactions
.
Biological Reviews
,
94
(
1
),
16
36
.
Dhabu
,
M.
,
Deshpande
,
P.
, &
Vishwakarma
,
S.
(
2013
).
Partition based graph compression.
International Journal of Advanced Computer Science and Applications
,
4
(
9
).
Dhulipala
,
L.
,
Kabiljo
,
I.
,
Karrer
,
B.
,
Ottaviano
,
G.
,
Pupyrev
,
S.
, &
Shalita
,
A.
(
2016
).
Compressing graphs and indexes with recursive graph bisection.
arXiv:1602.08820.
Ding
,
C. H.
,
Li
,
T.
, &
Jordan
,
M. I.
(
2010
).
Convex and semi-nonnegative matrix factorizations.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
32
(
1
), 45–55.
Ding
,
C.
,
Li
,
T.
,
Peng
,
W.
, &
Park
,
H.
(
2006
).
Orthogonal nonnegative matrix t-factorizations for clustering.
In
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
126
135
).
New York
:
ACM
.
Dulmage
,
A. L.
, &
Mendelsohn
,
N. S.
(
1958
).
Coverings of bipartite graphs
.
Canadian Journal of Mathematics
,
10
,
517
534
.
Edelman
,
A.
,
Arias
,
T. A.
, &
Smith
,
S. T.
(
1998
).
The geometry of algorithms with orthogonality constraints
.
SIAM Journal on Matrix Analysis and Applications
,
20
(
2
),
303
353
.
Fan
,
W.
,
Li
,
J.
,
Wang
,
X.
, &
Wu
,
Y.
(
2012
).
Query preserving graph compression.
In
Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
(pp.
157
168
).
New York
:
ACM
.
Fortunato
,
S.
(
2010
).
Community detection in graphs
.
Physics Reports
,
486
(
3–5
),
75
174
.
Hirayama
,
J.-i.
,
Hyvärinen
,
A.
,
Kiviniemi
,
V.
,
Kawanabe
,
M.
, &
Yamashita
,
O.
(
2016
).
Characterizing variability of modular brain connectivity with constrained principal component analysis
.
PLOS One
,
11
(
12
), e0168180.
Hyvärinen
,
A.
,
Hirayama
,
J.-i.
,
Kiviniemi
,
V.
, &
Kawanabe
,
M.
(
2016
).
Orthogonal connectivity factorization: Interpretable decomposition of variability in correlation matrices
.
Neural Computation
,
28
(
3
),
445
484
.
Hyvärinen
,
A.
, &
Smith
,
S. M.
(
2013
).
Pairwise likelihood ratios for estimation of non-gaussian structural equation models
.
Journal of Machine Learning Research
,
14
,
111
152
.
Jackson
,
M. O.
(
2011
). An overview of social networks and economic applications. In
J.
Benhabib
,
A.
Bisin
, &
M.
Jackson
(Eds.),
Handbook of social economics
,
1
(pp.
511
585
).
Amsterdam
:
Elsevier
.
Kolda
,
T. G.
, &
Bader
,
B. W.
(
2009
).
Tensor decompositions and applications
.
SIAM Review
,
51
(
3
),
455
500
.
Kuhn
,
H.
, &
Tucker
,
A.
(
1951
).
Nonlinear programming.
In
Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability
(pp.
481
492
).
Berkeley
:
University of California Press
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
2001
). Algorithms for non-negative matrix factorization. In
T.
Leen
,
T.
Dietterich
, &
V.
Tresp
(Eds.),
Advances in neural information processing systems
,
13
(pp.
556
562
).
Cambridge, MA
:
MIT Press
.
Leskovec
,
J.
, &
Faloutsos
,
C.
(
2006
).
Sampling from large graphs.
In
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining
(pp.
631
636
).
New York
:
ACM
.
Li
,
K.
,
Zhang
,
L.
, &
Huang
,
H.
(
2018
).
Social influence analysis: Models, methods, and evaluation.
Engineering
,
4
(
1
),
40
46
.
Li
,
Y.
,
Fan
,
J.
,
Wang
,
Y.
, &
Tan
,
K.-L.
(
2018
).
Influence maximization on so cial graphs: A survey
.
IEEE Transactions on Knowledge and Data Engineering
,
30
(
10
),
1852
1872
.
Liu
,
Y.
,
Safavi
,
T.
,
Dighe
,
A.
, &
Koutra
,
D.
(
2018
).
Graph summarization methods and applications: A survey
.
ACM Computing Surveys
,
51
(
3
), 62.
Liu
,
Y.
,
Shah
,
N.
, &
Koutra
,
D.
(
2015
).
An empirical comparison of the summarization power of graph clustering methods
. arXiv:1511.06820.
Maccioni
,
A.
, &
Abadi
,
D. J.
(
2016
).
Scalable pattern matching over compressed graphs via dedensification.
In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1755
1764
).
New York
:
ACM
.
Malliaros
,
F. D.
, &
Vazirgiannis
,
M.
(
2013
).
Clustering and community detection in directed networks: A survey
.
Physics Reports
,
533
(
4
),
95
142
.
Maneth
,
S.
, &
Peternek
,
F.
(
2015
).
A survey on methods and systems for graph compression.
arXiv:1504.00616.
Mehmood
,
Y.
,
Barbieri
,
N.
,
Bonchi
,
F.
, &
Ukkonen
,
A.
(
2013
).
CSI: Community-level social influence analysis.
In
Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases
(pp.
48
63
).
Berlin
:
Springer
.
Meilă
,
M.
, &
Pentney
,
W.
(
2007
).
Clustering by weighted cuts in directed graphs.
In
Proceedings of the 2007 SIAM International Conference on Data Mining
(pp.
135
144
).
Philadelphia
:
SIAM
.
Navlakha
,
S.
,
Rastogi
,
R.
, &
Shrivastava
,
N.
(
2008
).
Graph summarization with bounded error.
In
Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data
(pp.
419
432
).
New York
:
ACM
.
Nisbet
,
M. C.
, &
Kotcher
,
J. E.
(
2009
).
A two-step flow of influence? Opinion-leader campaigns on climate change
.
Science Communication
,
30
(
3
),
328
354
.
Nolte
,
G.
,
Bai
,
O.
,
Wheaton
,
L.
,
Mari
,
Z.
,
Vorbach
,
S.
, &
Hallett
,
M.
(
2004
).
Identifying true brain interaction from EEG data using the imaginary part of coherency
.
Clinical Neurophysiology
,
115
(
10
),
2292
2307
.
Orbanz
,
P.
(
2017
).
Subsampling large graphs and invariance in networks
. arXiv:1710.04217.
Pavlopoulos
,
G. A.
,
Secrier
,
M.
,
Moschopoulos
,
C. N.
,
Soldatos
,
T. G.
,
Kossida
,
S.
,
Aerts
,
J.
, …
Bagos
,
P. G.
(
2011
).
Using graph theory to analyze biological networks
.
BioData Mining
,
4
(
1
), 10.
Purohit
,
M.
,
Prakash
,
B. A.
,
Kang
,
C.
,
Zhang
,
Y.
, &
Subrahmanian
,
V.
(
2014
).
Fast influence-based coarsening for large networks.
In
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1296
1305
).
New York
:
ACM
.
Schaeffer
,
S. E.
(
2007
).
Graph clustering
.
Computer Science Review
,
1
(
1
),
27
64
.
Shahaf
,
D.
,
Yang
,
J.
,
Suen
,
C.
,
Jacobs
,
J.
,
Wang
,
H.
, &
Leskovec
,
J.
(
2013
).
Information cartography: Creating zoomable, large-scale maps of information.
In
Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1097
1105
).
New York
:
ACM
.
Shashua
,
A.
, &
Hazan
,
T.
(
2005
).
Non-negative tensor factorization with applications to statistics and computer vision.
In
Proceedings of the 22nd International Conference on Machine Learning
(pp.
792
799
).
New York
:
ACM
.
Shi
,
J.
, &
Malik
,
J.
(
2000
).
Normalized cuts and image segmentation
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
22
(
8
),
888
905
.
Shi
,
L.
,
Sun
,
S.
,
Xuan
,
Y.
,
Su
,
Y.
,
Tong
,
H.
,
Ma
,
S.
, &
Chen
,
Y.
(
2016
).
Topic: Toward perfect influence graph summarization.
In
Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering
(pp.
1074
1085
).
Piscataway, NJ
:
IEEE
.
Shi
,
L.
,
Tong
,
H.
,
Tang
,
J.
, &
Lin
,
C.
(
2014
).
Flow-based influence graph visual summarization.
In
2014 IEEE International Conference on Data Mining
(pp.
983
988
).
Piscataway, NJ
:
IEEE
.
Shi
,
L.
,
Tong
,
H.
,
Tang
,
J.
, &
Lin
,
C.
(
2015
).
Vegas: Visual influence graph summarization on citation networks
.
IEEE Transactions on Knowledge and Data Engineering
,
27
(
12
),
3417
3431
.
Spirtes
,
P.
(
2005
).
Graphical models, causal inference, and econometric models
.
Journal of Economic Methodology
,
12
(
1
),
3
34
.
Stedmon
,
C. A.
, &
Bro
,
R.
(
2008
).
Characterizing dissolved organic matter fluorescence with parallel factor analysis: A tutorial
.
Limnology and Oceanography: Methods
,
6
(
11
),
572
579
.
Tang
,
J.
,
Sun
,
J.
,
Wang
,
C.
, &
Yang
,
Z.
(
2009
).
Social influence analysis in large-scale networks.
In
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
807
816
).
New York
:
ACM
.
Thomson
,
M.
,
Oxman
,
A.
,
Haynes
,
R.
,
Davis
,
D.
,
Freemantle
,
N.
, &
Harvey
,
E.
(
1998
).
Local opinion leaders to improve health professional practice and health care outcomes.
Cochrane Library
,
3
.
Valente
,
T. W.
, &
Pumpuang
,
P.
(
2007
).
Identifying opinion leaders to promote behavior change
.
Health Education and Behavior
,
34
(
6
),
881
896
.
van
Leeuwen
,
M.
,
Vreeken
,
J.
, &
Siebes
,
A.
(
2006
).
Compression picks item sets that matter.
In
Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery
(pp.
585
592
).
Berlin
:
Springer
.
Vickers
,
M.
, &
Chan
,
S.
(
1981
).
Representing classroom social structure
.
Melbourne
:
Victoria Institute of Secondary Education
.
Vreeken
,
J.
,
Van Leeuwen
,
M.
, &
Siebes
,
A.
(
2011
).
KRIMP: Mining item sets that compress
.
Data Mining and Knowledge Discovery
,
23
(
1
),
169
214
.
Xu
,
K. S.
,
Kliger
,
M.
, &
Hero
,
A. O.
(
2011
).
Tracking communities in dynamic social networks.
In
Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction
(pp.
219
226
).
Berlin
:
Springer
.
Zhou
,
H.
,
Zeng
,
D.
, &
Zhang
,
C.
(
2009
).
Finding leaders from opinion networks.
In
Proceedings of the 2009 IEEE International Conference on Intelligence and Security Informatics
(pp.
266
268
).
Piscataway, NJ
:
IEEE
.