A graph is a mathematical representation of a set of variables where some pairs of the variables are connected by edges. Common examples of graphs are railroads, the Internet, and neural networks. It is both theoretically and practically important to estimate the intensity of direct connections between variables. In this study, a problem of estimating the intrinsic graph structure from observed data is considered. The observed data in this study are a matrix with elements representing dependency between nodes in the graph. The dependency represents more than direct connections because it includes influences of various paths. For example, each element of the observed matrix represents a co-occurrence of events at two nodes or a correlation of variables corresponding to two nodes. In this setting, spurious correlations make the estimation of direct connection difficult. To alleviate this difficulty, a digraph Laplacian is used for characterizing a graph. A generative model of this observed matrix is proposed, and a parameter estimation algorithm for the model is also introduced. The notable advantage of the proposed method is its ability to deal with directed graphs, while conventional graph structure estimation methods such as covariance selections are applicable only to undirected graphs. The algorithm is experimentally shown to be able to identify the intrinsic graph structure.
A graph is a mathematical structure that represents the relationship of data. Typical examples of graph structures are railroads, the Internet, transmission networks, and neural networks. In recent years, pattern recognition and data mining based on graph structure have been studied. For example, the graph structure was used by Bunke and Riesen (2011) for document analysis, Conte, Foggia, Sansone, and Vento (2004) for biometric identification, and Washio and Motoda (2003) for data mining.
A graph is composed of nodes (vertices) and edges (arcs), where a connection between a pair of nodes is expressed by an edge. Consider a case of stock prices; nodes correspond to companies, and edges correspond to direct connectivity between companies. Influences of stock prices can transfer through several companies. As another example, in the case of neural networks, nodes correspond to nerve cells (neurons), and edges correspond to direct connectivity between neurons. If coincident multineuronal firings are observed, correlations between these observed neurons are stronger than others. However, these neurons may not be connected directly, and the observed coincident firings may be a result of influences from other neurons. This effect is called a pseudo-correlation, and it is problematic in data analysis.
There are two major graph estimation problems. One is a problem of estimating the importance of nodes when connectivities of whole edges are known—for example, web page ranking problems such as PageRank (Page, Brin, Motowani, & Winograd, 1997) and HITS (Kleinberg, 1999). The other problem is estimating edges based on certain statistics, which represents dependency between nodes—for example, cell-signaling data analysis (Friedman, Hastie, & Tibshirani, 2007). We deal with the latter problem in this letter. There are many different kinds of statistics for representing the relationship between nodes; therefore, we collectively refer to these statistics as dependency.
In this study, based on the random walk model on graphs, we propose a method that can extract only important connections in graphs without any tuning parameters. A random walk model is often used in graph analyses. For example, Ribeiro and Towsley (2010) proposed using random walk models for sampling nodes and edges from a given graph to calculate network characteristics such as degree distribution. Sarma, Gollapudi, and Panigrahy (2011) proposed to approximate the importance index of the node in this PageRank (Page et al., 1997) scheme by using random walk modeling for a given graph. Although a random walk model is used for sampling from a fixed graph structure, to the best of our knowledge, there has been no attempt to use a random walk model on a graph to estimate intrinsic graph structure. Although the proposed framework of graph structure estimation is different from SICS, we compare the ability of our method to several SICS methods, which are actively studied and considered to offer state-of-the-art accuracy.
The remainder of this letter is organized as follows. In section 2, we define the notion of graph structures and represent dependency as the total intensity of directed and undirected connections between nodes. Section 3 describes the digraph Laplacian that is a simple matrix function characterizing a graph and describes information transitions on a graph. In section 4, an algorithm for estimating graph structure is proposed. Section 5 shows the comparison of graph structure estimation with our method and existing methods by using both synthetic and real-world data sets. The last section, a conclusion, provides examples of future research on this field of study.
2. Relevant Background and Problem Setting
In this section, we explain the notion of graph structures (Chung, 1997) and introduce the framework for estimating graph structure. Let V be a set of nodes and E be a set of edges that connects nodes i and j. A graph is characterized by a pair (V, E). There are two types of graphs: undirected and directed. When we are interested in only connectivity between nodes, we use undirected graphs, and when we are interested in the directions of edges, we use directed graphs.
Let be a connectivity matrix with , , which represents the connection intensity from nodes i to j, and can be zero if there is no edge between i and j. For the example of stock prices, represents direct connectivity between companies. We can deal with both undirected and directed graphs, but in this study, we mainly consider the directed graph, that is, generally . In our representation, diagonal elements are always set to 0, that is, we assume does not include self-loops. We emphasize that the connectivity in this study indicates the direction and intensity of information transition via the edge, unlike graphical models (Pearl, 2000) in which edges represent causality.
3. Model of Graph Structure
We propose using the exponential map as the function f in equation 2.1 for theoretical and computational simplicity. Then we can simply obtain the inverse of f as the matrix logarithm. In this section, we explain digraph Laplacian (Li & Zhang, 2010) for considering the exponential map of directed graphs. We explain the notion of a random walk on a graph characterized by the digraph Laplacian and represent information propagation on a graph by a random walk.
3.1. Digraph Laplacian.
3.2. Random Walk on Graph.
3.3. Observed Data Model.
4. Algorithm for Estimating Graph Structure
In this section, we show experimental results of graph structure estimation. In section 5.1, we focus on an estimate of undirected graphs. We compare results of the proposed method with results of SICS methods. Since generative models of the proposed framework and SICS framework are different, we estimate graph structures from observed data according to each generative model. Observed data are generated from the random walk in section 5.1.1. This condition is faithful to the proposed framework. Observed data are generated by the multidimensional gaussian distribution in section 5.1.2. This condition is faithful to the SICS framework. In section 5.2, we estimate directed graph structures using the proposed method only because SICS methods cannot be applied to directed graph structure estimation. In section 5.3, we apply the proposed method and SICS methods to real-world financial data sets and compare the results of these methods. As representative SICS methods, we adopt glasso (Friedman et al., 2007), QUIC (Hsieh et al., 2011), and LOREC (Luo, 2011). We use StARS (stability approach to regularization selection; Liu, Roeder, & Wasserman, 2010) to find an appropriate regularization parameter in SICS methods.
5.1. Artificial Data of Undirected Graphs.
First, we estimate symmetric graph structures using both the proposed method and SICS methods and compare their performances. We set the number of nodes as n=20.
5.1.1. Data in Accordance with the Random Walk.
In Figure 6, we can see that the accuracies of SICS methods decrease as the densities increase. The proposed method is shown to obtain high accuracies in most cases.
In Figure 7, to see the effect of the regularization parameters in SICS methods, we show the average estimation accuracies by the proposed method and the glasso method with various regularization parameters. The vertical axis and horizontal axis of the graphs indicate accuracy and regularization parameters , respectively. In this study, we fixed the density of to 0.2, but we observed a similar tendency for other densities. We prepared two levels of noise: 0.006 and 0.06. The proposed method has no tuning parameter in contrast to SICS methods. Figure 7 shows that the misspecification of the regularization parameter of glasso leads to inaccurate estimation results. In this study, the proposed method achieved almost the same accuracy as the optimal case of glasso.
5.1.2. Data Generated from the Multidimensional Gaussian Distribution.
From Figure 8a, we see that when the data are generated from a multivariate gaussian with the inverse covariance matrix , the proposed method accurately estimates the ground-truth graph. From Figure 8b, we can see that the accuracies of every method decrease as the densities increase. However, the accuracy decrease of the proposed method is slower than others and retains more than 60% accuracy even when the density reached 0.6.
There are two possible reasons for the better accuracies achieved by the proposed method shown in Figures 6 and 8. The first is the essential difference of the underlying models of the proposed and SICS frameworks. While the SICS framework considers only the second-order statistics, the proposed framework takes higher-order statistics into account, as shown in equations 3.10 and 3.11. The second reason is the difference of the regularization methods. We adopt StARS for defining the regularization parameter in SICS methods, while algorithm 2 has a mechanism to automatically determine an appropriate sparsity based on the fitness of the model.
5.2. Artificial Data of Directed Graphs.
In this section, we estimate directed graphs represented by asymmetric adjacency matrices . Since SICS methods cannot be used for estimating directed graphs, we use only the proposed method. We prepare several with densities 0.2, 0.4, 0.6, and randomly generate , from a uniform distribution over [0, 1]. We set the number of nodes to n=20. We generate observed data T according to equation 3.13 in the same manner as the experiments of the undirected graph. We show the relationship between accuracy, defined by equation 5.4, and the size of observed noise defined, by equation 5.5, in Figure 9. As shown in Figures 6 and 9, the proposed method is able to estimate the directed graph structure the same as the undirected case. The accuracy of estimation deteriorates with the increase of the density of the ground-truth graph and the size of noise. This tendency is observed in the experiments with both directed and undirected graphs.
Sections 5.1 and 5.2 are summarized as follows. We performed experiments with artificial data and showed that the proposed method is able to estimate the graph structures more accurately than the conventional SICS methods in most cases. The proposed method, which has no tuning parameter, achieved comparable or superior accuracies to the finely tuned glasso. Also, the proposed method is applicable to directed graphs, while the conventional SICS methods are applicable only to undirected graphs.
5.3. Real-World Data.
In this section, we apply the proposed method and SICS methods to analyze a set of time-series data of stock prices downloaded from Yahoo! finance.3 First, we consider an undirected graph with nodes corresponding to companies to compare the results of the proposed method with the results of SICS methods. Next, we consider a directed graph to analyze the relationship of the companies in more detail by the proposed method. The data set is composed of the daily closing prices of 10 companies from November 11, 2010, to July 11, 2011, which is equivalent to 158 business days. We show the data in Figure 10. We take an interest in how the relationship of companies changed around the day of the Great East Japan Earthquake on March 11, 2011.
5.3.1. Undirected Graphs.
The edges of the graph represent a certain kind of relationship of stock prices between companies. For example, if stock prices of companies i and j behave similarly, it is natural to consider that there is an edge between these companies. We estimate the undirected connectivity matrix from an observation. Let be a stock price of company i at time t. Since we focus on the existence of connections and do not consider whether connections are positive or negative, we consider an absolute value of relative change rate Ri(t)=|(Si(t+1)−Si(t))|/Si(t). We assume that virtual random walkers in the stock market have uptick and downturn factors. They transit between nodes according to the rate determined by the intensity of the connection, and we assume the rise and decline of stock prices of companies can be explained by the transition of these random walkers.
Ten large electronic companies (Sony, Panasonic, Sharp, Toshiba, Google, IBM, Apple, Microsoft, LG, and Samsung) are chosen and their daily closing stock prices Si(t) before and after the Great East Japan Earthquake (November 11, 2010–March 11, 2011 and March 11, 2011–July 11, 2011) are used to construct observation matrices. Figures 11 and 12 are the estimated graph structures before and after the earthquake, respectively.
From Figures 11 and 12, only a few edges were detected by SICS methods. The sparseness of SICS methods is decided by StARS. From the results of the SICS methods, we cannot see any changes in the relationship between companies around the day of the earthquake. The proposed method outperformed SICS methods on artificial data sets; therefore, the result of the proposed method can be considered more reliable. In addition, the proposed method reveals an interesting relationship between companies. From Figure 11a, we can see that only the proposed method extracts some mutual connections between Japanese companies (Sony, Panasonic, and Toshiba) and U.S. companies (Apple, IBM, and Google). From Figure 12a, we can see that Japanese and U.S. companies are mutually connected to each other, but companies in these two countries become isolated after the earthquake occurred. From this result, it is conceivable that the stock prices of Japanese companies showed similar behavior, and the behavior was distinct from those of U.S. companies.
5.3.2. Directed Graphs.
We show the estimated directed graph structures in Figure 13. Figures 13a and 13b have different connectivity from the structures shown in Figures 11a and 12a. Comparing the estimated structures corresponding to after the earthquake shown in Figures 13b and 12a, connectivity ignoring directions is similar, but directed dependencies between companies can be seen from Figure 13b.
In this study, we proposed a method for estimating graph structures using dependency between nodes. Since only the intrinsic connectivity of a graph is important in many cases, we simply represented a graph structure using an adjacency matrix and modeled the dependency between nodes using a probability matrix of transition on a graph characterized by a digraph Laplacian. In SICS methods such as glasso, we must tune a regularization parameter, whereas there is no need to tune any parameter in our proposed method. SICS methods are scalable and faster than the proposed method. However, when the size of graphs is of the same order to those as in section 5, the proposed method is shown experimentally to be comparable to or more accurate in estimating the intrinsic graph structure as compared to SICS methods with parameters optimized by StARS. Furthermore, the proposed method is applicable to estimating asymmetric graphs, while the conventional SICS methods are applicable only to symmetric graphs. As shown in the real-world experiment, the proposed method is able to capture an asymmetric structure that was not extracted by SICS methods, which assume symmetric structures only.
Since we restricted the value of each element of the connectivity to be 0 or 1, our conjecture is that our model based on the random walk on the graph is identifiable. Proving strict identifiability of the proposed model remains an important factor for future work. Also, the proposed general algorithm 1 does not have a theoretical guarantee of convergence to the ground-truth solution because it estimates the graph structure in a greedy manner. However, from Figure 8a, we see that the estimation accuracy moves close to 100% as the size of the data increases.
The proposed method requires O(n2) iterations for estimating the graph structure with n nodes. Although this order of computational complexity is linear in the number of unknown variables , each iteration involves optimization with respect to and , and to the matrix logarithm and exponential. To reduce the computational cost, it would be helpful to estimate the number of possible edges m. The major computational issue in the proposed method is calculating the matrix logarithm in each iteration, which requires at least the computational cost of an order higher than O(n5) (Roch & Villard, 1996), where n is the number of edges. The stochastic optimization approach and approximation algorithm would be applicable for reducing the computational cost. It is also possible to use another model for f in which the inverse map can be efficiently calculated. Improvement of computational efficiency for the application of large-scale graphs is an important future study to be conducted.
We designed our algorithm for estimating a -valued connectivity matrix to focus on estimating the existence of edges and obtaining intrinsic connectivity. The proposed method can be extended to estimate a real-valued matrix by omitting the thresholding operation in equation 4.5. This can make the model more flexible, and we would be able to obtain more detailed graph structure, but would need an additional tuning parameter for controlling the sparsity . In addition, for simplicity, the observation model in equation 3.8 is assumed to be a gaussian, although it is possible to replace it with another density function.
When we apply the proposed method to graph structure estimation, the most important thing is how to define the observed matrix T characterizing the dependency of nodes. In contrast to conventional Bayesian graph estimation methods based on linear structure equation models (e.g., Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006), our proposed method assumes a random information transition on a graph and estimates its structure based on the matrix T, which represents dependency between nodes. However, given a set of raw observations, the estimation results vary according to how the observation matrix T is generated from the raw data. In section 5, we have shown an example of graph structure estimation using a set of stock data in which the observation matrix is defined as a correlation matrix of stock prices. Further investigation will be needed for defining the observed matrix.
Finally, in the field of neuroscience, multielectrode recording has enabled us to monitor large numbers of neurons simultaneously. Analysis of neural activities by the proposed method, specifically before and after learning, could provide visualization on the change of the network structure. We are now analyzing extracellularly obtained spike data sets by our proposed method to estimate the directed connectivity between neurons.
We express special thanks to the editor and reviewers whose comments led to valuable improvements of the manuscript. H.H. is supported by JSPS KAKENHI grant number 25870811, and S.A. and N.M. are supported by JSPS KAKENHI grant number 25120011. M.T. is supported by the Innovates Center of Research Excellence grant (iCORE SCH001) and Natural Sciences and Engineering Research Council of Canada grant (NSERC DG 386522-2010).
Later in this study, we introduce the random walk model. In the model, a walker is regarded as an information carrier on the transit network represented by a graph.
Technically, this operation is to ensure that we can calculate the inverse of the matrix log function in algorithm 2, and we set r=1 when performing experiments.