Abstract

LiNGAM has been successfully applied to some real-world causal discovery problems. Nevertheless, causal sufficiency is assumed; that is, there is no latent confounder of the observations, which may be unrealistic for real-world problems. Taking into the consideration latent confounders will improve the reliability and accuracy of estimations of the real causal structures. In this letter, we investigate a model called linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC) which can be seen as a specific case of lvLiNGAM. This model includes the latent confounders, which are assumed to be independent gaussian distributed and statistically independent of the disturbances. To tackle the causal discovery problem of this model, first we propose a pairwise cumulant-based measure of causal directions for cause-effect pairs. We prove that in spite of the presence of latent gaussian confounders, the causal direction of the observed cause-effect pair can be identified under the mild condition that the disturbances are simultaneously supergaussian or subgaussian. We propose a simple and efficient method to detect the violation of this condition. We extend our work to multivariate causal network discovery problems. Specifically we propose algorithms to estimate the causal network structure, including causal ordering and causal strengths, using an iterative root finding-removing scheme based on pairwise measure. To address the redundant edge problem due to the finite sample size effect, we develop an efficient bootstrapping-based pruning algorithm. Experiments on synthetic data and real-world data have been conducted to show the applicability of our model and the effectiveness of our proposed algorithms.

1.  Introduction

Causal discovery has received extensive attention in recent years. Much empirical work has been devoted to discovering the causal mechanisms behind natural and social phenomena. Conducting controlled experiments is an effective way to discover the causal mechanism. However, controlled experiments in general are expensive, time-consuming, practically infeasible, or ethically prohibited. Due to these reasons, causal discovery from nonspecifically controlled experimental data becomes a great necessity. Models such as structural equation models (SEMs) and Bayesian networks (BNs) have been proposed to explain the data-generating mechanisms and have been widely applied to social science, econometrics, and medical science (Pearl, 2000; Spirtes, Glymour, & Scheines, 2000). However, conventional methods such as BNs assume that all disturbances are gaussian distributed, and hence only second-order statistics are employed to infer the causal structure. In general, such methods can obtain only a class of models that are equivalent in their conditional correlation structure (Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006). The full structure, including the causal ordering and causal strengths, cannot be identified without prior knowledge in most cases (Sogawa et al., 2011). Recently it has been shown that by exploiting the nongaussianity of the disturbances, the causal structure can be fully and uniquely determined. Shimizu et al. (2006), proposed a linear nongaussian acyclic model (LiNGAM) and showed that the full causal structure can be identified by independent component analysis (ICA) (Comon, 1994; Hyvärinen & Oja, 2000). As most ICA algorithms use gradient descent–based iterative searching, suboptimality cannot be avoided. Correct causal ordering cannot be guaranteed using ICA-based LiNGAM algorithms. A direct-LiNGAM framework was proposed later to tackle this problem (Shimizu et al., 2011). LiNGAM has been widely used to discover the underlying causal structures of many real-world problems (Sogawa et al., 2011; Kawahara, Shimizu, & Washio, 2011; Moneta, Entner, Hoyer, & Coad, 2010; Zhang & Chan, 2006). The advantages of LiNGAM over conventional methods are that (1) a full and unique causal structure can be identified, provided that all assumptions hold and the sample size of the observations is sufficiently large, compared with BNs, which can only obtain a class of Markov equivalent models; (2) no prior knowledge of the network structure is needed; and (3) compared to BNs, which may involve large numbers of conditional independent tests, the computational complexity of LiNGAM is much lower.

In spite of these advantages, LiNGAM has the same shortcoming as other conventional methods such as BNs: they all assume causal sufficiency (Shimizu et al., 2006). Causal sufficiency is a basic assumption for many causal models, including LiNGAM and BNs. It assumes that for a set S of observed variables, every common direct cause (confounder) of a subset (with two or more variables) in S is also in S. This assumption is strong and may not hold in the real world. For example, in addition to mutual influences among stocks, the prices of stocks are also affected by a number of external factors: interest rate, inflation rate, oil price, gold price, and so on. In general, for a set of observed variables, some confounders may be hidden or latent. The latter is closely related to the widely used factor model in economics and finance (Zhang & Chan, 2009; Chan, 2002; Chan & Cha, 2001; Cha & Chan, 2000). The factor model assumes no cause-and-effect relationships among the observed variables, but all observed variables are assumed as the effects of possibly a small number of latent uncorrelated or independent factors. Causal model and factor model have their own advantages. A causal model can be used to estimate the causal relationship among the observed variables, which is quite common in the real world. The factor model takes into consideration some latent external confounders outside the system, for example, interest rates, oil prices, and gold prices outside the stock market. Henao and Winther (2011) considered a model involving both a linear directed acyclic graph and a factor model. In that model, each variable is affected by other variables, driving a signal of itself and some latent variables. The authors proposed a hierarchical Bayesian framework to estimate the model and a quantitative method for model comparison. (See Henao & Winther, 2011, for more details.) Thus, a more general and realistic model should be able to describe the causal relationship among the observed variables as well as the effects from the latent confounders outside the system. In this letter, we discard the causal sufficiency assumption made in BNs and LiNGAM and allow the presence of latent confounders.

We deal with a more important and challenging problem: causal discovery for linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC). In LiNGAM-GC, we allow the presence of latent confounders. In order to make the problem tractable, we approximate these latent confounders by gaussian variables that are mutually independent and independent of the disturbances.

Figure 1 shows an illustrative example of LiNGAM-GC with two observed variables. x and y are observed variables of which we aim to discover the causal direction. e1 and e2 are independent nongaussian disturbances. c is the latent confounder, which is introduced in our model to represent the ensemble effect of the many latent independent factors fi. , , and are causal strengths. Due to the extra dependence introduced by c, the causal discovery problem for x and y becomes much more challenging. Previous methods such as LiNGAM and Direct-LiNGAM may obtain incorrect results because of the latent confounder. In this letter, we propose and rigorously analyze a new cumulant-based pairwise measure of the causal direction. We prove that under mild conditions, the causal direction of the model given in Figure 1 can be correctly identified by simply investigating the sign of , that is, if the measure , is concluded. On the contrary, if , is concluded. We also extend our work to multivariate causal network discovery using the Direct-LiNGAM framework: we first identify the exogenous variable (root) of the network and then regress all other variables on the exogenous variable. We repeat these steps for the regression residues until the whole causal ordering has been identified.

Figure 1:

Pairwise LiNGAM-GC (causal-effect pair).

Figure 1:

Pairwise LiNGAM-GC (causal-effect pair).

Note that our model LiNGAM-GC can be seen as a variant of latent variable LiNGAM (lvLiNGAM) (Hoyer, Shimizu, Kerminen, & Palviainen, 2008), which we discuss in detail in section 2. In general, the main difference between LiNGAM-GC and lvLiNGAM is that our model assumes gaussian latent confounders and lvLiNGAM assumes nongaussian confounders. Although lvLiNGAM is a more general model, the huge computational complexity makes it practically infeasible. Actually, lvLiNGAM can be used to find the causal structure for only a very small number of variables. However, gaussian approximation makes the problem fully tractable. In fact, our method developed based on this approximation is computationally efficient and can be applied to large causal network discovery.

The rest of this letter is organized as follows. In section 2, we review some related recent work concerning latent confounders. In section 3, we define our linear nongaussian acyclic models in the presence of latent confounders in detail. In section 4, we propose and rigorously analyze a new cumulant-based measure for pairwise causal discovery of LiNGAM-GC.1 In section 5, the main contribution of this letter, we extend our work to causal discovery for LiNGAM-GC with more than two observed variables. In section 6, we look at the experiments on synthetic data and real-world data conducted to show the effectiveness of our proposed methods. We conclude in section 7.

2.  Related Work Concerning Latent Confounders

Several recent papers concern latent confounders (Hoyer et al., 2008; Janzing, Peters, Mooij, & Schölkopf, 2009; Kawahara, Bollen, Shimizu, & Washio, 2010; Entner & Hoyer, 2011). In Hoyer et al. (2008), as illustrated by the following equations, latent confounders are involved in the model
formula
2.1
where x is the observation vector, e is the disturbance vector, and c is the latent confounder vector; A and B are the factor-loading matrices. Hoyer et al. treated the causal discovery problem for LiNGAM in the presence of latent confounders as an overcomplete ICA problem and developed algorithms to derive canonical models based on the estimated mixing matrix. However, estimation of the mixing matrix for the overcomplete ICA problem is an ill-posed and challenging problem. Actually, the performance of existing methods for this problem is far from satisfactory, especially when the dimensionality of the problem is high. In addition, ICA-based methods may get stuck in local optima, which leads to incorrect estimation of the causal ordering.
Janzing et al. (2009) proposed the confounders with additive noise (CAN) model,
formula
2.2
where x and Y are the observations, NX and NY are statistically independent disturbances, T is the latent confounder, and u and v are certain nonlinear functions. They showed that under certain conditions, the confounder T is identifiable. Note that in this model, they assumed that there is no direct edge between x and Y, and the variances of NX and NY are relatively small compared to the confounder T. However, we are more interested in a general model where there is a direct edge between x and Y; NX and NY are not necessarily small. In this general model, we aim to estimate the direction of the edge between x and Y.

Kawahara et al. (2010) proposed the GroupLiNGAM model. In this model, latent confounders are allowed to be present but are restricted within certain subsets of the observed variables. They mainly focused on finding the causal ordering of the subsets. This was achieved by iteratively finding the exogenous subsets and removing the exogenous subsets of the observed variables from the remaining subsets until the entire causal ordering of the subsets was identified. However, in GroupLiNGAM, the causal structures remain unidentified yet within the confounded subsets. Hence, the confounding problem is only partially solved in this work.

Entner and Hoyer (2011) proposed a method to infer the partial causal structure of the observed variables based on lvLiNGAM; their method is to discover all the unconfounded supersets and then estimate the total causal effects for each pair in the unconfounded supersets. The basic idea is that confounding can be detected in the nongaussian settings. The algorithm first finds out all the unconfounded causal pairs and then, for each confounded pair, finds the maximal set S containing the common causes of the pair such that is unconfounded. Their method has been proved to be sound and complete. However, similar to Kawahara et al. (2010), this work can obtain only partial information of the causal network. If the latent confounders have effects on most of the observed variables, which is quite common in real-world phenomena, it is very likely that the method returns an empty, unconfounded superset.

3.  Linear Nongaussian Acyclic Models in the Presence of Latent Gaussian Confounders

In this section, we propose our model: linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC).

Suppose we have the observation vector generated by LiNGAM-GC defined as follows:
formula
3.1
where A is an matrix collecting the causal strengths among the observed variables; e is vector of disturbances; is a vector collecting the latent confounders; B is an matrix collecting the causal strengths of the confounders on the observed variables. We make the following assumptions about the model:
Assumption 1. 

The causal model is acyclic, that is, the causal model can be represented by a directed acyclic graph (DAG) or, equivalently, by simultaneous row and column permutations, and matrix A can be permuted to a strictly lower triangular matrix. Here, strictly means that the diagonal elements of the lower triangular matrix are exact zeros.

Assumption 2. 

The components of the disturbance vector e are mutually independent and nongaussian distributed with zero means.

Figure 2 gives an illustrative example of LiNGAM-GC. In the causal network described in the figure, if the confounders c1 and c2 are absent, the model reduces to LiNGAM, of which the causal structure can be estimated by LiNGAM algorithms (Shimizu et al., 2006, 2011). However, due to the extra dependence introduced by the latent confounders c1 and c2, the causal discovery becomes a challenging task. In order to make the problem tractable, we make another assumption on the model:

Figure 2:

LiNGAM in the presence of latent gaussian confounders.

Figure 2:

LiNGAM in the presence of latent gaussian confounders.

Assumption 3. 

The latent confounders ci are independent gaussian distributed with zero means and independent of disturbances ej.

Here, we approximate the distributions of the latent confounders by gaussian distribution. The advantage of this approximation is that it makes the model fully identifiable and practically tractable. The interpretation of this approximation is based on the central limit theorem (CLT), as traditionally in statistics and other related areas. As illustrated by Figure 1, the disturbances ei can be regarded as local factors that describe the intrinsic behavior of the observed variable and cannot regarded as external noise (Henao & Winther, 2011). In many real-world phenomena, ei tend to be more nongaussian. Besides the intrinsic disturbances, the observed variables x and y are also affected by large number of global factors (macroeconomic factors) . We introduce another variable c to represent the ensemble effect of these global factors, . By the CLT, we can roughly say that the sum of large numbers of independent variables tends to be or at least close to gaussian distributed. Consequently, we approximate the latent confounders as gaussian distributed in order to make the problem tractable. Our method proposed in this letter is developed based on this approximation. The performance of the algorithm depends on to what extent our assumptions on the model hold. Thus, this method may not be applicable when the gaussian assumption of the latent confounders is strongly violated. This includes the cases that only some variables in a network are observable but the unobserved variables have some effects on the observed variables. In this scenario, our method may not be suitable as the latent gaussian confounder assumption is quickly violated.2

Now we can reformulate equation 3.1 as follows:
formula
where I is an identity matrix. Letting and , we have
formula
3.2
Note that the model described by equation 3.2 is exactly a noisy ICA model. Some work has been devoted to noisy ICA. One promising approach is a bias removal technique (Hyvärinen, 1999). Hyvärinen (1999) proposed solving noisy ICA problem using gaussian moments. However, this approach requires prior knowledge of the covariance matrix of the gaussian noise. In our problem, the covariance matrix is , where is the covariance matrix of the latent confounders c. It is obvious that in our problem, we have no such prior knowledge of . Hence, the gaussian moments–based methods are not applicable to our problem. Attias (1999) proposed independent factor analysis (IFA) and used a gaussian mixture model (GMM) to approximate the densities of the disturbances. All parameters of the IFA model were learned by the expectation-maximization (EM) algorithm. However, the number of parameters to learn is large especially when the dimensionality of the problem is high. The algorithm easily converges to local optima such that the estimated mixing matrix always leads to incorrect causal ordering when it is applied to our problem. Furthermore, the computational time is huge, which makes it not applicable to large-scale causal network discovery. Inferring the causal structure of LiNGAM-GC is a challenging problem. To the best of our knowledge, no reliable method has been proposed to estimate the causal structure of LiNGAM-GC. In the following sections, we propose a new cumulant-based measure to tackle this problem.

4.  Causal Discovery for LiNGAM-GC with Two Variables

In this section, we focus on inferring the causal direction of a cause-and-effect pair in the presence of latent gaussian confounders: LiNGAM-GC with two observed variables. As graphically represented by Figure 1, suppose that the real data-generating mechanism is described by the following equations:
formula
4.1
We have only observations x and y and no prior knowledge about causal direction and strength. Our task is to discover the causal direction based on pure observations. Before we propose our new cumulant-based measure, we introduce the cumulant-based measure that Hyvärinen and Smith (2013) proposed to discover the causal structure of LiNGAM.

4.1.  Cumulant-Based Measure by Aapo Hyvärinen.

The cumulant-based measure was proposed for LiNGAM; no latent confounder is present in the model. Suppose we have observed two random variables x and y with zero means and generated by a reduced model described by equations 4.1 with and . Denote by and the normalized x and y with unit variances and by the corresponding correlation coefficient of and . It is easy to know that , an important working condition for this cumulant-based measure. The cumulant-based measure that Hyvärinen and Smith (2013) proposed is defined as
formula
4.2
where means sample average, is the estimated correlation coefficient, and is the kurtosis of defined as . According to theorem 1 in Hyvärinen and Smith (2013), we have the following causal inference rule:
formula
4.3
This cumulant-based measure works well in the scenario of no latent confounders. Unfortunately, if we simply apply it to LiNGAM-GC, we may get misleading results. There are two causes of failure when applying this measure to LiNGAM-GC:
  • • 

    A second-order statistic cannot eliminate the effect of a latent confounder when we estimate . Using second-order statistics, . Since is normalized to unit variance, . We have . We can see that is strongly influenced by . if and only if or , that is, there is no confounding effect. However, due to the confounding effect, this estimation bias may lead to severe problems.

  • • 

    is a very important working condition for the cumulant-based measure that Hyvärinen and Smith (2013) proposed. However, simply normalizing x and y to unit variance cannot guarantee . Lemma 1 shows the condition under which .

Lemma 1. 

Assume that the observed variables x and y are generated according to equation 4.1 and fulfill , where and are the variances of e2 and c, respectively. Denote by and the normalized x and y with unit variances; then causal strength between and has the property that.

Proof. 
Let , , , , and be the standard deviation of x, y, e1, e2, and c, respectively. We have and :
formula
Since , we have
formula
We have :
formula
If , it is easy to see that and .

4.2.  New Cumulant-Based Measure for LiNGAM-GC.

In order to tackle the causal discovery problem for a causal-effect pair described by equation 4.1, we propose our new cumulant-based measure and causal inference rule, which we partially discussed in our previous work (Chen & Chan, 2012). The causal inference rule is simple and similar to the causal inference rule proposed in Hyvärinen and Smith (2013). In order to make this letter self-contained, we introduce it in this section.

First, we propose a standardization method to achieve . This is an important step and working condition for our algorithm. We have shown that simply normalizing x and y to unit variance cannot guarantee due to the extra dependence introduced by the latent confounder c. In order to eliminate the effect of the latent confounder c, we need to use higher-order statistics. In this letter, we propose a standardization method: normalization to unit absolute kurtosis. Define and , where kurt(x) and kurt(y) are the kurtosis of x and y respectively. We have:
formula
since c is gaussian distributed such that kurt(c)=0. Similarly, we have
formula
We propose the following standardization method:
formula
4.4
Lemma 2 shows that under mild conditions, normalizing x and y to unit absolute kurtosis leads to :
Lemma 2. 

Assume that the kurtoses of e1 and of e2 have the same sign: e1 and e2 are simultaneously supergaussian or subgaussian. Denote by and the normalized x and y with unit absolute kurtosis according to equation 4.4. Then the causal strength between and has the property of.

Proof. 
Since and , we have
formula
We also have
formula
Since e1 and e2 are simultaneously supergaussian or subgaussian, , so we have .

Regarding the condition in lemma 2 that e1 and e2 are simultaneously supergaussian or subgaussian, this assumption is not restrictive, as it has been reported that the innovation processes of many real signals are almost invariably supergaussian even if the signals themselves are subgaussian (Valpola, Harva, & Karhunen, 2004). We also develop a simple method to detect the violation of simultaneously super- or subgaussian assumption in the section 4.4.

For convenience, in the rest of this section, we use the notations x, y, c, , , and , but we assume that x and y are standardized such that (either by normalizing to unit variance or absolute kurtosis depending on which condition in lemma 1 or 2 holds). Inspired Hyvärinen and Smith (2013), we give the following cumulant-based measure, which can be used to determine the causal direction for pairwise LiNGAM-GC.

Define the 4th-order cross-cumulants of x and y as follows:
formula
Define the new cumulant-based measure as
formula
4.5
means sample average. We have the following theorem:
Theorem 1. 
If the true causal direction is, we have
formula
4.6
whereis the kurtosis of e1. If the true causal direction is, we have
formula
4.7
Proof. 
Consider the 4th-order cross-cumulant:
formula
4.8
If , the observed x and y fulfill the following generating mechanism:
formula
According to the properties of cumulant (Hyvärinen & Smith, 2013),
  • • 
    For random variables u, v, and constants c1 and c2, we have
    formula
  • • 
    If any of the random variables u, v, p, and q is independent of the remaining variables, we have
    formula
  • • 
    If u and v are statistically independent and with zero means, we have
    formula
We have:
formula
Since we assume that the latent confounder c is gaussian distributed, we have kurt(c)=0. Hence we have
formula
Likewise,
formula
We have:
formula
Likewise, if , the observed x and y are generated according to
formula
We have
formula
We assume that x and y are standardized such that (by either normalization to unit variance or unit absolute kurtosis depending on which condition holds). Hence, if , we have:
formula
Otherwise if , we have
formula
Thus, the cumulant-based measure enables us to determine the causal direction between the observed x and y simply by investigating the sign of .
Algorithm 1 uses the proposed cumulant-based measure to identify the causal direction of a cause-and-effect pair.
formula

4.3.  Causal Strength.

We have developed a new cumulant-based measure to infer the causal direction for cause-and-effect pairs. We still have another major task: estimating the causal strength between x and y. This problem is easy to solve once we can distinguish the cause from the effect. The basic idea is to use higher-order statistics. In section 4.2, we showed that if the causal direction is . Since , we have
formula
Now we can estimate the causal strength simply by
formula
4.9
Similarly, if the causal direction is , we can estimate by
formula
4.10

4.4.  Detecting the Violation of the Simultaneously Supergaussian or Subgaussian Assumption.

Lemma 2 shows that if the simultaneously super/subgaussian assumption holds, by simply normalizing the observations to unit absolute kurtosis, we can guarantee that the normalized causal strength between the standardized and fulfills . This property is of great importance for our new cumulant-based measure proposed in section 4.2. However, it is not possible to tell if this simultaneously super/subgaussian assumption holds for the observed real-world data before we construct the causal relationship. In this section, we propose a simple way to detect the violation of the simultaneously super/subgaussian assumption.

Without loss of generality, suppose we have observations x and y, which are generated by equation 1.1. Hence, the ground truth is . If the simultaneously super/subgaussian assumption is violated, again without loss of generality, suppose kurt(e1)>0 and kurt(e2)<0. We consider the following two cases:

  • • 

    kurt(x)>0 and kurt(y)<0.

  • • 

    kurt(x)>0 and kurt(y)>0.

The first case of violation is easy to detect. We just need to estimate the kurtoses of the observations. If the estimated kurtoses are of opposite signs, we can conclude that the kurtoses of the disturbances e1 and e2 are of opposite signs. The difficulty lies in the second case, where even though kurt(e2)<0, . To tackle this difficulty, we propose the following measure of violation (MoV) to detect a violation,
formula
4.11
where and are the estimated cause and effect by our cumulant-based measure, respectively, and is the estimated causal strength between and .
First if the simultaneously super/subgaussian assumption is not violated, according to theorem 1, we know that , and . We have
formula
In this case, sign(MoV)>0, which is the same as sign(kurt(x)).
If the simultaneously super/subgaussian assumption is violated, the cumulant-based measure will consider the converse direction as the causal direction, , and
formula
We have
formula
As we consider the case where kurt(x)=kurt(e1)>0, kurt(e2)<0 but , it is easy to know that sign(kurt(MoV))<0, which is opposite to sign(kurt(x)). This simple measure can be applied to detect the violation where kurt(e1)<0 and kurt(e2)>0. Generally we can detect the violation of super/subgaussian assumption by algorithm 2.
formula

5.  Causal Discovery for LiNGAM-GC with More Than Two Variables

5.1.  LiNGAM-GC with More Than Two Variables.

In this section, we focus on causal discovery for LiNGAM-GC with more than two observed variables. Suppose the observations are generated according to the following multivariate LiNGAM-GC model:
formula
We can use our new cumulant-based measure to estimate the LiNGAM-GC model with more than two observed variables using the Direct-LiNGAM framework (Shimizu et al., 2011; Hyvärinen & Smith, 2013). The basic idea is that we first compute the cumulant-based measure of all different pairs of variables and store the cumulant-based measure as the (i, j) element of a matrix Mn. We show later in this section that for an exogenous variable xr (an exogenous variable is the variable with no observed parent in the model), all elements in the rth row of Mn are nonnegative, neglecting random errors due to the finite sample size. Once we have found the exogenous variable xr, we can regress all the other variables on xr and obtain the regression residues. We repeat the same steps by computing the cumulant-based measure matrix Mn−1 for the regression residues and then find the variable, which has no observed parents except for possibly the first variable found in the previous step (Hyvärinen & Smith, 2013). This process is repeated successively until the whole causal ordering is identified.

We need to compute the cumulant-based measure for all different pairs of variables to find the exogenous variable. In order to apply the new cumulant-based measure , first we need to standardize the observation x. Similar to the pairwise case, we can standardize the observations to unit variance or unit absolute kurtosis according to different scenarios. The following two lemmas show under what conditions the above two standardization methods can lead to , where is the causal strength between the standardized exogenous variable and the standardized .

For the rest of this section, without loss of generality, we assume that A is already permuted to be a strictly lower triangular matrix. Formulate the generating equation as
formula
where I is an identity matrix. Letting and , we have
formula
5.1
Consider the following cause-and-effect pair:
formula
5.2
We assume that if x1 is an ancestor of xi, then the total causal strength from x1 to xi cannot be cancelled due to multipath, that is, .
Lemma 3. 

Assume the variance of the disturbances ei and the variance of the latent confounders ck fulfill. Denote byandthe normalized x1 and xi with unit variances. Then the causal strengthbetweenandhas the property of.

Proof. 

The proof of this lemma is similar to that of lemma 1. We skip the proof here.

From lemma 3, we can see that simply normalizing the observed variables to unit variance may not guarantee since the condition is quite complicated and may not hold in reality. In order to eliminate the effects of the confounders ck, we have to exploit higher-order statistics as we have done in the pairwise case. We standardize the observed variables to unit absolute kurtosis. Define and as follows:
formula
We have:
formula
We standardize x1 and xi as follows: and . We have the following lemma:
Lemma 4. 

Assume that the kurtoses of ej (), have the same sign: ej are simultaneously supergaussian or subgaussian. Denote by and the normalized x1 and xi with unit absolute kurtosis. Then the causal strength between and has the property of.

Proof. 
Since and , we have:
formula
We have
formula
Since all kurt(ej) have the same sign, it is easy to see that .

Without loss of generality, in the rest of this section, we assume that the exogenous variable is xr and we further assume that the observed variables are standardized such that the total causal strength between the exogenous variables xr and the other variables xj has the property of .

The following theorem provides theoretical support for our method in finding the exogenous variable using the pairwise cumulant-based measure:

Theorem 2. 

Assume the observed data x strictly follows LiNGAM-GC (see equation 3.1) and the sample size is infinite. We compute the cumulant-based measure for all different pairs of variables and store as the (i, j)th element of a matrix M. Then a variable xr is the exogenous variable if and only if the elements of the rth row of matrix M are all nonnegative.

Proof. 
Assume that xr is exogenous, , where is the (r, k)th element of . According to the model assumption and equation 3.1, we have . Following a similar derivation in the proof of theorem 1, we have
formula
Since we assume that x are standardized (see lemmas 3 and 4), . We have , .

Assume that xr is not exogenous, that is, xj has at least one observed ancestor. Among all the ancestors, there must be an exogenous variable due to the acyclicity assumption of the model. Denote by xe the ancestor being exogenous. According to our assumption that since xe is an ancestor of xr, we know that , that is, . Hence, there must be at least one negative element in the rth row of matrix M.

Once we have identified the exogenous variable xr, we can remove the effect of xr from the other variables xj, , that is, , where, as we have shown, . We can repeat the steps for the regression residues of the remaining variables x(r) until the whole causal ordering is identified. The algorithm for LiNGAM-GC with more than two variables is summarized in algorithm 3.
formula

5.2.  Sparse Causal Network.

In the real world, a causal network is more likely to be sparse; many elements of the causal strength matrix A are zeros. Now we look at our proposed new cumulant-based measure . Theorem 1 shows that if , . That means that if the ground truth is that there is no direct edge between x and y, that is, , then , provided that we have infinite sample size. However, since we are given finite sample size, the empirical cumulant-based measure based on the observations is not exactly zero but a number with small absolute value.

Due to the effect of finite sample size, the causal network will contain many redundant edges with weak causal strengths. One possible solution to this problem is to set a threshold for the cumulant-based measure ; if , we prefer to discard the edge between x and y. However, the problem is that without any prior knowledge of the causal strengths, it is difficult to choose an appropriate value for the threshold . A more sophisticated approach to this problem is developing a hypothesis test to see whether is significantly different from zero. We adopt the second method. Now we consider the following null hypothesis:

H0 5.1. There is no direct edge between the observed x and y: .

The basic idea is that if zero falls out of the 95% or 98% confidence interval of the density of , we reject the null hypothesis; otherwise, we fail to reject the null hypothesis. However, it is very difficult to develop the asymptotic distribution of under the null hypothesis. In this letter we develop an empirical distribution of using bootstraping. The algorithm is pruning redundant edges of causal network obtained is summarized in algorithm 4.
formula

6.  Experiment

In this section, we conduct a series of experiments to show the effectiveness of our proposed algorithms.

6.1.  Synthetic Data: Cause-and-Effect Pair.

In this experiment, we use our proposed cumulant-based measure with different normalization methods: unit variance (LiNGAM-GC-UV) and unit absolute kurtosis (LiNGAM-GC-UK), ICA-LiNGAM (http://www.cs.helsinki.fi/group/neuroinf/lingam/), Direct-LiNGAM (http://www.ar.sanken.osaka-u.ac.jp/~inazumi/dlingam.html), cumulant-based measure (C-M) (Hyvärinen & Smith, 2013) to identify the causal directions for cause-and-effect pairs synthesized strictly according to the LiNGAM-GC model. Note that ICA-LiNGAM, Direct-LiNGAM, and C-M are proposed for LiNGAM assuming no latent confounder. The purpose of this experiment is to show that the latent confounders can be problematic if they are not considered. The observed cause-and-effect pairs for this experiments are generated according to the following equations:
formula
6.1
In this experiment, e1 and e2 are supergaussian and are generated by ei=sign(ni)|ni|2, where ni are standard gaussian distributed. We further normalize ei to unit variance. In the first part of this experiment, c strictly follows a gaussian distribution with zero mean and standard deviation . We fix and . In order to investigate how and affect the accuracies of different algorithms, we conduct a series of experiments as follows: and . Note that this experimental setting fulfills the conditions of lemmas 1 and 2. Hence, both normalization methods can lead to . For each parameter setting , we randomly generate 100 data sets with sample size of 5000 and randomly swap x and y. We use different algorithms to find the causal directions of these cause-and-effect pairs. The percentages of correctly identified data sets for different algorithms are shown in Figures 3 and 4.
Figure 3:

Results of cause-and-effect pairs: .

Figure 3:

Results of cause-and-effect pairs: .

Figure 4:

Results of cause-and-effect pairs: .

Figure 4:

Results of cause-and-effect pairs: .

The experimental results suggest that LiNGAM-GC-UV and LiNGAM-GC-UK have the best performances in different scenarios. Although wrong decisions still occur in the case of small causal strength , it is due to the finite sample size effect. If the sample size is large enough, both LiNGAM-GC-UV and LiNGAM-GC-UK are expected to achieve perfect performances. From Figures 3 and 4, we find that C-M performs quite well in the case of positive causal strength but poorly in the case of negative causal strength. This observation confirms our theoretical analysis of C-M algorithm in the scenario of latent gaussian confounders in section 4. The performances of ICA-LiNGAM and Direct-LiNGAM depend on the variance of the latent confounder. When the effect of the latent confounder is strong enough, the performances of both algorithms degenerate dramatically. Compared to ICA-LiNGAM, Direct-LiNGAM is more sensitive to the latent confounder. This is possibly because Direct-LiNGAM tests the independence of potential exogenous variable and the regression residue. However, due to the latent confounder, the exogenous variable and the regression residue are dependent if only second-order statistics are employed for regression. However, ICA-based methods still search for the direction that maximizes the independence of the separated components during iterative searching.

In the experiment, the latent confounder c strictly follows a gaussian distribution. In order to investigate the robustness of our algorithms against the gaussian assumption, we conduct the following experiment where c is not strictly gaussian but the sum of a number of independent factors as illustrated in Figure 1: . The experimental setting is as follows: ei are generated by ei=sign(ni)|ni|2 as the last experiment. We fix , and . . fk are mutually independent and generated by fk=sign(nk)|nk|r, where nk is a gaussian random variable with zero mean and standard deviation uniformly randomly selected from [1, 1.4]; r is uniformly randomly selected from [1.4, 1.6]. We randomly generate 100 data sets according to equation 6.1 with a sample size of 5000 and randomly swap x and y. We use our algorithms LiNGAM-GC-UV and LiNGAM-GC-UK to find the causal directions of these cause-and-effect pairs. The percentages of correctly identified data sets for these two algorithms and the kurtosis of the latent confounder are shown in Figure 5. The result shows that when the number of independent factors fk increases, the kurtosis of the latent confounder c decreases; c becomes more and more gaussian, which confirms the central limit theorem. Meanwhile, the performance of our algorithms increases significantly. Interestingly, our cumulant-based measure is robust against the gaussian assumption of the latent confounder c, as we can see that when the number of the factors fk increases to 20, the kurtosis of c is still 0.1528; c is supergaussian. But the performances of our algorithms are close to 100%. From this experiment, it seems that although all theories developed in this letter are based on the gaussian assumption of the latent confounders, our methods are robust against this assumption.

Figure 5:

Robustness against gaussian assumption.

Figure 5:

Robustness against gaussian assumption.

6.2.  Synthetic Data: Detecting the Violation of Simultaneously Super/Subgaussian Assumption.

In order to show the effectiveness of algorithm 4 in detecting the violation of simultaneously super/subgaussian assumption, we generate the observations x and y according to equation 6.1. In the following experiment, , and we fix and . We consider two scenarios. In the first, we fix , e1=sign(n1)|n1|2 such that e1 is supergaussian and e2=sign(n2)|n2|r. We change r from 0.2 to 1.8 with a step size of 0.2. In the second scenario, we fix , e1=sign(n1)|n1|0.2 such that e1 is subgaussian and e2=sign(n2)|n2|r. We change r from 0.2 to 1.8 with a step size of 0.2 as the first experiment. In both scenarios, we generate the observations x and y with sample size of 2000, 4000, , 10,000. Once the observations are generated, we apply algorithm 2 to detect the violation of simultaneously super/subgaussian assumption. We run the experiment 1000 times and count the percentage of detected violations. The experimental results are shown in Figures 6 and 7.

Figure 6:

Detection of violation (e1 supergaussian).

Figure 6:

Detection of violation (e1 supergaussian).

Figure 7:

Detection of violation (e1 subgaussian).

Figure 7:

Detection of violation (e1 subgaussian).

As e2=sign(n2)|n2|r, when 0<r<1, e2 is subgaussian; when r=1, e2 is gaussian, and when r>1, e2 is supergaussian. From Figure 6 where e1 is supergaussian, we can see that when 0<r<1, especially when r is small,which means that e2 is subgaussian, algorithm 2 can detect almost all the violations. When r>1, which means that e2 is supergaussian, the percentages of detected violations decrease to zero rapidly since when r>0, the simultaneous super/subgaussian assumption holds. We can see that with more observed samples, the performance of our proposed algorithm gets closer to the ideal performance curve (a step function whose value is 1 when r<1 and 0 when r>1).

A similar conclusion can be drawn from Figure 7.

6.3.  Real-World Data: Cause-and-Effect Pairs.

In order to show the applicability of our proposed methods in real-world data, we compare the performances of difference methods in the real-world cause-and-effect pairs (http://webdav.tuebingen.mpg.de/cause-effect/). We use 42 pairs in this data set.3 As the computational time of Direct-LiNGAM increases dramatically for large sample size, we use at most 1000 samples for each cause-and-effect pair. The performances of different algorithms are given in Table 1.

Table 1:
Percentage of Recovering the True Causal direction in 42 Real-World Cause-and-Effect Pairs.
LiNGAM-LiNGAM-ICA-Direct-
AlgorithmGC-UVGC-UKC-MLiNGAMLiNGAMIGCI
Accuracy 83.33% 82.35% 82.05% 76.16% 64.29% 57.14% 
Decision rate 100% 80.95% 92.86% 100% 100% 100% 
LiNGAM-LiNGAM-ICA-Direct-
AlgorithmGC-UVGC-UKC-MLiNGAMLiNGAMIGCI
Accuracy 83.33% 82.35% 82.05% 76.16% 64.29% 57.14% 
Decision rate 100% 80.95% 92.86% 100% 100% 100% 

We use algorithm 4 to detect the violation of a simultaneously super/subgaussian assumption for LiNGAM-GC-UK. We find that 80.95% of all cause-and-effect pairs can pass the detection for LiNGAM-GC-UK. For C-M, we select pairs where kurtoses of x and y have the same sign. For LiNGAM-GC-UV, LiNGAM-ICA, DirectLiNGAM, and IGCI, we do not have a detection method for the violation of assumption; hence, we use all the cause-and-effect pairs. Table 1 shows that for causal discovery of real-world data, our proposed LiNGAM-GC-UV and C-M have the best performances followed by LiNGAM-GC-UK. LiNGAM-GC-UV performs better than LiNGAM-GC-UK, possibly due to the fact that LiNGAM-GC-UK uses normalization to unit absolute kurtosis, which requires a larger sample size in order to obtain consistent estimations. LiNGAM-ICA has an accuracy of 76.16%, followed by DirectLiNGAM and IGCI (Daniušis et al., 2010). From this experiment, we learn that our proposed new cumulant-based method can be used for causal discovery of real-world problems. The performance of the new cumulant-based method is more accurate and reliable than methods not taking into consideration latent confounders, such as LiNGAM-ICA, Direct-LiNGAM, and IGCI.

6.4.  Synthetic Data: Causal Network Discovery.

In this experiment, we apply three algorithms, LiNGAM-GC-UK, LiNGAM-ICA, and C-M, to discovery of the causal ordering of a LiNGAM-GC with five variables and five latent confounders. The observations are generated according to the following settings:
formula
The matrix A is a strictly lower triangular matrix with element aij uniformly distributed in [0.1, 0.3] or [−0.3, −0.1]. B is a matrix with element bij uniformly distributed in [0.1, 0.3]. The disturbances e are generated following the same method as previous experiments and normalized to unit variance. The latent confounders c are independent standard gaussian. We randomly permute the generated observations and use three algorithms to discover the causal ordering of the observations. In order to investigate the effect of sample size on the performance of three algorithms, we conduct the experiment with sample sizes ranging from 500 to 5,000. In each experiment, we run 300 independent trials and count the percentage of data sets with (1) the first variable in the causal ordering identified, (2) the first two variables in the causal ordering identified, and (3) the whole causal ordering identified for three algorithms. The results are summarized in Tables 2 to 4.
Table 2:
Percentage of the First Variable Found in 300 Synthesis Data Sets.
Sample Size AlgorithmLiNGAM-GC-UKC-MLiNGAM-ICA
500 72.67% 60.67% 85.00% 
1000 81.67% 62.00% 90.33% 
1500 86.67% 59.00% 89.00% 
2000 87.00% 64.33% 88.33% 
2500 93.67% 65.00% 89.00% 
3000 93.33% 64.67% 90.00% 
3500 94.00% 63.00% 91.33% 
4000 93.67% 62.67% 89.33% 
4500 95.67% 65.00% 90.33% 
5000 96.33% 62.67% 93.33% 
Sample Size AlgorithmLiNGAM-GC-UKC-MLiNGAM-ICA
500 72.67% 60.67% 85.00% 
1000 81.67% 62.00% 90.33% 
1500 86.67% 59.00% 89.00% 
2000 87.00% 64.33% 88.33% 
2500 93.67% 65.00% 89.00% 
3000 93.33% 64.67% 90.00% 
3500 94.00% 63.00% 91.33% 
4000 93.67% 62.67% 89.33% 
4500 95.67% 65.00% 90.33% 
5000 96.33% 62.67% 93.33% 
Table 3:
Percentage of the First Two Variables Found in 300 Synthesis Data Sets.
Sample Size AlgorithmLiNGAM-GC-UKC-MLiNGAM-ICA
500 52.00% 40.67% 75.00% 
1000 70.00% 43.33% 81.67% 
1500 79.00% 42.00% 81.00% 
2000 79.67% 48.67% 79.33% 
2500 88.00% 42.67% 84.67% 
3000 88.67% 49.67% 82.33% 
3500 88.67% 49.00% 82.33% 
4000 90.33% 49.00% 83.00% 
4500 93.00% 52.33% 84.67% 
5000 93.00% 48.33% 86.33% 
Sample Size AlgorithmLiNGAM-GC-UKC-MLiNGAM-ICA
500 52.00% 40.67% 75.00% 
1000 70.00% 43.33% 81.67% 
1500 79.00% 42.00% 81.00% 
2000 79.67% 48.67% 79.33% 
2500 88.00% 42.67% 84.67% 
3000 88.67% 49.67% 82.33% 
3500 88.67% 49.00% 82.33% 
4000 90.33% 49.00% 83.00% 
4500 93.00% 52.33% 84.67% 
5000 93.00% 48.33% 86.33% 

From Tables 2, 3, and 4, we can see that LiNGAM-ICA has the best performance in the scenario of small sample size, followed by LiNGAM-GC-UK and then C-M. The reason is that the negentropy-based ICA algorithm is able to find the rotation direction that maximizes the independence of the separated components, while LiNGAM-GC-UK, and C-M are based on a fourth order cumulant. In the scenario of small sample size, sample average approximation deviates far from the true value. However, given sufficient large numbers of samples, LiNGAM-GC-UK has the best performance among the three algorithms. From Table 2, we can see that the performances of LiNGAM-GC-UK and LiNGAM-ICA are close to each other in finding the first variable of the causal network, but when it comes to finding the first two variables and even the whole causal ordering, LiNGAM-GC-UK has much better performance. From Table 4 we can see that LiNGAM-GC-UK achieves 91.67% of accuracy, which is much higher than the 72.67% of accuracy by LiNGAM-ICA and 41.00% by C-M when the sample size is 5000. We also find that the C-M algorithm is more sensitive to the latent confounders than LiNGAM-ICA.

Table 4:
Percentage of the Entire Causal Ordering Found in 300 Synthesis Data Sets.
Sample Size AlgorithmLiNGAM-GC-UKC-MLiNGAM-ICA
500 37.00% 25.67% 57.00% 
1000 57.67% 34.00% 69.67% 
1500 68.33% 31.00% 67.67% 
2000 73.67% 40.00% 69.33% 
2500 83.00% 33.00% 71.33% 
3000 84.33% 41.67% 71.33% 
3500 86.33% 39.33% 72.67% 
4000 88.67% 40.67% 72.67% 
4500 90.33% 44.00% 74.33% 
5000 91.67% 41.00% 72.67% 
Sample Size AlgorithmLiNGAM-GC-UKC-MLiNGAM-ICA
500 37.00% 25.67% 57.00% 
1000 57.67% 34.00% 69.67% 
1500 68.33% 31.00% 67.67% 
2000 73.67% 40.00% 69.33% 
2500 83.00% 33.00% 71.33% 
3000 84.33% 41.67% 71.33% 
3500 86.33% 39.33% 72.67% 
4000 88.67% 40.67% 72.67% 
4500 90.33% 44.00% 74.33% 
5000 91.67% 41.00% 72.67% 

6.5.  Synthetic Data: Sparse Causal Network.

In the experiments, we compared the performance of LiNGAM-UK with LiNGAM-ICA and C-M. We know that LiNGAM-ICA and C-M assume that there is no latent confounder of the observed variables. The experiments are conducted to show how problematic it is if latent confounders are not considered in our inference. In the following experiment, we compare the performance of our algorithm LiNGAM-GC-UK with pairwise lvLiNGAM (Entner & Hoyer, 2011). Pairwise lvLiNGAM takes latent confounders into consideration. This algorithm can estimate the causal structure of the unconfounded subset, and, hence, it can discover the partial causal structure of the observed variables. The experimental setting is as follows. The disturbances ei are generated as previous synthetic experiments, and the confounders cj are standard gaussian. The causal strength matrix A is strictly lower triangular and sparse. About 80% of the entries are zeros. The factor loading matrix B of confounders ci on xi is also sparse. The number of nonzero entries in each column of c is constrained to be at least 2 and at most one-third of the numbers of the observed variables. The sample size is 2000. Once the data are generated, we permute them to a random order. We use LiNGAM-GC-UK and pairwise lvLiNGAM to discover the causal structure. For LiNGAM-GC-UK, we use algorithm 4 to remove those redundant edges. We use the following parameters for the pruning algorithm: the number of bootstraps is 500; the bootstrap length is half the sample size, and we use a 98% confidence interval. For pairwise lvLiNGAM, we use the default setting. The results are the average of 100 independent trials and summarized in Table 5.

Table 5:
Sparse Causal Network Discovery: LiNGAM-GC-UK Pairwise lvLiNGAM.
CORRTPRFPR
7 observed LiNGAM-GC-UK 0.9341 0.8243 0.0337 
3 latent pairwise lvLiNGAM 0.9815 0.3249 
11 observed LiNGAM-GC-UK 0.9043 0.7532 0.0276 
4 latent pairwise lvLiNGAM 0.9940 0.1646 
15 observed LiNGAM-GC-UK 0.9017 0.7775 0.0250 
5 latent pairwise lvLiNGAM 0.9850 0.0870 
20 observed LiNGAM-GC-UK 0.8710 0.7540 0.0168 
8 latent pairwise lvLiNGAM 0.9946 0.0280 
CORRTPRFPR
7 observed LiNGAM-GC-UK 0.9341 0.8243 0.0337 
3 latent pairwise lvLiNGAM 0.9815 0.3249 
11 observed LiNGAM-GC-UK 0.9043 0.7532 0.0276 
4 latent pairwise lvLiNGAM 0.9940 0.1646 
15 observed LiNGAM-GC-UK 0.9017 0.7775 0.0250 
5 latent pairwise lvLiNGAM 0.9850 0.0870 
20 observed LiNGAM-GC-UK 0.8710 0.7540 0.0168 
8 latent pairwise lvLiNGAM 0.9946 0.0280 

In Table 5, CORR is the correlation between the estimated causal strengths and the true ones. TPR means “true positive rate,” which represents the probability that we discover an edge that really exists by both algorithms. FPR means “false positive rate,” which represents the probability that we discover a faulty edge that does not exists. From Table 5, we can see that both algorithms perform very well in correlation. The correlation between the estimated causal strengths and the true ones is around 0.9, and we observe that pairwise lvLiNGAM performs slightly better than LiNGAM-GC-UK. This is reasonable because pairwise lvLiNGAM discovers only the partial causal structure, while LiNGAM-GC-UK discovers the whole structure. Due to the fact that LiNGAM-GC adopts the DirectLiNGAM framework, errors are unavoidably accumulated. Nevertheless, we can see that LiNGAM-GC-UK can discover most of the existing edges while pairwise lvLiNGAM can discover only a very small portion of the edges, especially when the number of the latent confounder increases. From the table, we can see that the that pairwise lvLiNGAM never creates a redundant edge, and the probability that LiNGAM-GC-UK creates redundant edges is very small given sufficient large sample size.

6.6.  Real-World Data: Causal Discovery for the Hong Kong Stock Market.

In this section, we aim to discover the causal relationship among eight stocks selected from the Hong Kong stock market. (The data set is downloaded from Yahoo Hong Kong Finance: http://hk.finance.yahoo.com/stock/index.php.) The selected stocks are the constituents of Hang Seng Index (HSI):

  • • 

    X1: Cheung Kong (0001.HK)

  • • 

    X2: CLP Holdings (0002.HK)

  • • 

    X3: HSBC Holdings (0005.HK)

  • • 

    X4: Hang Seng Bank (0011.HK)

  • • 

    X5: Henderson Land (0012.HK)

  • • 

    X6: Hutchison (0013.HK)

  • • 

    X7: SHK PPT (0016.HK)

  • • 

    X8: Bank of E Asia (0023.HK)

Besides the mutual influences among the stocks, the prices of the stocks are also affected by a number of external factors such as interest rates, oil prices, gold prices, and so on. It is unlikely that the causal sufficiency assumption holds for the stock market. In order to verify our hypothesis, we first apply pairwise lvLiNGAM (Entner & Hoyer, 2011) to the selected data set; we observe that pairwise lvLiNGAM returns an empty unconfounded superset, meaning that the causal sufficiency assumption does not hold and, more important, all stocks selected are affected by the confounders.

In this experiment, we use our proposed LiNGAM-GC algorithms to discover the causal relationship among the daily returns of the selected eight stocks from March 21, 2008, to April 3, 2012. For the days when stock prices were missing, we simply use linear interpolation to get the prices. Let Xit be the adjusted closing price of the ith stock on day t. The daily return can be obtained by . We assume that the stock returns can be modeled by LiNGAM-GC,
formula
We are interested in discovering the causal structure matrix A in order to learn how the stocks influence each other. We apply our multivariate LiNGAM-GC (algorithm 3, using two different normalization methods UV and UK) and the pruning algorithm (algorithm 4). The number of bootstrapping is 3000, the bootstrapping sample size is 900 and ) to the returns x(t). Note that we also combine algorithm 2 in LiNGAM-GC-UK and observe that this data set passes the detection, meaning that simultaneous supergaussianity is common in finance. We also apply C-M, LiNGAM-ICA, and DirectLiNGAM to the same data set.

In order to compare different models obtained by the methods, we use the following two performance indices: mean square error (MSE) and mean square fourth-order cross-cumulant (MSCC). Let r=(IA)x, where A is the discovered causal strength matrix obtained by the compared methods.

  • • 

  • • 

    , where is the normalized residue with standard deviation;

MSCC is used to measure the high-order dependencies among the residue r. Since in our model, we allow the presence of a latent gaussian confounder, we choose the fourth-order cross-cumulant, which is immune to the latent gaussian confounders. If the model ideally fit the data set, MSCC=0. In other words, the smaller MSCC, the better model A.

Table 6 shows the performances of the compared algorithms. We can see that MSEs of the compared algorithms are quite close. While we can see that LiNGAM-GC-UK and LiNGAM-GC-UV have the smallest MSCC, which means that the residue r obtained by these two algorithms is more independent. That means our proposed model better describes this real-world data set.

Table 6:
Model Comparison.
Algorithm PerformanceMSEMSCC
LiNGAM-GC-UK  6.1156 
LiNGAM-GC-UV  4.6454 
C-M  11.1184 
LiNGAM-ICA  14.9365 
DirectLiNGAM  8.4913 
Algorithm PerformanceMSEMSCC
LiNGAM-GC-UK  6.1156 
LiNGAM-GC-UV  4.6454 
C-M  11.1184 
LiNGAM-ICA  14.9365 
DirectLiNGAM  8.4913 

The causal networks discovered by LiNGAM-GC-UK and LiNGAM-GC-UV are in Figures 8 and 9.

Figure 8:

Causal network of the Hong Kong stock market: LiNGAM-GC-UK.

Figure 8:

Causal network of the Hong Kong stock market: LiNGAM-GC-UK.

Figure 9:

Causal network of the Hong Kong stock market: LiNGAM-GC-UV.

Figure 9:

Causal network of the Hong Kong stock market: LiNGAM-GC-UV.

From the network discovered by LiNGAM-GC, we have the following interesting findings similar to (Zhang & Chan, 2008):

  • • 

    Stocks belonging to the same subindex tend to be causally related. For example, in the causal network discovered by LiNGAM-GC-UK and LiNGAM-GC-UV, x3 (HSBC Holdings), x4 (Hang Seng Bank), and x8 (Bank of E Asia) belong to the Hang Seng subindex in Finance. x2 (CLP Holdings) is the only stock in our selected stocks that belongs to the subindex in Utility. x5 (Henderson Land), x7 (SHK PPT), and x1 (Cheung Kong) belong to the subindex in Property. x6 (Hutchison) belongs to the subindex in Commerce and Industry.

  • • 

    Bank stocks are the causes in the causal network. In Figures 8 and 9, HSBC Holdings and Hang Seng Bank are the roots of the causal network. Bank of E Asia is affected only by another bank stock, Hang Seng. Actually, HSBC Holdings, Hang Seng Bank, and Bank of E Asia are the three largest banks in Hong Kong.

  • • 

    If there exists a holding relationship between two companies, there is possibly a causal relationship between their stocks, and the holder company is possibly the effect. For example, Cheung Kong holds 49.97% of Hutchison, and we find an edge from Hutchison to Cheung Kong in our causal network.

  • • 

    Real estate stocks are more likely to be affected by other stocks. For example, Cheung Kong, Henderson Land, and SHK PPT are the three largest real estate companies in Hong Kong.

Note that for the real-world problem, usually it is very difficult to have the ground truth, but we find that from the causal networks discovered by our method, we have some findings that are consistent with our empirical knowledge. Further interpretation is left for future work. However, we find that the above observations are not significant in the obtained networks by C-M, LiNGAM-ICA and Direct-LiNGAM.

7.  Conclusion

A causal model, linear nongaussian acyclic model in the presence of latent gaussian confounders (LiNGAM-GC), is a special case of lvLiNGAM and investigated in this letter. By allowing the presence of latent confounders, this model is expected to give more accurate descriptions of real-world phenomena. To our knowledge, our model is one of few that take into consideration latent confounders. To tackle the causal discovery of this model, we propose and rigorously analyze a new cumulant-based measure to infer the causal direction of cause-and-effect pairs under certain conditions. We develop a simple and efficient method to detect the violation of such a condition. We also extend our previous work to multivariate causal network discovery and propose a pruning method to estimate sparse causal networks. A series of experiments shows the effectiveness of our model and cumulant-based measure. Future work will be focused on relaxing the assumptions made in the work presented here and developing a robust estimation of high-order cumulants in the case of a small sample size.

Acknowledgments

The work described in this letter was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

References

Attias
,
H.
(
1999
).
Independent factor analysis
.
Neural Computation
,
11
(
4
),
803
851
.
Cha
,
S.
, &
Chan
,
L.
(
2000
).
Applying independent component analysis to factor model in finance
. In
Proceedings of the Intelligent Data Engineering and Automated Learning IDEAL 2000. Data Mining, Financial Engineering, and Intelligent Agents
(pp.
161
173
).
Berlin
:
Springer
.
Chan
,
L.
(
2002
).
The prediction performance of independent factor models
. In
Proceedings of the 2002 International Joint Conference on Neural Networks
(Vol.
3
, pp. 
2515
2520
).
Piscataway, NJ
:
IEEE
.
Chan
,
L.
, &
Cha
,
S.
(
2001
).
Selection of independent factor model in finance
. In
Proceedings of 3rd International Conference on Independent Component Analysis and Blind Signal Separation
.
Berlin
:
Springer-Verlag
.
Chen
,
Z.
, &
Chan
,
L.
(
2012
).
Causal discovery for linear non-gaussian acyclic models in the presence of latent gaussian confounders
. In
Latent variable analysis and signal separation
(pp.
17
24
).
New York
:
Springer-Verlag
.
Comon
,
P.
(
1994
).
Independent component analysis, a new concept?
Signal Processing
,
36
(
3
),
287
314
.
Daniušis
,
P.
,
Janzing
,
D.
,
Mooij
,
J.
,
Zscheischler
,
J.
,
Steudel
,
B.
,
Zhang
,
K.
, &
Schölkopf
,
B.
(
2010
).
Inferring deterministic causal relations
. In
Proceedings of the Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-10)
(pp.
143
150
).
Corvallis, OR
:
AUAI Press
.
Entner
,
D.
, &
Hoyer
,
P.
(
2011
).
Discovering unconfounded causal relationships using linear non-gaussian models
. In
New frontiers in artificial intelligence
(pp.
181
195
).
New York
:
Springer-Verlag
.
Henao
,
R.
, &
Winther
,
O.
(
2011
).
Sparse linear identifiable multivariate modeling
.
Journal of Machine Learning Research
,
12
,
863
905
.
Hoyer
,
P.
,
Shimizu
,
S.
,
Kerminen
,
A.
, &
Palviainen
,
M.
(
2008
).
Estimation of causal effects using linear non-gaussian causal models with hidden variables
.
International Journal of Approximate Reasoning
,
49
(
2
),
362
378
.
Hyvärinen
,
A.
(
1999
).
Gaussian moments for noisy independent component analysis
.
Signal Processing Letters, IEEE
,
6
(
6
),
145
147
.
Hyvärinen
,
A.
, &
Oja
,
E.
(
2000
).
Independent component analysis: Algorithms and applications
.
Neural Networks
,
13
(
4–5
),
411
430
.
Hyvärinen
,
A.
, &
Smith
,
S. M.
(
2013
).
Pairwise likelihood ratios of estimation of non-gaussian structural equation models
.
Journal of Machine Learning Research
,
14
,
111
152
.
Janzing
,
D.
,
Peters
,
J.
,
Mooij
,
J.
, &
Schölkopf
,
B.
(
2009
).
Identifying confounders using additive noise models
. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
(pp.
249
257
).
Corvallis, OR
:
AUAI Press
.
Kawahara
,
Y.
,
Bollen
,
K.
,
Shimizu
,
S.
, &
Washio
,
T.
(
2010
).
Grouplingam
:
Linear non-gaussian acyclic models for sets of variables
.
Arxiv preprint arXiv:1006.5041
.
Kawahara
,
Y.
,
Shimizu
,
S.
, &
Washio
,
T.
(
2011
).
Analyzing relationships among ARMA processes based on non-gaussianity of external influences
.
Neurocomputing
,
74
,
2212
2221
.
Moneta
,
A.
,
Entner
,
D.
,
Hoyer
,
P.
, &
Coad
,
A.
(
2010
).
Causal inference by independent component analysis with applications to micro-and macroeconomic data
.
Jena Economic Research Papers
,
2010
,
031
.
Pearl
,
J.
(
2000
).
Causality: Models, reasoning, and inference
.
Cambridge
:
Cambridge University Press
.
Shimizu
,
S.
,
Hoyer
,
P.
,
Hyvärinen
,
A.
, &
Kerminen
,
A.
(
2006
).
A linear non-gaussian acyclic model for causal discovery
.
Journal of Machine Learning Research
,
7
,
2003
2030
.
Shimizu
,
S.
,
Inazumi
,
T.
,
Sogawa
,
Y.
,
Hyvärinen
,
A.
,
Kawahara
,
Y.
,
Washio
,
T.
, et al
(
2011
).
Directlingam: A direct method for learning a linear non-gaussian structural equation model
.
Journal of Machine Learning Research
,
12
,
1225
1248
.
Sogawa
,
Y.
,
Shimizu
,
S.
,
Shimamura
,
T.
,
Hyvärinen
,
A.
,
Washio
,
T.
, &
Imoto
,
S.
(
2011
).
Estimating exogenous variables in data with more variables than observations
.
Neural Networks
,
24
,
875
880
.
Spirtes
,
P.
,
Glymour
,
C.
, &
Scheines
,
R.
(
2000
).
Causation, prediction, and search
.
Cambridge, MA
:
MIT Press
.
Valpola
,
H.
,
Harva
,
M.
, &
Karhunen
,
J.
(
2004
).
Hierarchical models of variance sources
.
Signal Processing
,
84
(
2
),
267
282
.
Zhang
,
K.
, &
Chan
,
L.
(
2006
).
Extensions of ICA for causality discovery in the Hong Kong stock market
. In
Lecture Notes in Computer Science: Vol. 4234. Neural Information Processing
(pp.
400
409
).
New York
:
Springer-Verlag
.
Zhang
,
K.
, &
Chan
,
L.
(
2008
).
Minimal nonlinear distortion principle for nonlinear independent component analysis
.
Journal of Machine Learning Research
,
9
,
2455
2487
.
Zhang
,
K.
, &
Chan
,
L.
(
2009
).
Efficient factor Garch models and factor-DCC models
.
Quantitative Finance
,
9
(
1
),
71
91
.

Notes

1

Some preliminary results were presented in Chen and Chan (2012).

2

We thank an anonymous reviewer for pointing out this problem.

3

We select data sets where the relationships are close to linear. We make a simple preprocessing of pair 75 to make the relation more linear and use two processed pairs and instead of the original one.