## Abstract

LiNGAM has been successfully applied to some real-world causal discovery problems. Nevertheless, causal sufficiency is assumed; that is, there is no latent confounder of the observations, which may be unrealistic for real-world problems. Taking into the consideration latent confounders will improve the reliability and accuracy of estimations of the real causal structures. In this letter, we investigate a model called linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC) which can be seen as a specific case of lvLiNGAM. This model includes the latent confounders, which are assumed to be independent gaussian distributed and statistically independent of the disturbances. To tackle the causal discovery problem of this model, first we propose a pairwise cumulant-based measure of causal directions for cause-effect pairs. We prove that in spite of the presence of latent gaussian confounders, the causal direction of the observed cause-effect pair can be identified under the mild condition that the disturbances are simultaneously supergaussian or subgaussian. We propose a simple and efficient method to detect the violation of this condition. We extend our work to multivariate causal network discovery problems. Specifically we propose algorithms to estimate the causal network structure, including causal ordering and causal strengths, using an iterative root finding-removing scheme based on pairwise measure. To address the redundant edge problem due to the finite sample size effect, we develop an efficient bootstrapping-based pruning algorithm. Experiments on synthetic data and real-world data have been conducted to show the applicability of our model and the effectiveness of our proposed algorithms.

## 1. Introduction

Causal discovery has received extensive attention in recent years. Much empirical work has been devoted to discovering the causal mechanisms behind natural and social phenomena. Conducting controlled experiments is an effective way to discover the causal mechanism. However, controlled experiments in general are expensive, time-consuming, practically infeasible, or ethically prohibited. Due to these reasons, causal discovery from nonspecifically controlled experimental data becomes a great necessity. Models such as structural equation models (SEMs) and Bayesian networks (BNs) have been proposed to explain the data-generating mechanisms and have been widely applied to social science, econometrics, and medical science (Pearl, 2000; Spirtes, Glymour, & Scheines, 2000). However, conventional methods such as BNs assume that all disturbances are gaussian distributed, and hence only second-order statistics are employed to infer the causal structure. In general, such methods can obtain only a class of models that are equivalent in their conditional correlation structure (Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006). The full structure, including the causal ordering and causal strengths, cannot be identified without prior knowledge in most cases (Sogawa et al., 2011). Recently it has been shown that by exploiting the nongaussianity of the disturbances, the causal structure can be fully and uniquely determined. Shimizu et al. (2006), proposed a linear nongaussian acyclic model (LiNGAM) and showed that the full causal structure can be identified by independent component analysis (ICA) (Comon, 1994; Hyvärinen & Oja, 2000). As most ICA algorithms use gradient descent–based iterative searching, suboptimality cannot be avoided. Correct causal ordering cannot be guaranteed using ICA-based LiNGAM algorithms. A direct-LiNGAM framework was proposed later to tackle this problem (Shimizu et al., 2011). LiNGAM has been widely used to discover the underlying causal structures of many real-world problems (Sogawa et al., 2011; Kawahara, Shimizu, & Washio, 2011; Moneta, Entner, Hoyer, & Coad, 2010; Zhang & Chan, 2006). The advantages of LiNGAM over conventional methods are that (1) a full and unique causal structure can be identified, provided that all assumptions hold and the sample size of the observations is sufficiently large, compared with BNs, which can only obtain a class of Markov equivalent models; (2) no prior knowledge of the network structure is needed; and (3) compared to BNs, which may involve large numbers of conditional independent tests, the computational complexity of LiNGAM is much lower.

In spite of these advantages, LiNGAM has the same shortcoming as other conventional
methods such as BNs: they all assume causal sufficiency (Shimizu et al., 2006). Causal sufficiency is a basic assumption
for many causal models, including LiNGAM and BNs. It assumes that for a set **S** of observed variables, every common direct cause (confounder) of
a subset (with two or more variables) in **S** is also in **S**. This assumption is strong and may not hold in the real world.
For example, in addition to mutual influences among stocks, the prices of stocks are
also affected by a number of external factors: interest rate, inflation rate, oil
price, gold price, and so on. In general, for a set of observed variables, some
confounders may be hidden or latent. The latter is closely related to the widely
used factor model in economics and finance (Zhang & Chan, 2009; Chan, 2002;
Chan & Cha, 2001; Cha & Chan, 2000). The factor model assumes no
cause-and-effect relationships among the observed variables, but all observed
variables are assumed as the effects of possibly a small number of latent
uncorrelated or independent factors. Causal model and factor model have their own
advantages. A causal model can be used to estimate the causal relationship among the
observed variables, which is quite common in the real world. The factor model takes
into consideration some latent external confounders outside the system, for example,
interest rates, oil prices, and gold prices outside the stock market. Henao and
Winther (2011) considered a model involving
both a linear directed acyclic graph and a factor model. In that model, each
variable is affected by other variables, driving a signal of itself and some latent
variables. The authors proposed a hierarchical Bayesian framework to estimate the
model and a quantitative method for model comparison. (See Henao & Winther, 2011, for more details.) Thus, a more
general and realistic model should be able to describe the causal relationship among
the observed variables as well as the effects from the latent confounders outside
the system. In this letter, we discard the causal sufficiency assumption made in BNs
and LiNGAM and allow the presence of latent confounders.

We deal with a more important and challenging problem: causal discovery for linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC). In LiNGAM-GC, we allow the presence of latent confounders. In order to make the problem tractable, we approximate these latent confounders by gaussian variables that are mutually independent and independent of the disturbances.

Figure 1 shows an illustrative example of
LiNGAM-GC with two observed variables. *x* and *y* are
observed variables of which we aim to discover the causal direction. *e*_{1} and *e*_{2} are independent nongaussian disturbances. *c* is the
latent confounder, which is introduced in our model to represent the ensemble effect
of the many latent independent factors *f _{i}*. , , and are causal strengths. Due to the extra dependence introduced by

*c*, the causal discovery problem for

*x*and

*y*becomes much more challenging. Previous methods such as LiNGAM and Direct-LiNGAM may obtain incorrect results because of the latent confounder. In this letter, we propose and rigorously analyze a new cumulant-based pairwise measure of the causal direction. We prove that under mild conditions, the causal direction of the model given in Figure 1 can be correctly identified by simply investigating the sign of , that is, if the measure , is concluded. On the contrary, if , is concluded. We also extend our work to multivariate causal network discovery using the Direct-LiNGAM framework: we first identify the exogenous variable (root) of the network and then regress all other variables on the exogenous variable. We repeat these steps for the regression residues until the whole causal ordering has been identified.

Note that our model LiNGAM-GC can be seen as a variant of latent variable LiNGAM (lvLiNGAM) (Hoyer, Shimizu, Kerminen, & Palviainen, 2008), which we discuss in detail in section 2. In general, the main difference between LiNGAM-GC and lvLiNGAM is that our model assumes gaussian latent confounders and lvLiNGAM assumes nongaussian confounders. Although lvLiNGAM is a more general model, the huge computational complexity makes it practically infeasible. Actually, lvLiNGAM can be used to find the causal structure for only a very small number of variables. However, gaussian approximation makes the problem fully tractable. In fact, our method developed based on this approximation is computationally efficient and can be applied to large causal network discovery.

The rest of this letter is organized as follows. In section 2, we review some related recent work concerning latent
confounders. In section 3, we define our linear
nongaussian acyclic models in the presence of latent confounders in detail. In
section 4, we propose and rigorously analyze a
new cumulant-based measure for pairwise causal discovery of LiNGAM-GC.^{1} In section 5, the main contribution of
this letter, we extend our work to causal discovery for LiNGAM-GC with more than two
observed variables. In section 6, we look at
the experiments on synthetic data and real-world data conducted to show the
effectiveness of our proposed methods. We conclude in section 7.

## 2. Related Work Concerning Latent Confounders

**x**is the observation vector,

**e**is the disturbance vector, and

**c**is the latent confounder vector;

**A**and

**B**are the factor-loading matrices. Hoyer et al. treated the causal discovery problem for LiNGAM in the presence of latent confounders as an overcomplete ICA problem and developed algorithms to derive canonical models based on the estimated mixing matrix. However, estimation of the mixing matrix for the overcomplete ICA problem is an ill-posed and challenging problem. Actually, the performance of existing methods for this problem is far from satisfactory, especially when the dimensionality of the problem is high. In addition, ICA-based methods may get stuck in local optima, which leads to incorrect estimation of the causal ordering.

**x**and

**Y**are the observations,

**N**

_{X}and

**N**

_{Y}are statistically independent disturbances,

**T**is the latent confounder, and

*u*and

*v*are certain nonlinear functions. They showed that under certain conditions, the confounder

**T**is identifiable. Note that in this model, they assumed that there is no direct edge between

**x**and

**Y**, and the variances of

**N**

_{X}and

**N**

_{Y}are relatively small compared to the confounder

**T**. However, we are more interested in a general model where there is a direct edge between

**x**and

**Y**;

**N**

_{X}and

**N**

_{Y}are not necessarily small. In this general model, we aim to estimate the direction of the edge between

**x**and

**Y**.

Kawahara et al. (2010) proposed the GroupLiNGAM model. In this model, latent confounders are allowed to be present but are restricted within certain subsets of the observed variables. They mainly focused on finding the causal ordering of the subsets. This was achieved by iteratively finding the exogenous subsets and removing the exogenous subsets of the observed variables from the remaining subsets until the entire causal ordering of the subsets was identified. However, in GroupLiNGAM, the causal structures remain unidentified yet within the confounded subsets. Hence, the confounding problem is only partially solved in this work.

Entner and Hoyer (2011) proposed a method to
infer the partial causal structure of the observed variables based on lvLiNGAM;
their method is to discover all the unconfounded supersets and then estimate the
total causal effects for each pair in the unconfounded supersets. The basic idea is
that confounding can be detected in the nongaussian settings. The algorithm first
finds out all the unconfounded causal pairs and then, for each confounded pair, finds the maximal set **S** containing the common causes
of the pair such that is unconfounded. Their method has been proved to be sound and
complete. However, similar to Kawahara et al. (2010), this work can obtain only partial information of the causal
network. If the latent confounders have effects on most of the observed variables,
which is quite common in real-world phenomena, it is very likely that the method
returns an empty, unconfounded superset.

## 3. Linear Nongaussian Acyclic Models in the Presence of Latent Gaussian Confounders

In this section, we propose our model: linear nongaussian acyclic models in the presence of latent gaussian confounders (LiNGAM-GC).

**A**is an matrix collecting the causal strengths among the observed variables;

**e**is vector of disturbances; is a vector collecting the latent confounders;

**B**is an matrix collecting the causal strengths of the confounders on the observed variables. We make the following assumptions about the model:

The causal model is acyclic, that is, the causal model can be represented by a
directed acyclic graph (DAG) or, equivalently, by simultaneous row and column
permutations, and matrix **A** can be permuted to a strictly lower
triangular matrix. Here, strictly means that the diagonal elements of the lower
triangular matrix are exact zeros.

The components of the disturbance vector **e** are mutually independent
and nongaussian distributed with zero means.

Figure 2 gives an illustrative example of
LiNGAM-GC. In the causal network described in the figure, if the confounders *c*_{1} and *c*_{2} are absent, the model reduces to LiNGAM, of which the causal structure
can be estimated by LiNGAM algorithms (Shimizu et al., 2006, 2011).
However, due to the extra dependence introduced by the latent confounders *c*_{1} and *c*_{2}, the causal discovery becomes a challenging task. In order to make the
problem tractable, we make another assumption on the model:

The latent confounders *c _{i}* are independent gaussian distributed with zero means and independent
of disturbances

*e*.

_{j}Here, we approximate the distributions of the latent confounders by gaussian
distribution. The advantage of this approximation is that it makes the model fully
identifiable and practically tractable. The interpretation of this approximation is
based on the central limit theorem (CLT), as traditionally in statistics and other
related areas. As illustrated by Figure 1, the
disturbances *e _{i}* can be regarded as local factors that describe the intrinsic behavior of
the observed variable and cannot regarded as external noise (Henao & Winther, 2011). In many real-world phenomena,

*e*tend to be more nongaussian. Besides the intrinsic disturbances, the observed variables

_{i}*x*and

*y*are also affected by large number of global factors (macroeconomic factors) . We introduce another variable

*c*to represent the ensemble effect of these global factors, . By the CLT, we can roughly say that the sum of large numbers of independent variables tends to be or at least close to gaussian distributed. Consequently, we approximate the latent confounders as gaussian distributed in order to make the problem tractable. Our method proposed in this letter is developed based on this approximation. The performance of the algorithm depends on to what extent our assumptions on the model hold. Thus, this method may not be applicable when the gaussian assumption of the latent confounders is strongly violated. This includes the cases that only some variables in a network are observable but the unobserved variables have some effects on the observed variables. In this scenario, our method may not be suitable as the latent gaussian confounder assumption is quickly violated.

^{2}

**I**is an identity matrix. Letting and , we have Note that the model described by equation 3.2 is exactly a noisy ICA model. Some work has been devoted to noisy ICA. One promising approach is a bias removal technique (Hyvärinen, 1999). Hyvärinen (1999) proposed solving noisy ICA problem using gaussian moments. However, this approach requires prior knowledge of the covariance matrix of the gaussian noise. In our problem, the covariance matrix is , where is the covariance matrix of the latent confounders

**c**. It is obvious that in our problem, we have no such prior knowledge of . Hence, the gaussian moments–based methods are not applicable to our problem. Attias (1999) proposed independent factor analysis (IFA) and used a gaussian mixture model (GMM) to approximate the densities of the disturbances. All parameters of the IFA model were learned by the expectation-maximization (EM) algorithm. However, the number of parameters to learn is large especially when the dimensionality of the problem is high. The algorithm easily converges to local optima such that the estimated mixing matrix always leads to incorrect causal ordering when it is applied to our problem. Furthermore, the computational time is huge, which makes it not applicable to large-scale causal network discovery. Inferring the causal structure of LiNGAM-GC is a challenging problem. To the best of our knowledge, no reliable method has been proposed to estimate the causal structure of LiNGAM-GC. In the following sections, we propose a new cumulant-based measure to tackle this problem.

## 4. Causal Discovery for LiNGAM-GC with Two Variables

*x*and

*y*and no prior knowledge about causal direction and strength. Our task is to discover the causal direction based on pure observations. Before we propose our new cumulant-based measure, we introduce the cumulant-based measure that Hyvärinen and Smith (2013) proposed to discover the causal structure of LiNGAM.

### 4.1. Cumulant-Based Measure by Aapo Hyvärinen.

*x*and

*y*with zero means and generated by a reduced model described by equations 4.1 with and . Denote by and the normalized

*x*and

*y*with unit variances and by the corresponding correlation coefficient of and . It is easy to know that , an important working condition for this cumulant-based measure. The cumulant-based measure that Hyvärinen and Smith (2013) proposed is defined as where means sample average, is the estimated correlation coefficient, and is the kurtosis of defined as . According to theorem 1 in Hyvärinen and Smith (2013), we have the following causal inference rule: This cumulant-based measure works well in the scenario of no latent confounders. Unfortunately, if we simply apply it to LiNGAM-GC, we may get misleading results. There are two causes of failure when applying this measure to LiNGAM-GC:

- •
A second-order statistic cannot eliminate the effect of a latent confounder when we estimate . Using second-order statistics, . Since is normalized to unit variance, . We have . We can see that is strongly influenced by . if and only if or , that is, there is no confounding effect. However, due to the confounding effect, this estimation bias may lead to severe problems.

- •
is a very important working condition for the cumulant-based measure that Hyvärinen and Smith (2013) proposed. However, simply normalizing

*x*and*y*to unit variance cannot guarantee . Lemma 1 shows the condition under which .

*Assume that the observed variables x and y are generated according to equation 4.1 and fulfill , where and are the variances of e_{2} and c, respectively. Denote by and the normalized x and y with unit variances; then causal strength between and has the property that*.

### 4.2. New Cumulant-Based Measure for LiNGAM-GC.

In order to tackle the causal discovery problem for a causal-effect pair described by equation 4.1, we propose our new cumulant-based measure and causal inference rule, which we partially discussed in our previous work (Chen & Chan, 2012). The causal inference rule is simple and similar to the causal inference rule proposed in Hyvärinen and Smith (2013). In order to make this letter self-contained, we introduce it in this section.

*x*and

*y*to unit variance cannot guarantee due to the extra dependence introduced by the latent confounder

*c*. In order to eliminate the effect of the latent confounder

*c*, we need to use higher-order statistics. In this letter, we propose a standardization method: normalization to unit absolute kurtosis. Define and , where

*kurt*(

*x*) and

*kurt*(

*y*) are the kurtosis of

*x*and

*y*respectively. We have: since

*c*is gaussian distributed such that

*kurt*(

*c*)=0. Similarly, we have We propose the following standardization method: Lemma 2 shows that under mild conditions, normalizing

*x*and

*y*to unit absolute kurtosis leads to :

*Assume that the kurtoses of e_{1} and of e_{2} have the same sign: e_{1} and e_{2} are simultaneously supergaussian or subgaussian. Denote by and the normalized x and y with unit absolute kurtosis according to equation 4.4. Then the causal
strength between and has the property of*.

Regarding the condition in lemma 2 that *e*_{1} and *e*_{2} are simultaneously supergaussian or subgaussian, this assumption is
not restrictive, as it has been reported that the innovation processes of many
real signals are almost invariably supergaussian even if the signals themselves
are subgaussian (Valpola, Harva, & Karhunen, 2004). We also develop a simple method to detect the violation of
simultaneously super- or subgaussian assumption in the section 4.4.

For convenience, in the rest of this section, we use the notations *x*, *y*, *c*, , , and , but we assume that *x* and *y* are standardized such that (either by normalizing to unit variance or absolute kurtosis
depending on which condition in lemma 1 or 2 holds). Inspired Hyvärinen
and Smith (2013), we give the following
cumulant-based measure, which can be used to determine the causal direction for
pairwise LiNGAM-GC.

*x*and

*y*fulfill the following generating mechanism: According to the properties of cumulant (Hyvärinen & Smith, 2013),

- •
- •
- •

*c*is gaussian distributed, we have

*kurt*(

*c*)=0. Hence we have Likewise, We have: Likewise, if , the observed

*x*and

*y*are generated according to We have We assume that

*x*and

*y*are standardized such that (by either normalization to unit variance or unit absolute kurtosis depending on which condition holds). Hence, if , we have: Otherwise if , we have Thus, the cumulant-based measure enables us to determine the causal direction between the observed

*x*and

*y*simply by investigating the sign of .

### 4.3. Causal Strength.

*x*and

*y*. This problem is easy to solve once we can distinguish the cause from the effect. The basic idea is to use higher-order statistics. In section 4.2, we showed that if the causal direction is . Since , we have Now we can estimate the causal strength simply by Similarly, if the causal direction is , we can estimate by

### 4.4. Detecting the Violation of the Simultaneously Supergaussian or Subgaussian Assumption.

Lemma 2 shows that if the simultaneously super/subgaussian assumption holds, by simply normalizing the observations to unit absolute kurtosis, we can guarantee that the normalized causal strength between the standardized and fulfills . This property is of great importance for our new cumulant-based measure proposed in section 4.2. However, it is not possible to tell if this simultaneously super/subgaussian assumption holds for the observed real-world data before we construct the causal relationship. In this section, we propose a simple way to detect the violation of the simultaneously super/subgaussian assumption.

Without loss of generality, suppose we have observations *x* and *y*, which are generated by equation 1.1. Hence, the ground truth is . If the simultaneously super/subgaussian assumption is
violated, again without loss of generality, suppose *kurt*(*e*_{1})>0 and *kurt*(*e*_{2})<0. We consider the following two cases:

- •
*kurt*(*x*)>0 and*kurt*(*y*)<0. - •
*kurt*(*x*)>0 and*kurt*(*y*)>0.

*e*

_{1}and

*e*

_{2}are of opposite signs. The difficulty lies in the second case, where even though

*kurt*(

*e*

_{2})<0, . To tackle this difficulty, we propose the following measure of violation (MoV) to detect a violation, where and are the estimated cause and effect by our cumulant-based measure, respectively, and is the estimated causal strength between and .

*kurt*(

*x*)=

*kurt*(

*e*

_{1})>0,

*kurt*(

*e*

_{2})<0 but , it is easy to know that sign(

*kurt*(

*MoV*))<0, which is opposite to sign(

*kurt*(

*x*)). This simple measure can be applied to detect the violation where

*kurt*(

*e*

_{1})<0 and

*kurt*(

*e*

_{2})>0. Generally we can detect the violation of super/subgaussian assumption by algorithm 2.

## 5. Causal Discovery for LiNGAM-GC with More Than Two Variables

### 5.1. LiNGAM-GC with More Than Two Variables.

*i*,

*j*) element of a matrix

**M**

_{n}. We show later in this section that for an exogenous variable

*x*(an exogenous variable is the variable with no observed parent in the model), all elements in the

_{r}*r*th row of

**M**

_{n}are nonnegative, neglecting random errors due to the finite sample size. Once we have found the exogenous variable

*x*, we can regress all the other variables on

_{r}*x*and obtain the regression residues. We repeat the same steps by computing the cumulant-based measure matrix

_{r}**M**

_{n−1}for the regression residues and then find the variable, which has no observed parents except for possibly the first variable found in the previous step (Hyvärinen & Smith, 2013). This process is repeated successively until the whole causal ordering is identified.

We need to compute the cumulant-based measure for all different pairs of
variables to find the exogenous variable. In order to apply the new
cumulant-based measure , first we need to standardize the observation **x**.
Similar to the pairwise case, we can standardize the observations to unit
variance or unit absolute kurtosis according to different scenarios. The
following two lemmas show under what conditions the above two standardization
methods can lead to , where is the causal strength between the standardized exogenous
variable and the standardized .

**A**is already permuted to be a strictly lower triangular matrix. Formulate the generating equation as where

**I**is an identity matrix. Letting and , we have Consider the following cause-and-effect pair: We assume that if

*x*

_{1}is an ancestor of

*x*, then the total causal strength from

_{i}*x*

_{1}to

*x*cannot be cancelled due to multipath, that is, .

_{i}*Assume the variance of the disturbances e_{i} and the variance of the latent confounders c_{k} fulfill*.

*Denote by*

*and*

*the normalized x*

_{1}and x_{i}with unit variances. Then the causal strength*between*

*and*

*has the property of*.

The proof of this lemma is similar to that of lemma 1. We skip the proof here.

*c*, we have to exploit higher-order statistics as we have done in the pairwise case. We standardize the observed variables to unit absolute kurtosis. Define and as follows: We have: We standardize

_{k}*x*

_{1}and

*x*as follows: and . We have the following lemma:

_{i}*Assume that the kurtoses of e_{j} (), have the same sign: e_{j} are simultaneously supergaussian or subgaussian. Denote by and the normalized x_{1} and x_{i} with unit absolute kurtosis. Then the causal strength between and has the property of*.

Without loss of generality, in the rest of this section, we assume that the
exogenous variable is *x _{r}* and we further assume that the observed variables are standardized
such that the total causal strength between the exogenous variables

*x*and the other variables

_{r}*x*has the property of .

_{j}The following theorem provides theoretical support for our method in finding the exogenous variable using the pairwise cumulant-based measure:

*Assume the observed data x strictly follows LiNGAM-GC
(see equation 3.1) and
the sample size is infinite. We compute the cumulant-based measure for all different pairs of variables and store as the (i, j)th
element of a matrix M. Then a variable x_{r} is the exogenous variable if and only if the elements of the rth row of matrix M are all
nonnegative*.

*x*is exogenous, , where is the (

_{r}*r*,

*k*)th element of . According to the model assumption and equation 3.1, we have . Following a similar derivation in the proof of theorem 1, we have Since we assume that

**x**are standardized (see lemmas 3 and 4), . We have , .

Assume that *x _{r}* is not exogenous, that is,

*x*has at least one observed ancestor. Among all the ancestors, there must be an exogenous variable due to the acyclicity assumption of the model. Denote by

_{j}*x*the ancestor being exogenous. According to our assumption that since

_{e}*x*is an ancestor of

_{e}*x*, we know that , that is, . Hence, there must be at least one negative element in the

_{r}*r*th row of matrix

**M**.

*x*, we can remove the effect of

_{r}*x*from the other variables

_{r}*x*, , that is, , where, as we have shown, . We can repeat the steps for the regression residues of the remaining variables

_{j}**x**

^{(r)}until the whole causal ordering is identified. The algorithm for LiNGAM-GC with more than two variables is summarized in algorithm 3.

### 5.2. Sparse Causal Network.

In the real world, a causal network is more likely to be sparse; many elements of
the causal strength matrix **A** are zeros. Now we look at our proposed
new cumulant-based measure . Theorem 1 shows that if , . That means that if the ground truth is that there is no
direct edge between *x* and *y*, that is, , then , provided that we have infinite sample size. However, since we
are given finite sample size, the empirical cumulant-based measure based on the observations is not exactly zero but a number
with small absolute value.

Due to the effect of finite sample size, the causal network will contain many
redundant edges with weak causal strengths. One possible solution to this
problem is to set a threshold for the cumulant-based measure ; if , we prefer to discard the edge between *x* and *y*. However, the problem is that without any prior knowledge
of the causal strengths, it is difficult to choose an appropriate value for the
threshold . A more sophisticated approach to this problem is developing a
hypothesis test to see whether is significantly different from zero. We adopt the second
method. Now we consider the following null hypothesis:

**H0 5.1.** There is no direct edge between the observed *x* and *y*: .

## 6. Experiment

In this section, we conduct a series of experiments to show the effectiveness of our proposed algorithms.

### 6.1. Synthetic Data: Cause-and-Effect Pair.

*e*

_{1}and

*e*

_{2}are supergaussian and are generated by

*e*=

_{i}*sign*(

*n*)|

_{i}*n*|

_{i}^{2}, where

*n*are standard gaussian distributed. We further normalize

_{i}*e*to unit variance. In the first part of this experiment,

_{i}*c*strictly follows a gaussian distribution with zero mean and standard deviation . We fix and . In order to investigate how and affect the accuracies of different algorithms, we conduct a series of experiments as follows: and . Note that this experimental setting fulfills the conditions of lemmas 1 and 2. Hence, both normalization methods can lead to . For each parameter setting , we randomly generate 100 data sets with sample size of 5000 and randomly swap

*x*and

*y*. We use different algorithms to find the causal directions of these cause-and-effect pairs. The percentages of correctly identified data sets for different algorithms are shown in Figures 3 and 4.

The experimental results suggest that LiNGAM-GC-UV and LiNGAM-GC-UK have the best performances in different scenarios. Although wrong decisions still occur in the case of small causal strength , it is due to the finite sample size effect. If the sample size is large enough, both LiNGAM-GC-UV and LiNGAM-GC-UK are expected to achieve perfect performances. From Figures 3 and 4, we find that C-M performs quite well in the case of positive causal strength but poorly in the case of negative causal strength. This observation confirms our theoretical analysis of C-M algorithm in the scenario of latent gaussian confounders in section 4. The performances of ICA-LiNGAM and Direct-LiNGAM depend on the variance of the latent confounder. When the effect of the latent confounder is strong enough, the performances of both algorithms degenerate dramatically. Compared to ICA-LiNGAM, Direct-LiNGAM is more sensitive to the latent confounder. This is possibly because Direct-LiNGAM tests the independence of potential exogenous variable and the regression residue. However, due to the latent confounder, the exogenous variable and the regression residue are dependent if only second-order statistics are employed for regression. However, ICA-based methods still search for the direction that maximizes the independence of the separated components during iterative searching.

In the experiment, the latent confounder *c* strictly follows a
gaussian distribution. In order to investigate the robustness of our algorithms
against the gaussian assumption, we conduct the following experiment where *c* is not strictly gaussian but the sum of a number of
independent factors as illustrated in Figure 1: . The experimental setting is as follows: *e _{i}* are generated by

*e*=

_{i}*sign*(

*n*)|

_{i}*n*|

_{i}^{2}as the last experiment. We fix , and . .

*f*are mutually independent and generated by

_{k}*f*=

_{k}*sign*(

*n*)|

_{k}*n*|

_{k}^{r}, where

*n*is a gaussian random variable with zero mean and standard deviation uniformly randomly selected from [1, 1.4];

_{k}*r*is uniformly randomly selected from [1.4, 1.6]. We randomly generate 100 data sets according to equation 6.1 with a sample size of 5000 and randomly swap

*x*and

*y*. We use our algorithms LiNGAM-GC-UV and LiNGAM-GC-UK to find the causal directions of these cause-and-effect pairs. The percentages of correctly identified data sets for these two algorithms and the kurtosis of the latent confounder are shown in Figure 5. The result shows that when the number of independent factors

*f*increases, the kurtosis of the latent confounder

_{k}*c*decreases;

*c*becomes more and more gaussian, which confirms the central limit theorem. Meanwhile, the performance of our algorithms increases significantly. Interestingly, our cumulant-based measure is robust against the gaussian assumption of the latent confounder

*c*, as we can see that when the number of the factors

*f*increases to 20, the kurtosis of

_{k}*c*is still 0.1528;

*c*is supergaussian. But the performances of our algorithms are close to 100%. From this experiment, it seems that although all theories developed in this letter are based on the gaussian assumption of the latent confounders, our methods are robust against this assumption.

### 6.2. Synthetic Data: Detecting the Violation of Simultaneously Super/Subgaussian Assumption.

In order to show the effectiveness of algorithm 4 in detecting the violation of
simultaneously super/subgaussian assumption, we generate the observations *x* and *y* according to equation 6.1. In the following experiment, , and we fix and . We consider two scenarios. In the first, we fix , *e*_{1}=sign(*n*_{1})|*n*_{1}|^{2} such that *e*_{1} is supergaussian and *e*_{2}=sign(*n*_{2})|*n*_{2}|^{r}. We change *r* from 0.2 to 1.8 with a step size of 0.2. In
the second scenario, we fix , *e*_{1}=sign(*n*_{1})|*n*_{1}|^{0.2} such that *e*_{1} is subgaussian and *e*_{2}=sign(*n*_{2})|*n*_{2}|^{r}. We change *r* from 0.2 to 1.8 with a step size of 0.2 as
the first experiment. In both scenarios, we generate the observations *x* and *y* with sample size of 2000, 4000, , 10,000. Once the observations are generated, we apply
algorithm 2 to detect the violation of simultaneously super/subgaussian
assumption. We run the experiment 1000 times and count the percentage of
detected violations. The experimental results are shown in Figures 6 and 7.

As *e*_{2}=sign(*n*_{2})|*n*_{2}|^{r}, when 0<*r*<1, *e*_{2} is subgaussian; when *r*=1, *e*_{2} is gaussian, and when *r*>1, *e*_{2} is supergaussian. From Figure 6 where *e*_{1} is supergaussian, we can see that when
0<*r*<1, especially when *r* is small,which
means that *e*_{2} is subgaussian, algorithm 2 can detect almost all the violations.
When *r*>1, which means that *e*_{2} is supergaussian, the percentages of detected violations decrease
to zero rapidly since when *r*>0, the simultaneous
super/subgaussian assumption holds. We can see that with more observed samples,
the performance of our proposed algorithm gets closer to the ideal performance
curve (a step function whose value is 1 when *r*<1 and 0 when *r*>1).

A similar conclusion can be drawn from Figure 7.

### 6.3. Real-World Data: Cause-and-Effect Pairs.

In order to show the applicability of our proposed methods in real-world data, we
compare the performances of difference methods in the real-world
cause-and-effect pairs (http://webdav.tuebingen.mpg.de/cause-effect/). We use 42 pairs in
this data set.^{3} As the computational time of Direct-LiNGAM increases dramatically for
large sample size, we use at most 1000 samples for each cause-and-effect pair.
The performances of different algorithms are given in Table 1.

. | LiNGAM- . | LiNGAM- . | . | ICA- . | Direct- . | . |
---|---|---|---|---|---|---|

Algorithm . | GC-UV . | GC-UK . | C-M . | LiNGAM . | LiNGAM . | IGCI . |

Accuracy | 83.33% | 82.35% | 82.05% | 76.16% | 64.29% | 57.14% |

Decision rate | 100% | 80.95% | 92.86% | 100% | 100% | 100% |

. | LiNGAM- . | LiNGAM- . | . | ICA- . | Direct- . | . |
---|---|---|---|---|---|---|

Algorithm . | GC-UV . | GC-UK . | C-M . | LiNGAM . | LiNGAM . | IGCI . |

Accuracy | 83.33% | 82.35% | 82.05% | 76.16% | 64.29% | 57.14% |

Decision rate | 100% | 80.95% | 92.86% | 100% | 100% | 100% |

We use algorithm 4 to detect the violation of a simultaneously super/subgaussian
assumption for LiNGAM-GC-UK. We find that 80.95% of all cause-and-effect pairs
can pass the detection for LiNGAM-GC-UK. For C-M, we select pairs where kurtoses
of *x* and *y* have the same sign. For
LiNGAM-GC-UV, LiNGAM-ICA, DirectLiNGAM, and IGCI, we do not have a detection
method for the violation of assumption; hence, we use all the cause-and-effect
pairs. Table 1 shows that for causal
discovery of real-world data, our proposed LiNGAM-GC-UV and C-M have the best
performances followed by LiNGAM-GC-UK. LiNGAM-GC-UV performs better than
LiNGAM-GC-UK, possibly due to the fact that LiNGAM-GC-UK uses normalization to
unit absolute kurtosis, which requires a larger sample size in order to obtain
consistent estimations. LiNGAM-ICA has an accuracy of 76.16%, followed by
DirectLiNGAM and IGCI (Daniušis et al., 2010). From this experiment, we learn that our proposed new
cumulant-based method can be used for causal discovery of real-world problems.
The performance of the new cumulant-based method is more accurate and reliable
than methods not taking into consideration latent confounders, such as
LiNGAM-ICA, Direct-LiNGAM, and IGCI.

### 6.4. Synthetic Data: Causal Network Discovery.

**A**is a strictly lower triangular matrix with element

*a*uniformly distributed in [0.1, 0.3] or [−0.3, −0.1].

_{ij}**B**is a matrix with element

*b*uniformly distributed in [0.1, 0.3]. The disturbances

_{ij}**e**are generated following the same method as previous experiments and normalized to unit variance. The latent confounders

**c**are independent standard gaussian. We randomly permute the generated observations and use three algorithms to discover the causal ordering of the observations. In order to investigate the effect of sample size on the performance of three algorithms, we conduct the experiment with sample sizes ranging from 500 to 5,000. In each experiment, we run 300 independent trials and count the percentage of data sets with (1) the first variable in the causal ordering identified, (2) the first two variables in the causal ordering identified, and (3) the whole causal ordering identified for three algorithms. The results are summarized in Tables 2 to 4.

Sample Size Algorithm . | LiNGAM-GC-UK . | C-M . | LiNGAM-ICA . |
---|---|---|---|

500 | 72.67% | 60.67% | 85.00% |

1000 | 81.67% | 62.00% | 90.33% |

1500 | 86.67% | 59.00% | 89.00% |

2000 | 87.00% | 64.33% | 88.33% |

2500 | 93.67% | 65.00% | 89.00% |

3000 | 93.33% | 64.67% | 90.00% |

3500 | 94.00% | 63.00% | 91.33% |

4000 | 93.67% | 62.67% | 89.33% |

4500 | 95.67% | 65.00% | 90.33% |

5000 | 96.33% | 62.67% | 93.33% |

Sample Size Algorithm . | LiNGAM-GC-UK . | C-M . | LiNGAM-ICA . |
---|---|---|---|

500 | 72.67% | 60.67% | 85.00% |

1000 | 81.67% | 62.00% | 90.33% |

1500 | 86.67% | 59.00% | 89.00% |

2000 | 87.00% | 64.33% | 88.33% |

2500 | 93.67% | 65.00% | 89.00% |

3000 | 93.33% | 64.67% | 90.00% |

3500 | 94.00% | 63.00% | 91.33% |

4000 | 93.67% | 62.67% | 89.33% |

4500 | 95.67% | 65.00% | 90.33% |

5000 | 96.33% | 62.67% | 93.33% |

Sample Size Algorithm . | LiNGAM-GC-UK . | C-M . | LiNGAM-ICA . |
---|---|---|---|

500 | 52.00% | 40.67% | 75.00% |

1000 | 70.00% | 43.33% | 81.67% |

1500 | 79.00% | 42.00% | 81.00% |

2000 | 79.67% | 48.67% | 79.33% |

2500 | 88.00% | 42.67% | 84.67% |

3000 | 88.67% | 49.67% | 82.33% |

3500 | 88.67% | 49.00% | 82.33% |

4000 | 90.33% | 49.00% | 83.00% |

4500 | 93.00% | 52.33% | 84.67% |

5000 | 93.00% | 48.33% | 86.33% |

Sample Size Algorithm . | LiNGAM-GC-UK . | C-M . | LiNGAM-ICA . |
---|---|---|---|

500 | 52.00% | 40.67% | 75.00% |

1000 | 70.00% | 43.33% | 81.67% |

1500 | 79.00% | 42.00% | 81.00% |

2000 | 79.67% | 48.67% | 79.33% |

2500 | 88.00% | 42.67% | 84.67% |

3000 | 88.67% | 49.67% | 82.33% |

3500 | 88.67% | 49.00% | 82.33% |

4000 | 90.33% | 49.00% | 83.00% |

4500 | 93.00% | 52.33% | 84.67% |

5000 | 93.00% | 48.33% | 86.33% |

From Tables 2, 3, and 4, we can see that LiNGAM-ICA has the best performance in the scenario of small sample size, followed by LiNGAM-GC-UK and then C-M. The reason is that the negentropy-based ICA algorithm is able to find the rotation direction that maximizes the independence of the separated components, while LiNGAM-GC-UK, and C-M are based on a fourth order cumulant. In the scenario of small sample size, sample average approximation deviates far from the true value. However, given sufficient large numbers of samples, LiNGAM-GC-UK has the best performance among the three algorithms. From Table 2, we can see that the performances of LiNGAM-GC-UK and LiNGAM-ICA are close to each other in finding the first variable of the causal network, but when it comes to finding the first two variables and even the whole causal ordering, LiNGAM-GC-UK has much better performance. From Table 4 we can see that LiNGAM-GC-UK achieves 91.67% of accuracy, which is much higher than the 72.67% of accuracy by LiNGAM-ICA and 41.00% by C-M when the sample size is 5000. We also find that the C-M algorithm is more sensitive to the latent confounders than LiNGAM-ICA.

Sample Size Algorithm . | LiNGAM-GC-UK . | C-M . | LiNGAM-ICA . |
---|---|---|---|

500 | 37.00% | 25.67% | 57.00% |

1000 | 57.67% | 34.00% | 69.67% |

1500 | 68.33% | 31.00% | 67.67% |

2000 | 73.67% | 40.00% | 69.33% |

2500 | 83.00% | 33.00% | 71.33% |

3000 | 84.33% | 41.67% | 71.33% |

3500 | 86.33% | 39.33% | 72.67% |

4000 | 88.67% | 40.67% | 72.67% |

4500 | 90.33% | 44.00% | 74.33% |

5000 | 91.67% | 41.00% | 72.67% |

Sample Size Algorithm . | LiNGAM-GC-UK . | C-M . | LiNGAM-ICA . |
---|---|---|---|

500 | 37.00% | 25.67% | 57.00% |

1000 | 57.67% | 34.00% | 69.67% |

1500 | 68.33% | 31.00% | 67.67% |

2000 | 73.67% | 40.00% | 69.33% |

2500 | 83.00% | 33.00% | 71.33% |

3000 | 84.33% | 41.67% | 71.33% |

3500 | 86.33% | 39.33% | 72.67% |

4000 | 88.67% | 40.67% | 72.67% |

4500 | 90.33% | 44.00% | 74.33% |

5000 | 91.67% | 41.00% | 72.67% |

### 6.5. Synthetic Data: Sparse Causal Network.

In the experiments, we compared the performance of LiNGAM-UK with LiNGAM-ICA and
C-M. We know that LiNGAM-ICA and C-M assume that there is no latent confounder
of the observed variables. The experiments are conducted to show how problematic
it is if latent confounders are not considered in our inference. In the
following experiment, we compare the performance of our algorithm LiNGAM-GC-UK
with pairwise lvLiNGAM (Entner & Hoyer, 2011). Pairwise lvLiNGAM takes latent confounders into
consideration. This algorithm can estimate the causal structure of the
unconfounded subset, and, hence, it can discover the partial causal structure of
the observed variables. The experimental setting is as follows. The disturbances *e _{i}* are generated as previous synthetic experiments, and the confounders

*c*are standard gaussian. The causal strength matrix

_{j}**A**is strictly lower triangular and sparse. About 80% of the entries are zeros. The factor loading matrix

**B**of confounders

*c*on

_{i}*x*is also sparse. The number of nonzero entries in each column of

_{i}**c**is constrained to be at least 2 and at most one-third of the numbers of the observed variables. The sample size is 2000. Once the data are generated, we permute them to a random order. We use LiNGAM-GC-UK and pairwise lvLiNGAM to discover the causal structure. For LiNGAM-GC-UK, we use algorithm 4 to remove those redundant edges. We use the following parameters for the pruning algorithm: the number of bootstraps is 500; the bootstrap length is half the sample size, and we use a 98% confidence interval. For pairwise lvLiNGAM, we use the default setting. The results are the average of 100 independent trials and summarized in Table 5.

. | CORR . | TPR . | FPR . |
---|---|---|---|

7 observed LiNGAM-GC-UK | 0.9341 | 0.8243 | 0.0337 |

3 latent pairwise lvLiNGAM | 0.9815 | 0.3249 | 0 |

11 observed LiNGAM-GC-UK | 0.9043 | 0.7532 | 0.0276 |

4 latent pairwise lvLiNGAM | 0.9940 | 0.1646 | 0 |

15 observed LiNGAM-GC-UK | 0.9017 | 0.7775 | 0.0250 |

5 latent pairwise lvLiNGAM | 0.9850 | 0.0870 | 0 |

20 observed LiNGAM-GC-UK | 0.8710 | 0.7540 | 0.0168 |

8 latent pairwise lvLiNGAM | 0.9946 | 0.0280 | 0 |

. | CORR . | TPR . | FPR . |
---|---|---|---|

7 observed LiNGAM-GC-UK | 0.9341 | 0.8243 | 0.0337 |

3 latent pairwise lvLiNGAM | 0.9815 | 0.3249 | 0 |

11 observed LiNGAM-GC-UK | 0.9043 | 0.7532 | 0.0276 |

4 latent pairwise lvLiNGAM | 0.9940 | 0.1646 | 0 |

15 observed LiNGAM-GC-UK | 0.9017 | 0.7775 | 0.0250 |

5 latent pairwise lvLiNGAM | 0.9850 | 0.0870 | 0 |

20 observed LiNGAM-GC-UK | 0.8710 | 0.7540 | 0.0168 |

8 latent pairwise lvLiNGAM | 0.9946 | 0.0280 | 0 |

In Table 5, CORR is the correlation between the estimated causal strengths and the true ones. TPR means “true positive rate,” which represents the probability that we discover an edge that really exists by both algorithms. FPR means “false positive rate,” which represents the probability that we discover a faulty edge that does not exists. From Table 5, we can see that both algorithms perform very well in correlation. The correlation between the estimated causal strengths and the true ones is around 0.9, and we observe that pairwise lvLiNGAM performs slightly better than LiNGAM-GC-UK. This is reasonable because pairwise lvLiNGAM discovers only the partial causal structure, while LiNGAM-GC-UK discovers the whole structure. Due to the fact that LiNGAM-GC adopts the DirectLiNGAM framework, errors are unavoidably accumulated. Nevertheless, we can see that LiNGAM-GC-UK can discover most of the existing edges while pairwise lvLiNGAM can discover only a very small portion of the edges, especially when the number of the latent confounder increases. From the table, we can see that the that pairwise lvLiNGAM never creates a redundant edge, and the probability that LiNGAM-GC-UK creates redundant edges is very small given sufficient large sample size.

### 6.6. Real-World Data: Causal Discovery for the Hong Kong Stock Market.

In this section, we aim to discover the causal relationship among eight stocks selected from the Hong Kong stock market. (The data set is downloaded from Yahoo Hong Kong Finance: http://hk.finance.yahoo.com/stock/index.php.) The selected stocks are the constituents of Hang Seng Index (HSI):

- •
*X*_{1}: Cheung Kong (0001.HK) - •
*X*_{2}: CLP Holdings (0002.HK) - •
*X*_{3}: HSBC Holdings (0005.HK) - •
*X*_{4}: Hang Seng Bank (0011.HK) - •
*X*_{5}: Henderson Land (0012.HK) - •
*X*_{6}: Hutchison (0013.HK) - •
*X*_{7}: SHK PPT (0016.HK) - •
*X*_{8}: Bank of E Asia (0023.HK)

Besides the mutual influences among the stocks, the prices of the stocks are also affected by a number of external factors such as interest rates, oil prices, gold prices, and so on. It is unlikely that the causal sufficiency assumption holds for the stock market. In order to verify our hypothesis, we first apply pairwise lvLiNGAM (Entner & Hoyer, 2011) to the selected data set; we observe that pairwise lvLiNGAM returns an empty unconfounded superset, meaning that the causal sufficiency assumption does not hold and, more important, all stocks selected are affected by the confounders.

*X*be the adjusted closing price of the

_{it}*i*th stock on day

*t*. The daily return can be obtained by . We assume that the stock returns can be modeled by LiNGAM-GC, We are interested in discovering the causal structure matrix

**A**in order to learn how the stocks influence each other. We apply our multivariate LiNGAM-GC (algorithm 3, using two different normalization methods UV and UK) and the pruning algorithm (algorithm 4). The number of bootstrapping is 3000, the bootstrapping sample size is 900 and ) to the returns

**x**(

*t*). Note that we also combine algorithm 2 in LiNGAM-GC-UK and observe that this data set passes the detection, meaning that simultaneous supergaussianity is common in finance. We also apply C-M, LiNGAM-ICA, and DirectLiNGAM to the same data set.

In order to compare different models obtained by the methods, we use the
following two performance indices: mean square error (MSE) and mean square
fourth-order cross-cumulant (MSCC). Let **r**=(**I**−**A**)**x**,
where **A** is the discovered causal strength matrix obtained by the
compared methods.

- •
- •
, where is the normalized residue with standard deviation;

MSCC is used to measure the high-order dependencies among the residue **r**. Since in our model, we allow the presence of a latent
gaussian confounder, we choose the fourth-order cross-cumulant, which is immune
to the latent gaussian confounders. If the model ideally fit the data set,
MSCC=0. In other words, the smaller MSCC, the better model **A**.

Table 6 shows the performances of the
compared algorithms. We can see that MSEs of the compared algorithms are quite
close. While we can see that LiNGAM-GC-UK and LiNGAM-GC-UV have the smallest
MSCC, which means that the residue **r** obtained by these two
algorithms is more independent. That means our proposed model better describes
this real-world data set.

Algorithm Performance . | MSE . | MSCC . |
---|---|---|

LiNGAM-GC-UK | 6.1156 | |

LiNGAM-GC-UV | 4.6454 | |

C-M | 11.1184 | |

LiNGAM-ICA | 14.9365 | |

DirectLiNGAM | 8.4913 |

Algorithm Performance . | MSE . | MSCC . |
---|---|---|

LiNGAM-GC-UK | 6.1156 | |

LiNGAM-GC-UV | 4.6454 | |

C-M | 11.1184 | |

LiNGAM-ICA | 14.9365 | |

DirectLiNGAM | 8.4913 |

From the network discovered by LiNGAM-GC, we have the following interesting findings similar to (Zhang & Chan, 2008):

- •
Stocks belonging to the same subindex tend to be causally related. For example, in the causal network discovered by LiNGAM-GC-UK and LiNGAM-GC-UV,

*x*_{3}(HSBC Holdings),*x*_{4}(Hang Seng Bank), and*x*_{8}(Bank of E Asia) belong to the Hang Seng subindex in Finance.*x*_{2}(CLP Holdings) is the only stock in our selected stocks that belongs to the subindex in Utility.*x*_{5}(Henderson Land),*x*_{7}(SHK PPT), and*x*_{1}(Cheung Kong) belong to the subindex in Property.*x*_{6}(Hutchison) belongs to the subindex in Commerce and Industry. - •
Bank stocks are the causes in the causal network. In Figures 8 and 9, HSBC Holdings and Hang Seng Bank are the roots of the causal network. Bank of E Asia is affected only by another bank stock, Hang Seng. Actually, HSBC Holdings, Hang Seng Bank, and Bank of E Asia are the three largest banks in Hong Kong.

- •
If there exists a holding relationship between two companies, there is possibly a causal relationship between their stocks, and the holder company is possibly the effect. For example, Cheung Kong holds 49.97% of Hutchison, and we find an edge from Hutchison to Cheung Kong in our causal network.

- •
Real estate stocks are more likely to be affected by other stocks. For example, Cheung Kong, Henderson Land, and SHK PPT are the three largest real estate companies in Hong Kong.

Note that for the real-world problem, usually it is very difficult to have the ground truth, but we find that from the causal networks discovered by our method, we have some findings that are consistent with our empirical knowledge. Further interpretation is left for future work. However, we find that the above observations are not significant in the obtained networks by C-M, LiNGAM-ICA and Direct-LiNGAM.

## 7. Conclusion

A causal model, linear nongaussian acyclic model in the presence of latent gaussian confounders (LiNGAM-GC), is a special case of lvLiNGAM and investigated in this letter. By allowing the presence of latent confounders, this model is expected to give more accurate descriptions of real-world phenomena. To our knowledge, our model is one of few that take into consideration latent confounders. To tackle the causal discovery of this model, we propose and rigorously analyze a new cumulant-based measure to infer the causal direction of cause-and-effect pairs under certain conditions. We develop a simple and efficient method to detect the violation of such a condition. We also extend our previous work to multivariate causal network discovery and propose a pruning method to estimate sparse causal networks. A series of experiments shows the effectiveness of our model and cumulant-based measure. Future work will be focused on relaxing the assumptions made in the work presented here and developing a robust estimation of high-order cumulants in the case of a small sample size.

## Acknowledgments

The work described in this letter was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

## References

## Notes

^{1}

Some preliminary results were presented in Chen and Chan (2012).

^{2}

We thank an anonymous reviewer for pointing out this problem.

^{3}

We select data sets where the relationships are close to linear. We make a simple preprocessing of pair 75 to make the relation more linear and use two processed pairs and instead of the original one.