Abstract

Causal discovery via the asymmetry between the cause and the effect has proved to be a promising way to infer the causal direction from observations. The basic idea is to assume that the mechanism generating the cause distribution p(x) and that generating the conditional distribution p(y|x) correspond to two independent natural processes and thus p(x) and p(y|x) fulfill some sort of independence condition. However, in many situations, the independence condition does not hold for the anticausal direction; if we consider p(x, y) as generated via p(y)p(x|y), then there are usually some contrived mutual adjustments between p(y) and p(x|y). This kind of asymmetry can be exploited to identify the causal direction. Based on this postulate, in this letter, we define an uncorrelatedness criterion between p(x) and p(y|x) and, based on this uncorrelatedness, show asymmetry between the cause and the effect in terms that a certain complexity metric on p(x) and p(y|x) is less than the complexity metric on p(y) and p(x|y). We propose a Hilbert space embedding-based method EMD (an abbreviation for EMbeDding) to calculate the complexity metric and show that this method preserves the relative magnitude of the complexity metric. Based on the complexity metric, we propose an efficient kernel-based algorithm for causal discovery. The contribution of this letter is threefold. It allows a general transformation from the cause to the effect involving the noise effect and is applicable to both one-dimensional and high-dimensional data. Furthermore it can be used to infer the causal ordering for multiple variables. Extensive experiments on simulated and real-world data are conducted to show the effectiveness of the proposed method.

1.  Introduction

Inferring the causal relationships among observed variables is an important and challenging problem in many research fields, including philosophy and the natural and social sciences. An efficient method for this problem is to do randomized controlled experiments. These experiments, however, are usually too expensive and sometimes ethically prohibited or practically infeasible. Causal discovery from purely observed data is an alternative with great potential and thus has attracted large amounts of attention in the past decade and various methods have been proposed (Pearl, 2000; Spirtes, Glymour, & Scheines, 2000; Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006; Hoyer, Janzing, Mooij, Peters, & Schölkopf, 2008; Zhang & Hyvärinen, 2009; Janzing et al., 2012). For example, constraint-based methods (Pearl, 2000; Spirtes et al., 2000) exploit the causal Markov condition and have been widely used in the social sciences, medical science, and bioinformatics. However, these methods obtain only the Markov equivalent class and fail to tell the causal direction for cause-and-effect pairs. Other methods, such as linear non-gaussian acyclic models (LiNGAM) (Shimizu et al., 2006, 2011) and additive noise models (ANM) (Hoyer, Janzing et al., 2008), assume additive noise models; such models represent an effect as a linear or nonlinear function of its parents with additive independent noise. The post-NonLinear (PNL) (Zhang & Hyvärinen, 2009) causal model further incorporates the nonlinear sensor or measurement distortion, which is often encountered in practice. The advantage of this category of methods is that under some conditions on the function and distributions of the cause and the noise, the causal direction generally is identifiable; that is, the model holds for at most one direction. However, the weakness is that the model assumption may be unrealistic for real-world data. For example, LiNGAM assumes linearity of the causal mechanism, which might be too restrictive in the real world. Furthermore, the method may report misleading results if the noise effect is not additive in the real data-generating mechanism. For ANM and PNL, another limitation is that it is usually difficult to be generalized to multivariate causal network discovery problems.

Recently another view for causal discovery has been proposed by exploiting the asymmetry between the cause and the effect (Janzing & Schölkopf, 2010). Suppose x causes y.1 It has been postulated that the mechanism generating the cause distribution p(x) and that generating p(y|x) are due to two independent natural processes; p(x) and p(y|x) fulfill some sort of independence criterion, but this independence is generally violated for p(y) and p(x|y). This kind of asymmetry has been shown to be equivalent to the asymmetry that the factorization of the joint distribution p(x, y) into p(y|x)p(x) yields simpler terms than the factorization into p(x|y)p(y). Here “simpler” is in terms of Kolmogorov complexity. This postulate is appealing but generally remains conceptual. Some theoretical and practical problems hinder it from practical use. For example, we need to have a strict definition on the independence between p(x) and p(y|x). We also need to justify the violation of the independence criterion between p(x|y) and p(y). In addition, we should have an efficient quantitative method to measure the independence or dependence. These problems still remain open and challenging.

There are some existing works such as information geometric causal Inference (IGCI) (Daniušis et al., 2010; Janzing et al., 2012) and linear trace method (LTr) (Janzing, Hoyer, & Schölkopf, 2010; Zscheischler, Janzing, & Zhang, 2011) proposed by exploiting the previously mentioned postulate. IGCI tackles the one-dimensional deterministic case where the effect y is a nonlinear deterministic function of the cause x: y=f(x). Denote by g the inverse function of f. We can represent x by x=g(y). IGCI assumes that the density of the cause and the nonlinear function f fulfill some sort independence criterion defined as the uncorrelatedness between and p(x), and they show that generally, this kind of independence in the causal direction leads to the positive correlation between and p(y). Based on this asymmetry, they proposed an entropy-based method to infer the causal direction, and the method achieves very good performance. However, IGCI investigates the noiseless case, which is usually unrealistic in real-world problems. The presence of noise might deteriorate the performance of IGCI. Furthermore, IGCI is not applicable to high-dimensional observations such as data in climate research and neural science. Nor can it be used to obtain the causal graph for multiple variables.

The linear trace (LTr) method was recently developed to tackle the causal discovery problem for two linearly coupled high-dimensional observations and , where y=Ax, and A is an matrix. To make the causal direction identifiable, it is assumed that A and are generated independently and thus fulfill the trace condition (Janzing et al., 2010; Zscheischler et al., 2011), where is the covariance matrix of x. Specifically, , where . They showed that this trace condition is violated for the anticausal direction, consequently making it possible to identify the causal direction. However, the weaknesses of LTr are that it can be used only for high-dimensional data and assumes a linear deterministic relationship between the effect and the cause without considering the noise effect.

Motivated by this asymmetry-based postulate, in this letter, we propose a general method for causal discovery problems. We follow the same logic, that is, p(x) and p(y|x) are generated by two independent natural mechanisms such that they fulfill some sort of independence criterion but this independence criterion is strongly violated in the anticausal direction. We propose an independence criterion for p(x) and p(y|x) and show that this independence leads to the asymmetry between the cause and the effect. Specifically, we show that , where is a complexity metric that measures the distance of a (marginal or conditional) distribution to a reference density. This complexity measure is theoretically sound but difficult to compute, especially for the conditional distribution. To tackle this problem, we apply the Hilbert space embeddings method (Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2012; Song, Huang, Smola, & Fukumizu, 2009) in a way that we embed the (marginal and conditional) densities of the observed variables and the densities of the reference distribution as points or operators in the reproducing kernel Hilbert space (RKHS). We show that the distance of two density functions p1(x) and p2(x) (defined as the integration of square difference of two density functions within certain domain) is proportional to the distance of two embedded points and in RKHS. Consequently, we can compute the complexity metric using kernel embeddings, and we propose our kernel-based method to infer the causal directions of cause-effect pairs.

The contributions of our kernel-based method are three-fold. First, we do not assume any specific generating model for x and y. Our method allows a general transformation from the cause to the effect incorporating the noise effect. For example, the proposed method can be used for linearly, nonlinearly coupled cause-and-effect pairs. It is also applicable to both an additive and a multiplicative noise effect. Second, our method can be applied to both one-dimensional and high-dimensional data, which exploits the benefit of the kernel method. Finally, our method can be easily used to derive the causal ordering for multiple variables (the number of the observed variables is greater than or equal to 3). This problem is important because usually people are more interested in finding the causal ordering of multiple variables, but it remains challenging so far. To verify the strengths of our proposed method, we conduct a series of experiments on synthetic and real-world data.

The rest of this letter is structured as follows. In section 2, we introduce the idea behind our proposed method, including the independence criterion and the asymmetry between the cause and the effect. In section 3, we introduce the reproducing kernel Hilbert space embedding and show how we can compute the complexity metric using kernel embeddings, and we also propose a method to sort a group of variables according to their causal ordering. In sections 4 and 5, we conduct a series of experiments on synthetic and real-world data. In section 6, we draw conclusions on our work.

2.  Cause-and-Effect Asymmetry

In this section, we propose an independence criterion between p(x) and p(y|x) and then derive an asymmetry between the cause and the effect based on the independence criterion in the causal direction. Before we present our main results, we give the following problem description.

2.1.  Problem Description.

Suppose we are given a set of observed variables x and y with a joint distribution p(x, y), where x is n-dimensional and y is m-dimensional. Suppose the ground truth is that x is the cause and y is the effect (i.e., ), but we have no prior knowledge on the causal direction. Based on the pure observation, we would like to infer which of the following models, or , is more plausible. In this letter, we focus on the causal sufficiency case (Pearl, 2000); there is no latent confounder (common cause) for the observed variables. Latent confounders might be present in many real-world phenomena, but they are usually difficult to detect, making it a great challenge to identify the causal direction. Dealing with a latent confounder is out of the scope of this letter. Interested readers can refer to Chen and Chan (2013), Entner and Hoyer (2011), Hoyer, Shimizu, Kerminen, and Palviainen (2008), and Kawahara, Bollen, Shimizu, and Washio (2010). Furthermore, we also exclude the case where the causal mechanism is bidirectional, that is, both variables affect each other simultaneously. Even with these assumptions, the problem is still challenging, as we have no prior information but the observations. To make the causal direction identifiable, we need to find a certain kind of asymmetry between the cause and the effect. In the following section, we show that if we accept certain kinds of independence between p(x) and p(y|x), we can justify that there is indeed some asymmetry between the cause and the effect.

2.2.  Asymmetry Between the Cause and the Effect.

To identify the causal direction given our observations, our basic idea is that p(x) and p(y|x) fulfill some sort of independence criterion and this independence criterion is usually violated for p(y) and p(x|y).

Suppose and fall into the domain and , respectively. If the relation between x and y is stochastic, we define and . If the relation between x and y is deterministic, that is, y=f(x), we define :

Definition 1. 
We say that the conditional probability distribution p(y|x) that maps the cause to the effect and the marginal distribution of the cause p(x) are independent if they fulfill the following uncorrelatedness condition:
formula
2.1
for u=1, 2, given any fixed y.
The interpretation of the uncorrelatedness condition is that if we treat hu(x) and p(x) as two random variables given any fixed y, the uncorrelatedness condition is equivalent to
formula
2.2
where U is a uniform distribution in the domain . Interestingly it can be shown that if the uncorrelatedness condition holds in the causal direction, we can justify that there is indeed some asymmetry between the cause and the effect. Before we show this asymmetry, we propose the following measure, which enables us to quantize the complexities of the joint distribution factorizations corresponding to both directions:
Definition 2. 
The divergence of a probability density p1(x) with respect to a reference density p2(x) on the domain is defined as
formula
2.3

If we choose a simple density such as the uniform distribution as the reference density, the divergence reflects the complexity of the given density to some extent. Note that the divergence defined here performs similar to KL divergence, but it is much easier to manipulate, as we show later.

Based on the independence assumption on the causal direction, we show the asymmetry for the causal and anticausal directions in the following theorem:

Theorem 1. 
Given two uniform distributions ux and uy as reference distributions in the domains and respectively, if the uncorrelatedness conditions defined in equation 2.1 hold, we have
formula
2.4
where
formula
2.5
Proof. 

Note that all integrations in the following derivation are performed in the domains and .

It is easy to know that :
formula
formula
However, for the anticausal direction,
formula

This asymmetry can be used for causal discovery only if it vanishes when x and y are statistically independent. The following corollary confirms the property:

Corollary 1. 

If x and y are statistically independent, then.

Proof. 
If x and y are statistically independent, we have p(y|x)=p(y). According to Theorem 1, we have
formula

Note that if , there is an asymmetry between the cause and the effect. This asymmetry enables us to develop a computational method to infer the causal direction. In the following sections, we show that under some mild conditions, we can show that generally .

Note that IGCI (Janzing et al., 2012) is innovative and inspiring, and our work follows the same line of research by making use of a certain type of independence between the distribution of the cause and the conditional distribution of the effect given the cause. The main difference is that IGCI uses KL divergence to measure the distance of a density to a reference density, which is difficult to manipulate, but our proposed method uses a square difference that turns out to be easily calculated in RKHS. (We show this property in a later section.) Furthermore, IGCI justifies only the one-dimensional nonlinear deterministic case, which seems to be restrictive. Our method, however, allows a very general transformation from the (high-dimensional) cause to the (high-dimensional) effect and thus applies to much wider range of causal discovery problems.

2.3.  Analysis of the Sign of .

We treat p(y|x) and p(x) as two independent random processes indexed by x and assume that given any fixed y, the process p(y|x) has a mean u(y)>0 and a standard deviation and process p(x) has a mean 1 and a standard deviation . Specially, we have
formula
2.6
We assume that u(y), , , and are all integrable in the domain . We also assume that u(y) and are two independent processes indexed by y: . It is easy to know that . Similarly,
formula
2.7
We have
formula
and
formula
Consequently, we have
formula
In the following, we show that .
Since the function is convex in , according to Jensen's inequality, we have
formula
We have
formula
The last inequality holds according to Cauchy-Schwarz inequality. Consequently, we show that , and thus .

2.4.  What Value Could Take On for Some Special Cases?.

In this section, we analyze some special cases where y is generated by some specific models from x, and we look into the sign of of such models.

2.4.1.  Deterministic Case.

The first case we analyze is the deterministic case where y=f(x). Figure 1 gives an example where .

Figure 1:

Deterministic case: .

Figure 1:

Deterministic case: .

In this case, we have and . Thus,
formula
Supposing f(x)=xn, we have . Consequently, we have
formula
From this result, we can see that if n=1 and if n>1; otherwise, .

2.4.2.  Linear Nongaussian Case.

Suppose and . . with support [0, 1] (see Figure 2). Let us see in this case:
Figure 2:

Linear Nongaussian: Uniform cause and noise.

Figure 2:

Linear Nongaussian: Uniform cause and noise.

formula
Note that , , and . It is easy to know that .

2.4.3.  Linear Gaussian Case.

Now we consider the case where , where and and , and we assume that , and . We can write it , where and and , and we assume that , and . It is easy to show that and , which are uncorrelated with p(x) and p(y) such that . In summary, due to the symmetry of the linear gaussian case, the causal direction cannot be identified.

3.  Causal Discovery via Reproducing Kernel Hilbert Space Embeddings

In section 2, we showed that if a certain kind of independence criterion is assumed between p(x) and p(y|x), then this independence leads to the asymmetry between the cause and the effect, which enables us to discover the causal direction. In this section, we propose our causal discovery method via reproducing kernel Hilbert space embeddings. The basic idea is to map the densities of the variables and the reference densities to RKHS where the complexity metrics can be easily calculated. Then we show that the RKHS embedding preserves the relative magnitude of the complexity metrics. By comparing the complexity metrics, we can infer the causal direction of a cause-and-effect pair. We propose a method to estimate the complexity metrics using the empirical observations and a kernel-based method to infer the causal direction of two variables. This method is also extended to solve the causal ordering problem for multiple variables.

3.1.  Causal Inference Rule Based on the Asymmetry.

Based on the analysis in section 2, we know that generally there is an asymmetry between the cause and the effect if the independence criterion holds for the causal direction.

Let and . We then have the following causal inference rule:

  • • 

    If , then is concluded.

  • • 

    Else if , is concluded.

  • • 

    Otherwise, we have no conclusion on the causal direction.

Although this causal inference rule seems appealing, it might be difficult to use in practice. There are some major obstacles, including the estimation of the densities, especially the conditional distribution and the integration might not be tractable. In order to solve this difficulty, we propose mapping the densities to RKHS and manipulating them via the kernel embeddings.

3.2.  Measuring the Density Divergence by Hilbert Space Embeddings.

In section 2, we showed that by assuming a certain independence criterion between p(x) and p(y|x), we obtain the asymmetry between the causal direction and the anticausal direction. This asymmetry can be detected by the complexity metrics of the distributions with respect to the uniform reference density.

The original definition of divergence in equation 2.2 is conceptually appealing but practically difficult to compute because it involves the density functions whose estimation from the data is challenging, especially for the conditional density and multivariate distributions. To tackle this difficulty, we apply the reproducing kernel Hilbert Space embedding method. The basic idea elucidated in Figure 3 is to embed the original densities (marginal and conditional distributions) as elements in RKHS. Manipulations or operations on probability densities (e.g., the two samples test) can be easily performed on the embeddings instead of the density functions. By exploiting this property, we can calculate the complexity metric in RKHS, which turns out to be very convenient using kernel methods. In the rest of this section, we focus on how to embed marginal and conditional probability distribution to RKHS and also justify that the distance between two embeddings is proportional to the divergence of the two original densities, a very important working condition for our causal inference rule.

Figure 3:

Complexity metric computation in RKHS.

Figure 3:

Complexity metric computation in RKHS.

First, we introduce the kernel embeddings of distributions.

3.2.1.  Hilbert Space Embeddings.

When using Hilbert space embeddings, we represent a probability distribution by an element in RKHS.

An RKHS on with a kernel k is a Hilbert space of functions such that its dot product satisfies the reproducing properties,
formula
3.1
can also be viewed as the feature vector denoted by of x in RKHS. The advantage of mapping x to RKHS is that we can compute the inner product of using a kernel function in the original space.

3.2.2.  Embeddings of Marginal Distributions.

The embeddings of marginal and conditional distributions play central roles in our proposed method. First, we focus on the embedding of a marginal distribution. The embedding of a probability density p(x) is defined as
formula
3.2
It has been shown that if , is guaranteed to be an element of RKHS. The Hilbert space embedding of p(x) enjoys two attractive properties. First, if the kernel is characteristic, the mapping from p(x) to is injective, which means that different distributions are mapped to different points in an RKHS. An example of characteristic kernels is the gaussian kernel. Second, the empirical estimation converges to quickly as the number of sample size increases.

3.2.3.  Embeddings of Conditional Distributions.

According to the definition of the marginal distribution embedding, we have
formula
3.3
Other than estimate directly, which could be challenging, one can define the operator such that . It has been shown that an operator defined as
formula
3.4
where is the cross-covariance operator defined as
formula
3.5
enjoys this property.

Suppose we have two distinct densities, p1(x) and p2(x), and can compute the distance between two densities as . If we embed both densities into RKHS and represent them as and , we can compute the distance between two embeddings as . Now the problem is what the relationship is between and . Does the Hilbert space embedding preserve the relative magnitudes of the complexity metrics, that is, does hold? We answer this question in the following section.

3.3.  Relation Between and .

In order to use Hilbert space embedding, we should be clear about the relationship between and . The following theorem shows that these two terms are proportional. Before we present our theorem, we introduce the concept of free independence (Voiculescu, 1997; Voiculescu, Dykema, & Nica, 1992; Zscheischler et al., 2011).

Definition 3 
(Voiculescu, 1997; Voiculescu et al., 1992). Let be an algebra and a linear functional on with . Then A and B are called free if
formula
3.6
for polynomials pi, qi, whenever pi(A)=qi(B)=0.

It follows that if A and B are free independent, it holds that (Voiculescu, 1997; Voiculescu et al., 1992).

Theorem 2. 
For any two probability densities p1(x) and p2(x), denote by and we assume that is finite, and by and the Hilbert space embeddings of p1(x) and p2(x), respectively. Assume that and , where and are vectors of coefficients. We also assume that the covariance matrix and are free independent, that is, , where (Zscheischler et al., 2011). We have
formula
3.7
Proof. 
Suppose the Hilbert space induced by a kernel k(x, y) is rich enough that and . We have and , where and are vectors of expansion coefficients. We have
formula
Similarly, we have .
formula
We also have
formula
Thus, we have
formula
Consequently, we have
formula

This theorem is important for our causal inference rule to work because if the kernel embedding method does not preserve the relative magnitudes of the complexity metrics, we can never use to represent . Once this relation has been established, we can develop a sample-based causal inference rule, which we introduce in the following sections.

3.4.  Embedding of the Reference Density.

In our complexity metric, we need to calculate the distance between pyx and py|xux. It is easy to obtain the embedding for pyx, while it is not clear how to obtain the embedding for py|xux. In this section, we show how the embedding of py|xux can be obtained by the conditional embedding of py|x and the marginal embedding of ux.

We use Hilbert space embeddings to represent densities. Denote by , , , , the kernel embeddings of pxy, py|xux, uxuy, uyux, and px|yuy, respectively.

Denote by and . We have the following lemma:

Lemma 1. 
formula
3.8
Proof. 
According to the definition,
formula
The proof is similar for the backward direction.

3.5.  Empirical Estimations.

In this section, we show how to estimate the complexity metrics based on the observations. Let x and y be the sample matrices of the variables x and y, respectively. Suppose and are sample matrices of randomly generated variables following uniform distribution with the same dimensions of x and y, respectively. Denote by the feature matrix of y and the feature matrix of x. Denote by the feature matrix of and the feature matrix of .

We define the following measure for the causal direction:
formula
3.9
formula
3.10
We have
formula
Four terms involved in the causal inference rule can be estimated as follows:
formula
3.11
formula
3.12
formula
3.13
formula
3.14
where L is the sample size; is the regularization parameter; I is the identity matrix; KX is the kernel Gram matrix on samples x, and KY are kernel Gram matrix on samples y; is the kernel Gram matrix on samples , and similarly, is the kernel Gram matrix on samples ; is the cross-kernel matrix on samples x and , and is the cross-kernel matrix on samples y and . We can see that the complexity metrics can be easily calculated using kernel methods.
We propose algorithm 1 for causal discovery via RKHS embeddings.
formula

3.6.  Sorting a Group of Variables.

In this section, we extend our work to discover the causal ordering for a group of variables that forms a directed acyclic graph (DAG). That is, we can sort this group of variables according to their causal ordering in such a way that no later variable is the cause of earlier variables. This is an important but challenging problem. The main challenges come from the complex relations among the variables and the computational complexity due to the size of the group. In this section, we propose the following algorithm to sort a group of causally correlated variables.

Denote by G the set of variables to be sorted and S the set of sorted variables (initially , an empty set) and . To infer the whole causal ordering of multiple variables, we exploit the acyclic property in an iterative way. Specifically, we find the root of the causal graph and add the root to S and update . Then we continue to find the root variable in R and add this second root to S and update R again. This procedure is repeated until we find the whole causal ordering of the set of variables.

Now we focus on how to find the root. Initially we have and R=G. To find a root of the variables in R, for any variable , we split the group into two subsets, and . We compute the complexity metrics and select the rth variable with minimal complexity metric as the root of the variables in R. Once we have identified the root, we add the root to S, , and update . Then we continue to find the root for the variables in R. Following the same procedure, for any variable , we split R into and . Now we combine Ai with S to have an augmented set , and similarly we compute the complexity metrics and pick the one with the minimal complexity metric to add to S and update R accordingly. We can continue the same procedure until we find the full causal ordering of the variables in G. A more sophisticated way for picking a root is that for each step, we compute the p-value for the hypothesis that and then select the one with highest p-value.

4.  Simulations

As shown in the previous sections, our proposed EMD method is able to discover causal directions for both one-dimensional and high-dimensional cause-and-effect pairs. Furthermore, it is applicable to sorting a group of variables according to their causal ordering. In order to investigate the practical performance of EMD on these causal discovery problems, in this section, we conduct a series of experiments on synthetic data.

4.1.  Independent Pairs.

In this section, we investigate whether the proposed method could give faulty results if the pairs are statistically independent. Generally, as illustrated in corollary 1, if two variables are independent, . However, due to the finite sample size effect, it is most likely that but , where is a small nonnegative number. In order to solve this problem, we use the bootstrap method. We use a resampling technique to generate N replicates of the original cause-and-effect pairs, and for each pair, we calculate the difference of and and investigate whether the difference is significantly different from zero (we use ). If the difference is not significantly different from zero, we conclude that there is no causal relationship between two variables.

In this experiment, we first conduct the one-dimensional case. The experimental setting is as follows. We generate two independent one-dimensional random variables with gaussian distribution, supergaussian distribution, and subgaussian distribution. We use the bootstrap-based method to identify the causal direction and calculate the percentage of cases where our method yields no causal relation between two variables. We also investigate the performance in different sample size from 100 to 500. The results are illustrated in Table 1.

Table 1:
Accuracy of Correctly Identified One-Dimensional Independent Pairs.
Sample Size
Distribution100200300400500
Gaussian 99% 98% 96% 96% 100% 
Supergaussian 98% 99% 100% 100% 100% 
Subgaussian 100% 100% 97% 98% 96% 
Sample Size
Distribution100200300400500
Gaussian 99% 98% 96% 96% 100% 
Supergaussian 98% 99% 100% 100% 100% 
Subgaussian 100% 100% 97% 98% 96% 

In the second part of this experiment, we generate high-dimensional independent pairs with dimension equal to 5. We generate the pair by x=Azx and y=Bzy, where zx and zy are isotropic gaussian, and we investigate two cases. The first case is that A and B are two identity matrices such that x and y are both isotropic gaussian. The second case is that both A and B are random matrices with appropriate dimensionality such that x and y are nonisotropic gaussian. Similar to the first experiment, we calculate the accuracy of our method. The result is presented in Table 2.

Table 2:
Accuracy of Correctly Identified High-Dimensional Independent Pairs.
Sample Size
Distribution100200300400500
Isotropic gaussian 98% 94% 96% 97% 99% 
Nonisotropic gaussian 92% 89% 93% 92% 89% 
Sample Size
Distribution100200300400500
Isotropic gaussian 98% 94% 96% 97% 99% 
Nonisotropic gaussian 92% 89% 93% 92% 89% 

From Tables 1 and 2, we can see that when two random variables are independent, our proposed method can successfully report the correct identification of nondependence between these two variables. Thus, the false-positive rate is small.

4.2.  One-Dimensional Causal Pairs.

In this section, we use our proposed EMD to discover the causal directions for one-dimensional cause-and-effect pairs. First, we draw samples of the cause x from the following three densities:

  • • 

    .

  • • 

    .

  • • 

    .

These three densities are unimodal, bimodal, and trimodal distributions, respectively. We use these densities in order to investigate the performances of compared methods against different cause distributions with different complexities. We investigate the following four causal mechanisms mapping x to y:

  • • 

    M1: y=0.8x+n

  • • 

    M2:

  • • 

    M3:

  • • 

    M4: y=atan(x)3+n,

where n follows a generalized normal distribution size, 10); thus, n is subgaussian. These four mapping mechanisms involve linear, nonlinear functions and additive and a multiplicative noise effect. Figure 4 shows the plots of different combinations of distributions and mapping mechanisms.

Figure 4:

Plots of different distributions and mechanisms.

Figure 4:

Plots of different distributions and mechanisms.

We generate 100 replicates of the cause-and-effect pairs with a sample size of 500 and use EMD, LiNGAM, and IGCI to infer the causal direction. For EMD, we use a gaussian kernel with kernel width , where dM is the median of distances among all input patterns. For LiNGAM and IGCI, we use the default settings.

The accuracies of all algorithms are presented in the Table 3. From the results in the table, IGCI and LiNGAM perform poorly in certain scenarios. For example, IGCI performs poorly when the relationship between the cause and the effect is close to linear. LiNGAM fails to discover the correct causal direction when the relationship between the cause and the effect is strongly nonlinear and the noise effect is multiplicative. For EMD, we can see that generally, it has the best overall performance, although it does not have good performance in scenarios (p1(x), M1) and (p1(x), M4). One possible explanation is that p1(x) and M1, M4 are too simple, such that the cause-and-effect asymmetry is not significant enough to detect when finite samples are given.

Table 3:
Results of One-Dimensional Cause-and-Effect Pairs.
Mechanism
AlgorithmM1M2M3M4
p1(x    
  EMD 42% 100% 100% 47% 
  IGCI 20% 100% 100% 14% 
  LiNGAM 100% 47% 24% 100% 
p2(x    
  EMD 98% 100% 100% 98% 
  IGCI 68% 100% 100% 78% 
  LiNGAM 100% 52% 6% 100% 
p3(x    
  EMD 100% 100% 100% 99% 
  IGCI 34% 98% 97% 51% 
  LiNGAM 100% 71% 0% 100% 
Mechanism
AlgorithmM1M2M3M4
p1(x    
  EMD 42% 100% 100% 47% 
  IGCI 20% 100% 100% 14% 
  LiNGAM 100% 47% 24% 100% 
p2(x    
  EMD 98% 100% 100% 98% 
  IGCI 68% 100% 100% 78% 
  LiNGAM 100% 52% 6% 100% 
p3(x    
  EMD 100% 100% 100% 99% 
  IGCI 34% 98% 97% 51% 
  LiNGAM 100% 71% 0% 100% 

4.3.  Robustness Against the Parameter Selections.

The selection of the parameters including the regularization parameter and the kernel width is a common problem for all kernel methods. Some work (Gretton, Sejdinovic et al., 2012) has discussed a related problem of selecting an optimized group of kernels, but it remains challenging to extend the framework to the optimization over the parameters of a single kernel.

In our EMD method, the regularization parameter and the kernel width should be well selected. In order to investigate the robustness performance of EMD against the parameter selection, we conduct the following experiment.

We use the same experimental setting as the previous experiment but focus on only the first mapping mechanism from x to y. In the first part of the experiment, we investigate the robustness against the regularization parameters. We fix the kernel width . We use the regularization parameter as 10a where and calculate the accuracy of EMD as the previous experiment. In the second part of this experiment, we fix the regularization parameter and use a different kernel width as dM, , , and . The experimental results are shown in Figure 5.

Figure 5:

Robustness against parameter selection.

Figure 5:

Robustness against parameter selection.

From Figure 5, we can see that we should carefully select the regularization parameter for EMD. If is too large, EMD might not be able to detect the nongeneric dependence of the conditional probability density that maps the effect to the cause and the marginal probability density of the effect. However, if is too small, there will be a high possibility that the model overfits the data, and consequently; there are dependencies between the marginal and the conditional probability density for both the causal direction and the anticausal direction. Thus, we should choose an appropriate value for the regularization parameter carefully. However, we can see from the result that the method seems quite robust against the selection of the kernel width as shown in the figure; EMD almost perfectly identifies the causal direction for 100 randomly generated cause-and-effect pairs.

4.4.  High-Dimensional Cause-Effect Pairs.

In this experiment, we focus on causal discovery for high-dimensional data. We follow experimental settings similar to those of Chen, Zhang, and Chan (2013). We generate the cause x by x=Az, where and p(z) is selected from the density defined in the previous experiment and . We map x to y by yi=f(ui+di), where u=Bx, and is a random shift. We use the following nonlinear functions: (1); (2) ; (3) .

Firstly, we investigate the deterministic case where . We compare the performances of EMD, LTr, and the kernelized trace method (KTr) (Chen et al., 2013) with dim=5 and a sample size of 500. For EMD and KTr, we use a kernel width . The results are summarized in Figure 6.

Figure 6:

Accuracies of the compared methods on high-dimensional cause-and-effect pairs: deterministic case.

Figure 6:

Accuracies of the compared methods on high-dimensional cause-and-effect pairs: deterministic case.

KTr has the best overall performance, followed by EMD. KTr is a kernel-based method proposed for tackling the causal discovery problem, especially for the high-dimensional cause-effect pairs, which are nonlinearly and deterministically coupled. LTr does not perform satisfactorily since the model assumption is strongly violated. Although KTr outperforms EMD in the deterministic case, the noise effect is always encountered in the real world; thus, we compare their performances in the noisy case.

We investigate the case where is independent standard gaussian noise. We also compare the performance of three algorithms. The results are summarized in Figure 7.

Figure 7:

Accuracies of compared methods on high-dimensional cause-and-effect pairs: the stochastic case.

Figure 7:

Accuracies of compared methods on high-dimensional cause-and-effect pairs: the stochastic case.

From Figure 7, we see that EMD has the best overall performance, which shows that it can be generalized very well to high-dimensional causal discovery problems with strong noise effects. KTr has the worst performance since it assumes no noise is present in the data-generating mechanism.

From these results, we can see that EMD allows a general transformation from the cause to the effect, making it suitable to real-world problems.

4.5.  Discovering the Causal Ordering for Multiple Variables.

In this section, we use our proposed EMD method to discover the total causal ordering of a set of variables that form a directed acyclic graph. We generate a directed acyclic graph with four variables with the ground truth x1<x2<x3<x4 in the following way.

We use an artificial neural network with five neurons in the hidden layer with a nonlinear activation function to map the cause to the effect. Denote by effect=N(cause). We generate the effect , where pai is the parents of xi and is independent supergaussian noise—for example, .

In the first case, we generate a chain graph where each variable (except the root) has one and only one direct parent. In the second case, we generate a fully connected graph where there is a directed edge between any two variables. Figures 8 and 9 show these two different topologies of the causal graphs.

Figure 8:

Chain graph.

Figure 8:

Chain graph.

Figure 9:

Fully connected graph.

Figure 9:

Fully connected graph.

We conduct 100 independent trials with a sample size of 500 and use EMD and LiNGAM to discover the total ordering of the generated set of variables. We compare the accuracies of EMD and LiNGAM in correctly identifying (1) the whole causal ordering, (2) the first variables of the graph, and (3) the first two variables of the graph. Figure 10 shows the experimental results.

Figure 10:

Accuracies of compared methods on sorting four variables.

Figure 10:

Accuracies of compared methods on sorting four variables.

From the results, we can see that our proposed EMD-based sorting algorithm has good performance in both graphs, especially in finding the root of the graph (almost 100%). However, LiNGAM achieves less than 50% accuracy in finding the full causal ordering, possibly due to the underfitting of the model. From Figure 11, we can see that the relations among the variables are nonlinear and complicated. Ignoring the nonlinear effect might lead to incorrect results. In this experimental setting, we use a neural network to map the cause to the effect, a very general transformation. EMD still successfully discovers the causal ordering, and thus we expect it has good performance with real-world data as well. In the following section, we apply EMD to discover the causal directions for real-world cause-and-effect pairs and sort a group of real-world variables.

Figure 11:

Plots of four variables.

Figure 11:

Plots of four variables.

5.  Real-World Data

5.1.  Cause-and-Effect Pairs.

In order to verify the applicability of our proposed method in real-world problems, we use the widely used cause-and-effect pairs.2 There are 79 pairs, each with a the confidence of the ground-truth causal direction. We use EMD, LiNGAM, IGCI, ANM, PNL-MLP, and GPI (Mooij, Stegle, Janzing, Zhang, & Schölkopf, 2010) to discover the causal direction of each pair.3 We use the resampling method to generate 100 replicates and calculate the accuracies of all methods. The results are presented in Figure 12. This task is challenging: for the real-world data, the mapping mechanism from the cause to the effect is complicated, especially with large amounts of noise. Nevertheless, EMD still performs well, with an average accuracy close to 75% and with a highest accuracy of 86%. IGCI and GPI also perform quite well, followed by PNL-MLP and ANM. LiNGAM performs the worst, which might be due to the model restriction.

Figure 12:

Accuracies of compared methods on real-world cause-and-effect pairs.

Figure 12:

Accuracies of compared methods on real-world cause-and-effect pairs.

5.2.  Sorting Stock Indices in Asia, Europe, and the United States.

In order to verify the effectiveness of our proposed sorting algorithm, we use stock indices in Asia, Europe, and the United States. Due to the time difference and the efficient market hypothesis, we expect the following ground truth of the causal ordering: . We collect three different indices for each region: , , and the . For each experiment, we pick one stock index in each region and use EMD and LiNGAM to sort three stock indices. Thus, there are cases. Note that the causal relations among the stock indices in different regions are more likely to be nonlinear, and thus methods such as ANM and IGCI seem to be more appropriate than LiNGAM. However, currently, such methods, including ANM and IGCI, are used mainly for causal discovery in the two-variable case. Extension of ANM and IGCI to multivariate causal discovery is nontrivial. Although some work has been devoted to multivariate causal discovery (e.g. Mooij, Janzing, Peters, & Schölkopf, 2009; Zhang & Hyvärinen, 2009), it is far from satisfactory. Thus, in this experiment, we compare only the performance of EMD with LiNGAM. The experimental results are shown in Tables 4 to 6.

Table 4:
Stock Indices Sorting.
Algorithm
Stock IndicesEMDLiNGAM
   
   
   
   
   
   
   
   
   
Accuracy 66.67% 0% 
Algorithm
Stock IndicesEMDLiNGAM
   
   
   
   
   
   
   
   
   
Accuracy 66.67% 0% 
Table 5:
Stock Indices Sorting.
Algorithm
Stock IndicesEMDLiNGAM
   
   
   
   
   
   
   
   
   
Accuracy 88.89% 0% 
Algorithm
Stock IndicesEMDLiNGAM
   
   
   
   
   
   
   
   
   
Accuracy 88.89% 0% 
Table 6:
Stock Indices Sorting.
Algorithm
Stock IndicesEMDLiNGAM
   
   
   
   
   
   
   
   
   
Accuracy 88.89% 44.44% 
Algorithm
Stock IndicesEMDLiNGAM
   
   
   
   
   
   
   
   
   
Accuracy 88.89% 44.44% 

From Tables 4 to 6, we can see that EMD reports quite encouraging performance in sorting the stock indices of different regions. The overall performance is 81.48%, much higher than the overall performance of LiNGAM (14.81%). Note that we also conduct the experiment using vector data, since we want to infer the causal relations among three-vector data for Asia, Europe, and the United States. We apply EMD to the data, and the result is Aisa Europe United States, which is consistent with the results in Tables 4 to 6. From this real-world experiment, we can see that the data-generating mechanism might not be simply linear. A linear structural equation model (LiNGAM is a special case of this category) might not be able to correctly describe the data-generating mechanism. Model misspecification might lead to incorrect conclusions. However, EMD does not assume any specific model for the data-generating mechanism and consequently enables it to deal with real-world data of which the generating process is not clear to us.

6.  Discussion and Conclusion

In this letter, based on certain assumptions on the independence of the causal direction, we prove the asymmetry between the causal direction and anticausal direction. By exploiting this asymmetry, we propose a reproducing kernel Hilbert space embediness-based method to solve a general causal discovery problem. The strengths of our proposed method are that it requires no specific assumption on the generating mechanism from the cause to the effect and it can be applied to both one-dimensional and high-dimensional, linearly or nonlinearly coupled cause-and-effect pairs. By embedding the data to RKHS, we avoid estimating the probability densities, especially the conditional probability densities, explicitly, which could be very challenging, especially for high-dimensional data. Furthermore, it can be applied to sorting a set of variables according to their causal ordering. Extensive experiments have been conducted, and the results are very encouraging. Nevertheless, there remains future work. For example, what is the interpretation of the independence criterion? How can we test whether it holds in real-world data or to what extent it holds in the real world? How can we reduce computational complexity, especially for the sorting algorithm? These problems are the focus of our future work.

Acknowledgments

The work described in this letter was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

References

Chen
,
Z.
, &
Chan
,
L.
(
2013
).
Causality in linear nongaussian acyclic models in the presence of latent gaussian confounders
.
Neural Computation
,
25
(
6
),
1605
1641
.
Chen
,
Z.
,
Zhang
,
K.
, &
Chan
,
L.
(
2013
).
Nonlinear causal discovery for high dimensional data: A kernelized trace method
. In
Proceedings of the IEEE International Conference on Data Mining
.
Piscataway, NJ
:
IEEE
.
Daniušis
,
P.
,
Janzing
,
D.
,
Mooij
,
J.
,
Zscheischler
,
J.
,
Steudel
,
B.
,
Zhang
,
K.
, &
Schölkopf
,
B.
(
2010
).
Inferring deterministic causal relations
. In
Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence
(pp.
143
150
).
Corvallis, OR
:
AUAI Press
.
Entner
,
D.
, &
Hoyer
,
P.
(
2011
).
Discovering unconfounded causal relationships using linear non-gaussian models
.
New Frontiers in Artificial Intelligence
(pp.
181
195
).
New York
:
Springer
.
Gretton
,
A.
,
Borgwardt
,
K. M.
,
Rasch
,
M. J.
,
Schölkopf
,
B.
, &
Smola
,
A.
(
2012
).
A kernel two-sample test
.
Journal of Machine Learning Research
,
13
,
723
773
.
Gretton
,
A.
,
Sejdinovic
,
D.
,
Strathmann
,
H.
,
Balakrishnan
,
S.
,
Pontil
,
M.
,
Fukumizu
,
K.
, &
Sriperumbudur
,
B. K.
(
2012
).
Optimal kernel choice for large-scale two-sample tests
. In
F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger
(Eds.),
Advances in neural information processing systems, 25
(pp.
1214
1222
).
Red Hook, NY
:
Curran
.
Hoyer
,
P. O.
,
Janzing
,
D.
,
Mooij
,
J. M.
,
Peters
,
J.
, &
Schölkopf
,
B.
(
2008
).
Nonlinear causal discovery with additive noise models
. In
D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou
(Eds.),
Advances in neural information processing systems, 21
(pp.
689
696
).
Cambridge, MA
:
MIT Press
.
Hoyer
,
P.
,
Shimizu
,
S.
,
Kerminen
,
A.
, &
Palviainen
,
M.
(
2008
).
Estimation of causal effects using linear non-gaussian causal models with hidden variables
.
International Journal of Approximate Reasoning
,
49
(
2
),
362
378
.
Janzing
,
D.
,
Hoyer
,
P. O.
, &
Schölkopf
,
B.
(
2010
).
Telling cause from effect based on high-dimensional observations
. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
479
486
).
Msdison, WI
:
International Machine Learning Society
.
Janzing
,
D.
,
Mooij
,
J.
,
Zhang
,
K.
,
Lemeire
,
J.
,
Zscheischler
,
J.
,
Daniušis
,
P.
,
Schölkopf
,
B.
(
2012
).
Information-geometric approach to inferring causal directions
.
Artificial Intelligence
,
182
,
1
31
.
Janzing
,
D.
, &
Schölkopf
,
B.
(
2010
).
Causal inference using the algorithmic Markov condition
.
IEEE Transactions on Information Theory
,
56
(
10
),
5168
5194
.
Kawahara
,
Y.
,
Bollen
,
K.
,
Shimizu
,
S.
, &
Washio
,
T.
(
2010
).
Grouplingam: Linear non-gaussian acyclic models for sets of variables
.
Arxiv preprint arXiv:1006.5041
Mooij
,
J.
,
Janzing
,
D.
,
Peters
,
J.
, &
Schölkopf
,
B.
(
2009
).
Regression by dependence minimization and its application to causal inference in additive noise models
. In
Proceedings of the 26th Annual International Conference on Machine Learning
(pp.
745
752
).
New York
:
ACM Press
.
Mooij
,
J. M.
,
Stegle
,
O.
,
Janzing
,
D.
,
Zhang
,
K.
, &
Schölkopf
,
B.
(
2010
).
Probabilistic latent variable models for distinguishing between cause and effect
. In
J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.)
,
Advances in neural information processing systems
,
23
(pp.
1687
1695
).
Red Hook, NY
:
Curran
.
Pearl
,
J.
(
2000
).
Causality: Models, reasoning, and inference
.
Cambridge
:
Cambridge University Press
.
Shimizu
,
S.
,
Hoyer
,
P.
,
Hyvärinen
,
A.
, &
Kerminen
,
A.
(
2006
).
A linear non-gaussian acyclic model for causal discovery
.
Journal of Machine Learning Research
,
7
,
2003
2030
.
Shimizu
,
S.
,
Inazumi
,
T.
,
Sogawa
,
Y.
,
Hyvärinen
,
A.
,
Kawahara
,
Y.
,
Washio
,
T.
, …
Bollen
,
K.
(
2011
).
Directlingam: A direct method for learning a linear non-gaussian structural equation model
.
Journal of Machine Learning Research
,
12
,
1225
1248
.
Song
,
L.
,
Huang
,
J.
,
Smola
,
A.
, &
Fukumizu
,
K.
(
2009
).
Hilbert space embeddings of conditional distributions with applications to dynamical systems
. In
Proceedings of the 26th Annual International Conference on Machine Learning
(pp.
961
968
).
New York
:
ACM Press
.
Spirtes
,
P.
,
Glymour
,
C.
, &
Scheines
,
R.
(
2000
).
Causation, prediction, and search
.
Cambridge, MA
:
MIT Press
.
Voiculescu
,
D. V.
(
1997
).
Free probability theory
.
New York
:
AMS Press
.
Voiculescu
,
D. D. V.
,
Dykema
,
K. J.
, &
Nica
,
A.
(
1992
).
Free random variables
.
Providence, RI
:
American Mathematical Society
.
Zhang
,
K.
, &
Hyvärinen
,
A.
(
2009
).
On the identifiability of the post-nonlinear causal model
. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
(pp.
647
655
).
Corvallis, WA
:
AUAI Press
.
Zscheischler
,
J.
,
Janzing
,
D.
, &
Zhang
,
K.
(
2011
).
Testing whether linear equations are causal: A free probability theory approach
. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
(pp.
839
846
).
Corvallis, OR
:
AUAI Press
.

Notes

1

In this letter, we denote by x a scalar random variable and x a vector of random variables and x a matrix collecting the realizations.

3

ANM, PNL-MLP and GPI are time-consuming, so we run them just once.