Abstract

In this article, we deal with the problem of inferring causal directions when the data are on discrete domain. By considering the distribution of the cause and the conditional distribution mapping cause to effect as independent random variables, we propose to infer the causal direction by comparing the distance correlation between and with the distance correlation between and . We infer that X causes Y if the dependence coefficient between and is smaller. Experiments are performed to show the performance of the proposed method.

1  Introduction

Inferring the causal direction between two variables from observational data has become a hot research topic. Additive noise models (ANMs) (Zhang & Hyvärinen, 2009a, 2009b; Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006; Hoyer, Janzing, Mooij, Peters, & Schölkopf, 2009; Peters, Janzing, & Schölkopf, 2011; Hoyer & Hyttinen, 2009; Shimizu et al., 2011; Hyvärinen & Smith, 2013; Hyvärinen, Zhang, Shimizu, & Hoyer, 2010; Hyttinen, Eberhardt, & Hoyer, 2012; Mooij, Janzing, Heskes, & Schölkopf, 2011) are preliminary trials to solve this problem. They assume that the effect is governed by the cause and an additive noise, and the causal inference is done by finding the direction that admits such a model. Recently, under another view of exploiting the asymmetry between cause and effect, the linear trace method (LTr) (Zscheischler, Janzing, & Zhang, 2011; Janzing, Hoyer, & Schölkopf, 2010) and information-geometric causal inference (IGCI) (Janzing et al., 2012) have been proposed. Suppose X is the cause and Y is the effect. Based on the fact that the generating of is independent with that of (Janzing & Schölkopf, 2010; Lemeire & Janzing, 2013; Schölkopf et al., 2012), LTr suggests that the trace condition is fulfilled in the causal direction while violated in the anticausal direction, and IGCI shows that the density of the cause and the log slope of the function-transforming cause to effect are uncorrelated, while the density of the effect and the log slope of the inverse of the function are positively correlated. By accessing these so-called cause-effect asymmetries, one can determine the causal direction. Then a kernel method using the framework of IGCI to deal with high-dimensional variables is developed (Chen, Zhang, Chan, & Schölkopf, 2014), and nonlinear extensions of trace method are presented (Chen, Zhang, & Chan, 2013).

In some situations, the variables of interest are on discrete domains, and researchers have adopted additive noise models to discrete data for causal inference (Peters, Janzing, & Schölkopf, 2010, 2011). Given observations of the variable pair, they do regressions in both directions and test the independence between the residuals and the regressors. The direction that admits an additive noise model is inferred as the causal direction. However, ANMs may not be suitable for modeling discrete variables in some situations. For example, it is not natural to adopt ANMs to modeling categorical variables. Methods with wider applicability would be valuable for causal inference on discrete data.

Motivated by the postulate that the generating of is independent of that of , we suggest that the is an observation of a pair of variables that are independent with each other (here , referring to the probability at one specified point). To infer the causal direction, we calculate the dependence coefficients between and and between and . Then the direction that induces smaller correlation is inferred as the causal direction. Without a functional causal model assumption, our method has wider applicability than traditional ANMs. Various experiments are conducted to demonstrate the performance of the proposed method.

This article is organized as follows. Section 2 defines the problem. Section 3 presents the causal inference principle. Section 4 gives the detailed causal inference method. Section 5 shows the experiments, and section 6 concludes.

2  Problem Description

Suppose we have a set of observed variables X and Y with support domain and , respectively. X is the cause and Y is the effect, but we do not have prior knowledge about the causal direction. We assume that they are on discrete domain (for continuous variable, we can perform discretization) and there are no latent confounders. We want to identify the causal direction: “X causes Y” or “Y causes X.” For clarity, we list the symbols that may appear in the following sections in Table 1. Since we constrain variables in finite range, P(X,Y) and can be written as matrices, and is a vector in the matrix . We use these representations in the rest of the article.

Table 1:
Lookup Table.
SymbolDescription
 Support of variable X 
 Support of variable Y 
 Distribution of X 
 Probability of  
P(X,YJoint distribution of  
 Conditional distribution of Y given X 
 Conditional distribution of Y given  
 Cardinality of a set 
SymbolDescription
 Support of variable X 
 Support of variable Y 
 Distribution of X 
 Probability of  
P(X,YJoint distribution of  
 Conditional distribution of Y given X 
 Conditional distribution of Y given  
 Cardinality of a set 

3  Causal Inference Principle

In this section, we present the principle that we are using for causal inference on discrete data. We start with the basic idea.

3.1  Basic Idea

The basic idea is to consider the as a realization of a variable pair, and the two variables (one is one-dimensional and the other is high-dimensional) are independent with each other. See Figure 1 for an example.

Figure 1:

and as random variables.

Figure 1:

and as random variables.

Figure 1 shows an example of and . Suppose and (here, and ). is a vector in RM, and is a matrix in . The highlighted grids (red bars) are a pair . Consider and as two independent random variables. The generating of the and is done by drawing realizations at each possible value of x (shifting the red bars from right to left). We have realizations. We formalize this in postulate 5:

Postulate 1.

and are both random variables taking realizations at different x. is independent with .

Once this postulate is in mind, one could seek properties induced by it for causal discovery. We discuss this in the next section.

3.2  Distance Correlation

If we want to characterize the dependence between and , one measurement is the correlation. However, is a high-dimensional random vector. Adopting traditional dependence coefficients like Pearson correlations would cause a certain estimation bias when the sample size is not large. Moreover, it would be useful if the independence between variables corresponds to 0 correlation. This is not true if we use traditional correlations. Here we propose to use distance correlation (Székely, Rizzo, & Bakirov, 2007) as the dependence measurement.

Distance correlation is a measurement of dependence between two random variables (one-dimensional or high-dimensional). Suppose we have two random variables , with characteristic functions and , respectively. Their joint characteristic function is . The distance covariance is defined as:

Definition 1.
The distance covariance between two random variables is
formula
3.1

Here refers to the weighted L2 norm, and similarly we can define distance variance (Székely et al., 2007). Then the distance correlation is defined as

Definition 2.
The distance correlation is
formula
3.2
and if or .
The distance correlation is an approach for dependence measuring. There are other methods to measure the dependence, like mutual information and kernel independence measurements (Gretton, Herbrich, Smola, Bousquet, & Schölkopf, 2005). However, mutual information is hard to estimate given finite sample size. Kernel methods involve a few parameters (kernel functions, kernel widths) that are not easy to choose, so we use this metric in this article. Then we discuss how to estimate the distance correlation empirically from data (Székely et al., 2007). Suppose we have n observations of two random variables as . For variables and , we can construct
formula
3.3
formula
3.4
and then we construct matrices A and B with these entries:
formula
3.5
formula
3.6
Then we can estimate the empirical distance covariance as follows (Székely et al., 2007).
Definition 3.
The empirical distance covariance is
formula
3.7

We can estimate the empirical distance correlation using the empirical distance covariance. The distance correlation has a property that implies independence between and . We show in the next section that this helps to identify the causal direction.

3.3  Inferring Causal Directions

In this section, we discuss how to infer the causal directions. Suppose we have the joint distribution of the variable pair as P(X,Y). We are able to factorize it in two directions and get and . Each of them is a random variable pair. We define the dependence measurements of them:

Definition 4.
The dependence measurements are defined as
formula
3.8
formula
3.9
If postulate 5 is accepted, then in the causal direction, the distance correlation between and reaches the lower bound as
formula
3.10
Since the distance correlation is nonnegative, in the anticausal direction we have
formula
3.11
and now we get the causal inference principle as
formula
3.12
Intuitively, in the causal direction, we get a smaller dependence coefficient between the marginal distribution and the conditional distribution than that in the anticausal direction. It is worthy of attention that the domain size should be reasonably large to generate reliable statistics. We give the detailed causal inference method in the next section.

4  Causal Inference Method

In this section we give a causal inference method that identifies the causal direction by estimating the distance correlations. If X causes Y, the should be smaller than . However, estimating the coefficients from samples may induce random errors, so we introduce a threshold . They are significantly different if their difference is larger than , and we can decide the causal direction. Otherwise we stay undecided. We detail the inference method in algorithm 1.

formula

From algorithm 1, it is clear that our method identifies the cause and the effect by factorizing the joint distribution in two directions and comparing the dependence coefficients ( and ). The one with the smaller distance correlation (between the marginal distribution and the conditional distribution) is inferred as the causal direction. We call it causal inference by estimating distance correlations (DC). Next, we analyze the computational cost of our method. Suppose the sample size is n. The time for constructing the matrix recording the joint distribution is , and the times for calculating and are and , respectively. The total time therefore is . One can see that this method is of low computational complexity (linear with respect to sample size). This would be verified in experiments.

5  Experiments

In this section, we test the performance of our method (DC). The compared algorithm is the discrete regression (DR) (Peters et al., 2011). We perform experiments under various settings. Section 5.1 shows the performance of DC on identifying ANMs with different . Section 5.2 presents the performance of DC when the distribution of the cause and the conditional distribution mapping cause to effect are randomly generated. Section 5.3 tests the efficiency of the algorithms. Section 5.4 discusses the choice of the threshold parameter . Section 5.5 shows the performance of DC at different decision rates. In section 5.6, we apply DC to real-world cause-effect pairs (with discretization) to show its capability in solving practical problems.

5.1  Additive Noise Models

We first evaluate the accuracies of DC on identifying ANMs with different ranges of noise term N. The model is written as
formula
5.1
The function f is constructed by random mapping from to . Suppose the support of the noise is . The noise domain is chosen as:
  1. .

  2. .

  3. .

  4. .

The probability distributions of the cause X are chosen by randomly generating a vector (length ) with each entry being an integer between and normalizing it to unit sum. We generate the probability distributions of the noise N using the same way. In each trial, the algorithms are forced to make a decision. For each noise setting, we randomly generate 500 functions. Thus, we have 500 additive noise models. for each model could be different due to the randomness of the mappings. Then we sample 200, 300, 500, 1000, 2000, 4000 points for each model and apply DC and DR to the samples. The plots showing the accuracies of the algorithms are given in Figure 2.

Figure 2:

Accuracy of the algorithms versus sample size on different .

Figure 2:

Accuracy of the algorithms versus sample size on different .

From Figure 2, one can see that DR performs slightly better than DC when . For example, when the sample size is 4000, the accuracy of DR is 0.82, while that of DC is 0.78. We observe that ANMs with small could sometimes yield small distance correlations in both directions. In these situations, the decision made by DC is close to a random guess. DC performs better than DR when is larger. The accuracies of DC become around 0.9 when the sample size is large. But DC does not correctly identify all models. This is because the difference between the estimated distance correlations is sometimes small due to estimation random errors, and this may make the decision wrong.

5.2  Models with Randomly Generated and

We now test the algorithms on the models with and being randomly generated. To be specific, we generate using the method in section 5.1. Then we generate distributions on as a reference set. For , we generate by randomly taking one of the distributions in the reference set. We choose the domain size to be

  1. .

  2. .

  3. .

  4. .

For each setting, we generate 500 models. For each model, we sample 200, 300, 500, 1000, 2000, 4000 points and apply DC and DR to them. The performance is shown in Figure 3.

Figure 3:

Accuracy of the algorithms versus sample size on different .

Figure 3:

Accuracy of the algorithms versus sample size on different .

DR’s performance in these scenarios is unsatisfactory. This is because DR often makes a random guess since the models do not satisfy the ANM in either direction. DC’s performance is satisfactory when the sample size is larger. This shows that DC works in these scenarios while DR does not.

5.3  On Efficiency

This section investigates the efficiency of the algorithms. We use the same experimental setting as in section 5.2 (setting 2 and 4). We run each sample size 100 times and record the total running time (seconds) of the algorithms. The records are shown in Figure 4.

Figure 4:

Running time of the algorithms versus sample size on different .

Figure 4:

Running time of the algorithms versus sample size on different .

Figure 4 shows that DC has a higher efficiency than DR. For example, when and sample size is 1000, DR uses around 400 seconds to finish the experiments, while DC uses only 2 seconds. This is because DR searches the entire domain iteratively to find a function that yields minimum dependence between residuals and regressors, which could be time-consuming in practice.

5.4  On Parameter

In sections 5.1 to 5.3, DC is forced to make a decision at each trial (). In this section, we examine the influence of on the performance of DC. This may help to set the values of in practice. We use the same experimental setting as in section 5.2. The domain size is chosen as and . The sample size is fixed to be 4000. We choose the parameter to be:

  1. .

  2. .

  3. .

For each setting, we generate 500 models and apply DC to them. The proportions of correctly identified models, wrongly identified models, and nonidentified models are shown in Figure 5.

Figure 5:

Proportion of models versus threshold on different .

Figure 5:

Proportion of models versus threshold on different .

The proportion of nonidentified models becomes large when the threshold is 0.1 because DC becomes conservative in this situation. The proportion of correctly identified models and of wrongly identified models decreases as the threshold gets larger. However, the accuracy (the number of correctly identified models divided by the number of correctly identified models plus the number of wrongly identified models) increases. This is reasonable since the decisions made by DC under a higher threshold are more reliable. Based on the plotted results, we observe that the accuracy and decision date of DC are acceptable when . We suggest that 0.05 is a reasonable choice for the parameter.

5.5  On Decision Rates

In our algorithm, the threshold parameter controls the decision rates of DC. In other words, if we increase the (from 0 to 1), the decision rates decrease (from 100% to 0%). In this section, we study the influence of the parameter on the performance of DC. We follow the experimental setting in section 5.2 and choose . We fix the sample size to be 5000 and vary the decision rates (by changing the from 0 to 1). The plots showing the percentage of correct decisions versus the decision rate are in Figure 6.

Figure 6:

Percentage of correct decisions versus decision rate.

Figure 6:

Percentage of correct decisions versus decision rate.

Obviously, the percentage of correct decisions decreases as the decision rate increases. For example, the percentage of correct decisions is 100% when the decision rate is less than 20%, but it becomes 77% when the decision rate is 100%. This is acceptable since the decision would be more reliable if the algorithm makes decisions based on higher .

5.6  On Real-World Data

We apply DC to real-world benchmark cause-effect pairs.1 The data set contains records of 88 cause-effect pairs. We exclude the pairs numbered 17, 44, 45, 52, 53, 54, 55, 68, 71, and 75 because they are either multivariate causal pairs or pairs that cannot be fitted into memory using our algorithm. We apply DC, DR (Peters et al., 2011), IGCI (Janzing et al., 2012), LiNGAM (Shimizu et al., 2006), ANM (Hoyer et al., 2009), PNL (Zhang & Hyvärinen, 2009b), and CURE (Sgouritsa, Janzing, Hennig, & Schölkopf, 2015). The last five methods can be directly applied to the data. DC and DR could be applied to discretized data. To make the variables discrete, we process them using the following method. For a variable X, if the maximum absolute value of all observations is less than 1, we process it by ; else we process it by .2 For each pair, we generate 50 replicates using resampling techniques and then apply the algorithms to the causal pairs. The box plots showing the accuracies are in Figure 7.

Figure 7:

Accuracy of algorithms on 78 real-world causal pairs.

Figure 7:

Accuracy of algorithms on 78 real-world causal pairs.

Determining causal directions on real-world data is challenging since the causal mechanisms are often complex and the data records could be noisy (Pearl, 2000; Spirtes, Glymour, & Scheines, 2000). However, Figure 7 shows that DC has satisfactory performance on this task. The average accuracy of DC is around 72%, which is highest among all algorithms. ANM-based methods (DR, LiNGAM, ANM) do not have a good performance. This may be because the assumptions of additive noise models restrict their applicability.

6  Conclusion

In this article, we deal with the causal inference problem on discrete data. We consider the distribution of the cause and the conditional distribution mapping cause to effect as independent random variables, and propose to discover the causal direction via estimating and comparing the distance correlations. Encouraging experimental results are reported. This shows that inferring the causal direction using the independence postulate is a promising research direction. In the future, we will try to extend this method to deal with high-dimensional data.

Acknowledgments

We thank the editor and the anonymous reviewers for helpful comments. The work described in this article was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

References

Chen
,
Z.
,
Zhang
,
K.
, &
Chan
,
L.
(
2013
).
Nonlinear causal discovery for high dimensional data: A kernelized trace method
. In
Proceedings of the IEEE 13th International Conference on Data Mining
(pp.
1003
1008
).
Piscataway, NJ
:
IEEE
.
Chen
,
Z.
,
Zhang
,
K.
,
Chan
,
L.
, &
Schölkopf
,
B.
(
2014
).
Causal discovery via reproducing kernel Hilbert space embeddings
.
Neural Computation
,
26
(
7
),
1484
1517
.
Gretton
,
A.
,
Herbrich
,
R.
,
Smola
,
A.
,
Bousquet
,
O.
, &
Schölkopf
,
B.
(
2005
).
Kernel methods for measuring independence
.
Journal of Machine Learning Research
,
6
,
2075
2129
.
Hoyer
,
P. O.
, &
Hyttinen
,
A.
(
2009
).
Bayesian discovery of linear acyclic causal models
. In
Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence
(pp.
240
248
).
Corvallis, OR
:
AUAI Press
.
Hoyer
,
P. O.
,
Janzing
,
D.
,
Mooij
,
J. M.
,
Peters
,
J. R.
, &
Schölkopf
,
B.
(
2009
).
Nonlinear causal discovery with additive noise models
. In
D.
Koller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems
,
21
(pp.
689
696
).
Cambridge, MA
:
MIT Press
.
Hyttinen
,
A.
,
Eberhardt
,
F.
, &
Hoyer
,
P. O.
(
2012
).
Learning linear cyclic causal models with latent variables
.
Journal of Machine Learning Research
,
13
,
3387
3439
.
Hyvärinen
,
A.
, &
Smith
,
S. M.
(
2013
).
Pairwise likelihood ratios for estimation of non-gaussian structural equation models
.
Journal of Machine Learning Research
,
14
,
111
152
.
Hyvärinen
,
A.
,
Zhang
,
K.
,
Shimizu
,
S.
, &
Hoyer
,
P. O.
(
2010
).
Estimation of a structural vector autoregression model using non-gaussianity
.
Journal of Machine Learning Research
,
11
,
1709
1731
.
Janzing
,
D.
,
Hoyer
,
P.
, &
Schölkopf
,
B.
(
2010
).
Telling cause from effect based on high-dimensional observations
. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
479
486
).
Madison, WI
:
Omnipress
.
Janzing
,
D.
,
Mooij
,
J.
,
Zhang
,
K.
,
Lemeire
,
J.
,
Zscheischler
,
J.
,
Daniušis
,
P.
, …
Schölkopf
,
B.
(
2012
).
Information-geometric approach to inferring causal directions
.
Artificial Intelligence
,
182
,
1
31
.
Janzing
,
D.
, &
Schölkopf
,
B.
(
2010
).
Causal inference using the algorithmic Markov condition
.
IEEE Transactions on Information Theory
,
56
(
10
),
5168
5194
.
Lemeire
,
J.
, &
Janzing
,
D.
(
2013
).
Replacing causal faithfulness with algorithmic independence of conditionals
.
Minds and Machines
,
23
(
2
),
227
249
.
Mooij
,
J. M.
,
Janzing
,
D.
,
Heskes
,
T.
, &
Schölkopf
,
B.
(
2011
).
On causal discovery with cyclic additive noise models
. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
. (Eds.),
Advances in neural information processing systems
,
24
(pp.
639
647
),
Red Hook, NY
:
Curran
.
Pearl
,
J.
(
2000
).
Causality: models, reasoning and inference
.
Cambridge
:
Cambridge University Press
.
Peters
,
J.
,
Janzing
,
D.
, &
Schölkopf
,
B.
(
2010
).
Identifying cause and effect on discrete data using additive noise models
. In
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics
(pp.
597
604
).
JMLR
.
Peters
,
J.
,
Janzing
,
D.
, &
Scholkopf
,
B.
(
2011
).
Causal inference on discrete data using additive noise models
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
33
(
12
),
2436
2450
.
Schölkopf
,
B.
,
Janzing
,
D.
,
Peters
,
J.
,
Sgouritsa
,
E.
,
Zhang
,
K.
, &
Mooij
,
J.
(
2012
).
On causal and anticausal learning
. In
Proceedings of the 29th International Conference on Machine Learning
(pp.
1
8
).
Madison, WI
:
Omnipress
.
Sgouritsa
,
E.
,
Janzing
,
D.
,
Hennig
,
P.
, &
Schölkopf
,
B.
(
2015
).
Inference of cause and effect with unsupervised inverse regression
. In
Proceedings of the 18th International Conference on Artificial Intelligence and Statistics
(pp.
847
855
).
JMLR
.
Shimizu
,
S.
,
Hoyer
,
P. O.
,
Hyvärinen
,
A.
, &
Kerminen
,
A.
(
2006
).
A linear nongaussian acyclic model for causal discovery
.
Journal of Machine Learning Research
,
7
,
2003
2030
.
Shimizu
,
S.
,
Inazumi
,
T.
,
Sogawa
,
Y.
,
Hyvärinen
,
A.
,
Kawahara
,
Y.
,
Washio
, …
Bollen
,
K.
(
2011
).
Directlingam: A direct method for learning a linear non-gaussian structural equation model
.
Journal of Machine Learning Research
,
12
,
1225
1248
.
Spirtes
,
P.
,
Glymour
,
C. N.
, &
Scheines
,
R.
(
2000
).
Causation, prediction, and search
.
Cambridge, MA
:
MIT Press
.
Székely
,
G. J.
,
Rizzo
,
M. L.
, &
Bakirov
,
N. K.
(
2007
).
Measuring and testing dependence by correlation of distances
.
Annals of Statistics
,
35
(
6
),
2769
2794
.
Zhang
,
K.
, &
Hyvärinen
,
A.
(
2009a
).
Causality discovery with additive disturbances: An information-theoretical perspective
. In
Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(pp.
570
585
).
New York
:
Springer
.
Zhang
,
K.
, &
Hyvärinen
,
A.
(
2009b
).
On the identifiability of the post-nonlinear causal model
. In
Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence
(pp.
647
655
).
Corvallis, OR
:
AUAI Press
.
Zscheischler
,
J.
,
Janzing
,
D.
, &
Zhang
,
K.
(
2011
).
Testing whether linear equations are causal: A free probability theory approach
. In
Proceedings of the 27th International Conference on Uncertainty in Artificial Intelligence
(pp.
839
847
).
Corvallis, OR
:
AUAI Press
.

Notes

2

For pairs 65, 66, 67 which contain stock returns, we process the variables by