## Abstract

In this article, we deal with the problem of inferring causal directions when the data are on discrete domain. By considering the distribution of the cause and the conditional distribution mapping cause to effect as independent random variables, we propose to infer the causal direction by comparing the distance correlation between and with the distance correlation between and . We infer that *X* causes *Y* if the dependence coefficient between and is smaller. Experiments are performed to show the performance of the proposed method.

## 1 Introduction

Inferring the causal direction between two variables from observational data has become a hot research topic. Additive noise models (ANMs) (Zhang & Hyvärinen, 2009a, 2009b; Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006; Hoyer, Janzing, Mooij, Peters, & Schölkopf, 2009; Peters, Janzing, & Schölkopf, 2011; Hoyer & Hyttinen, 2009; Shimizu et al., 2011; Hyvärinen & Smith, 2013; Hyvärinen, Zhang, Shimizu, & Hoyer, 2010; Hyttinen, Eberhardt, & Hoyer, 2012; Mooij, Janzing, Heskes, & Schölkopf, 2011) are preliminary trials to solve this problem. They assume that the effect is governed by the cause and an additive noise, and the causal inference is done by finding the direction that admits such a model. Recently, under another view of exploiting the asymmetry between cause and effect, the linear trace method (LTr) (Zscheischler, Janzing, & Zhang, 2011; Janzing, Hoyer, & Schölkopf, 2010) and information-geometric causal inference (IGCI) (Janzing et al., 2012) have been proposed. Suppose *X* is the cause and *Y* is the effect. Based on the fact that the generating of is independent with that of (Janzing & Schölkopf, 2010; Lemeire & Janzing, 2013; Schölkopf et al., 2012), LTr suggests that the trace condition is fulfilled in the causal direction while violated in the anticausal direction, and IGCI shows that the density of the cause and the log slope of the function-transforming cause to effect are uncorrelated, while the density of the effect and the log slope of the inverse of the function are positively correlated. By accessing these so-called cause-effect asymmetries, one can determine the causal direction. Then a kernel method using the framework of IGCI to deal with high-dimensional variables is developed (Chen, Zhang, Chan, & Schölkopf, 2014), and nonlinear extensions of trace method are presented (Chen, Zhang, & Chan, 2013).

In some situations, the variables of interest are on discrete domains, and researchers have adopted additive noise models to discrete data for causal inference (Peters, Janzing, & Schölkopf, 2010, 2011). Given observations of the variable pair, they do regressions in both directions and test the independence between the residuals and the regressors. The direction that admits an additive noise model is inferred as the causal direction. However, ANMs may not be suitable for modeling discrete variables in some situations. For example, it is not natural to adopt ANMs to modeling categorical variables. Methods with wider applicability would be valuable for causal inference on discrete data.

Motivated by the postulate that the generating of is independent of that of , we suggest that the is an observation of a pair of variables that are independent with each other (here , referring to the probability at one specified point). To infer the causal direction, we calculate the dependence coefficients between and and between and . Then the direction that induces smaller correlation is inferred as the causal direction. Without a functional causal model assumption, our method has wider applicability than traditional ANMs. Various experiments are conducted to demonstrate the performance of the proposed method.

## 2 Problem Description

Suppose we have a set of observed variables *X* and *Y* with support domain and , respectively. *X* is the cause and *Y* is the effect, but we do not have prior knowledge about the causal direction. We assume that they are on discrete domain (for continuous variable, we can perform discretization) and there are no latent confounders. We want to identify the causal direction: “X causes Y” or “Y causes X.” For clarity, we list the symbols that may appear in the following sections in Table 1. Since we constrain variables in finite range, *P*(*X*,*Y*) and can be written as matrices, and is a vector in the matrix . We use these representations in the rest of the article.

Symbol . | Description . |
---|---|

Support of variable X | |

Support of variable Y | |

Distribution of X | |

Probability of | |

P(X,Y) | Joint distribution of |

Conditional distribution of Y given X | |

Conditional distribution of Y given | |

Cardinality of a set |

Symbol . | Description . |
---|---|

Support of variable X | |

Support of variable Y | |

Distribution of X | |

Probability of | |

P(X,Y) | Joint distribution of |

Conditional distribution of Y given X | |

Conditional distribution of Y given | |

Cardinality of a set |

## 3 Causal Inference Principle

In this section, we present the principle that we are using for causal inference on discrete data. We start with the basic idea.

### 3.1 Basic Idea

The basic idea is to consider the as a realization of a variable pair, and the two variables (one is one-dimensional and the other is high-dimensional) are independent with each other. See Figure 1 for an example.

Figure 1 shows an example of and . Suppose and (here, and ). is a vector in *R ^{M}*, and is a matrix in . The highlighted grids (red bars) are a pair . Consider and as two independent random variables. The generating of the and is done by drawing realizations at each possible value of

*x*(shifting the red bars from right to left). We have realizations. We formalize this in postulate

^{5}:

and are both random variables taking realizations at different *x*. is independent with .

Once this postulate is in mind, one could seek properties induced by it for causal discovery. We discuss this in the next section.

### 3.2 Distance Correlation

If we want to characterize the dependence between and , one measurement is the correlation. However, is a high-dimensional random vector. Adopting traditional dependence coefficients like Pearson correlations would cause a certain estimation bias when the sample size is not large. Moreover, it would be useful if the independence between variables corresponds to 0 correlation. This is not true if we use traditional correlations. Here we propose to use distance correlation (Székely, Rizzo, & Bakirov, 2007) as the dependence measurement.

Distance correlation is a measurement of dependence between two random variables (one-dimensional or high-dimensional). Suppose we have two random variables , with characteristic functions and , respectively. Their joint characteristic function is . The distance covariance is defined as:

Here refers to the weighted *L*_{2} norm, and similarly we can define distance variance (Székely et al., 2007). Then the distance correlation is defined as

*n*observations of two random variables as . For variables and , we can construct and then we construct matrices

*A*and

*B*with these entries: Then we can estimate the empirical distance covariance as follows (Székely et al., 2007).

We can estimate the empirical distance correlation using the empirical distance covariance. The distance correlation has a property that implies independence between and . We show in the next section that this helps to identify the causal direction.

### 3.3 Inferring Causal Directions

In this section, we discuss how to infer the causal directions. Suppose we have the joint distribution of the variable pair as *P*(*X*,*Y*). We are able to factorize it in two directions and get and . Each of them is a random variable pair. We define the dependence measurements of them:

^{5}is accepted, then in the causal direction, the distance correlation between and reaches the lower bound as Since the distance correlation is nonnegative, in the anticausal direction we have and now we get the causal inference principle as Intuitively, in the causal direction, we get a smaller dependence coefficient between the marginal distribution and the conditional distribution than that in the anticausal direction. It is worthy of attention that the domain size should be reasonably large to generate reliable statistics. We give the detailed causal inference method in the next section.

## 4 Causal Inference Method

In this section we give a causal inference method that identifies the causal direction by estimating the distance correlations. If X causes Y, the should be smaller than . However, estimating the coefficients from samples may induce random errors, so we introduce a threshold . They are significantly different if their difference is larger than , and we can decide the causal direction. Otherwise we stay undecided. We detail the inference method in algorithm 1.

From algorithm 1, it is clear that our method identifies the cause and the effect by factorizing the joint distribution in two directions and comparing the dependence coefficients ( and ). The one with the smaller distance correlation (between the marginal distribution and the conditional distribution) is inferred as the causal direction. We call it causal inference by estimating distance correlations (DC). Next, we analyze the computational cost of our method. Suppose the sample size is *n*. The time for constructing the matrix recording the joint distribution is , and the times for calculating and are and , respectively. The total time therefore is . One can see that this method is of low computational complexity (linear with respect to sample size). This would be verified in experiments.

## 5 Experiments

In this section, we test the performance of our method (DC). The compared algorithm is the discrete regression (DR) (Peters et al., 2011). We perform experiments under various settings. Section 5.1 shows the performance of DC on identifying ANMs with different . Section 5.2 presents the performance of DC when the distribution of the cause and the conditional distribution mapping cause to effect are randomly generated. Section 5.3 tests the efficiency of the algorithms. Section 5.4 discusses the choice of the threshold parameter . Section 5.5 shows the performance of DC at different decision rates. In section 5.6, we apply DC to real-world cause-effect pairs (with discretization) to show its capability in solving practical problems.

### 5.1 Additive Noise Models

The probability distributions of the cause *X* are chosen by randomly generating a vector (length ) with each entry being an integer between and normalizing it to unit sum. We generate the probability distributions of the noise *N* using the same way. In each trial, the algorithms are forced to make a decision. For each noise setting, we randomly generate 500 functions. Thus, we have 500 additive noise models. for each model could be different due to the randomness of the mappings. Then we sample 200, 300, 500, 1000, 2000, 4000 points for each model and apply DC and DR to the samples. The plots showing the accuracies of the algorithms are given in Figure 2.

From Figure 2, one can see that DR performs slightly better than DC when . For example, when the sample size is 4000, the accuracy of DR is 0.82, while that of DC is 0.78. We observe that ANMs with small could sometimes yield small distance correlations in both directions. In these situations, the decision made by DC is close to a random guess. DC performs better than DR when is larger. The accuracies of DC become around 0.9 when the sample size is large. But DC does not correctly identify all models. This is because the difference between the estimated distance correlations is sometimes small due to estimation random errors, and this may make the decision wrong.

### 5.2 Models with Randomly Generated and

We now test the algorithms on the models with and being randomly generated. To be specific, we generate using the method in section 5.1. Then we generate distributions on as a reference set. For , we generate by randomly taking one of the distributions in the reference set. We choose the domain size to be

.

.

.

.

DR’s performance in these scenarios is unsatisfactory. This is because DR often makes a random guess since the models do not satisfy the ANM in either direction. DC’s performance is satisfactory when the sample size is larger. This shows that DC works in these scenarios while DR does not.

### 5.3 On Efficiency

This section investigates the efficiency of the algorithms. We use the same experimental setting as in section 5.2 (setting 2 and 4). We run each sample size 100 times and record the total running time (seconds) of the algorithms. The records are shown in Figure 4.

Figure 4 shows that DC has a higher efficiency than DR. For example, when and sample size is 1000, DR uses around 400 seconds to finish the experiments, while DC uses only 2 seconds. This is because DR searches the entire domain iteratively to find a function that yields minimum dependence between residuals and regressors, which could be time-consuming in practice.

### 5.4 On Parameter

In sections 5.1 to 5.3, DC is forced to make a decision at each trial (). In this section, we examine the influence of on the performance of DC. This may help to set the values of in practice. We use the same experimental setting as in section 5.2. The domain size is chosen as and . The sample size is fixed to be 4000. We choose the parameter to be:

.

.

.

The proportion of nonidentified models becomes large when the threshold is 0.1 because DC becomes conservative in this situation. The proportion of correctly identified models and of wrongly identified models decreases as the threshold gets larger. However, the accuracy (the number of correctly identified models divided by the number of correctly identified models plus the number of wrongly identified models) increases. This is reasonable since the decisions made by DC under a higher threshold are more reliable. Based on the plotted results, we observe that the accuracy and decision date of DC are acceptable when . We suggest that 0.05 is a reasonable choice for the parameter.

### 5.5 On Decision Rates

In our algorithm, the threshold parameter controls the decision rates of DC. In other words, if we increase the (from 0 to 1), the decision rates decrease (from 100% to 0%). In this section, we study the influence of the parameter on the performance of DC. We follow the experimental setting in section 5.2 and choose . We fix the sample size to be 5000 and vary the decision rates (by changing the from 0 to 1). The plots showing the percentage of correct decisions versus the decision rate are in Figure 6.

Obviously, the percentage of correct decisions decreases as the decision rate increases. For example, the percentage of correct decisions is 100% when the decision rate is less than 20%, but it becomes 77% when the decision rate is 100%. This is acceptable since the decision would be more reliable if the algorithm makes decisions based on higher .

### 5.6 On Real-World Data

We apply DC to real-world benchmark cause-effect pairs.^{1} The data set contains records of 88 cause-effect pairs. We exclude the pairs numbered 17, 44, 45, 52, 53, 54, 55, 68, 71, and 75 because they are either multivariate causal pairs or pairs that cannot be fitted into memory using our algorithm. We apply DC, DR (Peters et al., 2011), IGCI (Janzing et al., 2012), LiNGAM (Shimizu et al., 2006), ANM (Hoyer et al., 2009), PNL (Zhang & Hyvärinen, 2009b), and CURE (Sgouritsa, Janzing, Hennig, & Schölkopf, 2015). The last five methods can be directly applied to the data. DC and DR could be applied to discretized data. To make the variables discrete, we process them using the following method. For a variable *X*, if the maximum absolute value of all observations is less than 1, we process it by ; else we process it by .^{2} For each pair, we generate 50 replicates using resampling techniques and then apply the algorithms to the causal pairs. The box plots showing the accuracies are in Figure 7.

Determining causal directions on real-world data is challenging since the causal mechanisms are often complex and the data records could be noisy (Pearl, 2000; Spirtes, Glymour, & Scheines, 2000). However, Figure 7 shows that DC has satisfactory performance on this task. The average accuracy of DC is around 72%, which is highest among all algorithms. ANM-based methods (DR, LiNGAM, ANM) do not have a good performance. This may be because the assumptions of additive noise models restrict their applicability.

## 6 Conclusion

In this article, we deal with the causal inference problem on discrete data. We consider the distribution of the cause and the conditional distribution mapping cause to effect as independent random variables, and propose to discover the causal direction via estimating and comparing the distance correlations. Encouraging experimental results are reported. This shows that inferring the causal direction using the independence postulate is a promising research direction. In the future, we will try to extend this method to deal with high-dimensional data.

## Acknowledgments

We thank the editor and the anonymous reviewers for helpful comments. The work described in this article was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

## References

## Notes

^{1}

Available at https://webdav.tuebingen.mpg.de/cause-effect/.

^{2}

For pairs 65, 66, 67 which contain stock returns, we process the variables by