Estimating conditional dependence between two random variables given the knowledge of a third random variable is essential in neuroscientific applications to understand the causal architecture of a distributed network. However, existing methods of assessing conditional dependence, such as the conditional mutual information, are computationally expensive, involve free parameters, and are difficult to understand in the context of realizations. In this letter, we discuss a novel approach to this problem and develop a computationally simple and parameter-free estimator. The difference between the proposed approach and the existing ones is that the former expresses conditional dependence in terms of a finite set of realizations, whereas the latter use random variables, which are not available in practice. We call this approach conditional association, since it is based on a generalization of the concept of association to arbitrary metric spaces. We also discuss a novel and computationally efficient approach of generating surrogate data for evaluating the significance of the acquired association value.
The problem of assessing conditional dependence between two random variables given the knowledge of a third random variable is essential in many scientific problems, for example, in evaluating effective connectivity of a neuronal network (Quinn, Coleman, Kiyavash, & Hatsopoulos, 2011) or assessing causal influence in a biological network (Lozano, Naoki, Liu, & Rosset, 2009). However, the available methods for assessing conditional dependence (Schreiber, 2000; Fukumizu, Gretton, Sun, & Schölkopf, 2008; Su & White, 2008) suffer from several computational drawbacks. First, they are computationally expensive, requiring to time complexity where n is the sample size; second, they almost always require selecting several free parameters, such as a kernel, the corresponding kernel size, and, often, a regularization parameter, where selecting the best values of these parameters remains an open problem; third, these measures often assess only conditional independence rather than conditional dependence, that is, they are zero if and only if conditional independence is satisfied, but they do not address when and how conditional dependence increases or decreases or becomes maximum; and fourth, they explore conditional dependence in the context of the random variables, but it remains unclear that given a finite set of realizations, what property of this set of realizations makes them conditionally dependent. These issues prohibit the applicability of these measures to practical problems, where sample size or dimensionality, or both, could be high; where just knowing that the random variables are conditionally independent is not enough; where one has access to only a set of realizations rather than the random variables; and where simpler approaches such as linear Granger causality and partial correlation (PC) remain dominant due to their inherent simplicity (Dhamala, Rangarajan, & Ding, 2008).
Assessing conditional dependence is particularly essential in estimating causal flow in the sense of Granger (Quinn et al., 2011). It is well known that assessing Granger noncausality, that is, one time series not having any causal influence on another time series , is equivalent to finding whether the present value of is conditionally independent of the past values of given the past values of alone (Diks & Panchenko, 2006). However, this approach does not address the issue that if indeed causes , then how does one quantify the strength of causation? This issue is usually addressed in terms of the conditional mutual information (CMI), an established measure of conditional dependence (Joe, 1989; Schreiber, 2000). But conditional mutual information is very difficult to estimate, and the properties satisfied by the measure itself are not inherited by its finite sample estimator. To elaborate, consider the random variables (X, Y, Z) with joint probability law P(X, Y, Z). It is clear that CMI is minimum when equation 1.1 is satisfied, whereas it is maximum when there exists a functional relationship between X and Y for every Z=z (Joe, 1989). Now consider a finite set of realizations from this probability law. Given these samples, CMI is estimated consistently by estimating the Radon-Nikodym derivative using adaptive binning (Pérez-Cruz, 2008) or kernel smoothing (Joe, 1989). But the resulting estimators neither provide the same understanding of when the estimated value is minimum or maximum, and under what circumstances this value increases or decreases, nor do inherit the desired properties of the measure (e.g., they can be negative and are never invariant to one-to-one transformation). Therefore, there is a clear mismatch between what a measure intends to quantify and what an estimator is able to capture; or from a slightly different aspect, although the meaning of the measure is transparent in the context of the random variables (i.e., the probability law), it remains unclear what attribute makes them conditionally dependent from the perspective of the realizations.
In this letter, we address the following: Given a set of realizations from a random variable triplet (X, Y, Z), what makes the realizations conditionally dependent on the realizations given the realizations ? Earlier work in statistics by pioneers such as Galton, Pearson, Spearman, and Kendall shows a similar approach to estimating statistical dependence, where rather than starting with a statistical measure and deriving an appropriate estimator, the authors start with the realizations and provide an intuitive explanation of what dependence should imply (Spearman, 1904; Kendall, 1938). While the existing approaches of assessing conditional dependence follow the former conceptual framework, we follow the latter since, in practice, we have access to only a finite set of realizations rather than the underlying random variables and also since it is rather difficult to materialize the desired properties and understanding of a measure in an estimator. In order to achieve this, we generalize the concept of association to arbitrary metric spaces and then extend it to introduce the concept of conditional association.1 The proposed approach not only provides an intuitive view of what conditional dependence is and how it changes, but it culminates in a parameter-free and relatively simpler estimator, thus becoming an excellent alternative to the state-of-the-art methods. Another advantage of this approach is that it is defined only in terms of the pairwise distances between the realizations and therefore is applicable to exotic metric spaces such as non-Euclidean space or infinite dimensional space, such as the space of spike trains (Seth et al., 2010).
Although conditional association provides a clear interpretation of what conditional dependence implies, the significance of the acquired value remains to be investigated, especially under small sample size. Since explicitly computing the asymptotic null distribution in this context is computationally intense and the theoretical asymptotic null distribution is often violated for finite samples, we consider generating surrogate data to estimate the null distribution.2 Unfortunately, generating surrogate data to simulate conditional independence remains an open area of research (Seth & Príncipe, 2012). In this letter, therefore, we also introduce a novel scheme for generating surrogate data for simulating conditional independence. Some interesting properties of the proposed approach are that it involves only one free parameter and it resamples the original data, thus providing a scope for reusing computations involved in estimating the original conditional dependence value, for estimating the surrogate values. However, in its present format, this approach is applicable to Euclidean spaces only where the Lebesgue measure is defined. Therefore, we limit ourselves here to Euclidean spaces.
The rest of the letter is organized as follows. In section 2, we provide a brief overview of the existing literature on measures of conditional dependence and point out their weaknesses. In section 3, we start with a brief overview of the concept of association and then discuss how this concept can be generalized and extended to address the notion of conditional association. In section 4, we propose a novel scheme of generating surrogate data for evaluating the significance of conditional association. In section 5, we apply the proposed method on synthetic and real data to provide more insight into the proposed method. Finally, in section 6, we conclude with a brief summary of the proposed work and some guidelines for future work.
Recently the problem of assessing conditional dependence has also been addressed in the context of kernel-based learning, where Fukumizu et al. (2008) proposed as a measure of conditional dependence, where denotes the normalized conditional cross-covariance operator and denotes the Hilbert-Schmidt norm. This measure can be consistently estimated as HSNCICn=Tr[R(XZ)R(YZ)−2R(XZ)R(YZ)RZ+R(XZ)RZR(YZ)RZ], where , is a Gram matrix, is a strictly positive-definite (spd) kernel, I is the identity matrix, Tr denotes the trace of a matrix, and is a regularization parameter that depends on n. The advantage of this approach is that unlike CM, HD, or CMI, the use of an spd kernel allows this measure to be defined on any arbitrary set of random variables that might not take values in . However, there are two major drawbacks to this approach: it requires selecting one more free parameter (i.e., the regularization), and it is computationally more involved than CM and HD, taking time compared to by the other methods. The free parameters can be chosen by matching the bootstrap variance of the estimator with the theoretical variance (Fukumizu et al., 2008). However, this approach is computationally involved, especially in conjunction with a permutation test and the high computational complexity of the estimator. Moreover, defining an appropriate strictly positive-definite kernel over an abstract space still remains an active area of research, thus limiting the utility of this approach.
Although these existing approaches (not including CM) are mathematically elegant as measures of conditional (in)dependence: the measures are zero if and only if conditional independence is satisfied among a random variable triplet; they are maximum when one variable is a function of the other given any value of the third variable; and their respective estimators are consistent, that is, they reach their theoretical values when the number of realizations tends to infinity. They do not address what conditional dependence implies in terms of a finite set of realizations—how the estimated value increases or decreases and when it reaches its maximum. In section 3, we address this issue, which leads to an alternate understanding of conditional dependence in the context of a finite set of realizations and provides an opportunity to design simpler estimators that are parameter free, can be applied to any metric space, not just Euclidean, and can effectively quantify conditional dependence, as we will demonstrate with both real and synthetic data.
Before proceeding, we briefly discuss the notion of association in statistics. Given realizations from two real valued random variables (X, Y), they are said to be associated if large realizations of X are associated with large realizations of Y. The most widely used measure of association is the correlation, which is defined as . However, this measure captures only linear relationships between two random variables since the realizations of X and Y are compared in absolute terms. This idea has been generalized by Spearman (1904), who proposed to use the correlation between the ranks of the realizations as a measure of association. Working with ranks, that is, the relative values rather than the absolute values, allows Spearman's coefficient to capture a monotonic relationship rather than just a linear relationship. Kendall (1938), on the other hand, proposed a measure of association using the ideas of concordance and discordance. Two pairs of realizations, (xi, yi) and (xj, yj), are said to be concordant if (xi−xj) and (yi−yj) have the same sign; otherwise, they are said to be discordant. Kendall defined a measure of association as the difference between the number of concordant and discordant pairs normalized by the total number of pairs. It is evident that this idea captures the same attribute of whether relatively large realizations of X are associated with relatively large realizations of Y.
3.1. Generalized Association.
The idea of association is defined only on , where the product (xy) and the ordering (x<y) are well defined, whereas in practice, one often encounters more exotic random variables (e.g., vectors). Therefore, we generalize this idea to a metric space by defining association in the following way: given two random variables (X, Y) in , Y is associated with X if close realization pairs of Y, that is, , are associated with close realization pairs of X, that is, , where closeness is defined in terms of the respective metrics of the two spaces where the random variables lie. In other words, if two realizations are close in , then the corresponding realizations are close in . Notice that by construction, this concept is valid in any metric space, not just Euclidean. However, in this letter, we explore this concept only in the Euclidean space. We call this approach the generalized association.
To quantify this notion of association, we follow this algorithm under the assumption that two realizations do not share the same distance from a third realization. Given realizations ,
For all , repeat the following.
Find closest to xi in terms of , that is, .
Find rank ri of in terms of that is, ,
Notice that the concept of generalized association is asymmetric: . The intuition behind an asymmetric measure of dependence fits the regression scenario very well where the regression function Y=f(X) could be noninvertible: although Y can be completely determined by X, it is not true the other way around. Also, the value of GMA depends on the choice of the metrics and . This is an undesirable but not surprising fact since any estimator of dependence such as an estimator of mutual information (MI) usually requires choosing suitable metrics (Kraskov, Stögbauer, & Grassberger, 2004; Pérez-Cruz, 2008). However, this selection is often overlooked in the context of a Euclidean space, and the default Euclidean norm, the l2-norm, is used in the estimators.
3.2. Conditional Association.
This approach, however, requires designing the metric by utilizing the metrics and , where it is not clear, how to make these metrics compatible. For example, they can be combined as as in the Euclidean space. But a mere scaling of one of these two metrics can suppress the contribution of the other in the combined metric. This is undesirable, and it restricts the final estimator from being invariant to simple scaling of the individual domains. This, however, is not an issue of conditional association in particular, but any estimator of conditional dependence such as CMI and HSNCIC. Therefore, to evaluate GMA((X, Z), Y), we consider a slightly different approach. Instead of finding the closest point in terms of metric , we do it in terms of the relative positions of the realizations in the individual domains. To elaborate, we first compute the ranks of realizations from xi using , and the ranks of realizations from zi using , and then find the closest point as . The intuition behind this approach originates from the understanding that a point in the joint domain would be close if it is close in both individual domains. However, if there are ties in the combined ranks, we resolve them by the ranks of X since we are interested in knowing how important this variable is in the presence of Z.
It is evident that this approach does not suffer from the compatibility issue, and it is invariant to a class of transformations of the individual domains such as scaling on the real line. Also, it preserves our intuition of conditional dependence. To elaborate, let us consider two simple examples. First, consider X1=V1, Y1=U1, Z1=U1; where U1 and V1 are independent. Then GMA(Z1, Y1)=1, whereas GMA((X1, Z1), Y1)<GMA(Z1, Y1) since the arbitrary ranks of X1 affect the combined ranking and disrupt the perfect alignment (closest point closest) between Z1 and Y1. Therefore, . Notice that as a special case, MCA(X, X; X)=0 (Dawid, 1998). Next, consider X2=U2, Y2=U2, Z2=V2, where U2 and V2 are independent random variables. Then , whereas since the perfect alignment (closest point closest) of X2 helps bring the combined ranks closer to aligning with Y2, and therefore, MCA(X2, Y2; Z2)>0.
4. Surrogate Data
Although conditional association, or any other measure of conditional dependence, returns a value, say, q, the significance of this value remains obscure, since a large value can result due to the presence of conditional dependence or simply due to a lack of evidence, that is, a sufficient number of realizations. Therefore, to remove the effect of small sample size, it is essential to judge the significance of this value in the context where conditional dependence is absent. This can be achieved by generating surrogate data sets simulating conditional independence and observing the values of the measure on these surrogate data sets. Let qn be the original estimated value of the realizations and be the surrogate values estimated from the surrogate data sets for . Then we can consider qn to be significant if where is sufficiently close to zero.
Generating surrogate data simulating conditional independence, however, is not trivial. Two popular approaches for generating surrogate data have been proposed by Paparoditis and Politis (2000) and Diks and DeGoede (2001). Given realizations from (X, Y, Z), the first approach generates realizations , representing , such that and , .6 This is done by sampling from the conditional distributions and , respectively. However, these distributions are estimated using Parzen's approach and require selecting an appropriate resampling width, which becomes difficult in higher dimensions. On the other hand, the second approach generates realizations such that . This is done by simply permuting the realizations of Y with respect to the realizations of (X, Z). Although this approach is simple and does not involve any free parameter, is only a sufficient condition for but not necessary.
Here, we discuss a different approach for generating surrogate data by modifying the first approach. We follow the suggestion in Paparoditis and Politis (2000) in the sense that we first generate samples from and then from and , respectively. However, we perform the following modifications in order to make this approach computationally more attractive in the context of the permutation test. First, we assume that the estimated densities exist only on the sample locations; second, we use a nearest neighbor–based approach to estimate the conditional density functions at these locations as described in section 2 in the context of estimating CMI; and third, we reuse the realizations , , and as realizations from fZ(z), fX|Z(x|z), and fY|Z(y|z), respectively. Reusing the original data is computationally advantageous in the context of a permutation test since the computation involved in computing the true conditional dependence value can be reused to compute the surrogate values, which includes reusing the distance matrix or the kernel Gram matrix. Before discussing the other aspects, we present the algorithm in detail.
Consider that we have realizations from a joint distribution fUV(u, v). Then, using the definition of conditional density function, we get , since fV(vj) is a constant normalizing factor. Therefore, we can estimate following equation 2.2, where we need to specify a neighborhood parameter k. Given the estimate , we can then sample from this density function (after normalizing) assuming the density function exists over only the realization values . Based on this approach, we follow three simple steps to generate surrogate data: for each i, (1), assign , (2) sample from the ensemble with probability , and (3) sample from the ensemble with probability .
Although this process is simple, it involves a free parameter k, which, in some sense controls the smoothness of the estimated density function. We set , a popular choice in nearest neighbor–based density estimation (Pérez-Cruz, 2008). It should be noted that the choice of this parameter may not be optimal. However, we empirically show that it works well in practice. Also, notice that in some sense, this approach can be understood as the first approach with a variable kernel size and the computational complexity of this approach is higher than the second approach.
4.1. Sanity Check.
To demonstrate the validity of the proposed measure and the surrogate data generation technique, we consider the following two examples.
Conditionally dependent but independent variables. Consider three independent normally distributed random variables X, Y, and , , and a third random variable where . Here Y and X are independent, but they are conditionally dependent given the event Z=z since then , that is, the value of Y can be partially determined from the value of X and vice versa. Therefore, we should expect that and MCA(X, Y; Z)>0. We call this example ExCoDe.
Conditionally independent but dependent variables. Consider three independent normally distributed random variables , that is, , and construct two random variables X and Y where and where . Here X and Y are dependent, since both originate from Z and are corrupted by two independent noises. However, these two random variables are conditionally independent given the event Z=z since then and , which are independent by construction. Therefore, we should expect that GMA(X, Y)>0.5 and . We call this example, ExCoIn.
Notice that the quality of the surrogate data should be judged by a measure of conditional dependence. However, the significance of the measured value itself is judged by the surrogate data. Therefore, to evaluate the quality of the surrogate data, we rely on a simpler and more established measure of conditional dependence, the partial correlation (PC), along with the methods described in this letter: MCA, CMI, and HSNCIC. PC can be reliably applied in these two examples since the joint probability law for both the examples is gaussian. For CMI, we use , whereas for HSNCIC, we use a gaussian kernel with the kernel size set to the median of the intersample distances and set the regularization value to 1/n. We use 1000 sets of 500 realizations to estimate the true values, T = 100 surrogates to judge the significance of these values, and set . In Figure 1, we show the performance of these methods.
We observe that for both examples, the surrogate values exist around zero for PC and MCA, which is promising. However, the distributions of surrogate values for CMI and HSNCIC are biased, which is an usual observation for finite sample estimation. For ExCoDe, the true values are much higher than the surrogate values, which indicates the presence of conditional dependence, whereas for ExCoIn, the true values are almost identically distributed as the surrogate values, which is expected. Also, for ExCoDe, we observe a monotonic increase in the estimated values, a desired characteristic of a measure of conditional dependence.
For ExCoDe, controls the difficulty of the problem in terms of signal-to-noise ratio, that is, a higher injects more noise () relative to the signal (X), making it difficult to observe the contribution of X on Y given Z. The proportion of significant values out of all trials is a sign of how accurately the methods have assessed conditional dependence. We observe that PC achieves the best performance in precisely identifying conditional dependence at . The performance of PC is justified since the original realizations are gaussian in nature. The performances of CMI and HSNCIC are better than that of MCA. However, this performance is achieved by proper selection of the parameter values and at the expense of more computational cost. Also, we observe that CMI tends to wrongfully establish conditional dependence in both examples.
Although the true and the surrogate values of the measures for ExCoDe follow the desired pattern, that is, the true values monotonically increase for increasing and the surrogate values maintain a steady low, we do not observe the same effect for ExCoIn. Notice that the surrogate values become larger for PC for large . A possible reason for this is that for large , the joint distributions of (X, Z) and (Y, Z) are almost singular and thus difficult to estimate. Therefore, it is possible that the surrogate data generated from these distributions are not accurate and therefore not gaussian, thus manipulating the values returned by PC. HSNCIC demonstrates a different behavior. Although it maintains a similar surrogate value distribution over different , the true value estimated by this measure monotonically decreases with increasing , thus forcing it to miss detecting conditional dependence. A probable cause of this observation is, again, the choice of free parameters. Since for large , the probability law becomes narrower, HSNCIC perhaps require a smaller kernel size for proper estimation than the selected kernel size.
In summary, this experiment shows that, first, the surrogate data generated by the proposed method may not be accurate with respect to all measures of conditional dependence, but they are still sufficiently reliable for assessing significance in the context of MCA, and second, the existing measures of conditional dependence rely on the accurate choice of parameters to make a proper decision (i.e., to avoid misleading assessment of significance).
In this section, we apply the proposed approach to several real and synthetic data sets over varying sample sizes and dimensionality for assessing causal strength in the sense of Granger (1980) and addressing its pros and cons. The other available methods of conditional dependence are usually applied to small-scale problems due to their computational load and inherent difficulty in choosing appropriate values of the free parameters (Fukumizu et al., 2008; Sun, 2008; Seth & Príncipe, 2012).
5.1. Conditional Granger Causality.
The issue of causal influence between two time series can be trivially extended to multivariate time series involving three or more time series where it is often desired to separate a direct cause from an indirect one, that is, to judge if the time series and are causally connected or not, given a third time series . is said to cause given , that is, , if the past values of contain additional information about the present value Yt of that is not contained in past values of . In terms of conditional independence, implies that the present value Yt of is conditionally independent of the past values of given past values of .
For our experiments, we generate a multivariate time series with two to five elements. To separate a direct cause from an indirect cause, we quantify the conditional causal influence, that is, we quantify the causal influence of on by the conditional association of Y=Wj(t) with given where , and L is the number of past values that we condition on. Since we are working in the Euclidean space, we use the l2 distance as metric for all three spaces , , and . For each experiment, we use T=100 trials to evaluate the significance of the acquired values and present our results over 128 repetitions of the same experiment. For all the simulation results, we observe a common feature that the performance of MCA in terms of providing significant values improves over increasing sample sizes n and it degrades over increasing embedding dimensions L. This is expected since L controls the dimensionality, and the higher the value of L, the more difficult and sparse the problem is.
5.2. Linear System.
5.3. Nonlinear System.
5.4. Varying Coupling Strength.
5.5. Multivariate Time Series.
5.6. Heart Rate and Respiration Force.
Next, we apply MCA on the bivariate time series of heart rate (H) and respiration force (R) of a patient with sleep apnea. This data have been acquired from the Santa Fe Institute time series competition.8 Normally, respiration force has a causal influence on heart rate. However, for a patient with sleep apnea, this causal direction is reversed. Therefore, for this data set, a strong causal influence is observed from heart rate to respiration force, whereas a weak influence is observed in the other direction (Schreiber, 2000). We randomly select n=480, 720, 960 long segments from the time series, and use L=32, 40, 48, 56, 64, 72. Since, the time series is sampled at 2 Hz, these measurements are equivalent to n=4, 6, 8 minutes and L=16, 20, 24, 28, 32, 36 seconds. We observe from Figure 6 that MCA strongly supports the causal influence of heart rate over respiration force.
In this letter, we introduced the concept of conditional association as a substitute for conditional dependence. The major difference between the proposed approach and existing ones is that it explores the concept of conditional dependence in the context of the realizations rather than random variables or the probability law. Unlike available measures of association, the proposed approach is parameter free and relatively easy to compute, thus making it an excellent tool for inferring causal influence among random variables and stochastic processes. We have also introduced a novel scheme for generating surrogate data to evaluate the significance of the acquired value, which is attractive since it resamples the original data, allowing the computations involved in computing the conditional measure to be reused to compute surrogate values.
Although the initial results of these two approaches are very promising, a few aspects require further investigation:
We have observed that the proposed approach usually provides less power compared to PC in the context of gaussian probability law. Although this is not a drawback of this method, it is certainly an undesirable property. Therefore, more sophisticated measures of association should be investigated in order to improve the performance of the proposed approach in terms of statistical power.
We have observed that the quality of the surrogate values generated by the proposed method is somewhat poor when the probability law is close to degenerate. This issue should be explored in more detail. Also, the full extent of the effect of the free neighborhood parameter remains to be explored.
We have explored the measure of conditional association only the context of the realizations. However, a corresponding population version of this approach would be interesting to investigate in order to establish if conditional association is a necessary and sufficient condition for conditional independence.
Finally, we have noted that the proposed approach can be applied to any metric spaces. However, in this letter, we have restricted ourselves to Euclidean space, partly due to simplicity in understanding and also due to the unavailability of surrogate data generation technique. It would be interesting to apply this approach to more abstract spaces to fully understand its capabilities and limitations.
This project is partially supported by the NSF grant IIS-096419. We thank Austin Brockmeier for many insightful discussions.
Notice that the work presented in this letter is not the same as in Holland and Rosenbaum (1986) where the authors have explored a different concept under the same name.
Our objective is not to design a test for conditional independence, but merely to observe if the acquired conditional association value is significant enough to be considered a sign of conditional dependence. Testing conditional independence requires a measure of conditional independence (Seth & Príncipe, 2012), and we have yet to find a formal derivation that the measure of conditional association satisfies the necessary properties.
The usual definition of conditional CDF is .
It is not a strict measure of conditional independence since it is only a necessary condition, but not an if and only if condition.
The expression implies that the random variables X and Y follow the same distribution.
It has been established that the asymptotic null distribution of (smoothed) measures of conditional independence is usually independent of certain dependence structure. However, a similar proof for finite samples and arbitrary dimension is usually unavailable in the literature.