## Abstract

Accurate causal inference among time series helps to better understand the interactive scheme behind the temporal variables. For time series analysis, an unavoidable issue is the existence of time lag among different temporal variables. That is, past evidence would take some time to cause a future effect instead of an immediate response. To model this process, existing approaches commonly adopt a prefixed time window to define the lag. However, in many real-world applications, this parameter may vary among different time series, and it is hard to be predefined with a fixed value. In this letter, we propose to learn the causal relations as well as the lag among different time series simultaneously from data. Specifically, we develop a probabilistic decomposed slab-and-spike (DSS) model to perform the inference by applying a pair of decomposed spike-and-slab variables for the model coefficients, where the first variable is used to estimate the causal relationship and the second one captures the lag information among different temporal variables. For parameter inference, we propose an efficient expectation propagation (EP) algorithm to solve the DSS model. Experimental results conducted on both synthetic and real-world problems demonstrate the effectiveness of the proposed method. The revealed time lag can be well validated by the domain knowledge within the real-world applications.

## 1 Introduction

Temporal causal inference aims to reveal the causal relationship among temporal variables and time series. Much effort has gone into learning an accurate temporal causal inference model for various applications, such as gene regulatory network discovery (Friedman, 2004; Liu, Niculescu-Mizil, Lozano, & Lu, 2011), social network analysis (Goldenberg & Moore, 2004), indictor analysis in business (Arnold, Liu, & Abe, 2007), climate data study (Chen, Liu, Liu, & Carbonell, 2010), and traffic analysis (Han, Song, Cong, & Xie, 2012).

For temporal causal modeling, an important factor is the time lag, which indicates how long one must look into the past to make accurate predictions of a current event. A real-world example is provided in Figure 1. Figure 1a shows the structure of a highway network; the red circle indicates a destination, and the blue circles denote the starting points. The traffic flow originating from the blue circles goes through the network, with a cost of some travel time, and finally arrives at the destination. Here, the travel time brings a time lag between the time series recorded at the source points and the time series recorded at the destination. This lag information is captured in Figures 1b to 1d. Figure 1b records the number of vehicles from the source point A arriving at the destination between 10:00 and 10:15. We observe that these vehicles depart from A earlier: 9:15 to 10:00. By treating 15 minutes as a unit, the time lag from A to the destination is thus a value in in terms of units. Similarly, Figures 1c and 1d record the lag information from starting points B and C to the destination, respectively. Commonly, the lag between a pair of time series can be described as a value in a range (e.g., the values in Figure 1). Hence, an accurate estimation of the range for each pair of time series can bring a better understanding of the causal relations and more accurate prediction performance.

Most of the existing causal inference methods treat and as constants instead of variables before learning. For example, previous approaches (Bahadori, Liu, & Xing, 2013; Cheng, Bahadori, & Liu, 2014; Lebre, Becq, Devaux, Stumpf, & Lelandais, 2010) set , which assumes that the current event is affected by the most recent history and choose either by cross-validation method or as a predefined value. As we can see from the previous example, such a setting will prevent the model from detecting the true time lags and causal relationship among temporal variables.

In this letter, we propose to learn the causal relationship and the time lag among different temporal variables at the same time. Specifically, we propose a probabilistic decomposed slab-and-spike (DSS) model in which we separate the inference by decomposing the slab-and-spike variable into a product of two latent variables—one designed for estimating the causal relationship among different temporal variables and another layer to capture the optimal time lag among these variables in the form of range . For parameter inference, we propose an efficient expectation propagation algorithm to solve the DSS model. Experimental evaluations are performed on both synthetic and real-world problems, and the results demonstrate the effectiveness of the DSS model and that the recovered time lag can well match the domain knowledge.

The rest of the letter is organized as follows. Related work is discussed in section 2. We describe our model in section 3; then we give the detailed inference in section 4. We show experimental results on two data sets in section 5. We summarize the letter and conclude with future work in section 6.

## 2 Related Work

There has been a rich literature on temporal causal learning. Since the seminal work of Granger causality (Granger, 1969), where a statistical hypothesis test was proposed to determine whether one time series was useful in predicting another one, many approaches have been developed for inferring the causal relationship among temporal variables. For example, the exhaustive graphical Granger method (Arnold, Liu, & Abe, 2007) applied the Granger causality test to each pair of the temporal variables and established a causal relation graph for the variables; the vector autoregressive (VAR) model (Gilbert, 1995; Hatemi-J, 2004) fitted a linear regression by placing the current value of all the temporal variables as the response while treating the past values as covariates; the LASSO Granger method (Arnold et al., 2007; Qiu, Liu, Subrahmanya, & Li, 2012) employed the regularization to learn sparse causal relations among the variables. The SIN method (Arnold et al., 2007; Drton & Perlman, 2008) adopted a notion of the gaussian graphical model to test the causal relationship among temporal variables, and a greedy extension of this idea to deal with large scale graphs was proposed in Han et al. (2012). Some causal inference models utilizing the hidden Markov random filed regression were discussed in Liu, Kalagnanam, and Johnsen (2009), Liu, Niculescu-Mizil, Lozano, and Lu (2010, 2011). Other causal inference models that aim to analyze irregular time series analysis (Qiu et al., 2012; Bahadori & Liu, 2012), evolving relational graphs (Chen et al., 2010), and the switch of roles between cause and effect (Cheng et al., 2014) have been studied recently as well. Although this is a rich literature, when considering the time lag, most of these methods prefix this value as constant before learning, leading to inaccurate estimates of the causal relations.

Very few studies tried to understand the time lag from data. Dhurandhar (2010) proposed estimating the maximum lag for a grouped graphical Granger model. Knapp and Carter (1976) developed a maximum likelihood estimator to determine the time delay in correlation analysis. Unfortunately, this work focuses only on inferring the lag; they cannot infer the causal relationship at the same time. We aim to learn the causal relationship and the lag among temporal variables simultaneously from data.

## 3 The Decomposed Slab-and-Spike Model

Since we aim to learn the causal relationship and the time lag between the causes and the target simultaneously, we introduce two binary latent variables and , where controls the causal relationship between the causes and the target and contains the lag information. Specifically, indicates that the variable is possibly causally related to ; otherwise, if , is not responsible for .

On the other hand, determines whether a specific time point lies in the lag interval of the th variable. Let denote the th time point in the time window of variable : . That is, when , the th time point is regarded as belonging to the lag interval of variable ; hence, could be the cause of , that is, . Otherwise, when , the th time point is outside the lag interval and is removed from the model: .

Recall that we consider the lag with form ; therefore, we need to encode the two parameters and in the latent variable . To achieve this, we introduce two new variables, and , and reformulate and as and . Now the lag interval can be represented as .

Based on equation 3.9, Figures 2a and 2b show example plots of the probability density under different choices of hyperparameters. The spike-and-slab prior is a mixture of some gaussian distributions (the slab) and a point probability mass (the spike), and it is displayed by an upward-pointing arrow. The height of this arrow is the probability mass of the prior at zero. From Figure 2, we observe that the proposed spike-and-slab prior can clearly discriminate the coefficients modeled by the slab, whose probability density is barely changed, and the coefficients described by the spike, whose posterior has significantly different peaks around zero.

In many real-world applications, we may be aware of some a priori knowledge about the causal relationship and the time lag. For example, in traffic and climate analysis, the time series collected from spatially close locations are more likely to be causal related, and the time lag between them is relatively small compared with those from distant locations (see the example in Figure 1). Hence, we may choose the hyperparameters, such as , , and , to match this knowledge to facilitate the learning, as we will do in our experiments.

### 3.1 Prediction and Dependence

A practical difficulty is that the exact computation of equations 3.10 to 3.12 is intractable for typical learning problems. All of these expressions involve a number of summations that grow exponentially. Thus, they must be approximated in practice. As an efficient alternative, we employ expectation propagation (EP), a method for fast, approximate inference with Bayesian models.

## 4 Parameter Inference

In this section, we propose an efficient parameter inference algorithm to solve the DSS model.

EP is a deterministic method for carrying out approximate Bayesian inference. EP approximates the posterior distribution of the parameters of interest using a simpler parametric distribution . The form of is chosen so that the integrals required to calculate expected values and marginal distributions with respect to can be obtained analytically in closed form. EP fixes the parameters of to approximately minimize the Kullback-Leibler divergence between the exact posterior and (Hernndez-Lobato et al., 2013).

### 4.1 Approximation of the Posterior

### 4.2 EP Update Operations

Next, we perform the expectation propagation. Taking as an example, we first compute the corresponding parameters of the marginal distributions of , , and under the cavity distribution . The parameters of are obtained from the Bayesian rules of the quotient between gaussian distributions and the quotient between Bernoulli distributions (Hernndez-Lobato, 2009). Once is computed, we can find the corresponding approximation , which minimizes the KL divergence between and . Once all the approximate factors have been updated in parallel, needs to be recomputed as the product of all the approximate factors. The similar calculation process can be found in Hernndez-Lobato et al. (2013) and Hernndez-Lobato (2009). Computations for can be derived similarly. The procedure is depicted in algorithm 1.

## 5 Experiments

In this section, we conduct empirical results on both synthetic and real-world data sets. We compare the proposed DSS model with the state-of-the-art temporal causal inference methods, including the Granger LASSO (L(G); Arnold et al., 2007), the standard LASSO (L(S); Lee, Lee, Abbeel, & Ng, 2006), the SIN Granger method (Drton & Perlman, 2008), the TLHL model (Zhou et al., 2015), and the VAR model (Gilbert, 1995). Specifically, standard LASSO does not consider time lag interval; it looks back only one time unit. Granger LASSO and SIN Granger are given a fixed time window as lag interval. The TLHL model takes lags as adaptive parameters, but it assumes the regression coefficients follow exponential form. The VAR model generalizes the univariate AR model to multiple time series. For the LASSO-based methods, we choose their regularization parameters from a candidate set by cross-validation according to Arnold et al. (2007).

### 5.1 Synthetic Data

#### 5.1.1 Setting

We first evaluate various methods on synthetic data. We set , , and in the synthetic data. The predictors are generated from the standard normal distribution. In order to simulate the causal relationship, we set the response variable to be affected by —that is, . The corresponding probability matrix is set as Figure 3a.

We select the in the standard LASSO based on cross-validation. In this experiment, we select the optimal from the candidate set and get finally. The selection of in the Granger LASSO is the same as that in the standard LASSO.

#### 5.1.2 Evaluation

#### 5.1.3 Results

We evaluate the estimated parameters in the DSS model. Figures 3a and 3b show the true probability and the learned by the DSS model, respectively. Comparing the two panels, we can see that the learned probability matrix is close to the true matrix, especially for the time points where the probability equals 1. Thus, we can conclude that the learned probability clearly recovers the true probability, which demonstrates that the proposed DSS model can accurately estimate the lag intervals for different covariates.

Moreover, we can obtain the causal relations by observing the probability , which is the probability that the coefficient is different from zero. Figure 4 reports the learned probabilities. From the results, we observe that the results well match the ground truth that the first eight variables are the causal factors of the response .

Figure 5 shows the effect of the parameter , where we vary from 1 to 25 and evaluate the F1 measures of the obtained lags. It is clear that when is larger than the maximum lag defined for all the features (i.e., 20), the F1 score can reach a high value around 0.8 and generally remains stable. This observation is reasonable since the considered time window should be larger than the existing maximum lag, such that we can accurately discover all the lags in the variables. Moreover, the results also imply that we do not need a very large .

Next, we compare the DSS model with other methods in terms of learning accuracy. The score of the causal relationship recovery is provided in Figure 6. From the result, we can conclude that the DSS model is much more stable than all the other methods. Because the method of standard LASSO with fixed time lag considers only one value of lag, the lag fluctuations cannot be captured. Thus, the performance of standard LASSO is worse.

Finally, based on the recovered dependency relationships, we can evaluate various models in terms of predication performance. We use the MAPE to evaluate the performance of different methods. Figure 7 provides the predication results. We see that our model achieves the lowest prediction errors in the MAPE.

### 5.2 Traffic Data

In this section, we evaluate the proposed methods on a real-world traffic data set studied by Han et al. (2012).

#### 5.2.1 Setting

The features in this traffic data set are observations collected from sensors located on stations in a traffic network. The observation of each station is the number of passed vehicles over 15 minutes. The data are collected from records over 92 days, and there are 236 traffic stations, so 236 time series were observed in this data set. The data values are normalized to [0,1]. The task here is to infer the causal relationship among different time series from two busy periods in each day: 7:00 a.m. to noon and midnight to 7:00 a.m. We select one month's data, with 600 samples as the training data.

In this traffic data set, there exists prior knowledge on the lag intervals. When a vehicle passes a traffic station, the time is recorded. And as a consequence, we can compute the average travel time between any two locations. This travel time can be viewed as a rough guess of the lag, and we use it to define the hyperparameters in the DSS model. Specifically, we set the hyperparameter as the corresponding average travel time.

#### 5.2.2 Results and Discussion

Table 1 shows the predicted mean absolute percentage error for all the methods. From the results, we see that the DSS method outperforms the other methods. This demonstrates that learning both the causal relations and the lag intervals simultaneously from data can significantly improve prediction performance compared to the methods that fix the length of the lag.

. | L(S) . | VAR . | L(G) . | TLHL . | DSS . |
---|---|---|---|---|---|

MAPE | 0.35 | 0.22 | 0.3 | 0.23 | 0.12 |

. | L(S) . | VAR . | L(G) . | TLHL . | DSS . |
---|---|---|---|---|---|

MAPE | 0.35 | 0.22 | 0.3 | 0.23 | 0.12 |

We also show the detailed causal relations that the DSS model learned from 7:00 a.m. to noon and midnight to 5:00 a.m. in Figures 8a and 8b, respectively. In the two figures, the red circle indicates the target station (destination) and the blue circle the source station (starting point). We observe that the causal-related source stations are mostly located close to the target station from 7:00 a.m. to noon. However, from midnight to 5:00 a.m. (the late night), the causal-related source stations change dramatically where these source stations are generally distant from the target station.

This interesting phenomenon can be well explained by our domain experts. A large number of trucks travel through this traffic network in the late night every day, and most of the trucks travel a long distance for business reasons. In contrast, during the day, most traffic is composed of private cars, which usually travel a very short distance. This knowledge well explains the results in Figures 8a and 8b and demonstrates the effectiveness of the proposed method.

The lag intervals that the DSS model learned are shown in Figure 9. It is obvious that most of the ground-truth travel time values lie in the lag intervals the DSS model learned, which implies that the inferred lag intervals are very accurate. For example, from Figure 9a, we can see that the lag interval of station 10 is [9,15] while that of station 1 is [2,8], which is consistent with the fact that the distance between station 10 and the target station is much farther than the one between station 1 and the target station.

### 5.3 Air Quality Data

In this section, we apply our method to the air quality data analysis. The problem of air quality has attracted increasing interest in recent years (Intergovernmental Panel on Climate Change, 2007). In this application, one of the most important challenges is to discover the dispersion path of air pollution (Seinfeld & Pandis, 2016) and learning spatial causal relationship is very helpful for interpreting and understanding the spreading pattern of the pollution. We will test all the methods on a real-world air quality data set to evaluate their effectiveness.

#### 5.3.1 Data

The air quality data are collected from 14 air quality monitoring stations in different locations of Beijing, China. Each station measures the hourly PM2.5 data from December 1, 2014, to March 31, 2015. We extract the period from February 20, 2015, at 7:00 a.m. to February 22 at 7:00 p.m. from the entire time series, since the air quality is seriously polluted during this time interval. We chose the time series from four sensors located at four important areas as the target time series and then determined their causal factors from all the time series collected from all the sensors separately.

#### 5.3.2 Results and Discussion

Figure 11 shows the recovered causal relationship and the estimated lag intervals for the four sensors. We find that the estimated causal relationship is consistent with the direction of the pollution diffusion as revealed in Figure 10. From the results, we could infer that this pollution incident originates southwest of Beijing. As our domain experts acknowledged, there are a number of chemical factories located in these areas, and this evidence provides a reasonable explanation for our results.

In addition, we also compare the error with other methods. Table 2 shows the MAPE of all methods on this data set. From the table, we see that our model obtains the lowest prediction error. Thus, we can conclude that causal inference with time lag helps to provide more accurate prediction of the future effects.

## 6 Conclusion

Time lag is an important parameter in causal relationship learning. In this letter, we proposed a probabilistic decomposed slab-and-spike model to simultaneously estimate the causal relationship and the time lag interval within a unified framework. We developed an expectation propagation algorithm for the parameter inference. We evaluated our model on one synthetic data set and two real-world data sets. Experimental results demonstrate the effectiveness of the proposed model in learning the causal relationship as well as the time lag. Importantly, the results conducted on the real data sets are well interpreted by the domain experts.

It will be interesting to extend the static DSS model to learn a dynamic model, where the causal relationship and lags are time varying over the entire time series. We are also interested in learning the causal relationship and the time lag between the covariates and multiple responses at the same time.

## Acknowledgments

This work was supported by the National Natural Science Foundation of China (61572041) and the Beijing Natural Science Foundation (4152023).

## References

*Fourth Assessment Report: Climate Change 2007*(Working Paper 1325)