## Abstract

In neuroscience, data are typically generated from neural network activity. The resulting time series represent measurements from spatially distributed subsystems with complex interactions, weakly coupled to a high-dimensional global system. We present a statistical framework to estimate the direction of information flow and its delay in measurements from systems of this type. Informed by differential topology, gaussian process regression is employed to reconstruct measurements of putative driving systems from measurements of the driven systems. These reconstructions serve to estimate the delay of the interaction by means of an analytical criterion developed for this purpose. The model accounts for a range of possible sources of uncertainty, including temporally evolving intrinsic noise, while assuming complex nonlinear dependencies. Furthermore, we show that if information flow is delayed, this approach also allows for inference in strong coupling scenarios of systems exhibiting synchronization phenomena. The validity of the method is demonstrated with a variety of delay-coupled chaotic oscillators. In addition, we show that these results seamlessly transfer to local field potentials in cat visual cortex.

## 1 Introduction

Dependencies in time series often arise from interactions of underlying complex systems, which may be delayed and nonlinear. Determining the direction of information flow, as well as the delay of these interactions, is of high interest for many practitioners and presents a difficult task in data analysis. For example, in neuroscience, the cells in each brain area form a complex dynamical system that we try to understand through electrophysiological recordings. The task of understanding each brain area is complicated by the fact that different brain areas interact, presumably nonlinearly. It is therefore necessary that signals from different brain areas are recorded simultaneously and that the direction and time course of information flow between brain areas can be analyzed even in cases of nonlinear and delayed coupling.

Different methodological approaches exist to determine the causality of interactions hidden in time series data. Among the oldest approaches is Granger causality (Granger, 1969), which formalizes the problem in terms of stochastic processes. Although not originally intended for dynamical systems, applications of Granger causality in neuroscience have become increasingly popular. While the method can be extended to estimate delays (directed coherence phase; Witham, Wang, & Baker, 2010), it is limited to linear statistical models. Furthermore, due to limitations of Granger causality applied to coupled dynamical systems, it may result in false positives, as Sugihara et al. (2012) pointed out. As a remedy, they introduce, in the context of species population dynamics, a neighborhood method based on reconstruction. The neighborhood method, however, is not designed to account for delays, which play a minor role in population dynamics, and is therefore unsuitable for neuroscience. It is mainly applicable to weakly coupled systems away from regimes of synchronization. In an alternative approach, Nolte et al. (2004) investigated delayed interactions of EEG signals using imaginary coherency. Yet another method used in neuroscience is transfer entropy (Vicente, Wibral, Lindner, & Pipa, 2011). This approach, which is based on information theory, does not presume linearity. A recent application of the method (Wibral et al., 2013) also allows for estimation of delays and can be understood as an information-theoretic variant of Granger causality.

However, none of these methods is directly addressing all of the challenges that neural time series pose. Data are usually composed of local measurements from spatially distributed subsystems, such as local field potentials from electrode recordings, and distant brain regions communicate with significant delays. The interactions of different neuronal populations may be highly nonlinear. Furthermore, synchronization and phase-locking phenomena seem to be omnipresent (Uhlhaas et al., 2009). To understand communication patterns and ultimately the information processing of the brain, typical analysis of neuronal data tries to identify the information flow between areas and modules. However, due to the coupling, individual areas can never be considered autonomous subsystems. Moreover, one can always assume possibly very high-dimensional influences from unobserved brain areas as a source of intrinsic noise, dynamically altering the temporal evolution of its targets. Finally, measurements are altered by unobserved neuronal activity and noise. These challenges require a statistical model designed to infer the direction of information flow and its delay, taking into account that the data are noisy and were generated by an unknown and not fully observed coupled dynamical system.

In a first step, our method implements the theoretical framework suggested by Ørstavik and Stark (1998) in a side remark. It assumes measurements to be taken from lower-dimensional local subsystems, weakly coupled to a high-dimensional global system that is unobserved. This framework uses Stark’s skew-product embedding theory (Stark, 1999) that provides conditions under which a local lower-dimensional driver may be reconstructed from measurements of the driven system. Note that implicitly, skew-product embeddings are also the theoretical foundation of Sugihara’s approach (Sugihara et al., 2012) that was briefly mentioned above. We extend this theory to a statistical model that aims to recover the measurements of a putative driver from measurements of the driven system alone, while both are subject to additional higher-dimensional forcing. As a result, the detectability of information flow is formalized in terms of reconstructibility from measurements, which will be asymmetric in case the underlying information flow is unidirectional. The interactions between time series are modeled by Volterra series and fit using gaussian process regression. The gaussian process regression allows us to use finite and infinite Volterra series and switch easily between parametric and nonparametric versions of the model, where the latter are preferred when possible.

In a second step, we estimate possible delays in the interaction. The stochastic model provides reconstructions of measurements of the putative driver at different delays, leading to reconstruction-error graphs for which the theory predicts the lowest error in certain reconstructible areas relative to the delay of the interaction. On this basis, an analytical criterion, specifically designed for this purpose, allows us to determine the onset of reconstructibility in time. A point estimator for the interaction delay is found, together with an area of confidence. The criterion is tailored to work in situations where the dynamical system under consideration is not invertible in time. This occurs, for example, in delay-coupled systems given by retarded functional differential equations, which we have to consider as a default scenario in our domain of application.

The outline of this article is given as follows. First, we provide a description of the proposed framework with respect to its application. This description is based on intuitive arguments, while a more detailed mathematical derivation is given in the appendix. The latter contains an account of the various concepts from differential topology that afford reconstruction of dynamical systems from measurements and provides a full derivation of the gaussian processes that are used in the regression.

Second, we aim to establish the validity of the method by application to synthetic data generated from different delay-coupled chaotic systems with intrinsic noise given by Wiener processes. As measurement time series, we use linear combinations of all coordinate functions of the subsystems, which appears to be closest to the case of local field potentials or other types of recordings encountered in neuroscience. The synthetic data sets include standard examples such as coupled logistic maps with intrinsic noise, which we use to demonstrate full reconstructibility of a target driver time series in a situation where we can have full knowledge of the stochastic driver. A nonstandard example is given by a weakly coupled Lorenz-Rössler system, which illustrates the applicability of the method in case of systems operating on very different timescales. This is followed by a Rössler-Lorenz system in generalized synchronization. Here we show that the special consideration of delays in the interaction makes inference possible even if the systems are very strongly coupled. Coupled Mackey-Glass systems provide examples of noninvertible systems where each subsystem is in itself delay coupled.

The final application of the method is to neurophysiological data—local field potentials from anesthetized cat visual cortices stimulated by visual gratings. We show that the resulting reconstruction-error graphs across delays feature the same interpretable patterns previously introduced on the synthetic data sets. Furthermore, our method demonstrates a surprisingly rich insight into the different types of interactions, as well as the most informative parts of the trials in relation to stimulus onset. We present a resulting connectivity graph for the electrodes that appears to be physiologically plausible. In addition, estimated delays with confidence bounds are reported for inter- and intra-area interactions.

## 2 Methods

In this section, we describe the framework to estimate delays and the directedness of information flow with intuitive arguments. To this end, we provide an overview in three steps. We then discuss details of the statistical model. Finally, we analyze reconstruction error graphs across delays as supplied by the statistical model and develop a criterion to estimate the interaction delays.

A more rigorous mathematical derivation is provided in the appendix, which is organized as follows. First, an introduction to the concepts from differential topology that present the theoretical foundation is provided. Second, the statistical model that uses gaussian process regression with a Volterra series operator is introduced and adapted to the problem at hand.

### 2.1 Overview

In this overview we highlight the main aspects of the proposed framework in three steps, as shown in Figure 1. The first step provides an intuition for the detection of directedness of information flow. Measurements from experiments provide time series data for analysis. In our framework we assume time series to be measurements from complex systems. Consider, for example, a Rössler and a Lorenz system, both three-dimensional and chaotic. The left-hand side of Figure 1.1 shows how their three-dimensional states evolve in time as trajectories on complex manifolds.

On the right-hand side, the Lorenz system receives additional weak input from the Rössler system at each point in time. This situation is formalized by a weak unidirectional coupling term (see equation 3.3). That is, at each point in time, the driven system receives and accommodates additional information about a particular state of the driver. The driven Lorenz system accommodates the additional information geometrically by warping the state manifold on which it evolves. Consequently, if the geometric information of the driven system is accessible, then so is the driver and its state information that was conveyed via the coupling. The driven system contains information about the state of the driver. This information is entangled with the dynamics of the driven system. It is not immediately clear how we can reconstruct the driver without detailed knowledge about the dynamics of the driven system, but we assume we can. This reconstructibility of state information is asymmetric since the coupling is unidirectional. In comparison to the uncoupled scenario, the Rössler driver system looks unchanged in the right hand side of Figure 1, panel 1, since it carries no additional information about the driven system that may be reconstructed. This asymmetry in reconstructibility can be used for inference regarding the direction of information flow. Note, however, that the asymmetry breaks down in case of strong driver forcing. In such a case, the driven system may lose all autonomous dynamics and become a mere nonlinear transform of the driver, a phenomenon also known as generalized synchronization (Kocarev & Parlitz, 1996). If synchronized, the driven system may be predicted from the driver state if the corresponding transform is known.

In the second step, corresponding to Figure 1, panel 2, we explain how the reconstruction is practically achieved by a statistical model as a function of measurements. In practice, there is no direct access to the underlying systems. Instead, the systems are only indirectly accessible via measurement time series. As a motivating example from neuroscience, consider local field potentials. These are electrical measurements from comparatively small neuronal subpopulations, and one may be interested in determining whether and how information is exchanged between these subpopulations. In Figure 1, panel 2, two such local neuronal populations are indicated by the stylized networks in light and dark brown, respectively. Their measurements are represented symbolically by time series *y* for the driver in this setup and *x* for the driven population.

In addition, we assume that there is high-dimensional extrinsic forcing to both subpopulations since they are embedded in the global cortical network. The extrinsic forcing is indicated by and . These are random variables since the high-dimensional global system is unobservable and too big to be reconstructed explicitly. We thus have explicitly observed measurements of two lower-dimensional local subsystems weakly coupled to a high-dimensional global system via stochastic forcing .

*d*is proportional to the intrinsic dimensionality of the reconstructed system. If

*d*is large enough, there is a one-to-one correspondence of with a particular state of the underlying dynamical system. In case this correspondence also preserves topological and differential structure, it is called an embedding. Analogous to Figure 1, panel 1, one could use a sequence of such vectors, obtained from a sliding window of size

*d*, to plot trajectories in

*d*-dimensional state space.

As a result, the geometric information of the underlying dynamical system is accessible. Guided by the intuition established in step 1, we see now that a reconstruction vector obtained from measurements of the driven system will grant access to the state information of the driver that was conveyed by the coupling. Under relatively mild assumptions (detailed in the appendix) embedding theory ensures that there exists a function *F* that maps measurements *x* of the driven system to measurements *y* of the driver, but not the other way. The crucial step in showing that this mapping exists is to note a one-to-one correspondence, that is, an invertible mapping, between and the joint coupled dynamical system of driver and driven system. One can then project this joint system onto the driver and from there, using the measurement function, to the measurements *y*. For example, we know there exists a function that realizes . The basis for our method is that we can nonparametrically estimate this function *F* even if we do not know its precise functional form. We have thus shifted the reconstructibility criterion of step 1 from underlying unobserved systems to observed time series and can now try to estimate *F* via nonparametric regression, for example, using gaussian processes.

There is one additional complication. Recall the stochastic forcing representing putative high-dimensional driver input that is not reconstructible. As outlined in detail in the appendix, the stochastic terms will modify the mapping *F* nonlinearly. We can make this dependency explicit by additional arguments to *F* and write symbolically (see equation A.9 in the appendix for the correct formalization). In practice, the stochastic terms are unknown and have to be dealt with at the level of statistical inference. (A more detailed account of this is given again in section 2.2.) However, if the stochastic forcing is known, it may be supplied explicitly in the form of additional covariates for the model. We demonstrate this in section 3 with a practical example of synthetic data generated from coupled logistic maps. In this case, the reconstruction of the driver time series may become perfect, thus providing strong empirical confirmation for the existence of *F* and the power of the resulting statistical model.

In the third step, corresponding to Figure 1, panel 3, we discuss the procedure to estimate interaction delays. The main idea is simple. Referring back to step 1 and the right-hand side of Figure 1, panel 1, recall that the driven system may accommodate driver state information conveyed via the coupling of the two systems in its warped state manifold. If the coupling, and thus the information flow, from driver to driven system is delayed by , the driven system will at time *t* accommodate state information of the driver from time . Knowledge of the driven system may thus allow reconstruction of past driver states. In step 2, the reconstruction was made explicit as a function *F* between the measurement time series *x* and *y*. Its argument is a reconstruction vector corresponding to a particular state of the driven system at time *t*, say. The target will be a single measurement ; however, since the delay is unknown, the method proceeds to fit the model multiple times for a set of candidate delays. For each candidate delay, a shifted version of the target time series is reconstructed given the covariate vectors obtained in the fashion of a sliding window. This is indicated by the yellow arrows in the lower part of Figure 1, panel 3. For each reconstruction of the target time series, the normalized root-mean-squared error (NRMSE) is computed. The NRMSE values can be plotted against candidate delays in a reconstruction-error graph (REG). In the upper part of Figure 1, panel 3, we show an idealized version of such a REG, where the *x*-axis is mirrored to align candidate delays with the time series below.

As discussed in more detail in the appendix, the REG will usually not feature a single minimum in the candidate delay sampled closest to the true , but be minimal for a whole set of candidate delays, corresponding to the size *d* of sampling program . This is owed to the fact that the one-to-one correspondence discussed in step 2 between and a particular system state, as learned by our mapping *F*, can often be uniquely related forward and backward in time to all samples in . In how far this is possible will depend on whether the temporal evolution of the dynamical system is time invertible and how uniquely one state can be related to different past states. In general, we therefore expect to see a full sink in the REG. However, depending on the true delay , this sink will have a particular offset in time that is informative about . In the following section about estimating interaction delays, we present an analytical criterion that operates on the REGs to determine this offset and use it to compute a point estimator for as well as confidence bounds.

In contrast, if *F* is fit in the acausal direction, one would expect a flat REG close to 1. That is, assuming *y* represents measurements from a driver that forces another system, measured by *x*, with a delay , we expect fitting *F* under the assumption that leads to a flat REG close to 1, whereas fitting *F* under (“y = F(x)”) should yield a sink in the corresponding REG with temporal offset .

### 2.2 Statistical Model

In this section we discuss the statistical model that computes estimators of the measurement time series of putative drivers via the function *F* introduced in section 2.1. We formalize the model as gaussian process regression using Bayesian inference.

This is adequate in the context of time series analysis and allows us to deal elegantly with the uncertainty in *F*, the uncertainty introduced by and (see Figure 1.2), as well as additional measurement noise. Although the use of gaussian process regression and its close relative, kernel regression, is usually associated with nonparametric models, formalizing a parametric regression problem as a gaussian process, regardless of how the regression is computed numerically, is also beneficial, in particular with respect to estimating hyperparameters. Where possible, we try to employ nonparametric models. However, since they are prone to overfitting, the possibility of reducing model complexity and resorting to a parametric variant is desirable if the data do not offer the required amount of information. At the same time, the model has to be able to approximate arbitrary continuous functions

*Volterra series*was coined in the context of functional analysis (Volterra, 1915). In general, it can loosely be thought of as Taylor series expansion of

*F*whose polynomial terms are rewritten in a certain coordinate representation. The result is a linear form with respect to its parameterization, a desirable property in a statistical model because it simplifies computation. For the discrete case, the model can be stated in the polynomial form, where The model can be used in its full series expansion or can be truncated at a particular order

*n*( for ) to reduce model complexity (equation 2.2 shows summands up to order 2). The

*h*are kernel functions or, rather, coefficients, which are subject to uncertainty in a finite parametric statistical model.

_{i}Based on the results from embedding theory we cannot assume that *F* is differentiable; nevertheless, it is easy to show that the Volterra series can approximate arbitrary continuous *F* of the required type even if a parametric model with finite but arbitrary order *n* is considered. We give a simple proof of this fact in the corresponding part of the appendix. This is very important for the method being developed here in order to be able to interpret a lack of reconstructibility as a true lack of information. Furthermore, being able to truncate the series expansion at a particular order allows one to control model complexity via the nonlinearity of the series, which is important to be able to deal with overfitting in case relevant information is represented less than optimally in the data. Whenever possible, however, the full series expansion will be employed as nonparametric model.

So far, we have identified a number of sources of uncertainty that arise when trying to reconstruct a putative driver measurement time series. These include measurement noise, hidden stochastic drivers, as well as the particular functional form of the reconstruction map *F*. In a Bayesian approach, this uncertainty is summarized in a probability distribution. If the task is to predict a target time series in terms of a covariate time series , the uncertainty in the prediction of unseen data given covariate time series is summarized in a predictive distribution . The predictive distribution incorporates knowledge of the given data to infer given . We denote by *H* the particular modeling assumption that has to be made (e.g., model order, reconstruction dimension *d*, sampling program ) and the corresponding hyperparameters. The covariance structure of this distribution represents the model uncertainty and supplies confidence intervals.

The inference step is now realized by computing the leave-one-out (LOO) predictive distribution, which means each data point of the target time series *y* is reconstructed conditional on all other data points, as detailed in section 2.3. In general, the inference step in a similar setting would be an induction and realized within the Bayesian paradigm by computing a predictive distribution of unseen future data, conditional on an observed data set. In contrast, our objective is not to predict future data but to assess how well a target time series can be reconstructed from a second covariate time series. However, the induction in the reconstruction task can be interpreted as predicting each target data point individually with information from the covariate time series as if it were unobserved, conditional on the remaining target data. Alternatively, one could interpret this as a cross-validation procedure and note that it allows one to always use all available information for the reconstruction of each data point in the target time series. Using gaussian process regression affords efficient analytical computation of the LOO predictive distribution in one shot.

To be able to assess the goodness of the resulting reconstruction, a point estimator for the target time series is provided by the expected value of the predictive distribution (see equation A.15 in the appendix). Determining hyperparameters such as is more involved and has to be achieved numerically for practical reasons. In contrast, we deal with the random variates in a Bayesian fashion by integrating them out of the predictive distribution with an analytical approximation. All mathematical details, as well as a full derivation of the gaussian processes and their covariance structures, can be found in the corresponding section of the appendix.

Finally, we discuss a preprocessing step necessary in time series analysis to avoid a particular type of overfitting. In time series that represent measurements of systems evolving continuously in time, subsequent samples tend to be highly correlated. The higher the sampling resolution, the smaller the difference between neighboring sample points. If the time series are in this sense oversampled, there is a strong possibility for overfitting. Unless the hyperparameters are determined on a separate data set, this may occur even in the LOO cross-validation scheme we employ here. This is a result of equation A.15, which provides the point estimator for the regression target time series. This point estimator (i.e., the expected value of the predictive distribution) is essentially only a product of a so-called smoother matrix with the data vector *y*. The latter contains in this situation redundant information in the individual samples on a short timescale, whereas more complex dynamic features on longer timescales may be underrepresented in *y*. In particular, for each target , a near-identical covariate in close temporal proximity may be found in *y*. Modeling the on a short timescale therefore becomes trivial and does not reflect functional dependencies on longer, more relevant timescales that one is really interested in.

To counter this problem, the time series have to be downsampled at a relevant timescale. A criterion for establishing a relevant timescale is to compute the auto mutual information of the time series. This is also a standard procedure to determine proper lags between samples in reconstruction vectors for embeddings (compare with sampling program in definition 1 in the appendix). In practice, jointly downsampling the time series until mutual information between neighboring samples is less than 1 is desirable. If this is not possible, one seeks convergence of mutual information as a function of the lag between subsequent samples. If the two time series evolve on very different timescales, it may happen that the covariate time series is still oversampled as compared to the target time series. In this case, one may yield a bad model performance if the dimensionality *d* of the covariate vector is chosen too low. The elements of need to span a relevant part of the time series with respect to the timescale; otherwise geometric information is squashed in the reconstruction. We discuss such a situation in section 3. The solution is either to increase *d* or introduce additional lags between the samples in .

### 2.3 Estimation of Interaction Delays

At this point, a statistical model has been devised that allows the reconstruction of a putative driver measurement time series out of another time series representing the driven system. The model is defined under the assumption that both measured systems are lower-dimensional local subsystems, which are weakly coupled to a higher-dimensional global system that is not reconstructible from the data. The statistical model accounts for uncertainty pertaining to measurement noise, as well as dynamically evolving intrinsic noise resulting from the hidden global drive. This formalizes the situation that one encounters in neuroscientific data, such as local field potentials, which represent measurements from local neuronal subpopulations, embedded in a global brain network that can be treated as practically infinite dimensional.

In a first step toward estimating the delay, for a particular choice of sampling program , we create reconstructions of the putative measurement time series for a candidate set of negative delays by which we shift the target series. Figure 2A shows the situation for a particular covariate reconstruction vector corresponding to time *t*. If the true interaction delay is , the earliest reconstructible target sample is , where We can assume is always reconstructible, since it relates to the last temporal index in the sampling program . This fact is discussed in detail in the appendix in the context of definition 1.

As a brief summary, we note that the dynamical systems that are usually considered in embedding theory admit a temporal evolution that is invertible in time. Given a certain initial condition, past and future of the system are therefore uniquely determined. As a result, given a reconstruction vector system states corresponding to all individual sampling points in are in principle reconstructible. In case of systems that are not invertible in time, this is no longer true. This case is important to consider since delay-coupled systems, such as spatially distributed neural networks, are in general not time invertible (the time-reversed system would be acausal). Takens (2002) showed that if the temporal evolution of a dynamical system is not invertible, it may not even admit embedding. However, he proved that the system state corresponding to the last sample in the reconstruction vector is in general reconstructible. We stress this fact once more: the system state corresponding to the end of the reconstruction vector is always reconstructible, and we will exploit this fact here for inference. We must note, however, that this is a conjecture insofar as the corresponding proof in Takens (2002) does not explicitly treat the skew product scenario we consider here. Nevertheless, the simulation results presented in the next section strongly suggest that this is a valid generalization.

We associate with the onset of reconstructibility which is informative about the delay of the interaction. Note that if , where denotes the true intrinsic dimensionality of the underlying product manifold, the reconstructibility of will extend to the preceding samples of the target series. The reconstructibility of *y _{j}* for depends on how uniquely the inverse temporal mapping of the underlying system can be inferred from the data. In any case, the last reconstructible point is , associated with the first element of We call the temporal indices the reconstructible area, marked by the dashed-yellow rectangle in Figure 2A.

The reconstructible area corresponds to the information about the driver present in the covariate measurements of the driven system by way of actual information exchange via the delayed coupling. Outside the reconstructible area, the candidate reconstructions of the statistical model become predictions backward or forward in time, respectively, which are necessarily based on the closest temporal state reconstructible with the given covariate vector Due to the global stochastic drivers, predictions will be subject to further uncertainty, in addition to the uncertainty entering We can thus expect to have the lowest variance in candidate reconstructions within the reconstructible area. Besides, predictability is likely to decline rapidly away from the closest system state that is actually reconstructible. The variance will therefore be our criterion to evaluate reconstructibility.

*y*in the delay-shifted target time series, the LOO point estimator from equation A.15 is computed conditional on the data set where the

_{j}*j*th sample is removed. This represents the LOO validation set. To check for overfitting of the model, it is common practice to compare performance on the validation set against performance on the training set. The latter amounts to computing the predictive distribution conditional on the full data set . Second, as a measure of performance, we compute the mean-squared error between the candidate reconstruction and the target. This yields a LOO cross-validation estimator of the variance which is asymptotically correct (Konishi & Kitagawa, 2008) and will be lowest for candidate delays within the reconstructible area. To be able to compare the reconstructibility in one causal direction against another, we report the normalized root-mean-squared error (NRMSE), given by where denotes the standard deviation of target time series

*y*. Note that where denotes the mean of time series

*y*. We therefore choose as the upper bound for reconstructibility, in which case the reconstruction is as good as fitting a horizontal line to indicating a lack of interesting functional dependencies.

*y*, for which we report Since the are, by assumption 3, one way to quantify the variability of the NRMSE is to resample the full predictive process as using the covariance structure resulting from equation A.15, for For each

*b*, we can compute and derive percentiles for this statistic from its resulting empirical distribution. However, recall from the previous section that the assumption of normality in the predictive process is only approximate. While this does not pose difficulties for confidence intervals of the individual the nonlinear NRMSE statistic presents a problem, and we found that the empirical distribution of often yields confidence intervals that do not contain To counter this problem, we compute the nonnormal distribution of residuals in equation 2.5 empirically from the data and employ bootstrapping (Efron & Tibshirani, 1994) to generate sample sets to yield a distribution for Confidence intervals are computed conservatively from the 99 percentiles. Where computationally feasible, the resulting confidence intervals may be improved by the -method (Efron, 2003), which employs additional correction for deviations from normality.

We can now associate an NRMSE with each reconstruction of the full target time series of a putative driver at a particular candidate delay. Plotting NRMSE against candidate delays yields a reconstruction error graph (REG) as shown schematically in Figure 2B. Estimating the delay of an interaction is based on such an REG and consequently treated as simultaneous inference on multiple candidate delays and their respective model fits. The model selection process takes this into account and is based on the full REG a particular model produces. While the proper Bayesian treatment would result effectively in a type of model averaging, the computation usually runs into numerical issues and the averaged model in our setting will be dominated by very few model candidates. It is therefore prudent to simply select the best model for inference. The main model characteristics that have to be determined are the reconstruction dimension *d* and the model order *n* in equation 2.2. The selection process is straightforward. In principle, there is no upper bound for *d*, but the size of the data set will enforce practical limits. We found that the global minimum in the REG often decreases suddenly with increasing *d* and equally so for different . Increasing the reconstruction dimension further trades model complexity for validation set performance at some point since fewer and fewer data for inference remain. Regarding the model order, we found that truncated models of order or higher in most cases behave similarly to the full nonparametric series expansion. If overfitting is an issue with a particular data set, it will usually become apparent at already. As a standard approach, we therefore test models with and select the best according to the resulting REG. In the REG, the depth of the sink is the main criterion, as well as the steepness at the onset of reconstructibility, the most important feature for estimating the delay, as we describe below. Overfitting tends to manifest in a shallower sink for the validation set and a sharp increase of the NRMSE at the borders and outside the reconstructible area where the model has no direct information for reconstruction available. The increase in the NRMSE is the result of overregularization from increasingly extreme hyperparameters. As such, the model selection process is in itself already highly informative about the reconstructible area in a REG. In the remainder of this article, we report results only for the best model unless stated otherwise.

Ideally, one would expect a REG across candidate delays as shown by the dark brown graph in Figure 2B, where the reconstructible area corresponds to a sink. The lowest NRMSE need not be at either the beginning or the end of the reconstructible area and is therefore not a good indicator for the delay. In fact, the sink need not even be a level plain but may contain additional structure, in particular if the timescales of the unobserved systems differ substantially. Furthermore, we cannot in general expect full reconstructibility for all candidate delays in the reconstructible area. The only reliable indicator of the true interaction delay therefore is the onset of reconstructibility in the REG.

A further problem is the fact that the time series have been sampled down substantially. In order to achieve subsampling accuracy in the delay estimation, the REG has to be interpolated in a first step. In a second step, curvature and slope of the resulting interpolated graph will be used to determine the onset of the sink. Consequently, the model used for interpolation has to exhibit a certain smoothness and be at least twice differentiable. It is important, where possible, to avoid overfitting in order to yield a simple sink shape. In our case, samples of the REG will be coarsely spaced across delays. Overfitting leads here to extreme slope and curvature between samples that would interfere with the delay estimation.

The best interpolation results were achieved by fitting a cubic spline function (Boor, 2001) as a gaussian process parameterized by time (see equation A.11). The spline function is a linear combination of basis splines and fully determined in terms of a knot sequence that defines the starting points of the basis splines’ compact carriers. As knot sequence, we choose the given samples of the REG. For interpolation, we compute *m _{y}* (see equation A.15) on a very fine grid across delays, conditional on the linear interpolation of the REG samples on this grid. The latter enforces a type of regularization between the actual samples, which asserts a certain well-behavedness of the graph in these underspecified areas. This dramatically reduces overfitting and noninformative curvature in the interpolation. We interpolate both, confidence intervals and the REG itself, in this fashion. If, on the other hand, the REG is already sampled at a fine resolution, a better approach would be to use a gaussian process with stationary covariance function where the length-scale

*l*controls how strongly the model interpolates.

*I*is a smooth function defined on an arbitrarily fine timescale, as given by the dark brown graph in Figure 2B. To determine the onset of reconstructibility and an area of confidence for the true delay we first compute numerically the derivative of

*I*and choose as left boundary for the area of confidence its global minimum This corresponds to the highest rate of negative change in the REG and reliably marks the onset of the sink if one exists. To check for existence, it is sufficient to verify that the first NRMSE significantly larger than the minimum of the REG is to the right of .

*c*determines the sensitivity to curvature, which we set to 20 to be able to deal with shallow sinks. Consequently, where achieves robustness against uninformative structure of

*I*within the sink. Finally, a simple point estimator for the true delay is obtained as This estimator works very well in practice and allows for fully automatic determination of interaction delays. If the reconstructible area is marked by a shallow sink, the area of confidence will be larger; the point estimator 2.9, however, may still yield accurate results. If the sink has a steep onset, the method is very precise. Most problematic are cases where the corners of the sink are less well pronounced and the curvature relaxes only very slowly. In this case and for all computations described above, provides an upper bound for the

*t*under consideration. Figure 2B summarizes and illustrates the approach schematically for the hypothetical ideal case.

To determine the preferred direction of information flow, we compute the REG in both possible directions of information flow (either or ), determine the minimum NRMSE across delays, and check whether they are significantly different by means of the bootstrapped 99% confidence intervals. Further details are discussed using concrete examples in section 2.4.

### 2.4 Experimental Procedures

Experimental data presented here were obtained from one adult cat. All procedures in this study were approved by the responsible government body (Regierungspräsidium Darmstadt) in accordance with the guidelines of German law for the protection of animals. Details about the experimental protocol can be found in Wunderle, Eriksson, and Schmidt (2013). In brief, the animal was initially anesthetized with a mixture of ketamin/medetomidine (10 mg/kg and 0.02 mg/kg) and subsequently maintained by artificial ventilation of O/NO (30%/70%) supplemented with isoflurane (1.5–0.7%). Analgesia was ensured by continuous intravenous infusion of sufentanil (4 g/kg/h) together with a Ringer solution. Two small craniotomies were made over visual areas 17 and 21a according to stereotactic coordinates. Through each craniotomy, a single shank multicontact laminar probe was implanted into brain tissue (1M, Neuronexus, MI, USA) in order to record from all cortical layers. The recorded signals were bandpass-filtered (0.7–300 Hz) and sampled at 1024 Hz to obtain the LFP signal. After the surgical procedures, the animal was stimulated with gratings drifting in eight directions (45 steps) presented on a 22-inch LCD monitor (Samsung SyncMaster 2233RZ) in front of the cat. The gratings were at full contrast and the temporal (1.8 Hz) and spatial frequency (0.4 cy/deg) was chosen to evoke responses in both areas, 17 and 21a.

## 3 Results

We have established the validity of the method in practice on different delay-coupled chaotic systems. These involve systems that are time discrete or continuous, irreversible in time, governed by retarded functional differential equations, exhibit generalized synchronization, and total systems where the coupled subsystems operate on very different timescales. The synthetic data thus provide a rich testbed of potential difficulties in application. All of these systems are designed according to Figure 10, with additional stochastic driver input to both subsystems. For time-continuous systems, the latter is given by a Wiener process with additional volatility, which we tried to set as high as possible before catastrophically altering the system manifolds in the numerical solution.

To provide additional illustration and a comparison to standard methodology, we show the time-shifted cross-correlation graphs for all synthetic example systems in section A.5.

In a final step, we applied the method to real data composed of LFPs from cat visual cortex. Here we show that interpretable patterns very similar to the synthetic case are produced by the method, which suggest a connectivity structure that appears to be physiologically plausible.

### 3.1 Logistic Maps

In a first step, we consider a simple discrete system to practically demonstrate the soundness of the functional mapping established in equation A.9. To this end, we present results where the stochastic drivers are unknown, and treated with Bayesian statistics according to equation A.25, as well as results where and were supplied for the mapping as covariates according to equation A.9.

*y*is driving

*x*via a coupling term in the definition of

*x*. Both subsystems are chaotic and show sensitive dependence on and , respectively, which are part of the intrinsic temporal evolution of the system. Although the stochastic forcing is given by a linear term, its effect is nonlinearly amplified through the forward mapping. Examples of the resulting time series are shown in Figure 3D. Logistic maps are examples of endomorphisms since the forward mapping is in general not invertible. As was discussed before, reconstructibility may therefore not extend to the full reconstructible area. We fit the full nonparametric statistical model according to equation A.15 with prior covariance matrix A.24 under both hypotheses, and , and computed REGs as described in the previous section. Note that under hypothesis ,

*x*is assumed to be the driver and thus the target of the reconstruction, whereas

*y*provides covariates. In this simple example, a reconstruction dimension is sufficient for the covariate vector . The REGs are reported for the LOO predictive distribution (dark brown, LOO CV) and the predictive distribution conditional on all available data (light brown, training) to indicate overfitting. In addition, we also report the REG for a separate validation set of the same size (brown, validation), which should alleviate any concerns regarding the ability of the complex nonparametric model to generalize. Confidence intervals (shaded area) were computed for the LOO estimator only. Figure 3 shows the results for a small data set where each time series consisted of 1000 samples.

In Figure 3A, the reconstructibility is shown under the correct hypothesis for the case where the stochastic driver is unknown and integrated out of the model. The lowest reconstruction rate is achieved at the true delay . Reconstructibility is confined to this delay, which corresponds to the last coordinate of covariate vector , in agreement with theoretical reconstructibility of endomorphisms discussed beforehand. Figure 3B shows reconstructibility of the same model under the wrong assumption , which hardly ever deviates from the baseline NRMSE. The direction of interaction, as well as the delay, are thus clearly discernible on this data set.

In Figure 3C, reconstruction was repeated as in panel A; however, this time the stochastic drivers were supplied explicitly as covariates for functional mapping A.9. The results are highly interesting. As can be seen, reconstructibility improved only within the reconstructible area. At candidate delays where information about the corresponding driver state is not actually available through the covariate vector , the inclusion of the noise terms into the functional model may even lead to a deterioration of performance. At the true delay , the additional driver information leads to a practically perfect reconstruction of *y*, which beautifully demonstrates the validity of the theoretical approach behind mapping equation A.9 in practice.

### 3.2 Lorenz-Rössler System

*dde23*(Shampine & Thompson, 2001) would be a much better choice.

As time series, we consider the linear combinations and , reminiscent of the contributions from different spatially distributed components to an LFP, with sampling rate , as shown in Figure 4D. It can be seen that the natural oscillations of the driven Rössler system (*y*) are much simpler and occur at a much slower timescale than those of the Lorenz driver (*x*) with its two-lobed attractor geometry. Analysis of time series evolving on different timescales is challenging, and we have designed this particular example to demonstrate the reconstructibility of a fast, complex signal from a slower oscillatory response.

Optimal results are reported for a third-order model with reconstruction dimension . Although the intrinsic dimensionality of the skew-product manifold described by equation 3.2 is likely less than 6, a larger *d* is necessary to capture enough geometric information of response system *y* with covariate vector We use this data set to show how different lags between candidate delays of the REG can affect the delay estimation process. In Figure 4A we show the interpolated REG under the correct hypothesis with candidate delay reconstructions given at a step size of 0.5, marked by dots. In comparison, Figure 4C shows the REG sampled with smaller step size 0.1. Although both point estimators for the delay yield the correct result , the analytical area of confidence is much wider in panel A, indicating a larger uncertainty associated with the estimator as a result of a less well-pronounced sink in the interpolated REG. The finer step size of candidate reconstructions in Figure 4C provides more accurate information for the delay estimation process and consequently yields a very narrow area of confidence. The reconstructible area corresponds to , as a result of the time series sampling rate and the model choice . For this system, the sink in the interpolated REG fully covers the reconstructible area and has sharply defined corners. Despite the fact that there is substantial oscillatory and trend-like uninformative structure within the sink, the analytical criteria for establishing the area of confidence prove to be robust and capture the informative geometric features, as is readily verified by visual inspection. In contrast, Figure 4B shows that under the acausal hypothesis , no significant reconstruction is possible at any of the considered candidate delays.

### 3.3 Rössler-Lorenz System

*x*is the driver, and its intrinsic timescale is set to such that it oscillates with a similar frequency as the driven Lorenz system

*y*. The stochastic drive is once again a Wiener process with additional volatility of at a numerical solution step size of 0.0001.

Pyragas (1996) has shown for the case without stochastic forcing and delay that for , the two systems enter a regime of generalized synchronization (GS) (Kocarev & Parlitz, 1996). Adding a delay is compatible with the definition of GS, and the stochastic forcing does not catastrophically alter the system manifolds but acts merely as a perturbation. We set such that the systems will be in a perturbed regime of GS. Time series are once more sampled as linear combinations of the coordinate functions of the subsystems, as shown in Figure 5D.

This data set poses a twofold challenge. First, the two subsystems show very similar oscillatory behavior. As a result of the (perturbed) GS regime, we can assume the oscillators will exhibit more or less rigid phase-locking behavior. This means a certain functional dependency will be present across any candidate delay in the analysis. Second, by definition of GS, the driven system is fully predictable from knowledge of the driver alone, although this transformation may be complex. The dynamics of the full skew-product system have collapsed onto a common synchronization manifold. As a result, the causal structure of the interaction may be masked since predictability in this sense cannot be distinguished from reconstructibility. If the interaction is delayed, however, one may still be able to discern a sink in the interpolated REG that is informative. Furthermore, prediction works only forward in time, whereas reconstruction works backward. As a result, among the set of candidate delays in the REG, an optimally predictable state (that is, in acausal reconstruction direction) of the driven system can only ever be found at 0, and only if the true interaction delay . The larger is, the more the interaction and its delay will be distinguishable. It is also our impression that a statistical model provides in general better functional reconstructions, than predictions, since in reconstructions, driver information is actually present in the temporal evolution of the driven time series.

In Figure 5 we present results for models with reconstruction dimension . We consider the true hypothesis first, shown in Figures 5A and 5C. Here, overfitting becomes apparent at model order 3, as can be seen in Figure 5C. The shape of the resulting sink is not very regular, the corners are not well pronounced, and curvature hardly declines inside the sink. As a result, the delay estimation is slightly off. Figure 5A shows, in contrast, a second-order model, where the sink is very regular with less curvature and the model does not suffer from overfitting. Consequently, the delay estimation of the second-order model is more accurate and yields, in fact, the correct value of . Note that in this case, the addition of further candidate delays to the REG at a finer resolution will not improve the shape of the sink. Its shallowness is owed to the simpler oscillatory behavior of the two systems and their phase locking, which improve system predictability. This also leads to a much lower baseline reconstructibility that is now apparent even under the acausal hypothesis in Figure 5B. Here, the baseline predictability at candidate delay is comparable to the baseline reconstructibility outside the sink in Figure 5A. Note that the mere presence of a sink in the REG with significant temporal offset under one coupling hypothesis is highly informative about the direction of interaction.

We conclude that if delays are involved in the interaction, our method may still be able to reliably discern coupling scenarios shrouded in weaker forms of synchrony and phase locking.

### 3.4 Mackey-Glass System

The two systems oscillate on very similar timescales and may have entered a weak form of synchronization, albeit perturbed by the stochastic forcing, as a result of the coupling. We therefore expect again a certain baseline reconstructibility across candidate delays. Figure 6 shows the results for an optimal reconstruction dimension . At the sampling rate of , the hypothetical reconstructible area would have size 5. Under the correct hypothesis , reconstructibility of a third-order model in Figure 6A is compared against a model employing the full nonparametric series expansion in Figure 6C. Both agree in their delay estimate. The REG sink resulting from the nonparametric model, however, has less curvature and more pronounced corners, leading to a smaller area of confidence and a correct point estimator . As expected, a certain baseline reconstructibility is apparent under the acausal hypothesis in Figure 6B but does not impede inference regarding the direction of the interaction. It is noteworthy that the sinks in the REGs of Figures 6A and 6C barely cover half of the reconstructible area, which would be in this case , as a result of the temporal dependencies in equation 3.4. The onset of reconstructibility nonetheless affords reliable delay estimation, which shows that functional mapping A.9 extends to retarded systems if the intrinsic dimensionality of the attractor manifolds allows reconstruction.

### 3.5 Local Field Potentials of Cat Visual Areas

In this section, we report results that our method yielded on an actual neuroscientific data set. The data consist of local field potential recordings measured at eight different electrodes in cat visual areas. Four electrodes were placed in area 17, homologue to V1 in primates; another four were placed in area 21a, a homologue to V4 (Peters & Payne, 1993). In each area, the electrodes were chosen to correspond to a particular layer of the cortex, as indicated in Figure 8. The laminar position of the recording electrode is an important aspect, because information flow within and between areas follows a specific pattern along the layers of the cortex (Douglas & Martin, 2004). At the time of the analysis, the cortical layers from which individual electrodes had been recording were unknown to the data analysts. The cats were under general anesthesia during the recordings, and visual stimulation was performed by drifting gratings of different orientations and directions. Our goal was to test the method on an actual data set, compare results with the patterns yielded on the synthetic data, and create a connectivity graph of the eight electrodes. We were not interested in answering a particular neuroscientific question at this point.

The data were recorded at a sampling rate of 1024 Hz. Time series appeared not to show substantial differences in response to different stimulus orientations, so we picked one orientation at random for analysis. This resulted in 50 trials, during each of which the stimulus presentation lasted for 2035 samples. In terms of preprocessing, filtering the time series with a Parzen window, the size of which was subject to model selection, improved reconstructibility slightly. Furthermore, the data had to be sampled down substantially, by a step size of 15, for mutual information between neighboring samples in the time series to reduce to a level adequate for statistical modeling. The statistical model was always conditioned on all 50 trials; the covariate vectors , however, were constructed respecting trial boundaries for each trial individually. As a result, if the reconstruction dimension is set to , for example, spans an area of more than 300 ms. Since the covariate vectors are generated for each trial individually, this means the first 300 ms of each trial directly after stimulus onset are lost for analysis, since the regression target is always associated with the end of the covariate vector for sure reconstructibility, as discussed earlier.

This turned out to be a problem. Figure 7 shows the first 2000 samples after stimulus onset of an exemplary LFP time series of a single trial from the data set. The time series is obviously nonstationary. In particular, in the first half, the direct stimulus response is characterized by slow oscillations with large amplitudes that are markedly different from the statistics in the second half of the series. They also grossly distort the NRMSE statistic, which is normalized by the time series standard deviation and assumes stationarity. The direct stimulus response is therefore often routinely discarded in analyses and deemed to be the result of uninformative transient dynamics.

We checked this assumption by performing pairwise reconstructions for all channels, in total 56 REGs, based on different temporal windows within the individual trials. The analysis revealed that the samples directly after stimulus onset are by far the most informative, whereas REGs conditional on samples from the second half of the stimulus presentation revealed no information about interaction delays at all. As a result, we based all subsequent analyses on the first 1000 samples after stimulus onset only, such that the statistics were approximately stationary. Furthermore, in light of this discovery, the loss of several hundred milliseconds of information right after stimulus onset due to the regression setup became unacceptable. To improve this situation, given a model with reconstruction dimension *d*, we added a zero-padding of length to the beginning of each trial time series. Consequently, the first *i* samples in each trial where are now included in the model, albeit with suboptimal reconstruction dimension . Nonetheless, this yielded more pronounced sinks in the REGs and improved the delay estimation process substantially.

We report our final results for a model of order 2 with on time series sampled down to steps of 15.36 ms. Higher-order models were too susceptible to overfitting. Figure 8 summarizes the results in a connectivity diagram. Intra-area, one is likely to find strongly synchronized populations, putative mutual coupling motifs, as well as short interaction delays. In contrast, the inter-area connectivity is characterized by long interaction delays, up to an estimate of 136 ms, and the network motif appears to correspond to a feedforward structure from lower to higher visual areas.

The statistical model is highly informative about the underlying interaction. In Figure 8, we distinguish four types. The first one we labeled strong synchrony with linear dependency. This interaction was found between channels 1 and 2 in area 17 and between channels 6 and 8 in area 21a. Interestingly, both pairs consist of one channel located in the superficial and the other in the granular layer. Corresponding REGs are shown in Figure 9. The left plot shows the reconstruction under the hypothesis , that is, channel 1 is the target of the reconstruction, and channel 2 provides covariates. Reconstructibility under both hypotheses does not differ significantly. In both directions, there is no temporal offset for the sink in the REG, which fully extends to the maximally reconstructible area and has sharply pronounced corners. Moreover, reconstructibility with a second-order model is not significantly better than a first-order model. The first-order model is basically only a linear time-invariant filter. In addition, only for does the NRMSE increase significantly. This suggests that the underlying systems are strongly synchronized and most likely merely locked to oscillations induced by the visual gratings. The latter hypothesis is supported by the abrupt increase of the NRMSE beyond the reconstructible area, indicating a lack of predictability from covariate information. The measurements therefore provide no information about interactions that could be exploited for further inference.

The second type of interaction we discovered appears to be a slightly weaker kind of synchrony characterized by nonlinear dependencies. Channels 7 and 8 provide an example; their REGs are shown in Figure 9. Although similar at first glance to the situation of channels 1 and 2, reconstructibility declines significantly on reducing both order and reconstruction dimension *d* of the model, and the sink is slightly shallower. This indicates a nontrivial kind of information exchange between the underlying neuronal populations, albeit without a discernible delay.

The third type of interaction that can be found in the data admits a discernible direction of information flow. We show exemplary REGs of channels 4 and 8 in Figure 9, which are the first to exhibit an asymmetry in reconstructibility. On the left-hand side, under the hypothesis that information flows from lower to higher visual area, the full extent of a sink is apparent with a large offset. The estimated delay is about 100 ms. Moreover, reconstructibility is significantly better than under the counterhypothesis on the right-hand side. Both, shape and asymmetry of the REGs, are strongly reminiscent of the situation that arises in a regime of generalized synchronization resulting from unidirectional coupling, which we discussed previously in the context of the Rössler-Lorenz system (compare with Figure 5). The example channel 4 was located in the white matter below area 17, whereas channel 8 was located in the granular layer of the downstream area 21a. The white matter consists of fiber tracts projecting to and from other areas, and evoked potentials can be easily recorded. Accordingly, our method reveals a strong feedforward flow of information from the lower area 17 to the major recipient layer (granular layer 4) in the downstream area 21a.

The fourth type of interaction depicted in Figure 8 is labeled nonlinear correlation. These are cases where ghosts of sinks appear in the corresponding REGs, which are, however, not statistically significant. It may be indicative for a situation where the statistical model cannot capture enough information from the data to yield a more pronounced sink. Furthermore, it is worth noting that this kind of interaction was found for the channel located in the superficial layers of area 17 with all the other channels in the downstream area 21a. The superficial layers exhibit a high degree of recurrent connections. It may be that the computations done there are more complex and are not captured by the initial evoked response used here for the reconstruction. We distinguish this from scenarios where the REGs show no patterns at all, such that their deviation from 1.0 may rather be explained as a baseline due to oscillatory activity. These cases are left unmarked in the connectivity diagram.

A last finding we report here pertains to putative mutual coupling scenarios and occurred only intra-area. An example is given by channels 2 and 4, with REGs shown in the last row of Figure 9. Reconstructibility is not actually significantly better in one direction over the other, although this is a close call. Both REGs exhibit, however, significantly pronounced sinks, clearly indicating delayed information flow in both directions. We hypothesize that this pattern might correspond to a situation of weak mutual coupling. This is a scenario we have not discussed so far for reasons explained in section 4.

## 4 Discussion

We have presented a statistical method that estimates the direction of information flow and its delay in situations motivated by neuroscientific application where data consist of local measurement from spatially distributed subsystems that are part of a larger global system. The situation is formalized by assuming the local subsystems are low-dimensional and deterministic, delay-coupled to each other, and receive additional stochastic forcing from the surrounding global system. The latter is by assumption rendered practically infinite dimensional. This corresponds, for example, to the situation encountered with local field potentials, which provide recordings of local neuronal subpopulations embedded in an extremely high-dimensional global brain network.

The method formalizes directed information flow in terms of reconstructibility, where the latter is based on theory from differential topology that is concerned with embeddings of manifolds, similar to the approach by Sugihara et al. (2012) in the context of population dynamics. Given the particular domain of application in neuroscience, a statistical model formalizes the different sources of uncertainty pertaining to the reconstruction. While the functional dependencies, resulting from embedding theoretic considerations, are realized via discrete Volterra series operators, the full statistical model is given by an approximate gaussian process. Under different hypotheses regarding the coupling direction, the statistical model yields interpolated reconstruction error graphs across candidate delays that are informative about the true delay of the interaction. We propose an analytical criterion, informed by differential topology, which provides a point estimator for the delay, together with an area of confidence.

The validity of the method was established on a variety of delay-coupled chaotic systems that exhibit nonlinear oscillatory behavior, as well as synchronization and phase-locking phenomena that relate well to the character of neuroscientific data. While complexity in actual neural networks likely arises in the first place from the sheer number of interacting subsystems, chaotic systems achieve a similar level of complexity due to their intrinsic chaoticity. The latter admits a low-dimensional deterministic description that retains analytical tractability of the systems, making them ideal candidates for testing purposes. Our method proved to be versatile in the analysis of systems operating on very different timescales, was able to carry out inference in the presence of phase locking and generalized synchronization phenomena, and allowed inference on data from retarded functional systems that are likely to be encountered in the neuroscientific context.

Furthermore, we demonstrated that the applicability and interpretability of the method seamlessly transferred to a real data set of local field potentials from cat visual cortex. In addition, the statistical model allowed us to differentiate between different types of interaction, accounting for putative synchronization phenomena as well as scenarios of delayed and asymmetric information flow. The results we reported were based on 50 trials. The method as such, however, also supports in principle single trial analysis. In addition, it can be limited to particular temporal windows in a time series to deal with nonstationarities and achieve a temporal resolution of the analysis. In this context, our results revealed that the slow oscillatory activity directly after stimulus onset, which is often discarded as uninformative transient response, is in fact highly informative about the interaction as well as its delay.

While the resulting connectivity (see Figure 8) appears to be physiologically plausible, the estimated delays are surprising because they substantially exceed expectations derived from anatomy and physiology (Bastos, Vezoli, & Fries, 2015). Future work will need to reconcile these divergent pieces of evidence. At this point, we note that given the nature of the communication between neuronal populations, it is probably not adequate to talk about a single delayed interaction. Rather, one would expect again many different delay-coupled local interactions. Consequently, we interpret the estimated delays more in terms of a major trend with respect to the partially reconstructed dynamics. The final results on the LFP data set were reported for a model with reconstruction dimension . Given that LFPs may record activity from more than 10,000 neurons, this number seems to be very low. Although higher *d* did not substantially improve reconstructibility in this case, the maximal dimension that can be investigated is severely limited by the intricate temporal dependencies on the stimulus within individual trials. As a result, it may be that information about additional shorter delays is not accessible in the coarser reconstruction used in the analysis presented here. This does not exclude the possibility of additional faster interactions. In this regard, we hypothesize that a large area of confidence associated with an estimated delay may be indicative also for a broader range of delayed interactions and thus represents an essential part of the inference. Despite the fact that a lower reconstruction dimension was used in this analysis, it is worth pointing out that one definite advantage of the statistical modeling approach over other methodologies is that it could in principle afford a dimensionality of several thousand if the data can support it.

On the other hand, we have shown that substantial information is already reconstructible at and comparable to the results on low-dimensional synthetic data sets. This is most likely owed to the fact that the dynamics of neuronal subpopulations are largely characterized by concerted oscillatory activity and an abundance of reported synchronization and phase-locking phenomena (Harris, 2005; Fries, 2009). The latter arise as a result of the high interconnectedness within populations. The more concerted the joint activity, the more the intrinsic dimensionality of the joint state-space manifold on which the neuronal population evolves temporally collapses. This in turn renders the dynamics more susceptible to reconstruction. At the same time, however, synchronization obstructs inference based on asymmetry in reconstructibility. It is therefore important that the methodology can account for this to a certain degree and that a trade-off at a spatial scale is found.

We would like to address an important issue concerning the formalization of causality in the proposed method. The informative criterion regarding a directed causal interaction is the (inverse) reconstructibility of the putative driver. The theory asserts the existence of functional mapping A.9 in this case, which we try to estimate from the data by means of a statistical model. However, the existence of a functional dependency between two time series is in itself not necessarily indicative for an underlying dynamic interaction. A simple example that comes to mind are a sine and a cosine time series, which are related by a trivial phase offset. This may give rise to concerns regarding false positives. The complete lack of reconstructibility under the wrong hypothesis in the synthetic examples provided earlier, in particular for the simple discrete system of logistic maps, should alleviate these concerns. The biggest challenge in this regard are indeed phase-locked or synchronized oscillatory signals, as discussed by the example of the Rössler-Lorenz and Mackey-Glass systems. We have shown that the immediate consequence of phase locking is a baseline reconstructibility across delays, which is easily established and does not derogate the existence of informative sink structures in the REGs. Since we have defined a concise domain of application where measurements are taken from subsystems embedded in high-dimensional global systems, we can always expect a certain complexity in the measurement time series. This could be, for example, a variation in the amplitudes of oscillatory signals, which we also encountered in a preliminary analysis of epilepsy data where the LFPs supposedly document extremely synchronized cortical activity. We therefore think that false positives are not a big concern for neuroscientific data sets.

The reconstruction approach does, however, require a degree of autonomous intrinsic activity of the driven system to establish directedness. In strongly synchronized states, this is no longer possible. As was discussed by example of channels 1 and 2 (see Figure 9) of the cat LFP data, such states are easily identified by properties of the functional dependency supplied by the statistical model. We assume here that intrinsic dynamics are absent and the two signals are strongly locked to the visual stimulus. In general, if the coupling is weaker such that autonomous intrinsic activity is still present in the driven subsystem, the method will be robust against common drive and yield symmetric REGs with high NRMSE values.

On a related note, we mention that the analysis of identical systems is not a problem in itself. However, identical systems are more likely to completely synchronize and thus defy analysis. Preliminary studies of delay-coupled identical Rössler systems (not shown) exhibiting anticipating synchronization (Voss, 2000) revealed that the presence of additional stochastic intrinsic forcing is beneficial in this situation. In the presence of noise, the systems are phase-locked with highly correlated amplitudes, but complete synchronization is never actually established over extended periods of time. These perturbations away from synchrony render the systems susceptible to inference once more. Noise may therefore have beneficial effects for causal inference, although high noise levels lead to a deterioration of statistical modeling performance, in particular for fully nonparametric models. At the same time, the statistical nature of the proposed framework explicitly accounts for noise and can even deal with asymmetric noise scenarios. This is due to the fact that irrespective of the causal hypothesis regarding the direction of information flow, noise from both, driver and driven system, are always accounted for nonlinearly in the statistical model of mapping, equation A.9.

In application to neuroscientific data such as EEG, the biggest concern is with regard to overlapping measurements. If the domains of different measurement functions overlap, the same underlying system will be reconstructible. As a trivial result, both measurements are mutually reconstructible. This is a problem that has to be addressed at the level of experimental design, and great care has to be taken to exclude this possibility. Countering intuition, an application of preprocessing techniques such as independent component analysis (ICA), popular in neuroscience, would lead to a deterioration of the situation beyond repair. Since each IC is a linear combination of the original measurement signals, it can be understood as a virtual measurement of a dynamical system constituted by the original signals. Consequently, each IC has perfect knowledge of the full underlying measurement system such that it should be possible to mutually reconstruct any pair of ICs. Preliminary results on ICs from EEG data support this hypothesis and yielded pairwise reconstructibility that can be described only as suspicious.

The final point we address is the evidence of mutually delayed coupling scenarios in the cat LFP data, as reported with respect to channels 2 and 4 in Figure 9. For several reasons, we have not discussed mutual coupling scenarios. First, embedding theory does not provide at this point a rigorous examination of the skew product scenario in terms of mutually coupled systems, as opposed to the unidirectional coupling case. Stark (1999) conjectured, however, that the theory could in principle be extended to account for mutual coupling. Second, our own preliminary results on mutually delay-coupled systems indicate that the delay estimation approach works in principle if the coupling is weak enough. Given the shape of the sinks involved in the reconstructions of channels 2 and 4, we feel confident that the delay estimate is correct. In stronger coupling scenarios that are most likely characterized by generalized synchronization, however, we came across less expected results where delays vanished in one direction. We therefore decided that this scenario warrants further studying and should be accompanied by theoretical research based on differential topology.

### Appendix: Theoretical Background

#### A.1 Embedding Theory

The method presented here is built on a theoretical framework given by concepts from differential topology. As is also explicitly stated by practitioners in this field (see Huke, 2006), these concepts are complicated and usually accompanied by a long tail of other theoretical dependencies (an introduction can be found in Hirsch, 2012, and the appendix of Broer and Takens, 2010), which makes them a challenge even for seasoned theoreticians. Our goal in this section is nonetheless to give an introduction to some key concepts that are necessary to understand later parts of the proposed method and explain the red line by which they are connected.

The problem that differential topology solves for the practitioner is that of reconstructing a system that is observed only indirectly via real-valued measurements. Consider, for example, local field potentials (LFPs) from electrode recordings in cortex. These yield a time series measurement of the unobserved neuronal network activity contributing to the LFPs. Aeyels (1981) was one of the first to work on this topic and provides the most intuitive access. He considered time-continuous dynamical systems given by vector fields defined on a differentiable manifold *M* with *f* dimensions. Each vector field admits a diffeomorphism , which takes an initial condition and maps it forward in time to . Thus, defines a temporal evolution of the dynamical system and corresponding trajectories on *M*. Measurements, such as LFPs, are defined as continuous functions As a function of time, the system is observed only indirectly via the measurements , which constitute the observed time series. Suppose the measurements were sampled at a set of *d* points along an interval of length This set is called a sample program

A system is called -observable if for each pair with there is a , such that

*m*into the set of measurements defined by , is bijective. is called a reconstruction map. If , and differ in at least one coordinate, allowing one to distinguish between

*m*and

*n*in measurement. Aeyels showed that given an almost arbitrary vector field, it is a generic property of measurement functions

*x*that the associated reconstruction map

*Rec*is bijective if

_{d}*Genericness*is defined here in terms of topological concepts that lie outside the scope of this introduction. As a result, the temporal evolution of on

*M*becomes accessible via the reconstruction vectors corresponding in time.

For purposes of statistical modeling, this level of description is quite sufficient. In general, however, it is natural to also demand differentiability of such that its image is a submanifold in In this case, the reconstruction map is called an embedding and also preserves the smoothness properties of In turn, an embedding affords the investigation of topological invariants and further properties of the dynamical system in measurement. Takens (1981) showed in a contemporaneous piece of work that is generically an embedding if , together with stronger statements of genericness.

With respect to our final goal of estimating interaction delays, the following is important to note. A diffeomorphism is invertible and defines the temporal evolution of a dynamical system both forward and backward in time. As a result, given a reconstruction vector all system states corresponding to the sample program are in principle reconstructible (the relative position of the to are not material for Aeyels’s proof as long as they can be uniquely associated with *m* via ). In case of endomorphisms, that is, systems not invertible in time, this is no longer true. Endomorphisms are important to consider for us since delay-coupled systems, such as spatially distributed neural networks, are in general not time invertible (the time-reversed system would be acausal). In later work, Takens (2002) showed that endomorphisms may not even admit embeddings. However, in terms of the language introduced here previously, he proved the genericness of the -observability of system state given the reconstruction vector in case Hence, the system state corresponding to the end of the sample program is always reconstructible, and we will exploit this fact for later stages of the method proposed here.

At the same time, for the purpose of statistical modeling, the restrictions may be less severe in practice. A loss of differentiability in some points or even a finite number of intersections in may still admit substantial reconstructibility with regard to certain functional mappings that will be introduced later. Also, Sauer, Yorke, and Casdagli (1991) proved that application of linear time-invariant filters to the measurements does not destroy embeddings. This result is highly important for neuroscientific data, which are often already filtered within the recording device. Sauer et al. also conjectured that different measurement functions may be combined in a reconstruction map, which was later proven by Deyle and Sugihara (2011).

*N*with dimension

*g*, and a nonautonomous driven system operating on

*M*with dimension

*f*, where Stark showed that given real-valued measurements of states of the driven system alone, the corresponding reconstruction map generically embeds the full product manifold in case This means

*N*is reconstructible via measurements of

*M*alone. While this result dramatically improves chances at predicting a single time series, causal analyses in the Granger framework may yield distorted outcomes (see Sugihara et al., 2012, for a concrete example). However, it also gives rise to an asymmetry in reconstructibility, which can be exploited to formalize causal interactions. Unless the subsystems are generalized and synchronized, in which case the dynamics of the product system collapse onto a synchronization manifold, measurements of the driver manifold

*N*will not allow a reconstruction of

*M*since there is no information about

*M*present in the temporal evolution of In case of synchronization, this approach becomes problematic, since knowledge of

*N*would allow by definition a perfect prediction of (Kocarev & Parlitz, 1996).

With respect to our purpose of delay estimation, we note that the individual timescales of the two systems are not material for the skew-product embedding, which is formulated in terms of indexed sequences of points on the manifolds. Also, it is not required that the temporal evolution of the full product system, equation A.2, is a diffeomorphism, this restriction pertains only to and individually. As a result, the theory can in principle account for delay-coupling scenarios. Denote by the driver system state at time *t*, given an initial condition . A coupling with delay would induce forward dynamics on *M* with *M*-intrinsic time index as Conversely, for a given measurement function and a corresponding reconstruction vector we would expect to be able to reconstruct the mixed temporal state of the full product system on if

*N*. This mapping could yield, for example, Note that

*F*is a mapping between the explicitly observed measurement time series, allowing one to reconstruct the indexed measurements

*y*using reconstruction vectors composed of Furthermore, the reconstruction is backward in time if delays are involved, which employs the strongest possible constraint that causal interactions can exhibit. That is, a cause-effect relationship can extend only forward in time; consequently, reconstruction is possible only backward in time.

_{i}The map *F* is a continuous functional mapping if it exists, and our goal will be to estimate it with a statistical model directly from any given measurement time series. It is not sufficient, however, to assume the data stem from a single autonomous skew-product system. When considering, for example, LFP recordings, measurement time series corresponding to different recording sites will contain information from local neuronal subpopulations that may interact with each other. In any case, each subpopulation is likely to receive massive modulatory influence from other areas of the brain. Such additional drivers may be too high-dimensional for practical purposes to be reconstructed from the data. Nonetheless, they will dynamically alter the temporal evolution of any measured subpopulation. At the very least, one has to consider this as a form of intrinsic noise, which, in contrast to simple measurement noise, is part of the system dynamics.

*N*. Redefining the reconstruction map and writing Stark proved that is generically an embedding for typical

*n*if . This result also holds if the driver system is stochastic, with shift map dynamics on bi-infinite sequences (Stark, Broomhead, Davies, & Huke, 2003; Huke, 2006), in which case the driver is effectively infinite dimensional. Measurement noise can be accounted for analogously. With applications in neuroscience in mind, Ørstavik and Stark (1998) have proposed using such a framework of stochastic forcing in situations where the measurements are presumably taken from a lower-dimensional local subsystem, weakly coupled to a high-dimensional global system. If the global system, such as the neural network in the brain, is high-dimensional enough, one can view it in this context as practically stochastic. In turn, any locally measured neuronal subpopulation may be regarded as a stochastically forced low-dimensional deterministic system.

*F*, equation A.3, in the following way. Given system A.2, assume finite measurement time series and defined as , with and Now extend where is a topological space and The latter can be treated as a stochastic dynamical system by introducing shift map dynamics on the bi-infinite sequence with The shift map is now defined by Likewise, extend We have disentangled the temporal indices of the two deterministic systems here and assume to account for interaction delays. The stochastically forced systems can be viewed as parameterized forward mappings, for example, and In case of , we adopt the notation

*M*is now twice parameterized, For functional mapping, equation A.3, these dependencies on the unknown drivers can be made explicit, analogous to the discussion of the NARMA model in Stark et al. (2003). This yields the bundle reconstruction: If and are taken to be one-dimensional, we can immediately include them as random variables in a statistical model. A proper treatment of this source of uncertainty with Bayesian techniques will be discussed after the statistical model has been introduced. In section 3, we demonstrate the soundness of the mapping

*F*using coupled chaotic logistic maps with intrinsic noise. It is shown that in case the noise sequences and are observed and included in

*F*, the time series reconstruction becomes exact. Figure 10 summarizes and illustrates the approach graphically.

#### A.2 Discrete Volterra Series Operator

In order to derive point estimators for the measurement time series, the function has to be approximated by a model. *F* being continuous exhausts prior knowledge about its functional form. A choice that may be called canonical in this situation is a discrete Volterra series operator. Note that the term *operator* here is borrowed from functional analysis and has nothing to do with filters or other engineering concepts. The term *Volterra series* was coined in the context of functional analysis, due to Volterra (1915). In general, it can loosely be thought of as Taylor series expansion of *F* whose polynomial terms are rewritten in a certain coordinate representation. The result is a linear form with respect to its parameterization, which is a desirable property in a statistical model because it simplifies computation. Although we cannot assume that *F* is differentiable and can be expressed by a Taylor series expansion, one can show that the discrete Volterra series operator nonetheless can approximate arbitrary continuous *F* of the required type. This can be seen as follows.

The domain of *F* is a subset , defined by the image of the reconstruction map (see section A.1). *X* is compact, since is continuous and its domain is assumed to be a compact manifold.

A result from functional analysis, the Stone-Weierstrass theorem (Werner, 2011), now states the following. Suppose is an algebra of continuous, real-valued functions on the compact Hausdorff space *X* that separates points of *X* and contains the constant functions. Then for all and any continuous real-valued function *F* on *X*, a function exists such that for all . This means a polynomial *p* can be found that approximates the desired function *F* arbitrarily exactly.

*h*

_{0}, polynomials of type 2 contain the constant functions. The separation property demands that for with , there exists a such that Without loss of generality, assume

*x*and

*y*differ in the

*j*th coordinate. Then choose That is, for . If we let for , and 0 otherwise, will have the desired property. Furthermore, it is a standard result from commutative algebra (see Roman, 2007; Artin & A’Campo, 1998) that the elements of the form equation 2.2 generate an algebra (where only finitely many of the

*h*differ from 0). It is in fact the graded algebra of homogeneous polynomials, the elements of which are equipped with an infinite-dimensional vector space structure and the standard product of polynomials defined on elements of this space. Moreover, this algebra can be defined for symmetric as well as nonsymmetric polynomials.

_{i}As a result, we may invoke the Stone-Weierstrass theorem and the desired approximation properties of the discrete Volterra series operator obtain.

#### A.3 Gaussian Process Regression

*F*as uncertainty in the function itself rather than withdrawing to uncertainty in a finite parameterization of

*F*. A gaussian process is a system of random variables indexed by a linearly ordered set, such that any finite number of samples are jointly normally distributed (Hida & Hitsuda, 2007). A gaussian process thus defines a distribution over functions and is completely specified by a mean function and a covariance function. In terms of a gaussian process, we can define the functional model as where denotes coordinate of matrix

*K*.

One can now derive all distributions necessary to formalize the uncertainty associated with data . The details of these derivations can be found, for example, in Rasmussen and Williams (2006). The covariance matrix *K _{y}* corresponding to process

*y*is given by

To compute estimators of the target time series, we now only need to specify *K*(*z*,*z*) corresponding to the Volterra series model in equation 2.2. This is a straightforward derivation using equations 2.2 and A.11. (For details and a more general discussion of Volterra theory in polynomial regression models, see Franz & Schölkopf, 2006.) In a first step, we consider the uncertainty in the coefficients represented by the of equation 2.2 for the case of a parametric model with finite truncated series expansion. The formulation chosen here structures the coefficients in a such a way that the degree of the monomial that is being weighted is apparent. This is a notational convenience and otherwise irrelevant. That is, one may equally think of each summand in equation 2.2 as being weighted by one particular coefficient at this point, which happens to correspond to an element in the image of exactly one

Once again, it is sensible to make the prior assumption that any finite set of these coefficients is distributed normally with mean zero and a covariance function given by This basically says that for each pair of coefficients *j*,*k* (weighting the *j*th and *k*th summand in equation 2.2, respectively), we assume they may be positive or negative, finite, and each coefficient can in principle vary independent of the other coefficients, which is a plausible modeling assumption.

*X*the matrix in which each row contains the summands from equation 2.2 that correspond to the elements of That is, the

*j*th row of X would correspond to covariate given by the reconstruction vector It contain the elements Note that in this formulation, is not symmetric, which means that in both terms, for example, and , occur. This has no impact on the generality of the proof for the universal approximation property of equation 2.2 with respect to continuous functions in . The latter can be formulated with respect to a symmetric, as well as a nonsymmetric, algebra of homogeneous polynomials.

To arrive at a nonparametric version of this covariance matrix, that is, we would like to consider , one could make the following approach. The model in equation 2.2 is simply a polynomial. However, interpreted a Volterra series operator, it represents a reparameterized Taylor series expansion. If the Taylor series is considered an infinite series expansion, it is clear from basic analytical arguments that in order for a series to converge, the sequence of coefficients that weigh the summands has to be a null sequence and converge to zero.

*i*th element of . Now define a new covariance function by For the gaussian process in

*F*, it follows that where is a diagonal matrix the diagonal entries of which are given by equation A.21. As a result, denoting once more by kernel matrix

*K*the

*F*-covariance matrix for which , In this expression, we can now consider a nonparametric model (i.e., a full series expansion) obtained by letting such that This last expression thus corresponds to a model for a full Taylor series expansion around the origin (also called Maclaurin series), given a proper encoding of prior knowledge of the coefficient sequence from analytical arguments. The model thus represents an analytic function whose domain is an open subset around the origin. Polynomials are naturally analytic functions; it is therefore not hard to extend the universal approximation property with respect to continuous functions from polynomials to analytic functions, as defined above. (For a different derivation of this model and more rigorous proof of the approximation property in the context of kernel regression, see Steinwart, 2002.)

Next, we have to deal with the unknown from equation 2.3. Denote by the vector of unknowns, and by the vector of known covariates in equation 2.3 such that The obvious strategy is to integrate already out of the prior process, equation A.12, and perform the inference step in the predictive distribution with the resulting process, independent of . However, the resulting process is no longer gaussian. Although the have been assigned a normal prior, they enter via *K _{y}* nonlinearly into the prior process likelihood, equation A.13. As a result, the analytic derivation of the predictive distribution in equation A.15 no longer applies.

In the context of noisy covariates in Regression, however, Girard, Rasmussen, Quinonero-Candela, and Murray-Smith (2003) suggest a gaussian approximation. This is done by computing expectation and covariance of the noise-driven process explicitly while still assuming a normal distribution. They show that although the actual distribution, as approximated by Markov Chain Monte Carlo (MCMC) methods, is far from normal, the expected value of the gaussian approximation predictive distribution yields a comparable predictor, and the predictive variance captures the model uncertainty quite well, allowing for reasonable posterior density credible sets.

*L*

_{2}-regularization (see Rasmussen & Williams, 2006; Hoerl & Kennard, 1970) and helps to prevent overfitting. With this newly defined prior process, one computes the predictive distribution in gaussian approximation in the same way described before, according to equation A.15.

In a fully Bayesian treatment, the hyperparameters would have to be assigned (noninformative) priors and be integrated out of the distributions relevant for inference or model selection, too. However, in general, it is not possible to get rid of dependence on both, hyperparameters and model parameterization. Instead, explicit values for the hyperparameters may be estimated by maximizing, for example, the marginal likelihood (see equation A.13), or the predictive distribution in a leave-one-out cross-validation scheme (Sundararajan & Keerthi, 2001). We employ the latter, which is less prone to overfitting and less dependent on model specifications.

The remaining free parameters of our statistical model, indicated by *H* in the conditioning part of the predictive distribution, pertain to *d*, the size of the reconstruction space, and the order of the Volterra series expansion. These have to be determined in a model selection process, where for each choice of *H*, the hyperparameters are estimated individually.

#### A.4 Treatment of Stochastic Driver Input in the Statistical Model under the Bayesian Paradigm

#### A.5 Cross-Correlation Functions

The cross-correlation functions are illustrated in Figure 11.

## Acknowledgments

We acknowledge and thank Andreas Wilmer for his contributions at an early stage of this work. J.S. thanks Hazem Toutounji and Phillip Benner for many invaluable discussions.