Many natural systems, especially biological ones, exhibit complex multivariate nonlinear dynamical behaviors that can be hard to capture by linear autoregressive models. On the other hand, generic nonlinear models such as deep recurrent neural networks often require large amounts of training data, not always available in domains such as brain imaging; also, they often lack interpretability. Domain knowledge about the types of dynamics typically observed in such systems, such as a certain type of dynamical systems models, could complement purely data-driven techniques by providing a good prior. In this work, we consider a class of ordinary differential equation (ODE) models known as van der Pol (VDP) oscil lators and evaluate their ability to capture a low-dimensional representation of neural activity measured by different brain imaging modalities, such as calcium imaging (CaI) and fMRI, in different living organisms: larval zebrafish, rat, and human. We develop a novel and efficient approach to the nontrivial problem of parameters estimation for a network of coupled dynamical systems from multivariate data and demonstrate that the resulting VDP models are both accurate and interpretable, as VDP's coupling matrix reveals anatomically meaningful excitatory and inhibitory interactions across different brain subsystems. VDP outperforms linear autoregressive models (VAR) in terms of both the data fit accuracy and the quality of insight provided by the coupling matrices and often tends to generalize better to unseen data when predicting future brain activity, being comparable to and sometimes better than the recurrent neural networks (LSTMs). Finally, we demonstrate that our (generative) VDP model can also serve as a data-augmentation tool leading to marked improvements in predictive accuracy of recurrent neural networks. Thus, our work contributes to both basic and applied dimensions of neuroimaging: gaining scientific insights and improving brain-based predictive models, an area of potentially high practical importance in clinical diagnosis and neurotechnology.
The brain is arguably one of the most complex dynamical systems known to date, and the ultimate goal of neuroscience is to understand how the brain functions and the corresponding behavior of animals and humans that emerges from intricate nonlinear interactions across billions of neurons. Various neuroimaging modalities provide a rich source of brain activity data, from multielectrode recordings (Schwarz et al., 2014) and functional magnetic resonance imaging (fMRI; Feinberg et al., 2010), to recently developed brain-wide calcium imaging (CaI; Ahrens et al., 2012), which provides a unique perspective on neural function because it allows recording the levels of calcium at an unprecedented subneuronal spatial resolution across entire vertebrate brains, and at temporal resolutions that are commensurate with the timescale of calcium dynamics and neuronal activity (Ahrens, Orger, Robson, Li, & Keller, 2013b). However, given such data, the challenge is to learn interpretable, domain-specific nonlinear dynamical models able to capture well the full-brain neural activity and potentially facilitate accurate predictive modeling, an important area with many practical applications, for example, improving medical diagnosis of mental disorders, epileptic seizure prediction, and brain-computer interfaces (Brunton & Beyeler, 2019; Bzdok & Ioannidis, 2019; Miotto, Wang, Wang, Jiang, & Dudley, 2018; Shoeibi et al., 2020; Zhang et al., 2020).
Note that classical linear vector-autoregressive (VAR) models are often unable to capture complex nonlinear brain dynamics, as confirmed by our experiments. On the other hand, general-purpose nonlinear dynamical models, such as deep recurrent neural networks (RNNs), are not easily interpretable from a neuroscience perspective and tend to require much larger amounts of training data than those typically available in modern neuroimaging experiments. This work proposes a novel, hybrid approach that is shown to overcome these limitations, as it provides an interpretable dynamical model that not only fits the data well, but also allows for generalization to unseen data, making accurate (short-term) predictions of future brain activity.1
Brain activity is generated by coupled dynamics of nonlinear units (neurons), which can exhibit diverse nonlinear behaviors. For instance, it can be oscillatory or even chaotic (Korn & Faure, 2003), with sharp phase transitions between different states. Some of the simplest models that can capture these behaviors, and have been extensively used to model neural dynamics (Ermentrout & Terman, 2010), are relaxation oscillators. It is also highly relevant to neuroscience, as a special case of the FitzHugh-Nagumo model, which is a simplification of the classical Hodgkin-Huxley model of activation and deactivation dynamics of a spiking neuron (FitzHugh, 1961; Izhikevich, 2007). VDP can also model neural populations through its approximation of Wilson-Cowan dynamics (Destexhe & Sejnowski, 2009; Kawahara, 1980).
In this letter, we model the brain dynamics in the dimensionality-reduced space, using a set of top SVD (or ICA) components, or modes. Note that on top of the fact that VDP dynamics is commonly used to model the local neural activity, recent experimental and theoretical studies demonstrate that VDP is also a plausible assumption for modeling such modes. Specifically, the results in Alonso et al. (2019) and Seung, Lee, Reis, and Tank (2000) support the idea that (basal) brain activity is poised at the boundary between stability and instability, a condition that allows for significant coding advantages. And as it was recently shown (Moirogiannis, Hayton, & Magnasco, 2019; Moirogiannis, Piro, & Magnasco, 2017) that under generic assumptions of neuronal connectivity, as one or two modes become unstable, the dynamics on the resulting center manifold is described by a VDP-like equation, or more specifically, a dynamic equation in which the even nonlinearities are absent. It is likely that this property persists as the dimensionality of the manifold increases slightly.
Given the above considerations, we propose to use a set of coupled VDP oscillators to reproduce the (dimensionality-reduced) dynamics derived from different brain imaging data sets. To this end, we need to estimate the states and structure parameters of the coupled VDP model from noisy data, with model nonlinearities and unobserved variables. As it turns out, this task is quite nontrivial and not easily amenable to straightforward approaches such as random sampling in the parameter space. Therefore, we had to explore a variety of approaches and developed a framework combining heuristic stochastic search with deterministic optimization, which allowed us to achieve an accurate data fit using the VDP model on three different neuroimaging data sets. Furthermore, the VDP coupling matrix can be used to analyze positive and negative interactions across different brain areas (spatial components) to obtain neuroscientific insights. Finally, we also use the VDP model for predicting future brain activity (generalization) and obtain good short-term prediction results, often outperforming the VAR model and comparable to or better than LSTM. Furthermore, the best generalization accuracy is achieved by a hybrid method we call VanDeePol, which uses VDP learned from relatively limited training data to simulate large amounts of pretraining data for LSTM, pulling it toward specific nonlinear dynamics, and then fine-tuning it on limited-size real data. We demonstrate that VanDeePol consistently improves LSTM performance on both data sets and both for short- and long-term predictions. Moreover, data augmentation with VDP simulations boosts LSTM performance significantly higher than just using noisy versions of training data.
Importantly, our method hinges on the ability to reasonably fit a simple but rich dynamical system to real data and leverage it to produce unlimited amounts of simulated data for pretraining (data augmentation). The key to this success is achieving an accurate fit of the dynamical system to the training data for high (in the 0.8–0.95 range) correlations between the observed and predicted time series. We show that this offers considerable advantages and largely improves an LSTM's predictive performance. Beyond fit and predictive power, our VDP-based method also provides a novel approach for discovering brain circuits' functional connectivity, since the VDP's coupling matrix can be interpreted as an encoding of positive and negative interactions, of various strength, across different brain areas (spatial components). As we show in the case of the zebrafish data set, this corresponds to a meaningful anatomical structure, suggesting our approach may provide important mechanistic insight, in addition to predictive power, in the study of complex systems.
To summarize, this work demonstrates that it is possible to (1) accurately capture the (dimensionality-reduced) whole-brain activity using VDP models; (2) obtain anatomically plausible interpretation of VDP connectivity, useful for gaining neuroscientific insight; (3) generalize to the unseen future brain activity data; (4) leverage VDP as an unlimited source of simulated data for augmenting a relatively small brain-imaging data set; and (5) using augmented data for pretraining deep recurrent networks, achieve an enormous improvement in the brain activity prediction as compared to both VDP and recurrent networks taken separately. Note that VDP considerably outperforms the linear autoregressive model (VAR) in terms of data fit, interpretability, and predictive accuracy, justifying the added complexity of training such models on brain imaging data. Moreover, our approach is reasonably robust in the presence of non-VDP dynamics, as shown in our synthetic experiments. VDP represents one out of many possible approaches to modeling neural activity. Thus, we are not claiming that the true underlying neural dynamics must always follow VDP, but rather demonstrate that VDP can be an effective, interpretable tool for brain imaging data analysis.
Most of the recent work on calcium imaging is focused on identifying the sources (neurons and axonal or dendritic processes) and denoising/deconvolving the neural activity from calcium signals (Apthorpe et al., 2016; Giovannucci et al., 2017; Inan, Erdogdu, & Schnitzer, 2017; Speiser et al., 2017). Moreover, while recovery of action potential timing and circuit structure and dynamics using CaI modality of recording has an established history in the field (Peterlin, Kozloski, Mao, Tsiola, & Yuste, 2000), it was not until recently that this problem was approached at whole brain scales (Betzel, 2020; Naumann et al., 2016; Vladimirov et al., 2018), which is necessary in order to gain scientific insight into the macroscale brain function, but at the same time poses computational and algorithmic challenges. It is in this direction that our work attempts to make a contribution.
2.1.1 Zebrafish CaI Data
In Ahrens et al. (2013b), light-sheet microscopy was used to record the neural activity of a whole brain of the larval zebrafish at resting state, reported by a genetically encoded calcium marker, in vivo and at 0.8 Hz sampling rate. From the publicly available data (Ahrens et al., 2013a) it is possible to obtain a movie of 500 frames with a 2D collapsed view of 80% of the approximately 40,000 to 100,000 neurons in the brain, with a resolution of 400 by 250 pixels (approximately 600 by 270 microns).
For example, spatial components show pronounced but nonoverlapping forebrain island-like structures, often with lateral symmetry. Moreover, spatial components 4 to 6 include the hindbrain oscillator (seen in the right panels). The corresponding temporal components are dominated by oscillatory activity, consistent with the physiology of the hindbrain oscillator described in Ahrens et al. (2013b).
2.1.2 Rat CaI Data
We considered another calcium imaging data set, of a rat's visual cortex: conventional cytosolic GCaMP6f data in V1 of a rat looking at oriented gratings. After several preprocessing steps, described in appendix A, we obtained 8 data sets, over 182 individual neural cells, with 276 time points each and a time resolution of 0.22 seconds. The top 9 SVD components were extracted for each data set.
2.1.3 Human fMRI
We also tested our approach on resting-state human functional MRI (fMRI), using data from 10 healthy control subjects, obtained from the Track-On HD data set (Kloeppel, Gregory, & Scheller, 2015). In this case, instead of SVD, a conventional 20-components independent component analysis (ICA) was performed on the data, after which the most meaningful 10 components were selected, containing 160 time points each, with a time resolution of 3 seconds.
2.3 Parameter Estimation
States and parameter estimation for systems of ordinary differential equations (ODEs) from data is a problem of significant interest in various research communities, ranging from physics and biology to control theory, machine learning and optimization (Lillacci & Khammash, 2010; Rackauckas et al., 2020; Sabatier, 2000; van Leeuwen & Herrmann, 2015). (See appendix B for more detail on the related work in this area.) This problem can be quite challenging due to the high computational cost of the numerical integration involved in evaluating each candidate set of parameters, and to the very large size of the search space, especially for high-dimensional ODEs.
Note that before exploring more advanced methods, we first attempted to estimate the free parameters of the van der Pol model using a simple random sampling approach. However, despite experimenting with random sampling quite extensively, we could not obtain satisfactory results in terms of the accuracy of the model's fit to the training data; indeed, the parameter estimation problem turned out to be quite nontrivial (e.g., the model's fit, measured via correlation between the true and predicted time series, was quite unsatisfactory, mostly in the 0.4–0.5 range). We later developed a more sophisticated (heuristic) stochastic optimization procedure, which improved the results considerably, leading to an increase of approximately 10% to 15% in the mean Pearson correlation and more than a 40% decrease in the mean RMSE of the data fits.
Moreover, aiming to further improve the data fit, as well as the generalization ability of our models (discussed later), we continued to explore other approaches and, as a result, developed a hybrid method that combined the stochastic search with a deterministic optimization algorithm, which, consistent with the literature (Gábor and Banga, 2015; Rodriguez-Fernandez, Egea, & Banga, 2006), provided a further increase in the quality of the model fits to our training data: more than a 50% increase in the mean Pearson correlation (mainly in 0.7–0.8 range) and a decrease of more than 70% in the mean RMSE with respect to the random sampling.
For the deterministic optimization, we used fast optimization techniques available for nonlinear Kalman smoothing to fully minimize over the states for each update of the unknown parameters. The algorithm can be understood in the framework of recent results on the variable projection optimization method (Golub & Pereyra, 2003). Finally, in our hybrid approach, the stochastic search and ODE-constrained optimization based on variable-projection (VP) (discussed in the next two sections), were executed iteratively, in an alternating fashion, as summarized in algorithm 1.
2.3.1 Stochastic Search
Our stochastic search method is essentially a random walk in both the parameter and the hidden state space, which accepts candidate random steps if it improves a certain fitness function and discards it otherwise. Specifically, we use the following fitness function: , where and are the Pearson correlation and coefficient of determination of the observable state, respectively, and is a weight hyperparameter. We alternate stochastic search with the deterministic optimization procedure, described in the next section, either until the convergence or until the maximum number of iterations is reached. (For details on the algorithm, see section C.1.)
2.3.2 ODE-Constrained Inference
When the parameters and in equation 2.1 are known, inferring the hidden components from observations is a nonlinear Kalman smoothing problem (Kalman & Bucy, 1961; Kalman, 1960). We develop a method to find the hidden variables () from the observed ones () for given parameter settings and to learn unknown parameter settings themselves. We discretize the ODE model in equation 2.1, and formulate a joint inference problem for the state-space and parameters that is informed by noisy direct observations of some states; and constrained by the discretized dynamics.
2.3.3 Inference for a Single Oscillator
The extension of the ODE-constrained inference to the multiple oscillators case is relatively straightforward and is presented in the section C.2.
Note that an accurate fit to the training data alone is an insufficient criterion of the model's validity due to potential overfitting issue. The ultimate validation of a model's ability to accurately capture underlying mechanisms behind the data generation is its generalization, that is, the prediction quality on unseen data drawn from the same data distribution. In this work, we evaluated generalization of candidate dynamical models as measured by the predictive accuracy of future brain activity. Several models besides VDP were used as baselines, including linear vector autoregressive (VAR) model and recurrent neural networks (LSTM), as well as a hybrid method combining VDP-based data augmentation with LSTM, as described below.
Given a VDP model (parameters and initial hidden states) estimated on training time interval, we can integrate it not only to fit the training time series, but also to predict the future time points, immediately following the training interval, since we have an estimate of the hidden state variable at the last training time point. Note, however, that the model is not trivially applicable for predicting an arbitrary interval in the remote future, since its hidden state would be unknown. (For further information, see section D.1.)
The simplest baseline for time-series forecasting is the standard linear vector autoregressive (VAR) model (Lütkepohl, 2005). At a given time , given an matrix of previous observations () over observed temporal components (i.e., ), the task is to (jointly) predict at the next time step for all considered temporal components. To predict multiple steps ahead, we proceed recursively, using the previous predictions as new inputs. We experimented with several values of and selected , based on Akaike and Bayesian information criterions.
We also compared the VDP model prediction with several standard time-series forecasting approaches, including recurrent neural networks, such as the LSTM model (Hochreiter & Schmidhuber, 1997). We used a two-layer LSTM network, trained in Keras using RMSprop with dropout (see section D.2 for details on the network's architecture and hyperparameters). The task was to jointly predict all the observed states at a certain time point based on their values at previous time points, just as in the case of VAR. In order to be comparing models that use the same amount of information from the time series' history, we provided LSTM with the same previous time points as in the case of VAR model. The prediction of multiple steps ahead was also performed in the same recursive way as it was done for VAR.
van der Poldeep neural networks. Training LSTMs often requires a large number of training samples, which was not the case in our data, so some data augmentation could help. A simple, and often used, approach is to generate noisy versions of the training data. However, a more effective alternative, as we show below, is to use the VDP model as a possible “ground-truth” dynamical model for simulating (unlimited) additional training data. The resulting hybrid approach starts with estimating VDP on training data, then using it to simulate the data for pretraining LSTM, and fine-tuning LSTM on real training data; we refer to this combination of van der Pol with deep learning as VanDeePol.
Note that pretraining on VDP-simulated data can serve as a regularizer (prior) in the absence of large training data sets; notice that estimating VDP from smaller data might be easier than training LSTM since VDP has many fewer parameters to train, but hopefully captures some domain knowledge (e.g., certain characteristics of neuronal activation dynamics or anatomy), while LSTM is a general-purpose model. Implementation details are in section D.3.
3.1 Model Fit Evaluation
We applied the proposed approaches on the two calcium imaging data sets discussed earlier, zebrafish whole-brain and rat's visual cortex calcium imaging, as well as on human fMRI data, while evaluating the following key aspects: (1) VDP model fit on training data; (2) predictive accuracy when forecasting time series using VDP, LSTM, and VanDeePol, with linear VAR used as a baseline, and another baseline, representing LSTM pretrained with a noisy version of training data (see section D.4 for details); and (3) interpreting VDP's coupling (interaction) matrix. Zebrafish data were split into five training segments, 100 time points each; the next 20 time points for each training segment were used for testing (forecasting). For the rat data, we used each of the eight data subsets as a training segment, splitting it into 100 training and 176 testing points. For human fMRI we used 10 data sets, corresponding to different subjects, each of which were split into the first 100 time points for training and the remaining 60 for testing.
Figure 3 compares the performance of several predictive methods: VAR (green), VDP (violet), LSTM (blue), VanDeePol (orange), a baseline data-augmented LSTM –pretrained with noisy versions of training data– (red), as well as the limit case of random prediction, drawn from each training set (black).
We train the models on 100 consecutive points of each data segment (5 for zebrafish, 8 for rat and 10 for human), and then used the next 20 points (zebrafish), next 176 points (rat), or next 60 points (human) for testing/prediction, as follows. First, we use each trained model for a short-term prediction task: to predict sequentially the first nine points of the test segments (adjacent to the training segments) in the following recursive way. Given the last six points of the training data, we predict the next, unobserved one (the first point in test segment), then shift by one point forward the six-point input window, using our recent prediction of a current data point as a part of the six-point input segment for predicting the next point, and so on, never using the observed data points from the test data.
We also evaluate the models on the long-term prediction task that involves all future test points. Namely, we repeat the short-term prediction procedure described above for a window of nine test points, which, in turn, are shifted forward one point at a time, until we reach the end of each test segment. Each of the shifts consists in a short-time prediction itself, starting its prediction process with six ground truth data points and advancing in the same recursive manner as explained above, without using more observed data points but the initial six. Note that unlike VAR and LSTM models, VDP model cannot be straightforwardly applied for a long-term prediction, as it will require adding an inference mechanism for the hidden states corresponding to the future points beyond the last one in the training segment. Thus, the VDP plot is omitted from the corresponding long-term comparisons.
We compute the median Pearson correlation between the true and predicted time series and the median root-means-square error (RMSE), over all training/test splits (i.e., 5 for zebrafish, 8 for rat, and 10 for human fMRI), all considered SVD components and all testing intervals (in case of short term, we only had one test interval). The shaded area around each curve represents the standard error. Note that for long-term prediction, having larger test intervals yields considerable lower variance (narrower error bars) for rat (176 data points) and human (60 data points) as compared to zebrafish (20 data points). All predictions are normalized by the mean (across components) of the standard deviations (across time) of the training data for each case.
We can see in the first two rows of Figure 3 that for zebrafish and rat CaI, in general VanDeePol (orange) significantly outperforms the other methods or, in the worst cases, it is equivalently good. On the other hand, VDP (purple) presents a good performance in terms of correlation, though it is outperformed by LSTM in the second half of the prediction window in the case of zebrafish data set. However VDP seems to have a relatively high RMSE, which could be due to considerable initial distances between modeled and true time series: despite the model generally having high overall fit correlation (see Figure 2), note that the error for the last point of the training set (initial condition for VDP predictions) might be significant. Nonetheless, the difference in performance between pretraining LSTM with noisy versions of the training data (red) and pretraining them with van der Pol models, which is essentially our VanDeePol method (orange), is noteworthy. Hence it is possible to think that VDP successfully learned some meaningful information about the underling dynamics (reflected in its high correlation predictive performance), which was transferred to LSTM by imposing a prior and resulted in an improved performance.
It is notable that for human fMRI short-term predictions, all methods perform similarly in terms of correlation (see Figure 3i) and even VAR outperform the rest of them in terms of RMSE (see Figure 3j) up to three-steps-ahead predictions, after which VAR predictions degrade severely, as in the other cases. This can be understood when considering that the fMRI signal is relatively slow due to the following reasons: (1) The hemodynamic response function (HRF) that is convolving the underlying neural activity in fMRI has a typical timescale of approximately 10 seconds. Since the time resolution is 3 s, this means real fluctuations are not expected to appear in fewer than three of four time points. (2) fMRI noise is known to exhibit strong autocorrelation (Arbabshirani et al., 2014). (3) Data have been bandpass-filtered during preprocessing (0.01–0.16 Hz) to remove drifts and high-frequency noise. While this is supposed to increase neural signal to noise and is compatible with the HRF natural filtering mentioned above, it is not expected to see fast changes by design.
However, the picture of human fMRI long-term predictions (see Figures 3k and 3l) is very different from their respective short-term counterparts but, at the same time, more similar to their zebrafish and rat equivalents. VanDeePol clearly performs better than VAR for a large margin, which could mean that while VAR may perform better for very short-term predictions, the van der Pol model might be actually capturing additional features of the underlying dynamics, which, when combined with LSTM, enables VanDeePol to perform well even for predictions far from the training set.
Once we have estimated VDP parameters, besides using them for the forecast of future activity, we can go back to the neural basis and try to learn some neuroscientific insights. Recalling that parameter in equation 2.1 can be interpreted as effective (directed) connectivity between the neural populations captured by each spatial component of the singular value decomposition, we attempt to use the gained information to infer the underlying functional connectivity of the brain. However, note that we do not have distinct neuron populations independently identified in each spatial component; they are all present in all spatial components, only that differently weighted (see Figure 1). We next explain the method we developed to disentangle the information we gained about the complex nonlinear brain dynamics into simple pixel-to-pixel directed connectivity. Figure 4 demonstrates an example of our connectivity analysis applied to the case of the CaI zebrafish data set.
First, we serialized each spatial component as a vector and computed the outer product for each spatial component pair, scaled by the product of the square roots of each corresponding singular value. The resulting matrices each represented a kind of pixel-to-pixel connection strength for that spatial component pair. To compute the final pixel-to-pixel connectivity matrix, we summed the weight matrices () for all VDP models generated as described before. Each element of the resulting summed matrix was then multiplied by the corresponding pixel-to-pixel connections strength matrix for that spatial component pair. The resulting weighted matrices were then summed to create the final pixel-to-pixel connectivity matrix. For visualization only, the top 200 positive weights were rendered as pixel-to-pixel excitatory edges (see Figure 4b) and the lowest negative 200 as inhibitory edges (see Figure 4c).
The results show connections between brain areas that are consistent with known anatomical demarcations in the larval zebra fish brain (Ahrens et al., 2013b), as well as functional relationships among forebrain, midbrain, and hindbrain. The lateralization of certain connections is intriguing and warrants further analysis, as does the apparent involvement of both visual areas and motor generating areas in the final network. Finally, a competitive recurrent network is implicated for the fish forebrain intraconnectivity.
Our next question was to find out how the connectivity inferred by our nonlinear coupled VDP model would compare with the connectivity obtained by a simple linear autoregressive model. For such comparison, we use the same VAR(6) model utilized as baseline for the prediction task, which provides six matrices, one per time lag. In order to be equivalent to what we did for VDP, we have to define a reduction method to get a single matrix out of the six. Since there is no clear right choice for that method, we tested three methods:
1st Lag: Keep only the matrix corresponding to the first lag (immediate previous time step).
Mean: Take the mean across the six matrices.
Max abs: For each element, select the value with the highest absolute value across the six matrices.
Figure 5 presents the resulting connectivity networks when using VAR matrices instead of van der Pol's for the analysis. The pixel-to-pixel connectivities inferred by our method using the VDP approach and VAR approach are quantitatively (see appendix G) and qualitatively different, with the VDP connectivity having much more explanatory value within the domains of systems neuroscience and neuroanatomy. Specifically, in the VAR approach, inferred inhibitory and excitatory networks are highly overlapping. In other words, the VAR approach fails to identify the most fundamental observation of neuroanatomy across organisms: that inhibition and excitation play different roles in the brain's networks and therefore create distinct anatomical network connections and motifs. Inhibitory and excitatory networks in the brain have very different structures, and only VDP recapitulates this observation in its inferred networks.
When considering these differences between inferred excitatory and inhibitory networks from VDP, it is first striking that connections within and from the telencephalon of the forebrain (left structure) are highly divergent between the two networks. Reciprocal excitatory connections between telencephalon and the optic tectum (middle structure) and between telencephalon and hindbrain (right structure) support top-down processing with feedback within the those systems controlling animal orientation (the optic tectum) and swimming movement (the hindbrain). In contrast, the three-hub looping inhibitory network in the forebrain implies a network similar to what is referred to as “winnerless” in the neuroscientific literature (Laurent et al., 2001).
Excitatory connections between the optic tectum and the hindbrain, on the other hand, are quite sparse, while inhibitory connections between these structures on the left side are robust, possibly indicating their role in swim centers of the hindbrain cancelling self-generated movement artifacts from in the optic tectum. In this way, possible runaway feedback orientation signals to the optic tectum from sensory centers (e.g., self-generated visual, auditory, or lateral line signals) may be averted and quenched by the system.
Finally, the specificity in the VDP connectivity with which the left optic tectum inhibits only one of the three inhibitory nodes in the telencephalon and excites only one pixel in the right optic tectum is remarkable and suggests very interesting functional convergence was ongoing at the time the movie was made. Such specificity and convergence are wholly absent from the VAR inferred connectivity.
Motivated by the challenging problem of modeling nonlinear dynamics of brain activations, we propose a novel approach for learning a network of nonlinear ordinary differential equations as parameterized coupled van der Pol (VDP) oscillators and demonstrate its successful application to modeling neural activity as measured by calcium imaging data in two organisms (zebrafish and rat), as well as in human fMRI. Our work is the first, to the best of our knowledge, to demonstrate that a coupled VDP model can be fit to multivariate brain imaging data with the level of accuracy reported, achieve a reasonable predictive performance in time-series forecasting setting, while also providing insight into excitatory and inhibitory interactions across different brain subsystems. Furthermore, we show that VDP can serve as a generative model for data augmentation and regularization (imposing domain-specific prior) in order to boost the predictive performance of deep neural networks. Indeed, our hybrid approach combining VDP with recurrent neural networks significantly outperforms the predictive accuracy of both methods when used independently.
4.2 Limitations and Perspectives
Our work opens a door to several future directions, born out of its success and its limitations. Here we focused primarily on investigating the feasibility of VDP in terms of accurate modeling of neuroimaging data, as well as its applicability for data augmentation, inference of connectivity, and improving neural activity prediction. While the proposed parameter estimation method demonstrates a clear promise of the overall approach, developing even better and more efficient parameter estimation techniques is a promising avenue for further research. For instance, our data dimensionality reduction approach and parameter estimation procedure for the dynamical system model are independent; while this may avoid circularity issues, we would like to develop a more satisfactory method that integrates both into a simultaneous optimization of the dimensionality-reduced space and the identification of the model in such space. For similar reasons, the hybrid model we present here suggests that another comprehensive solution would include a deep learning approach in which the activation units are replaced by low-dimensional nonlinear units with richer dynamic profiles.
Exploring richer dynamics is clearly another promising path for future studies. Here, the particular application to neural data motivated the use of the VDP model because of the well-established research showing it is a good model for local neural activity. However, we expect that the overall approach can generalize to other domains with similar local models (e.g., ecology Bjørnstad, 2015). Since VDP represents one out of several dynamical models that can be potentially highly relevant for modeling neural activity, exploring other neural population dynamics models is also an important direction to investigate. Moreover, while we focused on oscillatory modes, future attempts need to also integrate the arrhythmic broadband part of neural activity (Chaudhuri, He, & Wang, 2018), as well as longer timescales (Raut, Snyder, & Raichle, 2020).
In summary, our results demonstrate that coupled nonlinear oscillator models can accurately capture brain dynamics across multiple species and imaging modalities, while providing insights into functional connectivity. Moreover, the generative ability of such dynamical models can significantly boost the predictive accuracy of deep learning methods on relatively small neuroimaging data sets, opening new venues for data augmentation in spatiotemporal medical imaging. Thus, our work contributes to both the fundamental problem of gaining neuroscientific insights and the practical problem of building accurate, predictive models from neuroimaging data, an active research area with many potential applications, from clinical diagnosis to neurotechnology.
Appendix A: Data Sets and Preprocessing
A.1 Zebrafish Data Set
We worked with the dorsal view of the the publicly available data set (video) (Ahrens et al., 2013a). We first filtered out all pixels with near-zero standard deviation by setting their values to zero. Next, a singular value decomposition (SVD) was performed as shown in Figure 6.
We worked with the top six SVD components (excluding the very first, the mean), which explained 95% of the variance of the data. There was a portion of the data, in the middle of the video, associated with a sudden burst in the activity of almost the whole brain (at second 18 in the video); we excluded this unique event from the training and test data. Given the remaining data, five segments of 120 temporal points were extracted for all temporal components; the first 100 were used for training and the last 20 for testing.
A.2 Rat Data Set
Eight square-wave, drifting gratings (spatial frequency, 0.1 cycles/degree; speed, 10 deg/sec) were presented to an awake, head-fixed rat while imaging cells with conventional cytosolic GCaMP6f in visual area LM, with a sampling rate of 44.69 Hz. Stimuli were shown for 2 seconds per trial, with an intertrial interval of 4 seconds, during which time a gray screen was shown. All gratings were presented within a circular aperture of 35 degrees of visual angle. Grating directions were selected at 45 degree intervals.
We used eight different realizations of this experiment. Each data set consisted of 182 time series of length 2860, each corresponding to an individual cell. We smoothed the data with a Savitzky-Golay filter (window length = 51, order of the polynomial 3) and afterward downsampled it by a factor of 10 in order to approximately match the characteristic timescale of the above zebrafish data set dynamics. Next, we discarded the first 10 out of 286 time points to exclude some potential initial perturbations due to the measurement process; this resulted into the final 276 time steps 182 cells, in each of eight data sets.
Finally, an SVD decomposition was performed as described above for zebrafish data, using each cell instead of a pixel. In this case, we selected the top nine SVD components (as usual, excluding the very first, that is, the mean). As in the zebrafish case, the first 100 time steps of each temporal component were used for training; the remaining 176 points were used for testing.
A.3 Human fMRI
We used resting-state fMRI data from 10 healthy control subjects, obtained from the Track-On HD data set (Kloeppel et al., 2015). Data have been bandpass-filtered (0.01–0.16 Hz) to remove drifts and high-frequency noise. Next, a standard 20-component ICA was performed, after which the most meaningful 10 components were used. Each subject data set contained 160 time points (with a time resolution of 3 seconds), using the first 100 time steps for training and the remaining for testing.
Appendix B: Related Work: ODE Parameter Estimation
States and parameter estimation for systems of ordinary differential equations (ODEs) from data is a problem of significant interest in various research communities, ranging from physics and biology to control theory, machine learning, and optimization (Lillacci & Khammash, 2010; Rackauckas et al., 2020; Sabatier, 2000; van Leeuwen & Herrmann, 2015). This problem can be quite challenging due to the high computational cost of the numerical integration involved in evaluating each candidate set of parameters and to the very large size of the search space, especially for high-dimensional ODEs. These difficulties have inspired a range of approaches. Many traditional methods avoid optimization by using the unscented Kalman filter (Quach, Brunel, & d'Alché-Buc, 2007; Sitz, Schwarz, Kurths, & Voss, 2002; Voss, Timmer, & Kurths, 2004) or other derivative-free dynamic inference methods (Havlicek, Friston, Jan, Brazdil, & Calhoun, 2011). However, derivative-free methods have limitations because there are no convergence criteria or disciplined way to iterate them to improve estimates. Optimization-based approaches for fitting parameters and dynamics are discussed by Gábor and Banga (2015), who formulate parameter identification under dynamic constraints as an ODE-constrained optimization problem. We take a similar view and use recent insights into variable projection to develop an efficient optimization algorithm for state and parameter estimation of the van der Pol model (VDP), where state vectors include unobserved (hidden) variables. The work of Gábor and Banga (2015) is focused on global strategies (e.g., multiple restarts of well-known methods); our contribution is to develop an efficient local technique for an inexact VDP formulation.
For fixed parameter values, the state estimation of the VDP model becomes a nonlinear Kalman smoothing problem (Kalman & Bucy, 1961; Kalman, 1960). Optimization-based approaches with nonlinear and nongaussian models require iterative optimization techniques; see, for example, the survey of Aravkin, Burke, Ljung, Lozano, and Pillonetto (2017). Dynamical modeling was applied to nonlinear systems early on by Anderson and Moore (1979) and Mortensen (1968). More recently, the optimization perspective on Kalman smoothing has enabled further extensions, including inference for systems with unknown parameters (Bell, 2000), systems with constraints (Bell, Burke, & Pillonetto, 2009), and systems affected by outlier measurements for both linear (Cipra & Romera, 1997; Durovic & Kovachevic, 1999; Meinhold & Singpurwalla, 1989) and nonlinear (Aravkin, Bell, Burke, & Pillonetto, 2011; Aravkin, Burke, & Pillonetto, 2014) models.
Appendix C: Estimation of van der Pol Parameters
C.1 Stochastic Search
Our stochastic search algorithm performs a random walk in the space of parameters and hidden-variable states, using some fitness function described below to evaluate each candidate point and accepting the steps that improve the data fit while discarding the ones that do not. The method described in the following paragraphs was implemented in Julia (Bezanson, Edelman, Karpinski, & Shah, 2017) using the package DifferentialEquations.jl (Rackauckas & Nie, 2017); its pseudocode is shown in algorithm 2.
The search begins with an initial guess for parameters (starting with the same value for all oscillators), a zero connectivity matrix , and a random guess for the initial condition of the hidden variables, for , which we refer to as . At every stochastic search step, these parameters and initial conditions change in a certain stochastic way, and our differential equations with the new parameters are integrated; if the solution fits the data better then the previous best one, according to a fitness function , the step is accepted, that is, the new parameters and initial conditions replace the previous best set.
In the first stage of the algorithm (for a certain number of iterations ), only the parameters and initial conditions are changed, while keeping zero. In each step, one oscillator is randomly chosen; subsequently its parameters and initial conditions from both its observable and hidden variables, and , are changed with a gaussian random step. The initial conditions of the observable variables are box-projected at each step, so that they do not depart too much from measured initial conditions. In order to better escape from local minima, we sometimes also make special steps, in which all oscillators are changed jointly, and with larger variance than normally.
After the initial iterations, the matrix also starts to change at every stochastic search step. Unlike the rest of the parameters, all the elements of change at every step, although the variances of their random steps are one order of magnitude smaller. However, similar to the other parameters, larger variance steps are sporadically taken for .
Stochastic search is performed for a fixed number of steps, and then we switch to another optimization procedure, which can be either VP or another stochastic search with different parameter. It turns out that modifying when using the second stochastic search helps significantly to escape bad local minima and to accelerate the estimation of parameters. Hence, the stochastic search is implemented, changing the parameter from the fitness function in an alternating fashion. As can be seen in the following simple pseudocode, for some cycles of the stochastic search, it uses only the Pearson correlation, and for some other cycles, it uses a combination between the correlation and .
C.2 ODE-Constrained Inference: An Extension to m Coupled Oscillators
Appendix D: Prediction Methods
D.1 Predictions from the van der Pol Model
The parameters estimation algorithm is run 10 times for each data set, producing different results due to the stochastic nature of the method. After each global iteration of the algorithm, the parameters are saved in order to afterward select the top 50 most correlated models to the data, thus obtaining 500 different well-fitted models. In order to generate the van der Pol prediction, we continue integrating the equations, with the obtained parameters and hidden states, further the training range, and average all predictions.
D.2 LSTM Architecture and Hyperparameters
Our LSTM networks were implemented in Keras and contained two layers, 128 units in each layer, with 0.3 dropout, followed by a fully connected layer and linear activation; we used the mean squared error and the optimizer RMSProp with learning rate of . The normalization was the same as in the estimation of van der Pol parameters: all components were normalized by the mean of their training set standard deviations.
We used LSTM for multivariate time-series prediction: based on the values of consecutive times from the SVD time components (), the task is to predict its immediately next values (). To perform multiple steps-ahead predictions, it proceeds iteratively, in a recursive way, using the previous predictions as new inputs. In this way, only the first temporal data points from the ground truth are used (in order to start the forecast), and the predicted values are subsequently incorporated to move forward in time. In our experiments, several values of were tried, and was selected in order to match the lag order of VAR and, hence, compare equivalent models in terms of information used about the history of each time series.
LSTMs, and deep learning in general, require a huge amount of training data for them to learn well. In this case, the zebrafish data set was composed of several hundred time points. On the other hand, given our prior knowledge of the data, presenting a nonlinear dynamic behavior of a certain type (i.e., van der Pol with specific parameters), one can hypothesize that providing LSTM with information about such domain-specific dynamics can potentially improve its performance.
Thus, we propose a simple approach for providing general-purpose LSTM with prior information about the domain-specific dynamics, a data-augmentation approach that pretrains LSTM on a large amount of simulated data obtained from a fitted van der Pol model, before fine-tuning LSTM on a relatively small amount of available real data. Such pretraining on the data simulated from our analytical model serves as a regularizer in the absence of large training data sets, biasing LSTM toward the type of dynamics we expect in the data. We label this hybrid method VanDeePol.
For each training set, the parameter estimation algorithm is run 10 times, saving the results after each global iteration for selecting afterward the top 50 fits with higher Pearson correlation. Integrating the model equations with the selected parameters and initial conditions, and extending the integration 10 time steps further than the training limit, we obtain 500 time series of shape 1106 (zebrafish), 1109 (rat), or 11010 (human) for each training set. Note that the extra 10 time steps included are already predictions from the van der Pol models, so in some way, we are incorporating the VDP predictions into LSTM. The first 109 time steps of each of these artificial data sets would be used for the pretraining of the LSTM, while the last steps would be used as validation in order to decide when to stop this process. After the pretraining, a fine-tuning is done with the real training data corresponding to each segment, again employing the last time points as validation for the early-stopping.
D.4 Data Augmentation with Noisy Versions of Training Data
The simple method of augmenting training data with noisy versions of itself was used as a baseline for data augmentation. Attempting to get approximately the same number of total training samples for each data set as in the case of data augmentation with van der Pol, 550 noisy versions of each data set were created. Additive white gaussian noise was used, with a variance of 20% of each component standard deviation (across time). So for each training set, 550 time series of shape 1006 (zebrafish), 1009 (rat), or 11010 (human) were generated. Considering that the task was to predict the next time step (jointly for all components) based on the previous six time steps, we had 550 94 pairs of (X,y). Similar to data augmentation with VDP, the last pair of (X,y) from each of the 550 time series was used as validation for stopping the training. Likewise, after pretraining with noisy training data, fine-tuning was performed using the real training data and the last time point as validation.
D.5 LSTM without Data Augmentation
For the cases in which only the real (measured) data were used for training the LSTMs (associated with blue curves in Figure 3), conditions and settings were exactly the same as for the fine-tuning after each of the above discussed pretrainings with synthetic data sets.
Appendix E: Experiments on Simulated Data
In order to investigate on the ability of the parameter estimation method, van der Pol model, and VAR models to fit and infer connectivities from different dynamical systems, we conducted the following experiments.
Ten coupled oscillators, each of them with two types of state variables, were simulated with a sparsity of 0.6 in their connectivity. The equations were integrated for 10 units of time, saving the states at every 0.1, resulting in 100-long time series. The first type of variables from each oscillator were considered as the observed ones and fitted to the first type of variables from van der Pol, similar to what is done for the neural data. The parameter estimation procedure is run once, the 300 best fits are selected based on their Pearson correlations, and their corresponding parameters are averaged to get the final estimation. Finally, the estimated connectivities from the fitted VDP models are compared to the ground truth connectivities.
Besides the ability to fit the data, we are interested in investigating how well the connectivities underlying each time series are recovered. For comparison with a simpler method, we used the VAR(6) linear model, the same that was used as a baseline for the time series prediction task. Each VAR(6) model provides six matrices, corresponding to the previous six steps considered for predicting the next one. We have to define a reduction method to obtain a single matrix in order to compare it with the ground truth connectivity. Since there is no clear right choice for that reduction method, we tested three different methods:
1st lag: Keep only the matrix corresponding to the first lag (immediate previous time step).
Mean: Take the mean across the six matrices.
Max abs: For each element, select the value with highest absolute value across the six matrices.
Figures 9 and 10 display violin plots with the distributions of correct guesses for all the methods, along with the equivalent when using random normal matrices. For both the van der Pol and Hopf cases, the inference via the VDP method is significantly better than the VAR methods.
As seen in Figures 7 and 8, the fits are very good for both VDP and Hopf cases. However, when the underlying dynamics is VDP, we can infer the direction of the couplings significantly better, as Figures 9 and 10 present. This result emphasizes the relevance of using a good model. In this letter, we are not claiming that the VDP is a universal dynamics approximator but that given some particular characteristics of the neural dynamics, when projected onto a lower-dimensional phase space, their modes can be approximated by a VDP-like dynamical system.
Appendix F: Differences between VDP and VAR Methods
F.1 Connectivity Patterns
We compare the W matrices obtained by VDP with those obtained with VAR methods (see Figure 11). Beyond the overall magnitude of connectivity values, we observe an aggregation of the weights on the top SVD component for the VAR methods but not for VDP.
F.2 Stability of the Fit
While the bottom row of Figure 11 already shows a difference in the standard variation of connectivity measures across different time windows, we also compared the stability of W matrices fits, normalized by the average connectivity (see Figure 12). Despite its stochastic nature, the VDP variability across time windows was lower than the VAR deterministic methods (VDP versus VAR6/1st lag: d 0.58 [1.19, 0.03], ; VDP versus VAR6/Max: d 0.63 [1.24, 0.03], ; VDP versus VAR mean: d 0.50 [1.11, 0.10], E-03).
Appendix G: Further Study on Pixel-to-Pixel Connectivities
To study further the influence of the mode-to-mode connectivity matrices (W, inferred by VDP or VAR) on the final pixel-to-pixel connectivity, we performed the following experiment. We computed the pixel-to-pixel connectivity while shuffling all the elements of each W matrix used. In this way all the statistical properties of the estimated Ws were preserved but any possible structure dissolved. In order to have a quantitative estimation of the difference between the final connectivities obtained by VDP, VAR, and shuffling the VDP W matrices, we examined the L1 distances between their vectorized full pixel-to-pixel associated matrices (avoiding any thresholding), which we will refer as C, C, and C. Specifically, we computed 1000 realizations of C, measured L1 pairwise distances between all of them, and compared them with the distances between C and C to each C through Cohen's d effect sizes and Mann–Whitney U tests:
VDP: d 157.33 [156.67, 158.00] ( 0.0001, Bonferroni corrected)
VAR 1st lag: d 53.47 [53.03, 53.90] ( 0.0001, Bonferroni corrected)
VAR max abs: d 684.09 [682.73, 685.45] ( 0.0001, Bonferroni corrected)
VAR mean: d 46.32 [45.90, 46.73] ( 0.0001, Bonferroni corrected)
In the connectivity analysis, we sum the contributions of different connectivity matrices obtained by independent parameter estimation runs, which yield different outcomes due to its stochastic nature and nonconvexity of the problem. This could be seen as averaging across different local minima solutions, which in a general nonconvex optimization problem would be ill founded. However, based on the construction of the model and our interpretation of the matrix, each of which elements would represent an average coupling strength between the neural populations demarcated by the th and th SVD spatial components, we hypothesize that even though we would find multiple sets of parameters corresponding to possibly equally good fits, statistically there would be some with more predominance than others, so that after the numerous summations or averages (also across different time windows), some components would vanish and some would prevail. In fact, from the experiment described in this section, we obtained that the mean of the absolute values of C is 6.90 0.87 while the corresponding to VDP, VAR 1st lag, VAR max abs and VAR mean are 194.58, 71.44, 811.73, and 63.47, respectively. It is then possible to interpret that the nonshuffled connectivities got less vanishing components across the whole connectivity inference process. Nevertheless, the investigation on how to best estimate the model parameters and the inference of the pixel-to-pixel connectivity will be matter of contining investigation.
Appendix H: Additional Fitting Plots
It is important to note that good generalization to unseen data is an important indicator that a model is likely to robustly capture brain dynamics, as compared to a model that works well only on the same data it was trained on, potentially overfitting noise patterns unrelated to the true underlying brain activity mechanisms, and thus failing to generalize to unseen data. While improving model generalization is traditionally the main focus of the machine learning field, it is sometimes overlooked in neuroimaging literature. Thus, our focus here is on all three core aspects of building good brain models: data fit, interpretability and predictive accuracy (generalization).
The Lipschitz constant of the gradient of stays bounded as , which is clearly false for .
We thank Juliana Rhee for sharing the rat calcium imaging data set used in this work. Computations were performed using DIRAC High-Performance Cluster, at the Physics Department of FCEyN, UBA, which we gratefully acknowledge. This research has been supported by UBA (UBACyT 20020170100482BA) and ANPCyT (PICT 2015-3824). We also thank the reviewers for their insightful comments and constructive help.