## Abstract

We investigate the utility of Wishart processes (WPs) for estimating time-varying functional connectivity (TVFC), which is a measure of changes in functional coupling as the correlation between brain region activity in functional magnetic resonance imaging (fMRI). The WP is a stochastic process on covariance matrices that can model dynamic covariances between time series, which makes it a natural fit to this task. Recent advances in scalable approximate inference techniques and the availability of robust open-source libraries have rendered the WP practically viable for fMRI applications. We introduce a comprehensive benchmarking framework to assess WP performance compared with a selection of established TVFC estimation methods. The framework comprises simulations with specified ground-truth covariance structures, a subject phenotype prediction task, a test-retest study, a brain state analysis, an external stimulus prediction task, and a novel data-driven imputation benchmark. The WP performed competitively across all the benchmarks. It outperformed a sliding window (SW) approach with adaptive cross-validated window lengths and a dynamic conditional correlation (DCC)-multivariate generalized autoregressive conditional heteroskedasticity (MGARCH) baseline on the external stimulus prediction task, while being less prone to false positives in the TVFC null models.

## 1 Introduction

The human brain is a dynamic system of interacting regions. Higher-order cognition and complex behavior arise from the spatiotemporal integration, reorganization, and segregation of brain region activity (Bressler & Menon, 2010; Deco et al., 2011; Shine et al., 2016). This intrinsic functional architecture or “connectome” exhibits complex and modular system structures (Deco et al., 2015). Understanding the dynamics of brain region interconnectivity and its changes, for example, during disease, development, or learning, is a critical goal in modern neuroscience.

Functional connectivity (FC) is the study of temporal coactivation across distinct brain regions. This is inferred from the blood-oxygen-level-dependent (BOLD) neural activity signal detected using functional magnetic resonance imaging (fMRI) (Bijsterbosch et al., 2021). These connectivity maps can be used to “fingerprint” subjects, and many inter-individual differences have been identified (Finn et al., 2015; Liégeois et al., 2019). FC is characterized by statistical dependencies (i.e., correlations) between neural activity in brain regions, often assuming a stationary covariance structure (Wang et al., 2014; Zalesky et al., 2012). However, this assumption neglects the complex temporal structure of interregional relationships (Hutchison et al., 2013; Lurie et al., 2020). Therefore, the study of time-varying FC (TVFC) has become increasingly popular (Cohen, 2018; Vidaurre et al., 2021). Instead of estimating a single correlation matrix for an entire scan, correlation is estimated as a function of time during a scan.

Approaches for estimating TVFC include sliding window (SW) methods (Chang & Glover, 2010; Sakoğlu et al., 2010), multiplication of temporal derivatives (Shine et al., 2015), phase coherence models (Glerean et al., 2012), multivariate generalized autoregressive conditional heteroskedasticity (MGARCH) processes (Hakimdavoodi & Amirmazlaghani, 2020; Lindquist et al., 2014), customized Bayesian models (Ebrahimi et al., 2023; Li, Pluta, et al., 2019; Taghia et al., 2017; Warnick et al., 2018), wavelet-based methods (Zhang et al., 2016), and phase synchronization models (Demirtaş et al., 2016; Varela et al., 2001). Another popular approach is to use state-based models, such as the Hidden Markov Model (HMM). Instead of estimating a full covariance structure, each scanning volume is assigned to one of several canonical correlation matrices or “brain states” (Ahrends et al., 2022; Vidaurre et al., 2018). However, all the aforementioned methods exhibit a large degree of variation in their TVFC estimates (Foti & Fox, 2019; Lurie et al., 2020).

This study proposes to estimate TVFC using Wishart processes (WPs). This stochastic process model learns a distribution over (covariance) matrix-valued functions. This is analogous to the Gaussian process (GP) that learns a distribution over scalar-valued functions (Rasmussen & Williams, 2005). Therefore, it is a natural candidate for TVFC modeling. It has recently become possible to adopt this class of models in neuroimaging, thanks to advances in approximate inference routines and easy-to-use computational libraries (Heaukulani & van der Wilk, 2019; Matthews et al., 2017). We discuss the motivation, assumptions, and background of this model; how to construct it; how to employ Bayesian inference to fit its parameters; and its potential advantages and limitations.

Evaluating TVFC estimation methods remains difficult because the ground-truth covariance structure in real data is unobserved. Therefore, estimation methods coexist, despite the large degree of variation in their estimates. To evaluate the application of WPs to covariance estimation between fMRI time series, we designed a comprehensive collection of benchmarks that serve as proxies for the direct observation of true TVFC. Each benchmark comprises a competitive prediction task in which the WP is compared with other TVFC estimation methods (Leenings et al., 2022; Yarkoni & Westfall, 2017). We study established prediction tasks including simulation benchmarks, where time series are generated using prespecified underlying covariance structures (Hindriks et al., 2016; Thompson et al., 2018) and where the prediction task is to “reconstruct” this ground-truth covariance structure from the simulated observations, resting-state fMRI (rs-fMRI) benchmarks, where we study the predictive power of estimated TVFC of subject measures and their test-retest robustness, and task-based fMRI (tb-fMRI) benchmarks, where the goal is to predict the presence of an external stimulus (Gonzalez-Castillo & Bandettini, 2018; Sahib et al., 2018). Additionally, we introduce a data-driven imputation benchmark, in which the prediction task is to infer TVFC at a left-out test set of observations using observations from the remaining training dataset.

In Section 2, we describe the existing TVFC estimation methods that serve as baselines for comparison with the WP. These include the static FC (sFC) estimate, an improved version of the SW approach using cross-validated window lengths (SW-CV), and the dynamic conditional correlation (DCC) MGARCH model. Next, we discuss how we constructed the WP and inferred its model parameters. We then discuss the benchmarks that were studied. After presenting the results in Section 3, we summarize the insights into WP performance across these benchmarks in Section 4. We discuss limitations and provide promising avenues for future work to address these.

## 2 Methods

In this section, we describe methods for time-varying covariance estimation, how to extract features from estimated covariance structures, and how to compare estimation method performance through benchmarking.

### 2.1 Baseline TVFC estimation methods

#### 2.1.1 Static functional connectivity (sFC)

Static connectivity estimates assume that the covariance between the random variables of interest does not change across any arbitrary scanning session length. The multivariate BOLD observations at time step (i.e., scanning volume) $1\u2264n\u2264N$ for activity time series from $D$ regions (or *nodes*) are denoted as $yn\u2208\mathbb{R}D$. As sample means are assumed to be zero, the (unbiased) covariance matrix estimator is

The respective (Pearson) correlation matrix is obtained by dividing each covariance matrix element by the product of the standard deviations of the respective variables.

#### 2.1.2 Sliding window (SW) functional connectivity

SW methods slide a time window of a certain length $\omega $ and shape (typically rectangular or Gaussian) across observations. The covariance is estimated within this window for each step (Shakil et al., 2016). The window length is typically between 30 and 60 seconds (Shirer et al., 2012). In this study, we implemented a rectangular window with a step size of a single volume. Time series were high-pass filtered (5^{th}-order Butterworth) to remove the frequency components below $1/\omega $ prior to TVFC estimation (Leonardi & Van De Ville, 2015). Zeros were padded to the start and end of the time series to allow for estimation around those locations.

Window lengths affect the estimated TVFC. Small windows can detect fast-changing aspects of the data but often return spurious covariance, whereas large windows are more robust to outliers but are unable to detect faster dynamics. Real datasets contain unknown temporal structures, and each scan may require a different window length. Strategies exist to determine the window length a priori or to circumvent the choice of window length (Nielsen et al., 2017; Vergara et al., 2019; Yaesoubi et al., 2018), such as choosing a window length to avoid $1/\omega $ exceeding the Nyquist frequency (Zalesky et al., 2014). However, a consensus strategy for selecting this hyperparameter is lacking. In this study, we posited that good methods for inferring TVFC will also be able to predict TVFC at unseen time points. We implemented a data-driven approach for optimal a priori window length selection using cross-validation (Fig. 1). The evaluation data points were obtained one at a time from the time series as a validation set. The average log likelihood of these observations under a zero-mean multivariate Gaussian distribution was then computed as

where $1\u2264n\u2264Ne$ is an evaluation data point, $Ne$ denotes the number of evaluation points (equal to the total number of time steps $N$ minus the maximum proposal window length $\omega max$), and $\Sigma ^n$ (a function of $\omega $) is the covariance matrix estimate for all observations in the surrounding window of length $\omega $ (*excluding* the evaluation data point). This procedure was repeated for a range of proposed window lengths. The optimal window length $\omega ^$ maximizes the log likelihood:

The minimum and maximum window lengths to explore ($\omega min$ and $\omega max$) were set to 20 (Leonardi & Van De Ville, 2015) and 180 seconds, respectively. For example, assuming a repetition time (TR) of 2 seconds, a window would include 10–90 data points. We empirically verified that this approach consistently outperformed hard-coded window lengths (see Supplementary Materials Sections S.1 and S.7).

#### 2.1.3 MGARCH: Dynamic conditional correlation (DCC)

The MGARCH framework, which is popular in both econometrics and neuroimaging, describes a family of models that considers a zero-mean vector stochastic process $yn\u2208\mathbb{R}D$ with a time-varying covariance structure:

where $\eta n$ is an i.i.d. white noise vector process. The variances of the individual time series are assumed to follow a vector autoregression process; therefore, the dynamic covariance process $\u2211n$ is conditioned on all observations up until time $n\u22121$. In this study, we considered the DCC-GARCH model (Engle, 2002). Specifically, we implemented a first-order DCC(1,1) model as commonly used in FC studies (Choe et al., 2017; Lindquist et al., 2014). Parameter inference starts by fitting a univariate GARCH model (Bollerslev, 1986; Engle, 1982) to each time series individually, for which we used a first-order GARCH(1,1) process, meaning each state only considers both the GARCH and ARCH terms from the preceding state.

When estimating the covariance structure between a set of $D>2$ time series, we could either model time series jointly or loop over all pairs of time series, training $D(D\u22121)/2$ models and thus learning more parameters. Whether different parameters should be learned for each node pair (or *edge*) depends on whether we expect significantly different dynamic characteristics between edges. In FC studies, running DCC in a pairwise manner is common (Choe et al., 2017). However, DCC models were designed to model all dimensions jointly. The optimal implementation remains unclear. Therefore, we included both: throughout this paper, “DCC-J” indicates a (multivariate) jointly trained model and “DCC-BL” refers to bivariate loop (pairwise) training.

### 2.2 The Wishart process

#### 2.2.1 Model construction

We constructed the WP as a single statistical model that describes the time-varying covariance structures between time series. To start, the Wishart distribution defines a probability density function over positive definite matrices $\u2211$:

where $V$ is a *$D\xd7D$* positive definite scale matrix, $\nu \u2265D$ is the number of degrees of freedom, and with normalization constant $Z=2\nu D/2|V|\nu /2\Gamma D(\nu /2)$, where $\Gamma D(\xb7)$ is the multivariate gamma function. This distribution is a multivariate extension of the chi-squared distribution and is used extensively in the estimation of covariance matrices in multivariate statistics. Crucially, we can construct a Wishart-distributed random (matrix-valued) variable from a collection of i.i.d. zero-mean multivariate Gaussian random variables. The sum of the outer products of such variables is Wishart distributed:

where $ui$ are i.i.d. $N(0,V)$ distributed, $D$-dimensional random variables, and $WD(V,\nu )$ denotes a Wishart distribution with a scale matrix $V$ and $\nu $ degrees of freedom.

The WP is a continuous time stochastic process on symmetric positive definite matrices, which makes it suitable for modeling covariance matrices (Bru, 1991; Wilson & Ghahramani, 2011). We let $Y:\u2009=(yn,1\u2264n\u2264N)$ denote a sequence of BOLD measurements in $\mathbb{R}D$, where $N$ is the number of scan volumes and $D$ is the number of time series. Index locations are denoted as $X:\u2009=(xn,1\u2264n\u2264N)$ in $\mathbb{R}$ as the times at which the measurement $yn$ is observed, considered in a fixed interval of [0, 1] during training and prediction. These index locations were univariate in our case, but could be multivariate when other covariates, such as head motion, would be included. We chose the conditional likelihood of observations to be a multivariate Gaussian:

where $\u2211n$ is a $D\xd7D$ covariance matrix and $\mu n=0$ in our context. The random and unobserved process $\u22111,\u22112,\u2026,$$\u2211N$ constitutes TVFC. Analogous to constructing a Wishart-distributed random variable from a collection of Gaussian random variables, we constructed the Wishart *process* from i.i.d. collections of Gaussian *processes*. Let

for $1\u2264d\u2264D$ and $1\u2264k\u2264\nu $, where $K(\u22c5,\u22c5;\theta )$ denotes a kernel function. Let $Fn,d,k:\u2009=fd,k(Xn)$ denote the evaluation of the GP at $Xn$. We wrote $Fn$ for the aggregate $D\xd7\nu $ matrix $(Fn,d,k)1\u2264d\u2264D,1\u2264k\u2264\nu $, which has entries $Fn,d,k$ indexed by $d$ and $k$. Analogues to Eq. (2.6), we construct

as a Wishart-distributed random matrix at time $1\u2264n\u2264N$. This construction allows us to query covariance matrices continuously throughout an fMRI scan, because the underlying GPs can be queried at any time. It also handles missing data naturally because data are not expected in a grid-like pattern. Matrix $A\u2208\mathbb{R}D\xd7D$ is restricted such that the scale matrix $AAT$ is positive definite, and it is trained as part of the inference routine. This scale matrix provides the expected value of $\u2211$ for all $n$. The log likelihood from Eq. (2.7) with this construction of $\u2211n$ plugged in is

Heaukulani and van der Wilk (2019) suggested adding white noise for robust inference, which we validated empirically. The covariance matrix construction is updated to

with additive noise matrix $\Lambda $: a diagonal $D\xd7D$ matrix with positive entries, whose values are trained with the rest of the parameters (off-diagonal values remain zero).

#### 2.2.2 Bayesian inference

After defining our model, we needed a procedure to fit it. Doing so requires computing $p(Y)=\u222bp(Y\u2009|\u2009F)p(F)dF$ in the posterior $p(F|Y)=p(Y,F)p(Y)$, which is intractable. Therefore, we used variational inference, a technique that approximates the probability density through optimization (Blei et al., 2017). This contrasts with Markov Chain Monte Carlo methods, which are often employed in such situations. Variational inference posits a family of distributions $q(F)$ over the latent variables and then finds the member of that family that is close in terms of Kullback-Leibler (KL) divergence to the target distribution (the true posterior). Importantly, model parameters and covariance structure aspects are learned directly from the data instead of having fixed hyperparameters representative of inductive biases on behalf of the practitioner.

To make our WP models more computationally viable, we used *sparse* underlying variational GPs as WP building blocks. The sparse variant introduces the concept of inducing points: learned auxiliary data points (Bauer et al., 2016; Titsias, 2009). The number of inducing points $Z:\u2009=(zm,1\u2264m\u2264M)$ is smaller than $N$. We write $Um,d,k:\u2009=fd,k(Zm)$ for sparse GP evaluations. We chose a fully factorized Gaussian prior over $Ud,k$. We collectively denote $Ud,k:\u2009=(Um,d,k,m\u2264M)$ and chose its variational approximation as

for variational parameters $\mu d,k\u2208\mathbb{R}M$ and $Sd,k\u2208\mathbb{R}M\xd7M$ a real, symmetric, positive definite matrix. For inference, we iteratively maximized the evidence lower bound (ELBO)

as our objective function using gradient descent, where (approximate) gradients were computed based on samples (Monte Carlo estimates) of our objective function. Intuitively, the first ELBO term is a likelihood (data fit) term and the second is a regularizing term.

#### 2.2.3 Implementation

The properties of the underlying GPs are determined by a kernel function, for which we used a (stationary, isotropic) Matérn 5/2 kernel

with $r=\Vert x\u2212x\u2032\Vert l$, and where $l$ and $\sigma 2$ are the kernel length scales and variance parameters, respectively. This length scales parameter is easily interpretable and defines how quickly the GP (and, therefore, TVFC as a WP) can change over time by how much weight the model places on observations farther from the evaluation index location. Therefore, we hypothesized that this parameter may be analogous to the window length in SW (see Supplementary Materials Section S.7).

Parameters were updated using gradient descent with the Adam optimizer (Kingma & Ba, 2015). The estimated parameters include the variational parameters, kernel hyperparameters, the lower Cholesky decomposition $A$ of the scale matrix, and the diagonal values of the additive noise matrix $\Lambda $. We set $\nu =D$ and $M=200$. All figures include the 95% confidence interval of the mean estimate plus/minus two standard deviations, based on 3,000 samples from the posterior. Although this model is complex in its description, and thus introduces a complexity penalty, its *implementation* is lean and robust, relying heavily on open-source libraries. Further, while there is no theoretical limit to the number of time series modeled, the computational demand for training this model on high-dimensional data (over 100 time series) is high, taking days or even weeks for a single scan. In practice, this means researchers working with fMRI may need access to sizeable compute clusters (e.g., parallelizing training bivariate models for each edge) or GPUs. Alternatively, they may run a dimensionality reduction algorithm before fitting a WP to their data.

### 2.3 Benchmarking TVFC estimation methods

#### 2.3.1 Simulation benchmarks

We considered simulated data from seven underlying deterministic synthetic covariance structures (Fig. 2): **null** covariance (time series are uncorrelated), static **constant** covariance of $\sigma ij=0.8$, two **periodic** covariance structures (a **slow** and a **fast** oscillating sine wave defined by one and three periods, respectively) that model transient changes in coupling, **stepwise** covariance that models two large change points in covariance, a covariance structure that mimics a series of **state transitions** (Thompson et al., 2018), and a covariance structure inspired by a **boxcar** external task design. For each of these, we sampled the observations $yn\u2208\mathbb{R}D$ at time steps $1\u2264n\u2264N$ from a zero-mean Gaussian distribution:

for the respective covariance matrices. We tested on pairwise (i.e., bivariate) and trivariate datasets. The covariance structure for generating bivariate data was

where the covariance term $\sigma (n)$ varied with time according to the synthetic covariance structures (Fig. 2). For the trivariate data, both a *dense*, fully covarying covariance structure

and a *sparse* version where only the first two time series are correlated

were implemented. The range of $\sigma (n)$ was updated to [−0.5, 1] to maintain the covariance matrices as positive semidefinite in the dense case. The time series were individually normalized to have a mean of zero and a unit standard deviation. Model performance was averaged across $T=200$ trials. We report the results of $N=400$ scanning volumes. The results for higher dimensions can be found in Supplementary Materials Section S.2.

Noise with a signal-to-noise ratio (SNR) of 2 was added to these time series. The noisy $yn*$ was constructed as a linear mix of noiseless signal $yn$ (Eq. 2.15) and (independent) noise time series $\u03f5n$:

where $0<\alpha <1$ and the SNR is equal to $\alpha 1\u2212\alpha $. Noisy time series more realistically mimic fMRI data, especially when the noise contains properties as found in fMRI data (Deco et al., 2009; Welvaert & Rosseel, 2013). Therefore, we used data from the Human Connectome Project (HCP) (Elam et al., 2021; Smith, Beckmann, et al., 2013) as $\u03f5n$; a source of noise. A detailed description of HCP data can be found in Appendix A.2. The time series obtained from the HCP were representative of an independent component analysis (ICA) component of whole-brain activity. For each $D$-dimensional set of simulated time series (i.e., individually for each covariance structure and trial), we selected the time series from $D$ randomly selected ICA components from a randomly selected scan of *different* HCP subjects. As they were obtained from different subjects, these noise time series could be assumed to be uncorrelated. This ensured that no additional and adversarial (i.e., other than the predetermined “ground truth”) covariance structure was introduced. These noise time series were preferred over other types of noise, such as white noise, as they preserve realistic fMRI data characteristics such as spatial and temporal autocorrelation (Arbabshirani et al., 2014; Honari et al., 2019; Keilholz et al., 2017). The addition of noise affects the ground-truth variances and covariances, which were updated accordingly (Appendix A.1). The root mean square error (RMSE) between the estimated and ground-truth correlations served as the performance metric (Lindquist et al., 2014; Wilson & Ghahramani, 2011). For the trivariate cases, we computed the RMSE across all elements of the full correlation matrices.

#### 2.3.2 Resting-state fMRI benchmarks

We studied the capability of predicting subject measures, test–retest robustness, and a brain state analysis using HCP rs-fMRI data. Each activity time series in this dataset represented an ICA component of whole-brain activity. For interpretable features, we computed edgewise (i.e., independently for each connection between nodes) summary measures of TVFC estimates across time: the mean (analogous to sFC estimates; considered the connection “strength”)

the variance (Choe et al., 2017; Hutchison et al., 2013), sometimes interpreted as connectivity “stability” or “flexibility” (Allen et al., 2014)

and we proposed an additional summary measure, the TVFC *rate-of-change*, defined as the mean absolute relative difference between subsequent time steps

for the edge between nodes $1\u2264i\u2264D$ and $1\u2264j\u2264D$. The latter summary measure captured how smoothly the estimated TVFC changed over time.

The estimated TVFC was related to subject measures using a morphometricity analysis (Sabuncu et al., 2016), but instead of anatomical variation between subjects, we studied variation in TVFC. We posited the linear mixed effects model

where $y$ is a column vector of length equal to the number of subjects studied containing some quantitative subject measure, $X$ is a design matrix of nuisance variables (covariates) weighted by β, $a~N(0,\sigma a2Ka)$ is a random effects vector, and $\u03f5~N(0,\sigma e2)$ is a noise vector. Entries in the symmetric matrix * $Ka$* encoded how globally “similar” the TVFC estimates between the respective subjects were. This similarity was defined using the three summary measures (including two dynamic measures) of estimated whole-brain TVFC, taking the lower triangular values as a vector of size $D(D\u22121)/2$. The similarity “distance” between subject vectors was defined by (nonlinear) Gaussian kernel $k(x,x\u2032)=exp(\u2212(x\u2212x\u2032)2)$. Morphometricity was computed as the proportion of phenotypic variation that could be explained by intersubject variation:

with $\sigma y2$ the phenotypic variance. Parameters $\sigma a2$ and $\sigma e2$ were estimated using the restricted maximum likelihood method (Harville, 1977). The analysis was repeated for 15 subject measures, including age, gender, and cognitive task scores (Kong et al., 2019; Li, Kong, et al., 2019). Age and gender were included as nuisance variables (when they were not predicted).

Further, we compared the robustness of the TVFC estimates across scans from the same subject (Abrol et al., 2016, 2017; Choe et al., 2017; Elliott et al., 2020; Noble et al., 2019). The test–retest reliability of TVFC summary measures was computed as interclass correlation coefficients (ICCs), individually for each edge (Chen et al., 2018; Shrout & Fleiss, 1979):

with $\sigma X2$ the between-subject and $\sigma U2$ the within-subject variance. We computed image intraclass correlation coefficients (I2C2) as a whole-brain reliability assessment (Shou et al., 2013):

with $KX\u2208\mathbb{R}D(D\u22121)/2\xd7D(D\u22121)/2$ the within-subjects covariance (of “true” FC) and $KU\u2208\mathbb{R}D(D\u22121)/2\xd7D(D\u22121)/2$ the covariance of the measurement error vector (the difference between measurement and “true” FC). These matrices were computed using method of moments estimators (Carroll et al., 2006; Shou et al., 2013). Values for both ICC and I2C2 lie between 0 and 1, where 0 represents complete retest independence and 1 represents perfect measurement reliability.

Finally, TVFC estimates are commonly characterized as switching between recurring FC correlation matrices or “brain states” (Keilholz et al., 2017; Kringelbach & Deco, 2020). At any point in time, a subject’s functional organization is defined by one of $k$ brain states, between which a subject traverses during the scanning session. In this study, for each session separately, all estimated correlation matrices for all $N$ time steps and all subjects were concatenated, and brain states were extracted using the $k$-means algorithm (Lloyd, 1982). We extracted $k=3$ brain states, following Choe et al. (2017). We also computed the number of brain state change points (i.e., switches between states within a scan). The brain state analysis was not considered as a benchmark for method selection, because it was unclear how to determine the superior TVFC estimation method from this analysis.

#### 2.3.3 Task-based fMRI benchmarks

We assessed which method’s TVFC estimates could best predict an external task stimulus of brain activity, using an elegantly simple visual task experiment from the Rockland dataset (Nooner et al., 2012). More details on this dataset and the task paradigm can be found in Appendix A.3. Predicting the presence or absence of the visual stimulus was framed using a generalized linear model (GLM)

where $yg\u2208\mathbb{R}N$ is the estimated TVFC for a given method, which was run independently for all $1\u2264g\u2264G$ edges of interest (i.e., the connectivity between the primary visual cortex (V1) and five other ROIs; V2, V3, V4, mPFC, and M1), $X\u2208\mathbb{R}RxN$ is the design matrix with $R=3$ regressors (a single experimental regressor and two nuisance regressors) weighted by $\beta \u2208\mathbb{R}R$, and $eg\u2208\mathbb{R}N$ are the error terms. The experimental regressor was the boxcar (block design) model of the visual stimulus convolved with a hemodynamic response function (HRF) (Glover, 1999). The two nuisance regressors were first-order polynomials, where the polynomial (drift) order was determined as the TR multiplied by $N150$ (Worsley et al., 2002). We inferred $\beta $ from the observations, using ordinary least squares (OLS) to minimize the error terms (Worsley & Friston, 1995). The magnitudes of $\beta $ parameters were then used to determine which TVFC estimates best captured the presence of the visual stimulus.

#### 2.3.4 The imputation benchmark

Our proposed imputation benchmark is a data-driven benchmark for comparing TVFC estimation methods that only uses the observed time series. Again, we posited that good TVFC estimation methods can predict TVFC at unseen time points. Training and test sets were defined as data points taken from the time series under a leave-every-other-out (LEOO) split, resulting in equally sized training and test sets (Fig. 3). Each candidate method was tasked with estimating the covariance structure at the test locations using only observations from the training set. Except for the WP, where interpolation follows naturally, the test location covariance matrix estimates were elementwise and linearly interpolated between the estimates at the enclosing training locations. The performance metric was the mean log likelihood of observing the test set observations under a zero-mean Gaussian distribution defined by the estimated covariance matrices. We applied this benchmark to all datasets enclosed in this study.

## 3 Results

### 3.1 Simulation benchmarks

First, we investigated the ability of the TVFC estimation methods to recover known covariance structures from simulated multivariate time series. Simulation benchmarks address a significant obstacle in assessing TVFC in fMRI data: the lack of ground-truth knowledge of the covariance structure. We tested against a battery of synthetic covariance structures that could plausibly be encountered in an fMRI scan (Fig. 2): **null** covariance, **constant** static covariance, **periodic (slow and fast)** covariance structures of an oscillating sine wave (slow, low frequency—one period; fast, high frequency—three periods), **stepwise** covariance that models two large change points, a covariance structure that mimics a series of **state transitions** (Thompson et al., 2018), and a covariance structure inspired by a blocked-design **boxcar** tb-fMRI study that mimics the presence and absence of an external stimulus, as modeled by convolving a square wave stimulus time series with an HRF.

To investigate the qualitative differences between the estimation methods, the bivariate TVFC estimates for a single trial for all covariance structures are shown in Figure 4. We quantified the performance by computing the RMSE between the estimated and ground-truth correlation terms (i.e., the off-diagonal term; Figure 5, top row). Differences between the methods were statistically tested using two-tailed *t*-tests with Bonferroni correction. Of note is the spurious covariance detected by SW-CV in the static **null** and **constant** cases (Hindriks et al., 2016; Lindquist et al., 2014). The WP and DCC methods performed comparably to sFC, with the SW-CV performing worse than the WP ($t=\u221222.00$, $p=9.41\xd710\u221271$, and $t=\u22129.67$, $p=5.3\xd710\u221220$, for the null and constant covariance structures, respectively). For the two **periodic** (**slow** and **fast**) covariance structures, the smooth nature of the WP was particularly suitable, in contrast to the DCC estimates, where large, spurious jumps were observed ($t=\u221225.87$, $p=2.9\xd710\u221287$, and $t=\u221234.76$, $p=1.2\xd710\u2212122$, for the slow and fast periodic covariance structures, respectively). However, the WP estimates were too smooth to capture the sudden changes in the **stepwise** covariance structure. The SW-CV estimates suffered from a similar problem, where static parts require longer windows, but the change points require shorter windows. While DCC could model the change points, it returned spurious covariance estimates in the static parts of the time series. All methods failed to reconstruct the **state transition** covariance structure, as evidenced by the fact that none of the methods were able to significantly outperform the sFC estimate (Fig. 5, top row). This was interpreted as none of the methods was sensitive enough to detect subtle yet sudden changes in the covariance structure. All methods could model the **boxcar** covariance structure; however, performance was reduced when noise was added. The SW-CV performed better than the WP on this dynamic covariance structure ($t=13.77$, $p=1.4\xd710\u221235$).

We undertook further benchmarking in the trivariate case (Fig. 5, middle and bottom row). Both a dense version, where all time series edges share the same correlation structure, and a sparse case, where the third time series is uncorrelated with the first two time series, were implemented. Competitive performance, measured as the RMSE between all elements of the estimated and ground-truth correlation matrices, was observed for WP estimates. Similar to the bivariate case, SW-CV estimates return a time-varying structure when the ground-truth covariance is static (e.g., comparing with the WP on the null covariance structure; $t=\u22125.04$, $p=.003$ and $t=\u22125.06$, $p=.003$, for the dense and sparse cases, respectively).

The comparison between the dense and sparse cases was of interest. Whereas the pairwise DCC-BL and the jointly trained DCC-J performed similarly for the *dense* case, DCC-BL performed better in the *sparse* case for the covariance structures with time-varying structure (e.g., comparing these two on the slow periodic covariance structure; $t=\u22120.14$, $p=1.00$, and $t=9.66$, $p=1.5\xd710\u22128$, for the dense and sparse cases, respectively). The SW-CV estimates performed worse in the sparse case relative to the WP (e.g., on the fast periodic covariance structure; $t=\u22121.46$, $p=1.00$ and $t=\u22127.76$, $p=3.8\xd710\u22127$, for the dense and sparse cases, respectively). This suggests that it may also benefit from learning different parameters (window lengths in this case) for each edge, instead of a globally optimal window length.

### 3.2 Resting-state fMRI benchmarks

Under real-world conditions, the ground-truth covariance structure between brain components is unknown, making the assessment of TVFC estimation methods difficult. Next, we framed indirect prediction tasks as proxies to shed light on which method’s TVFC estimates were closer to the ground truth. We analyzed the time-varying covariance between ICA-identified networks for 812 subjects from the HCP. The estimates for several edges (i.e., covariances between two network time series) of a single scan of a single subject are shown in Figure 6. These estimates varied radically among the methods considered, underscoring the importance of the choice of estimation method. Estimates for training DCC in a joint (“-J”) or pairwise (“-BL”) manner strongly overlapped; therefore, the results for DCC-BL are omitted. The characteristics and differences in the estimates could be reasonably captured by three TVFC summary measures: mean, variance, and rate-of-change (Fig. 7). To illustrate, DCC-J estimates changed rapidly within a narrow range of correlation estimates, which was captured by the low variance and high rate-of-change.

TVFC estimates were used to predict subject measures using a morphometricity analysis (Fig. 8) (Sabuncu et al., 2016). Strong heterogeneity was observed across the estimation methods and the predictive power of subject measures. For example, for the subject measure of Sustained Attention Sensitivity, none of the variance across all TVFC summary measures and estimation methods could be explained, indicating that the signatures of this task may not be captured by FC. Subject age was better predicted by the mean and variance of TVFC estimates than by the rate-of-change. To illustrate, Hutchison and Morton (2015) found that FC variability over the length of a scan correlates positively with age. We found that the WP and SW-CV methods, but not DCC-J, were predictive of age. The only significant differences in the average variance explained across all subject measures (the rightmost column in Fig. 8) were the comparison between the mean of DCC-J estimates and the sFC estimates ($t=\u22122.17$, $p=3.8\xd710\u22122$) and for the variance summary measure between WP and DCC-J ($t=4.17$, $p=2.7\xd710\u22124$) and between DCC-J and SW-CV ($t=\u22125.48$, $p=1.0\xd710\u22125$). This means that the WP and SW-CV methods exhibited similar performance. The complete lack of predictive power of the variance of DCC-J estimates was particularly noteworthy. Another popular way to frame this type of prediction task is as an out-of-sample predictive modeling study (Dhamala et al., 2021; Greene et al., 2018; Smith et al., 2015; Zamani Esfahlani et al., 2022). Such an analysis has been added to Supplementary Materials Section S.4 for comparison. Our findings replicate those of prior studies, and the relative performance of the estimation methods is similar. However, DCC-J performed relatively better in the out-of-sample prediction task.

The test–retest benchmark examined the reliability of TVFC estimates across four scans of the same subjects, separately for each TVFC summary measure. The edgewise ICC scores are shown in Figure 9A. The mean of TVFC estimates was more robust than their dynamic summary measures (variance and rate-of-change). Some edges were consistently more robust than others, irrespective of the estimation method. The robustness of mean TVFC was similar across estimation methods (no significant differences were found), but the variance of DCC-J estimates was more robust compared with the other two methods (*t*-test between the mean ICC score over all edges, $p=4.63\xd710\u221217$ and $p=6.01\xd710\u221214$, respectively, in comparison with the WP and SW-CV scores), replicating Choe et al. (2017). This finding was perhaps unexpected, as the variance of DCC-J estimates was previously found to have no predictive power of subject measures. Further, the robustness of the rate-of-change of the SW-CV estimates was higher than that of the other two methods ($p=9.71\xd710\u22125$ and $p=1.18\xd710\u22122$, in comparison with the WP and DCC-J scores, respectively). The I2C2 scores for whole-brain reliability of the TVFC summary measures were in broad agreement with the ICC scores (Fig. 9B). For perspective, the ICC scores for the learned WP kernel parameters were 0.29 for the kernel variance and 0.48 for the kernel length scales (Eq. 2.14).

The $k=3$ extracted brain states from the WP, DCC-J, and SW-CV estimates are shown in Supplementary Materials Section S.5. Apart from the extracted brain states, we were interested in the dynamics of the transitions between brain states. The number of brain state change points for each method for each subject is shown in Figure 10. The median number of brain state switches during a scanning session was one, zero, and nine for the WP, DCC-J, and SW-CV methods, respectively. Intuitively, given that SW-CV estimates have a higher rate-of-change (Fig. 7), it estimates more brain state switches than the WP.

### 3.3 Task-based fMRI benchmarks

We used a tb-fMRI benchmark to investigate whether estimates of the dynamic coupling (TVFC) of V1 with five other brain regions could predict the presence of an external visual stimulus. Similar to the rs-fMRI case, ground-truth knowledge of the true underlying covariance structures was absent. However, in contrast to rs-fMRI, external stimulation can shape the coupling of regional brain activity in a more predictable and controllable manner. We observed (as expected) that the correlation between V1 and the other regions decreased as we moved up and away from the visual cortex hierarchy (Di et al., 2015): connectivity strength with V2 was the highest, followed by V3, V4, mPFC; and it was lowest with M1 (see Supplementary Materials Section S.6).

Next, as we expected that this external stimulus would induce periodic covariance between brain regions, we investigated the predictive utility of each covariance structure estimate for recovering the stimulus waveform. The relationship between the external stimulus and TVFC estimates is plotted in Figure 11, where the trend and offset were removed from the estimates and the estimates were averaged across subjects (Fig. 11A). The extracted $\beta $ parameters from the GLM are shown to quantify this relationship (Fig. 11B). Most of the edges were anticorrelated to the external stimulus, as evidenced by the negative learned values. The WP model had the largest (absolute) estimated task-related $\beta $ parameters and had thus captured most of the external task. The within-visual cortex edges had the least predictive power, and the V1-mPFC and V1-M1 edges had the most. This may indicate that connectivity between distant brain regions is more informative of external task conditions, which may be related to the stronger connectivity between visual regions (Supplementary Fig. S11B). Additional coupling effects in these edges may be more difficult to detect than in those with lower baseline connectivity.

### 3.4 Imputation benchmarks

A major difficulty in evaluating TVFC estimation methods is the lack of ground-truth knowledge of the covariance structures underlying observed time series. The imputation benchmark proposed here used observed time series directly as a ground truth to circumvent this issue. Data were split into equally sized training and test sets by removing every other scanning volume. Performance was assessed as the ability to predict unobserved test volumes from observed training volumes. Across the different datasets, we obtained results on this benchmark which corresponded to the performance metrics computed *with* knowledge of a ground truth or proxy thereof.

For the simulated data, we examined two cardinal cases using bivariate noiseless data. For the null covariance case (Fig. 12A), all methods performed similarly well on this benchmark (e.g., *t*-test between WP and sFC; $p=.47$). However, for a nonstatic covariance structure, such as the slowly oscillating periodic structure (Fig. 12B), we found that the sFC estimates performed much worse on the imputation benchmark than the other methods (e.g., *t*-test between WP and sFC; $p=1.75\xd710\u2212106$). These findings correspond to the prior results shown for the simulation benchmarks, where the prespecified ground-truth covariance structure was reconstructed.

Next, we ran the imputation benchmark on rs-fMRI data from the HCP. The WP and SW-CV methods outperformed DCC-J and sFC on the whole-brain (all edges) imputation benchmark (Fig. 13). DCC-J estimates performed worse than sFC estimates, indicating a poor model fit. This would underwrite the interpretation of the high test–retest scores for its variance as a result of numerical instability rather than being indicative of having modeled the data well.

There may be certain edges where WP performance was similar to that of the static approach (i.e., static edges), and some where it outperformed it (i.e., dynamic edges). Therefore, we also ran the imputation benchmark individually for each edge. These results (averaged over all subjects) are shown in Figure 14. Certain edges had high performance on this imputation benchmark, which may be because they were static and thus easier to fit. For example, test likelihoods were high for one of the FP(CL)-FP edges, which can be explained by the low variance and rate-of-change of this edge (see Fig. 7). However, we were interested in comparing estimation methods. The WP outperformed sFC more strongly for certain edges. This may be because these edges changed more over time, as indicated by the higher variance and rate-of-change summary measures for these edges (see Fig. 7).

Applying the imputation benchmark to the tb-fMRI Rockland data showed strong performance for the WP and weak performance for DCC-J (Fig. 15). SW-CV outperformed sFC, but with a small effect size, indicative of its failure to capture the dynamics in these data.

## 4 Discussion

We investigated the utility of WPs as a natural method for TVFC estimation. Further, we introduced a data-driven method to determine the optimal window length in SW approaches. Method comparison and selection are intrinsically difficult because of the lack of access to the true covariance structure underlying neural activity. We proposed a range of prediction tasks to benchmark TVFC estimation methods and proposed a data-driven imputation benchmark. We found that the WP performed competitively.

We simulated BOLD time series representing putative covariance structures present in fMRI. We showed that the WP does not produce spurious estimates of covariance dynamics when the true covariance is static, while we replicated this well-known failure mode of the SW and DCC approaches (Lindquist et al., 2014; Lurie et al., 2020). This may be due to the WP’s strong prior over static covariance structures; its estimates will be static unless provided with sufficient evidence of the contrary. Further, SW methods estimate in a pairwise fashion and have less inherent protection against false positive estimation. They view each pair of time series independently from all other time series. The WP estimates outperformed the other methods in the case of a smoothly changing covariance structure. However, the WP was not particularly well suited for modeling covariance structures with sudden changes in connectivity strength. This could be attributed to the smoothness of the underlying GPs. The other methods considered (DCC and SW-CV) also failed in such cases. We found that the WP performed competitively when scaling to three dimensions, especially when the edges between the simulated time series had distinct covariance structures. Nonetheless, simulation studies have limited value as it remains unclear what covariance structures are present in real fMRI data.

We studied benchmarks using rs-fMRI data from the HCP. We found large qualitative differences in the TVFC estimates among the different estimation methods. We related summary measures of TVFC estimates with subject measures through a morphometricity analysis. The results for sFC (as a reference null model; the alternative hypothesis that FC is static) and the first summary measure (mean across time) of TVFC estimates replicated those of a previous study (Li, Kong, et al., 2019). The time-varying summary measures (variance and rate-of-change) contained information about subject measures, which justifies the study of TVFC beyond sFC. However, these simple summary measures may be limited characterizations of the dynamics of TVFC, and a more sophisticated mapping between TVFC estimates and subject measures may lead to an improved benchmark. WP and SW-CV estimates had the highest predictive power. However, they performed differently for different subject measures, which may indicate the capture of distinct elements of covariance structures. In terms of test–retest robustness, all methods performed similarly, but the variance of the DCC estimates and the rate-of-change of the SW-CV estimates were more robust across scanning sessions. However, the high robustness of the DCC estimates may be because the raw numerical values of the variance of the DCC estimates were close to zero. The test–retest measure may be unstable in cases where both within-subject and between-subject variances are small. Lastly, the brain state analysis demonstrated that the choice of TVFC estimation method impacts the number of brain state switches during a scan.

We used a tb-fMRI dataset from the Rockland sample with an external visual task that induced a covariance structure in the participants. The ability of covariance estimates to detect putative changes in interregion correlations induced by the stimulus was determined by regressing the stimulus onto the TVFC estimates. The WP estimates recovered the external stimulus better compared with the other methods, witnessed by learning larger weights for the stimulus in a GLM setup. This outperformance could be attributed to the small number of scan volumes. Bayesian methods generally perform well in small samples because they use priors, DCC is known to struggle in small samples, and SW-CV may not have seen enough data to find a reasonable window length, and is even more sensitive to outliers in such regimes. The highest predictive performance was found for the V1-M1 and V1-mPFC edges. V1 and M1 are anatomically farther apart than V1 and the other visual areas. They underlie functionally distinct processes, which may explain their anticorrelated interaction (Fox et al., 2005; Murphy et al., 2009). This may explain why the V1-M1 edge was most (anti)correlated with the external task conditions. It is possible that visual stimulus onset disrupts coactivity between V1 and M1 (Di et al., 2015). Further, Figure S11B indicates that the overall FC between V1 and the other visual areas was much stronger than with the mPFC and M1. The baseline strong static correlation between V1 and other visual areas may have masked any potential dynamic changes in connectivity induced by the stimulus. The same could be argued for the mPFC, potentially to a lesser degree. To investigate TVFC estimation beyond the simple external stimulus studied here, other tb-fMRI benchmarks with richer task structures should be included (Xie et al., 2019).

We found a correspondence between the estimation method performance on each benchmark (simulations, rs-fMRI, and tb-fMRI) and their respective performance on the respective imputation benchmarks. In other words, the WP performs well on this benchmark when it does so on the matched benchmark, and vice versa. This marks its exciting promise as a benchmark that can be used in realistic situations where no ground-truth covariance structure or proxies thereof (e.g., concurrent information, expert labeling, or additional collected participant information) are available.

Based on the simulation benchmark results, we posit that a method’s outperformance of the sFC estimate can be considered an indication of how time-varying the covariance structure is (but only if the method has successfully modeled the data). The TVFC estimation method performance was similar to that of sFC when either there was no or little dynamic structure present, or the methods failed to model it. However, the TVFC estimation methods outperformed sFC when there was a dynamic structure (if the method had successfully modeled it). In other words, we posit that outperformance over the sFC estimate is a proxy for a statistical test of whether there is any time-varying structure (Hindriks et al., 2016; Zalesky & Breakspear, 2015) and can be considered a null model study (Liégeois et al., 2021; Miller et al., 2018; Novelli & Razi, 2022). The imputation benchmark finding is consistent with this interpretation. Outperformance of the sFC estimate on this benchmark can also be considered a null model study. Therefore, practitioners are recommended to conduct this study prior to any further analysis, not only when in doubt about how to estimate TVFC, but also whether their data contain a strong dynamic component.

In addition to competitive benchmarking performance, the WP offers qualitative advantages, which may also explain its competitive performance. First, unlike SW, it is a *model-based* approach, which can lead to increased predictive power (Bassett et al., 2011), allowing for easier specification of inductive biases (Foti & Fox, 2019), and uncertainty modeling (Kudela et al., 2017). In general, a model that makes fewer assumptions about the data and extracts relevant hyperparameters automatically is preferred.

### 4.1 Limitations and future work

We must still consider the possibility of covariance structures in real data being organized as point processes of change points, where connectivity can suddenly and drastically change. All the considered models performed poorly on such data, for example, the simulated “state transition” dataset. State-based models such as HMMs are expected to perform better on such covariance structures. Change point detection algorithms can help determine whether such sudden changes exist (Cribben et al., 2012; Robinson et al., 2010). Within the WP framework, change point kernels can be included in models (Saatçi et al., 2010; Wilson, 2013). A key question is whether state transitions are expected in the covariance structures of real data. If methods that are better suited for detecting states outperform those that cannot capture change points well on real data benchmarks designed for this purpose, this would provide evidence for their existence.

Another limitation is that the utility of (rs-fMRI) test–retest studies is questionable. The classical measurement error framework assumes an immutable subject-specific property, whereas in FC studies, at least part of the functional architecture is expected to be different in different sessions, and thus less robust than more stable characteristics such as anatomical properties. One estimation method could extract subject-specific traits, such as participant age, whereas another could extract participant states, such as stress during the scan. It is not clear whether a TVFC estimation method that extracts stable subject- or session-specific information is preferred. This makes it a poorly framed exercise, and its interpretation in relation to the optimal method choice is unclear. Further, viewing this as a prediction problem, the test–retest problem can be stated as, when given the first scan, how well we can predict which scan is that subject’s subsequent scan. In this light, the method with the best test–retest score from *any* model feature can be considered stronger, such as the WP’s learned kernel parameters, for which we found high ICC scores. Although test–retest reliability may be desired, optimizing it for its own sake overlooks the goal of TVFC estimation. It has recently been argued that (behaviorally) *predictive* connectomes are more important than *reliable* connectomes (Finn & Rosenberg, 2021).

The WP implemented in this study can be augmented in several ways. First, one of the disadvantages of this model is its computational cost (Meng et al., 2023). To mitigate the computational cost due to large values of $N$, innovations in sparse GPs that remove the need for inducing points can be ported to WP models (Adam et al., 2020; Wilkinson et al., 2021). However, this may be more relevant in applications to EEG data, where the number of time steps is much larger. The model scales in computational time with respect to *$D$* in $O(D3)$, owing to the required inversion of a $D\xd7D$ matrix. A low rank (or *factored*) implementation can help in situations where $D$ is large (Fox et al., 2015; Heaukulani & van der Wilk, 2019). This assigns each time series in a weighted manner to $K\u226aD$ clusters, and only learns the correlation structure between these clusters. Apart from reducing the number of model parameters, this can be considered as a dimensionality reduction step that posits an inherent network organization onto the brain. Second, while assuming the mean function $\mu n$ to be zero(s) is fair if we subtract the empirical mean from the data, models that include an estimate of the mean function may yield better covariance estimates (Lan et al., 2020). Finally, by specifying the similarity between observations, the kernel choice imposes assumptions (inductive bias) on the behavior of the observed time series, such as covariance structure smoothness (Foti & Fox, 2019; Fyshe et al., 2012). Expressive kernels can be designed to incorporate domain knowledge (Gönen & Alpaydin, 2011). For example, a periodic kernel could model TVFC where scanning volumes at a certain periodic interval are more related, which could be useful in the case of periodic task stimuli. Although our simple kernel choice in this work resembles a tapered SW approach, more complex kernels can introduce structures that are unattainable by other methods.

Having a comprehensive benchmarking framework is crucial because it allows for rapid iterative improvements and verification of model adjustments. Additional benchmarks will increase robustness (e.g., across data collection routines, scanners, and demographics) and deepen our knowledge of the utility of estimation methods. As a priority, a carefully tailored prediction task could help assess the degree to which brain states are artifacts of the estimation method choice or valid constructs that accurately describe underlying brain dynamics. Finally, we must accept that perfect benchmarking is unattainable in practice and that a single estimation method will not outperform in all configurations. One way to increase the robustness of our scientific conclusions is to formalize the TVFC estimation method choice (and other choices) using a multiverse analysis framework (Bell & Kampman, 2021; Steegen et al., 2016). Such approaches have recently been demonstrated in neuroimaging data processing pipeline choices (Dafflon et al., 2022) and for exploring machine learning model hyperparameter spaces (Bell et al., 2022). This views each decision “universe” in parallel and repeats an analysis in each universe. If a conclusion holds in most universes, it can be considered more robust.

## 5 Conclusions

This study assessed the utility of WP models for TVFC estimation. We also proposed a more principled method for running SW by cross-validating its window length. We carefully considered how to compare TVFC estimation methods by designing a comprehensive suite of benchmarks, including simulations, subject measure prediction, test–retest studies, brain state analyses, and external task prediction. A novel imputation benchmark based on cross-validation that can be run on any dataset was introduced. We showed how the WP performed competitively in relation to baselines, including both DCC and SW approaches, outperforming them especially on the external task prediction benchmark while estimating fewer false positives in the TVFC null model. Finally, we recommend running the imputation benchmark on any newly collected dataset to determine the optimal TVFC estimation method.

## Data and Code Availability

Researchers can use the pip-installable, open-source Python package *FCEst* (https://github.com/OnnoKampman/FCEst) for TVFC estimation using WPs and the baselines discussed. Code for reproducing the benchmarking and their results can be found at https://github.com/OnnoKampman/FCEst-benchmarking. We used open-source Python libraries *nilearn* (Abraham et al., 2014), *pingouin* (Vallat, 2018), and *GPflow* (Matthews et al., 2017) and the *R* package *rmgarch* (Galanos, 2022). Human Connectome Project data are available at https://db.humanconnectome.org/ (requiring free registration). Rockland data are available at http://fcon_1000.projects.nitrc.org/indi/pro/nki.html (requiring free registration).

## Author Contributions

O.P.K. carried out the conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing, and editing. J.Z. carried out the conceptualization, data curation, formal analysis, methodology, validation, writing, and editing. S.A. carried out the conceptualization, data curation, formal analysis, methodology, software, validation, and editing. M.v.d.W. carried out the conceptualization, methodology, software, supervision, and editing. Z.K. carried out the conceptualization, funding acquisition, project administration, resources, supervision, and editing.

## Funding

This work was supported by grants to Z.K. from the Wellcome Trust (grants 205067/Z/16/Z, 223131/Z/21/Z, 221633/Z/20/Z), the Royal Society (INF\R2\202107), the Biotechnology and Biological Sciences Research Council (grants H012508 and BB/P021255/1). This work was further supported by the European Union’s Horizon 2020 Marie Skłodowska-Curie Actions Initial Training Network (H2020-MSCA-ITN-2017) “DyViTo: Dynamics in Vision and Touch” (grant agreement 765121).

## Declaration of Competing Interest

None.

## Ethics Statement

This work used open-access data from the HCP and the NKI-Rockland sample. No identifying information related to participants from either dataset has been published or shared.

## Acknowledgements

We thank Gustavo Deco, Emmanuel Stamatakis, Creighton Heaukulani, and Vasilis Karlaftis for their helpful feedback. We would also like to thank the reviewers for their helpful comments. Data were provided by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. For the purpose of open access, the authors have applied for a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

## Supplementary Materials

Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00184.

## References

## Appendix

### A.1. Updating ground-truth covariances after noise addition

Although the distribution of the HCP time series used as noise $\u03f5$ in the simulation benchmarks is unknown, we know that $E[\u03f5]=0$ and $E[\u03f52]=1$. First, we note that

so that the variance of the new (noisy) time series is

Regarding the covariance terms, all cross-terms except within $yn$ are zero; therefore, we are simply left with $\sigma yn,1*yn,2*=\alpha 2\sigma yn,1yn,2$. To maintain unit time series variances, the new (noisy) signal is

so our new ground-truth covariance terms can be analytically updated as

### A.2. Human connectome project (HCP) data description

HCP participants were young adults (twins and nontwin siblings), aged 22–35 years. We used the data from the HCP S1200 public release (published in March 2017), from the 812 subjects with four available scans processed using improved fMRI image reconstruction (“recon2”). These scans were acquired with a 3T Siemens connectome-Skyra using a multiband sequence. The readily available $D=15$ dimensional ICA-based node time series were used. For interpretability, each ICA component was mapped to a well-known functional network (FN) (Giorgio et al., 2018). Three neuroimaging researchers visually inspected and manually assigned each ICA component to 1 of 10 networks that resemble task-driven FNs (Fox & Raichle, 2007; Laird et al., 2011; Smith et al., 2009). The four scans were divided into two immediately consecutive scans on two separate days (scans 1A, 1B, 2A, and 2B). Scans were acquired with a TR of 0.72 seconds and a voxel size of 2 mm isotropic. Each scan contained $N=1,200$ images and was 15 minutes long. The data had been preprocessed using a (minimal) preprocessing pipeline (Glasser et al., 2013, 2016; Smith, Vidaurre, et al., 2013). This minimal preprocessing pipeline included motion and distortion correction, and anatomical registration to standard MNI space. Noise components had been filtered out during preprocessing using ICA+FIX (Griffanti et al., 2014; Salimi-Khorshidi et al., 2014). Further global signal regression was not performed. All time series had been demeaned, with variance normalized (Beckmann & Smith, 2004).

### A.3. Rockland data description

The Rockland dataset contained fMRI scans of subjects (aged 18-35 years) alternating between fixating in the dark or seeing a checkerboard visual pattern consecutively (20 seconds in duration of both stimuli and rest periods). We used the dataset with a TR of 0.645 seconds ($N=240$). We only used data from the first baseline visit (coded as “ses-BAS1”). Data preprocessing was performed in FSL (Jenkinson et al., 2012) consisting of brain extraction and field-of-view fixation, data registration, motion correction, smoothing (5 mm Gaussian kernel), and detrending (high-pass, 0.01 Hz). No initial volumes were excluded from the scans. Finally, we denoised the data using ICA-AROMA (Pruim et al., 2015). The data were not prewhitened to ensure preservation of the visual task. Subjects with a correlation between their V1 BOLD time series and the stimulus time series (convolved with an HRF) of less than 0.4 were discarded, resulting in 286 subjects. Such low correlations indicated that the expected BOLD signature in V1 was not observed (e.g., owing to the subjects closing their eyes). This visual task is simpler than those recorded under the umbrella of the HCP, which makes it better suited for our benchmarking aims (Barch et al., 2013). We extracted BOLD activations from visual areas V1, V2, V3, V4, the medial prefrontal cortex (mPFC), and the primary motor cortex (M1). A plot of the time series, aligned across subjects, is shown in Figure S11A. As motivation for using this design, Di et al. (2015) showed that FC is not static across such visual stimuli and that FC between higher and lower visual regions is affected in a consistent manner with stimulus onset and offset.