Test-retest reliability of regression dynamic causal modeling

Abstract Regression dynamic causal modeling (rDCM) is a novel and computationally highly efficient method for inferring effective connectivity at the whole-brain level. While face and construct validity of rDCM have already been demonstrated, here we assessed its test-retest reliability—a test-theoretical property of particular importance for clinical applications—together with group-level consistency of connection-specific estimates and consistency of whole-brain connectivity patterns over sessions. Using the Human Connectome Project dataset for eight different paradigms (tasks and rest) and two different parcellation schemes, we found that rDCM provided highly consistent connectivity estimates at the group level across sessions. Second, while test-retest reliability was limited when averaging over all connections (range of mean intraclass correlation coefficient 0.24–0.42 over tasks), reliability increased with connection strength, with stronger connections showing good to excellent test-retest reliability. Third, whole-brain connectivity patterns by rDCM allowed for identifying individual participants with high (and in some cases perfect) accuracy. Comparing the test-retest reliability of rDCM connectivity estimates with measures of functional connectivity, rDCM performed favorably—particularly when focusing on strong connections. Generally, for all methods and metrics, task-based connectivity estimates showed greater reliability than those from the resting state. Our results underscore the potential of rDCM for human connectomics and clinical applications.

For clinical applications, computational methods for assessing functional integration in large-scale (whole-brain) networks of individual patients have great potential . In order to leverage this potential, candidate methods need to fulfill several criteria, including (a) computational efficiency (allowing assessment of large-scale networks with hundreds of nodes, within clinically acceptable time frames), (b) reliability (construct and testretest), and (c) predictive validity (with regard to specific clinical questions).
Regression dynamic causal modeling (rDCM) is a generative model of fMRI data that was developed with these objectives in mind (Frässle, Lomakina, Kasper, et al., 2018;Frässle, Lomakina, Razi, et al., 2017). It represents a novel variant of DCM for fMRI (Friston et al., 2003) that scales gracefully to very large networks including hundreds of nodes, enabling whole-brain effective connectivity analyses within time frames of minutes to hours. Furthermore, the model can utilize structural connectivity information to constrain inference on directed functional interactions or, where no such information is available, infer optimally sparse representations of whole-brain connectivity patterns. For rDCM, we have recently demonstrated the face validity of the approach in comprehensive simulation studies for both taskbased (Frässle, Lomakina, Kasper, et al., 2018;Frässle, Lomakina, Razi, et al., 2017) and resting-state fMRI data . Furthermore, we have demonstrated its construct validity in application to fMRI data from a simple hand movement paradigm (Frässle, Manjaly, et al., 2021), as well as to resting-state fMRI data . These studies have provided promising results and suggest that rDCM might enable the construction of clinically useful "computational assay" in psychiatry and/or neurology . However, test-retest reliability of rDCM has not been assessed so far.
Test-retest reliability represents an important test-theoretical property that quantifies the stability of estimates over time at the individual-subject level. It thus has particular relevance for clinical tests that require repeated assessments, such as monitoring treatment response over time. Test-retest reliability has already been assessed for classical variants of DCM for fMRI (Almgren et al., 2018;Frässle, Paulus, et al., 2016;Frässle, Stephan, et al., 2015;Rowe et al., 2010;Schuyler et al., 2010). Overall, these studies have reported good reproducibility of DCM for fMRI across sessions, although detailed work has stressed avoidance of local extrema during optimization and the choice of the prior distributions as important factors for achieving good test-retest reliability (Frässle, Stephan, et al., 2015). Dynamic causal modeling: A generative model of effective (directed) connectivity based on neuroimaging data.
Generative model: Describes the putative processes by which data were generated. Specified by the joint probability density over model parameters and data.
Effective connectivity: Effective connectivity refers to the directed influences that one neuronal population exerts on another neuronal population.
Test-retest reliability: Test-theoretical property that refers to the consistency of a test over time, performed under identical conditions in the same subject.
Distribution: Refers to the probability density function of a continuous random variable.
While test-retest reliability has been investigated for classical DCM for fMRI, it has not been tested for rDCM so far. Here, we assess the (group-level) consistency as well as the test-retest reliability of rDCM for inferring effective connectivity from task-based as well as resting-state fMRI data, applying the model to multiple datasets over time, acquired under the same conditions in the same participants. In addition, using the same data, we examined the consistency of group-level estimates of connectivity (referred to as "consistency" below). This metric is complementary to test-retest reliability that focuses on the stability of individual estimates over time. To this end, we made use of the comprehensive test-retest dataset from the Human Connectome Project (HCP; Van Essen et al., 2013).

Analysis Plan
All analyses reported in this paper have been prespecified in an analysis plan that was timestamped prior to the analyses (https://gitlab.ethz.ch/tnu/analysis-plans/fraessle_hcp_test_retest; .

Regression Dynamic Causal Modeling
General overview. Regression DCM (rDCM) is a novel variant of DCM for fMRI that enables effective connectivity analyses in whole-brain networks (Frässle, Lomakina, Kasper, et al., 2018;Frässle, Lomakina, Razi, et al., 2017). This computational efficiency is achieved by several modifications and simplifications of the original DCM framework. These include (a) translating state and observation equations of a linear DCM from time to frequency domain, (b) replacing the nonlinear hemodynamic model with a linear hemodynamic response function (HRF), (c) applying a mean field approximation across regions (i.e., parameters targeting different regions are assumed to be independent), and (d) specifying conjugate priors on neuronal (i.e., connectivity and driving input) parameters and noise precision. These modifications reformulate a linear DCM in the time domain as a Bayesian linear regression in the frequency domain, resulting in the following likelihood function: θ r ¼ a r;1 ; a r;2 ; …; a r;R ; c r;1 ; c r;2 ; …; c r;K Â Ã : (1) Here, Y r is the dependent variable in region r that is explained as a linear mixture of afferent connections from other regions and direct (driving) inputs. Specifically, Y r is the Fourier transformation of the temporal derivative of the measured signal in region r. Furthermore, y r represents the measured BOLD signal in region r, X is the design matrix (comprising a set of regressors and explanatory variables), u k is the kth experimental input, and the hat symbol denotes the discrete Fourier transform (DFT). Additionally, θ r represents the parameter vector comprising all afferent connections a r,1 , …, a r,R and all driving input parameters c r,1 , …, c r,K targeting region r. Finally, τ r denotes the noise precision parameter for region r and I N×N is the identity matrix (where N denotes the number of data points). Choosing appropriate priors on the parameters and hyperparameters in Equation 1 (see Frässle, Lomakina, Razi, et al., 2017) results in a generative model that can be used for inference on the directed connection strengths and inputs.
Bayesian statistics: Theory based on Bayes theorem, which provides a recipe for optimally combining prior and new information in a probabilistic way.
Linear regression: Statistical approach that attempts to model the linear relationship between a scalar response and one or more explanatory variables.
Under this formulation, inference can be done very efficiently by (iteratively) executing a set of analytical variational Bayes ( VB) update equations of the sufficient statistics of the posterior density. In addition, one can derive an expression for the negative (variational) free energy (Friston et al., 2007). The negative free energy represents a lower bound approximation to the log model evidence that accounts for both model accuracy and complexity. Hence, the negative free energy offers a sensible metric for scoring model goodness and is frequently used for comparing competing hypotheses (Bishop, 2006). We have recently further augmented rDCM by introducing sparsity constraints as feature selectors into the likelihood of the model in order to automatically prune fully connected network structures (Frässle, Lomakina, Kasper, et al., 2018). A comprehensive description of the generative model of rDCM, including the mathematical details of the neuronal state equation, can be found elsewhere (Frässle, Lomakina, Kasper, et al., 2018;Frässle, Lomakina, Razi, et al., 2017).

Dataset
Participants. We used the publicly available fMRI data provided by the Human Connectome Project (HCP; Van Essen et al., 2013), specifically, all fMRI datasets from the HCP S1200 data release for which test and retest sessions are available. In total, this included 45 participants (31 females, 14 males). However, not all participants performed all paradigms twice. Hence, we excluded participants, for each paradigm individually, if not all their data from the test and retest session of the particular paradigm were available. The experimental protocol of the HCP was in compliance with the Declaration of Helsinki and was approved by the Institutional Review Board at Washington University in St. Louis (IRB #20120436). Informed consent was obtained from all participants prior to the experiment and all open-access data were deidentified. Permission to use the open-access data for the present study was obtained from the HCP, abiding the Data Use Terms (https://www.humanconnectome.org/data/data-use-terms).
Data acquisition. The HCP dataset comprises fMRI data acquired during the "resting state" (i.e., unconstrained cognition in the absence of experimental manipulations). During the resting-state measurement, participants were asked to keep their eyes open and to fixate on a crosshair projected on a screen. Furthermore, the HCP dataset comprises fMRI data acquired during several cognitive tasks, including (a) working memory, (b) gambling, (c) motor, (d) language, (e) social cognition, (f ) relational processing, and (g) emotional processing. For the resting state, a total of four measurements are available per session (i.e., test or retest) that differ in the phase encoding direction during oblique axial acquisitions. Specifically, two resting-state measurements are available with phase encoding in right-to-left (RL) and two in left-to-right (LR) direction. Similarly, for each task, two measurements are available (i.e., one per phase encoding direction) per session.
Preprocessing. Preprocessing of the data was already performed by the HCP consortium, and preprocessed files are released alongside the raw data. Here, we made use of the minimally preprocessed fMRI data . The minimal preprocessing pipeline uses different tools from various freely available software packages like FSL (Jenkinson et al., 2012), FreeSurfer (Dale et al., 1999), and the HCP Workbench (Marcus et al., 2013) in order to accomplish several tasks, including spatial artifact/distortion removal, realignment, surface generation, cross-modal registration, and alignment to standard space (MNI). For the restingstate fMRI (rs-fMRI) data, additional preprocessing steps were performed to remove noise from the data. Specifically, the preprocessing of the rs-fMRI data made use of MELODIC as part of a single-subject spatial ICA decomposition. The resulting components were classified as signal or noise by FIX Salimi-Khorshidi et al., 2014) and a cleaned version of the data is provided. The final preprocessed versions of both rs-fMRI and task data were then stored using the HCP-internal CIFTI file format and the associated grayordinates spatial coordinate system . For comprehensive information on the individual preprocessing steps that were performed on both the HCP resting-state and task-based fMRI data, please refer to the manual (see above) or Glasser et al. (2013).
Time series extraction. To extract BOLD signal time series for the subsequent rDCM analyses, we made use of two different whole-brain parcellation schemes. This allowed us to assess the robustness of our estimates of test-retest reliability and group-level consistency to the choice of parcellation scheme. First, we made use of the Human Connectome Project parcellation (HCP MMP 1.0; , also known as the Glasser parcellation. HCP MMP 1.0 represents a very detailed cortical in vivo parcellation, consisting of 360 regions that were defined based on combined information on cortical architecture (e.g., relative cortical myelin content, cortical thickness), connectivity, and topography within some areas (e.g., the map of visual space in visual cortex). Second, we made use of the Schaefer 400-node parcellation (Schaefer et al., 2018), which rests on a gradient-weighted Markov random field model that integrates local gradient approaches (i.e., transient changes in functional connectivity patterns) and global similarity approaches (clustering of homogenous/similar functional connectivity patterns, regardless of spatial proximity). Using task-based and resting-state fMRI, the authors derive parcellations of the human brain at various degrees of granularity and demonstrate that these parcels represent subcomponents of global brain networks identified by Yeo et al. (2011). The Schaefer parcellation is optimized to align with both task-based and resting-state fMRI, and has been found to demonstrate improved homogeneity within parcels relative to alternative parcellations (Schaefer et al., 2018).
For each of the considered whole-brain parcellation schemes, we extracted the BOLD signal time series of all regions using dedicated HCP tools for CIFTI files. Specifically, we used the command -cifti-parcellate from the HCP Workbench tool wb_command (for further information, see https://www.humanconnectome.org/software/workbench-command/-cifti-parcellate). The script takes the dense time series data (which is the CIFTI format in which the HCP fMRI data are stored) and a *.dlabel file (which contains the parcellation) and extracts average BOLD signal time series from each region. The extracted time series then entered whole-brain effective connectivity analyses using rDCM. rDCM analysis. The extracted BOLD signal time series were then utilized for whole-brain effective connectivity analyses using rDCM. Since neither the Glasser atlas nor the Schaefer atlas are accompanied by an anatomical connectome that could inform the network architecture of the whole-brain models (i.e., the presence or absence of endogenous connections in rDCM; the A-matrix), we assumed a fully (all-to-all) connected network. Furthermore, the input (C) matrix was defined according to data type: (a) for the resting-state fMRI datasets, no driving inputs are available and the C-matrix was set to all-zeros (as described in , and (b) for the task-based fMRI datasets, a full C-matrix was assumed.
For this setting, two different variants of rDCM were employed. First, using the fully connected network architecture, the strength of each connection and driving input was inferred using the classical implementation of rDCM (Frässle, Lomakina, Razi, et al., 2017). This yielded a total of at least (a) 129,600 free parameters for the models based on the Glasser atlas (including 129,240 connectivity parameters, 360 inhibitory self-connections, and-for the task-based fMRI datasets-a task-dependent number of driving input parameters), and (b) 160,000 free parameters for the models based on the Schaefer atlas (including 159,600 connectivity parameters, 400 inhibitory self-connections, and-for the task-based fMRI datasets-a task-dependent number of driving input parameters).
In a second step, we utilized the sparsity constraints embedded in rDCM to automatically prune both connections and, for the task-based fMRI data, driving inputs (Frässle, Lomakina, Kasper, et al., 2018). In brief, this is achieved by introducing binary indicator variables as feature selectors into the likelihood function where each indicator variable determines whether a specific connectivity parameter is present. This resulted in the same number of neuronal parameters (i.e., connectivity, inhibitory self-connection, and driving input parameters) as mentioned above, plus the same number of binary indicator parameters. Notably, a Bernoulli prior is specified on the binary indicator variables, where the Bernoulli distribution is parameterized by a single parameter p i 0 . Hence, p i 0 represents a hyperparameter of the model and encodes the a priori belief about the network's degree of sparseness. Since exact a priori knowledge about the degree of sparseness of the networks is not available here, we followed the procedure described in Frässle, Lomakina, Kasper, et al. (2018), using a line-search procedure to determine the value of p i 0 that resulted in the highest negative free energy. More specifically, for each participant, we systematically varied p i 0 within a range of 0.3 to 0.9 in steps of 0.1 and performed model inversion for each p i 0 value. The optimal p i 0 value was then determined for each participant by selecting the one that yielded the highest negative free energy. This yielded individual sparse effective connectivity patterns where some connections are absent (pruned away) and thus take a value of 0, whereas other connections remain present and thus take a nonzero connection strength.
Model inversion: Refers to the process by which the posterior distribution over the model parameters of a generative model is computed.
For either of the two rDCM variants, the whole-brain models were fitted to the extracted BOLD signal time series by making use of the standard routines and prior settings implemented in the rDCM toolbox. Specifically, whole-brain models were inverted by utilizing the main routine tapas_rdcm_estimate.m from the rDCM toolbox as implemented in TAPAS (Frässle, Aponte, et al., 2021), which is freely available as open-source software (https://www .translationalneuromodeling.org/tapas).
Group-level consistency and test-retest reliability of individual connection strengths. First, we investigated the across-session consistency of whole-brain effective connectivity patterns at the group level. To this end, for each endogenous connection and driving input, we computed the mean (across all participants) and then assessed the Pearson correlation between grouplevel parameter estimates from Session 1 ("test") and Session 2 ("retest"). Significance was determined at an alpha level of 0.05, corrected for multiple comparisons (i.e., number of paradigms) using Bonferroni correction. Hence, correlations with a p value smaller than 0.00625 (i.e., 0.05/8) were deemed significant. These analysis steps were performed for both (a) rDCM with fixed network architecture, as well as (b) rDCM with sparsity constraints. Note that we computed the group-level effective connectivity patterns as the simple arithmetic mean across participants; however, other approaches are possible as well, such as computing group-level parameters using a parametric empirical Bayesian (PEB) approach (Friston, Litvak, et al., 2016).
Second, we assessed the test-retest reliability of the whole-brain effective connectivity patterns, that is, the stability of rDCM parameter estimates at the individual-subject level. To this end, an intraclass correlation coefficient (ICC) was computed for each connection. Specifically, we utilized the ICC(3, 1) type (Shrout & Fleiss, 1979), quantifying the ICC as a ratio between within-subject variability across the two sessions (σ 2 w ) and between-subject variability (σ 2 b ): (2) ICC(3, 1) values range from −1 to 1. According to conventional interpretations of ICC values, test-retest reliability is classified as "poor" for ICC < 0.4, as "fair" for 0.4 ≤ ICC < 0.6, as "good" for 0.6 ≤ ICC < 0.75, and as "excellent" for ICC ≥ 0.75 (Cicchetti, 2001).
Based on the parameter-wise ICC values, different analyses were performed. First, the distribution of ICC values across all connections was inspected and the mean of the distribution was used to quantify the average test-retest reliability of rDCM when considering all connections. Second, reliability was tested as a function of connection strength. This was motivated by the hypothesis that reliability should be lower for connections that are weak (close to 0) and are thus unlikely to represent a meaningful effect that would be consistently present across sessions. Conversely, strong connections (both inhibitory and excitatory) should be more likely to represent meaningful effects and should thus have a greater probability to be conserved across sessions. This hypothesis was tested using two different analyses: (a) We computed the correlation between absolute parameter strengths and ICC values. (b) We restricted the test-retest reliability analyses only to parameters that were significantly different from 0 (as assessed using one-sample t tests and Bonferroni correction for the multiple comparisons). Furthermore, for the connectivity parameters, we also further restricted the analysis to the top 1,000 connections (i.e., the connections with the largest absolute weights).
Inter-session consistency of whole-brain effective connectivity patterns. In a final analysis, we tested how consistent the entire effective connectivity profiles were across the two sessions. This analysis follows previous work demonstrating that individual subjects can be identified by their unique functional connectivity profiles derived from fMRI data (Finn et al., 2015). Here, we asked whether the whole-brain connectivity profile of individual participants during the first session ("test") could be used to identify them from the set of all effective connectivity profiles obtained from the second session ("retest"). To this end, we computed for each participant in Session 1 the similarity between his/her connectivity matrix and the connectivity matrices of all participants in Session 2. The predicted identity was that with the highest similarity score. Following Finn et al. (2015), similarity was defined as the Pearson correlation between two vectors of connectivity estimates taken from the participant's adjacency matrix from Session 1 and all adjacency matrices from Session 2. Repeating this procedure for each participant in Session 1 allows us to construct a confusion matrix from which the identification accuracy can be computed. To account for order effects, we performed the same analysis in the opposite direction, testing whether a connectivity profile from the second session could be used to identify a given individual from the set of all effective connectivity profiles obtained from the first session.
To assess statistical significance of the identification accuracy, we performed permutation testing. Here, an empirical null distribution of the identification accuracy was computed by randomly permuting the participant labels of the session to be predicted and repeating the entire prediction procedure described above. Here, we used 1,000 permutations. The p value was then computed as the rank of the original identification accuracy in the distribution of permutation-based identification accuracies, divided by the total number of permutations.

RESULTS
In the following, we first present our findings on group-level consistency and test-retest reliability of individual connection strength estimates. Subsequently, we report the inter-session consistency of whole-brain effective connectivity patterns. In either case, we present results obtained using both "classical" rDCM (with a fixed network architecture) and "sparse" rDCM (with sparsity constraints and thus variable network architecture). All results are compared with functional connectivity estimates (Pearson correlation coefficients and L1-regularized partial correlations).

Group-Level Consistency of Connection Strengths Across Sessions
Regression DCM with fixed (fully connected) network architecture. Group-level estimates of individual connections were highly consistent across the two sessions, independently of the paradigm (i.e., task-fMRI, rs-fMRI) and whole-brain parcellation scheme. More specifically, for the Glasser atlas, Pearson correlations (r) for the connectivity parameter estimates ranged from 0.92 for the emotional processing task to 0.97 for the language task. For the driving input parameter estimates, Pearson correlations varied more strongly across the different paradigms and ranged from 0.37 for the emotional processing task to 0.98 for the social cognition task. For the Schaefer atlas, we found virtually identical results. A comprehensive list of all results from the group-level consistency analysis is provided in Table 1.
Regression DCM with sparsity constraints. In a second step, we assessed the across-session consistency of estimated connection strengths using rDCM with embedded sparsity constraints. Overall, we found group-level consistency of sparse rDCM to be comparable to rDCM with fixed network architecture for all paradigms except for the resting state. More specifically, for resting-state fMRI data, rDCM with sparsity constraints performed considerably worse (r = 0.62) than classical rDCM (r = 0.96); see Table 1. For all task-based datasets, consistency only slightly decreased for rDCM with sparsity constraints. Interestingly, for the driving input parameter estimates, rDCM with sparsity constraints performed comparably to rDCM with fixed network architecture and, in fact, in half of the cases outperformed the latter. For the Schaefer atlas, we again found results to be virtually identical.
Comparison to functional connectivity. In a next step, we compared the group-level consistency of rDCM (both with fixed [fully connected] network architecture and sparsity constraints) with the group-level consistency of functional connectivity estimates that are frequently used for Table 1. Across-session consistency of group-level model parameter estimates for rDCM and functional connectivity. Consistency of parameter estimates in terms of the Pearson correlation coefficient between group-level (i.e., averaged across participants) estimates of Session 1 ("test") and Session 2 ("retest"). Group-level consistencies are reported for the connectivity and driving input parameters of rDCM (middle) as well as for the functional connectivity estimates (right). For both methods, results are shown for all HCP paradigms as well as for the two whole-brain parcellation schemes (i.e., Glasser, Schaefer). Furthermore, results are reported for two different "modes" of estimation (see main text for details): (a) fixed network architecture (i.e., classical rDCM and Pearson correlation coefficient), and (b) sparsity constraints (i.e., sparse rDCM and L1-regularized partial correlations). All correlations were significant at a significance threshold of p < 0.05 (Bonferroni-corrected for multiple comparisons). human connectomics and network neuroscience. Specifically, we assessed group-level consistency for functional connectivity estimates based on Pearson's correlation coefficients (for a full connectivity matrix) and L1-regularized partial correlations (for sparsity constraints), respectively.
In brief, group-level Pearson correlations were highly consistent across the two sessions, regardless of the paradigm (i.e., task-fMRI, rs-fMRI) and whole-brain parcellation scheme. More specifically, for the Glasser atlas, group-level consistency for Pearson correlation coefficients ranged from 0.89 for the emotional processing task to 0.95 for the resting state (see Table 1). Hence, we found the group-level consistency for Pearson correlations to be somewhat lower than for rDCM. More specifically, we found differences to range between 0.01 and 0.06 (all in favor of rDCM), which was highly significant ( p < 0.001) given the high degrees of freedom (i.e., number of connectivity parameters). For L1-regularized partial correlations, group-level consistency ranged from 0.91 for the emotional processing task to 0.98 for the resting state. Here, the values were generally very similar to sparse rDCM, except for the resting-state dataset where L1-regularized partial correlations showed greater consistency. Except for the resting state, we found differences between sparse rDCM and L1-regularized partial correlations to range between 0.01 and 0.05 (in favor of one or the other), which was again highly significant ( p < 0.001) given the high degrees of freedom. As for the rDCM analysis, we found functional connectivity results for the Schaefer atlas to be virtually identical to the ones for the Glasser atlas.

Test-Retest Reliability
Regression DCM with fixed (fully connected) network architecture. In a second analysis, we assessed the test-retest reliability of estimates of individual connection strengths by rDCM, computing the ICC (Shrout & Fleiss, 1979) for each connection. Here, we report the results for the Glasser atlas; again, the results for the Schaefer atlas are virtually identical and are reported in the Supporting Information.
Overall, when considering all model parameters, test-retest reliability of model parameter estimates from rDCM was relatively low ( Figure 1B, left). More specifically, for the connectivity parameters, on average test-retest reliability ranged from poor for the resting state (mean ICC = 0.24, 95% confidence interval (CI) = [−0.18, 0.59]) to fair for the social cognition task (mean ICC = 0.42 [−0.07, 0.75]) when considering all connections. Similarly, for the intrinsic selfconnections (i.e., the diagonal of the A-matrix), on average test-retest reliability ranged from poor for the resting state (mean ICC = 0.33 [−0.05, 0.63]) to fair for the social cognition task (mean ICC = 0.41 [−0.15, 0.77]); hence, no systematic differences were observed for the two types of connectivity parameters. Finally, for the driving input parameters, test-retest reliability ranged from poor for the emotional processing task (mean ICC = 0.08 [−0.43, 0.54]) to fair for the social cognition task (mean ICC = 0.42 [−0.03, 0.73]). Importantly, this includes weak connections and driving inputs that may not represent meaningful effects, but may be driven by noise. In a next step, we therefore tested whether stronger parameters tended to be more reliable.
Focusing only on connections that deviated significantly from zero ( p < 0.05, Bonferronicorrected for multiple comparisons), we observed a clear increase in reliability ( Figure 1B, middle). While reliability of the significant connections inferred from resting-state fMRI data was still poor on average (mean ICC = 0.32 [−0.10, 0.64]), reliability was considerably higher for task-based fMRI data (e.g., mean ICC = 0.62 [0.10, 0.88] for the emotional processing task). The same pattern could be observed for the significant driving inputs (although somewhat less strongly). Finally, when restricting our reliability analysis even further to the top 1,000 connections (i.e., the connections with the highest absolute connection strengths), Connectomics: Refers to the study of connectomes, which represent comprehensive maps of (anatomical or functional) connections within the nervous system. Figure 1. Test-retest reliability of regression DCM for a fixed network architecture. (A) Methodological overview. Resting-state and task-based fMRI data from the Human Connectome Project (HCP) are used for the analysis. Region-wise BOLD signal time series were extracted from a whole-brain parcellation scheme (e.g., the Glasser atlas) and whole-brain effective connectivity was inferred using rDCM. The rDCM parameter estimates were then analyzed with regard to group-level consistency and test-retest reliability. (B) Estimates of the probability density functions (using the nonparametric kernel smoothing of fitdist.m implemented in MATLAB) of the connection-wise intraclass correlation coefficient (ICC) for the resting state and all 7 tasks (i.e., emotional processing, gambling, language, motor, relational processing, social cognition, and working memory) for the Glasser atlas (see Supporting Information Figure S1 for the respective results of the Schaefer atlas). Results are shown when considering all connections (left), significant connections (middle), and the top 1,000 connections (right). (C) Mean (averaged across all paradigms) test-retest reliability for all connections (top, left) as well as how often (i.e., in how many paradigms) a connection showed excellent reliability (bottom, left). Mean test-retest reliability projected onto the cortical surface (top, middle) and the cortical location of all regions that are linked via connections that show excellent reliability in all 8 paradigms (bottom, middle). Connectogram showing the connections with excellent reliability in all 8 paradigms (right). The connectogram was produced using Circos (publicly available at https://circos .ca/software/). Table 2. Test-retest reliability of model parameter estimates for regression DCM and functional connectivity. Test-retest reliability of parameter estimates was assessed in terms of the intraclass correlation coefficient (ICC) between estimates of Session 1 ("test") and Session 2 ("retest") for a fixed (full) network architecture (i.e., classical rDCM and Pearson correlation coefficient). Here, we report the mean (averaged across parameters) ICC value and 95% confidence interval (CI). Averaging of the connection-wise ICC values as well as computing the 95% CI was achieved by (a) transforming connection-wise ICC values to z-space using Fisher z-transformation, (b) computing mean as well as lower and upper bound of the 95% CI in z-space, and finally (c) back-transforming estimates to r-space. Test-retest reliability is reported for the connectivity and driving input parameter estimates of rDCM (middle) as well as for the functional connectivity estimates (right). For both methods, results are shown for all HCP paradigms for the Glasser atlas (see Supporting Information Table S1 for the respective results of the Schaefer atlas). Furthermore, results are shown for a all parameters (top row), b significant parameters (middle row), and c top 1,000 parameters (bottom row). we found the shift towards higher reliability to be even more pronounced ( Figure 1B, right). Specifically, we found reliability to range on average from fair for the resting state (mean ICC = 0.45 [−0.02, 0.76]) to near excellent for the emotional processing task (mean ICC = 0.74 [0.45, 0.89]). A comprehensive list of all results from the test-retest reliability analysis is provided in Table 2.
In a post hoc analysis, we inspected which connections were most reliable across the different HCP paradigms. The mean ICC values (averaged across all eight paradigms) revealed a notable pattern of connections that were consistently reliable across paradigms ( Figure 1C, left). In particular, when inspecting connections that showed excellent reliability (i.e., ICC > 0.75) in all eight paradigms, we found these connections to primarily link regions such as areas a9-46v, a47r, p47r, and p10p near the frontal pole, AVI and FOP5 in the anterior insula and the frontal operculum, respectively, as well as TE1m and TE2a on the lateral surface of the temporal lobe ( Figure 1C, bottom left). These regions map well onto components of the multipledemands network, which is characterized by showing consistent activation for a number of different cognitive tasks (Assem et al., 2020;Fedorenko et al., 2013).
These results illustrate that stronger connections (both inhibitory and excitatory) inferred by rDCM are more reliable across sessions and, in fact, often achieve good to excellent test-retest reliability (i.e., ICC > 0.6). This is confirmed when directly testing the correlation between the absolute mean (i.e., averaged across all participants) parameter strength and the ICC value of the parameter estimate, both for connectivity parameters (for all paradigms: r ≥ 0.26, all p < 0.001) and driving input parameters-although this was more variable for the latter (range: r = −0.04, p = 0.29 to r = 0.40, p < 0.001).
As suggested by one of our reviewers, we repeated the above correlation analysis, but now testing for an association between the ICC value of the parameter estimate and the mean (i.e., averaged across all participants) posterior precision of the parameter. In brief, we found the correlation between ICC value and average posterior precision to be significant (for all paradigms: r ≥ 0.17, all p < 0.001). However, this correlation was consistently (across all paradigms) lower than the correlation between ICC value and absolute mean connection strength. For the driving input parameters, this was more variable, showing higher correlation between ICC value and average posterior precision for some paradigms but weaker correlation for other paradigms (range: r = −0.08, p < 0.001 to r = 0.59, p < 0.001).
Regression DCM with sparsity constraints. In a second step, we assessed the test-retest reliability of connectivity estimates obtained using rDCM with embedded sparsity constraints. Overall, the test-retest reliability of parameter estimates from sparse rDCM was lower than for rDCM with fixed network architecture.
When considering all connections, test-retest reliability was on average poor for all paradigms (Figure 2A, left). More specifically, for the connectivity parameters, test-retest reliability ranged from mean ICC = 0.02 [−0.28, 0.33] for the resting state to mean ICC = 0.34 [−0.09, 0.66] for the motor task when considering all connections. Again, we found the test-retest reliability of the intrinsic self-connections to be comparable to the (between-region) connections, ranging from mean ICC = 0.06 [−0.28, 0.39] for the resting state to mean ICC = 0.39 [−0.31, 0.82] for the emotional processing task. Similarly, for the driving input parameters, test-retest reliability ranged from poor for the motor task (mean ICC = 0.11 [−0.24, 0.44]) to fair for the relational processing task (mean ICC = 0.40 [−0.16, 0.76]).
In a next step, we again tested whether stronger connections were more reliable. Focusing only on connections that deviated significantly from zero ( p < 0.05, Bonferroni-corrected), we again observed a shift towards higher reliability (Figure 2A, middle), although less pronounced as for rDCM with fixed network architecture. For sparse rDCM, reliability of the significant connectivity parameters ranged on average from poor for the resting state (mean ICC = 0.16 [−0.32, 0.57]) to fair for the emotional processing task (mean ICC = 0.44 [−0.18, 0.81]). The same pattern could be observed for the significant driving input estimates. Finally, when restricting our reliability analysis even further to the top 1,000 connections, we found the shift towards higher reliability to be even more pronounced, with one exception: the resting state (Figure 2A, right). Specifically, even for the top 1,000 connections, we found poor reliability for the resting state (mean ICC = 0.05 [−0.29, 0.38]), whereas for all task-based fMRI datasets, testretest reliability was considerably increased when considering only the top 1,000 connections (e.g., mean ICC = 0.66 [0.25, 0.86] for the emotional processing task). A comprehensive list of all results from the test-retest reliability analysis for sparse rDCM is provided in Table 3. Estimates of the probability density functions (using the nonparametric kernel smoothing of fitdist.m implemented in MATLAB) of the connection-wise intraclass correlation coefficient (ICC) for the resting-state and all 7 tasks (i.e., emotional processing, gambling, language, motor, relational processing, social cognition, and working memory) for the Glasser atlas (see Supporting Information Figure S2 for the respective results of the Schaefer atlas). Results are shown when considering all connections (left), significant connections (middle), and the top 1,000 connections (right). (B) Mean (averaged across all paradigms) test-retest reliability for all connections (top, left) as well as how often (i.e., in how many paradigms) a connection showed excellent reliability (bottom, left). Mean test-retest reliability projected onto the cortical surface (top, middle) and the cortical location of all regions that are linked via connections that show excellent reliability in at least 6 paradigms (bottom, middle). Connectogram showing the connections with excellent reliability in at least 6 paradigms (right). The connectogram was produced using Circos (publicly available at https://circos.ca/software/). Table 3. Test-retest reliability of model parameter estimates for regression DCM and functional connectivity (sparsity constraints). Test-retest reliability of parameter estimates was assessed in terms of the intraclass correlation coefficient (ICC) between estimates of Session 1 ("test") and Session 2 ("retest") for sparsity constraints (i.e., rDCM with sparsity constraints and L1-regularized partial correlations). Here, we report the mean (averaged across parameters) ICC value and 95% confidence interval (CI). Averaging of the connection-wise ICC values as well as computing the 95% CI was done in z-space (see caption of Table 2 for details). Test-retest reliability is reported for the connectivity and driving input estimates of rDCM (middle) as well as for the functional connectivity estimates (right). For both methods, results are shown for all HCP paradigms for the Glasser atlas (see Supporting Information Table S2 for the respective results of the Schaefer atlas). Furthermore, results are shown for a all parameters (top row), b significant parameters (middle row), and c top 1,000 parameters (bottom row). In a post hoc analysis, we again inspected which connections were most reliable across the different HCP paradigms. Inspecting the mean (averaged across all paradigms) ICC values revealed a similar pattern for sparse rDCM as observed above for classical rDCM-although with somewhat lower mean ICC values ( Figure 2B, left). For example, no connections were found that showed excellent reliability in all eight paradigms. However, when inspecting those connections that showed excellent reliability in at least six of the eight paradigms, we observed a pattern that was highly consistent with the one obtained using rDCM with fixed network architecture (see above). Specifically, these connections again primarily linked regions that had previously been identified with the multiple-demands network, such as areas p10p near the frontal pole, AVI and FOP5 in the anterior insula and the frontal operculum, respectively, as well as TE1m and TE2a on the lateral surface of the temporal lobe ( Figure 2B, bottom left).
Again, these results illustrate that stronger parameters (both inhibitory and excitatory) are more reliable than weaker parameters. This observation was confirmed when explicitly testing the correlation between the mean (i.e., averaged across all participants) absolute parameter strength and the ICC values of the parameter estimate, both for connectivity strengths (resting state: r = 0.01, p < 0.001; for all task paradigms: r ≥ 0.18, p < 0.001) and for driving inputs, although this was again more variable for the latter (range: r = 0.09; p = 0.01 to r = 0.39, p < 0.001).
Furthermore, following the suggestion by one of our reviewers, we also tested for an association between the ICC value and the mean (i.e., averaged across all participants) posterior precision of the parameter. These results were highly consistent with the results obtained for rDCM with fixed network architecture. More precisely, for the connectivity parameters, we found the correlation between ICC value and average posterior precision to be significant for all task paradigms (r ≥ 0.06, all p < 0.001). However, the correlation became marginally negative for the resting state (r = −0.01, p = 0.001). Furthermore, this correlation was consistently (across all paradigms) lower than the correlation between ICC value and absolute mean connection strength. For the driving input parameters, the constellation was more variable, showing higher correlation between ICC value and average posterior precision for some paradigms but weaker correlation for other paradigms (range: r = 0.03, p = 0.113 to r = 0.50, p < 0.001).
In summary, our results indicate that, for the present datasets, connectivity estimates obtained using sparse rDCM were less reliable than those obtained using rDCM with fixed network architecture (see the Discussion section for potential explanations). For resting-state data, test-retest reliability of sparse rDCM was poor-even when focusing on strong connections. Conversely, for the driving input estimates, test-retest reliability was comparable across the two rDCM variants.
Comparison to functional connectivity. For comparison with rDCM, we investigated the testretest reliability of functional connectivity estimates obtained using Pearson correlations and L1-regularized partial correlations.
First, we compared results from rDCM with fixed network architecture to Pearson correlations ( Figure 3A). We found that the two methods showed similar test-retest reliability when considering all model parameters ( Figure 3A, left). Specifically, test-retest reliability of Pearson correlations ranged from mean ICC = 0.16 [−0.25, 0.53] for the resting state to mean ICC = 0.38 [−0.06, 0.69] for the language task. Interestingly, when focusing on stronger connections, Pearson correlations did not show the same improvement previously observed for rDCM; instead, test-retest reliability remained mostly poor (or fair at best). More specifically, when focusing only on significant parameter estimates, reliability ranged from mean ICC = 0.14 [−0.31, 0.54] for the resting state to mean ICC = 0.42 [−0.15, 0.78] for the language task ( Figure 3A, middle). Similarly, when restricting the analysis to the top 1,000 connections, reliability ranged from mean ICC = 0.22 [−0.36, 0.68] for the resting state to mean ICC = 0.45 [−0.38, 0.88] for the working-memory task ( Figure 3A, right). A comprehensive list of all results from the test-retest reliability analysis is provided in Table 2 (right column).
Second, we compared sparse rDCM to L1-regularized partial correlations ( Figure 3B). Interestingly, we found test-retest reliability of L1-regularized partial correlations to be on average close to zero for all paradigms when considering all connectivity parameters ( Figure 3B, left). Specifically, test-retest reliability ranged from mean ICC = 0.07 [−0.41, 0.52] for the relational processing task to mean ICC = 0.14 [−0.38, 0.60] for the resting state. While this improved when focusing on stronger connections, test-retest reliability remained relatively low for L1regularized partial correlations and, in most cases, worse than for sparse rDCM. More specifically, when focusing only on significant parameters, reliability ranged from mean ICC = 0.30 [−0.07, 0.59] for the emotional processing task to mean ICC = 0.50 [0.02, 0.79] for the resting state ( Figure 3B, middle). Similarly, when restricting the analysis to the top 1,000 connections, reliability ranged from mean ICC = 0.31 [−0.10, 0.63] for the emotional processing task to mean ICC = 0.55 [0.11, 0.81] for the resting state ( Figure 3B, right). A comprehensive list of all results from the test-retest reliability analysis is provided in Table 3 (right column).  Table 4. Across-session consistency of connectivity profiles for regression DCM and functional connectivity. Consistency of the entire connectivity profile across the two sessions. Identification accuracies are reported for predicting identity in Session 2 from Session 1 (top) and vice versa (bottom). Results are reported for rDCM (middle) and for functional connectivity estimates (right). For both methods, results are shown for all HCP paradigms as well as for the two whole-brain parcellation schemes (i.e., Glasser, Schaefer). Furthermore, results are reported for two different "modes" of estimation (see main text for details): (a) fixed network architecture (i.e., classical rDCM and Pearson correlation coefficient), and (b) sparsity constraints (i.e., rDCM with sparsity constraints and L1-regularized partial correlations). rDCM FC Fixed network Glasser Schaefer Glasser Schaefer Similarity Analysis: Inter-session Consistency of Whole-Brain Effective Connectivity Patterns Regression DCM with fixed (fully connected) network architecture. In a final analysis, we shifted the focus from reliability of separate connections to the consistency of the whole-brain effective connectivity profile across time. To this end, we asked whether the effective connectivity profile of an individual person obtained in one session could be used to identify this individual from the set of all effective connectivity profiles obtained in another session. This analysis follows previous work demonstrating that functional connectivity profiles derived from fMRI data enable the identification of individual subjects (Finn et al., 2015).
First, we assessed identification accuracies for the whole-brain effective connectivity patterns inferred using rDCM with fixed network architecture (chance level: 1/N sub × 100%, ranging from 2.4% to 2.3%, depending on the number of subjects available in each task). Overall, entire effective connectivity profiles were highly consistent across the two sessions and enabled identification of individual participants with high accuracies. More specifically, when predicting identity in Session 2 from Session 1 (S 1 ➔ S 2 ), identification accuracies ranged from 80.5% (33/41) for the emotional processing task to 100% (44/44) for the social cognition task. Similarly, when predicting identity in Session 1 from Session 2 (S 2 ➔ S 1 ), identification accuracies ranged from 78.0% (32/41) for the emotional processing task to 100% (44/44) for the social cognition task. Results were almost identical for the Schaefer parcellation. All of the identification accuracies were statistically significant at p < 0.05 (Bonferroni-corrected for multiple comparisons), as assessed using permutation testing (see the Methods section). A comprehensive list of all identification accuracies is provided in Table 4 (middle column, top).
Regression DCM with sparsity constraints. Second, identification accuracies were assessed for sparse rDCM. Again, the sparse whole-brain effective connectivity profiles were highly consistent across the two sessions and allowed identification of individual participants with high accuracies, with the notable exception of the resting state. More specifically, for the resting state, identification accuracies were around 50% (i.e., 47.6% when predicting S 1 ➔ S 2 , and 54.8% when predicting S 2 ➔ S 1 ); please see Table 4 for details. For task-based data, identification accuracies were considerably higher. Specifically, when predicting S 1 ➔ S 2 , identification accuracies ranged from 78.0% (32/41) for the emotional processing task to 100% (44/44) for the social cognition task. Similarly, when predicting S 2 ➔ S 1 , identification accuracies ranged from 80.5% (33/41) for the emotional processing task to 100% (44/44) for the social Comparison to functional connectivity. Finally, we compared identification accuracies between rDCM and functional connectivity estimates obtained using Pearson correlation and L1regularized partial correlations. Overall, we found that functional connectivity profiles also enabled identification of individual participants with high accuracies (Table 4). There was one notable exception: connectivity during the resting state, as characterized by Pearson correlation coefficients. More specifically, for this setting, identification accuracies were 31.0% (13/42) when predicting S 1 ➔ S 2 , and 21.4% (9/42) when predicting S 2 ➔ S 1 . This is in contrast to previous reports by Finn et al. (2015); for potential explanations of these inconsistencies, please see the Discussion section. For all other settings, identification accuracies of functional connectivity profiles were high and even surpassed those reported for rDCM in some cases, particularly when using sparsity constraints. Again, all identification accuracies-even for the resting state in combination with Pearson's correlations-were statistically significant at p < 0.05 (Bonferroni-corrected), as assessed using permutation testing.

DISCUSSION
In this paper, we assessed the test-retest reliability and group-level consistency of connection strengths inferred from fMRI data using rDCM Frässle, Lomakina, Kasper, et al., 2018;Frässle, Lomakina, Razi, et al., 2017). First, using two different whole-brain parcellations, we demonstrated that rDCM provides highly consistent parameter estimates at the group level across two sessions of the HCP dataset , regardless of the exact paradigm. Second, we found, on average, relatively low test-retest reliability when considering all connections. However, stronger connections were more reliable, with many strong connections displaying good to excellent test-retest reliability (ICC ≥ 0.6); see Table 2. When comparing this to the test-retest reliability of measures of functional connectivity, rDCM performed favorably-in particular, when focusing on strong connections (see Figure 3). While these observations hold for both variants of rDCM, we found test-retest reliability to be considerably higher for rDCM with fixed network architecture as compared with rDCM with sparsity constraints.
The increase in reliability with higher connection strengths is worth emphasizing. For example, when restricting the analysis to the top 1,000 connections, we found for all taskbased datasets on average good test-retest reliability (see Table 2). This suggests that those connections representing meaningful effects can be reliably inferred using rDCM. These observations are consistent with previous analyses of test-retest reliability in the context of classical DCM for fMRI. For instance, Frässle, Paulus, et al. (2016) assessed test-retest reliability of effective connectivity in small (six-region) networks of the core face perception system. While finding fair to good reliability of parameter estimates on average, they observed a similar trend of increased reliability for larger parameter estimates. Our results are also in line with other reports on the test-retest reliability of classical DCM (Frässle, Stephan, et al., 2015;Rowe et al., 2010;Schuyler et al., 2010) and spectral DCM (Almgren et al., 2018)-all conducted in the context of much smaller networks than the ones considered here. Furthermore, the observed increase in test-retest reliability with connection strength is not exclusive to DCMs. For instance, a similar increase of test-retest reliability with effect size has also been observed in conventional fMRI analyses (Caceres et al., 2009).
Interestingly, this pattern of increased test-retest reliability for stronger connections was less pronounced for functional connectivity estimates (Figure 3). Test-retest reliability estimates based on Pearson correlations and L1-regularized partial correlations also showed an increase of ICC values for greater connection strength, in line with previous studies of functional connectivity (for a review, see Noble et al., 2019). However, this increase was only moderate and the average test-retest reliability remained poor to fair, even for the strong connections.
With regard to the test-retest reliability of rDCM, two further observations are worth highlighting. First, we found connectivity estimates from task-based fMRI data to be consistently more reliable than those from resting-state fMRI data. This is remarkable given that restingstate measurements were considerably longer than task measurements, with longer scanning sessions typically being associated with increased reliability (Birn et al., 2013;Noble et al., 2017). More specifically, while (per session) approximately 1 hr of resting-state fMRI data were collected (combined across the phase-encoding directions), task-based fMRI data comprised just a couple of minutes. Despite these very short scanning sessions, task-based fMRI exhibited superior reliability compared with resting-state data. These observations are in line with previous reports demonstrating higher test-retest reliability for functional connectivity patterns derived from task-based as compared with resting-state fMRI data (Noble et al., 2019;Wang et al., 2017). Furthermore, our results are also consistent with findings suggesting that connectivity patterns derived from task-based fMRI are more predictive of individual traits (Greene et al., 2020;Greene et al., 2018). This indicates that-despite its patient-friendly nature-the resting state may not be ideally suited for clinical settings since test-retest reliability is considerably lower than for task-based fMRI-even at much longer scanning times.
Second, we found connectivity estimates by rDCM to be more reliable when assuming a fixed (fully connected) network architecture as compared with relying on embedded sparsity constraints. This was surprising given that sparsity constraints prevent overfitting and should thus increase generalizability of parameter estimates. Having said this, previous simulations have shown that rDCM with sparsity constraints is even more demanding in terms of data quality than rDCM with fixed network architecture (Frässle, Lomakina, Kasper, et al., 2018). More specifically, we have demonstrated that for low signal-to-noise ratio (SNR) or long repetition time (TR) settings, rDCM with sparsity constraints tends to yield overly sparse connectivity matrices that result from a propensity to pruning existing connections (Frässle, Lomakina, Kasper, et al., 2018). This may be an explanation for the diminished test-retest reliability observed in the current study in the sense that weak connections may sometimes be pruned and sometimes not.
Finally, moving from assessments of individual connections to whole-brain patterns, we demonstrate that the entire connectivity profile (i.e., the whole-brain "connectivity fingerprint") of individuals is highly consistent across the two sessions-for both effective (rDCM) and functional connectivity measures. We show that, in many cases, it is possible to identify an individual among all participants with close to perfect accuracy based on the inferred connectivity pattern. This is consistent with a previous study demonstrating the identifiability of single subjects from functional connectivity measures (Finn et al., 2015), as well as similar reports (Cole et al., 2014;Horien et al., 2019;Noble et al., 2017;Pannunzi et al., 2017;Smith et al., 2009). Interestingly, we found that one particular combination (i.e., resting state and Pearson correlations) yielded relatively low (yet still significant) identification accuracies. This is in contrast to the previous report by Finn et al. (2015). These differences may be due to a number of reasons, including differences in (a) the exact dataset, (b) preprocessing strategy, or (c) wholebrain parcellation scheme. Despite this discrepancy, our results support the idea that individual participants may possess a unique whole-brain connectivity profile for a given cognitive context. This underscores the exciting opportunities of whole-brain connectivity assessments for studying individual variability of brain networks and how this relates to cognitive phenotypes in health and disease.
Importantly, we show that all three metrics considered-group-level consistency and testretest reliability of individual connections as well as whole-brain connectivity profiles-are almost identical for two state-of-the-art parcellation schemes, that is, the Glasser parcellation (HCP MMP 1.0;  and the Schaefer 400-node parcellation (Schaefer et al., 2018). This is important because inference on the organizational principles of the brain has been shown to depend on the exact parcellation scheme utilized for defining the nodes of the network (Fornito et al., 2010;Fornito et al., 2016). Consequently, it is critical to verify that any conclusions drawn from connectivity estimates are not dependent on this particular choice. Here, we demonstrate that the reliability and consistency of whole-brain effective connectivity estimates obtained using rDCM (as well as those for functional connectivity measures) generalize across the two parcellation schemes. Notably, these two parcellation schemes focus on the cortex and do not cover the cerebellum and subcortical regions. The latter structures, in particular subcortical regions, are usually characterized by diminished signal-to-noise ratio of the fMRI signal. Hence, it remains to be tested whether the reliability results reported here generalize to parcellation schemes that include subcortical structures, like the Automated Anatomical Labeling (AAL) atlas (Tzourio-Mazoyer et al., 2002).
Our findings have important implications for the fields of human connectomics and network neuroscience in general, as well as for the clinically oriented disciplines of computational psychiatry and computational neurology in particular. Especially for the latter two, test-retest reliability of a computational model is important for its clinical utility, particularly when longitudinal measurements are required (e.g., monitoring of treatment response). Here, we showed that rDCM provides good test-retest reliability when focusing on strong connections and enables identification of individual participants with high accuracy based on the entire connectivity profile. Importantly, rDCM shows high reliability even for very short scanning sessions of 3-4 min when working with task-based fMRI data. This is important for potential clinical applications.
In summary, our systematic analyses indicate that, in many constellations, rDCM exhibits good properties with regard to group-level consistency and test-retest reliability of connections, as well as the inter-session consistency of whole-brain connectivity patterns. This complements previous methodological assessments of face and construct validity of rDCM Frässle, Lomakina, Kasper, et al., 2018;Frässle, Lomakina, Razi, et al., 2017;Frässle, Manjaly, et al., 2021) and underscores its potential for clinical applications. Its ability to obtain reliable estimates of directed whole-brain connectivity may enable the construction of computational assays for identifying pathophysiological mechanisms and for predictions about individual treatment responses or clinical trajectories (Frässle, Marquand, et al., 2020)-a possibility that we will examine in future studies.

CODE AND DATA AVAILABILITY
A MATLAB implementation of the regression dynamic causal modeling (rDCM) approach is available as open-source code in the Translational Algorithms for Psychiatry-Advancing Science (TAPAS) software package (https://www.translationalneuromodeling.org/tapas). Furthermore, we will publish the code for the analysis as well as the source data files for figures and tables online as part of an online repository that conforms to the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles (https://gitlab.ethz.ch/tnu/code/fraessleetal _rdcm_test_retest; . Additionally, the raw data are openly available from the HCP website, which also conforms to the FAIR principles.