Inferring network properties from time series using transfer entropy and mutual information: Validation of multivariate versus bivariate approaches

Functional and effective networks inferred from time series are at the core of network neuroscience. Interpreting properties of these networks requires inferred network models to reflect key underlying structural features. However, even a few spurious links can severely distort network measures, posing a challenge for functional connectomes. We study the extent to which micro- and macroscopic properties of underlying networks can be inferred by algorithms based on mutual information and bivariate/multivariate transfer entropy. The validation is performed on two macaque connectomes and on synthetic networks with various topologies (regular lattice, small-world, random, scale-free, modular). Simulations are based on a neural mass model and on autoregressive dynamics (employing Gaussian estimators for direct comparison to functional connectivity and Granger causality). We find that multivariate transfer entropy captures key properties of all network structures for longer time series. Bivariate methods can achieve higher recall (sensitivity) for shorter time series but are unable to control false positives (lower specificity) as available data increases. This leads to overestimated clustering, small-world, and rich-club coefficients, underestimated shortest path lengths and hub centrality, and fattened degree distribution tails. Caution should therefore be used when interpreting network properties of functional connectomes obtained via correlation or pairwise statistical dependence measures, rather than more holistic (yet data-hungry) multivariate models.


INTRODUCTION
networks inferred via multivariate TE are proven to converge to the underlying structure for sufficiently long time series (Runge, 2018;Sun et al., 2015). In gaining an understanding of these multivariate effective connectivity inference algorithms, it is important to validate that they perform to that expectation where it is applicable, and investigate how that performance varies with respect to factors such as sample size, and so on. In doing so, we also address the recent call for more extensive model diversity in testing multivariate algorithms: "To avoid biased conclusions, a large number of different randomly selected connectivity structures should be tested [including link density as well as properties such as small-worldness]" (Runge, 2018, p. 13).
Outside of these idealized circumstances though, we can no longer make a clear general hypothesis on how the effective network models are expected to reflect the underlying structure, yet a successful validation gives confidence that the directed statistical relationships they represent remain accurate as an effective network model at the microscale. Furthermore, it is at least desirable for not only effective networks but also functional networks to recognize important features in the underlying network structure: to track overall regime changes in the macroscopic structure reliably, and to reflect the mesoscopic properties of distinctive nodes (or groups of nodes) in the structure. The desire for recognition of important features in the network is applicable whether the inference is made under idealized conditions or not.
This motivates our validation study, which is primarily conducted under idealized conditions as above and based on synthetic datasets involving ground truth networks of 100-200 nodes with different topologies, from regular lattice to small-world, random, scale-free, and modular. Many of these structural properties are incorporated in the macaque connectomes analyzed in the last section. At the macroscale, we measure several fundamental and widely used properties, including shortest path length, clustering coefficient, small-world coefficient, betweenness centrality, and features of the degree distributions (Rubinov & Sporns, 2010). These properties of the inferred network models are compared with those of the real underlying structural networks in order to validate and benchmark different inference algorithms in terms of their ability to capture the key properties of the underlying topologies. At the microscale, the performance is assessed in terms of precision, recall, and specificity of the inferred model in classifying the links of the underlying structural network. As above, while we do not expect all approaches to strictly capture the microscale features, these results help to explain their performance at the macroscale.
For most of our experiments, the time series of node activity on these networks are generated by vector autoregressive (VAR) dynamics, with linearly coupled nodes and Gaussian noise. Both the VAR process and the inference algorithms are described in detail in the Methods section, where we also discuss how MI and the magnitude of Pearson correlation are equivalent for stationary VAR processes. This implies that the undirected networks obtained via the bivariate MI algorithm are equivalent to the widely employed undirected functional networks obtained via correlation, extending the implications of our results beyond information-theoretic methods. Further, our results based on TE extend to Granger causality, which is equivalent to TE for stationary VAR processes (Barnett, Barrett, & Seth, 2009). Networks inferred using bivariate TE are typically referred to as directed functional networks, to emphasize the directed nature of their links. The extension to multivariate TE for effective network inference can also be viewed as an extension to multivariate Granger causality for the stationary VAR processes here.
We find that multivariate TE performs better on all network topologies at all scales, for longer time series. Bivariate methods can achieve better recall with limited amount of data (shorter time series) in some circumstances, but the precision and the ability to control false positives are neither consistent nor predictable a priori. On the other hand, thanks to recent statistical improvements, multivariate TE guarantees high specificity and precision regardless of the amount of data available, and the recall steadily increases with more data. We discuss how the mesoscopic properties of the underlying structural network-particularly the network motifs-can influence the precision and recall of the model at the microscale. In turn, we show how the performance at the microscale affects the inferred network properties at the macroscale. We observe that bivariate methods are often unable to capture the most distinctive topological features of the networks under study (including path length, clustering, degree distribution, and modularity), largely because of their inability to control false positives at the microscale.
Our final section moves beyond the validation under idealized conditions to extend the experiments to time series from a neural mass model employed on the CoCoMac connectome. Although the incorporation of complexities such as nonlinear interactions and subsampled time series do somewhat reduce the performance of the methods, the superior performance of multivariate TE aligns with the results above from the validation in idealized conditions. While further and more wide-ranging experiments are required, our experiments provide substantial evidence for the validity of multivariate TE in providing effective network models, which still retain meaningful network insights in more realistic conditions.

Generating Dynamics on Networks
Two models are used to generate time series dynamics on networks of coupled variables. Vector autoregressive processes are employed for validation studies under idealized conditions, and a neural mass model is used on the weighted CoCoMac connectome as a final investigation going beyond these conditions. The simulation and analysis pipeline is illustrated in Figure 1.

Networks of linearly coupled Gaussian variables.
We simulate discrete-time, stationary, firstorder vector autoregressive processes (VAR) on underlying structural networks of N nodes. A VAR process is described by the recurrence relation where Z(t) is a row vector and Z i (t) is the activity of node i at time t. The Gaussian noise ε(t) is spatially and serially uncorrelated, with standard deviation θ = 0.1. The N × N weighted adjacency matrix C = [C ij ] describes the network structure, where C ij is the weight of the directed connection from node i to node j. These dynamics are generated on various network topologies, as detailed in the following sections. The choice of the weights C ij (detailed in the following sections) guarantees the stability of the system, which is a sufficient condition for stationarity (Atay & Karabacak, 2006). Since stationary VAR processes have multivariate Gaussian distributions, the information-theoretic measures we use can be directly related to Pearson correlation and Granger causality (Granger, 1969). The simple VAR dynamics are chosen as the primary model for our validation studies instead of nonlinear alternatives because the main goal is not to prove the superiority of nonlinear dependence measures on nonlinear systems, which has been shown elsewhere (Novelli et al., 2019). We rather aim to show that, even on linearly coupled Gaussian variables-perfectly suitable to be studied via cross-correlations-multivariate approaches are better able to infer the macroscopic network  properties. In addition, the VAR dynamics are amenable to be investigated using the faster Gaussian estimator for the information-theoretic measures, allowing us to carry out more extensive simulations over a wider range of parameters.
Neural mass model on the CoCoMac connectome.
To provide an extension beyond the linear VAR dynamics, neural activity in various brain regions is modeled (following Shine, Aburn, Breakspear, & Poldrack, 2018;Li et al., 2019) as an oscillating two-dimensional neural mass model derived by mode decomposition from the Fitzhugh-Nagumo single neuron model (FitzHugh, 1961). As previously presented (Li et al., 2019;, the CoCoMac connectome (Kötter, 2004) is used to provide directed coupling between 76 regions, with axonal time delays between these regions based on the length of fiber tracts as estimated by diffusion spectrum imaging (Sanz Leon et al., 2013). Data provided from Li et al. (2019) and  was simulated using the open-source framework The Virtual Brain (Sanz Leon et al., 2013), using code implementing the model freely available at https://github.com/macshine/gain_topology .
Langevin equations (Equation 2) specify the neural mass model, via the dynamics of local mean membrane potential (V) and the slow recovery variable (W) at each regional node i: (2) In the above, ξ i and η i are independent standard Wiener noises and I i is the synaptic current with C ji indicating the connection weight from j to i and incorporating time delays τ ji from j to i (estimated as described above). The CoCoMac connectome network contains 1,560 directed connections (including 66 self-links), with τ ji on non-self links having an average of 19.8 ms (standard deviation 8.32 ms). The membrane potentials V i are converted to normalized firing rates S i via a sigmoid activation function with parameter m = 1.5 chosen to align the sigmoid with its typical input. The parameters for gain σ = 0.5 (in Equation 4) and excitability γ = 0.3 (in Equation 2) are selected to simulate activity in the integrated regime of dynamics identified by .
Finally, the time series of membrane voltage V i (t) (originally obtained with a 0.5-ms temporal resolution via stochastic Heun integration) are subsampled at 15 ms (selected as half the median time for the autocorrelation functions to decay to 1/e).

Network Inference Algorithms
As illustrated in Figure 1, three algorithms are employed to infer network models from the time series using the IDTxl Python package : The Information Dynamics Toolkit xl is an open-source Python package for network inference, freely available on GitHub (https://github .com/pwollstadt/IDTxl).

Bivariate mutual information for functional connectivity.
Mutual information (MI) is computed between all pairs of nodes independently, in a bivariate fashion, and only the measurements that pass a strict statistical significance test (described below) are interpreted as undirected links.
MI is a measure of statistical dependence between random variables (Cover & Thomas, 2005), introduced by Shannon in laying the foundations of information theory (Shannon, 1948). Formally, the MI between two continuous random variables X and Y with joint probability density function μ(x, y) and marginal densities μ X (x) and μ Y (y) is defined as where the integral is taken over the set of pairs (x, y) such that μ(x, y) > 0. The strength of MI lies in its model-free nature, meaning that it doesn't require any assumptions on the distribution or the variables (e.g., Gaussian). Being able to capture nonlinear relationships, MI is typically presented as a generalized version of the Pearson correlation coefficient. However, for the VAR processes considered here (Equation 1) with stationary multivariate Gaussian distributions, the MI between two variables X and Y under these dynamics is completely determined by the magnitude of their Pearson correlation coefficient ρ (Cover & Thomas, 2005): Crucially, this one-to-one relationship between MI and the absolute value of ρ (for VAR processes) implies that the networks inferred via the bivariate MI algorithm are equivalent to the functional networks obtained via cross-correlation-widely employed in neuroscience. This equivalence persists whenever a Gaussian estimator for MI is used (which models the processes as VAR), even for nonlinear dynamics, as is used in our experiments. Differences may lie in how the raw MI values are transformed into a network structure. Early approaches often used a fixed threshold aimed at obtaining a prescribed link density, while the bivariate MI algorithm used here adopts an adaptive threshold (different for each link) to meet a desired statistical significance level. The statistical significance is computed via null hypothesis testing to reflect the probability of observing a larger MI from the same samples if their temporal relationship were destroyed (the p value is obtained from a chi-square test, as summarized in Lizier, 2014). The critical level for statistical significance is set to α = 0.01/N, where N is the network size. This produces a Bonferroni correction for the inference of parent nodes for each target (i.e., for Bonferroni correction: A strategy for controlling the error rate when multiple comparisons are performed. The significance threshold is divided by the number of comparisons. each target, there is a 0.01 chance under the null hypothesis that at least one spurious parent node is selected, assuming independent sources).

Bivariate transfer entropy for directed functional connectivity.
Transfer entropy (TE) is computed between all pairs of nodes independently, in a bivariate fashion, and only the measurements that pass a strict statistical significance test (described below) are interpreted as links. TE is a model-free measure of statistical dependence between random variables (Schreiber, 2000); however, different from MI and cross-correlation, it is a directed and not a symmetric measure (i.e., the TE from a source node X to a target node Y is not necessarily the same as the TE from Y to X), and specifically considers information about the dynamic state updates of the target Y. Thus, employing TE has the advantage of generating directed networks and providing a more detailed model of the dynamics of the system under investigation. Formally, the TE from a source stochastic process X to a target process Y is defined as (Schreiber, 2000) T X→Y (t) := I(X t−1 ; Y t |Y <t ),

Inferring network properties from time series
where I(X t−1 ; Y t |Y <t ) is the conditional mutual information (Cover & Thomas, 2005) between the previous sample X t−1 of the source and the next sample Y t of the target, in the context of (the vector of) the target's past values Y <t . The directed and dynamic nature of TE derives specifically from taking the past of the target into account when measuring the lagged statistical dependence between X and Y.
In practice, in order to estimate the TE from the time series, Y <t is usually constructed as an embedding vector (Takens, 1981), and a maximum lag must be specified to build a finite embedding of the target's past (using either uniformly or nonuniformly spaced variables, (Faes, Nollo, & Porta, 2011;Kugiumtzis, 2013;Vlachos & Kugiumtzis, 2010). Here, using the IDTxl Python package , a nonuniform embedding of the target's past Y <t is Nonuniform embedding: A set of nonuniformly spaced time lags that capture the underlying state of the process, akin to a Takens' embedding.
built via iterative greedy selection of statistically significant elements of Y <t (as detailed in the next section). While similar multivariate embeddings and lags larger than one time step can be used for the source X in principle, here only a single sample at lag 1 for the source is used for the VAR model in alignment with its dynamics.
Analogously to the MI, the bivariate TE values are transformed into a directed network structure by testing their statistical significance. This is computed (using a theoretical null distribution, as summarized in Lizier, 2014) to reflect the probability of observing a larger TE from the same samples if the source samples were temporally decoupled from the target and its past. A critical level of α = 0.01/N is used as per MI.

Multivariate transfer entropy for effective connectivity.
Different from the bivariate approaches above, the multivariate TE approach does not consider pairs of nodes in isolation, but focusses on modeling dynamic updates in each target process in the system by selecting a minimal set of sources that collectively contribute to the computation of the target's next state. More formally, for each target Y, this method aims at identifying the minimal set of sources X <t that maximize the collective TE to Y, defined as The multivariate TE network inference algorithm is described in full in Novelli et al. (2019), synthesizing together components of Faes et al. (2011), Lizier & Rubinov (2012, Montalto, Faes, and Marinazzo (2014), Sun et al. (2015), and Vlachos & Kugiumtzis (2010). A greedy approach is used to iteratively select the candidate variables to add to X <t from a candidate set {X i,t−l } of lagged variables from the past of each source X i ∈ Z \ {Y}, up to some maximum lag l ≤ L (L = 1 for the VAR model and L = 4 for the neural mass model used here). Testing the statistical significance of the conditional mutual information for a new candidate source to be included at each step-conditioning on the previously selected sources-provides an adaptive stopping condition for the greedy algorithm. The family-wise error rate for each target is set to α = 0.01 using the max statistic (Novelli et al., 2019), meaning that for each target there is a 0.01 chance that, under the null hypothesis, at least one spurious parent node is selected. The conditioning on previously selected sources serves to prevent the selection of spurious sources that are correlated with true sources because of common driver as well as pathway or chain effects (referred to as holding redundant information only; Lizier & Rubinov, 2012;Stramaglia, Cortes, & Marinazzo, 2014). Such conditioning also enables capturing multivariate or synergistic effects on the target that cannot be detected by examining individual sources in isolation (Lizier & Rubinov, 2012;Stramaglia et al., 2014). Furthermore, in contrast to always conditioning on all potential sources, by conditioning only on previously selected sources the iterative approach defers overly high-dimensional analysis until it is genuinely required, buying statistical sensitivity ("recall"; Lizier & Rubinov, 2012;Novelli et al., 2019). Every node is studied as a target (in parallel on a computing cluster, using IDTxl; Wollstadt et al., 2019) and the results are then combined into a directed network describing the information flows in the system. Similar to the bivariate TE discussed above, a nonuniform embedding of the target's past Y <t is built first (Vlachos & Kugiumtzis, 2010), before the second step of selecting sources via the same iterative greedy algorithm (Novelli et al., 2019). While multiple past samples of any given source can be considered (e.g., as has be done in Novelli et al., 2019), only one past value is examined here (L = 1) for the VAR experiments in line with their known structure in Equation 1 and in order to focus on network structure effects only.
Given that TE and Granger causality are equivalent for Gaussian variables (Barnett et al., 2009), using the Gaussian estimator with the multivariate TE algorithm can be viewed as extending Granger causality in the same multivariate/greedy fashion.

Evaluation Metrics
At the microscale (individual links), the network inference performance is evaluated against the known underlying network structure as a binary classification task, using standard statistics based on the number of true positives (TP, i.e., correctly classified existing links), false positives (FP, i.e., absent links falsely classified as existing), true negatives (TN, i.e., correctly classified absent links), and false negatives (FN, i.e., existing links falsely classified as absent). The following standard statistics are employed in the evaluation: Intuitively, the precision measures how often an inferred link is actually present in the underlying structure, the recall measures the proportion of true links that are detected, and the specificity measures the proportion of absent links that are correctly not inferred. For a properly controlled family-wise error rate (α), the expected specificity is 1 − α.
At the macroscale, the performance is evaluated in terms of the accuracy in measuring network properties of interest on the inferred network, as compared with their real values when measured on the underlying structural network. These properties include the following; Characteristic path length. The average shortest distance between all pairs of nodes (Rubinov & Sporns, 2010). A shorter average value is typically interpreted as an indication of the efficiency of the network in propagating information. The characteristic path length is only well defined for connected networks-a problem that is avoided by construction in this study by only generating connected networks. This limitation could be alternatively overcome by replacing the characteristic path length with the analogous global efficiency (Latora & Marchiori, 2001), reported in the Supporting Information for completeness. Clustering coefficient. The clustering coefficient of a node is the fraction of the node's neighbours that are also neighbours of each other (Fagiolo, 2007;Watts & Strogatz, 1998). The mean clustering coefficient hence reflects the average prevalence of clustered connectivity around individual nodes. Small-worldness coefficient. Small-world networks are formally defined as networks that are significantly more clustered than random networks, yet have approximately the same characteristic path length as random networks (Watts & Strogatz, 1998). The smallworldness coefficient was proposed by Humphries and Gurney (2008) to capture this effect in the following single statistic (although improvements on the original measure have been recently suggested; Neal, 2017;Zanin, 2015): In Equation 9, C and L respectively denote the average clustering coefficient and the characteristic path length; analogously, C rand and L rand denote the average clustering coefficient and the characteristic path length of Erdős-Rényi networks having the same size and number of links as the network under study. Degree distribution. The probability distribution of the in-and out-degree over the whole network (where the in-and out-degree of a node are defined as the number of its incoming and outgoing connections, respectively). Betweenness centrality. The fraction of all shortest paths in the network that pass through a given node (Rubinov & Sporns, 2010), excluding paths that start and end on the given node. Modularity. A measure of the separation of a network into specific groups (or modules), defined as the fraction of the edges that fall within the given groups minus the expected fraction if the edges were distributed at random (Newman & Girvan, 2004). Rich-club coefficient. The "rich-club" phenomenon refers to the tendency of hubs (nodes with high degree) to form tightly interconnected communities (Colizza, Flammini, Serrano, & Vespignani, 2006). The rich-club coefficient was first proposed by Zhou and Mondragon (2004) to quantify this effect: whereby E >k denoted the number of edges among the N >k nodes having degree higher than a given value k.

Numerical Simulations
The first experiment is aimed at testing the robustness of the three inference algorithms with respect to vast changes in network structure. The Watts-Strogatz model is used to generate a spectrum of topologies, ranging from regular lattices to random networks (similar to Erdős-Rényi networks, although not equivalent; Maier, 2019) through a small-world transition (Watts & Strogatz, 1998). Each simulation starts with a directed ring network of 100 nodes with uniform link weights C ij = C ii = 0.15 and fixed in-degree d in = 4 (i.e., each node is linked to two neighbors on each side, as well as to itself via a self-loop). The source of each link is then rewired with a given probability p ∈ [0, 1], so as to change the overall network topology while keeping the in-degree of each node fixed. Only rewiring attempts that keep the network connected are accepted, in order to allow the measurement of the average shortest path length. The simulations for each p are repeated 10 times on different network realizations and with random initial conditions.

Results
At the microscale, the performance is evaluated in terms of precision, recall, and specificity in the classification of the links (present or absent) in the inferred network compared with the underlying structural network. In the case of bivariate MI, each undirected link in the inferred network is represented as two directed links in opposite directions. For longer time series of 10,000 samples, multivariate TE is the most accurate method, achieving optimal performance according to all metrics on all the network topologies generated by the Watts-Strogatz rewiring model ( Figure 2, right column). Bivariate TE also achieves nearly optimal recall and high specificity on all topologies; however, despite the strict statistical significance level, the precision is significantly lower on lattice-like topologies (low rewiring probability) than on random ones (high rewiring probability). The opposite trend is shown by the bivariate MI algorithm, whose precision and recall drastically decrease with increasing rewiring probability. As expected, the recall of all methods decreases when shorter time series of 1,000 samples are provided ( Figure 2, left column). However, the recall for multivariate TE is consistent across topologies, while it decreases with higher rewiring probability when bivariate methods are used. This results in the bivariate TE having larger recall for lattice-like topologies, while multivariate TE has larger recall than bivariate for more random topologies (i.e., for a rewiring probability larger than p = 0.2). A further interesting effect is that bivariate TE attains better precision on shorter time series than on longer ones.
At the macroscale, the three algorithms are tested on their ability to accurately measure three fundamental network properties relevant through the small-world transition, using the longer time series of 10,000 samples. Multivariate TE is able to closely approximate the real shortest path length on all the network topologies generated by the Watts-Strogatz rewiring model, while the bivariate MI and TE algorithms produce significant underestimates, particularly on lattice-like topologies (Figure 3). Similarly, multivariate TE is able to closely match the real mean clustering on all the network topologies generated by the Watts-Strogatz rewiring model, while the bivariate MI and TE algorithms consistently overestimate it (Figure 4). The related measure of local efficiency (Latora & Marchiori, 2001) is reported in the Supporting Information.
Given the above results on the characteristic path length and the mean clustering coefficient, it is not surprising that the bivariate MI and TE algorithms significantly overestimate the real small-worldness coefficient, while the multivariate TE method produces accurate estimates on all the network topologies generated by the Watts-Strogatz rewiring model ( Figure 5). Equivalent results are found if the alternative measures of "small-world index" (Neal, 2017) or "double-graph normalized index" (Telesford, Joyce, Hayasaka, Burdette, & Laurienti, 2011) are computed instead (not shown).

Discussion
At the microscale, the results concerning the bivariate TE can be explained in the light of the recent theoretical derivation of TE from network motifs for VAR dynamics (Novelli,Atay,Jost,Figure 2. Performance as a function of the rewiring probability in Watts-Strogatz ring networks (N = 100 nodes). Multivariate TE is consistent across network structure and guarantees high precision regardless of the amount of data available (third row). Bivariate TE (second row) can have better recall than multivariate TE for shorter time series (T = 1,000, left column) but its precision is not consistent (it drops when T = 10,000, right column) and the optimal time series length cannot be determined a priori. Bivariate MI (first row) has lower precision and recall than TE-based methods, for both T = 1,000 and T = 10,000. For each value of the rewiring probability, the results for 10 simulations on different networks are presented (low-opacity markers) in addition to the mean values (solid markers). . For a fixed in-degree, the TE decreases with the rewiring probability, making it harder for candidate links to pass the statistical significance tests when only short time series are available. This explains why the recall for the bivariate TE slightly drops with higher rewiring probability for T = 1000 (Figure 2). We can speculate that a similar mechanism could be responsible for the more drastic drop in the recall for the bivariate MI (via evidence from derivations for covariances from network structure for similar processes; Pernice, Staude, Cardanobile, & Rotter, 2011;Schwarze & Porter, 2020). The fact that bivariate TE is larger for Figure 3. Characteristic path length as a function of the rewiring probability in Watts-Strogatz ring networks (N = 100 nodes and T = 10,000 time samples). Multivariate TE is able to closely approximate the characteristic path length of the real topologies (ground truth). On the other hand, bivariate MI and TE produce significant underestimates due to spurious links creating shortcuts across the network, particularly on lattice-like topologies (low rewiring probability). The results for 10 simulations on different network realizations are presented (low-opacity markers) in addition to the mean values (solid markers). regular lattice structures also explains why its recall is slightly higher than for multivariate TE here: The redundancy between close sources that elevates their bivariate TE is explicitly conditioned out of the multivariate TE for secondary sources. On the other hand, as the rewiring increases, the higher recall for multivariate TE must be due to this method capturing synergistic effects that (more disparate) multiple sources have on the target, which the bivariate method does not.
Comparing the results between shorter and longer time series raises another question: Why is the precision of the bivariate TE worse for longer time series than for shorter ones, especially for lattice-like topologies? More complex motifs involving common parents and multiple  walks, which are more prevalent in regular lattice topologies, can result in nonzero TE on spurious links. These indirect effects are typically weak; however, for long enough time series, the low TE values can be distinguished from noise and thus pass the statistical significance tests. The resulting spurious links (false positives) decrease the precision and the specificity as the time series length is increased, with the effect being stronger in regular lattice topologies. In other words, the Bonferroni correction of the statistical significance level (i.e., dividing α by the network size N) does not result in a well-calibrated test for bivariate inference methodsthe sources are correlated, and the tests on them are not independent. The differences in the specificity on the plots are subtle because the networks are sparse; however, they manifest in large differences in the precision. Crucially, this effect is not seen for the multivariate TE, which maintains specificity consistent with the requested α for all topologies and time series lengths. Thus, lower recall achieved by multivariate TE on regular lattice networks for short time series (compared with bivariate TE) can be viewed as a compromise to control the specificity in a consistent fashion. A compelling argument in favor of controlling the specificity is provided by Zalesky et al. (2016 p. 407), who conclude that "specificity is at least twice as important as sensitivity [i.e., recall] when estimating key properties of brain networks, including topological measures of network clustering, network efficiency and network modularity." Unfortunately, there is currently no consistent a priori way (and no reasonable candidate) to determine the optimal time series length for bivariate TE to attain high precision.
Moving to the macroscale results, it is clear that the ability to control the false positives while building connectomes is a crucial prerequisite for the application of complex network measures. Adding only a few spurious links leads to significant underestimation of the average shortest path length-an effect that has previously been reported for lattice-like networks using MI (Bialonski, Horstmann, & Lehnertz, 2010) and extended here to TE and across a range of topologies (Figure 3). Together with the clustering coefficient, the shortest path length is a defining feature of small-world networks. Although evidence of small-world properties of functional networks obtained from fMRI recordings have been provided in several studies (e.g., van den Heuvel, Stam, Boersma, & Hulshoff Pol, 2008), whether or not the brain is a small-world network is still being debated (Hilgetag & Goulas, 2015;Papo, Zanin, Martínez, & Buldú, 2016. Following Papo et al. (2016, the question addressed here is of a pragmatic rather than an ontological nature: Independently of whether the brain is a small-world network, to what extent can neuroscientists using standard system-level neuroimaging techniques interpret the small-world construct in the context of functional brain networks? An indication that the interpretation is problematic was provided by Hlinka, Hartman, and Paluš (2012), who showed that functional connectivity matrices of randomly coupled autoregressive processes show small-world properties. The effect is due to intrinsic properties of correlation rather than just to the finite sample size problem or spatial oversampling. Specifically, correlation has a transitivity property: For any node X with neighbors Y and Z (and respective correlations ρ XY and ρ XZ ), a lower bound can be derived for the correlation between the neighbors (Langford, Schwertman, & Owens, 2001: In particular, a strong positive correlation between two pairs of them implies a positive correlation within the third pair: ρ 2 XY + ρ 2 XZ > 1 implies ρ 2 YZ > 0 (Langford et al., 2001). The problem was further investigated by Zalesky, Fornito, and Bullmore (2012), who showed that functional connectivity matrices of independent processes also exhibit small-world properties and that-in practice-the correlation between neighbors is much higher than the theoretical lower bound in Equation 11. These considerations on correlation extend to bivariate MI, given the one-to-one relationship between MI and the absolute value of Pearson's correlation coefficient for the Gaussian variables considered in this study (see the Bivariate mutual information for functional connectivity section in the Methods Section). This transitivity property results in more triangular cliques in functional networks, that is, an inflated clustering coefficient across the whole spectrum of networks in Figure 4. Together with the underestimate of the shortest path length discussed above, the outcome is an overestimate of the small-worldness coefficient ( Figure 5). As shown, the limitations of bivariate methods can be overcome by multivariate TE, to a large degree for shorter time series and certainly when sufficiently long time series are available.

Numerical Simulations
The linear preferential attachment algorithm without attractiveness (Barabási & Albert, 1999) is used to generate undirected scale-free networks of N = 200 nodes. Starting with two connected nodes, a new node is added at each iteration and linked bidirectionally to two existing nodes, selected with probability proportional to their current degree (via linear preferential attachment). This preferential mechanism makes high-degree nodes more likely to be selected and further increase their degree-a positive feedback loop that generates few highly connected hubs and many low-degree nodes. The resulting density is approximately 4/N = 0.02 (average in-and out-degrees being approximately 4), with hubs having degrees up to around 50 here. A constant uniform link weight C XY = C XX = 0.1 is assigned to all the links, achieving strong coupling but ensuring the stationarity of the VAR dynamics. For robustness, each simulation is repeated 10 times on different network realizations and with random initial conditions.

Results
At the microscale, the performance is evaluated in terms of precision and recall in the classification of the links. The outcome is qualitatively similar to the small-world case presented above: For longer time series (10,000 samples), multivariate TE is the most accurate method, achieving optimal performance according to all metrics ( Figure 6, right column). Bivariate TE also achieves optimal recall; however, despite the strict statistical significance level, the precision is significantly lower than multivariate TE. The bivariate MI algorithm scores comparatively very poorly in terms of both precision and recall (< 40% on average). As expected, the recall of all methods decreases when shorter time series of 1,000 samples are provided ( Figure 6, left column). Once more, bivariate TE attains better precision on shorter time series than on longer ones, and for these networks attains slightly better recall than multivariate TE on the shorter time series.
At the macroscale, the three algorithms are tested on their ability to accurately measure several relevant properties of scale-free networks. It is well known that the degree distribution of networks generated via this preferential attachment algorithm follows a power-law, with theoretical exponent β = 3 in the limit of large networks (Barabási & Albert, 1999). Fitting power-laws to empirical data requires some caution, such as adopting a logarithmic binning scheme (Virkar & Clauset, 2014), and the dedicated powerlaw Python package is employed for this purpose (Alstott, Bullmore, & Plenz, 2014). For sufficiently long time series (T = 10,000 in this study), multivariate TE is able to accurately recover the in-degrees of the nodes in our scale-free networks, while the bivariate MI and TE algorithms produce significant overestimates Figure 6. Precision (top row) and recall (bottom row) in scale-free networks obtained via preferential attachment (N = 200 nodes). Multivariate TE guarantees high precision regardless of the amount of data available (T = 1, 000 in the left column and T = 10,000 in the right column). Bivariate TE can achieve slightly better recall than multivariate TE for shorter time series (bottom left panel) but its precision drops substantially for longer time series (top right panel) and the optimal time series length cannot be determined a priori. Bivariate MI has lower precision and recall than TE-based methods, for both T = 1,000 and T = 10,000. The box-whiskers plots summarize the results over 10 simulations on different network realizations, with median values indicated in color. Figure 7. Inferred vs. real in-degree in scale-free networks obtained via preferential attachment (N = 200 nodes and T = 10,000 time samples). Multivariate TE is the only algorithm able to preserve the in-degrees of the nodes as compared to their value in the real networks (ground truth). The dashed black line represents the identity between real and inferred values. Surprisingly, bivariate methods can inflate the in-degree of nonhubs by over 1 order of magnitude, making hubs less distinguishable. The results are collected over 10 simulations on different network realizations. (Figure 7). As a consequence, the (absolute value of the) exponent of the fitted power-law is underestimated by the latter methods, as shown in Figure 8.
Hubs are a key feature of scale-free networks and have high betweenness centrality, since most shortest paths pass through them. However, their centrality is highly underestimated by bivariate methods, often making it indistinguishable from the centrality of peripheral nodes (Figure 9). . The best power-law distribution fit for the real networks (ground truth) and the inferred network models are plotted with dashed lines (decay exponents reported in the legend). Despite the finite-size effect due to the small network size, multivariate TE is able to approximate the theoretical power-law decay exponent β = 3 and to match the power-law fit to the real in-degree distribution (β = 3.1). On the other hand, bivariate TE and MI underestimate the absolute value of the exponent. The results are collected over 10 simulations on different network realizations. As in the small-world case, multivariate TE is able to very closely approximate the real mean clustering coefficient, while the bivariate MI and TE algorithms consistently overestimate it ( Figure 10). The related measure of local efficiency (Latora & Marchiori, 2001) is reported in the Supporting Information.
A closer examination of the clustering of individual nodes (instead of the average) reveals that low clustering values are consistently overestimated by bivariate methods, while high clustering values are underestimated (Supporting Information). Finally, bivariate methods overestimate the rich-club coefficient (Supporting Information).

Figure 10.
Inferred vs. real average clustering coefficient in scale-free networks obtained via preferential attachment (N = 200 nodes and T = 10,000 time samples). Multivariate TE is the only algorithm able to preserve the average clustering of the real networks (ground truth), while bivariate TE and MI consistently overestimate it. The box-whiskers plots summarize the results over 10 simulations, with median values indicated in color.

Discussion
Echoing the discussion of small-world networks above, the ability to control the false positives while building connectomes-exhibited only by multivariate TE-is also crucial for correctly identifying fundamental features of scale-free networks, such as the power-law degree distribution and the presence of hub nodes. Hubs are characterized by high degree and betweenness centrality. Unfortunately, the centrality of hubs is not robust with respect to false positives: The addition of spurious links cause's strong underestimates of the betweenness centrality of real hubs, since additional links provide alternative shortest paths. For bivariate TE, the effect is so prominent that the inferred centrality of real hubs can be indistinguishable from the centrality of peripheral nodes, as shown in Figure 9. The in-degree is in principle more robust with respect to false positives; however, bivariate methods infer so many spurious incoming links into nonhubs that they become as connected (or more) than the real hubs are inferred to be (Figure 7). Taken together, these effects on the in-degree and centrality greatly hinder the identification of real hubs when bivariate MI or TE are employed. The inflation of the in-degree of peripheral nodes also fattens the tail of the in-degree distribution (Figure 8), resulting in an underestimate of the exponent of the fitted power-law with respect to the theoretical value β = 3 (Barabási & Albert, 1999). This has severe implications for the synthetic networks used in this study, erroneously providing evidence against the simple preferential attachment algorithm used to generate them. The third distinct characteristic of these networks is their low average clustering, which is also induced by the preferential attachment algorithm, whereby each new node is only connected to two existing ones. However, bivariate methods fail to capture this feature, producing a strong overestimate of the average clustering coefficient (Figure 10). This can be attributed to the transitivity property of Pearson's correlation, which produces overabundant triangular cliques in functional networks (as previously discussed). Given the significant biases affecting all the distinctive properties of scale-free networks-in addition to the smallworld networks presented above-it is evident that great caution should be used when applying bivariate inference methods (cross-correlation, MI, TE) to draw conclusions as to topological properties of real-world networks. In contrast, again, the multivariate TE was demonstrated to produce network models with microscopic and macroscopic topological properties consistent with those of the underlying structural scale-free networks.

Numerical Simulations
In order to study the performance of the three inference algorithms on modular topologies, networks of 100 nodes are generated and equally partitioned into five groups of 20. Initially, each node is directly linked to 10 random targets within its own group, such that the five communities are completely disconnected. The initial density is thus 50% within each group and 10% overall. Link targets are then gradually rewired from within to between groups, weakening the modular structure but preserving the overall density and keeping the out-degrees fixed. Eventually, the concepts of "within" and "between" groups are no longer meaningfulthe links are equally distributed and the topology resembles a random Erdős-Rényi network of equal overall density. This happens when the rewiring is so prevalent that only 2 links are left within the initial groups and 8 out of 10 links are formed between them (for each node). Going even further, when all 10 links are formed between the initial groups and none within, the network becomes multipartite, that is, the nodes are partitioned into five independent sets having no internal connections. A constant uniform link weight C XY = C XX = 0.08 is assigned to all the links, achieving strong coupling but ensuring the stationarity of the VAR dynamics.
Each simulation is repeated 10 times on different network realizations and with random initial conditions.

Results
At the microscale, we find that bivariate MI and TE infer more spurious links within the initial groups than between them for smaller between-group densities ( Figure 11, left column). As the between-group density increases though, we find more spurious links between the initial groups than within them. The normalized false positive rate is also significantly higher within groups for smaller between-group densities (right column), however the normalization sees the false positive rate becoming comparable between and within group's as the between-group density increases. The number of false positives produced by multivariate TE is comparatively negligible.
At the mesoscale, the modularity of the partition corresponding to the five disconnected communities is maximal in the absence of rewiring and decreases as more and more links are formed between groups rather than within them (Figure 12). Bivariate and multivariate TE produce accurate estimates of the real modularity, while bivariate MI often underestimates it, particularly for shorter time series (T = 1,000) and intermediate between-group densities.

Discussion
Our results on modular networks confirm and extend previous findings on correlation-based functional connectivity, stating that "false positives occur more prevalently between network modules than within them, and the spurious inter-modular connections have a dramatic impact on network topology" (Zalesky et al., 2016, 407). Indeed, the left column of Figure 11 shows that bivariate MI and TE infer a larger number of false positives between the initial groups than within them, once we have a midrange between-group link density in the underlying structure (which induces the transitive relationships). However, the same does not apply to the false positive rate (i.e., the normalized number of false positives) shown in the right column of Figure 11: Where an edge does not actually exist, it is more likely to be inferred if it is within rather than across groups (for up to midrange between-group link densities). As such, the higher number of false positives between modules is mostly due to the larger number of potential spurious links available between different communities compared with those within them. Nonetheless, the key message is that the modular structure (at the mesoscale level) affects the performance of bivariate algorithms in inferring single links (at the microscale level). This provides further empirical evidence for the theoretical finding that bivariate TE-despite being a pairwise measure-does not depend solely on the directed link weight between a single pair of nodes, but on the larger network structure they are embedded in, via the mesoscopic network motifs . In particular, the abundance of specific "clustered motifs" in modular structure increase the bivariate TE, making links within each group easier to detect but also increasing the false positive rate within modules. Other studies have related also the correlation-based functional connectivity to specific structural features, such as search information, path transitivity (Goni et al., 2014), and topological similarity (Bettinardi et al., 2017).
The underestimate of the modularity of the initial partition by bivariate MI (Figure 12, bottom panel) is a direct result of these higher numbers of spurious between-group links. This has important implications for the identification of the modules, since a lower score makes this partition less likely to be deemed optimal by popular greedy modularity maximization Inferring network properties from time series Figure 11. False positives per node (left column) and false positive rate (right column) in modular networks of N = 100 nodes and T = 10,000 time samples. Each node is connected to 10 neighbors and the horizontal axis represents the number of links between groups (in the two extreme cases all 10 links are formed exclusively within each group or exclusively between groups). Bivariate MI and TE infer more spurious links (left column) within the initial groups than between them for smaller between-group densities; the comparison then reverses for larger between-group densities. The false positive rate (i.e., the normalized number of false positives; right column) is also higher within groups for smaller between-group densities, while the rate for between groups becomes comparable (instead of larger) as between-group density increases. The number of false positives produced by multivariate TE is comparatively negligible. The results for 10 simulations on different networks are presented (low-opacity markers) in addition to the mean values (solid markers).
algorithms (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008). We speculate that the spurious intermodular links would also hinder the identification of the modules when alternative approaches for community detection are employed (a thorough comparison of which is beyond the scope of this study). . Each node is initially connected to 10 neighbours within the same group via directed links. As link targets are gradually rewired from within to between groups, the modularity of the initial partition linearly decreases. The horizontal axis represents the number of links between groups for each node (in the two extreme cases all 10 links are formed exclusively within each group or exclusively between groups). Bivariate and multivariate TE produce accurate estimates of the real modularity, while bivariate MI often underestimates it, for both shorter and longer time series (T = 1,000 in the top panel and T = 10,000 in the bottom one). The results for 10 simulations on different networks are presented (low-opacity markers) in addition to the mean values (solid markers).

MACAQUE CONNECTOME
Finally, the three inference algorithms are compared on two real macaque brain connectomes, using both linear VAR dynamics and a nonlinear neural mass model (pipeline illustrated in Figure 1).

Linear VAR dynamics.
As a final validation study under idealized conditions, the linear VAR dynamics in Equation 1 is run on the connectome obtained via tract-tracing by Young (1993). This directed network consists of 71 nodes and 746 links (15% density) and incorporates multiple properties investigated in the previous sections, including a small-world topology and the presence of hubs and modules. The scaling of the performance is studied as a function of the cross-coupling strength (i.e., the sum of incoming link weights into each node, denoted as C in and formally defined as C in = ∑ X C XY for each node Y). The coupling C in is varied in the [0.3, 0.7] range, making C XY constant for each parent X for a given Y to achieve this, and the self-link weights are kept constant at C XX = 0.2 to ensure the stationarity of the VAR dynamics. For robustness, each simulation is repeated 10 times with random initial conditions. Nonlinear neural mass model.
As a final experiment, we provide an initial investigation of whether the insights from the previous validation studies extend beyond the idealized conditions there. Specifically, in moving towards a more realistic setting, neural mass model dynamics are simulated on the CoCoMac connectome, as described in the Methods Section. This network structure contains 76 nodes with 1,560 directed connections (27% density), which are weighted and have experimentally estimated coupling delays. Importantly, by incorporating nonlinear coupling, coupling delays, a distribution of coupling weights, and subsampling, this last study drops many of the simplifying assumptions made using the VAR dynamics in the previous sections. The linear Gaussian estimator is retained for our information-theoretic measures despite the nonlinear interactions here, so as to remain consistent with the previous studies. Dropping the assumption of sampling at the real causal process resolution adds a particular challenge, and is often encountered in practice in modalities with low temporal resolution. To handle the variation in coupling delays, we consider sources at lags up to L = 4 time steps (60 ms) here. The longest time series analyzed (30,000 samples) corresponds to 7.5 min. of one sample per 15 ms.

Results
At the microscale, the results for the linear and nonlinear dynamics ( Figure 13 and Figure 14) are complementary and summarize the main findings presented so far. There exists a windowcharacterized by low cross-coupling strength and short time series-where bivariate TE attains similar or better performance compared with multivariate TE in terms of recall, specificity, and precision. For stronger coupling or longer time series, the recall of all methods increase's, but the precision and specificity of the bivariate methods substantially drop while those of multivariate TE remain consistently high.
An intuitive visual representation of how these differences in precision, recall, and specificity affect the macroscopic inferred network is provided in Figure 15, where the inferred adjacency matrices are displayed beside the real connectome, with different colors indicating which links are correctly/incorrectly inferred or missed by each method.
The macroscale results (in terms of local and global efficiency measures) are reported in the Supporting Information.

Discussion
Interestingly, Figures 13 and 14 show how similar outcomes are produced by either stronger coupling (link weights) or longer time series. An explanation is readily available in the simple case of VAR dynamics: The bivariate TE on spurious links is typically lower than the TE on real links, and it increases with the coupling strength. (Bivariate TE can be analytically derived from the network structure under the assumption of VAR dynamics, and spurious links are associated with larger motifs (e.g., longer chains), contributing to the TE at lower orders of magnitude . Therefore, spurious links can only pass statistical significance Figure 13. Performance as a function of coupling weight in a real macacque connectome with 71 nodes and linear VAR dynamics. Multivariate TE (third row) guarantees high specificity (adequate control of the false positive rate) regardless of the cross-coupling strength and time series length (T = 1,000 in the left column and and T = 10,000 in the right column). Bivariate TE (second row) attains similar performance to multivariate TE for low cross-coupling strength and short time series, according to all metrics. For stronger coupling or longer time series, the recall of all methods increases, but the precision and specificity of the bivariate methods substantially drop. For each value of the crosscoupling weights, the results for 10 simulations from random initial conditions are presented (low-opacity markers) in addition to the mean values (solid markers).

Figure 14.
Performance as a function of number of time series samples (T) in the CoCoMac connectome with 76 nodes and nonlinear dynamics (neural mass model). Similarly to the linear VAR dynamics case presented in the previous figure, multivariate TE (third row) has lower recall than bivariate methods, particularly for shorter time series. More importantly though, it guarantees higher and more consistent precision and specificity, testifying to a more effective control of false positives. This advantage becomes increasingly important as longer time series are provided. The results for all simulations are presented (low-opacity markers) in addition to the mean values (solid markers). tests when sufficiently long time series are available (in order for their weak TE values to be distinguished from noise); for the same reason, for shorter time series, spurious links can only be detected in the presence of strong enough coupling. Unfortunately, for real datasets, there is no consistent a priori way to determine the optimal window of time series lengths for bivariate TE, before increasing false positive rate decays precision and specificity.
It is crucial to note that, despite moving beyond the idealized conditions used for validation with the VAR model, the qualitative differences between the inference algorithms remain unchanged in the neural mass model study. That is, multivariate TE attains higher and more consistent precision and specificity than bivariate methods, testifying to a more effective control of false positives that enables more faithful representation of macroscale network features-an advantage that becomes increasingly important as longer time series are provided. Indeed, on a more intuitive level, Figure 15 provides immediate visual evidence of the importance of a reliable method for controlling the false positive rate. The large number of spurious interhemispheric links produced by bivariate methods can hinder the identification of important macroscopic features, starting from the very presence of the two hemispheres themselves and extending to other fundamental network properties (as shown for local and global efficiency measures in the Supporting Information). This issue becomes particularly problematic for longer time series, as in the case of T = 30, 000 time samples shown in Figure 15.
With that said, the precision and specificity of the multivariate TE for this more realistic study in Figure 14 is noticeably lower compared with the previous experiments in idealized conditions, such as that shown in Figure 13. Specifically, the specificity is lower than would be expected from the proven well-controlled false positive rate under idealized conditions. This is potentially due to a number of factors in this study, including the subsampling, nonlinear dynamics, strong autocorrelation, and to a lesser extent coupling delays. While we retained the linear Gaussian estimator to be consistent with the previous experiments, we have previously demonstrated substantial performance enhancements for the multivariate TE algorithm when using a nonlinear estimator for studying nonlinear dynamics (Novelli et al., 2019). That could be expected to improve performance here as well. Similarly, we have recently demonstrated approaches to rigorously control the known inflation of false positive rates for MI and TE because of autocorrelation (Cliff et al., 2020), although this is not expected to have as dramatic an effect in this study because of the selection of subsampling time via the autocorrelation time. The subsampling itself though (by a factor of 30 on the original time series) is likely to have had a significant impact on performance. This is because subsampling obscures our view of the dynamics at the real interaction scale, and prevents us from properly conditioning on the past of the target (which is known to inflate the false positive ratel; Wibral et al., 2011). While initial studies have suggested some level of robustness of TE to subsampling (Lizier et al., 2011), a more systematic analysis would be important future work to properly understand its effect.

CONCLUSION
We have sought to evaluate how well network models produced via bivariate and multivariate network inference methods capture features of underlying structural topologies. As outlined in the Introduction, these inference techniques seek to infer a network model of the relationships between the nodes in a system, and are not necessarily designed or expected to replicate the underlying structural topology in general. Our primary focus, however, was in evaluating the techniques under specific idealized conditions under which effective network models are proven to converge to the underlying structure. The focus on such conditions is important because they provide assumptions under which our evaluation becomes a validation study. The performance of these methods was evaluated at both the microscopic and the macroscopic scales of the network. While we may not expect the same performance in identifying links at the microscopic scale, we should expect all of the methods to identify relevant macroscopic features in the underlying network structure, as well as distinctive nodes or groups of nodes.
For longer time series, multivariate TE performs better on all network topologies (lattice-like, small-world, scale-free, modular, and the real macaque connectome). This enhanced performance is very clear at the microscale of single links, achieving high precision and recall, and consequently at the macroscale of network properties, accurately reflecting the key summary statistics of the ground truth networks used for validation.
Bivariate methods (directed and undirected) can exhibit higher recall (or sensitivity) for shorter time series for certain underlying topologies; however, as available data increase, they are unable to control false positives (that is, they have lower specificity). While decreasing statistical significance thresholds (critical α levels) for inferring links is a common strategy to reduce false positives, the bivariate measures simply cannot match the sensitivity of the multivariate approach at the same specificity (compare the precisions for same recall level in Figure 6 as an example, or refer to the sample ROC curve in the Supporting Information for a more extensive comparison). At the macroscale, the comparatively larger number of false positives leads to overestimated clustering, small-world, and rich-club coefficients, underestimated shortest path lengths and hub centrality, and fattened degree distribution tails. The changes in these measures are partly due to the aforementioned transitivity property for bivariate measures (implying that false positives are often "close" to real links in the network), and partly due to higher density; untangling these effects is a topic for future work. In any case, caution should therefore be used when interpreting network properties of functional connectomes obtained via correlation or pairwise statistical dependence measures. Their use is advisable only when the limited amount of data don't allow the use of the more sophisticated but more accurate multivariate TE, which more faithfully tracks trends in underlying structural topology. Further research is required to try to reliably identify-a priori-situations where bivariate TE will exhibit higher precision and recall (particularly in terms of time series length), as there is no clear candidate approach to do so at present. In the current status quo, the critical strength of the multivariate approach lies in its ability to appropriately control the false positive rate to meet the requested values.
Our evaluation of the inference techniques under idealized conditions considered several of the highest profile complex network topologies: lattice-like, small-world, scale-free, modular, and a mix of their features in a real macaque connectome. This complements previous work (Novelli et al., 2019;Sun et al., 2015) at a similar scale, which evaluated performance on random network structures and incorporated a study of the effect of network size and linearversus-nonlinear dynamics and estimators. Obviously, we have only scratched the surface of examining the effects of the myriad combinations of network parameters that could be investigated, which could include larger variations in degree, distributions on edge weights, super or sublinear preferential attachment, nonuniform module sizes, or cross-module connection probabilities, and could also incorporate experiments across other types of dynamics. Thus far, our conclusions on how the multivariate TE approach performs against the bivariate measures were consistent across the variety of structures, and while it would be interesting to see how other variations in structure affect the performance, we do expect the general conclusions on the comparison between approaches to remain similar.
Of course, there is a computational time tradeoff, with the run-time of the multivariate TE algorithm requiring O(d) longer in comparison to the bivariate approach (where d is the average inferred in-degree). The runtime complexity is analyzed in detail and benchmarked for both linear and nonlinear estimators on similar scale experiments in Novelli et al. (2019;Supporting Information). Our experiments here on 10,000 time samples for up to 200 nodes, with the more efficient linear estimator, took less than 2 hr (single core) on average per target on the same hardware. Given the availability of parallel computing to analyze targets simultaneously, we believe the trade-off in runtime increase is justifiable for the performance increase demonstrated here.
Beyond idealized conditions, effective network inference techniques are not guaranteed to converge in such manner to an underlying structure. This can be for many reasons, including hidden nodes (or lack of full observability), nonstationarity or short sample size, or subsampling obscuring the scale of interaction. Yet, our final experiment (examining time series dynamics of a neural mass model on the 76 node CoCoMac connectome) extended our investigations into the domain beyond idealized conditions and also demonstrated superior performance of the multivariate TE, aligning with the validation studies in idealized conditions. Importantly, this included visually revealing the characteristic hemispheric macroscopic structure of this connectome. With that said, the performance of multivariate TE in this example was certainly reduced in comparison to our experiments under idealized conditions. This appears to be due to various factors as discussed in that section, including the use of a linear estimator on nonlinear dynamics as well as the effect of subsampling. There is substantial scope for further study to understand the performance of inference techniques under nonideal techniques, and how the effective network models they infer are related to underlying structure. This will involve further experiments on realistic neural dynamics; systematic study of the effect of subsampling in network inference (building on existing studies for Granger causality; Barnett & Seth, 2017), and assessing the ability of inference algorithms to capture key network features when only a subset of nodes is observed (i.e., in the presence of hidden nodes).
Finally, while we focus on functional brain networks, our conclusions and methods also apply to anatomical brain networks in which connectivity is measured using correlation in cortical thickness or volume (He, Chen, & Evans, 2007). Beyond neuroscience, they also extend to metabolite, protein and gene correlation networks (Gillis & Pavlidis, 2011; a similar validation study using synthetic networks was carried out in gene regulatory networks using bivariate MI and TE; Budden & Crampin, 2016).