## Abstract

Functional and effective networks inferred from time series are at the core of network neuroscience. Interpreting properties of these networks requires inferred network models to reflect key underlying structural features. However, even a few spurious links can severely distort network measures, posing a challenge for functional connectomes. We study the extent to which micro- and macroscopic properties of underlying networks can be inferred by algorithms based on mutual information and bivariate/multivariate transfer entropy. The validation is performed on two macaque connectomes and on synthetic networks with various topologies (regular lattice, small-world, random, scale-free, modular). Simulations are based on a neural mass model and on autoregressive dynamics (employing Gaussian estimators for direct comparison to functional connectivity and Granger causality). We find that multivariate transfer entropy captures key properties of all network structures for longer time series. Bivariate methods can achieve higher recall (sensitivity) for shorter time series but are unable to control false positives (lower specificity) as available data increases. This leads to overestimated clustering, small-world, and rich-club coefficients, underestimated shortest path lengths and hub centrality, and fattened degree distribution tails. Caution should therefore be used when interpreting network properties of functional connectomes obtained via correlation or pairwise statistical dependence measures, rather than more holistic (yet data-hungry) multivariate models.

## Author Summary

We compare bivariate and multivariate methods for inferring networks from time series, which are generated using a neural mass model and autoregressive dynamics. We assess their ability to reproduce key properties of the underlying structural network. Validation is performed on two macaque connectomes and on synthetic networks with various topologies (regular lattice, small-world, random, scale-free, modular). Even a few spurious links can severely bias key network properties. Multivariate transfer entropy performs best on all topologies for longer time series.

## INTRODUCTION

Functional and effective network inference in neuroscience typically involves preprocessing the data, defining the parcellation, extracting the time series, inferring the links in the model network, and measuring network properties, for example, to compare patients and controls or to predict phenotype (Bassett & Sporns, 2017; Fornito, Zalesky, & Bullmore, 2016). Each step in the pipeline requires making modeling and analysis choices, whose influence on the final results is the subject of ongoing research (Aquino, Fulcher, Parkes, Sabaroedin, & Fornito, 2020; Cliff, Novelli, Fulcher, Shine, & Lizier, 2020; Zalesky et al., 2016). As part of this effort, we study how the choice of different inference algorithms affects the properties of the resulting network model in comparison to the underlying structural network, and whether the ability to accurately reflect these properties changes across different underlying structural networks.

The structure (or topology) of a network can be described at multiple scales (Bassett & Sporns, 2017): from the microscopic (individual links), to the mesoscopic (modules and motifs) and the macroscopic (summary statistics, such as average shortest path length and measures of small-worldness; Rubinov & Sporns, 2010). At each scale, the structure is associated with development, aging, cognition, and neuropsychiatric diseases (Xia et al., 2020). Previous studies have assessed the performance of different network inference algorithms in identifying the structural links at the microscale (Kim, Rogers, Sun, & Bollt, 2016; Novelli, Wollstadt, Mediano, Wibral, & Lizier, 2019; Razi, Kahan, Rees, & Friston, 2015; Runge, Nowack, Kretschmer, Flaxman, & Sejdinovic, 2018; Sun, Taylor, & Bollt, 2015). The goal of this work is to extend the assessment to all scales and to a variety of topologies, across a range of related network inference algorithms. We link the performance at the microscale to the resulting network properties at the macroscale, and describe how this changes as a function of the overall topology.

We compare bivariate and multivariate approaches for inferring network models, employing statistical dependence measures
based on information theory (Shannon, 1948). These approaches include *functional network
inference*, which produces models of networks of pairwise or bivariate
statistical relationships between nodes, and can either quantify undirected
statistical dependence, in the case of mutual information (MI; Cover &
Thomas, 2005), or directed dependence, in
the case of transfer entropy (TE; Bossomaier, Barnett, Harré, &
Lizier, 2016; Schreiber, 2000). These approaches also include *effective
network inference*, which is intended to produce the simplest possible
circuit models that explain the observed responses (Aertsen, Gerstein, Habib,
& Palm, 1989). In this class, we
evaluate the use of multivariate TE, which, in contrast to the bivariate approaches,
aims to minimize spurious links and infer minimal models of the parent *sets* for each target node in the network.

All of these inference techniques seek to infer a network *model* of
the relationships between the nodes in a system. Different methods capture different
aspects of these relationships and don’t necessarily seek to replicate the
underlying structural topology, nor do we expect them to *in general* (particularly in neuroimaging experiments, where aspects of the structure may be
expressed more or less or not at all, depending on the cognitive task). In spite of
that, in this paper we do seek to evaluate and indeed validate these methods in
inferring microscopic, mesoscopic, and macroscopic features of the underlying
network structure. Crucially, we perform this validation under idealized
conditions—including full observability, stationarity no subsampling, and so
fourth—that allow us to establish a hypothesis that effective networks should
be not just complementary to the structural but converge to it under these
conditions, as our available data increase. Indeed, under these idealized conditions
(specifically in the absence of hidden nodes, and other simplifying assumptions,
including stationarity), effective networks inferred via multivariate TE are proven
to converge to the underlying structure for sufficiently long time series (Runge, 2018; Sun et al., 2015). In gaining an understanding of these multivariate
effective connectivity inference algorithms, it is important to validate that they
perform to that expectation where it is applicable, and investigate how that
performance varies with respect to factors such as sample size, and so on. In doing
so, we also address the recent call for more extensive model diversity in testing
multivariate algorithms: “To avoid biased conclusions, a large number of
different randomly selected connectivity structures should be tested [including link
density as well as properties such as small-worldness]” (Runge, 2018, p. 13).

Outside of these idealized circumstances though, we can no longer make a clear general hypothesis on how the effective network models are expected to reflect the underlying structure, yet a successful validation gives confidence that the directed statistical relationships they represent remain accurate as an effective network model at the microscale. Furthermore, it is at least desirable for not only effective networks but also functional networks to recognize important features in the underlying network structure: to track overall regime changes in the macroscopic structure reliably, and to reflect the mesoscopic properties of distinctive nodes (or groups of nodes) in the structure. The desire for recognition of important features in the network is applicable whether the inference is made under idealized conditions or not.

This motivates our validation study, which is primarily conducted under idealized conditions as above and based on synthetic datasets involving ground truth networks of 100–200 nodes with different topologies, from regular lattice to small-world, random, scale-free, and modular. Many of these structural properties are incorporated in the macaque connectomes analyzed in the last section. At the macroscale, we measure several fundamental and widely used properties, including shortest path length, clustering coefficient, small-world coefficient, betweenness centrality, and features of the degree distributions (Rubinov & Sporns, 2010). These properties of the inferred network models are compared with those of the real underlying structural networks in order to validate and benchmark different inference algorithms in terms of their ability to capture the key properties of the underlying topologies. At the microscale, the performance is assessed in terms of precision, recall, and specificity of the inferred model in classifying the links of the underlying structural network. As above, while we do not expect all approaches to strictly capture the microscale features, these results help to explain their performance at the macroscale.

For most of our experiments, the time series of node activity on these networks are generated by vector autoregressive (VAR) dynamics, with linearly coupled nodes and Gaussian noise. Both the VAR process and the inference algorithms are described in detail in the Methods section, where we also discuss how MI and the magnitude of Pearson correlation are equivalent for stationary VAR processes. This implies that the undirected networks obtained via the bivariate MI algorithm are equivalent to the widely employed undirected functional networks obtained via correlation, extending the implications of our results beyond information-theoretic methods. Further, our results based on TE extend to Granger causality, which is equivalent to TE for stationary VAR processes (Barnett, Barrett, & Seth, 2009). Networks inferred using bivariate TE are typically referred to as directed functional networks, to emphasize the directed nature of their links. The extension to multivariate TE for effective network inference can also be viewed as an extension to multivariate Granger causality for the stationary VAR processes here.

We find that multivariate TE performs better on all network topologies at all scales, for longer time series. Bivariate methods can achieve better recall with limited amount of data (shorter time series) in some circumstances, but the precision and the ability to control false positives are neither consistent nor predictable a priori. On the other hand, thanks to recent statistical improvements, multivariate TE guarantees high specificity and precision regardless of the amount of data available, and the recall steadily increases with more data. We discuss how the mesoscopic properties of the underlying structural network—particularly the network motifs—can influence the precision and recall of the model at the microscale. In turn, we show how the performance at the microscale affects the inferred network properties at the macroscale. We observe that bivariate methods are often unable to capture the most distinctive topological features of the networks under study (including path length, clustering, degree distribution, and modularity), largely because of their inability to control false positives at the microscale.

Our final section moves beyond the validation under idealized conditions to extend
the experiments to time series from a neural mass model employed on the CoCoMac
connectome. Although the incorporation of complexities such as nonlinear
interactions and subsampled time series do somewhat reduce the performance of the
methods, the superior performance of multivariate TE aligns with the results above
from the validation in idealized conditions. While further and more wide-ranging
experiments are required, our experiments provide substantial evidence for the
validity of multivariate TE in providing effective network *models*,
which still retain meaningful network insights in more realistic conditions.

## METHODS

### Generating Dynamics on Networks

Two models are used to generate time series dynamics on networks of coupled variables. Vector autoregressive processes are employed for validation studies under idealized conditions, and a neural mass model is used on the weighted CoCoMac connectome as a final investigation going beyond these conditions. The simulation and analysis pipeline is illustrated in Figure 1.

#### Networks of linearly coupled Gaussian variables.

*N*nodes. A VAR process is described by the recurrence relation

**(**

*Z**t*) is a row vector and

*Z*

_{i}(

*t*) is the activity of node

*i*at time

*t*. The Gaussian noise

**(**

*ε**t*) is spatially and serially uncorrelated, with standard deviation

*θ*= 0.1. The

*N*×

*N*weighted adjacency matrix

*C*= [

*C*

_{ij}] describes the network structure, where

*C*

_{ij}is the weight of the directed connection from node

*i*to node

*j*. These dynamics are generated on various network topologies, as detailed in the following sections. The choice of the weights

*C*

_{ij}(detailed in the following sections) guarantees the stability of the system, which is a sufficient condition for stationarity (Atay & Karabacak, 2006). Since stationary VAR processes have multivariate Gaussian distributions, the information-theoretic measures we use can be directly related to Pearson correlation and Granger causality (Granger, 1969). The simple VAR dynamics are chosen as the primary model for our validation studies instead of nonlinear alternatives because the main goal is not to prove the superiority of nonlinear dependence measures on nonlinear systems, which has been shown elsewhere (Novelli et al., 2019). We rather aim to show that, even on linearly coupled Gaussian variables—perfectly suitable to be studied via cross-correlations—multivariate approaches are better able to infer the macroscopic network properties. In addition, the VAR dynamics are amenable to be investigated using the faster Gaussian estimator for the information-theoretic measures, allowing us to carry out more extensive simulations over a wider range of parameters.

#### Neural mass model on the CoCoMac connectome.

To provide an extension beyond the linear VAR dynamics, neural activity in various brain regions is modeled (following Shine, Aburn, Breakspear, & Poldrack, 2018; Li et al., 2019) as an oscillating two-dimensional neural mass model derived by mode decomposition from the Fitzhugh-Nagumo single neuron model (FitzHugh, 1961). As previously presented (Li et al., 2019; Shine et al., 2018), the CoCoMac connectome (Kötter, 2004) is used to provide directed coupling between 76 regions, with axonal time delays between these regions based on the length of fiber tracts as estimated by diffusion spectrum imaging (Sanz Leon et al., 2013). Data provided from Li et al. (2019) and Shine et al. (2018) was simulated using the open-source framework The Virtual Brain (Sanz Leon et al., 2013), using code implementing the model freely available at https://github.com/macshine/gain_topology (Shine, 2018).

*V*) and the slow recovery variable (

*W*) at each regional node

*i*:

*ξ*

_{i}and

*η*

_{i}are independent standard Wiener noises and

*I*

_{i}is the synaptic current

*C*

_{ji}indicating the connection weight from

*j*to

*i*and incorporating time delays

*τ*

_{ji}from

*j*to

*i*(estimated as described above). The CoCoMac connectome network contains 1,560 directed connections (including 66 self-links), with

*τ*

_{ji}on non-self links having an average of 19.8 ms (standard deviation 8.32 ms). The membrane potentials

*V*

_{i}are converted to normalized firing rates

*S*

_{i}via a sigmoid activation function

*m*= 1.5 chosen to align the sigmoid with its typical input. The parameters for gain

*σ*= 0.5 (in Equation 4) and excitability

*γ*= 0.3 (in Equation 2) are selected to simulate activity in the integrated regime of dynamics identified by Shine et al. (2018).

Finally, the time series of membrane voltage *V*_{i}(*t*)
(originally obtained with a 0.5-ms temporal resolution via stochastic Heun
integration) are subsampled at 15 ms (selected as half the median time for
the autocorrelation functions to decay to 1/*e*).

### Network Inference Algorithms

#### Bivariate mutual information for functional connectivity.

Mutual information (MI) is computed between all pairs of nodes independently,
in a *bivariate* fashion, and only the measurements that pass
a strict statistical significance test (described below) are interpreted as
undirected links.

*X*and

*Y*with joint probability density function

*μ*(

*x*,

*y*) and marginal densities

*μ*

_{X}(

*x*) and

*μ*

_{Y}(

*y*) is defined as

*x*,

*y*) such that

*μ*(

*x*,

*y*) > 0. The strength of MI lies in its model-free nature, meaning that it doesn’t require any assumptions on the distribution or the variables (e.g., Gaussian). Being able to capture nonlinear relationships, MI is typically presented as a generalized version of the Pearson correlation coefficient. However, for the VAR processes considered here (Equation 1) with stationary multivariate Gaussian distributions, the MI between two variables

*X*and

*Y*under these dynamics is completely determined by the magnitude of their Pearson correlation coefficient

*ρ*(Cover & Thomas, 2005):

*ρ*(for VAR processes) implies that the networks inferred via the bivariate MI algorithm are equivalent to the functional networks obtained via cross-correlation—widely employed in neuroscience. This equivalence persists whenever a Gaussian estimator for MI is used (which models the processes as VAR), even for nonlinear dynamics, as is used in our experiments. Differences may lie in how the raw MI values are transformed into a network structure. Early approaches often used a fixed threshold aimed at obtaining a prescribed link density, while the bivariate MI algorithm used here adopts an adaptive threshold (different for each link) to meet a desired statistical significance level. The statistical significance is computed via null hypothesis testing to reflect the probability of observing a larger MI from the same samples if their temporal relationship were destroyed (the

*p*value is obtained from a chi-square test, as summarized in Lizier, 2014). The critical level for statistical significance is set to

*α*= 0.01/

*N*, where

*N*is the network size. This produces a Bonferroni correction for the inference of parent nodes for each target (i.e., for each target, there is a 0.01 chance under the null hypothesis that at least one spurious parent node is selected, assuming independent sources).

#### Bivariate transfer entropy for directed functional connectivity.

Transfer entropy (TE) is computed between all pairs of nodes independently,
in a *bivariate* fashion, and only the measurements that pass
a strict statistical significance test (described below) are interpreted as
links.

*X*to a target node

*Y*is not necessarily the same as the TE from

*Y*to

*X*), and specifically considers information about the dynamic state updates of the target

*Y*. Thus, employing TE has the advantage of generating directed networks and providing a more detailed model of the dynamics of the system under investigation. Formally, the TE from a source stochastic process

*X*to a target process

*Y*is defined as (Schreiber, 2000)

*I*(

*X*

_{t−1};

*Y*

_{t}|

*Y*_{<t}) is the conditional mutual information (Cover & Thomas, 2005) between the previous sample

*X*

_{t−1}of the source and the next sample

*Y*

_{t}of the target, in the context of (the vector of) the target’s past values

*Y*_{<t}. The directed and dynamic nature of TE derives specifically from taking the past of the target into account when measuring the lagged statistical dependence between

*X*and

*Y*.

In practice, in order to estimate the TE from the time series, *Y*_{<t} is usually constructed as an embedding vector (Takens, 1981), and a maximum lag must be specified to
build a finite embedding of the target’s past (using either uniformly
or nonuniformly spaced variables, (Faes, Nollo, & Porta, 2011; Kugiumtzis, 2013; Vlachos & Kugiumtzis, 2010). Here, using the IDTxl Python
package (Wollstadt et al., 2019), a nonuniform embedding of the
target’s past *Y*_{<t} is built via iterative greedy selection of statistically significant
elements of *Y*_{<t} (as
detailed in the next section). While similar multivariate embeddings and
lags larger than one time step can be used for the source *X* in principle, here only a single sample at lag 1 for the source is used for
the VAR model in alignment with its dynamics.

Analogously to the MI, the bivariate TE values are transformed into a
directed network structure by testing their statistical significance. This
is computed (using a theoretical null distribution, as summarized in Lizier, 2014) to reflect the
probability of observing a larger TE from the same samples if the source
samples were temporally decoupled from the target and its past. A critical
level of *α* = 0.01/*N* is used as per
MI.

Bivariate TE has found wide application in studies of directed functional connectivity (e.g., Honey, Kotter, Breakspear, & Sporns, 2007; Lizier, Heinzle, Horstmann, Haynes, & Prokopenko, 2011; Marinazzo, Wu, Pellicoro, Angelini, & Stramaglia, 2012; Orlandi, Stetter, Soriano, Geisel, & Battaglia, 2014; Stetter, Battaglia, Soriano, & Geisel, 2012; Wibral et al., 2011). Importantly, TE and Granger causality are equivalent for Gaussian variables (Barnett et al., 2009), which applies to the VAR processes considered here (Equation 1), suggesting that a Gaussian estimator for TE be employed (Bossomaier et al., 2016). The networks inferred via the bivariate TE algorithm with this estimator are equivalent (for any dynamics) to those obtained via Granger causality, which is also widely employed in neuroscience.

#### Multivariate transfer entropy for effective connectivity.

*minimal set*of sources that collectively contribute to the computation of the target’s next state. More formally, for each target

*Y*, this method aims at identifying the minimal set of sources

*X*_{<t}that maximize the collective TE to

*Y*, defined as

The multivariate TE network inference algorithm is described in full in
Novelli et al. (2019), synthesizing
together components of Faes et al. (2011), Lizier and Rubinov (2012), Montalto, Faes, and Marinazzo (2014), Sun et al. (2015), and Vlachos and Kugiumtzis (2010). A greedy approach is used to
iteratively select the candidate variables to add to *X*_{<t} from a candidate set {*X*_{i, t−l}} of lagged
variables from the past of each source *X*_{i} ∈ **Z** ∖{*Y*}, up to some maximum lag *l* ≤ *L* (*L* = 1
for the VAR model and *L* = 4 for the neural mass model used
here). Testing the statistical significance of the conditional mutual
information for a new candidate source to be included at each
step—conditioning on the previously selected sources—provides
an adaptive stopping condition for the greedy algorithm. The family-wise
error rate for each target is set to *α* = 0.01 using
the max statistic (Novelli et al., 2019), meaning that for each target there is a 0.01 chance that,
under the null hypothesis, at least one spurious parent node is selected.
The conditioning on previously selected sources serves to prevent the
selection of spurious sources that are correlated with true sources because
of common driver as well as pathway or chain effects (referred to as holding
redundant information only; Lizier & Rubinov, 2012; Stramaglia, Cortes, & Marinazzo, 2014). Such conditioning also
enables capturing multivariate or synergistic effects on the target that
cannot be detected by examining individual sources in isolation (Lizier
& Rubinov, 2012; Stramaglia
et al., 2014). Furthermore, in
contrast to always conditioning on all potential sources, by conditioning *only* on previously selected sources the iterative
approach defers overly high-dimensional analysis until it is genuinely
required, buying statistical sensitivity (“recall”; Lizier
& Rubinov, 2012; Novelli et
al., 2019). Every node is studied
as a target (in parallel on a computing cluster, using IDTxl; Wollstadt et
al., 2019) and the results are then
combined into a directed network describing the information flows in the
system. Similar to the bivariate TE discussed above, a nonuniform embedding
of the target’s past *Y*_{<t} is built first (Vlachos & Kugiumtzis, 2010), before the second step of selecting
sources via the same iterative greedy algorithm (Novelli et al., 2019). While multiple past samples of
any given source can be considered (e.g., as has be done in Novelli et al., 2019), only one past value is
examined here (*L* = 1) for the VAR experiments in line with
their known structure in Equation
1 and in order to focus on network structure effects only.

Given that TE and Granger causality are equivalent for Gaussian variables (Barnett et al., 2009), using the Gaussian estimator with the multivariate TE algorithm can be viewed as extending Granger causality in the same multivariate/greedy fashion.

### Evaluation Metrics

At the microscale (individual links), the network inference performance is
evaluated against the known underlying network structure as a binary
classification task, using standard statistics based on the number of *true positives* (TP, i.e., correctly classified existing
links), *false positives* (FP, i.e., absent links falsely
classified as existing), *true negatives* (TN, i.e., correctly
classified absent links), and *false negatives* (FN, i.e.,
existing links falsely classified as absent). The following standard statistics
are employed in the evaluation:

**Precision**=*TP*/(*TP*+*FP*);**Recall (true positive rate)**=*TP*/(*TP*+*FN*);**Specificity (true negative rate)**=*TN*/(*TN*+*FP*).**False positive rate***FP*/(*FP*+*TN*).

Intuitively, the precision measures how often an inferred link is actually
present in the underlying structure, the recall measures the proportion of true
links that are detected, and the specificity measures the proportion of absent
links that are correctly not inferred. For a properly controlled family-wise
error rate (*α*), the expected specificity is 1 − *α*.

At the macroscale, the performance is evaluated in terms of the accuracy in measuring network properties of interest on the inferred network, as compared with their real values when measured on the underlying structural network. These properties include the following;

**Characteristic path length.**The average shortest distance between all pairs of nodes (Rubinov & Sporns, 2010). A shorter average value is typically interpreted as an indication of the efficiency of the network in propagating information. The characteristic path length is only well defined for connected networks—a problem that is avoided by construction in this study by only generating connected networks. This limitation could be alternatively overcome by replacing the characteristic path length with the analogous global efficiency (Latora & Marchiori, 2001), reported in the Supporting Information for completeness.**Clustering coefficient.**The clustering coefficient of a node is the fraction of the node’s neighbours that are also neighbours of each other (Fagiolo, 2007; Watts & Strogatz, 1998). The mean clustering coefficient hence reflects the average prevalence of clustered connectivity around individual nodes.**Small-worldness coefficient.**Small-world networks are formally defined as networks that are significantly more clustered than random networks, yet have approximately the same characteristic path length as random networks (Watts & Strogatz, 1998). The small-worldness coefficient was proposed by Humphries and Gurney (2008) to capture this effect in the following single statistic (although improvements on the original measure have been recently suggested; Neal, 2017; Zanin, 2015):In Equation 9,$\sigma =CCrandLLrand.$(9)*C*and*L*respectively denote the average clustering coefficient and the characteristic path length; analogously,*C*_{rand}and*L*_{rand}denote the average clustering coefficient and the characteristic path length of Erdős-Rényi networks having the same size and number of links as the network under study.**Degree distribution.**The probability distribution of the in- and out-degree over the whole network (where the in- and out-degree of a node are defined as the number of its incoming and outgoing connections, respectively).**Betweenness centrality.**The fraction of all shortest paths in the network that pass through a given node (Rubinov & Sporns, 2010), excluding paths that start and end on the given node.**Modularity.**A measure of the separation of a network into specific groups (or modules), defined as the fraction of the edges that fall within the given groups minus the expected fraction if the edges were distributed at random (Newman & Girvan, 2004).**Rich-club coefficient.**The “rich-club” phenomenon refers to the tendency of hubs (nodes with high degree) to form tightly interconnected communities (Colizza, Flammini, Serrano, & Vespignani, 2006). The rich-club coefficient was first proposed by Zhou and Mondragon (2004) to quantify this effect:whereby$\varphi (k)=2E>kN>k(N>k\u22121),$(10)*E*_{>k}denoted the number of edges among the*N*_{>k}nodes having degree higher than a given value*k*.

## SMALL-WORLD NETWORKS

### Numerical Simulations

The first experiment is aimed at testing the robustness of the three inference
algorithms with respect to vast changes in network structure. The Watts-Strogatz
model is used to generate a spectrum of topologies, ranging from regular
lattices to random networks (similar to Erdős-Rényi networks,
although not equivalent; Maier, 2019)
through a small-world transition (Watts & Strogatz, 1998). Each simulation starts with a directed ring
network of 100 nodes with uniform link weights *C*_{ij} = *C*_{ii} = 0.15 and fixed
in-degree *d*_{in} = 4 (i.e., each node is linked to two
neighbors on each side, as well as to itself via a self-loop). The source of
each link is then rewired with a given probability *p* ∈
[0, 1], so as to change the overall network topology while keeping the in-degree
of each node fixed. Only rewiring attempts that keep the network connected are
accepted, in order to allow the measurement of the average shortest path length.
The simulations for each *p* are repeated 10 times on different
network realizations and with random initial conditions.

### Results

At the microscale, the performance is evaluated in terms of precision, recall,
and specificity in the classification of the links (present or absent) in the
inferred network compared with the underlying structural network. In the case of
bivariate MI, each undirected link in the inferred network is represented as two
directed links in opposite directions. For longer time series of 10,000 samples,
multivariate TE is the most accurate method, achieving optimal performance
according to all metrics on all the network topologies generated by the
Watts-Strogatz rewiring model (Figure 2,
right column). Bivariate TE also achieves nearly optimal recall and high
specificity on all topologies; however, despite the strict statistical
significance level, the precision is significantly lower on lattice-like
topologies (low rewiring probability) than on random ones (high rewiring
probability). The opposite trend is shown by the bivariate MI algorithm, whose
precision and recall drastically decrease with increasing rewiring probability.
As expected, the recall of all methods decreases when shorter time series of
1,000 samples are provided (Figure 2, left
column). However, the recall for multivariate TE is consistent across
topologies, while it decreases with higher rewiring probability when bivariate
methods are used. This results in the bivariate TE having larger recall for
lattice-like topologies, while multivariate TE has larger recall than bivariate
for more random topologies (i.e., for a rewiring probability larger than *p* = 0.2). A further interesting effect is that bivariate TE
attains better precision on shorter time series than on longer ones.

At the macroscale, the three algorithms are tested on their ability to accurately measure three fundamental network properties relevant through the small-world transition, using the longer time series of 10,000 samples. Multivariate TE is able to closely approximate the real shortest path length on all the network topologies generated by the Watts-Strogatz rewiring model, while the bivariate MI and TE algorithms produce significant underestimates, particularly on lattice-like topologies (Figure 3). Similarly, multivariate TE is able to closely match the real mean clustering on all the network topologies generated by the Watts-Strogatz rewiring model, while the bivariate MI and TE algorithms consistently overestimate it (Figure 4). The related measure of local efficiency (Latora & Marchiori, 2001) is reported in the Supporting Information.

Given the above results on the characteristic path length and the mean clustering coefficient, it is not surprising that the bivariate MI and TE algorithms significantly overestimate the real small-worldness coefficient, while the multivariate TE method produces accurate estimates on all the network topologies generated by the Watts-Strogatz rewiring model (Figure 5). Equivalent results are found if the alternative measures of “small-world index” (Neal, 2017) or “double-graph normalized index” (Telesford, Joyce, Hayasaka, Burdette, & Laurienti, 2011) are computed instead (not shown).

### Discussion

At the microscale, the results concerning the bivariate TE can be explained in
the light of the recent theoretical derivation of TE from network motifs for VAR
dynamics (Novelli, Atay, Jost, & Lizier, 2020). For a fixed in-degree, the TE decreases with
the rewiring probability, making it harder for candidate links to pass the
statistical significance tests when only short time series are available. This
explains why the recall for the bivariate TE slightly drops with higher rewiring
probability for *T* = 1000 (Figure 2). We can speculate that a similar mechanism could be responsible
for the more drastic drop in the recall for the bivariate MI (via evidence from
derivations for covariances from network structure for similar processes;
Pernice, Staude, Cardanobile, & Rotter, 2011; Schwarze & Porter, 2020). The fact that bivariate TE is larger for regular lattice
structures also explains why its recall is slightly higher than for multivariate
TE here: The redundancy between close sources that elevates their bivariate TE
is explicitly conditioned out of the multivariate TE for secondary sources. On
the other hand, as the rewiring increases, the higher recall for multivariate TE
must be due to this method capturing synergistic effects that (more disparate)
multiple sources have on the target, which the bivariate method does not.

Comparing the results between shorter and longer time series raises another
question: Why is the precision of the bivariate TE worse for longer time series
than for shorter ones, especially for lattice-like topologies? More complex
motifs involving common parents and multiple walks, which are more prevalent in
regular lattice topologies, can result in nonzero TE on spurious links. These
indirect effects are typically weak; however, for long enough time series, the
low TE values can be distinguished from noise and thus pass the statistical
significance tests. The resulting spurious links (false positives) decrease the
precision and the specificity as the time series length is increased, with the
effect being stronger in regular lattice topologies. In other words, the
Bonferroni correction of the statistical significance level (i.e., dividing *α* by the network size *N*) does not
result in a well-calibrated test for bivariate inference methods—the
sources are correlated, and the tests on them are not independent. The
differences in the specificity on the plots are subtle because the networks are
sparse; however, they manifest in large differences in the precision. Crucially,
this effect is not seen for the multivariate TE, which maintains specificity
consistent with the requested *α* for all topologies and
time series lengths. Thus, lower recall achieved by multivariate TE on regular
lattice networks for short time series (compared with bivariate TE) can be
viewed as a compromise to control the specificity in a consistent fashion. A
compelling argument in favor of controlling the specificity is provided by
Zalesky et al. (2016 p. 407), who
conclude that “specificity is at least twice as important as sensitivity
[i.e., recall] when estimating key properties of brain networks, including
topological measures of network clustering, network efficiency and network
modularity.” Unfortunately, there is currently no consistent a priori way
(and no reasonable candidate) to determine the optimal time series length for
bivariate TE to attain high precision.

*X*with neighbors

*Y*and

*Z*(and respective correlations

*ρ*

_{XY}and

*ρ*

_{XZ}), a lower bound can be derived for the correlation between the neighbors (Langford, Schwertman, & Owens, 2001):

In particular, a strong positive correlation between two pairs of them implies a
positive correlation within the third pair: $\rho XY2$ + $\rho XZ2$ > 1 implies $\rho YZ2$ > 0 (Langford et al., 2001). The
problem was further investigated by Zalesky, Fornito, and Bullmore (2012), who showed that functional
connectivity matrices of independent processes also exhibit small-world
properties and that—in practice—the correlation between neighbors
is much higher than the theoretical lower bound in Equation 11. These considerations on
correlation extend to bivariate MI, given the one-to-one relationship between MI
and the absolute value of Pearson’s correlation coefficient for the
Gaussian variables considered in this study (see the ec5*Bivariate mutual information for functional
connectivity* section in the Methods Section). This
transitivity property results in more triangular cliques in functional networks,
that is, an inflated clustering coefficient across the whole spectrum of
networks in Figure 4. Together with the
underestimate of the shortest path length discussed above, the outcome is an
overestimate of the small-worldness coefficient (Figure 5). As shown, the limitations of bivariate methods can be
overcome by multivariate TE, to a large degree for shorter time series and
certainly when sufficiently long time series are available.

## SCALE-FREE NETWORKS

### Numerical Simulations

The linear preferential attachment algorithm without attractiveness
(Barabási & Albert, 1999)
is used to generate undirected scale-free networks of *N* = 200
nodes. Starting with two connected nodes, a new node is added at each iteration
and linked bidirectionally to two existing nodes, selected with probability
proportional to their current degree (via linear preferential attachment). This
preferential mechanism makes high-degree nodes more likely to be selected and
further increase their degree—a positive feedback loop that generates few
highly connected hubs and many low-degree nodes. The resulting density is
approximately 4/*N* = 0.02 (average in- and out-degrees being
approximately 4), with hubs having degrees up to around 50 here. A constant
uniform link weight *C*_{XY} = *C*_{XX} = 0.1 is assigned to all
the links, achieving strong coupling but ensuring the stationarity of the VAR
dynamics. For robustness, each simulation is repeated 10 times on different
network realizations and with random initial conditions.

### Results

At the microscale, the performance is evaluated in terms of precision and recall in the classification of the links. The outcome is qualitatively similar to the small-world case presented above: For longer time series (10,000 samples), multivariate TE is the most accurate method, achieving optimal performance according to all metrics (Figure 6, right column). Bivariate TE also achieves optimal recall; however, despite the strict statistical significance level, the precision is significantly lower than multivariate TE. The bivariate MI algorithm scores comparatively very poorly in terms of both precision and recall (< 40% on average). As expected, the recall of all methods decreases when shorter time series of 1,000 samples are provided (Figure 6, left column). Once more, bivariate TE attains better precision on shorter time series than on longer ones, and for these networks attains slightly better recall than multivariate TE on the shorter time series.

At the macroscale, the three algorithms are tested on their ability to accurately
measure several relevant properties of scale-free networks. It is well known
that the degree distribution of networks generated via this preferential
attachment algorithm follows a power-law, with theoretical exponent *β* = 3 in the limit of large networks
(Barabási & Albert, 1999).
Fitting power-laws to empirical data requires some caution, such as adopting a
logarithmic binning scheme (Virkar & Clauset, 2014), and the dedicated *powerlaw* Python package is employed for this purpose (Alstott, Bullmore, & Plenz, 2014). For sufficiently long time
series (*T* = 10,000 in this study), multivariate TE is able to
accurately recover the in-degrees of the nodes in our scale-free networks, while
the bivariate MI and TE algorithms produce significant overestimates (Figure 7). As a consequence, the (absolute
value of the) exponent of the fitted power-law is underestimated by the latter
methods, as shown in Figure 8.

Hubs are a key feature of scale-free networks and have high betweenness centrality, since most shortest paths pass through them. However, their centrality is highly underestimated by bivariate methods, often making it indistinguishable from the centrality of peripheral nodes (Figure 9).

As in the small-world case, multivariate TE is able to very closely approximate the real mean clustering coefficient, while the bivariate MI and TE algorithms consistently overestimate it (Figure 10). The related measure of local efficiency (Latora & Marchiori, 2001) is reported in the Supporting Information.

A closer examination of the clustering of individual nodes (instead of the average) reveals that low clustering values are consistently overestimated by bivariate methods, while high clustering values are underestimated (Supporting Information). Finally, bivariate methods overestimate the rich-club coefficient (Supporting Information).

### Discussion

Echoing the discussion of small-world networks above, the ability to control the
false positives while building connectomes—exhibited only by multivariate
TE—is also crucial for correctly identifying fundamental features of
scale-free networks, such as the power-law degree distribution and the presence
of hub nodes. Hubs are characterized by high degree and betweenness centrality.
Unfortunately, the centrality of hubs is not robust with respect to false
positives: The addition of spurious links cause’s strong underestimates
of the betweenness centrality of real hubs, since additional links provide
alternative shortest paths. For bivariate TE, the effect is so prominent that
the inferred centrality of real hubs can be indistinguishable from the
centrality of peripheral nodes, as shown in Figure 9. The in-degree is in principle more robust with respect to false
positives; however, bivariate methods infer so many spurious incoming links into
nonhubs that they become as connected (or more) than the real hubs are inferred
to be (Figure 7). Taken together, these
effects on the in-degree and centrality greatly hinder the identification of
real hubs when bivariate MI or TE are employed. The inflation of the in-degree
of peripheral nodes also fattens the tail of the in-degree distribution (Figure 8), resulting in an underestimate of
the exponent of the fitted power-law with respect to the theoretical value *β* = 3 (Barabási & Albert, 1999). This has severe implications for
the synthetic networks used in this study, erroneously providing evidence
against the simple preferential attachment algorithm used to generate them. The
third distinct characteristic of these networks is their low average clustering,
which is also induced by the preferential attachment algorithm, whereby each new
node is only connected to two existing ones. However, bivariate methods fail to
capture this feature, producing a strong overestimate of the average clustering
coefficient (Figure 10). This can be
attributed to the transitivity property of Pearson’s correlation, which
produces overabundant triangular cliques in functional networks (as previously
discussed). Given the significant biases affecting all the distinctive
properties of scale-free networks—in addition to the small-world networks
presented above—it is evident that great caution should be used when
applying bivariate inference methods (cross-correlation, MI, TE) to draw
conclusions as to topological properties of real-world networks. In contrast,
again, the multivariate TE was demonstrated to produce network models with
microscopic and macroscopic topological properties consistent with those of the
underlying structural scale-free networks.

## MODULAR NETWORKS

### Numerical Simulations

In order to study the performance of the three inference algorithms on modular
topologies, networks of 100 nodes are generated and equally partitioned into
five groups of 20. Initially, each node is directly linked to 10 random targets
within its own group, such that the five communities are completely
disconnected. The initial density is thus 50% within each group and
10% overall. Link targets are then gradually rewired from within to
between groups, weakening the modular structure but preserving the overall
density and keeping the out-degrees fixed. Eventually, the concepts of
“within” and “between” groups are no longer
meaningful—the links are equally distributed and the topology resembles a
random Erdős-Rényi network of equal overall density. This happens
when the rewiring is so prevalent that only 2 links are left within the initial
groups and 8 out of 10 links are formed between them (for each node). Going even
further, when all 10 links are formed between the initial groups and none
within, the network becomes multipartite, that is, the nodes are partitioned
into five independent sets having no internal connections. A constant uniform
link weight *C*_{XY} = *C*_{XX} = 0.08 is assigned to all
the links, achieving strong coupling but ensuring the stationarity of the VAR
dynamics. Each simulation is repeated 10 times on different network realizations
and with random initial conditions.

### Results

At the microscale, we find that bivariate MI and TE infer more spurious links within the initial groups than between them for smaller between-group densities (Figure 11, left column). As the between-group density increases though, we find more spurious links between the initial groups than within them. The normalized false positive rate is also significantly higher within groups for smaller between-group densities (right column), however the normalization sees the false positive rate becoming comparable between and within group’s as the between-group density increases. The number of false positives produced by multivariate TE is comparatively negligible.

At the mesoscale, the modularity of the partition corresponding to the five
disconnected communities is maximal in the absence of rewiring and decreases as
more and more links are formed between groups rather than within them (Figure 12). Bivariate and multivariate TE
produce accurate estimates of the real modularity, while bivariate MI often
underestimates it, particularly for shorter time series (*T* =
1,000) and intermediate between-group densities.

### Discussion

Our results on modular networks confirm and extend previous findings on
correlation-based functional connectivity, stating that “false positives
occur more prevalently between network modules than within them, and the
spurious inter-modular connections have a dramatic impact on network
topology” (Zalesky et al., 2016,
407). Indeed, the left column of Figure 11 shows that bivariate MI and TE infer a larger *number* of false
positives between the initial groups than within them, once we have a midrange
between-group link density in the underlying structure (which induces the
transitive relationships). However, the same does not apply to the false
positive *rate* (i.e., the normalized number of false positives)
shown in the right column of Figure 11:
Where an edge does not actually exist, it is more likely to be inferred if it is
within rather than across groups (for up to midrange between-group link
densities). As such, the higher number of false positives between modules is
mostly due to the larger number of potential spurious links available between
different communities compared with those within them. Nonetheless, the key
message is that the modular structure (at the mesoscale level) affects the
performance of bivariate algorithms in inferring single links (at the microscale
level). This provides further empirical evidence for the theoretical finding
that bivariate TE—despite being a pairwise measure—does not depend
solely on the directed link weight between a single pair of nodes, but on the
larger network structure they are embedded in, via the mesoscopic network motifs
(Novelli et al., 2020). In particular,
the abundance of specific “clustered motifs” in modular structure
increase the bivariate TE, making links within each group easier to detect but
also increasing the false positive rate within modules. Other studies have
related also the correlation-based functional connectivity to specific
structural features, such as search information, path transitivity (Goni et al., 2014), and topological similarity
(Bettinardi et al., 2017).

The underestimate of the modularity of the initial partition by bivariate MI (Figure 12, bottom panel) is a direct result of these higher numbers of spurious between-group links. This has important implications for the identification of the modules, since a lower score makes this partition less likely to be deemed optimal by popular greedy modularity maximization algorithms (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008). We speculate that the spurious intermodular links would also hinder the identification of the modules when alternative approaches for community detection are employed (a thorough comparison of which is beyond the scope of this study).

## MACAQUE CONNECTOME

Finally, the three inference algorithms are compared on two real macaque brain connectomes, using both linear VAR dynamics and a nonlinear neural mass model (pipeline illustrated in Figure 1).

### Numerical Simulations

#### Linear VAR dynamics.

As a final validation study under idealized conditions, the linear VAR
dynamics in Equation 1 is run
on the connectome obtained via tract-tracing by Young (1993). This directed network consists of 71 nodes
and 746 links (15% density) and incorporates multiple properties
investigated in the previous sections, including a small-world topology and
the presence of hubs and modules. The scaling of the performance is studied
as a function of the cross-coupling strength (i.e., the sum of incoming link
weights into each node, denoted as *C*_{in} and
formally defined as *C*_{in} =
∑_{X}*C*_{XY} for each node *Y*). The coupling *C*_{in} is
varied in the [0.3, 0.7] range, making *C*_{XY} constant for each
parent *X* for a given *Y* to achieve this,
and the self-link weights are kept constant at *C*_{XX} = 0.2 to ensure the
stationarity of the VAR dynamics. For robustness, each simulation is
repeated 10 times with random initial conditions.

#### Nonlinear neural mass model.

As a final experiment, we provide an initial investigation of whether the
insights from the previous validation studies extend beyond the idealized
conditions there. Specifically, in moving towards a more realistic setting,
neural mass model dynamics are simulated on the CoCoMac connectome, as
described in the Methods Section. This network structure contains 76 nodes
with 1,560 directed connections (27% density), which are weighted and
have experimentally estimated coupling delays. Importantly, by incorporating
nonlinear coupling, coupling delays, a distribution of coupling weights, and
subsampling, this last study drops many of the simplifying assumptions made
using the VAR dynamics in the previous sections. The linear Gaussian
estimator is retained for our information-theoretic measures despite the
nonlinear interactions here, so as to remain consistent with the previous
studies. Dropping the assumption of sampling at the real causal process
resolution adds a particular challenge, and is often encountered in practice
in modalities with low temporal resolution. To handle the variation in
coupling delays, we consider sources at lags up to *L* = 4
time steps (60 ms) here. The longest time series analyzed (30,000 samples)
corresponds to 7.5 min. of one sample per 15 ms.

### Results

At the microscale, the results for the linear and nonlinear dynamics (Figure 13 and Figure 14) are complementary and summarize the main findings presented so far. There exists a window—characterized by low cross-coupling strength and short time series—where bivariate TE attains similar or better performance compared with multivariate TE in terms of recall, specificity, and precision. For stronger coupling or longer time series, the recall of all methods increase’s, but the precision and specificity of the bivariate methods substantially drop while those of multivariate TE remain consistently high.

An intuitive visual representation of how these differences in precision, recall, and specificity affect the macroscopic inferred network is provided in Figure 15, where the inferred adjacency matrices are displayed beside the real connectome, with different colors indicating which links are correctly/incorrectly inferred or missed by each method.

The macroscale results (in terms of local and global efficiency measures) are reported in the Supporting Information.

### Discussion

Interestingly, Figures 13 and 14 show how similar outcomes are produced by either stronger coupling (link weights) or longer time series. An explanation is readily available in the simple case of VAR dynamics: The bivariate TE on spurious links is typically lower than the TE on real links, and it increases with the coupling strength. (Bivariate TE can be analytically derived from the network structure under the assumption of VAR dynamics, and spurious links are associated with larger motifs (e.g., longer chains), contributing to the TE at lower orders of magnitude (Novelli et al., 2020). Therefore, spurious links can only pass statistical significance tests when sufficiently long time series are available (in order for their weak TE values to be distinguished from noise); for the same reason, for shorter time series, spurious links can only be detected in the presence of strong enough coupling. Unfortunately, for real datasets, there is no consistent a priori way to determine the optimal window of time series lengths for bivariate TE, before increasing false positive rate decays precision and specificity.

It is crucial to note that, despite moving beyond the idealized conditions used
for validation with the VAR model, the qualitative differences between the
inference algorithms remain unchanged in the neural mass model study. That is,
multivariate TE attains higher and more consistent precision and specificity
than bivariate methods, testifying to a more effective control of false
positives that enables more faithful representation of macroscale network
features—an advantage that becomes increasingly important as longer time
series are provided. Indeed, on a more intuitive level, Figure 15 provides immediate visual evidence of the
importance of a reliable method for controlling the false positive rate. The
large number of spurious interhemispheric links produced by bivariate methods
can hinder the identification of important macroscopic features, starting from
the very presence of the two hemispheres themselves and extending to other
fundamental network properties (as shown for local and global efficiency
measures in the Supporting Information).
This issue becomes particularly problematic for longer time series, as in the
case of *T* = 30,000 time samples shown in Figure 15.

With that said, the precision and specificity of the multivariate TE for this more realistic study in Figure 14 is noticeably lower compared with the previous experiments in idealized conditions, such as that shown in Figure 13. Specifically, the specificity is lower than would be expected from the proven well-controlled false positive rate under idealized conditions. This is potentially due to a number of factors in this study, including the subsampling, nonlinear dynamics, strong autocorrelation, and to a lesser extent coupling delays. While we retained the linear Gaussian estimator to be consistent with the previous experiments, we have previously demonstrated substantial performance enhancements for the multivariate TE algorithm when using a nonlinear estimator for studying nonlinear dynamics (Novelli et al., 2019). That could be expected to improve performance here as well. Similarly, we have recently demonstrated approaches to rigorously control the known inflation of false positive rates for MI and TE because of autocorrelation (Cliff et al., 2020), although this is not expected to have as dramatic an effect in this study because of the selection of subsampling time via the autocorrelation time. The subsampling itself though (by a factor of 30 on the original time series) is likely to have had a significant impact on performance. This is because subsampling obscures our view of the dynamics at the real interaction scale, and prevents us from properly conditioning on the past of the target (which is known to inflate the false positive ratel; Wibral et al., 2011). While initial studies have suggested some level of robustness of TE to subsampling (Lizier et al., 2011), a more systematic analysis would be important future work to properly understand its effect.

## CONCLUSION

We have sought to evaluate how well network models produced via bivariate and
multivariate network inference methods capture features of underlying structural
topologies. As outlined in the Introduction, these inference techniques seek to
infer a network *model* of the relationships between the nodes in a
system, and are not necessarily designed or expected to replicate the underlying
structural topology *in general*. Our primary focus, however, was in
evaluating the techniques under specific idealized conditions under which effective
network models are proven to converge to the underlying structure. The focus on such
conditions is important because they provide assumptions under which our evaluation
becomes a validation study. The performance of these methods was evaluated at both
the microscopic and the macroscopic scales of the network. While we may not expect
the same performance in identifying links at the microscopic scale, we should expect
all of the methods to identify relevant macroscopic features in the underlying
network structure, as well as distinctive nodes or groups of nodes.

For longer time series, multivariate TE performs better on all network topologies (lattice-like, small-world, scale-free, modular, and the real macaque connectome). This enhanced performance is very clear at the microscale of single links, achieving high precision and recall, and consequently at the macroscale of network properties, accurately reflecting the key summary statistics of the ground truth networks used for validation.

Bivariate methods (directed and undirected) can exhibit higher recall (or
sensitivity) for shorter time series for certain underlying topologies; however, as
available data increase, they are unable to control false positives (that is, they
have lower specificity). While decreasing statistical significance thresholds
(critical *α* levels) for inferring links is a common strategy
to reduce false positives, the bivariate measures simply cannot match the
sensitivity of the multivariate approach at the same specificity (compare the
precisions for same recall level in Figure 6 as
an example, or refer to the sample ROC curve in the Supporting Information for a more extensive comparison). At the
macroscale, the comparatively larger number of false positives leads to
overestimated clustering, small-world, and rich-club coefficients, underestimated
shortest path lengths and hub centrality, and fattened degree distribution tails.
The changes in these measures are partly due to the aforementioned transitivity
property for bivariate measures (implying that false positives are often
“close” to real links in the network), and partly due to higher
density; untangling these effects is a topic for future work. In any case, caution
should therefore be used when *interpreting* network properties of
functional connectomes obtained via correlation or pairwise statistical dependence
measures. Their use is advisable only when the limited amount of data don’t
allow the use of the more sophisticated but more accurate multivariate TE, which
more faithfully tracks trends in underlying structural topology. Further research is
required to try to reliably identify—a priori—situations where
bivariate TE will exhibit higher precision and recall (particularly in terms of time
series length), as there is no clear candidate approach to do so at present. In the
current status quo, the critical strength of the multivariate approach lies in its
ability to appropriately control the false positive rate to meet the requested
values.

Our evaluation of the inference techniques under idealized conditions considered several of the highest profile complex network topologies: lattice-like, small-world, scale-free, modular, and a mix of their features in a real macaque connectome. This complements previous work (Novelli et al., 2019; Sun et al., 2015) at a similar scale, which evaluated performance on random network structures and incorporated a study of the effect of network size and linear-versus-nonlinear dynamics and estimators. Obviously, we have only scratched the surface of examining the effects of the myriad combinations of network parameters that could be investigated, which could include larger variations in degree, distributions on edge weights, super or sublinear preferential attachment, nonuniform module sizes, or cross-module connection probabilities, and could also incorporate experiments across other types of dynamics. Thus far, our conclusions on how the multivariate TE approach performs against the bivariate measures were consistent across the variety of structures, and while it would be interesting to see how other variations in structure affect the performance, we do expect the general conclusions on the comparison between approaches to remain similar.

Of course, there is a computational time tradeoff, with the run-time of the
multivariate TE algorithm requiring 𝒪(*d*) longer in
comparison to the bivariate approach (where *d* is the average
inferred in-degree). The runtime complexity is analyzed in detail and benchmarked
for both linear and nonlinear estimators on similar scale experiments in Novelli et
al. (2019; Supporting Information). Our experiments here on 10,000 time samples for
up to 200 nodes, with the more efficient linear estimator, took less than 2 hr
(single core) on average per target on the same hardware. Given the availability of
parallel computing to analyze targets simultaneously, we believe the trade-off in
runtime increase is justifiable for the performance increase demonstrated here.

Beyond idealized conditions, effective network inference techniques are not guaranteed to converge in such manner to an underlying structure. This can be for many reasons, including hidden nodes (or lack of full observability), nonstationarity or short sample size, or subsampling obscuring the scale of interaction. Yet, our final experiment (examining time series dynamics of a neural mass model on the 76 node CoCoMac connectome) extended our investigations into the domain beyond idealized conditions and also demonstrated superior performance of the multivariate TE, aligning with the validation studies in idealized conditions. Importantly, this included visually revealing the characteristic hemispheric macroscopic structure of this connectome. With that said, the performance of multivariate TE in this example was certainly reduced in comparison to our experiments under idealized conditions. This appears to be due to various factors as discussed in that section, including the use of a linear estimator on nonlinear dynamics as well as the effect of subsampling. There is substantial scope for further study to understand the performance of inference techniques under nonideal techniques, and how the effective network models they infer are related to underlying structure. This will involve further experiments on realistic neural dynamics; systematic study of the effect of subsampling in network inference (building on existing studies for Granger causality; Barnett & Seth, 2017), and assessing the ability of inference algorithms to capture key network features when only a subset of nodes is observed (i.e., in the presence of hidden nodes).

Finally, while we focus on functional brain networks, our conclusions and methods also apply to anatomical brain networks in which connectivity is measured using correlation in cortical thickness or volume (He, Chen, & Evans, 2007). Beyond neuroscience, they also extend to metabolite, protein and gene correlation networks (Gillis & Pavlidis, 2011; a similar validation study using synthetic networks was carried out in gene regulatory networks using bivariate MI and TE; Budden & Crampin, 2016).

## ACKNOWLEDGMENTS

The authors thank Mac Shine and Daniele Marinazzo for useful discussions. The authors acknowledge the Sydney Informatics Hub and the University of Sydney’s high-performance computing cluster Artemis for providing the high-performance computing resources that have contributed to the research results reported within this paper. The authors thank Matthew Aburn for providing time-series data simulated from the neural mass model on the CoCoMac connectome from Li et al. (2019) and Shine et al. (2018).

## SUPPORTING INFORMATION

Supporting Information for this article is available at https://doi.org/netn_a_00178. The network inference algorithms described in this paper are implemented in the open-source Python software package IDTxl (Wollstadt et al., 2019), which is freely available on GitHub (https://github.com/pwollstadt/IDTxl). The code used for the systematic exploration of network structures and inference methods is also publicly available (Novelli, 2020; https://github.com/LNov/infonet).

## AUTHOR CONTRIBUTIONS

Leonardo Novelli: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Software; Validation; Visualization; Writing original draft. Joseph Lizier: Conceptualization; Funding acquisition; Investigation; Supervision; Writing review & editing.

## FUNDING INFORMATION

Joseph Lizier, Australian Research Council, Award ID: DECRA Fellowship grant DE160100630. Joseph Lizier, University of Sydney, Award ID: Research Accelerator (SOAR) prize program.

## TECHNICAL TERMS

- Brain parcellation:
The subdivision of the brain into “parcels” or regions of interest, serving as nodes in network-based models of the brain.

- Network topology:
The structure of the network describing the connections between its elements.

- Multivariate network inference:
Considering multiple time series at once (as opposed to bivariate inference, which considers each pair of nodes in isolation).

- IDTxl:
The Information Dynamics Toolkit xl is an open-source Python package for network inference, freely available on GitHub (https://github.com/pwollstadt/IDTxl).

- Bonferroni correction:
A strategy for controlling the error rate when multiple comparisons are performed. The significance threshold is divided by the number of comparisons.

- Nonuniform embedding:
A set of nonuniformly spaced time lags that capture the underlying state of the process, akin to a Takens’ embedding.

## REFERENCES

## External Supplements

## Author notes

Competing Interests: The authors have declared that no competing interests exist.

Handling Editor: Daniele Marinazzo

DOI:https://doi.org/10.1152/jn.1989.61.5.900,PMID:2723733