## Abstract

The analysis of the community architecture in functional brain networks has revealed important relations between specific behavioral patterns and characteristic features of the associated functional organization. Numerous studies have assessed changes in functional communities during different states of awareness, learning, information processing, and various behavioral patterns. The robustness of detected communities within a network has been an often-discussed topic in complex systems research. However, our knowledge regarding the intersubject stability of functional communities in the human brain while performing different tasks is still lacking. In this study, we examined the variability of functional communities in weighted undirected graphs based on fMRI recordings of healthy participants across three conditions: the resting state, syllable production as a simple vocal motor task, and meaningful speech production representing a complex behavioral pattern with cognitive involvement. On the basis of the constructed empirical networks, we simulated a large cohort of artificial graphs and performed a leave-one-out stability analysis to assess the sensitivity of communities in the group-averaged networks with respect to perturbations in the averaging cohort. We found that the stability of partitions derived from group-averaged networks depended on task complexity. The determined community architecture in mean networks reflected within-behavior network stability and between-behavior flexibility of the human functional connectome. The sensitivity of functional communities increased from rest to syllable production to speaking, which suggests that the approximation quality of the community structure in the average network to reflect individual per-participant partitions depends on task complexity.

## INTRODUCTION

Considerable advancements in imaging technology drastically improved the quality and level of detail of neuroimaging data sets collected over the past years (Van Essen et al., 2013). Although this development presented obvious advantages for knowledge generation, it also came with a number of problems arising in the analysis of complex high-dimensional data. Thus, many fMRI studies started utilizing abstract mathematical concepts rooted in graph theory (Bondy & Murty, 1976) to represent and subsequently analyze brain activity data in terms of large-scale graphs (Wang et al., 2014; van den Heuvel & Sporns, 2011; Lynall et al., 2010; Bassett & Bullmore, 2006; Humphries, Gurney, & Prescott, 2006). The mathematical concept of a graph, which represents a collection of nodes and edges, proved to be particularly well suited to comprehensively assess the whole-brain functional connectome (De Vico Fallani, Richiardi, Chavez, & Achard, 2014; Bullmore & Sporns, 2009). The analysis of the community architecture within these graphs revealed a number of previously unknown brain activity patterns, such as the context-dependent global reorganization of brain activity not necessarily restricted to underlying white matter connections (Andric & Hasson, 2015), the predictive power of nodal assignments to functional modules for learning performance (Bassett et al., 2011), the estimation of cognitive performance based on the modular organization of a cognitive control network at rest (Stevens, Tappon, Garg, & Fair, 2012), and the modular disorganization and a breakdown of normal network architecture preceding an epileptic seizure (Fuertinger, Simonyan, Sperling, Sharan, & Hamzei-Sichani, 2016). The inherent structure of these functional patterns most likely would have remained undetected without using the theoretical underpinning of graph partitions.

However, finding a graph's community structure as well as analyzing and comparing network partitions present several technical challenges itself (Sporns & Betzel, 2015). One of the most widely used approaches to partition a network into communities, or modules, is the technique of modularity maximization (Porter, Onnela, & Mucha, 2009; Newman & Girvan, 2004). Maximizing a network's Newman modularity score is a computationally efficient way to simultaneously determine both the number of nodal clusters and their members (Newman, 2006). However, modularity maximization also experiences two well-known conceptual problems: resolution limit (Fortunato & Barthelemy, 2007), which, under certain conditions, prevents the detection of small communities in a graph, and module degeneracy (Good, de Montjoye, & Clauset, 2010), caused by numerous different network partitions with near-optimal modularity scores. The latter aspect is closely related to the problem of comparing network partitions and assessing their quality in reflecting the “true” underlying community structure of a graph. These methodological issues, together with the fact that neuroimaging studies traditionally focus on group-level inferences, are likely reasons for many graph-theory-based functional imaging studies to center on analyzing the community structure of group-averaged networks (Bertolero, Yeo, & D'Esposito, 2015; Betzel et al., 2015; Godwin, Barry, & Marois, 2015; Onoda & Yamaguchi, 2013). A few studies have demonstrated pronounced effects of intersubject variability on functional coupling patterns, particularly during complex cognitive tasks (Miranda-Dominguez et al., 2014; Grabner et al., 2007; Newman, Carpenter, Varma, & Just, 2003; Rypma & D'Esposito, 1999). However, the impact of individual variability in functional network architecture on the community layout of group-level mean networks across different experimental conditions is still largely unknown.

In this study, we analyzed the variability of functional network communities across 27 healthy adult participants with respect to a group-wide reference given by the modular decomposition of the group-averaged networks. We assessed three different conditions: the resting state as a baseline status, syllable production representing a simple vocal motor task with minimal cognitive involvement, and the production of meaningful sentences as a highly complex speech motor behavior. Furthermore, we performed an in-depth assessment of the effects of perturbations in the averaging process on the architecture of the mean networks by simulating 100 graphs per condition and conducted a comprehensive leave-one-out analysis of the stability of mean network communities. Using different variations of community detection algorithms, some studies have examined the robustness of communities within a single network by probing well-studied prototype graphs, such as the karate club network (Karrer, Levina, & Newman, 2008; Gfeller, Chappelier, & De Los Rios, 2005; Wu & Huberman, 2004; Zachary, 1977). The aim of this study was to analyze the within-group stability and sensitivity of functional communities with respect to the experimental condition. We hypothesized that the community architecture was both specific and sensitive to each condition and driven by its complexity.

## METHODS

### Ethics Statement

All participants provided written informed consent before study participation, which was approved by the institutional review boards of the Icahn School of Medicine at Mount Sinai and the National Institute of Neurological Disorders and Stroke, National Institutes of Health.

### Data Acquisition

Twenty-seven right-handed monolingual English-speaking healthy participants (18 women, 9 men; mean age = 52.2 years, *SD =* 11.3 years) without any history of neurological, psychiatric, or laryngeal disorders were recruited for this study.

Brain images were acquired on a 3-T GE scanner (Milwaukee, WI). Whole-brain resting-state fMRI images were acquired before the task-production fMRI within the same scanning session. Participants were instructed to rest without specific thoughts, with their eyes closed in an environment with dimmed light. Participants were continuously monitored during scanning and debriefed after the scan; no participant reported falling asleep during the scanning session. Functional images were acquired with a gradient-weighted EPI pulse sequence (repetition time = 2 sec, 150 contiguous volumes, echo time [TE] = 30 msec, flip angle [FA] = 90°, 33 slices with 3.75-mm in-plane resolution, slice thickness = 4 mm). Syllable and sentence production fMRI data were acquired with a gradient-weighted EPI pulse sequence (total repetition time = 10.6 sec, comprising 3.6-sec delay for stimulus presentation, 5-sec delay for task production, and 2-sec image acquisition; TE = 30 msec; FA = 90°; 36 slices with 3.75-mm in-plane resolution; slice thickness = 4 mm) using a BOLD contrast and an event-related sparse-sampling design. As a simple vocal motor task, all participants produced the same syllables /ii-iiiʔi-iʔi/, which is composed of the vowel /i/ (as in “beet”) and the glottal stop /ʔ/ followed again by /i/, which maximized vocal fold adduction while minimizing orofacial articulation. As a more cognitively and linguistically complex speech motor task, all participants produced the same 10 meaningful, grammatically correct English sentences (e.g., “We are always away,” “Tom is in the army”), one per acquisition volume (Fuertinger, Horwitz, & Simonyan, 2015). The speech production task was designed to encompass higher-order cognitive and linguistic processing, whereas syllable production represented a simple vocal motor task that was relevant to speaking but had minimal cognitive and linguistic involvement (Fuertinger et al., 2015; Simonyan & Fuertinger, 2015; Simonyan, Herscovitch, & Horwitz, 2013). All sample stimuli were acoustically presented one at a time and performed by the participant one per acquisition volume. Thirty-six trials per each task (syllable or sentence production) and 24 resting fixations as a baseline were acquired over the five scanning sessions in each participant with the tasks pseudorandomized between sessions and participants. A high-resolution T1-weighted image was acquired as an anatomical reference using a magnetization-prepared rapid acquisition gradient echo sequence (inversion time = 450 msec, TE = 3.0 msec, FA = 10, 128 slices with 1.2-mm thickness).

### Construction of Functional Networks

Functional images were preprocessed using AFNI software (Cox, 1996) following a standard analysis pipeline as described earlier (Fuertinger et al., 2015). A whole-brain parcellation based on the cytoarchitectonic maximum probability maps and macrolabel atlas (Eickhoff et al., 2005) was used to extract regionally averaged task-production and resting-state BOLD fMRI residual time series for all participants (Figure 1A). The whole-brain parcellation was composed of *n* = 212 ROIs, including 142 cortical, 36 subcortical, and 34 cerebellar areas.

The interaction of ROIs during each condition (i.e., rest, syllable, and speech) was estimated by computing the normalized mutual information (NMI; Strehl & Ghosh, 2002) for each pair of regional time series. Normalizing the classic mutual information (Cover & Thomas, 1991) *I*(*x*,*y*) of two random variables *x* and *y* by the geometric mean of the associated Shannon entropies (Shannon, 1948) *H*(*x*) and *H*(*y*), respectively, resulted in a metric, which assumed values close to 0 in case of statistical independence, whereas pairwise mutual dependence yielded NMI coefficients close to 1. Hence, NMI values were (unlike the classic mutual information) bounded from above by 1, allowing for quantitative comparisons across different data sets.

Following Sturges' (1926) rule, we used eight bins to discretize probability distributions for computing NMI coefficients, which resulted in an estimated inflation of the mutual information because of sample size by less than 0.24 bits (Steuer, Kurths, Daub, Weise, & Selbig, 2002). This finding indicates high approximation quality of the empirical mutual information values within each condition with negligible influence of data acquisition differences. It is, therefore, unlikely that the observed differences in the examined static functional networks were driven by variations in length of task and rest data acquisition.

Symmetric *N* × *N* matrices were constructed by arranging the computed pairwise NMI coefficients by region. Because of the absence of negative values, any NMI matrix can be readily interpreted as the connectivity matrix of a weighted undirected graph with ROIs representing nodes and pairwise NMI values between ROIs serving as edge weights.

In contrast, classic Pearson correlation matrices are not necessarily nonnegative, and this has to be accounted for in a graph theoretical framework. Negative values may be discarded by either considering only their absolute magnitudes or removing the corresponding edges from the graph entirely, resulting in a loss or potential corruption of the underlying correlative information. Instead, the positive and negative components of a graph may be analyzed separately, although this approach might result in unconnected subgraphs. Alternatively, classic network metrics may be redefined and extended to account for negative edge weights; however, the potential existence of negative cycles proved to be especially problematic for measures relying on the notion of paths through a graph (Fakcharoenphol & Rao, 2006). To circumvent these issues and their inherent impact on network architecture, we chose to use the NMI as a statistical dependence metric to construct functional networks. In this manner, three groups of functional networks were constructed, (*F*_{k}^{r})_{k ≥ 1}, (*F*_{k}^{syl})_{k ≥ 1}, and (*F*_{k}^{sp})_{k ≥ 1} (with *k* denoting the participant number), corresponding to the resting state, syllable production, and speaking, respectively (Figure 1B).

In a second step, we stimulated 300 networks, including 100 for each resting (*G*_{k}^{r})_{k ≥ 1}, syllable production (*G*_{k}^{syl})_{k ≥ 1}, and speaking (*G*_{k}^{sp})_{k ≥ 1} conditions, by imposing a Gaussian random graph model on randomly sampled empirical networks from the corresponding groups (Figure 1B; Wang et al., 2011; Ghosh, Rho, McIntosh, Kotter, & Jirsa, 2008). Specifically, let *F* be a fixed but arbitrary empirical network, and let *F*(*m*,*n*) denote the edge weight between nodes *m* and *n*. Given *F*, a simulated network *G* is obtained by imposing a Gaussian null graph with independent, identically distributed edges of zero mean and constant variance given by the scaled median of edge weights within each group (rest, syllable, and speech). Then, the expected edge weight between nodes *m* and *n* in *G* is given by *E*(*G*(*m*,*n*)|*s*_{1},…,*s*_{N}) = (*s*_{m} + *s*_{n})/(*N* − 2) − 2*S*/[(*N* − 1)(*N* − 2)] for *m* ≠ *n* and *E*(*G*(*m*,*n*)|*s*_{1},…,*s*_{N}) = 0; otherwise, where *s*_{m} = Σ_{n}*F*(*m*,*n*) denotes the strength of node *m* in *F* and *S* = ½ Σ_{m}*s*_{m} is the total nodal strength in *F* (Chang, Leahy, & Pantazis, 2012). Imposing a Gaussian graph on *F* as a null model to generate *G* ensures that the basic topology of the empirical network *F*, including its degree distribution, is preserved, while simultaneously introducing stochastic variability in the simulated network cohort (Chang et al., 2012). Consequently, this strategy injects stochastic alterations in the modular architecture of *F* to construct a simulated graph *G* of comparable but randomly adapted modular architecture.

### Network Preprocessing

We will use the shorthand notation (*F*_{k})_{k ≥ 1} and (*G*_{k})_{k ≥ 1} to refer to all (rest, syllable, speech) empirical and simulated networks, respectively. The connection density of both (*F*_{k})_{k ≥ 1} and (*G*_{k})_{k ≥ 1} was calculated in every graph as the number of actual edges divided by the maximum possible number of edges in the graph. Thus, the density of an undirected graph with *N* nodes is given by 2*K*/(*N*^{2} − *N*), with *K* denoting the actual edge count in the network. All networks (empirical and simulated) showed a density ≥ 99%, indicating that most graphs were almost fully connected. To allow an assessment of edge clustering and connection patterns, a density-dependent iterative thresholding approach was applied to every graph. Edges with weight less than a subsequently incremented percentage of the maximum weight in the graph (starting with 1%) were removed from the network. The iteration was stopped if the removal of an edge would fully disconnect a node from the rest of the network. It has been shown that network architecture becomes increasingly random above a density of 50% (Lynall et al., 2010; Humphries et al., 2006); thus, the connection density of all graphs was reduced to the within-group minimum density, which was calculated as the lowest common density to which every network in the group could be thresholded without disconnecting a node. In this way, empirical graphs were reduced to 65% (rest), 63% (syllable), and 50% (speech) densities, whereas the thresholding of simulated networks resulted in densities of 60% (rest), 68% (syllable), and 47% (speech). Given the size of the considered networks (*N* > 200), the obtained within-group minimum densities were comparable across conditions (van Wijk, Stam, & Daffertshofer, 2010).

One of the most widely used ways to represent group effects in graph theoretical assessments of neuroimaging data is to consider group-averaged networks (Ginestet, Fournel, & Simmons, 2014), which we thus used as group-specific points of reference in our subsequent topological graph analysis. Group-averaged networks were computed by calculating mean edge weights within each condition (rest, syllable, and speech) in both empirical and simulated networks (Figure 1C). We denote group-averaged networks by the subscript “0” (*F*_{0}^{r}, *G*_{0}^{r}, etc.) and drop the superscript “r,” “syl,” and “sp” to indicate mean networks of each condition, namely, the resting state, syllable production, and speaking (*F*_{0} and *G*_{0}). To further assess the effects of perturbations in the averaging set on the topology of the mean network, we implemented a leave-one-out analysis paradigm in the simulated network cohort, which was composed of the most samples per group (rest, syllable, and speech). Although an *n*-fold cross-validation approach would have been computationally more efficient, it was a priori not clear how to choose the number of withheld samples *n* appropriately to not introduce an estimation bias (Hastie, Tibshirani, & Friedman, 2009). Hence, we performed a leave-one-out validation using a large cohort of simulated networks to minimize any undue influence of a single empirical subject on the community layout in the group average. Thus, we computed 100 leave-one-out mean networks per condition (*G*_{0\k}^{r})_{k ≥ 1}, (*G*_{0\k}^{syl})_{k ≥ 1}, and (*G*_{0\k}^{sp})_{k ≥ 1}, corresponding to the resting state, syllable, and speech production, respectively, by averaging across the graphs (*G*_{j}^{r}), (*G*_{j}^{syl}), and (*G*_{j}^{sp}), respectively, for *j* = 1,…,100, *j* ≠ *k* (Figure 1B).

### Analysis of Network Topology

The architecture of each network was analyzed by estimating its optimal modular decomposition. The terms “community” and “module” will be used interchangeably in this article to denote groups of nodes that form densely interconnected subnetworks within a graph. Most community detection strategies that are used to analyze functional brain networks yield partitions, in which each node is unambiguously assigned to exactly one module (Sporns & Betzel, 2015). In line with this, we defined a modular decomposition of a graph to be a partition that divided a network into nonoverlapping nodal communities. One of the most widely used techniques to compute a network's modular decomposition consists of maximizing a graph's Newman modularity score *Q* (Newman & Girvan, 2004). The modularity score of a network is a statistic that quantifies the degree to which a network can be divided into nodal groups that show a higher internal connection density than would be expected by chance. Thus, maximizing *Q* ideally uncovers the partition with the lowest between-module and densest within-group connectivity, which thus best approximates the network's inherent community structure.

However, the global null model used to predict between-group connectivity in the calculation of *Q* is based on the assumption that every node can establish connections to any other node in the network. As a consequence, merging two small but clearly delineated communities in a sufficiently large network might increase the value of *Q* (Fortunato & Barthelemy, 2007). To counteract this effect known as the resolution limit of modularity, we used a multiresolution approach (Bassett et al., 2013) for community detection. A resolution parameter γ was introduced that scaled the null model in the calculation of *Q*, so that smaller communities were detected for γ > 1, whereas 0 < γ ≤ 1 resulted in larger modules. To balance modularity resolution for each of the six groups (rest, syllable, and speech conditions in the empirical and simulated network cohorts), the optimal modular decomposition of each group-averaged network was approximated by maximizing *Q*(γ) for γ in [0.5, 2.0] (30 values in increments of 0.05). Simultaneously, we computed a modularity score *Q*^{rnd}(γ) for each value of γ as the average modularity across 20 random networks with the same degree distribution as the corresponding group-averaged graph. Maximizing the quality function *f*(γ) = *Q*(γ) − *Q*^{rnd}(γ), that is, the difference between each network's actual score and a corresponding random graph's modularity score, favored the associated modular architecture that deviated the most from a partition expected by chance under the null model. To reduce computational load and ensure that the scale of partitions was consistent within each group, the maximizer γ = argmax *f*(γ) for the respective mean network was used as the resolution parameter for all graphs in the underlying group (Figure 1D).

Another methodological drawback of modularity maximization is rooted in its nature as global optimization problem. With increasing network size and community count, the modularity score *Q* is characterized by numerous local maxima, causing a degeneracy of near-optimal but dissimilar network partitions (Good et al., 2010). To address this phenomenon, we approximated the optimal modular decomposition of a graph using consensus communities, based on clustering a partition association matrix (Lancichinetti & Fortunato, 2012). A heuristic optimization strategy based on a multi-iterative generalization of the Louvain community detection algorithm (Rubinov & Sporns, 2010; Blondel, Guillaume, Lambiotte, & Lefebvre, 2008) was used to maximize *Q*, given the calculated group-specific resolution parameters. In this manner, 100 near-optimal modular decompositions were calculated for every network, which formed the basis of an association matrix *A*, with elements *A*_{ij} representing the number of times nodes *i* and *j* were assigned to the same community, weighted by each partition's quality as quantified by its modularity score *Q*. To eliminate small entries of *A* and thus improve clustering results, we applied an automated thresholding strategy (Bassett et al., 2013). The cutoff threshold τ was defined as the maximal element across 10,000 null association matrices constructed based on random permutations of the previously calculated 100 partitions. Every element *A*_{ij} < τ was set to zero, and the resulting association matrix was partitioned again 100 times using the Louvain algorithm, generating a new association matrix. If necessary, this step was repeated until partitions converged to a single-limit partition, which eventually represented the final consensus decomposition of the network.

Because *Q* can be maximized even for synthetic graphs with random topology, the successful computation of a network partition does not necessarily imply that a graph is really modular. To assess whether the computed consensus communities were indicative of a truly modular architecture of the underlying networks, we compared the partition of every empirical and simulated group-averaged network, *F*_{0} and *G*_{0}, respectively, to random graphs with identical degree distributions. As a first step, the *Q* modularity of each network was transformed to a *z* score *Q*_{z} by subtracting the average modularity of 100 corresponding random graphs μ(*Q*_{k}^{rnd}) and dividing the result by the corresponding standard deviation σ(*Q*_{k}^{rnd}). Thus, *Q*_{z} > 2 would suggest a significantly higher modularity of a group-averaged network than expected by chance. However, it has been shown that some highly modular networks fail to exhibit a significant *Q*_{z}, whereas the Newman modularity of other computer-generated graphs (Girvan & Newman, 2002), which lack any community structure, is much greater than expected by chance (Karrer et al., 2008). Thus, we used robustness testing as an additional means to ensure the quality of the computed consensus partitions. Although the community layout of a truly modular network is usually stable against small perturbations of its link structure, rewiring even a few edges in a random graph generally results in substantial changes in its modular decomposition (Karrer et al., 2008; Gfeller et al., 2005). Thus, similar community layouts of original and perturbed networks are suggestive of a robust modular makeup of a graph, whereas community dynamics that resemble random networks suggest the lack of any natural underlying nodal clustering within a graph. Following Karrer et al. (2008), both empirical and simulated mean networks, *F*_{0} and *G*_{0}, respectively, were randomized with increasing rewiring probability α (40 steps from 0 to 1). For each value of α, perturbed graphs *F*_{0}(α) and *G*_{0}(α) were computed by averaging across 100 rewired networks (*F*_{0}^{k}(α)) and (*G*_{0}^{k}(α)), respectively. Random graphs *R*_{0} were constructed by averaging across 100 candidate null model networks with the same degree distribution as *F*_{0} and *G*_{0}, respectively. The constructed random graphs were similarly perturbed, where, for each value of α, the network *R*_{0}(α) was determined by averaging across 100 rewired random networks (*R*_{0}^{k}(α)). The similarity of community layouts in two graphs *T* and *T*′ was quantified by introducing a partition distance *p*_{d} defined as the normalized variation of information between the partitions *P* and *P*′ of *T* and *T*′, respectively, that is, *p*_{d}(*T*,*T*′) = (*H*(*P*) + *H*(*P*′) − 2*I*(*P*,*P*′))/log(*N*) (Meila, 2007), where *H* and *I* denote the Shannon entropy and mutual information, respectively. Thus, *p*_{d}(*T*,*T*′) = 0 indicates identical partitions, whereas *p*_{d}(*T*,*T*′) = 1 points to maximally different community layouts. If the evolution of *p*_{d}(*F*_{0}, *F*_{0}(α)) and *p*_{d}(*G*_{0}, *G*_{0}(α)) departed significantly from the perturbation dynamics of the corresponding null models *p*_{d}(*R*_{0}, *R*_{0}(α)), it would thus suggest a clear nonrandom community architecture of the networks *F*_{0} and *G*_{0}. Because of its high computational demand, a robustness analysis was only performed for the group-averaged networks *F*_{0} and *G*_{0}, which were the a priori focus of analysis in this study.

### Statistical Evaluation and Computational Environment

To assess whether deviations in the partitions of the group-averaged networks from the modular decompositions of the underlying cohort were related to the observed condition (i.e., rest, syllable, and speech) or were identified by chance only, we performed a linear mixed effect model analysis (Bates, Machler, Bolker, & Walker, 2015). Three linear models with condition as fixed effect and an intercept for subjects or leave-one-out averages as random effect were fit to the data given by the partition distances *p*_{d}(*F*_{0},*F*_{k}), *p*_{d}(*G*_{0}, *G*_{k}), and *p*_{d}(*G*_{0}, *G*_{0\k}), respectively. We used likelihood ratio tests at a Bonferroni-corrected *p* ≤ .05 to assess whether the full models with the observed condition (rest, syllable, and speech) as the main effect differed significantly from a null model without the effect. The constructed models were subjected to simultaneous two-sided all-to-all general linear hypothesis tests in a Tukey contrast at *p* ≤ .05 adjusted for multiple comparisons to determine the statistical significance of differences in reference-to-subject partition distances across conditions.

Network processing and visualization codes were written in Python using the open-source packages NumPy, SciPy (van der Walt, Colbert, & Varoquaux, 2011), and Matplotlib (Hunter, 2007). Community analyses were performed in MATLAB (The MathWorks, Inc., Natick, MA) using the Brian Connectivity Toolbox (Rubinov & Sporns, 2010). The statistical analysis was done in R (R-Core-Team, 2016) using the packages lme4 (Bates et al., 2015) and multcomp (Hothorn, Bretz, & Westfall, 2008). Three-dimensional renderings of the networks embedded in reference brain models were generated using the BrainNet Viewer (Xia, Wang, & He, 2013). Network processing and visualization codes used in this study are made available at research.mssm.edu/simonyanlab/analytical-tools/.

### Alternative Strategies for Network Construction

To ensure that our results were not biased by differences in the sampling rate between resting state and task production, we sparse-sampled the original resting-state time series to match the sparse-sampling design of the task production scans. We applied the same community analysis pipeline to the networks based on this new set of sparse-sampled resting-state data and compared our results with the findings reported in the main study. We found that all networks showed identical within-group partition sensitivity (empirical networks: *p*_{d}(*F*_{0}, *F*_{k}) = 0.35 ± 0.03, artificial networks: *p*_{d}(*G*_{0}, *G*_{k}) = 0.36 ± 0.03, leave-one-out averages: *p*_{d}(*G*_{0}, *G*_{0\k}) = 0.01 ± 0.01). We subsequently constructed linear mixed effect models using partition data from sparse-sampled and original resting-state networks with data source (sparse sampled vs. original) as fixed effect and an intercept for subjects/leave-one-out averages as random effects. General linear hypothesis tests showed no statistically significant differences between sparse-sampled and original resting-state partitions (empirical networks: *p* = .7, artificial networks: *p* = .6, leave-one-out averages: *p* = .9). This additional analysis demonstrated that the sparse-sampling sequence during task production did not differentially influence our data in terms of mapping network communities.

To illustrate that the validity of our findings did not depend on our choice of the NMI as the statistical dependence metric, we performed a second set of numerical experiments, in which we based the constructed networks on pairwise zero-lag Pearson correlation coefficients. We first subjected the Pearson networks to the same relative thresholding strategy as the NMI graphs by iteratively eliminating edges with low weights while keeping all networks connected (i.e., all nodes must have degree ≥ 1 after thresholding). This reduced the densities of empirical networks to 65% for resting state, 85% for syllable production, and 72% for speech production and similarly for artificial networks to 67% for resting state, 86% for syllable production, and 74% for speech production. To further lower network densities to the levels of the NMI graphs, we used a maximum spanning tree approach. For each graph, we computed a maximum spanning tree and populated it with the strongest connections found in the original networks until a target density was met (Hidalgo, Klinger, Barabasi, & Hausmann, 2007). To keep the results consistent, we used the densities of the NMI networks as target values. Because of the presence of negative weights in the graphs, we used an adapted implementation of the Louvain algorithm to perform the community detection analysis (Traag & Bruggeman, 2009).

Compared with the findings reported below for the NMI graphs, the results obtained using these Pearson networks were highly similar (empirical graphs [rest/syllable/speech]: *Q*_{z} = 39.1/50.0/57.4, *p*_{d}(*F*_{0}, *F*_{k}) = 0.33 ± 0.03/0.37 ± 0.05/0.37 ± 0.03; artificial graphs [rest/syllable/speech]: *Q*_{z} = 56.1/36.1/47.0, *p*_{d}(*G*_{0}, *G*_{k}) = 0.37 ± 0.03/0.37 ± 0.05/0.39 ± 0.03; *p*_{d}(*G*_{0}, *G*_{0\k}) = 0.02 ± 0.04/0.09 ± 0.06/0.12 ± 0.06). Likelihood ratio tests of linear mixed effect models constructed based on empirical as well as artificial graphs confirmed that deviations of per-subject partitions from the group-averaged reference were condition related (both *p*s < .0007) and not by chance, similar to our main finding in NMI networks. Within-group stability of mean network partitions showed identical characteristics as in NMI networks. Deviations of empirical per-subject network communities from the group-averaged partition showed significant deviations between resting state and task production (both *p*s < .003) but not across tasks (speech vs. syllable: *p* = .9). Finally, leave-one-out partitions also exhibited an increase in community sensitivity with increasing task complexity (all *p*s < 10^{−5}) similar to NMI networks.

In conclusion, these data demonstrated that our main findings below are not influenced by our choice of the mutual information as the statistical dependence measure. However, using the NMI instead of Pearson correlation coefficients ensured the reliability of our results as it simultaneously accounted for nonlinear co-dependencies in the data. Moreover, the use of a maximum spanning tree approach with Pearson's networks may yield ambiguous results because a network has a unique maximum spanning tree if and only each edge weight in the graph is unique (Kepner & Gilbert, 2011). This condition may be violated by most functional networks, yielding numerous plausible maximum spanning trees per graph. Thus, although our data showed similarities between NMI and Pearson networks, using NMI networks is overall more robust for providing reliable outcome, as described below.

## RESULTS

### Empirical Networks

As a first step in our analysis, we performed a detailed assessment of the community architecture in empirical large-scale functional networks based on pairwise NMI values between regionally averaged BOLD time series. A comparative assessment of the modularity score of the empirical group-averaged networks *F*_{0} and corresponding null model graphs yielded γ = 1 as the resolution parameter, for which *Q* was maximally different from *Q*^{rnd} in all conditions (Figure 2A). Thus, γ = 1 was used in all following modularity maximization calculations. The threshold values in the subsequent consensus partitionings were chosen to be 0.0625 for *F*_{0}^{r} and *F*_{0}^{syl} and 0.044 for *F*_{0}^{sp}, which were based on the respective maxima in the corresponding null association matrices.

A likelihood ratio test of the constructed linear mixed effect model against an associated null model confirmed that deviations of per-subject partitions from the group-averaged reference were indeed condition related (*p* = 3.2 × 10^{−5}) and not by chance.

An in-depth assessment of the modularity score *F*_{0} and the results of robustness testing confirmed the highly modular structure of all empirical mean networks (all *Q*_{z} ≥ 32.8; see Table 1). Compared with equivalent random graphs, the perturbation curves *p*_{d}(*F*_{0}, *F*_{0}(α)) of all empirical functional networks increased markedly slower and deviated significantly from the corresponding null models, indicating that the detected communities were relatively robust against perturbations, unlike modules detected in the respective random graphs (Figure 2B).

. | Rest
. | Syllable
. | Speech
. |
---|---|---|---|

Empirical networks | |||

Q_{z} | 53.5 | 32.8 | 47.2 |

p_{d}(F_{0}, F_{k}) | 0.35 ± 0.03 | 0.39 ± 0.03 | 0.4 ± 0.04 |

Artificial networks | |||

Q_{z} | 59.0 | 27.1 | 51.5 |

p_{d}(G_{0}, G_{k}) | 0.36 ± 0.03 | 0.4 ± 0.03 | 0.4 ± 0.04 |

p_{d}(G_{0}, G_{0\k}) | 0.01 ± 0.02 | 0.05 ± 0.04 | 0.07 ± 0.04 |

. | Rest
. | Syllable
. | Speech
. |
---|---|---|---|

Empirical networks | |||

Q_{z} | 53.5 | 32.8 | 47.2 |

p_{d}(F_{0}, F_{k}) | 0.35 ± 0.03 | 0.39 ± 0.03 | 0.4 ± 0.04 |

Artificial networks | |||

Q_{z} | 59.0 | 27.1 | 51.5 |

p_{d}(G_{0}, G_{k}) | 0.36 ± 0.03 | 0.4 ± 0.03 | 0.4 ± 0.04 |

p_{d}(G_{0}, G_{0\k}) | 0.01 ± 0.02 | 0.05 ± 0.04 | 0.07 ± 0.04 |

Calculated *Q* modularity *z* scores and within-group partition distances between per-subject networks/leave-one-out samples and the respective group-averaged reference graphs in empirical and artificial network cohorts across the resting state, syllable production, and speaking.

The resting-state network *F*_{0}^{r} was composed of four modules, including frontoparietal (red), thalamic–BG (green), central–temporal (purple), and occipital–cerebellar (orange; Figure 3A-I). Individual per-subject networks (*F*_{k}^{r})_{k ≥ 1} showed a similar modular makeup that closely resembled the group-averaged partition (*p*_{d}(*F*_{0}^{r}, *F*_{k}^{r}) = 0.35 ± 0.03; see Table 1).

Compared with the resting state, syllable production was characterized by significantly larger deviations of per-subject partitions from the group-averaged reference (*p*_{d}(*F*_{0}^{syl}, *F*_{k}^{syl}) = 0.39 ± 0.03, *p* = .002). The community architecture of the mean network *F*_{0}^{syl} was composed of four modules, including mainly left-lateralized (red), mainly right-lateralized (purple), thalamic–BG (green), and cerebellar (orange; Figure 3B-I).

Similar to syllable networks, speech production was characterized by pronounced changes in the layout of per-subject communities as compared with the modular structure of the mean network (*p*_{d}(*F*_{0}^{sp}, *F*_{k}^{sp}) = 0.4 ± 0.04; see Table 1). These deviations were significantly larger than the corresponding resting-state variations (*p* < 10^{−4}) but comparable with the observed modular intersubject variability during syllable production (*p* = .93). The mean network *F*_{0}^{sp} consisted of five communities, including two mainly left- and right-lateralized central–temporal (red and purple, respectively), thalamic–BG (green), fronto-occipital (orange), and cerebellar (blue; Figure 3C-I).

The low standard deviations of *p*_{d}(*F*_{0},*F*_{k}) yielded very small maximal standard errors of the average subject-to-reference partition distances across all conditions (*SE*_{r} = 0.007, *SE*_{syl} = 0.007, *SE*_{sp} = 0.009), indicating that the chosen sample size of 100 simulated graphs (*G*_{k})_{k ≥ 1} was adequate to perform a modular stability analysis.

### Simulated Networks

On the basis of the empirical per-subject networks, 100 artificial graphs were constructed per condition, namely, resting state, meaningless syllable production, and meaningful speaking. Analogous to the analysis performed for the empirical networks, the mean networks *G*_{0} were used to algorithmically determine condition-specific resolution parameters and consensus partitioning thresholds. As for the empirical networks, γ = 1 maximized the difference between *Q* and *Q*^{rnd} in all groups (Figure 2A), whereas the consensus thresholds were determined to be 0.046, 0.044, and 0.0625 for *G*_{0}^{r}, *G*_{0}^{syl}, and *G*_{0}^{sp}, respectively. Deviations in the partitions of the individual 100 simulated networks from the corresponding group-averaged references were related to the analyzed conditions (rest, syllable, and speech) as revealed by a likelihood ratio test of the associated linear mixed effect model against a corresponding null model (*p* < 10^{−6}), suggesting that the basic topological properties of the empirical networks were preserved in the simulated graphs. This observation was further confirmed when assessing the quality of the modular structure in the group-averaged networks. All graphs *G*_{0} showed a normalized modularity score significantly greater than two (*Q*_{z} ≥ 27.1; see Table 1) as well as smoothly inclining perturbation curves *p*_{d}(*G*_{0}, *G*_{0}(α)) that differed markedly from corresponding random networks (Figure 2B).

The modular decomposition of the simulated resting-state network *G*_{0}^{r} closely resembled the community architecture of the corresponding empirical graph (*p*_{d}(*F*_{0}^{r}, *G*_{0}^{r}) = 0.14); however, an additional fifth module (blue) was detected in *G*_{0}^{r} (Figure 3A-III). Notably, this new module recruited nodes exclusively from the thalamic–BG module in the empirical network (green in Figure 3A-I) and thus was composed of mostly subcortical nodes, including the subdivisions of the bilateral amygdala and thalamus, nucleus accumbens, and hippocampus. Thus, the additional fifth module in *G*_{0}^{r} took the position of the thalamic–BG community (green) in *F*_{0}^{r}, whereas some remaining subcortical regions migrated from the central–temporal module (purple) in *F*_{0}^{r} to the thalamic–BG community (green) in *G*_{0}^{r} (Figure 3A-II). A within-group analysis revealed that the community architecture of the generated 100 simulated graphs was similar to the respective group-averaged partition (*p*_{d}(*G*_{0}^{r}, *G*_{k}^{r}) = 0.36 ± 0.03; Table 1), consistent with the findings for the empirical networks.

The community architecture of the simulated syllable production network *G*_{0}^{syl} closely approximated the partition of the empirical network (*p*_{d}(*F*_{0}^{syl}, *G*_{0}^{syl}) = 0.15), showing four modules of almost identical structure as their respective empirical counterparts (Figure 3B-III). Interestingly, the most pronounced community membership fluctuations were found in the frontal lobe with nine nodes (subdivsions of the bilateral orbital and frontal gyri) migrating from the left-lateralized module in the empirical graph *F*_{0}^{syl} (red) to the cerebellar module in the simulated network *G*_{0}^{syl} (orange; Figure 3B-II). In accordance with the results obtained for the empirical networks, the constructed 100 syllable production graphs showed significantly larger deviations from the respective group-averaged reference than the simulated resting-state networks (*p*_{d}(*G*_{0}^{syl}, *G*_{k}^{syl}) = 0.4 ± 0.03, *p* < 10^{−4}).

The within-group analysis of simulated speech production graphs revealed a similar pattern with deviations of individual network partitions from the group-averaged reference, which were significantly more pronounced than for the simulated resting-state graphs (*p*_{d}(*G*_{0}^{sp}, *G*_{k}^{sp}) = 0.4 ± 0.04, *p* < 10^{−4}) but comparable with the simulated syllable production networks (*p* = .6). Remarkably, the community architecture of the empirical and simulated speech production graphs, *F*_{0}^{sp} and *G*_{0}^{sp}, respectively, was very similar, showing nodal communities of almost equal topology (*p*_{d}(*F*_{0}^{sp}, *G*_{0}^{sp}) = 0.06; Figure 3C-III). Both networks were composed of five modules, which differed only by isolated nodes on module boundaries changing their affiliation from *F*_{0}^{sp} to *G*_{0}^{sp} (Figure 3C-II). The thalamic–BG module in *G*_{0}^{sp} showed most deviations from its empirical counterpart by recruiting some additional subcortical regions (right nucleus accumbens, external globus pallidus). However, this module was also the largest in both networks, comprising 26% and 28% of all nodes in *F*_{0}^{sp} and *G*_{0}^{sp}, respectively.

To further assess the impact of intersubject variability on the partitions of the group-averaged graphs, we performed a leave-one-out stability analysis using the 100 simulated networks constructed for each condition. Group-averaged resting-state network partitions were characterized by the highest robustness against perturbations within the sample cohort (*p*_{d}(*G*_{0}^{r}, *G*_{0/k}^{r}) = 0.01 ± 0.02; Figure 4D and E, Table 1) and exhibited a relatively stable community architecture across all leave-one-out networks (Figure 4A). The mean syllable production network was significantly more sensitive to variations in the underlying cohort than the resting-state graph (*p*_{d}(*G*_{0}^{syl}, *G*_{0\k}^{syl}) = 0.05 ± 0.04, *p* < 10^{−6}; Figure 4D), with nodal community affiliations fluctuating notably between leave-one-out graphs (Figure 4B). Compared with the resting state and syllable production, communities of the group-averaged speech production network *G*_{0}^{sp} showed the highest variability across the testing set (*p*_{d}(*G*_{0}^{sp}, *G*_{0\k}^{sp}) = 0.07 ± 0.04; Figure 4D). The modular makeup of (*G*_{0\k}^{sp})_{k ≥ 1} changed drastically from sample to sample (Figure 4C), resulting in significantly higher deviations of leave-one-out network partitions from the group-averaged reference as compared with rest and syllable production (all *p*s ≤ 7.1 × 10^{−6}).

## DISCUSSION

We performed a stability analysis of nodal communities in large-scale functional brain networks during three conditions of increasing complexity: the resting state, production of meaningless syllables, and production of meaningful speech. Our findings indicate that task-related intersubject variability impacts mean network community structure and its aptitude to represent the modular makeup of the underlying condition. Pronounced interindividual differences in functional connectivity as usually observed during task conditions (Hermundstad et al., 2013) are potentially confounding within-group stability of task-based functional modules.

The calibration of the resolution parameter used in the Louvain algorithm resulted in γ = 1 for all group-averaged empirical networks *F*_{0}, which implies that the scale of functional modules was balanced across all examined conditions. The same result was also obtained for the simulated networks *G*_{0}, demonstrating that the resolution of empirical communities was preserved in the simulated graphs. These findings indicate that the determined differences in community stability within empirical (*F*_{k})_{k ≥ 1}, simulated (*G*_{k})_{k ≥ 1}, and leave-one-out (*G*_{0\k})_{k ≥ 1} networks were a function of the assessed condition and likely not caused by variations in the resolution of the modularity function.

Empirical networks *F*_{0} were characterized by a highly modular structure as shown by robustness testing (Figure 2B), which confirmed the nonrandom nature of the computed partitions. The follow-up assessment of the modular architecture in the simulated network cohort (*G*_{k})_{k ≥ 1} further confirmed that the employed strategy for generating these graphs indeed preserved the topological characteristics seen in the empirical networks. Imposing Gaussian random graphs on the per-subject networks (*F*_{k})_{k ≥ 1} to construct (*G*_{k})_{k ≥ 1} introduced perturbations in the group-averaged graphs *G*_{0}, which in turn led to changes in their respective modular architecture as compared with the empirical networks *F*_{0}. Thus, assessing the migration of nodes from communities in *F*_{0} to modules in the simulated group-averaged networks *G*_{0} (Figure 3-II) represents a module perturbation analysis by itself. The subsequently performed leave-one-out analysis of the generated networks further quantified the task-dependent sensitivity of functional communities.

### Topological Characteristics of Resting-State Networks

Resting-state communities in *F*_{0}^{r} showed a spatial distribution that was balanced across hemispheres with no evident lateralization pattern (Figure 3A-I). Compared with both tasks, resting-state community showed more prominent differences between empirical *F*_{0}^{r} and simulated *G*_{0}^{r} networks (Figure 3A), with an additional fifth module formed in *G*_{0}^{r} compared with the four modules in *F*_{0}^{r}. The fifth module in *G*_{0}^{r} (blue; Figure 3A-III) was mainly composed of subcortical nodes migrating the central–temporal module of *F*_{0}^{r} (purple). The layout of resting-state modules with respect to communities in *G*_{0}^{r} showed the highest similarity across leave-one-out averages (*G*_{0\k}^{r})_{k ≥ 1} compared with the two analyzed tasks (Figure 4D and E), indicating that subject-specific contributions in the construction of the group-averaged network *G*_{0}^{r} did not noticeably influence its community structure. This finding motivates the hypothesis that the modular decomposition of a group-averaged functional resting-state network is well suited to approximate the community layouts in the underlying network cohort. Clearly, this proposition premises a sufficient number of graphs to construct the average network. However, the empirical networks analyzed here showed a very small maximal standard error of average subject-to-reference partition distances across all conditions (maximal *SE* = 0.009), suggesting that the validity of the stated conjecture does not require an unrealistically large sample size.

### Topological Characteristics of Task-Production Networks

The architecture of empirical speech and syllable production networks was shaped by two communities with clearly delineated hemispheric dominance (red and purple; Figure 3A-I and B-I), which in total comprised 51% (*F*_{0}^{syl}) and 37% (*F*_{0}^{sp}) of all nodes in the respective graphs. Both syllable and speech production graphs were characterized by a markedly lower similarity of per-subject network partitions to the group-averaged reference as compared with the resting state. In other words, functional communities exhibited higher within-group variability while performing the two tasks than during the resting state. Given that network partitions (in every *F*_{0} as well as in all *F*_{k}) were calculated in terms of consensus communities, the observed increase in topological variability of modules was thus most likely a consequence of individual task-related functional adaptations.

Interestingly, although the simulated and empirical networks were very similar across all conditions (*p*_{d}(*F*_{0}, *G*_{0}) ≤ 0.15), confirming the robustness of the detected communities with respect to perturbations of edges (Figure 2B), the simulated speech production network *G*_{0}^{sp} exhibited the smallest deviation from the corresponding empirical graph *F*_{0}^{sp} (Figure 3C). On the other hand, similar to the associated empirical networks, simulated speech production communities were characterized by large within-group variations with respect to the reference partition of the mean network and showed the highest sensitivity to leave-one-out sampling (Figure 4). This observation may be indicative of two effects. First, within-group variations present in the empirical cohort (*F*_{k}^{sp})_{k ≥ 1} were conserved by the randomization process that generated the simulated graphs (*G*_{k}^{sp})_{k ≥ 1}. Second, the perturbations introduced in (*F*_{k}^{sp})_{k ≥ 1} to construct (*G*_{k}^{sp})_{k ≥ 1} had a negligible effect on the layout of functional communities in the mean network *G*_{0}^{sp} and were much less pronounced than edge-weight deviations because of intersubject variability. Thus, community structure translated well from empirical *F*_{0}^{sp} to simulated group-averaged *G*_{0}^{sp} despite pronounced within-group variability in both cohorts (*F*_{k}^{sp})_{k ≥ 1} and (*G*_{k}^{sp})_{k ≥ 1}.

By comparison, the community structure of the simulated syllable production network *G*_{0}^{syl} deviated noticeably from the empirical graph *F*_{0}^{syl}. Particularly, regions in the pFC (orange; Figure 3B), which have been shown to play a key role in the cognitive aspects of speech control (Burton, Locasto, Krebs-Noble, & Gullapalli, 2005; Giraud et al., 2004; Benson et al., 2001; Muller et al., 1997) and which are of less importance for the processing of meaningless syllables (Fuertinger et al., 2015), exhibited differing community assignments in *G*_{0}^{syl} as compared with *F*_{0}^{syl}. Thus, like in the resting state, modular deviations in the simulated graph *G*_{0}^{sp} from the community layout in *F*_{0}^{sp} were not evenly distributed across the entire network but predominantly affected specific, weakly integrated functional components of the graphs.

Whereas simulated resting-state networks (*G*_{k}^{r})_{k ≥ 1} closely followed the community architecture of the respective group-averaged reference graph *G*_{0}^{r}, the partitions of syllable (*G*_{k}^{syl})_{k ≥ 1} and speech production (*G*_{k}^{sp})_{k ≥ 1} networks showed significantly higher within-group variability with respect to the corresponding mean graphs. Single-subject changes in the computation of the analyzed task-related *G*_{0}^{syl} and *G*_{0}^{sp} networks altered their modular topology significantly more as compared with the resting state (both *p*s < 10^{−5}), with communities in the speech production network *G*_{0}^{sp} showing the highest sensitivity to perturbations in cohort averaging, as revealed by leave-one-out sampling (Figure 4D and E).

Although recent studies reported that resting-state functional architecture was related to anatomical connectivity patterns, with functional modules corresponding to dense anatomical connections (Shih et al., 2015; Shen et al., 2012; Wang, Sporns, & Burkhalter, 2012), task production induces a behaviorally driven reconfiguration of the functional architecture of the brain on a global scale (Fuertinger et al., 2015, 2016; Stanley, Dagenbach, Lyday, Burdette, & Laurienti, 2014) and may therefore be more sensitive to intersubject variability. Thus, the observed high within-group stability of resting-state functional partitions as compared with task-related modular layouts might be a consequence of the underlying neuroanatomical architecture. An interesting question for future studies might be what the functional architecture of the brain may look like during human multitasking, that is, in participants performing more than one task at the same time.

### Conclusions

We analyzed the sensitivity of community architecture in large-scale functional brain networks with respect to task complexity. A randomization strategy was used to simulate artificial subjects that conserved the modular layout of the original empirical graphs while simultaneously introducing enough randomness to ensure sufficient variability across networks for studying topological stability. We demonstrated that resting-state functional network partitions were characterized by low intersubject variability with respect to a group-averaged reference, whereas speech and syllable production graph architectures exhibited significantly larger within-group deviations from the community layout of the associated mean networks. Thus, the sensitivity of nodal modular topology with respect to group averaging increased from rest to syllable production to speaking, which suggests that the approximation quality of the community structure in the average network to reflect individual per-subject partitions depends on task complexity.

## Acknowledgments

This work was supported by the grants from the National Institute on Deafness and Other Communication Disorders (R01DC012545) and the National Institute of Neurological Disorders and Stroke (R01NS088160) to K. S.

Reprint requests should be sent to Kristina Simonyan, Department of Neurology, One Gustave L. Levy Place, Box 1137, Icahn School of Medicine at Mount Sinai, New York, NY 10029, or via e-mail: kristina.simonyan@mssm.edu.