## Abstract

Clusterwise inference is a popular approach in neuroimaging to increase sensitivity, but most existing methods are currently restricted to the General Linear Model (GLM) for testing mean parameters. Statistical methods for testing *variance components*, which are critical in neuroimaging studies that involve estimation of narrow-sense heritability or test-retest reliability, are underdeveloped due to methodological and computational challenges, which would potentially lead to low power. We propose a fast and powerful test for variance components called CLEAN-V (**CLEAN** for testing **V**ariance components). CLEAN-V models the global spatial dependence structure of imaging data and computes a locally powerful variance component test statistic by data-adaptively pooling neighborhood information. Correction for multiple comparisons is achieved by permutations to control family-wise error rate (FWER). Through analysis of task-functional magnetic resonance imaging (fMRI) data from the Human Connectome Project across five tasks and comprehensive data-driven simulations, we show that CLEAN-V outperforms existing methods in detecting test-retest reliability and narrow-sense heritability with significantly improved power, with the detected areas aligning with activation maps. The computational efficiency of CLEAN-V also speaks of its practical utility, and it is available as an R package.

## 1 Introduction

Imaging biomarkers obtained by magnetic resonance imaging (MRI) are critical to the understanding of brain functions and structures and their relationship to the diagnosis of brain disorders. Among these, functional MRI (fMRI) has been increasingly used to measure brain activation in response to experimental tasks to understand human cognition and behavior. However, recent studies using large neuroimaging databases (e.g., the Human Connectome Project (HCP) and Adolescent Brain Cognitive Development (ABCD) study) have shown poor test-retest reliability of task-fMRI measures in most brain regions (Elliott et al., 2020; Kennedy et al., 2022). Since test-retest reliability provides a measure of consistency and stability of data obtained repeatedly over time, the lack of reliability of task-fMRI measures has raised concerns about current efforts using fMRI data to find brain-behavior associations (Marek et al., 2022). Blokland et al. (2011), with 319 twins in their study, discovered several regions of the brain highlighting heritable fMRI activations, but only with a loose cluster defining threshold (uncorrected voxel-wise $p<0.05$), which might be prone to inflate false positives. These recent findings in neuroimaging motivate a need for a powerful statistical method for neuroimaging studies that involve test-retest reliability or heritability.

Conducting inference for test-retest reliability and heritability is fundamentally a variance component problem in statistics (Lindquist et al., 2012) because both measures can be expressed in terms of random effects for modeling dependencies between observed data (images). However, most existing variance component methods in neuroimaging literature estimate and test each variance component separately (e.g., massive univariate analysis). In contrast to testing for mean parameters, testing for variance component parameters is more challenging due to model specification and intensive computation to address spatial dependence structures. Recently, Risk and Zhu (2021) showed that spatial modeling of variance components would lead to improved heritability estimation. However, their method also suffers from computational costs, and conducting statistical inference with their method is not straightforward, as the authors also noted.

We hypothesize that leveraging spatial dependencies in testing for the test-retest reliability or heritability would increase the replicability of identified regions (Noble et al., 2021). The spatial-extent inference has shown its utility to increase sensitivity in neuroimaging compared with simple massive univariate analysis. In the context of the General Linear Model (GLM), clusterwise inference improves sensitivity through the convolution of univariate test statistics in the spatial domain, which is exemplified by the threshold-free cluster enhancement (TFCE) (Smith & Nichols, 2009). Park et al. (2021) showed that combining the test statistics across neighboring vertices yielded high statistical power in linear mixed models as well. Also, spatial Bayesian GLMs that parametrically model spatial dependencies of the mean parameters resulted in more accurate and reliable estimates (Mejia et al., 2020; Spencer et al., 2022). Bernal-Rusiel et al. (2013) and Risk et al. (2016) showed that modeling of the local spatial dependence of the residual noise term using a Gaussian process results in increased statistical power. More recently, the CLEAN method proposed by Park and Fiecas (2022) showed that additional gain in sensitivity is achieved in GLMs by cluster-enhanced statistic, which is constructed from multivariate test statistics by applying spatial Gaussian process to the residual noise. This idea has been extended to CLEAN-R (Weinstein et al., 2022) in testing and localizing correspondence between two imaging modalities by modeling modality-specific spatial autocorrelation and constructing cluster-enhanced test statistics for correlation parameters. CLEAN and CLEAN-R highlight the importance of using two aspects of spatial dependencies to improve sensitivity.

The aim of this paper is to extend spatial-extent inference to testing variance components, which have been underdeveloped in neuroimaging due to their methodological and computational challenges. To overcome the limitations of massive univariate analysis in variance component testing, we propose a new powerful method called CLEAN-V (**CLEAN** for testing **V**ariance components) that tests and localizes variance components at the vertex level. CLEAN-V shares the same underlying mechanism as CLEAN (Park & Fiecas, 2022) and CLEAN-R (Weinstein et al., 2022) in improving statistical power: (i) an explicit spatial autocorrelation modeling of neuroimaging data, (ii) effective spatial enhancement of test statistics, and (iii) a fast resampling procedure for making statistical inferences. However, CLEAN and CLEAN-R are primarily developed for testing GLM parameters or intermodal correlations, and their direct applications to test variance components in test-retest reliability and heritability studies are limited. Therefore, the proposed method aims at bridging this gap and providing a robust solution for testing and localizing variance components in neuroimaging studies.

The rest of the paper is organized as follows. In Section 2, we describe the details of CLEAN-V and compare it with the existing variance component testing methods. In Section 3, we compare the performance of CLEAN-V with other existing methods by conducting comprehensive data-driven simulation studies. We apply CLEAN-V to multiple tasks from Human Connectome Project (HCP) to compare statistically significant test-retest reliability and heritability regions across different experimental tasks. We conclude with discussion in Section 4.

## 2 Methods

### 2.1 Notation and model specifications

We consider a dataset consisting of subject-level continuous neuroimaging data. In this paper, we focus on activation levels in task-fMRI measured on the cortical surface, although other neuroimaging data types such as cortical thickness and cerebral blood flow that are mapped onto the cortex are also applicable. Task activation for each individual at each location is obtained by applying GLM to the observed blood oxygenation level-dependent (BOLD) signal on the convolution of the stimulus function with the hemodynamic response function (Lindquist, 2008; Monti, 2011).

Consider $N$ total images, where each image has $V$ vertices. Note that $N$ does not necessarily denote subjects in reliability studies, where more than two scans from a subject are obtained. Let $i=1,\u2026,N$ be the indices for images, and $v=1,\u2026,V$ be the indices for vertices in a cortical surface. The observed imaging data $y(v)=(y1(v),$$y2(v),\u2026,yN(v))\u2032$ are modeled by

where $r(v)=(r1(v),\u2026,rN(v))\u2032$ and $#(v)=(#1(v),\u2026,#N(v))\u2032$ are assumed to be independent of each other. Please refer to Figure 1 for illustrations of the model specification. Here, $X$ is an $N\xd7p$ nuisance covariates matrix (e.g., age): each row of the matrix ($xi\u2032$) is a $p$-dimensional nuisance covariates vector for the $i$th image. Under this model, $r(v)$ characterizes the dependence among all $N$ images at vertex $v$ (e.g., vertex-level test-retest reliability or heritability) through $K$. We note that the dependence of images at vertex $v$ is captured through $\theta 2(v)$. Without any true reliable (or heritable) features, we have $\theta 2(v)=0$ for all $v$, and images are considered independent.

In this model, $\Sigma $, which is the covariance of the $i$th residual image $#i=(#i(1),\u2026,#i(V))\u2032$, characterizes both spatial and non-spatial variations within images. We use $bi=(bi(1),\u2026,bi(V))\u2032$ to characterize the variation due to spatial autocorrelation, whose covariance is modeled by $\sigma 2\u200a\Phi (\varphi ,D)$ with a scale parameter $\sigma 2$ and a predefined spatial autocorrelation function (SACF) with a parameter $\varphi >0$ and a $V\xd7V$ pairwise geodesic distance matrix $D$. We also use $\delta i=(zi(1),\u2026,zi(V))\u2032$ to characterize the non-spatial variations, or white noise, with its covariance $\tau 2\u200bIV$. The assumption that $bi$ and $\delta i$ are independent leads to the $\Sigma =\sigma 2\u200a\u200b\u200a\Phi (\varphi ,D)\u200a\u200a\u200a+\tau 2IV$, which follows $#i=bi+\delta i$ with

Following exploratory data analyses of Park and Fiecas (2022) and Weinstein et al. (2022), we assume the exponential spatial autocorrelation function (SACF) for modeling the contrast of parameter estimates in task-fMRI. Thus, the $(v,v\u22c6)$th element of $\Phi (\varphi ,D)$ can be expressed by $exp(\u2212\varphi \u22c5dv,v\u22c6)$ where $dv,v\u22c6$ is the geodesic distance between vertices $v$ and $v\u22c6$.

From the model specification, the null hypothesis of interest is

which implies that there is no between-image dependencies. The model shows flexibility in specifying reliability and heritability structures through the specification of the dependence structure $K$. In testing for test-retest reliability, we specify the $(i,i\u22c6)$th element of $K$ with 1 if images $i$ and $i\u22c6$ are from the same subject and 0 otherwise. In testing narrow-sense heritability in twin studies, the $(i,i\u22c6)$th element of $K$ with 1 if the subjects $i$ and $i\u22c6$ are monozygotic twins, 0.5 if they are dizygotic twins, 0 otherwise. In larger pedigree studies, $K$ becomes a genetic relationship matrix.

### 2.2 Existing approaches

Existing approaches in testing for variance components are mostly massive univariate analyses where spatial dependencies of the imaging features are not considered. Specifically, the $b(v)$ terms (spatial variations) are dropped, and the model reduces to a massive vertex-wise model as

where the null hypothesis remains the same as (2). Note that, from this vertex-level model, narrow-sense heritability is defined by variance due to additive genetic effects divided by the total variance, and test-retest reliability is defined by an intraclass correlation coefficient (ICC), both of which give the same parametrization as $\theta 2(v)/(\theta 2(v)+\tau 2(v))$ based on study designs.

Ganjgahi et al. (2015) used the massive univariate method to carry out voxel-wise heritability inference and estimation with the likelihood ratio test (LRT), Wald test, and score test. Their study results showed that the most efficient method, the score test, produced more conservative inferences. They argued that it is likely due to untenable 50:50 $\chi 12$ mixture approximation even if all of the tests have a common asymptotic result following a 50:50 mixture of $\chi 12$ distribution and $\chi 02$ (i.e., point mass at 0).

Ge et al. (2015) proposed massively expedited genome-wide heritability analysis (MEGHA) by fitting the model (3) to all vertices with brain structure measures. They used a scaled variance-component score test statistic. The variance-component score test statistic is also known as the sequence kernel association test (SKAT) statistic with a linear kernel in genome-wide association studies (GWASs) (Wu et al., 2011). SKAT is a computationally efficient method to test the association between genetic variants since the computation of the test statistics only requires parameter estimates obtained under the null model. The statistic has the form

where $\beta ^(v)$ is estimated under the null hypothesis (2). Wu et al. (2011) showed that the SKAT statistic follows a mixture chi-square distribution under $H0$, where the weights of the mixture chi-squares are determined by the eigenvalues of $QKQ\u2032$ and $Q$ is a Cholesky decomposition of $P=\tau ^2(I\u2212X(X\u2032X)\u22121X\u2032)$ that satisfies $P=QQ\u2032$. Wu et al. (2011) also showed that it is the locally most powerful test compared to the log-likelihood ratio test.

### 2.3 Proposed method: CLEAN-V (“CLEAN” for testing “V”ariance components)

From the spatial model specified in Equation (1), the score-based variance component test statistic for vertex $v$ is

with $\beta ^(v)$ and $b^(v)$ estimated under $H0$. Importantly, fitting the null model without any $\theta 2(v)$s is equivalent to fitting CLEAN for GLM (Park & Fiecas, 2022) because dependencies of images are modeled by $r(v)$ only. Therefore, we first use ordinary least squares to estimate $\beta (v)$ for each $v$ separately and use residual images $#^i$ to obtain consistent estimates $\sigma 2,\varphi $ and $\tau 2$ via covariance regression analysis (Park & Fiecas, 2022; Zou et al., 2017) by minimizing

$\u2211i=1N\u2009\u2009\u2225\u2009\u2009\u2009#^i#^i\u2032\u2212\sigma 2\u2009\u2009\Phi \u2009\u2009(\varphi ,D)\u2212\tau 2IV\u2225\u2009F2,$(6)

where $||\u22c5||\u200aF2$ is the squared Frobenius norm of a matrix. We refer Appendix A of Park & Fiecas, (2022) for a detailed estimation procedure. Then, we use the parameter estimates to construct $\Sigma ^=\sigma ^2\Phi (\varphi ^,D)+\tau ^2IV$ and obtain $b^i=(b^i(1),\u2026,$$b^i(V))\u2032=E^[bi|yi]$ under the null model. It can also be shown that $U(v)$ is also a score-based variance component test statistic under the working (spatial) covariance matrix, which implies that $U(v)$ from Equation (5) also follows a mixture chi-square distribution (Supplementary Material A).

We construct CLEAN-type cluster-enhanced test statistics in CLEAN-V for its powerfulness and easy implementation in the mesh surface. Compared to other types of cluster enhancement (e.g., TFCE (Smith & Nichols, 2009)), the CLEAN-type cluster enhancement is constructed from aggregating test statistics from the local neighborhood (Park & Fiecas, 2022). To construct cluster-enhanced test statistics using $U(v)$ in Equation (5), we first define $Nr(v)$ as the collection of all vertices within the radius $r$ neighborhood of a central vertex $v$ (i.e., ${v\u22c6:dv,v\u22c6\u2264r}$). Then, we use all the $U(v\u22c6)$ from vertices $v\u22c6$ within $Nr(v)$ to construct a cluster-enhanced test statistic for the vertex $v$. For a radius $r$, we compute the standardized sum of the variance-component score test statistics within $Nr(v)$:

In practice, we obtain the null distribution of $\u2211v\u22c6\u2208Nr(v)\u2009\u2009U(v\u22c6)$ by permutation and use empirical permuted test statistics to obtain $E^H0(\u2211v\u22c6\u2208Nr(v)\u2009\u2009U(v\u22c6))$ and $Var^H0(\u2211v\u22c6\u2208Nr(v)\u2009\u2009U(v\u22c6))$ by the sample mean and sample variance (Section 2.4). Similar to how we used the quadratic form to show that $U(v\u22c6)$ follows a mixture of chi-square distribution, $\u2211v\u22c6\u2208Nr(v)\u2009\u2009U(v\u22c6)$ also follows a mixture of chi-squares distribution under the null hypothesis (Supplementary Material A). A mixture chi-square distribution can be approximated by normal distribution according to the Lyapunov central limit theorem (Supplementary Material A). Therefore, after standardization in (7), $Tr(v)$ approximately follows the standard normal distribution for each $r$ and $v$. This transformation makes sure that each test statistic is mapped onto the same standard normal scale, which allows for data-adaptive testing based on the true areas of signals. Figure 2 is given for illustration.

Statistically, the way CLEAN-V achieves high power aligns with CLEAN (Park & Fiecas, 2022) developed for GLM. In CLEAN, $Tr(v)$ is defined by assuming that the regression coefficients of interest in $Nr(v)$ are the same (Park & Fiecas, 2022). Similarly, it can be shown that $Tr(v)$ in CLEAN-V is constructed by assuming that $\theta 2(v\u22c6)$s in $Nr(v)$ are the same (Supplementary Material B). Since the SKAT statistics are known to be locally most powerful, combining test statistics within a neighborhood strengthens the evidence against the null hypothesis when the strength of signals remains consistent within the neighbor. This property, along with Supplementary Material B, makes the proposed CLEAN-type enhancement particularly powerful in CLEAN-V. Li et al. (2022) used a similar approach, called Quadratic Scan (Q-SCAN) statistic, to identify the sizes and locations of signal regions in genome-wide association studies.

In CLEAN-V, we build a map of test statistics by adopting an adaptive method to benefit from the “ideal” local neighbor $Nr(v)$ for each $v$. Since the best size of the neighborhood area for each vertex might not be the same, we do not fix the radius $r$. Instead, we consider different clusters by choosing different radii $rs$ from $r1$ up to $r max $ and choose the maximum $Tr(v)$ as the adaptive cluster-enhanced test statistic for vertex $v$:

In this way, we can find the closely optimal cluster size for each vertex, which is useful as true areas of signals are unknown *a priori*. Following Park and Fiecas (2022) and Weinstein et al. (2022), we consider $r1=1mm,\u2009\u2009r2=2mm,\u2026,rmax=20mm $ as a default in this paper, although $rmax$ can be chosen flexibly based on research questions and contexts. The radius $r0=0mm$ is included in the radii set since it guarantees that original vertex-level test statistics can also be considered in the maximization process.

### 2.4 FWER control with permutation

A critical value $t\alpha $ is needed to control the family-wise error rate (FWER) at level $\alpha $. We consider the test statistic for $H0$ as

Permutation allows us to generate an empirical distribution of test statistics under the null hypothesis. In our setting, it can be achieved by shuffling ${(yi,\u2009\u200axi)}i=1N$ while keeping the dependence matrix $K$ unchanged. Note that this is equivalent to shuffling $\delta ^i$ from the null model across images, which in practice does not require any refitting of the mean model and the covariance model (Equation (6)). For the $b$th permutation ($b=1,\u2026,B$), we construct the permuted variance-component test statistics $U(b)(v)$. Then, we use Equation (7) to construct the permuted single cluster-enhanced test statistic $Tr(b)(v)$. $T(b)(v)$ and $T(b)$ follow Equations (8) and (9). Finally, we construct the empirical null distribution by the set ${T(1),\u2026,T(B)}$ and select the $(1\u2212\alpha )$th quantile as $t\alpha $. Statistical significance is achieved at $\alpha \xd7100%$ FWER when $T>t\alpha $.

### 2.5 Localizing areas of significance

The set of vertices $v$ with statistically significant $\theta 2(v)>0$ is identified by vertices whose $T(v)$ exceeds $t\alpha $.

### 2.6 Notes on the computational efficiency

Considering the complicated covariance structures implied in Equation (1), CLEAN-V reduces the computational cost dramatically because CLEAN-V uses score-based test statistics that require the estimation $\sigma 2,\tau 2,\varphi $ only. Also, the permutation does not even require estimating $\sigma 2,\tau 2,\varphi $ repeatedly but matrix multiplications from the null model only. We note that other methods, such as likelihood-ratio test or Wald test used by Ganjgahi et al. (2015), would require the estimation of $\theta 2(1),\u2026,\theta 2(V),\sigma 2,\tau 2,\varphi $ in the original data and each permutation when extended to the spatial autocorrelation modeling, which would be computationally infeasible. In addition, in CLEAN-V, computational costs for obtaining $\sigma 2,\tau 2,\varphi $ under the null model are relaxed by using a method-of-moment approach (Equation (6), (Zou et al., 2017)), and the calculation of $\Sigma ^\u22121$ for a large $V$ is relaxed by using the nearest neighbor Gaussian process approximation (Datta et al., 2016).

## 3 DATA ANALYSIS AND DATA-DRIVEN SIMULATIONS

### 3.1 Data preparation

We collected task-fMRI data from the Human Connectome Project (1200 Subjects Data Release). We focus on five experimental tasks: relational processing, emotional processing, social cognition, language processing, and gambling decision-making. Specific contrasts we used for each task are summarized in Table 1. Barch et al. (2013) provide a more detailed description of each task. There are more than 1000 subjects in total who completed the tasks and, among these, 44 subjects who completed tasks twice. Additionally, there are 171 twin pairs ($N=342$ images), of which 68 pairs are dizygotic twins and 103 pairs are monozygotic twins. There are 80 and 116 female twins correspondingly. The average ages for dizygotic and monozygotic twins are 29.16 (SD: 3.42) and 29.29 (SD: 3.37) years, respectively. The task-fMRI data were processed through the HCP’s minimal surface processing pipeline with 2mm surface smoothing (Glasser et al., 2013). We used the R package ciftiTools (Pham et al., 2022) to sample 10242 vertices from approximately 32000 vertices, resulting in 9394 and 9398 cortex vertices in the left and right hemispheres, respectively, after excluding 848 and 844 vertices in the medial wall in each hemisphere.

Task . | Contrast . |
---|---|

Emotional processing | Face vs. shape |

Social cognition | Social vs. random |

Relational processing | Relational vs. match |

Language | Story vs. math |

Gambling | Reward vs. punishment |

Task . | Contrast . |
---|---|

Emotional processing | Face vs. shape |

Social cognition | Social vs. random |

Relational processing | Relational vs. match |

Language | Story vs. math |

Gambling | Reward vs. punishment |

### 3.2 Competitors

In our data analysis and simulations, we compared CLEAN-V to three competitors, all of which are considered as simplified versions of CLEAN-V. Specifically,

**CLEAN-V without spatial correlation**: It constructs cluster-enhanced test statistics $T(v)$ based on the model without considering spatial autocorrelations (Equation (3)). For a fair comparison, we used the same neighbor set as CLEAN-V to construct cluster-enhanced test statistics.**CLEAN-V without cluster enhancement**: It models spatial autocorrelation of the noise following Equation (1), but it does not construct cluster-enhanced statistic. It is equivalent to setting $r=0$ only for $Tr(v)$ (i.e., the vertex itself).**Massive-univariate analysis**: This method neither models spatial autocorrelation nor constructs cluster-enhanced test statistics, which is equivalent to MEGHA (Ge et al., 2015).

All these competitors used the same permutation step outlined in Section 2.4. We included *CLEAN-V without spatial correation* and *CLEAN-V without cluster enhancement* to examine the effect of modeling the spatial structure of features and the impact of the novel cluster enhancement method, separately.

### 3.3 Data analysis

#### 3.3.1 Test-retest reliability and heritability for five different cognition tasks

We first fitted CLEAN-V to (i) 44 subjects with test-retest images to obtain the test-retest reliability map and (ii) 171 twin pairs to obtain the heritability map, with FWER controlled at 5%. These maps are shown in Figure 3(a). CLEAN-V detected wide areas of statistically significant test-retest reliability on the brain surface for most of the task-fMRI. However, we obtained different areas of significance for different tasks. The language task showed strong evidence of test-retest reliability, but the gambling task showed tiny and sparse test-retest reliability. However, the five tasks displayed a common result: anterior central gyrus and postcentral gyrus showed low or no test-retest reliability.

We also provided estimates of test-retest reliability and narrow-sense heritability (Fig. 3(b)) for comparison. Their effect size computed by the massive univariate method was small and specific to tasks, as expected. However, CLEAN-V detected test-retest reliability and narrow-sense heritability with high significance even if the estimates were small, and the detected regions corresponded well to the effect size estimates. Also, by qualitatively comparing the test-retest reliability maps to the activation maps (Cohen’s $d$; Fig. 3(c)), we found that there is a strong correspondence between the strength of reliability and activation (Hawco et al., 2021). Specifically, the language tasks showed a strong activation map which agrees with our results: the strongest evidence of test-retest reliability came from the language task. The weak evidence of activations in gambling tasks also agrees with the test-retest reliability map of CLEAN-V.

#### 3.3.2 Localization results from CLEAN-V and its competitors

Figure 4 shows the CLEAN-V and the competitors’ binarized localization results of both test-retest reliability and narrow-sense heritability for the emotional processing task. CLEAN-V without cluster enhancement produced almost the same result as the massive univariate analysis, where very limited number of vertices were detected. Compared to these two methods, CLEAN-V without spatial correlation obviously detected more regions, which shows that the proposed cluster enhancement improved signal detection. CLEAN-V detected most of the areas for the reliability and heritability, which validates the benefits of cluster-enhanced tests statistics and spatial autocorrelation modeling. The analysis results for other tasks are provided in Supplementary Figures S1 and S2 in the Supplementary Material C, and the performances were analogous to the emotional task which supported the most sensitive performance of CLEAN-V and the least sensitive performance of massive univariate analysis.

#### 3.3.3 Comparisons to massive univariate analysis with smoothed images

In this section, we also conducted massive univariate analyses (MEGHA) with different levels of surface smoothing (e.g., 5mm, 8mm, 10mm). We compared them to CLEAN-V applied to the original data but with $rmax$ = 0mm, 5mm, 8mm, 10mm, although we acknowledge that a direct comparison between “smoothing radius” in massive univariate analysis and $rmax$ in CLEAN-V is not feasible. From Figure 5, massive univariate analysis applied to presmoothed data improved the number of detected vertices. For CLEAN-V, it also consistently increased the detected regions as $rmax$ increased. At the same time, it suggests that even a small $rmax$ (e.g., 5mm) seems to outperform massive univariate analyses with different smoothing levels. Since Gaussian smoothing typically involves larger support regions, these results demonstrate that CLEAN-V, which use fixed radii to aggregate test statistics in a fixed spatial domain, is more effective in localizing signal regions of test-retest reliability or heritability, without the need to rely on smoothing images.

A cautionary note is warranted. Both imaging smoothing and cluster enhancement involve a tradeoff between sensitivity and specificity. Therefore, just as choosing the optimal smoothing level is an open question, choosing an optimal $rmax$ in CLEAN-V, similar to CLEAN and CLEAN-R, is an open question which should be determined carefully based on expected signal-to-noise ratios, sample sizes, and research purposes.

### 3.4 Data-driven simulations

#### 3.4.1 Setup

Motivated by existing efforts in neuroimaging literature (Geuter et al., 2018; Park & Fiecas, 2022; Weinstein et al., 2021, 2022), we used task-fMRI data from HCP to design data-driven simulations to evaluate the performance of CLEAN-V compared to the competitors. It aligns with the approach of Eklund et al. (2016), which used real data (resting-state fMRI) to evaluate false positive rates of different statistical methods for task-fMRI. Also, this approach is particularly useful in evaluating the robustness of CLEAN-V under potential model misspecifications (in covariance structures) because it assumes stationarity and isotropy of underlying spatial autocorrelation with the exponential SACF, which may not hold. For our data-driven simulations, we used images from the emotional task to construct “null” data and “signal” data with different signal-to-noise ratios (SNRs), which are described below.

#### 3.4.2 Simulating null data

The null data consists of independent images only. Among all subjects in HCP, we only considered test images from non-twin subjects to construct “pseudo” test-retest pairs or twin pairs. For example, a set of test-retest images for a subject was generated by randomly sampling two images from test images from different independent (non-twin) subjects. Since there are no true test-retest pairs or twins in the null data, we expect no test-retest reliability or narrow-sense heritability.

#### 3.4.3 Simulating test-retest data with different signal-to-noise ratios

We fixed the number of subjects to be 20 ($N=40$ images) throughout the simulation studies for test-retest reliability, and we changed the proportion of subjects whose test and retest images are sampled (without replacement) from the 44 subjects available from HCP. The remaining subjects’ images were sampled from test images from different subjects from HCP. We expect stronger SNR when more test-retest pairs are included. These evaluations are shown in Section 3.5.1.

#### 3.4.4 Simulating twins data with different signal-to-noise ratios

The simulation process for twins data is the same as the procedure for test-retest data except for the larger sample size. We fixed the number of twin pairs to be 60 ($N=120$ images) as the overall effect size of narrow-sense heritability is smaller than test-retest reliability. We changed the proportion of true twins whose images were sampled from the 171 twins. The rest of the image pairs were sampled from different subjects. These evaluations are also shown in Section 3.5.1.

### 3.5 Data-driven simulation results

In each scenario, we generated 1000 sets of simulated data to evaluate performances.

#### 3.5.1 Power analysis and family-wise error rate

Figure 6(a) shows the empirical FWER of the CLEAN-V, which demonstrates that CLEAN-V and its competitors controlled FWER accurately at 0.05 under different $rmax$s of cluster enhancement for testing both test-retest reliability and narrow-sense heritability. We also evaluated the score test by Ganjgahi et al. (2015) which actually showed inflated FWER control (Supplementary Material D). Therefore, their method was not used for power comparisons.

Figure 6(b) shows the power of the four competitors. The empirical power is calculated by the number of times rejecting the null hypothesis divided by the number of simulations (1000). We see that CLEAN-V’s power was uniformly higher than the competitors, and the power of CLEAN-V without spatial correlation was substantially higher than the massive univariate analysis. For test-retest reliability, however, the power of CLEAN-V without cluster enhancement showed a similar power as massive univariate analysis and even lower when the signal-to-noise ratio was high. Notably, CLEAN-V achieved strong power (80%) when the number of test-retest pairs was only 3. As the number of test-retest pairs increased to 5, almost 100% power was obtained by CLEAN-V. For narrow-sense reliability, CLEAN-V also achieved the highest power. Specifically, it reached 80% power when the number of twin pairs was 35. And it attained 100% power as the number of twin pairs increased to 40.

#### 3.5.2 Reproducibility

We computed vertex-level reproducibility from the simulation results. Specifically, for each vertex, we calculated the proportion of the vertex identified out of 1000 simulations. Figure 7(a) shows the reproducibility maps of three settings for both test-retest reliability and narrow-sense heritability. Clearly, CLEAN-V outperformed its competitors under all settings. CLEAN-V and CLEAN-V without spatial correlation identified substantial signal regions for test-retest reliability when sufficient test-retest pairs and twins were included. CLEAN-V was better than CLEAN-V without spatial correlation in all cases. Regardless of the number of test-retest pairs and twin pairs included in the samples of simulations, massive univariate analysis and CLEAN-V without cluster enhancement nearly did not detect the signals of test-retest reliability and heritability sufficiently. Because the signal region of heritability for the emotional task is relatively small, both CLEAN-V and CLEAN-V without spatial correlation identified small signal regions. However, CLEAN-V still identified more signal regions with higher reproducibility.

Figure 7(b) included the proportion of vertices whose reproducibility is over 80% or 50% for each approach. CLEAN-V outperformed in all scenarios. When there were 20 test-retest pairs in the sample for test-retest reliability analysis (all test-retest data), CLEAN-V remarkably produced 68% and 77% vertices over 80% and 50% reproducibility correspondingly, which were better than CLEAN-V without spatial correlation (48% and 61%) and considerably higher than CLEAN-V without cluster enhancement and massive univariate analysis (0% and 0.1% for both). For narrow-sense heritability analysis with 60 pairs of twins, 9% and 5% vertices had over 50% and 80% reproducibility from CLEAN-V, which is also notably higher than the proportion of vertices with CLEAN-V without spatial correlation (5% and 3%) and the other two methods (0% for both). For other settings, CLEAN-V also significantly outperformed the others.

## 4 Discussion

In this paper, we proposed a new method, CLEAN-V, for statistical inference and localization of test-retest reliability and narrow-sense heritability for task-based fMRI. Our method directly adjusts the global spatial autocorrelation, constructs spatially-enhanced test statistics in a locally powerful way, and controls FWER efficiently by permutations. We showed that CLEAN-V controlled family-wise error rate and achieved high power simultaneously.

Our results showed that it is necessary to use both cluster enhancement and spatial autocorrelation modeling to increase sensitivity in testing variance components, which aligns with previous findings (Park & Fiecas, 2022; Weinstein et al., 2022). It was supported by comparing CLEAN-V to simpler methods: massive univariate analysis, CLEAN-V without spatial correlation, and CLEAN-V without cluster enhancement. The comparison results showed that there is no power improvement if only spatial dependence is considered (without cluster enhancement). Our cluster enhancement method with univariate statistics led to increased statistical power by flexibly collecting neighbors’ information and adaptation. With a combination of the two modeling strategies (cluster enhancement and spatial autocorrelation modeling), CLEAN-V sufficiently detected test-retest reliability even if the proportion of test-retest data in samples was very small (15%) and achieved almost perfect power when the proportion increased to 25%. Considering that the number of images used in studies for test-retest reliability or heritability is mostly small $(N<100)$, CLEAN-V provides a powerful and sensitive approach that partially relaxes such concerns.

In the analysis of five experimental tasks of the HCP task-fMRI data, CLEAN-V, in general, distinguished the strength of test-retest reliability and narrow-sense heritability on the brain surface and among tasks. We also found a strong relationship between test-retest reliability and activation by comparing the CLEAN-V’s test-retest maps to activation maps. Thus, CLEAN-V offers possibilities to discover the correspondence between reliability regions and activation regions.

A few points are worth more discussion from the perspectives of methodology and implementation. First, combining the score-based variance component test and permutations, CLEAN-V is much more computationally efficient than the classical likelihood-based tests since constructing the test statistics and finding FWER-controlled threshold only need fitting the null model once instead of refitting numerous models. The computational efficiency of CLEAN-V speaks of its practical utility. Second, the methodological developments made in the paper are based on large samples (i.e., asymptotics), although our data analyses show efficiency of CLEAN-V in small samples. More methodological research on establishing the joint central limit theorem of the test statistics (under both the null and alternative hypotheses) would support and strengthen the validity of CLEAN-V, which is left as future work.

This paper focused primarily on analyzing task-fMRI data. Still, CLEAN-V is widely applicable to many neuroimaging data types (e.g., cortical thickness) where “borrowing” information across spatial domains is expected to improve power. Also, a comprehensive exploratory analysis by Weinstein et al. (2022) reveals that applying exponential spatial autocorrelation function (SACF) is reasonable in many imaging modalities mapped onto the cortical surface. While applying the same assumption to volumetric (3D) imaging data could be questioned, we showed that a simplified version of CLEAN (“CLEAN-V without spatial correlation”) also results in a dramatic increase in power and sensitivity compared to massive univariate analysis, which can be adopted easily.

Lastly, CLEAN-V provides useful insight into recent concerns on low test-retest reliability or heritability in neuroimaging studies. Although CLEAN-V does not provide effect size estimates, the fact that the adoption of “spatial variations” resulted in increased power suggests that the current definition of test-retest reliability or heritability in neuroimaging studies might be too simple to characterize all sources of variations, such as spatial variations, which would provide limited understanding of variance components and lead to suboptimal estimations. For example, Risk and Zhu (2021) showed how spatial modeling of heritability provides better estimates than mass univariate analysis. Extending the “inference” problem of CLEAN-V to the “estimation” problem is worth further investigations, which we leave as future work.

## Data and Code Availability

Without parallel computing, implementing CLEAN-V for 342 images (each image containing 10000 vertices) with 5000 permutations took less than 10 minutes on a laptop, which supports the practical utility of the method. We are working on additional optimization of the software with parallel computing, which is publicly available at https://github.com/junjypark/CLEAN.

## Eithics Statement

Secondary data analysis of the Human Connectome Project (HCP) data used in this paper follows the HCP Open Access Data Use Terms and the HCP Restricted Data Use Terms.

## Author Contributions

R.P.: Methodology, Data preparation, Formal analysis, Validation, Visualization, Software, and Writing—original draft. E.W.D.: Data preparation, Validation. C.H.: Validation, Writing—review & editing. N.R.: Supervision, Writing—review & editing. A.N.V.: Supervision. J.Y.P.: Conceptualization, Methodology, Validation, Writing—review & editing, Funding acquisition, and Supervision.

## Declaration of Competing Interest

None.

## Acknowledgements

We would like to thank reviewers for their helpful comments and suggestions. E.W.D. receives funding from the Brain and Behavior Research Foundation Young Investigator Award (NARSAD), the Canadian Institutes of Health Research. C.H. receives research funding from the National Institute of Mental Health (NIMH), the Brain and Behavior Research Foundation, the University of Toronto, and the Centre for Addiction and Mental Health (CAMH) Foundation. N.R. receives funding from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2020-05897). A.N.V. receives funding from the National Institute of Mental Health (1/3R01MH102324 & 1/5R01MH114970), Canadian Institutes of Health Research, Canada Foundation for Innovation, CAMH Foundation, and the University of Toronto. J.Y.P. receives funding from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2022-04831), and the University of Toronto’s Data Science Institute, McLaughlin Centre, and Connaught Fund.

Data were provided by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. This research was supported in part by NIH grants R01EB016061 and R01EB026549 from the National Institute of Biomedical Imaging and Bioengineering and R01 MH116026 from the National Institute of Mental Health.

## Supplementary Materials

Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00058.