Abstract
Being able to aggregate results from many acceptable data analysis pipelines (multiverse analyses) is a desirable feature in almost all aspects of imaging neuroscience. This is because multiple noise sources may contaminate the acquired imaging data, and different pipelines will attenuate or remove those noise source effects differentially. Here, we used multiple preprocessing pipelines that are known to impact the final results and conclusions of Positron Emission Tomography (PET) neuroimaging studies significantly. We developed conceptual and practical tools for statistical analyses that aggregate pipeline results and a new sensitivity analysis testing for hypotheses across pipelines, such as “no effect across all pipelines” or “at least one pipeline with no effect”. The proposed framework is generic and can be applied to any multiverse scenario. Code to reproduce all analyses and figures is openly available, including a step-by-step tutorial, so other researchers can carry out their own multiverse analysis.
1 Introduction
Modern neuroimaging techniques have provided unique opportunities to measure complex signaling pathways in the living human brain with the goal of identifying reliable biomarkers of disease states and treatment outcomes (Poldrack et al., 2020). Data arising from state-of-the-art neuroimaging techniques are, however, often contaminated with noise confounds such as motion-related artefacts, affecting both the spatial and temporal correlation structure of the data (Nørgaard, Ozenne, et al., 2019). Carefully designed preprocessing steps have been developed to remove unwanted noise sources, but in the absence of a “ground truth” it remains a major challenge to evaluate the impact of preprocessing choices on subsequent statistical analyses and results. Over time, preprocessing pipelines (i.e., a set of preprocessing steps) have become more complex and flexible, and this increase in researcher degrees of freedom (termed multiverse analyses) has consistently been shown to affect the outcomes of neuroimaging studies (Botvinik-Nezer et al., 2020; Carp, 2012; Nørgaard, Ganz, et al., 2019). The most common approach in the neuroimaging field is, to date, to use a single pipeline and not consider the heterogeneity of preprocessing choices. This approach not only makes abstraction of the multitude of possible results more difficult, but is likely also sub-optimal because the best pipeline is more often than not, study, population, or even subject dependent (Churchill et al., 2015; Nørgaard, Ganz, et al., 2019). More concerning, neuroscientists might be tempted to “tune” the pipeline in order to obtain the most satisfying results. This will generally lead to spurious and non-reproducible results since the variability induced by the choice of pipeline is now conditioned on the desired outcome. However, since it is neither realistic nor optimal to move toward a single unified preprocessing pipeline, there is an urgent need for a statistical framework allowing to explore results among many preprocessing pipelines in a principled way.
The aim of this work is, thus, to provide a statistical framework that can aggregate the evidence from multiverse analyses to produce conclusions robust to the choice of the pipeline. More specifically, the present paper proposes a statistical sensitivity analysis providing:
- (i)
visualizations of the heterogeneity of several preprocessing pipelines
- (ii)
estimation of a global effect across all preprocessing pipelines
- (iii)
quantification of the proportion of pipelines with evidence for an effect
- (iv)
a statistical framework for testing hypotheses across pipelines such as “no effect across all pipelines”, “at least one pipeline with no effect”
In this work, as an exemplary case, we use Positron Emission Tomography (PET) neuroimaging data to carry out a real-world multiverse analysis. However, our framework is generic and may be used in any multiverse scenario. The corresponding software is available as an R package called LMMstar freely accessible on CRAN, and our Github repository contains the code and data to reproduce all simulations and figures. Finally, we also provide a step-by-step tutorial on GitHub so other researchers can carry out these analyses using their own data.
2 Materials and Experimental Settings
2.1 Data
We used two different data sources in our analyses: in sillico and real data. For the in sillico data, different noise structures were chosen to reflect how different configurations of pipelines may interact with the distribution of the data, and the sample size was varied to encompass small to larger scale clinical studies. In the real data analysis, we mimicked how real neuroimaging studies compare an intervention to a reference measurement. Pipelines were selected independently of the intervention data using the healthy/placebo arm of Frokjaer et al. (2015) using results from Nørgaard, Ganz, et al. (2019), investigating the impact of preprocessing pipelines. The proposed sensitivity analysis was illustrated on the intervention arm of the study where the follow-up value was compared to a reference value taken from a normative serotonergic atlas (Beliveau et al., 2017).
To simulate in sillico data, we considered the simple case of a single brain measurement (), with a single binary exposure () following a Bernoulli distribution with parameter (i.e., two balanced groups) and no covariates (). Latent values for the brain measurement were simulated using a normal distribution with variance 1 and mean times the exposure values, where (null hypothesis) or (alternative hypothesis). The observed was simulated for each pipeline, adding pipeline specific noise to the latent . This noise was simulated using a multivariate normal distribution with mean 0 and variance , , and depending on the scenario (see Supplementary Material A for details). In scenario 1, we simulated many pipelines () with correlated homoscedastic noise; in scenario 2, a few pipelines () with uncorrelated heteroscedastic noise; and in scenario 3, many pipelines () with correlated heteroscedastic noise. Scenario 2 and 3 included one pipeline with high signal to noise ratio (SNR), that is, low variance, and many pipelines with low SNR, that is, high variance. Additional simulations with non-normally distributed noise ( and ) can be found in Supplementary Material E, investigating the impact of outliers () and skewed data (). The sample size was varied from to in each group, such that the smallest sample size was well below the number of parameters ( mean parameters, variance-covariance parameters) in scenario 1 and 3 and the asymptotic regime was reached for the largest sample size. We generated 10,000 datasets per scenario and sample size—this provides sufficient precision about the mean, standard deviation, and rejection rate to neglect the Monte Carlo uncertainty. These data will be used to assess the large sample size properties of the procedure (bias, variance, type 1 error control) in finite samples.
To illustrate the use of the proposed sensitivity analysis on real data, we utilize neuroimaging results from a placebo-controlled, double-blinded, clinical study (Frokjaer et al., 2015). The study was registered and approved by the ethics committee for the capital region of Copenhagen (protocol-ID: H-2-2010-108) and registered as a clinical trial: www.clinicaltrials.gov under the trial ID NCT02661789. All subjects provided written informed consent prior to participation, in accordance with The Declaration of Helsinki II. The aim of the study was to assess the association between the emergence of depressive symptoms and change in cerebral serotonin transporter (SERT) availability following a hormonal treatment (). Raw data is available from the CIMBI database (https://nru.dk/index.php?view=category&id=128) upon request under GDPR law, and derived data is available on Github. It consists of structural Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) imaging data for healthy females who underwent a baseline scan, received either Placebo () or a GnRHa implant intervention (), and participated in a follow-up scan. SERT availability estimates were modeled and extracted for each subject for 28 subcortical and cortical regions (see Nørgaard, Ganz, et al., 2019 for details), and averaged across hemispheres, producing a final sample of regions per subject and pipeline. These regions (amygdala, thalamus, putamen, caudate, anterior cingulate cortex, hippocampus, orbital frontal cortex, superior frontal cortex, occipital cortex, superior temporal gyrus, insula, inferior temporal gyrus, parietal cortex, and entorhinal cortex) were chosen because they cover the entire brain, and many are target regions in published serotonin transporter (SERT) PET studies. No covariates were considered ().
2.2 Multiverse analyses of real data
Five preprocessing steps were used to analyze the data and estimate the SERT availability (outcome measure). These steps include motion correction (with/without), co-registration (4 options), delineation of volumes of interest (3 options), partial volume correction (4 options), and kinetic modeling for quantification of SERT availability (MRTM, SRTM, Non-invasive Logan and MRTM2). More information about the preprocessing choices and software can be found in (Nørgaard, Ganz, et al., 2019). The combination of individual preprocessing steps leads to a number of possible combinations.
2.3 Notation and assumptions
We now introduce notations that will be used to describe the proposed sensitivity analysis (see Supplementary Material F for a summary). We are interested in relating brain measurements () to exposures or treatments () accounting for covariates (). We consider a set of pipelines used to preprocess the neuroimaging data. For a given pipeline we fit a statistical model with parameters that relates processed by pipeline , , and . We then obtain from this model an estimate of the effect of interest (denoted ). In our real-life example, we use, for each pipeline, a paired t-test to compare the observed change in SERT availability to an atlas value so for , where is the empirical mean and the empirical variance of the change in SERT availability (processed with pipeline ) between baseline and follow-up.
We make the following working assumptions: first, the observed data correspond to independent and identically distributed replicates of . Second, we have chosen a set of reasonable pipelines, meaning that the estimated effects found in the follow-up statistical analysis will converge to the right value as the sample size increases. This set can include pipelines distorting the signal (e.g., adding a fixed value) if that has no consequence, asymptotically, on the estimated effect (the mean change is not biased as the added value cancels out when substracting the baseline value to the follow-up value). Finally, when considering asymptotic results we will consider a fixed number of pipelines and let the sample size increase to infinity.
3 Proposed Sensitivity Analysis
To be able to draw conclusions across pipelines, we not only need the result of each pipeline but also some information about how they relate. If all pipelines were equally reliable and equally different, we would weight each pipeline equally. If there exists one independent pipeline and a block of correlated pipelines, all equally reliable, then we would weight the independent pipeline more compared with other pipelines. By treating pipelines as black boxes, we can investigate their relation in terms of each observation’s influence on the estimated effects across pipelines, . This relation is fully characterized by the joint distribution of the effects. Once estimated, we can extract summaries of this distribution, for example, an average value, and carry out statistical tests, for example testing the compatibility between the observations and the joint distribution that would have been observed under a specific hypothesis.
3.1 Estimating the joint distribution across pipelines
The joint distribution could be obtained using a multivariate model, for example, modeling data from all pipelines at once using a mixed model. Because of the complexity of the dependency among pipelines, this is, however, rarely feasible with the available sample size. Instead, and this matches common practice, we perform the same analysis separately for each pipeline and obtain a vector of estimated associations with their standard errors . Using tools from the semi-parametric theory (see Kennedy, 2017 and Tsiatis, 2006 for more details), we can approximate the influence of each observation on the estimate by a random variable called the influence function, denoted for pipeline , and satisfying:
where denotes a residual term that convergences toward zero in probability as the sample size tends to infinity. As shown in Supplementary Material B, has a simple expression when is the empirical mean or an element of a maximum likelihood (ML) estimator. Since this decomposition applies to all pipelines, we get from the multivariate central limit theorem that the joint distribution of the estimates is asymptotically multivariate normal. It has mean and its variance-covariance, denoted , is the same as the one of divided by . Note that with limited number of observations, typically when , the estimated variance-covariance based on the estimated influence function is not guaranteed to be positive definite.
3.2 Testing hypotheses across pipelines
The global null hypothesis “no effect across all pipelines” can be tested using a max-test approach: more extreme realizations would correspond to larger values of the maximum statistic where denotes the absolute value and . A p-value may, therefore, be computed by integrating the joint density under the null hypothesis outside of the domain (see Supplementary Fig. B in Supplementary Material C). Here, we use the notation that represents the Cartesian product between intervals . The value , such that the integral outside equals , provides a critical threshold for the estimated test statistics . This threshold can also be used to derive confidence intervals.
The null hypothesis “at least one pipeline with no effect” is an intersection union test. As such, it can be rejected if and only if all the un-adjusted p-values relative to each pipeline are below or equivalently if the largest un-adjusted p-value is below .
The proportion of pipelines where there is evidence for an effect can be estimated as where denotes the indicator function. One drawback with this non-parametric estimator is that it is a non-smooth function of , making the associated uncertainty difficult to evaluate. Instead, one can use a parametric approach, assuming normally distributed test statistics:
which is a smooth (but complex) function of the model parameters. Here, refers to the cumulative distribution function of a standard normal distribution. The uncertainty about the estimator can, therefore, be derived using a non-parametric bootstrap or a delta method where , and is the set of parameters of the statistical model across pipelines.
3.3 Visualizing the heterogeneity across pipelines
In section 3.1, it was stated that the estimated associations are, asymptotically, normally distributed. They can, therefore, be summarized by their expectation and variance-covariance matrix (i.e., standard errors and correlation matrix). In this work, we suggest two graphical displays to visualize the heterogeneity of the results across pipelines:
A heatmap of the estimated correlation among estimates, obtained by converting into a correlation matrix.1
A forest plot displaying and through the confidence intervals, possibly using the previously established re-ordering.
3.4 Estimating a global effect across pipelines
Several methods can be used for estimating a global effect across pipelines. A naive method would be to compute the mean of the estimated associations:
This estimator will, however, not be efficient if some pipelines lead to more precise estimates, that is, the elements in are not equal. Intuitively, we would like to pool the estimates with weights inversely proportional to the standard errors such that we put more weight on precise estimates:
However, we also need to take into account the correlation between estimates. Indeed, perfectly correlated estimates should weight as if there was only one estimate. To do so, we can use the following GLS estimator of the global effect:
where is a column vector filled with ones and . Indeed, equation 1 can be shown to be equivalent to performing a spectral decomposition of , and using the eigenvectors to combine estimates into independent components that can be pooled according to weights proportional to the eigenvalues (Supplementary Material D1). This is used when is singular to compute , by restricting the spectral decomposition to the eigenvalues above a given threshold (). In the simple case where , , , brain measurements are jointly normally distributed, is binary, and there are no missing data, the GLS estimator can be shown to be asymptotically efficient (Supplementary Material D2).
Cox et al. (2006) studied a similar estimator when and found that, under unequal variance, the global estimate can be outside of the range of the (pipeline specific) estimates. As a remedy, we propose a constrained GLS estimator, denoted , which, in addition to have weights that sum up to 1, constrains the weight of each estimate to be at most 1 in absolute value:
where is chosen to satisfy the constraint (Supplementary Material D3). These constrained weights can be seen as a regularization of the GLS weights (first term) toward the weights for the mean (second term) with regularization coefficient . This regularization coefficient is 1 when the largest weight is 1 () and decreases toward 0 as the largest weight increases toward infinity ().
Note that if some of the pipelines may induce some bias, the previous approaches will propagate this bias and therefore be unsatisfactory. Systematic differences between pipelines can be investigated by comparing the estimates between pipelines, for example, and using to obtain the corresponding uncertainty: .
4 Results
4.1 Simulated multiverse analysis for normally distributed data
We compare the performance of the proposed estimators when using unbiased pipelines on simulated data. For each dataset , , and , we computed the four previously described estimators of the global effect, , , , and . P-values for , , and were computed, neglecting the uncertainty of the weights (, , ).
4.1.1 Weights (global effect)
Based on , , and , we can compute how , , and (or ) would weight the results from each pipeline if the variance-covariance matrix of the pipeline estimates was known (Fig. 1). In scenario 1, and would provide equal weight to all pipelines with a weight of 5%, while would weight by 1.88% the correlated pipelines and by 14.37% the uncorrelated pipelines (in this paragraph, all weights are rounded for readability). In scenario 2, equally weights all pipelines by 16.67% while and would use the following weights: 18.75%, 52.51%, 10.94%, 7.72%, 5.97%, and 4.10%, favoring the high SNR pipeline. In scenario 3, would equally weight all pipelines by 5%, would weight each correlated pipeline by 5.17% and the remaining pipelines by 14.48% (high SNR pipeline), 3.02%, 2.13%, 1.65%, and 1.13%, while would weight by 1.65% the correlated pipelines and by 48.61% (high SNR pipeline), 10.13%, 7.15%, 5.52%, and 3.80% the remaining pipelines.
Large sample weights used by , , and the GLS estimators ( or ) to combine the pipeline-specific estimates in scenario 3. Weights relative to the correlated pipeline are shown in shades of gray (first 15 blocks); weights relative to the independent pipeline are shown using rainbow colors (last 5 blocks). In the legend, the variance for a given pipeline effect is indicated in parentheses.
Large sample weights used by , , and the GLS estimators ( or ) to combine the pipeline-specific estimates in scenario 3. Weights relative to the correlated pipeline are shown in shades of gray (first 15 blocks); weights relative to the independent pipeline are shown using rainbow colors (last 5 blocks). In the legend, the variance for a given pipeline effect is indicated in parentheses.
4.1.2 Point estimate (global effect)
The average estimate of the global effect estimators (, , , ) was close to the true value for all sample sizes and all scenarios, typically , and the largest discrepancy was 0.015 (, n=10 per group, scenario 3, alternative hypothesis).
4.1.3 Point estimate (proportion)
Under the null, the estimated proportion of pipelines with evidence for an effect ranged between 5.35% and 6.79% based on and between 0.50% and 1.71% based on . showed no obvious trend with sample size, whereas showed a small decrease with the sample size (upper panel of Supplementary Fig. C). Under the alternative, the proportion of pipelines dramatically increased with the sample size for both estimators (upper panel of Fig. 3): typically from 10% or below when per group to 99% in scenario 1, 65% in scenario 2, and around 80% in scenario 3 with per group. The “usual” estimator exhibited a smaller value at low sample size but a steeper increase compared to the proposed estimator .
4.1.4 Variance (global effect)
Similar results were obtained for and , so we will only discuss the former case (upper panels of Fig. 2). In scenario 1, and showed similar variability, whereas the variability of was higher in very small samples (standard deviation: +156% for , +11% for ) and lower for larger sample sizes (e.g., -11% for ). In scenario 2, and showed similar variability (slightly higher for in low sample sizes and slightly lower afterward), both smaller than the variability of , for example, -9.93% for and -24.86% for for . Results in scenario 3 were similar to scenario 1 up to a larger decrease in variability in large samples for (-27.75% for n=500). The variability of was similar to one of except in small samples where it was smaller and closer to the one of .
Upper panel: empirical standard deviation of the estimated common effect for each estimator, scenario, and sample size under the null hypothesis. For large samples, the green line (GLS) is covered by the blue line (constrained GLS) in all scenarios. In scenarios 1 and 3, the red line (average) is covered by the yellow line (pool-se). In scenario 1, the GLS estimator has the highest standard deviation in small samples and the GLS and GLS constrained have the lowest standard deviation (approx. 12% lower compared to the average). Lower panel: rejection rate under the null hypothesis (i.e., type 1 error). Shaded area represents the Monte Carlo uncertainty.
Upper panel: empirical standard deviation of the estimated common effect for each estimator, scenario, and sample size under the null hypothesis. For large samples, the green line (GLS) is covered by the blue line (constrained GLS) in all scenarios. In scenarios 1 and 3, the red line (average) is covered by the yellow line (pool-se). In scenario 1, the GLS estimator has the highest standard deviation in small samples and the GLS and GLS constrained have the lowest standard deviation (approx. 12% lower compared to the average). Lower panel: rejection rate under the null hypothesis (i.e., type 1 error). Shaded area represents the Monte Carlo uncertainty.
4.1.5 Variance (proportion)
The proposed estimator had a lower variability compared to the usual estimator in all scenarios and across sample sizes under the alternative hypothesis (except in scenario 1, ): the reduction in standard deviation ranged from 5.67% to 36.94% (lower panels of Fig. 3).
Empirical mean (upper panel) and standard deviation (lower panel) of the two proportion estimators ( and ) for each scenario, and sample size under the alternative hypothesis. Shaded area represents the Monte Carlo uncertainty.
Empirical mean (upper panel) and standard deviation (lower panel) of the two proportion estimators ( and ) for each scenario, and sample size under the alternative hypothesis. Shaded area represents the Monte Carlo uncertainty.
4.1.6 Type 1 error (global effect)
Across all scenarios, the type 1 errors for and were well controlled except in very small samples where small deviations from the nominal level were observed (maximum of 6.31% for and 7.36% for , lower panels of Fig. 2). When neglecting the uncertainty about the weights, the type 1 error control of and was only controlled for large samples, that is, for ; large type 1 error rates were found in very small samples (e.g., 22.02% for with ).
4.2 Simulated multiverse analysis for non-normal distributions
While the previous section evaluated the performance of the sensitivity analysis assuming ground-truth data generated from a multivariate normal distribution, this assumption is often violated in real neuroimaging data (especially with patient data). To address this issue, we expanded the simulation scenarios to include non-normal data (scenario 4 and 5), allowing us to understand how well the sensitivity analysis works, how these data distributions interact with sample size, and what additional constraints are necessary to govern the usage of the sensitivity analyses. More specifically, we created two additional simulations largely corresponding to scenario 3 (many pipelines with correlated heteroscedastic noise), but with different noise distributions. Scenario 4 used a noise distribution following a Student’s t-distribution with 5 degrees of freedom, resulting in a distribution with heavy tails to reflect the inclusion of outliers. Scenario 5 used a noise distribution following a half-normal distribution, resulting in a heavily skewed non-negative distribution, with most of the density centered around 0. Figure 4 highlights the differences between scenario 3, 4, and 5, across different noise levels. Additional simulation details for scenario 4 and 5 can be found in the Supplementary Material E.
Density of the (marginal) distribution of the noise (left panels) and of the observed values (right panels) for three pipelines. Colors refer to different scenarios, that is, different noise distributions.
Density of the (marginal) distribution of the noise (left panels) and of the observed values (right panels) for three pipelines. Colors refer to different scenarios, that is, different noise distributions.
Notably, the results for these additional scenarios were similar to the multivariate normal cases in terms of bias, variability, and rejection rate (Supplementary Fig. E). However, while these additional scenarios may not fully represent real-world data, they show the robustness of the proposed sensitivity analysis to assumptions about the residual distribution. Intuitively, they benefit from the law of large numbers, that is, for large enough sample sizes they will behave as expected. Worse low sample size performances with non-Gaussian noise distributions is expected, which is in line with some of the simulation results (e.g., bias in scenario 5, Supplementary Fig. D). The difference between the Gaussian and non-Gaussian case is not striking though, and factors such as sample size seem to be driving the behavior of the estimators.
4.3 Application to real-world multiverse analysis
We illustrate the statistical sensitivity analysis with real data described in Section 2.1 in which the SERT availability was assessed after a drug intervention and compared to normative values. We start by studying the behavior of the four statistical estimators (pooled GLS, pooled constrained GLS, pooled average and pooled SE) to estimate a common effect across pipelines for the given null hypothesis for a single subcortical region - amygdala - first, reported as a forest plot in Figure 5 (left panel). The dashed vertical line in Figure 5 represents the normative value, and all horizontal error bars represent the estimated effect (mean and 95% CI) for a given pipeline and estimator. Across a reasonable set of preprocessing pipelines, three of the eight selected reject the null hypothesis (as indicated by the non-overlapping CI with the normative value) with estimated percent differences between groups ranging between -9% (pipeline 6) and 2.5% (pipeline 1). The pooled constrained GLS, pooled SE and pooled average all fail to reject a common effect across pipelines, whereas the GLS estimator rejects the null hypothesis. However, when inspecting the pipeline weights for the GLS estimator for these data, it assigned a very high weight to four pipelines (i.e., weight above 1 in absolute values), leading to an unreliable estimate as illustrated by the large standard deviation found in the simulation study for low sample sizes (Fig. 2). The constrained GLS estimator did not exhibit this problem and had weights between -0.79 and 1. We also visualize the effect of the four global effect estimators across the nine cortical regions. This is shown in Figure 6. Here, we can observe a similar pattern, where the pooled GLS estimator exhibited the largest deviations from the expected value based on the SERT normative values, whereas the pooled constrained GLS estimator is close to the pooled SE and pooled average.
Left panel: forest plot of the estimated SERT availability in the amygdala for the intervention group (point and full line) versus the normative values (dashed line) for each pipeline and the four proposed pooled estimators. Right panel: correlation of the estimated SERT availability between pipelines. Pipeline 1: with motion correction (MC), boundary based registration (BBR) using the time-weighted average PET image (twa), and MRTM2 as kinetic modeling choice to estimate SERT availability. Pipeline 2: MC, normalized mutual information registration (NMI) using twa, MRTM2. Pipeline 3: MC, BBR using the average PET image, and MRTM2. Pipeline 4: MC, NMI using the average PET image, and MRTM2. Pipeline 5: MC, BBR_twa, and MRTM. Pipeline 6: no motion correction (nMC), BBR_twa, and MRTM. Pipeline 7: nMC, BBR_twa, and MRTM2. Pipeline 8: MC, BBR_twa, and SRTM.
Left panel: forest plot of the estimated SERT availability in the amygdala for the intervention group (point and full line) versus the normative values (dashed line) for each pipeline and the four proposed pooled estimators. Right panel: correlation of the estimated SERT availability between pipelines. Pipeline 1: with motion correction (MC), boundary based registration (BBR) using the time-weighted average PET image (twa), and MRTM2 as kinetic modeling choice to estimate SERT availability. Pipeline 2: MC, normalized mutual information registration (NMI) using twa, MRTM2. Pipeline 3: MC, BBR using the average PET image, and MRTM2. Pipeline 4: MC, NMI using the average PET image, and MRTM2. Pipeline 5: MC, BBR_twa, and MRTM. Pipeline 6: no motion correction (nMC), BBR_twa, and MRTM. Pipeline 7: nMC, BBR_twa, and MRTM2. Pipeline 8: MC, BBR_twa, and SRTM.
Estimated difference (intervention - normative values) of SERT availability across pipelines for the four global effect estimators on the common inflated FreeSurfer surface (left hemisphere; lateral view, upper and medial view, lower). Note, only the nine cortical regions of interest are visualized.
Estimated difference (intervention - normative values) of SERT availability across pipelines for the four global effect estimators on the common inflated FreeSurfer surface (left hemisphere; lateral view, upper and medial view, lower). Note, only the nine cortical regions of interest are visualized.
Figure 5 (right panel) shows a heatmap for the estimated correlations across preprocessing pipelines, ranging from 0.68 to 1. Pipelines 1-4 show a very high correlation with each other (only varying the registration choices), whereas pipelines 5, 6, and 8 are equally correlated with each other (different kinetic models), and pipeline 7 is somewhat in between (no motion correction). The heatmap captures important differences in the correlation structure between pipelines, suggesting that not all pipelines perform similarly with moderate levels of unexplained variance. Pipelines 4 to 8 exhibit numerically smaller correlation compared to pipelines 1 to 4 (and similar variance), which explains why GLS estimators produce estimates with lower numerical values compared to pooled estimators ignoring the correlation, as GLS estimators assign more weight to the last four pipelines.
Notably, no correction for multiple comparisons was carried out across regions and pipelines at this point. The rationale for not including this (as should otherwise always be carried out) is that we wanted to make our analysis as comparable as possible to the Neuroimaging Analysis Replication and Prediction Study (Botvinik-Nezer et al., 2020), where each participating institution analyzed the data using their own established pipeline and tested only a single region in a hypothesis-driven fashion. Future work should address how adequate correction for multiple comparisons should be carried out across multiple estimators and pipelines.
5 Discussion
The proposed statistical framework leverages information about the joint distribution of the test statistics across pipelines to provide insights about the impact of preprocessing strategies. In particular, we introduce a new estimator of the proportion of pipelines with evidence for an effect () and a new estimator of the effects across pipelines (), able to pool correlated and heteroschedastics estimates while exhibiting reasonable small sample size properties. This information can complement a standard forest plot, as exemplified in Figure 5 where we added a heatmap of the correlation between pipelines.
The simulation results show that while outperforms standard pooling techniques in large samples ( or ), it is not suited for the commonly observed cases in neuroimaging studies of smaller N (number of subjects) relative to the large P (number of pipelines). The proposed alternative estimator () can be seen as regularizing the GLS estimator toward the empirical average in small samples. Even though this considerably improves the small sample performance of the estimator, other regularization approaches may lead to further gain. An alternative approach could be to regularize the estimated variance-covariance matrix between pipeline-specific estimates, for example, using graphical lasso. However, this is challenging since the correlation structure among pipelines is typically complex and non-sparse.
The proposed estimator of the proportion of pipelines () showed lower variability in the simulation studies which should benefit statistical power. It is also intuitively appealing due to less sensitivity to the threshold chosen for statistical significance compared to averaging the number of pipelines for which the p-value was below, say, 0.05.
In the real-world application, we observed that all estimators could be readily applied, and three of them performed as expected based on the results from individual pipelines. Only the estimator performed differently and was the only one rejecting the null hypothesis hinting at an effect across pipelines. This is, though, due to the estimator not being able to fit the weights properly in small sample sizes. Since neuroimaging studies in general and especially in PET are rarely beyond sample sizes of (Szucs & Ioannidis, 2020), other estimators should be used. We recommend using or : the latter when pipelines are not similarly related and the sample size is moderate to large, otherwise the former.
The framework that we are proposing is not without limitations. Fast and accurate estimation of statistical uncertainty is still being investigated. One attempt, neglecting the uncertainty of the weights, is numerically fast but unreliable in small samples or with a large number of pipelines for and . For the moment, resampling methods such as non-parametric bootstrap or permutation test, while more computationally demanding, are recommended.
The underlying assumption of combining results across pipelines in our analysis is that all pipelines are unbiased and positively correlated (i.e., they all estimate the “same” effect). This can, however, only sometimes be guaranteed. For example, in the well-known NARPS analysis (Botvinik-Nezer et al., 2020), the effect detected by different groups was, for a few cases, in the opposite direction (which is not typical of multiverse analyses). Alternative approaches, for example, STAPLE (Dafflon et al., 2022), could be used to reduce the bias of the pooled estimators by assuming a majority of unbiased pipelines or identifying clusters of pipelines and pooling pipeline-specific estimates within clusters. Finally, our proposed framework currently only considers a single region at a time and not multiple regions of interest jointly. Previous work has shown that optimal preprocessing strategies for single regions exist (Nørgaard, Ganz, et al., 2019), but also that multiple regions of interest are correlated within-subject (Eklund et al., 2016; Ganz et al., 2020). Therefore, future work is still needed to further jointly investigate and summarize the complex and nested dependence structure across multiple regions, and how it interacts with multiple preprocessing strategies.
6 Conclusion
In this work, we have developed a statistical sensitivity analysis that can quantify the impact of different preprocessing choices on subsequent statistical analyses. As has been reported in previous studies, we observe that the influence of preprocessing pipelines on subsequent statistical analysis can be quite large. Hence, we provide tools for statistical analyses that can aggregate multiple analyses of the same data. We introduced a new pooling estimator across pipelines () but also discuss how to estimate the proportion of pipelines with evidence for an effect and test hypotheses across pipelines, such as “no effect across all pipelines” or “at least one pipeline with no effect”. Code to reproduce all simulations and figures, including a step-by-step tutorial, is available at https://github.com/openneuropet/multiverse_tools/tree/main.
Ethics
The authors ensure that the work submitted is original and all authors are credited appropriately and have agreed to the journal’s open access, ethical, and data-sharing policies. All clinical data used in this study was ethically acquired following relevant ethical and legal guidelines. There are no conflicts of interest with respect to the work presented in this article.
Data and Code Availability
The script to re-create the in silico data is available on GitHub (https://github.com/openneuropet/multiverse_tools/tree/main/sensitivity_analysis-simulation). The data for the real-world example originate from a clinical study by Frokjaer et al. (2015). The study was registered and approved by the ethics committee for the capital region of Copenhagen (protocol-ID: H-2-2010-108) and registered as a clinical trial: www.clinicaltrials.gov under the trial ID NCT02661789. All subjects provided written informed consent prior to participation, in accordance with The Declaration of Helsinki II. Data are available from the CIMBI database (https://nru.dk/index.php?view=category&id=128) upon request.
Code implementing the proposed methods is available on CRAN in the LMMstar package (https://cran.r-project.org/web/packages/LMMstar/index.html). Code for the simulation study and the application is available on GitHub (https://github.com/openneuropet/multiverse_tools/tree/main/sensitivity_analysis-simulation).
Author Contributions
All authors participated in the conceptualization and first drafting of the manuscript. M.N. performed the data curation. B.O. developed the methodology and wrote the software. M.N., C.P., and M.G.B. participated in the validation and formal analysis. B.O. and M.G.B. added visualizations. All authors read and approved the final manuscript.
Funding
We thankfully acknowledge funding through the BRAIN Initiative grant #1R24MH120004-01A1 and the Novo Nordisk Foundation grant #N20OC0063277, as well as the Lundbeck foundation grant BrainDrugs #R279-2018-1145.
Declaration of Competing Interest
There are no competing interests with respect to the presented work.
Supplementary Materials
Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00523.
References
Re-ordering the pipelines may be useful to better visualize blocks of pipelines that are particularly correlated. This can, for instance, be performed by converting the correlation matrix to a dissimilarity matrix and then using hierarchical clustering.