Abstract
The fields of neuroscience and psychology are currently in the midst of a so-called reproducibility crisis, with growing concerns regarding a history of weak effect sizes and low statistical power in much of the research published in these fields over the last few decades. Whilst the traditional approach for addressing this criticism has been to increase participant sample sizes, there are many research contexts in which the number of trials per participant may be of equal importance. The present study aimed to compare the relative importance of participants and trials in the detection of phase-dependent phenomena, which are measured across a range of neuroscientific contexts (e.g., neural oscillations, non-invasive brain stimulation). This was achievable within a simulated environment where one can manipulate the strength of this phase dependency in two types of outcome variables: one with normally distributed residuals (idealistic) and one comparable with motor-evoked potentials (an MEP-like variable). We compared the statistical power across thousands of experiments with the same number of sessions per experiment but with different proportions of participants and number of sessions per participant (30 participants × 1 session, 15 participants × 2 sessions, and 10 participants × 3 sessions), with the trials being pooled across sessions for each participant. These simulations were performed for both outcome variables (idealistic and MEP-like) and four different effect sizes (0.075—“weak,” 0.1—“moderate,” 0.125—“strong,” 0.15—“very strong”), as well as separate control scenarios with no true effect. Across all scenarios with (true) discoverable effects, and for both outcome types, there was a statistical benefit for experiments maximising the number of trials rather than the number of participants (i.e., it was always beneficial to recruit fewer participants but have them complete more trials). These findings emphasise the importance of obtaining sufficient individual-level data rather than simply increasing number of participants.
1 Introduction
Neuroscience and psychology studies often yield statistical powers well below the typically desired level of 80%, with one review reporting a range of only 8–30% (Button et al., 2013). This has led to concern regarding the reproducibility of findings reported in these fields, the most damning of which comes from the Open Science Collaboration (Open Science Collaboration, 2015), which involved 100 replications of published psychology experiments and found only 36% of those replications were successful. Whilst there are many factors that can influence replicability, in terms of both experimental design and statistical analysis, one of the most commonly raised concerns is the small sample sizes that have traditionally been employed by many studies in these fields (Turner et al., 2018). Conventional considerations for maximising statistical power focus on increasing the participant sample size, since recruiting a greater number of participants will invariably aide the detection of an experimental effect (Minarik et al., 2016; Mitra et al., 2019). In many research contexts, however, there may be value in considering the number of trials within an experimental paradigm (Baker et al., 2021; Normand, 2016; Smith & Little, 2018; Zoefel et al., 2019).
The relative importance of each of these factors for any given experiment is dependent on the amount of variability that is expected to be present both within and between participants (Baker et al., 2021; Grice et al., 2017; Rouder & Haaf, 2018; Xu et al., 2018). For example, if a lot of variability is expected across trials for a given individual (i.e., when the measured variable is dynamic/unstable), there is an obvious value in ensuring a sufficient amount of data are collected from each individual to ensure that the individual-level measures are accurate, rather than simply recruiting more individuals. In this scenario, it is important to remember that even with a sufficiently large participant sample size, the group-level analyses are unlikely to be meaningful if the measures at the individual level are inaccurate (Normand, 2016). However, if there is little variability expected across trials (i.e., when the measured variable is static/stable), then there is less value in collecting large number of trials for each participant, with resources instead allocated towards increasing the participant sample size.
There are many branches of neuroscience research that are particularly prone to considerable within- and between-subjects variability, one example of which is the field of non-invasive brain stimulation (NIBS). NIBS methods make promising tools for both studying and modulating brain function (Begemann et al., 2020; de Boer et al., 2021; Lewis et al., 2016; Vosskuhl et al., 2018) that have received increasing interest in recent years for their potential uses in both psychiatry (Elyamany et al., 2021; Piccoli et al., 2022; Vicario et al., 2019) and neurorehabilitation (Evancho et al., 2023; Qi et al., 2023; Yang et al., 2024). Unfortunately, however, studies involving NIBS have traditionally employed relatively small sample sizes (Button et al., 2013; Minarik et al., 2016; Mitra et al., 2019), which when combined with the aforementioned within- and between-subjects variability can lead to suboptimal analyses, inconsistent findings across studies, and ultimately criticism regarding the efficacy of these stimulation techniques (Lafon et al., 2017; Vöröslakos et al., 2018).
Most of this criticism has historically been directed towards low participant sample sizes (Mitra et al., 2019); however, as mentioned earlier, it has recently been suggested that in many cases, the trial sample size may be of equal, if not greater importance (Baker et al., 2021; Normand, 2016; Rouder & Haaf, 2018; Smith & Little, 2018; Xu et al., 2018). For example, a recent simulation study by Zoefel et al. (2019) found that the most important of their experimental parameters for detecting phasic modulation of brain activity was indeed the number of trials per participant and they, therefore, suggested that future studies should employ experimental designs with a relatively high number of trials. The authors also found that the most important of their neural parameters for detecting phasic effects was the hypothesised effect size. Therefore, optimising the trial sample size would be of particular importance for studies investigating subtle (weak) phasic effects.
This brings us to transcranial alternating current stimulation (tACS), which is a form of NIBS that involves the application of a weak alternating electric current across the scalp at the same frequency as a particular neural oscillation in order to influence the underlying oscillatory activity (Bland & Sale, 2019). Converging evidence suggests that tACS can effectively modulate oscillatory brain activity in a frequency- and phase-dependent manner by entraining endogenous oscillations to match the frequency and phase of the exogenous stimulation (for review, see Wischnewski et al. (2023)). However, because tACS only probabilistically influences the spike timing of neuronal populations rather than directly causing those neurons to depolarise (Elyamany et al., 2021; Huang et al., 2017; Opitz et al., 2016; Vosskuhl et al., 2018), this phasic entrainment is often reported to be relatively weak (Gundlach et al., 2016; Neuling et al., 2012; Riecke, Formisano, et al., 2015; Riecke et al., 2018; Riecke, Sack, et al., 2015; Riecke & Zoefel, 2018; Wilsch et al., 2018; Zoefel et al., 2018).
One approach for investigating phasic entrainment by tACS is to collect transcranial magnetic stimulation (TMS)-induced motor-evoked potentials (MEPs) at different phases of tACS, referred to as phase-dependent TMS (Fehér et al., 2017, 2022; Nakazono et al., 2021; Raco et al., 2016; Schaworonkow et al., 2018, 2019; Schilberg et al., 2018; Zrenner et al., 2018, 2022). This allows corticospinal excitability to be probed across tACS phase without the interference of tACS artefacts, which are a major contaminant of neurophysiological recordings from electro-/magnetoencephalography (Kasten & Herrmann, 2019; Noury et al., 2016; Noury & Siegel, 2017). This technical advantage does come at a cost, however, as MEP amplitudes often exhibit considerable within-participant variability across trials (Capaday, 2021; Janssens & Sack, 2021) due to a complex combination of physiological (Gandevia & Rothwell, 1987; Hashimoto & Rothwell, 1999; Niyazov et al., 2005; Vidaurre et al., 2017; Zalesky et al., 2014) and experimental factors (Grey & van de Ruit, 2017). The combination of small effect sizes for tACS and high inter-trial variability for TMS, therefore, means that the MEP sample size needs to be sufficiently large in order to detect any modulation of the MEP amplitudes with respect to tACS phase. Crucially, however, the maximum number of MEPs that can be obtained in a single experiment session is limited by several practical limitations, such as the session length, charge time between TMS pulses, and the gradual build-up of the TMS device’s temperature (which can eventually cause the device to overheat).
In a recent human study using slow-wave tACS (Geffen et al., 2021), we acquired a total of 240 MEPs within the tACS protocol of each session. Since these MEPs were acquired across 4 different epochs of the tACS protocol (early online, late online, early offline, and late offline), this gave us only 60 MEPs per epoch, per participant. Despite employing a participant sample size of 30, which is nearly 50% greater than the mean participant sample size from a recent meta-analysis of NIBS studies (~22; Mitra et al., 2019), the trial sample size could be considered low (e.g., the lowest number of trials was 192 in Zoefel et al.’s simulations in 2019). This low trial sample size may ultimately limit the interpretation of these results since it cannot be confirmed whether the lack of phasic effects was due to insufficient statistical power. Might these effects have been detectable had we instead recruited just 10 participants and had them return for three sessions?
One possible solution to this limit in MEP acquisition is to perform multiple sessions for each participant then pooling the trial data across sessions. However, given that most studies have resource and/or time constraints that limit their experimental hours, this approach would come at the cost of participant sample size. Although the simulations performed by Zoefel et al. (2019) have demonstrated that trial sample size is a crucial experimental parameter for detecting phasic effects, those simulations kept participant sample size constant for all experiments. The relationship between participant and trial sample size for detecting phasic effects thus remains to be established and it is unclear whether the gain of statistical power from increasing the trial sample size would outweigh the loss of statistical power from decreasing the participant sample size. Therefore, we chose to perform a simulation study to quantify the statistical powers of experiments that have the same number of total sessions per experiment but with different proportions of participants and number of sessions per participant. In this manner, the theoretical burden to the researcher (in terms of experimental hours) is matched in each scenario: a greater number of individuals contribute fewer experimental sessions or fewer individuals contribute multiple sessions (i.e., a greater number of trials for a given participant pooled across sessions).
2 Methodology
2.1 Simulation scenarios
All simulations were performed in MATLAB (R2020b). For each scenario, we defined the type of data (“idealistic” or “MEP-like”), mean effect size (0.075—“weak,” 0.1—“moderate,” 0.125—“strong,” or 0.15—“very strong”), number of participants (30, 15, or 10), sessions per participant (1, 2, or 3 sessions for 30, 15, and 10 participants, respectively), trials per session (60), number of experiments (1000), and the relative degree of between- and within-subjects variability (low, medium, or high).
The total number of sessions and the trials per session were chosen based on the participant and trial sample sizes used in our recent slow-wave tACS study (see above; Geffen et al., 2021). The effect sizes do not reflect a traditional effect size value such as a Cohen’s d value, but rather determine either the amplitude of the base sine function (in the case of the idealistic data) or the variance of a normal distribution around the base sine function (in the case of the MEP-like data). The mean values for each effect size category were chosen from pilot testing to provide sinusoidal data with a good spread of statistical power: from weak effects (e.g., 20% power) to strong effects (e.g., 90% power). As a negative control, we performed a separate test with no effect of phase for any of the sessions.
2.2 Determining effect sizes for each session
For each scenario, separate effect sizes were first generated for each participant that ranged around the mean effect size for the chosen effect size category (i.e., 0.075—“weak,” 0.1—“moderate,” 0.125—“strong,” or 0.15—“very strong”). The range for the participant effect sizes was set to either ±20%, 60%, or 100% around this mean value for low, medium, and high between-subjects variability, respectively. The effect sizes for each participant were then jittered slightly between their individual sessions by either ±10%, 20%, or 30% around the value from their first session for low, medium, and high within-subjects variability, respectively.
2.3 Generating sinusoidal data
The effect sizes for each session were then used to generate sinusoidal data with continuous (i.e., randomly sampled) phase values (see Fig. 1). For the idealistic data (Fig. 1A), the effect size value scales the sine function to peak at the specified amplitude, with the data points being normally distributed around the base sine function. For the MEP-like data (Fig. 1B), however, the effect size scales the variance of a normal distribution around the base sine function rather than scaling the amplitude of the sine function itself, since the MEP-like data explicitly violate the assumptions of homoscedasticity and normality. The distribution for the MEP-like data is then folded to create a positive skew. Like MEP values, these data points, therefore, could not be negative and instead became more variable and positive around the “peak” of tACS.
2.4 Analysing the simulated data
The idealistic/MEP-like data and the phase values were then pooled across the individual sessions for each participant and a linear regression permutation analysis was performed to calculate the individual p-values for each participant (Bland & Sale, 2019; Zoefel et al., 2019). The linear regression approach outlined in Bland and Sale (2019) was found to be the most sensitive when paired with permutation analysis (Zoefel et al., 2019). Baker et al. (2021) also found that many paradigms where the dependent variable is derived by model fitting show continual improvements in power as the number of trials increases, rather than reaching an asymptote where further trials provide no additional improvements in statistical power, as was the case for other paradigms where the dependent variable was derived by some other means. Furthermore, by model fitting at the individual level, the individual becomes the replication unit instead of the group (i.e., each participant can be viewed as an independent replication of the experiment; Smith & Little, 2018).
For this linear regression analysis, an ideal (best-fitting) sinusoidal model is first fitted to each simulated participant’s data points based on their corresponding phase value as described in Bland and Sale (2019). The data points are then shuffled with respect to their phases for a total of 10000 permutations per participant and new sinusoidal models are fitted to the shuffled data. The true and shuffled sinusoidal model amplitudes are then compared, with the individual p-values representing the proportion of shuffled model amplitudes that exceeded the true model amplitudes for each participant. Because the permutation procedure disrupts any phasic effects that may be present, the shuffled model amplitudes should be small (i.e., closer to zero) and thus, the shuffled data act as a negative control for the “true” data. The group p-value for the experiment was then obtained by combining the individual p-values using Fisher’s method (Fisher, 1992; Zoefel et al., 2019).
2.5 Calculating statistical power for each scenario
This process was repeated for a total of 1000 experiments per scenario, and the statistical power for the scenario was calculated by dividing the number of experiments with a significant group p-value (i.e., p < .05) by the total number of experiments. For each combination of effect size and data type, the powers were averaged across the different degrees of between-/within-subjects variability to form the power values for the primary analyses, whilst the complete power values before averaging have been included as Supplementary Material.
2.6 Comparing statistical powers between scenarios
To determine whether the proportion of participants vs. trials significantly influenced the predicted powers, chi-square goodness-of-fit tests were performed for each combination of data type (idealistic or MEP like) and effect size (no effect, weak, moderate, strong, or very strong).
To determine whether any combination of parameters was susceptible to over- or under-sensitivity (i.e., if any of the powers for the no effect condition were significantly greater or lower than the conventional expected false discovery rate of .05), a binomial test was performed to establish the minimum and maximum thresholds for false-positive experiments in the no effect simulations.
3 Results
The power values for each scenario (i.e., the proportion of significant experiments for each combination of effect size and number of participants/trials) are summarised in Table 1 and Figure 2. The different proportions of number of participants and trials (i.e., 30 participants × 1 session, 15 participants × 2 sessions, and 10 participants × 3 sessions) were then compared using chi-square goodness-of-fit tests (separately for each effect size and data type), which revealed significant improvements in statistical power as the number of trials increases (i.e., beyond corresponding decreases in number of participants). These improvements in power were present across all effect sizes, except for the control simulations with no effect, for both the idealistic (χ2 = 0.94, 23.57, 51.76, 73.88, and 54.18 for no effect, weak, moderate, strong, and very strong, respectively; p = .624 for no effect and p < .001 for all other effect sizes) and MEP-like data (χ2 = 0.05, 62.25, 86.53, 42.81, and 11.58, respectively; p = .974 for no effect, p < .001 for weak, moderate, and strong effects, and p = .003 for very strong effect).
Idealistic . | Effect size . | ||||
---|---|---|---|---|---|
. | No effect . | Weak (0.075) . | Moderate (0.1) . | Strong (0.125) . | Very strong (0.15) . |
30 participants × 1 session | .048 | .13 | .225 | .358 | .541 |
15 participants × 2 sessions | .053 | .176 | .338 | .542 | .731 |
10 participants × 3 sessions | .058 | .221 | .406 | .626 | .807 |
Idealistic . | Effect size . | ||||
---|---|---|---|---|---|
. | No effect . | Weak (0.075) . | Moderate (0.1) . | Strong (0.125) . | Very strong (0.15) . |
30 participants × 1 session | .048 | .13 | .225 | .358 | .541 |
15 participants × 2 sessions | .053 | .176 | .338 | .542 | .731 |
10 participants × 3 sessions | .058 | .221 | .406 | .626 | .807 |
MEP-like . | Effect size . | ||||
---|---|---|---|---|---|
. | No effect . | Weak (0.075) . | Moderate (0.1) . | Strong (0.125) . | Very strong (0.15) . |
30 participants × 1 session | .049 | .196 | .381 | .634 | .836 |
15 participants × 2 sessions | .051 | .317 | .587 | .823 | .948 |
10 participants × 3 sessions | .049 | .387 | .683 | .881 | .973 |
MEP-like . | Effect size . | ||||
---|---|---|---|---|---|
. | No effect . | Weak (0.075) . | Moderate (0.1) . | Strong (0.125) . | Very strong (0.15) . |
30 participants × 1 session | .049 | .196 | .381 | .634 | .836 |
15 participants × 2 sessions | .051 | .317 | .587 | .823 | .948 |
10 participants × 3 sessions | .049 | .387 | .683 | .881 | .973 |
Each value represents the mean statistical power (i.e., the proportion of simulated experiments with a significant group p-value) from nine simulations with varying degrees of between- and within-subjects variability, each simulation comprising 1000 simulated experiments. Each experiment consisted of 30 sessions (60 trials each), which were divided into either 1, 2, or 3 sessions per participant, with the trials then being pooled across sessions for each participant. For all effect sizes (except for the control simulations with no effect), simulations with fewer participants but more sessions per participant showed significantly greater statistical powers compared with experiments with more participants but fewer sessions per participant. These improvements in power were present for both the idealistic (chi-square goodness-of-fit tests; χ2 = 0.94, 23.57, 51.76, 73.88, and 54.18 for no effect, weak, moderate, strong, and very strong, respectively; p = .624 for no effect and p < .001 for all other effect sizes) and MEP-like data (χ2 = 0.05, 62.25, 86.53, 42.81, and 11.58, respectively; p = .974 for no effect, p < .001 for weak, moderate, and strong effects, and p = .003 for very strong effect).
The sensitivity of each combination of parameters to false positives was assessed using a binomial test, which found that the minimum threshold for under-sensitivity was 39 out of 1000 experiments (.039), with p = .064. The maximum threshold for over-sensitivity was 61 out of 1000 experiments (.061), again with p = .064. All of the powers from the no effect simulations fell within these thresholds, and thus, no combination of parameters was deemed to be over- or under-sensitive.
4 Discussion
Neuroscience and, in particular, NIBS studies are often criticised for their relatively low statistical powers (Button et al., 2013), which have traditionally been attributed to low participant sample sizes (Lafon et al., 2017; Minarik et al., 2016; Mitra et al., 2019; Vöröslakos et al., 2018). However, it has recently been suggested that in many cases, such as when detecting phasic effects on brain activity, the trial sample size for each participant may be of equal, if not greater, importance than the participant sample size itself (Baker et al., 2021; Grice et al., 2017; Normand, 2016; Rouder & Haaf, 2018; Smith & Little, 2018; Xu et al., 2018; Zoefel et al., 2019). The present stimulation study aimed to directly compare the relative importance of participant sample size and trial sample size per participant for detecting phasic effects of NIBS via a linear regression permutation analysis. To this end, we compared the statistical powers of simulated experiments with the same number of total experiment sessions (30) but different proportions of participants and number of sessions per participant (30 participants × 1 session, 15 participants × 2 sessions, and 10 participants × 3 sessions), with the trials being pooled across sessions for each participant. These simulations were performed for two types of outcome variables (idealistic and MEP like) and four different effect sizes (0.075—weak, 0.1—moderate, 0.125—strong, 0.15—very strong), as well as a separate control with no true effect. The chi-square goodness-of-fit tests revealed that for both data types and all effect sizes (except for the control simulations with no effect), experiments with fewer participants but more sessions (i.e., more trials) per participant showed significantly greater statistical powers compared with experiments with more participants but fewer sessions per participant, supporting our initial hypothesis. Further, the binomial test confirmed that none of the simulated experiments were susceptible to over- or under-sensitivity.
In the case of the idealistic data, the benefit of trials over participants appears to increase as the effect size increases. This is particularly relevant in the context of NIBS research, which traditionally had a history of weak effect sizes that may still be prone to type II errors even with optimised number of trials and participants (Button et al., 2013; Gundlach et al., 2016; Neuling et al., 2012; Riecke, Formisano, et al., 2015; Riecke et al., 2018; Riecke, Sack, et al., 2015; Riecke & Zoefel, 2018; Wilsch et al., 2018; Zoefel et al., 2018). Despite this, our results clearly suggest that the benefits of trials over participants are still significant even at weaker effect sizes. For the MEP-like data, however, it appears the benefit is of a similar magnitude irrespective of effect size. Therefore, no matter what capacity tACS has to modulate MEP amplitudes in a phasic manner, there is a benefit to sampling more trials rather than more participants.
There are several factors to consider when designing an experiment involving multiple sessions per participant, since it can introduce some practical issues. The first, and perhaps most impactful, of these issues is dropout/attrition: partially completed datasets have to be abandoned if participants withdraw from the study before completing all of their sessions. It is also important to consider the length of time between consecutive sessions, both for the participants’ safety and to minimise any carryover effects between sessions (Alharbi et al., 2017; Brunoni & Fregni, 2011). On a similar note, if the experiment involves a task where performance is quantitatively assessed, it is important to consider the possibility of training effects that may occur as the participant gains more practice with the task with repeated sessions. Finally, researchers should strive to minimise any other controllable variables between each participants’ sessions (time of day, caffeine intake etc.). The severity of these challenges only worsens as the number of sessions per participant increases, and so researchers should consider what the ideal number of sessions per participant would be to minimise any practical issues whilst still achieving the desired trial sample size. In some cases, it is possible to achieve larger number of trials by increasing the length of each session, thus reducing the number of sessions needed to achieve the desired trial sample size. However, this too involves some practical issues that need to be considered, such as participant and/or experimenter fatigue resulting in low-quality data.
Despite the challenges associated with performing multiple sessions per participant, the results of these simulations suggest that the increases in statistical power are worth the small cost of these additional challenges. Furthermore, performing multiple sessions per participant also offers some practical advantages. For example, the setup time at the start of each session is generally reduced after a participant has completed at least one session, since the participant is already familiar with the procedure and equipment. Another advantage in subsequent sessions that is more specific to TMS is that it is easier to determine both the location of the participant’s “hot-spot” (Rossini et al., 1994) for the targeted muscle and the stimulation intensity required to consistently induce MEPs in that muscle that are around the desired baseline amplitude (e.g., 1 mV; Cuypers et al., 2014; Ogata et al., 2019; Thies et al., 2018).
It is important to note that the simulations performed in this study assume that for each participant, the effect sizes are approximately similar across their sessions (though this variability was manipulated across several plausible ranges for completeness). Further, there may also be differences in the quality of the data between sessions, and the impact of this on statistical power is not directly assessed in the current study. However, it is also worth noting though that whilst consistency across sessions may be relevant for assessing the amplitude of a sinusoidal effect, it is less relevant for the frequency of the effect since phase is equivalent for any frequency. This means that individual sessions can still be pooled by phase even if there is a slight difference in the frequency of the sinusoidal effect between sessions.
5 Conclusions
The results of this simulation study highlight the importance of trial sample size for detecting phasic effects on brain activity. Our findings suggest that if there are limitations to the number of trials that can be obtained in a single experiment session (such as for phase-dependent TMS), conducting repeated experiment sessions on a smaller number of participants can be a useful strategy that allows researchers to obtain sufficiently large trial sample sizes and ensure accurate estimation and model fitting at the individual level. Although our simulation was originally designed in the context of our recent human tACS study (Geffen et al., 2021), effects of oscillatory phase can be found across a wide range of biomedical research fields, including but not limited to chronobiology (Kuhlman et al., 2018), biochemistry (Uhlén & Fritz, 2010), and cardiology (Tiwari et al., 2021). Further, the considerations regarding trial sample size vs. participant sample size are applicable to a wide variety of non-oscillatory experimental approaches in human neuroscience, psychology, and physiology. We, therefore, invite researchers to utilise the simulation code provided to aid in estimating statistical power for their own experimental designs using different proportions of participants and trials/sessions.
Data and Code Availability
The MATLAB code used to perform the simulations in this study is available at https://osf.io/vwysk/ (https://doi.org/10.17605/OSF.IO/VWYSK)
Author Contributions
A.G.: Conceptualisation, Methodology, Software, Formal Analysis, Investigation, Data Curation, Writing—Original Draft, Visualisation. N.B.: Conceptualisation, Methodology, Software, Data Curation, Writing—Review & Editing, Visualisation, Supervision. M.V.S.: Conceptualisation, Methodology, Validation, Resources, Writing—Review & Editing, Project Administration, Funding Acquisition, Supervision.
Funding
This work was supported by the US Office of Naval Research Global [grant number N62909-17-1-2139] awarded to M.V.S. The funding body had no involvement in the study design; the collection, analysis, and interpretation of data; the writing of the report; or the decision to submit the article for publication.
Declaration of Competing Interest
The authors have no relevant financial or non-financial interests to disclose.
Acknowledgements
We would like to thank Simoné Reinders as her honours thesis was a major source of inspiration in the design of this study.
Supplementary Materials
Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00345.