## Abstract

Normative modelling is an emerging technique for parsing heterogeneity in clinical cohorts. This can be implemented in practice using hierarchical Bayesian regression, which provides an elegant probabilistic solution to handle site variation in a federated learning framework. However, applications of this method to date have employed a Gaussian assumption, which may be restrictive in some applications. We have extended the hierarchical Bayesian regression framework to flexibly model non-Gaussian data with heteroskdastic skewness and kurtosis. To this end, we employ a flexible distribution from the sinh-arcsinh (SHASH) family, and introduce a novel reparameterisation and a Markov chain Monte Carlo sampling approach to perform inference in this model. Using a large neuroimaging dataset collected at 82 different sites, we show that the results achieved with this extension are equivalent or better than a warped Bayesian linear regression baseline model on most datasets, while providing better control over the parameters governing the shape of distributions that the approach is able to model. We also demonstrate that the attained flexibility is useful for accurately modelling highly nonlinear relationships between aging and imaging derived phenotypes, which shows that the extension is important for pushing the field of normative modelling forward. All methods described here are available in the open-source pcntoolkit.

## 1 Introduction

Brain disorders affect millions of people worldwide, and have a large impact on the quality of life of the patient and the people surrounding them. Correct treatment in the form of medication or therapy can help reduce symptoms. However, studies to understand mechanisms or develop treatments for these disorders are usually performed in a case-control framework, which provides group-level inferences in which homogeneity within clinical cohorts and clinical groups is presumed. These studies aim to detect differences between the clinical and the control (i.e., healthy) groups and are therefore limited to providing inferences about group averages (e.g., the “average patient”). However, it has been argued that the individual deviations from the groups’ means contain the most important information for stratification (Marquand, Rezek, et al., 2016; Marquand, Wolfers, et al., 2016). An improved understanding of the heterogeneity within clinical cohorts may ultimately lead to better diagnosis and treatment.

Normative modelling provides a framework for understanding heterogeneity within clinical cohorts (Marquand, Rezek, et al., 2016). A normative model aims to provide estimates of centiles of variation of a specific set of phenotypes over the full range of the clinical covariates, that is, across the lifespan, much like growth charts in pediatric medicine. Centiles of variation of imaging derived phenotypes (IDPs) are estimated, and subjects are assigned a point somewhere along the estimated centiles. The assumption is that healthy subjects typically land in areas of higher density than non-healthy subjects. The mean or “average” subject is thus also a meaningful concept in normative models, but contrary to case-control studies, the deviations from that mean are interpreted as anomaly in the brain structure or function, and are seen as valuable resources for further downstream processing like analysis of underlying mechanism of brain disorders.

In probabilistic modelling, a distinction is made between epistemic and aleatoric uncertainty (Kendall & Gal, 2017). The former is due to uncertainty about a parameter or phenomenon, and generally becomes smaller as more data are collected; the latter is uncertainty due to the variation that is naturally present in the measured phenomenon. In normative modelling, we aim to capture and separate these forms of uncertainty to the highest possible degree. Separating all sources of aleatoric and epistemic uncertainty allows adequately controlling for nuisance variation, and guiding further decisions by relevant (biological) variation only.

As in many other fields, data availability in neuroscience has increased dramatically in recent years, and that trend will likely continue. Medical data, however, is sometimes subject to restrictions involving sharing, due to certain permissions not being granted. There is no guarantee that all data on which a model is to be trained can be centrally located. The question of how to do normative modelling on large, distributed datasets is thus a natural one, and it comes with a number of challenges. Firstly, the method must be able to handle large amounts of data. We have seen that Gaussian processes would be viable candidates for normative modelling, if it were not for their time complexity that scales cubically with the number of input data points (Marquand, Rezek, et al., 2016). Second, distributed datasets inevitably come with sources of nuisance variance; since every scan site has its own protocols, every scanner has its own unique noise characteristics, and there may be unknown sources of noise present in specific sites as well. In addition, it is well known that there are sex-related differences in neurobiological phenotypes. Hirearchical Bayesian Regression (HBR) can deal with these so-called batch effects in data by assuming a shared prior over batch-specific model parameters, as we describe in Appendix B. Third, neuroimaging data are complex and provide rich information about the individual from which they were derived. These data therefore need to be treated with care. Permission to share personal data is not always available, and transportation of the data is therefore not always possible. This requires federated learning; a distributed learning approach that does not require transportation of the data, only of the inferred parameters of model (Kia et al., 2021; Li et al., 2020). In this study, we employ and extend the hierarchical Bayesian regression (HBR) approach for normative modelling (Gelman & Rubin, 1992; Kia et al., 2020, 2021) on non-Gaussian IDPs. The method involves defining a generative model, and inferring the posterior distribution over the model parameters using Markov chain Monte Carlo (MCMC) sampling. By absorbing and summarizing information contained in training data in learned posteriors, HBR supports federated learning; all information required to extend or adapt the model is contained in the model, eliminating the need of data transportation when using or even extending the model.

The current implementation of HBR assumes that the residuals follow a Gaussian distribution. However, the variability in many imaging phenotypes cannot be properly described by a Gaussian distribution, as they are skewed, kurtotic, or both. The issue of non-Gaussianity in neurological phenotypes has partially been addressed in Fraza et al. (2021). This method accounts for non-Gaussianity by applying a monotonic warping function to the data (Snelson et al., 2003), which yields a closed form for the marginal likelihood that can be maximised using standard optimisation techniques (Dinga et al., 2021). Here, we utilise the same warping function—the sinh-arcsinh transformation, Appendix A—but in a fully Bayesian approach, that is, in the HBR framework, yielding full posterior distributions for all parameters. Applying the sinh-arcsinh transformation to a standard Gaussian distribution yields the SHASH distribution (Jones & Pewsey, 2009), which is detailed in Section 2.1, a flexible distribution with separate parameters roughly corresponding to skew and kurtosis. Here, we apply the HBR framework to the a canonical variant of SHASH distribution and a novel reparameterisation that substantially reduces dependency between the parameters, which is crucial for more efficient sampling. We assess its performance on some known problematic imaging phenotypes, and show that the HBR framework with a SHASH likelihood is able to model phenotypes in a way that was previously impossible.

The contributions of this work are: (i) a reparameterisation of SHASH distribution and proposal of an MCMC sampling approach to infern this model; (ii) a thorough analysis of the HBR method with a SHASH likelihood in comparison with a warped-Bayesian linear regression as a baseline method (Fraza et al., 2021); (iii) an extensive evaluation of these methods on a large multi-site neuroimaging dataset containing estimates of cortical thickness and subcortical volume; and (iv) an extension to the existing HBR implementation in the pcntoolkit.

The paper is structured as follows: we first provide a theoretical background of HBR, MCMC sampling, and the SHASH distribution in Section 2, where the problems of using the SHASH distribution in a sampling context will also become clear. We introduce a reparameterisation that addresses these problems in Section 2.1.1. A framework for concisely describing common variations of HBR models is introduced in Appendix B. Then, we evaluate our approach using several experiments with a large neuroimaging dataset in Section 3, where we show the proposed method performs on par or better than existing methods. Then, we apply the new HBR models to an even more non-linear dataset (Zabihi et al., 2022), where we show a clear advantage over existing methods. A discussion of the advantages and limitations of our approach follows in Section 4, and we summarise all our findings in Section 5.

## 2 Methods

At its heart, normative modelling involves fitting a probabilistic regression model to predict centiles of variation in a biological or psychometric response variable as a function of a set of clinically relevant covariates. The covariates are often taken to be demographic variables such as age and sex, but this is not necessarily the case and many other mappings are possible (e.g., using cognitive scores to predict brain activity). Early reports used classical Gaussian process regression (Marquand, Rezek, et al., 2016; Ziegler et al., 2014) which is appealing due to its Bayesian non-parametric nature, exact inference and flexibility in modelling non-linearity with a relatively small number of parameters. However, the exact Gaussian process approach used in these works has two main limitations, namely its extremely poor computational scaling to large numbers of data points (owing largely to the cubic computational scaling required to solve the GPR predictive equations) and for the exact formulation (i.e., having an analytical solution), it is restricted to modelling Gaussian centiles. While generalisations are possible within the Gaussian process literature to address these problems, these typically have other shortcomings. For example, modelling non-Gaussian noise distributions usually requires approximations which further increases computational complexity. The result is that other approaches with better computational scaling have been proposed in the neuroimaging literature. These are reviewed in Marquand et al. (2019) and include quantile regression and hierarchical linear regression techniques. More recently, the field has increasingly recognised the importance of modelling non-Gaussianity in the response variables, while using a distributional form to increase the precision with which outer centiles can be estimated (Borghi et al., 2006). There are two main approaches that have been proposed to solve this problem in the neuroimaging literature, namely likelihood warping (Fraza et al., 2021) and generalised additive models for location, scale, and shape (GAMLSS) (Dinga et al., 2021). While these approaches are both capable of flexibly modelling non-Gaussianity and site effects, both have potential shortcomings: GAMLSS is not probabilistic due to the heavy regularisation penalties that are applied to constrain the flexible smoothers underlying the approach. Likelihood warping is probabilistic in nature, but still does not model the uncertainty associated with shape parameters and does not offer completely flexible control over the nature of the modelled distribution. Neither approach supports fully decentralised federated learning at training time although ad-hoc solutions have been proposed to allow model parameters to be transferred to a new site at test time. The approach proposed in this paper attempts to address these shortcomings.

The HBR framework is a Bayesian modelling framework that assumes a generative model over the response variables $Y$ in the form of a likelihood $\mathcal{L}$ and a prior distribution $p$ for each parameter $\theta i$ of the likelihood. A general model can be expressed as such:

where $P$ is the number of parameters of the likelihood. One of the main objectives in Bayesian modeling is to retrieve the posterior, which is the distribution over the model parameters given the data; $p(\theta |Y)$. Bayes’ rule gives us an expression of the posterior: $p(\theta |Y)=p(Y|\theta )$$p(\theta )/p(Y)$, but this expression can only be evaluated analytically in special cases where the likelihood is conjugate to the prior, and this is not the case in HBR. As of such, the posterior is approximated by MCMC sampling, which we discuss in Appendix D. Assuming the response variable follows a Gaussian distribution, we substitute the Gaussian $N$ for $\mathcal{L}$, and our parameters $\theta $ become the mean and variance $\mu $ and $\sigma 2$. The general model described in Equation 1 is fully capable to support non-Gaussian likelihoods, like the family of SHASH distributions, which is detailed in Section 2.1. The way we model $p\theta i$ depends on our further assumptions about the data. Typically $\mu $, and optionally also $\sigma $ are taken to be linear functions of a set of covariates, which in this example are clinical or demographic variables. This is further detailed in Appendix B, where we also introduce a framework for model specification.

### 2.1 The SHASH distribution

To accurately fit normative models on non-Gaussian-distributed response variables, one could substitute a family of flexible distributions for $\mathcal{L}$ in Equation 1. This family must then contain positively as well as negatively skewed members, leptokurtic as well as platykurtic members, and preferably with smooth transitions between them. All those requirements are fulfilled by the family of SHASH distributions ($S$). Figure 1a and 1b show the flexibility of the $S$ in modelling various distributional forms.

To generate samples from a SHASH distribution, one applies an inverse sinh-arcsinh transformation to samples from a standard Gaussian (Jones & Pewsey, 2009). The sinh-arcsinh transformation $\xi $ and its inverse $\xi \u22121$ are defined as:

where $\u2208\u2208\mathbb{R}$ and $\delta \u2208\mathbb{R}+$ govern the shape of the resulting distribution. Now assuming a random variable $Z$ follows a standard Gaussian, we can construct samples $X$ that follow a SHASH distribution by applying $\xi \u2208,\delta \u22121$ to $Z$^{1}:

For the remainder of Section 2, $Z$ and $z$ will indicate Gaussian distributed samples, and $X$ and $x$ will indicate SHASH distributed samples.

To derive the density of this SHASH distribution, we need to apply a change of variables, which states that the density of transformed samples will be the density of the original samples multiplied with the Jacobian determinant of the inverse transformation (Bishop, 2006). The constraint being that this Jacobian determinant exists and that it is not 0. This is true for any transformation that is a diffeomorphism^{2}. Because the transformation $\xi $ is a diffeomorphism, we can get the density of $S$ by a change of variables. We multiply the density of the samples in the Gaussian domain, $\phi $,—which we find by applying $\xi $—with the derivative of $\xi \u2208,\delta $.

where $C\u2208,\delta (x)=cosh(\delta sinh\u22121(x)\u2212\u2208)$. Figures 1 and 2 illustrate the effect of the parameters $\u2208$ and $\delta $ on the shape of the SHASH density. We clearly see that $\u2208$ and $\delta $ modulate the skew and kurtosis, respectively.

A constraint on the parameter $\delta $ can be derived from this result. A well-known property of probability distributions is that they are positive everywhere. The only term in Equation 3 that can possibly be negative is the left fraction, because the middle fraction is a positive constant and the right exponential is strictly positive for all real inputs^{3}. The left fraction is negative if exactly one of $\delta $ or $C\u2208,\delta $ is negative, and we have $cosh(\alpha )>0$ for all $\alpha \u2208\mathbb{R}$. Thus, the only way in which we can get a negative density is to have $\delta <0$, from which we derive the constraint that $\delta >0$.

#### 2.1.1 Reparameterisation

The SHASH distribution we have seen so far has two parameters, $\u2208$ and $\delta $, but additional parameters of location and scale are necessary to achieve the desired flexibility. Jones and Pewsey (2009) suggest adding scale and location parameters by muliplication with $\sigma $ and addition of $\mu $. The resulting distribution is known as the SHASH* _{o}* or $So$ distribution.

Assuming a random variable $Z$ follows a standard Gaussian, we can construct samples $\Omega $ that follow a $So$ distribution by applying $\xi \u2208,\delta \u22121$ to $Z$, multiplying with $\sigma $, and adding $\mu $:

Equivalently, because of 2, we can write

And because we already know the density of $X$ to be $S$, we can use the chain rule in conjunction with the change of variables to arrive at the density of $So$:

where $\u2208$ and $\delta $ are parameters that govern the higher order moments of the distribution (explained below). While we believe this family is flexible enough for the purposes of modeling a wide variety of neuroimaging phenotypes, inferring the parameters of this distribution is highly challenging because of their strong correlations^{4}. Figures 1a and 2a illustrate the correlation between $\mu $ and $\u2208$. Clearly, $\u2208$ controls both the skew and the mean, but $\mu $ also controls the mean. Similarly, in Figures 1b and 2b, we see that $\delta $ controls the kurtosis as well as the variance, while $\sigma $ also controls the variance. Writing down the first two moments of the $So$ distribution will make this correlation explicit. Because the samples $\Omega $ from SHASH* _{o}* are simply scaled and shifted samples $X$ from SHASH (see Equation 5), we can express the moments of $\Omega $ simply in terms of the moments of $X$ as:

To break the correlation between $\mu $ and $\u2208$ and between $\sigma $ and $\delta $, we propose to use the expressions for $E\u2009[X]$ and $E\u2009[X2]$ that are provided by Jones et al. to derive a new density. Jones et al. give us an analytical expression for the $r$’th non-central moment, given here in Equation 8. For derivations of these quantities, please refer to Jones and Pewsey (2009).

for $r\u2208\mathbb{N}$, where

and $K$ is the modified Bessel function of the second kind (Bowman, 1939).

The solution proposed here is to first apply a standardising shift and scale to the SHASH samples, such that the mean and variance of those samples are 0 and 1, respectively. Then, we apply the location and scale parameters $\mu $ and $\sigma $ like before to attain samples $B$. Because the moments of $S$ are all obtainable by Equation 8, this is a simple operation. We define the newly designed transformation as $\lambda $ and the corresponding density as SHASH* _{b}* or $Sb$:

where $\eta \u2208,\delta 2$ is the central variance given by $\eta \u2208,\delta 2=m\u2208,\delta (2)\u2212$$(m\u2208,\delta (1))2$. We can again use the chain rule in conjunction with the change of variables to arrive at the density of $Sb$:

Writing down the first and second moments of $B$, the expectations that appeared in Equations 6 and 7 now cancel with the moments that we added in Equation 10, and the correlation breaks completely:

The new parameterisation has nicely interpretable parameters for location and scale, which the $So$ does not. Compare Figure 1a and 1b with 1c and 1d. The densities in the two rightmost figures all have zero mean and unit variance. This may not seem obvious for the density with $\delta =0.4$ in Figure 1d, but the tail behaviour responsible for this becomes clearer from the inset. In Appendix G, H, the empirical moments of the $So$ and $Sb$ distributions are plotted over a range of parameters. It is important to note that the $Sb$ family is isomorphic to $So$, which means that they model exactly the same set of distributions, just under a different parameterisation.

#### 2.1.2 Priors on $Sb$ parameters

The $Sb$ reparameterisation greatly reduces the problem of correlation in parameter space, but it comes at a cost. Some of the reparameterisation terms get extremely large quickly, and may therefore lead to numerical instability. This may be addressed by appropriately setting the priors for the $Sb$ likelihood. Comparing Figure 2b and 2d, the warp in the low domain of $\delta $ is apparent. The stretching effect of the change of variables is enormous in those regions, because it has to compensate for the flattening effect of $\delta $ visible in Figure 1b in order to standardise the SHASH distribution. The curve in the line associated with $\delta =0.1$ is barely visible in this range. The variance of $X\u2208,\delta $ as a function of $\delta $ is shown on a double log scale in Figure 3. Notice that small differences in the lower domain of $\delta $ amount to substantial differences in the variance. The enormous magnitude of $dd\delta m\u2208,\delta (2)$ in those regions is mainly due to the behaviour of the Bessel function $Kn(z)$ in Equation 9, and to a lesser degree due to the fraction in the exponent in Equation 8. In practical applications, values of $\delta <0.3$ are thus associated with numerical instability, and sampling in those regions should be avoided at the cost of losing some expressivity. In our models, we enforce this constraint by first enforcing positivity of the samples by applying the softplus function, and then adding a small constant of 0.3. The softplus function enforces positivity and is defined as:

### 2.2 Experiments

#### 2.2.1 Dataset 1: Lifespan normative modelling of image derived phenotypes

To allow a meaningful comparison with the current methods, we adapt and use the data from (Rutherford et al., 2022). This is a large neuroimaging dataset collected from 82 different scanners, containing 58834 control subjects and 2925 patient subjects^{5}. Measures of cortical thickness and subcortical volume were extracted with Freesurfer version 6.0 using image derived phenotypes derived either from the Destrieux cortical parcellation or from the subcortical segmentation. Furthermore, the different UK-Biobank datasets (Sudlow et al., 2015) were merged, because based on prior work (Fraza et al., 2021; Kia et al., 2022), we have observed that site effects in the UKB cohort are minimal for these freesurfer measure derived from the different scanners used (due to a careful harmonisation of acquisition parameters and scanning hardware across this cohort). We therefore consider that we can safely consider them to be drawn from a single site. Counts per site and sex can be found in Appendix F. For full details about the data, we refer to Rutherford et al. (2022). Most data in this dataset are publicly available, but for several of the clinical samples, we do not have explicit permission for data sharing. Such permissions for sharing were not obtained from the data from the University of Michigan, and as such these data were not included in this study (n = 1394 (1149 controls, 245 patients)).

The family of SHASH distributions has the flexibility to model skewed and kurtotic distributions, but also has the standard Gaussian as a central member. We pick some “hard” (i.e., difficult) phenotypes, the age-conditioned distributions of which are not described well by a Gaussian, and some “easy” phenotypes, the age-conditioned distributions of which are described well by a Gaussian. We aim to model both easy and hard phenotypes with the same model, using the same hyperparameters. The selected hard phenotypes are: right cerebellum white matter, because it is slightly skewed; right lateral ventricle, because it is moderately skewed and heteroskedastic; and white matter hypointensities, because it is severely skewed and heteroskedastic. For easy phenotypes, we selected the estimated total intracranial volume, the thickness of the Jensen sulcus, and the volume of the brain stem. These “easy” phenotypes do not show a lot of skew or kurtosis, but the mean of the thickness of the Jensen sulcus shows an interesting trajectory (see Fig. I.26). The distributions of the different phenotypes are shown in Figure 6 and in Appendix I below. In some instances, we used the following abbreviations for the “hard” phenotypes: Right-Cerebellum-White-Matter: RCWM, Right-Lateral-Ventricle: RLV, WM-hypointensities: WMH, and similarly for the “easy” phenotypes: EstimatedTotalIntraCranialVol: ETICV, rh_S_interm_prim-Jensen_thickness: RIPJ, Brain-Stem: BS,

A stratified 10-fold split was made using the Sklearn package (Pedregosa et al., 2011), to which end we filtered out data from sites with less than 10 subjects of one or either sex. This ensures that the test folds contain at least one unique sample for each batch effect. As preprocessing, all data were feature-wise-normalised by subtracting the mean and dividing by the standard deviation. Normalisation is not a necessity, but it allowed the same hyperparameters to be used for different response variables. This would otherwise not be possible, because the scale of the response variables differ by several orders of magnitude. Sampling performance could be hampered if the prior distribution differs exceedingly much from the likelihood.

#### 2.2.2 Batch effects

Age and sex were used as covariates, and site was used as a batch effect. The HBR framework supports multidimensional batch effects, allowing to use both sex and site as a batch effect and allowing site effects to be modelled in a single-step regression together with effects of interest. However, we consider that there is little reason to assume that the differences related to sex vary much between sites, and having this multidimensional batch effect results in a larger number of model parameters. In practice, we saw no difference in performance between models that used sex as a batch effect, and those that used sex as a covariate. For further discussion, see Section 4.

#### 2.2.3 Dataset 2: Modelling latent representations derived from an autoencoder

As a second, more challenging, test of the flexibility of our model, we fit a normative model on a highly non-linear representation of brain data derived from the latent representation learned by a convolutional autoencoder. Full details are provided in Zabihi et al. (2022), but briefly; data derived from the UK Biobank sample (n = 20,781, 47% female) were compressed to a 2-dimensional latent representation using a pipeline consisting of a conditional convolutional autoencoder with a 100-dimensional latent space (Kingma & Welling, 2013), followed by a UMAP dimensionality reduction (McInnes et al., 2018). The two-step pipeline was constructed because the intermediate representation produced by the autoencoder would otherwise not be reflective of age and sex, it would only reconstruct the data. In this case, it was desirable to have a latent representation conditioned on these variables. The resulting representation followed a distribution with a nonlinear trajectory with respect to age n the mean, variance, and skew. A stratified train/test (75%/25%) split was made on the raw data level, specifically not at the 100-dimensional latent representation (see Fig. 9). This means that the same train data were used for training the convolutional autoencoder as well as the UMAP embedding, and all of the test data were held out from training. Following the rationale described above, no batch effects were modelled. Again, we used age and sex as covariates.

#### 2.2.4 MCMC samplers

We employed a No U-Turn sampler for inference, implemented within the PyMC version 5.4.1 software using two chains per sample. Chains were run for 1500 iterations using 500 samples as burn-in. Convergence checks were performed using the $R^$ statistic. Further details about the theory behind MCMC inference are provided in Appendix D. All scripts necessary for running the analyses conducted in this paper are freely available via GitHub.

#### 2.2.5 Ethics statement

All participants provided written informed consent for their data to be used, and all studies were approved by the institutional review boards at the respective institutions.

## 3 Results

In Section 3.1, the convergence of the MCMC samplers utilised by the HBR models is assessed.

We evaluate four variants of the HBR method: The first uses a simple Gaussian likelihood, where $\mu $ and $\sigma $ are modelled exactly like $\mu $ and $\sigma $ in Appendix B.3. This model is indicated with an $N$. Model $N$ is the model that was previously the limit of what could be done within HBR in the pcntoolkit. We also evaluate a model with an $So$ likelihood and two variants of the $Sb$ likelihood, which differ only in the way $\u2208$ and $\delta $ are modelled. In the first model $Sb1$, both $\u2208$ and $\delta $ assume constant values throughout the full range of covariates. In model $Sb2$, both $\u2208$ and $\delta $ are linear regressions on the design matrix $\Phi $, which in our case contains 5-knot b-spline basis expansion (Boor, 1978) of the age covariate, concatenated with the age and sex covariates. These models are formalised in Table 1, using the notation detailed in Appendix B. Details about the priors can be found in Table 2. Full specifications of the generative models can be found in Appendix C.

$N$ . | $So1$ . | $Sb1$ . | $Sb2$ . |
---|---|---|---|

$\mu n\u223c\mathcal{M}3.1.2b$ | $\mu n\u223c\mathcal{M}3.1.2b$ | $\mu n\u223c\mathcal{M}3.1.2b$ | $\mu n\u223c\mathcal{M}3.1.2b$ |

$\sigma n\xb1\u223c\mathcal{M}3.1.1$ | $\sigma n\xb1\u223c\mathcal{M}3.1.1$ | $\sigma n\xb1\u223c\mathcal{M}3.1.1$ | $\sigma n\xb1\u223c\mathcal{M}3.1.1$ |

$\sigma n=softplus(\sigma n\xb1)$ | $\sigma n=softplus(\sigma n\xb1)$ | $\sigma n=softplus(\sigma n\xb1)$ | $\sigma n=softplus(\sigma n\xb1)$ |

$yn\u223cN(\mu n,\sigma n)$ | $\u2208\u223c\mathcal{M}1$$\delta \xb1\u223c\mathcal{M}1$ | $\u2208\u223c\mathcal{M}1$$\delta \xb1\u223c\mathcal{M}1$ | $\u2208\u223c\mathcal{M}3.1.1$$\delta \xb1\u223c\mathcal{M}3.1.1$ |

$\delta =softplus(10\u22c5\delta \xb1)/10$ | $\delta =softplus(10\u22c5\delta \xb1)/10$ | $\delta =softplus(10\u22c5\delta \xb1)/10$ | |

$yn\u223cSo(\mu n,\sigma n,\u2208,\delta +0.3)$ | $yn\u223cSb(\mu n,\sigma n,\u2208,\delta +0.3)$ | $yn\u223cSb(\mu n,\sigma n,\u2208,\delta +0.3)$ |

$N$ . | $So1$ . | $Sb1$ . | $Sb2$ . |
---|---|---|---|

$\mu n\u223c\mathcal{M}3.1.2b$ | $\mu n\u223c\mathcal{M}3.1.2b$ | $\mu n\u223c\mathcal{M}3.1.2b$ | $\mu n\u223c\mathcal{M}3.1.2b$ |

$\sigma n\xb1\u223c\mathcal{M}3.1.1$ | $\sigma n\xb1\u223c\mathcal{M}3.1.1$ | $\sigma n\xb1\u223c\mathcal{M}3.1.1$ | $\sigma n\xb1\u223c\mathcal{M}3.1.1$ |

$\sigma n=softplus(\sigma n\xb1)$ | $\sigma n=softplus(\sigma n\xb1)$ | $\sigma n=softplus(\sigma n\xb1)$ | $\sigma n=softplus(\sigma n\xb1)$ |

$yn\u223cN(\mu n,\sigma n)$ | $\u2208\u223c\mathcal{M}1$$\delta \xb1\u223c\mathcal{M}1$ | $\u2208\u223c\mathcal{M}1$$\delta \xb1\u223c\mathcal{M}1$ | $\u2208\u223c\mathcal{M}3.1.1$$\delta \xb1\u223c\mathcal{M}3.1.1$ |

$\delta =softplus(10\u22c5\delta \xb1)/10$ | $\delta =softplus(10\u22c5\delta \xb1)/10$ | $\delta =softplus(10\u22c5\delta \xb1)/10$ | |

$yn\u223cSo(\mu n,\sigma n,\u2208,\delta +0.3)$ | $yn\u223cSb(\mu n,\sigma n,\u2208,\delta +0.3)$ | $yn\u223cSb(\mu n,\sigma n,\u2208,\delta +0.3)$ |

For all $n$, $n\u2208{1,...,N}$. The softplus function was scaled down by a factor of 10 to force more linear behaviour between 0 and 1. More details about the priors can be found in Table 2. The notation $\mathcal{M}a.b.c$ is used to denote different generative models, the details of which can be found in Appendix C.

Parameter . | Prior . | $N$ . | $So1$ . | $Sb1$ . | $Sb2$ . |
---|---|---|---|---|---|

$w\mu $ | $N(w\mu |0D,ID)$ | X | X | X | X |

$\mu \tau \mu $ | $N(\mu \tau \mu |0,1)$ | X | X | X | X |

$\sigma \tau \mu $ | $N+(\sigma \tau \mu |1)$ | X | X | X | X |

$\nu \tau \mu $ | $N(\nu \tau \mu |0,1)$ | X | X | X | X |

$w\sigma \xb1$ | $N(\mu w\sigma \xb1|0D,ID)$ | X | X | X | X |

$\tau \sigma \xb1$ | $N(\tau \sigma \xb1|1,1)$ | X | X | X | X |

$\u2208$ | $N(\u2208|0,1)$ | X | X | ||

$w\u2208$ | $N(w\u2208|0D,0.2\u22c5ID)$ | X | |||

$\tau \u2208$ | $N(\tau \u2208|0,0.2)$ | X | |||

$\delta \xb1$ | $N(\delta \xb1|1,1)$ | X | X | ||

$w\delta $ | $N(w\delta |0D,0.2\u22c5ID)$ | X | |||

$\tau \delta $ | $N(\tau \delta |1,0.3)$ | X |

Parameter . | Prior . | $N$ . | $So1$ . | $Sb1$ . | $Sb2$ . |
---|---|---|---|---|---|

$w\mu $ | $N(w\mu |0D,ID)$ | X | X | X | X |

$\mu \tau \mu $ | $N(\mu \tau \mu |0,1)$ | X | X | X | X |

$\sigma \tau \mu $ | $N+(\sigma \tau \mu |1)$ | X | X | X | X |

$\nu \tau \mu $ | $N(\nu \tau \mu |0,1)$ | X | X | X | X |

$w\sigma \xb1$ | $N(\mu w\sigma \xb1|0D,ID)$ | X | X | X | X |

$\tau \sigma \xb1$ | $N(\tau \sigma \xb1|1,1)$ | X | X | X | X |

$\u2208$ | $N(\u2208|0,1)$ | X | X | ||

$w\u2208$ | $N(w\u2208|0D,0.2\u22c5ID)$ | X | |||

$\tau \u2208$ | $N(\tau \u2208|0,0.2)$ | X | |||

$\delta \xb1$ | $N(\delta \xb1|1,1)$ | X | X | ||

$w\delta $ | $N(w\delta |0D,0.2\u22c5ID)$ | X | |||

$\tau \delta $ | $N(\tau \delta |1,0.3)$ | X |

Note that parameters with a $\xb1$ superscript are mapped by a softplus function post-sampling to force positivity, as displayed in Table 1. $N+$ indicates the positive Normal distribution. $0D$ and $ID$ denote a $D$ dimensional zero-vector and identity matrix, respectively. Central values were chosen for most priors, with $\delta $ being a notable exception. As discussed before, having a low $\delta $ leads to numerical issues, and by setting the priors up like this, we aim to push the sampler away from the low domain. Remember that we also add a constant of 0.3 to the result of applying the softplus function to $\delta \xb1$. The prior on $\tau \sigma $ is centred at 1, because due to standardisation we expect $\sigma $ to lay close to 1 on average, and the softplus approximates an identity function for positive values, so the mode of the transformed value should still lay close to 1. We set the prior on the weights of $\mu $ a little wider than the others, because we found that improved the regression. Making the priors any wider than this did not lead to any significantly different results.

In Section 3.2, we compare different types of HBR normative models, with one another and with warped Bayesian Linear Regression (Fraza et al., 2021), which is used as a baseline, abbreviated with W-BLR. The models used in Section 3.3 are the same as in Section 3.2, just applied to different data.

Note that the results in Section 3.2 are based on MCMC estimates of z-scores (see Appendix E). The results in Appendix C are based on the MAP estimates, which were chosen because they are less computationally demanding, and because in practice, the MCMC results were only marginally better.

A main objective in normative modelling is to map observations to z-scores in such a way that healthy variation within those z-sores follows a Gaussian distribution, and that patients fall in the outer centiles of variation. Gaussianity in the outer centiles is compared by the third and fourth moment (skew and kurtosis) of the predicted z-scores. The qq-plots, and a visual inspection of the regression curves further help assess the centiles.

### 3.1 Convergence analysis

Here a simple analysis of the samplers is performed, by analysing the trajectory of the Gelman-Rubin, or $R^$ statistic (Brooks & Gelman, 1998), which can be thought of as the ratio between the interchain variance and the intrachain variance. The $R^$ statistic is 1 (one) when those are equal, and this is a good indicator that the sampler has converged. In practice, a rule of thumb is that an $R^$ lower than 1.1 is no reason to worry, and an $R^$ less than 1.05 indicates good convergence. To compute the $R^$ statistic, at least two chains need to be sampled. We sampled two chains of 1500 samples using the pcntoolkit, the first 500 of which were used for tuning and were discarded. The last 1000 samples of all chains were stored and used for further analysis.

In Figure 4, the $R^$ statistics of $\u2208$ (unbroken line) and $\delta $ (broken line) are displayed as a function of chain length. The results are averaged over the 10 folds. Overall, the convergence of the $So1$ model, the $Sb1$ model, and the $Sb2$ models are similar on these phenotypes. Except for the WM-hypointensities phenotype—which is very hard to model—all samplers converge with $R^$ very close to 1 within at most 300 iterations.

Since it is a strictly positive measure, we also evaluate the convergence of the samplers on a log-transformed version of the WM-hypointensities, which shows much better convergence.

Note also that the $So$ sampler did not converge for the most difficult phenotype reported in Section 3.3, so we use the $Sb$ parameterisation in that case.

### 3.2 Goodness-of-fit analysis

First, we discuss how the HBR with the $Sb$ likelihoods compare to the W-BLR method, and the HBR with the Gaussian likelihood. We assess (i) how well the predicted z-scores follow a Gaussian distribution, by looking at the higher-order moments of the z-scores and at qq-plots; (ii) how good the fit of the percentile lines is visually, by superimposing them on the data; and (iii) how well the models have captured the batch effects, by looking at classification performance. Specifically: for any combination of two sites, we computed the area-under-the-curve (AUC) metric (Hastie et al., 2009) by using the z-scores as classification thresholds, and the batch-indices as labels.

Table 3 forms the first basis of our discussion. For convenience, it is useful to juxtapose Figure 6 or Appendix I, which visualise the phenotypes discussed here. The skew and kurtosis are the third and fourth standardised moment, respectively (Casella & Berger, 2021). As the name suggests, skew describes how much the data is skewed about the mean. The kurtosis measures the tailed-ness of the distribution. A spike distribution has a kurtosis of 0, and a standard Gaussian has a kurtosis of 3. For this reason, the kurtosis values reported here are reduced by 3, which is sometimes named the *excess kurtosis*. For both measures, the ideal result is thus 0, which suggests a symmetric distribution, with tails as slim as those of a Gaussian.

. | Skew . | Excess kurtosis . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

. | . | HBR . | . | HBR . | ||||||

Feature | W-BLR | $N$ | $So$ | $Sb1$ | $Sb2$ | W-BLR | $N$ | $So$ | $Sb1$ | $Sb2$ |

RCWM | 0.06 | 1.34 | 0.14 | 0.14 | 0.14 | 0.57 | 6.05 | 0.50 | 0.50 | 0.50 |

RLV | 0.32 | 2.74 | 0.17 | 0.17 | 0.13 | 0.89 | 26.39 | 1.40 | 1.08 | 0.56 |

WMH | 0.34 | 6.39 | 0.33 | 0.34 | 0.33 | 0.50 | 79.03 | 1.54 | 1.52 | 1.39 |

ETICV | -0.04 | 0.04 | -0.01 | -0.01 | -0.01 | 0.71 | 1.34 | 0.32 | 0.32 | 0.30 |

RIPJ | 0.03 | 0.29 | 0.0 | 0.0 | 0.0 | 0.30 | 1.17 | 0.12 | 0.12 | 0.12 |

BS | -0.0 | 0.31 | 0.02 | 0.02 | 0.03 | 0.52 | 0.91 | 0.25 | 0.25 | 0.24 |

logWMH | 0.04 | 0.89 | 0.05 | 0.05 | 0.06 | 0.16 | 2.07 | 0.16 | 0.16 | 0.12 |

. | Skew . | Excess kurtosis . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

. | . | HBR . | . | HBR . | ||||||

Feature | W-BLR | $N$ | $So$ | $Sb1$ | $Sb2$ | W-BLR | $N$ | $So$ | $Sb1$ | $Sb2$ |

RCWM | 0.06 | 1.34 | 0.14 | 0.14 | 0.14 | 0.57 | 6.05 | 0.50 | 0.50 | 0.50 |

RLV | 0.32 | 2.74 | 0.17 | 0.17 | 0.13 | 0.89 | 26.39 | 1.40 | 1.08 | 0.56 |

WMH | 0.34 | 6.39 | 0.33 | 0.34 | 0.33 | 0.50 | 79.03 | 1.54 | 1.52 | 1.39 |

ETICV | -0.04 | 0.04 | -0.01 | -0.01 | -0.01 | 0.71 | 1.34 | 0.32 | 0.32 | 0.30 |

RIPJ | 0.03 | 0.29 | 0.0 | 0.0 | 0.0 | 0.30 | 1.17 | 0.12 | 0.12 | 0.12 |

BS | -0.0 | 0.31 | 0.02 | 0.02 | 0.03 | 0.52 | 0.91 | 0.25 | 0.25 | 0.24 |

logWMH | 0.04 | 0.89 | 0.05 | 0.05 | 0.06 | 0.16 | 2.07 | 0.16 | 0.16 | 0.12 |

Smaller is better. Note that the most direct comparison is between W-BLR, $So$ and $Sb1$. The bold figures on the left of the vertical line show the best performing method among these methods. The $Sb2$ models are more flexible by construction and have an additional random effect, and phenotypes that are bold on the right side of the line are those that benefit from this additional flexibilty. “Hard” phenotypes: RCWM, RLV, (log)WMH. “Easy” phenotypes: ETICV, RIPJ, BS. Abbreviations: RCWM: Right-Cerebellum-White Matter; RLV: Right-Lateral-Ventricle; WMH: White-Matter-Hypointensities; ETICV: Estimated Total Intracranial Volume; RIPJ: interim prim-Jensen thickness (right Sulcus); BS: Brain Stem; log WMH: logWhite-Matter-Hypointensities. See also Section 2.2.

The best results—results that are closest to 0—are printed in bold font but note that the most direct comparison is between W-BLR, $So$ and $Sb1$. The bold figures on the left of the vertical line that separates W-BLR, $So$ and $Sb1$ from $Sb2$ show the best performing method amongst these methods. The $Sb2$ models are more flexible by construction and have an additional random effect. Phenotypes (rows) that are bold on the right side of that vertical line are those that benefit from this additional flexibilty. For most phenotypes, $So$ or $Sb1$ have the best result, with a few benefiting from the additional flexibility introduced by $Sb2$, but the difference with other methods is sometimes small. The two moments need to be weighted differently as well. For instance, for the Brain-Stem phenotype, the W-BLR has only a marginally smaller skew than $So$, but $So$, $Sb1$ and $Sb2$ have a significantly smaller kurtosis. It is important to remember that the kurtosis is a sum of fourth powers, and the skew is a sum of third powers. Deviations thus generally have a larger effect on the kurtosis than on the skew. For the WM-hypointensities phenotype, the W-BLR method performs on par with the $S$ models regarding the skew, and better when regarding the kurtosis, indicating a better fit on those phenotypes that are also most difficult to model. The W-BLR has inherent batch effects in all aspects of the predicted percentiles, possibly explaining why it fits better to these hard phenotypes. This is further discussed in Section 3.2.1. We emphasise that the comparison is not completely straightforward with the HBR models being more conservative, as described in detail in Section 3.2.1. Despite that, the numbers indicate a fit that is overall better than W-BLR.

Figure 5 further illustrates the good performance of the $So$ and $Sb$ models relative to the $N$ model. The three hard phenotypes are not modelled well at all by the $N$ model. Overall, the $So$ models perform comparable to W-BLR, except for Right-Lateral-Ventricle, where the W-BLR line looks slightly better. Note that this figure shows only a random selected cross-validation fold, whereas the statistics in Table 3 are averaged across all folds. Note also that the log transform does not by itself adequately model non-Gaussianiy in the WM-hypointensities phenotype, indicating that while it addresses non-negativity, it is still necessary to combine this transformation with a principled approach for modelling non-Gaussianity, such as afforded by the SHASH distribution.

Figure 6 visualises the centiles of the fitted HBR models on two hard phenotypes; Right-Lateral-Ventricle, WM-hypointensities, and the log-transformed WM-hypointensities. For brevity, we show only the $Sb2$ model, but the fitted centiles for all models and phenotypes are shown in the appendix (i.e., including $So$ and $Sb1$). Overall, this figure shows that the non-Gaussian models fit the data better. Focusing on area A, the W-BLR the $Sb$ model fits the data well, but the fitted centiles of the Gaussian model (b) cannot accurately model the centiles in the lower range, and many extend implausibly into the negative range (also for e). Focusing on area B, and realising that this phenotype is a volume, and can therefore never be negative, the problem is obvious. Focusing on area C, we see that the W-BLR model must severely warp the input space to be able to accurately model the data at this point. For models (d) and (f), the 0.1’th fitted centile is still estimated to run into the negative domain, towards the range of the data points. This means that the data are not modelled accurately in those regions. As noted above, this motivates the application of the log transformation to the data before fitting which enforces positivity (g-i) and avoids the fitted centiles becoming negative toward the limits of range of the input variables. However, it is still necessary to model non-Gaussianity. Focusing on area D, we see that the Gaussian model applied to the log-transformed data has a poor fit in these regions, while the other two models perform well. These visualisations indicate that the SHASH transformation fits the data better and better accommodates site effects relative to a Gaussian model.

Another important aspect of the HBR method is the way it can deal with batch effects. Unlike the W-BLR method, where one-hot encoded^{6} batch vectors are appended to the design matrix (effectively treating site as a fixed effect), causing the batch effects to directly or indirectly affect the mean, variance, skew, and kurtosis, in HBR the batch effects only affect aspects of the likelihood by explicitly giving the parameter that controls that aspect a random effect. By having a batch-specific offset in $\mu $, sampled as a deviation from a group mean, the HBR models’ predictions are ultimately indistinguishable between batches. To assess to which degree the predictions are indistinguishable between batches, we compare the Area Under the Reciever-Operator-Curve (AUC) scores (Hastie et al., 2009), which vary between 0 and 1, in a 1v1 setting. Only site was used as a batch effect in our models, so the batch effects correspond exactly to the site-ids. For each combination of two sites, we compute the AUC score using the sklearn.metrics.roc_auc_score method, where we use the binary class labels as y_true, and the z-score predictions for y_pred. Because an AUC score of 0.5 would indicate perfect indistinguishibility, we plot the absolute deviation from the 0.5 mark in Figure 7.

#### 3.2.1 A subtle difference between HBR and W-BLR

The comparison between HBR and W-BLR seems to indicate the W-BLR should be preferred in some cases where the modelled response variable takes on an extreme shape. However, the W-BLR method and the HBR differ in the order in which they apply certain transformations, leading to a subtle difference.

The W-BLR method applies the sinh-arcsinh transformation after applying the affine transformation for mean and variance, and the HBR method applies the sinh-arcsinh transformation before applying the affine transformation for $\mu $ and $\sigma $. Because $\mu $ and $\sigma $ potentially contain batch effects, and because in W-BLR, $\mu $ influences the shape of the distribution (as illustrated in Fig. 8), the random effects in W-BLR indirectly influence the skew and the kurtosis. The better fit of W-BLR on the Right-Lateral-Ventricle and the WM-hypointensities data in Table 3 may partly be due to this flexibility. We assume that in most cases, batch effects are not expected in the variance, skew, and kurtosis as much as they are in the mean. Especially for features with a variance that is small relative to the mean, a small batch effect in the mean would not constitute a large scaling in the variance. Since the skew and kurtosis are central measures—meaning that they are computed on mean-centred data that are divided by the variance—they are not influenced by scaling or shifting transformations at all. For phenotypes derived from images from a single specific scanner may be overestimated by some constant amount, but it is less likely that the shape of the trajectory of the feature depends much on the scan location itself (although see Section 4 for further discussion about this assumption). Nonetheless, a latent correlation between the site and the measurement could be present. In contrast, the HBR method provides better control because it allows modelling batch effects optionally, but not necessarily (see Appendix B). With W-BLR, the random effects are encoded as covariates, and may therefore influence all parameters of the likelihood. The results in Table 3 should be seen in light of this subtle difference. The W-BLR model may outperform HBR on some phenotypes, but the results listed under HBR are made by more conservative models, which can be controlled better in the sense that the shape parameters in HBR are implicitly regularised under the prior, whereas the shape parameters estimated by W-BLR are effectively unconstrained and can potentially more easily overfit to the data.

. | Mean . | Variance . | Skew . | Kurtosis . |
---|---|---|---|---|

$So(\mu ,\sigma ,\u2208,\delta )$ | -0.24 | 2.4 | -0.96 | 0.9 |

$So(\mu +0.5,\sigma ,\u2208,\delta )$ | 0.26 | 2.4 | -0.96 | 0.9 |

$\xi \u2208,\delta \u22121(N(\mu ,\sigma ))$ | -0.05 | 1.27 | -1.34 | 2.76 |

$\xi \u2208,\delta \u22121(N(\mu +0.5,\sigma ))$ | 0.37 | 0.91 | -1.31 | 3.37 |

. | Mean . | Variance . | Skew . | Kurtosis . |
---|---|---|---|---|

$So(\mu ,\sigma ,\u2208,\delta )$ | -0.24 | 2.4 | -0.96 | 0.9 |

$So(\mu +0.5,\sigma ,\u2208,\delta )$ | 0.26 | 2.4 | -0.96 | 0.9 |

$\xi \u2208,\delta \u22121(N(\mu ,\sigma ))$ | -0.05 | 1.27 | -1.34 | 2.76 |

$\xi \u2208,\delta \u22121(N(\mu +0.5,\sigma ))$ | 0.37 | 0.91 | -1.31 | 3.37 |

The first two rows differ only in the mean, by the exact amount by which $\mu $ was increased, whereas the bottom two rows differ entirely. This illustrates that in W-BLR, a change in $\mu $ changes the entire shape of the distribution.

### 3.3 Fitting highly nonlinear data with $Sb$

To demonstrate the importance of the flexibility of model $S$ likelihood, we take as an example a recent study in which problematic data emerged from a dimensionality reduction (see Fig. 9).

In Zabihi et al. (2022), fMRI data were compressed using an autoencoder (Kingma & Welling, 2013), and then further compressed using UMAP dimensionality reduction (McInnes et al., 2018). The result was a 2d latent representation of brain activation patterns, which followed a somewhat problematic, highly nonlinear, distribution. Positively skewed on the lower end, and negatively skewed on the higher end of the age range, this representation cannot be modelled by HBR with a Gaussian likelihood, and W-BLR is also not designed to deal with these kinds of effects. If we take $Sb1$, we find that we run into the limits of that model as well, because it assumes a fixed skew and kurtosis throughout the age range. With model $Sb2$, we have the flexibility to model heterogeneous skew and kurtosis, and Figure 9 shows that this flexibility is absolutely essential for modelling this phenotype.

### 3.4 Computational complexity

We provide an indication here about the computational complexity for the time required to estimate the different variants of our models. Specifically, for the lifespan normative models presented in Section 2.2 above the mean [stdev] time averaged over 10 folds and is specified as hours:minutes. This measures the time required to draw 1000 samples after 500 samples burn in was: $N$ = 2:19 [0:36], $So$ = 6:29 [1:00], $Sb1$ = 6:54 [0:31] and $Sb2$ = 27:26 [1:27]. Another important question in the use of MCMC samplers is determining the number of samples to acquire (i.e., the chain length). We provide a simple evaluation to determine whether 1000 samples after burn-in is sufficient. To achieve this, we compare the empirical mean and variance of different parameters across chains having 500, 1000, and 1500 samples, post burn-in. We also show Monte Carlo estimates for a representative set of parameters in Figure J.36. This shows that the parameters are effectively identical, with perhaps a slight reduction in the Monte Carlo estimates of the variance for some parameters for the short chains. Overall, this provides confidence that the chain length we employ is sufficient for reliable inferences.

## 4 Discussion

In this work, we proposed a flexible method for modelling non-Gaussianity in normative models based on the SHASH distribution. We proposed a novel reparameterisation and developed an efficient MCMC sampling approach for these models. We first applied this method to a large neuroimaging dataset and showed competitive performance with respect to competing methods, while providing full Bayesian inference and full distributions over all model parameters. Then, we applied this method to a highly challenging problem where we aim to model a set of learned parameters from a highly non-linear model (derived from an autoencoder). In this case the classical methods are not able to satisfactorily model the distribution of the brain phenotype, while the proposed method can.

Normative modelling aims to find centiles of healthy variation in neuroimaging phenotypes, much like growth charts in pediatric medicine. Many of those phenotypes follow complex nonlinear, asymmetric distributions; so for a normative model to be generally applicable, it has to be able to model features like skewness and kurtosis. Our extension builds upon the HBR framework for Normative modelling introduced by (Kia et al., 2021), which retrieves a distribution over model parameters given a prior distribution over the parameters and a likelihood over the data. The original paper reported only on results where a Gaussian likelihood was used, but the framework is flexible enough to support any parameterised likelihood. Here, we adapted the HBR framework to make use of a different one, namely the SHASH likelihood, which has additional parameters roughly modelling for skew and kurtosis.

The basic SHASH distribution in Equation 3 has no parameters for location and scale, but those can be added, resulting in the SHASH* _{o}* ($So$) distribution, as in Equation 5. We also introduce a reparameterisation called the SHASH

*($Sb$) distribution, which aims to remove the severe correlations by applying an affine tranformation for location and scale to a standardised SHASH distribution.*

_{b}Our results on a large neuroimaging dataset show that the $Sb$ reparameterisation in some cases converges better and more reliably than the $So$ distribution. For example, in the analysis of the highly challenging data, shown in Figure 9, the $So$ model did not converge. However, for most phenotypes, the convergence is similar to the original variant. A more important benefit of this reparameterisation is that the parameters of the distribution are more interpretable, as shown schematically in Figure 1 and empirically in Figure 2. We note that the $Sb$ distribution is isomorphic to $So$—the two can model exactly the same set of distributions— which might not give a preference for one distribution over the other in the context of HBR. We also observed a similar sampling performance (at least in terms of convergence) for both. However, the $So$ parameterisation could in some circumstances be preferred over $Sb$, (e.g., for “easy” phenotypes with near-Gaussian shapes) if parameter interpretation is not of primary importance, although the latter achieved slightly better performance in our experiments. However, as noted above, this $So$ model did not converge for the most difficult nonlinear phenotype we considered, in which case $Sb$ is preferred.

We showed that HBR with an $Sb$ likelihood performs equivalently or in some cases slightly better than Warped Bayesian Linear Regression (W-BLR) (Fraza et al., 2021) on most datasets, in terms of Gaussianity of the predicted deviation scores, although W-BLR predictions are better on specific pathological phenotypes. We argue that the W-BLR method fits better to that data due to the fact that batch effects are effectively encoded as a fixed effect and can therefore potentially affect all parameters of the likelihood, which is not always desired (see Fig. 8). The W-BLR method has no option to disable this flexibility, as it is inherent to the method. This is why we argue that the HBR method should be preferred as it supports enabling or disabling random effects in each parameter of the likelihood as a modelling choice. Moreover, the W-BLR method is unable to model the most complex phenotype we evaluated (i.e., derived from the latent representation of an autoencoder) because it does not provide the ability to sufficiently control the skewness and kurtosis across the range of the covariates. In other words, HBR provides more flexible parametric control over the shape of the distributions that the approach is able to model.

We have shown that HBR with $Sb$ likelihood is able to model highly non-Gaussian distributions with intricate trajectories in several parameters in a way that is impossible to replicate with existing techniques. This indicates that HBR will possibly play an important role in pushing the field of Normative modelling forward.

The computational complexity of the likelihood may be the most important limitation, especially for the $Sb2$ variant. It must be noted that the computation time is significantly reduced if fewer batch effects are present. We hope that future releases of software packages like pymc will help reduce the required computational time.

Lastly, all our work is freely available to be used and extended, in the open-source pcntoolkit package.

### 4.1 Future work

#### 4.1.1 Transfer learning

In the current work, we have not applied the method in a transfer learning setting, but the methods for doing so are described in Kia et al. (2021) and the updates we provide to the pcntoolkit package support this. For transferring a learned model to a new site, one would have to create a factorised approximation of the posterior by using the MCMC samples retrieved earlier. Restricting ourselves to a known distributional form, this can easily be done by standard optimisation techniques on the log likelihood. The factorised posterior is then fixed, and can be used as an informed prior for MCMC sampling with data from a new site. The samples retrieved in this manner can then be used for approximation of z-scores or any other downstream analysis.

#### 4.1.2 Variational inference

We have shown the value of the $So$ and $Sb$ likelihoods in the context of MCMC sampling. The reparameterisation removes much of the correlation in the posterior that is problematic for sampling methods, leading to more stable results. Nonetheless, the MCMC sampling procedure is still very computationally demanding, especially on large datasets with many batch effects. A variational approach (Blei et al., 2017) may now be a viable solution to this problem. In the HBR context, the $Sb$ reparameterisation may be particularly beneficial in this regard because it removes dependencies between the parameters, such that the posterior can be approximated by a factorised distribution. Variational inference is a fast approximation, and MCMC is a slow but (asymptotically) exact method. We believe that in some cases a fast approximation suffices, so variational approximation could be a suitable extension to the present work.

#### 4.1.3 Generalisations to more expressive noise distributions

The extension presented here is just the tip of the iceberg of Hierarchical Bayesian modelling. This extension was developed for modelling skewness and kurtosis, and while we showed that this can fit a wide range of distributions, the flexibility of the SHASH distribution is still limited. Specifically, the limitations in tail thickness are easily read from plot 1d. Mixture models could further help increase the expressive power of the HBR method. One could design a mixture of a single SHASH* _{b}* (or simply a Gaussian, which might be easier to fit) and two Gamma distributions, one left-tailed and one right-tailed, conveniently located as to act as surrogate tails. The central $S$ component would be dedicated to learning the shape of the bulk of the data, while the Gamma distributions absorb the largest outliers. With the current version of the pcntoolkit, this extension is now within reach, but we leave this for future work.

#### 4.1.4 Batch effects

Multidimensional batch effects are currently implemented as a $D$ dimensional matrix in the pcntoolkit, containing one value per combination of batch effects. The number of values grows exponentially with the number of batch effects. An alternative approach is to model a separate vector for each batch effect, and to learn parameters linking the vectors together. So instead of having $\theta i,j\u223cN(\mu i,j,\sigma )$, one could have $\theta i,j\u223cN(\alpha \theta i1+\beta \theta j2+\gamma \theta i1\theta j2)$. The number of required linking parameters grows as $n(n+1)/2\u2208O(n2)$, much better than the current $O(dn)$, where $d$ is the number of unique values for a batch effect. We expect that the correlation between batch effects is not sufficient to justify the approach, and we believe the most important effects in the batches can be modelled this way. In addition, throughout this work we have assumed that the batch effects are principally evident in the mean, and observed empirically that this provided reasonable performance. It is conceivable, however, that batch effects could also be evident in the variance. While the original Gaussian HBR method can easily handle batch effects in the variance, the use of the SHASH distribution complicates this somewhat; while we could add batch effects to the variance as well, this might interact with the shape and/or heteroskedasticity of the resulting model. We therefore leave this for future work.

## 5 Conclusion

We have extended a method for federated normative modelling of neuroimaging data to support non-Gaussian attributes such as skew and kurtosis. To this end, we have introduced a 4-parameter reparameterisation of the SHASH distribution, along with an accompanying software implementation. We have demonstrated that our method performs equivalently or in some cases slightly better than a baseline method on simple and complex data, thereby showing that this method can push the field of normative modelling forward.

## Data and Code Availability

Most data used in this manuscript are publicly available and can be downloaded from the repositories providing the data. See Rutherford et al. (2022) for details. Data from the “Thematically Organised Psychosis” sample is available under reasonable request by contacting the responsible PIs (L.T.W. and O.A.A.). A software implementation of the tools presented here is provided in the pcntoolkit package and the scripts for running the experiments are also provided online via GitHub.

## Author Contributions

A.A.A.d.B., J.M.M.B.: primary analysis, drafting and revising the manuscript. S.M.K.: conceptualisation, supervision, analysis, and drafting and revising the manuscript. S.R., M.Z., C.F., and P.B.: analysis, drafting and revising the manuscript. L.T.W., O.A.A.: provision of data, drafting and revising the manuscript. M.H., C.F.B., and A.M.: conceptualisation, supervision, analysis, and drafting and revising the manuscript.

## Declaration of Competing Interest

C.F.B. is director and shareholder of SBGNeuro Ltd. The other authors report no conflicts of interest.

## References

## Appendix A. Sinh-Arcsinh Transformation

## Appendix B. Model Specification

Every phenotype has its own defining characteristics. Some may be severely skewed, while others appear to be nicely Gaussian. We may see strong correlations between age and variance (heteroskedasticity) in one, while another seems to retain a consistent variance throughout the entire age spectrum. Batch effects like sex-specific differences, or site-specific noise characteristics may or may not be present in the data. Characteristics like these can be learned by a sufficiently flexible model. However, Occam dictates that we should prefer a simple model over a complex model, if they perform equally well. A model is therefore optimally defined if it has enough flexibility to capture the parameters of interest, whilst being constrained in every other way. An ideal modelling framework would allow specifying constraints on all aspects of the model, be flexible enough to support a wide range of models, while still remaining concise. Here, a model definition framework is proposed that allows us to concisely write down a wide range of different models. This framework will simplify the later sections of this report. In addition, the framework introduced here aligns with the implementation of the source code in the pcntoolkit software package. The graphical notation used here is slightly adapted from Bishop (2006) chapter 8, where here we indicate the range and name of the iterator variables more explicitly in the plates, i.e., $N$ becomes $n\u2208{1,...,N}$. For the remainder of this section, we will uphold the nomenclature in Table B.5. Without loss of generality, let us now focus on modelling a single parameter, $\theta $. For more complex distributions that take more parameters, like almost any distribution, we just apply what follows to every parameter separately.

Variable . | Interpretation . |
---|---|

$N\u2208\mathbb{N}$ | The number of observations, subjects, rows in the table |

$d$ | The number of clinical covariates |

$X\u2208\mathbb{R}N\xd7d$ | Clinical covariates |

$Y\u2208\mathbb{R}N\xd71$ | Response variable |

$B$ | The number of unique combinations of batch effects. So if we model 2 sexes and 7 sites, this will be $2\xd77=14$ |

$Z\u2208\mathbb{N}N\xd71$ | A list of $N$ indices, corresponding to the combination of batch effects that apply to the data. So every $zn\u2208{1,...,B}$. Not to be confused with the $z$-score. |

$D$ | The dimensionality of the basis expansion |

$\Phi \u2208\mathbb{R}N\xd7dD$ | A design matrix derived from $X$. This can be a polynomial basis expansion, a b-spline basis expansion, or simply $X$. |

Variable . | Interpretation . |
---|---|

$N\u2208\mathbb{N}$ | The number of observations, subjects, rows in the table |

$d$ | The number of clinical covariates |

$X\u2208\mathbb{R}N\xd7d$ | Clinical covariates |

$Y\u2208\mathbb{R}N\xd71$ | Response variable |

$B$ | The number of unique combinations of batch effects. So if we model 2 sexes and 7 sites, this will be $2\xd77=14$ |

$Z\u2208\mathbb{N}N\xd71$ | A list of $N$ indices, corresponding to the combination of batch effects that apply to the data. So every $zn\u2208{1,...,B}$. Not to be confused with the $z$-score. |

$D$ | The dimensionality of the basis expansion |

$\Phi \u2208\mathbb{R}N\xd7dD$ | A design matrix derived from $X$. This can be a polynomial basis expansion, a b-spline basis expansion, or simply $X$. |

### Appendix B.1. Constant $\theta $

Here, we discuss simple models where $\theta $ is constant for the whole spectrum of $X$, save batch effects. *Model* $\mathcal{M}1$. Let $\mathcal{M}1$ represent the case where $\theta $ does not depend on $X$ or $Z$, i.e., it takes a single value for all $yn$. The only dependency of $\theta $ in $\mathcal{M}1$ is on the (set of) hyperparameters $\alpha $ through its prior distribution $p$:

and Figure B.10 shows the graphical model.

#### Appendix B.1.1 Constant $\theta $ with batch effects

Combinations of site, sex, and possible other factors define distinct groups called *batches*. Here, we use the term *batch effects* to indicate possibly unknown differences between known and distinct batches present in the data. HBR can deal with these batch effects in a principled way by modelling batch-specific parameters as deviations from a learned group mean. This allows information from all batches to flow into the hyperpriors, which, in turn, influence the estimated individual parameters for each batch. The models $\mathcal{M}2a$ and $\mathcal{M}2b$ encode this.

*Model* $\mathcal{M}2a$. CThe next simplest case is model $\mathcal{M}2a$, in which $\theta $ does depend on $Z$ but not on $X$. This encodes the belief that the parameter is constant for all $x$, but that it may be influenced by batch effects. We sample the batch effects ($\theta b$) as Gaussian deviations from a group mean ($\mu \theta $) with a group variance ($\sigma \theta 2$), such that the sites can learn from each other. Additionally, we have that under the appropriate prior, the group variance acts as a regulariser.

*Model* $\mathcal{M}2b$. The phenomenon known as the funnel is an issue that commonly afflicts hierarchical models, and it may occur in model $\mathcal{M}2a$. To give a brief explanation, when the parameter $\sigma \theta $ is small, but we cannot determine $\theta $ with enough confidence, the posterior distribution attains a wide support in the domain of $\sigma \theta $, including very small values. If we sample in those regions where $\sigma \theta $ is very small, the values $\theta $ all collapse to $\mu \theta $, and sampling of $\theta $ becomes very hard. Most samples of $\theta $ will be rejected, since the proposal distribution is tuned for a larger step size. For a more thorough explanation, see (Betancourt & Girolami, 2015). In those situations where we fear a funnel may occur, we apply the following reparameterisation as a remedy; the batch-specific deviation from the group mean is determined by sampling the group standard deviation ($\sigma \theta $) from a positive distribution, and then multiply that with an *offset* ($\nu b$) sampled from a central member of a symmetric distribution—like the standard Gaussian—for each $b\u2208{1,\u2026,B}$. Sampling like this is sometimes called a non-centred sampling approach, as opposed to the centred approach in model $\mathcal{M}2a$. Here, we propose model $\mathcal{M}2b$:

### Appendix B.2 $\theta $ as a function of $X$

More complex models can be defined if we let $\theta $ follow a linear relationship with the basis expansion $\Phi (X)$ through a set of weights $w\u2208\mathbb{R}dD$ and an intercept $\tau \u2208\mathbb{R}$. We sample $w$ and $\tau $ from their own respective distributions, which is where the definitions from Appendix B.1 are re-used.

*Model* $\mathcal{M}3.*.*$. Since $w$ and $\tau $ are both constant parameters with possible batch effects, we can model them exactly like we did $\theta $ in $\mathcal{M}1$, or $\theta zn$ in models $\mathcal{M}2a$ and $\mathcal{M}2b$. The definitions there are general enough to allow modelling $w$ as distributed according to a multidimensional density.

The subscript of these linear models consists of three terms, separated by periods; $\mathcal{M}3.*.*$. The first term is a 3, signifying that this is a linear model. The second and third terms indicate the model types that are used for $w$ and $\tau $, respectively. For instance: $\mathcal{M}3.1.1$ is a model where $\theta $ is a linear function of $\Phi $ through $w$ and $\tau $, which are both constants like $\theta $ in $\mathcal{M}1$. Model $\mathcal{M}3.2a.2b$ encodes for the model with a linear $\theta $, where $w$ and $\tau $ have unique values for each batch effect, but the former is modelled by the centred approach as in model $\mathcal{M}2b$, and the latter by the non-centred approach explained in model $\mathcal{M}2a$. Both $w$ and $\tau $ can thus be modelled in three distinct ways, allowing us to define a total of nine different models with linear $\theta $ in this way. $\theta $ can thus be modelled in a total of 12 ways: Three in which it is fixed, and nine in which it has a linear dependence on $\Phi $.

### Appendix B.3 Hypothetical application

The proposed framework allows us to concisely specify how parameters are modelled.

*Scenario.* A pattern that is commonly observed in neurological phenotypes is the following: As the age of the participant ($X$) increases, the phenotype ($Y$) increases first (until around 25), and then starts to decrease. However, the variance in the phenotype increases with age as well. The data show some positive skew, and may also be slightly kurtotic. Furthermore, the data are collected from various sites, so site-effects ($Z1$) are expected, and sex-related differences ($Z2$) need to be accounted for as well.

*Modelling.* Clearly, the parameter that controls the mean should be a function of age, but the non-monotonic behaviour of the mean guides us away from a directly linear approach. Rather, we would use a polynomial basis expansion, or a regression spline (Boor, 1978). We usually prefer the regression spline basis over a simple polynomial basis, because it does not induce global curvature (Fraza et al., 2021). A Gaussian distribution would not be able to capture the observed skew, so the decision for a likelihood function falls on the SHASH* _{b}*, parameterised by $\mu $, $\sigma $, $\u2208$, and $\delta $. Model $\mathcal{M}3$ is a fitting choice for both $\mu $ and $\sigma $, as they appear to be related with age. For $\mu $, we expect that the batch effects do not affect the trajectory of the phenotype, but mainly influence the magnitude. Therefore, the weights $w$ do not need to be modelled with batch effects, but $\tau $ does. The expected batch effects are not very large, so the non-centered modelling approach ($\mathcal{M}2a$) is a good choice. Model $\mathcal{M}3.1.2b$ is decided for $\mu $. We do not expect batch effects in $\sigma $, but it does clearly have a relation with age. We can safely choose model $M3.1.1$ for $\sigma $. The last two parameters, $\u2208$ and $\delta $, are both assumed to be constant across sites and age. So they are both modelled by $\mathcal{M}1$. To enforce positivity of both $\sigma $ and $\delta $, we apply a softplus transformation to the outputs of the corresponding models, and to fulfill the constraint discussed in Section 2.1.2, we add a constant to $\delta $.

We now have:

## Appendix C Generative Models

In all the generative models in this section, $D$ is the dimensionality of the basis expansion $\varphi $. In our models, this is taken to be 7. We repeat some earlier equations for quick reference.

where $K$ is a modified Bessel function of the second kind.

### Appendix C.1. Model $N$

### Appendix C.2. Model $So$

where

### Appendix C.3. Model $Sb1$

where

### Appendix C.4 Model $Sb2$

where

## Appendix D MCMC Sampling In A Nutshell

Uncertainty quantification ideally accounts for all uncertainty in the model parameters, but is often hard to quantify exactly. MCMC approximates this uncertainty by sampling from the desired distribution. Given a generative model like above, Bayes’ rule gives us an expression for the posterior over our model parameters:

which we then use to compute the expectation of a function $f$:

For instance, in normative modelling,$f$ could return the deviation score of some data point. The expectation above gives us an estimate in which all the uncertainty in the model parameters is captured. Unfortunately, the integral in Equation D.2 is intractable, and the denominator in Equation D.1, $p(D)$, contains an intractable integral as well:

In practice, it is impossible to determine $p(\theta |D)$ exactly, unless the prior is conjugate to the likelihood. MCMC samplers use a number of different methods to retrieve pseudo-random samples from $p(\theta |D)$, without ever needing to analyse the integral in Equation D.3. With enough of these samples in hand, we use the following result:

to approximate our desired expectation.

## Appendix E Maximum-a-Posteriori (MAP) Approximation

Instead of doing an MCMC estimation, which is quite computationally expensive, another method is to find the MAP, and to compute the desired function based on that point.

So in other words, the MAP estimate is just that point in parameter space that maximises the posterior. It can be found using gradient-based methods. Using the MAP-estimate is not the most principled method, because it does not account for the uncertainty in the parameters. For instance, the MAP estimate will always return a single $\theta $, although the posterior could by multimodal, or contain large flat optima, which would justify multiple valid $\theta $’s. However, the computational gains may be worth the loss of a representation of uncertainty.

## Appendix F Data Counts

Site . | # Females . | # Males . | Mean age . |
---|---|---|---|

ABCD_01 | 199 | 189 | 9.89 |

ABCD_02 | 252 | 290 | 10.07 |

ABCD_03 | 268 | 301 | 9.88 |

ABCD_04 | 307 | 324 | 9.81 |

ABCD_05 | 178 | 167 | 9.89 |

ABCD_06 | 286 | 278 | 9.94 |

ABCD_07 | 153 | 172 | 9.87 |

ABCD_08 | 160 | 176 | 9.95 |

ABCD_09 | 200 | 207 | 9.96 |

ABCD_10 | 280 | 295 | 9.86 |

ABCD_11 | 206 | 208 | 9.81 |

ABCD_12 | 77 | 84 | 9.88 |

ABCD_13 | 277 | 278 | 9.81 |

ABCD_14 | 267 | 316 | 10.19 |

ABCD_15 | 178 | 217 | 9.90 |

ABCD_16 | 414 | 507 | 9.90 |

ABCD_17 | 267 | 290 | 9.81 |

ABCD_18 | 160 | 181 | 9.90 |

ABCD_19 | 272 | 262 | 10.06 |

ABCD_20 | 320 | 320 | 10.05 |

ABCD_21 | 222 | 269 | 9.91 |

ABIDE_GU | 27 | 27 | 10.42 |

ABIDE_KKI | 65 | 122 | 10.29 |

ABIDE_NYU | 28 | 107 | 14.40 |

ABIDE_USM | 3 | 56 | 22.07 |

ADD200_KKI | 27 | 34 | 10.25 |

ADD200_NYU | 20 | 18 | 10.21 |

AOMIC_1000 | 483 | 445 | 22.84 |

AOMIC_PIPO2 | 120 | 89 | 22.18 |

ATV | 17 | 60 | 22.67 |

CIN | 22 | 44 | 47.89 |

CMI_CBIC | 73 | 126 | 11.19 |

CMI_RU | 125 | 251 | 10.53 |

CMI_SI | 38 | 68 | 11.01 |

CNP-35343.0 | 43 | 47 | 31.94 |

CNP-35426.0 | 11 | 9 | 29.50 |

COI | 78 | 46 | 51.86 |

HCP_A_MGH | 86 | 85 | 59.75 |

HCP_A_UCLA | 71 | 53 | 53.32 |

HCP_A_UM | 120 | 84 | 61.59 |

HCP_A_WU | 112 | 66 | 58.83 |

HCP_D_MGH | 109 | 107 | 13.78 |

HCP_D_UCLA | 62 | 65 | 14.13 |

HCP_D_UM | 85 | 71 | 13.26 |

HCP_D_WU | 75 | 79 | 13.97 |

HCP_EP_BWH | 2 | 6 | 23.09 |

HCP_EP_IU | 11 | 14 | 23.90 |

HCP_EP_MGH | 3 | 8 | 27.61 |

HCP_EP_McL | 4 | 9 | 25.15 |

HKH | 17 | 12 | 45.41 |

HRC | 36 | 13 | 41.69 |

HUH | 38 | 29 | 34.74 |

KCL | 25 | 16 | 34.12 |

KTT | 41 | 87 | 31.03 |

KUT | 66 | 93 | 36.50 |

Oasis2 | 127 | 58 | 76.88 |

Oasis3 | 631 | 921 | 69.78 |

SWA | 15 | 85 | 28.48 |

SWU_SLIM_ses1 | 318 | 230 | 20.07 |

UCDavis | 62 | 74 | 3.12 |

UTO | 106 | 96 | 35.19 |

Cam | 329 | 318 | 54.19 |

Delta | 18 | 31 | 50.44 |

ds001734 | 60 | 48 | 25.54 |

ds002236 | 38 | 48 | 11.49 |

ds002330 | 37 | 29 | 26.62 |

ds002345 | 131 | 76 | 21.69 |

ds002731 | 28 | 31 | 21.25 |

ds002837 | 42 | 44 | 26.73 |

hcp_ya | 606 | 507 | 28.80 |

Ixi | 313 | 245 | 48.71 |

Nki | 308 | 174 | 42.63 |

Pnc | 701 | 677 | 14.21 |

Top | 134 | 158 | 34.57 |

ukb-11025.0 | 12944 | 12042 | 62.98 |

ukb-11027.0 | 5438 | 4534 | 64.39 |

Site . | # Females . | # Males . | Mean age . |
---|---|---|---|

ABCD_01 | 199 | 189 | 9.89 |

ABCD_02 | 252 | 290 | 10.07 |

ABCD_03 | 268 | 301 | 9.88 |

ABCD_04 | 307 | 324 | 9.81 |

ABCD_05 | 178 | 167 | 9.89 |

ABCD_06 | 286 | 278 | 9.94 |

ABCD_07 | 153 | 172 | 9.87 |

ABCD_08 | 160 | 176 | 9.95 |

ABCD_09 | 200 | 207 | 9.96 |

ABCD_10 | 280 | 295 | 9.86 |

ABCD_11 | 206 | 208 | 9.81 |

ABCD_12 | 77 | 84 | 9.88 |

ABCD_13 | 277 | 278 | 9.81 |

ABCD_14 | 267 | 316 | 10.19 |

ABCD_15 | 178 | 217 | 9.90 |

ABCD_16 | 414 | 507 | 9.90 |

ABCD_17 | 267 | 290 | 9.81 |

ABCD_18 | 160 | 181 | 9.90 |

ABCD_19 | 272 | 262 | 10.06 |

ABCD_20 | 320 | 320 | 10.05 |

ABCD_21 | 222 | 269 | 9.91 |

ABIDE_GU | 27 | 27 | 10.42 |

ABIDE_KKI | 65 | 122 | 10.29 |

ABIDE_NYU | 28 | 107 | 14.40 |

ABIDE_USM | 3 | 56 | 22.07 |

ADD200_KKI | 27 | 34 | 10.25 |

ADD200_NYU | 20 | 18 | 10.21 |

AOMIC_1000 | 483 | 445 | 22.84 |

AOMIC_PIPO2 | 120 | 89 | 22.18 |

ATV | 17 | 60 | 22.67 |

CIN | 22 | 44 | 47.89 |

CMI_CBIC | 73 | 126 | 11.19 |

CMI_RU | 125 | 251 | 10.53 |

CMI_SI | 38 | 68 | 11.01 |

CNP-35343.0 | 43 | 47 | 31.94 |

CNP-35426.0 | 11 | 9 | 29.50 |

COI | 78 | 46 | 51.86 |

HCP_A_MGH | 86 | 85 | 59.75 |

HCP_A_UCLA | 71 | 53 | 53.32 |

HCP_A_UM | 120 | 84 | 61.59 |

HCP_A_WU | 112 | 66 | 58.83 |

HCP_D_MGH | 109 | 107 | 13.78 |

HCP_D_UCLA | 62 | 65 | 14.13 |

HCP_D_UM | 85 | 71 | 13.26 |

HCP_D_WU | 75 | 79 | 13.97 |

HCP_EP_BWH | 2 | 6 | 23.09 |

HCP_EP_IU | 11 | 14 | 23.90 |

HCP_EP_MGH | 3 | 8 | 27.61 |

HCP_EP_McL | 4 | 9 | 25.15 |

HKH | 17 | 12 | 45.41 |

HRC | 36 | 13 | 41.69 |

HUH | 38 | 29 | 34.74 |

KCL | 25 | 16 | 34.12 |

KTT | 41 | 87 | 31.03 |

KUT | 66 | 93 | 36.50 |

Oasis2 | 127 | 58 | 76.88 |

Oasis3 | 631 | 921 | 69.78 |

SWA | 15 | 85 | 28.48 |

SWU_SLIM_ses1 | 318 | 230 | 20.07 |

UCDavis | 62 | 74 | 3.12 |

UTO | 106 | 96 | 35.19 |

Cam | 329 | 318 | 54.19 |

Delta | 18 | 31 | 50.44 |

ds001734 | 60 | 48 | 25.54 |

ds002236 | 38 | 48 | 11.49 |

ds002330 | 37 | 29 | 26.62 |

ds002345 | 131 | 76 | 21.69 |

ds002731 | 28 | 31 | 21.25 |

ds002837 | 42 | 44 | 26.73 |

hcp_ya | 606 | 507 | 28.80 |

Ixi | 313 | 245 | 48.71 |

Nki | 308 | 174 | 42.63 |

Pnc | 701 | 677 | 14.21 |

Top | 134 | 158 | 34.57 |

ukb-11025.0 | 12944 | 12042 | 62.98 |

ukb-11027.0 | 5438 | 4534 | 64.39 |

Site . | # Females . | # Males . | Mean age . |
---|---|---|---|

ABIDE_GU | 8 | 39 | 10.94 |

ABIDE_KKI | 19 | 58 | 10.22 |

ABIDE_NYU | 16 | 111 | 12.83 |

ABIDE_USM | 2 | 72 | 21.60 |

ADD200_KKI | 10 | 12 | 10.21 |

ADD200_NYU | 14 | 38 | 10.05 |

CIN | 28 | 45 | 56.80 |

CNP-35343.0 | 30 | 33 | 35.33 |

CNP-35426.0 | 15 | 44 | 34.62 |

COI | 40 | 29 | 45.07 |

HCP_EP_BWH | 8 | 15 | 22.38 |

HCP_EP_IU | 22 | 37 | 22.96 |

HCP_EP_MGH | 3 | 7 | 20.31 |

HCP_EP_McL | 15 | 16 | 23.56 |

HKH | 13 | 20 | 44.81 |

HRC | 10 | 6 | 40.50 |

HUH | 25 | 32 | 43.33 |

KCL | 75 | 29 | 31.44 |

KTT | 21 | 26 | 37.89 |

KUT | 30 | 31 | 41.70 |

Oasis2 | 82 | 97 | 76.94 |

Oasis3 | 158 | 123 | 75.37 |

SWA | 19 | 115 | 33.61 |

UTO | 54 | 95 | 35.61 |

Delta | 142 | 160 | 41.25 |

Top | 254 | 277 | 32.42 |

Site . | # Females . | # Males . | Mean age . |
---|---|---|---|

ABIDE_GU | 8 | 39 | 10.94 |

ABIDE_KKI | 19 | 58 | 10.22 |

ABIDE_NYU | 16 | 111 | 12.83 |

ABIDE_USM | 2 | 72 | 21.60 |

ADD200_KKI | 10 | 12 | 10.21 |

ADD200_NYU | 14 | 38 | 10.05 |

CIN | 28 | 45 | 56.80 |

CNP-35343.0 | 30 | 33 | 35.33 |

CNP-35426.0 | 15 | 44 | 34.62 |

COI | 40 | 29 | 45.07 |

HCP_EP_BWH | 8 | 15 | 22.38 |

HCP_EP_IU | 22 | 37 | 22.96 |

HCP_EP_MGH | 3 | 7 | 20.31 |

HCP_EP_McL | 15 | 16 | 23.56 |

HKH | 13 | 20 | 44.81 |

HRC | 10 | 6 | 40.50 |

HUH | 25 | 32 | 43.33 |

KCL | 75 | 29 | 31.44 |

KTT | 21 | 26 | 37.89 |

KUT | 30 | 31 | 41.70 |

Oasis2 | 82 | 97 | 76.94 |

Oasis3 | 158 | 123 | 75.37 |

SWA | 19 | 115 | 33.61 |

UTO | 54 | 95 | 35.61 |

Delta | 142 | 160 | 41.25 |

Top | 254 | 277 | 32.42 |

Group . | # Females . | # Males . | Mean age . |
---|---|---|---|

AD | 142 | 94 | 75.75 |

ADHD | 40 | 71 | 17.60 |

ASD | 61 | 389 | 19.08 |

BD | 130 | 106 | 34.67 |

Converted | 24 | 13 | 79.75 |

Demented | 58 | 84 | 76.21 |

MCI | 16 | 29 | 73.40 |

MDD | 260 | 239 | 40.26 |

MDD_motar | 30 | 21 | 35.50 |

Other | 15 | 26 | 34.21 |

Pain | 25 | 39 | 55.28 |

Patient | 72 | 98 | 30.01 |

Patient_motar | 32 | 38 | 40.20 |

SZ | 131 | 225 | 34.85 |

Stroke | 3 | 6 | 67.66 |

Otherpsychosis | 32 | 52 | 29.76 |

Schizoaffective | 23 | 10 | 35.69 |

Schizophreniform | 11 | 12 | 25.43 |

Group . | # Females . | # Males . | Mean age . |
---|---|---|---|

AD | 142 | 94 | 75.75 |

ADHD | 40 | 71 | 17.60 |

ASD | 61 | 389 | 19.08 |

BD | 130 | 106 | 34.67 |

Converted | 24 | 13 | 79.75 |

Demented | 58 | 84 | 76.21 |

MCI | 16 | 29 | 73.40 |

MDD | 260 | 239 | 40.26 |

MDD_motar | 30 | 21 | 35.50 |

Other | 15 | 26 | 34.21 |

Pain | 25 | 39 | 55.28 |

Patient | 72 | 98 | 30.01 |

Patient_motar | 32 | 38 | 40.20 |

SZ | 131 | 225 | 34.85 |

Stroke | 3 | 6 | 67.66 |

Otherpsychosis | 32 | 52 | 29.76 |

Schizoaffective | 23 | 10 | 35.69 |

Schizophreniform | 11 | 12 | 25.43 |

## Appendix G Empirical Moments of Shasho

## Appendix H. Empirical Moments of Shashb

## Appendix I Percentiles of All Features, By All HBR Models

## Appendix J AUC Scores On All Phenotypes

^{1}

The transformation of the Gaussian samples Z via the sinh-arcsinh transform only relates to the construction of the SHASH distribution, and it does not relate to z-scoring the data prior to analysis. Nevertheless, we did standardize all variables used in the analysis before fitting since it can improve the performance of the samplers.

^{2}

A differentiable function with a known and differentiable inverse.

^{3}

We have only real inputs because y is real and the cosh and sinh^{−1} functions map real numbers to real numbers.

^{4}

Correlations like this cause problems for inference in HBR because of its reliance on MCMC sampling. In MCMC sampling, an algorithm traverses a path of pseudo-random points in parameter space, storing visited points as samples. Under the conditions that (i) the Markov chain has the posterior distribution as its stationary distribution, and (ii) detailed balance is satisfied, the samples that are retrieved follow the posterior distribution over model parameters, and can thus be used to approximate distributions. Correlations in the parameter space inhibit the sampler from efficiently exploring the parameter space, resulting in samples that do not follow the posterior very well, or do not converge to the target density at all.

^{5}

Note that this differs slightly from the sample size in Rutherford et al. because we removed several small clinical samples for which permission for sharing was not available.

^{6}

A one-hot encoding of an N-ary class label is a vector with a single ‘1’ at the index of the class that it labels, and zeros everywhere else.

## Author notes

Note on the article history: This article was received originally at *Neuroimage* 4 September 2022 and transferred to *Imaging Neuroscience* 8 June 2023.