FastSurfer-HypVINN: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI

Abstract The hypothalamus plays a crucial role in the regulation of a broad range of physiological, behavioral, and cognitive functions. However, despite its importance, only a few small-scale neuroimaging studies have investigated its substructures, likely due to the lack of fully automated segmentation tools to address scalability and reproducibility issues of manual segmentation. While the only previous attempt to automatically sub-segment the hypothalamus with a neural network showed promise for 1.0 mm isotropic T1-weighted (T1w) magnetic resonance imaging (MRI), there is a need for an automated tool to sub-segment also high-resolutional (HiRes) MR scans, as they are becoming widely available, and include structural detail also from multi-modal MRI. We, therefore, introduce a novel, fast, and fully automated deep-learning method named HypVINN for sub-segmentation of the hypothalamus and adjacent structures on 0.8 mm isotropic T1w and T2w brain MR images that is robust to missing modalities. We extensively validate our model with respect to segmentation accuracy, generalizability, in-session test-retest reliability, and sensitivity to replicate hypothalamic volume effects (e.g., sex differences). The proposed method exhibits high segmentation performance both for standalone T1w images as well as for T1w/T2w image pairs. Even with the additional capability to accept flexible inputs, our model matches or exceeds the performance of state-of-the-art methods with fixed inputs. We, further, demonstrate the generalizability of our method in experiments with 1.0 mm MR scans from both the Rhineland Study and the UK Biobank—an independent dataset never encountered during training with different acquisition parameters and demographics. Finally, HypVINN can perform the segmentation in less than a minute (graphical processing unit [GPU]) and will be available in the open source FastSurfer neuroimaging software suite, offering a validated, efficient, and scalable solution for evaluating imaging-derived phenotypes of the hypothalamus.


Motivation
The hypothalamus consists of a group of interconnected neuronal nuclei located at the base of the brain [1].It is the body's principal homeostatic center and plays a crucial role in the regulation of a broad range of physiological, behavioural, and cognitive functions, both through direct control of endocrine and * Correspondence to: Martin Reuter (martin.reuter[at] dzne.de).
autonomic nervous system outflow, as well as through extensive projections to cortical and limbic regions [1].Neuropathological studies have demonstrated extensive involvement of the hypothalamus in a range of neurodegenerative diseases, including Alzheimer's disease [2,3], Parkinson's disease [4], Huntington's disease [5], frontotemporal dementia, and amyotrophic lateral sclerosis [6,7].However, the association between hypothalamic integrity and physiological, behavioural, and cognitive outcomes has not been studied in large clinical or population-based studies for lack of a reliable high-throughput automatic imaging procedure.
The majority of studies on hypothalamic imaging-derived phenotypes use manual annotations of magnetic resonance imaging (MRI) scans as the gold standard.Manual segmentation of the hypothalamus and its substructures is commonly done on T1-weighted images [8,9].Nonetheless, the use of multi-modal structural information during the manual annotation process has also been proposed to increase especially the visibility of the lateral hypothalamus boundaries [6,10].
These multi-modal protocols recommend segmenting the hypothalamus using simultaneous visualization of registered T1weighted (T1w) and T2-weighted (T2w) MR images.Manual delineation of the hypothalamus, however, is a very timeconsuming process that relies highly on the user's expertise due to the small size and low boundary MR contrast in the hypothalamus region, regardless of the available MRI modalities.
Automated methods have been proposed to segment the whole hypothalamus [11][12][13][14][15] and its sub-regions [16] quickly and reliably.However, even though automated tools are available, they only focus on segmenting 1.0 mm isotropic T1w scans, ignoring the detailed structural information available in sub-millimeter resolution datasets.High-resolutional (HiRes) MR scans are becoming more common across studies (even in clinical settings) due to rapid advancements in MR technology (e.g.accelerated acquisition schemes) and are increasingly employed as the new standard for large studies (e.g. the Rhineland Study [17,18], Human Connectome Project (HCP) datasets [19][20][21], Autism Brain Imaging Data Exchange II (ABIDE-II) [22], TRACK-PD [23]).Thus, the need for neuroimaging tools that can handle sub-millimeter resolutions (e.g.0.8 mm isotropic) has increased.
Moreover, current automated hypothalamic segmentation methods have neglected the inclusion of multi-modal structural information.One reason for this is that simultaneous access to T1w and T2w images is not always possible due to constraints in scanning time or poor image quality in one of the modalities due to reduced image resolution or acquisition artefacts.Therefore, the introduction of an accurate automated method for segmenting hypothalamic structures on high-resolutional T1w and T2w MRI scans, which is also robust to handle missing modalities, is of significant interest to clinicians and researchers.

Related work
Automated hypothalamic segmentation methods utilizing multi-atlas-based techniques [11,12] were initially proposed.
However, these methods are slow and demand considerable computational resources.Newer techniques such as fully convolutional neural networks (F-CNNs) can tremendously speed-up computation time by utilizing graphical processing units (GPUs) and have become the preferred method for solving supervised semantic segmentation problems in the medical computer vision community [24][25][26][27][28][29][30][31].
Hypothalamus segmentation using F-CNNs has mainly focused on identifying the hypothalamus as one whole structure in the brain [13][14][15].Recently, Billot et al. [16] proposed a method to segment five sub-regions of the hypothalamus using an encoder-decoder 3D F-CNN with extensive data augmentation.They followed the hypothalamic parcellation protocol introduced by Makris et al. [9] on standard 1.0 mm isotropic resolution T1w images.Their proposed method illustrates the capabilities of F-CNNs to segment hypothalamic compartments with promising results on datasets acquired at 1.0 mm isotropic resolution [16,32].However, F-CNNs are known to have issues generalizing to resolutions that differ from the training one [30,33,34] rendering HiRes images out-of-distribution and unsuitable for methods designed for lower resolutions.A common approach for this problem is to down-sample the input image to the desired lower resolution in a pre-processing step [15,16,24].This process, however, reduces image details and information, forfeiting the investment already made when acquiring the higher resolution in the first place.Furthermore, HiRes information could help address inter-class inconsistencies between voxels at a local and global level and alleviate the partial volume effect problem [35].
HiRes segmentation of brain structures has mostly been tackled by training with manual annotations created at the desired resolution [25,30,36,37] or training models using 1.0 mm data with scale-augmentations -an established deep-learning technique to improve the generalizability of a model.Recently, models capable of segmenting scans at different resolutions have been introduced.Billot et al. [38,39] proposed SynthSeg, a technique for generating segmentations at a fixed resolution (1.0 mm), regardless of the resolution of the input scan, which are interpolated to the fixed resolution as a pre-processing step.
During training, SynthSeg relies on a generative model that produces "unrealistic synthetic images" [38].These synthetic images are created from ground truth label maps at the pre-defined fixed resolution.This approach simulates domain variability by incorporating multiple random parameters for the generator, such as spatial, intensity, contrast, and resolution variability.
While providing input flexibility, the model's output resolution, however, remains confined to the fixed resolution.
Before SynthSeg, we introduced the Voxel-Size Independent Neural Network (VINN) for resolution-independent segmentation tasks [34].The VINN approach enables training and inference using images at multiple resolutions within a single network.In brief, instead of interpolating input images, VINN integrates the resolution change into the network, replacing a regular scale transition with an interpolation layer, that maps the latent space at native input resolution to a pre-defined internal resolution at lower layers of the network and vice-versa.As a result, rich HiRes information is retained without image or label interpolation, and segmentations are provided at the desired native input resolution.
Finally, as has already been shown in manual segmentation of hypothalamic structures, exclusively utilizing T1w images as input forfeits the significant potential presented by the inclusion of multi-modal information (T1w and T2w) [6,10].
Common multi-modal F-CNN architectures, however, require all input modalities to always be present.The absence of any modality introduces a computational bias that the network is not trained to handle.To overcome missing modalities, proposed solutions include training a specific network for each of the input combinations or providing the segmentation model with a synthesized version of unavailable modalities [40,41].Alternatively, training networks with synthetic image contrast has also been suggested [38,42].Even though these techniques have shown promising results, a more suitable model should be capable of extracting the most salient information for solving the given task from the available modalities without the need for artificial images or multiple modality-specific networks.With this in mind, shared latent space models were introduced on the challenging task of multi-modal brain tumor segmentation [43][44][45].This approach first translates modalities into independent latent spaces; afterwards, the modalities' embedded information is merged inside the network into a shared latent representation.The shared latent space is then forwarded to the remaining network to solve the desired task.At inference time, the shared representation is computed from the available modalities, thus being robust to all input-modality combinations (i.e.

hetero-modal) included in training.
To address the missing modalities challenge in a HiRes scenario, we suitably include the shared latent space concept into our voxel-size independent network (VINN).Hetero-modal VINN (HM-VINN) introduces a fusion module that linearly combines the modalities inside the network.After passing the available scans through a separate modality-specific convolutional block, the network weighs and merges the feature maps based on the best available information using a learnable weighted sum.As the output of the fusion module is normalized, missing one modality can be tackled by assigning zero to its respective weight.

Contribution
To our knowledge, we are the first to tackle automated hetero-modal sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI.The contributions of this work are the following: Firstly, we introduce a new hypothalamic labeling protocol adapted to the higher spatial resolution offered by 3T 0.8 mm isotropic MR images.The proposed protocol presents a more fine-grained parcellation of the hypothalamus and includes usually ignored brain structures, such as hypophysis, epiphysis, the optic nerve, optic chiasm, and optic tract, as illustrated in Figure 1.Secondly, we present HypVINN, a novel automated hypothalamic parcellation tool with a novel hetero-modal VINN (HM-VINN) architecture at its core, providing a solution to the multi-resolution and the missing modality challenge in a single model.We extensively show that the model's input flexibility does not compromise performance compared to state-of-the-art methods with fixed inputs in terms of segmentation accuracy, test-retest reliability, and generalizability.Moreover, our method replicates hypothalamic volume effects (e.g.age and sex) on subsets of the 0.8 mm (HiRes) Rhineland Study (n=463) and the 1.0 mm UK Biobank (n=535) [46,47].Last but not least, and to the benefit of the research community, we will integrate the HypVINN tool into the user-friendly, open source FastSurfer framework [24] available at: https://github.com/Deep-MI/FastSurfer(code will be released upon acceptance).

Datasets
We used MR images from two population studies, namely the Rhineland Study (RS) [17,18] and the UK Biobank (UKB) [46,47], with resolutions of 0.8 mm (HiRes) and 1.0 mm, respectively.Participants from both studies gave written informed consent in accordance with the ethical guidelines of the individual studies.Furthermore, ethics approval and regulations can be accessed on their respective webpages.For this work, we compiled four distinct datasets from the population studies: a manually annotated dataset (from RS), a generalizability dataset (from RS and UKB), a test-retest dataset (from RS),     This paper utilized the 0.8 mm isotropic T1w and T2w MR scans.The T1 protocol consists of a multi-echo magnetization prepared rapid gradient echo (MPRAGE) sequence [48] with 2D acceleration [49], while the T2 protocol uses a 3D Turbo-Spin-Echo (TSE) sequence with variable flip angles [50].
Both sequences also utilize elliptical sampling [51] and parallel imaging (PI) [52] to expedite the imaging process.For this work, all protocol versions from the Rhineland Study were considered, and sequence parameters are presented in Appendix Table A1.
We compiled the Rhineland Study datasets by first randomly selecting a subset (n=534) of participants with available T1w and T2w scans from sex and age strata to ensure a balanced population distribution.The sample presents a mean age of 54.9 years (range 30 to 95), and 59.4% were women.We then further assigned participants to the in-house dataset and all its subsequent splits adhering to the age and sex-stratification scheme.
All T2w scans were registered to their corresponding T1w scan using FreeSurfer's mri robust register tool [53].
MRI scans of the in-house training and testing dataset (n=50) were manually annotated by an experienced rater and split into training/validation (n=44) and testing (n=6) sets.
Training data was further split into four groups for crossvalidation.Finally, the testing data was manually annotated for a second time by our main rater to evaluate intra-rater variability.The rater was blind to the scans' identification to avoid bias and overestimating performance.
For evaluating within-session test-retest reliability, we utilized the RS subset (n=21) with two in-session T1w scans.The additional scan for this participant was acquired during the time slot allocated for a free protocol inside the Rhineland study's MRI acquisition protocol.Due to the time constraint of the free protocol, a second T2w scan could not be acquired.Before starting the free protocol, participants were asked to move their head inside the head-neck coil.It is important to note, that T1w scans were not acquired back-to-back, but with a time gap of almost 30 minutes.
The MRI scans of the remaining participants (n=463) were compiled into the RS case-study dataset to evaluate the sensitivity to known hypothalamic volume effects (e.g.age and sex).
For a detailed description of the population characteristics of all the aforementioned RS subsets see Appendix Table A2 and A3.
We used data from the UK Biobank study to test the generalizability of our method to isotropic 1.0 mm scans from an unseen cohort with different acquisition parameters.An initial subset (n=544) of random participants was selected from sex and age strata to ensure a balanced population distribution.
The chosen sample presents a mean age of 58.7 years (range 45 to 82), consisting of 52.6% women.Subsequently, the scans of nine random participants were manually labeled by our expert rater to evaluate segmentation accuracy at 1.0 mm (generalizability dataset).The remaining UKB participants (n=535, UKB case-study dataset) were also used in the hypothalamic volumes effects sensitivity analysis.A summary of the population characteristic of the UKB subsets is presented in Appendix Table A4.

Manual reference standard
An experienced rater manually annotated the sub-regions of the hypothalamus and adjacent structures on registered T1w and T2w images, except for the UK Biobank cases where only T1w scans were available.The annotation was performed using Freeview, a visualization tool of FreeSurfer [54,55], which allowed simultaneous viewing of the available modalities.Summarizing the labeling process, the borders of the unilateral hypothalamus were defined as follows [9]: a) anteriorly: coronal plane passing through the most rostral tip of the anterior commissure and containing the optic chiasm, b) posteriorly: coronal plane through the most caudal tip of the mammillary bodies, c) superiorly: third ventricle with the diencephalic fissure, d) inferiorly: junction to the optic chiasm rostrally and the hemispheric margin more caudally, e) medially: wall of the third ventricle and the interhemispheric fissure, and f) laterally: rostrally at the medial border of the optic tract and more caudally at the internal capsule, globus pallidus and cereberal penduncle.
A detailed definition of the segmentation procedure for all different substructures is provided in Appendix C. Adjacent small hypothalamic nuclei were grouped into subunits according to Table 1.An example of the manual segmentation scheme is illustrated in Figure 1, and an overview of all twenty-four segmented structures is presented in Table 1.

Hetero-modal segmentation network -HM-VINN
To accurately segment the hypothalamic sub-regions and adjacent structures, we employ VINN [34] as the foundation for our network design.VINN is a resolution-independent extension of the successful multi-network approach FastSurfer-CNN [24,30,31].Both methods are 2.5D approaches, i.e. they aggregate predictions of three 2D F-CNNs (one per anatomical view) with multi-slice input [24].The F-CNNs follow a UNet-type layout with an encoder and decoder arm of five competitive-dense blocks (CDB) separated by an additional bottleneck CDB (see Figure 2).In FastSurferCNN, all scale transitions between the CDBs are implemented via fixed-scale down-or up-sampling operations (i.e.(un)pooling).VINN, on the other hand, replaces the first and last scale transition with a flexible network-integrated resolution-normalization.Here, the native image resolution is explicitly integrated into the network and utilized to interpolate the feature maps to a common pre-defined network base resolution (1.0 mm).In turn, network capacity in the inner layers is available for the segmentation task while retaining voxel size-dependent information outside of it.Lastly, the view-aggregation step ensembles the resulting probabilities maps through a weighted average (axial = 0.4, coronal = 0.4, and sagittal = 0.2).The weights of the sagittal predictions are reduced compared to the other predictions, as structures with left and right hemispheres labels are unified into one due to missing lateralization information in the sagittal view [24].For the current segmentation task, we also unify lateralized structure labels into one for the sagittal view, consequently reducing the number of classes in the sagittal F-CNN from 24 to 15.Therefore, the VINN view-aggregation weighting scheme is also suitable for our application.
In this work, we extend VINN into a hetero-modal segmentation scenario (referred to as HM-VINN) by embedding the input modalities into a shared latent space [43][44][45].Following this direction, we modify the standard F-CNNs from VINN to initially process T1w and T2w images independently of each other by replacing the first encoder CDB with modality-specific CDBs (Figure 2, e.g.T1-CDB * and T2-CDB * ).After the independent stage, feature maps are merged inside the network by a fusion module and fed into the following convolutional pipeline.
The implemented fusion module weights and merges the feature maps from the T1 and T2 branches based on the best available information using a learnable weighted sum.Let us denote the output feature map from the T1-CDB * as F T 1 ϵ R C×H×W and the T2-CDB * output as F T 2 ϵ R C×H×W , where C, H, W represent the channel, height, and width dimensions, respectively.Then, the output of fusion module (F fused ) is where W T 1 and W T 2 are global learnable scalar parameters initialized both at 0.5.The introduction of W T 1 and W T 2 allows the network to gradually learn the importance of each modality.
If a modality is more informative, its feature maps will have a higher weight.Additionally, as the output of the fusion module is normalized, missing one modality can be tackled by assigning zero to its respective weight.Thus, the fusion features are identical to the encoder block output of the existing modality.
In detail, all three F-CNNs followed the abovementioned layout (see Figure 2).Within F-CNNs, the CDB layout is kept

Hetero-modal training procedure
Introducing additional variations by data augmentation during training helps neural networks to be more robust.Here, we make HM-VINN robust to missing modalities by sometimes randomly dropping either the T1w or T2w image for a given training example with a uniform distribution between all input combinations (modality dropout).The modality weights in the fusion module are adjusted as follows: i) When the two modalities are available, the network automatically assigns the weights (see Eq. 1).ii) If a modality is dropped, its corresponding fusion weight is set to zero as described in the previous section.By starting this modality dropout procedure only after ten epochs, the proposed training procedure first establishes general segmentation capabilities (with all modalities available) before pivoting to more difficult scenarios with different combinations and missing modalities.

Model learning
All F-CNN are implemented in PyTorch [56] using a docker container [57].Independent models for axial, coronal, and sagittal views are trained for 100 epochs with batch size of 16 using two NVIDIA Tesla V100 GPU with 32 GB RAM.
In contrast to all above-mentioned transformations, defacing is performed statically before training ("offline") due to the high computation time to deface a scan (more than 1 minute per method).

Evaluation metrics
We compute three standard segmentation metrics (dice similarity coefficient, volume similarity, and Hausdorff distance) to assess the similarity between the predicted label maps and manual annotations [65].We first evaluate the dice similarity coefficient (Dice) [66,67] as it provides spatial overlap consensus.Let M (manual annotations) and P (prediction) denote binary label maps, then Dice is defined as: where |M ∩ P| represent the number of common elements (intersection) and |M| and |P| the number of elements in each label map, therefore, Dice values range from 0 to 1, and a higher Dice represents a better segmentation agreement.Afterwards, we compute volume similarity (VS) as volume measurements are usually the desired image-derived phenotype for downstream statistical analysis.VS is defined as: VS has the same range as Dice; however, it can have its maximum value even when the spatial overlap is zero, as this metric does not consider spatial localization information.Additionally, a spatial distance-based metric is used to evaluate the quality of segmentation boundary delineation (contour).Here, we use the 95% Hausdorff distance (HD95), a Hausdorff distance (HD) as it is less sensitive to outliers [68].HD95 is considered as the 95th percentile of the ordered distance measures, and it is defined as: where d is the Euclidean distance.In contrast to the Dice and VS, HD95 is a dissimilarity metric so a smaller value indicates a better boundary delineation with a value of zero being the minimum (perfect match).
Finally, statistical significant differences in segmentation performance are confirmed throughout this work by a nonparametric paired two-sided Wilcoxon signed-rank test [69] after correcting for multiple testing using Bonferroni correction (referred to as corrected p).
For accessing the test-retest reliability of predicted volume measurements between two repeated scans of the same participant, we use the intra-class correlation (ICC).The ICC is a commonly used metric to assess the degree of agreement and correlation between measurements.The ICC values range from 0 to 1, with higher values representing better reliability.Here, we compute a two-way fixed, absolute agreement and single measures with a 95% confidence interval (ICC(A,1)) [70].

Experiments and results
This section is divided into four parts with the aim to thoroughly validate our hetero-modal method for hypothalamic sub-regions and adjacent structures segmentation (referred to as Finally, we measure the sensitivity of the proposed pipeline to replicate known hypothalamic volume effects with respect to age and sex.In order to ensure that all experiments are carried out under the same testing conditions: All inference analyses are evaluated in a docker container with a 12 GB NVIDIA Titan V GPU.Model inference can also run on the CPU at reduced speeds.

Accuracy
In this section, we benchmark and evaluate the accuracy of the hetero-modal HypVINN.All implemented networks are trained using the scheme mentioned in Section 2.3.3.
To show a proof-of-concept for our proposed HypVINN in segmenting hypothalamic sub-regions and adjacent structures with missing input modalities, we benchmark our method against segmentation scenarios where all modalities are always available (i.e.uni-modal and multi-modal models).For this purpose, we implement the classic VINN with three different inputs: i) only T1w (T1-VINN), ii) only T2w (T2-VINN), and Finally, performance is assessed on the unseen test-set by the three metrics (Dice, HD95, and VS).

Comparison with the state-of-the-art
In Table 2, we present the similarity scores for the global segmentation performance of all evaluation metrics as well as significance indicators (corrected p < 0.002).Here, we ob-serve that HypVINN performs as well as the modality-specific models.In the T1w-only inference scenario, the T1-VINN outperforms HypVINN in Dice and HD95; however, there is no statistical difference between them.On the other hand, when For this reason, in the remaining experiments, we only use a T2w image in combination with a T1w image and not as a standalone modality.

Intra-rater reproducibility
In this section, we compare the performance of the automated methods against our main rater variability (i.e.intra-rater variability).The intra-rater variability puts the accuracy results into context, where it can be seen as the ideal automated method performance.We assess this variability by computing the similarity between the two sets of manual segmentations of the main rater in the in-house test-set.Note, in contrast to Section 3.1, all models are retrained on the full training dataset.It is important to note, that the testing-set is still unseen for these models and is only used for final performance.These "final" models are additionally used for the generalizability (Section 3.2), reliability (Section 3.3), and sensitivity (Section 3.4) analyses.
In Figure 3, we present box plots for the three accuracy metrics (Dice, VS, and HD95) in the test-set for the three major regions (hypothalamic, optic, and others, see Section 2.2).We observe that our main rater has an overall good intra-rater agree- The intra-rater scores outperform all the implemented automated methods in Dice and VS, with significant statistical differences present in the hypothalamic region structures (corrected p < 0.004).Moreover, the HD95 inter-rater hypothalamic region results are significantly better than the ones of the 3D model.On the other hand, MM-VINN and HypVINN outperforms the intra-rater results in recognizing tissue boundaries (HD95), even if no statistical significance can be inferred from the statistical test.We additionally observe that manually replicating the boundary outline in the structures from the others and optic regions is more challenging.Furthermore, we visually notice that all automated methods generate similar predictions to the manual ones, with the most considerable discrepancies in

Generalizability
In this section, we evaluate the robustness of the proposed This strategy prevents the downsampling of manual labels to 1.0 mm, which introduces interpolation artefacts that could potentially decrease accuracy along boundaries, thereby impacting the analysis.On the other hand, no resampling is needed for the UK Biobank scans as this dataset is acquired and labeled at 1.0 mm resolution.However, multi-modal evaluation is not done for this dataset as T2w scans are not available.Therefore, we limit the generalizability analysis in the UK Biobank dataset to the performance of the standalone T1w input models.
Finally, generalizability performance is assessed by the three similarity metrics (Dice, HD95, and VS) at the native resolution of the corresponding manual reference, except for volume similarity (VS) in the 1.0 mm Rhineland Study predictions.VS does not require spatial overlap between label maps, thus, can be computed without the need for resampling to the same resolution.
Henschel et al. [34] demonstrated the generalizability capabilities of VINN, HM-VINN's parent architecture, to handle unseen resolutions.Their results, however, were achieved by utilizing multi-resolution data during training, which is a different scenario to ours, where only 0.8 mm data is available for training.Therefore, here we further compare the generalizability capabilities of our HM-VINN architecture to segment 1.0 mm MR scans to F-CNNs without resolution-independence mechanisms (HM-CNN).We implemented HM-CNN by removing the flexible network-integrated resolution-normalization step inside HM-VINN and replacing it with a fixed scale transition.Moreover, for the current segmentation task, the proposed HypVINN model is composed of the HM-VINN architecture with external scale augmentation (exSA) during the learning process.To isolate the contributions of the different resolutionindependence schemes, we train the HM-CNN with and without external scale augmentation.Furthermore, a comparison baseline of HM-VINN without external augmentation is also implemented.For a fair comparison, all benchmarked networks are trained using the same procedure of the "final" HypVINN (see Section 3.1.2).We limited this analysis to only T1 input models as T1 is the primary MRI sequence for our segmentation task.Finally, in order to validate the robustness of HypVINN in both inference scenarios, we compare our method against the modality-specific model from the previous section (i.e.T1-VINN, MM-VINN and 3D-UNet).
In Figures 5 and 6, we present the generalizability results for the segmentation evaluation metrics in the hypothalamic, optic, and others regions for both datasets.For the first comparison analysis (Figure 5 in all different regions and metrics for both datasets, except for HD95 in the optic structures for UKB.Furthermore, the difference in performance is statistically significant (corrected p < 0.004) in RS for the hypothalamic (all metrics) and optic region (Dice and HD95) and in UKB for the others (Dice and HD95) and optic region (Dice and VS).Additionally, when exSA is included without resolution-independence (HM-CNN +exSA), we observe that the proposed HypVINN (HM-VINN +exSA) still yields more accurate segmentation scores with statistical significance in RS for Dice and VS in all regions and in UKB for Dice (other and optic), VS (all regions) and HD95 (others).On the other hand, we do not see statistical differences between HM-CNN +exSA and the HM-VINN without exSA except in RS for Dice in the others and optic regions and in UKB for HD95 in the hypothalamic region.Lastly, as expected, the vanilla HM-CNN (no exSA or resolution-independence) fails in both datasets for all regions, showcasing the expected generalizability issues of a standalone F-CNN to out-of-distribution resolutions.
Analyzing the generalizability results between input modalities, we observed that even though models have not been trained at 1.0 mm resolution, they can generalize remarkably well, as illustrated in Figure 6 and 7.For RS, no significant differences are found between 2.5D models except for the optic area where both multi-modal models outperform the T1-input Our proposed approach (HypVINN) consisting of the HM-VINN architecture plus external scale augmentation (+exSA, blue) outperforms other comparison baselines in both manually labeled datasets.It is important to consider that the HypVINN used in this analysis is the one trained in Section 3.1.2.Note: similarity scores are presented for the hypothalamic, others, and optic regions.Additionally, a letter directly on top of a box plot indicates which other models the model significantly outperforms (paired Wilcoxon signed-rank test, corrected significance per region : hypothalamic p < 0.004; others p < 0.007; optic p < 0.008).
HypVINN with statistical significance (corrected p < 0.008; metric significance: Dice and VS both methods, and HD95 only MM-VINN).In UKB scans, no statistical significance differences are observed between the T1-input HypVINN and T1-VINN (corrected p > 0.008).Finally, when comparing against the 3D-UNet (which has been trained with external scale augmentation), the 2.5D models show in RS significantly better Dice scores for the hypothalamic and optic regions (corrected  p < 0.004).For UKB, the 2.5D models significantly outperform the 3D-UNet in Dice and HD95 for the hypothalamic and others regions (corrected p < 0.004).

Test-retest reliability
Assuming that brain anatomy does not change within the as can be seen in Appendix Figure A3.Furthermore, all implemented methods perform equally well for VS in all regions (VS > 0.98).Finally, we observe a statistically significant difference in the structures from the others region between HypVINN with multi-modal input and T1-VINN (VS: 0.9960 vs. 9927, corrected p < 0.05).

Sensitivity to age and sex effects
Previous studies have shown that men have a larger hypothalamus volume than women at a global level [71] but also at a sub-unit level [9,12].Therefore, in this section, we aim to use the automated hypothalamic volume estimates to replicate these findings and explore volume-age correlations in a general population, representing a feasible scenario in which our method will be used as the post-processing analysis pipeline.To this end,  ge + sex + êT IV + T 1 seq + T 2 seq ).All statistical analyses are performed in R [72], and eTIV estima-tions are computed using FreeSurfer [73].It is important to note, that automated segmentations can be carried out without needing bias field corrected scans.Here, we correct the bias field in a pre-processing step primarily for the partial volume estimation, which is a post-processing step to the segmentation.
The predicted volumes for the total hypothalamus follow the results from smaller studies [6,8,9,14,74] with a similar global anatomical definition (from 910 mm 3 to 1580 mm 3 ) as can been seen in Figure 8 a).For the sub-regions, we observe that the tubular region is the smallest segmented hypothalamic structure (±45.9 mm 3 ) and the posterior hypothalamus the biggest one (±379.3mm 3 ).However, a direct comparison in size of our hypothalamic sub-regions with other studies is not possible due to the different segmentation protocols.
For both RS and UKB subsets, the total hypothalamus volumes significantly decreased (p < 0.001) with age (see Fig- Structures Volumes [a] Volumes estimates from total hypothalamus and sub-units.medial and lateral hypothalamus), where the volumes are positively correlated with age.However, this positive correlation in all middle structures is not observed in the UKB, where a significant increase is not found for the lateral hypothalamus.
Furthermore, all structures independent of the dataset, except for the medial hypothalamus in UKB, show statistically significant sex differences (p < 0.05) even after correcting for head-size, with men having larger hypothalamic volumes than women (see Figure 8 c).These results are in line with previous findings [9,12,71].Moreover, as expected, all inferred volumes are positively associated with eTIV (p < 0.01).
Independent of the provided MRI input, age and sex effects on hypothalamic volume estimates in the Rhineland Study using our HypVINN exhibit the same directional trends.Moreover, even though HypVINN is trained with all RS sequence versions, we observe differences between sequences; however, none of them are significant (p > 0.05).Nevertheless, controlling for MRI sequences in any downstream statistical analysis is recommended when including image biomarkers obtained from multiple MRI sequences.
From the visual quality assessment, we observe that our tool performed very well in two different datasets; examples of correct segmentations for four random male participants with different ages can be observed in Figure 9.For the failing cases, we note that segmentation errors are mainly present when there

Discussion
In this paper, we present the first hetero-modal model for automated sub-segmentation of the hypothalamus and adjacent  Firstly, we introduce a different segmentation protocol of the hypothalamus compared to the one proposed by Makris et al. [9].Therefore, we re-train the only other contemporary method for hypothalamus sub-segmentation of 1 mm T1w images [16].The parcellation method of Makris et al. was developed for in-vivo semi-automatic hypothalamic segmentation using 1.5T isotropic 1 mm MR images and was therefore necessarily less detailed than the one presented in this work.In general, we define the boundaries of the hypothalamus as a whole according to the same anatomical definitions and landmarks used by them.Yet, for sub-segmentation of the different hypothalamic subregions, we use a more fine-grained approach to take optimal advantage of the higher spatial resolution offered by the available 3T 0.8 mm isotropic MR images.
Consequently, our approach results in the sub-segmentation of more hypothalamic structures as detailed in Table 1.For example, whereas both the posterior hypothalamus and mammillary bodies were included under the label "posterior hypothalamus" in the parcellation scheme of Makris et al., our method provides separate volumetric estimates for each of these structures, which is of clinical relevance given that these structures operate in a functionally independent manner.Another noteworthy difference between the two parcellation schemes concerns the subdivision of the medial part of the hypothalamus: in contrast to Makris et al. who subdivided this region into a superior and an inferior tuberal region, we follow the more conventional neuroanatomical subdivision of this region into the medial and the lateral hypothalamus -using the fornix as the boundary between these two structures -, and tubular region.For the tubular region, we group the tuberomammillary region, the median eminence, and the arcuate nucleus.Again, we opt for this approach to gain more detailed anatomical information about the various substructures of the hypothalamus.In addition, our method also provides automatic segmentation of several other important structures in the vicinity of the hypothalamus, for which, until now, no automated segmentation procedure has been available.Notably, these adjacent hypothalamic structures include the hypophysis (i.e. the pituitary gland), which is the body's principal and most versatile endocrine gland responsible for the central regulation of most other endocrine tissues throughout the body; the epiphysis, the site where the "sleep hormone" melatonin is synthesized; as well as all major structures of the central optic system including the optic nerves, the optic chiasm, and the optic tracts.
Despite the small size of different sub-structures and low contrast on MR images, our novel deep-learning technique (HypVINN) can accurately segment all twenty-four structures even when input modalities are missing at inference time.
HypVINN performs as well as state-of-the-art modality-specific F-CNNs.Passing a T2w scan as standalone input to HypVINN or to a specialized T2w model generates the lowest performance from all input variations (see section 3.1.1).For our heteromodal model, the difference in contribution between T1-and T2-derived information is quantifiable in the modality weights from the fusion module, with the weight of the T1-block (W T 1 ) tripling the T2 one.Thus, an available T1w scan is more important for the current segmentation task than a T2w scan.
Nonetheless, we demonstrate that including a T2 can still be beneficial for some structures as models with multi-modal information yield generally better segmentation performance.
Unequal performance between inference setups (i.e.available input modalities) was also reported in other hetero-modal deep-learning segmentation tasks, with higher results achieved when the primary modality was available [43][44][45].In our case, preference for the T1 modality could be explained by the inherent modality bias from the manual annotation process.Our labeling protocol is mainly performed on the T1w scans, and the T2w scans are only used as a support modality as most anatomical boundaries are visible in T1.Hence, evaluating segmentation performance with the current manual labels is not entirely neutral across the various inference configurations.A more fair evaluation will require training and validation using manual annotations explicitly tailored to a structure's visible anatomical characteristics in each input combination.However, generating 2 m − 1 manual labels per participant, where m represents the number of modalities, is not feasible as creating manual annotations for a single configuration is already expensive and time-consuming.Therefore, based on our findings, we recommend utilizing a T2w scan accompanied by a T1w scan (i.e. multi-modal input) and not as a standalone input for the current segmentation task.
Our hetero-modal model, when including a T1w image, exhibits segmentation performance in the range of the main rater variability (see Section 3.1.2).The intra-rater variability can be seen as the ideal performance of the automated method as we use manually annotated labels from the main rater to train our F-CNNs.Therefore, it is challenging for an automated approach to outperform the intra-rater scores.Considering this, the accuracy in the hypothalamic region of our hetero-modal model and all benchmark methods is lower than the intra-rater agreement on all evaluation metrics.Yet, the underperformance in this region can also be attributed to the low MR contrast between neighboring structures, especially for the medial and lateral hypothalamus.Nonetheless, the segmentation results are en-par with other deep-learning techniques on similar brain segmentation tasks (i.e.small size and low contrast across anatomical boundaries) [16,30].
HypVINN not only performs well on segmenting isotropic 0.8 mm T1w and T2w MR scans, but it also exhibits generalizability to isotropic 1 mm MR scans from the Rhineland Study and UK Biobank dataset (see Section 3.2).We demonstrate that utilizing the resolution-independence mechanism performs as well as external scale augmentations to handle unseen resolution when training with a single (0.8 mm) resolution.
Furthermore, we show that resolution-independence combined with external scale augmentations (proposed) outperforms all other comparative baselines.It is important to note, that our HypVINN could be retrained with multi-resolution data to support a wider range of resolutions and to fully exploit the advantages of using a voxel-size independent F-CNN [34].
Furthermore, HypVINN performs equally well as modality specific-models in both 1 mm datasets.As expected, performance on the Rhineland Study data is higher than on the UK Biobank.The UK Biobank dataset, consists of scans from a different cohort and is acquired with a different MRI acquisition protocol.Due to these dissimilarities, segmentation performance is not directly comparable.Nevertheless, the proposed HypVINN generalizes quite well to this external dataset.Finally, even though our model supports both 0.8 mm and 1 mm resolutions, we recommend to process 0.8 mm MR scans at their native resolution to obtain more detailed and precise predictions by leveraging the additional information present in the higher resolution.Note, our proposed model also shows promising results in the high-resolutional MRI scans from the Human Connectome Project (HCP) young adult and lifespan pilot project datasets [19][20][21]; see Appendix Figure A5 for prediction examples of our tool in HCP scans.
Throughout this work, we compare our HypVINN against the re-trained version of the 3D-UNet with extensive data augmentations proposed by Billot et al. [16] for hypothalamus subsegmentation.Our results demonstrate that our method not only outperforms the 3D-UNet in terms of segmentation accuracy (see Sections 3.1.1and 3.1.2)but also exhibits better generalizability across both comparative datasets (see Section 3.2).Additionally, the training process for the 3D-UNet using the authors' released implementation and recommended training parameters takes approximately 100 hours per model using the GPU setup described in Section 2.3.3.In contrast, back-to-back training of the three F-CNNs that compose our HypVINN takes around 19 hours (roughly 6 hours per F-CNN).Therefore, besides outperforming the contemporary method, our approach can be (re)trained more efficiently with a lower carbon footprint.
As demonstrated in the Rhineland Study data, all automated methods exhibit excellent test-retest agreement between insession volume estimates (see Section 3.3).Additionally, our HypVINN shows high robustness and generalizability across the general population of the Rhineland Study and UK Biobank case-study datasets, with only twenty-one cases (2.10%) between the two datasets being excluded from the age and sex analysis due to segmentation errors (see Section 3.4).The most common factor for our pipeline to fail is a severe deformation of the third ventricle (i.e.out-of-distribution cases), which generates unclear hypothalamic boundaries, as illustrated in Appendix Figure A4.Therefore, careful inspection is recommended when using our tool in aging populations and clinical cohorts, as the prevalence of large ventricles increases with age and certain diseases (e.g.Alzheimer's disease, Parkinson's disease, etc.).We recommend visually inspecting the predictions from scans with pathological changes and from volumetric outliers within the cohort before including them in any downstream analysis, particularly outliers from the third ventricle and medial/lateral hypothalamus.Although volumetric outlier detection can help identify predictions with significant failures, more robust quality control tools are desirable.However, developing these tools is outside this paper's scope and will be future work.
Lastly, since HypVINN is based on deep learning, it can easily be fine-tuned to be more robust to out-of-distribution cases by retraining with manual annotations created on participants with low segmentation quality.
In line with previous studies on smaller datasets [9,12,71], we also find that the volume of the total hypothalamus is larger in men compared to women.However, our analyses in two substantially larger population-based cohorts revealed that the volumes of virtually all hypothalamic substructures are significantly larger in men independent of head size.Our findings thus warrant further detailed association studies to investigate the clinical relevance of these pronounced sex differences in the human hypothalamus.On the other hand, the derived age effects from small-scale studies present inconsistent results for the different hypothalamic substructures, except for the total hypothalamus whose total volume decreases with age [6,9,16,71].Our method's total hypothalamic volume estimates also replicate this negative correlation with age.Furthermore, although most hypothalamic regions atrophy with increasing age, the volume of the middle/tuberal region of the hypothalamus significantly increases with age.This finding is novel and could imply that specific hypothalamus regions could be resistant to ageassociated atrophy.Indeed, the paraventricular nucleus contained within the medial hypothalamic region exhibits a striking stability in terms of neuronal numbers, both with age and in the context of common neurodegenerative diseases such as Alzheimer's disease [75].These findings thus underscore the need for further large-scale studies into the differential effects of age on different hypothalamic substructures.
In conclusion, we demonstrate that HypVINN can successfully identify the desired structures with similar or better performance than state-of-the-art modality-specific models regarding segmentation accuracy, generalizability, and test-retest reliability.Furthermore, the fact that HypVINN replicates previous age and sex findings on large unseen subsets of the Rhineland Study and the UK Biobank corroborates the stability and sensitivity of our method.Moreover, our hypothalamic sub-segmentation tool generates accurate segmentations regardless of whether both T1w and T2w images are available or just a single T1w image.However, utilizing both modalities results in slightly improved segmentation outcomes.
Overall, we introduce HypVINN -the first hetero-modal deep learning method for hypothalamic sub-segmentation and segmentation of other adjacent structures, such as the hypophysis, epiphysis, and major structures of the central optic system.
The proposed method offers a more detailed parcellation of the hypothalamus compared to the only other contemporary automated method [16].Additionally, it can generate accurate segmentations from T1w and T2w MR images at isotropic 0.8 mm or 1 mm resolutions.Finally, HypVINN will be incorporated into the FastSurfer neuroimaging software suite.Thus, providing an easy to use alternative for more reliable assessment of hypothalamic imaging-derived phenotypes.

Data and code availability statement
This work uses MRI data from the Rhineland Study and UK Biobank.The Rhineland Study data is not publicly available because of data protection regulations.However, access can be provided to scientists in accordance with the Rhineland Study's Data Use and Access Policy.Requests to access the data should be directed to Dr. Monique Breteler at RS-DUAC@dzne.de.UK Biobank data are available through a procedure described at http://www.ukbiobank.ac.uk/using-the-resource/.
The method presented in this article will be made publicly available on Github (https://github.com/Deep-MI/FastSurfer)upon acceptance.Table A1: Sequence parameters for the T1-weighted and T2-weighted versions in the Rhineland Study.To date, there have been two versions of the T1w sequence (T 1w a−b ) and four versions of the T2w sequence (T 2w a−d ) -care was taken to preserve the image contrast between versions for both sequences.E) datasets [19][20][21] from our proposed HypVINN with multi-modal input (MM) for six random participants.We observe that our tool shows promising results in both available HCP resolutions (0.7 mm and 0.8 mm).Furthermore, our tool seems to generalize well across age categories inside the training age range (training data started at age 30).However, all the above observations are only qualitative, and no accuracy segmentation metrics can be computed as manual annotations are unavailable for this dataset.Note: T1w, T2w, and HypVINN outcomes are presented for each participant.Furthermore, in each participant's row, the first three images display the different hypothalamic structures on the coronal view, and the remaining image shows the structures on the axial view.The color lookup table for all visible structures is presented on the right.

Figure 1 :
Figure1: T1-weighted (T1w) and T2-weighted (T2w) images and ground truth (GT) from two participants.The proposed manual segmentation scheme is composed of twenty-four structures divided into three major regions: 1) hypothalamic (anterior, middle, and posterior), 2) optic, and 3) others.The color lookup table * for all structures is presented on the left, and a detailed overview of the three regions is presented in Table1.* Structures are not visible in the presented snapshots.
and a case-study dataset (from RS and UKB).The manually annotated dataset (referred to as "in-house dataset") was initially split into two non-overlapping sets, one for training and validation, and the other for testing.The remaining datasets were exclusively used for evaluations to assess different aspects of our hetero-modal method.The Rhineland Study is an ongoing population-based cohort study located in Bonn, Germany, which enrolls participants aged 30 years and above (www.rheinland-studie.de).MR scans were collected at two different sites using identical 3T Siemens MAGNETOM Prisma MRI scanners equipped with 64-channel head-neck coils.The core MRI acquisition protocol for every participant in the Rhineland Study includes the following MR contrast: T1w, T2w, FLAIR, diffusion-weighted, susceptibilityweighted, resting-state functional, and abdominal Dixon MRI with a total net scan time of around 45 minutes.Furthermore, an optional extra acquisition time (maximum 10 minutes) is available for a free protocol.

Figure 2 :
Figure 2: Hetero-Modal VINN (HM-VINN) architecture in HypVINN.Input modalities are first independently processed by modality-specific competitive dense blocks (T1-CDB * and T2-CDB * ).Afterward, modality-specific feature maps are merged inside the network by our proposed fusion module (light green) to create a shared latent space.During inference time, the shared latent space can be computed over the available modalities and fed into the remaining network.Furthermore, HM-VINN incorporates flexible transitions in the first and last scale transition by utilizing the network-integrated resolution-normalization (green).Each CDB is composed of four sequences of parametric rectified linear unit (PReLU), convolution (Conv), and batch normalization (BN).In the modality-specific CDBs and second encoder block (CBD * ), the first PReLU is replaced with a BN to normalize the inputs.

10 − 4
and an initial learning rate of 0.05, which is decreased to 0.005 after 70 epochs.The networks are trained by optimizing a combined loss function of a median frequency weighted cross-entropy loss and Dice loss[27].This loss function encourages correct segmentation along anatomical boundaries and counters class imbalances by increasing the weights of less frequent classes.To increase the generalizability of our model, we apply several spatial and intensity data augmentations during training.Spatial augmentations on the inputs images are limited to random affine transformations such as translation (range: from −15 mm to 15 mm), rotation (range: from −10 • to 10 • ), and uniform scaling (factor: from 0.85 to 1.15)[60].Furthermore, we include internal scale augmentations of the feature maps as introduced by FastSurferVINN to improve the segmentation performance[34].Intensity augmentations are carried out to address two challenges: 1) intensity inhomogeneities due to scan parameters[60] and 2) artefacts introduced by defacing algorithms in regions of interest (e.g.optic region).The first problem is tackled by applying a random bias field[61,62] transformation on the input images (coefficients range: from -0.5 to 0.5).For the second issue, we improve the network's robustness to handle defaced scans by including scans with or without face features as part of the training set.For creating the modified scans, three common open-source algorithms are used (PyDeface[63], HypVINN).The HypVINN model is composed of the HM-VINN architecture and learning strategies introduced in Sections 2.3.2 and 2.3.3.i) Initially, we evaluate the segmentation accuracy of HypVINN's predictions against manual annotations.For this purpose, we benchmark the network based on the performance in the unseen test-set against multi-and uni-modal models -including the only other contemporary method for hypothalamus parcellation (Section 3.1.1),and manual rater variability (Section 3.1.2). ii) We assess the generalizability of our method to a different image resolution -1.0 mm isotropic MRI scans (Section 3.2).iii) We test the reliability of the predicted volumes in a within-session test-retest scenario (Section 3.3).iv) Figure A2.Thus, performance is mainly driven by the T1derived information, with T2w being only a support modality.

Figure 3 :Figure 4 :
Figure3: Segmentation performance comparison on the in-house testset between manual intra-rater scores vs. our proposed HypVINN and benchmark F-CNNs.HypVINN (dark red and dark blue) produces comparable results to the manual intra-rater agreement (gray).Note: similarity scores are presented for the hypothalamic, others, and optic regions.Additionally, a letter directly on top of a box plot indicates which other models the model significantly outperforms (paired Wilcoxon signed-rank test, corrected significance per region : hypothalamic p < 0.004; others p < 0.007; optic p < 0.008).
hetero-modal model (HypVINN) to generalize to brain MRI scans with a different image resolution (1.0 mm isotropic) than the training one (0.8 mm isotropic).For this purpose, we utilize the MRI scans from the Rhineland Study (RS) in-house test-set (n=6) and a random subset (n=9) of the UK Biobank (UKB) dataset that is manually annotated (see Section 2.1).For the Rhineland Study, as the MR scans and respective ground truth are at 0.8 mm isotropic resolution, we down-sample the pre-registered T1w and T2w scans from their native resolution to the desired 1.0 mm isotropic resolution.After the 1mm scans are processed by the segmentation model, the resulting probability maps (i.e.soft-labels) are up-sampled to the original 0.8 mm resolution.Thereafter, hard labels are generated.
), the inclusion during training of exSA in our HM-VINN architecture (proposed HypVINN, Figure 5 blue) shows better segmentation performance compared to the comparative baseline without exSA (Figure 5 purple)

Figure 5 :
Figure 5: Retrospectively benchmarking of single resolution (0.8 mm) trained networks to segment 1.0 mm T1w MR scans from the Rhineland Study and UK Biobank.Our proposed approach (HypVINN) consisting of the HM-VINN architecture plus external scale augmentation (+exSA, blue) outperforms other comparison baselines in both manually labeled datasets.It is important to consider that the HypVINN used in this analysis is the one trained in Section 3.1.2.Note: similarity scores are presented for the hypothalamic, others, and optic regions.Additionally, a letter directly on top of a box plot indicates which other models the model significantly outperforms (paired Wilcoxon signed-rank test, corrected significance per region : hypothalamic p < 0.004; others p < 0.007; optic p < 0.008).

Figure 6 :
Figure 6: Segmentation performance comparison between our proposed HypVINN, with multi-modal input (MM) and uni-modal T1 input (T1), vs. modality-specific models for segmenting 1.0 mm MR scans from the Rhineland Study and UK Biobank.HypVINN (dark red and dark blue) can generalize remarkably well to 1.0 mm MR scans independent of the provided MRI input.It is important to consider that the F-CNN models used in this analysis are the ones trained in Section 3.1.2.Note: similarity scores are presented for the hypothalamic, others, and optic regions.Additionally, a letter directly on top of a box plot indicates which other models the model significantly outperforms (paired Wilcoxon signed-rank test, corrected significance per region : hypothalamic p < 0.004; others p < 0.007; optic p < 0.008).
we process the T1w scans from the Rhineland Study (n=463) and UK Biobank (n=535) case-study datasets (see Section 2.1) with our proposed HypVINN.To further evaluate the robustness of our hetero-modal model to handle different modalities, we also assess the effects in the Rhineland cases when both preregistered T1w and T2w scans are available at inference.Ideally, the direction of the effects should not be modified by the input scenarios (only T1w or T1w & T2w).We note that joint T1w & T2w analysis in the UK Biobank is not possible due to the absence of T2w scans.All generated predictions are visually inspected by an experienced rater.A total of six participants from the Rhineland Study (RS) and fifteen participants from the UK Biobank (UKB) are

Figure 7 :
Figure7: Segmentation examples on the coronal view from our proposed HypVINN with T1 input and manual ground truth (GT) for one labeled 1.0 mm scan from the UKBiobank and one 1.0 mm scan from the Rhineland Study unseen test-set.Even though our proposed method is not trained with 1.0 mm scans, it can generate accurate predictions at this resolution.Note: the color scheme for the visible structures is presented on the right.
ure 8 b).This negative association is also observed in the subregions except for the middle structures (e.g.tuberal-region,

Figure 8 :
Figure 8: Hypothalamic volumes estimates (a) and volume associations with age (b) and sex (c) in participants from the Rhineland Study (n=457) and UK Biobank (N=520) for HypVINN.Age and sex effects on hypothalamic volume estimates in the Rhineland Study from HypVINN, independent of the provided MRI input, follow the same direction trend.Furthermore, our model replicates previous sex findings in both datasets corroborating the stability and sensitivity of our method.Note: * Effects are obtained after accounting for head-size (eTIV) and modality sequence (only Rhineland Study).
is an unclear boundary of the hypothalamus due to severe enlargements of the third ventricle as illustrated in Appendix Figure A4.

Figure 9 :
Figure 9: Examples of correct predictions in the Rhineland Study (a-b) and Uk Biobank (c-d) from our proposed HypVINN with multi-modal [MM] or T1w only [T1] input for four unseen random male participants with different ages.Note: for each participant, T1w, T2w (only Rhineland Study participants), and HypVINN outcomes are presented.Furthermore, in each participant's row, the first three images display the different hypothalamic structures on the coronal view, and the remaining three images show all remaining structures on the axial view.The color lookup table for all visible structures is presented on the right.

Santiago Estrada :
Methodology, Software, Validation, Formal analysis, Investigation, Conceptualization, Writing -original draft, Writing -review & editing, Visualization.David K ügler: Conceptualization, Methodology, Validation, Data Curation, Writing -original draft, Writing -review & editing.Emad Bahrami: Investigation, Software, Validation, Data Curation, Writing -original draft.Peng Xu : Data Curation, Investigation, Validation, Writing -original draft.Dilshad Mousa: Data Curation, Investigation.Monique M.B.Breteler: Supervision, Funding acquisition, Resources, Writing -review & editing.N. Ahmad Aziz: Conceptualization, Validation, Resources, Writing -original draft, Writing -review & editing, Supervision, Funding acquisition.Martin Reuter: Conceptualization, Validation, Resources, Writing -original draft, Writing -review & editing, Supervision, Project administration, Funding acquisition.Acknowledgment We would like to thank the Rhineland Study group for supporting the data acquisition and management.This work was supported by DZNE institutional funds, by the Federal Ministry of Education and Research of Germany (031L0206, 01GQ1801), the Chan Zuckerberg Initiative (Project Fast-Surfer), the Helmholtz-AI project DeGen (ZT-I-PF-5-078), by an Alzheimer's Association Research Grant (Award Number: AARG-19-616534), and by NIH (R01 LM012719, R01 AG064027, R56 MH121426, and P41 EB030006).Peng Xu was supported by a scholarship from China Scholarship Council and N.Ahmad Aziz was supported by a European Research Council Starting Grant (Number: 101041677).This research has been conducted using the UK Biobank Resource under Application Number 82056.Data in the appendix were also provided in part by the publicly available Human Connectome Project (HPC), WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.Appendix A.

Figure A2 :Figure A3 :Figure A4 :
FigureA2: T1-Block learnable modality weight during training.The T1-block has a much higher value (≈ 0.75) than the T2-block weight (≈ 0.25) in HypVINN's fusion module, starting in the early training steps in all four cross-validation training splits (i.e.S1, S2, S3, and S4) .Thus, performance is mainly driven by the T1-derived information, with T2w being only a support modality.

Figure A5 :
Figure A5: Examples of correct predictions in the Human Connectome Project (HCP) young adults (HCP-YA, A-C) and HCP lifespan pilot project (HCP-LPP, D-E) datasets[19][20][21] from our proposed HypVINN with multi-modal input (MM) for six random participants.We observe that our tool shows promising results in both available HCP resolutions (0.7 mm and 0.8 mm).Furthermore, our tool seems to generalize well across age categories inside the training age range (training data started at age 30).However, all the above observations are only qualitative, and no accuracy segmentation metrics can be computed as manual annotations are unavailable for this dataset.Note: T1w, T2w, and HypVINN outcomes are presented for each participant.Furthermore, in each participant's row, the first three images display the different hypothalamic structures on the coronal view, and the remaining image shows the structures on the axial view.The color lookup table for all visible structures is presented on the right.

Table 1 :
Summary of the hypothalamic sub-regions and adjacent structures included in the proposed labeling scheme with its corresponding name, anatomical designation and region.

Table 2 :
[16] (and standard deviation) segmentation performance of the cross-validated F-CNN models on the unseen test-set.The proposed hetero-modal HypVINN performs as well as the modality-specific models.Furthermore, HypVINN with multi-modal and standalone T1w input outperforms the 3D-UNet proposed by Billot et al.[16]-theonly other contemporary method for hypothalamus parcellation.Note: the statistical significance column (Signif.)indicates, which other models the model outperforms (Wilcoxon signed rank test, corrected p < 0.002).
iii) T1w & T2w (multi-modal (MM)-VINN).For the multimodal model, the input passed to the network consists of a multi-channel image created by stacking T1w and T2w image slices on top of each other.Additionally, we compare our HypVINN against the method proposed by Billot et al. [16]a 3D-UNet with extensive data augmentation for hypothalamic sub-segmentation on T1w images.Direct comparison of our predicted outcomes with the results from the already trained model from Billot et al. is not possible as our annotation protocol segments more structures and uses a different hypothalamic parcellation.Therefore, we utilize the implementation provided by the authors to retrain their T1w model from scratch with our manual annotations.It is important to notice that we don't finetune the implementation from Billot et al., and any optimization of their tool is outside this paper's scope.Furthermore, all comparative VINN baselines follow the same 2.5D scheme as mentioned in Section 2.3.1, and inference in HypVINN is done per input combination.The difference between results in the following two sections is in the data used for training: For Section 3.1.1andTable2,all networks are trained in a 4-fold crossvalidation scheme to also generate validation performance on the holdout validation split (see Appendix B for ablation results).For all other results we used the full training set (n=44).

Table A2 :
Demographics of the Rhineland Study participants for all different datasets.Descriptive data were expressed as mean (SD) or count (percentage) for continuous or categorical variables, respectively.Inter group differences were compared with the Student's t-test for continuous variables and with the Pearson's chi-square test for categorical variables.

Table A3 :
Demographics for the training and testing in-house dataset.Descriptive data were expressed as mean (SD) or count (percentage) for continuous or categorical variables, respectively.Inter group differences were compared with the Student's t-test for continuous variables and with the Pearson's chi-square test for categorical variables.

Table A4 :
Demographics for the UK Biobank participants for all different datasets.Descriptive data were expressed as mean (SD) or count (percentage) for continuous or categorical variables, respectively.Inter group differences were compared with the Student's t-test for continuous variables and with the Pearson's chi-square test for categorical variables.

Table B1 :
Fusion module weighting scheme optimization: Mean (and standard deviation) of segmentation performance metrics per input modality of the ablative hetero-modal VINN (HM-VINN) architectures on the validation set.Global weights outperforms per channel weights in all comparative metrics and all inference scenarios.Note: the statistical significance column (Signif.)indicates, which other models the model outperforms (Wilcoxon signed rank test, corrected p < 0.002).