Deep learning models have achieved remarkable success in segmenting brain white matter lesions in multiple sclerosis (MS), becoming integral to both research and clinical workflows. While brain lesions have gained significant attention in MS research, the involvement of spinal cord lesions in MS is relatively understudied. This is largely owing to the variability in spinal cord magnetic resonance imaging (MRI) acquisition protocols, high individual anatomical differences, the complex morphology and size of spinal cord lesions, and lastly, the scarcity of labeled datasets required to develop robust segmentation tools. As a result, automatic segmentation of spinal cord MS lesions remains a significant challenge. Although some segmentation tools exist for spinal cord lesions, most have been developed using sagittal T2-weighted (T2w) sequences primarily focusing on cervical spines. With the growing importance of spinal cord imaging in MS, axial T2w scans are becoming increasingly relevant due to their superior sensitivity in detecting lesions compared to sagittal acquisition protocols. However, most existing segmentation methods struggle to effectively generalize to axial sequences due to differences in image characteristics caused by the highly anisotropic spinal cord scans. To address these challenges, we developed a robust, open-source lesion segmentation tool tailored specifically for axial T2w scans covering the whole spinal cord. We investigated key factors influencing lesion segmentation, including the impact of stitching together individually acquired spinal regions, straightening the spinal cord, and comparing the effectiveness of 2D and 3D convolutional neural networks (CNNs). Drawing on these insights, we trained a multi-center model using an extensive dataset of 582 MS patients, resulting in a dataset comprising an entirety of 2,167 scans. We empirically evaluated the model’s segmentation performance across various spinal segments for lesions with varying sizes. Our model significantly outperforms the current state-of-the-art methods, providing consistent segmentation across cervical, thoracic, and lumbar regions. To support the broader research community, we integrate our model into the widely-used Spinal Cord Toolbox (v7.0 and above), making it accessible via the command sct_deepseg lesion_ms_axial_t2 -i <path-to-image.nii.gz>.

Multiple sclerosis (MS) is a complex autoimmune disease affecting the central nervous system, including the brain and spinal cord (Jakimovski et al., 2024). With over 2.8 million people impacted worldwide, MS is a leading cause of severe neurological disability in young adults (Walton et al., 2020). The disease is characterized by inflammatory demyelinating lesions, which are visible as hyperintensities on T2-weighted (T2w) MRI scans. Consequently, MRI has become the primary tool for diagnosing (Thompson et al., 2017) and monitoring MS (M. A. Rocca et al., 2024). However, most MRI studies focus on the brain and the measurement of atrophy in its anatomical regions (Calabrese et al., 2010; Kidd et al., 1999; Lucchinetti et al., 2011), despite spinal cord lesions being integral to MS diagnosis (Kearney et al., 2015; Thompson et al., 2017) and contributing significantly to disability (Bussas et al., 2022; Lauerer et al., 2024; M. Rocca et al., 2022).

Spinal cord MRI poses unique challenges due to its small size, tubular structure, proximity to moving organs, and the heterogeneous composition of surrounding tissues (Jasperse, 2024). These factors make lesion detection and manual segmentation cumbersome, time-consuming, and prone to variability (Gros et al., 2018). Existing open-source segmentation methods (De Leener et al., 2014, 2015; Gros et al., 2018; Naga Karthik et al., 2024) trained on specific regions of the spine fail to generalize under shifts in the data distribution caused by highly anisotropic axial scans. Furthermore, methods developed for segmenting brain lesions (Gentile et al., 2023; Mendelsohn et al., 2023; Schmidt et al., 2012; Shiee et al., 2010; Wiltgen et al., 2024) do not necessarily generalize to those of the spinal cord. In this study, we adopt an approach that simultaneously segments the spinal cord and lesions, focusing on axial T2w scans. Axial scans are superior for lesion detectability and can enhance diagnostic certainty in differentiating MS from other diseases, especially when high sensitivity for spinal cord lesions is required to confirm findings from sagittal scans or to detect small lesions potentially missed on sagittal sequences (Breckwoldt et al., 2017; Galler et al., 2016; Kerbrat et al., 2020; Weier et al., 2012). Such protocols are increasingly applied in clinical practice (Kearney et al., 2015), although they require extended scanning times and, usually, acquisition in three chunks covering the cervical, upper thoracic, and thoracolumbar part of the spine; for brevity, we will refer to the latter two simply as thoracic and lumbar (part of the spine).

Automated spinal cord lesion segmentation presents several additional challenges. The appearance of lesions along with their ambiguous boundaries due to partial volume effects in anisotropic scans poses a significant challenge to the generalizability of segmentation models. Still, available training data (i.e., images with manually annotated reference standard lesion masks) are sparse, with a particular shortage for the thoraco-lumbar spinal cord. This is critical because, although most lesions develop in the cervical cord, a considerable proportion occurs in the thoracic spinal cord (Eden et al., 2019; Hua et al., 2015; Poulsen et al., 2021). Even in the most caudal part of the spinal cord (i.e., in the medullary conus), lesions can occur and be decisive for the individual patient regarding disability and differential diagnosis (Dubey et al., 2019; Mariano et al., 2021). In addition, it is unclear which data preprocessing1 and training strategies lead to the most optimal delineation of lesions. For instance, given the similar lesion intensities within axial slices across vertebral levels, does segmentation performance increase through additional training data from other (possibly even distant) spinal levels? We hypothesize that training on whole-spine scans obtained by stitching individual chunks and training a 3D model which leverages additional context surrounding the cord could provide more insights to this question. Further, how does the curvature of the spinal cord impact the segmentation performance? Can straightening, a preprocessing strategy that virtually erects the spinal cord to a vertical column (De Leener, Mangeat, et al., 2017), simplify the lesion segmentation task? Our hypothesis here is that straightening could potentially improve the segmentation performance by removing curvature-related variance in the spinal cord. Lastly, does segmenting both the spinal cord and the lesions simultaneously result in better performing models than those trained to segment only the lesions? Echoing the conclusion of a previous study for segmenting brain tumors (Isensee et al., 2020), we hypothesize that treating the spinal cord and lesion masks as overlapping regions could result in better gradients during training and improve segmentation performance overall.

Lastly, given previous studies have primarily focused on the automatic segmentation of MS lesions in cervical or cervico-thoracic spine (Eden et al., 2019; Gros et al., 2018), we systematically study the impact of the above data preprocessing and training strategies to develop a robust deep learning-based method for the automatic segmentation of MS lesions in axial T2w scans covering the entire spine. Specifically, we:

  1. Use a region-based approach to simultaneously segment MS lesions and the spinal cord. We improve upon the performance of the existing tools (Bédard et al., 2025; Gros et al., 2018), particularly in the lower-thoracic and lumbar spines.

  2. Evaluate preprocessing methods such as stitching individual chunks versus training on raw chunks and the effects of straightening versus no straightening.

  3. Compare (patch-wise) 2D-/3D hybrid and (slice-by-slice) 2D kernels for training spinal cord lesion segmentation models with convolutional neural networks.

  4. Investigate generalizability across various sites and imaging protocols for spinal cord lesion segmentation.

2.1 MRI data and ground truth

MRI data were retrospectively collected from four different sites, encompassing both cross-sectional and longitudinal datasets, with partial and full spinal cord coverage, from a total of 582 patients, following internal review board approval and performed in accordance with the Declaration of Helsinki.

The longitudinal dataset consisted of 317 patients acquired at a single site, namely the Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany (referred to as the TUM dataset). Each patient underwent two MRI examinations (sessions), with each session containing 3 individual chunks of highly anisotropic axial T2-weighted scans covering the cervical, thoracic, and lumbar spines, respectively. The chunks are numbered 1-3 in the cranio-caudal direction for each spinal region. Participants for this site spanned a large variety of clinical phenotypes and conditions, namely, clinically isolated syndrome (CIS; n=11), radiologically isolated syndrome (RIS; n=1), primary progressive MS (PPMS; n=11), secondary progressive MS (SPMS; n=14), relapsing-remitting MS (RRMS; n=277), and unknown (n = 3). The dataset comprises patients with spinal cord lesions and those without lesions at one or both longitudinal time points. The mean time interval between baseline and follow-up scan (in years) was 3.1±2.6, with an average age of 36.1±10.4 at baseline. Of the 317 patients, the female-/male- ratio was 209/108.

The ground-truth (GT) spinal cord and lesion masks were manually annotated by two experienced raters (SR and RW). Annotations were guided by an iterative pre-labeling scheme, which involved an initial training phase where a segmentation network was trained on a manually labeled subset of the cohort. Intermediate segmentations were obtained from this trained network followed by manual corrections by raters SR and RW. This pre-labeling approach significantly expedited the annotation process, enabling the inclusion of a large number of patients and whole spine scans, which would have been practically infeasible with manual annotation alone.

Starting from the raw chunks of the individual spinal regions in the native space (Chunks Native), Figure 1A shows the dataset characteristics of the different preprocessed variants of the TUM dataset obtained after stitching and straightening the chunks. More details are given in Section 2.4 and Figure 2.

Fig. 1.

Study flowchart. (A) Characteristics of the TUM dataset along with its preprocessed variants. (B) Characteristics of data from different sites and the respective train/test splits. n denotes the number of patients, and nvol denotes the total number of images.

Fig. 1.

Study flowchart. (A) Characteristics of the TUM dataset along with its preprocessed variants. (B) Characteristics of data from different sites and the respective train/test splits. n denotes the number of patients, and nvol denotes the total number of images.

Close modal
Fig. 2.

Illustration of the dataset versions and respective lesion appearance. Each image shows the middle sagittal slice of the axial T2w sequence from a representative subject in the dataset. Lesions, highlighted in red, represent the maximum intensity projection of the lesion mask on the sagittal slice, allowing visualization of all lesions along the whole spinal cord. (A) Middle sagittal slice of a stack of three axial T2w acquisitions covering the cervical, thoracic, and lumbar regions of the spine. (B) Stitched whole-spine scan derived from the respective chunks. (C) Axial slice of lesions in each chunk. (D) Straightened version of the individual chunks, where the curvature of the cord was removed using sct_straighten_spinalcord. (E) Stitched configuration of the previously straightened chunks. (F) Axial slice of lesions in the straightened chunk. In both stitched and straightened variants, the lesion characteristics remain consistent, indicating a (desired) minimal impact on lesion morphology.

Fig. 2.

Illustration of the dataset versions and respective lesion appearance. Each image shows the middle sagittal slice of the axial T2w sequence from a representative subject in the dataset. Lesions, highlighted in red, represent the maximum intensity projection of the lesion mask on the sagittal slice, allowing visualization of all lesions along the whole spinal cord. (A) Middle sagittal slice of a stack of three axial T2w acquisitions covering the cervical, thoracic, and lumbar regions of the spine. (B) Stitched whole-spine scan derived from the respective chunks. (C) Axial slice of lesions in each chunk. (D) Straightened version of the individual chunks, where the curvature of the cord was removed using sct_straighten_spinalcord. (E) Stitched configuration of the previously straightened chunks. (F) Axial slice of lesions in the straightened chunk. In both stitched and straightened variants, the lesion characteristics remain consistent, indicating a (desired) minimal impact on lesion morphology.

Close modal

For the cross-sectional datasets, 153 patients were scanned at the NYU Langone Medical Center, New York, USA, 80 subjects were scanned at the Brigham and Women’s Hospital, Harvard Medical School, Boston, USA, and 32 were scanned at the Zuckerberg San Francisco General Hospital, San Francisco, California, USA. The scans from all three sites primarily covered the cervical or cervico-thoracic spines with only either of the chunks for each patient. The lesion masks were annotated manually by raters at the respective sites. The spinal cord masks were generated with sct_deepseg_sc using Spinal Cord Toolbox (De Leener, Lévy, et al., 2017) followed by manual corrections wherever necessary. Clinical data were not available for the MS patients across the sites. Following our experiments with the chunks from the TUM dataset, the scans from these three sites were primarily used to improve and test the segmentation model’s generalization capabilities. Figure 1B shows how the data from different sites are combined for developing the lesion segmentation model. Table 1 shows a detailed overview of the image characteristics and acquisition parameters from all sites in this study.

Table 1.

Overview of image characteristics and acquisition parameters from each site.

Site#ScansManufacturerTR (ms) (median)TE (ms) (median)Resolution (mm3)
Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany 1999 Siemens 1.5T (n=14) 6065 104 min: 0.19×0.19×1.95
max: 0.78×0.78×6.83 
Siemens 3T (n=290) 7070 107 
Philips 1.5T (n=5) 3034 100 
Philips 3T (n=1687) 4435 90 
GE 1.5T (n=3) 6280 114 
NYU Langone Medical Center, New York, USA 209 Siemens 3T (n=209) 4000 107 min: 0.56×0.56×3.3
max: 0.78×0.78×11.91 
Brigham and Women’s Hospital, Harvard Medical School, Boston, USA 80 Siemens 3T (n=80) 5070 101 min: 0.56×0.56×3
max: 0.7×0.7×3 
Zuckerberg San Francisco General Hospital, San Francisco, California, USA 32 GE 3T (n=32) 3516 72 min: 0.29×0.29×3
max: 0.7×0.7×6 
Site#ScansManufacturerTR (ms) (median)TE (ms) (median)Resolution (mm3)
Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany 1999 Siemens 1.5T (n=14) 6065 104 min: 0.19×0.19×1.95
max: 0.78×0.78×6.83 
Siemens 3T (n=290) 7070 107 
Philips 1.5T (n=5) 3034 100 
Philips 3T (n=1687) 4435 90 
GE 1.5T (n=3) 6280 114 
NYU Langone Medical Center, New York, USA 209 Siemens 3T (n=209) 4000 107 min: 0.56×0.56×3.3
max: 0.78×0.78×11.91 
Brigham and Women’s Hospital, Harvard Medical School, Boston, USA 80 Siemens 3T (n=80) 5070 101 min: 0.56×0.56×3
max: 0.7×0.7×3 
Zuckerberg San Francisco General Hospital, San Francisco, California, USA 32 GE 3T (n=32) 3516 72 min: 0.29×0.29×3
max: 0.7×0.7×6 

TR and TE represent the median repetition and echo times, respectively.

2.2 Preprocessing framework

2.2.1 Stitching

In clinical practice, full spinal cord scans are impractical due to long acquisition times. An alternative is to stitch individual segments into whole-spine scans. In this paper, we refer to stitching as the process of aligning and merging multiple image segments or chunks, in this case, axial T2-weighted (T2w) scans acquired during the same session, into a single image. It must be noted that chunks may also be registered to the template individually; however, it is preferred to directly register the stitched image as this simplifies the computation of lesion morphometrics by eliminating the need to reconcile overlapping regions or risk double-counting that can occur when multiple chunks are processed separately (Graf, Möller, et al., 2024).

To compare whether the model benefits from the extended context around the spinal cord in stitched whole-scans, we employed the stitching algorithm implemented in (Graf, Platzek, et al., 2024; Lavdas et al., 2019). This algorithm has four main steps: (i) the corner points are computed for each chunk, (ii) using the affine matrix of each chunk, corner points are rotated to find the smallest volume bounding box that fits all the chunks, resulting in optimal rotation and spacing parameters, (iii) the chunks are resampled to the new (common) image space and occupancy maps containing overlapping regions between chunks are stored, and (iv) the resampled chunks are blended using ramp-based interpolation with a weighted averaging function to ensure smooth transitions between chunks. Manual corrections are applied to the stitched mask, if required, to reduce interpolation artifacts. Figure 2A-C shows the result of the stitching algorithm and the subsequent appearance of lesions in the whole-spine scan.

While the stitching procedure described in (Graf, Platzek, et al., 2024) includes a registration-based alignment alternative, we chose not to use it in this study, instead relying on the global alignment of chunks within the scanner coordinate space. This approach could theoretically introduce misalignment due to patient movement, similar to intra-scan motion. However, our quality checks showed that misalignment was very rare. When it did occur, it was typically in non-overlapping regions where empty slices resulted from suboptimal MRI planning. Moreover, given the minimal chunk overlap and the inherent challenges of registering axial T2w spinal cord images—such as high anisotropic resolution, partial volume effects, and the lack of distinct anatomical landmarks for segmentation-based registration—we found that a registration-based approach introduced additional alignment errors and was ultimately inferior in our stitching experiments.

2.2.2 Straightening

In the analysis of brain MS lesions, MRI scans of the brain are typically registered to the MNI anatomical template using rigid registration (Carass et al., 2017; Wiltgen et al., 2024). Following this approach of analyzing the lesions in a template space, one would have to register the individual chunks to, for example, the PAM50 template (De Leener et al., 2018). As it is difficult to obtain the vertebral levels in axial scans with thick sagittal slices, we opted for a simpler approach involving straightening of the spinal cord (De Leener, Mangeat, et al., 2017). Straightening essentially eliminates all curvature-related variance in the spinal cord, offering a simpler alternative to template-based registration and facilitating cohort-level analysis. The straightening procedure using Spinal Cord Toolbox (SCT) (De Leener, Lévy, et al., 2017) is described as follows: Given an axial T2w scan (chunk or stitched), we applied sct_straighten_spinalcord using the spinal cord segmentation mask as reference to obtain the straightened cord along with the warping field (warp_curve2straight). Then, the output warping field was applied using sct_apply_transfo with linear interpolation to straighten both the GT spinal cord and the lesion masks. Lastly, both the straightened GT masks were binarized using a threshold of 0.5. Figure 2D-F shows the straightened individual chunks, straightened whole-spine scans and the resulting appearance of lesions across spinal regions, with a reduced field of view resulting from the straightening algorithm.

2.3 Training protocol

The continuing dominance of nnUNet (Isensee et al., 2021, 2024) across several open-source segmentation challenges has shown that a well-tuned convolutional neural network (CNN) architecture is robust and continues to achieve state-of-the-art results over novel transformer-based architectures (Hatamizadeh et al., 2021; Karthik et al., 2024; Ma et al., 2024). Based on this rationale, we used nnUNetV2 (Isensee et al., 2021) as the segmentation framework. Specifically, we used nnUNet’s region-based training strategy, where the model treats the masks of the spinal cord and lesions as overlapping regions and segments them simultaneously. The rationale for going with this approach is two-fold: (i) combining the label of a larger object (i.e., the spinal cord) with a smaller one (i.e., the lesion) might help obtain better gradients for effective learning early on during the training, and (ii) it has been shown that directly optimizing for overlapping regions, forming a hierarchy of sub-structures (e.g., spinal cord lesions), results in better performing models than optimizing for each of the objects as independent classes (Isensee et al., 2020). In practice, this essentially reframes the two-class segmentation task (with independent spinal cord and lesion masks) into two binary segmentation tasks (where, the first class is now the union of the spinal cord and the lesion masks and the second class is the lesion mask) optimized using the sigmoid activation with the compound Dice and binary cross-entropy loss. While region-based training requires both spinal cord and lesion masks as inputs, the rise of robust spinal cord segmentation tools (Bédard et al., 2025; Gros et al., 2018; Naga Karthik et al., 2024) across various MRI contrasts and pathologies enables obtaining spinal cord segmentation masks without significant manual annotation costs.

We used the 2d and 3d_fullres models in their default configurations as automatically set by nnUNet. The 2D model employs 2D convolutional kernels throughout all stages of the U-Net architecture. However, for 3d_fullres model, the default configurations resulted in a hybrid 2D/3D kernels given the highly anisotropic nature of our dataset. In particular, out of the 6 layers in the encoder, the 1st layer used a 3×3×1 kernel with stride 1×1×1 (i.e., no downsampling/pooling), the 2nd and 3rd layers used 3×3×1 kernels with a stride of 2×2×1 (i.e. in-plane downsampling only), and the 4th, 5th, and 6th layers used 3×3×3 kernels with a stride of 2×2×2. Note that the first 3 layers have singleton dimensions and the downsampling/pooling is only done when the resolution (or, slice thickness) of the out-of-plane dimension lies within a factor of 2 of the in-plane resolution. Therefore, it must be noted that despite being a 3D model (i.e., trained on 3D input patches), the actual configuration consists of a mix of 2D and 3D convolutional kernels.

Default data augmentation transforms applied by nnUNet were used, namely, random rotation, scaling, mirroring, Gaussian noise addition, Gaussian blurring, adjusting image brightness and contrast, low-resolution simulation, and gamma transformation. All scans were converted to RPI orientation and preprocessed using Z-score normalization. For each of the four preprocessed datasets, 2D and 3D models were trained using three-fold cross validation resulting in a total of 24 models. All models were trained for 1000 epochs, with a batch size of 2, using the stochastic gradient descent optimizer with a polynomial learning rate scheduler.

While nnUNet is known for automatically configuring the patch sizes of the images used during training based on the hardware capacity, its default patch sizes could be sub-optimal depending on the size of the input images. Specifically, in the case of the stitched datasets, the automatically-configured patch sizes for 2D/3D-hybrid models were defined to be half the size of the median shape of the images from the training set. This means that the model effectively received only half of the total length of the (stitched) spinal cord in the S-I plane as a 3D input patch during training. To evaluate whether fitting the whole context of the cord within a single input patch would improve the segmentation performance, we also trained a model on a patch size covering the entire S-I plane, meaning that the model has access to the whole context of the spinal cord in the sagittal plane.

All segmentation networks were trained with nnUNet v 2.4.1 using Python 3.10.13 and Pytorch 2.2.1. Training both 2D and 3D models (with hybrid 2D/3D kernels) on any given dataset variant took approximately 1 day on a single NVIDIA A100 GPU with 40 GB memory.

2.4 Experimental design

Experiments were designed following two major themes: (i) investigating the effects of data preprocessing (namely, stitching and straightening the chunks) on lesion segmentation using the TUM dataset, and building on this (ii) leveraging data from additional sites to obtain a robust and generalizable segmentation model for MS lesions in axial T2w scans.

Regarding the first theme, we created four versions of the TUM training dataset (see Fig. 1A) to evaluate how the different formats of the input scans affected the downstream lesion segmentation performance.

  1. chunks native, the original dataset containing the individual chunks of axial T2w scans corresponding to cervical, thoracic, and lumbar regions of the spine,

  2. stitched native, consisting of a single scan per patient with the individual cervical, thoracic and lumbar chunks stitched together to form the whole spinal cord,

  3. chunks straightened, where the individual chunks in their native space are straightened using the spinal cord segmentation masks and the model is trained in the straightened image space, and,

  4. stitched straightened, where the stitched images in their native space are straightened using the whole-spine cord segmentation masks.

We trained 2D and 3D models on each of the four dataset variants, resulting in a total of 8 models to investigate the impact of input spatial context. As mentioned in the previous section, the 3D models used a mix of 2D and 3D convolutional kernels owing to the dataset characteristics. Please refer to Table A1 in the Appendix A for a detailed comparison of training configurations. Only anisotropic, axially acquired T2-weighted scans were used in our experiments. Furthermore, because of the evolution in the appearance of lesions between sessions, the images were treated as independent inputs for training the segmentation model. The TUM dataset was split patient-wise according to 80-20% train/test ratio to ensure that the chunks for a given patient belonged either to the training set or testing set but not to both. For the chunks native dataset, this resulted in a total of 1522 individual chunks from 254 patients for training/validation and 380 chunks from 63 patients for testing. For the stitched native dataset, this translated into 508 stitched whole-spine scans from 254 patients in the training/validation set and 126 scans from 63 patients in the testing set.

As for the second theme focusing on robustness and generalization, all 153 patients’ chunks from the NYU site were added to the training set while keeping the BWH and UCSF sites unseen, in order to test the model’s generalization to out-of-distribution data. Such a training and evaluation approach partly overlaps with a lifelong learning scenario where the segmentation model is continually enriched with data from new sites by adding them to the current training set and testing on data from unseen sites (Naga Karthik et al., 2022). Figure 3 shows the distribution of lesions in terms of the lesion volume across train and test splits for each site.

Fig. 3.

Lesion distribution plot across train and test sites computed from the ground truth lesion masks. Outliers with lesion volumes exceeding 400 mm3 were excluded for visualization purposes only (not for model training). All sites and sets follow a similar distribution of lesion count over size, validating a meaningful train/test split for the experiments and training of the final model.

Fig. 3.

Lesion distribution plot across train and test sites computed from the ground truth lesion masks. Outliers with lesion volumes exceeding 400 mm3 were excluded for visualization purposes only (not for model training). All sites and sets follow a similar distribution of lesion count over size, validating a meaningful train/test split for the experiments and training of the final model.

Close modal

2.5 Evaluation protocol and metrics

2.5.1 Spinal cord segmentation

We compared our models on the TUM test set with three other open-source methods available in SCT: (i) sct_propseg (De Leener et al., 2014), (ii) sc_deepseg_sc (Gros et al., 2018), and (iii) the recently proposed contrast-agnostic segmentation model (Bédard et al., 2025). The test split was done at the patient-level (and not at the chunks-level), ensuring that the axial scans of a particular patient strictly belonged only to the testing set, resulting in a total of 126 patients. The performance of these models was then evaluated independently on each of the three chunks (i.e., cervical, thoracic and lumbar) from the TUM dataset.

2.5.2 Lesion segmentation

We created three test sets: the TUM (n=63) and the chunks from sites, BWH (n=80), and UCSF (n=32) (which were kept unseen during training) to evaluate the model’s performance on out-of-distribution data. First, to determine the dataset variant and training strategy that leads to best performance, 8 models (4 dataset variants, 2 models each after combining the results from 3 folds) were compared. Then, the best model out of the 8 (based on the Dice score), the model trained on chunks from two sites, along with sct_deepseg_lesion (Gros et al., 2018), were evaluated on the TUM, BWH, and UCSF test sets.

2.5.3 Metrics

While the ANIMA toolbox (Commowick et al., 2021), specifically animaSegPerfAnalyzer, is widely used for the evaluation of brain MS lesion segmentation models (Commowick et al., 2021), it fails to output appropriate metrics in special cases where the ground-truth mask contains no lesions, irrespective of whether the model predicts false positive lesions or not. Skipping such cases in the evaluation of spinal cord MS lesions would result in (biased) higher scores for the metrics by not accounting for the false positive rate of the segmentation models. To address these limitations, we used the open-source MetricsReloaded package (Maier-Hein et al., 2024; Reinke et al., 2023), designed to mitigate several shortcomings of existing open-source toolboxes. In the context of this study, when the GT and the automatic prediction are both empty (i.e., contain no lesions), the voxel-wise Dice is set to 1, indicating the model’s ability to learn correctly. When either of them is non-empty, the respective false-positive and false-negative scores are computed by formal definition. As for metrics, for spinal cord segmentation, we computed the voxel-wise Dice score, normalized surface distance (NSD), and relative volume error (RVE). For lesion segmentation, we reported the voxel-wise Dice score, NSD, and average absolute difference of the lesion count between the number of predicted lesions and GT lesions, and lesion-wise F1 score, where a prediction was considered true positive if it showed at least a 10% overlap with the GT lesion.

To ensure a fair comparison between the models trained on the four variants of the TUM dataset, we used the following post-processing strategies before computing the metrics:

  1. Chunks Native: For each patient, the 3 chunks corresponding to cervical, thoracic, and lumbar spines were stacked together to obtain a single prediction for a given patient (instead of having 3 lesion prediction masks per patient). As the sizes of the chunks differed depending on the spinal region, an appropriate amount of padding was applied to the chunks to enable stacking. Figure 4 illustrates our concept of stacking and algorithm 1 describes it in practice.

  2. Stitched Native: No post-processing was done in this case, as we already have a single prediction per patient.

  3. Chunks Straightened: To ensure that all predictions are evaluated in the same space, the straightened chunks were brought back to the image space by applying the inverse warping field. This was followed by stacking the chunks (now in the native space) to obtain a single prediction per patient.

  4. Stitched Straightened: Similar to the straightening of chunks, the straightened stitched images were transformed back to the native space by applying the inverse warping field.

Fig. 4.

Toy example illustrating the difference in computing the Recall score with and without stacking. When computed chunk-wise (A, B), Recallchunk-1=12=0.5,Recallchunk-2=1,Recallchunk-3=1 the average F1 score across chunks is calculated as: Recallaverage=0.5+1+13=560.833. In contrast, the Recall score on stacked chunks (C) is given by: Recallstacked=34=0.75. This illustrates that computing Recall on individual chunks and averaging the results may yield a different value than computing it on stacked chunks. Although the correct computation of metrics can be achieved without stacking through a straightforward mathematical approach, we opt for stacking as it simplifies the handling of prediction and GT masks and the metrics computations with the MetricsReloaded toolbox.

Fig. 4.

Toy example illustrating the difference in computing the Recall score with and without stacking. When computed chunk-wise (A, B), Recallchunk-1=12=0.5,Recallchunk-2=1,Recallchunk-3=1 the average F1 score across chunks is calculated as: Recallaverage=0.5+1+13=560.833. In contrast, the Recall score on stacked chunks (C) is given by: Recallstacked=34=0.75. This illustrates that computing Recall on individual chunks and averaging the results may yield a different value than computing it on stacked chunks. Although the correct computation of metrics can be achieved without stacking through a straightforward mathematical approach, we opt for stacking as it simplifies the handling of prediction and GT masks and the metrics computations with the MetricsReloaded toolbox.

Close modal

Given the difference in the magnitude of F1-scores computed on individual and stacked chunks, all metrics described in Section 3 were computed on stacked chunks. Notably, the process of stitching chunks can lead to the interpolation of lesions in overlapping areas, while the process of stacking may result in double-counting these lesions. To assess whether this has a significant impact on our dataset, we analyzed our TUM test set. Among a total of 126 stitched scans (or respectively 378 chunked scans), we observed 450 lesions in the chunked dataset and 448 in the stitched dataset, a difference that can be attributed to a single patient. Given the minimal frequency and negligible impact of this discrepancy, we decided not to account for this effect in our analysis.

2.6 Statistical analysis

Statistical analysis was performed using the SciPy Python library v 1.9.1 (Virtanen et al., 2019). Data normality was tested using D’Agostino and Pearson’s normality test. Group-wise comparison between the 2D and 3D models for each of the four variants of the TUM dataset (thus resulting in 8 models to compare) was performed using the non-parametric Kruskal-Wallis H-test. We decided on the Kruskal-Wallis H-test because of the non-parametric nature of our data (as concluded from the normality test) and due to the fact that we were comparing more than two groups. Post hoc pairwise test for multiple comparisons was performed using Dunn’s test with holm correction.

This section compares the results of our proposed model for spinal cord segmentation against various existing methods (Section 3.1). Next, we examine how spinal cord straightening affects the performance of the lesion segmentation model (Section 3.2). We then demonstrate the model’s capability to detect lesions of varying sizes and evaluate its performance in detecting lesions across different chunks (Section 3.3). Finally, we identify the best model based on these findings and enhance it further by incorporating additional data from external sites (Section 3.4).

3.1 Spinal cord segmentation

Figure 5 compares the performance of 2D/3D models trained on chunks/stitched images respectively with state-of-the-art methods for spinal cord segmentation evaluated independently across different spinal regions (cervical, thoracic, and lumbar). Interestingly, spinal cord segmentations from sct_propseg and sct_deepseg_sc vary drastically across chunks, with the lowest performance on chunk-3 covering thoracolumbar levels. The contrast-agnostic v2.52 model (Bédard et al., 2025) reduces the gap in segmentation accuracy across chunks with a notable improvement in the lumbar spine segmentation. However, the models trained with TUM data (chunk-wise or stitched) perform significantly better than other methods for our test set.

Fig. 5.

Raincloud plots comparing the (A) Dice scores (best: 1; worst: 0) and (B) relative volume error (in %, best: 0%) across various spinal cord segmentation methods evaluated independently on the three chunks. The proposed Chunks and Stitched models significantly outperform the rest of the methods across all chunks. Distributions adjacent to the boxplots illustrate the spread of predictions, and represents the mean value over 3 folds. *** P<.001 (group-wise Kruskal-Wallis H-test followed by post-hoc pairwise Dunn’s test for multiple comparisons). Statistically significant pair-wise differences were found throughout when comparing each existing method with each of the proposed models.

Fig. 5.

Raincloud plots comparing the (A) Dice scores (best: 1; worst: 0) and (B) relative volume error (in %, best: 0%) across various spinal cord segmentation methods evaluated independently on the three chunks. The proposed Chunks and Stitched models significantly outperform the rest of the methods across all chunks. Distributions adjacent to the boxplots illustrate the spread of predictions, and represents the mean value over 3 folds. *** P<.001 (group-wise Kruskal-Wallis H-test followed by post-hoc pairwise Dunn’s test for multiple comparisons). Statistically significant pair-wise differences were found throughout when comparing each existing method with each of the proposed models.

Close modal
Algorithm 1: Evaluation method for chunk lesion masks
graphic
 
Algorithm 1: Evaluation method for chunk lesion masks
graphic
 

3.2 Impact of cord curvature and spatial context

In this section, we compare the lesion segmentation performance across the different variants of the TUM dataset described in Section 2.4. Figure 6 shows multiple raincloud plots comparing the lesion segmentation performance of models trained on individual axial slices (2D) and three-dimensional image patches (3D). Training a 2D model resulted in a higher Dice and F1-score compared to the 3D models in all datasets except for the stitched native variant.

Fig. 6.

Raincloud plots comparing test lesion segmentation performance in terms of voxel-wise Dice (A) and lesion-wise F1-score (B) across 2D/3D models trained on 4 variants of the TUM dataset. One scatter point represents the model prediction on a single image from a given fold. Points at Dice = 1 represent scans with no lesions in GT mask that have an empty prediction from the model. Distributions illustrate the spread of predictions, and represents the mean value over 3 folds. *P<.05, **P<.01, ***P<.001 (Group-wise Kruskal-Wallis H-test followed by post-hoc Dunn’s test adjusted for multiple comparisons with Holm correction).

Fig. 6.

Raincloud plots comparing test lesion segmentation performance in terms of voxel-wise Dice (A) and lesion-wise F1-score (B) across 2D/3D models trained on 4 variants of the TUM dataset. One scatter point represents the model prediction on a single image from a given fold. Points at Dice = 1 represent scans with no lesions in GT mask that have an empty prediction from the model. Distributions illustrate the spread of predictions, and represents the mean value over 3 folds. *P<.05, **P<.01, ***P<.001 (Group-wise Kruskal-Wallis H-test followed by post-hoc Dunn’s test adjusted for multiple comparisons with Holm correction).

Close modal

As mentioned in Section 2.3, along with default patch sizes defined by nnUNet, we also experimented with modified patch sizes covering the entire length of the cord within a single input patch to see if extended patch sizes would improve segmentation performance. Interestingly, we found no statistically significant differences between the 3D models trained with default patch sizes and the models trained with a full S-I coverage. Hence, we proceeded with the default patch sizes for all models.

Considering the role of spinal cord curvature in lesion segmentation, we observed that the models trained on straightened cords performed slightly worse than the models trained in the native space of the input scans. Specifically, in 2D, a median Dice score of 0.72 (chunks native) vs. 0.69 (chunks straightened) and, in 3D, a median Dice of 0.72 (stitched native) vs. 0.53 (stitched straightened). This is also apparent from the wider distribution of the Dice scores of predictions (i.e. scatter points) in Figure 6A. In a between-dataset pairwise comparison of Dice scores, the 2D chunks native model significantly outperformed 2D stitched native(**P<.01), 3D chunks straightened(***P<.001), and 2D/3D stitched straightened models (***P<.001). Likewise, for lesion-wise F1-scores, the 2D chunks native model showed significant differences between 3D chunks straightened (***P<.001), and 3D stitched straightened models (***P<.001).

3.3 Lesion distribution, frequency, and detection across chunks

Considering the median Dice and F1-scores, the Chunks Native 2D model ranks higher than the straightened counterparts and is on par with the models trained on stitched scans. In other words, training directly on chunks is a simple strategy that does not perform worse than the stitched models without the need for any stitching procedure. As this presents a simple and scalable solution, we proceeded with the 2D model and present the results from our subsequent experiments evaluating lesion detection in this section.

Figure 7 shows the lesion detection rates across various categories of lesions sizes. The performance of the model increases with increasing lesion sizes even detecting small lesions between 10-50 mm3 at a 60% rate. We observed that lesion detection becomes challenging as lesions get even smaller.

Fig. 7.

Lesion detection rates across four volumetric lesion categories for the native chunks dataset using the 2D model. While the model struggles to segment very small lesions, which are often difficult to distinguish from image artifacts and contribute to high inter-rater variability, it demonstrates high detection rates for medium and large lesions.

Fig. 7.

Lesion detection rates across four volumetric lesion categories for the native chunks dataset using the 2D model. While the model struggles to segment very small lesions, which are often difficult to distinguish from image artifacts and contribute to high inter-rater variability, it demonstrates high detection rates for medium and large lesions.

Close modal

To evaluate whether our model is consistent in detecting lesions occurring at various spinal cord regions, we plotted the distribution of lesion sizes and the corresponding lesion detection rate across spinal cord chunks in Figure 8. As expected, we observed a decreasing trend in the frequency of lesions going from the cervical to thoracolumbar spine. Our model achieved similar lesion detection rates in cervical and cervicothoracic spines (left and middle), with a slight decrease in the thoracolumbar spine (right).

Fig. 8.

Histograms of spinal cord lesions across all TUM test set chunks, showing higher lesion prevalence in cervical spine (chunk-1), but similar trends in terms of lesion volume distributions across chunks (2-3). Note the different scaling of the y-axes. Lesion detection rates across the different chunks for the TUM test set slightly decrease when traversing the spine cranio-caudally.

Fig. 8.

Histograms of spinal cord lesions across all TUM test set chunks, showing higher lesion prevalence in cervical spine (chunk-1), but similar trends in terms of lesion volume distributions across chunks (2-3). Note the different scaling of the y-axes. Lesion detection rates across the different chunks for the TUM test set slightly decrease when traversing the spine cranio-caudally.

Close modal

3.4 Generalizability across sites and protocols

We have now established that the model trained on chunks performs at least on par as the stitched models without any explicit preprocessing (namely, stitching and straightening) with similar lesion detection rates across chunks. Following the experimental design, we added axial T2w chunks from an external site (NYU) and evaluate whether the performance of the model improves after training on two sites (TUM and NYU). We tested the model in two ways: (i) on an in-distribution test set (i.e., TUM) to evaluate whether the model’s performance has drifted after the addition of an external site, and (ii) on unseen (out-of-distribution) data from two additional sites (BWH and UCSF), acquired using different scanners and protocols unseen during training.

Table 2 and Figure 9 compare three models qualitatively and quantitatively: (i) sct_deepseg_lesion, (ii) the model trained on TUM dataset’s chunks (i.e. single site), and (iii) the model trained on two sites. As data from BWH and UCSF are acquired at different sites and with acquisition parameters, the results on these sites show the model’s robustness to out-of-distribution data. Both variants of the models trained on chunks performed significantly better than sct_deepseg_lesion (Gros et al., 2018), which tends to miss lesions (shown with yellow arrows) resulting in higher false negatives. Importantly, adding data from the NYU site did not degrade the model’s performance on the original TUM test set; rather, it showed a slight improvement across all metrics. As for the evaluation metrics on unseen sites, the model trained on two sites performed better than the single site model, although no statistically significant difference was found. A comparison with seg_ms_lesion, a contemporary work accessible in SCT, is presented in Appendix A.2.

Table 2.

Quantitative comparison of the models trained on chunks on in-distribution (TUM) and out-of-distribution test sites (BWH and UCSF).

test siteMetricsct_deepseg_lesionChunksNative SingleSiteChunksNative TwoSites
TUM  (n=126) Dice 0.33 ± 0.31 0.60 ± 0.34 0.62±0.33 
NSD 0.37 ± 0.31 0.65 ± 0.37 0.67±0.36 
PPVL 0.48 ± 0.38 0.60 ± 0.37 0.63±0.36 
F1ScoreL 0.42 ± 0.34 0.64 ± 0.36 0.66±0.34 
|nrefLnpredL| 1.93 ± 1.87 1.86 ± 1.89 1.62±1.81 
BWH (n=80) Dice 0.28 ± 0.30 0.45 ± 0.26 0.51±0.26 
NSD 0.34 ± 0.31 0.60 ± 0.32 0.65±0.30 
PPVL 0.61 ± 0.43 0.62 ± 0.35 0.72±0.34 
F1ScoreL 0.40 ± 0.37 0.55 ± 0.32 0.62±0.32 
|nrefLnpredL| 3.45 ± 4.22 3.27 ± 3.85 3.00±3.82 
UCSF (n=32) Dice 0.15 ± 0.16 0.52 ± 0.18 0.53±0.18 
NSD 0.24 ± 0.25 0.69 ± 0.22 0.71±0.21 
PPVL 0.52 ± 0.46 0.72 ± 0.29 0.80±0.22 
F1ScoreL 0.25 ± 0.26 0.68 ± 0.23 0.69±0.23 
|nrefLnpredL| 3.22 ± 2.88 2.22 ± 1.81 1.69±2.01 
test siteMetricsct_deepseg_lesionChunksNative SingleSiteChunksNative TwoSites
TUM  (n=126) Dice 0.33 ± 0.31 0.60 ± 0.34 0.62±0.33 
NSD 0.37 ± 0.31 0.65 ± 0.37 0.67±0.36 
PPVL 0.48 ± 0.38 0.60 ± 0.37 0.63±0.36 
F1ScoreL 0.42 ± 0.34 0.64 ± 0.36 0.66±0.34 
|nrefLnpredL| 1.93 ± 1.87 1.86 ± 1.89 1.62±1.81 
BWH (n=80) Dice 0.28 ± 0.30 0.45 ± 0.26 0.51±0.26 
NSD 0.34 ± 0.31 0.60 ± 0.32 0.65±0.30 
PPVL 0.61 ± 0.43 0.62 ± 0.35 0.72±0.34 
F1ScoreL 0.40 ± 0.37 0.55 ± 0.32 0.62±0.32 
|nrefLnpredL| 3.45 ± 4.22 3.27 ± 3.85 3.00±3.82 
UCSF (n=32) Dice 0.15 ± 0.16 0.52 ± 0.18 0.53±0.18 
NSD 0.24 ± 0.25 0.69 ± 0.22 0.71±0.21 
PPVL 0.52 ± 0.46 0.72 ± 0.29 0.80±0.22 
F1ScoreL 0.25 ± 0.26 0.68 ± 0.23 0.69±0.23 
|nrefLnpredL| 3.22 ± 2.88 2.22 ± 1.81 1.69±2.01 

Number of images for each test site is given in brackets.

Note: Data are means ± standard deviations. The best values for NSD and |nrefLnpredL| are 0.0 and 1.0 for Dice and F1ScoreL. Bold represents best performance.

 Statistically significant compared to DeepSegLesion (P<.001).

Fig. 9.

Qualitative comparison of sct_deepseg_lesion, ChunksNative SingleSite and ChunksNative TwoSites 2D models across three patients from each test site. sct_deepseg_lesion demonstrates low sensitivity to MS lesions, missing to segment lesions in a few patients (yellow arrows).

Fig. 9.

Qualitative comparison of sct_deepseg_lesion, ChunksNative SingleSite and ChunksNative TwoSites 2D models across three patients from each test site. sct_deepseg_lesion demonstrates low sensitivity to MS lesions, missing to segment lesions in a few patients (yellow arrows).

Close modal

In this study, we developed an automatic tool for the joint segmentation of the spinal cord and MS lesions in axial T2w MRI scans using a region-based training approach. Starting with a comprehensive comparison between chunks, stitched, and straightened variants of our dataset, we showed that training a 2D model directly on the raw chunks in the native space achieved the best performance, presenting a simple and scalable solution. With respect to spinal cord segmentation, our model improved over the existing methods on the lumbar spine, whereas, for lesions, the proposed model detected up to 90% of lesions larger than 50 mm3, while struggling to detect small lesions reliably. Given that MS lesions manifest along the entire length of the spinal cord, our model could detect lesions across the entire spinal axis, with slightly lower detection capabilities in the thoracolumbar regions. We summarize our key insights and perspectives on spinal cord and MS lesion segmentation in this section.

4.1 Spinal cord segmentation

All benchmark methods (namely, sct_propseg, sct_deepseg_sc and contrast_agnostic) showed a decrease in performance towards the lower thoracic and lumbar spinal cord. This is not surprising as these models were trained on datasets that predominantly comprised cervical and cervicothoracic chunks. In contrast, given that the proposed model was trained exclusively on chunks covering the whole spine, it achieved consistent segmentations throughout various regions of the cord. Particularly, as the current state-of-the-art (Bédard et al., 2025) struggles to segment the thoracolumbar spinal cord, integrating our model into SCT will provide a novel solution for the segmentation of thoracolumbar spine axial T2w scans. Another likely reason for the better performance of our model is that the benchmark methods were not trained on the TUM data, leading to an expected drop in performance when evaluated on out-of-distribution data.

4.2 Lesion segmentation

4.2.1 Spinal cord straightening

Straightening the spinal cord negatively impacted the lesion segmentation performance. At first glance, this may seem surprising, given that straightening eliminates curvature-related morphological variations across individuals. However, the straightening procedure introduces heavy deformations in the vertebrae and the discs surrounding the spinal cord, possibly blurring anatomically meaningful information; in addition, small hyperintensities occurring at the concave side of the cord shrink, while lesions on the convex side get enlarged. We hypothesize that these non-linear deformations negatively impacted the performance of the models. Furthermore, resampling involved in the straightening procedure could degrade the signal in the images caused due to interpolation, further affecting the performance. Additionally, for models trained in the native space, lesion segmentation can be performed using the raw chunks as input, whereas segmenting lesions on straightened scans requires spinal cord masks a priori at inference.

In addition to straightening the spinal cord, we could have registered the images to an anatomical template (e.g., PAM50 (De Leener et al., 2018)) and subsequently train a segmentation model in the template space, to further reduce the inter-patient variability. Similar approaches involving the MNI template have been reported in brain MS lesion segmentation studies (Carass et al., 2017; Wiltgen et al., 2024). However, registration to the PAM50 template requires robust identification of the vertebral levels, which was difficult to achieve here due to thick slices in the sagittal plane. While several deep learning-based methods exist for vertebral labeling in CT scans (thanks to the open-source, labeled datasets (Sekuboyina et al., 2021)), it remains a challenge in MRI scans. Existing studies are limited in their approaches concerning fixed fields-of-view, spinal region-specific models (i.e., cervical/lumbar) and do not provide vertebral labels for whole-spine images (Azad et al., 2021; Moller et al., 2024). On the other hand, TotalSpineSeg (Warszawer et al., 2024), a recently-introduced model, obtains robust vertebral labels on a wide variety of contrasts. Using vertebral labels from TotalSpineSeg and registering to the PAM50 template, future works could explore lesion segmentation directly in the template space, thus simplifying the tasks of lesion mapping and computing lesion morphometrics.

4.2.2 Chunks vs. Stitched whole-spine scans

Recent studies highlight the significant impact of MS lesions at cervical, cervicothoracic, and thoracolumbar levels (Bussas et al., 2022; Eden et al., 2019; Hua et al., 2015; Kearney et al., 2015; Poulsen et al., 2021) and recommend extended axial coverage of the spinal cord (Breckwoldt et al., 2017; Galler et al., 2016). However, as acquiring full spinal cord scans suffers from long acquisition times, stitching independently acquired segments offers a simpler method to obtain whole-spine scans. We observed that training models on whole-spine scans did not result in statistically significant performance improvements over training on individual chunks, nor did our additional experiments with extended patch sizes covering the entire spinal cord (in the S-I plane). This suggests two things: (i) the model learns to be sensitive to hyperintensities in the cord without requiring any additional 3D patch-wise or length-wise contextual information, (ii) on a practical note, one can simplify the lesion segmentation task on axial T2w scans by training on raw chunks (i.e., without stitching). Notably, this approach is more scalable as many acquisition protocols do not cover the whole spinal cord.

The stitching procedure requires resampling both individual chunks and their ground truth masks to a common resolution. While the additional interpolation may affect boundary-sensitive metrics such as the Dice score, our post-stitching quality control ensured that small lesions or lesions in the overlapping regions between chunks are preserved, thus not affecting lesion sensitivity or false positive rates.

4.2.3 Modeling spatial context with 2D and 3D convolutional kernels

Spinal cord lesions, which often appear as elongated, blob-like structures, frequently span across multiple axial slices and vertebral levels (see Fig. 2). Existing studies (Gros et al., 2018; Walsh et al., 2024) have only reported results from exclusively 3D models, motivating the comparison of 2D and 3D models in this study. Given the highly anisotropic axial scans, the 3D models were trained with a combination of 2D kernels with singleton dimensions and fully 3D kernels at different stages of the encoder to balance the in-plane and out-of-plane dimensions. Our experiments revealed that 2D-only models often outperformed the hybrid 2D/3D models, or at least performed on par. This suggests that the additional spatial context provided by 3D patches did not add significant value. This is evident in Table 2, where sct_deepseg_lesion (Gros et al., 2018), a 3D model trained on patch sizes of 48×48×48, performed significantly worse than our proposed 2D models, even though it was trained on a subset of axial scans. In contrast, pure 2D models effectively mitigate the destabilizing effect of partial volume by training slice-by-slice on high-resolution axial slices and any variations in image intensities across vertebral levels may have a less significant impact on 2D kernels than on 3D architectures.

4.2.4 Lesion detection across spinal segments

Consistent with previous studies (Bussas et al., 2022; Waldman et al., 2024), we observed a progressively decreasing trend in the occurrence of lesions from the cervical to the thoracolumbar spine within the TUM cohort. Notably, our analysis revealed that the lesion detection rate remains consistent across different spinal segments. This suggests that detection performance is influenced more by lesion size than by spinal level, highlighting the need to address limitations in segmenting smaller lesions by incorporating reliable training samples for categories that the model currently struggles to segment accurately.

In particular, our results demonstrate that the model struggles to reliably segment spinal cord lesions smaller than 10 mm3. Existing studies (Walsh et al., 2023) also reported high inter-rater variability for small lesions among expert raters. More importantly, our approach and model improve substantially with increasing lesion size, achieving approximately 60% accuracy for lesions between 10 and 50 mm3. Although these lesions are relatively small, they are clinically significant (Bussas et al., 2022). Recent studies have shown that using synthetically generated lesions as an additional data augmentation method improved the performance over traditional data augmentation strategies (Basaran et al., 2022; Zhang et al., 2021). Using such methods targeting the synthetic generation of small lesions could help improve our model’s performance across small lesion sizes.

4.2.5 Accounting for patients with no lesions in evaluation metrics

In MS, distinguishing patients with and without spinal cord lesions is crucial (Lauerer et al., 2024; M. A. Rocca et al., 2024). Ideally, the training data should contain a small proportion of patients with no lesions (or even healthy controls) and the evaluation metrics should account for the empty ground-truth masks. These considerations help the model in distinguishing between MS lesions and common spinal cord artifacts (appearing as abnormal hyperintensities). The ANIMA toolbox, commonly used in lesion segmentation challenges (Commowick et al., 2016, 2021), skips empty lesion masks, preventing proper estimation of false positive rates. In contrast, MetricsReloaded (Maier-Hein et al., 2024) assigns a Dice score of 1 when both GT and predictions are empty and 0 otherwise, ensuring empty cases are accounted for. While this might skew the Dice score, it is justified given the importance to correctly predict for empty masks. Thus, we used MetricsReloaded in this work and introduced an evaluation approach that stacks individual chunks before computing metrics for a fair comparison between models trained on chunks vs. whole-spine scans.

4.2.6 Inter-rater variability and pre-labeling strategies

An important aspect to consider is the use of different (pre-)labeling strategies. The TUM masks were pre-segmented with manual labels and an iterative approach employing nnUNetV2, while, in the data from NYU, UCSF, and BWH, lesions were labeled entirely manually by different raters and spinal cord masks were prelabeled with SCT (Gros et al., 2018) and subsequently corrected. This variability contributed positively to the generalizability of our proposed models.

4.2.7 Robustness to distribution shifts

Directly adding raw chunks from the NYU site highlighted the simplicity of training on chunks. The two-site model outperformed the single-site model on the TUM test set and two unseen external datasets, benefiting from the greater diversity in scanners and acquisition protocols. This improved robustness makes it more suitable for integration into SCT (De Leener, Lévy, et al., 2017).

4.2.8 Limitations

Our work has several limitations: (1) we note that only axial T2w scans were considered in this work. Although our training and test datasets include axially acquired scans with a wide range of resolutions, we did not evaluate our model on high-resolution axial or isotropic scans falling outside the resolution range of our datasets, (2) the proposed model is constrained to the segmentation of lesions in T2w scans. Incorporating a variety of orientations and contrasts in the training set would improve its performance on several contrasts (Bédard et al., 2025). However, this generalization may come at the cost of reduced accuracy for specific contrasts and orientations. Furthermore, clinical protocols frequently include complementary scans in various orientations (Clara Weyer, 2021; Wattjes et al., 2021), such as axial and sagittal T2w, which offer additional information that our current models are unable to use for its predictions. Leveraging multiple contrasts or views is technically challenging, as current deep learning models typically require registered images to be aligned and stacked as multiple channels using the same field of view. Enhancing models to effectively leverage such complementary information, for example, via super-resolving scans first, is an ongoing area of research (McGinnis et al., 2023). (3) We note that nnUNet’s automatic configuration of patches, kernels, and other training hyper-parameters may still be suboptimal depending on the task. Although we experimented with minor adjustments to the initial parameters (detailed in 2.3), further optimization is still an open area for investigation. Our comparison of training plans for the same model across different dataset configurations revealed that nnUNetV2 adapts settings such as patch size and stride based on the data. While this demonstrates nnUNetV2’s ability to tailor its plans to the data, these adjustments can also affect key variables, such as lesion sensitivity, which we aim to optimize. Given the vast number of possible configurations, a comprehensive ablation study would be computationally intensive and beyond the scope of this study. We acknowledge this limitation and emphasize the need for further exploration in future work. (4) The performance of our model is limited by the inter-rater variability in the GT masks, as they were obtained independently at each site with different annotation strategies. Understanding the effect of inter-rater agreement in MS lesion segmentation is an active area of research (Walsh et al., 2023), our study does not explore this in greater detail. Lastly, the model was developed using data from a private cohort. While this prevents future reproducibility of our study, the model is made publicly-available as part of SCT, thus establishing a baseline for future studies in MS lesion segmentation in axial T2w scans.

We developed an automatic tool for the joint segmentation of spinal cord and MS lesions in axial T2w MRI scans covering the entire spine. We evaluated both patch-wise 3D and slice-by-slice 2D training strategies applied to individual chunks, stitched whole-spine scans, and straightened spinal cords to identify the combination that yields the best lesion segmentation performance. Our experiments demonstrated that slice-by-slice 2D training on raw chunks achieved the highest segmentation accuracy. This approach is simple and scalable, while not involving stitching and straightening. To highlight the practicality of our findings, we further improved the proposed model by training and evaluating on chunks from external sites as they are easily available and need not cover the entire spine. The model performed well on unseen data and slightly improved after extending the training data, suggesting good generalizability. To facilitate broader use, we open-sourced the code and integrated the model into SCT (v7.0 and above).

The data used is private and could be made available upon reasonable request. To facilitate reproducibility and open science principles, all codes, including scripts for preprocessing, training, and generating plots are open-source and can be found at https://github.com/ivadomed/model-seg-ms-axial-t2w. Additionally, the segmentation model is accessible via SCT (v7.0 and above) using sct_deepseg lesion_ms_axial_t2 -i <path-to-image.nii.gz>.

E.N.K.: data curation, formal analysis, investigation, methodology, visualization, and writing (original draft, review & editing). J.M.: data curation, formal analysis, investigation, methodology, visualization, and writing (original draft, review & editing). R.W.: data curation, segmentation. S.R.: data curation, segmentation. R.G.: methodology, writing (review & editing). J.V.: methodology, writing (review & editing). M.L.: data curation, writing (review & editing). P.L.B: data curation, methodology, and writing (review & editing). J.T., R.B., S.T, T.S., A.B., C.Z., and B.H.: data curation. D.R.: conceptualization, supervision, writing (review & editing). B.W.: conceptualization, data curation, methodology, supervision, and writing (review & editing). J.S.K.: conceptualization, data curation, funding acquisition, investigation, methodology, supervision, and writing (review & editing). J.C.A.: conceptualization, data curation, funding acquisition, investigation, methodology, supervision, and writing (review & editing). M.M.: conceptualization, data curation, funding acquisition, investigation, methodology, supervision, and writing (review & editing).

E.N.K. is supported by the Fonds de Recherche du Québec Nature and Technologie (FRQNT) Doctoral Training Scholarship and DAAD (German Academic Exchange Service) Short-term Research Grant. J.M., M.M., and J.S.K. are supported by Bavarian State Ministry for Science and Art (Collaborative Bilateral Research Program Bavaria – Québec: AI in medicine, grant F.4-V0134.K5.1/86/34). R.G. and J.S.K. are supported by European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (101045128-iBack-epic-ERC2021-COG). J.V. received funding from the European Union’s Horizon Europe research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 101107932. J.C.A. is supported by the Canada Research Chair in Quantitative Magnetic Resonance Imaging [CRC-2020-00179], the Canadian Institute of Health Research [PJT-190258], the Canada Foundation for Innovation [32454, 34824], the Fonds de Recherche du Québec - Santé [322736, 324636], the Natural Sciences and Engineering Research Council of Canada [RGPIN-2019-07244], the Canada First Research Excellence Fund (IVADO and TransMedTech), the Courtois NeuroMod project, the Quebec BioImaging Network [5886, 35450], INSPIRED (Spinal Research, UK; Wings for Life, Austria; Craig H. Neilsen Foundation, USA), Mila - Tech Transfer Funding Program. The authors thank Digital Research Alliance of Canada for the computed resources used in this work.

E.N.K., J.M., R.W., S.R., R.G., J.V., P.L.B., M.L., J.T., S.T., T.S., A.B., C.Z., B.H., J.S.K., D.R., J.C.A., and M.M. have no known competing financial interests to declare. R.B. has received speaking honoraria from EMD Serono and Sanofi and research support from Bristol Myers Squibb, EMD Serono, and Novartis. B.W. has received speaker honoraria from Philips and Novartis (unrelated to this study).

We thank Mathieu Guay-Paquet and Joshua Newton for their assistance with dataset management and for their contributions to implementing the sct_qc tool. We extend our gratitude to Erbil’s for their consistent quality of vegan döners, making the completion of this project feasible.

Azad
,
R.
,
Rouhier
,
L.
, &
Cohen-Adad
,
J.
(
2021
).
Stacked hourglass network with a multi-level attention mechanism: Where to look for intervertebral disc labeling
.
arXiv.
https://arxiv.org/abs/2108.06554
Basaran
,
B. D.
,
Qiao
,
M.
,
Matthews
,
P. M.
, &
Bai
,
W.
(
2022
).
Subject-specific lesion generation and pseudo-healthy synthesis for multiple sclerosis brain images
.
arXiv.
https://arxiv.org/abs/2208.02135
Bédard
,
S.
,
Karthik
,
E. N.
,
Tsagkas
,
C.
,
Pravatà
,
E.
,
Granziera
,
C.
,
Smith
,
A.
,
Weber
II
,
K.
A
.
, &
Cohen-Adad
,
J.
(
2025
).
Towards contrast-agnostic soft segmentation of the spinal cord
.
Medical Image Analysis
,
101
(
4
),
103473
. https://doi.org/10.1016/j.media.2025.103473
Breckwoldt
,
M. O.
,
Gradl
,
J.
,
Hähnel
,
S.
,
Hielscher
,
T.
,
Wildemann
,
B.
,
Diem
,
R.
,
Platten
,
M.
,
Wick
,
W.
,
Heiland
,
S.
, &
Bendszus
,
M.
(
2017
).
Increasing the sensitivity of MRI for the detection of multiple sclerosis lesions by long axial coverage of the spinal cord: A prospective study in 119 patients
.
Journal of Neurology
,
264
,
341
349
. https://doi.org/10.1007/s00415-016-8353-3
Bussas
,
M.
,
El Husseini
,
M.
,
Harabacz
,
L.
,
Pineker
,
V.
,
Grahl
,
S.
,
Pongratz
,
V.
,
Berthele
,
A.
,
Riederer
,
I.
,
Zimmer
,
C.
,
Hemmer
,
B.
,
Kirschke
,
J. S.
, &
Mühlau
,
M.
(
2022
).
Multiple sclerosis lesions and atrophy in the spinal cord: Distribution across vertebral levels and correlation with disability
.
NeuroImage: Clinical
,
34
,
103006
. https://doi.org/10.1016/j.nicl.2022.103006
Calabrese
,
M.
,
Filippi
,
M.
, &
Gallo
,
P.
(
2010
).
Cortical lesions in multiple sclerosis
.
Nature Reviews Neurology
,
6
(
8
),
438
444
. https://doi.org/10.1038/nrneurol.2010.93
Carass
,
A.
,
Roy
,
S.
,
Jog
,
A.
,
Cuzzocreo
,
J. L.
,
Magrath
,
E.
,
Gherman
,
A.
,
Button
,
J.
,
Nguyen
,
J.
,
Prados
,
F.
,
Sudre
,
C. H.
,
Jorge Cardoso
,
M.
,
Cawley
,
N.
,
Ciccarelli
,
O.
,
Wheeler-Kingshott
,
C. A.
,
Ourselin
,
S.
,
Catanese
,
L.
,
Deshpande
,
H.
,
Maurel
,
P.
,
Commowick
,
O.
, …
Pham
,
D. L
. (
2017
).
Longitudinal multiple sclerosis lesion segmentation: Resource and challenge
.
NeuroImage
,
148
,
77
102
. https://doi.org/10.1016/j.neuroimage.2016.12.064
Clara
Weyer.
(
2021
).
Automated segmentation of the spine in multiple sclerosis (MS) patients with the spinal cord toolbox (sct)
. Klinikum rechts der Isar, Technische Universität München. https://doi.org/10.15407/jnpae2007.03.030
Commowick
,
O.
,
Cervenansky
,
F.
, &
Ameli
,
R.
(
2016
).
MSSEG challenge proceedings: Multiple sclerosis lesions segmentation challenge using a data management and processing infrastructure
.
MICCAI
. https://doi.org/10.1016/j.neuroimage.2021.118589
Commowick
,
O.
,
Cervenansky
,
F.
,
Cotton
,
F.
, &
Dojat
,
M.
(
2021
).
Msseg-2 challenge proceedings: Multiple sclerosis new lesions segmentation challenge using a data management and processing infrastructure
.
MICCAI 2021-24th International Conference on Medical Image Computing and Computer Assisted Intervention
,
1
118
. https://doi.org/10.1016/j.neuroimage.2021.118589
De Leener
,
B.
,
Cohen-Adad
,
J.
, &
Kadoury
,
S.
(
2015
).
Automatic segmentation of the spinal cord and spinal canal coupled with vertebral labeling
.
IEEE Transactions on Medical Imaging
,
34
(
8
),
1705
1718
. https://doi.org/10.1109/tmi.2015.2437192
De Leener
,
B.
,
Fonov
,
V. S.
,
Collins
,
D. L.
,
Callot
,
V.
,
Stikov
,
N.
, &
Cohen-Adad
,
J.
(
2018
).
Pam50: Unbiased multimodal template of the brainstem and spinal cord aligned with the icbm152 space
.
Neuroimage
,
165
,
170
179
. https://doi.org/10.1016/j.neuroimage.2017.10.041
De Leener
,
B.
,
Kadoury
,
S.
, &
Cohen-Adad
,
J.
(
2014
).
Robust, accurate and fast automatic segmentation of the spinal cord
.
NeuroImage
,
98
,
528
536
. https://doi.org/10.1016/j.neuroimage.2014.04.051
De Leener
,
B.
,
Lévy
,
S.
,
Dupont
,
S. M.
,
Fonov
,
V. S.
,
Stikov
,
N.
,
Collins
,
D. L.
,
Callot
,
V.
, &
Cohen-Adad
,
J.
(
2017
).
Sct: Spinal cord toolbox, an open-source software for processing spinal cord MRI data
.
Neuroimage
,
145
,
24
43
. https://doi.org/10.1016/j.neuroimage.2016.10.009
De Leener
,
B.
,
Mangeat
,
G.
,
Dupont
,
S.
,
Martin
,
A. R.
,
Callot
,
V.
,
Stikov
,
N.
,
Fehlings
,
M. G.
, &
Cohen-Adad
,
J.
(
2017
).
Topologically preserving straightening of spinal cord MRI
.
Journal of Magnetic Resonance Imaging
,
46
(
4
),
1209
1219
. https://doi.org/10.1002/jmri.25622
Dubey
,
D.
,
Pittock
,
S. J.
,
Krecke
,
K. N.
,
Morris
,
P. P.
,
Sechi
,
E.
,
Zalewski
,
N. L.
,
Weinshenker
,
B. G.
,
Shosha
,
E.
,
Lucchinetti
,
C. F.
,
Fryer
,
J. P.
,
Lopez-Chiriboga
,
A. S.
,
Chen
,
J. C.
,
Jitprapaikulsan
,
J.
,
McKeon
,
A.
,
Gadoth
,
A.
,
Keegan
,
B. M.
,
Tillema
,
J.-M.
,
Naddaf
,
E.
,
Patterson
,
M. C.
, …
Flanagan
,
E. P
. (
2019
).
Clinical, radiologic, and prognostic features of myelitis associated with myelin oligodendrocyte glycoprotein autoantibody
.
JAMA Neurology
,
76
(
3
),
301
309
. https://doi.org/10.1001/jamaneurol.2018.4053
Eden
,
D.
,
Gros
,
C.
,
Badji
,
A.
,
Dupont
,
S. M.
,
De Leener
,
B.
,
Maranzano
,
J.
,
Zhuoquiong
,
R.
,
Liu
,
Y.
,
Granberg
,
T.
,
Ouellette
,
R.
,
Stawiarz
,
L.
,
Hillert
,
J.
,
Talbott
,
J.
,
Bannier
,
E.
,
Kerbrat
,
A.
,
Edan
,
G.
,
Labauge
,
P.
,
Callot
,
V.
,
Pelletier
,
J.
, …
Cohen-Adad
,
J.
(
2019
).
Spatial distribution of multiple sclerosis lesions in the cervical spinal cord
.
Brain
,
142
(
3
),
633
646
. https://doi.org/10.1093/brain/awy352
Galler
,
S.
,
Stellmann
,
J.-P.
,
Young
,
K.
,
Kutzner
,
D.
,
Heesen
,
C.
,
Fiehler
,
J.
, &
Siemonsen
,
S.
(
2016
).
Improved lesion detection by using axial t2-weighted MRI with full spinal cord coverage in multiple sclerosis
.
American Journal of Neuroradiology
,
37
(
5
),
963
969
. https://doi.org/10.3174/ajnr.a4638
Gentile
,
G.
,
Jenkinson
,
M.
,
Griffanti
,
L.
,
Luchetti
,
L.
,
Leoncini
,
M.
,
Inderyas
,
M.
,
Mortilla
,
M.
,
Cortese
,
R.
,
De Stefano
,
N.
, &
Battaglini
,
M.
(
2023
).
BIANCA-MS: An optimized tool for automated multiple sclerosis lesion segmentation
.
Human Brain Mapping
,
44
(
14
),
4893
4913
. https://doi.org/10.1002/hbm.26424
Graf
,
R.
,
Möller
,
H.
,
McGinnis
,
J.
,
Rühling
,
S.
,
Weihrauch
,
M.
,
Atad
,
M.
,
Shit
,
S.
,
Menze
,
B.
,
Mühlau
,
M.
,
Paetzold
,
J. C.
,
Rueckert
,
D.
, &
Kirschke
,
J.
(
2024
).
Modeling the acquisition shift between axial and sagittal MRI for diffusion superresolution to enable axial spine segmentation
. In
N.
Burgos
,
C.
Petitjean
,
M.
Vakalopoulou
,
S.
Christodoulidis
,
P.
Coupe
,
H.
Delingette
,
C.
Lartizien
, &
D.
Mateus
(Eds.),
Proceedings of the 7nd international conference on medical imaging with deep learning
(pp.
520
537
, Vol.
250
).
PMLR
. https://proceedings.mlr.press/v250/graf24a.html
Graf
,
R.
,
Platzek
,
P.-S.
,
Riedel
,
E.
,
Kim
,
S. H.
,
Lenhart
,
N.
,
Ramschütz
,
C.
,
Paprottka
,
K.
,
Kertels
,
O.
,
Möller
,
H.
,
Atad
,
M.
,
Bülow
,
R.
,
Werner
,
N.
,
Völzke
,
H.
,
Schmidt
,
C.
,
Wiestler
,
B.
,
Paetzold
,
J.
,
Rueckert
,
D.
, &
Kirschke
,
J.
(
2024
).
Generating synthetic high-resolution spinal stir and t1w images from T2w FSE and low-resolution axial dixon
.
European Radiology
,
35
,
1761
1771
. https://doi.org/10.1007/s00330-024-11047-1
Gros
,
C.
,
De Leener
,
B.
,
Badji
,
A.
,
Maranzano
,
J.
,
Eden
,
D.
,
Dupont
,
S.
,
Talbott
,
J.
,
Zhuoquiong
,
R.
,
Liu
,
Y.
,
Granberg
,
T.
,
Ouellette
,
R.
,
Tachibana
,
Y.
,
Hori
,
M.
,
Kamiya
,
K.
,
Chougar
,
L.
,
Stawiarz
,
L.
,
Hillert
,
J.
,
Bannier
,
E.
,
Kerbrat
,
A.
, &
Cohen-Adad
,
J.
(
2018
).
Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks
.
NeuroImage
,
184
,
901
915
. https://doi.org/10.1016/j.neuroimage.2018.09.081
Hatamizadeh
,
A.
,
Nath
,
V.
,
Tang
,
Y.
,
Yang
,
D.
,
Roth
,
H. R.
, &
Xu
,
D.
(
2021
).
Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images
.
International MICCAI brainlesion workshop
,
272
284
. https://doi.org/10.1007/978-3-031-08999-2_22
Hua
,
L.
,
Donlon
,
S.
,
Sobhanian
,
M.
,
Portner
,
S.
, &
Okuda
,
D.
(
2015
).
Thoracic spinal cord lesions are influenced by the degree of cervical spine involvement in multiple sclerosis
.
Spinal Cord
,
53
(
7
),
520
525
. https://doi.org/10.1038/sc.2014.238
Isensee
,
F.
,
Jaeger
,
P. F.
,
Full
,
P. M.
,
Vollmuth
,
P.
, &
Maier-Hein
,
K. H.
(
2020
).
nnU-Net for brain tumor segmentation
.
arXiv
. https://arxiv.org/abs/2011.00848
Isensee
,
F.
,
Jaeger
,
P. F.
,
Kohl
,
S. A.
,
Petersen
,
J.
, &
Maier-Hein
,
K. H.
(
2021
).
nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation
.
Nature Methods
,
18
(
2
),
203
211
. https://doi.org/10.1038/s41592-020-01008-z
Isensee
,
F.
,
Wald
,
T.
,
Ulrich
,
C.
,
Baumgartner
,
M.
,
Roy
,
S.
,
Maier-Hein
,
K.
, &
Jaeger
,
P. F.
(
2024
).
nnU-Net revisited: A call for rigorous validation in 3d medical image segmentation
.
arXiv
. https://doi.org/10.1007/978-3-658-47422-5_32
Jakimovski
,
D.
,
Bittner
,
S.
,
Zivadinov
,
R.
,
Morrow
,
S. A.
,
Benedict
,
R. H.
,
Zipp
,
F.
, &
Weinstock-Guttman
,
B.
(
2024
).
Multiple sclerosis
.
Lancet
,
403
(
10422
),
183
202
. https://doi.org/10.1016/s0140-6736(23)01473-3
Jasperse
,
B.
(
2024
).
Spinal cord imaging in multiple sclerosis and related disorders
.
Neuroimaging Clinics of North America
,
34
(
3
),
385
398
. https://doi.org/10.1016/j.nic.2024.03.011
Karthik
,
E. N.
,
Bedard
,
S.
,
Valosek
,
J.
,
Chandar
,
S.
, &
Cohen-Adad
,
J.
(
2024
).
Contrast-agnostic spinal cord segmentation: A comparative study of convnets and vision transformers
.
Medical Imaging with Deep Learning
. https://openreview.net/forum?id=n6D25aqdV3 https://doi.org/10.58530/2023/1382
Kearney
,
H.
,
Miller
,
D. H.
, &
Ciccarelli
,
O.
(
2015
).
Spinal cord MRI in multiple sclerosis—Diagnostic, prognostic and clinical value
.
Nature Reviews Neurology
,
11
(
6
),
327
338
. https://doi.org/10.1038/nrneurol.2015.80
Kerbrat
,
A.
,
Gros
,
C.
,
Badji
,
A.
,
Bannier
,
E.
,
Galassi
,
F.
,
Combès
,
B.
,
Chouteau
,
R.
,
Labauge
,
P.
,
Ayrignac
,
X.
,
Carra-Dalliere
,
C.
,
Maranzano
,
J.
,
Granberg
,
T.
,
Ouellette
,
R.
,
Stawiarz
,
L.
,
Hillert
,
J.
,
Talbott
,
J.
,
Tachibana
,
Y.
,
Hori
,
M.
,
Kamiya
,
K.
, &
Cohen-Adad
,
J.
(
2020
).
Multiple sclerosis lesions in motor tracts from brain to cervical cord: Spatial distribution and correlation with disability
.
Brain
,
143
(
7
),
2089
2105
. https://doi.org/10.1093/brain/awaa162
Kidd
,
D.
,
Barkhof
,
F.
,
McConnell
,
R.
,
Algra
,
P.
,
Allen
,
I.
, &
Revesz
,
T.
(
1999
).
Cortical lesions in multiple sclerosis
.
Brain
,
122
(
1
),
17
26
. https://doi.org/10.1093/brain/122.1.17
Lauerer
,
M.
,
McGinnis
,
J.
,
Bussas
,
M.
,
Husseini
,
M.
,
Pongratz
,
V.
,
Engl
,
C.
,
Wuschek
,
A.
,
Berthele
,
A.
,
Riederer
,
I.
,
Kirschke
,
J.
,
Zimmer
,
C.
,
Hemmer
,
B.
, &
Mühlau
,
M.
(
2024
).
Prognostic value of spinal cord lesion measures in early relapsing-remitting multiple sclerosis
.
Journal of Neurology, Neurosurgery & Psychiatry
,
95
(
1
),
37
43
. https://doi.org/10.1136/jnnp-2023-331799
Lavdas
,
I.
,
Glocker
,
B.
,
Rueckert
,
D.
,
Taylor
,
S.
,
Aboagye
,
E.
, &
Rockall
,
A.
(
2019
).
Machine learning in whole-body MRI: Experiences and challenges from an applied study using multicentre data
.
Clinical Radiology
,
74
(
5
),
346
356
. https://doi.org/10.1016/j.crad.2019.01.012
Lucchinetti
,
C.
,
Popescu
,
B.
,
Bunyan
,
R.
,
Moll
,
N.
,
Roemer
,
S.
,
Lassmann
,
H.
,
Brück
,
W.
,
Parisi
,
J.
,
Scheithauer
,
B.
,
Giannini
,
C.
,
Weigand
,
S.
,
Mandrekar
,
J.
, &
Ransohoff
,
R.
(
2011
).
Inflammatory cortical demyelination in early multiple sclerosis
.
New England Journal of Medicine
,
365
,
2188
2197
. https://doi.org/10.1056/NEJMoa1100648
Ma
,
J.
,
Li
,
F.
, &
Wang
,
B.
(
2024
).
U-mamba: Enhancing long-range dependency for biomedical image segmentation
.
arXiv preprint arXiv:2401.04722
. https://doi.org/10.20944/preprints202411.2377.v1
Maier-Hein
,
L.
,
Reinke
,
A.
,
Godau
,
P.
,
Tizabi
,
M. D.
,
Buettner
,
F.
,
Christodoulou
,
E.
,
Glocker
,
B.
,
Isensee
,
F.
,
Kleesiek
,
J.
,
Kozubek
,
M.
,
Reyes
,
M.
,
Riegler
,
M.
,
Wiesenfarth
,
M.
,
Kavur
,
A.
,
Sudre
,
C.
,
Baumgartner
,
M.
,
Eisenmann
,
M.
,
Heckmann-Nötzel
,
D.
,
Rädsch
,
T.
, &
Jaeger
,
P.
(
2024
).
Metrics reloaded: Recommendations for image analysis validation
.
Nature Methods
,
21
,
195
212
. https://doi.org/10.1038/s41592-023-02151-z
Mariano
,
R.
,
Messina
,
S.
,
Roca-Fernandez
,
A.
,
Leite
,
M. I.
,
Kong
,
Y.
, &
Palace
,
J. A.
(
2021
).
Quantitative spinal cord MRI in MOG-antibody disease, neuromyelitis optica and multiple sclerosis
.
Brain
,
144
(
1
),
198
212
. https://doi.org/10.1093/brain/awaa347
McGinnis
,
J.
,
Shit
,
S.
,
Li
,
H.
,
Sideri-Lampretsa
,
V.
,
Graf
,
R. N.
,
Dannecker
,
M.
,
Pan
,
J.-Y.
,
Ans’o
,
N. S.
,
Muhlau
,
M.
,
Kirschke
,
J. S.
,
Rueckert
,
D.
, &
Wiestler
,
B.
(
2023
).
Single-subject multi-contrast MRI super-resolution via implicit neural representations
.
International Conference on Medical Image Computing and Computer-Assisted Intervention
. https://doi.org/10.1007/978-3-031-43993-3_17
Mendelsohn
,
Z.
,
Pemberton
,
H. G.
,
Gray
,
J.
,
Goodkin
,
O.
,
Carrasco
,
F. P.
,
Scheel
,
M.
,
Nawabi
,
J.
, &
Barkhof
,
F.
(
2023
).
Commercial volumetric MRI reporting tools in multiple sclerosis: A systematic review of the evidence
.
Neuroradiology
,
65
(
1
),
5
24
. https://doi.org/10.1007/s00234-022-03074-w
Moller
,
H. K.
,
Graf
,
R.
,
Schmitt
,
J.
,
Keinert
,
B.
,
Atad
,
M.
,
Sekuboyina
,
A. K.
,
Streckenbach
,
F.
,
Schon
,
H.
,
Kofler
,
F.
,
Kroencke
,
T. J.
,
Bette
,
S.
,
Willich
,
S. N.
,
Keil
,
T.
,
Niendorf
,
T.
,
Pischon
,
T.
,
Endemann
,
B.
,
Menze
,
B. H.
,
Rueckert
,
D.
, &
Kirschke
,
J. S.
(
2024
).
Spineps—automatic whole spine segmentation of T2-weighted MR images using a two-phase approach to multi-class semantic and instance segmentation
.
European Radiology
,
35
,
1178
1189
. https://doi.org/10.1007/s00330-024-11155-y
Naga Karthik
,
E.
,
Kerbrat
,
A.
,
Labauge
,
P.
,
Granberg
,
T.
,
Talbott
,
J.
,
Reich
,
D. S.
,
Filippi
,
M.
,
Bakshi
,
R.
,
Callot
,
V.
,
Chandar
,
S.
, &
Cohen-Adad
,
J.
(
2022
).
Segmentation of multiple sclerosis lesion across hospitals: Learn continually or train from scratch?
MedNeurIPS: Medical Imaging Meets NeurIPS Workshop
. https://arxiv.org/pdf/2210.15091.pdf
Naga Karthik
,
E.
,
Valošek
,
J.
,
Smith
,
A. C.
,
Pfyffer
,
D.
,
Schading-Sassenhausen
,
S.
,
Farner
,
L.
,
Weber
,
K. A.
,
Freund
,
P.
, &
Cohen-Adad
,
J.
(
2024
).
Sciseg: Automatic segmentation of intramedullary lesions in spinal cord injury on T2-weighted MRI scans
.
Radiology: Artificial Intelligence
,
7
(
1
),
e240005
. https://doi.org/10.1148/ryai.240005
Poulsen
,
E. N.
,
Olsson
,
A.
,
Gustavsen
,
S.
,
Langkilde
,
A. R.
,
Oturai
,
A. B.
, &
Carlsen
,
J. F.
(
2021
).
MRI of the entire spinal cord—Worth the while or waste of time? A retrospective study of 74 patients with multiple sclerosis
.
Diagnostics
,
11
(
8
),
1424
. https://doi.org/10.3390/diagnostics11081424
Reinke
,
A.
,
Tizabi
,
M. D.
,
Baumgartner
,
M.
,
Eisenmann
,
M.
,
Heckmann-Nötzel
,
D.
,
Kavur
,
A. E.
,
Rädsch
,
T.
,
Sudre
,
C. H.
,
Ación
,
L.
,
Antonelli
,
M.
,
Arbel
,
T.
,
Bakas
,
S.
,
Benis
,
A.
,
Buettner
,
F.
,
Cardoso
,
M. J.
,
Cheplygina
,
V.
,
Chen
,
J.
,
Christodoulou
,
E.
,
Cimini
,
B. A.
,
Maier-Hein, L.
(
2023
).
Understanding metric-related pitfalls in image analysis validation
.
Nature Methods
,
21
,
182
194
. https://doi.org/10.1038/s41592-023-02150-0
Rocca
,
M.
,
Valsasina
,
P.
,
Meani
,
A.
,
Gobbi
,
C.
,
Zecca
,
C.
,
Barkhof
,
F.
,
Schoonheim
,
M.
,
Strijbis
,
E.
,
Vrenken
,
H.
,
Gallo
,
A.
,
Bisecco
,
A.
,
Ciccarelli
,
O.
,
Yiannakas
,
M.
,
Rovira
,
A.
,
Sastre-Garriga
,
J.
,
Palace
,
J.
,
Matthews
,
L.
,
Gass
,
A.
,
Eisele
,
P.
, &
Filippi
,
M.
(
2022
).
Spinal cord lesions and brain grey matter atrophy independently predict clinical worsening in definite multiple sclerosis: A 5-year, multicentre study
.
Journal of Neurology, Neurosurgery & Psychiatry
,
94
,
10
18
. https://doi.org/10.1136/jnnp-2022-329854
Rocca
,
M. A.
,
Preziosa
,
P.
,
Barkhof
,
F.
,
Brownlee
,
W. J.
,
Calabrese
,
M.
,
Stefano
,
N. D.
,
Granziera
,
C.
,
Ropele
,
S.
,
Toosy
,
A. T.
,
Vidal-Jordana
,
Á.
,
Filippo
,
M. D.
, &
Filippi
,
M.
(
2024
).
Current and future role of MRI in the diagnosis and prognosis of multiple sclerosis
.
The Lancet Regional Health -Europe
,
44
,
100978
. https://doi.org/10.1016/j.lanepe.2024.100978
Schmidt
,
P.
,
Gaser
,
C.
,
Arsic
,
M.
,
Buck
,
D.
,
Förschler
,
A.
,
Berthele
,
A.
,
Hoshi
,
M.
,
Ilg
,
R.
,
Schmid
,
V. J.
,
Zimmer
,
C.
,
Hemmer
,
B.
, &
Mühlau
,
M.
(
2012
).
An automated tool for detection of flair-hyperintense white-matter lesions in multiple sclerosis
.
NeuroImage
,
59
(
4
),
3774
3783
. https://doi.org/10.1016/j.neuroimage.2011.11.032
Sekuboyina
,
A.
,
Husseini
,
M. E.
,
Bayat
,
A.
,
Löffler
,
M.
,
Liebl
,
H.
,
Li
,
H.
,
Tetteh
,
G.
,
Kukačka
,
J.
,
Payer
,
C.
,
Štern
,
D.
,
Urschler
,
M.
,
Chen
,
M.
,
Cheng
,
D.
,
Lessmann
,
N.
,
Hu
,
Y.
,
Wang
,
T.
,
Yang
,
D.
,
Xu
,
D.
,
Ambellan
,
F.
, …
Kirschke
,
J. S
. (
2021
).
Verse: A vertebrae labelling and segmentation benchmark for multi-detector ct images
.
Medical Image Analysis
,
73
,
102166
. https://doi.org/10.1016/j.media.2021.102166
Shiee
,
N.
,
Bazin
,
P.-L.
,
Ozturk
,
A.
,
Reich
,
D. S.
,
Calabresi
,
P. A.
, &
Pham
,
D. L.
(
2010
).
A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions
.
NeuroImage
,
49
(
2
),
1524
1535
. https://doi.org/10.1016/j.neuroimage.2009.09.005
Thompson
,
A. J.
,
Banwell
,
B. L.
,
Barkhof
,
F.
,
Carroll
,
W. M.
,
Coetzee
,
T.
,
Comi
,
G.
,
Correale
,
J.
,
Fazekas
,
F.
,
Filippi
,
M.
,
Freedman
,
M. S.
,
Fujihara
,
K.
,
Galetta
,
S. L.
,
Hartung
,
H. P.
,
Kappos
,
L.
,
Lublin
,
F. D.
,
Marrie
,
R. A.
,
Miller
,
A. E.
,
Miller
,
D. H.
,
Montalban
,
X.
, …
Cohen
,
J. A
. (
2017
).
Diagnosis of multiple sclerosis: 2017 revisions of the mcdonald criteria
.
The Lancet Neurology
,
17
,
162
173
. https://doi.org/10.1016/s1474-4422(17)30470-2
Virtanen
,
P.
,
Gommers
,
R.
,
Oliphant
,
T. E.
,
Haberland
,
M.
,
Reddy
,
T.
,
Cournapeau
,
D.
,
Burovski
,
E.
,
Peterson
,
P.
,
Weckesser
,
W.
,
Bright
,
J.
,
van der Walt
,
S. J.
,
Brett
,
M.
,
Wilson
,
J.
,
Millman
,
K. J.
,
Mayorov
,
N.
,
Nelson
,
A. R. J.
,
Jones
,
E.
,
Kern
,
R.
,
Larson
,
E.
,
Vázquez-Baeza, Y.
(
2019
).
Scipy 1.0: Fundamental algorithms for scientific computing in python
.
Nature Methods
,
17
,
261
272
. https://doi.org/10.1038/s41592-020-0772-5
Waldman
,
A. D.
,
Catania
,
C.
,
Pisa
,
M.
,
Jenkinson
,
M.
,
Lenardo
,
M. J.
, &
DeLuca
,
G. C.
(
2024
).
The prevalence and topography of spinal cord demyelination in multiple sclerosis: A retrospective study
.
Acta Neuropathologica
,
147
(
1
),
51
. https://doi.org/10.1007/s00401-024-02700-6
Walsh
,
R.
,
Gaubert
,
M.
,
Meurée
,
C.
,
Hussein
,
B. R.
,
Kerbrat
,
A.
,
Casey
,
R.
,
Combès
,
B.
, &
Galassi
,
F.
(
2024
).
Multi-sequence learning for multiple sclerosis lesion segmentation in spinal cord MRI
.
International Conference on Medical Image Computing and Computer-Assisted Intervention
,
478
487
. https://doi.org/10.1007/978-3-031-72114-4_46
Walsh
,
R.
,
Meurée
,
C.
,
Kerbrat
,
A.
,
Masson
,
A.
,
Hussein
,
B. R.
,
Gaubert
,
M.
,
Galassi
,
F.
, &
Combés
,
B.
(
2023
).
Expert variability and deep learning performance in spinal cord lesion segmentation for multiple sclerosis patients
.
2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS)
,
463
470
. https://doi.org/10.1109/cbms58004.2023.00263
Walton
,
C.
,
King
,
R.
,
Rechtman
,
L.
,
Kaye
,
W. E.
,
Leray
,
E.
,
Marrie
,
R. A.
,
Robertson
,
N. P.
,
Rocca
,
N. L.
,
Uitdehaag
,
B. M. J.
,
van der Mei
,
I.
,
Wallin
,
M. T.
,
Helme
,
A.
,
Napier
,
C. A.
,
Rijke
,
N.
, &
Baneke
,
P.
(
2020
).
Rising prevalence of multiple sclerosis worldwide: Insights from the atlas of MS, third edition
.
Multiple Sclerosis (Houndmills, Basingstoke, England)
,
26
,
1816
1821
. https://doi.org/10.1177/1352458520970841
Warszawer
,
Y.
,
Molinier
,
N.
,
Eshaghi
,
A.
, &
Cohen-Adad
,
J.
(
2024
).
Neuropoly/totalspineseg: Totalspineseg-r20241115
(Version r20241115). Zenodo. https://doi.org/10.5281/zenodo.14181502
Wattjes
,
M. P.
,
Ciccarelli
,
O.
,
Reich
,
D. S.
,
Banwell
,
B. L.
,
Stefano
,
N. D.
,
Enzinger
,
C.
,
Fazekas
,
F.
,
Filippi
,
M.
,
Frederiksen
,
J. L.
,
Gasperini
,
C.
,
Hacohen
,
Y.
,
Kappos
,
L.
,
Li
,
D. K.
,
Mankad
,
K.
,
Montalban
,
X.
,
Newsome
,
S. D.
,
Oh
,
J.
,
Palace
,
J. A.
,
Rocca
,
M. A.
, …
Rovira
,
À
. (
2021
).
2021 MAGNIMS–CMSC–NAIMS consensus recommendations on the use of MRI in patients with multiple sclerosis
.
The Lancet Neurology
,
20
,
653
670
. https://doi.org/10.1016/s1474-4422(21)00095-8
Weier
,
K.
,
Mazraeh
,
J.
,
Naegelin
,
Y.
,
Thoeni
,
A.
,
Hirsch
,
J. G.
,
Fabbro
,
T.
,
Bruni
,
N.
,
Duyar
,
H.
,
Bendfeldt
,
K.
,
Radue
,
E.-W.
,
Kappos
,
L.
, &
Gass
,
A.
(
2012
).
Biplanar MRI for the assessment of the spinal cord in multiple sclerosis
.
Multiple Sclerosis Journal
,
18
,
1560
1569
. https://doi.org/10.1177/1352458512442754
Wiltgen
,
T.
,
McGinnis
,
J.
,
Schlaeger
,
S.
,
Voon
,
C.
,
Berthele
,
A.
,
Bischl
,
D.
,
Grundl
,
L.
,
Will
,
N.
,
Metz
,
M.
,
Schinz
,
D.
,
Sepp
,
D.
,
Prucker
,
P.
,
Schmitz-Koep
,
B.
,
Zimmer
,
C.
,
Menze
,
B. H.
,
Rueckert
,
D.
,
Hemmer
,
B.
,
Kirschke
,
J. S.
,
Mühlau
,
M.
, &
Wiestler
,
B.
(
2024
).
LST-AI: A deep learning ensemble for accurate ms lesion segmentation
.
NeuroImage: Clinical
,
42
. https://doi.org/10.1016/j.nicl.2024.103611
Zhang
,
X.
,
Liu
,
C.
,
Ou
,
N. J.
,
Zeng
,
X.
,
Xiong
,
X.
,
Yu
,
Y.
,
Liu
,
Z.
, &
Ye
,
C.
(
2021
).
CarveMix: A simple data augmentation method for brain lesion segmentation
.
NeuroImage
,
271
,
120041
. https://doi.org/10.1016/j.neuroimage.2023.120041

Appendix A

A.1. Experimental details

Table A1 shows the training configurations 2D and 3D models on each dataset variant of the TUM dataset. Note that in all 3D models, the convolutional kernels contain a mix of 3D kernels with singleton dimensions (effectively being applied as 2D kernels on the in-plane dimensions) and full 3×3×3-shaped 3D kernels.

Table A1.

Overview of training configurations and hyperparameters used for 2D and 3D models trained on the TUM dataset.

Chunks nativeStitched nativeChunks straightenedStitched straightened
Parameters2d3d_fullres2d3d_fullres2d3d_fullres2d3d_fullres
Batch size 22 14 18 2 65 2 65 2 
Patch size (384, 384) (24, 256, 256) (512, 448) (48, 224, 160) (224, 224) (32, 224, 224) (224, 224) (44, 192, 192) 
Spacing (0.34, 0.34) (5.0, 0.34, 0.34) (0.34, 0.34) (5, 0.34, 0.34) (0.34, 0.34) (5, 0.34, 0.34) (0.34, 0.34) (5, 0.34, 0.34) 
Stages 7 7 7 7 6 6 6 6 
Features 32 32 32 32 32 32 32 32 
 64 64 64 64 64 64 64 64 
 128 128 28 128 128 128 128 128 
 256 256 256 256 256 256 256 256 
 512 320 512 320 512 320 512 320 
 512 320 512 320 512 320 512 320 
 512 320 512 320     
Kernel sizes (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) 
 (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) 
 (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3)     
Strides (1, 1) (1, 1, 1) (1, 1) (1, 1, 1) (1, 1) (1, 1, 1) (1, 1) (1, 1, 1) 
 (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) 
 (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) 
 (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) 
 (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) 
 (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) 
 (2, 2) (1, 2, 2) (2, 2) (2, 1, 1)     
Chunks nativeStitched nativeChunks straightenedStitched straightened
Parameters2d3d_fullres2d3d_fullres2d3d_fullres2d3d_fullres
Batch size 22 14 18 2 65 2 65 2 
Patch size (384, 384) (24, 256, 256) (512, 448) (48, 224, 160) (224, 224) (32, 224, 224) (224, 224) (44, 192, 192) 
Spacing (0.34, 0.34) (5.0, 0.34, 0.34) (0.34, 0.34) (5, 0.34, 0.34) (0.34, 0.34) (5, 0.34, 0.34) (0.34, 0.34) (5, 0.34, 0.34) 
Stages 7 7 7 7 6 6 6 6 
Features 32 32 32 32 32 32 32 32 
 64 64 64 64 64 64 64 64 
 128 128 28 128 128 128 128 128 
 256 256 256 256 256 256 256 256 
 512 320 512 320 512 320 512 320 
 512 320 512 320 512 320 512 320 
 512 320 512 320     
Kernel sizes (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) 
 (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) 
 (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) (3, 3) (1, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) (3, 3) (3, 3, 3) 
 (3, 3) (3, 3, 3) (3, 3) (3, 3, 3)     
Strides (1, 1) (1, 1, 1) (1, 1) (1, 1, 1) (1, 1) (1, 1, 1) (1, 1) (1, 1, 1) 
 (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) 
 (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) 
 (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) (2, 2) (1, 2, 2) 
 (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) 
 (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) (2, 2) (2, 2, 2) 
 (2, 2) (1, 2, 2) (2, 2) (2, 1, 1)     

A.2 Comparison with seg_ms_lesion

We present a comparison with seg_ms_lesion (a contemporary work available in SCT v6.53) in Table A2. However, its comparison with the results of our models is heavily biased for the following reasons: (i) seg_ms_lesion is trained on scans from all centers (TUM, BWH, UCSF), unlike our models, where the SingleSite model used TUM data and the TwoSite model used TUM+NYU data, (ii) the dataset split of seg_ms_lesion model shows train/test set leakage, meaning that chunks for a given patient were split across train/test splits, thus corrupting the test set, and (iii) most importantly, out of 126 (TUM), 80 (BWH), and 32 (UCSF) scans for evaluating our models, 97 (TUM), 58 (BWH), and 17 (UCSF) of scans were included in the training set of the seg_ms_lesion model. For an unbiased evaluation on the remaining scans in the test set unique to all the models, we refer the reader to Table A3 in the Appendix section.

Table A2.

Quantitative comparison of the models trained on chunks on TUM, BWH, and UCSF sites.

Test siteMetricsct_deepseg_lesionseg_ms_lesion (SCT v6.5)ChunksNative SingleSiteChunksNative TwoSites
TUM (n=126) Dice 0.33 ± 0.31 0.53 ± 0.34 0.60 ± 0.34 0.62±0.33 
NSD 0.37 ± 0.31 0.61 ± 0.39 0.65 ± 0.37 0.67±0.36 
PPVL 0.48 ± 0.38 0.63 ± 0.42 0.60 ± 0.37 0.63±0.36 
F1ScoreL 0.42 ± 0.34 0.63 ± 0.4 0.64 ± 0.36 0.66±0.34 
|nrefLnpredL| 1.93 ± 1.87 2.29 ± 0.14 1.86 ± 1.89 1.62±1.81 
BWH (n=80) Dice 0.28 ± 0.30 0.53 ± 0.27 0.45 ± 0.26 0.51±0.26 
NSD 0.34 ± 0.31 0.68 ± 0.33 0.60 ± 0.32 0.65±0.30 
PPVL 0.61 ± 0.43 0.74 ± 0.36 0.62 ± 0.35 0.72±0.34 
F1ScoreL 0.40 ± 0.37 0.62 ± 0.34 0.55 ± 0.32 0.62±0.32 
|nrefLnpredL| 3.45 ± 4.22 1.39 ± 2.56 3.27 ± 3.85 3.00±3.82 
UCSF (n=32) Dice 0.15 ± 0.16 0.59 ± 0.14 0.52 ± 0.18 0.53±0.18 
NSD 0.24 ± 0.25 0.77 ± 0.15 0.69 ± 0.22 0.71±0.21 
PPVL 0.52 ± 0.46 0.86 ± 0.18 0.72 ± 0.29 0.80±0.22 
F1ScoreL 0.25 ± 0.26 0.74 ± 0.18 0.68 ± 0.23 0.69±0.23 
|nrefLnpredL| 3.22 ± 2.88 1.22 ± 1.01 2.22 ± 1.81 1.69±2.01 
Test siteMetricsct_deepseg_lesionseg_ms_lesion (SCT v6.5)ChunksNative SingleSiteChunksNative TwoSites
TUM (n=126) Dice 0.33 ± 0.31 0.53 ± 0.34 0.60 ± 0.34 0.62±0.33 
NSD 0.37 ± 0.31 0.61 ± 0.39 0.65 ± 0.37 0.67±0.36 
PPVL 0.48 ± 0.38 0.63 ± 0.42 0.60 ± 0.37 0.63±0.36 
F1ScoreL 0.42 ± 0.34 0.63 ± 0.4 0.64 ± 0.36 0.66±0.34 
|nrefLnpredL| 1.93 ± 1.87 2.29 ± 0.14 1.86 ± 1.89 1.62±1.81 
BWH (n=80) Dice 0.28 ± 0.30 0.53 ± 0.27 0.45 ± 0.26 0.51±0.26 
NSD 0.34 ± 0.31 0.68 ± 0.33 0.60 ± 0.32 0.65±0.30 
PPVL 0.61 ± 0.43 0.74 ± 0.36 0.62 ± 0.35 0.72±0.34 
F1ScoreL 0.40 ± 0.37 0.62 ± 0.34 0.55 ± 0.32 0.62±0.32 
|nrefLnpredL| 3.45 ± 4.22 1.39 ± 2.56 3.27 ± 3.85 3.00±3.82 
UCSF (n=32) Dice 0.15 ± 0.16 0.59 ± 0.14 0.52 ± 0.18 0.53±0.18 
NSD 0.24 ± 0.25 0.77 ± 0.15 0.69 ± 0.22 0.71±0.21 
PPVL 0.52 ± 0.46 0.86 ± 0.18 0.72 ± 0.29 0.80±0.22 
F1ScoreL 0.25 ± 0.26 0.74 ± 0.18 0.68 ± 0.23 0.69±0.23 
|nrefLnpredL| 3.22 ± 2.88 1.22 ± 1.01 2.22 ± 1.81 1.69±2.01 

Number of images for each test site is given in brackets.

Note: Data are means ± standard deviations. The best values for NSD and |nrefLnpredL| are 0.0 and 1.0 for Dice and F1ScoreL. Bold represents best performance.

 Statistically significant compared to DeepSegLesion (P<.001).

As noted in Section 3.4, test patient overlap in the seg_ms_lesion training set prevents a fair comparison. Unlike the original manuscript, we report results only on test scans unique to each model. With 97 (TUM), 58 (BWH), and 17 (UCSF) cases in seg_ms_lesion’s training, we recomputed metrics on the remaining 29 (TUM), 22 (BWH), and 15 (UCSF) scans, as shown below. Please note the different number of test cases compared to Table A2.

Table A3.

Comparison of the models’ on test sets from TUM, BWH, and UCSF sites.

Test siteMetricChunksNative SingleSiteChunksNative TwoSitessct_deepseg_lesionseg_ms_lesion (SCT v6.5)
TUM (n=29Dice 0.54±0.46 0.54 ± 0.47 0.35 ± 0.46 0.16 ± 0.30 
NSD 0.28±0.39 0.27 ± 0.39 0.10 ± 0.26 0.15 ± 0.30 
PPVL 0.52±0.47 0.51 ± 0.46 0.32 ± 0.45 0.14 ± 0.28 
F1ScoreL 0.54±0.46 0.53 ± 0.46 0.33 ± 0.45 0.16 ± 0.29 
|nrefLnpredL| 1.36 ± 2.00 1.25±2.17 1.39 ± 1.42 3.43 ± 2.20 
BWH (n=22Dice 0.27 ± 0.28 0.37±0.35 0.35 ± 0.43 0.28 ± 0.30 
NSD 0.36 ± 0.38 0.41±0.39 0.17 ± 0.22 0.37 ± 0.40 
PPVL 0.30 ± 0.34 0.39 ± 0.38 0.49±0.47 0.34 ± 0.39 
F1ScoreL 0.33 ± 0.35 0.42±0.39 0.38 ± 0.43 0.35 ± 0.39 
|nrefLnpredL| 3.50 ± 3.17 2.73 ± 2.43 2.14±2.05 2.50 ± 1.77 
UCSF (n=15Dice 0.49 ± 0.18 0.50 ± 0.19 0.13 ± 0.15 0.60±0.14 
NSD 0.67 ± 0.22 0.69 ± 0.23 0.19 ± 0.22 0.78±0.14 
PPVL 0.72 ± 0.28 0.74 ± 0.30 0.46 ± 0.44 0.82±0.20 
F1ScoreL 0.62 ± 0.26 0.64 ± 0.29 0.22 ± 0.25 0.79±0.14 
|nrefLnpredL| 2.07 ± 1.83 1.67 ± 1.45 2.67 ± 2.94 1.53±1.46 
Test siteMetricChunksNative SingleSiteChunksNative TwoSitessct_deepseg_lesionseg_ms_lesion (SCT v6.5)
TUM (n=29Dice 0.54±0.46 0.54 ± 0.47 0.35 ± 0.46 0.16 ± 0.30 
NSD 0.28±0.39 0.27 ± 0.39 0.10 ± 0.26 0.15 ± 0.30 
PPVL 0.52±0.47 0.51 ± 0.46 0.32 ± 0.45 0.14 ± 0.28 
F1ScoreL 0.54±0.46 0.53 ± 0.46 0.33 ± 0.45 0.16 ± 0.29 
|nrefLnpredL| 1.36 ± 2.00 1.25±2.17 1.39 ± 1.42 3.43 ± 2.20 
BWH (n=22Dice 0.27 ± 0.28 0.37±0.35 0.35 ± 0.43 0.28 ± 0.30 
NSD 0.36 ± 0.38 0.41±0.39 0.17 ± 0.22 0.37 ± 0.40 
PPVL 0.30 ± 0.34 0.39 ± 0.38 0.49±0.47 0.34 ± 0.39 
F1ScoreL 0.33 ± 0.35 0.42±0.39 0.38 ± 0.43 0.35 ± 0.39 
|nrefLnpredL| 3.50 ± 3.17 2.73 ± 2.43 2.14±2.05 2.50 ± 1.77 
UCSF (n=15Dice 0.49 ± 0.18 0.50 ± 0.19 0.13 ± 0.15 0.60±0.14 
NSD 0.67 ± 0.22 0.69 ± 0.23 0.19 ± 0.22 0.78±0.14 
PPVL 0.72 ± 0.28 0.74 ± 0.30 0.46 ± 0.44 0.82±0.20 
F1ScoreL 0.62 ± 0.26 0.64 ± 0.29 0.22 ± 0.25 0.79±0.14 
|nrefLnpredL| 2.07 ± 1.83 1.67 ± 1.45 2.67 ± 2.94 1.53±1.46 

Number of images for each test site is given in brackets.Note: Data are means ± standard deviations. The best values for NSD and | nrefLnpredL | are 0.0 and 1.0 for Dice and F1ScoreL. Bold represents best performance.

For the TUM test set, seg_ms_lesion shows a large performance gap vs. our ChunksNative models, despite both being trained on TUM data. We attribute this to ChunksNative model’s focus on axial T2w scans, allowing better lesion feature learning than the multi-contrast, multi-view seg_ms_lesion model. Notably, 20 of 29 cases in the reduced test set had no lesions, likely contributing to seg_ms_lesion model’s high false positive rate.

For BWH, all models show comparable performance, with ChunksNativeTwoSites excelling in most metrics except DeltaLesions and Lesion-wise PPV. The diverse contrasts/views in seg_ms_lesion’s training may trade accuracy for generalizability, leading to a lower Dice score. As with TUM, 10 of 22 BWH test cases have empty lesion masks, challenging the highly sensitive seg_ms_lesion model. Moreover, it is noteworthy that BWH constitutes an in-distribution test site for seg_ms_lesion while it is out-of-distribution for ChunksNative models.

For UCSF, where all test cases have lesions, seg_ms_lesion outperforms all models. This is expected since (i) UCSF is OOD for our models, leading to a generalization gap, and (ii) it provides an ideal test scenario for seg_ms_lesion, an in-distribution set with high lesion prevalence.

1

By preprocessing, we refer to the transformations performed on the image before feeding to a DL-based segmentation pipeline. Note that this is different from the online preprocessing (i.e. resampling and normalization) typically done in DL training pipelines.

Author notes

*

Shared first authorship - authors contributed equally

Shared last authorship - equal supervision

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.