Abstract
Reconstructing the cortex from longitudinal magnetic resonance imaging (MRI) is indispensable for analyzing morphological alterations in the human brain. Despite the recent advancement of cortical surface reconstruction with deep learning, challenges arising from longitudinal data are still persistent. Especially the lack of strong spatiotemporal point correspondence between highly convoluted brain surfaces hinders downstream analyses, as local morphology is not directly comparable if the anatomical location is not matched precisely. To address this issue, we present V2C-Long, the first dedicated deep learning-based cortex reconstruction method for longitudinal MRI. V2C-Long exhibits strong inherent spatiotemporal correspondence across subjects and visits, thereby reducing the need for surface-based post-processing. We establish this correspondence directly during the reconstruction via the composition of two deep template-deformation networks and innovative aggregation of within-subject templates in mesh space. We validate V2C-Long on two large neuroimaging studies, focusing on surface accuracy, consistency, generalization, test-retest reliability, and sensitivity. The results reveal a substantial improvement in longitudinal consistency and accuracy compared to existing methods. In addition, we demonstrate stronger evidence for longitudinal cortical atrophy in Alzheimer’s disease than longitudinal FreeSurfer.
1 Introduction
The structure of the cortical gray matter, that is, the thin and tightly folded sheet of neural tissue confined by inner white matter (WM) and outer pial surfaces, can be observed from structural magnetic resonance imaging (MRI). It has direct implications for brain functionality (Pang et al., 2023), and local structural measurements like cortical thickness or curvature are important biomarkers to understand brain development (Mills & Tamnes, 2014) and to monitor the progression of brain disorders (Bachmann et al., 2023; Risacher et al., 2010; Schwarz et al., 2016). For accurate in vivo measurements, triangular meshes representing the WM and pial surfaces are typically extracted from the MR image since they are less susceptible to partial-volume effects than voxel-based segmentation (Dale et al., 1999). The surface-based representation enhances the sensitivity to subtle changes in cortical morphology, which is crucial for detecting early signs of neurodegenerative diseases and monitoring disease progression. Moreover, the high spatial resolution of triangular meshes allows for high-resolution mapping of cortical features, enabling researchers and clinicians to pinpoint specific regions of interest with great precision. Recently, the reconstruction of cortical surfaces has been accelerated from hours to seconds with deep-learning models that can be run on the latest generation of graphics processing units (GPUs) (Gopinath et al., 2021; Hoopes et al., 2022; Lebrat et al., 2021; Ma, Li, Robinson, et al., 2023; Santa Cruz et al., 2021).
These neural networks focus on cross-sectional data, but studying within-subject changes in aging, disease, and treatment relies on longitudinal data. Processing longitudinal neuroimaging data requires dedicated tools to reduce bias and increase the sensitivity to subtle differences in follow-up visits, typically far below the image resolution of 1 mm (Reuter et al., 2012). To this end, unbiased within-subject templates and homologous points, that is, corresponding anatomical locations, must be estimated accurately to make a local comparison of brain morphometry possible (Reuter & Fischl, 2011). More precisely, within-subject templates condense multiple longitudinal snapshots of an individual’s brain anatomy into a single geometric model for better within-subject consistency and higher sensitivity to slight morphological alterations. This is especially important for adult brains, where changes are more subtle than in early developmental stages (Bethlehem et al., 2022). Within-subject templates further serve as a common ground for visualizations and spatiotemporal statistics. The established approach for longitudinal cortical surface reconstruction, for example, as implemented in FreeSurfer (Fischl, 2012) and illustrated in Supplementary Figure S2, is to perform a group-wise registration of all of a subject’s visits, extract cortical surfaces from the obtained template image, and use these within-subject templates to initialize the reconstruction at each visit. However, this approach suffers from several flaws. First, although correspondences between the subject’s visits are established, the within-subject templates are not directly comparable across subjects. This complicates downstream applications, for example, longitudinal group comparisons. In addition, adding a new scan requires re-running the entire pipeline, starting from the group-wise registration, which is inefficient and, therefore, time- and resource-consuming. Finally, the group-wise registration is prone to bias introduced by variations in voxel intensity or registration asymmetry (Reuter & Fischl, 2011), challenges that may be overcome with advanced neural deformation networks.
In this work, we introduce V2C-Long—a novel cortical reconstruction framework dedicated to longitudinal MRI. V2C-Long establishes strong spatiotemporal correspondence of surface points, that is, within subjects and across subjects, inherently during the reconstruction as illustrated in Figure 1. The inherent correspondence eliminates the need for spherical inflation to create correspondences within and across subjects. To achieve this, we propose a two-stage approach that builds upon recent advances in template-based cortical surface reconstruction. First, we train a V2C-Flow model (Bongratz et al., 2024) to deform the FsAverage population template (Fischl, Sereno, Tootell, et al., 1999) to cortical surfaces and leverage the inherently established spatial correspondence to compute within-subject templates directly on the surface level. We then incorporate the within-subject templates into a second deep template-deformation network, thereby establishing strong temporal correspondence among multiple cortical surfaces from an individual. The learning-based approach, together with the parallel execution on the latest generation of GPUs, allows us to compute application-ready longitudinal sequences of cortical surfaces within seconds. Moreover, as the input and output data formats are compatible with existing neuroimaging tools, V2C-Long can be readily integrated into existing image-processing pipelines. We evaluate V2C-Long and derived cortical thickness measurements thoroughly using two large neuroimaging studies, ADNI and OASIS, and a test-retest database. We show that V2C-Long compares favorably with existing methods regarding surface accuracy, longitudinal consistency, and test-retest reliability. In addition, we perform a downstream analysis of longitudinal cortical thickness in Alzheimer’s disease, observing more significant evidence of group differences in V2C-Long compared to the best alternative approaches.
V2C-Long inherently establishes spatiotemporal correspondence of cortical surfaces during the reconstruction. This renders all reconstructed surfaces, that is, within a certain subject and across subjects, directly comparable on the vertex level.
V2C-Long inherently establishes spatiotemporal correspondence of cortical surfaces during the reconstruction. This renders all reconstructed surfaces, that is, within a certain subject and across subjects, directly comparable on the vertex level.
2 Related Work
Originally, the task of cortical surface reconstruction was tackled by a sophisticated combination of segmentation and mesh-extraction algorithms (Dale et al., 1999; MacDonald et al., 2000; Mangin et al., 1995), with runtimes in the order of hours for a single scan. Recently, deep learning-based cortex reconstruction methods have emerged, allowing for fast surface extraction within seconds on the latest GPUs. These methods can be categorized into implicit, segmentation-based, and template-based approaches, depending on the input and output of the utilized neural network. Implicit (Gopinath et al., 2021; Santa Cruz et al., 2021) and segmentation-based approaches (Henschel et al., 2020; Ma, Li, Robinson, et al., 2023) first compute an implicit representation of the surfaces, that is, a signed distance function or occupancy grid, respectively, which is then converted to an explicit mesh with marching cubes (Lorensen & Cline, 1987) or similar algorithms. However, applying marching cubes precludes a direct comparison of the surface meshes on a per-vertex basis and the resulting surfaces come without correspondence. In general, cortical surfaces can still be registered post hoc, but this requires intricate surface inflation and mapping to an icosphere, inevitably introducing area and shape distortions (Robinson et al., 2018; Suliman et al., 2022; Yeo et al., 2010).
Although implicit deep learning-based methods offer a considerable speed-up compared to traditional neuroimaging processing, explicit template-deformation methods (Bongratz et al., 2024; Hoopes et al., 2022; Lebrat et al., 2021; Ma, Li, Kyriakopoulou, et al., 2023; Rickmann et al., 2023; Santa Cruz et al., 2022; Wickramasinghe et al., 2020) are usually fastest. These methods do not require a conversion of surface representations. Instead, they follow the idea of deformable contours (Kass et al., 1988; Mangin et al., 1995) in that they take a generic, topologically correct shape template as input and deform it to individual three-dimensional shape contours based on latent features extracted from the input image. Apart from the fast processing, this approach bears the potential to establish correspondences between the template and the predicted surfaces through the deformation, enabling direct cross-sectional analyses and brain atlas propagation (Bongratz et al., 2024; Rickmann et al., 2023). Different from spherical surface registration, the corresponding points lie directly on the folded cortical surfaces and not on an icosphere. In V2C-Flow (Bongratz et al., 2024) and CorticalFlow (Lebrat et al., 2021)/CorticalFlow++ (Santa Cruz et al., 2022), these correspondences are learned in an unsupervised manner with the Chamfer loss and regularization terms. The Chamfer loss does not require ground-truth correspondences during training. V2CC (Rickmann et al., 2023), on the other hand, takes a supervised approach and computes an L1 loss between the predicted vertices and registered and re-sampled reference surfaces. Lastly, the loss from TopoFit (Hoopes et al., 2022) can be considered to lie in between the Chamfer and L1 loss. More precisely, TopoFit restricts the Chamfer loss to a certain neighborhood based on the adjacency of vertices in the FsAverage template; again, this approach requires registered and re-sampled reference surfaces.
For longitudinal cortex analysis, dedicated computational pipelines (Ashburner & Ridgway, 2013; Li et al., 2012; Reuter et al., 2012) were developed and implemented into widely used neuroimaging tools like FreeSurfer (Fischl, 2012) and CAT12 (Gaser et al., 2024). The common approach is to estimate a subject-specific mean or median image via group-wise registration as an intermediate step to obtain within-subject template surfaces. These within-subject templates are then warped to the contours of the individual scans. Thereby, the vertices within a longitudinal sequence are matched, whereas the within-subject templates from different individuals are not comparable on the vertex level. Hence, error-prone and time-consuming surface inflation, registration, and re-sampling are inevitable for longitudinal group analyses in these approaches (Fischl, Sereno, & Dale, 1999).
More broadly, deep-learning methods also exist that learn abstract latent representations from longitudinal images (Kim & Sabuncu, 2023; Ren et al., 2022). These methods, however, are not designed to work with non-Euclidean representations such as brain surfaces. A notable exception is the recent work by Zhao et al. (2024), which addresses longitudinal cortical parcellation but, unfortunately, is limited to a canonical spherical representation of cortical surfaces and, hence, relies on an accurate reconstruction in the first place.
3 Materials and Methods
Figure 2 depicts the architecture of V2C-Long. As input, V2C-Long takes a sequence of MR images from a certain subject. As output, WM and pial surfaces with spatial, that is, across subjects, and temporal, that is, within a subject, correspondence of anatomical surface points are computed for each input image, cf. Figure 1. We describe the architecture of the neural network, the within-subject template creation, and the within-subject template deformation in Sections 3.3 to 3.5, respectively.
Architecture of V2C-Long. From a sequence of 3D brain MRI scans, V2C-Long computes a within-subject template, enriches it with vertex features from the template-creation model, and deforms it to cortical surfaces with strong spatiotemporal point correspondence.
Architecture of V2C-Long. From a sequence of 3D brain MRI scans, V2C-Long computes a within-subject template, enriches it with vertex features from the template-creation model, and deforms it to cortical surfaces with strong spatiotemporal point correspondence.
3.1 Data and preprocessing
We obtained data for the preparation of this article from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu), containing subjects with Alzheimer’s disease, mild cognitive impairment, and cognitively normal individuals. We used longitudinal T1w MRI scans (1.5T and 3T), split the images on the subject level, and stratified our splits according to sex, age, initial diagnosis, and number of visits per subject ( visits, ). This resulted in 3,745, 594, and 1,094 scans for training, validation, and testing, respectively. Further, we used 288 scans (100 subjects) from the longitudinal OASIS-3 (LaMontagne et al., 2019) study to evaluate the generalization of the trained models to an external data set. As a reference standard, we used surfaces reconstructed by FreeSurfer v7.2 (Fischl, 2012), cross-sectional processing, which has become a standard in the field (Hoopes et al., 2022; Lebrat et al., 2021; Ma, Li, Robinson, et al., 2023; Santa Cruz et al., 2021). As model input, we provided the orig.mgz files, mapped to MNI152 standard space (1 mm, voxels) by affine registration (Modat et al., 2014). We also used a public test-retest (TRT) dataset (Maclaren et al., 2014) that contains 120 T1w MRI scans (three subjects, each scanned twice for 20 consecutive days) to assess the in-session test-retest reliability of our model.
3.2 Image and surface representation
Throughout this work, we represent an MR image as a three-dimensional tensor of height , width , and depth , containing a single channel of gray values. A set of MR images from a certain subject is given as an ordered sequence . After the initial scan , which does not have a particular role and is simply the first scan in the sequence, each subject can have an arbitrary number of follow-up scans. Slightly abusing nomenclature, we will also refer to scans from certain visits, that is, at certain timepoints, simply as visits for the sake of compactness.
The tissue boundaries of one cerebral hemisphere, namely the inner WM surface, which divides white and gray matter, and the outer pial surface, which divides gray matter and cerebrospinal fluid, are considered to be two-dimensional manifolds embedded in three-dimensional Euclidean space. We represent the surfaces as closed triangular meshes , where each mesh consists of vertices and faces , storing the indices to vertices. Implicitly, the faces also define a set of edges that connect vertices pairwise. From a sequence of visits from subject , we aim to reconstruct corresponding cortical surfaces , with fixed connectivity, that is, . Although there is no order among faces or vertices per se, we adopt an arbitrary but fixed order in our implementation to represent fixed mesh connectivity and precise anatomical locations in the brain. We denote the within-subject mesh template of subject as and population templates simply as . Slightly abusing the notation, we will refer to the within-subject template meshes and the within-subject template’s vertices equally with . The vertices define the mesh entirely, given that the connectivity is pre-defined.
3.3 Model architecture and implementation
Our V2C-Long framework builds upon the V2C-Flow model (Bongratz et al., 2024). Briefly, V2C-Flow learns to deform the generic FsAverage population template (Fischl, Sereno, Tootell, et al., 1999) to individual cortical surfaces, thereby establishing correspondences among all reconstructed surfaces and to the template. In contrast to other recent cortical surface reconstruction methods, V2C-Flow incorporates virtual edges between WM and pial surfaces to mitigate their intersection. V2C-Flow employs a UNet (Ronneberger et al., 2015)-like convolutional neural network (CNN) to extract multi-resolution image features from the input image, and it maps them onto the template vertices by trilinear interpolation. From these vertex features, a graph neural network (GNN) (Morris et al., 2019) computes a displacement field that is integrated numerically, deforming the template to the brain contours. Formally, the deformation is described by the initial value problem
where is the input image, is the population template, and are the parameters of the CNN and GNN. To solve Equation (1), we use an Euler integration scheme with five integration steps (step size 0.2) as in the original work. The final reconstruction is given by .
In V2C-Long, we employ two consecutive V2C-Flow models for within-subject template creation (cf. Section 3.4) and within-subject template deformation (cf. Section 3.5). Each V2C-Flow model is trained jointly with a combination of voxel- (cross-entropy) and mesh-loss (curvature-weighted Chamfer, edge, and normal consistency) functions for WM and pial surfaces at once. The loss functions and their weighting factors are described in detail in Bongratz et al. (2024). During training, we sample 100,000 points randomly per WM/pial surface from the reference meshes, obtained from FreeSurfer (v7.2), as the target for the curvature-weighted Chamfer loss. Our implementation is based on Python (v3.9), PyTorch (v1.10.0), Cuda (v11.3), PyTorch3d (v0.6.1), and automatic mixed precision. The code and models will be available at https://github.com/ai-med/Vox2Cortex.
3.4 Within-subject template creation
In contrast to existing image-based within-subject template creation (Ashburner & Ridgway, 2013; Li et al., 2012; Reuter et al., 2012), we present a new approach to obtain these templates directly in mesh space, that is, from the vertices that define the meshes together with the faces. More precisely, we propose to leverage V2C-Flow and, for a certain subject , compute the within-subject template from the mean vertex locations in a sequence of visits:
Theorem 1.Equation (2)solves the initial value problemin a first order approximation, whereis the mean deformation field across all of the subject’s visits.
We prove this statement in Appendix A. Theorem 1 implies that the template we obtain from Equation (2) is equivalent to solving Equation (3) on the subject level after aggregating all of the subject’s flow fields into a joint deformation , following the standard definition of deformable anatomical templates (Grenander & Miller, 1998). The prerequisite for the mean aggregation is corresponding vertices, which we get from V2C-Flow (Bongratz et al., 2024). See Supplementary Figures S5 and S6 for an illustration of vertex correspondences in V2C-Flow/V2C-Long. Since the sample mean is an unbiased estimator, the within-subject templates obtained from Equation (2) are not biased toward any visit. Nonetheless, including all available visits in the aggregation is vital to avoid asymmetries, for example, toward earlier visits if later ones were omitted.
To enrich the within-subject templates with additional information about an individual’s cortical geometry, we concatenate the vertex features extracted by the two graph neural network (GNN) blocks in V2C-Flow (obtained right before the output layers that compute the deformation) to the vertex coordinates. These features (of dimension ) contain geometric information beyond bare vertex coordinates as they guide the mesh flow. Formally, we compute the mean in Equation (2) from the resulting generalized vertices, which encompass both the spatial coordinates and the associated vertex features.
3.5 Within-subject template deformation
In a second step, we deform the within-subject template again to the contours of each visit to improve the temporal consistency of reconstructed surfaces compared to the initial reconstruction. To this end, we train a second V2C-Flow model with subject-specific input templates, that is,
Importantly, we provide the template only as a starting point, but the temporal deformation is not constrained further to avoid over-regularization (Reuter & Fischl, 2011). To speed up the training process in this second stage of V2C-Long, we provide the weights of the previously trained template-creation model as an initialization (except for input layers due to the additional vertex features).
3.6 Evaluation metrics
We use two sets of metrics to evaluate the quality and consistency of the predicted meshes. We provide the respective definitions in the following.
3.6.1 Surface quality metrics
The average symmetric surface distance (ASSD) and symmetric Hausdorff distance (HD) measure the concordance of two contours in average (ASSD) and in the worst case (HD). For better robustness, we use a percentile-based version of the HD (). To compute distances between surfaces, we sample points uniformly from the predicted and reference meshes, and , respectively, yielding point sets and . From the point sets and the meshes, we can compute point-to-surface distances in a bilateral manner. As a measure of topological accuracy, we also compute the number of self-intersecting faces (SIF) from the predicted meshes.
ASSD. Formally, the ASSD computes as
where is the Euclidean point-to-face distance between a point p and the closest point in the mesh . In our implementation, we use
. Similarly, the symmetric HD computes as
To make the HD robust to outliers, a percentile can be selected ( corresponds to the standard Hausdorff distance which is easily dominated by a single vertex).
SIF. Ideally, cortical surfaces should be represented by watertight 2-manifolds, that is, closed surfaces without holes, handles, or self-intersections. While the template-deformation approach in V2C-Long circumvents holes and handles entirely, self-intersections can still occur. We compute the number of self-intersecting faces (SIF), that is, the ratio of faces intersecting with another face from the same mesh, using PyMeshLab (v2022.2).
3.6.2 Surface consistency metrics
We use the following metrics to evaluate the test-retest reliability, that is, the consistency of reconstructions from images acquired within a short time, and the longitudinal consistency, that is, the consistency of reconstructions with longer time periods between the visits.
MCVar and CThVar. We quantify point-wise longitudinal consistency via the variance in derived morphological descriptors, namely cortical curvature (mean curvature variance, MCVar) (Meyer et al., 2003) and thickness (cortical thickness variance, CThVar) (Fischl & Dale, 2000). The rationale is that local variations in curvature (the cortical sheet is tightly folded) and thickness are usually greater than anatomical alterations over time (Fischl & Dale, 2000). Hence, the variance in these measures should be smaller with better alignment (albeit not necessarily zero). Formally, we compute the unbiased sample variance
where is the discrete mean curvature associated with a vertex . The CThVar is defined analogously. Note that the curvature can be computed individually for WM and pial surfaces, whereas only a single CThVar value is obtained from both surfaces. We compute a robust score on the subject level by taking the median across all vertices. For estimating cortical thickness, we follow the bilateral method proposed by Fischl and Dale (2000). We calculate the shortest distance from a vertex on the WM surface to the pial surface, and vice versa, and then take the average of these distances at each vertex.
ParcF1. To further assess the longitudinal consistency on the region level, we map the Destrieux atlas (Destrieux et al., 2010), a fine-grained cortical atlas dividing the cortex into gyri and sulci, from FreeSurfer’s FsAverage to the predicted surfaces (assigning a certain class to each vertex). Given the respective class labels, we measure the overlap, that is, the consistency, of a region in the longitudinal sequence by pairwise comparison of nearest-neighbor vertex classes. On the surface level, the F1 score is weighted according to the different region sizes.
3.7 Vertex-wise linear mixed effects regression
To compare the pattern of longitudinal cortical thickness in two diagnostic groups, we employ a vertex-wise linear mixed effects (LME) regression model (Bernal-Rusiel et al., 2013; Wachinger et al., 2016). This model is defined as
where, for subject , the age at initial visit , the time from initial to follow-up visit , and the stable diagnosis are considered. It allows for individual intercepts and slopes via random effects regression coefficients , . The fixed, non-individual regression coefficients model the global effects on cortical thickness. We use the LME implementation in Statsmodels (v0.14.1).
4 Results
In this section, we report results in terms of longitudinal consistency (Section 4.2), accuracy (Section 4.4), and in-session test-retest reliability (Section 4.3) using the metrics described in Section 3.6. In addition, we provide an ablation study of V2C-Long in Section 4.5. Finally, we replicate differences in longitudinal cortical thickness between patients diagnosed with Alzheimer’s disease and a healthy control population using the LME regression model described in Section 3.7.
4.1 Experimental setting
We compare V2C-Long to the longitudinal FreeSurfer processing (FS-Long) (Reuter et al., 2012). Additionally, we include recent deep learning-based cortex reconstruction methods. Namely, we implemented TopoFit* (Hoopes et al., 2022), CF++† (Santa Cruz et al., 2022), V2C-Flow (Bongratz et al., 2024), and V2CC (Rickmann et al., 2023) consistently for the right hemisphere based on the original repositories. Although these methods are not optimized for longitudinal data as thoroughly as FS-Long and V2C-Long, they are crucial baselines due to their proven accuracy and architectural similarity. We used the raw output of all reconstruction methods without further surface-based post-processing, as this is most comparable to V2C-Long. Nevertheless, we also consider spherical registration-based post-processing as another baseline. Specifically, we applied the spherical registration implemented in FreeSurfer (FS-Reg) (Fischl, 2012), with FsAverage as the registration target and barycentric interpolation of vertex coordinates, to V2C-Flow surfaces. We refer to this approach as V2C-Flow/FS-Reg. Results for a V2C-Long model trained on both hemispheres are in the Supplementary Material. We trained all deep-learning methods in the same setting using our longitudinal ADNI training set and Nvidia A100 GPUs with 40GB VRAM. We consistently selected the model with the lowest reconstruction error (ASSD) on the validation set.
4.2 Longitudinal consistency
Table 1 shows the longitudinal consistency metrics of all implemented methods for inner (WM) and outer (pial) cortical surfaces. Corresponding plots are in Supplementary Figure S4. We observe a substantial improvement with V2C-Long over all baseline methods—including FS-Long on the internal (ADNI) and external (OASIS) test sets. The variance in discrete mean curvature (MCVar) is reduced by up to 61% (pial, ADNI) compared to the closest competitor, V2CC, and by up to 67% (pial, OASIS) compared to FS-Long. Regarding cortical thickness, V2C-Long achieves an improvement of 24% (ADNI) and 30% (OASIS) over the best alternative method. On the region level (ParcF1), V2C-Long is again at the forefront of the best methods, only beaten by FS-Long on OASIS WM surfaces by 0.001. The spherical registration improves the longitudinal consistency compared to the raw output of V2C-Flow. However, V2C-Long remains significantly superior across both datasets.
Consistency metrics by surface and method for the ADNI and OASIS test sets.
. | . | WM surface . | Pial surface . | Both . | ||
---|---|---|---|---|---|---|
. | Model . | MCVar . | ParcF1 . | MCVar . | ParcF1 . | CThVar . |
ADNI | V2C-Long | 0.020 ± 0.020 | 0.971 ± 0.015 | 0.012 ± 0.015 | 0.966 ± 0.017 | 0.016 ± 0.010 |
V2C-Flow (Bongratz et al., 2024) | 0.060 ± 0.023 | 0.924 ± 0.030 | 0.047 ± 0.018 | 0.920 ± 0.028 | 0.038 ± 0.027 | |
V2C-F./FS-Reg | 0.039 ± 0.014 | 0.934 ± 0.053 | 0.033 ± 0.012 | 0.927 ± 0.051 | 0.028 ± 0.029 | |
V2CC (Rickmann et al., 2023) | 0.032 ± 0.009 | 0.958 ± 0.016 | 0.028 ± 0.009 | 0.945 ± 0.018 | 0.021 ± 0.009 | |
CF++ (Santa Cruz et al., 2022)a | 0.046 ± 0.017 | – | 0.046 ± 0.017 | – | 0.034 ± 0.027 | |
TopoFit (Hoopes et al., 2022) | 0.049 ± 0.018 | 0.912 ± 0.050 | 0.050 ± 0.020 | 0.903 ± 0.050 | 0.031 ± 0.014 | |
FS-Long (Reuter et al., 2012) | 0.032 ± 0.013 | 0.968 ± 0.050 | 0.031 ± 0.015 | 0.953 ± 0.050 | 0.032 ± 0.019 | |
OASIS | V2C-Long | 0.021 ± 0.007 | 0.963 ± 0.016 | 0.013 ± 0.005 | 0.953 ± 0.019 | 0.016 ± 0.007 |
V2C-Flow (Bongratz et al., 2024) | 0.064 ± 0.021 | 0.911 ± 0.020 | 0.050 ± 0.017 | 0.904 ± 0.022 | 0.036 ± 0.015 | |
V2C-F./FS-Reg | 0.038 ± 0.012 | 0.929 ± 0.018 | 0.033 ± 0.011 | 0.917 ± 0.020 | 0.025 ± 0.011 | |
V2CC (Rickmann et al., 2023) | 0.035 ± 0.010 | 0.947 ± 0.017 | 0.034 ± 0.011 | 0.929 ± 0.021 | 0.023 ± 0.010 | |
CF++ (Santa Cruz et al., 2022)a | 0.050 ± 0.018 | – | 0.051 ± 0.019 | – | 0.037 ± 0.018 | |
TopoFit (Hoopes et al., 2022) | 0.060 ± 0.019 | 0.885 ± 0.023 | 0.067 ± 0.024 | 0.873 ± 0.025 | 0.036 ± 0.014 | |
FS-Long (Reuter et al., 2012) | 0.033 ± 0.013 | 0.964 ± 0.017 | 0.040 ± 0.020 | 0.941 ± 0.021 | 0.030 ± 0.013 |
. | . | WM surface . | Pial surface . | Both . | ||
---|---|---|---|---|---|---|
. | Model . | MCVar . | ParcF1 . | MCVar . | ParcF1 . | CThVar . |
ADNI | V2C-Long | 0.020 ± 0.020 | 0.971 ± 0.015 | 0.012 ± 0.015 | 0.966 ± 0.017 | 0.016 ± 0.010 |
V2C-Flow (Bongratz et al., 2024) | 0.060 ± 0.023 | 0.924 ± 0.030 | 0.047 ± 0.018 | 0.920 ± 0.028 | 0.038 ± 0.027 | |
V2C-F./FS-Reg | 0.039 ± 0.014 | 0.934 ± 0.053 | 0.033 ± 0.012 | 0.927 ± 0.051 | 0.028 ± 0.029 | |
V2CC (Rickmann et al., 2023) | 0.032 ± 0.009 | 0.958 ± 0.016 | 0.028 ± 0.009 | 0.945 ± 0.018 | 0.021 ± 0.009 | |
CF++ (Santa Cruz et al., 2022)a | 0.046 ± 0.017 | – | 0.046 ± 0.017 | – | 0.034 ± 0.027 | |
TopoFit (Hoopes et al., 2022) | 0.049 ± 0.018 | 0.912 ± 0.050 | 0.050 ± 0.020 | 0.903 ± 0.050 | 0.031 ± 0.014 | |
FS-Long (Reuter et al., 2012) | 0.032 ± 0.013 | 0.968 ± 0.050 | 0.031 ± 0.015 | 0.953 ± 0.050 | 0.032 ± 0.019 | |
OASIS | V2C-Long | 0.021 ± 0.007 | 0.963 ± 0.016 | 0.013 ± 0.005 | 0.953 ± 0.019 | 0.016 ± 0.007 |
V2C-Flow (Bongratz et al., 2024) | 0.064 ± 0.021 | 0.911 ± 0.020 | 0.050 ± 0.017 | 0.904 ± 0.022 | 0.036 ± 0.015 | |
V2C-F./FS-Reg | 0.038 ± 0.012 | 0.929 ± 0.018 | 0.033 ± 0.011 | 0.917 ± 0.020 | 0.025 ± 0.011 | |
V2CC (Rickmann et al., 2023) | 0.035 ± 0.010 | 0.947 ± 0.017 | 0.034 ± 0.011 | 0.929 ± 0.021 | 0.023 ± 0.010 | |
CF++ (Santa Cruz et al., 2022)a | 0.050 ± 0.018 | – | 0.051 ± 0.019 | – | 0.037 ± 0.018 | |
TopoFit (Hoopes et al., 2022) | 0.060 ± 0.019 | 0.885 ± 0.023 | 0.067 ± 0.024 | 0.873 ± 0.025 | 0.036 ± 0.014 | |
FS-Long (Reuter et al., 2012) | 0.033 ± 0.013 | 0.964 ± 0.017 | 0.040 ± 0.020 | 0.941 ± 0.021 | 0.030 ± 0.013 |
Values are mean ± SD over all subjects in the respective dataset and the best results are highlighted.
Computation of the ParcF1 score is not possible for CF++ due to the custom template for which no atlas is available.
Figure 3 shows the longitudinal consistency for individual Destrieux atlas regions. In the polar chart, each angle corresponds to a certain region, and the radial distance represents its F1 score, that is, the longitudinal consistency. For V2C-Long, we plot the region-wise consistency on inflated brain surfaces. We connect three of them, that is, sulcus intermedius primus of Jensen (S_interm_prim-Jensen), posterior-ventral cingulate gyrus (G_cingul-Post-ventral), and short insular gyri (G_insular_short) to the corresponding kinks in the polar chart. These regions are either comparably small (S_interm_prim-Jensen), located in the medial area of the cortex (G_cingul-Post-ventral), or deep in the insular cortex (G_insular_short), which makes them particularly difficult to reconstruct for all implemented methods. The polar chart further reveals that the consistency in V2C-Long is superior to all other methods for most (62/75) of the sulcal and gyral regions, followed by FS-Long (13/75).
Region-wise consistency of reconstructed surfaces based on the Destrieux atlas (mean of WM and pial surfaces, ADNI test set). (a) Lateral view of per-region ParcF1 scores (higher is better) from V2C-Long. (b) Comparison of ParcF1 scores for all 75 regions from different methods. (c) Medial view of per-region ParcF1 scores from V2C-Long. We also indicate the location and name of three regions in the atlas and the polar chart. A list of all Destrieux regions in the order of the plot (counter-clockwise) is in the Supplementary Material. For plotting, we used Pyvista (v0.35.2) and Plotly (v5.22.0).
Region-wise consistency of reconstructed surfaces based on the Destrieux atlas (mean of WM and pial surfaces, ADNI test set). (a) Lateral view of per-region ParcF1 scores (higher is better) from V2C-Long. (b) Comparison of ParcF1 scores for all 75 regions from different methods. (c) Medial view of per-region ParcF1 scores from V2C-Long. We also indicate the location and name of three regions in the atlas and the polar chart. A list of all Destrieux regions in the order of the plot (counter-clockwise) is in the Supplementary Material. For plotting, we used Pyvista (v0.35.2) and Plotly (v5.22.0).
In Figure 4, we plot the MCVar per vertex for CF++, V2CC, FS-Long, and V2C-Long in average over the ADNI test set. Note that FS-Long curvature measures need to be registered and re-sampled for vertex-wise comparison across subjects. All other methods directly permit averaging per-vertex values on the group level. We observe that V2C-Long achieves high consistency (i.e., low MCVar) across the entire cortex and prevents inconsistencies in the occipital lobe produced by CF++ and FS-Long. V2CC and FS-Long further have considerable longitudinal inconsistencies in pre- and postcentral gyri (WM surface), again alleviated by V2C-Long. Inconsistencies around the cingulate cortex and the medial wall (labeled as “unknown” in the Desikan-Killany atlas (Desikan et al., 2006)), stretching in parts into parahippocampal, entorhinal, and insular areas, are highlighted in all methods. Yet, they are less pronounced in V2C-Long, especially on the pial surface.
Vertex-wise longitudinal mean curvature variance (MCVar, lower is better) for CF++, V2CC, FS-Long, and V2C-Long. We show the mean over all subjects in our ADNI test set from lateral and medial views. We use the same exemplary individual anatomy for each plot to avoid visual differences due to the different templates in CF++ and the other methods.
Vertex-wise longitudinal mean curvature variance (MCVar, lower is better) for CF++, V2CC, FS-Long, and V2C-Long. We show the mean over all subjects in our ADNI test set from lateral and medial views. We use the same exemplary individual anatomy for each plot to avoid visual differences due to the different templates in CF++ and the other methods.
Figure 5 shows superimposed and color-coded pial surfaces from two visits of an individual from our ADNI test set. We recognize that, qualitatively, FS-Long and V2C-Long are the only methods that match individual triangles from the two different visits. The cross-sectional methods (TopoFit, CF++, V2C-Flow, and V2CC), on the other hand, failed to provide qualitatively consistent meshes across time.
Cortical surfaces from longitudinal methods (FS-Long and V2C-Long) are consistent across time—a property fulfilled by none of the recent cross-sectional methods (TopoFit, CF++, V2C-Flow, V2CC). We depict a right-hemisphere region of reconstructed pial surfaces from two visits (red and blue, respectively) of a subject in our ADNI test set. Ideally, colored triangles match.
Cortical surfaces from longitudinal methods (FS-Long and V2C-Long) are consistent across time—a property fulfilled by none of the recent cross-sectional methods (TopoFit, CF++, V2C-Flow, V2CC). We depict a right-hemisphere region of reconstructed pial surfaces from two visits (red and blue, respectively) of a subject in our ADNI test set. Ideally, colored triangles match.
4.3 Test-retest reliability
In Table 2, we validate the in-session test-retest reliability, that is, the consistency of reconstructions from images acquired within a short time and in a highly consistent setting, using the TRT dataset. In contrast to the evaluation based on ADNI and OASIS in the previous section, the number of subjects in the TRT dataset is much smaller (), and the number of visits is much higher (40 per subject). Here, we limit the comparison to the best methods from Table 1, that is, V2C-Long, V2CC, and FS-Long. The results are largely consistent with the observations on ADNI and OASIS, despite the different nature of the TRT data. V2C-Long achieves the best scores in all metrics except for the parcellation consistency (ParcF1) on WM surfaces, where FS-Long is again superior by 0.001. The largest improvement is achieved for the curvature variance (MCVar) on WM surfaces, where V2C-Long reduces the inconsistencies by half compared to V2CC and FS-Long.
Test-retest reliability of V2C-Long compared to V2CC and FS-Long.
. | . | WM surface . | Pial surface . | Both . | ||
---|---|---|---|---|---|---|
. | Method . | MCVar . | ParcF1 . | MCVar . | ParcF1 . | CThVar . |
TRT | V2C-Long | 0.020 ± 0.002 | 0.982 ± 0.001 | 0.017 ± 0.001 | 0.966 ± 0.003 | 0.020 ± 0.001 |
V2CC (Rickmann et al., 2023) | 0.040 ± 0.006 | 0.966 ± 0.004 | 0.043 ± 0.008 | 0.947 ± 0.002 | 0.031 ± 0.003 | |
FS-Long (Reuter et al., 2012) | 0.041 ± 0.004 | 0.983 ± 0.001 | 0.123 ± 0.023 | 0.947 ± 0.004 | 0.036 ± 0.000 |
. | . | WM surface . | Pial surface . | Both . | ||
---|---|---|---|---|---|---|
. | Method . | MCVar . | ParcF1 . | MCVar . | ParcF1 . | CThVar . |
TRT | V2C-Long | 0.020 ± 0.002 | 0.982 ± 0.001 | 0.017 ± 0.001 | 0.966 ± 0.003 | 0.020 ± 0.001 |
V2CC (Rickmann et al., 2023) | 0.040 ± 0.006 | 0.966 ± 0.004 | 0.043 ± 0.008 | 0.947 ± 0.002 | 0.031 ± 0.003 | |
FS-Long (Reuter et al., 2012) | 0.041 ± 0.004 | 0.983 ± 0.001 | 0.123 ± 0.023 | 0.947 ± 0.004 | 0.036 ± 0.000 |
Values are mean ± SD over the three subjects in the TRT dataset. The best results are highlighted.
4.4 Reconstruction accuracy and diagnostic differentiation
In Table 3, we report the reconstruction accuracy of all implemented methods. We compute the ASSD, , and to cross-sectional FreeSurfer surfaces, the silver standard in the field (Hoopes et al., 2022; Lebrat et al., 2021; Ma, Li, Robinson, et al., 2023; Santa Cruz et al., 2021). These results are based on our ADNI and OASIS test sets, separated by WM and pial surfaces. All methods achieve excellent accuracy with an average mostly below 0.5 mm, that is, half the image resolution, and an ASSD of 0.2 mm and below. Although the margins are comparably small, V2C-Long still yields the highest accuracy among the deep learning-based methods in all metrics on both datasets, except for the 99-percentile Hausdorff distance, where CF++ is superior. Notably, the spherical registration reduces the accuracy of V2C-Flow surfaces, for example, by 0.017 mm (ASSD) for WM surfaces on ADNI. We also include FS-Long in Table 3 for completeness. However, FS-Long intrinsically uses FreeSurfer methods (not just for training but also to process test scans); hence, the comparison to learning-based methods is not entirely fair.
Reconstruction metrics by surface and method for the ADNI and OASIS test sets.
. | . | WM surface . | Pial surface . | ||||
---|---|---|---|---|---|---|---|
. | Model . | ASSD . | . | . | ASSD . | . | . |
ADNI | V2C-Long | 0.177 ± 0.161 | 0.410 ± 0.756 | 1.115 ± 1.329 | 0.174 ± 0.161 | 0.409 ± 0.731 | 1.374 ± 1.308 |
V2C-Flow | 0.186 ± 0.116 | 0.427 ± 0.644 | 1.168 ± 1.429 | 0.180 ± 0.109 | 0.419 ± 0.566 | 1.392 ± 1.342 | |
V2C-F./FS-Reg | 0.203 ± 0.137 | 0.458 ± 0.705 | 1.214 ± 1.298 | 0.204 ± 0.123 | 0.471 ± 0.659 | 1.520 ± 1.237 | |
V2CC | 0.220 ± 0.036 | 0.492 ± 0.089 | 1.469 ± 0.465 | 0.243 ± 0.040 | 0.558 ± 0.104 | 1.798 ± 0.431 | |
CF++ | 0.214 ± 0.140 | 0.493 ± 0.812 | 1.262 ± 1.467 | 0.191 ± 0.129 | 0.445 ± 0.802 | 1.343 ± 1.419 | |
TopoFit | 0.194 ± 0.033 | 0.440 ± 0.088 | 1.252 ± 0.389 | 0.211 ± 0.038 | 0.459 ± 0.087 | 1.501 ± 0.408 | |
FS-Long | 0.151 ± 0.089 | 0.317 ± 0.187 | 0.970 ± 0.517 | 0.145 ± 0.078 | 0.300 ± 0.221 | 1.134 ± 0.550 | |
OASIS | V2C-Long | 0.176 ± 0.023 | 0.403 ± 0.053 | 1.171 ± 0.363 | 0.186 ± 0.025 | 0.430 ± 0.065 | 1.450 ± 0.341 |
V2C-Flow | 0.184 ± 0.024 | 0.419 ± 0.055 | 1.212 ± 0.359 | 0.196 ± 0.024 | 0.452 ± 0.062 | 1.494 ± 0.338 | |
V2C-F./FS-Reg | 0.205 ± 0.036 | 0.440 ± 0.073 | 1.270 ± 0.492 | 0.214 ± 0.024 | 0.482 ± 0.066 | 1.573 ± 0.380 | |
V2CC | 0.222 ± 0.036 | 0.506 ± 0.082 | 1.542 ± 0.460 | 0.282 ± 0.044 | 0.654 ± 0.116 | 2.009 ± 0.443 | |
CF++ | 0.226 ± 0.032 | 0.522 ± 0.078 | 1.463 ± 0.592 | 0.210 ± 0.041 | 0.473 ± 0.078 | 1.401 ± 0.359 | |
TopoFit | 0.197 ± 0.026 | 0.447 ± 0.065 | 1.302 ± 0.370 | 0.228 ± 0.036 | 0.502 ± 0.086 | 1.597 ± 0.368 | |
FS-Long | 0.126 ± 0.024 | 0.264 ± 0.053 | 0.917 ± 0.459 | 0.132 ± 0.025 | 0.274 ± 0.055 | 1.120 ± 0.367 |
. | . | WM surface . | Pial surface . | ||||
---|---|---|---|---|---|---|---|
. | Model . | ASSD . | . | . | ASSD . | . | . |
ADNI | V2C-Long | 0.177 ± 0.161 | 0.410 ± 0.756 | 1.115 ± 1.329 | 0.174 ± 0.161 | 0.409 ± 0.731 | 1.374 ± 1.308 |
V2C-Flow | 0.186 ± 0.116 | 0.427 ± 0.644 | 1.168 ± 1.429 | 0.180 ± 0.109 | 0.419 ± 0.566 | 1.392 ± 1.342 | |
V2C-F./FS-Reg | 0.203 ± 0.137 | 0.458 ± 0.705 | 1.214 ± 1.298 | 0.204 ± 0.123 | 0.471 ± 0.659 | 1.520 ± 1.237 | |
V2CC | 0.220 ± 0.036 | 0.492 ± 0.089 | 1.469 ± 0.465 | 0.243 ± 0.040 | 0.558 ± 0.104 | 1.798 ± 0.431 | |
CF++ | 0.214 ± 0.140 | 0.493 ± 0.812 | 1.262 ± 1.467 | 0.191 ± 0.129 | 0.445 ± 0.802 | 1.343 ± 1.419 | |
TopoFit | 0.194 ± 0.033 | 0.440 ± 0.088 | 1.252 ± 0.389 | 0.211 ± 0.038 | 0.459 ± 0.087 | 1.501 ± 0.408 | |
FS-Long | 0.151 ± 0.089 | 0.317 ± 0.187 | 0.970 ± 0.517 | 0.145 ± 0.078 | 0.300 ± 0.221 | 1.134 ± 0.550 | |
OASIS | V2C-Long | 0.176 ± 0.023 | 0.403 ± 0.053 | 1.171 ± 0.363 | 0.186 ± 0.025 | 0.430 ± 0.065 | 1.450 ± 0.341 |
V2C-Flow | 0.184 ± 0.024 | 0.419 ± 0.055 | 1.212 ± 0.359 | 0.196 ± 0.024 | 0.452 ± 0.062 | 1.494 ± 0.338 | |
V2C-F./FS-Reg | 0.205 ± 0.036 | 0.440 ± 0.073 | 1.270 ± 0.492 | 0.214 ± 0.024 | 0.482 ± 0.066 | 1.573 ± 0.380 | |
V2CC | 0.222 ± 0.036 | 0.506 ± 0.082 | 1.542 ± 0.460 | 0.282 ± 0.044 | 0.654 ± 0.116 | 2.009 ± 0.443 | |
CF++ | 0.226 ± 0.032 | 0.522 ± 0.078 | 1.463 ± 0.592 | 0.210 ± 0.041 | 0.473 ± 0.078 | 1.401 ± 0.359 | |
TopoFit | 0.197 ± 0.026 | 0.447 ± 0.065 | 1.302 ± 0.370 | 0.228 ± 0.036 | 0.502 ± 0.086 | 1.597 ± 0.368 | |
FS-Long | 0.126 ± 0.024 | 0.264 ± 0.053 | 0.917 ± 0.459 | 0.132 ± 0.025 | 0.274 ± 0.055 | 1.120 ± 0.367 |
Values are mean ± SD in mm over all scans in the respective dataset, and the best non-FreeSurfer results are highlighted.
For a visual comparison of V2C-Long and FS-Long, we show cortical reconstructions from two visits of a subject in the ADNI test set in Figure 6 (see Supplementary Fig. S3 for qualitative results from all evaluated methods). The highlighted regions in Figure 6 show anatomically implausible reconstructions by FS-Long in both visits and the within-subject template. In contrast, V2C-Long circumvents the fissures in the within-subject template and yields visually smooth and consistent surfaces at both visits.
Qualitative comparison of reconstructed pial surfaces of the right hemisphere from a subject of our ADNI test set with two visits. The top row shows the output from FS-Long (v7.2), and the bottom row shows the output from V2C-Long.
Qualitative comparison of reconstructed pial surfaces of the right hemisphere from a subject of our ADNI test set with two visits. The top row shows the output from FS-Long (v7.2), and the bottom row shows the output from V2C-Long.
To provide an additional, FreeSurfer-independent assessment of the reconstruction, we evaluate the accuracy of V2C-Long, V2CC, and FS-Long to differentiate between diagnostic groups, that is, Alzheimer’s disease and cognitively normal, in our ADNI test set based on derived cortical thickness measures. This analysis is grounded in the known association of Alzheimer’s disease with cortical atrophy (Schwarz et al., 2016). From the cortical thickness obtained from the respective method, we computed vertex-wise Z-scores (Tahedl, 2020), indicating the deviation from the norm in an age-matched, cognitively normal reference cohort (stratified by 10-year age brackets). For the estimation of the mean and standard deviation in each age cohort, we considered only initial scans to avoid bias toward individuals with many scans. Using the average Z-score across the cortex (excluding the unknown medial region), we calculated the Area Under the Curve (AUC) to classify the two groups. V2C-Long achieves the highest diagnostic accuracy from cortical thickness measures with an AUC of 0.83, followed by V2CC with an AUC of 0.81 and FS-Long with an AUC of 0.77.
4.5 Ablation study
In Table 4, we validate the design choices made in V2C-Long. In particular, we compare the realization of each of V2C-Long’s building blocks, that is, the within-subject template creation and the within-subject template deformation, to alternative approaches. For the within-subject template creation, we consider approaches that are based on the FsAverage template as alternatives to V2C-Flow, that is, V2CC and TopoFit, as well as FreeSurfer’s base images (FS-Base). For the within-subject template deformation, we ablate the V2C-Flow model and replace it with CorticalFlow (CF) deformation blocks. As this requires substantial modification of the CF++ method, we refer to it as “CF blocks” instead of “CF++” in Table 4. We did not employ V2CC and TopoFit as the within-subject template-deformation model as their training relies on cross-sectional ground-truth correspondences, which cannot be easily adapted for the longitudinal template deformation step.
Evaluation of different realizations of the within-subject template creation, aggregation, and deformation, as well as the vertex feature (VF)-enhancement.
Within-sub. template creation . | Within-sub. template aggregation . | Within-sub. template deformation . | VF . | WM surface . | Pial surface . | Both . | |||
---|---|---|---|---|---|---|---|---|---|
MCVar . | %SIF . | MCVar . | %SIF . | CThVar . | . | ||||
FS-Base | Median image | V2C-Flow | .021 ± .008 | 1.062 ± 0.334 | .013 ± .005 | 3.070 ± 1.035 | .017 ± .010 | .398 ± .719 | |
V2C-Flow | Median mesh | V2C-Flow | .035 ± .013 | 1.330 ± 0.867 | .022 ± .008 | 2.810 ± 1.121 | .019 ± .009 | .411 ± .313 | |
TopoFit | Mean mesh | V2C-Flow | .027 ± .011 | 0.260 ± 1.666 | .022 ± .010 | 2.324 ± 1.824 | .020 ± .009 | .447 ± .255 | |
V2CC | Mean mesh | V2C-Flow | .024 ± .009 | 0.344 ± 0.412 | .016 ± .006 | 2.526 ± 1.100 | .019 ± .008 | .392 ± .078 | |
V2C-Flow | Mean mesh | V2C-Flow | .022 ± .027 | 0.649 ± 0.439 | .013 ± .020 | 2.323 ± 0.926 | .018 ± .012 | .401 ± .623 | |
V2C-Flow | Mean mesh | CF blocks | .036 ± .013 | 0.659 ± 0.275 | .024 ± .010 | 2.076 ± 0.823 | .020 ± .010 | .478 ± .258 | |
V2C-Flow | Mean mesh | V2C-Flow | ✓ | .020 ± .020 | 0.643 ± 0.375 | .012 ± .015 | 2.151 ± 0.921 | .016 ± .010 | .410 ± .743 |
Within-sub. template creation . | Within-sub. template aggregation . | Within-sub. template deformation . | VF . | WM surface . | Pial surface . | Both . | |||
---|---|---|---|---|---|---|---|---|---|
MCVar . | %SIF . | MCVar . | %SIF . | CThVar . | . | ||||
FS-Base | Median image | V2C-Flow | .021 ± .008 | 1.062 ± 0.334 | .013 ± .005 | 3.070 ± 1.035 | .017 ± .010 | .398 ± .719 | |
V2C-Flow | Median mesh | V2C-Flow | .035 ± .013 | 1.330 ± 0.867 | .022 ± .008 | 2.810 ± 1.121 | .019 ± .009 | .411 ± .313 | |
TopoFit | Mean mesh | V2C-Flow | .027 ± .011 | 0.260 ± 1.666 | .022 ± .010 | 2.324 ± 1.824 | .020 ± .009 | .447 ± .255 | |
V2CC | Mean mesh | V2C-Flow | .024 ± .009 | 0.344 ± 0.412 | .016 ± .006 | 2.526 ± 1.100 | .019 ± .008 | .392 ± .078 | |
V2C-Flow | Mean mesh | V2C-Flow | .022 ± .027 | 0.649 ± 0.439 | .013 ± .020 | 2.323 ± 0.926 | .018 ± .012 | .401 ± .623 | |
V2C-Flow | Mean mesh | CF blocks | .036 ± .013 | 0.659 ± 0.275 | .024 ± .010 | 2.076 ± 0.823 | .020 ± .010 | .478 ± .258 | |
V2C-Flow | Mean mesh | V2C-Flow | ✓ | .020 ± .020 | 0.643 ± 0.375 | .012 ± .015 | 2.151 ± 0.921 | .016 ± .010 | .410 ± .743 |
We report mean ± SD values over all subjects (MCVar, CThVar), respectively all scans (SIF, ), in the ADNI test set. The best values are highlighted.
4.5.1 Within-subject template creation and aggregation
Compared to FreeSurfer’s base images (FS-Base), we found the mean aggregation of the V2C-Flow meshes to reduce the number of self-intersecting faces in the output surfaces. Similarly, mimicking FreeSurfer’s median aggregation in mesh space was detrimental in our ablation experiments as it considerably increased MCVar and SIF measures. In terms of run time, creating FreeSurfer’s base template (recon-all -base) takes around 4 hours on an Intel i7 CPU running at 3.60GHz for five visits. In V2C-Long, we obtain the within-subject template from the same number of visits within less than 10 seconds using a recent Nvidia A100 GPU, requiring around 8.5GB of VRAM at inference time. If no GPU is available, V2C-Long can still be run on the CPU. This takes around 1 minute per visit on an Intel i7 at 3.60 GHz. Replacing the V2C-Flow template-creation model with V2CC improved the reconstruction accuracy by around 0.01 mm at the cost of slightly less consistent surfaces in our experiments. The TopoFit within-subject template-creation model yielded the lowest number of self-intersecting faces in WM surfaces, but it did not improve over the V2C-Flow model in the remaining metrics. Unfortunately, the current limitations of GPU VRAM impede the use of vertex features from all four/seven graph deformation blocks in V2CC/TopoFit.
4.5.2 Within-subject template deformation
Replacing the V2C-Flow deformation model in the second step of V2C-Long with comparable CF blocks led to a slight reduction of the SIF score on the pial surfaces; however, it resulted in worse outcomes across all other metrics in Table 4. Finally, we found the enhancement of within-subject templates with vertex features (VF) in V2C-Long to have a slightly positive impact on the consistency of WM and pial surfaces and cortical thickness measures.
4.6 Longitudinal cortical thickness in Alzheimer’s disease
In Figure 7 (see Supplementary Fig. S1 for the medial view), we compare the pattern of longitudinal cortical thickness in patients affected by Alzheimer’s disease (AD) with a cognitively normal control group. To this end, we select AD subjects with a stable diagnosis and healthy controls from our ADNI test set and perform a mass-univariate analysis of longitudinal cortical thickness (CTh) measurements with the vertex-wise LME regression model described in Section 3.7. We plot uncorrected two-tailed p-values of the t-statistics of the regression effect , that is, the diagnosis. We compare the atrophy patterns among the three methods with the best longitudinal consistency (cf. Table 1), that is, FS-Long, V2CC, and V2C-Long. All three methods highlight similar regions in the temporal, parietal, and frontal lobes, as well as the precuneus and the posterior cingulate cortex. Exceptions are the isthmus cingulate and parts of the orbitofrontal cortex, where V2C-Long does not indicate group differences based on in contrast to the other two methods. V2C-Long, on the other hand, detects generally larger areas in the frontal, temporal, and parietal lobes, that is, lower p-values, providing stronger evidence for group differences.
Differences in longitudinal cortical thickness between stable AD subjects () and healthy controls (). We show uncorrected negative log10(p-value)-maps on FsAverage—based on cortical surfaces from FS-Long (v7.2), V2CC, and V2C-Long.
Differences in longitudinal cortical thickness between stable AD subjects () and healthy controls (). We show uncorrected negative log10(p-value)-maps on FsAverage—based on cortical surfaces from FS-Long (v7.2), V2CC, and V2C-Long.
5 Discussion
This work introduced V2C-Long for longitudinal cortical surface reconstruction with spatiotemporal correspondence of surface points. The developed model yielded highly consistent results across visits and high accuracy across datasets, indicating reliable and robust reconstruction performance.
5.1 Longitudinal reconstruction of cortical surfaces
Achieving consistent cortex reconstruction from MRI, where individual vertices are placed at precise anatomical locations, is challenging due to the complex folding patterns, the variability across subjects, and the limiting image resolution. We demonstrated that V2C-Long exhibits a compelling longitudinal consistency (cf. Section 4.2) as determined by quantitative scores and qualitative inspection of reconstructed surfaces on internal (ADNI) and external (OASIS) study data. The highest region-wise inconsistencies occurred for all methods in small and thin regions in the cingulate and insular cortex (cf. Fig. 3), which is understandable as inconsistencies at the parcellation boundaries have a larger impact in these cases relative to the region size. On the vertex level, we observed considerable inconsistencies in longitudinal mean curvature in CF++ and FS-Long in the occipital lobe, especially in pial surfaces (cf. Fig. 4). V2C-Long, on the other hand, exhibited very low longitudinal mean curvature variance (MCVar) across the entire cortex, with the largest inconsistencies again occurring around the cingulate cortex and the insula. The evaluation of test-retest reliability (cf. Section 4.3), which can be seen as short-term longitudinal consistency, led to similar results as the evaluation on ADNI and OASIS. In this regard, V2C-Long also improved over the best available reconstruction methods. At the same time, V2C-Long does not sacrifice accuracy; the model is at the forefront of all implemented deep learning-based architectures (cf. Section 4.4). The performance metrics of all neural networks on OASIS, of which no data have been used during training, were principally consistent with ADNI. This indicates good generalization of trained models across studies. For comparison, we implemented recent template-based cortex reconstruction methods with different neural network architectures (CNN: CF++; CNN & GNN: TopoFit, V2CC, V2C-Flow) and loss functions (L1: V2CC; Chamfer: CF++; restricted Chamfer: TopoFit; curvature-weighted Chamfer: V2C-Flow). These methods, however, were designed for cross-sectional data, and our experiments revealed that their longitudinal consistency is not competitive with dedicated longitudinal methods such as FS-Long. The only cross-sectional method approaching the performance of FS-Long is V2CC, which has been trained in a supervised manner to create correspondences based on previously registered and re-sampled FreeSurfer surfaces. Nonetheless, V2C-Long outperformed FS-Long and V2CC without requiring any surface-based pre- or post-processing. In addition, the learning-based approach makes V2C-Long more versatile than FS-Long. The model could be re-trained or fine-tuned for particular scenarios, for example, higher magnetic field strengths above 3T.
5.2 Creation of within-subject templates
From comparing V2C-Long with its direct architectural baseline, V2C-Flow, we deduce that creating within-subject templates is crucial for first-rate longitudinal consistency. Albeit within-subject templates are also used by FS-Long to initialize the reconstruction at the individual visits, V2C-Long computes the within-subject templates directly in mesh space. Eventually, this leads to a speed-up in the within-subject template creation from hours to seconds compared to the group-wise registration in FS-Long (cf. Section 4.5). In Figure 6, we observed that the mesh-based aggregation prevents implausible anatomical errors in the shown case, that is, fissures in the pial cortical surface. These artifacts likely result from converting a defective segmentation into surface meshes, which is circumvented entirely in V2C-Long since the sphere-like topology is already engraved into the input template. Supported by Figure 6, Supplementary Figure S3, and the work by Lebrat et al. (2023), we have good evidence that the geometric template-deformation in V2C-Long is robust and yields plausible results even for challenging input images. V2C-Long benefits from the template deformation not only in the ultimate reconstruction but also during the within-subject template creation.
5.3 Architectural design choices and training setup
Compared to the improvement over existing methods, we found architectural design choices to have a minor impact on the reconstruction quality and consistency (cf. Section 4.5). Yet, the best results were obtained by enhancing the within-subject template with deep vertex features and mean mesh aggregation. The GNN allows us to process information related to individual vertices and augment the input templates beyond bare vertex coordinates, which is impossible in a purely CNN-based approach. Aggregating the meshes via a median operation similar to FreeSurfer’s median image (Reuter et al., 2012) or employing FS-Base as a starting point for the longitudinal reconstruction is technically feasible but not recommended as we observed a dramatic increase in self-intersections in reconstructed WM surfaces. Our implementation allows for joint training and prediction of all four cortical surfaces simultaneously on a single GPU (cf. Supplementary Table S1). In contrast, TopoFit and CF++ train a single model for each cortical surface, complicating their practical usability. In addition, the joint reconstruction permits the incorporation of anatomical constraints between WM and pial surfaces, which proved beneficial in avoiding their intersection (Bongratz et al., 2024).
5.4 Longitudinal group analyses
The association of Alzheimer’s disease and cortical thickness has been well studied in the literature (Du et al., 2006; Risacher et al., 2010; Schwarz et al., 2016; Singh et al., 2006). To conduct these studies, extracting cortical thickness measures with FreeSurfer is a prominent, if not the most prevalent, approach. Our experiments replicated these results for the first time in a longitudinal setting using deep-learning methods (cf. Section 4.6). Overall, the regions highlighted by FS-Long, V2CC, and V2C-Long (cf. Fig. 7 and Supplementary Fig. S1) based on are consistent with existing research on cortical changes in Alzheimer’s disease, which reported a broad pattern of cortical atrophy most significant in the temporal lobe, the temporoparietal junction, the posterior cingulate, and the precuneus (Du et al., 2006). Nevertheless, we found slight differences between the three methods. V2CC and FS-Long highlighted group differences in the isthmus of the cingulate gyrus and the orbitofrontal cortex, which V2C-Long did not feature. Instead, V2C-Long carved out more significant evidence for atrophy in large parts of the frontal, temporal, and parietal lobes than the other two methods. The better diagnostic accuracy in V2C-Long (AUC 0.83) compared to V2CC (AUC 0.81) and FS-Long (AUC 0.77), cf. Section 4.4, further underscores the capability of V2C-Long to distinguish between AD patients and cognitively normal controls based on estimated cortical thickness. In contrast to FS-Long and V2CC, V2C-Long requires no spherical inflation and re-sampling for such vertex-based longitudinal group analyses (required as a post-processing step in FS-Long and as pre-processing in V2CC). Instead, we conducted all analyses with the raw output surfaces of V2C-Long, which reduces potential sources of error and computation time to a minimum.
5.5 Correspondences and registration
With V2C-Long, we obtain spatiotemporal vertex correspondences across subjects (cross-sectional) and visits (longitudinal) that match anatomical locations on the FsAverage template. The cross-sectional correspondences emerge from the shape alignment with the curvature-weighted Chamfer loss, cf. Supplementary Figures S5 and S6, which we found to be sufficient for the experiments conducted in this paper. However, such unsupervised alignment is not equivalent to an explicit curvature-based registration. For specific applications, such as atlas-based parcellation, V2C-Long might still benefit from a subsequent registration step. Based on V2C-Long’s inherent correspondence, the template icosphere can directly be used for this purpose, without the need for spherical inflation, similar to the parcellation in V2C-Flow (Bongratz et al., 2024). Nevertheless, we found that post hoc registration tends to reduce the accuracy of reconstructed surfaces (see Table 3), likely due to the interpolation of vertex coordinates. Additionally, we demonstrated in Table 1 that V2C-Long’s longitudinal correspondences, which were the focus of this work, are significantly superior to those obtained through traditional surface registration.
5.6 Limitations
As with all supervised learning methods, V2C-Long relies on the availability and correctness of ground-truth datasets. Compared to most other applications, however, obtaining manual cortical surface meshes by human experts is infeasible. As it has become a standard in the field, we used cross-sectional FreeSurfer as a reference for model training and evaluation of reconstruction accuracy (Lebrat et al., 2021; Ma, Li, Robinson, et al., 2023; Santa Cruz et al., 2021). Although we tried to remove scans with severe processing artifacts, we cannot preclude that errors made by FreeSurfer affect our models and confound the evaluation presented in this paper. The consistency assessment, however, requires no reference standard and provides an additional perspective usually not reported in prior studies. Our training set comprises more than 3.7 K T1w MRIs from cognitively normal subjects as well as subjects diagnosed with Alzheimer’s disease and mild cognitive impairment, making it the largest training set in the literature for cortex reconstruction models. Still, future work should investigate the performance under other neurodegenerative conditions or brain tumors. Self-intersections could potentially also confound downstream applications. We did not find self-intersections in V2C-Long obstructive, as we achieved excellent results in benchmark experiments and downstream applications with the raw output. To guarantee flawless compatibility with existing neuroimaging tools, however, we will work toward reducing the number of self-intersecting faces in V2C-Long, especially on the pial surface. Nevertheless, we believe that our model can already be used for a broad range of applications as is, and particular failure cases can be addressed via re-training or fine-tuning.
5.7 Conclusion
In conclusion, we introduced V2C-Long, the first dedicated deep-learning method for longitudinal cortex reconstruction and within-subject template creation. V2C-Long uses deep deformation fields to establish a strong inherent spatiotemporal correspondence between cortical surfaces, rendering them directly comparable without post-processing. Our experiments on two large longitudinal brain MRI studies validated the accuracy and consistency, demonstrating a substantial improvement over previous methods. We provided stronger evidence of longitudinal cortical atrophy in Alzheimer’s disease and higher diagnostic accuracy than FreeSurfer. Our results show the potential for V2C-Long to enhance future longitudinal neuroimage analyses, and the developed model offers researchers a valuable tool to find more subtle associations with brain structure.
Data and Code Availability
This study uses public data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, http://adni.loni.usc.edu) and from the Open Access Series of Imaging Studies (OASIS-3: Longitudinal Multimodal Neuroimaging, https://sites.wustl.edu/oasisbrains). Access is obtained through the online application forms provided at the linked URLs. Our code is available on GitHub with the link mentioned in the manuscript.
Author Contributions
Fabian Bongratz: Conceptualization, methodology, software, validation, formal analysis, data curation, writing—original draft, writing—review & editing, and visualization. Jan Fecht: Methodology, software, validation, formal analysis, investigation, writing—original draft, and visualization. Anne-Marie Rickmann: Conceptualization, software, data curation, and writing—review & editing. Christian Wachinger: Conceptualization, resources, writing—review & editing, supervision, project administration, and funding acquisition.
Ethics Statement
This work uses data from the Alzheimer’s disease neuroimaging initiative and (ADNI, http://adni.loni.usc.edu) and from the Open Access Series of Imaging Studies (OASIS-3: Longitudinal Multimodal Neuroimaging, https://sites.wustl.edu/oasisbrains). Participants in both studies gave written informed consent in accordance with the study’s ethical guidelines. Data acquisition in ADNI adheres to the Declaration of Helsinki; the study protocol is available at https://adni.loni.usc.edu/wp-content/themes/freshnews-dev-v2/documents/clinical/ADNI-1_Protocol.pdf. OASIS-3 was approved by Washington University’s Institutional Review Board (LaMontagne et al., 2019). Participants in the test-retest database (TRT, https://doi.org/10.6084/m9.figshare.929651) gave written informed consent following the guidelines of the Stanford University Institutional Review Board (Maclaren et al., 2014).
Declaration of Competing Interest
The authors declare that they have no conflict of interest.
Acknowledgments
This research was partially supported by the German Research Foundation. We gratefully acknowledge the computational resources provided by the Leibniz Supercomputing Centre (www.lrz.de). The authors would also like to thank Lennart Bastian for initial fruitful discussions. Data collection and sharing for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) is funded by the National Institute on Aging (National Institutes of Health Grant U19 AG024904). The grantee organization is the Northern California Institute for Research and Education.
Supplementary Materials
Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00500.
References
Appendix A
We prove Theorem 1 by induction. Assume we have input scans from a certain subject with visits. Then, the initial value problem
describes the deformation of the generic input template under the aggregated mean flow field across all visits. Further, we have
since in V2C-Flow, which proves our statement for the simple case . For each next forward Euler integration step , , with the induction hypothesis , we obtain for :
We adapted TopoFit for pial surfaces (originally only for WM surfaces).
We used the CF++ template with 140,000 vertices for best comparison to the other methods.