Abstract
A comprehensive understanding of the organizational principles in the human brain requires, among other factors, well-quantifiable descriptors of nerve fiber architecture. Three-dimensional polarized light imaging (3D-PLI) is a microscopic imaging technique that enables insights into the fine-grained organization of myelinated nerve fibers with high resolution. Descriptors characterizing the fiber architecture observed in 3D-PLI would enable downstream analysis tasks such as multimodal correlation studies, clustering, and mapping. However, best practices for observer-independent characterization of fiber architecture in 3D-PLI are not yet available. To this end, we propose the application of a fully data-driven approach to characterize nerve fiber architecture in 3D-PLI images using self-supervised representation learning. We introduce a 3D-Context Contrastive Learning (CL-3D) objective that utilizes the spatial neighborhood of texture examples across histological brain sections of a 3D reconstructed volume to sample positive pairs for contrastive learning. We combine this sampling strategy with specifically designed image augmentations to gain robustness to typical variations in 3D-PLI parameter maps. The approach is demonstrated for the 3D reconstructed occipital lobe of a vervet monkey brain. We show that extracted features are highly sensitive to different configurations of nerve fibers, yet robust to variations between consecutive brain sections arising from histological processing. We demonstrate their practical applicability for retrieving clusters of homogeneous fiber architecture, performing classification with minimal annotations and query-based retrieval of characteristic components of fiber architecture such as U-fibers.
1 Introduction
Decoding the human brain requires analyzing its structural and functional organization at different spatial scales, including cytoarchitecture and fiber architecture at microscopic resolutions (Amunts & Zilles, 2015; Axer & Amunts, 2022). Three-dimensional polarized light imaging (3D-PLI) (Axer, Graessel, et al., 2011) is an imaging technique that reveals the fine-grained configuration and 3D orientation of myelinated nerve fibers in both gray and white matter with micrometer resolution. 3D-PLI thus establishes a link between microscopic myeloarchitecture and dMRI-based structural connectivity at the macro- and mesoscopic scale (Caspers & Axer, 2019; Zilles et al., 2016). 3D-PLI images provide detailed visual information for obtaining maps of fiber architecture at different scales. Based on 3D-PLI images, previous work demonstrated the detection of myelinated pathways and delineation of subfields in the human hippocampus (Zeineh et al., 2017) as well as the identification of fiber tracts and visual areas in the vervet monkey visual system (Takemura et al., 2020).
Polarized light imaging allows processing of whole-brain tissue sections and enables scanning of large tissue stacks (Axer, Gräßel, et al., 2020; Axer, Poupon, et al., 2020; Howard et al., 2023). However, interpretation and analysis of the complex information provided by 3D-PLI require substantial expertise that cannot scale to the vastly increasing amount of data produced by recent high-throughput devices. Moreover, automated large-scale analysis of fiber architecture at the resolution provided by 3D-PLI is challenging due to the complexity and high dimensionality of the data. In order to use data analysis algorithms, a suitable lower dimensional feature representation of 3D-PLI textures is needed. Ideally, features in this representation are highly expressive for different fiber configurations while being robust against other sources of variation, such as histological processing effects and the relative 3D orientation of image patches. Such features, however, are difficult to derive, and we hypothesize that an efficient representation cannot be manually engineered.
Over the last years, deep learning methods have become prevalent in analyzing images in related fields such as histopathology (de Matos et al., 2021), as they are able to learn representations from pure data. While annotations of fiber configurations in 3D-PLI are not yet available at the scale required for supervised deep learning, we do have access to large amounts of unlabeled data. Recent advances in self-supervised representation learning suggest using contrastive learning (Hadsell et al., 2006; van den Oord et al., 2018) to learn distinctive representations from unlabeled training data. The training objective here is to represent similar instances (positive pairs) as close points in the embedding space while pushing dissimilar instances (negative pairs) apart to prevent representational collapse. While other methods to prevent representational collapse have been proposed as well, such as clustering (Caron et al., 2020), distillation (X. Chen & He, 2021; Grill et al., 2020), information maximization (Zbontar et al., 2021), or variance preservation (Bardes et al., 2022), contrastive learning of visual representations by application or adaptation of SimCLR (T. Chen et al., 2020) and MoCo (He et al., 2020) is still popular in medical image analysis (X. Chen et al., 2022; Krishnan et al., 2022). Due to its simplicity, we build on the SimCLR framework (T. Chen et al., 2020). An application of SimCLR in cytoarchitectonic brain mapping was recently performed for histological images (Schiffer et al., 2021). A main challenge they discovered was the tendency of models to focus more on anatomical landmarks than on features descriptive of cytoarchitecture when creating positive pairs based on data augmentations of the same image. To overcome this effect, they employ a supervised contrastive loss (Khosla et al., 2020) by defining positive pairs based on same labels and sample pairs within each brain area.
Several self-supervised learning methods were proposed to learn image representations based on spatial context, which can be used to create correlated views for contrastive learning (T. Chen et al., 2020; van den Oord et al., 2018; Van Gansbeke et al., 2021) or to define pretext tasks (Doersch et al., 2015; Noroozi & Favaro, 2016; Pathak et al., 2016). For microscopic imaging, predicting the geodesic distance between image patches along the brain surface (Spitzer et al., 2018) or the sequence of multiresolution histopathology images (Srinidhi et al., 2022) has been proposed as pretext tasks. Other approaches leverage the spatial continuity of images by maximizing mutual information between neighboring patches in histological images (Gildenblat & Klaiman, 2019) or satellite images (Ji et al., 2019). They assume textures in spatial proximity to be similar and, therefore, aim to contrast them with textures in more distant parts of images.
In the present study, we explore self-supervised contrastive learning for inferring descriptive features of local nerve fiber distribution patterns from raw 3D-PLI measurements. To generate positive pairs of 3D-PLI texture examples, we assume that fundamental properties of local fiber architecture are typically consistent between nearby image patches. While this assumption is likely violated at boundaries between distinct structural brain areas, we assume that it holds for the largest share of nearby image patches. In contrast to previous work (Gildenblat & Klaiman, 2019; Ji et al., 2019), instead of utilizing in-plane similarity of images, we use a 3D reconstructed histological volume to access the spatial coordinates of image patches in 3D. More precisely, we extract positive pairs of image patches at nearby coordinates across tissue sections. This sampling strategy is motivated by the idea that positive pairs from different tissue sections show independently measured tissue and thus encourage the learning of features that are robust to random variations in the measurement process not descriptive of fiber architecture. We denote this 3D-informed self-supervised learning strategy as 3D-Context Contrastive Learning (CL-3D).
To verify the validity of the proposed approach, we compare texture representations by different methods on the 3D reconstruction of the occipital lobe of a vervet monkey brain. We evaluate features based on their descriptive power for different fiber configurations, as well as their robustness to other sources of variation. Furthermore, we study the relationship between texture features and morphological measures at the macroscopic scale, using a precise automatic cortex segmentation which we developed specifically for 3D-PLI images based on a U-Net model (Ronneberger et al., 2015). To demonstrate the applicability of the learned CL-3D features, we show that the features form clusters that reflect different types of fiber architecture, enable classification tasks with minimal labels, and are suitable for query-based retrieval of U-fiber structures.
The main contributions of the present study are the following:
We propose a novel 3D-Context Contrastive Learning (CL-3D) strategy to learn a powerful feature embedding for microscopic resolution image patches from 3D-PLI.
We present specific image augmentations for maximizing invariance of learned features with respect to typical variations in 3D-PLI images, increasing feature quality and robustness.
We show a high sensitivity of the resulting 3D-PLI feature embeddings to fundamental configurations of nerve fibers, such as myelinated radial and tangential fibers within the cortex, fiber bundles, crossings, and fannings, as well as cortical morphology.
Using a dataset from a vervet monkey brain, we demonstrate that the learned features are well suited for exploratory data analysis, specifically for finding clusters of similar fiber architecture and retrieving locations with specific architectural properties based on interactively chosen examples.
2 Materials and Methods
2.1 3D-PLI measurements from the occipital lobe of a vervet monkey brain
2.1.1 Tissue samples
For this study, we use a 3D reconstruction of 234 coronal sections from the right occipital lobe of a 2.4-year-old adult male vervet monkey brain (ID 1818) measured with 3D-PLI (Takemura et al., 2020). The brain sample was obtained postmortem after flush with phosphate-buffered saline in accordance with the Wake Forest Institutional Animal Care and Use Committee (IACUC #A11-219) and conforming the AVMA Guidelines for the Euthanasia of Animals. It was perfusion fixed with 4% paraformaldehyde, immersed in 20% glycerin for cryoprotection, and frozen at -70°C. Sectioning of the frozen brain was performed coronally at 60 µm thickness using a large-scale cryostat microtome (Poly-cut CM 3500, Leica, Germany). Before each cutting step, blockface images (Axer, Graessel, et al., 2011) were taken as an undistorted reference for image realignment using a CCD camera (Fig. 1A).
2.1.2 3D-PLI acquisition
For 3D-PLI measurement (Axer & Amunts, 2022; Axer, Amunts, et al., 2011; Axer, Graessel, et al., 2011), brain sections were scanned using a polarizing microscope (LMP-1, Taorad, Germany), which provides a detailed view of nerve fiber architecture at 1.3 µm resolution. In this microscope setup, sections are placed on a stage between a rotating linear polarizer and a stationary circular analyzer consisting of a quarter-wave retarder and a second linear polarizer (Fig. 1B). The setup is illuminated by an incoherent white light LED equipped with a band-pass filter of 550 5 nm half-width. Variations in transmitted light intensity are captured using a CCD camera for 9 equidistant rotation angles of the rotating linear polarizer covering 180° of rotation. The recorded light intensity variations feature sinusoidal profiles at each pixel (Fig. 1C), which are determined by the spatial orientation of myelinated nerve fibers. Using Jones calculus, a physical description for these profiles can be derived as
Harmonic Fourier analysis can be applied to retrieve parameter maps of transmittance , retardation , and direction from the profiles (Axer, Amunts, et al., 2011). The phase shift between the ordinary and the extraordinary ray can be further decomposed as
with the cumulative thickness of birefringent tissue , birefringence , the wavelength of the light source , and nerve fiber inclination angle . While birefringence and wavelength are kept constant for all pixels, the number of myelinated nerve fibers, reflected by , varies. To resolve Equation (2) for inclination , a transmittance-weighted model (Menzel et al., 2022) can be used to estimate . Fiber inclination and direction information can then be jointly visualized in fiber orientation maps in HSV color space (Axer, Graessel, et al., 2011), where the hue value corresponds to in-plane fiber direction while saturation and value reflect the out-of-plane fiber inclination (both zero for vertical fibers at = 90°).
2.1.3 3D registration
To access the three-dimensional context of images, registration of 3D-PLI parameter maps is performed on 234 sections of the right occipital lobe from section 841 to 1,083 (Fig. 2). Nine sections heavily deformed by histological processing are sorted out and replaced by their nearest neighbors to ensure high-quality 3D reconstruction. Before registration of 3D-PLI parameter maps, a volume reconstruction of blockface images for the complete brain is performed to serve as an undistorted reference space (Fig. 2A). We use this reference space to correct for distortions from histological processing such as shrinkage or expansion of tissue, and to anchor the occipital lobe in the whole-brain context. For reconstruction of blockface images, ARTag markers are positioned on the cryotome along with the mounted tissue block (Fig. 1A). By identification of the markers using the ARToolKitPlus library (Wagner & Schmalstieg, 2007), blockface images are aligned using affine transformations (Schober et al., 2015). Subsequently, nonlinear transformation fields are estimated for alignment of 3D-PLI transmittance maps using the blockface volume as reference, which yields a reconstruction of the overall anatomical shape and topology of the occipital lobe in the 3D-PLI volume space. The same transformation is used to align all 3D-PLI parameter maps. However, blockface images do not contain sufficient structural detail for precise alignment of fine structures visible in 3D-PLI such as single fiber bundles and small blood vessels. Therefore, the blockface alignment is used as initialization for an additional registration step between adjacent 3D-PLI sections to reconstruct coherent 3D fiber tract transitions. In this step, each section is aligned to its successor and predecessor by symmetric normalization, which combines affine and deformable transformation, maximizing cross-correlation between joint retardation and transmittance images. The registration is performed iteratively forward and backward through the stack of sections, with the first and last sections remaining fixed. All registrations are performed using the ELASTIX (Klein et al., 2010; Shamonin et al., 2014), ANTs (Avants et al., 2010, 2011), and ITK (McCormick et al., 2014) software packages. The computed transformation fields are used to warp all 3D-PLI parameter maps into a common volume space (Fig. 2C) with a resolution of 31,077 28,722 243 voxels and a voxel size of 1.3 1.3 60 µm³. In-plane orientation information reflected by direction maps is preserved by adjusting the 2D rotation components at each pixel estimated by the curl of the transformation field.
2.2 Data augmentations for 3D-PLI
Data augmentations are crucial for increasing the diversity of training data. Self-supervised contrastive learning methods in particular rely on augmentations to learn representations that are more generalizable and robust to variations in the input data (T. Chen et al., 2020). It is crucial that augmentation schemes model the expected variability adequately. In microscopy, for example, similar tissue can exhibit different orientations or intensities between scans. Since mounted brain sections are not perfectly flat, sharpness variation can occur within a scan, resulting in slightly out-of-focus areas.
In this section, we introduce a set of augmentations specifically designed to reflect typical variations in 3D-PLI images. We scale attenuation and thickness parameters in the physical model of the measured 3D-PLI signal, perform geometric affine and flip transformations, as well as Gaussian blur. All augmentations are performed for joint transmittance, direction, and retardation parameter maps. Figure 3 shows example applications of the derived augmentations.
2.2.1 Modulation of signal parameters
Parameters of the tissue in the physical model of 3D-PLI might vary (e.g., transparency, thickness) depending on postmortem time, tissue processing, or storage time of the mounted sections. Here, we provide transformations that can be implemented into 3D-PLI-specific data augmentations that approximate typical variations in the parameters.
Attenuation coefficient
The transmitted light intensity of the tissue can vary across image acquisitions due to a change in light attenuation. Assuming uniform attenuation for simplicity, transmittance can be described by Bouguer–Lambert’s law as
for the intensity of incident light , section thickness , and attenuation coefficient . Scaling the attenuation coefficient by linear scaling factor as results in a scaled transmittance
Note that this equation is only an approximation of a real change in due to the simplifying assumption of uniform attenuation.
Section thickness
Although the section thickness was held constant throughout all brain sections used in this study, it might vary between data acquisitions of other samples. To reflect a linear change in thickness parameter in Equation (2), we scale phase retardation by a linear scaling factor . For retardation and we obtain a scaled retardation
To adjust the light transmittance , we compute scaled analog to Equation (4) for scaled thickness as follows:
Note that this augmentation is also only an approximation to a real change in thickness , as it does not add or remove tissue components from the measurement.
2.2.2 Resampling
Many image transformations require resampling of image intensity values. For 3D-PLI, resampling of the measured intensities from Equation (1) can be performed as
by a weighted mean of intensity values with corresponding weights , where . With 3D-PLI parameter maps, however, we work with derivations of the originally measured image intensities and cannot directly resample values for retardation and direction . By representing Equation (7) as Fourier series and due to the linearity of the Fourier transformation, resampling of the 3D-PLI parameter maps can be performed through
by computing via Equation (8) before obtaining and from Equation (9) via decomposition of the right-hand side into magnitude and phase, which correspond to and , respectively. The equations are used for all geometric transformations and filters, such as affine transformations or Gaussian blur, that require resampling of 3D-PLI parameter maps. We use geometric transformations to account for different orientations or distortion of tissue and Gaussian blur to mimic slightly out-of-focus areas.
2.2.3 Direction correction
Since 3D-PLI measures the absolute in-plane orientation of nerve fibers, any transformation that changes the geometry of image pixels requires a subsequent correction of direction values. For applications in diffusion MRI, Preservation of Principal Directions (PPD) (Alexander et al., 2001a, 2001b) was introduced to preserve directional information undergoing nonrigid transformations. While proposed for 3D diffusion tensors, a similar correction mechanism can be introduced for 3D-PLI, where we restrict the transformations to in-plane transformations for simplicity. We convert direction angles to cartesian coordinates normalized to one as
in order to generate corrected direction angles via
for nonlinear image transformation function , which maps pixel coordinates in the source domain to coordinates in the target domain, and Jacobi matrix of function . The correction is applied before application of function to transform the image. For specific transformations, Equation (11) can be simplified to more convenient forms.
Rotation
For an example of counter-clockwise rotation by arbitrary angle , Equation (11) can be simplified to
Affine transform
For pixel coordinates and an affine transformation composed of translation vector and matrix , the transformation function is given as
If inserted into Equation (11), a simplified correction mechanism for the affine transformation can be derived as
2.3 Cortex segmentation
To access brain morphology and distinguish gray and white matter locations, a U-Net model (Ronneberger et al., 2015) is trained for segmentation of pixels into background (BG), gray matter (GM), and white matter (WM) classes and applied to every 3D-PLI section of the lobe. For training the model, we create a dataset representing a large variety of textures in 3D-PLI images with minimal labeling effort by employing an active learning strategy in the annotation process. Rather than annotating complete sections, we manually select 58 square regions of interest (ROIs) of size 2,048 pixels (2.66 mm) from several sections, including sections outside the occipital lobe for a higher variety of examples. We train a U-Net model using these ROIs as inputs and apply the model to all available sections. We subsequently select new ROIs based on the most severe misclassifications in the model outputs. This process is repeated to obtain a growing dataset of 58, 119, 183, 301, and finally 369 ROIs of highly diverse patches capturing different textures across the entire brain.
It should be noted that large parts of the cortex can be segmented at acceptable quality using simple thresholding of transmittance and retardation values (Menzel et al., 2022). Challenging parts, such as an oblique cut border between gray and white matter or an intersecting pial surface within narrow sulci, form only a small fraction of the data, but have a significant impact on matching inner and outer cortical boundaries. We, therefore, apply a multiclass implementation of focal loss (Lin et al., 2017) for the training objective to increase emphasis for the model on challenging examples. We use 3D-PLI-specific augmentations, as described in Section 2.2, for training the cortex segmentation model.
Segmentations of the final model are corrected manually by removing small tissue fragments, extrapolating broken tissue, and filling holes to obtain a topologically correct cortex segmentation. As a last step, the resulting segmentations for individual 3D-PLI sections are stacked to form a segmented volume of the entire cortex in 3D (Fig. 2B).
2.4 3D context contrastive learning
Contrastive learning aims to learn robust and descriptive representations of data samples by contrasting similar and dissimilar pairs. The goal is to learn an encoding function that groups similar samples closely together in representation space while pushing dissimilar samples apart from each other. Assuming a reasonable measure of similarity that can be efficiently derived from the data, similar samples are generated as positive pairs consisting of an anchor sample and a positive sample with high similarity, while for negative pairs the anchor is combined with a dissimilar negative sample.
In this study, we derive similarity from the spatial neighborhood of image patches in a 3D-Context Contrastive Learning objective. Given a random location for the anchor sample, we obtain a positive sample from neighborhood location , where is chosen based on two variants of spatial context sampling (Fig. 4A):
- (i)
CL-2D, where is sampled on an in-plane circle with radius from the same tissue section. Setting = 0 is considered a special case, where the anchor sample is also used as positive sample and no context sampling is performed (Same).
- (ii)
CL-3D, where is sampled on a sphere with radius and is taken across sections by rounding the sampled coordinate to the nearest available section, but excluding the section from which the anchor sample was taken (i.e., and are always located on different sections). Setting = 0 is considered a special case, where refers to the nearest neighbor (NN) at the same in-plane coordinates as , but in a random adjacent section.
For , only locations with visible tissue are considered to avoid sampling positive pairs containing background only, using the segmentation masks obtained in Section 2.3. As CL-3D requires access to the spatial relationship of sampling locations between sections, we perform context sampling in the undistorted blockface 3D reference space. We utilize estimated transformation fields from performing the 3D registration in Section 2.1 to warp locations in the blockface volume to individual 3D-PLI sections. Using patches from original 3D-PLI parameters maps instead of registered ones directly has the advantage of including additional variation between positive samples for the contrastive learning objective, such as different orientations of texture. In addition to the spatial sampling, we perform random augmentations for all samples as detailed in Section 2.2 (Fig. 4B).
For training encoder , we build on the SimCLR contrastive learning framework (T. Chen et al., 2020) (Fig. 4C). In this specific framework, augmented positive pairs are randomly sampled for each training step and stored in minibatches of total examples . We refer to the set containing indices for all positive pairs in the minibatch as . For each positive pair, all other random samples are considered negative samples, which originate from random sections and locations. When the training volume is large relative to sampling radius of positive pairs, it is unlikely that a negative sample lies in the same spatial neighborhood as the positive one. Encoder typically refers to a deep learning model, which yields network activation vectors as representations. An additional projection head is introduced to map the activations to a lower dimensional space of projections , on which contrastive loss is applied. The training objective is given in terms of the InfoNCE loss (van den Oord et al., 2018):
where and similarity metric chosen as the cosine similarity with temperature parameter . The former per-sample loss is accumulated into a total loss as
After training, the projection head is discarded and for inference only representations on unaugmented samples are used (Fig. 4D).
2.5 Model training
2.5.1 Model architecture
For encoder , we use a ResNet-50 (He et al., 2016) model with 3 input channels, removing the last fully connected layer. The original ResNet-50 encoder outputs 2,048 feature channels, which is well evaluated on natural images (Deng et al., 2009). However, training the model at full capacity on our data systematically resulted in high activations to infrequent but highly pronounced structures such as tangential cut radial fibers. Such structures made it trivial to solve the contrastive learning objectivate based on spatial similarity. Therefore, we reduce the number of features for all blocks in the ResNet-50 architecture to 1/8 to limit the encoder capacity, which results in 256-dimensional hidden representations, preventing the model from overfitting to these specific structures. We choose this dimensionality as a trade-off between preventing overfitting and maintaining reasonable model capacity, as bigger models can learn more general features (X. Chen & He, 2021). For projection head , we use a two-layer MLP with ReLU activations, hidden feature size of 90 and outputs of size 32. To feed 3D-PLI images to the ResNet, parameter maps transmittance , direction , and retardation are stacked as = (, , ) to the channel dimension, which resolves the cyclic nature of direction values . We standardize the input channels by running mean and standard deviation over the first 1,024 batches during training.
2.5.2 Implementation
We use PyTorch (Paszke et al., 2019), PyTorch Lightning (Borovec et al., 2022), and Hydra (Yadan, 2019) frameworks using the Quicksetup-ai (Mekki et al., 2022) template for building our model. Data augmentations (cf. Section 2.2) are implemented using the Albumentations (Buslaev et al., 2020) framework. Training is conducted using a distributed data-parallel strategy on 4 Nvidia A100 GPUs with synchronized batch normalization statistics (Ioffe & Szegedy, 2015) on the supercomputer JURECA-DC at the Jülich Supercomputing Centre (JSC) (Thörnig, 2021).
2.5.3 Data sampling
We sample square patches of size 192 pixels (253 µm) as anchor samples from random locations within the training volume, excluding background using the previously generated cortex segmentation. For each anchor sample, we take positive samples from a random location in spatial proximity, depending on the chosen definition of spatial similarity. As we sample patch locations on the fly, we do not have a fixed dataset size but define an epoch as the sampling of 512 512 = 262,144 positive pairs. Per training step, we take 512 anchor samples and positive samples and process them evenly split on the 4 GPUs.
2.5.4 Data augmentation
For all samples, we apply an affine transformation (scaling from [0.9, 1.3] on each axis, rotation from [-180°, 180°], and shearing from [-20°, 20°] on each axis) with linear interpolation and subsequent center cropping to crops of size 128 pixels (169 µm) to eliminate padding effects. Subsequently, we perform random flipping on center crops and scale relative thickness (Eq. (4)) and the attenuation coefficient (Eq. (3)) each by random scaling from a logarithmic distribution with basis 2 from [-1, 1]. As a last augmentation, we perform Gaussian blur with a probability of 50% and from [0.0, 2.0].
2.5.5 Training
All augmented crops are fed to the encoder model and projection head in order to minimize the loss in Equation (16). We use Adam optimizer (Kingma & Ba, 2017) with a learning rate of 10-3, a weight decay of 10-6, and default parameters = 0.9, = 0.999, and = 10-8. For the choice of temperature parameter in Equation (15), we follow the optimal choice of = 0.5 reported by T. Chen et al. (2020) when training until convergence. We apply the same loss for training and validation. All models are trained until convergence if the validation loss does not reduce for more than 50 epochs, which takes between 195 (2D context) and 400 (3D context) epochs.
2.5.6 Inference
After training, model weights are frozen and inference is performed on complete sections using the trained ResNet encoder, discarding the projection head. Each section is converted into feature maps using a sliding window approach by dividing 3D-PLI parameter maps into tiles of size 128 pixels (169 µm) with 50% overlap. The overlap is chosen to better represent pixels at the edges of patches that would otherwise lie on the boundary between adjacent patches. We extract a 256-dimensional feature vector for each tile without applying the data augmentation used in training. Extracted feature vectors are reassembled into feature maps with reduced in-plane resolution of 84.4 µm per pixel compared with original input parameter maps, but 256 feature channels characterizing the local texture content. Compared with the section thickness of 60 µm, this makes the feature voxels approximately isotropic.
2.6 Classical texture features
The present approach enables learning of texture features specifically for 3D-PLI parameter maps. As baselines for texture analysis, we apply classical first-order histogram features (mean, variance, skewness, kurtosis, entropy), Grey-Level Co-occurrence Matrices (GLCM) (Haralick et al., 1973), Local Binary Patterns (LBP) (Ojala et al., 2002), and a combined set of all of their features. These are well-established approaches to represent textures in medical imaging such as CT, MR, PET (Scalco & Rizzo, 2017), and histopathology (de Matos et al., 2021). We compute texture features for whole sections using the same sliding window approach introduced in Section 2.5 by dividing 3D-PLI parameter maps into tiles of size 128 pixels (169 µm) with 50% overlap.
Direction maps represent the absolute orientation of fibers within the imaging plane. Since we aim to find texture representations that are independent of their absolute orientation, we are more interested in local patterns of than in their absolute values. We use the Sobel operator as a first derivative filter to highlight image edges and eliminate absolute values of . To filter direction values , we need to resolve their circular nature. Therefore, we represent direction angles in polar form as complex numbers
and apply Sobel filtering as
with convolution operator and Sobel filter kernels and . We aggregate the filtered images as
where extracts the magnitude of complex numbers, and dividing by 12 normalizes the filtered values to [0, 1].
For histogram features, we compute normalized histograms with 128 bins for each of the parameter maps , , and . From the histograms, we compute mean, variance, skewness, kurtosis, and entropy as features. Features for all parameter maps are concatenated, resulting in a total of 15 histogram features.
To extract LBP features, we compute local binary patterns for each patch by dividing the angular space into eight points with multiple radii [1, 2, 3] to define the local neighborhood of texture. We compute normalized histograms with 10 bins of LBP values for each radius and parameter map and concatenate them into a feature vector with 90 features.
We compute normalized and symmetric GLCMs for 32 equally spaced bins of parameter maps for distances [1, 2, 4] and angles [0, /4, /2, 3 /4]. From each GLCM, we compute contrast, correlation, energy, and homogeneity as features (Haralick et al., 1973). We concatenate the features for all parameter maps and distances while averaging over the angles to make the features robust to rotations, resulting in 36 total features.
In addition, we include a comprehensive combined set of all classical texture features (Histogram, LBP, GLCM) as a baseline.
2.7 Pretrained encoder on ImageNet
In recent years, there has been increasing interest in using pretrained deep learning models as feature extractors in histopathology (de Matos et al., 2021), a domain very close to ours. Here, we observe that many recent studies analyzing histopathological images use encoders pretrained on ImageNet (Deng et al., 2009) for feature extraction and downstream analysis (Breen et al., 2024; Liu et al., 2024; Wu et al., 2024). Therefore, we complement the classical baselines with a pretrained ResNet-50 (He et al., 2016) encoder, which has been trained on images from the ImageNet dataset using the SimCLR contrastive learning objective (T. Chen et al., 2020).
As the ResNet-50 model was trained on natural RGB images, we use two types of images generated from 3D-PLI parameter maps to visualize the fiber architecture: (1) Transmittance maps, stacked in the color channel dimension, to create grayscale RGB images, and (2) fiber orientation maps (FOM), which can be directly fed to the model. In contrast to the width-reduced ResNet-50 architecture we use for CL-3D and CL-2D, this model has full capacity and produces 2,048 features for image patches of 128 pixels (169 µm) size. For the creation of feature maps, we use the same sliding window approach as in the inference of CL-2D and CL-3D (Section 2.5).
3 Experiments and Results
We train CL-2D and CL-3D models using the data from the occipital lobe of a vervet monkey brain described in Section 2.1. We split the volume into sections for training (#962–#1077), sections for validation (#1078–#1083), and sections for testing (#851–#961) as shown in Figure 2A. We evaluate the feature representations produced by the models regarding their descriptive power for different fiber configurations, spatial consistency, and applicability for downstream tasks. In particular, we show that classification of texture into tissue classes requires less annotations based on CL-3D and CL-2D features than other methods. We compare the extent to which features of different approaches can be related to brain morphology and their robustness to variations between sections. We investigate the main factors of variation specifically for CL-3D features and demonstrate that they lend themselves to interactive data exploration and identification of nerve fiber architecture in large volumes of 3D-PLI data. All experiments are evaluated exclusively on features extracted from sections not included in training.
3.1 Linear evaluation of features with minimal labels
A common approach to assess the quality and robustness of feature representations is to perform linear evaluation (T. Chen et al., 2020; van den Oord et al., 2018) on a given classification task. For the linear evaluation protocol, a simple linear classifier is trained on top of features extracted for each data sample. Being able to perform the classification task with a simple linear model indicates a good discrimination of classes in feature space and thus high-quality features. In addition, we analyze the robustness of features in a weakly supervised setting by providing the classifier with increasing number of labeled training examples, starting with only a few per class. These examples scale with human annotation effort involved in creating the dataset. Self-supervised CL-3D and CL-2D encoders are still trained on the full amount of unlabeled training data, which does not require human annotations and is, therefore, available in a much higher quantity.
For the classification task, we use the training data acquired for the cortex segmentation performed in Section 2.3, which segments 3D-PLI images into three classes: background (BG), gray matter (GM), and white matter (WM). We divide the annotated ROIs into 228 ROIs (from 42 sections located caudal to the central sulcus) for training and 141 ROIs (from 19 sections encompassing the prefrontal cortex) for testing. This train/test split ensures that section IDs for testing the classifier do not contain sections from the occipital pole used for training the self-supervised models and allows testing the generalizability of features across the brain. To perform classification on individual texture patches, we extract square patches of 128 pixels (169 µm) on a regular grid per ROI and assign each patch the label of the most frequent class in the segmentation mask. The resulting dataset comprises many different textures with slightly unbalanced class distributions for both training (10,696 WM, 32,613 GM, and 12,829 BG patches) and testing (7,065 WM, 19,577 GM, and 7,589 BG patches).
For the linear classifier, we use a logistic regression classifier in a one-versus-rest scheme. We compare the classification performance by computing macro F1 scores across classes, and calculate the significance of these scores by computing standard error over 50 independent fits of the classifier on random subsets of training examples.
3.1.1 Evaluation of features by different methods
We compare CL-3D and CL-2D features with classical texture features and a pretrained ResNet-50 encoder on ImageNet under the linear evaluation protocol. Additionally, we report the performance of a supervised ResNet-50 classifier as a reference, specifically trained on the classification task using the full training dataset. For this model, we used a class-weighted cross-entropy loss with Adam optimizer in the same setting as described in Section 2.5. The model was trained for 416 epochs until convergence of validation accuracy, using a random 80/20 train/validation split.
With minimal training examples, results in Figure 5 show a clear lead in classification performance by our CL-2D and CL-3D models. Both models achieve macro F1 scores of 0.94 with only 30 random samples per class to fit the linear classifier. By using 10,000 samples per class, CL-3D, CL-2D, and a combination of all classical texture features (Combined) match the F1 score of the supervised ResNet-50 model (0.97). All other methods, including each individual classical texture descriptor, show lower F1 scores throughout all number of samples per class.
3.1.2 Effect of data augmentations for 3D-PLI on texture features by CL-3D and CL-2D
In Section 2.2, we introduce data augmentations specifically designed for 3D-PLI parameter maps and use them in the training of CL-3D and CL-2D models. To test the effect of these augmentations on feature quality and robustness, we perform linear evaluation of CL-3D and CL-2D models trained without any augmentation, with every single augmentation, and with all augmentations combined.
As shown in Figure 6A, a CL-2D model trained without data augmentations clearly underperforms compared with models trained with any individual data augmentation, except the blur augmentation, which appears to degrade feature representations for this classification task. Among models trained with individual augmentations, color distortions (modulation of section thickness and the attenuation coefficient) yield the best results. A combination of all augmentations together performs the best overall.
Results in Figure 6B demonstrate that CL-3D benefits most from color distortions. Models trained with geometric transformations only, such as affine and flip augmentations, perform worse than a model trained without augmentations. Excluding them from the full set of augmentations, however, does not improve performance (see Appendix Fig. A1 in Appendix). Using all introduced augmentations during training leads to the best results for CL-3D.
3.2 Main factors of variation in the learned representations
To gain insights into the main factors of variation captured by CL-3D features, we perform principal component analysis (PCA) on a random subset of 1 million voxels from the feature maps. We use the estimated principal axes to project feature channels for the entire dataset onto nine components with largest explained variance (64.2% cumulative explained variance), with at least 2.8% of variance explained per component. Explained variance refers to the amount of information in the dataset that is retained by each PCA component.
Figure 7D (see also Figure A2 in the Appendix) shows images for the first nine principal components, which in general terms reveal anatomically plausible structures. Component (1) shows a clear separation of white matter (WM) and gray matter (GM). Within GM, higher values indicate layers with a lower density of radial fibers. Within GM, values in component (2) distinguish between layers with high density of radial fibers (low values) and high density of tangential fibers (high values in the superficial layers, which contain mainly tangential fibers, and values around 0 in the Gennari stripe, which presents both radial and tangential fibers), while high values within WM highlight edges between fiber bundles. Low values in component (3) highlight layer I, which contains a high density of tangential fibers, as well as WM structures with high fiber density that run in-plane, such as the tapetum. High values indicate tangentially sectioned radial fibers. Component (4) has low values toward out-of-plane fibers and highly oblique cortex and high values toward superficial GM layers with a low density of radial fibers and low obliqueness. Component (5) shows high values for layer IVc in the primary visual area (V1) and layer VI throughout the whole cortex. In component (6), high values show layer I throughout the cortex as well as layer IVb within V1 (Stria of Gennari), that is, they highlight GM layers with a high density of tangential fibers. High and low values in (7) mainly represent WM, with high values indicating U-fibers and other in-plane fibers and low values indicating steep fibers or crossings. The stratum sagittal (SS) has the lowest values, as here fibers emerge vertically from the plane. High values in GM mainly highlight tangentially sectioned radial fibers. High values in (8) also highlight tangentially sectioned radial fibers in GM. Additionally, they indicate abrupt twisting of flat, in-plane fibers that twist out of plane at the GM/WM transition, and show parts of U-fibers that are cut through the plane. In component (9), layers IVa, IVc, and VI within V1 are characterized by high values.
3.3 Encoding of brain morphology in texture features
Fiber architecture has mutual dependencies with cortical morphology (Striedter et al., 2015; Van Essen, 1997). To investigate to what extent different texture representations encode cortical morphology, we extract a range of morphological parameters from our test data. In particular, based on the cortex segmentation described in Section 2.3, we compute a Laplacian field between outer pial and inner white matter surfaces using the HighRes cortex (Leprince et al., 2015) module included in brainvisa (Cointepas et al., 2001). We extract the following measures:
Equivolumetric cortical depth (Bok, 1929) as the depth along cortical traverses following the gradient of the Laplacian field, compensating for the effect of cortical curvature through the divergence of the same field. It has values of 0 at the Pial and 1 at the gray–white matter surface.
White matter depth defined as the shortest distance from each voxel within white matter to the interface between the cortical ribbon and white matter in millimeters.
Cortical curvature as the divergence of the gradient of the Laplace field between pial and white matter surfaces (Goldman, 2005; equation 3.8 in Leprince et al., 2015)
Obliqueness of the sectioning plane, computed as the absolute angle between the gradient of the Laplacian field and the sectioning plane with values in [0°, 90°].
We follow the approach of Spitzer (2020) and quantify the extent to which these measures can be predicted from texture features by different methods using a linear model. Being able to predict a quantity with a simple linear model indicates a robust encoding of that quantity. We randomly select 10,000 voxels from the training set and compute both their feature representations using the trained models and the above-mentioned morphological measures. The features are standardized using Z-score normalization and used to fit a linear regression model via least squares. For the model, we use relatively high L2 regularization with a weight of 104 to reduce overfitting observed for the pretrained model (ImageNet) with high number of 2,048 features. The goodness of the fit is determined by calculating the coefficient of determination , which denotes the proportion of variation in the measures that can be explained by the linear model from the feature representations. We compute for predicted values from 10,000 randomly selected voxels from the test set.
Our results in Table 1 show highest values for the prediction of all morphological measures from features by our CL-3D method with medium sampling radius = 118. It achieves values of 0.82 for cortical depth, 0.36 for white matter depth, 0.17 for curvature, and 0.52 for obliqueness, which are overall higher compared with CL-2D with in-plane context sampling. Using a larger or smaller radius for context sampling does not significantly increase values for CL-3D or CL-2D. While for = 0, that is, using nearest neighbor (NN) context sampling for CL-3D, values are marginally smaller, they decrease considerably for CL-2D, when using no context sampling in model training. Classical texture descriptors (GLCM, Histogram, LBP) and the pretrained model show overall much smaller values and do not encode curvature at all, that is, showing values around 0.
Method . | Dim. . | [µm] . | Input . | Cort. depth . | WM depth . | Curv. . | Oblique. . |
---|---|---|---|---|---|---|---|
GLCM | 36 | - | , , | 0.53 | 0.26 | 0.01 | 0.05 |
Histogram | 15 | 0.52 | 0.20 | 0.00 | 0.03 | ||
LBP | 90 | 0.36 | 0.13 | 0.01 | 0.06 | ||
[Histo., LBP, GLCM] | 141 | 0.63 | 0.27 | 0.02 | 0.08 | ||
Pretrained (ImageNet) | 2048 | - | 0.60 | 0.22 | 0.00 | 0.18 | |
FOM | 0.62 | 0.17 | 0.01 | 0.19 | |||
CL-2D | 256 | 0 | , , | 0.58 | 0.14 | 0.07 | 0.15 |
118 | 0.79 | 0.34 | 0.13 | 0.43 | |||
236 | 0.78 | 0.36 | 0.13 | 0.42 | |||
CL-3D | 256 | NN | , , | 0.79 | 0.35 | 0.17 | 0.49 |
118 | 0.82 | 0.36 | 0.17 | 0.52 | |||
236 | 0.82 | 0.34 | 0.14 | 0.51 |
Method . | Dim. . | [µm] . | Input . | Cort. depth . | WM depth . | Curv. . | Oblique. . |
---|---|---|---|---|---|---|---|
GLCM | 36 | - | , , | 0.53 | 0.26 | 0.01 | 0.05 |
Histogram | 15 | 0.52 | 0.20 | 0.00 | 0.03 | ||
LBP | 90 | 0.36 | 0.13 | 0.01 | 0.06 | ||
[Histo., LBP, GLCM] | 141 | 0.63 | 0.27 | 0.02 | 0.08 | ||
Pretrained (ImageNet) | 2048 | - | 0.60 | 0.22 | 0.00 | 0.18 | |
FOM | 0.62 | 0.17 | 0.01 | 0.19 | |||
CL-2D | 256 | 0 | , , | 0.58 | 0.14 | 0.07 | 0.15 |
118 | 0.79 | 0.34 | 0.13 | 0.43 | |||
236 | 0.78 | 0.36 | 0.13 | 0.42 | |||
CL-3D | 256 | NN | , , | 0.79 | 0.35 | 0.17 | 0.49 |
118 | 0.82 | 0.36 | 0.17 | 0.52 | |||
236 | 0.82 | 0.34 | 0.14 | 0.51 |
To quantify the quality by which features encode each measure, we calculate the goodness of each fit using the coefficient of determination . For CL-2D and CL-3D, refers to the distance at which context sampling is performed, with nearest neighbor (NN) sampling being a special case of CL-3D. The methods use different input modalities (Input) and produce features with different dimensionalities (Dim.). Bold values indicate the highest R2 value per column.
Scatter plots shown in Figure 8 visualize predicted and true values at the example of CL-3D and combined classical texture features. As shown in Figure 8A, for obliqueness, especially larger angles can be predicted from CL-3D features, while the prediction of smaller angles in the scatter plot exhibits a deviation. Predictions of white matter depth from features by both methods show a clearer fit for smaller depths than for larger depths. As shown in Figure 8B, neither curvature nor obliqueness can be predicted from the selected classical texture features.
3.4 Clustering of learned features
3.4.1 Hierarchical clustering
We perform hierarchical cluster analysis in the embedding space of CL-3D features to evaluate the extent to which these features map characteristic nerve fiber configurations. Hierarchical clustering creates a tree-like structure of hierarchical relationships among data points as a dendrogram based on their similarity in representational space. We choose the bottom-up approach for agglomerative clustering, which merges closest clusters based on our choice of Euclidean distance and Ward linkage. To reduce noise and computational effort, we cluster PCA-reduced feature maps from Section 3.2, where features are projected onto the first 20 components with 80.4% total explained variance. Additionally, using PCA-reduced feature maps mitigates potential negative effects of the curse of dimensionality when calculating distances in high-dimensional feature spaces. To increase the receptive field and reduce in-plane noise, each feature map is smoothed by an in-plane 2D Gaussian kernel with a standard deviation of which provides a good trade-off between noise reduction and keeping sensitivity to smaller structures.
We select features from foreground voxels across all test sections. Due to the huge number of 16 million data points, hierarchical clustering cannot be applied directly to the data points. Instead, we perform a two-step approach by first performing k-means clustering on a subset of 100,000 samples for 128 clusters. We use the 128 resulting cluster centers to assign all remaining 15.9 million data points to these clusters, allowing us to represent the whole test volume by superpixel-like clusters (Fig. 9A). Cluster assignments in the coronal plane appear visually smoother than the assignments across brain sections in the axial and sagittal planes. As a second step, we perform agglomerative hierarchical clustering to obtain a cluster dendrogram (Fig. 9B). From this result, we calculate silhouette scores for an increasing number of clusters to identify interesting candidate clusterings for visualization (Fig. 9C). Considering the overall decline in silhouette scores, we observe local maxima at 3, 7, and around 13–14 clusters. Notably, increasing from 13 to 14 clusters introduces a particularly interesting cluster, exclusively highlighting cortical layers in primary visual area (V1). Therefore, we select 14 clusters for visualization. We would like to point out that using fewer (32) or more (1,024) k-means centroids for subsequent hierarchical clustering resulted in overall lower silhouette scores and led to unspecific, over-simplifying clusters or more noisy clusters, respectively.
Results for selected candidate cluster configurations for visualization are shown in Figure 9E. We observe the same difference in spatial smoothness of cluster assignments between the coronal, axial, and sagittal planes as shown in Figure 9A, with the coronal plane appearing visually smoother. A range of characteristic aspects of fiber architecture are revealed, which are identified and confirmed by two neuroanatomists (N.P.-G. and M.N.). The descriptions are based on a comparison of each cluster with high-resolution 3D-PLI images across multiple sections and its overall distribution within the 3D geometry of the brain to ensure the consistency of descriptions.
3 clusters. Solutions for 3 clusters demonstrate a first global differentiation of the data into GM and WM. Due to its high fiber density, cortical layer VI is sometimes represented inside the WM cluster. We further observe a small cluster of tangentially cut cortex.
7 clusters. The configuration with seven clusters differentiates superficial and deep cortical layers. This segregation is shaped by the packing density of radial fibers in the deeper layers, and the tangentially running fibers close to the pial surface. For WM, voxels are split into two clusters: (1) the red cluster in Figure 9E highlights densely packed fibers with out-of-plane orientation of the sagittal stratum (SS), as well as surrounding densely packed in-plane fibers of the tapetum and (2) the green cluster in Figure 9E encompasses in-plane fibers or fibers with relatively low inclination, together with steep but less densely packed fiber bundles.
14 clusters. The configuration of 14 clusters reveals an increased sensitivity to specific WM fiber bundles and displays a cortical region delineation for the primary visual cortex (V1). Furthermore, a range of fiber architectural properties can be recognized in the maps corresponding to the clusters in Figure 9B:
Cluster (1) shows the tapetum and stratum calcarinum (SC), characterized by approximate in-plane fibers with high packing density.
Cluster (2) displays densely packed, highly inclined fibers and fibers of the sagittal stratum (SS) with out-of-plane orientation.
Cluster (3) includes WM voxels from sections with an artifact-related increased light transmittance (not present in the section shown in Fig. 9E).
Cluster (4) mainly displays WM voxels, but in some cortical segments also encompasses layer VI. WM covered by these voxels is characterized by fibers with high packing density, small inclination angles, and twisting or crossing patterns.
Cluster (5) mainly highlights layer VI of cortical segments complementing cluster 4. WM voxels encompass fibers with low packing density and low inclination at the border between GM/WM.
Cluster (6) reveals only WM voxels with very small fiber inclination mostly parallel configurations, with only a few crossings.
Cluster (7) highlights voxels located in highly oblique cortex with layers characterized by low myelination. When not tangentially sectioned, this portion of the cortex is encompassed by cluster (9).
Cluster (8) highlights the most superficially located bands of tangential fibers, namely those of the zonal layer and the Kaes–Bechterew stripe, which are located in cytoarchitectonic layers I and IIIa, respectively (Zilles et al., 2015), and found throughout the whole cortex.
Cluster (9) covers cortical layers with low density of myelin. The width of this cluster varies along the cortical ribbon. Some areas, where radial fibers reach almost up to layer II, are characterized by a narrow band of cluster (9), while in other areas it is broad because their radial fibers only reach into layer IIIb. The cluster disappears in the cortex of highly compressed sulci.
Voxels in cluster (10) are also located in obliquely sectioned layers with a low myelination while being less oblique than those of cluster (7).
Cluster (11) encompasses steep radial fibers fanning out at the apex of the gyrus.
Cluster (12) highlights approximate in-plane fibers with low packing density, fanning out at the apex of the gyrus.
Voxels of cluster (13) are restricted to the primary visual area (V1). They are mainly found in layers IVb–V, but, depending on packing density, sometimes also those of layer (VI). This cluster could reflect local cortical connectivity within V1.
Cluster (14) highlights deeper layers mainly due to their density of radial fibers. Its width varies along the cortical ribbon and is inversely related to the width of cluster (9). In V1 cluster (14) is restricted to layers IIIb and IVa. In the remaining cortex, depending on the area it can reach from layer IIIa or IIIb to layer V or VI.
To visualize the organization of data points in the learned representational space, we additionally perform UMAP (McInnes et al., 2018) projection of the PCA-reduced features used for clustering on two dimensions for 4,000 example data points. Projections in Figure 9D show the organization of features in a continuous band along cortical depth starting from superficial layers (clusters 8 and 9) to deeper layers (clusters 13, 14, 5, and partially 4) until reaching WM clusters (3, 6, and partially 4). Branches to the sides highlight different degrees of obliqueness of cortical layers (i.e., clusters 7, 9, and 10). Clusters for structures such as the SS (cluster 2), tapetum/SC (cluster 1), or layers of V1 (cluster 13) form separate branches in the projected space. We observe splits into multiple fragments by clusters 3, 4, and 13 in UMAP space. These splits can be partially explained by the features providing a more fine-grained distinction between patches from different cortical and white matter depths than captured by the clustering results. While increasing the number of clusters in hierarchical clustering reveals these additional details of fiber architecture, we limit the number of clusters to 14 in our qualitative analysis to maintain interpretability. Beyond this point, the complexity of clusterings gradually increases, and the content of individual clusters becomes more difficult to describe.
3.4.2 Consistency of cluster assignments across sections
In addition to 2D cluster maps, we assemble 3D renderings of the configuration with 14 clusters (Fig. 10). While CL-3D features were generated based only on in-plane texture information without cross-section constraints, the volume renderings reveal consistent cluster boundaries across sections, as can be observed by the smooth cluster shapes in the cross-sections. Cluster 13 stands out in being more noisy than other clusters.
To better quantify the cross-section consistency of feature clusters, we compute the intersection over union (IoU) of cluster assignments between adjacent sections for different number of k-means clusters (2, 8, 32, and 128). Since absolute coordinates are not included in the features, this analysis provides insights into the robustness of representations to intersection variations arising from histological processing.
IoU scores reported in Table 2 decrease overall as the number of clusters increases. Cluster assignments based on CL-3D features are significantly more consistent across all number of clusters compared with the other methods. When using nearest neighbor (NN) context sampling for CL-3D, scores slightly decrease. IoU scores of other methods (CL-2D, GLCM, pretrained encoder on ImageNet) are relatively close to each other. Setting = 0 for CL-2D, that is, not performing context sampling, stands out due to particularly low IoU values, which fall below those of GLCM and the pretrained encoder on ImageNet for more than two clusters.
Method . | [µm] . | Input . | 2 clusters . | 8 clusters . | 32 clusters . | 128 clusters . |
---|---|---|---|---|---|---|
GLCM | - | , , | 95.2 | 58.4 | 34.1 | 19.4 |
Histogram | 95.4 | 49.8 | 27.7 | 14.3 | ||
LBP | 49.4 | 30.9 | 17.3 | 9.2 | ||
[Histo., LBP, GLCM] | 92.9 | 42.6 | 24.2 | 14.1 | ||
Pretrained (ImageNet) | - | 86.6 | 50.0 | 29.9 | 17.0 | |
FOM | 77.8 | 50.9 | 35.5 | 23.6 | ||
CL-2D | 0 | , , | 88.9 | 47.1 | 25.0 | 12.8 |
118 | 95.4 | 61.2 | 35.9 | 20.0 | ||
236 | 95.0 | 55.8 | 36.7 | 21.0 | ||
CL-3D | NN | , , | 89.2 | 70.8 | 50.2 | 30.3 |
118 | 96.1 | 70.5 | 51.0 | 32.3 | ||
236 | 95.4 | 71.9 | 50.7 | 32.0 |
Method . | [µm] . | Input . | 2 clusters . | 8 clusters . | 32 clusters . | 128 clusters . |
---|---|---|---|---|---|---|
GLCM | - | , , | 95.2 | 58.4 | 34.1 | 19.4 |
Histogram | 95.4 | 49.8 | 27.7 | 14.3 | ||
LBP | 49.4 | 30.9 | 17.3 | 9.2 | ||
[Histo., LBP, GLCM] | 92.9 | 42.6 | 24.2 | 14.1 | ||
Pretrained (ImageNet) | - | 86.6 | 50.0 | 29.9 | 17.0 | |
FOM | 77.8 | 50.9 | 35.5 | 23.6 | ||
CL-2D | 0 | , , | 88.9 | 47.1 | 25.0 | 12.8 |
118 | 95.4 | 61.2 | 35.9 | 20.0 | ||
236 | 95.0 | 55.8 | 36.7 | 21.0 | ||
CL-3D | NN | , , | 89.2 | 70.8 | 50.2 | 30.3 |
118 | 96.1 | 70.5 | 51.0 | 32.3 | ||
236 | 95.4 | 71.9 | 50.7 | 32.0 |
For CL-2D and CL-3D, refers to the distance at which context sampling is performed, with nearest neighbor (NN) sampling being a special case of CL-3D. Column “Input” denotes modalities used as input for each method. Using a cross-section sampling strategy in CL-3D achieves the overall highest consistency. Bold values indicate the highest mean IoU score per column.
3.5 Using CL-3D features for retrieval of common fiber orientation patterns
The proposed CL-3D method embeds image patches with similar fiber architectural properties as close points in the learned feature space. Therefore, we expect that the resulting feature representations can be used for retrieval of similar structures, given a known “prototype” image patch. To evaluate the suitability for such an application, we investigate how far U-fiber structures can be found in 3D-PLI image data from a few image examples. Although U-fibers are represented by one of the principal axes (Fig. 7D(7)), they do not appear as individual clusters in Figure 9E. To demonstrate that the representations can still be used to identify specific nerve fiber architectures, we provide a few positive examples of U-fibers to perform a query for similar fiber configurations (Fig. 11).
Unfortunately, access to detailed, annotated data in combination with 3D-PLI measurements is limited, as reliable identification of positive and negative examples is not always possible. However, we can still search for U-fibers by selecting only a few positive examples that could be identified with confidence from 3D-PLI images by Takemura et al. (2020). For this purpose, we select up to six query points as positive examples (Fig. 11A) and compute affinity maps that show the similarity of all voxels in the feature maps to the averaged query features in representational space as responses. For this experiment, we smooth PCA-reduced feature maps with 20 components (80.4% total explained variance) by a 2D Gaussian kernel analog to Section 3.4, but set = 2 to increase the receptive field of texture features. This allows to represent texture for larger structures such as U-fibers and improves retrieval results. We calculate the affinity between feature points using a Gaussian radial basis function (RBF) kernel with = 3.5.
The affinity maps reveal peak regions at locations displaying U-fiber structures (Fig. 11B). They activate to all U-fibers identified by an expert, except for one missing activation for a U-fiber, highlighted by the asterisk. In addition, the activations highlight some fiber bundles that are not labeled as U-fibers, such as the vertical occipital fascicle (VOF) or the stratum calcarinum (SC). Although the false positive activations for VOF and SC could be suppressed through more query points, they do not disappear completely.
4 Discussion
4.1 CL-3D features encode fundamental aspects of fiber architecture
Feature representations extracted by the proposed CL-3D method from 3D-PLI image patches encode distinct aspects of fiber architecture. Our experiments in Section 3.4 showed that the features form hierarchical clusters that represent gray and white matter, myeloarchitectonic layer structures, fiber bundles, fiber crossings, and fiber fannings. These clusters are spatially consistent and often highlight specific characteristics of fiber architecture as locally connected structures (Fig. 9). Even without explicit clustering, the main PCA components of the feature embedding space produce maps that highlight fundamental principles of fiber architecture (Fig. 7). This indicates that proximity of features in the embedding space, which is efficient and easy to compute, serves as a suitable proxy measure for similarity of fiber architecture as captured in the corresponding image patches. This is in contrast to directly measuring similarity of 3D-PLI image patches, where strong differences can occur despite very similar fiber configurations. Thus, CL-3D features are well suited to facilitate downstream applications for 3D-PLI image analysis.
4.2 CL-3D features are robust to variations in histological processing
While encoding relevant aspects of fiber architecture, the proposed CL-3D features proved to be robust against many other sources of texture variation. We were able to observe this robustness in the 3D stacking of consecutive images with derived cluster segments from CL-3D features. The segments showed a high overlap across brain sections (Table 2), in particular, compared with the clustering of classical texture features and a pretrained encoder on ImageNet as baselines, but also with respect to CL-2D features. Clustering results by these methods were overall not consistent enough (see Appendix Fig. A3) to perform the same in-depth qualitative evaluation as performed for CL-3D in Section 3.4.1. Volume renderings in Figure 10 illustrated the consistency of clustering CL-3D features as spatially consistent 3D segments. This suggests that CL-3D promotes the learning of representations that are more robust to discrepancies between independently processed sections. CL-3D features were also found to be robust to the absolute in-plane orientation of texture, as shown by consistent laminar patterns of cluster assignments regardless of their absolute orientation in 2D (Fig. 9).
Some of the robustness of CL-3D features can be attributed to the introduced context sampling across brain sections, as shown in Table 2. CL-3D features demonstrated significantly higher overlap in cluster assignments between sections compared with models trained with in-plane (CL-2D) or without context sampling (CL-2D with = 0).
Another factor contributing to the robustness may result from the data augmentations specifically designed for 3D-PLI parameter maps, introduced in Section 2.2. CL-3D and CL-2D models trained with the introduced augmentations showed increased feature quality and robustness in Section 3.1.2. A comparison of individual augmentations identified the highest benefit from using color distortions (modulation of section thickness and the attenuation coefficient). For CL-2D, using the blur augmentation alone performed worse than training a model without augmentations. This is in line with T. Chen et al. (2020), who found the highest benefit from color distortions and the least benefit from blur. For CL-3D, geometric affine and flip augmentations each demonstrated a negative effect when used individually as the only augmentation. This is surprising at first glance but could be explained by natural geometric distortions in the training pairs, which were sampled from unregistered, neighboring tissue sections. Including geometric transformations to the full set of augmentations, however, did not negatively impact CL-3D training.
Remaining inconsistencies in cluster assignments between sections as shown in the sagittal and axial planes in Figure 9A and E can be attributed to some variations between brain sections still captured in CL-3D features. For example, the “WM (Outlier sections)” cluster highlights white matter in sections with degraded transmittance, which we regard as a histological artifact. Imprecisions in the 3D reconstruction of adjacent brain sections could also contribute to this effect. The cross-section discontinuity of cluster assignments could be mitigated by postprocessing with the same spatial smoothing of features in the axial and sagittal planes as performed for feature maps in the coronal plane.
CL-3D and CL-2D features showed sensitivity to the relative cutting angle of cortical voxels, as observed by their ability to predict measured obliqueness using a linear model (Table 1). Obliqueness is a local feature of histological images that is consistent across adjacent sections and could be exploited by the contrastive learning objective to identify nearby positive pairs. For the majority of cortical voxels, however, this effect on the features seems to be small. Only for very high obliqueness, CL-3D features formed some smaller branches in the UMAP projection in Figure 9D or isolated clusters in Figure 9E. This aligns with observations from the scatter plots (Fig. 8A) indicating that CL-3D features primarily encode obliqueness for larger cutting angles, showing limited ability to predict smaller angles. If an encoding of obliqueness in downstream applications is nevertheless not intended, a supervisory signal that combines texture patches with different cutting angles into the same label could be helpful. For unsupervised learning, treating obliqueness as a confounding variable (Dinga et al., 2020; Snoek et al., 2019) could also help to reduce its effect.
4.3 Retrieval and mapping as possible downstream applications
Fiber architecture is expressed in highly complex textures when measured at microscopic resolution. This makes it extremely challenging to navigate and explore larger stacks of 3D-PLI images. An obvious, albeit simple, application is the search for similar local configurations of nerve fibers, given a template image patch used as a prototype for such a query. We took a search for U-fiber structures as an example (Fig. 11), where independent expert annotations could be obtained from a previous study, and were able to demonstrate the feasibility of such a retrieval task with the proposed CL-3D features.
We showed that features cluster into groups that follow certain fiber bundles such as the sagittal stratum or the tapetum (Fig. 9). While this might at first suggest the use of features for automated brain mapping tasks, the clusters did not lend themselves to a sufficiently accurate delineation of anatomical structures. This could be due to partial volume effects of patches used to represent texture, or to the smoothing performed before clustering to denoise the features. As contrastive learning focuses on the most characteristic properties of texture to identify positive pairs, some aspects of fiber architecture might overshadow others. The clustering shown in Figure 9E, for example, did not fall into accurate GM/WM segments. This could be due to features reflecting the density of myelinated fibers more than other aspects of fiber architecture, thus including the deepest cortical layers in the WM segment, where the density of myelinated fibers is still very high. Since the degree of sharpness of the boundary between cortex and white matter constitutes a criterium for the identification of architectonically distinct areas (Niu et al., 2020), another explanation would, therefore, be that brain areas with a blurry boundary are those in which voxels of layer VI merge into the WM segment. It should be noted though that the proposed feature extraction method is not specifically designed for performing automatic brain mapping. A supervised approach for brain mapping as a downstream application can nevertheless be promising with the proposed features. Linear evaluation, as shown in Figure 5, demonstrated that a linear classifier on top of CL-3D or CL-2D features required only 30 examples per tissue class (gray matter, white matter, or background) to perform convincing classification, significantly reducing the amount of manual annotation required. A more systematic investigation into fiber architectonic mapping based on CL-3D features will be an important follow-up work of this study.
4.4 Relationship of feature representations with brain morphology
Cortical layers are arranged along the cortical depth and have distinct characteristic architectures. Being able to regress cortical depth from texture features using a simple linear model, therefore, suggests that features robustly encode information about the layering structure of the cortex, which is important for downstream applications such as brain mapping. The high amount of variance in cortical depth that could be explained by a linear model from CL-3D and CL-2D features (Table 1) indicates that these models indeed encode layer-related textures. For CL-3D, this claim is also supported by the observation that the main PCA components highlight individual cortical layers (Fig. 7), while clustering of the features shows several clusters that group more superficial and deeper layers (Fig. 9). Furthermore, being able to regress the depth of patches within white matter (Table 1) suggests that CL-3D and CL-2D features robustly separate deep from superficial fiber bundles such as U-fibers (Decramer et al., 2018; Shinohara et al., 2020). Classical texture features, as well as a pretrained model on ImageNet, however, demonstrated significantly lower capability in predicting cortical and white matter depth. A CL-2D model with sampling radius = 0, that is, not performing context sampling, performed much worse. This highlights a positive effect of the introduced context sampling in learning expressive representations for fiber architecture in 3D-PLI.
Methods for analyzing the laminar structure of the cortex typically require representations that are robust to cortical folding (Leprince et al., 2015; Schleicher et al., 1999; Waehnert et al., 2014). While CL-3D and CL-2D learned to represent some curvature-related patterns such as for fanning radial fibers at gyral crowns (Fig. 9), predicting curvature from the features did not work well (Table 1), indicating moderate robustness of the proposed feature representations to cortical folding. For CL-3D and CL-2D, the weak encoding might be attributed to the affine transformation applied in the contrastive learning setup, which performs scaling and shearing operations on the texture examples. It should be noted that in addition to the curvature definition used in Section 3.3, other established definitions (Goldman, 2005) that have not been considered in this study might lead to different results.
5 Conclusion and Outlook
Aiming to improve automatic mapping and analysis of fiber orientation distributions in the brain, we introduced a self-supervised contrastive learning scheme for extracting “deep” feature representations for 3D-PLI image patches at micrometer resolution. We specifically proposed 3D context Contrastive Learning (CL-3D), introducing a context sampling strategy to sample positive pairs based on their spatial proximity across nearby brain sections. Without any anatomical prior information given during training, the feature representations extracted by CL-3D were shown to highlight fundamental patterns of fiber architecture in both gray and white matter, such as myelinated radial and tangential fibers within the cortex, fiber bundles, crossings, and fannings. At the same time, feature representations by CL-3D proved to be more robust to variations between independently measured sections, such as artifacts arising from histological processing, compared with statistical methods, an encoder pretrained on natural images, and representations by in-plane sampling of positive pairs in contrastive learning (CL-2D).
The present study opens new perspectives for automated analysis of fiber architecture in 3D-PLI. Due to the low-dimensional embedding space, CL-3D feature representations can aid in interpretation of 3D-PLI textures and improve computational efficiency of downstream 3D-PLI analysis. For example, the learned feature representations can be used to develop spatial maps of specific aspects of fiber orientation distributions, such as U-fibers, which allow comparison of fiber architecture with other modalities linked to brain atlases. They can also be used to train discriminative models for downstream tasks such as segmenting tissue classes, cortical layers, fiber bundles, or even brain areas with minimal number of positive and negative labeled examples.
An important direction for future research will be to extend the trained models to larger training datasets. We intend to extend the approach to whole-brain datasets, possibly including multiple species. Since the main challenge for establishing training data is the precise 3D reconstruction from individual brain sections, it will be helpful to investigate how far approximate registrations can be sufficient for CL-3D. Furthermore, we plan to integrate deep 3D-PLI features into brain atlases to provide easy accessibility. For the human brain, the BigBrain (Amunts et al., 2013) model would be an ideal reference model for integration, which is already used for multimodal data integration from other imaging modalities, such as cyto- and receptor architecture.
Data and Code Availability
The software used to implement and train self-supervised CL-3D and CL-2D models is available on GitHub (https://github.com/FZJ-INM1-BDA/cl-3d). An implementation of the introduced data augmentations for 3D-PLI images (https://jugit.fz-juelich.de/inm-1/bda/software/data_processing/pli-transforms) as well as additional dependencies (https://jugit.fz-juelich.de/inm-1/bda/software) are available on GitLab.
Volumetric clustering results, PCA projections of extracted CL-3D features, measures of cortex morphology, and 3D-PLI fiber orientation and transmittance maps for reference are available on the EBRAINS data sharing platform (Oberstrass et al., 2024).
The complete set of high-resolution 3D-PLI images for the vervet monkey occipital lobe is hosted on the Jülich supercomputing facility. A subset of selected 3D-PLI images is available on EBRAINS (Axer, Gräßel, et al., 2020), with a future publication of the whole stack of images planned.
Ethics Statement
Vervet monkeys used were part of the Vervet Research Colony housed at the Wake Forest School of Medicine. Our study did not include experimental procedures with living animals. Brains were obtained when animals were sacrificed to reduce the size of the colony, where they were maintained and sacrificed in accordance with the guidelines of the Wake Forest Institutional Animal Care and Use Committee IACUC #A11-219 and the AVMA Guidelines for the Euthanasia of Animals.
Author Contributions
Alexander Oberstrass: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing—original draft, Visualization. Sascha E.A. Muenzing: Methodology, Data curation, Writing—review and editing. Meiqi Niu: Validation, Data curation, Investigation, Writing—review & editing. Nicola Palomero-Gallagher: Validation, Investigation, Writing—original draft, Writing—review and editing. Christian Schiffer: Conceptualization, Methodology, Software, Writing—review and editing. Markus Axer: Conceptualization, Validation, Investigation, Supervision, Writing—review and editing, Project administration, Funding acquisition. Katrin Amunts: Conceptualization, Validation, Investigation, Supervision, Writing—review and editing, Resources, Project administration, Funding acquisition. Timo Dickscheid: Conceptualization, Methodology, Supervision, Writing—original draft, Writing—review and editing, Project administration, Funding acquisition.
Funding
This project received funding from the Helmholtz Association’s Initiative and Networking Fund through the Helmholtz International BigBrain Analytics and Learning Laboratory (HIBALL) under the Helmholtz International Lab grant agreement InterLabs-0015, the Helmholtz Association portfolio theme “Supercomputing and Modeling for the Human Brain,” and the European Union’s Horizon 2020 Research and Innovation Programme, grant agreement 945539 (HBP SGA3), which is now continued in the European Union’s Horizon Europe Programme, grant agreement 101147319 (EBRAINS 2.0 Project). Computing time was granted through JARA on the supercomputer JURECA at Jülich Supercomputing Centre (JSC). Vervet monkey research was supported by the National Institutes of Health under grant agreements R01MH092311 and P40OD010965.
Declaration of Competing Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Acknowledgements
We sincerely thank Karl Zilles and Roger Woods for their valuable collaboration in the vervet brain project, the laboratory team of the Institute of Neuroscience and Medicine (INM-1) for preparing and measuring the brain sections, and the members of the Big Data Analytics and Fiber Architecture groups (both INM-1) for their valuable inputs and discussions during the development of this research.