The three-dimensional segmentation algorithm for cerebellar tissues in this study encompasses cerebellar segmentation for both structural and functional magnetic resonance imaging (MRI), super-resolution reconstruction, and BOLD sequence registration. Accordingly, we analyze current research in three key areas: segmentation, super-resolution, and registration.
Status of research on image segmentation
In recent years, deep learning has achieved remarkable progress in medical image segmentation, leading to the development of several highly effective models. One pioneering model, U-Net, introduced by Ronneberger et al.9 in 2015, employs an encoder-decoder architecture with skip connections, enabling the integration of multi-resolution features. U-Net has been extensively applied to tasks such as cell nucleus segmentation, organ segmentation, and lesion detection. Subsequently, Milletari et al.10 introduced V-Net in 2016, a model specifically designed for 3D segmentation. V-Net is capable of processing volumetric data, making it particularly suitable for applications involving computerized tomography (CT) and MRI scans. Attention U-Net enhances the standard U-Net architecture by incorporating attention mechanisms,11 which allow the model to focus on relevant regions within an image, thereby improving accuracy in complex backgrounds. This model is especially effective for fine-grained segmentation tasks, such as tumor delineation. The DeepLab series,12 including DeepLabv3 and DeepLabv3+,13 represents a significant advancement in high-resolution image segmentation. These models employ dilated convolutions and atrous spatial pyramid pooling to capture multi-scale contextual information, enhancing segmentation performance. nnU-Net,14 proposed by Isensee et al.,14 is a self-adapting U-Net framework that automatically configures the network architecture and training strategies based on the dataset characteristics. This model has demonstrated exceptional adaptability and performance across a wide range of medical image segmentation tasks. Çiçek et al.15 extended the U-Net architecture to 3D U-Net, which is specifically designed for 3D segmentation tasks involving organs such as the brain, lungs, and liver. This model leverages 3D convolutions to effectively capture volumetric information. Figure 1 presents sketches of the deep neural networks mentioned.
Transformer-based models have also made significant contributions to medical image segmentation. Swin-Unet combines the strengths of the Swin Transformer and U-Net, utilizing hierarchical attention mechanisms to achieve multi-scale feature fusion and significantly improve segmentation performance.16 TransUNet integrates Transformer modules into the U-Net framework, effectively handling complex image structures by capturing long-range dependencies.17 MedT (Medical Transformer) further exemplifies the potential of Transformer-based architectures in medical image segmentation.18 By capturing long-range dependencies, MedT significantly enhances segmentation accuracy. Finally, SegResNet combines residual networks with U-Net, employing residual connections to improve model depth and performance.19 This model is particularly well-suited for tasks requiring high-detail processing, such as tissue and lesion segmentation.
In summary, these deep learning models have demonstrated exceptional performance in various medical image segmentation tasks through continuous optimization and innovation. Their advancements have significantly propelled the field of medical image processing, highlighting the transformative potential of deep learning in medical applications.
Status of super-resolution research
Image super-resolution reconstruction, a method of reconstructing low-resolution images into high-resolution images, is one of the key technologies used to improve the resolution of real-world images and videos in computer vision tasks. Image super-resolution reconstruction has been widely applied in the real world, including in hyperspectral imaging,20 medical image processing,21 and facial recognition.22 Apart from improving image resolution, image super-resolution reconstruction also assists, to a certain extent, with other tasks related to computer vision.23 Due to the inadaptability of the image super-resolution problems, where multiple high-resolution images can correspond to one low-resolution image, the task of image super-resolution is quite challenging in the image reconstruction process.24
In recent years, with the application of convolutional neural networks in image super-resolution research, from the super-resolution convolutional neural network,25 based on traditional convolutional neural networks, to the super-resolution generative adversarial network, based on deep residual generative adversarial networks,26 various image super-resolution methods based on deep convolutional neural networks, relying on different network architecture designs and training strategies, have developed rapidly to improve the performance of image super-resolution reconstruction tasks (Fig. 2).
Super-resolution of medical images refers to the acquisition of low-resolution medical images from various medical imaging devices and the restoration of high-resolution medical images with rich details and clear textures using deep convolutional neural networks. This is conducive to clinical diagnosis, image segmentation,27,28 image registration,29 image fusion,30,31 and three-dimensional visualization of images in medical research. In the process of medical magnetic resonance (MR) image acquisition, various factors, such as imaging equipment, imaging techniques, external interference, and chessboard artifacts in boundary models and human tissue images, lead to low-resolution images and interfere with the accuracy of clinical diagnosis and subsequent medical research. Therefore, it is of great significance to use deep learning and other approaches to restore MR images with clear tissue boundaries and rich details. With the rapid development of deep learning and computer hardware, computer vision tasks have attracted increasing attention from the academic community, and researchers have begun to explore the application of CNNs to such tasks. Since Tian et al.32 initially applied deep convolutional neural networks to image super-resolution, the quality of image super-resolution reconstruction algorithms based on these networks has significantly improved.
Moreover, Chao et al.33 proposed an enhanced deep super-resolution network for single images, based on deep residual networks, to address the issue of shallow convolutional neural networks not being able to fully extract contextual feature information from images. By removing batch normalization from residual modules, they were able to stack more convolutional layers using the same computational resources, allowing the network to learn more contextual feature information. Given that the quality of reconstructed images can be improved by increasing the depth of convolutional neural networks in image super-resolution reconstruction task, Yang et al. successively proposed a deeply-recursive low- and high-frequency fusing network and a precise image super-resolution method based on the depth convolutional neural network, the very-deep super-resolution network.34,35 While further improving algorithm performance by increasing network depth, various studies on image super-resolution based on convolutional neural networks have encountered issues such as vanishing and exploding gradients during training. Cui et al.36 applied residual learning to computer vision tasks and proposed the image processing network ResLT. Building upon research on residual network for image processing, Ledig et al.26 proposed the super-resolution residual network, which utilizes the concept of residual learning. This approach avoids the loss of contextual information during image propagation through the network, addressing the gradient vanishing and exploding problems caused by increased network depth. Moreover, the super-resolution residual network shows improvement in preserving details in reconstructed images.
The application of attention mechanisms in image super-resolution enhances the accuracy of reconstructed images to a certain extent. Zhang et al.37 developed the residual dense network by combining residual learning and further proposed an image super-resolution network. Deep residual channel attention networks,38 based on the channel attention mechanism, are illustrated in Figure 3. Du et al.39 achieved a reduction in network parameter count and an improvement in MR image super-resolution reconstruction quality by using depth-wise separable convolutions instead of traditional convolutional layers. However, the aforementioned image super-resolution reconstruction algorithms based on convolutional neural networks prioritize higher objective performance metrics but overlook perceptual image quality, resulting in problems such as artifacts and blurriness in the reconstructed super-resolution images.
Compared with conventional super-resolution algorithms based on interpolation theory, image super-resolution algorithms based on convolutional neural networks show remarkable enhancements in network performance and image reconstruction effects.40 However, due to the intrinsic limitations of image super-resolution, convolutional neural networks often encounter challenges such as checkerboard artifacts and loss of image details when reconstructing super-resolution images with high upscaling factors.41 This is mainly because, in convolutional neural networks, when the stride of the convolutional kernel is not equal to 0, it introduces interference in computer vision research based on image super-resolution tasks. Li et al.42 introduced an image super-resolution algorithm based on generative adversarial networks to solve the image super-resolution problem, thus further promoting the development of image super-resolution research. In order to solve the problem of image blur caused by detail loss in medical image super-resolution models based on traditional convolutional neural networks, Inspired by Tran et al.,43 who introduced the disentangled representation learning-generative adversarial network (DR-GAN) for pose-invariant face recognition, subsequent studies have extended similar adversarial and disentanglement principles to image super-resolution tasks. These approaches aim to mitigate blurred edges and detail loss typically observed in CNN-based SR methods. Building upon the residual concept, Zhao et al.44 further proposed the laplacian pyramid generative adversarial network based on dense residual blocks. This approach effectively addresses blurriness and size inconsistency in reconstructing medical images using existing image super-resolution algorithms. Wang et al.45 proposed the enhanced super-resolution generative adversarial network (ESRGAN) by introducing the residual-in-residual dense block on top of residual blocks in the network model, which significantly improves the performance of generative adversarial network-based image super-resolution algorithms. By incorporating perceptual loss, adversarial loss, and a relative discriminator, the discriminator network assesses the relative authenticity of reconstructed images compared to traditional discriminator networks.46 This assessment guides the generator network to produce more realistic images through parameter updates during training. Shang et al.47 introduced the receptive field block ESRGAN, based on the enhanced generative adversarial network super-resolution algorithm, which features receptive field modules. These modules, with different sizes of receptive fields, enable the network to extract richer image detail features, thus enhancing the quality of reconstructed images. While existing image super-resolution methods based on generative adversarial networks have improved the overall visual quality of images in practical applications, they often introduce unnatural artifacts when reconstructing image details. Geets et al.48 proposed a method based on the statistical dependencies of image gradients and edges at different resolutions. Sun et al.49 presented a method based on gradient contours representing image gradients and gradient field transformations. Yan et al.50 introduced an image super-resolution algorithm based on gradient contour sharpness to improve the clarity of super-resolved images. In these methods, the statistical dependency relationship is modeled by estimating parameters related to high-resolution edges based on parameters learned from low-resolution images.50 Ma et al.51 proposed a structure-preserved super-resolution algorithm based on gradient guidance. This approach employs second-order gradient constraints in deep generative adversarial networks to provide better structural guidance for image reconstruction, effectively addressing issues such as structural distortion in reconstructed images.51 Compared with super-resolution reconstruction algorithms using convolutional neural networks, medical image super-resolution reconstruction algorithms based on deep generative adversarial networks achieve higher accuracy in restoring edge details and texture information in reconstructed images, making the visual effects of reconstructed images more suitable for clinical diagnostic needs. However, ordinary convolutional neural networks exhibit translational invariance in convolutional kernels, causing the loss of shallow and local features in images as the network depth increases, resulting in blurred edges and potential checkerboard artifacts in reconstructed images. In contrast, image super-resolution reconstruction algorithms based on deep generative adversarial networks update generator and discriminator network parameters through backpropagation, guiding the generated sample values toward more realistic values.
Status of image registration algorithm research
Currently, image registration algorithms can be divided into non-learning-based methods and learning-based methods. Traditional non-learning-based registration methods are mainly feature-based registration algorithms. The following provides a detailed description of their research status.52 Feature-based registration algorithms first extract features from the reference image and the floating image, generally including feature points, image edges, image structures, and statistical features. Then, through a matching strategy, they establish correspondences between features and calculate the deformation parameters of the image pairs through feature matching. Specifically, feature-based image registration algorithms involve the following steps53:
Feature extraction
Feature extraction is a pivotal task in the image registration process. It can be either manual or automatic, depending on the image complexity. Features such as closed boundary regions, textures, edges, points, lines, statistical features, and more advanced structures and semantic descriptions can serve as distinctive characteristics. These features must be easily identifiable and invariant to ensure that both the reference and floating images share sufficient common features. Robust algorithms are required for feature detection to extract as many features as possible from image pairs, irrespective of structural deformations.
Feature matching
The goal of this step is to establish precise correspondences between features, creating a matching method between the features of the reference and floating images. Various feature descriptors and similarity measures are employed to facilitate accurate feature correspondence. Feature descriptor designs should ensure the accurate reflection of global or local image characteristics, even in the presence of noise.
Transformation model estimation
Registration transformation models encompass rigid and non-rigid transformations. The selection of a transformation model depends on the image acquisition process and prior knowledge of expected image deformations. To align the reference and floating images, the deformation parameters of the transformation model must be estimated using feature correspondences.
Image resampling
Resampling of the floating image is performed using the estimated optimal deformation parameters. Following the transformation of image coordinates, the resulting position coordinates are typically non-integer values. Thus, interpolation operations are commonly employed for image resampling to address this issue.
In the context of medical image registration, Al-Khafaji et al.54 proposed the scale-invariant feature transform (SIFT) algorithm, which has been widely applied. SIFT, an early algorithm for keypoint detection, ensures invariance to translation, rotation, and other transformations. It requires extracting many point features, resulting in high computational complexity. To accelerate SIFT feature computation, Bay et al.55 applied an improved algorithm, Speeded-Up Robust Features (SURF). SURF is more stable and computationally efficient than SIFT. Apart from SIFT and SURF, various other feature description operators have been utilized in image registration,56 such as Harris corners.57 Shen et al.58 proposed the hierarchical attribute matching mechanism for elastic registration algorithm, which extracts a set of geometric moment invariants for each image point. Experimental results have demonstrated its effectiveness in registering brain images with significant anatomical differences. However, the hierarchical attribute matching mechanism for elastic registration algorithm requires pre-segmentation of brain tissues, which poses a challenge and limits its applicability. To overcome this limitation, Papamarkos et al.59 proposed an approach that achieves gray-level reduction through the combined utilization of both the image’s gray levels and additional local spatial features. These histogram-based attribute vectors exhibit rotation invariance and have been successfully applied to register various data, including brain MRI and diffusion tensor imaging. Nonetheless, registration results are significantly impacted by the accuracy of feature extraction. Inaccurate feature extraction can lead to substantial registration errors. Therefore, research on these algorithms primarily focuses on feature design.
With the rapid advancement of deep learning in computer vision and other fields, there has been an abundance of registration algorithms based on deep learning, with CNN playing a significant role in medical image registration. Early deep learning registration methods primarily focused on using deep learning to extract features from reference and floating images,60 or to learn similarity metrics for image pairs.61 These learned features and similarity metrics were then integrated into traditional registration frameworks to significantly improve registration effectiveness. Yoo et al.62,63 utilized a stacked convolutional autoencoder to extract features from pairs of 3D brain MR images, followed by optimizing the normalized cross correlation between the two sets of features using gradient descent. The experiment indicated that in single-modality registration, the feature descriptors extracted by deep learning might not surpass manually defined descriptors, but they could be used to obtain complementary information. Additionally, drawing on the registration experience of CT images, Zhu et al.64 used a CNN to estimate the registration error of chest CT-MRI images and employed the learned registration error as a similarity metric for subsequent registration (Fig. 4).65
The advantage of deep learning in grayscale-based registration is particularly evident in multi-modal scenarios, where designing effective multi-modal similarity metrics is challenging. Andrade et al.66 designed a stacked denoising autoencoder to learn a similarity metric for CT and MR images to achieve rigid registration. The multi-modal similarity metric learned by the model outperformed traditional NMI. These methods break the limitations of manually designed prior knowledge, effectively improving registration performance while retaining the iterative nature of traditional registration. However, these deep learning methods have not fundamentally solved the problem of slow registration speed due to iterative optimization. Therefore, more and more researchers are focusing on directly estimating deformation parameters using convolutional neural networks (ConvNets). Sankareswaran et al.67 used ConvNets to learn rigid transformation parameters, showing significant advantages in registration accuracy and real-time performance compared to grayscale-based methods. Huang et al.68 trained ConvNets to directly estimate the displacement vector field of image pairs, achieving the same accuracy as traditional registration methods. Yan et al.69 proposed the adversarial image registration framework, inspired by generative adversarial networks, for rigid registration of 3D MR and transrectal ultrasound (TRUS) image pairs. The generator estimates the deformation parameters of the image pair, while the discriminator distinguishes between real and predicted deformation images. The network is trained using an adversarial supervision strategy. These methods demonstrate good registration performance but require labeled training data. Typically, traditional registration methods are used to obtain deformation parameters or synthetic supervised training data is created using random deformation parameters. It can be seen that the performance of such supervised methods largely depends on the reliability of the labels, which has driven the development of semi-supervised registration models. Haskins et al.70 used label similarity to train a CNN model for MR-TRUS image registration. In their initial registration scheme, two network models were used to train global affine transformation parameters and local dense deformation fields. The result of global registration was used as input for local registration, achieving coarse-to-fine registration. However, to further improve the practicality of the model, in subsequent work, they combined the two parts of the network into an end-to-end framework, achieving end-to-end registration using CNNs. In another work, Saldanha et al.71 introduced a loss function based on label similarity and image grayscale similarity metrics on the basis of double supervision and weak supervision using breast phantoms, ultimately achieving deformable registration of 2D MR images, using both segmentation overlap distance and edge-based normalized gradient field distance to construct the loss function.
Blendowski et al.72 introduced an integrated multimodal registration method guided by a shape encoder-decoder network. First, a segmentation network is trained with anatomically labeled data, and then an energy-based iterative optimization method is used to estimate the deformation parameters between image pairs. This method relies on intermediate segmentation results but can simplify the registration of CT images and MR images in cases of large deformations.72 Similarly, Fu et al.73 used a Laplacian pyramid network with anatomical label supervision to overcome large structural differences between image pairs and used data augmentation to mitigate overfitting. These semi-supervised methods reduce the model’s reliance on labeled data but are still largely influenced by real labels. Therefore, many researchers are focusing on the study of unsupervised registration models. Especially since the emergence of spatial transformer networks, a large number of spatial transformer network-based image registration models have emerged.74 de Vos et al.75 proposed the unsupervised deformable registration model DIRNet, which first applies a ConvNet regressor to 2D control points and then uses cubic B-splines as a spatial transformer to output the displacement vector field of image pairs, followed by a resampler to achieve deformation of floating images. Similarly, Ji et al.76 developed an unsupervised end-to-end brain MRI image registration framework ADMIR (affine and deformable medical image registration), which includes affine registration and non-linear registration. When the sizes of the reference image and the floating image vary, pre-registration is usually required, while ADMIR can complete end-to-end registration, effectively saving registration time. However, this method cannot adapt to images of arbitrary sizes. When using this model for registration, the image size needs to be consistent with the size of the model training set. Balakrishnan et al.77 built the VoxelMorph model, achieving nonlinear registration of brain MRI images, with results superior to the SyN algorithm in terms of Dice score. Although VoxelMorph can accurately estimate dense vector fields of image pairs, later work has shown that the model performs poorly for heart CT data. Zhao et al.78 used VoxelMorph as the base network and proposed a recursive cascade registration network, which can reduce the number of network parameters and improve the registration speed through weight sharing during the testing phase. However, this method has difficulty maintaining smooth deformation fields during the recursive process. The works are all based on high-dimensional image space for network design. These algorithms based on deep unsupervised deformation parameter estimation do not require labeled data, reducing the requirements for data, but unsupervised registration requires selecting appropriate image similarity metrics as optimization targets. These similarity metrics are mostly based on global grayscale metrics, performing well in overall structural registration, but it is hard to accurately estimate local deformations. Additionally, the grayscale and texture information of multi-modal medical images differ greatly. After extracting image features based on deep convolution, how to select appropriate features from significantly different features to quantify the similarity between reference images and floating images has become a future challenge in multi-modal image registration.