Abstract

As the optical lenses for cameras always have limited depth of field, the captured images with the same scene are not all in focus. Multifocus image fusion is an efficient technology that can synthesize an all-in-focus image using several partially focused images. Previous methods have accomplished the fusion task in spatial or transform domains. However, fusion rules are always a problem in most methods. In this letter, from the aspect of focus region detection, we propose a novel multifocus image fusion method based on a fully convolutional network (FCN) learned from synthesized multifocus images. The primary novelty of this method is that the pixel-wise focus regions are detected through a learning FCN, and the entire image, not just the image patches, are exploited to train the FCN. First, we synthesize 4500 pairs of multifocus images by repeatedly using a gaussian filter for each image from PASCAL VOC 2012, to train the FCN. After that, a pair of source images is fed into the trained FCN, and two score maps indicating the focus property are generated. Next, an inversed score map is averaged with another score map to produce an aggregative score map, which take full advantage of focus probabilities in two score maps. We implement the fully connected conditional random field (CRF) on the aggregative score map to accomplish and refine a binary decision map for the fusion task. Finally, we exploit the weighted strategy based on the refined decision map to produce the fused image. To demonstrate the performance of the proposed method, we compare its fused results with several start-of-the-art methods not only on a gray data set but also on a color data set. Experimental results show that the proposed method can achieve superior fusion performance in both human visual quality and objective assessment.

1  Introduction

Computer vision has become a significant field in artificial intelligence. Multifocus image fusion is an efficient technology in many computer vision applications such as microscopic imaging, industrial vision systems, macrophotography, and feature extraction (Aslantas & Toprak, 2017; Li, Kang, & Hu, 2013). Many researchers have designed a series of algorithms with the goal of obtaining a fused image with the all-in-focus characteristic through multiple images that individually captures different focused locations in the same scene (Yin, Li, Chai, Liu, & Zhu, 2016). In general, these methods fall roughly into two categories: spatial domain algorithms and transform domain algorithms (Stathaki, 2008).

In terms of spatial domain algorithms, the source images are directly fused by a linear combination. These algorithms can be classified mainly as pixel based, block based, and region based (Luo, Zhang, Zhang, & Wu, 2017; Zhang, Bai, & Wang, 2017). Some pixel-based focus criteria have been designed, such as spatial frequency (Li & Yang, 2008), Laplacian energy (Socolinsky & Wolff, 2002; Petrovic & Xydeas, 2004; Tian, Chen, Ma, & Yu, 2011), and gradient energy (Huang & Jing, 2007). These criteria can comprehensively evaluate pixel-based spatial domain algorithms. The most direct pixel-based algorithm is averaging the pixel value of all source images. It is simple and fast. However, the algorithm always leads to some undesirable side effects and loses some original image information in the fused image. Thus, its qualitative and quantitative evaluations for fused images are poor. Recently, some more sophisticated algorithms have been proposed. For instance, the guided filtering-based method (GFF; Li, Kang, & Hu, 2013) can obtain state-of-the-art performance in many fields, such as multifocus image fusion and multimodal image fusion. Representative examples of the block-based algorithms have diverse focuses based on fusion methods (Li, Kwok, & Wang, 2001), artificial neural network–based method (Li, Kwok, & Wang, 2002), differential evolution algorithm-based methods (Aslantas & Kurban, 2010), and morphology-based focus measure--based methods (De & Chanda, 2013). Those algorithms have a common idea that the source images are decomposed into many blocks and then the focused blocks from each pair are picked up. Finally, all the focused blocks can form an all-focused fusion image. The final fused image probably presents some blurred blocks along the boundary between the focused and defocused regions of the source images. To overcome this problem, several efficient region-based algorithms have been proposed. Conceptually, algorithms based on region segmentation and spatial frequency (Shi & Malik, 2000; Li & Yang, 2008) mainly use the normalized cut method to finish the segmentation task and then produce the fused image by using the results of the segmentation. However, the quality of the fused images are greatly affected by the segmentation accuracy.

Transform domain-based algorithms also play an important role in the field of multifocus image fusion. The basic idea of these algorithms is that converting the source images into another feature domain. Then the features in the transformed domain can make the multifocus images fusion task more easily than in the original domain. The coefficients of the transformed source images are important. In general, we always implement focus criteria according to those of coefficients. When a region in one source image frequently gives larger coefficients than do other source images, we think the region in the current image is focused. Many algorithms based on this criterion have been presented. Some representative examples are the discrete wavelet transform (DWT)–based algorithm (Li, Manjunath, & Mitra, 1995; Pajares & De La Cruz, 2004; Tian & Chen, 2012), the dual-tree complex transform (DTCWT)–based algorithm (Selesnick, Baraniuk, Kingsbury, 2005; Lewis, O'Callaghan, Nikolov, Bull, & Canagarajah, 2007), the nonsubsampled contourlet transform (NSCT)–based algorithm (Zhang & Guo, 2009), the curvelet and contourlet transforms (CT)–based algorithm (Do & Vetterli, 2005; Yang, Wang, Jiao, Wu, & Wang, 2010), and the pulse-coupled neural network (PCNN)–based algorithm (Huang & Jing, 2007; Wang, Ma, & Gu, 2010).

Deep learning (DL) has attracted enormous attentions in various computer vision tasks, such as classification (Krizhevsky, Sutskever, & Hinton, 2012), semantic segmentation (Long et al., 2015), and object detection (He, Gkioxari, Dollár, & Girshick, 2017). DL also has been applied to image fusion (Liu, Chen, Wang, Wang, Ward, & Wang, 2018). The stacked encoder (SAE)–based method (Huang, Xiao, Wei, Liu, & Tang, 2017; Azarang & Ghassemian, 2017) and convolutional neural network (CNN)–based method (Wei et al., 2017; Palsson, Sveinsson, & Ulfarsson, 2017; Rao, He, & Zhu, 2017) have been implemented for remote-sensing image fusion. Medical image fusion (Liu, Chen, Cheng, & Peng, 2017) and multiexposure image fusion (Kalantari & Ramamoorthi, 2017) are also performed using CNN. However, those methods are not suitable for multifocus image fusion. Thus, Liu, Chen, Peng, and Wang (2017) exploit a Siamese CNN (Chopra, Hadsell, & LeCun, 2005), in multifocus image fusion to perform focus-region detection and a guided filter (He, Sun, & Tang, 2013) to finish the final fusion task. Not considering the different focus for foreground and background in natural multifocus image, the method employs a gaussian filter to blur a natural image for a training data set. In addition, pairs of image patches rather than whole images are fed into the network to train the CNN model. Though it inputs pairs of whole source images to a pretrained network and uses an equivalent convolution layer to produce a score map in the test phase, the method does not accomplish pixel-wise focus-region detection due to the different sizes between the score map and source image.

Taking these issues into consideration, we leverage the advances in a fully convolutional network (FCN) (Long et al., 2015) for semantic segmentation and then propose a novel FCN–based multifocus image fusion. The key idea is to synthesize pairs of multifocus images in the light of the different focuses for foreground and background in a natural image and train a single-channel FCN-selected parameter to detect pixel-wise focus regions in multifocus images. The main contributions can be summarized. First, it presents a novel method for synthesizing multifocus training samples for FCN. This method shows the different focus for foreground and background in a natural image. Second, it applies FCN-selected parameters to detect pixel-wise focus regions. The network is trained through each training sample rather than image patches. Third, it employs a fully connected CRF (Krähenbühl & Koltun, 2011) to accomplish and refine a binary decision map for fusion task. Finally, it performs enough experiments to verify the efficiency of the proposed method. Our method outperforms other compared methods in terms of visual quality and objective assessment.

The rest of the letter is organized as follows. In section 2, we introduce the FCN model and its feasibility and advantages for a multifocus image fusion task. In section 3, our FCN-based multifocus image fusion algorithm is presented. Experimental results and comparison with other algorithms are provided in section 4. Section 5 concludes the letter.

2  FCN Architecture for Multifocus Image Fusion

2.1  FCN Model

The fully convolutional network (FCN) (Long et al., 2015) achieved state-of-the-art performance in image semantic segmentation with pixel-level classification. Many FCN-based algorithms then emerged and attained good effects in a number of fields, such as optical flow (Fischer et al., 2015), edge detection (Xie & Tu, 2016), and simplifying sketches (Simo-Serra et al., 2016).

In general, FCN employs three techniques—fully convolutional layers, upsampling, and skip architecture (Long et al., 2015)—to accomplish semantic segmentation tasks. Compared with the standard convolutional neural network (CNN) such as LeNet (LeCun et al., 1989) and AlexNet (Krizhevsky et al., 2012), FCN can produce spatial outputs (i.e., outputs are images) by replacing the fully connected layers with the equivalent convolutional layers. Due to the absence of fully connected layers, the FCN can receive images of arbitrary size as input and show faster inference time than standard CNN. Due to pooling layers and some convolution layers in FCN, the dimension of the outputs is typically reduced. However, we always expect the same size between the inputs and outputs. Hence upsampling is used to restore the size of output to the input. The common implementation for upsampling is deconvolution, which reverses the forward and backward passes of convolution. Usually we can obtain segmentation results by using fully convolutional layers and upsampling. However, the results are always coarse. To improve the quality of output, skip architecture, which first upsamples various outputs of pooling layers and then fuses those of information to refine the output, is always adopted. These three techniques enable FCNs to achieve landmark success in semantic segmentation and other domains. Our work also gets inspirations from FCNs.

2.2  Feasibility and Advantages of the FCN Model for Multifocus Image Fusion

The problem of multifocus image fusion can be described as
formula
2.1
where and are a pair of source multifocus images; denotes the fused image; is a binary decision map for focus regions in ; presents a unit matrix; and stands for dot multiplication between two matrices. Obviously the key to solving equation 2.1 is to find an accurate mask for focus regions in combined with . It is a semantic segmentation task in source images for focus-region detection. In addition, because a common assumption is that a local region is well focused in only one source image (Liu et al., 2018)—any pixel is categorized into either focus regions or defocus regions—the semantic segmentation in multifocus images is a two-classification problem. Fortunately, the FCN (Long et al., 2015) can be applied to image semantic segmentation with good results. Hence, we can leverage well-trained FCN with two-classification to model the fusion task. Moreover, because of easy access to many segmentation image data sets, we can synthesize many pairs of multifocus images to train the FCN with the help of the popular deep learning framework caffe (Jia et al., 2014). Therefore, applying the FCN-based algorithm to multifocus image fusion is feasible theoretically and practically.

The proposed FCN-based multifocus image fusion has two advantages compared with common methods and the CNN-based method. The proposed FCN-based method, being similar to the CNN-based method (Liu et al., 2017), can automatically design feature extraction and fusion rules rather than manually as in many common methods such as NSCT (Zhang & Guo, 2009), GFF (Li, Kang, & Hu, 2013). And unlike the CNN-based method (Liu, Chen, Cheng et al., 2017), our method employs only a single-channel FCN with more layers to more efficiently and detect pixel-wise focus probability for each image. The synthesized training samples consider the different focus for foreground and background in natural multifocus image, and instead of image patches, each full image is to input and train FCN. Hence, the FCN-based algorithm has huge potential in multifocus image fusion tasks.

3  The Proposed Algorithm

3.1  Overview

In this section, we present our proposed FCN-based multifocus image fusion algorithm in detail. A schematic diagram of the proposed method is illustrated in Figure 1. We consider only the situation when there are two source images, although we can apply the proposed algorithm to more than two source images (see Figure 12). The proposed algorithm has four steps (see Figure 1): focus score computation, focus information aggregation, focus region refinement, and multifocus image fusion. First, a pair of source images is fed successively to the pretrained FCN to output two focus score maps, score map 1 and score map 2, which present pixel-wise focus scores for source images and , respectively. Then we inverse-score map 2 to produce score map 3, which has a homogeneous focus property with score map 1. Second, we pixel-wise average the values of score maps 1 and 3 to obtain an aggregative score map indicating the focus property of source image A. Next, with the consideration of correlation among pixels, we implement the fully connected conditional random field (CRF) on the aggregative score map to further accomplish and refine the binary decision map for the fusion task. Finally, the fused image is obtained with the final decision map by using the pixel-wise weighted-average strategy.

Figure 1:

Schematic diagram of the proposed FCN-based fusion algorithm. The proposed algorithm has four steps: focus score computation, focus information aggregation, focus region refinement, and multifocus image fusion.

Figure 1:

Schematic diagram of the proposed FCN-based fusion algorithm. The proposed algorithm has four steps: focus score computation, focus information aggregation, focus region refinement, and multifocus image fusion.

3.2  Synthesize Multifocus Images for Training

Four pairs of synthesized multifocus images with their corresponding ground truth are shown in Figure 2. PASCAL Visual Object Classes 2012 (PASCAL VOC 2012) is a popular classification and segmentation task data set (Everingham, Van Gool, Williams, Winn, & Zisserman, 2011) that contains 17,125 images in 20 categories (e.g., people, animals, various objects). It consists of five folders: Annotations, ImageSets, JPEGImages, SegmentationClass, and SegmentationObject. In this work, we use the JPEGImages and SegmentationClass folders to synthesize the multifocus images for training FCN. The JPEGImages contains 17,125 images, and the SegmentationClass contains 2913 indexed images, which are the segmentation results of some images in JPEGImages. For each segmentation image in SegmentationClass, from its image sequence, we can find the corresponding original image in the JPEGImages easily. The 2913 original images and their corresponding segmentation images are available to accomplish the synthesis task. The procedure has five steps: gaussian filtering, image conversion, image inversion, pixel-wise multiplication, and pixel-wise addition (see Figure 3). In the first step, for each original image, we use a gaussian filter (the standard deviation is set to 2 and the window size to 7) to generate five versions of blurred images. The first version is generated successively from original images by using the gaussian filter. The next version is given from the previous version by the gaussian filter. Next, we convert the ground truth into a mask in three-channel space. For this purpose, we set the background pixels (the black region) in the ground truth to (0,0,0) and other pixels to (1,1,1) to form into a three-channel binary map, mask 1. Obviously, mask 1 provides a binary map for each channel and presents the mask of foreground. In the third step, we make an inversion for mask 1, with the pixel value (0,0,0) changed to (1,1,1), and vice versa, to produce mask 2, which indicates the mask of background. In the fourth step, using the pixel-wise multiplication (dot product) between the blurred image and mask 1, we can obtain the part image A1 where the foreground is blurred and the background is black (see Figure 3). Similarly, the part image A2, where the background is clear and the foreground is black, also can be generated based on the original image and mask 1. Obviously, we can regard A1 and A2 as the defocused version and focused version for the same scene, respectively. Finally, the synthesized multifocus image can be obtained by the pixel-wise addition (dot addition) between A1 and A2. In accordance with the synthesis procedure of source image , we also can synthesize source image readily based on the original image, blurred image, and mask 2 (see Figure 3). Note that images and make up a pair of synthesized multifocus images with the ground truth Masks 2 and 1, respectively. Furthermore, we can exploit this procedure to synthesize other pairs of multifocus images with different versions of blurred images and the original images. For five versions of blurred images and the original images, we obtain 4500 pairs of synthesized multifocus images and their corresponding ground truth. Further, we could train a FCN model through the synthesized data set to learn the map between source images and the focus map.

Figure 2:

Four pairs of synthesized multifocus images and their corresponding ground truth. The first row presents four pairs of synthesized multifocus images. The second row exhibits the corresponding ground truth for the synthesized multifocus images. White regions indicate the focused regions, and the black regions show the defocused regions, respectively.

Figure 2:

Four pairs of synthesized multifocus images and their corresponding ground truth. The first row presents four pairs of synthesized multifocus images. The second row exhibits the corresponding ground truth for the synthesized multifocus images. White regions indicate the focused regions, and the black regions show the defocused regions, respectively.

Figure 3:

Diagram of the synthesis procedure for a pair of multifocus images. The procedure has five steps: gaussian filtering, image conversion, image inversion, pixel-wise multiplication, and pixel-wise addition. First, we use the gaussian filter to produce the various versions of the blurred image. In image conversion and image inversion, we exploit the ground-truth image to produce two three-channel binary masks for focusing foreground and background, respectively, in blurred images. Next, the part images, which contain only the focus region or defocus region for foreground or background, are produced based on the binary masks and original images. Finally, a pair of synthesized multifocus images can be given from a pair of part images: A1 and A2 or B1 and B2.

Figure 3:

Diagram of the synthesis procedure for a pair of multifocus images. The procedure has five steps: gaussian filtering, image conversion, image inversion, pixel-wise multiplication, and pixel-wise addition. First, we use the gaussian filter to produce the various versions of the blurred image. In image conversion and image inversion, we exploit the ground-truth image to produce two three-channel binary masks for focusing foreground and background, respectively, in blurred images. Next, the part images, which contain only the focus region or defocus region for foreground or background, are produced based on the binary masks and original images. Finally, a pair of synthesized multifocus images can be given from a pair of part images: A1 and A2 or B1 and B2.

3.3  FCN Model Design

The detailed configuration of the FCN model for the multifocus image fusion task is shown in Figure 4. Long et al. (2015) discussed three streams: FCN-32s, FCN-16s, and FCN-8s based on VGG16 (Simonyan & Zisserman, 2014). Experiments demonstrated that FCN-8s presents better performance in semantic segmentation than the other two nets. Nevertheless, the FCN-8s is aiming at the problem with 21-class segmentation. To accomplish the semantic segmentation with 2-class in our work, we modify the kernel number at the last three convolution and deconvolution layers to be 2. The detailed parameter settings are given in Table 1.

Figure 4:

The architecture of the FCN in our work. The FCN includes convolution layers, pooling layers, deconvolution layers, relu layers, crop layers, and Eltwise layers. The parameters of the convolution and deconvolution layers are listed in Table 1. We choose max pooling in all the pooling layers. Relu layers are activation functions. Crop layers crop the feature maps to have the same size with the reference feature maps (obtained via the layers that the red dotted arrows point to). The Eltwise layers add the feature map with the pixel-wise strategy.

Figure 4:

The architecture of the FCN in our work. The FCN includes convolution layers, pooling layers, deconvolution layers, relu layers, crop layers, and Eltwise layers. The parameters of the convolution and deconvolution layers are listed in Table 1. We choose max pooling in all the pooling layers. Relu layers are activation functions. Crop layers crop the feature maps to have the same size with the reference feature maps (obtained via the layers that the red dotted arrows point to). The Eltwise layers add the feature map with the pixel-wise strategy.

Table 1:
Parameter Settings of the Convolution and Deconvolution Layers.
ComponentsParameter Settings
Module 1 conv1_1 conv1_2  
 size: 3 3 64 size: 3 3 64  
 pad: 80 pad: 1  
 stride: 1 stride: 1  
Module 2 conv2_1 conv2_2  
 size: 3 3 128 size: 3 3 128  
 pad: 1 pad: 1  
 stride: 1 stride: 1  
Module 3 conv3_1 conv3_2 conv3_3 
 size: 3 3 256 size: 3 3 256 size: 3 3 256 
 pad:1 pad:1 pad:1 
 stride: 1 stride: 1 stride: 1 
Module 4 conv4_1 conv4_2 conv4_3 
 size: 3 3 512 size: 3 3 512 size: 3 3 512 
 pad:1 pad:1 pad:1 
 stride: 1 stride: 1 stride: 1 
Module 5 conv5_1 conv5_2 conv5_3 
 size: 3 3 512 size: 3 3 512 size: 3 3 512 
 pad:1 pad:1 pad:1 
 stride: 1 stride: 1 stride: 1 
Module 6 fc_conv1 fc_conv2  
 size: 7 7 4096 size: 1 1 4096  
 pad: 0 pad: 0  
 stride: 1 stride: 1  
Other component conv_7 conv_8 conv_9 
 size: 1 1 size: 1 1 size: 1 1
 pad: 0 pad: 0 pad: 0 
 stride: 1 stride: 1 stride: 1 
Other component deconv1 deconv2 deconv3 
 size: 4 4 size: 4 4 size: 16 16
 pad: 0 pad: 0 pad: 0 
 stride: 2 stride: 2 stride: 8 
ComponentsParameter Settings
Module 1 conv1_1 conv1_2  
 size: 3 3 64 size: 3 3 64  
 pad: 80 pad: 1  
 stride: 1 stride: 1  
Module 2 conv2_1 conv2_2  
 size: 3 3 128 size: 3 3 128  
 pad: 1 pad: 1  
 stride: 1 stride: 1  
Module 3 conv3_1 conv3_2 conv3_3 
 size: 3 3 256 size: 3 3 256 size: 3 3 256 
 pad:1 pad:1 pad:1 
 stride: 1 stride: 1 stride: 1 
Module 4 conv4_1 conv4_2 conv4_3 
 size: 3 3 512 size: 3 3 512 size: 3 3 512 
 pad:1 pad:1 pad:1 
 stride: 1 stride: 1 stride: 1 
Module 5 conv5_1 conv5_2 conv5_3 
 size: 3 3 512 size: 3 3 512 size: 3 3 512 
 pad:1 pad:1 pad:1 
 stride: 1 stride: 1 stride: 1 
Module 6 fc_conv1 fc_conv2  
 size: 7 7 4096 size: 1 1 4096  
 pad: 0 pad: 0  
 stride: 1 stride: 1  
Other component conv_7 conv_8 conv_9 
 size: 1 1 size: 1 1 size: 1 1
 pad: 0 pad: 0 pad: 0 
 stride: 1 stride: 1 stride: 1 
Other component deconv1 deconv2 deconv3 
 size: 4 4 size: 4 4 size: 16 16
 pad: 0 pad: 0 pad: 0 
 stride: 2 stride: 2 stride: 8 

Notes: For sizes, . denotes the kernel size of the convolution or deconvolution layers, and denotes the number of convolution kernel or bilinear kernel. The pad, indicates adding an pixel to each side of the feature maps. Stride represents the stride of the convolution or deconvolution operations.

In the training phase, we make use of the synthesized multifocus images to train the model. Specifically, given an input-synthesized RGB image of size , the FCN will produce a probability map over two classes of size , which is supervised by the corresponding ground truth. In the fusion stage, a pair of multifocus images is successively input to FCN; then two score maps, score maps 1 and 2, would be produced. Due to the effective training for FCN by using synthesized multifocus samples, the two score maps can indicate the focus probability for each source image. Though the focusregions could be detected by a simple comparison between ones, a better technology, such as fully connected CRF (Krähenbühl & Koltun, 2011), is employed to accomplish and refine the result.

3.4  Training Strategy

In the training phase, the softmax loss function is adopted as the objective. For each convolution layer, we initialize the weight using the Xavier method (Glorot & Bengio, 2010) and initialize the bias to 0. The learning rates for weight and bias are, respectively, 1 and 2. The decay rates of the weight and bias are, respectively, 1 and 0. For each deconvolution layer, we initialize the weight to a fixed bilinear kernel and ignore bias. The optimization method is stochastic gradient descent (SGD) (Bottou, 2010), and its basic learning rate, momentum and weight decay are set to 0.000001, 0.9, and 0.0005, respectively. We use the step learning policy to train the model, and the parameter gamma is set to 0.5. That is, the basic learning rate is updated every 5000 iterations with the formula , where represents the basic learning rate. We train the model from scratch and obtain the trained model after about 17 epochs through the training data.

3.5  Details of the Fusion Scheme

The fusion scheme has four steps: focus score computation, focus information aggregation, focus region refinement, and multifocus image fusion.

3.5.1  Step 1: Focus Score Computation

Let and denote a pair of source multifocus source images. For the gray source images, we extend the source images to three channels by connecting the single channel of the source images three times. When feeding and to the pretrained FCN, two score maps, score map 1, and score map 2, are obtained. Let , denote score map 1 and map 2, respectively. The value of each pixel in each score map ranges from 0 to 1, which suggests the focus property of the corresponding source images. The closer the value is to 1, the larger the probability is that the pixel belongs to focused regions, and vice versa. Because and are complementary focus scores for pairs of multifocus images, we invert to produce score map 3, , and presents a homogeneous focus property with .

3.5.2  Step 2: Focus Information Aggregation

Because and are nearly homogeneous to indicate the focus regions in , we should take full advantage of the focus probabilities in the two score maps. More concretely, we pixel-wise average and to produce an aggregative score map indicating the focus property in .

3.5.3  Step 3: Focus Region Refinement

A straightforward comparison between and can be adopted to produce a binary decision map (denoted as masking focus regions in , whereas the result maybe that some misclassified pixels and inaccurate boundaries between focus regions and defocused regions exists. Thus, to remove those incorrect regions and enhance the accuracy of focus boundaries, we implement the fully connected CRF (Krähenbühl & Koltun, 2011) on the aggregative focus map to generate . With the consideration of correlation among pixels, the fully connected CRF can efficiently refine the binary decision map. There are five parameters in the fully connected CRF: , , , , and . (For more details on fully connected CRFs, see Krähenbühl & Koltun, 2011.)

3.5.4  Step 4: Multifocus Image Fusion

The final decision map, , indicates the mask of focused regions in . Then the fused image can be generated using the equation 2.1.

4  Experiments

We evaluate the proposed FCN-based multifocus image fusion algorithm on two image data sets. One contains six pairs of gray multifocus images (see Figure 5), which have been used in recent publications. The other, called Lytro, contains 20 pairs of color multifocus images and four series of multifocus images with three sources (Nejati, Samavi, & Shirani, 2015). Gray images and some examples of the Lytro are shown in Figure 5. In this letter, we compare our proposed algorithm with nine state-of-the-art algorithms: the nonsubsampled contourlet transform (NSCT) (Zhang & Guo, 2009), the guided filtering-based fusion method (GFF) (Li, Kang, & Hu, 2013), multifocus image fusion based on image matting (IFM) (Li, Kang, Hu, & Yang, 2013), cross-bilateral filter-based method (CBF) (Kumar, 2015), discrete cosine harmonic wavelet transform-based method (DCHWT) (Kumar, 2013), multiscale weighted gradient-based fusion method (MWGF) (Zhou, Li, & Wang, 2014), boundary find-based multifocus image fusion through multiscale morphological focus measure (BFMM) (Zhang et al., 2017), dense SIFE (DSIFT)-based method (Liu, Liu, & Wang, 2015), and CNN-based one (Liu, Chen, Cheng et al., 2017). To compare the algorithms, we first obtain the fused images by using the original code provided by their authors. Then, from both visual observation and objective assessment metrics, we can comprehensively measure the quality of various fused images.

Figure 5:

Examples of multifocus images used in our experiments. The first two rows exhibit six pairs of gray multifocus images. The third and fourth rows show six pairs of multifocus images from the Lytro data set, and the final row displays two series of multifocus images with three sources from the Lytro data set.

Figure 5:

Examples of multifocus images used in our experiments. The first two rows exhibit six pairs of gray multifocus images. The third and fourth rows show six pairs of multifocus images from the Lytro data set, and the final row displays two series of multifocus images with three sources from the Lytro data set.

4.1  Objective Quality Metrics

Three quality evaluation metrics—the mutual information metric (MI) (Mackay, 2003), the gradient-based metric ( (Xydeas & Petrovic, 2000), and visual information fidelity for fusion (VIF) (Sheikh & Bovik, 2006)—are employed to evaluate the fusion results objectively. For each metric, a larger value indicates a better quality of the fused image.

The MI metric measures the mutual information between source and fused images. The metric is obtained by where
formula
4.1
quantifies the mutual information between fused image and input image .
The gradient-based metric evaluates fusion quality by measuring how spatial details from the source images are conducted to the fused image and is defined as
formula
4.2
where , , and are the edge strength and orientation preservation values at location , respectively. is computed similar to . and are the weight coefficients for each edge.
Visual information fidelity (VIF) is a full reference image quality assessment index based on natural scene statistics and the notion of image information extracted by the human visual system. The VIF index is defined as
formula
4.3
where represents the amount of information extracted from the fused image and denotes the amount of information extracted from the source images.

4.2  Fusion Results and Discussion

In the first experiment, we evaluate our proposed algorithm on the gray source images. The pair of book and fused images with various methods are shown in Figures 6a1 to 6l1 in which the selected regions are labeled with red rectangles and the magnified regions are presented in Figures 6a2 to 6l2. The difference images obtained by subtracting source image A (see Figure 6a1) from each fused image are shown in Figure 7. Each difference image has been normalized to the range of 0 to 1. NSCT (see Figure 6c1) is more blurred than other ones (see Figures 6d1 to 6l1). Other ones almost present a good visual effect. Whereas GFF displays some slight shadow (see the right region of Figure 7b), IFM suffers from some undesirable disorder (see Figure 6e2); CBF and DCHWT in magnified regions (see Figures 6f2) and 6g2) are obviously blurred; MWGF in Figure 6h2 loses edge contrast to some degree; BFMM exhibits undesirable ringing artifacts, especially in the left bottom of Figures 6i1; DSIFT loses some details around the boundary in the left region (see Figures 6j1 and 7h). Although the difference between the CNN-based method and our proposed method is small (see Figures 6k1 and 6l1), careful observation shows that the left region of results for the proposed method exhibits more detail than that of the CNN-based method. A better comparison also could be seen from Figures 7i and 7j. Moreover, Figure 6l2 with our method is clearer than Figure 6k2 with a CNN-based method. Therefore, the proposed method exhibits better visual fusion perception than all the other methods.

Figure 6:

Source images, fused images, and magnified images. (a1–b1) Book source images. (c1l–l11) Fused images with various fusion algorithms. (a2–l12) Magnified regions of images a1 to 11l1 in the red rectangles, respectively.

Figure 6:

Source images, fused images, and magnified images. (a1–b1) Book source images. (c1l–l11) Fused images with various fusion algorithms. (a2–l12) Magnified regions of images a1 to 11l1 in the red rectangles, respectively.

Figure 7:

The difference images with various methods. The difference images can be produced by subtracting the source image A (see Figure 6a1) from each fused image.

Figure 7:

The difference images with various methods. The difference images can be produced by subtracting the source image A (see Figure 6a1) from each fused image.

We also compare the performance of various fusion algorithms for the gray source images data sets by the quantitative metrics set out in section 4.1. The quantitative results are in Table 2, where the best result for each assessment is highlighted in bold. It can be seen from Table 2 that our method, in terms of MI, , and VIF metrics, illustrates the highest scores for 4, 4, 5 out of 6 samples respectively, and acquires the highest average scores. Therefore, it can be concluded based on the experiment that by both visual perception and objective assessment, our proposed algorithm outperforms all the comparison algorithms on the gray multifocus images data set.

Table 2:
Objective Assessments of Various Fusion Algorithms for the Gray MultiFocus Image Data Set.
Source Images
MetricsMethodsClockLabDollDiskRoseBookMean
MI NSCT 3.2445 3.4105 28805 2.8275 2.6236 3.5196 3.0844 
 GFF 3.9570 3.9557 3.3892 3.5312 3.6769 4.3003 3.8017 
 IFM 4.1137 4.2652 3.5069 3.9650 3.8909 4.4102 4.0253 
 CBF 3.7068 3.7387 3.1449 3.3368 2.9348 3.9752 3.4729 
 DCHWT 3.3718 3.5174 2.9964 3.0465 2.8140 3.6235 3.2283 
 MWGF 4.0370 4.2746 3.9558 3.9558 3.7525 4.5997 4.0959 
 BFMM 4.2752 4.3998 4.0725 4.1395 3.9625 4.6543 4.2506 
 DSIFT 4.2153 4.2601 4.0566 4.1083 4.0232 4.6283 4.23153 
 CNN 4.1526 4.7929 3.8212 4.0219 4.4564 4.6362 4.3135 
 Ours 4.3262 4.3832 4.2268 4.1432 4.1723 4.6645 4.3194 
 NSCT 0.5515 0.5511 0.4856 0.5252 0.6129 0.5648 0.5485 
 GFF 0.7112 0.73809 0.6974 0.7256 0.6950 0.7168 0.7140 
 IFM 0.7059 0.7384 0.6680 0.7245 0.6918 0.7039 0.7054 
 CBF 0.6742 0.7121 0.6517 0.6991 0.6664 0.7082 0.6853 
 DCHWT 0.6495 0.6679 0.6243 0.6581 0.6592 0.6698 0.6548 
 MWGF 0.7049 0.7386 0.7385 0.7313 0.6922 0.7221 0.7213 
 BFMM 0.7161 0.7464 0.7405 0.7284 0.6935 0.7233 0.7247 
 DSIFT 0.7192 0.7479 0.7404 0.7373 0.6955 0.7241 0.7247 
 CNN 0.7128 0.5796 0.7292 0.7355 0.6848 0.7235 0.6943 
 Ours 0.7167 0.7521 0.7396 0.7440 0.7111 0.7255 0.7315 
VIF NSCT 0.5509 0.4821 0.4566 0.4262 0.4964 0.4824 0.4824 
 GFF 0.6989 0.7060 0.6535 0.6587 0.6728 0.7119 0.6836 
 IFM 0.7045 0.7112 0.6253 0.6623 0.6639 0.6903 0.6763 
 CBF 0.6392 0.6394 0.5876 0.5817 0.5686 0.6344 0.6085 
 DCHWT 0.6223 0.6051 0.5624 0.5555 0.5866 0.6021 0.5890 
 MWGF 0.7244 0.7219 0.7202 0.6819 0.6708 0.7220 0.7069 
 BFMM 0.7284 0.7223 0.7164 0.6727 0.6680 0.7176 0.7042 
 DSIFT 0.7220 0.7238 0.7138 0.6803 0.6726 0.7233 0.7060 
 CNN 0.7227 0.5974 0.6948 0.6811 0.6475 0.7218 0.6775 
 Ours 0.7313 0.7244 0.7183 0.6802 0.6776 0.7243 0.7094 
Source Images
MetricsMethodsClockLabDollDiskRoseBookMean
MI NSCT 3.2445 3.4105 28805 2.8275 2.6236 3.5196 3.0844 
 GFF 3.9570 3.9557 3.3892 3.5312 3.6769 4.3003 3.8017 
 IFM 4.1137 4.2652 3.5069 3.9650 3.8909 4.4102 4.0253 
 CBF 3.7068 3.7387 3.1449 3.3368 2.9348 3.9752 3.4729 
 DCHWT 3.3718 3.5174 2.9964 3.0465 2.8140 3.6235 3.2283 
 MWGF 4.0370 4.2746 3.9558 3.9558 3.7525 4.5997 4.0959 
 BFMM 4.2752 4.3998 4.0725 4.1395 3.9625 4.6543 4.2506 
 DSIFT 4.2153 4.2601 4.0566 4.1083 4.0232 4.6283 4.23153 
 CNN 4.1526 4.7929 3.8212 4.0219 4.4564 4.6362 4.3135 
 Ours 4.3262 4.3832 4.2268 4.1432 4.1723 4.6645 4.3194 
 NSCT 0.5515 0.5511 0.4856 0.5252 0.6129 0.5648 0.5485 
 GFF 0.7112 0.73809 0.6974 0.7256 0.6950 0.7168 0.7140 
 IFM 0.7059 0.7384 0.6680 0.7245 0.6918 0.7039 0.7054 
 CBF 0.6742 0.7121 0.6517 0.6991 0.6664 0.7082 0.6853 
 DCHWT 0.6495 0.6679 0.6243 0.6581 0.6592 0.6698 0.6548 
 MWGF 0.7049 0.7386 0.7385 0.7313 0.6922 0.7221 0.7213 
 BFMM 0.7161 0.7464 0.7405 0.7284 0.6935 0.7233 0.7247 
 DSIFT 0.7192 0.7479 0.7404 0.7373 0.6955 0.7241 0.7247 
 CNN 0.7128 0.5796 0.7292 0.7355 0.6848 0.7235 0.6943 
 Ours 0.7167 0.7521 0.7396 0.7440 0.7111 0.7255 0.7315 
VIF NSCT 0.5509 0.4821 0.4566 0.4262 0.4964 0.4824 0.4824 
 GFF 0.6989 0.7060 0.6535 0.6587 0.6728 0.7119 0.6836 
 IFM 0.7045 0.7112 0.6253 0.6623 0.6639 0.6903 0.6763 
 CBF 0.6392 0.6394 0.5876 0.5817 0.5686 0.6344 0.6085 
 DCHWT 0.6223 0.6051 0.5624 0.5555 0.5866 0.6021 0.5890 
 MWGF 0.7244 0.7219 0.7202 0.6819 0.6708 0.7220 0.7069 
 BFMM 0.7284 0.7223 0.7164 0.6727 0.6680 0.7176 0.7042 
 DSIFT 0.7220 0.7238 0.7138 0.6803 0.6726 0.7233 0.7060 
 CNN 0.7227 0.5974 0.6948 0.6811 0.6475 0.7218 0.6775 
 Ours 0.7313 0.7244 0.7183 0.6802 0.6776 0.7243 0.7094 

Note: The best results for each metric are highlighted in bold.

We next conduct our proposed algorithm on Lytro color images. The infant and the fused images for various methods are shown in Figures 8a1 to 8l1, in which the selected regions are labeled with red rectangles and the magnified regions are illustrated in Figures 8a2 to 8l2. For a better comparison, we also present the difference images in Figure 9. Similar to the book example, we can see that NSCT in Figure 8c1 is more blurred than other fused results (see Figures 8d1 to 8l1). Other ones almost provide a good visual effect. However, GFF presents some slight residuals between the fused image and the right-focused source image in focused regions in Figure 9b. IFM displays evident artifacts on the infant's leg as well as the ring effect on the boundary of the infant's face (see Figures 8e1, 8e2, and 9c); CBF in Figure 9d loses some details on the infant's face and right steel pipe region; DCHWT produces evident marks in the region of infant's leg (see Figure 9e). MWGF does not perform well in this example; the bottom right corner of the fused image is blurry (see Figures 8h1 and 9f). BFMM also exhibits some blurs in the infant's leg (see Figures 8i1 and 9g). DSIFT does not merge well along the boundary between the edge of the infant's face and the woman's hair. In particular, some regions of the woman's hair are blurry (see Figures 8j2 and 9h). CNN in Figure 9i is not consistent enough at the edge of the infant's face region, which means some tiny clear regions of the source images do not perfectly merge into the fused result. We can see that our method performs well in this example. The edge of the infant's face is consistent enough; in addition, the right region in Figure 9j for our method exhibits more detail, such as the outline of the women's face. Thus, for the infant image pair, our proposed method achieves the best visual effect.

Figure 8:

Source images, fused images and magnified images. (a1–b1) Color infant source images. (c1l–l11) Fused images with various fusion methods. (a2l–l12) Magnified regions of the images a1 to l11 in the red rectangles, respectively.

Figure 8:

Source images, fused images and magnified images. (a1–b1) Color infant source images. (c1l–l11) Fused images with various fusion methods. (a2l–l12) Magnified regions of the images a1 to l11 in the red rectangles, respectively.

Figure 9:

Difference images with various methods. The difference images can be produced by subtracting the source image A (see Figure 8a1) from each fused image.

Figure 9:

Difference images with various methods. The difference images can be produced by subtracting the source image A (see Figure 8a1) from each fused image.

To assess our proposed method in a more objective and comprehensive manner, more experiments are performed on the all Lytro data set; the results are shown in Figure 10. From Figure 10a1, we can observe that the proposed method exhibits the highest score in the MI metric for the majority of examples.

Figure 10:

Scores and average scores of metrics for each method with the Lytro data set. (a1, b1, c1) The scores of MI, , and VIF for each algorithm, respectively. (a2, b2, c2) The average scores of MI, , and VIF for each algorithm, respectively.

Figure 10:

Scores and average scores of metrics for each method with the Lytro data set. (a1, b1, c1) The scores of MI, , and VIF for each algorithm, respectively. (a2, b2, c2) The average scores of MI, , and VIF for each algorithm, respectively.

Furthermore, our proposed algorithm also acquires the highest average score in MI (see Figure 10a2). In Figure 10b1, DSIFT, CNN, and our proposed algorithms present almost the same score in terms of for most of the examples. However, our proposed algorithm obtains the highest average score in that metric than other algorithms do (see Figure 10b2). In terms of VIF, our proposed algorithm also presents the highest scores for most examples (see Figure 10c1) and the highest average score (see Figure 10c2). Our proposed algorithm also shows better performance than other algorithms in visual perception and objective assessment.

On the whole, our proposed algorithm has higher fusion performance compared with the other algorithms not only for the gray multifocus images but also for color multifocus images. More fused results using our proposed algorithm can be seen in Figure 11, which presents good visuals due to the nearly accurate focus regions detected.

Figure 11:

More fusion results using the proposed algorithm. The results for two pairs of samples are presented in each row, where the first and last five columns present the results for a pair of samples, respectively. In the results for each pair of sample, a pair of source images is illustrated in the first and third columns, its corresponding masks of focus regions detected are exhibited in the second and fourth columns, and the fused result is shown in the fifth column.

Figure 11:

More fusion results using the proposed algorithm. The results for two pairs of samples are presented in each row, where the first and last five columns present the results for a pair of samples, respectively. In the results for each pair of sample, a pair of source images is illustrated in the first and third columns, its corresponding masks of focus regions detected are exhibited in the second and fourth columns, and the fused result is shown in the fifth column.

In both experiments, we mainly implemented our proposed algorithm to fuse two source multifocus images. However, we can extend the algorithm to the case of more than two source images naturally. To deal with this case, first we employ the proposed algorithm to fuse a pair of multifocus source images selected from source images and obtain a prefused image. Then for the prefused image and the subsequent source images, we repeat above process using our proposed algorithm until the final fused image is generated. An example for three source images is implanted, and the results are shown in Figure 12. All focused regions in the three source images have been integrated well into the result.

Figure 12:

Implementation of our proposed algorithm for the case with three source images. The first three columns exhibit the source images, and the last column indicates the fused image.

Figure 12:

Implementation of our proposed algorithm for the case with three source images. The first three columns exhibit the source images, and the last column indicates the fused image.

4.3  Future Work

Although our proposed algorithm based on FCN gives an impressive performance in multifocus image fusion, there is lots of room to improve fusion quality.

  • Using a deep neural network to replace the gaussian filter. In this work, we use a gaussian filter to stimulate the unfocused regions in the multifocus images. The synthesized multifocus images may have some differences compared with the natural multifocus images. A natural idea is that we can use the strong fitting power of the deep neural network to learn the nonlinear map between the focus regions and the defocused regions on the natural multifocus image data set. By replacing the gaussian filter with this trained deep neural network, we can make a more realistic multifocus image data set. Those of synthesized data can be used to train the FCN model more effectively.

  • Design a deeper network for fusion tasks. Many deeper neural networks, such as Resnet101, have achieved remarkable results in the computer vision field. They always provide better performance than the relatively slight neural network. Therefore, designing a deeper neural network to solve the multifocus image fusion task may further improve the fusion result.

5  Conclusion

In this letter, we presented a novel multifocus image fusion method based on FCN modeling. In the light of the different focus for foreground and background for a natural image, we synthesized 4500 pairs of multifocus images as training samples from PASCAL VOC 2012. Then a single FCN with the proper parameters is trained to pixel-wise detect focus regions in multifocus images, and a fully connected CRF is exploited to accomplish and refine focus-region detection for fusion task. Experiments have been conducted on six pairs of gray images and the Lytro data set with 20 pairs of color images to verify the efficiency of the proposed method. The results demonstrated that our method achieved superior performance in terms of visual quality and objective assessment compared with other state-of-the-art methods. We believe our observation of modeling multifocus image fusion in a FCN-based manner will encourage future research.

Acknowledgments

We thank the editors and anonymous reviewers for their detailed review, valuable comments, and constructive suggestions. We also sincerely thank Liye Mei and Xiaobei Wang for their meaningful discussions. We used the website http://mansournejati.ece.iut.ac.ir/content/lytro-multi-focus-dataset for the Lytro data set to implement experiments; http://xudongkang.weebly.com for offering the source codes to simulate the GFF and IFM methods; http://cn.mathworks.com/matlabcentral/fileexchange/?term=authorid%3A376865 for the source codes for the DCHWT and CBF methods; https://github.com/lsauto/MWGF-Fusion and https://github.com/uzeful/Boundary-Finding-based-Multi-focus-Image-Fusion for the source codes for the MWGF and BFMM methods, respectively; and http://www.escience.cn/people/liuyu1/Codes.html for offering the source codes for DSIFT and the CNN-based method. This work is supported by the National Natural Science Foundation of China (61463052 and 61365001) and China Postdoctoral Science Foundation (171740).

References

Aslantas
,
V.
, &
Kurban
,
R.
(
2010
).
Fusion of multi-focus images using differential evolution algorithm
.
Expert Systems with Applications
,
37
(
12
),
8861
8870
.
Aslantas
,
V.
, &
Toprak
,
A. N.
(
2017
).
Multi-focus image fusion based on optimal defocus estimation
.
Computers and Electrical Engineering
,
62
,
302
318
.
Azarang
,
A.
, &
Ghassemian
,
H.
(
2017
).
A new pan-sharpening method using multiresolution analysis framework and deep neural networks
. In
Proceedings of the International Conference on Pattern Recognition and Image Analysis
(pp.
1
6
).
Piscataway, NJ
:
IEEE
.
Bottou
,
L.
(
2010
).
Large-scale machine learning with stochastic gradient descent
. In
Proceedings of the Compstat'2000
(pp.
177
186
).
Heidelberg
:
Physica-Verlag
.
Chopra
,
S.
,
Hadsell
,
R.
, &
LeCun
,
Y.
(
2005
).
Learning a similarity metric discriminatively, with application to face verification
. In
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(
Vol. 1
, pp.
539
546
).
Washington, DC
:
IEEE Computer Society
.
De
,
I.
, &
Chanda
,
B.
(
2013
).
Multi-focus image fusion using a morphology-based focus measure in a quad-tree structure
.
Amsterdam
:
Elsevier
.
Do
,
M. N.
, &
Vetterli
,
M.
(
2005
).
The contourlet transform: An efficient directional multiresolution image representation
.
IEEE Transactions on Image Processing
,
14
(
12
),
2091
2106
.
Everingham
,
M.
,
Van Gool
,
L.
,
Williams
,
C. K. I.
,
Winn
,
J.
, &
Zisserman
,
A.
(
2011
).
The Pascal visual object classes challenge 2012 (voc2012) results (2012)
. http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html
Fischer
,
P.
,
Dosovitskiy
,
A.
,
Ilg
,
E.
,
Häusser
,
P.
,
Hazırbaş
,
C.
, &
Golkov
,
V.
(
2015
).
Flownet: Learning optical flow with convolutional networks
. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
2758
2766
).
Washington, DC
:
IEEE Computer Society
.
Glorot
,
X.
, &
Bengio
,
Y.
(
2010
).
Understanding the difficulty of training deep feedforward neural networks
.
Journal of Machine Learning Research
,
9
,
249
256
.
He
,
K.
,
Sun
,
J.
, &
Tang
,
X.
(
2013
).
Guided image filtering
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
35
(
6
),
1397
1409
.
He
,
K.
,
Gkioxari
,
G.
,
Dollár
,
P.
, &
Girshick
,
R.
(
2017
).
Mask r-cnn
. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
2980
2988
).
Washington, DC
:
IEEE Computer Society
.
Huang
,
W.
, &
Jing
,
Z.
(
2007
).
Multi-focus image fusion using pulse coupled neural network
.
Pattern Recognition Letters
,
28
(
9
),
1123
1132
.
Huang
,
W.
,
Xiao
,
L.
,
Wei
,
Z.
,
Liu
,
H.
, &
Tang
,
S.
(
2017
).
A new pan-sharpening method with deep neural networks
.
IEEE Geoscience and Remote Sensing Letters
,
12
(
5
),
1037
1041
.
Jia
,
Y.
,
Shelhamer
,
E.
,
Donahue
,
J.
,
Karayev
,
S.
,
Long
,
J.
, &
Girshick
,
R.
, …
Darrell
,
T.
(
2014
).
Caffe: Convolutional architecture for fast feature embedding
. In
Proceedings of the ACM International Conference on Multimedia
(pp.
675
678
).
New York
:
ACM
.
Kalantari
,
N. K.
, &
Ramamoorthi
,
R.
(
2017
).
Deep high dynamic range imaging of dynamic scenes
.
ACM Transactions on Graphics
,
36
(
4
),
1
12
.
Krähenbühl
,
P.
, &
Koltun
,
V.
(
2011
). Efficient inference in fully connected CRFS with gaussian edge potentials. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
109
117
).
Red Hook, NY
:
Curran
.
Krizhevsky
,
A.
,
Sutskever
,
I.
, &
Hinton
,
G. E.
(
2012
). ImageNet classification with deep convolutional neural networks. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
1097
1105
).
Red Hook, NY
:
Curran
.
Kumar
,
B. S.
(
2013
).
Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform
.
Signal, Image and Video Processing
,
7
(
6
),
1125
1143
.
Kumar
,
B. S.
(
2015
).
Image fusion based on pixel significance using cross bilateral filter
.
Signal, Image and Video Processing
,
9
(
5
),
1193
1204
.
LeCun
,
Y.
,
Boser
,
B.
,
Denker
,
J. S.
,
Henderson
,
D.
,
Howard
,
R. E.
,
Hubbard
,
W.
, &
Jackel
,
L. D.
(
1989
).
Backpropagation applied to handwritten zip code recognition
.
Neural Computation
,
1
(
4
),
541
551
.
Lewis
,
J. J.
,
O'Callaghan
,
R. J.
,
Nikolov
,
S. G.
,
Bull
,
D. R.
, &
Canagarajah
,
N.
(
2007
).
Pixel-and region-based image fusion with complex wavelets
.
Information Fusion
,
8
(
2
),
119
130
.
Li
,
H.
,
Manjunath
,
B. S.
, &
Mitra
,
S. K.
(
1995
).
Multisensor image fusion using the wavelet transform
.
Graphical Models and Image Processing
,
57
(
3
),
235
245
.
Li
,
S.
,
Kang
,
X.
, &
Hu
,
J.
(
2013
).
Image fusion with guided filtering
.
IEEE Transactions on Image Processing
,
22
(
7
),
2864
.
Li
,
S.
,
Kang
,
X.
,
Hu
,
J.
, &
Yang
,
B.
(
2013
).
Image matting for fusion of multi-focus images in dynamic scenes
.
Information Fusion
,
14
(
2
),
147
162
.
Li
,
S.
,
Kwok
,
J. T.
, &
Wang
,
Y.
(
2001
).
Combination of images with diverse focuses using the spatial frequency
.
Information Fusion
,
2
(
3
),
169
176
.
Li
,
S.
,
Kwok
,
J.
, &
Wang
,
Y.
(
2002
).
Multifocus image fusion using artificial neural networks
.
Pattern Recognition Letters
,
23
(
8
),
985
997
.
Li
,
S.
, &
Yang
,
B.
(
2008
).
Multifocus image fusion using region segmentation and spatial frequency
.
Woburn, MA
:
Butterworth-Heinemann
.
Liu
,
Y.
,
Chen
,
X.
,
Cheng
,
J.
, &
Peng
,
H.
(
2017
, July).
A medical image fusion method based on convolutional neural networks
. In
Proceedings of the 20th International Conference on Information Fusion
(pp.
1
7
).
Piscataway, NJ
:
IEEE
.
Liu
,
Y.
,
Chen
,
X.
,
Peng
,
H.
, &
Wang
,
Z.
(
2017
).
Multi-focus image fusion with a deep convolutional neural network
.
Information Fusion
,
36
,
191
207
.
Liu
,
Y.
,
Chen
,
X.
,
Wang
,
Z.
,
Wang
,
Z. J.
,
Ward
,
R. K.
, &
Wang
,
X.
(
2018
).
Deep learning for pixel-level image fusion: Recent advances and future prospects
.
Information Fusion
,
42
,
158
173
.
Liu
,
Y.
,
Liu
,
S.
, &
Wang
,
Z.
(
2015
).
Multi-focus image fusion with dense SIFT
.
Information Fusion
,
23
,
139
155
.
Long
,
J.
,
Shelhamer
,
E.
, &
Darrell
,
T.
(
2015
).
Fully convolutional networks for semantic segmentation
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
3431
3440
).
Piscataway, NJ
:
IEEE
.
Luo
,
X.
,
Zhang
,
Z.
,
Zhang
,
C.
, &
Wu
,
X.
(
2017
).
Multi-focus image fusion using hosvd and edge intensity
.
Journal of Visual Communication and Image Representation
,
45
(
C
),
46
61
.
MacKay
,
D. J.
(
2003
).
Information theory, inference and learning algorithms
.
Cambridge
:
Cambridge University Press
.
Nejati
,
M.
,
Samavi
,
S.
, &
Shirani
,
S.
(
2015
).
Multi-focus image fusion using dictionary-based sparse representation
.
Information Fusion
,
25
,
72
84
.
Pajares
,
G.
, & De La
Cruz
,
J. M.
(
2004
).
A wavelet-based image fusion tutorial
.
Pattern Recognition
,
37
(
9
),
1855
1872
.
Palsson
,
F.
,
Sveinsson
,
J. R.
, &
Ulfarsson
,
M. O.
(
2017
).
Multispectral and hyperspectral image fusion using a 3-D-convolutional neural network
.
IEEE Geoscience and Remote Sensing Letters
,
14
(
5
),
639
643
.
Petrovic
,
V. S.
, &
Xydeas
,
C. S.
(
2004
).
Gradient-based multiresolution image fusion
.
Piscataway, NJ
:
IEEE Press
.
Rao
,
Y.
,
He
,
L.
, &
Zhu
,
J.
(
2017
).
A residual convolutional neural network for pan-sharpening
. In
Proceedings of the International Workshop on Remote Sensing with Intelligent Processing
(pp.
1
4
).
Piscataway, NJ
:
IEEE
.
Selesnick
,
I. W.
,
Baraniuk
,
R. G.
, &
Kingsbury
,
N. C.
(
2005
).
The dual-tree complex wavelet transform
.
IEEE Signal Processing Magazine
,
22
(
6
),
123
151
.
Sheikh
,
H. R.
, &
Bovik
,
A. C.
(
2006
).
Image information and visual quality
.
IEEE Transactions on Image Processing
,
15
(
2
),
430
.
Shi
,
J.
, &
Malik
,
J.
(
2000
).
Normalized cuts and image segmentation
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
22
(
8
),
888
905
.
Simonyan
,
K.
, &
Zisserman
,
A.
(
2014
).
Very deep convolutional networks for large-scale image recognition
.
arXiv:1409.1556
.
Simo-Serra
,
E.
,
Iizuka
,
S.
,
Sasaki
,
K.
, &
Ishikawa
,
H.
(
2016
).
Learning to simplify: Fully convolutional networks for rough sketch cleanup
.
ACM Transactions on Graphics
,
35
(
4
),
121
.
Socolinsky
,
D. A.
, &
Wolff
,
L. B.
(
2002
).
Multispectral image visualization through first-order fusion
.
IEEE Transactions on Image Processing
,
11
(
8
),
923
931
.
Stathaki
,
T.
(
2008
).
Image fusion: Algorithms and applications
.
Amsterdam
:
Elsevier
.
Tian
,
J.
, &
Chen
,
L.
(
2012
).
Adaptive multi-focus image fusion using a wavelet-based statistical sharpness measure
.
Signal Processing
,
92
(
9
),
2137
2146
.
Tian
,
J.
,
Chen
,
L.
,
Ma
,
L.
, &
Yu
,
W.
(
2011
).
Multi-focus image fusion using a bilateral gradient-based sharpness criterion
.
Optics Communications
,
284
(
1
),
80
87
.
Wang
,
Z.
,
Ma
,
Y.
, &
Gu
,
J.
(
2010
).
Multi-focus image fusion using PCNN
.
Pattern Recognition
,
43
(
6
),
2003
2016
.
Wei
,
Y.
,
Yuan
,
Q.
,
Shen
,
H.
, &
Zhang
,
L.
(
2017
).
Boosting the accuracy of multispectral image pansharpening by learning a deep residual network
.
IEEE Geoscience and Remote Sensing Letters
,
14
(
10
),
1795
1799
.
Xie
,
S.
, &
Tu
,
Z.
(
2016
).
Holistically-nested edge detection
. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
1395
1403
).
Piscataway, NJ
:
IEEE
.
Xydeas
,
C. S.
, &
Petrovic
,
V.
(
2000
).
Objective image fusion performance measure
.
Electronics Letters
,
36
(
4
),
308
309
.
Yang
,
S.
,
Wang
,
M.
,
Jiao
,
L.
,
Wu
,
R.
, &
Wang
,
Z.
(
2010
).
Image fusion based on a new contourlet packet
.
Information Fusion
,
11
(
2
),
78
84
.
Yin
,
H.
,
Li
,
Y.
,
Chai
,
Y.
,
Liu
,
Z.
, &
Zhu
,
Z.
(
2016
).
A novel sparse-representation-based multi-focus image fusion approach
.
Neurocomputing
,
216
(
C
),
216
229
.
Zhang
,
Q.
, &
Guo
,
B. L.
(
2009
).
Multifocus image fusion using the nonsubsampled contourlet transform
.
Signal Processing
,
89
(
7
),
1334
1346
.
Zhang
,
Y.
,
Bai
,
X.
, &
Wang
,
T.
(
2017
).
Boundary finding based multi-focus image fusion through multi-scale morphological focus-measure
.
Information Fusion
,
35
,
81
101
.
Zhou
,
Z.
,
Li
,
S.
, &
Wang
,
B.
(
2014
).
Multi-scale weighted gradient-based fusion for multi-focus images
.
Information Fusion
,
20
,
60
72
.