## Abstract

With the wide deployments of heterogeneous networks, huge amounts of data with characteristics of high volume, high variety, high velocity, and high veracity are generated. These data, referred to multimodal big data, contain abundant intermodality and cross-modality information and pose vast challenges on traditional data fusion methods. In this review, we present some pioneering deep learning models to fuse these multimodal big data. With the increasing exploration of the multimodal big data, there are still some challenges to be addressed. Thus, this review presents a survey on deep learning for multimodal data fusion to provide readers, regardless of their original community, with the fundamentals of multimodal deep learning fusion method and to motivate new multimodal data fusion techniques of deep learning. Specifically, representative architectures that are widely used are summarized as fundamental to the understanding of multimodal deep learning. Then the current pioneering multimodal data fusion deep learning models are summarized. Finally, some challenges and future topics of multimodal data fusion deep learning models are described.

## 1 Introduction

Recently, many heterogeneous networks have been successfully deployed in both low-layer and high-layer applications, including Internet of Things, vehicular networks, and social networks (Zhang, Patras, & Haddadi, 2019; Meng, Li, Zhang, & Zhu, 2019; Qiu, Chen, Li, Atiquzzaman, & Zhao, 2018). With the wide deployment of heterogeneous networks, increasing amounts of data are being generated and collected at an unprecedented speed. These data, often referred to as big data, hold such characteristics as high volume, high variety, high velocity, and high veracity (Gao, Li, & Chen, 2019; Lv, Song, Val, Steed, & Jo, 2017). Also, these huge data that contain structured, semistructured, and unstructured data are multiple-modality/multimodal. And each modality of different source, type, and distribution contains modality-specific information (Li, Yang, & Zhang, 2019; Gao, Li, & Li, 2016). For example, a sports news web page uses images to record the scenes of the sport and texts to describe content of the sport. These images and texts are the descriptions of one event with different raw forms. The reasonable fusion of these multimodal data can help us better understand the event of interest, especially when one modality is incomplete (Khaleghi, Khamis, Karray, & Razavi, 2013; Lahat, Adali, & Jutten, 2015). Thus, with the increasing availibility and accessibility of multimodal data, the fusion of the information in multimodal data is a vital topic in big data research, which provides opportunities to better understand cross-modality and shared-modality information.

Multimodal data fusion, a fundamental method of multimodal data mining, aims to integrate the data of different distributions, sources, and types into a global space in which both intermodality and cross-modality can be represented in a uniform manner (Bramon et al., 2012; Bronstein, Bronstein, Michel, & Paragios, 2010; Poria, Cambria, Bajpai, & Hussain, 2017). It can provide richer information than a single modality by leveraging modality-specific information (Biessmann, Plis, Meinecke, Eichele, & Muller, 2011; Wagner, Andre, Lingenfelser, & Kim, 2011). In the past, some multimodal data fusion methods were presented to explore the complementary and cross-modality information between modalities (Sui, Adali, Yu, Chen, & Calhoun, 2012). For example, Kettenring (1971) proposed the multimodal canonical correlation analysis for the linear intermodality relationship as well as the cross-modality generalization information. Martinez-Montes, Valdes-Sosa, Miwakeichi, Goldman, and Cohen (2004) proposed the partial least squares model linear relationships over multiple variables, discovering the variables from the multi-source data sets. Groves, Beckmann, Smith, and Woolrich (2011) presented a multimodal independent component analysis that is a probabilistic model using the Bayesian framework to combine the independent variables of each different modality. These multimodal data fusion methods are limited to big multimodal data of high volume, high velocity, high variety, and high veracity since they are based on the shallow feature that cannot capture intrinsic internal structures and external relationships in multimodal data (Li, Chen, Yang, Zhang, & Deen, 2018; Zhang, Yang, & Chen, 2016). Thus, fully mining the patterns in the multimodal data requires new multimodal computing techniques.

Multimodal big data, similar to traditional big data, are of high volume, variety, velocity and veracity. However, the variety of the multimodal big data is more prominent than the other characteristics. In particular, multimodal big data are composed of several modalities that contain part of the description of the same things of interest with each modality-independent distribution. There are also complex correlations between modalities. The full modeling of the fusion representations hidden in the intermodality and cross-modality can further improve the performance of various multimodal applications.

Deep learning, a hierarchical computation model, learns the multilevel abstract representation of the data (LeCun, Bengio, & Hinton, 2015). It uses the the backpropagation algorithm to train its parameters, which can transfer raw inputs to effective task-specific representations. There are several well-known deep architectures: convolutional neural networks (CNN), recurrent neural networks (RNN), and generative adversarial networks (GAN) (Bengio, Courville, & Vincent, 2013; Chen & Lin, 2014). These deep learning methods have made great progress in both generative and discriminative tasks based on supervised and unsupervised training strategies (Guo et al., 2016). For example, Han, Kim, and Kim (2017) presented a deep pyramidal residual network by introducing a new residual strategy, which is a representative discriminative task. This pyramidal residual network can learn effective and robust abstract representations in which the task-specific factors are amplified and the irrelevant factors are suppressed, outperforming the state-of-the-art pattern recognition accuracy. A representative generative example is the generative adversarial network that is a game theory paradigm of deep learning (Goodfellow et al., 2014). The generative adversarial network can capture the intrinsic input structure based on the Nash equilibrium between the generator and the discriminator, reconstructing input objects. Also, there are some pioneering deep learning models in multimodal data fusion domains, such as cross-modality retrieval, image annotation, and assistant diagnosis. Although the multimodal data fusion deep learning model has made some progress, it is still in a preliminary stage. Thus, we review the representative multimodal deep learning models to motivate new paradigms of multimodal data fusion deep learning.

In the recent past, enormous amounts of multimodal big data were generated from widely deployed heterogeneous networks. Traditional multimodal data fusion methods cannot properly capture the intermodality representations and the cross-modality complementary correlations of the multimodal big data, since these are shallow models that cannot learn the intrinsic representation of data. Some pioneering work inspired by deep learning methods has proposed exploring the fusion of multimodal data. These deep learning–based multimodal methods have made some progress in various domains, including language translation, image annotation, and medical assistant diagnosis. But the research of deep learning for multimodal data fusion is still in a preliminary stage, and there is no work that reviews multimodal deep learning models. This review of deep learning for multimodal data fusion will provide readers with the fundamentals of the multimodal deep learning fusion method and motivate new multimodal deep learning fusion methods. The representative architectures—DBN, SAE, CNN, and RNN—are summarized because they are fundamental to understanding multimodal deep learning fusion models. Next, the pioneering multimodal deep learning fusion models are summarized from the task, model framework, and data set perspectives. They are grouped by the deep learning architecture used. Finally, some challenges and future topics of deep learning for multimodal data fusion are described.

## 2 The Representative Deep Learning Architectures

In this section, we introduce representative deep learning architectures of the multimodal data fusion deep learning models. Specifically, the definition, feedforward computing, and backpropagation computing of deep architectures, as well as the typical variants, are presented. The representative models are summarized in Table 1.

Architecture . | Representative Models . | Model Features . |
---|---|---|

Deep belief net | RBM (Zhang et al., 2018) | A generative graphic model that uses the energy to capture the probability distribution between visible units and hidden units. |

SRBM (Chen et al., 2017) | A sparse variant that each hidden unit connects to part of the visible units, preventing the model overfitting based on hierarchical latent tree analysis. | |

FRBM (Ning et al., 2018) | A fast variant trained by the lean CD algorithm in which the bounds-based filtering and delta product reduce the redundant dot product calculations. | |

TTRBM (Ju et al., 2019) | A compact variant that the parameters between the visible layer and hidden layer are reduced by transforming into the tensor-train format. | |

Stacked autoencoder | AE (Michael et al., 2018) | A basic fully connected network that uses the encoder-decoder strategy in an unsupervised manner to learn intrinsic features of data. |

DAE (Vincent et al., 2008) | A denoising variant that reconstructs the clear data from the noising data. | |

SAE (Makhzani & Frey, 2013) | A sparse variant that captures the sparse representations of the input by adding the constraint into the loss function. | |

GAE (Hou et al., 2019) | An adversarial variant that the decoder subnetwork that is also regarded as the generator, adopting game theory to more consistent features with input data. | |

FAE (Ashfahani et al., 2019) | An evolving variant that constructs an adaptive network structure in the learning of representations, based on the network significance. | |

BAE (Angshul, 2019) | An evolving variant adding the path-loss term in the loss function based on dictionary learning. | |

Convolutional neural network | Alexnet (Krizhevsky, Sutskever, & Hinton, 2012) | The nonsaturating neurons and the dropout are adopted in the nonlinear computational layers, based on a GPU implementation, respectively. |

ResNet (He et al., 2016) | A shortcut connection is used to cross several layers to back propagate the network loss to previous layers. | |

Inception (Christian et al., 2017) | A deeper and wider network is designed by using the uniform grid size for the blocks with auxiliary information. | |

SEnet (Cao et al., 2019) | Informational embedding and adaption recalibration are regarded as self-attention operations. | |

ECNN (Sandler et al., 2018) | The low-rank convolution replaces the full-rank convolution to improve the learning efficiency without much accuracy loss. | |

Recurrent neural network | RNN (Zhang et al., 2014) | A fully connected network where the self-connection between hidden layers is used to model the time dependency. |

BiRNN (Schuster & Paliwal, 1997) | Two independent computing processes are used to encode the forward and the backward dependency. | |

LSTM (Hochreiter & Schmidhuber, 1997) | The memory block is introduced to model the long-time dependency well. | |

SRNN (Lei et al., 2018) | A fast variant in which the light recurrence and highway network are proposed to improve the learning efficiency for a parallelized implementation. | |

VRNN (Jang et al., 2019) | A variational variant that uses the variational encoder-decoder strategy to model the temporal intrinsic features. |

Architecture . | Representative Models . | Model Features . |
---|---|---|

Deep belief net | RBM (Zhang et al., 2018) | A generative graphic model that uses the energy to capture the probability distribution between visible units and hidden units. |

SRBM (Chen et al., 2017) | A sparse variant that each hidden unit connects to part of the visible units, preventing the model overfitting based on hierarchical latent tree analysis. | |

FRBM (Ning et al., 2018) | A fast variant trained by the lean CD algorithm in which the bounds-based filtering and delta product reduce the redundant dot product calculations. | |

TTRBM (Ju et al., 2019) | A compact variant that the parameters between the visible layer and hidden layer are reduced by transforming into the tensor-train format. | |

Stacked autoencoder | AE (Michael et al., 2018) | A basic fully connected network that uses the encoder-decoder strategy in an unsupervised manner to learn intrinsic features of data. |

DAE (Vincent et al., 2008) | A denoising variant that reconstructs the clear data from the noising data. | |

SAE (Makhzani & Frey, 2013) | A sparse variant that captures the sparse representations of the input by adding the constraint into the loss function. | |

GAE (Hou et al., 2019) | An adversarial variant that the decoder subnetwork that is also regarded as the generator, adopting game theory to more consistent features with input data. | |

FAE (Ashfahani et al., 2019) | An evolving variant that constructs an adaptive network structure in the learning of representations, based on the network significance. | |

BAE (Angshul, 2019) | An evolving variant adding the path-loss term in the loss function based on dictionary learning. | |

Convolutional neural network | Alexnet (Krizhevsky, Sutskever, & Hinton, 2012) | The nonsaturating neurons and the dropout are adopted in the nonlinear computational layers, based on a GPU implementation, respectively. |

ResNet (He et al., 2016) | A shortcut connection is used to cross several layers to back propagate the network loss to previous layers. | |

Inception (Christian et al., 2017) | A deeper and wider network is designed by using the uniform grid size for the blocks with auxiliary information. | |

SEnet (Cao et al., 2019) | Informational embedding and adaption recalibration are regarded as self-attention operations. | |

ECNN (Sandler et al., 2018) | The low-rank convolution replaces the full-rank convolution to improve the learning efficiency without much accuracy loss. | |

Recurrent neural network | RNN (Zhang et al., 2014) | A fully connected network where the self-connection between hidden layers is used to model the time dependency. |

BiRNN (Schuster & Paliwal, 1997) | Two independent computing processes are used to encode the forward and the backward dependency. | |

LSTM (Hochreiter & Schmidhuber, 1997) | The memory block is introduced to model the long-time dependency well. | |

SRNN (Lei et al., 2018) | A fast variant in which the light recurrence and highway network are proposed to improve the learning efficiency for a parallelized implementation. | |

VRNN (Jang et al., 2019) | A variational variant that uses the variational encoder-decoder strategy to model the temporal intrinsic features. |

Notes: RBM: restricted Boltzmann machine; SRBM: sparse restricted Boltzmann machine; FRBM: fast restricted Boltzmann machine; TTRBM: tensor-train restricted Boltzmann machine; AE: autoencoder; DAE: denoising autoencoder; SAE: K-sparse autoencoder; GAE: generative autoencoder; FAE: fast autoencoder; BAE: blind autoencoder; Alexnet: Alex convolutional net; ResNet: residual convolutional net; Inception: Inception; SEnet: squeeze excitation network; ECNN: efficient convolutional neural network; RNN: recurrent neural network; BiRNN: bidirectional recurrent neural network; LSTM: long short-term memory; SRNN: slight recurrent neural network; VRNN: variational recurrent neural network.

### 2.1 The Deep Belief Net (DBN)

Unfortunately, in those gradient-computing equations, the probability $\u2211xPxPhi=1|x$ is difficult to compute (Hinton, Osindero, & Teh, 2006). In fact, the Markov chain Monte Carlo (MCMC) method is used to approximate the probability, such as the contrastive divergence algorithm.

Recently, some advanced RBMs have been proposed to improve performance. For instance, to avoid network overfitting, Chen, Zhang, Yeung, and Chen (2017) designed the sparse Boltzmann machine that learns the network structure based the hierarchical latent tree. Ning, Pittman, and Shen (2018) introduced fast contrastive-divergence algorithms to RBMs, where the bounds-based filtering and delta product are used to reduce the redundant dot product calculations in computations. To protect the internal structure of multidimensional data, Ju et al. (2019) proposed the tensor RBM, learning the high-level distribution hidden in multidimensional data, in which tensor decomposition is used to avoid the dimensional disaster.

To obtain conditional and joint distributions, DBN is trained by the unsupervised learning in a layer-wise manner. In other words, each hidden layer is modeled as an RBM. The output of the lower RBM is inputed to the upper one. In detail, the first hidden layer is modeled as an RBM that takes the training data as input, resulting in the empirical distribution of the first DBN hidden layer being approximated by the distribution captured by the RBM. Then the captured approximation distribution is fed to the RBM, that is, the second DBN hidden layer, to further capture the distribution in the training data in the same way. This process is repeated until the last hidden layer is trained.

After unsupervised learning, these parameters—the weights $W$ and hidden biases $b$—are employed to initialize a deep discriminative neural network of the same architecture, which gives rise to the initialized weights near a good local minimum of the training objects. Then the deep discriminative model generally is further trained by the stochastic gradient descent algorithm to learn the discriminative knowledge in object labels (Wang, Wang, Santoso, Chiang, & Wu, 2018).

### 2.2 The Stacked Autoencoder (SAE)

To improve the performance of the autoencoder, some adversarial networks are proposed by adopting game theory, in which the decoder is regarded as the generator that tries to trick the discriminator. Those adversarial variants can produce more consistent features with input data (Hou, Sun, Shen, & Qiu, 2019). To analyze stream data, Ashfahani, Pratama, Lughofer, and Ong (2019) proposed a deep evolving denoising autoencoder that constructs an adaptive network structure in the learning of representations, based on the network significance. To model robust features of inputs, Angshul designed a blind denoising autoencoder by adding the path-loss term in the loss function based on dictionary learning (Angshul, 2019). More autoencoder variants can be found in Michael et al. (2018).

As shown in Figure 2, the stacked autoencoder, the most typical fully connected neural network, consists of an input layer, an output layer, and several hidden layers (Sun, Zhang, Hamme, & Zheng, 2016). To learn the compact features of the input, SAE is trained with a two-stage strategy. In the first pretraining, each hidden layer is trained as a basic autoencoder to reconstruct its inputs in the unsupervised manner. For example, the $i$th hidden layer is initialized as the $i$th autoencoder. It takes the activations of the $(i-1)$th hidden layer as input. Then it uses the backpropagation algorithm to adjust its parameters by reconstructing the activation of the $(i-1)$th hidden layer. After each of hidden layers is pretrained these above unsupervised way, the stacked autoencoder uses the discriminative knowledge contained in the data labels to fine-tune the parameters to learn task-specific representations. This two-stage training makes the stacked autoencoder avoid local optimal solutions, converging to a better performance.

### 2.3 The Convolutional Neural Network (CNN)

DBN and SAE are fully connected neural networks. In these two networks, each neuron in the hidden layer connects to every neuron of the previous layer, a topology that produces a great number of connections. To train the weights of these connections, the fully connected neural network requires a great number of training objects to avoid overfitting and underfitting, which is computationally intensive. Also, the fully connected topology does not consider the location information of features contained between neurons. Thus, the fully connected deep neural network—DBN, SAE, and their variants—cannot deal with the high-dimensional data, especially large image and large audio data.

Similar to the fully connected architecture, the CNN is also trained to fit the training data, using the same algorithm (LeCun et al., 1989; LeCun, Bottou, Bengio, & Haffner, 1998; Zeiler & Fergus, 2014). There are three propagation stages in the back-propagation process. At the beginning, the loss is computed in the same way as with the fully connected architecture.

There are some representative CNNs. The most representative one is Alexnet (Krizhevsky, Sutskever, & Hinton, 2012). In Alexnet, the nonsaturating neurons and the dropout technique are adopted in the nonlinear computational layers to improve its performance. Furthermore, a GPU implementation is used to speed up the convolutional layer. He, Zhang, Ren, and Sun (2016) introduced ResNet to solve the accuracy degradation with the increase of depth. In ResNet, a resident block is designed by adding a shortcut connection to a network with several layers, which introduces the identity concept without extra computational cost. By using the resident module, the CNN depth is up to 1000 layers, which greatly contributes to image feature learning. Another example is the Inception-V4, in which a deeper and wider network is designed by using the uniform grid size for the blocks (Christian, Sergey, Vincent, & Alexander, 2017). To explicitly model channel interdependencies, some Squeeze-and-Excitation networks are introduced by using the global informational embedding and adaption recalibration operations, which are regarded as self-attention networks on local-and-global information (Jie, Li, & Sun, 2018; Cao, Xu, Lin, Wei, & Hu, 2019). To improve learning efficiency, some fast convolutional networks are designed by replacing the full-rank convolution with several low-rank convolutions. Those fast implementations can improve learning efficiency without much loss of accuracy (Sandler, Howard, Zhu, Zhmoginov, & Chen, 2018; Zhang, Zhou, Lin, & Sun, 2018). More convolutional variants are in Gu et al. (2018).

### 2.4 The Recurrent Neural Network (RNN)

Some well-known variants of the RNN have achieved impressive performance (Zhang, Wang, & Liu, 2014). For example, to model the bidirectional dependency of the sequential data, Schuster and Paliwal (1997) proposed the bidirectional RNN, where there are two independent computing processes that encode the forward dependency and the backward dependency. Another representative variant is LSTM (Hochreiter & Schmidhuber, 1997). This variant can effectively address the limitation that the standard RNN architecture cannot well model the long-time dependency by introducing the memory blocks. To speed up the training of RNNs, Lei, Zhang, Wang, Dai, and Artzi (2018) proposed a light recurrent unit, in which the light recurrence component is used to disentangle the dependency in the state computation and the highway network component is introduced to adaptively combine input and states. Jang, Seo, and Kang (2019) designed the semantic variational recurrent autoencoder to model the global text features in a sentence-to-sentence manner.

The deep RNN is stacked by several recurrent hidden layers with the cyclic connection. Thus, it can capture the deep features of the object direction, as well as the deep features along the time direction.

## 3 Deep Learning for Multimodal Data Fusion

In this section, we review the most representative multimodal data fusion deep learning models from the perspectives of the model task, model framework, and evaluating data set. They are grouped into four categories based on the deep learning architecture that is used. The representative multimodal deep learning models are summarized in Table 2.

Architecture . | Representative Model . | Model Task . | Model Features . |
---|---|---|---|

DBN based | MDBN (Srivastava & Salakhutdinov, 2012) | Learning the joint distribution over various modalities | Uses the intermodality model to learn the modality-specific feature. Then a one-layer RBM captures the cross-modality distribution. |

DMDBN (Suk et al., 2014) | Diagnosing Alzheimer's disease | Extracts features from MRI and PET, followed a multimodal DBN Then a hierarchical classifier adaptively combines previous results. | |

HPMDBN (Ouyang et al., 2014) | Estimating the human pose from multisource information | Two-layer features are extracted from three important pose views. Then an RBM models the joint distributions over multimodal. | |

HMDBN (Amer et al., 2018) | Detecting sequential events with discriminative labels | The conditional restricted Boltzmann machine is adopted to extract the intercross-modality features with additional discriminative label information. | |

FMDBN (Al-Waisy et al., 2018) | Recognizing faces from local and deep features | Local features of faces are modeled by the Curvelet transform. Then a DBN is built on the local features to learn deep features of faces. | |

SAE based | MSAE (Ngiam et al., 2011) | Exploring fusion strategies about multimodal data | The multimodality, cross-modality, and shared-modality representation learning methods are introduced based on SAE. |

GHMSAE (Hong et al., 2015) | Generating human skeletons from a series of images | The 2D image and 3D pose are transferred in the high-level skeleton space. Then the joint distributions are modeled by the MSE loss based on SAE. | |

MVAE (Khattar et al., 2019) | Detection fake news | Uses the variational encoder-decoder architecture to learn the intrinsic distribution over modalities with detecting loss from the detector. | |

AMSAE (Wang et al., 2018) | Learning intrinsic features of words | Uses the multimodal encoder-decoder architecture to model intrinsic features of words with the association and gating mechanisms. | |

CNN based | MCNN (Ma et al., 2015) | Exploring the image-sentence mapping at different levels | Uses the one-dimensional convolution to capture the image-sentence mapping at word, phase, and sentence levels, taking local topologies into consideration. |

AMCNN (Frome et al., 2013) | Recognizing objects based on label and unannotated text | Improve the performance of the visual system with the help of dense features extracted from unannotated text. | |

AVDCN (Hou et al., 2018) | Enhancing speech signals with auxiliary visual signals | The intermodality CNN maps audio and visual signals into shared semantic space, followed by a fully connected network that reconstructs the raw inputs. | |

MFCNN (Nguyen et al., 2019) | Understanding emotion of movie clips | Uses CNN with fuzzy logic to map modality-specific signals into the shared semantic space. | |

RNN based | MRNN (Mao et al., 2014) | Generating novel descriptions for images | Uses the recurrent network to learn the temporal dependence between sentences and images. |

MBiRNN (Karpathy & Li, 2017) | Generating rich descriptions for images at a glance | Bridges the intermodal relationship between visual features captured by the region CNN and text features captured by BiRNN. | |

MTRNN (Abdulnabi et al., 2018) | Labeling indoor scenes from RGB and depth data | Learns the multimodal joint distribution over various modalities by the RNN and transformer layers. | |

MGRNN (Narayanan et al., 2019) | Predicting driver behaviors with low-quality data | Uses the gate recurrent cell with the multimodal sensor data to model driver behavior. | |

ASMRNN (Sano et al., 2019) | Detecting ambulatory sleep from the wearable device data | Adopts bidirectional LSTMs to temporal features of each modality, followed by a fully connected layer that concatenates temporal features. |

Architecture . | Representative Model . | Model Task . | Model Features . |
---|---|---|---|

DBN based | MDBN (Srivastava & Salakhutdinov, 2012) | Learning the joint distribution over various modalities | Uses the intermodality model to learn the modality-specific feature. Then a one-layer RBM captures the cross-modality distribution. |

DMDBN (Suk et al., 2014) | Diagnosing Alzheimer's disease | Extracts features from MRI and PET, followed a multimodal DBN Then a hierarchical classifier adaptively combines previous results. | |

HPMDBN (Ouyang et al., 2014) | Estimating the human pose from multisource information | Two-layer features are extracted from three important pose views. Then an RBM models the joint distributions over multimodal. | |

HMDBN (Amer et al., 2018) | Detecting sequential events with discriminative labels | The conditional restricted Boltzmann machine is adopted to extract the intercross-modality features with additional discriminative label information. | |

FMDBN (Al-Waisy et al., 2018) | Recognizing faces from local and deep features | Local features of faces are modeled by the Curvelet transform. Then a DBN is built on the local features to learn deep features of faces. | |

SAE based | MSAE (Ngiam et al., 2011) | Exploring fusion strategies about multimodal data | The multimodality, cross-modality, and shared-modality representation learning methods are introduced based on SAE. |

GHMSAE (Hong et al., 2015) | Generating human skeletons from a series of images | The 2D image and 3D pose are transferred in the high-level skeleton space. Then the joint distributions are modeled by the MSE loss based on SAE. | |

MVAE (Khattar et al., 2019) | Detection fake news | Uses the variational encoder-decoder architecture to learn the intrinsic distribution over modalities with detecting loss from the detector. | |

AMSAE (Wang et al., 2018) | Learning intrinsic features of words | Uses the multimodal encoder-decoder architecture to model intrinsic features of words with the association and gating mechanisms. | |

CNN based | MCNN (Ma et al., 2015) | Exploring the image-sentence mapping at different levels | Uses the one-dimensional convolution to capture the image-sentence mapping at word, phase, and sentence levels, taking local topologies into consideration. |

AMCNN (Frome et al., 2013) | Recognizing objects based on label and unannotated text | Improve the performance of the visual system with the help of dense features extracted from unannotated text. | |

AVDCN (Hou et al., 2018) | Enhancing speech signals with auxiliary visual signals | The intermodality CNN maps audio and visual signals into shared semantic space, followed by a fully connected network that reconstructs the raw inputs. | |

MFCNN (Nguyen et al., 2019) | Understanding emotion of movie clips | Uses CNN with fuzzy logic to map modality-specific signals into the shared semantic space. | |

RNN based | MRNN (Mao et al., 2014) | Generating novel descriptions for images | Uses the recurrent network to learn the temporal dependence between sentences and images. |

MBiRNN (Karpathy & Li, 2017) | Generating rich descriptions for images at a glance | Bridges the intermodal relationship between visual features captured by the region CNN and text features captured by BiRNN. | |

MTRNN (Abdulnabi et al., 2018) | Labeling indoor scenes from RGB and depth data | Learns the multimodal joint distribution over various modalities by the RNN and transformer layers. | |

MGRNN (Narayanan et al., 2019) | Predicting driver behaviors with low-quality data | Uses the gate recurrent cell with the multimodal sensor data to model driver behavior. | |

ASMRNN (Sano et al., 2019) | Detecting ambulatory sleep from the wearable device data | Adopts bidirectional LSTMs to temporal features of each modality, followed by a fully connected layer that concatenates temporal features. |

Notes: MDBN: multimodal deep Boltzmann machine; DMDBN: diagnosis multimodal deep Boltzmann machine; HPMDBN: human pose deep Boltzmann machine; HMDBN: hybrid multimodal deep Boltzmann machine; FMDBN: face multimodal deep Boltzmann machine; MSAE: multimodal stacked autoencoder; GHMSAE: generating human-skeleton multimodal stacked autoencoder; MVAE: multimodal variational autoencoder; AMSAE: association-gating mechanism multimodal stacked autoencoder; MCNN: multimodal convolutional neural network; AMCNN: auxiliary multimodal convolutional neural network; AVDCN: audiovisual deep convolutional network; MFCNN: multimodal fuzzy convolutional neural network; MRNN: multimodal recurrent neural network; MBiRNN: multimodal bidirectional recurrent neural network; MTRNN: multimodal transformer recurrent neural network; MGRNN: multimodal gating recurrent neural network; ASMRNN: ambulatory sleep multimodal recurrent neural network.

### 3.1 The Deep Belief Net-Based Multimodal Data Fusion

#### 3.1.1 Example 1

Srivastava and Salakhutdinov (2012) proposed a multimodal generative model based on the deep Boltzmann learning model, learning multimodal representations by fitting the joint distributions of multimodal data over the various modalities, such as image, text, and audio. In this example, the good multimodal representation is defined as follows:

It should be similar to the raw inputs in the concept.

It should be easy to get even with certain modalities absent and easy to fill in the lost modalities.

It should improve classification accuracy and the retrieval tasks of both unified and multiple modalities.

Each module of the proposed multimodal DBN is initialized by the unsupervised layer-wise manner, and an MCMC-based approximate method is adopted for model training.

To evaluate the learned multimodal representation, extensive tasks are carried out, such as the generating missing modality task, the inferring joint representation task, and the discriminative task. Experiments verify that the learned multimodal representation meets the required properties.

#### 3.1.2 Example 2

To effectively diagnose Alzheimer's disease at an early phase, Suk, Lee, Shen, and the Alzheimer's Disease Neuroimaging Initiative (2014) proposed a multimodal Boltzmann model that can fuse the complementary knowledge from the multimodal data. Specifically, to address the limitations caused by the shallow feature learning methods, a DBN is used to learn the deep representations of each modality by transferring the domain-specific representation to the hierarchical abstract representation. Then a one-layer RBM is built on the concatenated vector that is the linear combination of the hierarchical abstract representations from each modality. It is used to learn the multimodal representation by constructing the joint distribution over the different multimodal features. Finally, the proposed model is extensively assessed on the ADNI data set in terms of three typical diagnoses, achieving state-of-the-art diagnosis accuracy.

#### 3.1.3 Example 3

To accurately estimate human poses, Ouyang, Chu, and Wang (2014) designed a multisource deep learning model that learns multimodal representation from mixture type, appearance score, and deformation modalities by extracting the joint distribution of the body pattern in high-order space. In the human-pose multisource deep model, the three widely used modalities are extracted from the pictorial structure models, which combine parts of the body based on conditional random field theory. To get the multimodal data, the pictorial structure model is trained by the linear support vector machine. After that, each of these three features is fed into a two-layer restricted Boltzmann model to capture abstract representations of the high-order pose space from the feature-specific representations. With the unsupervised initialization, each modality-specific restricted Boltzmann model captures the inherent representation of the global space. Then an RBM is used to further learn the human pose representation based on the concatenated vector of the high-level mixture type, appearance score, and deformation representations. To train the proposed multisource deep learning model, a task-specific objective function is designed that considers both body locations and human detection. The presented model is verified on LSP, PARSE and UIUC, and yields up to 8.6% improvement.

Recently some new DBN-based models for multimodal feature learning have been proposed. For instance, Amer, Shields, Siddiquie, and Tamrakar (2018) proposed a hybrid method for sequential event detection, in which the conditional RBM is adopted to extract the intermodality and cross-modality features with additional discriminative label information. Al-Waisy, Qahwaji, Ipson, and Al-Fahdawi (2018) introduced a multimodal method to recognize faces. In this method, a DBN-based model is used to model the multimodal distribution over the local handcrafted features captured by the Curvelet transform, which can merge the advantages of the local and deep features (Al-Waisy et al., 2018).

#### 3.1.4 Summary

Those DBN-based multimodal models use the probabilistic graphical network to transfer the modality-specific representations into the semantic features in the shared space. Then the joint distribution over modalities is modeled based on the features of the shared space. Those DBN-based multimodal models are more flexible and robust in unsupervised, semisupervised, and supervised learning strategies. They are well suited to capture informative features of input data. However, they neglect the spatial and temporal topologies of the multimodal data.

### 3.2 The Stacked Autoencoder-Based Multimodal Data Fusion

#### 3.2.1 Example 4

. | Feature Learning . | Supervised Training . | Testing . |
---|---|---|---|

Classic deep learning | Audio | Audio | Audio |

Video | Video | Video | |

Multimodal fusion | A $+$ V | A $+$ V | A $+$ V |

Cross-modality learning | A $+$ V | Video | Video |

A $+$ V | Audio | Audio | |

Shared representation learning | A $+$ V | Audio | Video |

A $+$ V | Video | Audio |

. | Feature Learning . | Supervised Training . | Testing . |
---|---|---|---|

Classic deep learning | Audio | Audio | Audio |

Video | Video | Video | |

Multimodal fusion | A $+$ V | A $+$ V | A $+$ V |

Cross-modality learning | A $+$ V | Video | Video |

A $+$ V | Audio | Audio | |

Shared representation learning | A $+$ V | Audio | Video |

A $+$ V | Video | Audio |

In a multiple-modality learning scenario, the audio spectrogram and the video frame are concatenated into vectors in a linear manner. The concatenated vector is fed into a sparse restricted Boltzmann machine (SRBM), to learn the correlation between audio and video. This model can learn only the shadow joint representation of multiple modalities since the correlation is implicit in the raw-level high-dimensional representations and the one-layer SRBM cannot model them. Motivated by this, the concatenated vector of the midlevel representations is fed into SRBM to model the correlation of multiple modalities, which shows better performance.

In the cross-modality learning scenario, a deep stacked multimodal autoencoder is proposed to explicitly learn the correlation between modalities. Specifically, both audio and video are presented as input in the feature learning, and only one of them is fed into the model in the supervised training and testing. This model is initialized in the way of multimodal learning and can model the cross-modality relationship well.

In the shared-modality representation, a modality-specific deep stacked multimodal autoencoder is introduced, motivated by the denoising autoencoder, to explore the joint representation between modalities, especially, when one modality is absent. The training data set that is enlarged by replacing one of modalities with zeros is fed into the model in feature learning.

Finally, detailed experiments are conducted on the CUAVE and AVLetters data sets to evaluate the performance of the multimodal deep learning for task-specific feature learning.

#### 3.2.2 Example 5

To generate visually and semantically effective human skeletons from a series of images, especially videos, Hong, Yu, Wan, Tao, and Wang (2015) proposed a multimodal deep autoencoder to capture the fusion relationship between images and poses. In particular, the proposed multimodal deep autoencoder is trained by a three-stage strategy to construct the nonlinear mapping between two-dimensional images and three-dimensional poses. In the feature fusion stage, the multiview hypergraph low-rank representation is used to construct the inner two-dimensional representation from a series of image features, such as histograms of oriented gradients and shape context, based on manifold learning. In the second stage, a one-layer autoencoder is trained to learn the abstract representation that is used to recover the three-dimensional pose by reconstructing the two-dimensional interimage features. At the same time, a one-layer autoencoder is trained in a similar way to learn the abstract representation of three-dimensional poses. After obtaining the abstract representation of each single modality, a neural network is used to learn the multimodal correlation between the two-dimensional image and the three-dimensional pose by minimizing the squared Euclidean distance between the interrepresentation of the two modalities. The learning of the presented multimodal deep autoencoder is composed of the initialization and the fine-tuning phases. In the initialization, the parameters of each subpart of the multimodal deep autoencoder are copied from the corresponding autoencoder and the neural network. Then the parameters of the whole model are further fine-tuned by the stochastic gradient descent algorithm to construct the three-dimensional pose from the corresponding two-dimensional image.

To evaluate the proposed multimodal deep autoencoder, extensive experiments are conducted on three typical image-pose data sets—Walking, HumanEva-I, and Human 3.6M—outperforming prior models in terms of pose recovery.

Some other representative models based on SAE are proposed to learn the joint distribution over modalities. For example, Wang, Zhang, and Zong (2018) designed a multimodal stacked autoencoder for feature learning of words, which the association and gating mechanisms are adopted to improve the word features. Khattar, Goud, Gupta, and Varma (2019) designed a multimodal variational framework based on the encoder-decoder architecture. This framework is composed of an encoder that models each single-modality feature, a decoder that reconstructs each modality, and a detector for the new detection.

#### 3.2.3 Summary

The SAE-based multimodal models use the encoder-decoder architecture to extract the intrinsic intermodality feature and cross-modality feature by the reconstruction method in an unsupervised manner. Since they are based on SAE, which is a fully connected model, a lot of parameters need to be trained. Also, they neglect the spatial and temporal topologies in the multimodal data.

### 3.3 The Convolutional Neural Network–Based Multimodal Data Fusion

#### 3.3.1 Example 6

To model the semantic mapping distribution between images and sentences, Ma, Lu, Shang, and Li (2015) proposed a multimodal convolutional neural network. To fully capture the semantic correlations, a three-level fusion strategy—the word level, the phase level, and the sentence level—is designed in an end-to-end architecture. The architecture consists of the image subnetwork, the matching subnetwork, and the multimodal subnetwork. The image subnetwork is a representative deep convolutional neural network, such as Alexnet and Inception, which effectively encodes the image input into a concise representation. The matching subnetwork models the joint representation that associates the image content with the word fragments of sentences in the semantic space.

To deeply integrate the image with the sentence, the word-fragment, phrase-fragment, and sentence-fragment matching networks are devised. The word-fragment matching network is a convolutional neural network that takes the word and the concise image representation as inputs by a one-dimensional convolution and a one-dimensional max-pooling layer with a two-unit window. This word-fragment matching network can achieve the local receptive field, share parameters, and reduce the number of free parameters. The phrase-matching network first transfers the words of each sentence into the phrase fragment that contains more semantic knowledge than the word fragment. Then it models the joint multimodal distributions by using the one-dimensional convolution to combine the phase fragment with image features. Similarly, the sentence-matching network learns the semantic representation of each sentence. After that, it combines the semantic representation of sentences with the image representation at the sentence level. The last evaluating subnetwork uses a multilayer perceptron that evaluates those multimodal joint representations. Finally, an ensemble framework that combines the word, phrase, and sentence multimodal representations is proposed to mine the cross-modality correlation between images and texts.

To evaluate the learned multimodal representation, the multimodal convolutional neural networks are conducted on the Flickr8K and Flickr30K for the bidirectional image and sentence retrieval task.

#### 3.3.2 Example 7

To scale the vision recognition system to an unlimited number of discrete categories, Frome et al. (2013) presented a multimodal convolutional neural network by leveraging the semantic information from text data. This network is composed of the language submodel and the visual submodel. The language submodel is based on the skip-gram model, which can transfer text information into a dense representation of the semantic space. The visual submodel is a representative convolutional neural network, such as Alexnet, that is pretrained on a 1000-class ImageNet data set to capture visual features. To model the semantic relationship between images and texts, the language and visual submodels are combined by a linear projection layer. Each submodel is initialized by parameters from each single modality. After that, to train this visual-semantic multimodal model, a novel loss function is proposed by combining the dot-product similarity and hinge rank loss that can give high similar scores to the correct image and label pairs. This model can yield state-of-the-art performance on the ImageNet data set, avoiding the semantically unreasonable results.

There are also some new CNN-based architectures to learn the multimodal features. For instance, Hou, Wang, Lai, Chang, and Wang (2018) proposed a multimodal speech enhancement framework. In the proposed framework, CNN is used to capture intermodality features in audio and visual signals. Then a fully connected network models the joint distribution by reconstructing the raw inputs. Nguyen, Kavuri, and Lee (2019) introduced a multimodal CNN network to classify the emotion of movie clips. In this multimodal network, the fuzzy logic combined with CNN is used to model intermodality features from audio, visual, and text modalities.

#### 3.3.3 Summary

The CNN-based multimodal models can learn the local multimodal feature between modalities by using the local field and pooling operation. They explicitly model the spatial topologies of the multimodal data. And they are not fully connected models in which the number of parameters is greatly reduced.

### 3.4 The Recurrent Neural Network--Based Multimodal Data Fusion

#### 3.4.1 Example 8

After that, the backpropagation algorithm is used to update parameters of the proposed model. Finally, the image caption, image retrieval, and sentence retrieval tasks are used to evaluate the proposed models on the IAPR TC-12, Flickr 8K, Flickr 30K, and MS COCO data sets. The results show that the proposed model outperforms state-of-the-art models.

#### 3.4.2 Example 9

Aiming to address the limitation that current visual recognition systems cannot generate rich descriptions for images at a glance, a multimodal alignment model is presented by bridging the intermodal relationship between visual and text data (Karpathy & Li, 2017). To achieve that, a twofold scheme is proposed. First, a visual-semantic embedding model is designed to generate the multimodal training data set. Then a multimodal RNN is trained on this data set to generate the rich descriptions of images.

In the visual-semantic embedding model, the region convolutional neural network is used to get the rich image representations that contain enough information on its content corresponding to the sentence. Then a bidirectional RNN is used to encode each sentence into a dense vector of the same dimension with the image representation. Moreover, a multimodal score function is given to measure the semantic similarity between images and sentences. Finally, the Markov random field method is used to generate the multimodal data set.

In the multimodal RNN, a more effective extended model is proposed, which is based on the text content and image input. This multimodal model is composed of a convolutional neural network that encodes the image input and a RNN encodes the image feature and the sentence. This model is also trained by the stochastic gradient descent algorithm. Both of the multimodal models are extensively evaluated on Flickr and Mscoco data sets and achieve state-of-the-art performance.

There are some new RNN-based multimodal deep learning methods. For example, Abdulnabi, Shuai, Zuo, Chau, and Wang (2018) designed a multimodal RNN to label indoor scenes in which the intermodality feature and cross-modality feature are learned by the RNN and transform layers. Narayanan, Siravuru, and Dariush (2019) designed the gate recurrent cell with the multimodal sensor data to model driver behaviors. Sano, Chen, Lopez-Martinez, Taylor, and Picard (2019) proposed a multimodal BiLSTM to detect ambulatory sleep in which the BiLSTM is used to extract features of data collected from wearable devices. Then each intermodality feature is concatenated by a fully connected network.

#### 3.4.3 Summary

The RNN-based multimodal models are able to analyze the temporal dependency hidden in the multimodal data with the help of the explicit state transfer in the computation of hidden units. They use the backpropagation-through-time algorithm to train parameters. Due to the computation in the hidden state transfer, it is difficult to parallelize on the high-performance devices.

## 4 Summary and Perspectives

Deep learning is an active branch of data mining. Recently, many representative deep learning architectures have been proposed to deal with problems of various domains, such as feature learning, audio compression, and image generation. These representative architectures have made great progress, outperforming other methods in corresponding domains powered by the accessibility of high-volume data. Also, high-performance computing devices, such as, GPU, CPU clusters, and cloud computing platforms are used to improve training efficiency. This explosion and accessibility of multimodal data in heterogeneous networks provide us with vast opportunities to mine the intrinsic knowledge of heterogeneous networks from multiple aspects. These data pose vast challenges on traditional multimodal data mining methods due to their high volume, velocity, variety, and veracity. Some pioneering multimodal deep learning models were presented for data fusion. In this survey, we summarized several multimodal data fusion deep learning models, all built on the current representative deep learning architectures: DBN, SAE, CNN, and RNN. We summarize the models in four groups of multimodal data deep learning models based on DBN, SAE, CNN, and RNN. These pioneering models have made some progress; however, the models are still in the preliminary stage, so there are still challenges.

First, there are a great number of free weights in the multimodal data fusion deep learning models, especially, redundant parameters that have little effect on the task of interest. To train these parameters capturing feature structures of data, large amounts of data are fed into the multimodal data fusion deep learning models based on the backpropagation algorithm, which is computing intensive and time-consuming. To increase weight-learning efficiency, some parallel variants of the backpropagation algorithm have been executed on computation-intensive architectures: CPU cluster, GPU, and cloud platforms. In turn, the scale of multimodal data fusion deep learning models greatly depends on the computing capability of the training devices. However, the increased speed of the computing capability of the current high-performance devices falls behind that of the multimodal data. The multimodal data fusion deep learning models trained on high-performance computing devices of the current architecture may not learn feature structures of the multimodal data of increasing volume well. Therefore, one future research possibility of deep learning on the fusion feature learning of multimodal data is to design new learning frameworks with more powerful computing architectures. In addition, the compression of free parameters, an effective way to enhance training efficiency in deep learning for single-modality data feature learning has made great progress. Thus, how to combine the current compression strategy to design new compression methods of multimodal deep learning is also a potential research direction.

Second, multimodal data contain not only intermodality information but also abundant cross-modality information. To learn the abundant intermodality and crossmodality information of multimodal data, most existing deep learning models for multimodal data fusion first use a deep model to capture the private features from each modality, transforming the modality-specific raw representation to a high-abstraction representation in a certain global space. Then these high abstraction representations are further concatenated into a vector that represents the global representation of the multimodal. Finally, a deep model is used to model high-abstract representations from the concatenated vectors. However, by using this method, the multimodal deep learning models cannot capture the fully semantic knowledge of the multimodal data. There are no clear explanations why these single intermodality features, the representations of the same semantic space, which can give rise to the combination of features of different semantic levels, lose cross-modality information. Also, the intermodality representations are concatenated in a linear fashion that cannot fit the complex relationships over multiple modalities. With the exploration of the multimodal data, three or more modalities are combined to mine the intermodality and crossmodality knowledge. The current multimodal data fusion deep learning models may not achieve the desired results. Thus, new deep learning models for multimodal data that take semantic relationships into consideration are urgently needed. In addition, some semantic fusion strategies—for example, multiview fusion, transfer learning fusion, and probabilistic dependency fusion—have made some progress in the semantic fusion of the multimodal data. Thus, the combination of deep learning and semantic fusion strategies may be a way to solve the challenges posed by the exploration of multimodal data.

Third, multimodal data are collected from dynamic environments, indicating that the data are uncertain. That is, these data are dynamic, which means that the distribution of data is not unchanged. The traditional method of multimodal deep learning to learn dynamic multimodal data is to train a new model when the data distribution changes. However, it takes too much time to train a new deep learning model, and it cannot satisfy online multimodal data applications. Online learning and incremental learning are the representative real-time strategies that learn the new knowledge of the new data without much loss of historical knowledge. Thus, with the explosion of the dynamic multimodal data, the design of online and incremental multimodal deep learning models for data fusion must be addressed. Also, the multimodal data are low quality and contain noise, incomplete data, and outliers. Currently, several deep learning models are focusing only on single-modality noisy data. With the explosion of low-quality multimodal data, a deep learning model for low-quality multimodal data needs to be addressed urgently.

## Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under grants 61602083 and 61672123, the Doctoral Scientific Research Foundation of Liaoning Province 20170520425, the Dalian University of Technology Fundamental Research Fund under grant DUT15RC(3)100, and the China Scholarship Council.