In this section, we review the most representative multimodal data fusion deep learning models from the perspectives of the model task, model framework, and evaluating data set. They are grouped into four categories based on the deep learning architecture that is used. The representative multimodal deep learning models are summarized in Table 2.
Architecture . | Representative Model . | Model Task . | Model Features . |
---|---|---|---|
DBN based | MDBN (Srivastava & Salakhutdinov, 2012) | Learning the joint distribution over various modalities | Uses the intermodality model to learn the modality-specific feature. Then a one-layer RBM captures the cross-modality distribution. |
DMDBN (Suk et al., 2014) | Diagnosing Alzheimer's disease | Extracts features from MRI and PET, followed a multimodal DBN Then a hierarchical classifier adaptively combines previous results. | |
HPMDBN (Ouyang et al., 2014) | Estimating the human pose from multisource information | Two-layer features are extracted from three important pose views. Then an RBM models the joint distributions over multimodal. | |
HMDBN (Amer et al., 2018) | Detecting sequential events with discriminative labels | The conditional restricted Boltzmann machine is adopted to extract the intercross-modality features with additional discriminative label information. | |
FMDBN (Al-Waisy et al., 2018) | Recognizing faces from local and deep features | Local features of faces are modeled by the Curvelet transform. Then a DBN is built on the local features to learn deep features of faces. | |
SAE based | MSAE (Ngiam et al., 2011) | Exploring fusion strategies about multimodal data | The multimodality, cross-modality, and shared-modality representation learning methods are introduced based on SAE. |
GHMSAE (Hong et al., 2015) | Generating human skeletons from a series of images | The 2D image and 3D pose are transferred in the high-level skeleton space. Then the joint distributions are modeled by the MSE loss based on SAE. | |
MVAE (Khattar et al., 2019) | Detection fake news | Uses the variational encoder-decoder architecture to learn the intrinsic distribution over modalities with detecting loss from the detector. | |
AMSAE (Wang et al., 2018) | Learning intrinsic features of words | Uses the multimodal encoder-decoder architecture to model intrinsic features of words with the association and gating mechanisms. | |
CNN based | MCNN (Ma et al., 2015) | Exploring the image-sentence mapping at different levels | Uses the one-dimensional convolution to capture the image-sentence mapping at word, phase, and sentence levels, taking local topologies into consideration. |
AMCNN (Frome et al., 2013) | Recognizing objects based on label and unannotated text | Improve the performance of the visual system with the help of dense features extracted from unannotated text. | |
AVDCN (Hou et al., 2018) | Enhancing speech signals with auxiliary visual signals | The intermodality CNN maps audio and visual signals into shared semantic space, followed by a fully connected network that reconstructs the raw inputs. | |
MFCNN (Nguyen et al., 2019) | Understanding emotion of movie clips | Uses CNN with fuzzy logic to map modality-specific signals into the shared semantic space. | |
RNN based | MRNN (Mao et al., 2014) | Generating novel descriptions for images | Uses the recurrent network to learn the temporal dependence between sentences and images. |
MBiRNN (Karpathy & Li, 2017) | Generating rich descriptions for images at a glance | Bridges the intermodal relationship between visual features captured by the region CNN and text features captured by BiRNN. | |
MTRNN (Abdulnabi et al., 2018) | Labeling indoor scenes from RGB and depth data | Learns the multimodal joint distribution over various modalities by the RNN and transformer layers. | |
MGRNN (Narayanan et al., 2019) | Predicting driver behaviors with low-quality data | Uses the gate recurrent cell with the multimodal sensor data to model driver behavior. | |
ASMRNN (Sano et al., 2019) | Detecting ambulatory sleep from the wearable device data | Adopts bidirectional LSTMs to temporal features of each modality, followed by a fully connected layer that concatenates temporal features. |
Architecture . | Representative Model . | Model Task . | Model Features . |
---|---|---|---|
DBN based | MDBN (Srivastava & Salakhutdinov, 2012) | Learning the joint distribution over various modalities | Uses the intermodality model to learn the modality-specific feature. Then a one-layer RBM captures the cross-modality distribution. |
DMDBN (Suk et al., 2014) | Diagnosing Alzheimer's disease | Extracts features from MRI and PET, followed a multimodal DBN Then a hierarchical classifier adaptively combines previous results. | |
HPMDBN (Ouyang et al., 2014) | Estimating the human pose from multisource information | Two-layer features are extracted from three important pose views. Then an RBM models the joint distributions over multimodal. | |
HMDBN (Amer et al., 2018) | Detecting sequential events with discriminative labels | The conditional restricted Boltzmann machine is adopted to extract the intercross-modality features with additional discriminative label information. | |
FMDBN (Al-Waisy et al., 2018) | Recognizing faces from local and deep features | Local features of faces are modeled by the Curvelet transform. Then a DBN is built on the local features to learn deep features of faces. | |
SAE based | MSAE (Ngiam et al., 2011) | Exploring fusion strategies about multimodal data | The multimodality, cross-modality, and shared-modality representation learning methods are introduced based on SAE. |
GHMSAE (Hong et al., 2015) | Generating human skeletons from a series of images | The 2D image and 3D pose are transferred in the high-level skeleton space. Then the joint distributions are modeled by the MSE loss based on SAE. | |
MVAE (Khattar et al., 2019) | Detection fake news | Uses the variational encoder-decoder architecture to learn the intrinsic distribution over modalities with detecting loss from the detector. | |
AMSAE (Wang et al., 2018) | Learning intrinsic features of words | Uses the multimodal encoder-decoder architecture to model intrinsic features of words with the association and gating mechanisms. | |
CNN based | MCNN (Ma et al., 2015) | Exploring the image-sentence mapping at different levels | Uses the one-dimensional convolution to capture the image-sentence mapping at word, phase, and sentence levels, taking local topologies into consideration. |
AMCNN (Frome et al., 2013) | Recognizing objects based on label and unannotated text | Improve the performance of the visual system with the help of dense features extracted from unannotated text. | |
AVDCN (Hou et al., 2018) | Enhancing speech signals with auxiliary visual signals | The intermodality CNN maps audio and visual signals into shared semantic space, followed by a fully connected network that reconstructs the raw inputs. | |
MFCNN (Nguyen et al., 2019) | Understanding emotion of movie clips | Uses CNN with fuzzy logic to map modality-specific signals into the shared semantic space. | |
RNN based | MRNN (Mao et al., 2014) | Generating novel descriptions for images | Uses the recurrent network to learn the temporal dependence between sentences and images. |
MBiRNN (Karpathy & Li, 2017) | Generating rich descriptions for images at a glance | Bridges the intermodal relationship between visual features captured by the region CNN and text features captured by BiRNN. | |
MTRNN (Abdulnabi et al., 2018) | Labeling indoor scenes from RGB and depth data | Learns the multimodal joint distribution over various modalities by the RNN and transformer layers. | |
MGRNN (Narayanan et al., 2019) | Predicting driver behaviors with low-quality data | Uses the gate recurrent cell with the multimodal sensor data to model driver behavior. | |
ASMRNN (Sano et al., 2019) | Detecting ambulatory sleep from the wearable device data | Adopts bidirectional LSTMs to temporal features of each modality, followed by a fully connected layer that concatenates temporal features. |
Notes: MDBN: multimodal deep Boltzmann machine; DMDBN: diagnosis multimodal deep Boltzmann machine; HPMDBN: human pose deep Boltzmann machine; HMDBN: hybrid multimodal deep Boltzmann machine; FMDBN: face multimodal deep Boltzmann machine; MSAE: multimodal stacked autoencoder; GHMSAE: generating human-skeleton multimodal stacked autoencoder; MVAE: multimodal variational autoencoder; AMSAE: association-gating mechanism multimodal stacked autoencoder; MCNN: multimodal convolutional neural network; AMCNN: auxiliary multimodal convolutional neural network; AVDCN: audiovisual deep convolutional network; MFCNN: multimodal fuzzy convolutional neural network; MRNN: multimodal recurrent neural network; MBiRNN: multimodal bidirectional recurrent neural network; MTRNN: multimodal transformer recurrent neural network; MGRNN: multimodal gating recurrent neural network; ASMRNN: ambulatory sleep multimodal recurrent neural network.