Multimodal deep learning, presented by Ngiam et al. (2011) is the most representative deep learning model based on the stacked autoencoder (SAE) for multimodal data fusion. This deep learning model aims to address two data-fusion problems: cross-modality and shared-modality representational learning. The former aims to capture better single-modality representations, leveraging knowledge from other modalities, while the latter learns the complex correlation between modalities at a midlevel. To achieve these, three learning scenarios—multiple-modality, cross-modality, and shared-modality learning—are designed, as depicted in Table 3 and Figure 6. Furthermore, in each scenario, to learn better representations, sparse coding is used by penalizing the loss function with the sparse constraints of the following form:
$minθ-∑logPv,h+λ∑p-1m∑Eh|v2.$
(3.2)
Figure 6:

The architectures of the multiple-modality, cross-modality, and shared-modality learning.

Figure 6:

The architectures of the multiple-modality, cross-modality, and shared-modality learning.

Close modal
Table 3:
Setting of Multimodal Learning.
Feature LearningSupervised TrainingTesting
Classic deep learning Audio Audio Audio
Video Video Video
Multimodal fusion A $+$A $+$A $+$
Cross-modality learning A $+$Video Video
A $+$Audio Audio
Shared representation learning A $+$Audio Video
A $+$Video Audio
Feature LearningSupervised TrainingTesting
Classic deep learning Audio Audio Audio
Video Video Video
Multimodal fusion A $+$A $+$A $+$
Cross-modality learning A $+$Video Video
A $+$Audio Audio
Shared representation learning A $+$Audio Video
A $+$Video Audio
Close Modal