Multimodal deep learning, presented by Ngiam et al. (2011) is the most representative deep learning model based on the stacked autoencoder (SAE) for multimodal data fusion. This deep learning model aims to address two data-fusion problems: cross-modality and shared-modality representational learning. The former aims to capture better single-modality representations, leveraging knowledge from other modalities, while the latter learns the complex correlation between modalities at a midlevel. To achieve these, three learning scenarios—multiple-modality, cross-modality, and shared-modality learning—are designed, as depicted in Table 3 and Figure 6. Furthermore, in each scenario, to learn better representations, sparse coding is used by penalizing the loss function with the sparse constraints of the following form:
minθ-logPv,h+λp-1mEh|v2.
(3.2)
Figure 6:

The architectures of the multiple-modality, cross-modality, and shared-modality learning.

Figure 6:

The architectures of the multiple-modality, cross-modality, and shared-modality learning.

Close modal
Table 3:
Setting of Multimodal Learning.
Feature LearningSupervised TrainingTesting
Classic deep learning Audio Audio Audio 
 Video Video Video 
Multimodal fusion A +A +A +
Cross-modality learning A +Video Video 
 A +Audio Audio 
Shared representation learning A +Audio Video 
 A +Video Audio 
Feature LearningSupervised TrainingTesting
Classic deep learning Audio Audio Audio 
 Video Video Video 
Multimodal fusion A +A +A +
Cross-modality learning A +Video Video 
 A +Audio Audio 
Shared representation learning A +Audio Video 
 A +Video Audio 
Close Modal

or Create an Account

Close Modal
Close Modal