Abstract
When applying multimodal machine learning in downstream inference, both joint and coordinated multimodal representations rely on the complete presence of modalities as in training. However, modal-incomplete data, where certain modalities are missing, greatly reduces performance in Multimodal Sentiment Analysis (MSA) due to varying input forms and semantic information deficiencies. This limits the applicability of the predominant MSA methods in the real world, where the completeness of multimodal data is uncertain and variable. The generation-based methods attempt to generate the missing modality, yet they require complex hierarchical architecture with huge computational costs and struggle with the representation gaps across different modalities. Diversely, we propose a novel representation learning approach named MissModal, devoting to increasing robustness to missing modality in a classification approach. Specifically, we adopt constraints with geometric contrastive loss, distribution distance loss, and sentiment semantic loss to align the representations of modal-missing and modal-complete data, without impacting the sentiment inference for the complete modalities. Furthermore, we do not demand any changes in the multimodal fusion stage, highlighting the generality of our method in other multimodal learning systems. Extensive experiments demonstrate that the proposed method achieves superior performance with minimal computational costs in various missing modalities scenarios (flexibility), including severely missing modality (efficiency) on two public MSA datasets.
1 Introduction
With the proliferation of the Internet and the surge of user-generated videos, Multimodal Sentiment Analysis (MSA) has become an important and challenging research task that focuses on predicting sentiment with multiple modalities including text, audio, and vision (Morency et al., 2011; Poria et al., 2020). The previous models (Zadeh et al., 2017; Tsai et al., 2019a; Wang et al., 2019; Han et al., 2021) aim at learning a mapping function to fuse the information of different modalities and obtain distinguishable multimodal representations for sentiment inference. As shown in Figure 1, these MSA methods input utterances with multiple modalities to train the mapping function of multimodal representation in the supervised of ground truth labels, and apply the learned MSA models in the downstream testing to predict the sentiment of other utterances.
However, both training and testing pipelines in these MSA methods require complete-modal data, indicating the sensitivity to missing modalities for the mapping function. Missing any modality in testing causes differences in the distribution of input data from training, leading to performance drops of the mapping function. Due to the uncertainty and various modality settings in the real world, the demand for the integrity of modalities limits the application of the previous strategies of multimodal representation learning.
To deal with the issues of missing modalities, generation-based research emerges which focuses on leveraging the remained modalities to generate the missing modalities (Tsai et al., 2019b; Pham et al., 2019; Tang et al., 2021). These generative models have complex hierarchical architecture, which requires redundant training parameters and high computational costs in training. Besides, their generative performance is still challenged by the huge modality gap among different modalities, further limiting their application in the real world.
Different from the generative methods, we propose a novel multimodal representation learning approach named MissModal, devoted to increasing the model’s robustness to missing modality in a classification way. Specifically, we utilize dependent modality-specific networks to learn representations for each modality. Then according to complete modalities—(text, audio, vision) and missing modalities (text), (audio), (vision), (text, audio), (text, vision), (audio, vision)— we adopt multimodal fusion networks with a consistent structure to learn the corresponding complete-modal and missing-modal representations. To transfer the semantic knowledge of complete modalities, we construct three constraints to align missing-modal and complete-modal representations, including geometric contrastive loss to utilize constrative learning at the level of samples, distribution distance loss to adjust the distribution of representations, and sentiment semantic loss to introduce supervise of sentiment labels.
Aiming at improving the downstream performance of MSA models in the real world, we retain the completeness of modalities in training, and then freeze the trained model for validation and testing with different missing rates for diverse modalities to evaluate the flexibility (randomly missing various modalities) and efficiency (severely missing modalities) of the proposed approach.
The contributions are summarized as follows:
- 1)
We propose a novel multimodal representation learning approach named MissModal, devoted to increasing the robustness of MSA models to the issues of missing modalities in downstream applications.
- 2)
Without generative methods, we construct three constraints to align the representations of missing and complete modalities, consisting of geometric contrastive loss, distribution distance loss, and sentiment semantic loss.
- 3)
Extensive experiments on two publicly available MSA datasets with various settings of missing rates and missing modalities demonstrates the superiority of the proposed approach in both flexibility and efficiency.
2 Related Work
2.1 Multimodal Representation Learning
Diverse modalities such as natural language, motion videos, and vocal signals contain specific and complementary information on a common concept (Baltrušaitis et al., 2019). Multimodal representation learning focuses on exploring the intra- and inter-modal dynamics and learning distinguishable representations for various downstream tasks (Bugliarello et al., 2021). Recently, contrastive learning-based multimodal pre-trained models, e.g., CLIP (Radford et al., 2021), WenLan (Huo et al., 2021), and UNIMO (Li et al., 2021), leverage contrastive learning to train transferrable mappings to bridge large-scale image-text pairs. The successful downstream application of these pre-trained models demonstrates the effectiveness of contrastive learning in aligning representations of different modalities.
As a task branch of multimodal machine learning, Multimodal Sentiment Analysis (MSA) aims at integrating the semantic information contained in different modalities, including textual, acoustic, and visual modalities, to predict the sentiment intensity of an utterance (Poria et al., 2020). Previous MSA methods mostly concentrate on designing effective multimodal fusion methods to explore the commonalities among different modalities (Zadeh et al., 2017; Rahman et al., 2020; Han et al., 2021) and learn informative multimodal representations. However, the training pipeline of explicit fusion strategies requires the presence of all modalities. Missing any modality in the downstream raises the differences of input condition between training and testing, causing wrong inference of sentiment in applications.
2.2 Missing Modality Issues
The aforementioned multimodal pre-trained models heavily depend on the completeness of modalities, making them fail to handle the issues of modality-incomplete data. As Ma et al. (2022) indicate, multimodal transformers (Hendricks et al., 2021) are sensitive to missing modalities and the modality fusion strategies are dataset-dependent which significantly affects the robustness. Therefore, to address missing modality issues, generation-based methods (Ma et al., 2021; Vasco et al., 2022) are proposed to learn a prior distribution on modality-shared representation and infer the missing modalities in the modality-shared latent space, which are also employed in the MSA task (Tsai et al., 2019b; Pham et al., 2019; Tang et al., 2021). Nevertheless, these generation-based methods require large computational costs and the generative performance is limited by the huge modality gaps. Meanwhile, they mostly demand complex hierarchical model architecture which lack generality and efficiency in the downstream application. Differently from them, we are devoted to utilizing the classification approach instead of generation to reach the performance upper bound in the scenarios of missing modalities.
Recently, Hazarika et al. (2022) proposed robust training by utilizing missing and noisy textual input as data augmentation to train the state-of-the-art MSA models. However, the application of robust training is limited by the settings of single modality and fixed missing rates. Diversely, according to the missing rates and the diversity of missing modalities, we evaluate the performance by flexibility (randomly missing various modalities) and efficiency (severely missing modalities in testing) to show the improvement of robustness to missing modalities for the proposed approach.
3 Method
3.1 Task Definition
The input of MSA task is utterances which can be denoted as triplet (T, A, V ), including textual modality , acoustic modality , and visual modality , where ℓU denotes the sequence length of corresponding modality and du denotes the feature dimension for U ∈{T, A, V }. The goal is learning to map the multimodal data (T, A, V ) into multimodal representations F = f(T, A, V ), where can be utilized to infer the final sentiment scores . Specifically for better generalization performance in the downstream applications, the multimodal representation mapping f learned by the training data needs to handle the testing scenario as well, regardless of the completeness of modalities.
3.2 Model Architecture
To increase the robustness to missing modalities in testing, we propose a novel multimodal representation learning approach named MissModal, whose architecture is shown in Figure 2.
Specially, we take the [CLS] token of FT and the embedding from the last time step of FA and FV, meaning that for U ∈{T, A, V }, the modality-specific representations FU satisfies .
Note that the structure of multimodal fusion networks is optional and can be flexibly substituted by state-of-the-art multimodal fusion methods, illustrating the backward compatibility of the proposed approach.
3.3 Constraints when Missing Modalities
As shown in Figure 1, to improve the robustness of the model to the missing modalities, we propose three losses as constraints to align the missing-modal representations Fmiss with the complete-modal ones FM in the following.
3.3.1 Geometric Contrastive Loss
Poklukar et al. (2022) indicate that there are huge gaps between the modality-specific representations and complete representations leading to severe misalignment in the distribution space. Inspired by but different from Chen et al. (2020) and Poklukar et al. (2022), we introduce contrastive learning among the multimodal representations with complete modalities and the ones with different cases of missing modalities to geometrically align the representations from the same utterance samples in the supervision of sentiment labels.
The contrastive learning encourages multimodal fusion networks to transfer complete-modal information to the missing-modal representations, making them more distinguishable when handling the missing modalities issues in applications.
3.3.2 Distribution Distance Loss
3.3.3 Sentiment Semantic Loss
3.3.4 Optimization Objective
4 Experiment Setting
4.1 Datasets and Metrics
The experiments are conducted on two benchmark datasets in MSA research: CMU-MOSI (Zadeh et al., 2016) contains 2,199 monologue utterances sliced from 93 YouTube movie opinion videos spanning 89 reviewers. We utilize 1,284 utterances for training, 229 utterances for validation, and 686 utterances for testing. CMU-MOSEI (Zadeh et al., 2018b) expands the multimodal data into 20k video clips from 3,228 videos in 250 diverse topics collected by 1,000 distinct YouTube speakers. We utilize 16,326 utterances for training, 1,871 utterances for validation, and 4,659 utterances for testing. Both of the datasets are annotated for the sentiment on a Likert scale ranging from −3 to +3, where the polarity indicates positive/negative and the absolute value denotes the relative strength of expressed sentiment.
For the evaluation metrics, we report seven-class classification accuracy (Acc7) for sentiment classification in [−3, +3], as well as binary classification accuracy (Acc2) and weighted F1-score (F1) in two measurement settings as non-negative&negative (non-exclude 0) (Zadeh et al., 2017) / positive&negative (exclude 0) (Tsai et al., 2019a). Moreover, we calculate mean absolute error (MAE) and Pearson correlation (Corr) for the regression difference and correlation between the prediction labels and ground truth.
4.2 Baselines
The MSA baselines are broadly categorized as:
Simply early and late fusion models: EF-LSTM (Williams et al., 2018b), LF-DNN (Williams et al., 2018a);
Tensor-based fusion models: TFN (Zadeh et al., 2017), LMF (Liu et al., 2018);
Graph-based fusion model: Graph-MFN (Zadeh et al., 2018b);
Generative and translation-based model: MFM (Tsai et al., 2019b), MCTN (Pham et al., 2019), CTFN (Tang et al., 2021);
Explicitly intra- and inter-modal dynamics manipulation models: MFN (Zadeh et al., 2018a), MISA (Hazarika et al., 2020);
Transformer-based fusion models: MulT (Tsai et al., 2019a), MAG-BERT (Rahman et al., 2020);
Label-guidance: Self-MM (Yu et al., 2021);
Mutual information maximization model: MMIM (Han et al., 2021).
We reproduce the baselines with hyperparameter grid searches for the best results. Additionally, we run the state-of-the-art models under robust training with 15% masking and 15% noisy language data following Hazarika et al. (2022) in the circumstances of complete modalities and missing textual modality for a fair comparison.
4.3 Implementation Details
Following the settings of baselines, we adopt the pre-trained BERT-base-uncased model to encode textual input and obtain raw textual features with 768-dimensional hidden states for each token. Besides, we utilize the CMU-Multimodal SDK to pre-process audio and vision data which applies COVAREP (Degottex et al., 2014) and Facet1 to extract raw acoustic and visual features.
We conduct the experiments on a single GTX 1080Ti GPU with CUDA 10.2. For the hyperparameters, following Gkoumas et al. (2021), we perform fifty-times random grid search to find the best hyperparameters setting including α and β in {0.3,0.5,0.7}, and τ in {0.5,0.7,0.9}. The batch sizes for MOSI and MOSEI are set as 32. For optimization, we adopt AdamW (Loshchilov and Hutter, 2019) as the optimizer with the learning rate 5e-5 for the parameters of BERT on both datasets, and 5e-4 on MOSI and 1e-3 on MOSEI for the other parameters.
For both complete and missing modalities settings, we run experiments five times and report the average performance as the final results. In the experiments with missing modalities, we remain the completeness of modalities of the training sets of both datasets to fine-tune the model, and then freeze the model for the validation and testing sets with different missing rates for diverse modalities to evaluate both flexibility and efficiency of the proposed approach.
5 Experiment Results
5.1 Experiments with Complete Modalities
As shown in Table 1, we compare the performance of MissModal with the state-of-the-art MSA methods with complete modalities in training and testing. The outstanding results on all metrics demonstrate the effectiveness of the proposed architecture of modality-specific and cross-modal representation learning on both MOSI and MOSEI.
Models . | CMU-MOSI . | CMU-MOSEI . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | |
EF-LSTM | 34.5 | 77.8/79.0 | 77.7/78.9 | 0.952 | 0.651 | 49.3 | 80.1/80.3 | 80.3/81.0 | 0.603 | 0.682 |
LF-DNN | 33.6 | 78.0/79.3 | 77.9/79.3 | 0.978 | 0.658 | 52.1 | 78.6/82.3 | 79.0/82.2 | 0.561 | 0.723 |
TFN | 33.7 | 78.3/80.2 | 78.2/80.1 | 0.925 | 0.662 | 52.2 | 81.0/82.6 | 81.1/82.3 | 0.570 | 0.716 |
LMF | 32.7 | 77.5/80.1 | 77.3/80.0 | 0.931 | 0.670 | 52.0 | 81.3/83.7 | 81.6/83.8 | 0.568 | 0.727 |
MFN | 34.2 | 77.9/80.0 | 77.8/80.0 | 0.951 | 0.665 | 51.1 | 81.8/84.0 | 81.9/83.9 | 0.575 | 0.720 |
Graph-MFN | 34.4 | 77.9/80.2 | 77.8/80.1 | 0.939 | 0.656 | 51.9 | 81.9/84.0 | 82.1/83.8 | 0.569 | 0.725 |
MFM | 33.3 | 77.7/80.0 | 77.7/80.1 | 0.948 | 0.664 | 50.8 | 80.3/83.4 | 80.7/83.4 | 0.580 | 0.722 |
MCTN | 33.7 | 78.7/80.0 | 78.8/80.1 | 0.960 | 0.686 | 52.0 | 80.4/83.7 | 80.9/83.7 | 0.570 | 0.728 |
CTFN | 29.9 | 78.9/80.4 | 78.7/80.3 | 0.964 | 0.683 | 50.0 | 80.6/83.2 | 81.0/83.0 | 0.582 | 0.720 |
MulT | 35.0 | 79.0/80.5 | 79.0/80.5 | 0.918 | 0.685 | 52.1 | 81.3/84.0 | 81.6/83.9 | 0.564 | 0.732 |
MISA | 43.5 | 81.8/83.5 | 81.7/83.5 | 0.752 | 0.784 | 52.2 | 81.6/84.3 | 82.0/84.3 | 0.550 | 0.758 |
MAG-BERT | 45.1 | 82.4/84.6 | 82.2/84.6 | 0.730 | 0.789 | 52.8 | 81.9/85.1 | 82.3/85.1 | 0.558 | 0.761 |
Self-MM | 45.8 | 82.7/84.9 | 82.6/84.8 | 0.731 | 0.785 | 53.0 | 82.6/85.2 | 82.8/85.2 | 0.540 | 0.763 |
MMIM | 45.0 | 83.0/85.1 | 82.9/85.0 | 0.738 | 0.781 | 53.1 | 81.9/85.1 | 82.3/85.0 | 0.547 | 0.752 |
MISAτ | 41.3 | 80.6/82.4 | 80.6/82.4 | 0.795 | 0.764 | 52.6 | 81.3/84.8 | 81.7/84.7 | 0.545 | 0.761 |
MAG-BERTτ | 46.0 | 82.5/84.4 | 82.4/84.4 | 0.734 | 0.790 | 53.5 | 81.8/84.8 | 82.1/84.7 | 0.542 | 0.758 |
Self-MMτ | 46.1 | 82.4/84.2 | 82.4/84.1 | 0.727 | 0.791 | 53.7 | 79.6/84.0 | 80.2/84.0 | 0.535 | 0.763 |
MMIMτ | 44.6 | 83.1/84.3 | 83.1/84.4 | 0.753 | 0.771 | 53.2 | 80.4/84.0 | 80.9/83.9 | 0.547 | 0.755 |
MissModal | 47.2 | 84.1/86.1 | 84.0/86.0 | 0.698 | 0.801 | 53.9 | 83.4/85.9 | 83.6/85.8 | 0.533 | 0.769 |
Models . | CMU-MOSI . | CMU-MOSEI . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | |
EF-LSTM | 34.5 | 77.8/79.0 | 77.7/78.9 | 0.952 | 0.651 | 49.3 | 80.1/80.3 | 80.3/81.0 | 0.603 | 0.682 |
LF-DNN | 33.6 | 78.0/79.3 | 77.9/79.3 | 0.978 | 0.658 | 52.1 | 78.6/82.3 | 79.0/82.2 | 0.561 | 0.723 |
TFN | 33.7 | 78.3/80.2 | 78.2/80.1 | 0.925 | 0.662 | 52.2 | 81.0/82.6 | 81.1/82.3 | 0.570 | 0.716 |
LMF | 32.7 | 77.5/80.1 | 77.3/80.0 | 0.931 | 0.670 | 52.0 | 81.3/83.7 | 81.6/83.8 | 0.568 | 0.727 |
MFN | 34.2 | 77.9/80.0 | 77.8/80.0 | 0.951 | 0.665 | 51.1 | 81.8/84.0 | 81.9/83.9 | 0.575 | 0.720 |
Graph-MFN | 34.4 | 77.9/80.2 | 77.8/80.1 | 0.939 | 0.656 | 51.9 | 81.9/84.0 | 82.1/83.8 | 0.569 | 0.725 |
MFM | 33.3 | 77.7/80.0 | 77.7/80.1 | 0.948 | 0.664 | 50.8 | 80.3/83.4 | 80.7/83.4 | 0.580 | 0.722 |
MCTN | 33.7 | 78.7/80.0 | 78.8/80.1 | 0.960 | 0.686 | 52.0 | 80.4/83.7 | 80.9/83.7 | 0.570 | 0.728 |
CTFN | 29.9 | 78.9/80.4 | 78.7/80.3 | 0.964 | 0.683 | 50.0 | 80.6/83.2 | 81.0/83.0 | 0.582 | 0.720 |
MulT | 35.0 | 79.0/80.5 | 79.0/80.5 | 0.918 | 0.685 | 52.1 | 81.3/84.0 | 81.6/83.9 | 0.564 | 0.732 |
MISA | 43.5 | 81.8/83.5 | 81.7/83.5 | 0.752 | 0.784 | 52.2 | 81.6/84.3 | 82.0/84.3 | 0.550 | 0.758 |
MAG-BERT | 45.1 | 82.4/84.6 | 82.2/84.6 | 0.730 | 0.789 | 52.8 | 81.9/85.1 | 82.3/85.1 | 0.558 | 0.761 |
Self-MM | 45.8 | 82.7/84.9 | 82.6/84.8 | 0.731 | 0.785 | 53.0 | 82.6/85.2 | 82.8/85.2 | 0.540 | 0.763 |
MMIM | 45.0 | 83.0/85.1 | 82.9/85.0 | 0.738 | 0.781 | 53.1 | 81.9/85.1 | 82.3/85.0 | 0.547 | 0.752 |
MISAτ | 41.3 | 80.6/82.4 | 80.6/82.4 | 0.795 | 0.764 | 52.6 | 81.3/84.8 | 81.7/84.7 | 0.545 | 0.761 |
MAG-BERTτ | 46.0 | 82.5/84.4 | 82.4/84.4 | 0.734 | 0.790 | 53.5 | 81.8/84.8 | 82.1/84.7 | 0.542 | 0.758 |
Self-MMτ | 46.1 | 82.4/84.2 | 82.4/84.1 | 0.727 | 0.791 | 53.7 | 79.6/84.0 | 80.2/84.0 | 0.535 | 0.763 |
MMIMτ | 44.6 | 83.1/84.3 | 83.1/84.4 | 0.753 | 0.771 | 53.2 | 80.4/84.0 | 80.9/83.9 | 0.547 | 0.755 |
MissModal | 47.2 | 84.1/86.1 | 84.0/86.0 | 0.698 | 0.801 | 53.9 | 83.4/85.9 | 83.6/85.8 | 0.533 | 0.769 |
Moreover, most previous MSA models demand the presence of fully modalities, which cannot be directly employed when missing modalities in the input data. To address this issue, we adopt robust training (Hazarika et al., 2022) strategy for the state-of-the-art MSA models. However, regardless of the circumstances of missing modalities, we observe that robust training decreases the performance on most metrics when testing with complete modalities due to the introduction of masking or noisy input.
Differently, the superior experiment results of MissModal are achieved under the constraints with the representation of missing modalities, which indicates that the introduction of the missing modality mechanism does not impact the testing performance with complete modalities.
5.2 Experiments when Missing Modalities
To show the benefits from the proposed constraints addressing missing modality issues, we remove modalities by replacing the modality input to zero vector in both validation and testing sets. Notably, unlike Hazarika et al. (2022) training and testing on language as the specific missing modality, we evaluate MissModal in various scenarios of missing textual modality, missing acoustic or visual modality, and missing random modalities.
5.2.1 Missing Textual Modality
The textual modality is viewed as the dominant modality in MSA task (Hazarika et al., 2020; Wu et al., 2021; Lin and Hu, 2023) due to the large-scale pre-trained language model and the nature of abundant semantic information instrumental in sentiment understanding. We firstly compare MissModal with the state-of-the-art methods under robust training (Hazarika et al., 2022) with missing textual modalities in various missing rates. As shown in Table 2, MissModal achieves superior performance than the state-of-the-art methods under robust training on most metrics, especially in the circumstances of severely missing modalities. We assume that the fixed settings of missing and noisy rates (15%) of robust training limit its applications on higher missing rates of textual modality. On the contrary, MissModal concentrates on improving the robustness of missing-modal representations, whose performance does not depend on the fixed setting of missing rates.
Models . | CMU-MOSI . | CMU-MOSEI . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | |
Missing 30% textual modality | ||||||||||
MISAτ | 31.8 | 72.7/73.8 | 72.2/73.5 | 1.003 | 0.620 | 48.3 | 77.2/76.6 | 76.3/75.1 | 0.653 | 0.616 |
MAG-BERTτ | 36.3 | 70.6/71.0 | 70.0/70.5 | 0.994 | 0.624 | 47.8 | 75.3/77.2 | 75.5/76.6 | 0.650 | 0.623 |
Self-MMτ | 35.8 | 74.1/76.1 | 73.0/75.3 | 0.929 | 0.669 | 49.1 | 77.2/77.6 | 76.8/76.5 | 0.627 | 0.634 |
MMIMτ | 34.9 | 73.3/74.4 | 73.0/74.2 | 0.968 | 0.627 | 49.0 | 76.9/77.0 | 77.0/76.4 | 0.640 | 0.628 |
MissModal | 38.7 | 74.2/76.3 | 73.2/75.5 | 0.907 | 0.664 | 49.1 | 77.9/78.1 | 77.4/77.0 | 0.634 | 0.635 |
Missing 50% textual modality | ||||||||||
MISAτ | 27.3 | 66.4/67.1 | 64.7/65.6 | 1.134 | 0.525 | 46.5 | 75.0/73.2 | 73.3/70.4 | 0.696 | 0.535 |
MAG-BERTτ | 29.5 | 65.6/66.3 | 63.7/64.5 | 1.128 | 0.526 | 46.1 | 74.6/73.6 | 73.4/71.0 | 0.700 | 0.533 |
Self-MMτ | 31.8 | 67.7/69.3 | 66.8/67.8 | 1.053 | 0.567 | 47.2 | 74.2/72.5 | 72.7/70.2 | 0.691 | 0.539 |
MMIMτ | 28.8 | 66.7/67.2 | 64.4/65.1 | 1.142 | 0.530 | 46.7 | 75.0/72.7 | 73.0/70.9 | 0.697 | 0.529 |
MissModal | 32.3 | 67.8/69.6 | 65.4/67.4 | 1.039 | 0.580 | 47.3 | 75.1/73.6 | 73.5/71.1 | 0.692 | 0.539 |
Missing 70% textual modality | ||||||||||
MISAτ | 22.9 | 61.3/61.7 | 59.1/59.6 | 1.248 | 0.389 | 44.2 | 72.8/69.1 | 69.1/62.2 | 0.763 | 0.410 |
MAG-BERTτ | 22.6 | 62.1/64.0 | 55.8/57.9 | 1.253 | 0.409 | 43.8 | 73.0/68.5 | 68.4/62.6 | 0.768 | 0.392 |
Self-MMτ | 25.5 | 61.4/62.9 | 57.7/59.4 | 1.194 | 0.442 | 43.5 | 72.3/67.1 | 68.5/62.1 | 0.755 | 0.404 |
MMIMτ | 23.4 | 60.4/60.9 | 55.2/55.9 | 1.283 | 0.403 | 43.9 | 72.1/68.6 | 69.0/63.5 | 0.756 | 0.405 |
MissModal | 27.1 | 64.4/66.7 | 59.1/61.8 | 1.164 | 0.450 | 44.6 | 73.9/69.3 | 69.4/63.6 | 0.755 | 0.407 |
Missing 90% textual modality | ||||||||||
MISAτ | 19.8 | 56.9/56.7 | 46.4/47.1 | 1.329 | 0.267 | 42.2 | 64.4/60.6 | 56.3/51.0 | 0.830 | 0.265 |
MAG-BERTτ | 18.5 | 55.3/54.8 | 45.5/46.4 | 1.416 | 0.216 | 42.2 | 71.6/63.4 | 62.8/53.9 | 0.820 | 0.247 |
Self-MMτ | 18.3 | 53.6/53.5 | 44.3/44.4 | 1.374 | 0.254 | 40.7 | 71.0/62.8 | 62.8/52.2 | 0.822 | 0.250 |
MMIMτ | 18.7 | 54.3/54.5 | 44.2/44.6 | 1.392 | 0.246 | 41.8 | 69.3/64.1 | 62.7/53.6 | 0.818 | 0.239 |
MissModal | 21.3 | 57.9/60.1 | 48.3/50.9 | 1.316 | 0.271 | 42.4 | 71.9/64.8 | 62.9/54.0 | 0.813 | 0.232 |
Models . | CMU-MOSI . | CMU-MOSEI . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . | |
Missing 30% textual modality | ||||||||||
MISAτ | 31.8 | 72.7/73.8 | 72.2/73.5 | 1.003 | 0.620 | 48.3 | 77.2/76.6 | 76.3/75.1 | 0.653 | 0.616 |
MAG-BERTτ | 36.3 | 70.6/71.0 | 70.0/70.5 | 0.994 | 0.624 | 47.8 | 75.3/77.2 | 75.5/76.6 | 0.650 | 0.623 |
Self-MMτ | 35.8 | 74.1/76.1 | 73.0/75.3 | 0.929 | 0.669 | 49.1 | 77.2/77.6 | 76.8/76.5 | 0.627 | 0.634 |
MMIMτ | 34.9 | 73.3/74.4 | 73.0/74.2 | 0.968 | 0.627 | 49.0 | 76.9/77.0 | 77.0/76.4 | 0.640 | 0.628 |
MissModal | 38.7 | 74.2/76.3 | 73.2/75.5 | 0.907 | 0.664 | 49.1 | 77.9/78.1 | 77.4/77.0 | 0.634 | 0.635 |
Missing 50% textual modality | ||||||||||
MISAτ | 27.3 | 66.4/67.1 | 64.7/65.6 | 1.134 | 0.525 | 46.5 | 75.0/73.2 | 73.3/70.4 | 0.696 | 0.535 |
MAG-BERTτ | 29.5 | 65.6/66.3 | 63.7/64.5 | 1.128 | 0.526 | 46.1 | 74.6/73.6 | 73.4/71.0 | 0.700 | 0.533 |
Self-MMτ | 31.8 | 67.7/69.3 | 66.8/67.8 | 1.053 | 0.567 | 47.2 | 74.2/72.5 | 72.7/70.2 | 0.691 | 0.539 |
MMIMτ | 28.8 | 66.7/67.2 | 64.4/65.1 | 1.142 | 0.530 | 46.7 | 75.0/72.7 | 73.0/70.9 | 0.697 | 0.529 |
MissModal | 32.3 | 67.8/69.6 | 65.4/67.4 | 1.039 | 0.580 | 47.3 | 75.1/73.6 | 73.5/71.1 | 0.692 | 0.539 |
Missing 70% textual modality | ||||||||||
MISAτ | 22.9 | 61.3/61.7 | 59.1/59.6 | 1.248 | 0.389 | 44.2 | 72.8/69.1 | 69.1/62.2 | 0.763 | 0.410 |
MAG-BERTτ | 22.6 | 62.1/64.0 | 55.8/57.9 | 1.253 | 0.409 | 43.8 | 73.0/68.5 | 68.4/62.6 | 0.768 | 0.392 |
Self-MMτ | 25.5 | 61.4/62.9 | 57.7/59.4 | 1.194 | 0.442 | 43.5 | 72.3/67.1 | 68.5/62.1 | 0.755 | 0.404 |
MMIMτ | 23.4 | 60.4/60.9 | 55.2/55.9 | 1.283 | 0.403 | 43.9 | 72.1/68.6 | 69.0/63.5 | 0.756 | 0.405 |
MissModal | 27.1 | 64.4/66.7 | 59.1/61.8 | 1.164 | 0.450 | 44.6 | 73.9/69.3 | 69.4/63.6 | 0.755 | 0.407 |
Missing 90% textual modality | ||||||||||
MISAτ | 19.8 | 56.9/56.7 | 46.4/47.1 | 1.329 | 0.267 | 42.2 | 64.4/60.6 | 56.3/51.0 | 0.830 | 0.265 |
MAG-BERTτ | 18.5 | 55.3/54.8 | 45.5/46.4 | 1.416 | 0.216 | 42.2 | 71.6/63.4 | 62.8/53.9 | 0.820 | 0.247 |
Self-MMτ | 18.3 | 53.6/53.5 | 44.3/44.4 | 1.374 | 0.254 | 40.7 | 71.0/62.8 | 62.8/52.2 | 0.822 | 0.250 |
MMIMτ | 18.7 | 54.3/54.5 | 44.2/44.6 | 1.392 | 0.246 | 41.8 | 69.3/64.1 | 62.7/53.6 | 0.818 | 0.239 |
MissModal | 21.3 | 57.9/60.1 | 48.3/50.9 | 1.316 | 0.271 | 42.4 | 71.9/64.8 | 62.9/54.0 | 0.813 | 0.232 |
To further show the effectiveness of MissModal in flexibility and efficiency, we run the model with and without MissModal in different missing rates of textual modality on the testing sets of MOSI and MOSEI datasets as shown in Figures 3 and 4. We observe that missing textual modalities from 10%–90% rates brings more significant drops of average performance to the model without MissModal than the one with MissModal. Besides, the variance of performance without MissModal on all metrics grows rapidly as the increasing of missing rates, which does not happen in the experiment results of the model with MissModal. Moreover, missing textual modality leads to polarization of the predicted sentiment, which is due to the less attention of acoustic and visual modalities to the fine-grained sentiment. Therefore, MissModal helps the model learn more distinguishable missing-modal representations, greatly improving the accuracy of sentiment inference, especially in the case of severely missing modality.
5.2.2 Missing Acoustic or Visual Modality
As the inferior modalities in MSA, acoustic and visual modalities play auxiliary and complementary roles in the prediction of sentiment, leading to less impact on the performance when removing these two modalities at 50% and 90% missing rates on MOSI and MOSEI, as shown in Tables 3–4. Nevertheless, missing each of them bring sub-optimal solution for the MSA model. With missing any modality in any missing rates, the performance of the model with MissModal surpasses the one without MissModal on all metrics, demonstrating the superiority of the proposed approach.
Setting . | Modality . | Missing Rate . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|---|---|
w/o MissModal | Acoustic | 50% | 45.9 | 82.4/84.3 | 82.3/84.3 | 0.728 | 0.782 |
90% | 45.6 | 81.8/83.5 | 81.8/83.6 | 0.724 | 0.791 | ||
Visual | 50% | 45.0 | 82.1/83.4 | 82.1/83.5 | 0.731 | 0.779 | |
90% | 44.5 | 81.5/83.1 | 81.5/83.2 | 0.737 | 0.781 | ||
MissModal | Acoustic | 50% | 46.7 | 83.2/85.1 | 83.2/85.1 | 0.718 | 0.800 |
90% | 46.1 | 82.8/84.6 | 82.8/84.6 | 0.709 | 0.794 | ||
Visual | 50% | 47.4 | 83.2/84.6 | 83.2/84.6 | 0.714 | 0.787 | |
90% | 45.8 | 82.8/84.4 | 82.7/84.4 | 0.723 | 0.785 |
Setting . | Modality . | Missing Rate . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|---|---|
w/o MissModal | Acoustic | 50% | 45.9 | 82.4/84.3 | 82.3/84.3 | 0.728 | 0.782 |
90% | 45.6 | 81.8/83.5 | 81.8/83.6 | 0.724 | 0.791 | ||
Visual | 50% | 45.0 | 82.1/83.4 | 82.1/83.5 | 0.731 | 0.779 | |
90% | 44.5 | 81.5/83.1 | 81.5/83.2 | 0.737 | 0.781 | ||
MissModal | Acoustic | 50% | 46.7 | 83.2/85.1 | 83.2/85.1 | 0.718 | 0.800 |
90% | 46.1 | 82.8/84.6 | 82.8/84.6 | 0.709 | 0.794 | ||
Visual | 50% | 47.4 | 83.2/84.6 | 83.2/84.6 | 0.714 | 0.787 | |
90% | 45.8 | 82.8/84.4 | 82.7/84.4 | 0.723 | 0.785 |
Setting . | Modality . | Missing Rate . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|---|---|
w/o MissModal | Acoustic | 50% | 53.1 | 80.2/84.8 | 80.8/84.8 | 0.539 | 0.765 |
90% | 52.9 | 79.1/84.0 | 79.9/84.1 | 0.541 | 0.764 | ||
Visual | 50% | 53.2 | 80.2/84.8 | 80.9/84.9 | 0.539 | 0.765 | |
90% | 52.7 | 78.3/83.9 | 79.2/84.1 | 0.550 | 0.761 | ||
MissModal | Acoustic | 50% | 53.7 | 81.9/85.2 | 82.4/85.1 | 0.540 | 0.768 |
90% | 53.2 | 81.6/84.9 | 82.1/84.9 | 0.540 | 0.766 | ||
Visual | 50% | 53.2 | 82.4/86.1 | 82.9/86.0 | 0.536 | 0.768 | |
90% | 53.1 | 81.9/85.6 | 82.4/85.6 | 0.539 | 0.767 |
Setting . | Modality . | Missing Rate . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|---|---|
w/o MissModal | Acoustic | 50% | 53.1 | 80.2/84.8 | 80.8/84.8 | 0.539 | 0.765 |
90% | 52.9 | 79.1/84.0 | 79.9/84.1 | 0.541 | 0.764 | ||
Visual | 50% | 53.2 | 80.2/84.8 | 80.9/84.9 | 0.539 | 0.765 | |
90% | 52.7 | 78.3/83.9 | 79.2/84.1 | 0.550 | 0.761 | ||
MissModal | Acoustic | 50% | 53.7 | 81.9/85.2 | 82.4/85.1 | 0.540 | 0.768 |
90% | 53.2 | 81.6/84.9 | 82.1/84.9 | 0.540 | 0.766 | ||
Visual | 50% | 53.2 | 82.4/86.1 | 82.9/86.0 | 0.536 | 0.768 | |
90% | 53.1 | 81.9/85.6 | 82.4/85.6 | 0.539 | 0.767 |
5.2.3 Randomly Missing Modalities
To demonstrate wider applications of MissModal in addressing the issues of missing modalities, we remove the modalities in the strategy of random distribution sampling and run MissModal with inputs of the remaining modalities as the settings of {(T),(A),(V ),(T, A),(T, V ),(A, V )}. This experiment setting is consistent with the scene when adopting MSA model in the real world where the presence of modalities is unknown.
As shown in Figures 5 and 6, the modalities are randomly removed in various missing rates ranging from 10%–100% on the testing sets of MOSI and MOSEI datasets, where 100% missing rate means that each testing utterance is incomplete and misses modalities randomly. The model with MissModal has a higher average performance and lower variance than the one without MissModal, indicating MissModal remains the upper bound of sentiment prediction performance in the scenarios of missing modalities.
Furthermore, it is observed that MissModal has greater improvement in performance and stability on MOSEI than on MOSI, no matter in the settings of missing textual modality or random modalities. We assume that on MOSI, the model tends to overfit the data due to the small scale of the dataset, while on MOSEI, the larger data scale helps reveal more significant improvement on the generalization performance of the proposed approach.
In general, MissModal reaches more stable and superior performance in the experiments both on flexibility as the randomness of missing modalities and on efficiency, as severely missing modalities at even 100% randomly missing rate.
6 Further Analysis
6.1 Ablation Study
We conduct an ablation study on the proposed losses , , and of MissModal with 100% missing rate of random modalities, as shown in Table 5. Apparently, each loss contributes to the training and encourages the model to reach optimal performance. Besides, with the supervision of the ground truth labels in , MissModal achieves greatly higher performance than the one trained without . Nevertheless, only guiding the learning at the level of prediction is far not enough for the representations when missing modalities. By fine-tuning at the feature level with and , the model learns more robust miss-modal representations. Intriguingly, both and can enhance the performance of miss-modal representations even in the unsupervised circumstance without the assistance of , which reveals new sight for the field of unsupervised MSA.
. | . | . | . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|---|---|---|
Supervised | 47.0 | 76.0/73.9 | 74.0/71.2 | 0.693 | 0.541 | |||
46.6 | 75.3/73.1 | 73.7/69.8 | 0.701 | 0.523 | ||||
46.5 | 74.8/72.0 | 73.5/69.2 | 0.714 | 0.513 | ||||
44.4 | 73.9/72.1 | 72.3/70.7 | 0.721 | 0.503 | ||||
Unsupervised | 34.4 | 71.8/65.4 | 67.3/60.0 | 1.100 | 0.137 | |||
33.0 | 66.7/61.9 | 64.1/58.0 | 1.048 | 0.149 | ||||
33.7 | 71.7/64.6 | 65.9/57.8 | 1.126 | 0.148 |
. | . | . | . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|---|---|---|
Supervised | 47.0 | 76.0/73.9 | 74.0/71.2 | 0.693 | 0.541 | |||
46.6 | 75.3/73.1 | 73.7/69.8 | 0.701 | 0.523 | ||||
46.5 | 74.8/72.0 | 73.5/69.2 | 0.714 | 0.513 | ||||
44.4 | 73.9/72.1 | 72.3/70.7 | 0.721 | 0.503 | ||||
Unsupervised | 34.4 | 71.8/65.4 | 67.3/60.0 | 1.100 | 0.137 | |||
33.0 | 66.7/61.9 | 64.1/58.0 | 1.048 | 0.149 | ||||
33.7 | 71.7/64.6 | 65.9/57.8 | 1.126 | 0.148 |
Additionally, we evaluate the performance of MissModal with only one specific modality when totally missing information from other modalities. As shown in Table 6, the experiment illustrates that textual modality is the dominant modality while acoustic and visual modalities serve as the inferior modalities in MSA task, concluding consistent with the former results and previous research (Gkoumas et al., 2021). However, only textual modality may trap the model into the subjective and biased emotion problems (Zadeh et al., 2017; Wang et al., 2019), degrading the performance compared with the multimodal case. Thus, the introduction of acoustic and visual modalities is necessary to further boost the accuracy of sentiment inference for the MSA task. Each modality of the utterance provides unique and complementary properties, which are extracted as modality-specific and -shared features for the final sentiment prediction. The demand for various modalities indicates the necessity of improving the robustness of MSA models when missing modalities.
Modality . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|
only T | 52.8 | 80.9/84.7 | 81.4/84.6 | 0.545 | 0.761 |
only A | 41.4 | 71.0/62.9 | 59.0/48.5 | 0.839 | 0.038 |
only V | 41.4 | 70.2/62.3 | 59.3/49.2 | 0.840 | 0.018 |
Complete | 53.9 | 83.4/85.9 | 83.6/85.8 | 0.533 | 0.769 |
Modality . | Acc7↑ . | Acc2↑ . | F1↑ . | MAE↓ . | Corr↑ . |
---|---|---|---|---|---|
only T | 52.8 | 80.9/84.7 | 81.4/84.6 | 0.545 | 0.761 |
only A | 41.4 | 71.0/62.9 | 59.0/48.5 | 0.839 | 0.038 |
only V | 41.4 | 70.2/62.3 | 59.3/49.2 | 0.840 | 0.018 |
Complete | 53.9 | 83.4/85.9 | 83.6/85.8 | 0.533 | 0.769 |
6.2 Representation Visualization
As shown in Figure 7(a)–7(b), we utilize the t-SNE algorithm (Van der Maaten and Hinton, 2008) to provide visualization in the embedding space for the learning processes of missing and complete representations. Before training, significant modality gaps exist among the missing-modal and complete-modal representations. Through the guidance of three proposed constraints, MissModal successfully aligns the distributions of the representations with missing acoustic or visual modalities and the ones with complete modalities, leading to superior results in the experiments with missing acoustic and visual modalities in Tables 3 and 4. Nevertheless, we observe that the absence of semantic information makes it challenging to optimize and align the multimodal representations lacking the textual modality, highlighting the dominant role of textual modality as indicated by the results in Table 6. Despite the remaining gaps in the embedding space, the distribution shape of representations without textual modality is similar to others in Figure 7(b), illustrating the effectiveness of MissModal even in the circumstance of missing the dominant modality.
Furthermore, we visualize the representations over different sentiment classes with complete modalities in the downstream testing in Figure 7(c) to demonstrate the superiority of MissModal in the downstream inference. The learned multimodal representations are divided into distinguishable clusters according to positive, neutral, and negative sentiment. Besides, the representations inside the same sentiment class are compact and become more and more compact with the increasing sentiment intensity. This reveals the relation between multimodal representations and sentiment labels, implicitly indicating the productive collaboration among , in the feature space, and at the level of prediction.
6.3 Qualitative Analysis
To further validate the contribution of the proposed approach, we present some examples where MissModal achieves superior performance compared with the model without MissModal when missing modalities in the multimodal input data in Table 7. The examples show various circumstances of missing modalities to demonstrate the effectiveness of the three proposed constraints.
Examples 1 to 3 contain multimodal input with missing only one modality, where the missing modality provides additional information to the final sentiment prediction. Without these complementary information, the model without MissModal tends to over-amplificate or over-reduce the magnitude of emotion contained in the utterances. Diversely, MissModal aligns the missing-modal representations with the complete-modal ones in the training, which implicitly transfer the knowledge of the missing modality to the remaining ones in the guidance of sentiment labels. Thus, the sentiment prediction of MissModal is closer to the annotated ground truth label in these cases, leading to higher performance on Acc7, MAE, and Corr as shown in Figure 6.
Examples 4 and 5 show cases without both acoustic and vision modalities, illustrating that these two inferior modalities present auxiliary roles in sentiment inference. Especially in Example 5, the text in the utterance can be potrayed as mostly neutral, which results in a prediction score close to 0 for the model without MissModal. While due to the latent information conveyed by the active tone and focused facial expression, MissModal deflects the polarity of sentiment to a bit positive, similar as the given ground truth label.
6.4 Model Complexity
As shown in Table 8, we compare the model complexity of various models by reporting the increased number of parameters on CMU-MOSEI.
Model . | Increased Parameters . |
---|---|
MFM | 5,996,541 |
MCTN | 3,910,658 |
CTFN | 2,852,801 |
MISAτ | 1,435,105 |
MAG-BERTτ | 1,352,449 |
Self-MMτ | 165,668 |
MMIMτ | 338,889 |
MissModal | 340,039 |
Model . | Increased Parameters . |
---|---|
MFM | 5,996,541 |
MCTN | 3,910,658 |
CTFN | 2,852,801 |
MISAτ | 1,435,105 |
MAG-BERTτ | 1,352,449 |
Self-MMτ | 165,668 |
MMIMτ | 338,889 |
MissModal | 340,039 |
Firstly, the generative models such as MFM, MCTN, and CTFN require massive parameters as mentioned above, strengthening the motivation of adopting classification-based methods in computationally limited scenarios. Differently, by simplifying the multimodal fusion networks to significantly reduce the computation complexity, MissModal requires parameters less than or comparable with the state-of-the-art baselines. The extra increased parameters for MissModal are brought mostly by multiple fusion networks for various circumstances of missing modalities.
Besides, the proposed constraints in MissModal demand no extra training parameters when addressing the issues of missing modalities. In general, MissModal achieves better trade-off between model complexity and performance with both complete and missing modalities.
7 Limitations
The limitations of the proposed approach are listed in the following for future research. First, the model parameters of MissModal depend on the complexity of multimodal fusion networks and the number of modalities, which may bring problems with increasing model complexity in the downstream application of the proposed approach. Then, the improvement of MissModal seems relevant to the scale of datasets, where small datasets may limit the robustness of MissModal. We believe that increasing the scale of datasets can show more effectiveness of the proposed approach in the issues of missing modalities. Lasty, although MissModal aims at handling missing modalities in the stage of inference, the demand for complete modalities in training raises difficulty in collecting multimodal data. Getting rid of the completeness of modalities in training is another interesting research area for us to explore in the future.
8 Conclusion
In this paper, we present a novel classification- based approach named MissModal to enhance the robustness to missing modalities in the downstream application by constructing three constraints, including geometric contrastive loss, distribution distance loss, and sentiment semantic loss to align the representations of missing and complete modalities. Extensive experiments on various settings of missing modalities and missing rates demonstrate the superiority of MissModal in both flexibility and efficiency on two public datasets. The analysis of representation visualization and model complexity further indicates the huge potential and generality of MissModal in other multimodal systems.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under grant 62076262.
Notes
iMotions 2017, https://imotions.com/.
References
Author notes
Action Editor: André F. T. Martins