MissModal: Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis

Abstract When applying multimodal machine learning in downstream inference, both joint and coordinated multimodal representations rely on the complete presence of modalities as in training. However, modal-incomplete data, where certain modalities are missing, greatly reduces performance in Multimodal Sentiment Analysis (MSA) due to varying input forms and semantic information deficiencies. This limits the applicability of the predominant MSA methods in the real world, where the completeness of multimodal data is uncertain and variable. The generation-based methods attempt to generate the missing modality, yet they require complex hierarchical architecture with huge computational costs and struggle with the representation gaps across different modalities. Diversely, we propose a novel representation learning approach named MissModal, devoting to increasing robustness to missing modality in a classification approach. Specifically, we adopt constraints with geometric contrastive loss, distribution distance loss, and sentiment semantic loss to align the representations of modal-missing and modal-complete data, without impacting the sentiment inference for the complete modalities. Furthermore, we do not demand any changes in the multimodal fusion stage, highlighting the generality of our method in other multimodal learning systems. Extensive experiments demonstrate that the proposed method achieves superior performance with minimal computational costs in various missing modalities scenarios (flexibility), including severely missing modality (efficiency) on two public MSA datasets.


Introduction
With the proliferation of the Internet and the surge of user-generated videos, Multimodal Sentiment Analysis (MSA) has become an important and challenging research task that focuses on predicting sentiment with multiple modalities including text, audio, and vision (Morency et al., 2011;Poria et al., 2020).The previous models (Zadeh et al., 2017;Tsai et al., 2019a;Wang et al., 2019;Han et al., 2021) aim at learning a mapping function to fuse the information of different modalities and obtain distinguishable multimodal representations for sentiment inference.As shown in Figure 1, these MSA methods input utterances with multiple modalities to train the mapping function of multimodal representation in the supervised of ground truth labels, and apply the learned MSA models in the downstream testing to predict the sentiment of other utterances.
However, both training and testing pipelines in these MSA methods require complete-modal data, indicating the sensitivity to missing modalities for the mapping function.Missing any modality in testing causes differences in the distribution of input data from training, leading to performance drops of the mapping function.Due to the uncertainty and various modality settings in the real world, the demand for the integrity of modalities limits the application of the previous strategies of multimodal representation learning.
To deal with the issues of missing modalities, generation-based research emerges which focuses on leveraging the remained modalities to generate the missing modalities (Tsai et al., 2019b;Pham et al., 2019;Tang et al., 2021).These generative models have complex hierarchical architecture, which requires redundant training parameters and high computational costs in training.Besides, their generative performance is still challenged by the huge modality gap among different modalities, further limiting their application in the real world.
Different from the generative methods, we propose a novel multimodal representation learning approach named MissModal, devoted to increasing the model's robustness to missing modality  in a classification way.Specifically, we utilize dependent modality-specific networks to learn representations for each modality.Then according to complete modalities- (text, audio, vision) and missing modalities (text), (audio), (vision), (text, audio), (text, vision), (audio, vision)we adopt multimodal fusion networks with a consistent structure to learn the corresponding complete-modal and missing-modal representations.To transfer the semantic knowledge of complete modalities, we construct three constraints to align missing-modal and complete-modal representations, including geometric contrastive loss to utilize constrative learning at the level of samples, distribution distance loss to adjust the distribution of representations, and sentiment semantic loss to introduce supervise of sentiment labels.
Aiming at improving the downstream performance of MSA models in the real world, we retain the completeness of modalities in training, and then freeze the trained model for validation and testing with different missing rates for diverse modalities to evaluate the flexibility (randomly missing various modalities) and efficiency (severely missing modalities) of the proposed approach.
The contributions are summarized as follows: 1) We propose a novel multimodal representation learning approach named MissModal, devoted to increasing the robustness of MSA models to the issues of missing modalities in downstream applications.
2) Without generative methods, we construct three constraints to align the representations of missing and complete modalities, consisting of geometric contrastive loss, distribution distance loss, and sentiment semantic loss.
3) Extensive experiments on two publicly available MSA datasets with various settings of missing rates and missing modalities demonstrates the superiority of the proposed approach in both flexibility and efficiency.
2 Related Work

Multimodal Representation Learning
Diverse modalities such as natural language, motion videos, and vocal signals contain specific and complementary information on a common concept (Baltrušaitis et al., 2019).Multimodal representation learning focuses on exploring the intra-and inter-modal dynamics and learning distinguishable representations for various downstream tasks (Bugliarello et al., 2021).Recently, contrastive learning-based multimodal pre-trained models, e.g., CLIP (Radford et al., 2021), WenLan (Huo et al., 2021), and UNIMO (Li et al., 2021), leverage contrastive learning to train transferrable mappings to bridge large-scale image-text pairs.The successful downstream application of these pre-trained models demonstrates the effectiveness of contrastive learning in aligning representations of different modalities.
As a task branch of multimodal machine learning, Multimodal Sentiment Analysis (MSA) aims at integrating the semantic information contained in different modalities, including textual, acoustic, and visual modalities, to predict the sentiment intensity of an utterance (Poria et al., 2020).Previous MSA methods mostly concentrate on designing effective multimodal fusion methods to explore the commonalities among different modalities (Zadeh et al., 2017;Rahman et al., 2020;Han et al., 2021) and learn informative multimodal representations.However, the training pipeline of explicit fusion strategies requires the presence of all modalities.Missing any modality in the downstream raises the differences of input condition between training and testing, causing wrong inference of sentiment in applications.

Missing Modality Issues
The aforementioned multimodal pre-trained models heavily depend on the completeness of modalities, making them fail to handle the issues of modality-incomplete data.As Ma et al. (2022) indicate, multimodal transformers (Hendreicks et al., 2021) are sensitive to missing modalities and the modality fusion strategies are dataset-dependent which significantly affects the robustness.Therefore, to address missing modality issues, generation-based methods (Ma et al., 2021;Vasco et al., 2022) are proposed to learn a prior distribution on modality-shared representation and infer the missing modalities in the modality-shared latent space, which are also employed in the MSA task (Tsai et al., 2019b;Pham et al., 2019;Tang et al., 2021).Nevertheless, these generation-based methods require large computational costs and the generative performance is limited by the huge modality gaps.Meanwhile, they mostly demand complex hierarchical model architecture which lack generality and efficiency in the downstream application.Differently from them, we are devoted to utilizing the classification approach instead of generation to reach the performance upper bound in the scenarios of missing modalities.
Recently, Hazarika et al. (2022) proposed robust training by utilizing missing and noisy textual input as data augmentation to train the state-of-the-art MSA models.However, the application of robust training is limited by the settings of single modality and fixed missing rates.Di-versely, according to the missing rates and the diversity of missing modalities, we evaluate the performance by flexibility (randomly missing various modalities) and efficiency (severely missing modalities in testing) to show the improvement of robustness to missing modalities for the proposed approach.

Task Definition
The input of MSA task is utterances which can be denoted as triplet (T, A, V ), including textual modality T ∈ R T ×d T , acoustic modality A ∈ R A ×d A , and visual modality V ∈ R V ×d V , where U denotes the sequence length of corresponding modality and d u denotes the feature dimension for U ∈ {T, A, V }.The goal is learning to map the multimodal data (T, A, V ) into multimodal representations F = f (T, A, V ), where F ∈ R N M ×d M can be utilized to infer the final sentiment scores ŷ ∈ R. Specifically for better generalization performance in the downstream applications, the multimodal representation mapping f learned by the training data needs to handle the testing scenario as well, regardless of the completeness of modalities.

Model Architecture
To increase the robustness to missing modalities in testing, we propose a novel multimodal representation learning approach named MissModal, whose architecture is shown in Figure 2.
To obtain the modality-specific representations, we firstly adopt the pre-trained BERT (Devlin et al., 2019) to encode the input text embedding T and learn the textual representation, where the output embedding of the last Transformer layer is represented as: Meanwhile, for the acoustic and visual modalities, we utilize two bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) to capture the temporal characteristics and two 3-layer unimodal Transformers (Vaswani et al., 2017) to further encode the global self-attention information.For U ∈ {A, V }, the audio and vision encoders are formulated as: Specially, we take the [CLS] token of F T and the embedding from the last time step of F A and F V , meaning that for U ∈ {T, A, V }, the modality-specific representations To capture the modality-shared dynamics, we utilize multimodal fusion networks to learn the latent interactions among different modalities.Specifically, devoted to better handling various cases of missing modality, we concatenate the modality-specific representations in seven ways to simulate seven input circumstances, including the settings of complete modalities, denoted as (T, A, V ) and the remaining modalities after missing, denoted as {(T ), (A), (V ), (T, A), (T, V ), (A, V )}.To highlight the effectiveness of MissModal, without losing generality, we adopt several simple MLP s with T anh activation layers as the fusion networks to extract the inter-modal information after concatenation, represented as: where [; ] denotes the concatenation of the modalities, F M denotes the multimodal representation with complete modalities and F miss denotes the representations with the inputs of remaining Note that the structure of multimodal fusion networks is optional and can be flexibly substituted by state-of-the-art multimodal fusion methods, illustrating the backward compatibility of the proposed approach.

Constraints when Missing Modalities
As shown in Figure 1, to improve the robustness of the model to the missing modalities, we propose three losses as constraints to align the missing-modal representations F miss with the complete-modal ones F M in the following.Poklukar et al. (2022) indicate that there are huge gaps between the modality-specific representations and complete representations leading to severe misalignment in the distribution space.Inspired by but different from Chen et al. ( 2020) and Poklukar et al. (2022), we introduce contrastive learning among the multimodal representations with complete modalities and the ones with different cases of missing modalities to geometrically align the representations from the same utterance samples in the supervision of sentiment labels.

Geometric Contrastive Loss
Given a mini-batch of multimodal representations , we define positive pairs as (F i M , F i miss ) while the negative ones as and (F i M , F j miss ) according to the ith and jth samples from the mini-batch B. Then we compute the sum of similarities among the negative pairs as: (4) where γ is a temperature hyperparameter regulating the probability distribution over distinct instances (Hinton et al., 2015).Similarly, the similarity of the positive pairs is denoted as s p,q (i, i), relating the missing-modal representations with the corresponding complete-modal ones.
By traversing all samples in the mini-batch, the geometric contrastive loss L geo is represented as: The contrastive learning encourages multimodal fusion networks to transfer complete-modal information to the missing-modal representations, making them more distinguishable when handling the missing modalities issues in applications.

Distribution Distance Loss
To further enhance the similarity of F i miss and the corresponding F i M , we add L2 distance constraints to reduce the distribution distance among the missing-modal and complete-modal representations from the same sample.The distribution distance loss L dis is represented as: Both geometric contrastive loss L geo and distribution distance loss L dis increase the model's robustness in the feature space when missing modalities.

Sentiment Semantic Loss
Due to various semantic information contained in diverse modalities, missing modalities may result in different sentiments of the same utterance.For the consistency in the inference of sentiment polarity, we introduce the sentiment semantic loss L sem to utilize the ground truth labels y to supervise the sentiment prediction of the missing-modal representations in the label space, which is denoted as:

Optimization Objective
In the MSA task, after obtaining the sentiment prediction ŷi M with complete modalities, we apply Mean Absolute Error (MAE) loss to conduct the regre ssion of sentiment labels.Along with the ground truth labels y, the task loss is formulated as: Lastly, we calculate the weighted sum of all training losses to obtain the final optimization objective, which is represented as: where α and β denote the hyperparameters controlling the impact of the training losses for the missing-modal representations in the feature space.
4 Experiment Setting

Datasets and Metrics
The experiments are conducted on two benchmark datasets in MSA research: CMU-MOSI (Zadeh et al., 2016) contains 2,199 monologue utterances sliced from 93 YouTube movie opinion videos spanning 89 reviewers.We utilize 1,284 utterances for training, 229 utterances for validation, and 686 utterances for testing.CMU-MOSEI (Zadeh et al., 2018b) expands the multimodal data into 20k video clips from 3,228 videos in 250 diverse topics collected by 1,000 distinct YouTube speakers.We utilize 16,326 utterances for training, 1,871 utterances for validation, and 4,659 utterances for testing.Both of the datasets are annotated for the sentiment on a Likert scale ranging from −3 to +3, where the polarity indicates positive/negative and the absolute value denotes the relative strength of expressed sentiment.
We reproduce the baselines with hyperparameter grid searches for the best results.Additionally, we run the state-of-the-art models under robust training with 15% masking and 15% noisy language data following Hazarika et al. (2022) in the circumstances of complete modalities and missing textual modality for a fair comparison.

Implementation Details
Following the settings of baselines, we adopt the pre-trained BERT-base-uncased model to encode textual input and obtain raw textual features with 768-dimensional hidden states for each token.Besides, we utilize the CMU-Multimodal SDK to pre-process audio and vision data which applies COVAREP (Degottex et al., 2014) and Facet1 to extract raw acoustic and visual features.
We conduct the experiments on a single GTX 1080Ti GPU with CUDA 10.2.For the hyperparameters, following Gkoumas et al. (2021), we perform fifty-times random grid search to find the best hyperparameters setting including α and β in {0.3, 0.5, 0.7}, and τ in {0.5, 0.7, 0.9}.The batch sizes for MOSI and MOSEI are set as 32.For optimization, we adopt AdamW (Loshchilov and Hutter, 2019) as the optimizer with the learning rate 5e-5 for the parameters of BERT on both datasets, and 5e-4 on MOSI and 1e-3 on MOSEI for the other parameters.
For both complete and missing modalities settings, we run experiments five times and report the average performance as the final results.In the experiments with missing modalities, we remain the completeness of modalities of the training sets of both datasets to fine-tune the model, and then freeze the model for the validation and testing sets with different missing rates for diverse modalities  (Hazarika et al., 2022).
to evaluate both flexibility and efficiency of the proposed approach.
5 Experiment Results

Experiments with Complete Modalities
As shown in Table 1, we compare the performance of MissModal with the state-of-the-art MSA methods with complete modalities in training and testing.The outstanding results on all metrics demonstrate the effectiveness of the proposed architecture of modality-specific and cross-modal representation learning on both MOSI and MOSEI.Moreover, most previous MSA models demand the presence of fully modalities, which cannot be directly employed when missing modalities in the input data.To address this issue, we adopt robust training (Hazarika et al., 2022) strategy for the state-of-the-art MSA models.However, regardless of the circumstances of missing modalities, we observe that robust training decreases the performance on most metrics when testing with complete modalities due to the introduction of masking or noisy input.
Differently, the superior experiment results of MissModal are achieved under the constraints with the representation of missing modalities, which indicates that the introduction of the missing modality mechanism does not impact the testing performance with complete modalities.

Experiments when Missing Modalities
To show the benefits from the proposed constraints addressing missing modality issues, we remove modalities by replacing the modality input to zero  vector in both validation and testing sets.Notably, unlike Hazarika et al. (2022) training and testing on language as the specific missing modality, we evaluate MissModal in various scenarios of missing textual modality, missing acoustic or visual modality, and missing random modalities.

Missing Textual Modality
The textual modality is viewed as the dominant modality in MSA task (Hazarika et al., 2020;Wu et al., 2021;Lin and Hu, 2023) due to the large-scale pre-trained language model and the nature of abundant semantic information instrumental in sentiment understanding.We firstly compare MissModal with the state-of-the-art methods under robust training (Hazarika et al., 2022) with missing textual modalities in various missing rates.As shown in fixed settings of missing and noisy rates (15%) of robust training limit its applications on higher missing rates of textual modality.On the contrary, MissModal concentrates on improving the robustness of missing-modal representations, whose performance does not depend on the fixed setting of missing rates.
To further show the effectiveness of MissModal in flexibility and efficiency, we run the model with and without MissModal in different missing rates of textual modality on the testing sets of MOSI and MOSEI datasets as shown in Figures 3 and 4. We observe that missing textual modalities from 10%-90% rates brings more significant drops of average performance to the model without Miss-Modal than the one with MissModal.Besides, the variance of performance without MissModal on all metrics grows rapidly as the increasing of missing rates, which does not happen in the experiment results of the model with MissModal.Moreover, missing textual modality leads to polarization of the predicted sentiment, which is due to the less attention of acoustic and visual modalities to the fine-grained sentiment.Therefore, Miss-Modal helps the model learn more distinguishable missing-modal representations, greatly improving the accuracy of sentiment inference, especially in the case of severely missing modality.

Missing Acoustic or Visual Modality
As the inferior modalities in MSA, acoustic and visual modalities play auxiliary and complementary roles in the prediction of sentiment, leading to less impact on the performance when removing these two modalities at 50% and 90% missing rates on MOSI and MOSEI, as shown in Tables 3-4.Nevertheless, missing each of them bring sub-optimal solution for the MSA model.With missing any modality in any missing rates, the performance of the model with MissModal surpasses the one without MissModal on all metrics, demonstrating the superiority of the proposed approach.

Randomly Missing Modalities
To demonstrate wider applications of MissModal in addressing the issues of missing modalities, we remove the modalities in the strategy of random distribution sampling and run MissModal with inputs of the remaining modalities as the settings of {(T ), (A), (V ), (T, A), (T, V ), (A, V )}.This experiment setting is consistent with the scene when adopting MSA model in the real world where the presence of modalities is unknown.
As shown in Figures 5 and 6, the modalities are randomly removed in various missing rates ranging from 10%-100% on the testing sets of MOSI and MOSEI datasets, where 100% missing rate means that each testing utterance is incomplete and misses modalities randomly.The model with MissModal has a higher average performance and lower variance than the one without MissModal, indicating MissModal remains the upper bound of sentiment prediction performance in the scenarios of missing modalities.Furthermore, it is observed that MissModal has greater improvement in performance and stability on MOSEI than on MOSI, no matter in the settings of missing textual modality or random modalities.We assume that on MOSI, the model tends to overfit the data due to the small scale of the dataset, while on MOSEI, the larger data scale helps reveal more significant improvement on the generalization performance of the proposed approach.
In general, MissModal reaches more stable and superior performance in the experiments both on flexibility as the randomness of missing modalities and on efficiency, as severely missing modalities at even 100% randomly missing rate.
6 Further Analysis

Ablation Study
We conduct an ablation study on the proposed losses L sem , L geo , and L dis of MissModal with 100% missing rate of random modalities, as shown in Table 5. Apparently, each loss contributes to the training and encourages the model to reach optimal performance.Besides, with the supervision of the ground truth labels in L sem , MissModal achieves greatly higher performance than the one trained without L sem .Nevertheless, only L sem guiding the learning at the level of prediction is far not enough for the representations when missing modalities.By fine-tuning at the feature level with L geo and L dis , the model learns more robust miss-modal representations.Intriguingly, both L geo and L dis can enhance the performance of miss-modal representations even in the unsupervised circumstance without the assistance of L sem , which reveals new sight for the field of unsupervised MSA.Additionally, we evaluate the performance of MissModal with only one specific modality when totally missing information from other modalities.As shown in Table 6, the experiment illustrates that textual modality is the dominant modality while acoustic and visual modalities serve as the inferior modalities in MSA task, concluding consistent with the former results and previous research (Gkoumas et al., 2021).However, only textual modality may trap the model into the subjective and biased emotion problems (Zadeh et al., 2017;Wang et al., 2019), degrading the performance compared with the multimodal case.Thus, the introduction of acoustic and visual modalities is necessary to further boost the accuracy of sentiment inference for the MSA task.Each modality of the utterance provides unique and complementary properties, which are extracted as modality-specific and -shared features for the final sentiment prediction.The demand for various modalities indicates the necessity of improving the robustness of MSA models when missing modalities.Table 7: Examples from the testing set of CMU-MOSEI dataset.The missing modalities input is highlighted with red and the ground truth sentiment labels are between strongly negative (−3) and strongly positive (+3).For each example, we show the Ground Truth and output predictions of models with and without MissModal.

Representation Visualization
As shown in Figure 7(a)-7(b), we utilize the t-SNE algorithm (Van der Maaten and Hinton, 2008) to provide visualization in the embedding space for the learning processes of missing and complete representations.Before training, significant modality gaps exist among the missing-modal and complete-modal representations.Through the guidance of three proposed constraints, Miss-Modal successfully aligns the distributions of the representations with missing acoustic or visual modalities and the ones with complete modalities, leading to superior results in the experiments with missing acoustic and visual modalities in Tables 3  and 4. Nevertheless, we observe that the absence of semantic information makes it challenging to optimize and align the multimodal representations lacking the textual modality, highlighting the dominant role of textual modality as indicated by the results in Table 6.Despite the remaining gaps in the embedding space, the distribution shape of representations without textual modality is similar to others in Figure 7(b), illustrating the effectiveness of MissModal even in the circumstance of missing the dominant modality.
Furthermore, we visualize the representations over different sentiment classes with complete modalities in the downstream testing in Figure 7(c) to demonstrate the superiority of MissModal in the downstream inference.The learned multimodal representations are divided into distinguishable clusters according to positive, neutral, and negative sentiment.Besides, the representations inside the same sentiment class are compact and become more and more compact with the increasing sentiment intensity.This reveals the relation between multimodal representations and sentiment labels, implicitly indicating the productive collaboration among L geo , L dis in the feature space, and L sem at the level of prediction.

Qualitative Analysis
To further validate the contribution of the proposed approach, we present some examples where MissModal achieves superior performance compared with the model without MissModal when missing modalities in the multimodal input data in Table 7.The examples show various circumstances of missing modalities to demonstrate the effectiveness of the three proposed constraints.
Examples 1 to 3 contain multimodal input with missing only one modality, where the missing modality provides additional information to the final sentiment prediction.Without these complementary information, the model without MissModal tends to over-amplificate or over-reduce the magnitude of emotion contained in the utterances.Diversely, MissModal aligns the missing-modal representations with the complete-modal ones in the training, which implicitly transfer the knowledge of the missing modality to the remaining ones in the guidance of sentiment labels.Thus, the sentiment prediction of MissModal is closer to the annotated ground truth label in these cases, leading to higher performance on Acc7, MAE, and Corr as shown in Figure 6.
Examples 4 and 5 show cases without both acoustic and vision modalities, illustrating that these two inferior modalities present auxiliary roles in sentiment inference.Especially in Example 5, the text in the utterance can be potrayed as mostly neutral, which results in a prediction score close to 0 for the model without MissModal.While due to the latent information conveyed by the active tone and focused facial expression, MissModal deflects the polarity of sentiment to a bit positive, similar as the given ground truth label.To highlight the demands for parameters, the reported number of parameters are the increased number of extra parameters after removing the pre-trained language model BERT.

Model Complexity
As shown in Table 8, we compare the model complexity of various models by reporting the increased number of parameters on CMU-MOSEI.Firstly, the generative models such as MFM, MCTN, and CTFN require massive parameters as mentioned above, strengthening the motivation of adopting classification-based methods in computationally limited scenarios.Differently, by simplifying the multimodal fusion networks to significantly reduce the computation complexity, MissModal requires parameters less than or comparable with the state-of-the-art baselines.The extra increased parameters for MissModal are brought mostly by multiple fusion networks for various circumstances of missing modalities.
Besides, the proposed constraints in MissModal demand no extra training parameters when addressing the issues of missing modalities.In general, MissModal achieves better trade-off between model complexity and performance with both complete and missing modalities.

Limitations
The limitations of the proposed approach are listed in the following for future research.First, the model parameters of MissModal depend on the complexity of multimodal fusion networks and the number of modalities, which may bring problems with increasing model complexity in the downstream application of the proposed approach.Then, the improvement of MissModal seems relevant to the scale of datasets, where small datasets may limit the robustness of MissModal.We believe that increasing the scale of datasets can show more effectiveness of the proposed approach in the issues of missing modalities.Lasty, although MissModal aims at handling missing modalities in the stage of inference, the demand for complete modalities in training raises difficulty in collecting multimodal data.Getting rid of the completeness of modalities in training is another interesting research area for us to explore in the future.

Conclusion
In this paper, we present a novel classificationbased approach named MissModal to enhance the robustness to missing modalities in the downstream application by constructing three constraints, including geometric contrastive loss, distribution distance loss, and sentiment semantic loss to align the representations of missing and complete modalities.Extensive experiments on various settings of missing modalities and missing rates demonstrate the superiority of Miss-Modal in both flexibility and efficiency on two public datasets.The analysis of representation visualization and model complexity further indicates the huge potential and generality of MissModal in other multimodal systems.

Figure 1 :
Figure 1: Illustration of missing modality in testing when applying the trained multimodal representation model in downstream application, where T, A, V denotes textual, acoustic, and visual modality, respectively.

Figure 2 :
Figure 2: The overall architecture of the proposed MissModal.The missing-modal representations F miss and complete-modal representations F M are aligned with the guidance of the proposed losses L geo , L dis , and L sem at both feature space and prediction level.

Figure 7 :
Figure 7: Visualization of (a)(b) multimodal representation with missing and complete modalities in the embedding space on the training set of MOSEI and (c) multimodal representation with complete modalities over different sentiment classes in the embedding space on the testing set of MOSEI.

Table 2 ,
Miss-Modal achieves superior performance than the state-of-the-art methods under robust training on most metrics, especially in the circumstances of severely missing modalities.We assume that the 1693 Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00628/2199592/tacl_a_00628.pdf by MIT Libraries, Levi Rubeck on 02 January 2024

Table 3 :
Performance improvement of MissModal in 50% and 90% missing rates of acoustic and visual modality on the testing set of MOSI dataset.

Table 4 :
Performance improvement of MissModal in 50% and 90% missing rates of acoustic and visual modality on the testing set of MOSEI dataset.

Table 5 :
Ablation study of the proposed losses in MissModal with 100% missing rate of random modalities on the testing set of MOSEI dataset, which is divided into supervised and unsupervised circumstances for the learning of missing-modal representations according to the existence of L sem .

Table 6 :
Ablation study of various modalities in MissModal on the testing set of MOSEI dataset.T, A, V denote textual, acoustic, and visual modality.

Table 8 :
Comparison of model complexity of MissModal and the MSA baselines.