TSSP-MSA: DEEP LEARNING LEVERAGED TRI-STAGE SELF-SUPERVISED FRAMEWORK FOR ROBUST MULTIMODAL SENTIMENT ANALYSIS
DOI:
https://doi.org/10.70917/ijcisim-2026-2185Keywords:
Multimodal Sentiment Analysis, Self-Supervised Learning, Unimodal Representation, Modality Fusion, Contrastive Learning, Robustness, Affective ComputingAbstract
Multimodal sentiment analysis (MSA) is designed to understand human emotions by analyzing modal signals of various forms of data including text, audio, and visual. Nonetheless, the current models tend to have difficulties in learning modality-specific features, working with incomplete data, and aligning semantics across modalities. To overcome these shortcomings, in this paper, the authors present TSSP-MSA, a tri-stage self-supervised model that aims to enhance the quality of the unimodal features, the adaptability of fusion, and the consistency across modalities. The unimodal encoders are then pretrained in the first stage with self-supervised goals to learn robust and semantically rich feature representations. The second step introduces an expert mixture fusion strategy with uncertainty awareness that dynamically balances modality contributions in terms of uncertainty and hence improves tolerance to missed or noisy data. The last step employs cross-modal contrastive refinement, which synchronizes modal representations in a common latent space, overcoming semantic discordance. Comprehensive testing on standard datasets like CMU-MOSI and CMU-MOSEI shows that TSSP-MSA achieves substantially better accuracy, F1-score, and a modality-dropping reshape than state-of-the-art by a large margin. The architecture proposed advances the idea of interpretable, robust affective computing devices, with high prospects of being utilized in real practice of human-computer interaction.