TSSP-MSA: DEEP LEARNING LEVERAGED TRI-STAGE SELF-SUPERVISED FRAMEWORK FOR ROBUST MULTIMODAL SENTIMENT ANALYSIS

Bharathi Niruti; Sujith AVLN; Kanaka Durga Returi

doi:10.70917/ijcisim-2026-2185

Authors

Bharathi Niruti Department of CSE, Malla Reddy University, Hyderabad, India
Sujith AVLN Department of IT, Malla Reddy University, Hyderabad, India
Kanaka Durga Returi Department of CSE, Malla Reddy Technical Campus (A constituent unit of Malla Reddy Vishwavidyapeeth), Deemed to be University, Hyderabad, India

DOI:

https://doi.org/10.70917/ijcisim-2026-2185

Keywords:

Multimodal Sentiment Analysis, Self-Supervised Learning, Unimodal Representation, Modality Fusion, Contrastive Learning, Robustness, Affective Computing

Abstract

Multimodal sentiment analysis (MSA) is designed to understand human emotions by analyzing modal signals of various forms of data including text, audio, and visual. Nonetheless, the current models tend to have difficulties in learning modality-specific features, working with incomplete data, and aligning semantics across modalities. To overcome these shortcomings, in this paper, the authors present TSSP-MSA, a tri-stage self-supervised model that aims to enhance the quality of the unimodal features, the adaptability of fusion, and the consistency across modalities. The unimodal encoders are then pretrained in the first stage with self-supervised goals to learn robust and semantically rich feature representations. The second step introduces an expert mixture fusion strategy with uncertainty awareness that dynamically balances modality contributions in terms of uncertainty and hence improves tolerance to missed or noisy data. The last step employs cross-modal contrastive refinement, which synchronizes modal representations in a common latent space, overcoming semantic discordance. Comprehensive testing on standard datasets like CMU-MOSI and CMU-MOSEI shows that TSSP-MSA achieves substantially better accuracy, F1-score, and a modality-dropping reshape than state-of-the-art by a large margin. The architecture proposed advances the idea of interpretable, robust affective computing devices, with high prospects of being utilized in real practice of human-computer interaction.

Downloads

Download data is not yet available.

TSSP-MSA: DEEP LEARNING LEVERAGED TRI-STAGE SELF-SUPERVISED FRAMEWORK FOR ROBUST MULTIMODAL SENTIMENT ANALYSIS

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Information