Controlled Emoji Effects in Fine-Grained Tweet Valence: A Multi-Metric Benchmark of Transformer Checkpoints
DOI:
https://doi.org/10.7091710.70917/ijcisim-2026-1966Keywords:
Sentiment analysis, Transformers, Multi-class classification, Emojis, Benchmarking, ROC-AUCAbstract
Transformer sentiment models are routinely benchmarked on social media text; however, performance claims regarding emoji robustness are frequently confounded by cross-dataset shifts in domain, label space, and annotation policy.This paper introduces a controlled evaluation protocol that isolates emoji presence as a variable within a fixed dataset and label space. We benchmark nine transformer checkpoints on SemEval-2018 Valence-oc (English), a fine-grained seven-class ordinal task, using a shared fine-tuning recipe and a multi-metric suite (Accuracy, Macro-F1, micro-ROC-AUC). Findings indicate that DEBERTA-V2-XLARGE-MNLI achieves the strongest performance (Acc 0.492, Macro-F1 0.385) on this task. By partitioning the fixed test set into emoji-containing (n = 251) and non-emoji (n = 686) subsets at evaluation time, we demonstrate that emoji effects are heterogeneous and model-dependent rather than uniformly beneficial. Furthermore, we show that single-metric accuracy reporting masks critical minority-class failures in ordinal sentiment,necessitating the use of Macro-F1 for valid model selection. The proposed protocol serves as a reusable methodology for measuring variable-specific effects without conounding factors.