Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using an LSTM network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multi-modal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.
Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues / Guerdelli, Hajer; Ferrari, Claudio; Berretti, Stefano; Del Bimbo, Alberto. - ELETTRONICO. - (2025), pp. 5681-5690. ( IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops USA 11 Giugno 2025) [10.1109/cvprw67362.2025.00566].
Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues
Guerdelli, Hajer;Ferrari, Claudio;Berretti, Stefano;Del Bimbo, Alberto
2025
Abstract
Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using an LSTM network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multi-modal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



