Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using an LSTM network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multi-modal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.

Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues / Guerdelli, Hajer; Ferrari, Claudio; Berretti, Stefano; Del Bimbo, Alberto. - ELETTRONICO. - (2025), pp. 5681-5690. ( IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops USA 11 Giugno 2025) [10.1109/cvprw67362.2025.00566].

Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues

Guerdelli, Hajer;Ferrari, Claudio;Berretti, Stefano;Del Bimbo, Alberto
2025

Abstract

Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using an LSTM network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multi-modal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.
2025
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
USA
11 Giugno 2025
Goal 9: Industry, Innovation, and Infrastructure
Guerdelli, Hajer; Ferrari, Claudio; Berretti, Stefano; Del Bimbo, Alberto
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1436387
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact