Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.

Spatio-temporal transformers for action unit classification with event cameras / Cultrera, Luca; Becattini, Federico; Berlincioni, Lorenzo; Ferrari, Claudio; Bimbo, Alberto Del. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - STAMPA. - 263:(2026), pp. 104578.1-104578.12. [10.1016/j.cviu.2025.104578]

Spatio-temporal transformers for action unit classification with event cameras

Cultrera, Luca;Becattini, Federico;Berlincioni, Lorenzo;Ferrari, Claudio;Bimbo, Alberto Del
2026

Abstract

Facial analysis plays a vital role in assistive technologies aimed at improving human–computer interaction, emotional well-being, and non-verbal communication monitoring. For more fine-grained tasks, however, standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered a major cause of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of both RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and also contains streams collected with a variety of possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. This makes our model suitable for real-world assistive scenarios, including privacy-preserving wearable systems and responsive social interaction monitoring. Our proposed model outperforms baseline methods by capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.
2026
263
1
12
Cultrera, Luca; Becattini, Federico; Berlincioni, Lorenzo; Ferrari, Claudio; Bimbo, Alberto Del
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S1077314225003017-main.pdf

Accesso chiuso

Tipologia: Versione finale referata (Postprint, Accepted manuscript)
Licenza: Tutti i diritti riservati
Dimensione 1.31 MB
Formato Adobe PDF
1.31 MB Adobe PDF   Richiedi una copia

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1452103
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact