Recognizing interpersonal relationships is essential for enabling human–computer systems to understand and engage effectively with social contexts. Compared to other computer vision tasks, Interpersonal relation recognition requires an higher semantic understanding of the scene, ranging from large background context to finer clues. We propose a transformer based model that attends to each person pair relation in an image reaching state of the art performances on a classical benchmark dataset People in Social Context (PISC). Our solution differs from others as it makes no use of a separate GNN but relies instead on transformers alone. Additionally, we explore the impact of incorporating additional supervision from occupation labels on relationship recognition performance and we extensively ablate different architectural parameters and loss choices. Furthermore, we compare our model with a recent Large Multimodal Model (LMM) to precisely assess the zero-shot capabilities of such general models over highly specific tasks. Our study contributes to advancing the state of the art in social relationship recognition and highlights the potential of transformer-based models in capturing complex social dynamics from visual data.

Navigating social contexts: A transformer approach to relationship recognition / Berlincioni, Lorenzo; Cultrera, Luca; Bertini, Marco; Del Bimbo, Alberto. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - ELETTRONICO. - 254:(2025), pp. 0-0. [10.1016/j.cviu.2025.104327]

Navigating social contexts: A transformer approach to relationship recognition

Berlincioni, Lorenzo
;
Cultrera, Luca
;
Bertini, Marco;Del Bimbo, Alberto
2025

Abstract

Recognizing interpersonal relationships is essential for enabling human–computer systems to understand and engage effectively with social contexts. Compared to other computer vision tasks, Interpersonal relation recognition requires an higher semantic understanding of the scene, ranging from large background context to finer clues. We propose a transformer based model that attends to each person pair relation in an image reaching state of the art performances on a classical benchmark dataset People in Social Context (PISC). Our solution differs from others as it makes no use of a separate GNN but relies instead on transformers alone. Additionally, we explore the impact of incorporating additional supervision from occupation labels on relationship recognition performance and we extensively ablate different architectural parameters and loss choices. Furthermore, we compare our model with a recent Large Multimodal Model (LMM) to precisely assess the zero-shot capabilities of such general models over highly specific tasks. Our study contributes to advancing the state of the art in social relationship recognition and highlights the potential of transformer-based models in capturing complex social dynamics from visual data.
2025
254
0
0
Berlincioni, Lorenzo; Cultrera, Luca; Bertini, Marco; Del Bimbo, Alberto
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S1077314225000505-main-2.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Solo lettura
Dimensione 3.06 MB
Formato Adobe PDF
3.06 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1415435
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact