Recognizing interpersonal relationships is essential for enabling human–computer systems to understand and engage effectively with social contexts. Compared to other computer vision tasks, Interpersonal relation recognition requires an higher semantic understanding of the scene, ranging from large background context to finer clues. We propose a transformer based model that attends to each person pair relation in an image reaching state of the art performances on a classical benchmark dataset People in Social Context (PISC). Our solution differs from others as it makes no use of a separate GNN but relies instead on transformers alone. Additionally, we explore the impact of incorporating additional supervision from occupation labels on relationship recognition performance and we extensively ablate different architectural parameters and loss choices. Furthermore, we compare our model with a recent Large Multimodal Model (LMM) to precisely assess the zero-shot capabilities of such general models over highly specific tasks. Our study contributes to advancing the state of the art in social relationship recognition and highlights the potential of transformer-based models in capturing complex social dynamics from visual data.
Navigating social contexts: A transformer approach to relationship recognition / Berlincioni, Lorenzo; Cultrera, Luca; Bertini, Marco; Del Bimbo, Alberto. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - ELETTRONICO. - 254:(2025), pp. 0-0. [10.1016/j.cviu.2025.104327]
Navigating social contexts: A transformer approach to relationship recognition
Berlincioni, Lorenzo
;Cultrera, Luca
;Bertini, Marco;Del Bimbo, Alberto
2025
Abstract
Recognizing interpersonal relationships is essential for enabling human–computer systems to understand and engage effectively with social contexts. Compared to other computer vision tasks, Interpersonal relation recognition requires an higher semantic understanding of the scene, ranging from large background context to finer clues. We propose a transformer based model that attends to each person pair relation in an image reaching state of the art performances on a classical benchmark dataset People in Social Context (PISC). Our solution differs from others as it makes no use of a separate GNN but relies instead on transformers alone. Additionally, we explore the impact of incorporating additional supervision from occupation labels on relationship recognition performance and we extensively ablate different architectural parameters and loss choices. Furthermore, we compare our model with a recent Large Multimodal Model (LMM) to precisely assess the zero-shot capabilities of such general models over highly specific tasks. Our study contributes to advancing the state of the art in social relationship recognition and highlights the potential of transformer-based models in capturing complex social dynamics from visual data.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S1077314225000505-main-2.pdf
accesso aperto
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Solo lettura
Dimensione
3.06 MB
Formato
Adobe PDF
|
3.06 MB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.