Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to per- form few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capa- bilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP’s classification ca- pabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks

ECO: Ensembling Context Optimization for Vision-Language Models / Agnolucci, Lorenzo; Baldrati, Alberto; Todino, Francesco; Becattini, Federico; Bertini, Marco; Del Bimbo, Alberto. - ELETTRONICO. - (2023), pp. 2803-2807. ( IEEE/CVF International Conference on Computer Vision (ICCV) Workshops) [10.1109/iccvw60793.2023.00299].

ECO: Ensembling Context Optimization for Vision-Language Models

Agnolucci, Lorenzo;Baldrati, Alberto;Becattini, Federico;Bertini, Marco;Del Bimbo, Alberto
2023

Abstract

Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to per- form few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capa- bilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP’s classification ca- pabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks
2023
Proc. of IEEE/CVF International Conference on Computer Vision (ICCV) Workshops
IEEE/CVF International Conference on Computer Vision (ICCV) Workshops
Agnolucci, Lorenzo; Baldrati, Alberto; Todino, Francesco; Becattini, Federico; Bertini, Marco; Del Bimbo, Alberto
File in questo prodotto:
File Dimensione Formato  
Agnolucci_ECO_Ensembling_Context_Optimization_for_Vision-Language_Models_ICCVW_2023_paper.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 1.38 MB
Formato Adobe PDF
1.38 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1452882
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 7
social impact