Composed Image Retrieval (CIR) aims to retrieve a tar- get image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for la- beling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled train- ing dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo- word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS- CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR con- taining multiple ground truths for each query. The ex- periments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.

Zero-Shot Composed Image Retrieval with Textual Inversion / Baldrati, Alberto; Agnolucci, Lorenzo; Bertini, Marco; Del Bimbo, Alberto. - ELETTRONICO. - (2023), pp. 15292-15301. ( IEEE International Conference on Computer Vision (ICCV)) [10.1109/iccv51070.2023.01407].

Zero-Shot Composed Image Retrieval with Textual Inversion

Baldrati, Alberto;Agnolucci, Lorenzo;Bertini, Marco;Del Bimbo, Alberto
2023

Abstract

Composed Image Retrieval (CIR) aims to retrieve a tar- get image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for la- beling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled train- ing dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo- word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS- CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR con- taining multiple ground truths for each query. The ex- periments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.
2023
Proc. of IEEE International Conference on Computer Vision (ICCV)
IEEE International Conference on Computer Vision (ICCV)
Baldrati, Alberto; Agnolucci, Lorenzo; Bertini, Marco; Del Bimbo, Alberto
File in questo prodotto:
File Dimensione Formato  
Baldrati_Zero-Shot_Composed_Image_Retrieval_with_Textual_Inversion_ICCV_2023_paper.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 4.63 MB
Formato Adobe PDF
4.63 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1452877
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 103
  • ???jsp.display-item.citation.isi??? 54
social impact