In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR) an image is combined with a text that provides information regarding user intentions, and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.

Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features / Baldrati, Alberto; Bertini, Marco; Uricchio, Tiberio; Del Bimbo, Alberto. - ELETTRONICO. - (2022), pp. 4955-4964. ( IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops) [10.1109/cvprw56347.2022.00543].

Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features

Baldrati, Alberto;Bertini, Marco;Uricchio, Tiberio;Del Bimbo, Alberto
2022

Abstract

In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR) an image is combined with a text that provides information regarding user intentions, and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.
2022
Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
Baldrati, Alberto; Bertini, Marco; Uricchio, Tiberio; Del Bimbo, Alberto
File in questo prodotto:
File Dimensione Formato  
Conditioned_and_composed_image_retrieval_combining_and_partially_fine-tuning_CLIP-based_features.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 1.99 MB
Formato Adobe PDF
1.99 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1452875
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 105
  • ???jsp.display-item.citation.isi??? 74
social impact