Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features

Baldrati, Alberto; Bertini, Marco; Uricchio, Tiberio; Del Bimbo, Alberto

doi:10.1109/cvprw56347.2022.00543

In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR) an image is combined with a text that provides information regarding user intentions, and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.

Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features / Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.. - ELETTRONICO. - (2022), pp. 4955-4964. (IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops ) [10.1109/cvprw56347.2022.00543].