Zero-Shot Composed Image Retrieval (ZS-CIR) is the task of retrieving a target image based on a query that combines a reference image with a textual description specifying desired modifications in a zero-shot setting. Existing ZS-CIR models typically fuse visual and textual modalities into a single query representation, but often struggle to capture the fine-grained distinctions essential for accurate retrieval. In this paper, we present TEOZCIR, a transformer-based model that introduces a balanced semantic fusion module and an enhancement mechanism to more effectively integrate multimodal information. The model is built around two core components: the Text-Aware Query Combiner (TAQC) and the Query Enhancer Network (QENet). These components operate in tandem: TAQC dynamically adjusts the semantic contributions of the visual context based on the input text, generating a balanced query representation. This representation is then further refined by QENet, which enhances the fused features to better align with the target image. Throughout the entire process, the model maintains a lightweight architecture with significantly fewer trainable parameters compared to conventional training-based methods. Experiments carried out on three benchmark datasets CIRR, Fashion IQ, and CIRCO to demonstrate that TEOZCIR significantly improves ZS-CIR performance, setting a new bench-mark for multimodal retrieval.

Text-Oriented Image Query Representation for Zero-Shot Composed Image Retrieval / Pavan Kartheek Rachabathuni, Andrea Ciamarra, Roberto Caldelli, Marco Bertini. - ELETTRONICO. - (2025), pp. 1-7. (2025 International Conference on Content-Based Multimedia Indexing (CBMI) ) [10.1109/CBMI66578.2025.11339329].

Text-Oriented Image Query Representation for Zero-Shot Composed Image Retrieval

Pavan Kartheek Rachabathuni
;
Andrea Ciamarra;Roberto Caldelli;Marco Bertini
2025

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) is the task of retrieving a target image based on a query that combines a reference image with a textual description specifying desired modifications in a zero-shot setting. Existing ZS-CIR models typically fuse visual and textual modalities into a single query representation, but often struggle to capture the fine-grained distinctions essential for accurate retrieval. In this paper, we present TEOZCIR, a transformer-based model that introduces a balanced semantic fusion module and an enhancement mechanism to more effectively integrate multimodal information. The model is built around two core components: the Text-Aware Query Combiner (TAQC) and the Query Enhancer Network (QENet). These components operate in tandem: TAQC dynamically adjusts the semantic contributions of the visual context based on the input text, generating a balanced query representation. This representation is then further refined by QENet, which enhances the fused features to better align with the target image. Throughout the entire process, the model maintains a lightweight architecture with significantly fewer trainable parameters compared to conventional training-based methods. Experiments carried out on three benchmark datasets CIRR, Fashion IQ, and CIRCO to demonstrate that TEOZCIR significantly improves ZS-CIR performance, setting a new bench-mark for multimodal retrieval.
2025
2025 Content-Based Multimedia Indexing (CBMI)
2025 International Conference on Content-Based Multimedia Indexing (CBMI)
Pavan Kartheek Rachabathuni; Andrea Ciamarra; Roberto Caldelli; Marco Bertini
File in questo prodotto:
File Dimensione Formato  
Text-Oriented_Image_Query_Representation_for_Zero-Shot_Composed_Image_Retrieval.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 1.61 MB
Formato Adobe PDF
1.61 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1470992
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact