Text-Oriented Image Query Representation for Zero-Shot Composed Image Retrieval

Rachabathuni, Pavan Kartheek; Ciamarra, Andrea; Caldelli, Roberto

doi:10.1109/CBMI66578.2025.11339329

Zero-Shot Composed Image Retrieval (ZS-CIR) is the task of retrieving a target image based on a query that combines a reference image with a textual description specifying desired modifications in a zero-shot setting. Existing ZS-CIR models typically fuse visual and textual modalities into a single query representation, but often struggle to capture the fine-grained distinctions essential for accurate retrieval. In this paper, we present TEOZCIR, a transformer-based model that introduces a balanced semantic fusion module and an enhancement mechanism to more effectively integrate multimodal information. The model is built around two core components: the Text-Aware Query Combiner (TAQC) and the Query Enhancer Network (QENet). These components operate in tandem: TAQC dynamically adjusts the semantic contributions of the visual context based on the input text, generating a balanced query representation. This representation is then further refined by QENet, which enhances the fused features to better align with the target image. Throughout the entire process, the model maintains a lightweight architecture with significantly fewer trainable parameters compared to conventional training-based methods. Experiments carried out on three benchmark datasets CIRR, Fashion IQ, and CIRCO to demonstrate that TEOZCIR significantly improves ZS-CIR performance, setting a new bench-mark for multimodal retrieval.

Text-Oriented Image Query Representation for Zero-Shot Composed Image Retrieval / Pavan Kartheek Rachabathuni, Andrea Ciamarra, Roberto Caldelli, Marco Bertini. - ELETTRONICO. - (2025), pp. 1-7. (2025 International Conference on Content-Based Multimedia Indexing (CBMI) ) [10.1109/CBMI66578.2025.11339329].