Multi-modal Large Language Models (MLLMs) are currently an extremely active research topic for the multimedia and computer vision communities, and show a significant impact in visual analysis and text generation tasks. MLLM's are well-versed in integrated understanding, analysis of complex data from cross modalities (i.e. text-image) and text generation with chat abilities. Almost all MLLM's, focus on alignment of image features to textual features for downstream text generation tasks includes detailed image description, visual question answering, stories and poems generation, phrase grounding, etc. However, when focusing on visual question answering, questions that are highly relevant to the context of an image may not be answered correctly with the existing MLLM's, contrary to questions that are related to visual aspects. Moreover, generating meta data (context) for an image using present day MLLM's is hard task due to hallucinating characteristic of underlying Large Language Models (LLM's), and adequate contextual information cannot be directly derived from an image based perspective. Considering the cultural heritage domain, these issues hamper the introduction of multimedia chatbots as tools to support learning and understanding artworks, since contextual information is typically needed to better understand the content of the artworks themselves, and museum curators require that scientifically accurate information is provided to the users of such systems. In this paper we present a system that combines contextual description of the artworks to enhance the contextual visual question answering task.

Context-aware chatbot using MLLMs for Cultural Heritage / Rachabatuni, Pavan Kartheek; Principi, Filippo; Mazzanti, Paolo; Bertini, Marco. - ELETTRONICO. - (2024), pp. 459-463. ( ACM Multimedia Systems) [10.1145/3625468.3652193].

Context-aware chatbot using MLLMs for Cultural Heritage

Principi, Filippo;Mazzanti, Paolo;Bertini, Marco
2024

Abstract

Multi-modal Large Language Models (MLLMs) are currently an extremely active research topic for the multimedia and computer vision communities, and show a significant impact in visual analysis and text generation tasks. MLLM's are well-versed in integrated understanding, analysis of complex data from cross modalities (i.e. text-image) and text generation with chat abilities. Almost all MLLM's, focus on alignment of image features to textual features for downstream text generation tasks includes detailed image description, visual question answering, stories and poems generation, phrase grounding, etc. However, when focusing on visual question answering, questions that are highly relevant to the context of an image may not be answered correctly with the existing MLLM's, contrary to questions that are related to visual aspects. Moreover, generating meta data (context) for an image using present day MLLM's is hard task due to hallucinating characteristic of underlying Large Language Models (LLM's), and adequate contextual information cannot be directly derived from an image based perspective. Considering the cultural heritage domain, these issues hamper the introduction of multimedia chatbots as tools to support learning and understanding artworks, since contextual information is typically needed to better understand the content of the artworks themselves, and museum curators require that scientifically accurate information is provided to the users of such systems. In this paper we present a system that combines contextual description of the artworks to enhance the contextual visual question answering task.
2024
Proc. of ACM Multimedia Systems Conference (MMSys)
ACM Multimedia Systems
Rachabatuni, Pavan Kartheek; Principi, Filippo; Mazzanti, Paolo; Bertini, Marco
File in questo prodotto:
File Dimensione Formato  
3625468.3652193.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 3.78 MB
Formato Adobe PDF
3.78 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1452341
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 21
  • ???jsp.display-item.citation.isi??? 15
social impact