This study explores the use of Vision Large Language Models (VLLMs) for identifying items in complex graphical documents. In particular, we focus on looking for furniture objects (e.g. beds, tables, and chairs) and structural items (doors and windows) in floorplan images. We evaluate one object detection model (YOLO) and state-of-the-art VLLMs on two datasets featuring diverse floorplan layouts and symbols. The experiments with VLLMs are performed with a zero-shot setting, meaning the models are tested without any training or fine-tuning, as well as with a few-shot approach, where examples of items to be found in the image are given to the models in the prompt. The results highlight the strengths and limitations of VLLMs in recognizing architectural elements, providing guidance for future research in the use multimodal vision-language models for graphics recognition.

Visual Large Language Models for Graphics Understanding: A Case Study on Floorplan Images / Nardoni, Valeria; Ali, Kimiya Noor; Ziran, Zahra; Marinai, Simone. - ELETTRONICO. - (2025), pp. 1-4. (Intervento presentato al convegno ACM Symposium on Document Engineering) [10.1145/3704268.3748681].

Visual Large Language Models for Graphics Understanding: A Case Study on Floorplan Images

Nardoni, Valeria;Ali, Kimiya Noor;Ziran, Zahra;Marinai, Simone
2025

Abstract

This study explores the use of Vision Large Language Models (VLLMs) for identifying items in complex graphical documents. In particular, we focus on looking for furniture objects (e.g. beds, tables, and chairs) and structural items (doors and windows) in floorplan images. We evaluate one object detection model (YOLO) and state-of-the-art VLLMs on two datasets featuring diverse floorplan layouts and symbols. The experiments with VLLMs are performed with a zero-shot setting, meaning the models are tested without any training or fine-tuning, as well as with a few-shot approach, where examples of items to be found in the image are given to the models in the prompt. The results highlight the strengths and limitations of VLLMs in recognizing architectural elements, providing guidance for future research in the use multimodal vision-language models for graphics recognition.
2025
DocEng '25: Proceedings of the 2025 ACM Symposium on Document Engineering
ACM Symposium on Document Engineering
Nardoni, Valeria; Ali, Kimiya Noor; Ziran, Zahra; Marinai, Simone
File in questo prodotto:
File Dimensione Formato  
3704268.3748681.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Creative commons
Dimensione 924.5 kB
Formato Adobe PDF
924.5 kB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1433620
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact