This study explores the use of Vision Large Language Models (VLLMs) for identifying items in complex graphical documents. In particular, we focus on looking for furniture objects (e.g. beds, tables, and chairs) and structural items (doors and windows) in floorplan images. We evaluate one object detection model (YOLO) and state-of-the-art VLLMs on two datasets featuring diverse floorplan layouts and symbols. The experiments with VLLMs are performed with a zero-shot setting, meaning the models are tested without any training or fine-tuning, as well as with a few-shot approach, where examples of items to be found in the image are given to the models in the prompt. The results highlight the strengths and limitations of VLLMs in recognizing architectural elements, providing guidance for future research in the use multimodal vision-language models for graphics recognition.
Visual Large Language Models for Graphics Understanding: A Case Study on Floorplan Images / Nardoni, Valeria; Ali, Kimiya Noor; Ziran, Zahra; Marinai, Simone. - ELETTRONICO. - (2025), pp. 1-4. (Intervento presentato al convegno ACM Symposium on Document Engineering) [10.1145/3704268.3748681].
Visual Large Language Models for Graphics Understanding: A Case Study on Floorplan Images
Nardoni, Valeria;Ali, Kimiya Noor;Ziran, Zahra;Marinai, Simone
2025
Abstract
This study explores the use of Vision Large Language Models (VLLMs) for identifying items in complex graphical documents. In particular, we focus on looking for furniture objects (e.g. beds, tables, and chairs) and structural items (doors and windows) in floorplan images. We evaluate one object detection model (YOLO) and state-of-the-art VLLMs on two datasets featuring diverse floorplan layouts and symbols. The experiments with VLLMs are performed with a zero-shot setting, meaning the models are tested without any training or fine-tuning, as well as with a few-shot approach, where examples of items to be found in the image are given to the models in the prompt. The results highlight the strengths and limitations of VLLMs in recognizing architectural elements, providing guidance for future research in the use multimodal vision-language models for graphics recognition.| File | Dimensione | Formato | |
|---|---|---|---|
|
3704268.3748681.pdf
accesso aperto
Tipologia:
Pdf editoriale (Version of record)
Licenza:
Creative commons
Dimensione
924.5 kB
Formato
Adobe PDF
|
924.5 kB | Adobe PDF |
I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



