In this paper, we address the challenges of automatic metadata annotation in the domain of Galleries, Libraries, Archives, and Museums (GLAMs) by introducing a novel dataset, EUFCC-340K, collected from the Europeana portal. Comprising over 340,000 images, the EUFCC-340K dataset is organized across multiple facets – Materials, Object Types, Disciplines, and Subjects – following a hierarchical structure based on the Art & Architecture Thesaurus (AAT). We developed several baseline models, incorporating multiple heads on a ConvNeXT backbone for multi-label image tagging on these facets, and fine-tuning a CLIP model with our image-text pairs. Our experiments to evaluate model robustness and generalization capabilities in two different test scenarios demonstrate the dataset’s utility in improving multi-label classification tools that have the potential to alleviate cataloging tasks in the cultural heritage sector. The EUFCC-340K dataset is publicly available at https://github.com/cesc47/EUFCC-340K.

EUFCC-340K: A faceted hierarchical dataset for metadata annotation in GLAM collections / Net, Francesc; Folia, Marc; Casals, Pep; Bagdanov, Andrew D.; Gómez, Lluis. - In: MULTIMEDIA TOOLS AND APPLICATIONS. - ISSN 1573-7721. - STAMPA. - (2025), pp. 1-24. [10.1007/s11042-024-20561-9]

EUFCC-340K: A faceted hierarchical dataset for metadata annotation in GLAM collections

Bagdanov, Andrew D.;
2025

Abstract

In this paper, we address the challenges of automatic metadata annotation in the domain of Galleries, Libraries, Archives, and Museums (GLAMs) by introducing a novel dataset, EUFCC-340K, collected from the Europeana portal. Comprising over 340,000 images, the EUFCC-340K dataset is organized across multiple facets – Materials, Object Types, Disciplines, and Subjects – following a hierarchical structure based on the Art & Architecture Thesaurus (AAT). We developed several baseline models, incorporating multiple heads on a ConvNeXT backbone for multi-label image tagging on these facets, and fine-tuning a CLIP model with our image-text pairs. Our experiments to evaluate model robustness and generalization capabilities in two different test scenarios demonstrate the dataset’s utility in improving multi-label classification tools that have the potential to alleviate cataloging tasks in the cultural heritage sector. The EUFCC-340K dataset is publicly available at https://github.com/cesc47/EUFCC-340K.
2025
1
24
Net, Francesc; Folia, Marc; Casals, Pep; Bagdanov, Andrew D.; Gómez, Lluis
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1424075
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact