For a long time now, datasets containing scientific articles have been crucial to the analysis and recognition of document images. These document collections have frequently served as a testing ground for cutting-edge methods for optical character recognition, layout analysis, and document understanding in general. We thoroughly analyze and compare many datasets proposed for layout analysis of scientific documents, ranging from small collections of scanned papers to modern large-scale datasets containing digital-born papers, which have been proposed to train deep learning-based methods. Furthermore, we outline a detailed taxonomy of the annotation procedures used considering manual, automatic, and generative approaches, and we analyze their benefits and drawbacks. This survey is meant to provide the reader with a review of the most used benchmarks together with detailed information on data, annotations, and complexity, helping scholars to identify the most suitable dataset for their tasks of interest. We also discuss possible open problems to further enhance datasets to support research in the layout analysis of scientific articles

Datasets and annotations for layout analysis of scientific articles / Gemelli, Andrea; Marinai, Simone; Pisaneschi, Lorenzo; Santoni, Francesco. - In: INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION. - ISSN 1433-2833. - ELETTRONICO. - 27:(2024), pp. -.683--.705. [10.1007/s10032-024-00461-2]

Datasets and annotations for layout analysis of scientific articles

Gemelli, Andrea;Marinai, Simone
;
Pisaneschi, Lorenzo;Santoni, Francesco
2024

Abstract

For a long time now, datasets containing scientific articles have been crucial to the analysis and recognition of document images. These document collections have frequently served as a testing ground for cutting-edge methods for optical character recognition, layout analysis, and document understanding in general. We thoroughly analyze and compare many datasets proposed for layout analysis of scientific documents, ranging from small collections of scanned papers to modern large-scale datasets containing digital-born papers, which have been proposed to train deep learning-based methods. Furthermore, we outline a detailed taxonomy of the annotation procedures used considering manual, automatic, and generative approaches, and we analyze their benefits and drawbacks. This survey is meant to provide the reader with a review of the most used benchmarks together with detailed information on data, annotations, and complexity, helping scholars to identify the most suitable dataset for their tasks of interest. We also discuss possible open problems to further enhance datasets to support research in the layout analysis of scientific articles
2024
27
683
705
Gemelli, Andrea; Marinai, Simone; Pisaneschi, Lorenzo; Santoni, Francesco
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1353291
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
social impact