For a long time now, datasets containing scientific articles have been crucial to the analysis and recognition of document images. These document collections have frequently served as a testing ground for cutting-edge methods for optical character recognition, layout analysis, and document understanding in general. We thoroughly analyze and compare many datasets proposed for layout analysis of scientific documents, ranging from small collections of scanned papers to modern large-scale datasets containing digital-born papers, which have been proposed to train deep learning-based methods. Furthermore, we outline a detailed taxonomy of the annotation procedures used considering manual, automatic, and generative approaches, and we analyze their benefits and drawbacks. This survey is meant to provide the reader with a review of the most used benchmarks together with detailed information on data, annotations, and complexity, helping scholars to identify the most suitable dataset for their tasks of interest. We also discuss possible open problems to further enhance datasets to support research in the layout analysis of scientific articles
Datasets and annotations for layout analysis of scientific articles / Gemelli, Andrea; Marinai, Simone; Pisaneschi, Lorenzo; Santoni, Francesco. - In: INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION. - ISSN 1433-2833. - ELETTRONICO. - 27:(2024), pp. -.683--.705. [10.1007/s10032-024-00461-2]
Datasets and annotations for layout analysis of scientific articles
Gemelli, Andrea;Marinai, Simone
;Pisaneschi, Lorenzo;Santoni, Francesco
2024
Abstract
For a long time now, datasets containing scientific articles have been crucial to the analysis and recognition of document images. These document collections have frequently served as a testing ground for cutting-edge methods for optical character recognition, layout analysis, and document understanding in general. We thoroughly analyze and compare many datasets proposed for layout analysis of scientific documents, ranging from small collections of scanned papers to modern large-scale datasets containing digital-born papers, which have been proposed to train deep learning-based methods. Furthermore, we outline a detailed taxonomy of the annotation procedures used considering manual, automatic, and generative approaches, and we analyze their benefits and drawbacks. This survey is meant to provide the reader with a review of the most used benchmarks together with detailed information on data, annotations, and complexity, helping scholars to identify the most suitable dataset for their tasks of interest. We also discuss possible open problems to further enhance datasets to support research in the layout analysis of scientific articlesI documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.