Automatic generation of scientific papers for data augmentation in document layout analysis

Pisaneschi, Lorenzo; Gemelli, Andrea; Marinai, Simone

doi:10.1016/j.patrec.2023.01.018

Document layout analysis is an important task to extract information from scientific literature. Deep-learning solutions for document layout analysis require large collections of training data that are not always available. We generate a large number of synthetic pages to subsequently train a neural network to perform document object detection. The proposed pipeline allows users to deal with less common layouts for which it is not easy to find large annotated datasets. High-quality annotations for a small collection of papers are obtained through a semi-automatic approach. Then, a generative model, based on LayoutTransformer, is used to generate plausible layouts that are subsequently populated with random information to perform data augmentation. We evaluate the proposed method considering scientific articles with two different types of layouts: double and single columns. For double-column papers, we improve detection by 1% starting from 385 manually annotated scientific articles. For single-column papers, we improve detection by 49% starting from 218 articles.

Automatic generation of scientific papers for data augmentation in document layout analysis / Pisaneschi, Lorenzo; Gemelli, Andrea; Marinai, Simone. - In: PATTERN RECOGNITION LETTERS. - ISSN 0167-8655. - ELETTRONICO. - 167:(2023), pp. 38-44. [10.1016/j.patrec.2023.01.018]