Document layout analysis is an important task to extract information from scientific literature. Deep-learning solutions for document layout analysis require large collections of training data that are not always available. We generate a large number of synthetic pages to subsequently train a neural network to perform document object detection. The proposed pipeline allows users to deal with less common layouts for which it is not easy to find large annotated datasets. High-quality annotations for a small collection of papers are obtained through a semi-automatic approach. Then, a generative model, based on LayoutTransformer, is used to generate plausible layouts that are subsequently populated with random information to perform data augmentation. We evaluate the proposed method considering scientific articles with two different types of layouts: double and single columns. For double-column papers, we improve detection by 1% starting from 385 manually annotated scientific articles. For single-column papers, we improve detection by 49% starting from 218 articles.

Automatic generation of scientific papers for data augmentation in document layout analysis / Pisaneschi, Lorenzo; Gemelli, Andrea; Marinai, Simone. - In: PATTERN RECOGNITION LETTERS. - ISSN 0167-8655. - ELETTRONICO. - 167:(2023), pp. 38-44. [10.1016/j.patrec.2023.01.018]

Automatic generation of scientific papers for data augmentation in document layout analysis

Pisaneschi, Lorenzo;Gemelli, Andrea;Marinai, Simone
2023

Abstract

Document layout analysis is an important task to extract information from scientific literature. Deep-learning solutions for document layout analysis require large collections of training data that are not always available. We generate a large number of synthetic pages to subsequently train a neural network to perform document object detection. The proposed pipeline allows users to deal with less common layouts for which it is not easy to find large annotated datasets. High-quality annotations for a small collection of papers are obtained through a semi-automatic approach. Then, a generative model, based on LayoutTransformer, is used to generate plausible layouts that are subsequently populated with random information to perform data augmentation. We evaluate the proposed method considering scientific articles with two different types of layouts: double and single columns. For double-column papers, we improve detection by 1% starting from 385 manually annotated scientific articles. For single-column papers, we improve detection by 49% starting from 218 articles.
2023
167
38
44
Pisaneschi, Lorenzo; Gemelli, Andrea; Marinai, Simone
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0167865523000247-main.pdf

accesso aperto

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 2.31 MB
Formato Adobe PDF
2.31 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1299020
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 4
social impact