In this paper we present a method for the segmentation of continuous page streams into multipage documents and the simultaneous classification of the resulting documents. We first present an approach to combine the multiple pages of a document into a single feature vector that represents the whole document. Despite its simplicity and low computational cost, the proposed representation yields results comparable to more complex methods in multipage document classification tasks. We then exploit this representation in the context of page stream segmentation. The most plausible segmentation of a page stream into a sequence of multipage documents is obtained by optimizing a statistical model that represents the probability of each segmented multipage document belonging to a particular class. Experimental results are reported on a large sample of real administrative multipage documents.

Document classification and page stream segmentation for digital mailroom applications / Gordo, Albert; Rusinol, Marcal; Karatzas, Dimosthenis; Bagdanov, Andrew D.. - ELETTRONICO. - (2013), pp. 621-625. (Intervento presentato al convegno 12th International Conference on Document Analysis and Recognition, ICDAR 2013 tenutosi a Washington, DC, usa nel 2013) [10.1109/ICDAR.2013.128].

Document classification and page stream segmentation for digital mailroom applications

BAGDANOV, ANDREW DAVID
2013

Abstract

In this paper we present a method for the segmentation of continuous page streams into multipage documents and the simultaneous classification of the resulting documents. We first present an approach to combine the multiple pages of a document into a single feature vector that represents the whole document. Despite its simplicity and low computational cost, the proposed representation yields results comparable to more complex methods in multipage document classification tasks. We then exploit this representation in the context of page stream segmentation. The most plausible segmentation of a page stream into a sequence of multipage documents is obtained by optimizing a statistical model that represents the probability of each segmented multipage document belonging to a particular class. Experimental results are reported on a large sample of real administrative multipage documents.
2013
Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
12th International Conference on Document Analysis and Recognition, ICDAR 2013
Washington, DC, usa
2013
Gordo, Albert; Rusinol, Marcal; Karatzas, Dimosthenis; Bagdanov, Andrew D.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1081457
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 19
  • ???jsp.display-item.citation.isi??? 11
social impact