Large collections of scanned documents (books and journals) are now available in Digital Libraries. The most common method for retrieving relevant information from these collections is image browsing, but this approach is not feasible for books with more than a few dozen pages. The recognition of printed text can be made on the images by OCR systems, and in this case a retrieval by textual content can be performed. However, the results heavily depend on the quality of original documents. More sophisticated navigation can be performed when an electronic table of contents of the book is available with links to the corresponding pages. An opposite approach relies on the reduction of the amount of symbolic information to be extracted at the storage time. This approach is taken into account by document image retrieval systems. In this paper we describe a system that we developed in order to retrieve information from digitized books and journals belonging to Digital Libraries. The main feature of the system is the ability of combining two principal retrieval strategies in several ways. The first strategy allows an user to find pages with a layout similar to a query page. The second strategy is used in order to retrieve words in the collection matching a user-defined query, without performing OCR. The combination of these basic strategies allows users to retrieve meaningful pages with a low effort during the indexing phase. We describe the basic tools used in the system (layout analysis, layout retrieval, word retrieval) and the integration of these tools for answering complex queries. The experimental results are made on 1287 pages and show the effectiveness of the integrated retrieval.

A general system for the retrieval of document images from digital libraries / S. MARINAI; E. MARINO; F. CESARINI; G. SODA. - STAMPA. - (2004), pp. 150-173. (Intervento presentato al convegno DIAL 2004. Document Image Analysis for Libraries tenutosi a Palo Alto (CA) nel January 2004) [10.1109/DIAL.2004.1263246].

A general system for the retrieval of document images from digital libraries

MARINAI, SIMONE;
2004

Abstract

Large collections of scanned documents (books and journals) are now available in Digital Libraries. The most common method for retrieving relevant information from these collections is image browsing, but this approach is not feasible for books with more than a few dozen pages. The recognition of printed text can be made on the images by OCR systems, and in this case a retrieval by textual content can be performed. However, the results heavily depend on the quality of original documents. More sophisticated navigation can be performed when an electronic table of contents of the book is available with links to the corresponding pages. An opposite approach relies on the reduction of the amount of symbolic information to be extracted at the storage time. This approach is taken into account by document image retrieval systems. In this paper we describe a system that we developed in order to retrieve information from digitized books and journals belonging to Digital Libraries. The main feature of the system is the ability of combining two principal retrieval strategies in several ways. The first strategy allows an user to find pages with a layout similar to a query page. The second strategy is used in order to retrieve words in the collection matching a user-defined query, without performing OCR. The combination of these basic strategies allows users to retrieve meaningful pages with a low effort during the indexing phase. We describe the basic tools used in the system (layout analysis, layout retrieval, word retrieval) and the integration of these tools for answering complex queries. The experimental results are made on 1287 pages and show the effectiveness of the integrated retrieval.
2004
First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings
DIAL 2004. Document Image Analysis for Libraries
Palo Alto (CA)
January 2004
S. MARINAI; E. MARINO; F. CESARINI; G. SODA
File in questo prodotto:
File Dimensione Formato  
DIAL04.pdf

Accesso chiuso

Tipologia: Versione finale referata (Postprint, Accepted manuscript)
Licenza: Tutti i diritti riservati
Dimensione 1.17 MB
Formato Adobe PDF
1.17 MB Adobe PDF   Richiedi una copia

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/260788
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 24
  • ???jsp.display-item.citation.isi??? 19
social impact