We describe a text indexing and retrieval technique that does not rely on word segmentation and is tolerant to errors in character segmentation. The method is designed to process early printed documents and we evaluate it on the well known Latin Gutenberg Bible. The approach relies on two main components. First, character objects (in most cases corresponding to individual characters) are extracted from the document and clustered together, so as to assign a symbolic class to each indexed object. Second, a query word is compared against the indexed character objects with a Dynamic Time Warping (DTW) based approach. The peculiarity of the matching technique described in this paper is the incorporation of sub-symbolic information in the string matching process. In particular, we take into account the estimated widths of potential subwords that are computed by accumulating lengths of partial matches in the DTW array.

Text retrieval from early printed books / S. Marinai. - STAMPA. - ACM International Conference Proceeding Series:(2009), pp. 33-40. ((Intervento presentato al convegno Third Workshop on Analytics for Noisy Unstructured Text Data tenutosi a Barcellona nel July 23-24 2009 [http://doi.acm.org/10.1145/1568296.1568304].

Text retrieval from early printed books

MARINAI, SIMONE
2009

Abstract

We describe a text indexing and retrieval technique that does not rely on word segmentation and is tolerant to errors in character segmentation. The method is designed to process early printed documents and we evaluate it on the well known Latin Gutenberg Bible. The approach relies on two main components. First, character objects (in most cases corresponding to individual characters) are extracted from the document and clustered together, so as to assign a symbolic class to each indexed object. Second, a query word is compared against the indexed character objects with a Dynamic Time Warping (DTW) based approach. The peculiarity of the matching technique described in this paper is the incorporation of sub-symbolic information in the string matching process. In particular, we take into account the estimated widths of potential subwords that are computed by accumulating lengths of partial matches in the DTW array.
Proceedings of the Third Workshop on Analytics for Noisy Unstructured Text Data
Third Workshop on Analytics for Noisy Unstructured Text Data
Barcellona
July 23-24 2009
S. Marinai
File in questo prodotto:
File Dimensione Formato  
marinai-and.pdf

accesso aperto

Descrizione: Articolo
Tipologia: Versione finale referata (Postprint, Accepted manuscript)
Licenza: DRM non definito
Dimensione 348.64 kB
Formato Adobe PDF
348.64 kB Adobe PDF Visualizza/Apri

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2158/373561
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 8
social impact