We describe a technique for transcript alignment in early printed books by using deep models in combination with dynamic programming algorithms. Two object detection models, based on Faster R-CNN, are trained to locate words. We first train an initial model to recognize generic words and hyphens by using information about the number of words in text lines. Using the model prediction on pages with a line-by-line ground-truth annotation is available, we train a second model able to detect landmark words. The alignment is then based on the identification of landmark words in pages where we only know the text corresponding to zones in the page. The proposed technique is evaluated on a publicly available digitization of the Gutenberg Bible while the transcription is based on the Vulgata, a late 4th century Latin translation of the Bible.

Text alignment in early printed books combining deep learning and dynamic programming / Ziran Z.; Pic X.; Undri Innocenti S.; Mugnai D.; Marinai S.. - In: PATTERN RECOGNITION LETTERS. - ISSN 0167-8655. - STAMPA. - 133:(2020), pp. 109-115. [10.1016/j.patrec.2020.02.016]

Text alignment in early printed books combining deep learning and dynamic programming

Ziran Z.;Marinai S.
2020

Abstract

We describe a technique for transcript alignment in early printed books by using deep models in combination with dynamic programming algorithms. Two object detection models, based on Faster R-CNN, are trained to locate words. We first train an initial model to recognize generic words and hyphens by using information about the number of words in text lines. Using the model prediction on pages with a line-by-line ground-truth annotation is available, we train a second model able to detect landmark words. The alignment is then based on the identification of landmark words in pages where we only know the text corresponding to zones in the page. The proposed technique is evaluated on a publicly available digitization of the Gutenberg Bible while the transcription is based on the Vulgata, a late 4th century Latin translation of the Bible.
2020
133
109
115
Ziran Z.; Pic X.; Undri Innocenti S.; Mugnai D.; Marinai S.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1186369
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 21
  • ???jsp.display-item.citation.isi??? 17
social impact