We describe a technique for transcript alignment in early printed books by using deep models in combination with dynamic programming algorithms. Two object detection models, based on Faster R-CNN, are trained to locate words. We first train an initial model to recognize generic words and hyphens by using information about the number of words in text lines. Using the model prediction on pages with a line-by-line ground-truth annotation is available, we train a second model able to detect landmark words. The alignment is then based on the identification of landmark words in pages where we only know the text corresponding to zones in the page. The proposed technique is evaluated on a publicly available digitization of the Gutenberg Bible while the transcription is based on the Vulgata, a late 4th century Latin translation of the Bible.
Text alignment in early printed books combining deep learning and dynamic programming / Ziran Z.; Pic X.; Undri Innocenti S.; Mugnai D.; Marinai S.. - In: PATTERN RECOGNITION LETTERS. - ISSN 0167-8655. - STAMPA. - 133:(2020), pp. 109-115. [10.1016/j.patrec.2020.02.016]
Text alignment in early printed books combining deep learning and dynamic programming
Ziran Z.;Marinai S.
2020
Abstract
We describe a technique for transcript alignment in early printed books by using deep models in combination with dynamic programming algorithms. Two object detection models, based on Faster R-CNN, are trained to locate words. We first train an initial model to recognize generic words and hyphens by using information about the number of words in text lines. Using the model prediction on pages with a line-by-line ground-truth annotation is available, we train a second model able to detect landmark words. The alignment is then based on the identification of landmark words in pages where we only know the text corresponding to zones in the page. The proposed technique is evaluated on a publicly available digitization of the Gutenberg Bible while the transcription is based on the Vulgata, a late 4th century Latin translation of the Bible.I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.