Semiautomatic Text Baseline Detection in Large Historical Handwritten Documents

Autores UPV
Año
CONGRESO Semiautomatic Text Baseline Detection in Large Historical Handwritten Documents

Abstract

A semiautomatic iterative process for the detection of text baselines in historical handwritten document images is presented. It relies on the use of Hidden Markov Models (HMM) to provide initial text baselines hypotheses, followed by user review in order to produce ground-truth quality results. Using the set of revised baselines as ground truth, the HMM¿s are re-trained before processing the next batch of pages. This process has been evaluated in the context of a real transcription task which, as a by-product, has produced line-detection ground truth. We show that the usage of a formal, HMMbased line-detection approach which requires training data, not only yields good detection results but is also of practical use in large handwritten image collections. Through experiments with real users we show that the proposed approach has interesting features; namely, accuracy, scalability and ease of use, as well as low overall human effort requirements.