Abstract
A semiautomatic iterative process for the detection
of text baselines in historical handwritten document images
is presented. It relies on the use of Hidden Markov Models
(HMM) to provide initial text baselines hypotheses, followed by
user review in order to produce ground-truth quality results.
Using the set of revised baselines as ground truth, the HMMs
are re-trained before processing the next batch of pages. This
process has been evaluated in the context of a real transcription
task which, as a by-product, has produced line-detection
ground truth. We show that the usage of a formal, HMMbased
line-detection approach which requires training data, not
only yields good detection results but is also of practical use in
large handwritten image collections. Through experiments with
real users we show that the proposed approach has interesting
features; namely, accuracy, scalability and ease of use, as well
as low overall human effort requirements.