Abstract
Recent development in deep learning-based recognizers needs a large annotated corpus for creating the model. Manually annotating a large corpus is time-consuming, costly, and tedious. In this work, we propose a framework for automatic annotation at the word level for given handwritten data and corresponding text sequences (or corpora). The proposed framework consists of five modules (i) pre-processing,(ii) word detection,(iii) word recognition,(iv) alignment, and (v) manual correction and verification. The preprocessing module cleans the image and crops the text region from an image. Word detection and recognition modules localize and recognize words. It is necessary to align words in the sequence with the word images during detection and recognition because of errors in writing. The alignment module aligns words in text sequence to the word images. The human annotator will correct the errors in the automatic annotation process and verify the document. Finally, we created an annotated dataset containing word images and their corresponding ground truth transcriptions. In this work, we demonstrate the proposed tool for annotating 14 sets corresponding to 13 Indic languages and English. Each set contains 15000 handwritten document images. On an extensive collection of handwritten document images in 14 languages, 80% of words are correctly annotated by the automatic annotation tool, while the remaining 20% are corrected manually.