Abstract
A Cost Efficient Approach to Correct OCR Errors in Large Document CollectionsDeepayan Das, Jerin Philip, Minesh Mathew and C. V. JawaharCenter for Visual Information Technology, IIIT Hyderabad, India.{deepayan.das, jerin.philip, minesh.mathew}@research.iiit.ac.in, jawahar@iiit.ac.inAbstract—Word error rate of anOCRis often higher thanits character error rate. This is especially true whenOCRsaredesigned by recognizing characters. High word accuracies arecritical for many practical applications like content creationand text-to-speech systems. In order to detect and correct themisrecognised words, it is common for anOCRto employ apost-processor module to improve the word accuracy. However,conventional approaches to post-processing like looking up adictionary or using a statistical language model (SLM), are stilllimited. In many such scenarios, it is often required to removethe outstanding errors manually.We observe that the traditional post-processing schemes lookat error words sequentially sinceOCRs process documents oneat a time. We propose a cost efficient model to address the errorwords in batches rather than correcting them individually.We exploit the fact that a collection of documents (eg. abook), unlike a single document, has a structure leading torepetition of words. Such words, if efficiently grouped togetherand corrected together, can lead to significant reduction in theeffort. Error correction can be fully automatic or with a humanin the loop. We compare the performance of our method withvarious baseline approaches including the case where all theerrors are removed by a human. We demonstrate the efficacy ofour solution empirically by reporting more than70%reductionin the human effort with near perfect error correction. Wevalidate our method on books in both English and Hindi.