Abstract
Imparting machines the capability to understand documents like humans do is an AI-complete
problem since it involves multiple sub-tasks such as reading unstructured and structured text,
understanding graphics and natural images, interpreting visual elements such as tables and
plots, and parsing the layout and logical structure of the whole document. Except for a small
percentage of documents in structured electronic formats, a majority of the documents used
today, such as documents in physical mediums, born-digital documents in image formats, and
electronic documents like PDFs, are not readily machine readable. A paper-based document
can easily be converted into a bitmap image using a flatbed scanner or a digital camera. Consequently, machine understanding of documents in practice requires algorithms and systems that
can process document images—a digital image of a document.
Successful application of deep learning-based methods and use of large-scale datasets significantly improved the performance of various sub-tasks that constitute the larger problem of
machine understanding of document images. Deep-learning based techniques have successfully been applied to the detection, and recognition of text and detection and recognition of
various document sub-structures such as forms and tables. However, owing to the diversity
of documents in terms of language, modality of text present (typewritten, printed, handwritten
or born-digital), images and graphics (photographs, computer graphics, tables, visualizations,
and pictograms), layout and other visual cues, building generic-solutions to the problem of
machine understanding of document images is a challenging task. In this thesis, we address
some of the challenges in this space, such as text recognition in low-resource languages, information extraction from historic/handwritten collections and multimodal modeling of complex
document images. Additionally, we introduce new tasks that call for a top-down perspective—
to understand a document image as a whole, not in parts—of document image understanding,
different from the mainstream trend where the focus has been on solving various bottom-up
tasks. Most of the existing tasks in Document Image Analysis (DIA) deal with independent
bottom-up tasks that aim to get a machine-readable description of certain, pre-defined document elements at various abstractions such as text tokens or tables. This thesis motivates a purpose-driven DIA wherein a document image is analyzed dynamically, subject to a specific
requirement set by a human user or an intelligent agent.
We first consider the problem of making document images printed in low-resource languages machine-readable using an OCR and thereby making these documents AI-ready. To
this end, we propose to use an end-to-end neural network model that can directly transcribe
a word or line image from a document to corresponding Unicode transcription. We analyze
how the proposed setup overcomes many challenges to text recognition of Indic languages.
Results of our synthetic to real transfer learning experiments for text recognition demonstrate
that models pre-trained on synthetic data and further fine-tuned on a portion of the real data
perform as well as models trained purely on real data. For 10+ languages for which there
have not been public datasets for printed text recognition, we introduce a new dataset that has
more than one million word images in total. We further conduct an empirical study to compare
different end-to-end neural network architectures for word and line recognition of printed text.
Another significant contribution of this thesis is the introduction of new tasks that require a
holistic understanding of document images. Different from existing tasks in Document Image
Analysis (DIA) that attempt to solve independent bottom-up tasks, we motivate a top-down
perspective of DIA that requires a holistic understanding of the image and purpose-driven
information extraction. To this end, we propose two tasks—DocVQA and InfographicVQA—
fashioned along Visual Question Answering (VQA) in computer vision. For DocVQA, we
show results using multiple strong baselines that are adapted from existing models for existing
VQA and QA problems. For InfographicVQA, we propose a transformer-based, BERT-like
model that jointly models multimodal—vision, language, and layout—input. We conduct open
challenges for both tasks, attracting hundreds of submissions so far.
Next, we work on the problem of information extraction from a document image collection. Recognizing text from historical and/or handwritten manuscripts is a major challenge to
information extraction from such collections. Similar to open-domain QA in NLP, we propose a new task in the context of document images that seek to get answers for natural language questions asked on collections of manuscripts. We propose a two-stage retrieval-based
approach for the problem that