Abstract
The internet serves as a vast repository of information covering a diverse array of topics, ranging
from blogs and articles to websites. However, it is important to note that not all of this information
is valuable or relevant. Navigating through the plethora of content in order to gain a comprehensive
understanding of a particular topic can be a daunting and time-consuming task. Furthermore, it is all
too common to invest time in reading content that ultimately proves to be unimportant or irrelevant.
Given the inherent limitations of the human cognitive capacity to process large quantities of information,
concise and relevant summaries are highly sought after in order to efficiently and effectively comprehend
complex subjects.
Summarization is a computational task that condenses textual information into a concise version
by including only the most essential and relevant information. There are two main approaches to
summarization: extractive and abstractive. Extractive summarization involves selecting sentences based
on their importance, while abstractive summarization involves introducing new words or phrases in the
summary. Document summarization has been studied for over three decades by the NLP community.
However, progress in Indian language summarization has been limited due to the lack of high-quality
datasets and benchmark models, which has motivated us to work towards developing resources and
benchmarks for Indian languages. In this thesis, we have developed text summarization resources for
Indian languages in three different settings: mono-lingual, cross-lingual, and multi-lingual.
The initial focus of this thesis is on mono-lingual summarization, specifically creating a high-quality
dataset for the popular south Indian language, Telugu. We propose a pipeline that crowd-sourced
summarization data and then aggressively filtered the content via: automatic and partial expert evaluation.
Using the pipeline, we create a high-quality Telugu abstractive summarization dataset (TeSum). The
dataset consists of 20,329 document-summary pairs, which were created by 347 annotators and evaluated
by 3 raters. We carefully designed annotation guidelines that consider the parameters of Relevance,
Readability, and Creativity. Additionally, we compared our dataset with existing Telugu summarization
datasets.
By training a summarization system on multiple languages, the system can learn to represent concepts
in a shared space, regardless of the language in which they are expressed. This shared representation
learning can be useful for transfer learning, as it enables the model to apply knowledge gained from
one language to another language. To achieve this, we perform the multi-lingual and cross-lingual
summarization for Indian languages. For multi-lingual summarization, we utilized the Indian Language Summarization (ILSUM) dataset
to create baselines, which includes Hindi, Gujarati, and Indian English. We test the proposed filters on
ILSUM data to perform the quality assessment. We conducted experiments with different pre-trained
sequence-to-sequence models to identify the best-performing model for each language. Our work also
involved an in-depth analysis of the impact of k-fold cross-validation when dealing with limited data.
Additionally, we performed experiments using a combination of the original and filtered versions of the
data to assess the effectiveness of the pre-trained models.
We present the PMIndiaSum, a new cross-lingual and highly parallel summarization dataset for
languages in India. The dataset covers 4 language families, 14 languages, and 196 language pairs. We
detail the approaches taken to derive this dataset, including data acquisition, cleaning, quality assurance,
and inspection. In addition, we publish benchmarks for various methodologies, such as fine-tuning
pre-trained language models and summarization-and-translation. Experimental results suggest that the
provision of multilingual data enhances cross-lingual summarization between Indian languages.
Furthermore, this thesis also delves into multi-perspective scientific document summarization. Our
objective is to develop a model that can generate a generic summary encompassing various aspects
covered by multiple reference summaries of a scientific document. We describe the different pre-trained
models used in this task, as well as the challenges encountered during the process.