Abstract
The digital age presents an overwhelming deluge of multimodal data, underscoring the imperative need for effective text summarization techniques. Such techniques transform vast amounts
of textual and visual data into concise, comprehensible, and insightful summaries, facilitating
information retrieval, comprehension, and decision-making. This thesis pioneers innovative
strategies to enhance text summarization by employing various forms of contextual guidance
and multimodal data, contributing significantly to the evolution of the field and offering a
cohesive narrative that links these diverse yet interconnected areas of study.
The journey begins with an exploration of ”Popularity Forecasting” of sentences within news
articles. This novel approach surpasses traditional salience-based extractive summarization by
predicting the ’popularity’ or ’eye-catching’ potential of sentences. We create a popularity
dataset which contains news articles from CNN/DM[47] with their sentence-popularity score
mapping. We create this by comparing sentences with the search queries for the particular
article. Then we adapt trained extractive summarizers to perform regression tasks and predict
the popularity of a particular sentence within a news article. The result is a ranking of sentences
based on their popularity scores
Next, the research advances into the realm of ”Multimodal Summarization,” which synergizes
textual and visual elements to create a more holistic summary. By pairing concise textual
summaries with the most salient images from news articles, this technique delivers a richer
and more comprehensive understanding of the content. In this work we also show that we
can improve the accuracy of summarization models by using images to aid the summarization
process. To do this we utilize visuolinguistic transformers like CLIP[54], OSCAR[36] to help in
the interaction of the two modalities and we adapt general summarization models so that we
can incorporate both textual and visual information in the summarization model
Building on the foundation of extractive summarization, and using the core logic from the
multimodal summarization work the study then introduces ”Guided Summarization.” This innovative method uses salience scores of sentences, obtained from an extractive summarizer,
to guide an abstractive summarizer. This symbiotic relationship between the two forms of
summarization results in more contextually relevant and focused abstract summaries.
The research further pushes the boundaries of personalization with ”Persona-based Summarization,” applied to SEBI legal case files. This technique generates tailored summaries based on the specific information needs of different personas such as investors, defense lawyers,
and judges. It underscores the potential of personalization in text summarization, making the
information more accessible and relevant to each user profile.
Finally, building on the insights gleaned from the exploration of multimodal summarization,
the study culminates with the creation of an ”Indic Multimodal Text-Image Pair Dataset.” This
unique resource is a rich assembly of text and image pairs of different Indian languages, serving
as a critical foundation for the development and evaluation of visuolinguistic transformers,
especially those focusing on data from the Indian subcontinent.
In summary, this thesis provides a comprehensive exploration of how contextual guidance
and multimodal data can significantly enhance text summarization. The innovative techniques
and resources proposed and developed in this research, connected through a cohesive narrative,
promise to significantly advance the field of text summarization, paving the way for more
engaging, comprehensive, and personalized summary generation