Abstract
Natural Language Generation (NLG) focuses on the automatic generation of natural language text,
which should ideally be coherent, fluent, and stylistically appropriate for a given communicative goal
and target audience. The tasks in NLG are varied, whether it be summarization, headline generation,
dialogue generation etc., and are also heavily dependent on the domain being considered.
Recent research has focused on creating domain-specific datasets and developing domain-specific models to make NLP systems more suited to real-world applications. Training models on data specific to
a domain has been observed to yield significantly better results across different domains, whether it be
legal, financial or biomedical.
However, we observe that there has not been much work done on problems in the tourism domain.
The tourism industry is important for the benefits it brings and due to its role as a commercial activity
that creates demand and growth for many more industries. Currently, there does not exist any standard
benchmark for the evaluation of travel and tourism-specific data science tasks and models.
To address this gap, we propose a benchmark, TOURISMNLG, of five natural language generation
(NLG) tasks for the tourism domain and release corresponding datasets with standard train, validation
and test splits. Moreover, as NLG systems are diversifying across languages, the datasets we create and
the models we contribute are also multilingual in nature, which is beneficial for the tourism industry
globally.
Further, previously proposed data science solutions for tourism problems do not leverage the recent benefits of transfer learning. Thus, in this thesis, we also contribute the first rigorously pretrained mT5 and
mBART model checkpoints for the tourism domain. The models have been pretrained on four tourismspecific datasets covering different aspects of tourism.
Using these models, we present initial baseline results on the benchmark tasks, that indicate an improvement in performance as compared to the respective models without domain-specific pretraining.
Additionally, we consider the problem of summarization for Indian languages, as described in the ILSUM (Indian Language SUMmarization) shared task, which focuses on summarising content from the
news domain in three important Indian languages: Indian English, Hindi, and Gujarati. We evaluate the
performance of existing pretrained models for the task and present our results and findings. We also talk
about steps that must be taken to create high-quality summarization datasets for Indian languages.
We hope that the contributions of this thesis will promote active research for natural language generation
for travel and tourism, as well as other domain-specific and language-specific tasks and models