Abstract
Language understanding has become crucial in different text classification tasks in Natural Language Processing (NLP) applications to get the desired output. Over the past decade,
machine learning and deep learning algorithms have been evolving with efficient feature representations to give better results. The applications of NLP are becoming potent, domain, and
language-specific. For resource-rich languages like English, the NLP applications give desired
results due to the availability of large corpora, different kinds of annotated datasets, efficient
feature representations, and tools.
Due to the lack of large corpora and annotated datasets, many resource-poor Indian languages struggle to reap the benefits of deep feature representations. Moreover, adopting existing
language models trained on large English corpora for Indian languages is often limited by data
availability, rich morphological variation, syntax, and semantic differences. Most of the work
being done in Indian languages is from a machine translation perspective. One solution is to use
translation for re-creating datasets in low resource languages from English. But in case of Indian languages like Telugu, the meaning may change and some crucial information may be lost
due to translation. This is because of their structural differences, morphological complexities,
and semantic differences.
In this thesis, our main objective is to mitigate the low-resource problem for Telugu. Overall,
to accelerate NLP research in Telugu, we present several contributions: (1) A large Telugu raw
corpus of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 sentences crawled from different Telugu websites). (2) Annotated datasets in Telugu for sentiment
analysis, emotion identification, hate speech detection, sarcasm identification, and clickbait detection. (3) For the Telugu corpus, we are the first to generate pre-trained distributed word
and sentence embeddings such as Word2Vec-Te, GloVe-Te, FastText-Te, MetaEmbeddings-Te,
Skip-Thought-Te. (8) We pre-trained different contextual language models for Telugu such as
ELMo-Te, BERT-Te, RoBERTa-Te, ALBERT-Te, Electra-Te, and DistilBERT-Te, word and
sentence embeddings using graph-based models: DeepWalk-Te and Node2Vec-Te, and Graph
AutoEncoders (GAE). (4) We propose the multi-task learning model (MT-Text GCN) to reconstruct word-sentence graphs on TEL-NLP data while achieving multi-task text classification
with learned graph embeddings. We show that our pre-trained embeddings are competitive or
better than the existing multilingual pre-trained models: mBERT, XLM-R, and IndicBERT. Lastly, the fine-tuning of pre-trained models show higher performance than linear probing results on five NLP tasks. We also experiment with our pre-trained models on other NLP tasks
available in Telugu (Named Entity Recognition, Article Genre Classification, Sentiment Analysis, and Summarization) and find that our Telugu pre-trained language models (BERT-Te and
RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We hope
that the availability of the created resources for different NLP tasks will accelerate Telugu NLP
research which has the potential to impact more than 85 million people.
In this thesis, we aim at bridging the gap by creating resources for different NLP tasks
in Telugu. These tasks can be extended to other Indian languages that are closer to Telugu
culturally and linguistically by translating these resources without losing information like verb
forms, cultural terms, vibhaktis etc. This is the first work that employs neural methods to
Telugu language —a language that does not have good tools like NER, parsers and embeddings.
Our work is the first attempt in this direction to provide good models in Telugu language by
exploring different methods with available resources.
It can also help the Telugu NLP community evaluate advances over more diverse tasks and
applications. We open-source our corpus, five different annotated datasets (SA, EI, HS, SAR,
and clickbait), lexicons, pre-trained embeddings, and code here1
. The pre-trained Transformer
models for Telugu are available here2.