Abstract
Due to the lack of a large annotated corpus, many resource-poor Indian languages struggle to reap the beneits of recent deep feature representations in Natural Language Processing (NLP). Moreover, adopting existing language models trained on large English corpora for Indian languages is oten limited by data availability, rich morphological variation, syntax, and semantic diferences. In this paper, we explore the traditional to recent eicient representations to overcome the challenges of low resource language, Telugu. In particular, our main objective is to mitigate the low-resource problem for Telugu. Overall, we present several contributions to a resource-poor language viz. Telugu. (i) a large annotated data (35,142 sentences in each task) for multiple NLP tasks such as sentiment analysis, emotion identiication, hate-speech detection, and sarcasm detection, (ii) we create diferent lexicons for sentiment, emotion, and hate-speech for improving the eiciency of the models, (iii) pretrained word and sentence embeddings, and (iv) diferent pretrained language models for Telugu such as ELMo-Te, BERT-Te, RoBERTa-Te, ALBERT-Te, and DistilBERT-Te on a large Telugu corpus consisting of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 sentences crawled from diferent Telugu websites). Further, we show that these representations signiicantly improve the performance of four NLP tasks and present the benchmark results for Telugu. We argue that our pretrained embeddings are competitive or beter than the existing multilingual pretrained models: mBERT, XLM-R, and IndicBERT. Lastly, the ine-tuning of pretrained models show higher performance than linear probing results on four NLP tasks with following F1-scores: Sentiment (68.72), Emotion (58.04), Hate-Speech (64.27) and Sarcasm (77.93). We also experiment on publicly available Telugu datasets (Named Entity Recognition, Article Genre Classiication, and Sentiment Analysis ), ind that our Telugu pretrained language models (BERT-Te and RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We open-source our corpus, four diferent datasets, lexicons, embeddings, and code https://github.com/Cha14ran/DREAM-T. he pretrained Transformer models for Telugu are available at https://huggingface.co/ltrctelugu.