Abstract
With extensive use of the Internet in recent years, massive amounts of data related to different topics and developments have been generated and shared online. The advent of social media like Twitter, Facebook, and Reddit has accelerated the communication between people of all colors, nations, and languages. Though the platform exists, communication barriers still exist due to languages. Many researchers are trying to solve this through various methods.
India is a land of multiple languages, and a majority of people are multilingual and tend to mix words from different languages in written text and also in speech. India has twenty-three significant languages with over seven hundred and twenty dialects. Kannada is a Dravidian language, and it is one of the four main Dravidian languages, and it is one of the top 30 most spoken languages of the world, with its own independent script and over fifty million speakers. With recent surges in developments in machine learning, understanding the language, and thus resulting in a wide variety of applications for resource-rich languages. In resource-rich languages, a considerable amount of work has been performed. The same is not true for low-resource languages or even Code-mixed low-resource languages. Kannada is one of such low-resource languages. In this thesis, we aim to focus on the research areas related to the Kannada-English Code-mixed domain. We present corpora in Kannada-English Code-mixed Twitter data and their analysis in the areas of Emotion Prediction and Parts-of-Speech Tagging, respectively.
In the first part, we present the corpus for Emotion Prediction in Kannada-English Code-mixed data. This is, to the best of our knowledge, the first corpus in this field. Identification and analysis of emotions in user-generated data in social media like Twitter, Facebook, Reddit, etc., is essential in understanding
the daily trends and human behavior. There has been work done on Code-mixed social media corpus but not on emotion prediction of Kannada-English Code-mixed Twitter data. We analyze the problem of emotion prediction on corpus obtained from Code-mixed KannadaEnglish extracted from Twitter annotated with their respective ‘Emotion’ for each tweet. We experimented with machine learning prediction models using features like Character N-Grams, Word NGrams, Repetitive characters, and others on Support Vector Machines(SVM) and Long Short term Memory Network(LSTM) on our corpus. In the second part, we present the corpus in Kannada-English Code-mixed data for Parts-of-Speech Tagging. This is, to the best of our knowledge, the first corpus in this field. Part-of-Speech (POS) is essential for many Natural Language Processing (NLP) applications. There has been a significant amount
of work done in POS tagging for resource-rich languages. POS tagging is an essential phase of text analysis in understanding the semantics and context of language. These tags are useful for higher-level tasks such as building parse trees, which can be used for Named Entity Recognition, Coreference resolution, Sentiment Analysis, and Question Answering. There has been work done on Code-mixed social media corpus but not on POS tagging of KannadaEnglish Code-mixed data. Here, we present Kannada-English Code-mixed social media corpus annotated with corresponding POS tags. We also experimented with machine learning classification models Conditional Random Fields(CRF), Bidirectional Long Short term Memory Network(Bi-LSTM), and
Bidirectional Long Short term Memory Network-Conditional Random Fields(Bi-LSTM-CRF) models on our corpus.