Abstract
Code-switching or code-mixing occurs when ”lexical items and/or grammatical features from two
languages appear in one sentence”. With the rising popularity of social media platforms such as Twitter,
Facebook, and Reddit, the volume of texts on these platforms has also grown significantly. Twitter alone
has over 500 million text posts (tweets) per day. India, a country with over 300 million multilingual
speakers, has over 23 million users on Twitter as of January 2022, and code-switching can be observed
heavily on this social media platform.
Code-mixed social media posts present unique challenges due to the mixing of languages, slang,
and informal expressions. Extracting valuable information from code-mixed data enables a better
understanding of user sentiments, preferences, and trends across different language communities. It helps
identify named entities, detect events, and extract relevant information from multilingual conversations.
By effectively extracting and analyzing code-mixed content, businesses, researchers, and language
processing systems can gain valuable insights into diverse language patterns, linguistic phenomena, and
cultural nuances, contributing to more accurate language processing, sentiment analysis, and targeted
communication strategies. Extracting valuable information from social media content has numerous
benefits that impact various domains. This has led to a growing interest in Information Extraction (IE) as
an active research area in artificial intelligence.
However, research efforts are often hindered by the lack of automated NLP tools to analyse massive
amounts of code-mixed data, especially for resource-poor languages. There have been significant efforts
towards understanding some languages, such as English, German, French, and Spanish, by creating
annotated resources for various tasks, while some languages have not received the same focus, making them resource-poor. One such language is Kannada, which has over 58 million native and secondarylanguage (L2) speakers worldwide. Like in most places around the world, Kannada speakers tend to
use code-mixed language, mixed with English, in informal settings including social media and texting
platforms, but the same has not received much focus from researchers. In this thesis, we work towards
information extraction from Kannada-English code-mixed data from social media. We work on the
problems of Named Entity Recognition (NER) and Event Detection.
Named Entity Recognition (NER) and Event Detection on social media code-mixed data are crucial for
understanding and analyzing user-generated content in diverse languages. They enable the identification
of named entities, such as names of people, organizations, and locations, as well as the detection of
events, allowing for deeper insights into multilingual conversations, cross-cultural trends, and effective
information retrieval from code-mixed social media posts.
We have collected Kanglish code-mixed data from social media, Twitter, and curated annotated
datasets for both NER and Event Detection tasks while mentioning the annotation guidelines for the same.
We have analysed the challenges that are unique to Kannada-English code-mixed data and have provided
annotation guidelines for the same. We have also proposed a few supervised approaches towards these
tasks on our dataset with careful feature selection and critically analysed the results in hopes to promote
more focus from the research community on such low-resource languages.