Abstract
Code-mixing (CM) is a frequently observed phenomenon on social media platforms in mul- tilingual societies such as India. While the increase in code-mixed content on these plat- forms provides good amount of data for study- ing various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers and parts of-speech (POS) taggers for analysing code-mixed data have been developed. One such tool is Named Entity Recognition (NER), an important Natu- ral Language Processing (NLP) task, which is not only a subtask of Information Extraction, but is also needed for downstream NLP tasks such as semantic role labeling. While entity extraction from social media data is generally difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete infor- mation. In this work, we present the first ever corpus for Kannada-English code-mixed social media data with the corresponding named en- tity tags for NER. We provide strong baselines with machine learning classification models such as CRF, Bi-LSTM, and Bi-LSTM-CRF on our corpus with word, character, and lexical features.