Abstract
Code-mixing is a linguistic phenomena frequently observed in user generated content on social media, especially by multilingual users. Apart from the inherent linguistic complexity, the analysis of code-mixed content poses complex challenges owing to the presence of spelling variations, transliteration and non-adherence to a formal grammar. However, for any downstream Natural Language Processing task, tools that are able to process and analyze code-mixed data are required. Currently there is a lack of publicly available resources for code-mixed Hindi-English data, while the amount of such text is increasing everyday. In this study, our focus is on creation of a dataset that has codemixed Hindi-English sentences along with the associated language and normalisation labels. To the best of our knowledge, our work is the first attempt at the creation of a linguistic resource for this language pair, which is also made public. In this work, we also present an empirical study detailing the construction of a language identification and normalisation system designed for this language pair.