Abstract
Code-Mixing (CM) is defined as the embedding of linguistic units such as phrases, words, and morphemes of one language into an utterance of another language. CM is a natural phenomenon observed in many multilingual societies. It helps in speeding-up communication and allows wider variety of expression due to which it has become a popular mode of communication in social media forums like Facebook and Twitter. However, current Question Answering (QA) re-search and systems only support expressing a question in a single language which is an unrealistic and hard proposition especially for certain domains like health and technology. In this paper, we take the first step towards the development of a full-fledged QA system in CM language which is building a Question Classification (QC) system. The QC system analyzes the user question and infers the expected Answer Type (AType). The AType helps in locating and verifying the answer as it imposes certain type-specific constraints.We learn a basic Support Vector Machine (SVM) based QC system for English-Hindi CM questions. Due to the inherent complexities involved in processing CM language and also the unavailability of language processing resources such POS taggers, Chunkers, Parsers, we design our current system using only word-level resources such as language iden-tification, transliteration and lexical translation. To reducedata sparsity and leverage resources available in a resource-rich language, in stead of extracting features directly from the original CM words, we translate them commonly into English and then perform featurization. We created an evaluation dataset for this task and our system achieves an ac-curacy of 63% and 45% in coarse-grained and fine-grained categories of the question taxonomy. The idea of translating features into English indeed helps in improving accuracy over the uni-gram baseline