Abstract
In today’s world, Information Retrieval (IR) systems play a significant role in facilitating users to access their desired information on the web. Current IR systems have come a long way in retrieving the most relevant results by using sophisticated ranking algorithms, relevance feedback, efficient indexing techniques etc. As a substantial amount of the web content is in English, current IR systems do a fair job
of fulfilling users information needs. On the other hand, Non-English web content has also seen tremendous growth in the recent years. Content from local or native languages originates mainly in the form of news portals, blogs, social networking web sites, etc. The surge in such web content can be attributed to major factors like users interests to read and express thoughts in their local languages, availability of sophisticated input methods for local language text entry, and robust technologies to handle such Non-English content. From the users perspective, IR systems that can retrieve relevant local language search results in addition to the search results in English would be more useful. This aspect of information need has given rise to the field of Cross Language Information Retrieval (CLIR).
CLIR deals with retrieval of documents present in one language using user queries in another language. CLIR systems either translate the queries to the documents
language during input processing or translate the documents to the language of the queries during indexing for efficient retrieval. In this thesis, we focus on exploring efficient ways to translate user queries for cross language retrieval involving Indian languages. To build an effective query translation process, we need to find answers for some interesting questions pertaining to queries like
What does a general user query consist of?
What are the ways to translate user queries from one language to another?
A typical web query consists of three to four words with most of them as named entities. Studies have shown that more than 80% of the query terms are Out Of Vocabulary (OOV) words. Generally, these OOV words are not present in bi-lingual thesaurus. Hence they cannot be translated using bi-lingual dictionaries similar to other content words present in the query. They have to be transliterated either phonetically or orthographically or combining both based on the source and target languages. The content words in the query can be translated using several approaches including bi-lingual thesaurus, Ontologies, parallel or comparable corpora, using
web and other related resources. As one could infer, some form of “external resources” is needed for translating queries across any two languages. Firstly, we try to address the problem of translating OOV words in the context of English-Indian language and Indian language-English cross language retrieval. In our analysis, we found that most of the transliteration errors are due to ambiguities in the usage of “vowels” either in the source side or in the target side word. To address this issue, we introduce a mapping based approach of transliterating named
entities that generates “consonant skeleton” for the given source word and map it to the right target side word equivalent based on phonetically modified Levenshtein distance. Our method requires very minimal training data and yet achieves significant
improvement in performance when compared to another statistical based technique. Initially, we evaluated the algorithm for English-Tamil transliteration.
The transliteration accuracy of our algorithm is 91%, 96% and 99% for top-1, top-3 and top-8 transliteration candidates, when compared to 42%, 62% and 77% of the statistical system on a given standard test data. We have also extended the algorithm for English-Hindi and Hindi-English transliteration of named entities in the cross language retrieval scenario. Our algorithm proved to be effective in the cross language retrieval context and our systems have achieved state-of-the-art performance. Secondly, as mentioned earlier, most of the current CLIR systems use some form of external translation resources for translating queries. However if such resources are scarce or not available across any two languages, query translation becomes
rarely feasible. In the second case, we address this particular issue in the context of CLIR across Indian languages. We propose an approximate, fuzzy string
matching technique that exploits the presence of common origin words and similarity across the writing systems between Indian languages for query translation. We have implemented our technique for query translation on top of existing CLIR systems.
Our method performs reasonably well for CLIR across all the six bi-lingual language pairs comprising of Hindi, Bengali and Marathi. We have also tested the
significance of our obtained results and have shown that they are statistically valid.