Abstract
Natural Language Processing deals with the techniques and processes which help computers interpret and process the human languages. There are various NLP applications which help us in daily tasks,namely spellchecker, text auto-correction, machine translation systems, text classification, and text summarization etc.. There is vast information available on the internet, but most of it is available in English.When it comes to Indian languages, the content is much lesser compared to other languages. Most of the Indian people are multilingual and machine translation systems can be utilized to translate from text in one language to a one’s own language. In that way machine translation technology can be helpful to overcome the language barrier. Also, most people in India are not proficient in English, hence machine translation systems can take advantage of the large information available on the internet. The machine translation technology has various approaches out of which main approaches are rule based, data driven and hybrid approach. Each of these approaches have their pros and cons. The hybrid approach utilizes
the benefits of both rule based as well as data driven approaches by combining them. The pipeline based hybrid machine translation systems exhibit certain software engineering charac-
teristic like heterogeneous nature of architecture, compute and knowledge intensive nature of modules and blackboard architecture. Due to these characteristics, the development and re-engineering of the hybrid MT systems are different from that of traditional software engineering processes. The hybrid MT systems with pipeline based architecture face some issues like lack of efficiency, architecture mismatch, scalability, robustness and performance related challenges in their integration, deployment and
engineering. We took Anusaaraka (English to Hindi) and Sampark (Indian language to Indian language) MT systems as our use-cases of pipeline based hybrid machine translation systems. The different modules in both of the systems are developed by different teams. By studying both of the systems, we observed that modules, internally, sometimes use heterogeneous utilities and platforms as well as hard coded knowledge sources which makes them rigid in nature and difficult to manage. This is not in consonance with the larger specifications of the system. These modules are comprised of complex, knowledge and compute intensive NLP modules relying on a bulk of hard-coded knowledge sources. Such modules are typically very unwieldy when we expect to scale the performance of the overall NLP system they are a part of. Additionally, there are NLP applications which either utilize MT systems or support their development processes. Such applications are translation workbenches and resource management systems. We also studied the aspects of MT related applications and found that the translation workbenches as well as resource management systems have performance issues and are developed with heterogeneous modules as mentioned above. Therefore, for making MT systems as well as MT related NLP applications to be utilized by the mass public, these MT applications need to be re-engineered. In this thesis, we aim to propose an approach to re-engineer MT and MT related applications to make them efficient and improve their performance in terms of processing speed. It would facilitate the
MT systems to be used as services which can be used in various NLP products and services. We first studied the Anusaaraka MT system and identified the NLP components which can be re-engineered. We proposed the symbiotic software engineering approach followed by micro-services based architecture
to re-engineer the system. In this way, each of the NLP modules as well as the complete MT system is proposed as a set of independent microservices. We demonstrate our approach on Anusaaraka MT system in which the performance of the system improves by 87% on execution of 1 sentence. The documentation and test cases are also being prepared for each of the modules as well as the complete system. We also propose the concept of distributed caching which is utilized to improve the performance
of compute intensive NLP modules in distributed environment. We performed experiments to analyze the impact of distributed caching on the Sampark MT system. The performance of the Sampark MT
system improves additionally by 4-6% when we utilize distributed caching mechanism over normal caching. We then propose the translation workbench-Transzaar with a resource management system- Kunji to aid the translators. The performance of the translators is evaluated in various scenarios when they use Transzaar and Transzaar with resource management system-Kunji. The productivity of the translators improved significantly by 1.35-1.45 times for English-Hindi language pair and 2 times for
Hindi-Urdu language pair to that of their initial productivity with Transzaar. Moreover it is also observed th