Abstract
Sindhi is an Indo-Aryan language spoken by more than 58 million speakers around the world. It is currently a resource poor language which is harmed by the literature being written in multiple scripts. Though the language is widely spoken,primarily, across two countries, the written form is not standardized. In this paper, we seek to develop resources for basic language processing for Sindhi language, in one of its preferred scripts (Devanagari), because a language that seeks to survive in the modern information society requires language technology products. This paper presents our work on building a stochastic Part-of-Speech tagger for Sindhi-Devanagari using conditional random fields with linguistically motivated features. The paper also discusses the steps taken to construct a part-of-speech annotated corpus for Sindhi in Devanagari script. We have also explained in detail the features that were used for training the tagger, which resulted in a part of speech tagger nearing 92% average accuracy.