Abstract
Few resource-rich languages like English and French have been extensively analyzed for Natural
Language Processing (NLP) tasks. Domain expertise is essential requirement for studying the properties
of a language to build linguistic resources, prepare annotated training data, select appropriate features,
configure parameters and set exceptions in the systems for building models for solving NLP tasks of
a language. And to build data driven models like, deep learning models, huge amount of annotated or
untagged training data is required. So, handling each of the remaining 6500+ “resource-poor” languages
would require the same amount of intensive effort, expertise, expenses, time and large training datasets.
Hence, it is required that we build domain independent and language independent data driven systems
which can work reasonably and effectively with less amount of training data without requiring domain
expertise. For this purpose, in this thesis, we propose generic concepts and data driven methods which
can be used to build systems for solving the NLP tasks of resource-poor languages.
We propose a generic associative classification approach called associative context classification
which we have developed using our proposed context based list concept that groups items of some
specific context and other proposed parameters and concepts. In our research, we have demonstrated the
application of this proposed approach in developing solutions to a few representative Natural Language
Processing tasks. We have focused on developing semi-supervised methods using small sized annotated
data. Our methods perform well even with less amount of training data without using domain knowledge
explicitly and hence, are especially suitable for resource-poor languages which lack domain resources.
Our proposed approach is based on associative classification and on the one sense per collocation
hypothesis which states that the sense of a word in a document is effectively determined by its context.
Hence, our proposed approach can be applied for NLP tasks which depend on collocation property.
We have validated the utility of our proposed approach for NLP tasks of resource-poor languages by
successfully applying it for developing generic methods for Part-of-Speech tagging and Word Sense
Disambiguation tasks.
Part-of-speech (POS) tagging is a NLP classification task that assigns a POS tag or other lexical
class marker to an item or to each item in the sentence. Here, we use the term “item” to represent
all the words and tokens of a language. All the available POS taggers including the state-of-the-art
taggers require training data and linguistic resources like dictionaries in large quantities. These taggers
do not perform well for resource-poor languages. So, there is a need to develop generic semi-supervised
tagging methods which use untagged corpus and require less or no lexical resources. Most of the existing
vi
vii
semi-supervised techniques require large untagged corpus, while for many resource-poor languages,
even obtaining a small untagged corpus is hard.
Word sense disambiguation (WSD) is a classification task which involves determining the correct
meaning of each word in a sentence or phrase based on the neighboring context items. Automated
WSD methods use knowledge structures like, WordNet and dictionaries and hand crafted features and
rules crafted by domain and linguistic experts from the training data. This is a costly and time taking
process and requires extensive amount of domain resources and linguistic expertise. These requirements
make it difficult to design a WSD algorithm for resource-poor languages and hence domain independent
methods are needed to be developed for this task also.
For our experiments, we have cleaned and prepared resource-rich English, resource-moderate Hindi
and resource-poor Bengali, Marathi, Tamil, Telugu and Urdu language datasets for POS tagging experiments and English, Hindi and Marathi language datasets for WSD experiments.
As part of our research work, we have developed:
• Two semi-supervised POS tagging methods using proposed associative context classification approach.
• Various ensemble POS tagging methods using proposed associative context classification approach, Support Vector Machine, Conditional Random Field, Decision Tree and Semi-supervised
Condensed Nearest Neighbor method.
• One semi-supervised associative WSD model using proposed associative context classification
approach.
• One ensemble WSD model using proposed associative context classification approach and Support Vector Machine.
• One negation rule finding algorithm for the POS tagged data to find annotation errors from the
tagged data.