Abstract
In recent years, we have witnessed a tremendous growth in the volume of text documents available
on the Internet, digital libraries, news sources, company-wide intranets and so on. This has led to
an increased interest in developing methods that can help users to effectively navigate, summarize,
and organize this information. The ultimate goal is to help users find what they are looking
for. There are mainly two approaches to enhance the task of document organization: supervised
approach, where pre-defined category labels are assigned to documents based on the likelihood
suggested by a training set of labeled documents; and unsupervised approach, where there is no
need for human intervention or labeled documents at any point in the whole process. Fast and
high-quality document clustering algorithms play an important role towards the goal of document
organization, as they have been shown to provide both an intuitive navigation/browsing mechanism
by organizing large amounts of information into a small number of meaningful clusters.
It can be noted that the performance of information retrieval method depends on feature selection and weight assignment methods which are being employed to extract the features from the
documents. Normally, features from a document are selected based on some criteria (e.g. Frequency).
These features are weighed by TF-IDF (Term frequency and inverse document frequency)
method. There are research efforts to improve the performance of feature extraction methods by
extending the concepts from the areas of ontology, open web directories and so on.
The vector space model is one of the most widely used models to represent text document for
computing similarity. Cosine similarity is a popular method to calculate similarity between two
vectors. The similarity computation by cosine similarity method is influenced by common features
(and their weight) between two document vectors. It can be observed that the cosine similarity is
unable to find similarity based on the meaning or semantics. For example, consider two documents,
one with a word “BMW” and the other having a word “Jaguar”. The similarity computation with
cosine similarity between these two documents equals to zero. Even though these two documents
have different vocabulary, they are semantically same as both refer car models.
In the thesis, we made an effort to capture the context of the document to incorporate it into
similarity computation. We exploit the generalization ability of hierarchical knowledge repositories
such as “Open Web Directory”, in which, a given term relates to a context, and the context, in turn,
relates to a collection of terms, then we can extract related terms for each term in the document. In
a simple generalization hierarchy of web directory, a term at a higher level is a generalized concept
for all the terms under this node e.g. sport is a generalized concept for football, cricket, baseball etc.
We can add these generalize terms along with their weight to the document vector. By enriching tthe document with generalized contextual terms there is a scope to increase the number of common
terms, thus the similarity between two documents. An improved approach is also proposed in
which these generalized terms are later removed from the document vector leaving behind only the
document terms. So, as output, we get a document vector with boosted weights and the dimension
space of the document vector remains same. In addition to feature extraction and feature weighting of a document vector, we propose a method to improve clustering quality by assigning weight to the feature vector of a cluster.
The contribution of this thesis is threefold:
1. We propose a framework that performs feature generation (using open web directory) and
enriches the feature vector with new, more informative and discriminative features. We use
topic paths of a term to obtain related contextual and generalized terms.
2. We propose an improved term weighting method called BoostWeight by considering the semantic
association between the terms. BoostWeight method increases the weight of terms if
they have common generalized term.
3. We propose a methodology to refine a given set of clusters by incrementally moving documents
between clusters. More weight is given to the representative and discriminative features of a
cluster.
To compare the performance of proposed approaches with existing weighting approaches, we
have conducted clustering experiments on two data-sets: WebData and Reuters21578 news corpus
dataset. Experimental results show that the proposed approach improves clustering performance
over other term extraction and weighting approaches.