Abstract
The process of data mining discovers interesting knowledge from large amounts of data. Given a
large amount of data, we have several data mining software systems for summarizing/characterizing,
extracting frequent patterns and association rules, classification, and clustering tasks. The data mining
based systems are employed to improve the performance of e-commerce systems, marketing, search
systems, banking, healthcare, etc. Investigating data mining approaches to extract new knowledge from
the given data is one of the active research areas. Graph model is widely employed to model the real
world. For example, a chemical compound, a protein-ligand complex, and an XML document can be
modeled as a graph or a graph transaction. In the literature, the topic of extracting the knowledge of
frequent subgraphs from the given Graph Transactional Dataset (GTD) has been extensively studied. It
has been shown that such knowledge could help in the drug discovery process. So far, in the literature,
the issue of extracting coverage-related knowledge from GTD has not yet been explored. In this thesis,
we propose new data mining models and approaches to extract two types of knowledge patterns from
GTDs.
Motivated by the issues in the process of drug discovery, as the first contribution, we propose the
model and approach to extract the coverage-related knowledge of potential subgraph patterns from the
given GTD. A subgraph pattern is a set of subgraphs. We consider that a subgraph covers the graph
transaction if it belongs to that graph transaction. A subgraph pattern is considered as Subgraph Coverage Pattern (SCP) if the ratio of the union of the graph transactions covered by the subgraphs of the
subgraph pattern to the size of GTD is not less than a user-given threshold value. We have defined
three metrics to characterize an SCP. First, the relative frequency metric ensures that each subgraph of
an SCP should appear in the threshold number of graph transactions. Second, the coverage support
metric ensures that the union of graph transactions covered by each subgraph of SCP should satisfy the
threshold percentage of GTD. Third, the overlap ratio metric ensures that the overlap ratio among the
graph transactions covered by individual subgraphs of SCP satisfies overlap ratio constraint. Given a
GTD and the user-given threshold values of relative frequency, coverage support and overlap ratio, the
problem is to extract all SCPs from the given GTD. A straightforward approach requires the extraction
of all possible subgraphs from GTD and generation of all possible subgraph patterns. In addition, for
each subgraph pattern, it requires expensive graph-based computation of relative frequency, coverage
support and overlap ratio to identify all SCPs of GTD. In this model, we propose the notions of the potential subgraphs, flat transactional modeling, and employ an overlap ratio-based candidate pruning
heuristic for efficiently extracting SCPs from GTD.
As the second contribution, with the motivation to improve the process of drug cocktail discovery, we
have proposed the model and approach to extract the coverage-related knowledge of graph transactional
patterns from the given GTD. A graph transactional pattern is a set of graph transactions. A drug
compound can be modeled as a graph transaction, and its fragments can be modeled as subgraphs. A
set of all fragments extracted from GTD is considered as a fragment set. We consider that a graph
transaction covers a fragment if the fragment belongs to that graph transaction. A graph transactional
pattern is considered as a Graph Transactional Coverage Pattern (GTCP) if the ratio of the union of
fragments covered by each of its graph transactions to the size of the fragment set satisfies the coverage
constraint. We have defined three metrics to characterize GTCP. First, the graph transactional coverage
metric ensures that each graph of GTCP should cover at least threshold percentage of fragments in
the fragment set. Second, the graph transactional pattern coverage metric ensures that the union of
fragments covered by each graph transaction of GTCP should satisfy the coverage constraint. Third,
the overlap ratio metric ensures that the overlap ratio among the fragments covered by individual graph
transactions of GTCP should satisfy the overlap ratio constraint. Given a GTD and the threshold values
of graph transactional coverage, graph transactional pattern coverage, and overlap ratio, the problem
is to extract all GTCPs from given GTD. A straightforward approach involves extracting all of the
possible fragments from GTD and generation of all possible graph transactional patterns, which increase
exponentially as the size of GTD increases. In addition, for each candidate graph transactional pattern,
requires expensive graph-based computation of graph transactional pattern coverage and overlap ratio
for extracting all GTCPs from GTD. In this work