Abstract
Data mining is a process of extracting interesting information from the large volumes of data. Data
mining algorithms are being widely used in several applications like customer relationship management,
inventory management, recommendation systems, fraud detection, and surveillance. Pattern mining is
an important task of data mining, which involves discovering interesting associations from large trans-
actional databases such as market basket data, e-commerce transactions, social network transactions,
etc. In the literature, efforts are being made to investigate different kinds of pattern mining models such
as frequent patterns, periodic patterns, utility patterns, coverage patterns, and correlated patterns. Also,
research efforts are being made to propose MapReduce based approaches to improve the scalability of
pattern mining approaches.
In this thesis, we investigate improved approaches to extract partial periodic patterns (3Ps) from a
given spatio-temporal database. In the literature, the model of periodic patterns has been proposed to
extract the interesting associations, which appear periodically in a large temporal database. Moreover, a
model of 3Ps has also been proposed to extract the knowledge of interesting associations, which appear
periodically in a partial manner in a large temporal database. However, the issue of extracting 3Ps from
spatio-temporal databases has not addressed. In this thesis, we have proposed an improved model and
algorithms to extract 3Ps from spatio-temporal databases. Moreover, we have also proposed MapReduce
framework to extract 3Ps from large spatio-temporal database.
A 3P in the given temporal database is defined based on the user-given minimum periodic-support
and maximum inter-arrival time constraints. For extracting 3Ps from a given temporal database, Pattern-
growth and ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal) based approaches
have been proposed in the literature. It has been observed that the existing model of extracting 3Ps
(and the algorithms) face performance issues for extracting 3Ps in a spatio-temporal database scenario.
Based on the observation that the itemsets or patterns having large spatial distance between the items
may not be interesting, we have identified the notion of maximum distance as the additional pruning
criteria and proposed the improved model of 3Ps. Overall, the proposed model employs minimum
periodic-support, maximum inter-arrival time, and maximum distance as three constraints to determine
the interestingness of a pattern in a spatio-temporal database. The minimum periodic-support is equal to
the minimum number of periodic occurrences of a pattern within the data. The maximum inter-arrival
time is equal to the maximum duration in which a pattern must reappear to consider its occurrence as
periodic within the data. The maximum distance is equal to the maximum distance between the items in a pattern. All patterns satisfying these three constraints are returned as interesting patterns. Based on the
new model, we have proposed both ECLAT and pattern growth based algorithms to extract 3Ps from the
given spatio-temporal databases. The efficiency of our approaches is shown by conducting experiments
on large synthetic and real-world spatio-temporal databases. We also demonstrate the usefulness of the
extracted patterns through a case study by applying on air pollution and congestion datasets.
To improve scalability, we have also proposed a parallel MapReduce based pattern growth approach
to discover 3Ps in a spatio-temporal database. The issue is to maintain the load balancing among the ma-
chines and mine 3Ps in parallel by employing a large number of machines. To address the issue, we have
proposed a parallel algorithm by incorporating the step of distributing transactional identifiers among
the machines and mining the identical itemsets independently over the different machines. We have
proposed an improved load allocation algorithm such that the machines receive the balanced load, ap-
proximately. Experimental results conducted on Apache Spark’s distributed environment show that the
proposed parallel approach improves performance significantly with increase in number of machines.