Abstract
There are several types of spatio-temporal dynamic processes, both naturally occurring and manmade like climate change, floods or droughts, crime-spread etc; and some which are influenced by
multiple factors like agricultural crop performance, occurrence of diseases, etc. It is important to understand the behaviour of such phenomena as they have a significant effect on the places where they occur, having both direct and indirect impacts on the environment as well as the population in that region. These phenomenon have an intrinsic spatial as well as temporal context associated with them. In order to gain insights into these processes and their relationships, there is a need for a methodology that can
capture the context in both the dimensions- spatial and temporal, simultaneously. At a particular location and time, the observed output value of a dynamic phenomenon can be affected
by many causative factors that feed as an input to the process and are at play simultaneously. While many studies look at the correlations of input to output to understand the phenomena, it’s also important to understand the outcome exhibited with its intrinsic spatio-temporal context, deciphering which may help us better appreciate the factors at play, both known and unknown dynamics of the factors themselves. Hence there is a need to highlight these patterns as seen in the observed data and provide insights into
the process that govern these phenomena. Different spatio-temporal statistical modelling methods have been developed, that try to approximate the process and produce a mathematically-formalized model that would generate the spatio-temporal
data. However, such modelling techniques are more focused on accurate estimation of input parameters and finding a suitable model that fits the data. They often end up ignoring the spatially correlated parameters, or parameters with a weaker influence on the phenomena, thus affecting the overall predictive performance of the model. Alternatively, data mining approaches are focused on uncovering and describing unknown patterns and trends in the data. However, in the case of spatio-temporal datasets,
the direct application of classical data mining algorithms is not suitable, since they are intended to work on transactional-like data which has independent attributes. The attributes in spatial data are not independent, and there is an implicit interaction between them, which cannot be captured directly by the classical data mining techniques. Therefore, the conventional data mining methods fail to effectively capture the implicit spatial or temporal characteristics of the geographic phenomenon and are unable to exploit the rich spatial temporal relationships/patterns embedded in the datasets.
In this thesis, we make use of a generic Spatio-Temporal data mining algorithm called MiSTIC, as part of a generalized framework called MiSTIC++. Using MiSTIC, the goal of the framework is to
segregates the entire area of analysis into different zones, where each zone is comprised of contiguous spatial samples. The method tried to create the zones in a way such that within each zone, the spatial interactions among its comprising elements remains preserved over time, thereby capturing the consistent
spatio-temporal context, which cannot be observed directly from the data. These spatial interaction exist between two neighboring regions, the directionality of which remains the same across time, despite the changes in magnitude. There might be several input factors that are responsible for bringing a net change in the observed value. However, in spite of the various factors
that are at play, the interactions between these individual spatial units remains very similar across time. The MiSTIC algorithm tries to capture this property of preserved spatial interactions, based on only the observed output data and the neighborhood information. The knowledge of such stable zones and their intra-zonal analysis helps understand the dynamic phenomena better, enabling us to respond better to the situations and to take better decisions with respect to the process being analyzed. The utility and generality of the proposed method has been demonstrated with the help a case study of crop yields in India. The study uses the district level yield dataset of Paddy and Maize in Kharif season across India for the period 2000 to 2010. The method helps cluster a set of districts into zones, with each zone further divided into nearly concentric sub-zones of similar performers based on the empirical rule. The results for paddy in the Kharif season show that there are 18 production zones across India and 23 for Maize, except the districts of North East and Jammu and Kashmir, with zonal averages ranging
from 861.9 kg/ha to 3076.3 kg/ha for Paddy. MiSTIC is further extended, in this research, to indicate regions and sub-regions that are consistently high and low performers across the years, irrespective of the input simila