Abstract
The impact of climate change on water quality variables is an essential topic for sustainable
river water quality management in a warming environment and is a great environmental
concern worldwide. River Water Quality (RWQ) models aim to simulate the behavior of
various water quality variables in response to pollutants, land use changes, and climate
change. However, these water quality models suffer from sparse data leading to data
uncertainty. In the past decades, different models have been successfully used for RWQ
modeling under different spatial and temporal scales. To simulate RWQ variables, physically
based water quality models can be used, but they require large amounts of site-specific
detailed data, including stream geometry, meteorological variables, and hydraulic properties
of the river, which are unavailable for many river systems globally. However, unlike processbased models, statistical models possess many advantages. Additionally, statistical models do
not require a large number of input variables, which are unavailable for many ungauged river
systems. However, accurately describing the nonlinear characteristics of a data series is a
significant shortcoming of this approach. To overcome such limitations, artificial intelligence
algorithms, i.e., Machine Learning (ML) techniques, are widely used to address a range of
nonlinear prediction problems. Such models are suited for information extraction from
sequential data in RWQ modeling, and they serve functionalities to build models using a
reduced number of variables with more accurate simulation.
Machine Learning (ML) has been increasingly adopted due to its ability to model
complex and nonlinearities between river water quality (RWQ) variables and their predictors
(e.g., Air Temperature, AT, streamflow). To simulate RWQ parameters using data-driven
algorithms, more input variables are required, which are unavailable for many ungauged river
systems. Climatic variables that are readily available are the maximum, minimum, and
average AT to build RWQ models with more accurate simulation and higher computational
efficiency. In this context, most of these ML approaches have been applied without any
detailed sensitivity analysis to identify the most influencing variables to be considered in the
prediction of RWQ variables. Furthermore, the development of systematic models combined
with ML under minimum data input variables has not been intensively studied in predicting
RWQ variables. To address these, the present study first demonstrates how new ML
approaches, such as Ridge regression (RR), K-nearest neighbors (KNN) regressor, Random Forest (RF) regressor, and Support Vector Regression (SVR), can be coupled with Sobol
global sensitivity analysis (GSA) to predict accurate RWQ variables estimates. Air
Temperature (AT) changes can affect River Water Temperature (RWT) under anthropogenic
climate change, the primary variable that influences water quality. Therefore, the present
study selected RWT as a water quality variable prediction with a tropical river system of
India, Tunga-Bhadra River, as a case study. Further, the proposed ML approaches have been
combined with the Ensemble Kalman Filter (EnKF) data assimilation (DA) technique to
improve the predicted values based on the measured data. Overall, the study concluded that
the SVR has been noted as the most robust ML model when coupled with a global sensitivity
algorithm and DA techniques to predict RWT at a monthly time scale compared to daily and
seasonal. Also, the study concluded that the SVR model is a strong choice for smaller
datasets and is less sensitive to outliers in the data compared to some other models. The SVR
is generally less computationally expensive than the ML models.
Another data uncertainty is the lack of availability of long-time series data to capture
interannual variability and consistent water quality measurement datasets in RWQ modeling.
Generally, RWQ data availability is on a monthly scale and is burdened with a large number
of missing values with limited durations. In this context, the selection of appropriate model
inputs, development of models under limited data, processing of non-stationary data,
seasonality scenarios, and different potentially influenced relevant lags of variables have not
been intensively investigated in the literature, especially in the case of estimation of RWQ
variables. Given the missing, limited, and non-stationary data scenarios, the present thesis
developed hybrid models for RWQ variables prediction using Long Short-Term Memory
(LSTM), integrated with (i) k-nearest neighbor (k-NN) bootstrap resampling algorithms
(kNN-LSTM) to address the data-limitations and (ii) discrete wavelet transform (WT)
approach (WT-LSTM) to address the time-frequency localized features. To demonstrate the
prediction of RWQ variables and to assess the impact of climate change on the river water
quality parameters, t