0% found this document useful (0 votes)

15 views28 pages

Boosting

This study focuses on predicting water quality using machine learning models optimized through grid search methods. It evaluates the Water Quality Index (WQI) and Water Quality Classification (WQC) using various classification and regression models, achieving high accuracy with the Gradient Boosting model (99.50%) for WQC and the Multi-Layer Perceptron model (R2 of 99.8%) for WQI. The research highlights the importance of effective water quality monitoring and the potential of machine learning to enhance predictive capabilities in this field.

Uploaded by

amrit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views28 pages

Boosting

Uploaded by

amrit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Multimedia Tools and Applications (2024) 83:35307–35334

https://doi.org/10.1007/s11042-023-16737-4

Water quality prediction using machine learning models

based on grid search method

Mahmoud Y. Shams1 · Ahmed M. Elshewey2 · El‑Sayed M. El‑kenawy3 ·

Abdelhameed Ibrahim4 · Fatma M. Talaat1,5 · Zahraa Tarek6

Received: 17 June 2022 / Revised: 10 August 2023 / Accepted: 31 August 2023 /

Published online: 29 September 2023
© The Author(s) 2023

Abstract
Water quality is very dominant for humans, animals, plants, industries, and the environ-
ment. In the last decades, the quality of water has been impacted by contamination and pol-
lution. In this paper, the challenge is to anticipate Water Quality Index (WQI) and Water
Quality Classification (WQC), such that WQI is a vital indicator for water validity. In this
study, parameters optimization and tuning are utilized to improve the accuracy of several
machine learning models, where the machine learning techniques are utilized for the pro-
cess of predicting WQI and WQC. Grid search is a vital method used for optimizing and
tuning the parameters for four classification models and also, for optimizing and tuning
the parameters for four regression models. Random forest (RF) model, Extreme Gradient
Boosting (Xgboost) model, Gradient Boosting (GB) model, and Adaptive Boosting (Ada-
Boost) model are used as classification models for predicting WQC. K-nearest neighbor
(KNN) regressor model, decision tree (DT) regressor model, support vector regressor
(SVR) model, and multi-layer perceptron (MLP) regressor model are used as regression
models for predicting WQI. In addition, preprocessing step including, data imputation
(mean imputation) and data normalization were performed to fit the data and make it con-
venient for any further processing. The dataset used in this study includes 7 features and
1991 instances. To examine the efficacy of the classification approaches, five assessment
metrics were computed: accuracy, recall, precision, Matthews’s Correlation Coefficient
(MCC), and F1 score. To assess the effectiveness of the regression models, four assessment
metrics were computed: Mean Absolute Error (MAE), Median Absolute Error (MedAE),
Mean Square Error (MSE), and coefficient of determination ( R2). In terms of classification,
the testing findings showed that the GB model produced the best results, with an accu-
racy of 99.50% when predicting WQC values. According to the experimental results, the
MLP regressor model outperformed other models in regression and achieved an R2 value
of 99.8% while predicting WQI values.

Keywords Water quality · Machine learning models · Grid search · Water quality index ·
Water quality classification

Extended author information available on the last page of the article

13
Vol.:(0123456789)
35308 Multimedia Tools and Applications (2024) 83:35307–35334

1 Introduction

Water is among the most precious resources on which all existence is dependent. Water
contamination degrades water quality, impacting the health of sea creatures and, by exten-
sion, humans that use them. This makes it critical to observe water quality and ensure the
survival of nautical life [1]. Comprehension of water quality concerns and issues is also
crucial for water pollution mitigation and control. To grasp the condition of the nauti-
cal ecosystem, several governments throughout the world have begun to build ecological
water management programs. Roughly one billion individuals do not have access to clean
water for drinking, and two million individuals perish every year as a consequence of pol-
luted water and poor sanitation and cleanliness. As a result, preserving the freshwater qual-
ity is critical [2]. Water quality is critical to the long-term viability of a diversion plan. The
water of poor quality may also be costly since resources must be shifted to repair water
delivery infrastructure whenever an issue emerges. The demand for enhanced water man-
agement and water quality control has been rising for these objectives to assure safe drink-
ing water at reasonable costs. To address these issues, systematic assessments of freshwa-
ter, disposal systems, and organizational monitoring issues are necessary [3]. Forecasting
water quality entails anticipating fluctuation characteristics in a water system’s health at
a specific moment. Assessment of water quality is critical for water quality planning and
regulation. Water pollution avoidance and regulation methods may be improved by fore-
casting future updates in water cleanliness at varying degrees of pollution and designing
reasonable water pollution prevention and control techniques. The overall consistency of
water should be assessed in water diversion plans. To handle everyday drinking difficul-
ties, a considerable quantity of water is carried. Thus, in today’s civilization, solutions for
anticipating water quality should be researched [4]. The use of artificial intelligence (AI)
and machine learning (ML) technologies is currently critical to security threats [5] and
focus on mapping the connection between system inputs and outcomes rather than complex
operations strategies [6].
Water quality forecasting is an essential method for water planning, regulation, and
monitoring; it is a necessary component of water contamination research to investigate
water ecological protection. As a consequence, it is crucial to enhance a realistic and prac-
tical strategy for predicting water quality. Simultaneously, forecasting futurity water quality
is necessary for preventing sudden updates in water quality and offering solutions. As a
result, precise forecast of water quality updates may not only assure the health of individ-
ual’s potable water but can also help guide fishing productivity and safeguard biodiversity
[7]. Furthermore, the typical water quality forecast technique cannot account for the effects
of biology, physics, hydraulics, alchemy, and meteorology. At the moment, researchers are
primarily concerned with enhancing the practicability and trustworthiness of groundwa-
ter forecasting techniques and have presented a range of new techniques, such as artificial
neural networks (ANN), stochastic mathematics, fuzzy mathematics, 3S technology, and
others, for enhancing water quality forecasting techniques and expand the range of applica-
tions [8].
The emergence of remote sensing (RS), cloud computing, the Internet of Things
(IoT), big data, and artificial intelligence has created new possibilities for improving and
implementing water environment surveillance technologies. Intelligent detection meth-
ods for water environmental conservation have been developed in counties and cities
throughout China, relying on various types of Stations for spontaneous hydrological and
water quality surveillance, wireless sensor networks (WSNs), RS surveillance systems,

13
Multimedia Tools and Applications (2024) 83:35307–35334 35309

surveillance ships, and sophisticated underwater robotic machines [9]. Artificial intelli-
gence solutions may significantly reduce water supply and sanitation systems while also
assisting in ensuring acquiescence with consuming water and wastewater handling qual-
ity standards. As a result, modeling and forecasting water quality to control water con-
tamination has received a lot of attention [10].
A Water Quality Index (WQI) is a metric utilized to quantify water quality for a vari-
ety of reasons. WOI may be used to determine if water is acceptable for consumption,
industrial usage, aquatic creatures, etc. The larger the WQI, the higher the water quality
[11]. The Water Quality Classification (WQC), which categorizes water as either mildly
contaminated or clean, was developed using the WQI value scope [12]. The Water Qual-
ity Index (WQI) covers many water quality characteristics at a given location, and time.
When doing subindex computations, WQI computation requires time and is frequently
influenced by mistakes. As a result, providing an efficient WQI forecasting technique is
critical [13].
The extremely nonlinear connections for the researched system can be correctly mod-
eled with or without previous information through gaining knowledge from a large amount
of historical data that incorporates the dynamic development operation [14].
Clean water is a crucial item on which living organisms rely. As a result, developing
a water quality forecasting technique to forecast futurity water quality situation has enor-
mous gregarious and economic significance [7].
Water quality has been greatly impacted by contamination and pollution in recent dec-
ades, which has had a negative impact on both aquatic ecosystems and human health.
Understanding and analysing water quality is critical to guaranteeing the long-term usage
and management of this valuable resource. The Water Quality Index (WQI) is a well recog-
nised indicator that gives a thorough assessment of water quality based on various param-
eters. It gives a quantitative metric that reduces the complicated nature of water quality into
a single number, allowing for easy interpretation and comparison across multiple sites and
time periods. WQI considers a variety of physical, chemical, and biological characteris-
tics such as pH, dissolved oxygen, turbidity, nutrient levels, and the presence of pollutants.
WQI gives a thorough evaluation of water quality by aggregating these factors, which sup-
ports in decision-making processes linked to water resource management. Water quality
grading (WQC) is an additional feature that categorises water samples into specified qual-
ity classes based on predefined thresholds. This categorization gives a realistic framework
for determining the amount of pollution in water, allowing for targeted actions and regula-
tory measures. Stakeholders can identify locations or causes of concern, prioritise remedia-
tion activities, and adopt necessary actions to safeguard water resources by grading water
quality. The study was motivated by the urgent need to address water quality degradation
and its effects. Water pollution and contamination pose serious dangers to ecosystems,
public health, and long-term development. Water quality monitoring and assessment are
essential steps in recognising possible concerns, adopting effective management plans, and
maintaining the supply of clean and safe water for diverse sectors. Traditional techniques
of water quality evaluation, which include laboratory analysis and WQI computation utilis-
ing measurable parameters, can be time consuming, costly, and restricted in their capacity
to offer real-time information. Predictive modelling provides an alternate method by esti-
mating WQI and WQC based on existing data using machine learning techniques. Water
quality may be assessed in a timely way by constructing accurate and effective prediction
models, even when direct measurement of all parameters is not possible or practicable. For
various reasons, predicting WQI and WQC using machine learning models is critical for
assessing water suitability:

13
35310 Multimedia Tools and Applications (2024) 83:35307–35334

Just-in-time water quality monitoring: Predictive models allow for real-time or

near-real-time estimate of WQI and WQC, which is more efficient and cost-effective
than standard laboratory analysis. This capacity enables continuous water quality
monitoring, early identification of degradation, and prompt reaction to possible threats
or pollution occurrences. Partial data handling: Some metrics in water quality moni-
toring may have missing or incomplete data. Predictive models may cope with such
scenarios well by leveraging the existing data and predicting missing values, guaran-
teeing WQI is calculated even when the entire data set cannot be accessible directly.
Resource optimization: With more precise WQI and WQC predictions, resources may
be allocated more effectively. Decision makers can prioritize sampling efforts, direct
monitoring activities to areas of interest, and optimize treatment strategies based on
expected water quality classes. Early Warning Systems: Predictive models can serve
as the basis for developing early warning systems for water quality issues. Through
continuous monitoring and forecasting of the Water Quality Index and WQC, potential
risks or deterioration in water quality can be identified in advance, enabling proactive
measures to be taken to mitigate impacts and protect water resources.
Machine learning algorithms are used in this work to predict water quality index
(WQI) and water quality classification (WQC). Grid search is a vital method used for
optimizing and tuning the parameters for four classification models, namely the ran-
dom forest (RF) model, extreme gradient boosting (XGBoost) model, gradient boost-
ing (GB) model, and adaptive boosting (AdaBoost) for predicting WQC, and four
regression models, namely K-nearest neighbor (KNN) regressor model, decision tree
(DT) regressor model, support vector regressor (SVR) model, and multi-layer percep-
tron (MLP) regressor model for predicting WQI. In classification, the experimental
results illustrated that the GB algorithm attained the greatest results with accuracy
equals to 99.5% while predicting WQC values. In regression, the experimental results
illustrated that the MLP regressor technique attained the greatest results with R2 equals
99.8% while predicting WQI values. This paper’s contributions are as follows:

• Data preprocessing is applied, including data imputation (mean imputation), and

data normalization was performed to fit the data and make it convenient for any
further processing.
• grid search is used for optimizing and tuning the parameters for four classification
models to predict WQC, and four regression models to predict WQI.
• To assess the performance of the classification techniques, MCC, accuracy,
recall, precision, and F1 score were computed, and four evaluation metrics, MAE,
MedAE, square MSE, and coefficient of determination (R2) were computed to eval-
uate the achievements of the regression models.
• The findings showed that the GB model performed the best in terms of predicting
WQC in classification. Furthermore, the experimental findings demonstrated that
the MLP regressor model performed the best in terms of predicting WQI in regres-
sion.

The remainder of the paper is organized as follows: Section 2 provides some stud-
ies related to water quality prediction. Recommended materials and methods in this
paper are presented in Section 3. The proposed methodology of our work is illustrated
in Section 4. Section 5 shows results and discussion. Finally, the conclusion is sum-
marized in Section 6.

13
Multimedia Tools and Applications (2024) 83:35307–35334 35311

2 Related work

Artificial Neural Networks (ANN), Support Vector Regressions (SVR), Grey Systems
(GS), Regression Analyses (RA), and other approaches are commonly used to estimate
water quality [3]. Liu et al. [9] predicted the Yangtze River Basin’s drinking water quality
utilising a long short-term memory (LSTM) network. Dissolved oxygen (DO), pH, chemi-
cal oxygen demand (COD), and NH3-N were used to construct the LSTM algorithm. The
LSTM technique has proved potential for surveillance water quality.
Sakshi Khullar and Nanhey Singh [15] presented a Bi-LSTM model based on deep
learning (DLBL-WQA) to anticipate the water quality variables of the Yamuna River in
India. A comparison showed that the suggested approach surpassed all other approaches in
terms of error rates and prediction accuracy. Sani Abba et al. [16] examined four machine
learning techniques Neuro-Fuzzy Inference (ANFIS), Backpropagation (BPNN), Multi-
layer Perceptron (MLP), and Support Vector Regressor (SVR) for anticipating the water
quality index (WQI). The acquired findings demonstrated the viability of the built smart
techniques for forecasting the WQI at the three stations using the neural network ensem-
ble’s better modeling outcomes (NNE). The predictive comparison indicated that NNE was
successful and hence may be used as a trustworthy prediction strategy.
Elbeltagi et al. [17] used four standalone techniques: M5P tree model (M5P), addi-
tive regression (AR), support vector machine (SVM), and random subspace (RSS) to
forecast WQI depending on variable elimination strategy. AR surpassed each other data-
driven approaches. The AR is offered as an optimal approach with good outcomes due to
improved forecasting reliability with the fewest source variables and could thus be used to
anticipate WQI in the Akot basin dependably and exactly. Seyed Asadollah et al. [18] pre-
sented Extra Tree Regression (ETR), an ensemble machine learning technique, for forecast-
ing monthly WQI rates along the Lam Tsuen River in Hong Kong. The results of the com-
parison between ETR and conventional standalone approaches (SVR, DTR), revealed that
the ETR approach delivers superior reliable WQI forecasts in both the training and testing
stages. Generally, the ETR approach outperformed earlier techniques for WQI forecasting
in terms of predictive accuracy and the number of input variables. Moreover Nosair 2022
et al. [19] presents a predictive regression model based on an original strategy employing
SWI indicators and artificial intelligence (AI) approaches to monitor groundwater saliniza-
tion due to saltwater intrusion (SWI) in the aquifer of the eastern Nile Delta, Egypt. Farid
Garabaghi et al. [20] presented four machine learning techniques with ensemble learning
approaches, namely Random Forest, LogitBoost, XGBoost, and AdaBoost for categoriza-
tion of the water quality. As a consequence, XGBoost outperformed the other classifica-
tion methods, with an accuracy of 96.9696 percent when important characteristics were
included in the classification stage. The XGBoost model is recommended as the greatest
classification method with high accuracy of 95.606 percent with tenfold cross validation
When the classification stage involved seven variables selected by the Backward Feature
Elimination Feature selector. Mehedi Hassan et al. [21] applied machine learning algo-
rithms such as NN, RF, SVM, BTM, and MLR to classify a water quality dataset in diverse
locations throughout India. Biological oxygen demand (BOD), dissolved oxygen (DO),
total coliform (TC), pH, Nitrate, and electric conductivity (EC) are all factors that influ-
ence water quality. These characteristics are dealt with in 5 stages: min–max normalization
for data pre-processing and missing data maintaining using RF, feature correlation, applied
machine learning categorization, and classification significance. This study’s maximum
accuracy, accuracy upper, kappa, and accuracy lower results are 99.83, 99.99, 99.17, and

13
35312 Multimedia Tools and Applications (2024) 83:35307–35334

99.07, respectively. The results revealed that conductivity, Nitrate, DO, PH, BOD, and TC
are the main attributes that help to organize the classification of water quality, with param-
eter significance results of 81.494, 74.78, 105.770, 36.805, 130.173, and105.166, respec-
tively. Table 1 lists some of the machine learning models for water quality prediction.
According to the previous works, the prediction and classification accuracy is improved
using machine learning techniques, so we discuss the effect of some of the machine learn-
ing techniques in the next section to predict water quality in a high percentage for predic-
tion and classification.

3 Materials and methods

Following the primary data preprocessing, a particular ML approach is chosen to be trained

and verified using the training and validation sets. Before being tested, the correspond-
ing hyper variables will be fine-tuned until the predetermined training target is satisfied.
The test dataset will eventually be applied to evaluate the trained approach and assess its
enhancement. For clarity, the ML modeling flow chart is given in Fig. 1. The general block
diagram of ML models begins with data splitting and preprocessing, followed by model
selection. The selected model then undergoes training, testing, and validation. Cross-vali-
dation is used to evaluate whether the training model has met its goals. If so, the model can
proceed to testing and performance assessment. If not, the model parameters need further
fine-tuning during training. To increase the effectiveness of water quality prediction in this
work, eight frequently used ML approaches are refined, implemented, and used, as shown
below.

3.1 Classification model for predicting WQC

This section introduced four classification algorithms: RF, XGBoost, GB, and AdaBoost.

3.1.1 Random Forest (RF)

RF method is an ensemble technique used for categorization. It is a supervised machine

learning method composed of numerous decision trees. Because it is an ensemble tech-
nique, it uses the best outcome given by the many decision trees, mitigating and limiting
generalization mistakes as the volume of the tree architecture in the forest grows [26]. The
classification and regression tree (CART) algorithm is used by the decision tree to catego-
rize the tuples depending on the target parameter. This approach is applied in conjunction
with bagging for resampling goals, updating the training data as a new tree forms [27].
Based on the parameters and equations listed below, a tree structure is built to catego-
rize the features [1]. The Gini Index may be used to create the decision tree for any tuple S
and is determined using the formula:

⎛� ⎛ ��𝜎y = cj.S�� ⎞⎞
Gini(y, s) = 1 − ⎜ ⎜� � ⎟⎟
(1)
⎜ cj𝜖dom(y) ⎜ �S� ⎟⎟
⎝ ⎝ ⎠⎠

The entropy and information gain are also important when creating a decision tree and
determining its outcome. It may be computed using the following formulas:

13
Table 1 ML techniques for water quality prediction
Author Technique Best Model Prediction Index Results

Radhakrishnan and Pillai [22] Support Vector Machine, Decision Decision Tree Algorithm weighted arithmetic water quality Accuracy = 98.50%
Tree, Naïve Bayes index (WAWQI)
Danish Jain et al. [1] Random Forest Algorithm, SVM, Random Forest Algorithm Water Quality Index (WQI) Accuracy = 92.127%
K-Nearest Neighbors (KNN)
Hmoud Al-Adhaileh and Neuro-Fuzzy Inference (ANFIS), ANFIS for (WQI) and Water Quality Classification Accuracy(ANFIS) = 96.17%
Alsaade [10] KNN, Feed-forward neural FFNN for (WQC) (WQC), Water Quality Index Accuracy(FFNN) = 100%
network (FFNN) (WQI)
Malek et al. [12] DT, Naive Bayes, Gradient Boost- Gradient Boosting Water Quality Classification Accuracy = 94.90%
Multimedia Tools and Applications (2024) 83:35307–35334

ing, KNN, ANN, RF, SVM (WQC)

Khan et al. [23] Principal Component Regression Gradient Boosting Classifier Water Quality Index (WQI), Water Accuracy (PCR) = 95%
(PCR), Gradient Boosting Classi- Quality Status (WQS) Accuracy (GBoost) = 100%
fier (GBoost)
Theyazn Aldhyani et al. [24] Neural Autoregressive Network NARNET for (WQI) and SVM for WQC (Water Quality Classifica- Accuracy (SVM) = 97.01%
(NARNET), SVM, KNN, (WQC) tion), WQI (Water Quality R2 (NARNET) = 96.17
Naive Bayes, Long Short-Term Index)
Memory
Dao Khoi et al. [25] (Adaptive boosting, GBoost, Extreme gradient boosting WQI R2 = 0.989 and RMSE = 0.107
HGBoost, LGBoost, XGBoost), (XGBoost)
(DT, ET, RF), (MLP, RBF,
DFFNN, CNN)
35313

13
35314 Multimedia Tools and Applications (2024) 83:35307–35334

Fig. 1 The flow chart of general machine learning modeling

∑
Entropy(S) = −p(i)log2 p(i) (2)

where p is the fraction of S that belongs to class ‘i’, for each given set S.
∑ |Sv| ( )
Gain(S, A) = Entropy(S) − Entropy Sv (3)
|S|
where Sv denotes the subset of S for which parameter A has value v.
RF presents numerous benefits. It avoids the issue of multivariate collinearity, which is a
disadvantage of ordinary regression analysis. It excels in regression and classification and has
a solid grasp of multi-dimensional data [28].

3.1.2 Extreme Gradient Boosting (XGBoost)

The XGBoost is a decision tree enhancement approach that is distinct from the classic gradi-
ent boosting decision tree methodology [29]. Based on the optimization issue, the standard
GBDT solely employs first-order derivative information. The loss function is then subjected
to the second Taylor extension, which employs the first and second-order derivatives. The loss
function includes a regularization term to manage the technique’s intricacy and prevent over-
fitting. The XGBoost technique is derived as follows [28]:
( ) ∑K ( )
ŷi = 𝜙 Xi = f Xi , fk 𝜖F
k=1 k
(4)

13
Multimedia Tools and Applications (2024) 83:35307–35334 35315

{ }
where F = f (x) = wq(x) (q ∶ Rm → T, w ∈ RT ) indicates a function space that defines a
decision tree and T is the leaf nodes number of a decision tree. The following is the loss
function:
∑ ( ) ∑ ( )
L(𝜙) = l y i yi +
i
Ω fk
k
(5)

� � 1
Ω fk = ΥT + 𝜆‖w‖2 (6)
2
The first component in Eq. (5) presents the number of leaves, while the second com-
ponent is the size of the outcome. XGBoost calculates Gain for every node in the tree to
assess whether the generated branch is relevant.
1( )
Gain = GainL + GainR − GainO − Υ (7)
2
where GainO denotes the authentic gain before splitting and −Υ is the number of the new
leaves.

3.1.3 Gradient Boosting (GB) model

The GB is a Machine Learning approach that combines many weak classification methods,
often decision trees, to produce a reliable classifier for classification and regression tasks.
It builds the system in stages, much like the other boosting strategies, and generalizes it
by maximizing an appropriate cost function. In the GB method, improperly identified
instances for one step are given more weight in the following step. The benefits of GB
include great prediction accuracy and a quick process [30]. This approach is quite identi-
cal to Adaptive Boosting (AdaBoost), although AdaBoost has the disadvantage of being
greatly impacted by outliers and readily overpowered by noisy data [31].

3.1.4 Adaptive Boosting (Adaboost) model

The AdaBoost method enhances the performance of the classifier by integrating numer-
ous weak learners into a single strong one. It repeatedly adjusts sample weights depend-
ing on classification mistakes, raising the weights of misclassified samples while reducing
the weights of well-classified samples. As a result, classification methods that focus on
miscategorized data rather than minority class examples are used. Because AdaBoost con-
centrates on prediction performance, the method is biased toward the majority class, which
provides more to total prediction performance [32].

3.2 Regression models for predicting WQI

In this section, four regression algorithms, namely, KNN, DT, SVR, and MLP, were
presented.

3.2.1 K‑Nearest Neighbors (KNN) model

The KNN technique distinguishes samples by locating the nearest neighboring provided
points and assigning the majority of n neighbors to a class. If there is a tie, many ways may

13
35316 Multimedia Tools and Applications (2024) 83:35307–35334

be employed to settle it. Nevertheless, KNN is not recommended for big datasets because it
does all computation throughout testing and converges during all trained data, calculating
the closest neighbor each time [33]. To locate the nearest neighbor in the features vector,
the Euclidean distance function (Di) was used as follows:
√
( ) ( )2
Di = x1 − x 2 + y 1 − y 2 (8)

where x1 , x2 , y1 , andy2 are parameters for data input.

3.2.2 Decision Tree (DT)

The DT is a straightforward, basic approach that generates judgments depending on the

values of all relevant input variables. DT chooses the root parameter based on entropy
before analyzing the weights of the other variables. DT gathered all variable decisions
grouped in a top-down tree and prepares the choice based on various values from special
attributes. Previous research has revealed that decision tree models work well on unbal-
anced data. Nevertheless, ensemble techniques based on decision trees, such as Gradient
Boosting (GB) and Random Forest (RF), virtually usually surpass single decision trees
[12]. The benefits of decision-tree-based models are their insensitivity to missing values,
ability to maintain both regular qualities and data, and high efficiency. Decision-tree-based
techniques, as compared to other ML algorithms, are better for short-term forecasting and
may have a faster computation speed [34].

3.2.3 Support Vector Regression (SVR)

The SVR is a machine learning technique that originated from the SVM and is seen to be a
promising method for solving nonlinear issues such as regression, forecasting, categoriza-
tion, and function estimation. The technique is an effective method for resolving convex
quadratic programming issues. Furthermore, SVR has outstanding characteristics such as
non-convergence to a local optimum, a strong mathematical formulation, great predictabil-
ity, and scalability. Nevertheless, the training dataset must be manually annotated, and the
SVR technique’s three variables must be changed using prior information [35–37]. SVR’s
generic nonlinear function is as follows:

y(x) = W T 𝜑(x) + b (9)

where y represents the link between predictand and predictors, W denotes the weight vec-
tor, φ(x) is the input dataset’s nonlinear mapping function, and b presents the scalar thresh-
old. Figure 2 depicts the SVR structure.

3.2.4 Multi‑Layer Perceptron (MLP) regressor

The MLP has an input–output layers and numerous hidden layers. The source signal is
transferred forward through the input layer to the hidden layer, where the neurons are com-
putationally managed before being provided forward to the output layer. The output of the
MLP neural network depends only on the current input and not on preceding or future
inputs; as a result, the MLP neural network is also referred to as a multi feed-forward
neural network. MLP neural networks are among the numerous neural network designs
that are basic in framework, simple to execute, and have strong fault tolerance, resilience,

13
Multimedia Tools and Applications (2024) 83:35307–35334 35317

Fig. 2 Structure of the SVR model

scalability, and outstanding nonlinear mapping capabilities [7]. Figure 3 depicts the archi-
tecture of the MLP neural network.

4 Proposed methodology

Water contamination is one of the most serious environmental issues confronting human-
ity, and the damage it causes is mostly due to a lack of forecasting, early caution, and
emergency management capabilities. As a result, the implementation of an appropriate sur-
veillance and early alert system to enable intelligent decision making and water quality
management is a critical scientific and technical issue that must be addressed promptly
[38]. Several machine learning approaches have advanced rapidly in recent years, Fig. 4
shows the proposed methodology to predict the quality of water.
The proposed methodology aims to develop a machine learning model for water quality
assessment based on a dataset containing seven features: dissolved oxygen, pH, conductiv-
ity, biological oxygen demand, nitrate, fecal coliform, and total coliform. The dataset has
already undergone preprocessing, which includes mean imputation and data normalization.

Fig. 3 MLP neural network topology

13
35318 Multimedia Tools and Applications (2024) 83:35307–35334

Fig. 4 The proposed methodology

The data has been split into a training set (80%) and a testing set (20%). During the training
phase, a grid search with cross-validation (CV = 5) is used to tune hyperparameters for four
different models for water quality classification (RF, XGBoost, GB, and Adaboost) and
four different models for water quality index (KNN, DT, SVM, and MLP).
The features of the data, the problem being handled, and the application’s performance
requirements all influence the choice of certain classification and regression models. The
specific models used in the Water Quality Assessment method were most likely chosen
based on their ability to handle the features of the water quality dataset and their perfor-
mance on similar situations. The presented ensemble models combine numerous weak
learners to create a stronger model. These models are frequently employed in classifica-
tion problems with a high number of characteristics and complicated interactions between
the variables and the target variable in the dataset. Ensemble approaches can capture these
complicated interactions and increase model accuracy. RF is well-known for its capacity
to handle high-dimensional data while avoiding overfitting, whereas Xgboost, GB, and
AdaBoost are well-known for their rapid training and prediction times as well as excellent
accuracy.
Popular regression models include KNN, DT, SVM, and MLP, which can handle diverse
types of data and correlations between features and the target variable. The KNN model is
a non-parametric model that can handle both linear and non-linear correlations between
features and the target variable. DT is a tree-based paradigm that can manage non-linear

13
Multimedia Tools and Applications (2024) 83:35307–35334 35319

connections and has a straightforward interpretation. The SVR is a kernel-based model that
works well on small datasets and can manage non-linear connections. A MLP is a neural
network-based model that can handle complex interactions between features and the target
variable.
During the testing phase, the models’ performance is evaluated using various metrics
such as Mean Absolute Error (MAE), Median Absolute Error (MedAE), Mean Squared
Error (MSE), R-squared (R2) for prediction, and accuracy, recall, precision, F1 score, and
Matthews Correlation Coefficient (MCC) for classification.
Grid search is a hyperparameter tuning approach often used in machine learning to dis-
cover the optimal hyperparameter combination for a given model. Hyperparameters are
parameters that must be specified before to training the model and cannot be learnt from
data. The learning rate, the regularization parameter, the number of layers in a neural net-
work, and the number of trees in a random forest are all examples of hyperparameters.
Grid search seeks to extensively search through all potential hyperparameter combina-
tions within a particular range or set of values. This is performed by first creating a grid
of all possible hyperparameter combinations, and then training and testing the model on a
validation or cross-validation set for each combination. The optimal set of hyperparameters
is the set of hyperparameters that gives the best performance on the validation or cross-
validation set.
The grid search algorithm is explained as follows:

• Define the hyperparameters as well as their potential values or ranges.

• Make a grid with all conceivable hyperparameter combinations.
• For each hyperparameter combination in the grid:

a Train the model on the training set using the current hyperparameters.
b Using a performance metric, evaluate the model on the validation or cross-validation
set (CV = 5).
c Keep track of the performance statistic.

• Choose the hyperparameter combination that produced the best performance measure.

Grid search may be computationally costly, particularly when there are a large number
of hyperparameters and their possible values or ranges. Using randomized search instead
of grid search can help to lower computing costs. A random subset of hyperparameters is
sampled in randomized search.

4.1 Dataset

The dataset used for this study is available at https://www.kaggle.com/datasets/anbar

ivan/indian-water-quality-data. The dataset was collected from lakes and rivers in India
from several locations in the period between 2005 to 2014. The government of India
collected this data to be sure that the water is valid for drinking. The dataset consists
of 1991 instances and 7 features. The dataset features are dissolved oxygen, PH, con-
ductivity, biological oxygen, nitrate, fecal coliform, and total coliform. The features of
the dataset are Dissolved Oxygen by which it indicates the level of oxygen dissolved in
the water, which is essential for supporting aquatic life. pH: It represents the acidity
or alkalinity of the water, indicating its level of acidity or basicity. The conductivity

13
35320 Multimedia Tools and Applications (2024) 83:35307–35334

Table 2 Statistical calculation of the features

Count Mean Std Min 25% 50% 75% Max

Dissolved_oxygen 1991 6.392637 1.322515e + 00 0.0 5.95 6.70 7.2 11.4

PH 1991 112.0906 1.875150e + 03 0.0 6.9 7.30 7.7 67115
Conductivity 1991 1786.466 5.517290e + 03 0.4 79 187.63 620.5 65700
Biological_oxygen 1991 6.940049 2.908065e + 01 0.1 1.20 1.90 3.9 534.5
Nitrate 1991 1.623079 3.852301e + 00 0.0 0.28 0.62 1.62307 108.7
Fecal_coliform 1991 362,529.3 8.038807e + 06 0.0 41 313 4950.5 27252
Total_coliform 1991 533,687.1 1.375409e + 07 0.0 118 542 2929 51109
WQI 1991 75.64109 1.359473e + 01 19.3 67.38 78.74 83.7 99.8

Fig. 5 Heat map visualization of the feature correlations

of water, which evaluates its capacity to conduct electrical current and offers informa-
tion on the existence of dissolved solids. The Biological Oxygen Demand (BOD) is a
measurement of the quantity of dissolved oxygen absorbed by microorganisms in water,
which indicates the extent of organic contamination. The Nitrate that examines the con-
centration of nitrate ions in water, which can be a sign of fertilizer or sewage pollution.
The Fecal Coliform is an indication of faecal pollution since it reflects the presence
of coliform bacteria in the water. Total Coliform, which represents the total amount of
coliform bacteria from both faecal and non-fecal sources. Certain preprocessing pro-
cesses were conducted to assure the dataset’s quality and usability in the study. These
processes involve dealing with missing values and outliers, both of which are significant
problems in real-world datasets. The specifics of the data pretreatment stages are not
stated in the context supplied. In addition, as shown in Table 2, the study included sta-
tistical computations on the dataset attributes. These computations may include metrics
such as mean, standard deviation, minimum, maximum, and quartiles, which provide
information about the data’s distribution and properties. Furthermore, the correlation

13
Multimedia Tools and Applications (2024) 83:35307–35334 35321

matrix of the dataset features was analyzed, as depicted in Fig. 5. The correlation matrix
explores the relationships between the different features, helping identify any significant
associations or dependencies among the variables.

4.2 Water Quality Index (WQI) computation

Water quality index (WQI) is a dominant indicator that impact the water quality [39].
WQI is computed via utilizing various parameters. WQI is computed using Eq. (10):
∑N
qi × wi
WQI = i=1 ∑N (10)
w
i=1 i

where N represents the number of the parameters, qi represents the quality rating scale for
the parameter i , and wi represents the unit weight for the parameter i . qi is computed using
Eq. (11):
( )
vi − vid
qi = 100 × (11)
si − vid

where vi represents the estimated value for the parameter i , vid represents an ideal value for
the parameter i while the water is pure, and si represents a standard value for the parameter
i . The unit weight wi is computed using Eq. (12):
k
wi =
si (12)

where k represents the constant of proportionality and computed using Eq. (13):
1
k = ∑N (13)
i=1 si

Figure 6 demonstrates the distribution of calculated feature (WQI). The statistical

calculation for the feature (WQI) is demonstrated in Table 1.
Table 3 demonstrates the unit weight of the features and Table 4 represents the WQC.

Fig. 6 Distribution of calculated

WQI

13
35322 Multimedia Tools and Applications (2024) 83:35307–35334

Table 3 Features unit weight Features Name Unit Weight

Dissolved_oxygen 0.2213
PH 0.2604
Conductivity 0.0022
Biological_oxygen 0.4426
Nitrate 0.0492
Fecal_coliform 0.0221
Total_coliform 0.0022

Table 4 Water quality WQI Rate Classification

classification (WQC)
0–50 Good
51–100 Poor
More than 100 Unsuitable

5 Results and discussion

The experiments are carried out using the jupyter notebook version (6.4.6). Jupyter notebook
makes it easier to run and write Python scripts. It is widely used as an open-source model
implementation and execution tool for AI and ML. The proposed models’ performance is
compared to that of numerous existing models. The classification models’ performance was
assessed using assessment criteria such as accuracy, recall, precision, F1 score, and Matthew’s
correlation coefficient (MCC). Equation (14) is used to calculate precision:
TP + TN
Accuracy = (14)
TP + FP + FN + TN
where TP if true positive, TN is true negative, FP is false positive, and FN is false negative.
Recall is calculated using Eq. (15):
TP
Recall = (15)
TP + FN
Precision is calculated using Eq. (16):
TP
Precision = (16)
TP + FP
F1 score is computed using Eq. (17):
2 ∗ Recall ∗ Precision
F1Score = (17)
Recall + Precision
MCC is calculated using Eq. (18):
TP × TN − FP × FN
MCC = √ (18)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)

13
Multimedia Tools and Applications (2024) 83:35307–35334 35323

Mean absolute error (MAE), median absolute error (MedAE), mean square error
(MSE), and coefficient of determination (R2) were used to assess the effectiveness of the
regression models. Equation (19) is used to compute MAE:
1 ∑N | |
MAE = |y
i=1 | reali
− ypredi | (19)
N |
MedAE is calculated using Eq. (20):
( )
| | | |
MedAE = median |yreal1 − ypred1 |, … … , |yrealN − ypredN | (20)
| | | |
MSE is calculated using Eq. (21):
1 ∑N
MSE = (y
i=1 reali
− ypredi )2 (21)
N
R2 is calculated using Eq. (22):
∑N 2
(y
i=1 reali
− ypredi )
(22)
2
R =1− ∑N 2
(y
i=1 reali
− y)

5.1 Water Quality Classification (WQC) prediction

The best parameters for classification models using the grid search approach are shown
in Table 5. The table details the tuning parameters investigated for each model, as well as
the precise parameter values that resulted in the optimum performance based on the tun-
ing procedure. These best parameters are crucial in optimizing the performance of each
machine learning model for their respective tasks. For random forest model, the tuning
parameters are:

• N_Estimators that represent the number of decision trees in the forest. The tested val-
ues are [50, 100, 150, 200, 250]. The best parameter is 100.
• Criterion that is the function to measure the quality of a split. Tested values are ’gini’
and ’entropy’. The best parameter is entropy.

Table 5 The settings of the best parameters for the classification approaches using grid search algorithm
Approaches Parameters Tuning The best parameters

RF Criterion = [‘gini’, ‘entropy’] Criterion = entropy

N_Estimators = [50,100,150,200,250], N_Estimators = 100,
XGBoost N_Estimators = [50,100,150,200,250], N_estimators = 200,
Max_depth =[1,2,3,4,5,6,7,8,9,10], Max_depth = 2,
Objective = [‘binary’, ‘logistic’] Objective = logistic
GB N_estimators = [50,100,150,200,250], N_estimators = 250,
Max_depth =[1,2,3,4,5,6,7,8,9,10], Max_depth = 1,
Max_features = [‘auto’, ‘sqrt’, ‘log2’] Max_features = auto
AdaBoost N_estimators = [50,100,150,200,250], N_estimators = 250,
Learning_Rate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] Learning_Rate = 0.5

13
35324 Multimedia Tools and Applications (2024) 83:35307–35334

For XGBoost model, the tuning parameters are:

• N_Estimators that represent the number of boosting rounds. Tested values are [50, 100,
150, 200, 250]. The best parameter is 200.
• Max_Depth represents the maximum depth of each decision tree. Tested values are
[1,2,3,4,5,6,7,8,9,10]. The best parameter is 2.
• Objective is the learning task and corresponding objective. Tested values are ’binary’
and ’logistic’. The best parameter is logistic.

For gradient boosting model, the tuning parameters are:

• N_Estimators that is the number of boosting rounds. Tested values are [50, 100, 150,
200, 250]. The best parameter is 250.
• Max_Depth is the maximum depth of each decision tree. Tested values are
[1,2,3,4,5,6,7,8,9,10]. The best parameter is 1.
• Max_Features: The number of features to consider when looking for the best split.
Tested values are ’auto’, ’sqrt’, and ’log2’. The best parameter is auto.

For AdaBoost model, the tuning parameters are:

• N_Estimators that represent the maximum number of estimators at which boosting is

terminated. Tested values are [50, 100, 150, 200, 250]. The best parameter is 250.
• Learning_Rate that is the rate at which the algorithm adjusts its weights. Tested values
are [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]. The best parameter is 0.5.

Table 6 shows the classification model performance using the grid search strategy.
As shown in Table 6, the performance of the classification models using grid search
method, namely, RF model, XGBoost model, AdaBoost model, and the proposed GB
model are demonstrated. The results of the proposed GB model demonstrate its superiority
over the alternative classification models (highlighted in bold). It achieves an accuracy of
99.5%, F1 score of 99.4%, recall of 99.5%, precision of 99.5%, and Matthews Correlation
Coefficient (MCC) of 94.3%. The remarkable performance of the GB model can be attrib-
uted to its ability to combine weak learners, specifically decision trees, in an ensemble
manner.
Table 7 shows a comparison of the suggested GB classification model utilizing the
grid search approach with many research that used the same dataset. The proposed GB
model underwent parameter tuning using the grid search method, resulting in excep-
tional performance. The proposed GB model achieved an impressive accuracy of
99.50% (highlighted in bold). These accuracy values showcase the models’ predictive

Table 6 The performance of the Models Accuracy F1 score Recall Precision MCC
classification approaches using
the grid search algorithm
RF 99.00% 98.90% 98.90% 98.90% 88.50%
XGBoost 99.30% 99.20% 99.20% 99.20% 91.50%
AdaBoost 99.10% 99.00% 99.00% 99.00% 88.90%
GB 99.50% 99.40% 99.50% 99.50% 94.30%

13
Multimedia Tools and Applications (2024) 83:35307–35334 35325

Table 7 Comparison between Studies Model Accuracy

proposed GB classification
model with several studies used
Ref [1] RF 95.98%
the same dataset
Ref [22] DT 98.50%
Ref [24] SVM 97.01%
Proposed GB model Parameters tuning for GB 99.50%
model using grid search

capabilities, with the Decision Tree model showing higher accuracy than the RF and
SVM models. However, the proposed GB model outperformed all other models, achiev-
ing the highest accuracy of 99.50%. It is important to note that the GB model’s perfor-
mance was further enhanced through parameter tuning using the grid search method,
showcasing its ability to optimize its predictive accuracy.
Figures 7, 8, 9 and 10 illustrate the feature importance for RF model, XGBoost
model, GB model, and Adaboost model, respectively, using grid search method.
Figure 11 shows a comparison between, RF model, AdaBoost model, XGBoost
model, and GB model in term of accuracy.

5.2 Water quality index (WQI) prediction

Table 8 shows the best regression model parameters found using the grid search
approach. The table summarizes the tuning parameters investigated for each regres-
sion model, as well as the exact parameter values that resulted in the best performance
during the tuning process. These best parameters play a crucial role in optimizing the
models for accurate regression predictions For KNN regressor, the tuning parameters
are:

Fig. 7 Feature importance for RF model

13
35326 Multimedia Tools and Applications (2024) 83:35307–35334

Fig. 8 Feature importance for XGBoost model

Fig. 9 Feature importance for GB model

• N_neighbors represent the number of neighbors to consider for prediction. Tested val-
ues are integers from 1 to 50. The best parameter is 1.
• Weights is the weight function used in prediction. Tested values are ’uniform’ and ’dis-
tance’. The best parameter is distance.

For DT regressor, the tuning parameters are:

• Max_depth is the maximum depth of the decision tree. Tested values are integers from
1 to 30. The best parameter is 10.
• Random_state is the random seed for reproducibility. Tested values are integers from 1
to 50. The best parameter is 33.

13
Multimedia Tools and Applications (2024) 83:35307–35334 35327

Fig. 10 Feature importances for Adaboost model

Fig. 11 Comparison between,

RF model, AdaBoost model,
XGBoost model, and GB model
in term of accuracy

Table 8 Best parameters for the regression models using grid search method
Models Tuning parameters Best parameters

KNN regressor n_neighbors = [1 to 50], n_neighbors = 1,

weights = [‘uniform’, ‘distance’] weights = distance
DT regressor max_depth = [1 to30], max_depth = 10,
random_state = [1 to 50] random_state = 33
SVR C = [1,2,3,4,5], C = 2,
epsilon = [0.1, 0.01, 0.001], epsilon = 0.001,
kernel = [‘sigmoid’, ‘poly’, ‘linear’, ‘rbf’] kernel = poly
MLP regressor activation = [‘relu’, ‘tanh’, ‘logistic’], activation = tanh,
solver = [‘sgd’, ‘lbfgs’, ‘adam’], solver = lbfgs,
alpha = [0.01, 0.001, 0.0001] alpha = 0.0001

13
35328 Multimedia Tools and Applications (2024) 83:35307–35334

For SVR regressor model, the tuning parameters are:

• C is the regularization parameter. Tested values are [1, 2, 3, 4, 5]. The best parameter is
C = 2.
• Epsilon is the margin of tolerance for errors. Tested values are [0.1, 0.01, 0.001]. The
best parameter is 0.001.
• Kernel is the kernel function used in SVR. Tested values are ’sigmoid’, ’poly’, ’linear’,
and ’rbf’. The best parameter is poly.

For MLP regressor model, the tuning parameters are:

• Activation is the activation function in hidden layers. Tested values are ’relu’, ’tanh’,
and ’logistic’. The best parameter is tanh.
• solver is the optimization algorithm. Tested values are ’sgd’, ’lbfgs’, and ’adam’. The
best parameter is lbfgs.
• alpha is the L2 regularization parameter. Tested values are [0.01, 0.001, 0.0001]. The
best parameter is 0.0001.

Table 9 describes the performance of the regression models using grid search method.
Table 9 presents the performance of different regression models obtained through the
grid search method. These models include the KNN regressor model, DT regressor model,
SVR model, and the proposed MLP regressor model. Out of these models, the proposed
MLP regressor model achieves the highest performance compared to the other regression
models. The performance of the proposed MLP regressor model surpasses the others due to
its inherent characteristics and capabilities. One significant advantage of MLP is its ability
to learn complex non-linear relationships between the input and output variables. Through
a process called backpropagation, the MLP receives feedback on the error in its predictions
and adjusts the weights of the connections between neurons to minimize this error. This
iterative learning process allows the MLP to continually improve its predictive accuracy.
MLP proves to be effective because it can capture and model intricate patterns and depend-
encies present in the data. By leveraging its hidden layers and the activation functions
within them, MLP can approximate complex functions and provide accurate predictions for
regression tasks. The results of the proposed MLP regressor model in Table 9 further high-
light in bold its superiority over the other regression models. It achieves a Mean Absolute
Error (MAE) of 0.003, Mean Squared Error (MSE) of 2.8 × 10–5, Median Absolute Error
(MedAE) of 0.0009, and an R-squared (R2) value of 99.8%. In contrast, the KNN regres-
sor model demonstrates the lowest performance with an MAE of 0.009, MSE of 0.0002,
MedAE of 0.005, and an R 2 of 98.2%. A comparison between the proposed MLP regres-
sor model with several studies used the same dataset is illustrated in Table 10. The Table
presents the MSE values obtained by different models, along with their corresponding

Table 9 Performance of the Models MAE MSE MedAE R2

regression models using grid
search method
KNN regressor 0.009 0.0002 0.005 98.2%
DT regressor 0.005 0.0001 0.0013 99%
SVR 0.004 0.0001 0.0012 99.1%
MLP regressor 0.003 2.8 × 10–5 0.0009 99.8%

13
Multimedia Tools and Applications (2024) 83:35307–35334 35329

Table 10 Comparison between Studies Model MSE

proposed MLP regressor model
with several studies used the
Ref [10] ANFIS 0.0029
same dataset
Ref [24] NARNET 0.1353
Proposed MLP Parameters tuning for MLP regres- 2.8 × 10–5
regressor sor using grid search

references. In [24], the NARNET model achieved an MSE of 0.1353, indicating its pre-
dictive performance in approximating the continuous-valued variable. The ANFIS model,
on the other hand, achieved a substantially lower MSE of 0.0029, confirming its higher
accuracy in predicting the target variable, according to [10]. The suggested MLP regres-
sor model, however, outperformed both the NARNET and ANFIS models after parameter
adjustment using the grid search approach. The suggested MLP regressor model has a low
MSE of 2.8 × 10–5, showing excellent precision in predicting the continuous-valued varia-
ble (highlighted in bold). The parameter tweaking procedure using grid search improved the
model’s accuracy even further, allowing it to outperform the other models assessed in the
research. These MSE values give useful information about the models’ performance, with
the ANFIS model outperforming the NARNET model. However, the presented MLP regres-
sor model, with its improved parameters, demonstrated excellent accuracy and attained the
lowest MSE of all models tested. This highlights the efficacy of the proposed MLP regressor
model, particularly when parameter tuning is applied using the grid search method, in accu-
rately predicting the target variable and minimizing the prediction error.
From Table 10, the proposed MLP regressor model achieved better performance in the
term of MSE than several previous studies.
Figures 12, 13, 14 and 15 illustrate the actual values vs. predicted values for KNN
regressor model, DT regressor model, SVR model, and the proposed MLP regressor model,
respectively, using grid search method. Visualizing the relationship between actual and
predicted values in regression problems is an essential step for evaluating model perfor-
mance and comprehending its behavior. This visualization yields invaluable insights, facili-
tating the assessment of prediction quality. Through this plot, this can effectively contrast
the predicted values generated by the regression model with the actual values present in
the dataset. This comparison swiftly reveals instances where the model’s predictions align
closely with actual observations and instances where discrepancies emerge. The plotted

Fig. 12 Actual values vs pre-

dicted values for KNN regressor
model

13
35330 Multimedia Tools and Applications (2024) 83:35307–35334

Fig. 13 Actual values vs pre-

dicted values for DT regressor
model

Fig. 14 Actual values vs pre-

dicted values for SVR model

Fig. 15 Actual values vs pre-

dicted values for MLP regressor
model

data points enable the identification of discernible trends or patterns governing the model’s
performance across distinct ranges of the target variable. Consequently, these visual cues
shed light on the model’s strengths and weaknesses, offering an opportunity to gauge its
capacity to capture the underlying data relationships.
However, there are many potential limitations and challenges that should be considered.
The specifics of the dataset used, and its representation require more detail on chemical

13
Multimedia Tools and Applications (2024) 83:35307–35334 35331

features and representation. Selection of other regions is required as well [19], considering
the impact of climate change [40, 41]. In addition, the selection of models during the study
may require prediction over a period of time, thus the use of LSTM and recurrent neural
networks are mainly required [42, 43].

6 Conclusion and future work

In this paper, grid search method is used for tuning the parameters for four classification
models and, for tuning the parameters for four regression models. The four classifica-
tion models are RF, XGBoost, AdaBoost model, and GB model are used as classification
models for predicting WQC. The four regression models are KNN regressor model, DT
regressor model, SVR model, and MLP regressor model are used as regression models
for predicting WQI. To assess the performance of the classification models, five assess-
ment metrics were computed: accuracy, recall, precision, F1 score, and MCC. To assess
the effectiveness of the regression models, four assessment metrics were computed: MAE,
MedAE, MSE, and coefficient of determination (R2). In terms of classification, the test-
ing findings showed that the GB model utilizing the grid search approach produced the
best results, with an accuracy of 99.5 percent when predicting WQC values. In regression,
the experimental results illustrated that MLP regressor model using grid search method
achieved the best results with R2 equals 99.8% while predicting WQI values. In the future,
we intended to use recurrent neural networks with LSTM to predict and the time serious
analysis of the WQI and WQC in the presence of climate change variable.

Authors’ contributions All authors are Equally Contributed.

Funding Open access funding provided by The Science, Technology & Innovation Funding Authority
(STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Data availability Data is available at https://www.kaggle.com/datasets/anbarivan/indian-water-quality-data.

Code availability Available on Request.

Declarations
Conflicts of interest The authors declare that they have no conflicts of interest to report regarding the present
study.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

13
35332 Multimedia Tools and Applications (2024) 83:35307–35334

References
1. Jain D, Shah S, Mehta H et al (2021) A Machine Learning Approach to Analyze Marine Life Sustain-
ability. In: Proceedings of International Conference on Intelligent Computing, Information and Control
Systems. Springer, pp 619–632
2. Clark RM, Hakim S, Ostfeld A (2011) Handbook of water and wastewater systems protection. In: Pro-
tecting Critical Infrastructure. Springer, pp 1–29. https://doi.org/10.1007/978-1-4614-0189-6
3. Hu Z, Zhang Y, Zhao Y et al (2019) A water quality prediction method based on the deep LSTM net-
work considering correlation in smart mariculture. Sensors 19:1420
4. Zhou J, Wang Y, Xiao F et al (2018) Water quality prediction method based on IGRA and LSTM.
Water 10:1148
5. Waqas M, Tu S, Halim Z et al (2022) The role of artificial intelligence and machine learning in wire-
less networks security: principle, practice and challenges. Artif Intell Rev 55:5215–5261. https://doi.
org/10.1007/s10462-022-10143-2
6. Halim Z, Waqar M, Tahir M (2020) A machine learning-based investigation utilizing the in-text fea-
tures for the identification of dominant emotion in an email. Knowl Based Syst 208:106443. https://
doi.org/10.1016/j.knosys.2020.106443
7. Wu J, Wang Z (2022) A Hybrid Model for Water Quality Prediction Based on an Artificial Neural
Network, Wavelet Transform, and Long Short-Term Memory. Water 14:610
8. Lee S, Lee D (2018) Improved prediction of harmful algal blooms in four Major South Korea’s Riv-
ers using deep learning models. Int J Environ Res Public Health 15:1322
9. Liu P, Wang J, Sangaiah AK et al (2019) Analysis and prediction of water quality using LSTM
deep neural networks in IoT environment. Sustainability 11:2058
10. Hmoud Al-Adhaileh M, Waselallah Alsaade F (2021) Modelling and prediction of water quality by
using artificial intelligence. Sustainability 13:4259
11. Bhardwaj D, Verma N (2017) Research paper on analysing impact of various parameters on water
quality index. Int J Adv Res Comput Sci 8(5):2496–498
12. Malek NHA, Wan Yaacob WF, Md Nasir SA, Shaadan N (2022) Prediction of Water Quality Classi-
fication of the Kelantan River Basin, Malaysia, Using Machine Learning Techniques. Water 14:1067
13. Slatnia A, Ladjal M, Ouali MA, Imed M (2022) Improving prediction and classification of water
quality indices using hybrid machine learning algorithms with features selection analysis. In:
Online International Symposium on Applied Mathematics and Engineering (ISAME22), vol
1. ISAME22, Istanbul-Turkey, pp 16–17
14. Deng T, Chau K-W, Duan H-F (2021) Machine learning based marine water quality prediction for
coastal hydro-environment management. J Environ Manage 284:112051
15. Khullar S, Singh N (2022) Water quality assessment of a river using deep learning Bi-LSTM meth-
odology: forecasting and validation. Environ Sci Pollut Res 29:12875–12889
16. Abba SI, Pham QB, Saini G et al (2020) Implementation of data intelligence models coupled with ensem-
ble machine learning for prediction of water quality index. Environ Sci Pollut Res 27:41524–41539
17. Elbeltagi A, Pande CB, Kouadri S, Islam ARM (2022) Applications of various data-driven models
for the prediction of groundwater quality index in the Akot basin, Maharashtra, India. Environ Sci
Pollut Res 29:17591–17605
18. Asadollah SBHS, Sharafati A, Motta D, Yaseen ZM (2021) River water quality index prediction and
uncertainty analysis: A comparative study of machine learning models. J Environ Chem Eng 9:104599
19. Nosair AM, Shams MY, AbouElmagd LM et al (2022) Predictive model for progressive saliniza-
tion in a coastal aquifer using artificial intelligence and hydrogeochemical techniques: A case study
of the Nile Delta aquifer, Egypt. Environ Sci Pollut Res 29:9318–9340
20. Garabaghi FH, Benzer S, Benzer R (2021) Performance evaluation of machine learning models
with ensemble learning approach in classification of water quality indices based on different subset
of features. Res Square 1:1–35. https://doi.org/10.21203/rs.3.rs-876980/v2
21. Hassan MM, Hassan MM, Akter L et al (2021) Efficient Prediction of Water Quality Index (WQI)
Using Machine Learning Algorithms. Hum Centric Intell Syst 1:86–97
22. Radhakrishnan N, Pillai AS (2020) Comparison of Water Quality Classification Models using
Machine Learning. In: 2020 5th International Conference on Communication and Electronics Sys-
tems (ICCES). IEEE, pp 1183–1188
23. Khan MSI, Islam N, Uddin J et al (2021) Water quality prediction and classification based on principal
component regression and gradient boosting classifier approach. J King Saud Univ – Comput Inform
Sci 34(8):4773–4781. https://doi.org/10.1016/j.jksuci.2021.06.003

13
Multimedia Tools and Applications (2024) 83:35307–35334 35333

24. Aldhyani THH, Al-Yaari M, Alkahtani H, Maashi M (2020) Water quality prediction using artificial
intelligence algorithms. Appl Bionics Biomech 2020:1–12. https://doi.org/10.1155/2020/6659314
25. Khoi DN, Quan NT, Linh DQ et al (2022) Using Machine Learning Models for Predicting the
Water Quality Index in the La Buong River, Vietnam. Water 14:1552
26. Forests R, Breiman L (1999) Statistics Department University of California Berkeley. pp 1-29
27. Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13:1063–1095
28. Wang S, Peng H, Liang S (2022) Prediction of estuarine water quality using interpretable machine
learning approach. J Hydrol 605:127320
29. Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd
ACM sigkdd international conference on knowledge discovery and data mining. pp 785–794
30. Prakash R, Tharun VP, Devi SR (2018) A comparative study of various classification techniques
to determine water quality. In: 2018 Second International Conference on Inventive Communication
and Computational Technologies (ICICCT). IEEE, pp 1501–1506
31. Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378
32. Zhou Y, Mazzuchi TA, Sarkani S (2020) M-adaboost-a based ensemble system for network intru-
sion detection. Expert Syst Appl 162:113864
33. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In:
International conference on database theory. Springer, pp 217–235
34. Lu H, Ma X (2020) Hybrid decision tree-based machine learning models for short-term water quality
prediction. Chemosphere 249:126169
35. Halim Z, Rehan M (2020) On identification of driving-induced stress using electroencephalogram sig-
nals: A framework based on wearable safety-critical scheme and machine learning. Inf Fusion 53:66–
79. https://doi.org/10.1016/j.inffus.2019.06.006
36. Chen H, Huang JJ, McBean E (2020) Partitioning of daily evapotranspiration using a modified shut-
tleworth-wallace model, random Forest and support vector regression, for a cabbage farmland. Agric
Water Manag 228:105923
37. Cheng Y, Peng J, Gu X et al (2020) An intelligent supplier evaluation model based on data-driven sup-
port vector regression in global supply chain. Comput Ind Eng 139:105834
38. Liao Z, Li Y, Xiong W et al (2020) An In-Depth Assessment of Water Resource Responses to Regional
Development Policies Using Hydrological Variation Analysis and System Dynamics Modeling. Sus-
tainability 12:5814
39. Tyagi S, Sharma B, Singh P, Dobhal R (2013) Water quality assessment in terms of water quality
index. Am J Water Resour 1:34–38
40. Shams MY, Tarek Z, Elshewey AM et al (2023) A Machine Learning-Based Model for Predicting
Temperature Under the Effects of Climate Change. In: Hassanien AE, Darwish A (eds) The Power
of Data: Driving Climate Change with Data Science and Artificial Intelligence Innovations. Springer
Nature Switzerland, Cham, pp 61–81
41. Elshewey AM, Shams MY, Elhady AM et al (2023) A Novel WD-SARIMAX Model for Temperature
Forecasting Using Daily Delhi Climate Dataset. Sustainability 15:757. https://doi.org/10.3390/su15010757
42. Tarek Z, Shams MY, Elshewey AM et al (2023) Wind Power Prediction Based on Machine Learning and
Deep Learning Models. Comput Mater Contin 74:715–732. https://doi.org/10.32604/cmc.2023.032533
43. Elshewey AM, Shams MY, Tarek Z et al (2023) Weight Prediction Using the Hybrid Stacked-LSTM
Food Selection Model. Comput Syst Sci Eng 46:765–781. https://doi.org/10.32604/csse.2023.034324

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

13
35334 Multimedia Tools and Applications (2024) 83:35307–35334

Authors and Affiliations

Mahmoud Y. Shams1 · Ahmed M. Elshewey2 · El‑Sayed M. El‑kenawy3 ·

Abdelhameed Ibrahim4 · Fatma M. Talaat1,5 · Zahraa Tarek6

* Mahmoud Y. Shams
mahmoud.yasin@ai.kfs.edu.eg
Ahmed M. Elshewey
ahmed.elshewey@fci.suezuni.edu.eg
El‑Sayed M. El‑kenawy
skenawy@ieee.org
Abdelhameed Ibrahim
afai79@mans.edu.eg
Fatma M. Talaat
fatma.nada@ai.kfs.edu.eg
Zahraa Tarek
zahraatarek@mans.edu.eg
1
Faculty of Artificial Intelligence, Kafrelsheikh University, Kafrelsheikh 33516, Egypt
2
Faculty of Computers and Information, Computer Science Department, Suez University, Suez,
Egypt
3
Department of Communications and Electronics, Delta Higher Institute of Engineering
and Technology, Mansoura 35111, Egypt
4
Computer Engineering and Control Systems Department, Faculty of Engineering, Mansoura
University, Mansoura 35516, Egypt
5
Faculty of Computer Science & Engineering, New Mansoura University, Mansoura 35712, Egypt
6
Faculty of Computers and Information, Computer Science Department, Mansoura University,
Mansoura 35561, Egypt

A Predictive Model For Water Quality Index Assessment by Machine Learning Approach
No ratings yet
A Predictive Model For Water Quality Index Assessment by Machine Learning Approach
6 pages
Project Report
No ratings yet
Project Report
38 pages
Artificial Intelligence in Water Quality Monitorin
No ratings yet
Artificial Intelligence in Water Quality Monitorin
13 pages
Uso de La Inteligencia Artificial para El Monitoreo de Aguas
No ratings yet
Uso de La Inteligencia Artificial para El Monitoreo de Aguas
13 pages
JWC 2023403
No ratings yet
JWC 2023403
23 pages
1 s2.0 S2214714422003646 Main
No ratings yet
1 s2.0 S2214714422003646 Main
17 pages
Article Mini Project
No ratings yet
Article Mini Project
7 pages
Iciccd 2024 Paper Id XX
No ratings yet
Iciccd 2024 Paper Id XX
12 pages
An AI-Driven Approach To Potable Water Classification Using Machine Learning Techniques - Abdulla A
No ratings yet
An AI-Driven Approach To Potable Water Classification Using Machine Learning Techniques - Abdulla A
8 pages
Water 15 00475 v2
No ratings yet
Water 15 00475 v2
17 pages
Machine Learning in Water Quality
No ratings yet
Machine Learning in Water Quality
10 pages
Water Quality Prediction with ML
No ratings yet
Water Quality Prediction with ML
8 pages
Water Quality Classification Using Machine Learning
No ratings yet
Water Quality Classification Using Machine Learning
12 pages
Nair 2022 J. Phys. Conf. Ser. 2325 012011
No ratings yet
Nair 2022 J. Phys. Conf. Ser. 2325 012011
20 pages
Water Quality Prediction Using Artificial Intellig
No ratings yet
Water Quality Prediction Using Artificial Intellig
12 pages
Batch 11 Ieee
No ratings yet
Batch 11 Ieee
5 pages
Water Quality Analysis and Prediction Using Machine Learning
No ratings yet
Water Quality Analysis and Prediction Using Machine Learning
6 pages
Water Quality Prediction Using Machine Learning Technique
No ratings yet
Water Quality Prediction Using Machine Learning Technique
9 pages
Water 17 02158 v2
No ratings yet
Water 17 02158 v2
19 pages
Water Quality Index WQI Prediction Using Machine Learning Algorithms
No ratings yet
Water Quality Index WQI Prediction Using Machine Learning Algorithms
5 pages
Water Quality Prediction Model
No ratings yet
Water Quality Prediction Model
6 pages
Predicting Water Purity by Riding The Ensemble Waves With Gradient Boosting Classification Technique
No ratings yet
Predicting Water Purity by Riding The Ensemble Waves With Gradient Boosting Classification Technique
4 pages
G7 Water Quality Prediction Using Machine Learning
No ratings yet
G7 Water Quality Prediction Using Machine Learning
11 pages
Machine Learning and AI-Driven Water Quality Monit
No ratings yet
Machine Learning and AI-Driven Water Quality Monit
11 pages
jws0710963 Paper 4
No ratings yet
jws0710963 Paper 4
12 pages
ABSTRACT
No ratings yet
ABSTRACT
2 pages
C3 Water Quality Prediction Based On Hybrid Deep (Drinking - Water)
No ratings yet
C3 Water Quality Prediction Based On Hybrid Deep (Drinking - Water)
10 pages
Water 17 01641 v2
No ratings yet
Water 17 01641 v2
18 pages
AISD Paper 5
No ratings yet
AISD Paper 5
16 pages
1 s2.0 S1319157821001361 Main
No ratings yet
1 s2.0 S1319157821001361 Main
9 pages
3 Main
No ratings yet
3 Main
21 pages
Channabasveshwara Institute of Technology: Analysis of Water Quality by Using Artificial Intelligence
No ratings yet
Channabasveshwara Institute of Technology: Analysis of Water Quality by Using Artificial Intelligence
26 pages
Review On Data Mining Techniques For Prediction of Water Quality
No ratings yet
Review On Data Mining Techniques For Prediction of Water Quality
6 pages
Water Quality Analyser
No ratings yet
Water Quality Analyser
7 pages
Toxic Article
No ratings yet
Toxic Article
66 pages
Water quality-PCA
No ratings yet
Water quality-PCA
9 pages
A Prediction System For Water Quality Using Machine Learning
No ratings yet
A Prediction System For Water Quality Using Machine Learning
52 pages
Reliable Water Quality Prediction and Parametric Analysis Using Explainable AI Models Scientific Reports
No ratings yet
Reliable Water Quality Prediction and Parametric Analysis Using Explainable AI Models Scientific Reports
1 page
Water Quality Classification Using Machine Learning
No ratings yet
Water Quality Classification Using Machine Learning
6 pages
Water Quality Analysis
No ratings yet
Water Quality Analysis
7 pages
Water Quality Monitoring and Forecasting System
0% (1)
Water Quality Monitoring and Forecasting System
75 pages
Checkfinal 123
No ratings yet
Checkfinal 123
18 pages
CONCLUSION
No ratings yet
CONCLUSION
2 pages
Forecasting Groundwater Quality Using Automatic Exponential Smoothing Model AESM in Xianyang City China
No ratings yet
Forecasting Groundwater Quality Using Automatic Exponential Smoothing Model AESM in Xianyang City China
23 pages
A Review of Artificial Neural Network Techniques For Environmental Issues
No ratings yet
A Review of Artificial Neural Network Techniques For Environmental Issues
17 pages
BIBILOGRAPHY
No ratings yet
BIBILOGRAPHY
5 pages
1st Paper
No ratings yet
1st Paper
26 pages
Report 18
No ratings yet
Report 18
20 pages
Application of Artificial Intelligence (AI) Techniques in Water Quality Index Prediction: A Case Study in Tropical Region, Malaysia
No ratings yet
Application of Artificial Intelligence (AI) Techniques in Water Quality Index Prediction: A Case Study in Tropical Region, Malaysia
13 pages
Water Quality Monitoring Using Machine Learning An
No ratings yet
Water Quality Monitoring Using Machine Learning An
23 pages
Water 16 03380
No ratings yet
Water 16 03380
33 pages
Prediction of Water Quality Using Naive Bayesian Algorithm
No ratings yet
Prediction of Water Quality Using Naive Bayesian Algorithm
2 pages
P 5 XNM
No ratings yet
P 5 XNM
25 pages
Random Forest Classifier For Remote Sensing Classification.
No ratings yet
Random Forest Classifier For Remote Sensing Classification.
12 pages
ANFIS
No ratings yet
ANFIS
16 pages
Computation 11 00016 v2
No ratings yet
Computation 11 00016 v2
14 pages
Water SVM XGB
No ratings yet
Water SVM XGB
6 pages
2024-Jayaraman-Critical Review On Water Quality Analysis Using IoT
No ratings yet
2024-Jayaraman-Critical Review On Water Quality Analysis Using IoT
12 pages
Hydrology 10 00110 v3
No ratings yet
Hydrology 10 00110 v3
23 pages
Electro-Facies Classification Using Well-Log Data
No ratings yet
Electro-Facies Classification Using Well-Log Data
19 pages
Mba ZG536 Course Handout
No ratings yet
Mba ZG536 Course Handout
7 pages
Machine Learning Clustering & NN
No ratings yet
Machine Learning Clustering & NN
28 pages
Datamites CDS Syllabus
100% (1)
Datamites CDS Syllabus
12 pages
ML GTU Solution
No ratings yet
ML GTU Solution
83 pages
Salary Prediction with ML Models
No ratings yet
Salary Prediction with ML Models
5 pages
Agronomy 13 00165 v2
No ratings yet
Agronomy 13 00165 v2
16 pages
Commentclass: A Robust Ensemble Machine Learning Model For Comment Classification
No ratings yet
Commentclass: A Robust Ensemble Machine Learning Model For Comment Classification
20 pages
Disease Risk Prediction by Using Convolutional Neural Network
No ratings yet
Disease Risk Prediction by Using Convolutional Neural Network
5 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
Face Recognition Based On LBP of GLCM Symmetrical Local Regions
No ratings yet
Face Recognition Based On LBP of GLCM Symmetrical Local Regions
17 pages
110.detection of Lung Cancer From CT Image Using SVM Classification and Compare The Survival Rate of Patients Using 3D Convolutional Neural Network (3D CNN) On Lung Nodules Data Set
No ratings yet
110.detection of Lung Cancer From CT Image Using SVM Classification and Compare The Survival Rate of Patients Using 3D Convolutional Neural Network (3D CNN) On Lung Nodules Data Set
12 pages
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
Electronics 12 00488 v2
No ratings yet
Electronics 12 00488 v2
34 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
160 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
27 pages
15) Machine Learning Algorithms
No ratings yet
15) Machine Learning Algorithms
5 pages
Ronel Arida Missinychrista - 24040124410017 - UAS QA QC
No ratings yet
Ronel Arida Missinychrista - 24040124410017 - UAS QA QC
7 pages
Session 2 Intro AI ML ITiE
No ratings yet
Session 2 Intro AI ML ITiE
23 pages
Movie Recommendation System: Using Machine Learning
No ratings yet
Movie Recommendation System: Using Machine Learning
7 pages
Data Science My Notes
No ratings yet
Data Science My Notes
61 pages
Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
No ratings yet
Machine Learning: Dr. Windhya Rankothge (PHD - Upf, Barcelona)
44 pages
Performance Assessment of Multiple Classifiers Based On Ensemble Feature Selection Scheme For Sentiment Analysis
No ratings yet
Performance Assessment of Multiple Classifiers Based On Ensemble Feature Selection Scheme For Sentiment Analysis
13 pages
KNN Practical Debasmita Datta
No ratings yet
KNN Practical Debasmita Datta
6 pages
ML Q
No ratings yet
ML Q
40 pages
Yshu
No ratings yet
Yshu
23 pages
Machine Learning Algorithms For GeoSpatial Data. Applications and
No ratings yet
Machine Learning Algorithms For GeoSpatial Data. Applications and
9 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Project Report
No ratings yet
Project Report
3 pages
Annexure I CS
No ratings yet
Annexure I CS
60 pages

Boosting

Uploaded by

Boosting

Uploaded by

Multimedia Tools and Applications (2024) 83:35307–35334

Water quality prediction using machine learning models

Mahmoud Y. Shams1 · Ahmed M. Elshewey2 · El‑Sayed M. El‑kenawy3 ·

Received: 17 June 2022 / Revised: 10 August 2023 / Accepted: 31 August 2023 /

Extended author information available on the last page of the article

Just-in-time water quality monitoring: Predictive models allow for real-time or

• Data preprocessing is applied, including data imputation (mean imputation), and

3 Materials and methods

Following the primary data preprocessing, a particular ML approach is chosen to be trained

3.1 Classification model for predicting WQC

3.1.1 Random Forest (RF)

RF method is an ensemble technique used for categorization. It is a supervised machine

ing, KNN, ANN, RF, SVM (WQC)

Fig. 1 The flow chart of general machine learning modeling

3.1.2 Extreme Gradient Boosting (XGBoost)

3.1.3 Gradient Boosting (GB) model

3.1.4 Adaptive Boosting (Adaboost) model

3.2 Regression models for predicting WQI

3.2.1 K‑Nearest Neighbors (KNN) model

where x1 , x2 , y1 , andy2 are parameters for data input.

3.2.2 Decision Tree (DT)

The DT is a straightforward, basic approach that generates judgments depending on the

3.2.3 Support Vector Regression (SVR)

y(x) = W T 𝜑(x) + b (9)

3.2.4 Multi‑Layer Perceptron (MLP) regressor

Fig. 2 Structure of the SVR model

Fig. 3 MLP neural network topology

Fig. 4 The proposed methodology

• Define the hyperparameters as well as their potential values or ranges.

The dataset used for this study is available at https://​www.​kaggle.​com/​datas​ets/​anbar​

Table 2 Statistical calculation of the features

Dissolved_oxygen 1991 6.392637 1.322515e + 00 0.0 5.95 6.70 7.2 11.4

Fig. 5 Heat map visualization of the feature correlations

4.2 Water Quality Index (WQI) computation

Figure 6 demonstrates the distribution of calculated feature (WQI). The statistical

Fig. 6 Distribution of calculated

Table 3 Features unit weight Features Name Unit Weight

Table 4 Water quality WQI Rate Classification

5 Results and discussion

5.1 Water Quality Classification (WQC) prediction

RF Criterion = [‘gini’, ‘entropy’] Criterion = entropy

For XGBoost model, the tuning parameters are:

For gradient boosting model, the tuning parameters are:

For AdaBoost model, the tuning parameters are:

• N_Estimators that represent the maximum number of estimators at which boosting is

Table 7 Comparison between Studies Model Accuracy

5.2 Water quality index (WQI) prediction

Fig. 7 Feature importance for RF model

Fig. 8 Feature importance for XGBoost model

Fig. 9 Feature importance for GB model

For DT regressor, the tuning parameters are:

Fig. 10 Feature importances for Adaboost model

Fig. 11 Comparison between,

KNN regressor n_neighbors = [1 to 50], n_neighbors = 1,

For SVR regressor model, the tuning parameters are:

For MLP regressor model, the tuning parameters are:

Table 9 Performance of the Models MAE MSE MedAE R2

Table 10 Comparison between Studies Model MSE

Fig. 12 Actual values vs pre-

Fig. 13 Actual values vs pre-

Fig. 14 Actual values vs pre-

Fig. 15 Actual values vs pre-

6 Conclusion and future work

Authors’ contributions All authors are Equally Contributed.

Data availability Data is available at https://​www.​kaggle.​com/​datas​ets/​anbar​ivan/​indian-​water-​quali​ty-​data.

Code availability Available on Request.

Authors and Affiliations

Mahmoud Y. Shams1 · Ahmed M. Elshewey2 · El‑Sayed M. El‑kenawy3 ·

You might also like

3 Materials and methods

3.1 Classification model for predicting WQC

3.1.1 Random Forest (RF)

3.1.2 Extreme Gradient Boosting (XGBoost)

3.1.3 Gradient Boosting (GB) model

3.1.4 Adaptive Boosting (Adaboost) model

3.2 Regression models for predicting WQI

3.2.1 K‑Nearest Neighbors (KNN) model

3.2.2 Decision Tree (DT)

3.2.3 Support Vector Regression (SVR)

3.2.4 Multi‑Layer Perceptron (MLP) regressor

The dataset used for this study is available at https://www.kaggle.com/datasets/anbar

4.2 Water Quality Index (WQI) computation

5 Results and discussion

5.1 Water Quality Classification (WQC) prediction

5.2 Water quality index (WQI) prediction

6 Conclusion and future work

Data availability Data is available at https://www.kaggle.com/datasets/anbarivan/indian-water-quality-data.