0% found this document useful (0 votes)
9 views17 pages

Water 15 00475 v2

This document discusses a novel automated system for water-quality prediction that utilizes the H2O AutoML stacked ensemble model and KNN imputer to handle missing values, achieving high accuracy rates. The proposed model outperforms existing techniques, achieving 97% accuracy, 96% precision, 99% recall, and 98% F1-score. The study emphasizes the importance of accurate water-quality classification in light of increasing population pressures and environmental challenges.

Uploaded by

sohaila.indoch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

Water 15 00475 v2

This document discusses a novel automated system for water-quality prediction that utilizes the H2O AutoML stacked ensemble model and KNN imputer to handle missing values, achieving high accuracy rates. The proposed model outperforms existing techniques, achieving 97% accuracy, 96% precision, 99% recall, and 98% F1-score. The study emphasizes the importance of accurate water-quality classification in light of increasing population pressures and environmental challenges.

Uploaded by

sohaila.indoch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

water

Article
Water-Quality Prediction Based on H2O AutoML and
Explainable AI Techniques
Hamza Ahmad Madni 1, * , Muhammad Umer 2 , Abid Ishaq 2 , Nihal Abuzinadah 3 , Oumaima Saidani 4 ,
Shtwai Alsubai 5 , Monia Hamdi 6 and Imran Ashraf 7, *

1 College of Electronic and Information Engineering, Beibu Gulf University, Qinzhou 535011, China
2 Department of Computer Science & Information Technology, The Islamia University of Bahawalpur,
Bahawalpur 63100, Pakistan
3 Faculty of Computer Science and Information Technology, King Abdulaziz University, P.O. Box 80200,
Jeddah 21589, Saudi Arabia
4 Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint
Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
5 Department of Computer Science, College of Computer Engineering and Sciences in Al-Kharj,
Prince Sattam bin Abdulaziz University, P.O. Box 151, Al-Kharj 11942, Saudi Arabia
6 Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint
Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
7 Department of Information and Communication Engineering, Yeungnam University,
Gyeongsan 38541, Republic of Korea
* Correspondence: hamza@bbgu.edu.cn (H.A.M.); imranashraf@ynu.ac.kr (I.A.)

Abstract: Rapid expansion of the world’s population has negatively impacted the environment,
notably water quality. As a result, water-quality prediction has arisen as a hot issue during the last
decade. Existing techniques fall short in terms of good accuracy. Furthermore, presently, the dataset
available for analysis contains missing values; these missing values have a significant effect on the
performance of the classifiers. An automated system for water-quality prediction that deals with
the missing values efficiently and achieves good accuracy for water-quality prediction is proposed
in this study. To handle the accuracy problem, this study makes use of the stacked ensemble H2 O
AutoML model; to handle the missing values, this study makes use of the KNN imputer. Moreover,
Citation: Madni, H.A.; Umer, M.;
the performance of the proposed system is compared to that of seven machine learning algorithms.
Ishaq, A.; Abuzinadah, N.; Saidani,
Experiments are performed in two scenarios: removing missing values and using the KNN imputer.
O.; Alsubai, S.; Hamdi, M.; Ashraf, I.
Water-Quality Prediction Based on
The contribution of each feature regarding prediction is explained using SHAP (SHapley Additive
H2 O AutoML and Explainable AI exPlanations). Results reveal that the proposed stacked model outperforms other models with 97%
Techniques. Water 2023, 15, 475. accuracy, 96% precision, 99% recall, and 98% F1-score for water-quality prediction.
https://doi.org/10.3390/w15030475
Keywords: water-quality prediction; KNN imputer; missing values; machine learning; deep learning
Academic Editor: Kyriaki
Kalaitzidou

Received: 9 December 2022


Revised: 29 December 2022 1. Introduction
Accepted: 5 January 2023
Water is one of the essential components of life; without water there is no existing life
Published: 25 January 2023
possible on the earth. Despite that 66% of the total earth is made up of water, out of this,
only 1% of water is usable, and the rest is not safe to use because either it is saltwater or
saline. From an economic point of view, water is an important part of a nation’s economy
Copyright: © 2023 by the authors.
and wealth. However, during the last few years, water levels have fallen considerably,
Licensee MDPI, Basel, Switzerland. and this comes as one of the biggest emerging problems of today’s world [1]. As the
This article is an open access article world population keeps increasing and the predicted growth of the population puts water
distributed under the terms and resources under pressure, providing clean water to this growing population has become a
conditions of the Creative Commons challenging task. The rapid growth of the population is a threatening situation that directly
Attribution (CC BY) license (https:// affects the water quality (WQ) as well, and the cost to provide safe water is increasing
creativecommons.org/licenses/by/ rapidly [2]. According to research, the lack of clean water might increase the probability
4.0/). of individuals living in poverty. Water distribution is uneven between countries. The

Water 2023, 15, 475. https://doi.org/10.3390/w15030475 https://www.mdpi.com/journal/water


Water 2023, 15, 475 2 of 17

accessible amount of water is 60% [3], which means that water is easily available to use
and abundant on Earth; it is accessible for industry, agriculture, and for drinking use [4].
Rivers and groundwater are the fundamental sources of fresh water; social and eco-
nomic development is directly linked with fresh water [5]. Due to human activities, both
surface water and groundwater are under great pressure. Activities such a commercial-
ization, urbanization, population growth, and industrialization have a direct impact on
water quality and quantity [6]. Additionally, climate change and global warming have a
worse effect on water quality. Therefore, water quality evaluation and estimation are of
great concern today [7].
The index used for the assessment and classification of surface water and groundwater
is the water-quality index (WQI). WQI is a widely used parameter for water-quality classi-
fication. For water-quality level estimation, Brown et al. [8] proposed an index. The index
is computed based on water physiochemical parameters such as pH, the concentration
of pollutants, dissolved oxygen, temperature, turbidity, and biochemical oxygen demand.
For policymakers, this WQI parameter gives meaningful qualitative data and is helpful
for the planners of water distribution systems. The drawback of WQI is that it consists of
lengthy and complex computations, and a lot of time and effort are needed in this regard [9].
To address the above-mentioned problems, it is the need of the hour to have an alternative
and state-of-the-art system for efficient water-quality classification (WQC).
AI-based modeling removes the complex and lengthy calculations and classifies WQI
promptly [9]. Therefore, water-quality classification using an artificial intelligence-based
system is getting the attention of many researchers. Different researchers have proposed
different WQC systems using machine learning and deep learning models. Predominantly,
such efforts often achieve low accuracy. Furthermore, the available dataset for the exper-
iments has some missing values that are much-needed for water-quality prediction and
have a direct impact on the results.
Clean and easily available water is required for drinking, home usage, recreational
activities, and food production. Better water-supply and resource management may signifi-
cantly increase a country’s economic development. Sufficient water should be available for
personal and domestic usage and should always be safe, easily accessible, and available
to everyone. Every year, many individuals die from kidney failure, cancer, and other
diseases caused by polluted water. Laboratory methods for classifying water quality are
resource-intensive and time-consuming. Many water-quality classification methods are
already available; however, many lack accuracy. As a result, it is very important to have
an automated system that can classify water quality with low human effort and with
time efficiency.
The continuous, diligent evaluation and acceptability of drinking water sources by
the public health community is referred to as potable-water-quality surveillance. A perfect
water distribution and monitoring system guarantees people’s health if the potable water is
treated without errors. Further, the perfect water treatment system is in vain if the architec-
ture of the water supply and water treatment allows contamination into the potable water.
During the last decade, concerns about water contamination have been raised. Prediction
of water quality comes out as an important topic as it directly relates to life survival on
earth. As a result, there is a vast amount of work on automated water-quality prediction
techniques. Such efforts often yield comparably low accuracy. Moreover, the dataset avail-
able for experimentation had missing values and missing attributes. These missing values
affect the results of water-quality prediction. To address this issue efficiently, this study
made the following contributions
• A novel H2 O AutoML stacked ensemble model is proposed that provides higher
accuracy for drinking water-quality prediction.
• For resolving the issue of missing values, experiments are performed using two
scenarios, where the first scenario involves deleting the missing values, while a K
nearest neighbor (KNN) imputer is used in the second scenario.
Water 2023, 15, 475 3 of 17

• Experiments are conducted to assess the performance of the KNN imputer and the
proposed H20 AutoML stacked ensemble model involving the use of several learning
models including logistic regression (LR), extra tree classifier (ETC), random forest
(RF), stochastic gradient descent classifier (SGDC), Gaussian naïve Bayes (GNB),
and gradient-boosting machine (GBM).
• The importance of different features is explained using the SHapley Additive exPlana-
tions (SHAP) model.
This study of WQC consists of four further sections: Section 2 briefly discusses the
previous research related to WQC. Section 3 consists of the description of the dataset,
proposed methodology, and description of the machine learning model used in this study.
Section 4 describes the results, and Section 5 discusses the conclusions of the study.

2. Related Work
Water is one of the most important resources for the existence of life, and human
needs are directly linked with the availability of water from both sources (surface and
groundwater). Thus, it is very important to have a state-of-the-art system that can classify
water quality. Many studies carried out for water-quality classification have provided
promising results. The literature review constitutes several previous works that used
artificial intelligence systems for water-quality index prediction.
Juna et al. [10] worked on automatic water-quality prediction using a KNN imputer
and MLP. They handled the missing values efficiently and obtained higher performance
regarding accuracy. They proposed a nine-layer multilayer perceptron (MLP) system with
KNN imputer to deal with the missing values. They also used seven machine learning
algorithms for comparison. Experimental results show that the proposed nine-layer MLP
achieved an accuracy value of 99% for water-quality prediction using the KNN imputer.
A dependable approach was proposed by Nida Nasir et al. [4] for predicting water quality
accurately. The authors used various machine learning and stacked ensemble learning
model for water-quality classification via the water-quality index. They used LR, RF, DT,
SVM XGBoost, CATBoost, and MLP for this purpose. Results of the study show that
CATBoost achieved an accuracy of 94.51%. For water-quality classification, Radhakrishnan
and Pillai [2] used machine learning models. They used three machine learning models,
including DT, SVM, and NB, in their study and used multiple datasets. The performance
of the machine learning models was compared, and the results revealed that DT achieved
better classification accuracy, i.e., 98.50%.
Aldhgani et al. [11] used a non-linear autoregressive neural network (NARNET) and
long short-term memory (LSTM). In addition to these deep learning models, they also
used three machine learning models, including NB, SVM, and KNN, for experiments.
NARNET and LSTM achieved almost the same accuracy but a slightly different regression
coefficient (RLSTM= 94.21%, NARNET = 96.17%), and from machine learning models, SVM
achieved an accuracy of 97.01%. Shahra et al. [12] proposed a deep learning-based system
for water-quality classification for water distribution networks. The study aims to achieve
high accuracy and keep low time for computation. They used two learning algorithms:
ANN and SVM. ANN outperformed the SVM model in terms of accuracy and achieved an
accuracy of 94%, whereas SVM achieved an accuracy of 89%.
An adaptive neuro-fuzzy system was proposed by Hadi et al. [13] for the classification
of drinking water into two classes: safe and unsafe. They used a real-time time-series
dataset that had four water quality parameters: bacteria count, color, turbidity, and pH.
The proposed adaptive neuro-fuzzy system achieved an accuracy of 92% for detecting
contaminated data. Abuzir and Abuzir [14] used j48, MLP, and NB for water-quality classi-
fication. They used a dataset that had 10 features. Different feature extraction techniques
were used for the dimensionality reduction of the dataset. They experimented with three
scenarios: using all features, using five features, and using two features. With all features
and with selected features, MLP outperformed the other two learning models.
Water 2023, 15, 475 4 of 17

Hassan et al. [15] used machine learning and deep learning models for classification of
Indian water quality data. The authors used SVM, RF, NN, multinomial logistic regression
(MLR), and bagged tree models (BTM). The results revealed that the main features, such as
total coliform, biological oxygen demand, dissolved oxygen, conductivity pH, and nitrate,
affect the water quality classification. A study by Sillbery et al. [16] used attribute realization
(AR) and SVM for water-quality classification of the Chao Phraya River. When they used
AR-SVM on six features of river-water data, they achieved accuracy from 86% to 95%.
The study by Ahmed et al. [17] used four different features, including turbidity, pH,
TDS, and temperature, for water-quality prediction. Experimental results show that MLP
outperformed the other learning algorithms in terms of accuracy and achieved an accuracy
of 85.05% with a (3,7) configuration.
The IoT-based system played a vital role in water-quality classification. Kakkar et al. [18]
used IoT-based devices for the data collection of residential overhead tanks. After data
collection, they use machine learning and a deep learning system for WQC. Malek et al. [19]
used Kelantan River data from the years 2005 to 2020 for water-quality classification. They
employed different kinds of machine learning models. For water quality, they used 13
physical and chemical parameters. From the experiments, results show that gradient boost-
ing with a learning rate of 0.1 achieved an accuracy value of 94.90%. For water quality and
water-demand prediction, Rustam et al. [20] proposed an artificial neural network system.
The authors used an artificial neural network with one hidden layer and several dropouts
and activation layers. Experiments were conducted on two datasets to predict water quality
and water consumption. For water-quality prediction, they achieved an accuracy of 0.96%,
while the R2 score for water consumption prediction was 0.99%. A comparative analysis of
existing approaches for water-quality prediction is presented in Table 1.

Table 1. Comparative analysis of the existing approaches.

Ref. Methods Dataset Findings


[10] Machine learning and deep Water quality dataset KNN imputer and MLP achieved 99%
learning models accuracy.
[4] Machine learning models Drinking-water-quality data of In- CATBoost model achieved 94.51%
dian states from 2005 to 2014 accuracy.
[2] Machine learning models Water collected from Narmada Decision tree achieved 98.50% accuracy.
River in India
[11] Machine learning model, Dataset collected from different SVM achieved highest results with
deep learning model, and au- states of India from 2005 to 2014 97.01% accuracy.
toregressive neural network
[12] ANN and SVM Dataset provided by US Environ- ANN outperformed with 94% accuracy.
ment Protection Agency
[13] Adaptive neuro-fuzzy infer- Ålesund water treatment plant ANFIS model detected safety condition
ence system (ANFIS) (WTP) between 92–96% in pipe network.
[14] J48, NB, and MLP Water_potability dataset MLP achieved highest accuracy with
66% value.
[15] Various machine learning Indian water-quality data Multinomial logistic regression
models achieved 99.83% accuracy
[16] SVM Chao Phraya River water dataset SVM achieved 94% classification
during 2008–2019 accuracy.
[17] Various machine learning Dataset collected from PCRWR MLP achieved 85.07% accuracy.
models
Water 2023, 15, 475 5 of 17

Table 1. Cont.

Ref. Methods Dataset Findings


[18] Multiple sensors interfaced Drinking water Authors proposed a cost-efficient solu-
with NodeMcU tion that notifies users before the water
gets contaminated.
[19] Series of machine learning Kelantan River water using data Gradient boosting model achieved
models from 2005 to 2020 94.90% accuracy.
[20] Several machine learning and Water-quality prediction dataset The proposed approach (ANN)
deep learning models and the water consumption dataset achieved 96% classification accuracy.

3. Material and Methods


This section explains the proposed approach for predicting water quality, as well as
the machine learning models and dataset used in the experiments. Figure 1 illustrates the
proposed architecture used for experiments in water-quality prediction.

Preprocessing Stacked
Drinking Water Quality Ensemble H20
Water Features 1. KNN Imputer AutoML
Dataset 2. Label Encoder

Evaluation

Trained Accuracy
Train Test Split Model Precision
Feature Recall
Engineering F-score

30% Testing 70% Training

Figure 1. Workflow diagram of water-quality prediction method.

3.1. Description of Dataset


The dataset used in this study is obtained from Kaggle, which is a well-known platform.
The dataset used in this study is known as “Water Quality”, and it is freely available at [21].
A brief description of the dataset is given in Table 2. The dataset constitutes 935 instances
and 10 columns with the target class ‘potable’. The target class has two values, 1 and
0, where 1 is used if the water is safe for drinking, and 0 is used if the water is not safe
for drinking.

Table 2. Dataset and its attributes.

Feature Description
pH Water pH (0 to 14).
Hardness Soap precipitate capacity in water in mg/L.
Solids Total dissolved solids in ppm.
Chloramines Number of chloramines in ppm.
Sulfate Sulfates dissolved in mg/L.
Conductivity Electrical conductivity of water in µS/cm
Water 2023, 15, 475 6 of 17

Table 2. Cont.

Feature Description
Organic_carbon Organic carbon in ppm.
Trihalomethanes Trihalomethanes in µg/L.
Turbidity Light-emitting property of water in NTU.
Potability Target class of whether the water is potable or not potable: potable
is 1, and not potable is 0.

3.2. KNN Imputer


In today’s world, a large amount of data is available to perform research and decision-
making. These data are generated from different and heterogeneous sources, so their
adequacy and relevancy may vary concerning a research objective. Often, such datasets are
limited by missing information for one or more of their attributes. This might happen due
to human error with data extraction or collecting or due to erroneous conversions and other
processing routines. As a result, dealing with missing values has become an important
part of data preparation. The method of imputation is very important, as the performance
of the models is directly linked with it. The KNN imputer by sci-kit-learn is a common
approach for imputing missing data. It is widely used in place of traditional imputation
methods [22].
KNN imputer facilitates the imputing of missing values in observations by utilizing a
Euclidean distance matrix to determine the nearest neighbors. The Euclidean distance is
calculated by ignoring missing values and increasing the weight of non-missing coordinates.
Euclidean distance can be calculated using the following formula:
p
Dxy = weight ∗ squared distance from present coordinates (1)

where
total number of coordinates
weight = (2)
number of present coordinates

3.3. Deleting Missing Values from the Dataset


The second approach for dealing with the data is to delete the missing values. This
approach is used in the second set of experiments, where all fields with missing data
are deleted.

3.4. H2O AutoML


H2 O AutoML [23] is a machine learning algorithm that works automatically and is
included in the H2 O system [24]. It is easy to understand and easy to implement, it is for
enterprise environments, and it produces high-quality models. On the tabular dataset, H2 O
AutoML supports multiple kinds of problems, such as binary classification, multi-class
classification, and regression problems. The major advantage of H2 O AutoML is that it has
the capacity for fast scoring; multiple H2 O models can produce predictions within very
little time. The other benefit of H2 O AutoML is that it offers APIs in different languages.
Due to these benefits, it is seamlessly used in different fields. For big data analytics, H2 O
AutoML has a tight integration. H2 O AutoML is a fully automatic supervised learning
model that is implemented in the H2 O library. It is an open-source, distributed, and scalable
model; it is widely used in academia and in the industry as well.
To evaluate the performance of the learning models for water-quality detection, several
classifiers are used with the H2 O AutoML technique to check the efficacy of the proposed
system. H2 O version 3.10.3.1 is used to train the learning models. All learning algorithms
are implemented using the H2 O AutoML module. This study uses seven learning models
for water quality classification: logistic regression [25], Gaussian naïve Bayes [26], random
Water 2023, 15, 475 7 of 17

forest [27], extra tree classifier [28], gradient boosting machine [29], stochastic gradient
decent [30], and H2 O stacked ensemble [23].

3.5. Logistic Regression


LR is extensively used for classification. LR has the ability to deal with a large number
of features because it provides a straightforward equation for classification problems into
a binary class. To achieve the best results, we optimized several of its hyperparameters.
To compute the probability of a certain event occurring, a mathematical function called the
’logistic regression hypothesis function’ is used. The sigmoid function is used to transform
the logistic regression output value into a probability value. The cost function of LR can be
calculated as
1
hΘ( X ) = + e− ( β 0 + β 1 X ) (3)
1
{Cost(hΘ ( x ))y = 1 − log(1 − hΘ ( x )), y = 0} (4)

3.6. Gaussian Naïve Bayes


GNB is the advanced variant of naïve Bayes; it is also based on the Bayesian theorem.
Naïve Bayes handles categorical variables efficiently, so all the variables in naïve Bayes
must be categorical. However, the water quality classification dataset consists of numeric
data, so that is why we use GNB. GNB uses the partial technique to handle large datasets
because during training, GNB takes the chunks of data into account.

3.7. Random Forest


RF is a tree-based ensemble model. It is an advanced version of a decision tree and
is used to handle supervised learning problems. RF combines many weak learners, so it
produces highly accurate predictions. By using the different bootstrap samples, RF uses the
bagging technique for the training of many decision trees by sub-sampling of the training
dataset to obtain the bootstrap samples. The size of the bootstrap samples is the same as
the size of the training dataset. In RF, the notable issue in the construction of a tree involves
attribute identification at each level for the root node; this process is known as attribute
selection. In ensemble classification, two or more classifiers are trained, and their results
are combined using a voting process. The most-common ensemble techniques are bagging
and boosting. RF can be defined as

p = mode{ T1 (y), T2 (y), T3 (y), . . . , Tm (y)} (5)


m
p = mode{ ∑ Tm (y)} (6)
m =1

For the majority of classification tasks, the Gini index is used as a cost function for the
estimation of a split in the dataset. The Gini index can be computed using

classes
i
Gini = 1 − ∑ p ( )2
t
(7)
i =1

3.8. Stochastic Gradient Decent


SGD is a renowned optimization method that learns the optimized value of the model’s
parameter in each iteration to reduce the cost function (c f ). SGD is a well-known variant of
GD that concerns a random stochastic such that in each iteration it selects a single sample
for the training of the model. SGD needs less training time to find the cost function of only
a single training sample xi at each iteration to attain local minima. It does so by updating
the model parameters for every iteration xi and target class yi .
j
Θ j = Θ j − α(yí − yi ) xi (8)
Water 2023, 15, 475 8 of 17

where α is the model learning rate and Θ j is the parameter. For better performance, SGD
uses several hyperparameters.

3.9. Gradient Boosting Machine


GBM is a boosting algorithm that is widely used for classification and regression
problems. GBM consists of three main factors: a loss function, a weak learner, and an
additive model. The additive model in gradient boosting minimizes the loss function by
combining many weak learners. It handles imbalanced datasets efficiently. The purpose of
boosting is to enhance the power of the algorithm in such a way that it can detect the model’s
weaknesses and replace them with strong learners to produce near-perfect outcomes. GBM
does this task by gradually, additively, and sequentially training numerous models.

3.10. Extra Tree Classifier


ETC is an ensemble learning model that is an ensemble of multiple unpruned decision
trees. For the splitting nodes, it uses the subset of features. Unlike RF, it uses the whole
of the data for construction of the decision tree rather than using the bootstrapping data.
There are two primary parameters in ETC: the number of randomized input features
selected at each node, the lowest sample size needed for splitting a node nm in, and the
ensemble (M) number of decision trees. The decision tree in ETC is very less likely to be
correlated because of the randomized selection of points of the split. ETC aggregates the
DT predictions in the ensemble to produce the final predictions in the case of regression.

3.11. H2O AutoML Stacked


The stacked ensemble learning model H2O is a supervised learning model that is used
to find the optimal combination from a number of prediction algorithms. The process of
finding the optimal combination from many prediction algorithms is called stacking. The
stacked ensemble model H2O supports any kind of problem, including binary and multi-
class classification. It also supports regression problems. This research work leverages an
RF classifier as a base and a gradient boosting machine as a meta-estimator to predict the
performance of drinking-water quality.

3.12. Explainable Machine Learning


For the advancement of the decision-making sequence, traditional machine learning
base prediction needs post hoc interpretations. The function of these post hoc interpreta-
tions is so that the community easily understands the rationale that works behind predic-
tions. Machine learning applications emphasize that interoperability is very important,
similar to accuracy [31]. Explainable ML helps in providing the basic add-in to machine
learning models by improving the transparency of predictions that are obtained automati-
cally. Foremost, such models are divided into two groups: data-driven interpretation and
model-driven interpretation. To interpret machine learning-based predictions, we used
the SHAP explainable model because it has the capacity to recognize values as a unified
measure of feature importance.

3.13. Shapley Additive Explanations


According to Lundberg and Lee [32], SHAP is used to explain ML prediction based
on game theory. For instance, inputs are taken as players, and predictions are referred
to as payout. The contribution of each player in the game can be calculated with the
help of SHAP. Several versions of SHAP have been introduced by Lundberg and Lee,
including TreeSHAP, KernelSHAP, linearSHAP, and DeepSHAP. These versions are for
specific machine learning model categories. For example, in this study, Tree-SHAP is used
Water 2023, 15, 475 9 of 17

to explain the ML predictions. Tree-SHAP uses the linear-explanatory model and shapely
values for the initial prediction model estimation.

N
h(ź) = ∅0 + ∑ ∅i źi (9)
i =1

where z0 represents the basic features, ∅ denotes the feature attribution, and h0 shows the
explanation model. Lundberg and Lee [32] calculate each feature attribution using the
below equation:

| k | ! ( N − | K | − 1) !
∅i = ∑ N!
[ gx (K ∪ {i }) − gx (K )] (10)
K ⊆ M {i }

gx (K ) = E[ g( x )| xK ] (11)
where M0 represents the set of all inputs, K shows the input subset of a feature, and
E[( g( x )| x_k )] is the expected value of the function on subset k. A linear additive feature
attribute method is used by SHAP for the simpler explanation

3.14. Proposed Framework


This section describes all phases of the proposed approach framework and its modules
utilized in the experiment. Figure 2 elaborates on the proposed framework architecture.
The proposed approach has sub-phases, and each phase has been explained separately.
The proposed framework consists of two phases. In Phase 1, all the learning algorithms are
implemented using H20 AutoML model selection on the dataset containing the missing
values. In Phase 2, the dataset is balanced and sparsity is removed using the KNN imputer
technique, and then learning algorithms are implemented. The results obtained clearly
show the superiority of the H20 stacked ensemble model over the rest. After that, using
the SHAP explainable AI technique, the contributions of features toward the prediction
are also explained. SHAP reveals the proportion to which each feature participates in the
prediction of drinking-water quality. The reason for choosing this stacking is that both
these models perform well individually as compared to other models and are suitable for
the task at hand. The SHAP technique is used to demonstrate the final prediction with
respect to features so as to provide an explanation of the model’s performance.

3.15. Evaluation
The model’s evaluation is the important step that mainly focuses on estimation of
the performance of the model on unseen data. For water-quality classification, the four
outcomes are described below:
True Positive (TP): instances that are actually positive and are predicted positive.
True Negative (TN): instances that are actually negative and are predicted negative.
False Positive (FP): instances that are negative and are predicted as positive.
False Negative (FN): instances that are positive and are predicted as negative.
This study evaluates the proposed system in terms of accuracy, precision, recall,
and F-score. The values of these parameters range between 0 and 1.
Accuracy is the percentage of correctly predicted instances. It can be computed using
the following formula:
TP + TN
Accuracy = (12)
TP + TN + FP + FN
Precision is the exactness of the classifier. Mathematically, precision can be computed
as:
TP
Precision = (13)
TP + FP
Water 2023, 15, 475 10 of 17

Recall is the completeness of the classifiers. Mathematically, recall can be computed as:

TP
Recall = (14)
TP + FN
The harmonic mean of recall and precision is called the F1 score. It is also referred to
as F-score. It can be calculated using the following formula:

Precision × Recall
F1 − Score = 2 × (15)
Precision + Recall

Instances

RF GBM

P(1) P(2) P(1) P(2)

P(1)= (P RF + P GBM)/2
P(2)= (P RF+ P GBM)/2

Final Prediction= argmax{P(1), P(2)}

Figure 2. Proposed stacked ensemble H20 AutoML architecture.

4. Results and Discussions


This section of the study discusses the results obtained from the classifiers used in
the study for water-quality classification. Python 3.0 on a Jupyter notebook is used to test
machine and deep learning models. The learning models’ accuracy, recall, precision, and F1
score are used to assess their performance.

4.1. Experimental Results of All Learning Models by Removing Missing Data


In the first set of experiments, the missing values are removed from the dataset. Once
the missing values are removed from the dataset, machine learning models are used. Table 3
shows the results of the machine learning models obtained by removing missing values
from the dataset.
According to the results, among all single models, RF and GBM acquire the highest
accuracy scores of 79% and 76%, respectively. RF achieves a precision, recall, and F1
score of 79%, whereas GBM achieves a precision, recall, and F1 score of 76%. LR is the
poorest performer, with a 48% value for accuracy, recall, precision, and F1 score. Overall,
the machine learning models’ performances by employing removing missing value data
are unsatisfactory. H20 stacked ensemble outperforms all single models by giving accuracy
and recall of 87% and precision and F1 score of 85% and 86%, respectively. Figure 3 shows
a graphical depiction of the machine learning model results after missing-value data have
Water 2023, 15, 475 11 of 17

been removed. The results clearly show that the performances from GBM and RF are
acceptable, and the rest of the models’ performances are poor.

Table 3. Results using deleted-values dataset.

Model Accuracy Precision Recall F1-Score


LR 48 48 48 48
GNB 52 54 52 47
ETC 72 72 72 72
RF 79 79 79 79
SGDC 50 25 50 33
GBM 76 76 76 76
H20 Stacked 87 85 87 86

Figure 3. Accuracy, precision, and other metrics for models using the deleted-values dataset.

4.2. Experimental Results of All Learning Models by Filling Values with KNN Imputer
In the second set of experiments, the KNN imputer is used. Following preprocessing,
some missing values are discovered in the dataset. To handle missing data, we applied
the KNN imputer. The value is computed by the KNN imputer using the Euclidean
distance and the mean of the given values. The data are used for machine learning model
experiments once the missing values are imputed. Table 4 displays the results of the
machine learning models produced with the KNN imputer.
Water 2023, 15, 475 12 of 17

Table 4. Experimental results using machine learning models with KNN imputer data.

Model Accuracy Precision Recall F1-Score


LR 61 38 61 47
GNB 61 38 61 47
ETC 72 73 72 72
RF 80 80 80 80
SGDC 59 56 59 55
GBM 80 80 80 79
H20 Stacked 97 96 99 98

The results show that RF and GBM reach 80% accuracy, while the RF obtains 80%
precision, recall, and F1 score. GBM has a precision and a recall score of 80%, but an
F1-score of 79%. With SGDC, an accuracy score of 59% is attained. In terms of accuracy,
precision, recall, and F1 score, the H20 stacked model once again outperforms all other
individual models. The graphical depiction of the machine learning model outcomes using
the KNN imputer is shown in Figure 4. It illustrates that using the KNN imputer enhances
the performance of the machine learning model.

Figure 4. Results of models using KNN-imputer-filled dataset.

4.3. Accuracy Comparison of All Learning Models with KNN Imputer and Removing Missing Data
For a detailed and clarified performance analysis, the results obtained from the learn-
ing models with and without the KNN imputer are compared in this section. Experimental
results show that the learning models perform well when we employ the KNN imputer for
filling in the missing values. These results are good compared to the results of the learning
models without the KNN imputer. The accuracy comparison of all learning models with
the KNN imputer and the removal of missing data is shown in Table 5.
Water 2023, 15, 475 13 of 17

Table 5. Performance comparison of the machine learning models with and without KNN imputer.

Accuracy
Model
KNN Imputer Deletion of Missing Values
LR 61 48
GNB 61 52
ETC 72 72
RF 80 79
SGDC 59 50
GBM 80 76
H20 Stacked 97 87

The performances of machine learning models when deleting missing values versus
the imputed dataset using the KNN imputer are shown in Figure 5. The KNN imputer
increases not only individual model performance but also the overall performance of all
learning models.

Figure 5. Graphical illustration of model performance using KNN imputer.

4.4. Results of Cross-Validation


A 10-fold cross-validation is also used to validate the performance of the proposed
approach, and results are presented in Table 6. It can be observed that the proposed model
provides an average accuracy of 97%, while the average values for precision, recall, and F
score are 96%, 99%, and 97%, respectively.
Water 2023, 15, 475 14 of 17

Table 6. Significance of proposed model with k-fold validation.

K-Folds Accuracy Precision Recall F1-Score


1st-Fold 96 94 96 95
2nd-Fold 98 95 98 97
3rd-Fold 98 96 97 96
4th-Fold 97 97 98 97
5th-Fold 97 98 99 98
6th-Fold 97 98 100 99
7th-Fold 96 98 99 98
8th-Fold 97 98 99 98
9th-Fold 98 98 100 99
10th-Fold 96 96 99 97
Average 97 96 99 97

4.5. Comparison with State-of-the-Art Approaches


In this study, a performance comparison is done to demonstrate the importance of the
proposed approach. In this regard, a number of recent research studies that are relevant to
the current issue are chosen. The current dataset is used to implement models from earlier
studies addressing the prediction of water quality. As shown by the comparison findings
in Table 7, the proposed technique outperforms other approaches.

Table 7. Comparison of proposed approach with state-of-the-art approaches for water-quality prediction.

Reference Year Model Accuracy


[11] 2020 SVM 91%
[33] 2021 DT 95%
[34] 2021 LDA+LSTM-RNN 88%
[12] 2021 ANN 94%
[20] 2022 ANN 96%
Proposed 2022 Stacked Ensemble H2 O AutoML 97%

4.6. Explainable Artificial Intelligence


SHAP highlights feature importance regarding the given task of water-quality pre-
diction. SHAP feature importance is a superior approach to traditional alternatives, but
in isolation, it provides little additional value. Beeswarm plots are a more complex and
information-rich display of SHAP values that reveal not only the relative importance of
features but their actual relationships with the predicted outcome. The SHAP summary
graphic representation depicts the contribution of each feature to each instance (row of
data). The total of the feature contributions and the bias term equals the model’s raw
prediction, i.e., prediction before applying the inverse link function. The graphical repre-
sentation of SHAP feature contributions is shown in Figure 6. It can be observed that pH
and sulfate are the features that play a major role in predicting water quality. They have a
large number of instances in favor of the prediction of safe water quality. The rest of the
features largely support the prediction of drinking water quality as harmful.
Water 2023, 15, 475 15 of 17

Figure 6. Graphical representation of SHAP feature importance.

SHAP explanation demonstrates the contribution of features to a specific instance.


The total of the feature contributions and the bias term equals the model’s raw prediction,
i.e., prediction prior to applying the inverse link function. H2O stacked ensemble uses
TreeSHAP, which can increase the contribution of a feature that has no impact on the
prediction when the features are linked. Again, Figure 7 clearly explains that the values of
pH and sulfate are in favor of the prediction of water quality as safe.

Figure 7. GSHAP explanation.

5. Conclusions
Survival of mankind is not possible without safe drinking water. Polluted water has a
lot of adverse effects on human health that ultimately result in severe and life-threatening
diseases. Due to a lot of urbanization in the world, the passages of drinking water are
mixed with polluted water, which is causing a severe problem for human beings to find
safe drinking water. This research work provides a stacked ensemble framework that
Water 2023, 15, 475 16 of 17

accurately classifies safe and harmful drinking water. The proposed stacked H20 AutoML
framework performs best with the KNN imputer technique that deals with the missing
values of the dataset. Experiments are carried out in two phases: with KNN imputer values
and by deleting missing values. Results reveal that using the KNN imputer for filling in the
missing values is a better choice, as deleting the missing value data cause information loss
that affects the performance of the models. The participation of each feature in prediction
is explained using an explainable AI technique, SHAP. The proposed approach obtains 97%
accuracy when used with the KNN imputer.
Author Contributions: Conceptualization, H.A.M. and M.U.; Data curation, S.A.; Formal analysis,
M.H. and O.S.; Funding acquisition, H.A.M. and M.H.; Investigation, M.U.; Methodology, A.I.; Project
administration, A.I.; Resources, O.S.; Software, A.I. and S.A.; Supervision, I.A. and N.A.; Validation,
I.A.; Visualization, S.A. and M.H.; Writing—original draft, H.A.M. and M.U.; Writing—review and
editing, N.A. and I.A. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by College of Electronic and Information Engineering, Beibu
Gulf University, Qinzhou 535011, China and by Princess Nourah bint Abdulrahman University
Researchers Supporting Project number (PNURSP2023R125), Princess Nourah bint Abdulrahman
University, Riyadh, Saudi Arabia.
Institutional Review Board Statement: Not applicable
Informed Consent Statement: Not applicable.
Data Availability Statement: The datasets can be found by the authors at request.
Acknowledgments: This study is supported via funding from Prince Sattam bin Abdulaziz Univer-
sity project number (PSAU/2023/R/1444).
Conflicts of Interest: The authors declare no conflict of interests.

References
1. Muhammad, S.Y.; Makhtar, M.; Rozaimee, A.; Aziz, A.A.; Jamal, A.A. Classification model for water quality using machine
learning techniques. Int. J. Softw. Eng. Its Appl. 2015, 9, 45–52. [CrossRef]
2. Radhakrishnan, N.; Pillai, A.S. Comparison of water quality classification models using machine learning. In Proceedings of
the 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 10–12 June 2020;
pp. 1183–1188.
3. Walley, W.; Džeroski, S. Biological monitoring: A comparison between Bayesian, neural and machine learning methods of water
quality classification. In Environmental Software Systems; Springer: Berlin/Heidelberg, Germany, 1996; pp. 229–240.
4. Nasir, N.; Kansal, A.; Alshaltone, O.; Barneih, F.; Sameer, M.; Shanableh, A.; Al-Shamma’a, A. Water quality classification using
machine learning algorithms. J. Water Process Eng. 2022, 48, 102920. [CrossRef]
5. Nouraki, A.; Alavi, M.; Golabi, M.; Albaji, M. Prediction of water quality parameters using machine learning models: A case
study of the Karun River, Iran. Environ. Sci. Pollut. Res. 2021, 28, 57060–57072. [CrossRef] [PubMed]
6. Ambade, B.; Sethi, S.S.; Giri, B.; Biswas, J.K.; Bauddh, K. Characterization, behavior, and risk assessment of polycyclic aromatic
hydrocarbons (PAHs) in the estuary sediments. Bull. Environ. Contam. Toxicol. 2022, 108, 243–252. [CrossRef]
7. Singha, S.; Pasupuleti, S.; Singha, S.S.; Singh, R.; Kumar, S. Prediction of groundwater quality using efficient machine learning
technique. Chemosphere 2021, 276, 130265. [CrossRef]
8. Brown, R.M.; McClelland, N.I.; Deininger, R.A.; Tozer, R.G. A water quality index-do we dare. Water Sew. Work. 1970,
117, 339–343.
9. Bui, D.T.; Khosravi, K.; Tiefenbacher, J.; Nguyen, H.; Kazakis, N. Improving prediction of water quality indices using novel
hybrid machine-learning algorithms. Sci. Total Environ. 2020, 721, 137612. [CrossRef]
10. Juna, A.; Umer, M.; Sadiq, S.; Karamti, H.; Eshmawi, A.; Mohamed, A.; Ashraf, I. Water Quality Prediction Using KNN Imputer
and Multilayer Perceptron. Water 2022, 14, 2592. [CrossRef]
11. Aldhyani, T.H.; Al-Yaari, M.; Alkahtani, H.; Maashi, M. Water quality prediction using artificial intelligence algorithms. Appl.
Bionics Biomech. 2020, 2020. [CrossRef]
12. Shahra, E.Q.; Wu, W.; Basurra, S.; Rizou, S. Deep Learning for Water Quality Classification in Water Distribution Networks. In
Proceedings of the International Conference on Engineering Applications of Neural Networks, Crete, Greece, 17–20 June 2021;
Springer: Berlin/Heidelberg, Germany, 2021; pp. 153–164.
13. Mohammed, H.; Hameed, I.A.; Seidu, R. Machine learning: Based detection of water contamination in water distribution
systems. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Kyoto, Japan, 15–19 July 2018;
pp. 1664–1671.
Water 2023, 15, 475 17 of 17

14. Abuzir, S.Y.; Abuzir, Y.S. Machine learning for water quality classification. Water Qual. Res. J. 2022, 57, 152–164. [CrossRef]
15. Hassan, M.M.; Hassan, M.M.; Akter, L.; Rahman, M.M.; Zaman, S.; Hasib, K.M.; Jahan, N.; Smrity, R.N.; Farhana, J.; Raihan,
M.; et al. Efficient prediction of water quality index (WQI) using machine learning algorithms. Hum.-Centric Intell. Syst. 2021,
1, 86–97. [CrossRef]
16. Sillberg, C.V.; Kullavanijaya, P.; Chavalparit, O. Water quality classification by integration of attribute-realization and support
vector machine for the Chao Phraya River. J. Ecol. Eng. 2021, 22, 70–86. [CrossRef]
17. Ahmed, U.; Mumtaz, R.; Anwar, H.; Shah, A.A.; Irfan, R.; García-Nieto, J. Efficient water quality prediction using supervised
machine learning. Water 2019, 11, 2210. [CrossRef]
18. Kakkar, M.; Gupta, V.; Garg, J.; Dhiman, S. Detection of water quality using machine learning and IoT. Int. J. Eng. Res. Technol.
(IJERT) 2021, 10, 73–75.
19. Malek, N.H.A.; Wan Yaacob, W.F.; Md Nasir, S.A.; Shaadan, N. Prediction of Water Quality Classification of the Kelantan River
Basin, Malaysia, Using Machine Learning Techniques. Water 2022, 14, 1067. [CrossRef]
20. Rustam, F.; Ishaq, A.; Kokab, S.T.; de la Torre Diez, I.; Mazón, J.L.V.; Rodríguez, C.L.; Ashraf, I. An Artificial Neural Network
Model for Water Quality and Water Consumption Prediction. Water 2022, 14, 3359. [CrossRef]
21. Kaggle. Water Quality. 2021. Available online: https://www.kaggle.com/datasets/adityakadiwal/water-potability (accessed on
1 November 2022).
22. Zhang, S. Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 2012, 85, 2541–2552. [CrossRef]
23. AUTOML: Automatic machine learning. Available online: hhttps://www.automl.org/automl/ (accessed on 1 November 2022).
24. H2O.ai. H2O: Scalable Machine Learning Platform. Available online: https://h2o.ai/platform/h2o-automl/ (accessed on 1
November 2022).
25. Ishaq, A.; Sadiq, S.; Umer, M.; Ullah, S.; Mirjalili, S.; Rupapara, V.; Nappi, M. Improving the prediction of heart failure patients’
survival using SMOTE and effective data mining techniques. IEEE Access 2021, 9, 39707–39716. [CrossRef]
26. Rustam, F.; Ashraf, I.; Mehmood, A.; Ullah, S.; Choi, G.S. Tweets classification on the base of sentiments for US airline companies.
Entropy 2019, 21, 1078. [CrossRef]
27. Manzoor, M.; Umer, M.; Sadiq, S.; Ishaq, A.; Ullah, S.; Madni, H.A.; Bisogni, C. RFCNN: Traffic accident severity prediction based
on decision level fusion of machine and deep learning model. IEEE Access 2021, 9, 128359–128371. [CrossRef]
28. Sharaff, A.; Gupta, H. Extra-tree classifier with metaheuristics approach for email classification. In Advances in Computer
Communication and Computational Sciences; Springer: Berlin/Heidelberg, Germany, 2019; pp. 189–197.
29. Fabian, D.; Guillermo Prieto Eibl, M.d.P.; Alnahhas, I.; Sebastian, N.; Giglio, P.; Puduvalli, V.; Gonzalez, J.; Palmer, J.D. Treatment
of glioblastoma (GBM) with the addition of tumor-treating fields (TTF): A review. Cancers 2019, 11, 174. [CrossRef] [PubMed]
30. Sowmya, B.; Nikhil Jain, C.; Seema, S.; KG, S. Fake News Detection using LSTM Neural Network Augmented with SGD Classifier.
Solid State Technol. 2020, 63, 6985–9665.
31. Ahmad, M.A.; Eckert, C.; Teredesai, A. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM International
Conference on Bioinformatics, Computational Biology, and Health Informatics, Washington, DC, USA, 29 August–1 September
2018; pp. 559–560.
32. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777.
33. Hasan, A.N.; Alhammadi, K.M. Quality Monitoring of Abu Dhabi Drinking Water Using Machine Learning Classifiers. In
Proceedings of the 2021 14th International Conference on Developments in eSystems Engineering (DeSE), Sharjah, United Arab
Emirates, 7–10 December 2021; pp. 1–6.
34. Dilmi, S.; Ladjal, M. A novel approach for water quality classification based on the integration of deep learning and feature
extraction techniques. Chemom. Intell. Lab. Syst. 2021, 214, 104329. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like