Water Quality Analysis using Machine Learning
K Rajesh K Tharun Kumar Reddy
Department of AI & ML Department of AI & ML
Chaitanya Bharathi Institute of Chaitanya Bharathi Institute of
Technology, Gandipet, Hyderabad,India. Technology, Gandipet, Hyderabad,India.
kurvarajesh72@gmail.com tharunkumar5297@gmail.com
Karthik Kemidi M Vishnu Chaitanya
Department of AI & ML Department of AI & ML
Chaitanya Bharathi Institute of Chaitanya Bharathi Institute of
Technology, Gandipet, Hyderabad,India. Technology, Gandipet, Hyderabad,India.
kemidikarthik2004@gmail.com mallipeddivishnuchaitanyac seaiml@cbit.ac.in
I. A BSTRACT out an approach using data mining to discover the potability
of water. The amount of data about quality can be analyzed
It is used to determine whether the water is safe to be
to find unknown knowledge. Hence, this area of research has
consumed or not for the prediction of water quality. Also,
earned its position more preferably. This paper is to design a
this current study is done to compare various machine learning
system that could predict water quality with better precision.
models that can be utilized like Decision Tree, Random Forest,
XGBoost, KNN, SVM, and Gaussian Naive Bayes to predict III. L ITERATURE S URVEY
the most efficient technique that produces the best possible
One integrated model[1] studied water quality parameters
result with water quality efficiency. Water : an indispensable
using machine learning algorithms on the basis of Accurately
resource for life comprising a vital component of survival for
forecast the circumstances with high precision and reliability.
all organisms, including humans. Both business and agriculture
This study highlighted the importance of data preprocess-
can thrive only if they have freshwater available to them. One
ing as well as feature selection to improve the prediction
crucial management aspect of freshwater is the evaluation of
results.An article [2]also discussed a few machine learning
water quality. Even before using the water for any activity,
techniques which the best amongst them has been applied
drinking, chemical application like pesticides, or animal feed-
to the water quality prediction. It was found that method-
ing, its quality has to be evaluated. Water quality has a direct
ologies such as random forests and support vector machines
impact on the environment as well as public health. Thus,
dramatically improved the potential capability of prediction
analysis and prediction of water quality is a significant factor
due to their ability to capture nonlinear relations between the
to protect environmental as well as human health. Machine
different parameters of water quality. Real-time Monitoring
learning techniques can be used for the analysis and prediction
systems[3] integrated with machine learning algorithms have
of water quality by considering various parameters like pH
been designed to provide real-time water quality information.
value, turbidity, hardness, conductivity, and dissolved solids.
Such systems can respond with immediate action to possible
In this research work, water quality is predicted by introducing
health risks through real-time detection of the presence of
the concentration of these parameters into machine learning
contaminants as well as other changes in water composition.
algorithms classifying the water as safe or unsafe for home
Hybrids,[4] which are combinations of more than one method
use.
for machine learning, have shown exceptional performance in
Keywords—Decision Tree, Random Forest, KNN, SVM, XG- salinity forecasting and other single indicators of water quality.
Boost, Gaussian Naive Bayes, Performance Metrics.
Hybrid methods are very sensitive to interactions between and
within variables, which are sometimes missed by classical
II. I NTRODUCTION
models. may not capture. Certain literature [5]focuses on field
The field of machine learning researches the ways in which sensor data and public records for analysis and prediction
computers learn through experience. Since the ability to learn purposes of water quality. Through exploratory data analysis,
is a fundamental characteristic of what is deemed intelligent, scientists were able to find the patterns in datasets, through
the terms "Machine Learning" and "Artificial Intelligence" are which long-term trends and seasonal variations of potability
often used interchangeably by researchers. The main objective of water can be predicted. There has been a trend[6] in recent
of machine learning is to create systems that can learn from research that uses machine learning towards the assessment
their experiences. New developments in machine learning and prediction of water potability through training models
techniques have made it easy to be overcome. We have worked on suitability of water. to be consumed based on diverse
indicators of quality. Projects [7]on assessing machine learning
algorithms for sustainable drinking water quality monitoring,
. The focus therein was on the models that could assist in
enhancing the efficiency of water management. Such work
sometimes includes an aspect from perspectives of sustain-
ability and public health. Comparing different variants of
machine learning algorithms[8] over datasets with missing
values imputed through statistical methods provided a better
insight into relative strengths and weaknesses of different
variants and the impact of data preprocessing on the quality of
the modelled system. The integration[9] of machine learning
with holistic weighting techniques was made to increase
prediction accuracy through the output of various models. It
decreased individual modeling errors, thus providing a better Figure 1. Dataset contents
prediction. There are various papers that have been written
to identify potable water by classifying water samples with
varied quality parameters. This paper is very significant for that labels water samples either as potable (safe to drink)
public health surveillance. The recent studies tend to optimize or non-potable (unsafe) based on the collection of several
machine learning models for water quality prediction through physicochemical attributes. The features in this dataset include
refinement of data preprocessing and parameter tuning for a pH, hardness, total dissolved solids, chloramines, sulphate,
model. These customized models result in significantly better conductivity, and nitrates.
prediction accuracy, particularly when localized to environ-
mental conditions and distinct characteristics of water quality. V. M ETHODOLOGIES
A. Other Related Works The machine learning model determines whether the water
The literatures [12-22]depicts wide-ranging applications of sample is safe to be consumed or not. It is This step is
computational intelligence and machine learning in different critical because importing the required libraries for testing and
domains. Techniques have been developed to extract text from training of our dataset, along with installing specific packages
lecture videos, thereby making the processing and utilization associated with nature inspired algorithms. The data set needs
of educational resources more efficient. Advanced clustering to be divided into 80:20 ratio of training and testing subsets;
and classification techniques have been used for fraud de- then finally, there comes the Figure 1 selection of appropriate
tection in IoT environments and financial datasets analysis. model. The classifiers to be considered are Support Vector
Algorithms, such as decision trees and clonal selection, have Classifier (SVC), Decision Tree, GaussianNB, Random Forest,
been used for medical diagnosis in healthcare, such as diabetes and XGBoost. The dataset is based on the observations of
and AIDS severity prediction. Machine learning frameworks water quality across 3276 different sources of water:
were also applied for the detection of COVID-19 through pH - This parameter measures the acidity or alkalinity of
sound of cough and analysis of blood samples. Detection the water on a scale that may vary between 0 to 14. As per the
accuracy was improved through this. High-dimensional cancer EPA guidelines, the pH of tap water must ideally fall between
datasets have been tamed with novel clustering methods and 6.5 and 8.5. The pH level determines the acid base nature of
dissimilarity measures. Diabetes diagnosis efficiency has been the water. The present study recorded the pH in the range of
enhanced due to dimension reduction techniques such as PCA 6.52 to 6.83, which is very much in conformity with WHO
coupled with kNN. Neurotrophic factor analysis also received standards.
benefits from clustering innovations, and improved measures Hardness- This is the amount of soap that can be dissolved
of similarity are key to enhancing financial text document clus- in one liter of water. The main causes of water hardness
tering. These studies demonstrate the multiplicity and impact are salts of calcium and magnesium. The longer water is
of computational intelligence solutions for significant, real- exposed to hardness-causing substances, the harder the water
world problems, presenting scalable and efficient solutions for is naturally. Classically, hardness is determined by the ability
education, health, and finance challenges. of the water to form soap through the precipitation of calcium
and magnesium.
IV. P ROBLEM S TATEMENT Total Dissolved Solids (TDS) - Water can hold dissolves
Design a machine learning model to monitor the quality of a several chemicals and a number of organic minerals or salts,
sample of water and to classify the water samples as potable or for instance, sodium, calcium, iron, zinc, bicarbonate ions,
non-potable based on the physicochemical properties, such as chloride ions, magnesium, and sulfates. The dissolved minerals
pH, hardness, and conductivity etc. to ensure safe availability may also change the color of the water and contribute to
of drinking water. The project of Water Quality Analysis is de- objectionable odors. High TDS means a high amount of
signed for developing a classification machine learning model minerals in the water. For water to be safe for drinking, the
Figure 2. data description
maximum allowed TDS level is 500 mg/l though the desirable Figure 3. data pre-processing
limit is 100 mg/l.
Sulfates - Sulfates are organic compounds that naturally
occur in minerals, soil, rocks, air, and water-soluble ground-
water, aquatic plants, and foods per unit area. In the chemical
industry, sulfurates are mainly used for industrial purposes.
Normally, seawater has 2,700 mg / L, freshwater sources
usually come with levels that range from 3 to 30 mg/L.
Conductivity -Pure water is a wonderful Electrical current
insulator. The concentration of dissolved particles in the liquid
often determines its conductivity. Electrical conductivity (EC)
measures the ability with which these particles conduct elec-
tricity by their ionic interactions. According to the guidelines
of the World Health Organization (WHO), the concentration
of EC should not exceed 400 S / cm.
Chloramines - The two major disinfectants used in public
water systems are chloride and chlorine. Ammonia is added
with chlorine for purification of drinking water. It is safe for
drinking water to contain up to 4 mg/L of chlorine.
Potability - It is a measure of whether water is potable or
not. Unpotable is equivalent to zero (0), while potable is one Figure 4. Water Potability Distribution
(1).
A. Data Pre-processing below indicates that no feature has any type of correlation
Enhancing data quality is essential during the processing and that there is no chance of doing dimensionality reduction.
phase of data analysis. In this stage, the Water Quality Index
is established based on key parameters from the dataset. C. Training and Testing of Data
Data preparation refers to the transformation of collected
data into a format suitable for machine learning algorithms. In machine learning, a model is trained to perform several
This step is crucial and serves as the foundational phase tasks by using a training dataset. The model learns during
in the development of a machine learning algorithm. It is the course of training based on the features found in the
necessary to eliminate all instances where the value is zero, as training data. In the context of sentiment analysis, words or
zero is not a valid value; thus, such instances are discarded. phrases are gathered from tweets. The model learns to form
The process of selecting feature subsets, which reduces data associations, understand ideas, draw inferences, and assess its
dimensionality and facilitates faster processing, involves the confidence levels based on the training dataset. The success
removal of irrelevant features and instances. of our data project is also is determined based on both the
quality and quantity of the machine learning training data, as
B. Correlation Matrix much as by the algorithms employed. As a consequence, given
A heat map function is utilized to display the visualization an appropriately labeled training set, the model will have the
of correlation between all features. The heatmap depicted opportunity to learn the correct features.
especially with large datasets, since it requires calculating
distances between points.
H. Performance Metrics
Accuracy - Accuracy is measured as the total count of ac-
tual predictions to the available predictions and it is multiplied
by 100.
Precision - The ratio of actual positives to the total available
positives is known as precision.
Recall - It mainly focuses on type-2 errors the ratio of true
positives to false negatives is called recall.
F1-score - The harmonic mean performance metric param-
eters precision with recall known as f1-score.
VI. A LGORITHM S TEPS FOR WATER Q UALITY A NALYSIS
USING ML
Figure 5. correlation matrix 1. Import Libraries: Import the necessary libraries like
pandas, numpy, matplotlib, seaborn, and machine learning
modules from sklearn.
D. Decision Tree
2. Load Dataset: Load the datset from kaggle
The decision tree is a machine learning algorithm that waterpotability.csvin a DataFrame.
primarily aims for classification. It has a strongly structured 3. Handle Missing Values: Impute missing values using the
system of classification wherein the nodes are components of mean strategy with SimpleImputer.
a given dataset. In decision trees, there exist two main nodes: 4. Data Exploration and Visualization: Calculate the corre-
decision nodes and leaf nodes. lation matrix, plot it with the help of a heatmap; analyze the
E. Support Vector Machine distribution of the target variable (Potability) with a bar chart.
5. Data Preprocessing: Features were split from the target
Support Vector Machine (SVM) is an algorithm used in
variable (y). Split the dataset into training and test sets with an
machine learning to classify tasks. It is widely used for
80-20 split; scaling features by StandardScaler using uniform
classifications challenges. SVM classifies between two classes
scaling .
by translating the data points into a high-dimensional space
6. Define and Train Models: Multiple classifiers were de-
and then finding the best hyperplane.
fined - Random Forest, Gradient Boosting and AdaBoost, and
F. Random Forest Classifier SVM. Train both on the scaled training data.
Random Forest forms an essential part of the supervised 7. Model Evaluation: Make predictions on the test data for
learning design. This classifier can be used for machine all models. Compute evaluation metrics: Accuracy, Classifica-
learning tasks that contain both classification and regression. tion Report, and Confusion Matrix. Identify the best model
It relies on the ensemble learning idea with an ensemble of based on accuracy.
several classifiers that are used to solve complicated problems 8. Cross-Validation: Perform 5-fold cross-validation on the
to improve the generalization capability of the model. Because best model and calculate mean cross-validation scores.
this algorithm, by its name, serves as a classifier, with es- 9. Feature Importance (in case of Random Forest): In case
sentially using many decision trees, each of them is actually the best model was Random Forest, extract and plot feature
trained on different subsets of input data, and their outputs importances.
are aggregated to enhance the accuracy of predictions. Instead 10. Save the Best Model: Save the fitted best model to a
of depending on a single decision tree, the random forest .pkl file using joblib.
aggregates the predictions from all trees and determines the 11. Load and Use the Model: Load the saved model for
final outcome based on the majority vote of these predictions. predictions in the future. Scale input samples before doing
prediction
G. KNN 12. Prediction Function: Define a function to predict
K-Nearest Neighbors (KNN) is a simple, nonparametric whether water is potable or not from a given sample. Use
machine learning algorithm used for classification and regres- the best model to classify that the water is either potable or
sion tasks. It works by finding the "k" closest data points not.
(neighbors) to a given input and making predictions based on 13. Test Prediction: Give sample inputs to the Prediction
the majority class or average value of these neighbours. KNN function for testing. Interpret the model’s output as "Potable"
is easy to implement but can be computationally intensive, or "Not Potable".
VII. E XPERIMENTAL S ETUP
Hardware Description
System 13th Gen Intel(R) Core(TM),
1.90GHz
Hard Disk 512 GB
Monitor HP P204v
Processor Intel i5-1340P
Table I
H ARDWARE R EQUIREMENTS
Software Description Figure 7. Precision
Operating System Windows 11
Programming Language Python 3.2 Figure 7 displays the precision for each model. Precision
Database Firebase
Tools Jupyter Notebook, Google Colab, is highest for the Random Forest model, followed closely by
Python IDE KNN and XGBoost, while Gaussian Naive Bayes has slightly
Table II lower precision. Higher precision means fewer false positives
S OFTWARE R EQUIREMENTS
for these models.
VIII. R ESULTS AND D ISCUSSIONS
In a nutshell, the classifier here was Random Forest Classi-
fier, which came out to have good accuracy between precision,
recall, and F1-score. The model is excellent in balance-it’s
great for applications that require one to know with reliabil-
ity how accurately their model is across diverse metrics of
evaluation. Its strength when training demonstrates it had ap-
proximately 70 percent accuracy and a good balance regarding
precision and recall. Its robustness lies in its ability to han-
dle imbalanced datasets with overall consistent performances
Figure 8. Recall
across different scenarios. Also, it has an ensemble nature;
overfitting, which enhances the generalizability to unseen data. Figure 8 illustrates recall values, where SVM and KNN
This will come in handy for predictive and analytical tasks in show the highest recall, meaning these models are better at
complex data environments as a highly reliable choice. identifying true positives. Gaussian Naive Bayes has the lowest
recall, indicating it misses more true positives.
Figure 6. Accuracy
Figure 9. F1-score
The figure 6 shows the accuracy of six different machine Figure 9 illustrates the F1-score, which is the harmonic
learning models: SVM, KNN, Decision Tree, Gaussian Naive mean of precision and recall. KNN has the highest F1-
Bayes, Random Forest, and XGBoost. The accuracy is fairly score, indicating a good balance between precision and recall.
consistent with SVM, KNN, Random Forest, and XGBoost Gaussian Naive Bayes has the lowest F1-score, suggesting it
achieving higher values compared to Decision Tree. is less effective in balancing precision and recall.
water quality monitoring across different regions.
6. Geographical Adaptation: Adapt the model for localized
regions through the incorporation of localized water quality
standards and climatic conditions.
7. Public Health Integration: Implement machine learning
models in public health programs to predict and prevent
waterborne diseases by monitoring and predicting trends of
water quality.
This research lays the foundation to many innovative so-
lutions in water quality management within the pursuit of
achieving a global goal: access to clean and safe drinking
water.
Figure 10. Best Model R EFERENCES
[1] Khan, Y., & See, C. S. (2016, April). Predicting and analysing water
quality using machine learning: a comprehensive model. In 2016 IEEE
Long Island Systems, Applications and Technology Conference (LISAT)
(pp. 1-6). IEEE.
[2] Ahmed, A. N., Othman, F. B., Afan, H. A., Ibrahim, R. K., Fai, C. M.,
Hossain, M. S., ... & Elshafie, A. (2019). Machine learning methods for
Figure 11. Sample prediction better water quality prediction. Journal of Hydrology, 578, 124084.
[3] Vergina, S. A., Kayalvizhi, S., Bhavadharini, R., & Kalpana Devi, S.
(2020). A real time water quality monitoring using machine learning
algorithm. Eur. J. Mol. Clin. Med, 7(8), 2035-2041.
IX. C ONCLUSIONS AND F UTURE S COPE [4] Melesse, A. M., Khosravi, K., Tiefenbacher, J. P., Heddam, S., Kim, S.,
Mosavi, A., & Pham, B. T. (2020). River water salinity prediction using
The project Water Quality Analysis using Machine Learning hybrid machine learning models. Water, 12(10), 2951.
does an excellent job in providing machine learning algorithms [5] Kuthe, A., Bhake, C., Bhoyar, V., Yenurkar, A., Khandekar, V., &
for the assessment of water quality. The study deals with the Gawale, K. (2022). Water quality analysis using machine learning.
International Journal for Research in Applied Science and Engineering
classification of water samples into potable or non-potable Technology, 10(12), 581-585.
types by using physicochemical parameters such as pH, hard- [6] Akshay, R., Tarun, G., Kiran, P. U., Devi, K. D., & Vidhyalakshmi, M.
ness, total dissolved solids, and conductivity. After several (2022, December). Water-Quality-Analysis using Machine Learning. In
2022 11th International Conference on System Modeling & Advance-
experiments of implementing and testing multiple machine ment in Research Trends (SMART) (pp. 13-18). IEEE.
learning models, the Random Forest classifier was regarded [7] Kaddoura, S. (2022). Evaluation of machine learning algorithm on
the most practical, as good balancing was found among the drinking water quality for better sustainability. Sustainability, 14(18),
11478.
performance metrics-accuracy, precision, recall, and F1-score. [8] Poudel, D., Shrestha, D., Bhattarai, S., & Ghimire, A. (2022). Com-
The reasons for this adaptation in real-world applications are parison of machine learning algorithms in statistically imputed water
its reliability and suitability to varied forms of data. The results potability dataset. Journal of Innovations in Engineering Education,
5(1), 38-46.
underscore the capability of machine learning in changing the [9] Wang, X., Li, Y., Qiao, Q., Tavares, A., & Liang, Y. (2023). Water quality
face of water quality monitoring and thereby contributing to prediction based on machine learning and comprehensive weighting
public health and environmental sustainability. methods. Entropy, 25(8), 1186.
[10] Patel, S., Shah, K., Vaghela, S., Aglodiya, M., & Bhattad, R. (2023).
Future Scope: Water Potability Prediction Using Machine Learning.
1. Integration with IoT Systems: - Combining machine [11] Brindha, D., Puli, V., NVSS, B. K. S., Mittakandala, V. S., & Nan-
learning models with IoT-based real-time water quality moni- neboina, G. D. (2023, February). Water quality analysis and predic-
toring systems for sustained assessment and the early detection tion using machine learning. In 2023 7th International Conference on
Computing Methodologies and Communication (ICCMC) (pp. 175-180).
of contamination. IEEE.
2. Extended Dataset Inclusion: Enhancement of the dataset [12] Velaga, S. M., Srikanth, P., & Basha, D. K. (2024). KBSS: an efficient
with additional parameters covering not only other indicator approach of extracting text contents from lecture videos-computational
intelligence techniques. International Journal of Cloud Computing,
biological and microbial parameters, but also more encom- 13(1), 1-24.
passing assessment of water quality. [13] Srikanth, P. (2021). An efficient approach for clustering and classifi-
3. Algorithm Optimization: Experiment with better algo- cation for fraud detection using bankruptcy data in IoT environment.
International Journal of Information Technology, 13(6), 2497-2503.
rithms, such as deep learning models, in addition to ensuring [14] Devarapalli, D., Srikanth, P., Rao, M. N., & Rao, J. V. (2016). Iden-
more accuracy in prediction and predictive capability towards tification of AIDS disease severity based on computational intelligence
different environmental conditions. techniques using clonal selection algorithms. International Journal of
Convergence Computing, 2(3-4), 193-207.
4. Hyperparameter Tuning : - Automate hyperparameter [15] Srikanth, P., Anusha, C., & Devarapalli, D. (2015). A computational
tuning using methods such as Grid Search or Bayesian Opti- intelligence technique for effective medical diagnosis using decision tree
mization to obtain the best performance of the model. algorithm. i-Manager’s Journal on Computer Science, 3(1), 21.
[16] Srikanth, P., & Behera, C. K. (2022, July). A machine learning frame-
5. Scalability and Deployment: Develop a cloud-based envi- work for COVID detection using cough sounds. In 2022 International
ronment for deployment of the model for large-scale, real-time Conference on Engineering & MIS (ICEMIS) (pp. 1-5). IEEE.
[17] Srikanth, P., & Behera, C. K. (2022, July). An empirical study and
assessment of minority oversampling with dynamic ensemble selection
on COVID-19 utilizing blood sample. In 2022 International Conference
on Engineering & MIS (ICEMIS) (pp. 1-7). IEEE.
[18] Panigrahi, S. (2020, April). Design and analysis of efficient cluster
using novel dissimilarity measure and classification for high dimensional
cancer datasets. In Proceedings of the International Conference on
Innovative Computing & Communications (ICICC).
[19] Panigrahi, S., Saitejaswi, K., & Devarapalli, D. (2019, February). Teju:
fraud detection and improving classification performance for bankruptcy
datasets using machine learning techniques. In Proceedings of Interna-
tional Conference on Sustainable Computing in Science, Technology and
Management (SUSCOM), Amity University Rajasthan, Jaipur-India.
[20] Mangathayaru, N., Mathura Bai, B., & Srikanth, P. (2018). Clustering
and classification of effective diabetes diagnosis: Computational intel-
ligence techniques using PCA with kNN. In Information and Commu-
nication Technology for Intelligent Systems (ICTIS 2017)-Volume 1 2
(pp. 426-440). Springer.
[21] Devarapalli, D. D., & Srikanth, P. (2018). A novel cluster algorithms
of analysis and predict for brain derived neurotrophic factor (BDNF)
using diabetes patients. In Data Engineering and Intelligent Computing:
Proceedings of IC3T 2016 (pp. 109-125). Springer Singapore.
[22] Srikanth, P., & Deverapalli, D. (2017, December). CFTDISM: Clus-
tering financial text documents using improved similarity measure. In
2017 IEEE International Conference on Computational Intelligence and
Computing Research (ICCIC) (pp. 1-4). IEEE.