0% found this document useful (0 votes)
21 views4 pages

ML Da1

Uploaded by

Lakshita Setia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

ML Da1

Uploaded by

Lakshita Setia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

School of Computer Science Engineering and Information Systems

Winter Semester 2024-2025

BITE410L - MACHINE LEARNING

Assignment – I

Topic: Water Quality prediction

Team Members:
1. Introduction

Access to safe drinking water is crucial for public health, yet traditional water quality assessment
methods are time-consuming and resource-intensive. To address this challenge, we propose an
advanced machine learning (ML) approach for predicting water quality based on chemical
parameters such as pH, turbidity, conductivity, and hardness.

While conventional models like Naive Bayes, Decision Tree, and Multilayer Perceptron (MLP)
offer reasonable accuracy, they often fall short when features exhibit interdependence and lack
transparency in decision-making. To overcome these limitations, we introduce a novel
Ensemble Learning approach using a Stacking Classifier, which combines Naive Bayes,
Decision Tree, and MLP models, with Logistic Regression as the meta-classifier. Furthermore,
we use SHAP (SHapley Additive Explanations) to identify the most influential water quality
parameters.

2. Literature Review

Peretz et al. (2024), in their journal “Engineering Applications of Artificial Intelligence”, state that
the Naïve Bayes classifier is a widely used probabilistic model due to its simplicity and
efficiency. However, its key limitation lies in its assumption of feature independence, which often
reduces its classification accuracy. To address this, they proposed the Naïve Bayes Enrichment
Method, which enhances classification performance by optimizing feature selection through
threshold learning and employing multiple NB classifiers with different distributions. Their study
found that NBEM significantly improves recall and precision by integrating results through a
weighted classification function. Additionally, they highlight the broad applications of NB
classifiers in various domains, including text classification, fraud detection, and medical
diagnostics. The study emphasizes that while NB is efficient for high-dimensional data, future
research should focus on refining its independence assumption, integrating deep learning
methodologies, and improving interpretability to enhance its trustworthiness in real-world
decision-making scenarios.

Mienye and Jere (2024), in their journal “A Survey of Decision Trees: Concepts, Algorithms,
and Applications”, state that decision tree-based algorithms are widely used due to their
simplicity, interpretability, and efficiency in classification and regression tasks. They discuss key
algorithms such as CART, ID3, C4.5, and CHAID, along with ensemble methods like random
forest and gradient-boosted decision trees, highlighting their applications in medical diagnosis,
fraud detection, and finance. The authors note that while decision trees offer transparency, they
are prone to overfitting and sensitivity to noise. Techniques such as pruning, ensemble learning,
and hybrid models help mitigate these issues. Their study also emphasizes that decision trees,
particularly ensemble models, achieve high accuracy in diagnosing diseases and detecting
fraud. In conclusion, they suggest that decision trees remain valuable in machine learning
despite their limitations. Future research should focus on enhancing scalability, handling high-
dimensional data, and integrating deep learning techniques to improve their predictive
performance

Ramchoun et al. (2016), in their journal “Multilayer Perceptron: Architecture Optimization and
Training”, discuss the optimization of Multilayer Perceptron (MLP) architecture to improve
classification and regression tasks. They highlight that selecting the optimal number of hidden
layers and neurons is crucial for preventing underfitting and overfitting. The study proposes a
genetic algorithm-based approach to optimize network architecture, ensuring efficiency in
training and generalization. The authors further explain that traditional MLP models fix their
architecture before training, leading to suboptimal performance. Their research introduces an
optimization model that dynamically adjusts connections and hidden layers using binary
variables, enhancing network adaptability. The study demonstrates the effectiveness of this
approach through experiments on the Iris dataset, showing improved classification accuracy
and reduced computational complexity compared to existing methods.In conclusion, Ramchoun
et al. (2016) emphasize that optimizing MLP architecture is essential for improving neural
network performance. They suggest further research on applying this model to real-world
datasets, such as medical diagnosis and financial forecasting, to validate its effectiveness
across different domains.

3. Proposed Design/Solution of the Identified Problem

Our proposed solution involves developing an ensemble-based ML model for water quality
prediction, focusing on both performance and interpretability.

3.1 Data Collection:

 We use the Water Quality Dataset, which includes chemical parameters such as pH,
turbidity, conductivity, hardness, sulphate, and dissolved solids.

3.2 Data Preprocessing:

 Handle missing values using median imputation.

 Normalize features using Min-Max scaling.

 Split the dataset into 80% training and 20% testing sets.

3.3 Model Training:

1. Base Models: Train individual classifiers: Naive Bayes, Decision Tree, and MLP.
2. Ensemble Learning: Combine predictions using a Stacking Classifier, with Logistic
Regression as the meta-classifier.

3.4 Performance Evaluation:

 Evaluate models using metrics such as accuracy, precision, recall, F1-score.

 Compare the performance of individual models and the ensemble approach.

3.5 Model Explainability:

 Use SHAP values to identify the most influential parameters, such as pH and turbidity,
enhancing the model's transparency.

Expected Outcomes:

1. Improved Accuracy: Higher classification accuracy compared to individual models.

2. Enhanced Explainability: Clear insights into the key water quality parameters.

3. Deployable Solution: A user-friendly system for real-time water quality assessment.

You might also like