MachineLearning
Text Book
G Amit Kumar Das, SaikatDutt, Subramanian
Chandramouli, Machine
Learning, Pearson India Education Services, 2019.
Reference Books
1. Tom M. Mitchell, Machine Learning, McGraw Hill Education,
2013.
2.Rudolph Russell, Machine Learning: Step-by-Step Guide
to Implement Machine Learning Algorithms with Python,
Create Space Independent Publishing Platform, 2018.
Machine Learning : Course Outcomes
Identify the various concepts and challenges in machine Learning.
Select modelling and evaluation technique for handling real time data.
Use supervised and unsupervised learning algorithms for a
given problem.
Examine supervised and unsupervised learning algorithms
for analysing data.
Identify the various concepts of neural network to develop AI
based applications.
Machine Learning : Syllabus
Introduction to Machine Learning: Types of Machine Learning, Problems not to be solved
using Machine Learning, Applications of Machine Learning, Tools in Machine
Learning, Issues in Machine Learning, Machine learning Activities, Basic Types of
Data in Machine Learning, Exploring Structure of data, Data Quality & Remediation,
Data Pre-Processing.
Modelling and Evaluation: Introduction, Selecting a Model, Training a Model,
Model Representation and Interpretability, Evaluating Performance of a Model,
Improving performance of a Model, Feature subset selection, Dimensionality
Reduction - PCA, SVD,FA, LDA.
Bayesian Concept Learning: Introduction, Bayes’ Theorem, Naivë Bayes
Classifier, Ap- plications of Naivë Bayes Classifier, Supervised Learning:
Classification , Example of Supervised Learning, Classification Model Learning Steps,
Common Classification Algorithms: KNN, Decision Tree, Random forest model ,
Machine Learning : Syllabus
Unsupervised Learning: Introduction, Unsupervised vs Supervised
Learning, Application of Unsupervised Learning, Clustering K-
Means, K-Medoid, DBSCAN.
Basics of Neural Network: Introduction, Understanding the Biological
Neuron, Exploring the Artificial Neuron, Types of Activation
Functions, Early Implementations of ANN, Architectures of Neural
Network.
Introduction to Machine
Learning
• Introduction
• In the real world, we are surrounded
by humans who can learn everything
from their experiences with their learning
capability,
• We have computers or machines which
work on our instructions. But can a
machine also learn from experiences
or past data like a human does?
Introduction to Machine
Learning
• To make the machines to learn, Machine learning plays a important role.
• What is Machine Learning ?
Arthur Samuel, a research pioneer in the field of artificial intelligence
and computer gaming, coined the term “Machine Learning”.
Computers learning from data is known as machine learning.
• Definition of ML
• Machine learning (ML) is a sub-domain of artificial intelligence (AI)
that focuses on developing systems that learn—or improve
performance—or predict new data-or make decisions based on the
data they receive.
Introduction to Machine
Learning
• It involves training models on data to identify patterns, relationships, and
insights, which can then be used to perform various tasks and make
predictions on new, unseen data.
• “Machine learning is a method of data analysis that automates analytical
model building. It is a branch of artificial intelligence based on the idea that
systems can learn from data, identify patterns and make decisions
with minimal human intervention”.
• Definition of Machine Learning (Mitchell 1997) — “A computer program is
said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at the tasks improves with
the experiences”.
Why Machine Learning
?
Programs have been developed that successfully
learn to recognize spoken words (Waibel 1989; Lee 1989),
predict recovery rates of pneumonia patients (Cooper et al. 1997),
detect fraudulent use of credit cards,
drive autonomous vehicles on public highways (Pomerleau 1989),
play gamessuch as backgammon at levels approaching
the performance of human world champions (Tesauro 1992, 1995).c
Programs vs. learning
algorithms
Traditional Programming refers to any manually created program that uses
input data and runs on a computer to produce the output.
In Machine Learning, also known as augmented
analytics, the input data and output are fed to
an algorithm to create a program. This yields powerful
insights that can be used to predict future outcomes.
Components of Learning
System
Introduction to Machine Learning
• Advantages of ML
1. Automation - enables automation of tasks that are repetitive or time- consuming
2. Prediction and Decision Making - analyze large datasets to make predictions and
decisions with high accuracy
3. Scalability: - can handle large volumes of data and scale efficiently to accommodate
growing datasets
4. Adaptability: - can adapt and learn from new data, allowing them to continuously
improve.
5. Personalization: -can personalize user experiences by analyzing user behavior
and preferences.
6. Pattern Recognition: -identifying patterns, trends, and anomalies in data that may not
Introduction to Machine Learning
• Disadvantages of ML
1. Data Dependency - models heavily rely on the quality and quantity of data for training.
2. Overfitting: models may become too specialized to the training data and fail to generalize
well to unseen data.
3. Interpretability: lack interpretability, making it challenging to understand how they arrive at
their predictions
4. Computational Resources : Training complex ML models requires significant computational
resources.
5. Ethical and Privacy Concerns: or infringe on privacy rights, raising ethical and social
concerns.
6. Lack of Domain Knowledge: models may perform poorly in domains where domain-
specific knowledge is essential.
Introduction to Machine Learning
• Limitations:
1. Limited by Data
Quality
2. Complexity:
3. Interpretability:
4. Generalization:
5. Scalability:
6. Human Expertise
History of Machine learning
• The history of machine learning traces back to the mid-20th
century, with roots in the fields of mathematics, computer science,
and artificial intelligence. Here's a brief overview:
• 1950s - 1960s: Early Foundations
1. Alan Turing (1950): Turing proposed the Turing Test as a
measure of machine intelligence, laying the groundwork for
the concept of artificial intelligence (AI).
2. Arthur Samuel (1959): Samuel developed the first self-learning
program, a checkers- playing program that improved its
performance through reinforcement learning.
History of Machine learning
• 1970s - 1980s: Symbolic AI Dominance
1. Expert Systems: Symbolic AI, based on rule-based expert
systems, dominated the field. These systems encoded human
expertise in the form of rules to solve specific problems.
2. Neural Networks Research: Neural networks research
continued, but interest discontinued due to limited
computational power and the dominance of symbolic AI.
• 1990s: Renaissance of Neural Networks
1. Backpropagation: The rediscovery of the backpropagation
algorithm for training neural networks led to renewed interest
in neural networks and machine learning.
History of Machine learning
2. Support Vector Machines (SVMs): Vladimir Vapnik and others developed
support vector machines, a powerful machine learning algorithm for
classification and regression tasks.
• 2000s - Present: The Big Data Era
1. Big Data: The explosion of data availability due to the internet, social
media, and digital technologies fueled the development of new machine
learning algorithms and techniques.
2. Deep Learning: Breakthroughs in deep learning, fueled by
advances in computational power and data availability, led to
significant improvements in areas like computer vision, natural language
processing, and speech recognition.
History of Machine learning
3. Reinforcement Learning: Reinforcement learning gained
prominence, particularly in areas like robotics, gaming
(e.g., AlphaGo), and autonomous vehicles.
4. Machine Learning Applications: Machine learning became
ubiquitous in various applications, including recommendation
systems, fraud detection, healthcare diagnostics, autonomous
vehicles, and more.
5. Ethical and Social Implications: Increased attention to the ethical
and social implications of machine learning, including concerns
about bias, fairness, privacy, and job displacement.
Evolution of machine learning
What is human learning?
• Human learning is the process through which individuals acquire
new
knowledge, skills, attitudes, or behaviors.
• This process can occur in various ways, including through:
1. Experiences:
2. Observation:
3. Instruction:
• Human learning involves cognitiveprocesses such as
attention, memory,
What is human learning?
• Cognitive scienceis an interdisciplinary field that studies
about
the mind and its processes, including how
people think, learn, remember, and perceive.
Types of human learning
• Human learning happens in one of the three ways –
• (1) either somebody who is an expert in the
subject directly teaches us
(Learning under expert guidance),
• (2) we build our own notion indirectly based on what we have learnt
from the expert in the past (Learning guided by knowledge
gained from experts), or
• (3) we do it ourselves, may be after multiple
attempts (Learning by self or self-learning),
How do machines learn?
• The basic machine learning process can be divided into three parts.
• 1. Data Input: Past data or information is utilized as a basis for
future decision-making.
• 2. Abstraction:The input data is represented ina broader
waythrough the
underlying algorithm.
• 3. Generalization: The abstracted representation is generalized to form
a framework for making decisions.
• Figure 1 shows a schematic representation of the machine learning
process.
How do machines learn?
Fig: Process of machine learning
• Human learning is that just by great memorizing and perfect recall,
i.e. just
based on knowledge input, students can do well in the examinations
only till a certain stage.
• A better learning strategy needs to be adopted:
How do machines
learn?
• 1. to be able to deal with the vastness of the subject matter
and the related issues in memorizing it.
• 2. to be able to answer questionswhere a direct answer has
not been learnt.
• A good option is to figure out the key points or ideas
amongst a vast pool of knowledge.
• From the machine learning paradigm, the vast pool of
knowledge is available from the data input.
How do machines
learn?
• However, rather than using it in entirety of the data, a
concept of mapping is applied.
• Mapping of the data to known characteristicis called
knowledge abstraction performed by the machine.
• Finally, this abstracted mapping from the input data can
be applied to make critical conclusions.
• This is generalization in context of machine learning.
Types of Machine
Learning
Types of Machine
• SupervisedLearning
Learning:
• Definition: In supervised learning, the algorithm is trained on a
labeled dataset, where each input is associated with a corresponding
output. (which means that the input data is paired with the desired
output.).
• Usefulness: Supervised learning is highly useful in real-world
scenarios where there is a known outcome or target variable.
• Supervised learning is often used for tasks such as classification,
regression, and object detection.
• Real-world Applications: Supervised learning is applied in various
Types of Machine Learning
1. Predictive analytics: Forecasting customer churn, predicting sales
trends, etc.
2. Healthcare: Diagnosing diseases based on patient data.
3. Finance: Credit scoring, fraud detection, risk assessment.
• Advantages:
Well-understood and widely studied.
Can achieve high accuracy when trained on sufficient and
representative data.
• Disadvantages:
Requires labeled data, which can be expensive and time-consuming to
obtain.
Types of Machine Learning
• Unsupervised Learning:
• Definition: In unsupervised learning, the is trained on
algorithm an
unlabeled dataset, where the goal is to discover hidden patterns or
structures within the data.
• Usefulness: Unsupervised learning is valuable in real-time
scenarios where the data is unstructured or lacks labels. It can
uncover hidden insights and group similar data points together.
• Real-world Applications: Unsupervised learning is applied in various
domains such as:
1. Market segmentation: Grouping customers based on similar traits or
behaviors.
Types of Machine Learning
1. Anomaly detection: Identifying unusual patterns or outliers in data.
2. Recommender systems: Generating personalized recommendations
based on user preferences.
• Advantages:
Can reveal hidden patterns or structures in the data.
Does not require labeled data, making it applicable to a wide
range of datasets.
• Disadvantages:
Evaluation of results can be subjective and challenging.
Interpretability of the model's output may be limited.
Types of Machine Learning
• Semi Supervised Learning
• Semi-supervised learning is a machinelearning paradigmthat falls
between supervised and unsupervised learning.
• In semi-supervised learning, the dataset contains both labeled and
unlabeled data.
• The algorithm leveragesthe small amount of labeled data along with
the larger pool of unlabeled data to make predictions or learn
patterns.
Types of Machine Learning
• Here are a few semi-supervised learning algorithms:
• Self-Training:
1. Self-training is a simple semi-supervised learning algorithm where the
model starts with a small amount of labeled data.
2. It trains initially on the labeled data and then uses
the trained model to make predictions on the
unlabeled data.
3. The predictions with high confidence are added to the labeled dataset,
and the process iterates until convergence.
4. This approach assumes that the model's predictions on the unlabeled
Types of Machine Learning
• Semi-Supervised Support Vector Machines (S3VM):
1.S3VM is an extension of traditional Support Vector Machines (SVM) to
semi-
supervised settings.
2. It incorporates both labeled and unlabeled data into the SVM
framework,
aiming to find a decision boundary that separates the data while
minimizing classification errors.
3.S3VM optimizes a combination of the margin and the empirical error
on the
labeled data, along with a term penalizing the model's complexity.
Types of Machine Learning
• Label Propagation:
1.Label propagation is a graph-based semi-supervised learning
algorithm.
2. It constructs a graph representation of the data,
where nodesrepresent data points, and edges represent
similarities between points.
3. Initially, the labeled nodes are assigned their true labels, and then the
labels propagate through the graph based on similarities between
nodes.
4.The final labels are determined based on the
propagated labels, and the
process iterates until convergence.
Types of Machine
• Generative Learning
Adversarial Networks (GANs):
1. GANs consist of two neural networks, a generator and a
discriminator, which are trained simultaneously through a min-max
game.
2. In semi-supervised learning, GANs can be used to generate realistic
samples from the unlabeled data distribution.
3. The generated samples are combined with the labeled data to
train a classifier, effectively leveraging the unlabeled data to
improve classification performance.
Types of Machine Learning
• Reinforcement Learning:
• Definition: Reinforcement learning involves an agent learning to
make decisions by interacting with an environment and receiving
feedback in the form of rewards or penalties.
• Usefulness: Reinforcement learning is beneficial in real-time
environments where decisions must be made sequentially and actions
have consequences. It is used in areas such as robotics, gaming, and
autonomous systems.
• Real-world Applications: Reinforcement learning is applied in
various
domains such as:
Types of Machine Learning
1. Autonomous vehicles: Teaching vehicles to navigate safely and
efficiently.
Resource management: Optimizing energy usage, inventory
2.
management, etc.
• Advantages:
1. Can learn complex behaviors and strategies through trial and error.
2. Suitable for environments with sparse or delayed feedback.
• Disadvantages:
1.Requires a well-defined reward structure, which may be challenging to
specify.
2. Can be computationally expensive and time-consuming to train.
Problems Cannot to Be Solved Using Machine
Learning
• Machine learning (ML) is a powerful tool for solving a wide range of
problems, there are certain types of problems that it may not be
well-suited to address effectively.
• Some examples of problems that cannot be easily solved by machine
learning alone:
• Lack of Data, Undefined Objectives, Causal inference, Ethical and
Moral Judgments, Unstructured Problem Solving, Domain
Expertise, Conceptual Understanding, Small Sample Sizes,
Incorporating Context, Real-time Critical Decision-making, Extreme
Context Shifts, New and Novel Situations, Interpersonal and
Problems Cannot to Be Solved Using Machine
Learning
1. Lack of Data: Machine learning models require sufficient and high-
quality data for training. If the data is scarce, incomplete, or
biased, the performance of ML models can suffer.
2. Undefined Objectives: Machine learning relies on well-defined
objectives and metrics for optimization. If the problem itself is not
well-defined, ML might not be effective in finding solutions.
3. Causal Inference: Causal inference is the process of drawing
conclusions about causal relationships based on data and statistical
methods. Determining cause and effect requires more rigorous
experimental design and statistical methods.
Problems Cannot to Be Solved Using Machine
Learning
4. Ethical and Moral Judgments: Decisions involving ethical considerations,
moral
judgments, and values often require human reasoning, empathy, and
contextual understanding that machine learning lacks in doing such
tasks.
5. Unstructured Problem Solving: Machine learning is often used for
structured tasks with clearly defined inputs and outputs. Problems
requiring creative thinking, intuition, and subjective judgment may not
be suitable for ML.
6. Domain Expertise: ML models require domain-specific knowledge for
Problems Cannot to Be Solved Using Machine
7. Learning
Conceptual Understanding: ML models can predict outcomes based
on patterns in data, but they may not provide a deep conceptual
understanding of underlying phenomena or processes.
8. Small Sample Sizes: Some machine learning algorithms,
particularly deep learning models, require large amounts of data to
generalize well. Small sample sizes can lead to overfitting and poor
performance.
9. Incorporating Context: Contextual understanding and reasoning
based on broader context, cultural nuances, and real-world
experiences are areas where machines may struggle.
Problems Cannot to Be Solved Using Machine
10.Learning
Real-time Critical Decision-making: Situations that require real-
time decision-making, especially in high-stakes environments like
healthcare or aviation, may not allow sufficient time for the
learning and adaptation process of ML models.
11. Extreme Context Shifts: Machine learning models might not perform
well when deployed in situations drastically different from
their training environment. They lack adaptability extreme shifts
in context.
12. New and Novel Situations: ML models typically operate based on
patterns learned from past data. When faced with entirely new
and novel situations, they might not have sufficient information to
Problems Cannot to Be Solved Using Machine
Learning
13. Interpersonal and Emotional Understanding: Recognizing
and responding to human emotions, nuances, and
interpersonal interactions are challenging tasks that require
human emotional intelligence and social understanding.
Applications of Machine
• Machine Learning
learning (ML) has become a powerful tool across
various industries, transforming how we live and work.
• Here's an overview of its applications and limitations:
Applications of Machine Learning
Healthcare: Finance:
1. Disease diagnosis and prognosis. 1. Fraud detection and prevention
2. Personalized 2. Credit scoring and risk assessment
treatment 3.Algorithmic trading and
recommendation. financial forecasting
3. Drug discovery and 4. Customer segmentation and
development targeted
4. Medical imaging analysis (e.g., marketing
MRI, CT scans) 5. Portfolio optimization and
5. Electronic health record (EHR) wealth management
Applications of Machine
E-Learning Marketing:
commerce: and
1. Product recommendation 1. Customer segmentation and
targeting
personalized shopping
2. Sentiment analysis and brand
experiences
sentiment monitoring
2. Customer segmentation and
3. Social media analytics and
churn prediction
influencer identification
3. Price optimization and
4. Customer lifetime value
dynamic pricing strategies
prediction
4. Fraud detection and
5.Campaign optimization and
prevention
marketing attribution modeling
5. Supply chain optimization
Applications of Machine
Learning
1.Predictive 1. Autonomous vehicles and self-
Manufacturing:
driving
maintenanceTransportation:
for car
s
machinery and equipment 2. Route optimization and traffic
prediction
2. Quality control and
defect detection 3. Demand forecasting for ride-sharing
and delivery
3.Supply chain optimization services
4. Fleet management and vehicle
and inventory management routing
4. Demand forecasting and 5. Predictive maintenance for
transportation
production
infrastructure
planning
Applications of Machine
Learning
Natural Language
1.Sentiment Processing
analysis and (NLP): 1. Object detection
Computerand recognition
Vision:
opinion mining. 2. Image classification and
segmentation
2.Text classification and
3.Facial recognition and
document categorization
biometric authentication
3.Language translation
4.Autonomous drones and
and multilingual
aerial surveillance
communication
4. Chat bots and virtual 5. Medical image analysis and
assistants diagnosis
5.Text summarization and
Limitations of
1. Lack of Common
ML
Sense Reasoning: ML algorithms struggle with
tasks requiring common sense or understanding the context of a
situation beyond the data they are trained on.
2. Creativity and Innovation: While applications like generating creative
text formats are emerging, current ML techniques lack the
ability to truly innovate or come up with entirely new ideas.
3. Ethical Decision-Making: ML algorithms cannot make ethical
judgments or navigate complex moral dilemmas requiring
human values and understanding.
Tools in Machine
• Machine learning,Learning
a branch of artificial intelligence, is rapidly
evolving and requires a robust set of tools to build, train, and
deploy models effectively.
• Here are some of the most popular tools, along with their
advantages, disadvantages, and limitations:
Tools in Machine
1. SCIKIT-LEARN:
Learning
Advantages:
Simple and easy-to-use API, making it great for beginners.
Comprehensive documentation and community support.
Implements a wide range of classical machine learning algorithms.
• Disadvantages:
Limited support for deep learning models.
May not be suitable for very large datasets or complex model
architectures.
• Limitations:
Lack of flexibility in customization compared to other
Tools in Machine
1.
Learning
TENSORFLOW:
Advantages:
Highly flexible and scalable, suitable for both research and production.
Supports deep learning models with customizable architectures.
TensorFlow Serving allows easy deployment of models.
• Disadvantages:
Steeper learning curve compared to simpler libraries like scikit-learn.
Requires more lines of code for simple tasks.
• Limitations:
May require significant computational resources for training complex
models.
Tools in Machine
1. PYTORCH:
Learning
Advantages:
Dynamic computational graph makes it easier to debug and
experiment.
Pythonic API is intuitive and easy to learn.
Growing popularity in both research and industry.
• Disadvantages:
Less mature ecosystem compared to TensorFlow.
Limited production deployment tools compared to TensorFlow Serving.
• Limitations:
Training large models can be slower compared to TensorFlow due to
lack of
Tools in Machine
1. KERAS: Learning
• Advantages:
High-level API, allowing for rapid prototyping and experimentation.
Can run on top of TensorFlow, Theano, or Microsoft Cognitive Toolkit.
Simplified syntax makes it easy to build neural networks.
• Disadvantages:
Less flexibility compared to TensorFlow or PyTorch.
May not be suitable for implementing custom architectures.
• Limitations:
Limited support for complex research experiments compared to TensorFlow or
PyTorch.
Tools in Machine
1. Learning
APACHE SPARK MLLIB:
Advantages:
Distributed computing capabilities suitable for big data
processing.
Integration with Apache Spark ecosystem for data
preprocessing and
analysis.
• Disadvantages:
Limited algorithms compared to standalone libraries like scikit-
learn.
Slower compared to native implementations for smaller
datasets.
Issues in Machine
• Some Learning
common issues in Machine Learning
that professionals face to inculcate
ML skills and create an application from scratch.
1. Data Quality and Quantity:
Insufficient data: Inadequate amount of data can lead to poor
model performance, especially for complex models like deep
learning.
Data imbalance: When the classes in a classification problem are not
represented equally, the model may become biased towards the
majority class. Noisy data: Data may contain errors, outliers, or
Issues in Machine
Learning
2. Feature Engineering:
Identifying relevant features: Selecting the right features that
contribute to predictive performance is crucial. Missing important
features or including irrelevant ones can degrade model accuracy.
Handling categorical data: Encoding categorical variables
effectively without introducing bias or increasing dimensionality can
be challenging.
Feature scaling: Ensuring that features are on similar scales can
improve the performance of certain algorithms, such as distance-
based methods.
Issues in Machine
Learning
3. Model Selection and Evaluation:
Overfitting and under fitting: Overfitting occurs when a model
learns to memorize the training data instead of generalizing to
unseen data, while under fitting happens when the model is too
simple to capture the underlying patterns.
Hyper parameter tuning: Selecting the optimal hyper parameters for a
model can be time- consuming and require extensive experimentation.
Model evaluation metrics: Choosing appropriate metrics to
evaluate model performance based onthe problem
domain is critical. Using inaccurateor
misleading metrics can lead to erroneous conclusions.
Issues in Machine
Learning
4. Interpretability and Explain ability:
Black-box models: Complex models such as deep neural networks
may lack interpretability, making it difficult to understand the
reasoning behind their predictions.
Model transparency: Understanding how a model makes
decisions is
important for gaining trust and addressing concerns about fairness,
bias, and ethics.
5. Deployment and Maintenance:
Deployment challenges: Integrating machine learning models into
production systems while ensuring scalability, reliability, and efficiency can
be complex.
Monitoring and updating: Models may degrade over time due to changes
in data distribution or drift. Regular monitoring and updating are
necessary to maintain performance.
6. Ethical and Legal Considerations:
Bias and fairness: Machine learning models can inherit biases present
in the training data, leading to unfair or discriminatory outcomes.
Privacy concerns: Handling sensitive data requires careful attention to
privacy regulations and ethical considerations, such as data anonymization
Machine learning Activities
• The following are the machine
learning activities:
1. Data Collection and Preprocessing 6. Hyper parameter Tuning
2. Exploratory Data Analysis (EDA) 7. Cross-Validation
8. Model Interpretation
3. Feature Engineering
and Explainability
4. Model Selection and Training 9. Deployment and
Monitoring
5. Model Evaluation
10.Transfer Learning
Machine learning Activities
1. Data Collection and Preprocessing
• Example: Suppose we are building a spam email classifier. We collect a
dataset containing emails labeled as spam or not spam.
• Preprocessing involves tasks like removing HTML tags, converting
text to
lowercase, removing null values, stop words, and tokenization.
2. Exploratory Data Analysis (EDA)
• Example: Before training a model, we analyze the distribution of
features, correlations, and outliers in our dataset. For instance, we
might visualize the frequency of spam words in spam emails
Machine learning Activities
3. Feature Engineering:
• Example: In a fraud detection system, we create new features like
transaction frequency, average transaction amount, and account
holder age based on existing data. These features provide more
information to the model for better fraud detection.
4. Model Selection and Training:
• Example: We experiment with different algorithms (e.g., logistic
regression, random forest, neural networks) and hyper parameters
to find the best- performing model for our task. For instance, we
train multiple classifiers on our spam email dataset and compare their
Machine learning Activities
5. Model Evaluation:
• Example: After training our spam email classifier, we evaluate its
performance using metrics like accuracy, precision, recall, and F1-score.
We split our dataset into training and testing sets to assess how
well the model generalizes to unseen or new data.
6. Hyper parameter Tuning:
• Example: We use techniques like grid search or random search to
tune the hyper parameters of our machine learning model. For
instance, we adjust the learning rate, regularization strength, and
batch size of a neural network to optimize its performance on a
validation set.
Machine learning Activities
7. Cross-Validation:
• Example: Instead of relying on a single train-test split, we perform
k-fold cross-validation to evaluate our model's performance more
robustly. For example, we divide our data into 5 folds, train the
model on 4 folds, and validate it on the remaining 1 fold, repeating
this process five times.
8. Model Interpretation and Explain ability:
• Example: In a medical diagnosis system, we use techniques like
SHAP (SHapley Additive ex Plantations) or LIME (Local
Interpretable Model- agnostic Explanations) to interpret the predictions
of our model. This helps to understand which features are most
Machine learning Activities
9. Deployment and Monitoring:
• Example: After building and evaluating our model, we deploy it
into a production environment where it can make real-time
predictions. We set up monitoring systems to track the model's
performance over time and retrain or update it as necessary to
maintain accuracy.
10.Transfer Learning:
• Example: In image classification, we leverage a pre-trained
convolutional neural network (CNN), such as ResNet or VGG, which was
trained on a large dataset like ImageNet. We fine-tune the CNN on
our specific task with a smaller dataset, achieving better performance
Basic Types of Data in Machine Learning
1. Numerical Data:
• Examples: Age, temperature, height, salary, stock prices.
• Advantages:
o
Easy to work with in many machine learning algorithms.
o
Can represent a wide range of values and magnitudes.
• Disadvantages:
o
Outliers can significantly affect analysis and model performance.
o
May require scaling or normalization to ensure features are on a
similar scale.
Basic Types of Data in Machine
Learning
2. Categorical Data:
Examples: Gender (Male, Female), color (Red, Green, Blue), product
categories (Electronics, Clothing, Books).
• Advantages:
o
Useful for representing non-numeric attributes and classes.
o
Can provide valuable information for classification tasks.
• Disadvantages:
o
Need to be encoded into numerical values for most machine
learning algorithms.
o
High cardinality (many unique categories)can lead to issues like
the
Basic Types of Data in Machine Learning
3. Ordinal Data:
• Examples: Education level (High School < Bachelor's < Master's < Ph.D.),
Likert scale ratings (Strongly Disagree < Disagree < Neutral < Agree <
Strongly Agree).
• Advantages:
o
Preserves order or ranking among categories, providing additional
information.
o
Can be useful for certain types of regression or ranking tasks.
• Disadvantages:
o
Not all machine learning algorithms can handle ordinal data directly.
o
May require careful encoding to maintain the ordinal relationship.
Basic Types of Data in Machine Learning
4. Text Data:
• Examples: Tweets, emails, articles, customer reviews.
• Advantages:
o
Rich source of information for sentiment analysis, text classification, and
natural language processing tasks.
o
Can capture nuanced information and context.
• Disadvantages:
o
High-dimensional and sparse representation can be computationally
expensive.
• Preprocessing steps like tokenization and stemming are necessary, which
can
Basic Types of Data in Machine Learning
5. Image Data: Examples: Photographs, medical images, satellite
images.
• Advantages:
o
Rich visual information suitable for tasks like object detection,
image
classification, and image segmentation.
o
Deep learning models like CNNs can automatically extract
hierarchical features.
• Disadvantages:
o
Large memory and computational requirements for processing
high- resolution images.
o
Requires extensive preprocessing and data augmentation to
Basic Types of Data in Machine Learning
6. Time Series Data:
• Examples: Stock prices over time, temperature readings, and
sensor data.
• Advantages:
o
Captures temporal dependencies and trends over time.
o
Suitable for forecasting, anomaly detection, and trend analysis.
• Disadvantages:
o
Need to handle missing values and irregular sampling intervals.
o
Sensitive to seasonality, trends, and noise.
Exploring the Structure of
Data
• Exploring the structure of data involves examining its
organization,
relationships, patterns, and attributes to gain insights and understanding.
• Here are examples, advantages, and disadvantages of exploring the
structure of data:
1. Descriptive Statistics:
• Examples: Mean, median, mode, standard deviation, and variance.
• Advantages:
o
Provides summary statistics that the central
tendency,
describe dispersion, and shape of the data
distribution.
o
Helps identify outliers and anomalies.
Exploring the Structure of Data
• Disadvantages:
o
May not capture complex relationships between variables.
o
Limited to numerical data.
• Example: Calculating the mean and standard deviation of exam
scores to
understand the average performance and variability among students.
2. Data Visualization:
• Examples: Histograms, scatter plots, box plots, bar charts.
• Advantages:
o
Offers visual representation of data distribution, trends, and
relationships.
Exploring the Structure of
Data
• Disadvantages:
o
Interpretation may vary based on visualization techniques.
o
Limited to visualizing a few variables at a time.
• Example: Plotting a histogram of customer ages to understand the
age distribution in a market dataset.
3. Correlation Analysis:
• Examples: Pearson correlation coefficient, Spearman rank correlation.
• Advantages:
o
Quantifies the strength and direction of relationships between
pairsof variables.
o
Helps identify potential predictors in regression analysis.
Exploring the Structure of Data
• Disadvantages:
o
Assumes linear relationships and may miss non-linear associations.
o
Correlation does not imply causation.
• Example: Computing the correlation between advertising spending
and sales revenue to understand their relationship in a
marketing dataset.
4. Dimensionality Reduction:
• Examples: Principal ComponentAnalysis (PCA), t-Distributed
Stochastic Neighbor Embedding (t-SNE).
Exploring the Structure of Data
• Advantages:
o
Reduces the dimensionality of data while preserving important
information.
o
Facilitates visualization and interpretation of high-dimensional data.
• Disadvantages:
o
Loss of interpretability in reduced dimensions.
o
May discard some information.
• Example: Applying PCA to gene expression data to identify
principal components representing gene expression patterns.
Exploring the Structure of Data
5. Clustering Analysis:
• Examples: K-means clustering, hierarchical clustering.
• Advantages:
o
Identifies natural groupings or clusters within the data.
o
Useful for segmentation and pattern recognition.
• Disadvantages:
o
Requires choosing the number of clusters, which can be subjective.
o
Results may vary based on the choice of distance metric and
clustering algorithm. Example: Using K-means clustering to
segment customers based on their purchasing behavior.
Data Quality &
Remediation
• Data quality refers to the reliability, accuracy, consistency,
completeness,and
relevancy of data.
• Data remediation involves the process of identifying and correcting data
quality issues to ensure that data is accurate, reliable, and
suitable for analysis or decision-making.
• Here are examples, advantages, disadvantages of data quality and
remediation in real-time scenarios:
Data Quality &
Remediation
• Examples of Data Quality Issues:
1. Inconsistent formats: In a customer database,
phone numbers are stored in various formats (e.g., +1
(555) 123-4567, 555-123-4567, 5551234567).
2.Missing values: In a sales dataset, some records may have missing
values for
the "sales amount" field.
3. Duplicate records: A product inventory system contains duplicate entries
for the same item.
4. Incorrect data: In a healthcare database, a patient's birthdate is
recorded as 01/13/1900, which is not possible.
Data Quality &
Remediation
• Advantages of Data Quality & Remediation:
1. Improved decision-making: High-quality data leads to more accurate
insights and better- informed decisions.
2.Enhanced efficiency: Clean and reliable data reduces the time spent
on data
cleaning and troubleshooting.
3. Increased trust: Stakeholders have greater confidence in data-driven
analyses and reports when data quality is high.
4. Regulatory compliance: Ensuring data quality helps
organizations comply with data protection and privacy regulations.
Data Quality &
Remediation
• Disadvantages of Data Quality & Remediation:
1. Time-consuming: Identifying and rectifying data quality issues can be
a time- intensive process, especially for large datasets.
2.Costly: Data remediation efforts may require
investments in tools, resources,
and personnel.
3. Complexity: Some data quality issues may be
challenging to detect and correct, especially in heterogeneous
datasets.
4. Potential for errors: Human error during data cleaning and
remediation can introduce new inaccuracies or biases.
Data Pre-
Processing
• Data preprocessing is a critical step in machine learning pipelines,
involving transforming raw data into a clean, structured format
suitable for training machine learning models.
• It includes various techniques to handle missing values, outliers,
feature scaling, normalization, encoding categorical variables,
irrelevant features and more, as well as to standardize or scale the
data.
• Here's an overview along with suitable examples,
advantages, and disadvantages.
Data Pre-Processing
1. Handling Missing Values:
• Example: Suppose we have a dataset of customer information,
and some entries have missing values for the "income" attribute.
• We can handle this by imputing missing values using techniques like
mean, median, or mode imputation, or by using advanced
imputation methods like K-nearest neighbors (KNN) or predictive
models.
• Advantages:
o
Prevents loss of valuable data.
o
Improves the robustness and reliability of the dataset.
Data Pre-Processing
• Disadvantages:
o
Imputation methods may introduce bias if not handled carefully.
o
Imputed values may not accurately represent the true
underlying data distribution.
2. Outlier Detection and Removal:
• Example: In a dataset of housing prices, we may find some entries
with unrealistically high or low prices.
• Outliers can be detectedusing statistical methods like Z-score or
IQR (Interquartile Range) and removed or adjusted
accordingly.
Data Pre-Processing
• Advantages:
o
Improves model performance by reducing the impact of outliers.
o
Prevents the model from being skewed by extreme values.
• Disadvantages:
o
Removal of outliers may lead to loss of valuable information.
o
Subjective choice of outlier detection method and threshold.
3. Feature Scaling and Normalization:
• Example: In a dataset containing features with different scales
(e.g., age and income), scaling techniques like Min-Max scaling or
Standardization (Z- score normalization) can be applied to bring
Data Pre-Processing
• Advantages:
o
Ensures that features contribute equally to the model.
o
Helps algorithms converge faster during training.
• Disadvantages:
o
Scaling may amplify the noise in the data.
o
Loss of interpretability for some features after scaling.
4. Encoding Categorical Variables:
• Example: Suppose we have a categorical feature like "gender" with
values "Male" and "Female." We can encode it into
numerical values using techniques like one-hot encoding or
• Advantages:
o
Allows algorithms to work with categorical data.
o
Preserves the ordinal relationship between categories if needed.
• Disadvantages:
o
Increases dimensionality, especially with one-hot encoding.
o
May introduce sparsity and multicollinearity in the dataset.
5. Dimensionality Reduction:
• Example: Applying techniques like Principal Component Analysis (PCA)
or t- distributed Stochastic Neighbor Embedding (t-SNE) to
reduce the dimensionality of high-dimensional datasets while
preserving most of the relevant information.
Data Pre-Processing
• Advantages:
o
Reduces overfitting and computational complexity.
o
Visualizes high-dimensional data in lower dimensions.
• Disadvantages:
o
Loss of interpretability in reduced dimensions.
o
May discard some information, leading to loss of predictive
power.