WELL POSED LEARNING PROBLEMS
Well Posed Learning Problem – A computer program is said to learn from experience E in
context to some task T and some performance measure P, if its performance on T, as was
measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
Task
Performance Measure
Experience
Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
Task – Classifying emails as spam or not
Performance Measure – The fraction of emails accurately classified as spam or not
spam
Experience – Observing you label emails as spam or not spam
APPLICATIONS OF MACHINE LEARNING
Image Recognition
Image Recognition is one of the reasons behind the boom one could have experienced in the
field of Deep Learning. The task which started from classification between cats and dog
images has now evolved up to the level of Face Recognition and real-world use cases based
on that like employee attendance tracking.
Also, image recognition has helped revolutionized the healthcare industry by employing
smart systems in disease recognition and diagnosis methodologies.
Speech Recognition
Speech Recognition based smart systems like Alexa and Siri have certainly come across and
used to communicate with them. In the backend, these systems are based basically on
Speech Recognition systems. These systems are designed such that they can convert voice
instructions into text.
One more application of the Speech recognition that we can encounter in our day-to-day life
is that of performing Google searches just by speaking to it.
Fraud Detection
In today’s world, most things have been digitalized varying from buying toothbrushes or
making transactions of millions of dollars everything is accessible and easy to use. But with
this process of digitization cases of fraudulent transactions and fraudulent activities have
increased. Identifying them is not that easy but machine learning systems are very efficient
in these tasks.
Due to these applications only whenever the system detects red flags in a user’s activity than
a suitable notification be provided to the administrator so, that these cases can be
monitored properly for any spam or fraud activities.
Self Driving Cars
It would have been assumed that there is certainly some ghost who is driving a car if we ever
saw a car being driven without a driver but all thanks to machine learning and deep learning
that in today’s world, this is possible and not a story from some fictional book. Even though
the algorithms and tech stack behind these technologies are highly advanced but at the core
it is machine learning which has made these applications possible.
The most common example of this use case is that of the Tesla cars which are well-tested
and proven for autonomous driving.
MODEL SELECTION AND GENERALIZATION
In machine learning, model selection is the process of choosing the best
predictive model for a given problem, while generalization is the ability of a
model to adapt to new data:
Model selection
Involves comparing the performance of multiple models, and choosing the one
that best fits the data. This process can be iterative, involving testing multiple
models and hyperparameters. It's important to consider other factors besides
performance, such as complexity, maintainability, and available resources.
Generalization
A model's ability to adapt to new data that's drawn from the same distribution as
the data used to create the model. A good model should generalize well to new
data. Overfitting happens when a model performs well on training data but
generalizes poorly.
CONCEPT LEARNING
Concept learning in machine learning is a process that teaches a computer program
to recognize a concept or function by analyzing a set of labeled examples:
Explanation
A concept is an idea that's formed by combining all of its attributes or
features. In concept learning, a model is trained to identify a concept or pattern
in a set of examples, and then use that concept to make predictions about new
data.
How it works
The model learns by searching for a hypothesis that best fits the training
examples. This search can be viewed as a process of learning a pattern in the
data and creating a function based on that pattern.
Approaches
Concept learning can be approached in a variety of ways, including rule-based
learning, neural networks, and decision trees. Case-based learning is a
prominent approach that involves building a repository of cases, each with a set
of features and their corresponding outcomes.
Importance
Concept learning is a fundamental part of many automated decision-making
learning processes.
INDUCTIVE LEARNING
Inductive learning is a machine learning technique that uses specific examples to
make generalizations or predictions. It's also known as inductive reasoning or
inductive inference.
Here's some more information about inductive learning:
How it works: Inductive learning algorithms (ILAs) use a labeled dataset to
train a model that can make predictions on new data. The model is trained to
map inputs to outputs based on the labeled examples.
How it's used: Inductive learning is often used in supervised learning, where
the data is labeled with the correct answer for each example.
Why it's used: Inductive learning is widely used because it's flexible and
generalizable.
How it's related to inductive bias: Inductive learning is closely related to the
concept of inductive bias.
HYPOTHESIS
A hypothesis in machine learning is the model’s presumption regarding the
connection between the input features and the result.
It is an illustration of the mapping function that the algorithm is attempting to
discover using the training set.
To minimize the discrepancy between the expected and actual outputs, the
learning process involves modifying the weights that parameterize the
hypothesis.
The objective is to optimize the model’s parameters to achieve the best
predictive performance on new, unseen data, and a cost function is used to
assess the hypothesis’ accuracy.
Hypothesis Space (H)
Hypothesis space is the set of all the possible legal hypothesis. This is the set from
which the machine learning algorithm would determine the best possible (only one)
which would best describe the target function or the outputs.
Hypothesis (h)
A hypothesis is a function that best describes the target in supervised machine
learning. The hypothesis that an algorithm would come up depends upon the data
and also depends upon the restrictions and bias that we have imposed on the data.
The Hypothesis can be calculated as:
y=mx+b
Where,
y = range
m = slope of the lines
x = domain
b = intercept
INDUCTIVE BIAS
Inductive bias can be defined as the set of assumptions or biases that a
learning algorithm employs to make predictions on unseen data based on its
training data. These assumptions are inherent in the algorithm’s design and
serve as a foundation for learning and generalization.
The inductive bias of an algorithm influences how it selects a hypothesis (a
possible explanation or model) from the hypothesis space (the set of all
possible hypotheses) that best fits the training data.
It helps the algorithm navigate the trade-off between fitting the training data
perfectly (overfitting) and generalizing well to unseen data (underfitting).
DIRECTIONAL DERIVATIVE
Directional Derivative measures how a function changes along a specified
direction at a given point, providing insights into its rate of change in that
direction. Directional Derivative can be defined as:
Dv(f) = ∇f · v
∇f represents Gradient of Function
where:
v is Direction Vector Along which we Want to Find Derivative
How to Calculate Directional Derivative
To calculate the directional derivative of a function at a given point in a specific
direction, follow these steps:
Step 1: Find the Gradient
Compute the gradient (∇f) of the function. The gradient is a vector that points in
the direction of the steepest increase of the function.
Step 2: Normalize Direction Vector
Normalize the direction vector (v) to ensure it has a length of 1. This is done by
dividing each component of the vector by its magnitude.
Step 3: Dot Product
Take the dot product of the normalized direction vector and the gradient. The dot
product is obtained by multiplying corresponding components of the two vectors
and then summing them up.
Dv(f) =∇f⋅v
Step 4: Evaluate at a Point: Plug in the coordinates of the point where you want to
find the directional derivative into the gradient and the normalized direction vector.
Dv(f)(a, b) = ∇f(a, b)⋅v
at the point P(1, 2) in the direction of the vector v= ⟨1, −1⟩.
Example 1: Compute the directional derivative of the function f(x, y) = x2 + 3y
Solution:
P(1, 2) in the direction of the vector v = ⟨1, −1⟩, we use the following formula:
To compute the directional derivative of the function f(x, y) = x 2 + 3y at the point
Dvf = ∇f⋅v
∇f=(∂f/∂x,∂f/∂y)
First, let’s find the gradient of f:
∂f/∂x=2x
So, ∇f=(2x,3).
∂f/∂y=3
Now, evaluate ∇f at the point P(1,2):
∇f(1,2)=(2(1),3)=(2,3)
Next, we compute the dot product of ∇f(1,2) and v:
∇f(1,2)⋅v = (2, 3)⋅⟨1, −1⟩ = 2⋅1 + 3⋅(−1) = 2 − 3 = −1
direction of the vector v = ⟨1, −1⟩ is Dvf = −1.
Therefore, the directional derivative of f(x, y) = x 2 + 3y at the point P(1, 2) in the
MINIMA/ MAXIMA
In machine learning, local minima and global minima are two important concepts
related to the optimization of loss functions.
A loss function is a function that measures the error between a model's predictions
and the ground truth. The goal of machine learning is to find a model that minimizes
the loss function.
Minima is a point where a loss function is minimized, indicating the point where a model
has the least error
Local minima
A local minimum is a point where the function's value is the lowest in its
immediate neighborhood.
Global minima
A global minimum is a point where the function's value is the lowest across
the entire range
OR
Local Minima
A point x = a is said to be the point of local minima if the value of the function at
this point is the lowest value around its neighbor.
In simple terms, if we consider a small interval around x = a, the function will obtain
its minimum value in this interval.
Mathematically,
A function f(x) has a local minimum at x = a if there exists an open interval I (such
that I is contained in the domain of f(x)) containing a, and
f(a) <= f(x), for all x in I.
Local Maxima
Similar to local minima, A point x = b is said to be the point of local maxima if the
value of the function at this point is the maximum value around its neighbor.
In simple terms, if we consider a small interval around x = b, the function will obtain
its maximum value in this interval.
Mathematically
A function f(x) has a local maximum at x = b if there exists an open
interval I (such that I is contained in the domain of f(x)) containing b, and
f(x) <= f(b), for all x in I.
Global Minima
The point within the entire domain at which the function obtains its lowest value is
known as the global minimum of the function.
There can be one and only one global minimum of a function.
It is the lowest local minimum value among all.
i.e., x = a, is a point of global minima, if and only if f(a) <= f(x) for
all x in the domain of f(x).
Global Maxima
Similar to global minima, the point within the entire domain at which the function
obtains its highest value is known as the global maxima of the function.
There can be one and only one global maxima of a function.
It is the highest local maximum value among all.
i.e., x = b is a point of global minima if and only if f(x) <= f(b) for
all x in the domain of f(x).
K-NEAREST NEIGHBORS ALGORITHM
The K-NN algorithm works by finding the K nearest neighbors to a given data
point based on a distance metric, such as Euclidean distance.
The class or value of the data point is then determined by the majority vote or
average of the K neighbors.
This approach allows the algorithm to adapt to different patterns and make
predictions based on the local structure of the data
DISTANCE METRICS USED IN KNN
ALGORITHM
Euclidean distance
Manhattan distance
Minkowski distance
Suprenum distance
LINEAR CLASSIFIER
LOGISTIC REGRESSION
Logistic regression is a supervised machine learning algorithm used for classification
tasks where the goal is to predict the probability that an instance belongs to a given
class or not.
Logistic regression is used for binary classification where we use sigmoid function,
that takes input as independent variables and produces a probability value between
0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1
otherwise it belongs to Class 0.
Key Points:
Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
Logistic Function – Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
It maps any real value into another value within a range of 0 and 1. The value
of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
DERIVATION OF SIGMOID FUNCTION
SVM
It is used in supervised learning
Svm classify the linear data points as well as non linear data points
Svm is used for
o Regression analysis
o Outlier analysis
o Pattern analysis
o Classification
svm is used to classify the data points of n dimension data set
where n= no of attributes or features of the data set
Classify the data points with linear plane(hyper plane)
Data set is of 2 dimension
FI=X1
F2=X2
Data points that lie on h1 & h2 are called support vectors
If we have 2 hyperplanes, hyperplane with more width is considered
good(accurate) and we should use that hyperplane
Support vectors gives or provides the maximum width
hyperplane(maximum marginal hyperplane (MMH))
QUESTION 1 ON SVM
QUESTION 2 ON SVM
PRINCIPAL COMPONENT ANALYSIS
It is also know as karhunen-loeve or K-L method
It searches for k d-dimensinal orthonormal vectors(principal componenets)
that can best be used to represent the data.
Where k(principal components)<= d
It uses feature extraction technique
It is also dimensional reduction techniques. During reducing dimension it does
not loose the important feature of data.
Two componets
PC1 – having maximum variance(maximum spread)
PC2 – component analysis. This is always orthogonal to PC1
(perpendicular to PC1)
MULTI-CLASS CLASSIFIER
DECISION TREE CLASSIFIER
Decision free induction is the learning of decision trees from class labelled
training tuples
A decision tree is a flowchart like tree structure,
Where each internal node denotes test on attribute/feature/column
Each branch represents outcome of the test
Each leaf node holds a class label
PSEUDOCODE
Begin with your training dataset, which should have some feature variables
and classification or regression input
Determine the “best feature” in the dataset to split the data on
Split the data into subsets that contain the correct values for this best feature.
This splitting basically defines a node on the tree i,e each node is a splitting
point based on a certain feature from our data
Recursively generate new tree nodes by using the subset of data created In
above point
Mathematically speaking, decision trees use hyperplanes which run parallel to any
one of the axes to cut your coordinate system into hyper cuboids
CART- classification and regression trees
The logic of decision trees can also be applied to regression problems,hence
the name CART
DECISION TREE ENTROPY
Entropy is nothing but the measure of disorder
OR
Measure of purity/impurity
More knowledge less entropy
How to calculate entropy?
Observations
1. If all class labels are same, then entropy is zero
2. More is uncertainity, more is entropy
3. Both log2 or loge can be used to calculate entropy
NOW ,WHICH ATTRIBUTE WE TAKE AS A ROOT NODE?
Information gain is a metric used to train decision trees
This metric measures the quality of split
The information gain is based on the decrease in the entropy after the dataset
is split on an attribute
It is expected reduction in the information requirement caused by knowing the
value of A
We want to partition on attribute A that would do the “best classification” so
that the amount of information still required to finish classifying the tuples is
minimal( minimum infoA(D) )
STEPS TO CALCULATE INFORMATION GAIN
STEP 1 entropy of parent
STEP 2 calculate entropy for column partitions
STEP 3 calculate weighted entropy of children
STEP 4 calculate information gain
STEP 5 calculate information gain for all the columns
STEP 6 find information gain recursively