ML Pyq Ans
ML Pyq Ans
Q1
a. 5 business applications of ML
Ans:
Market Basket Analysis: Supermarkets use association rule learning to discover frequently bought-together items. By
identifying patterns in customer purchases (like bread and butter), stores can optimize product placement and offer
bundled promotions. This boosts cross-selling and encourages customers to buy complementary items, increasing overall
revenue.
Classification:
Customer Churn Prediction: Businesses use classification to predict customer churn, categorizing customers as likely to
leave or stay. By analyzing past customer data (e.g., interactions, purchases, complaints), companies can proactively
engage at-risk customers with targeted retention offers, reducing churn rates and enhancing customer loyalty.
Regression:
Sales Forecasting: Regression models help businesses predict future sales based on historical data and factors like
seasonality, market trends, and economic indicators. With accurate sales forecasts, companies can manage inventory
effectively, plan staffing, and make informed budgeting decisions.
Unsupervised Learning:
Customer Segmentation: Businesses use clustering (an unsupervised learning technique) to group customers with similar
characteristics or behaviors. This allows targeted marketing strategies for each segment, personalized promotions, and
better customer experiences, ultimately increasing customer satisfaction and engagement.
Reinforcement Learning:
Dynamic Pricing: E-commerce platforms apply reinforcement learning to adjust prices in real-time based on demand,
competition, and customer behavior. The model continuously learns which prices maximize profits or conversions,
adapting pricing strategies dynamically to stay competitive while optimizing revenue.
b.
Ans :
c.
Ans: In binary classification, a performance evaluation matrix is used to evaluate a model’s performance in correctly
predicting two classes, typically labeled as positive and negative. This matrix, known as the confusion matrix, captures
four key outcomes for a set of predictions:
1. True Positive (TP): The model correctly predicts the positive class.
2. True Negative (TN): The model correctly predicts the negative class.
3. False Positive (FP): The model incorrectly predicts the positive class when it's actually negative (also known as
a "Type I error").
4. False Negative (FN): The model incorrectly predicts the negative class when it's actually positive (also known
as a "Type II error").
From these values, we derive several key metrics to evaluate the model’s performance:
1. Accuracy: Measures the proportion of correct predictions (both true positives and true negatives) over all predictions.
2. Precision: Measures the accuracy of positive predictions, representing how many predicted positives were actually
positive.
3. Recall (Sensitivity): Measures how well the model identifies actual positives, representing how many actual positives were
predicted as positive.
4. F1 Score: The harmonic mean of precision and recall, useful when the classes are imbalanced.
Example
Consider a model used for spam detection in emails. We’ll label “spam” as the positive class and “not spam” as the negative class.
Ans:
Gini index –
measure of purity or impurity while creating a decision tree. The attribute with a lower Gini index is used as the best
attribute to split.
It is calculated by subtracting the sum of the squared probabilities of each class from one.
It is the same as entropy but is known to calculate quicker as compared to entropy.
CART ( Classification and regression tree ) uses the Gini index as an attribute selection measure to select the best
attribute/feature to split.
The attribute with a lower Gini index is used as the best attribute to split.
Suppose we have a dataset of 10 customers and want to predict whether a customer will buy a product (class "Buy") or
not (class "No-Buy"). The initial dataset has:
6 instances of “Buy”
4 instances of “No-Buy”
To calculate the Gini Index for this dataset:
2. Formula:
3. So, the Gini Index for this dataset is 0.48. This implies that if we randomly pick an item from this dataset, there is a 48%
chance that it would be misclassified if assigned a random class based on this distribution.
Ans:
K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the
holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are
put together to form a training set. Then the average error across all k trials is computed. The advantage of this method
is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a
training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this
method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much
computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k
different times. The advantage of doing this is that you can independently choose how large each test set is and how
many trials you average over.
K-Fold Cross Validation is a resampling technique used to evaluate machine learning models on a limited dataset. It
splits the data into kkk subsets (or "folds") and iteratively trains and validates the model on each fold, ensuring that
every data point has a chance to be both in training and validation sets. This helps improve the reliability of performance
estimates by reducing variance caused by the specific choice of training and test sets.
1. Data Splitting: The dataset is randomly divided into kkk equally sized folds.
2. Training and Validation:
o The model is trained kkk times, each time using k−1k-1k−1 folds for training and the remaining one fold
for validation.
o For each iteration, a different fold is set aside as the validation set, and the remaining k−1k-1k−1 folds
are used for training.
3. Performance Averaging: After the model is trained and evaluated kkk times, the performance metrics (e.g.,
accuracy, precision, recall) from each fold are averaged to get an overall model performance estimate.
If we calculate error EiE_iEi for each fold iii using some metric (e.g., mean squared error), the final cross-validation
error is given by:
This average error provides a more robust estimate of the model’s performance compared to using a single training/test
split.
Suppose we have a dataset of 100 samples and use 5-fold cross-validation. The process would look like this:
1. Model Selection: K-fold cross-validation helps in selecting the best model by comparing performance across
various models. For instance, it can compare the accuracy of different classifiers (e.g., logistic regression, SVM,
decision trees) to choose the most suitable one.
2. Hyperparameter Tuning: When tuning hyperparameters (e.g., learning rate in neural networks, depth of trees
in random forests), k-fold cross-validation allows you to find the best settings that consistently yield good
performance without overfitting.
3. Estimating Model Performance: When the dataset is small or has high variance, k-fold cross-validation
provides a more reliable estimate of the model’s performance, as it tests the model on all parts of the data.
4. Preventing Overfitting: By evaluating the model multiple times on different parts of the data, k-fold cross-
validation reduces the risk of overfitting on a single train-test split.
5. Imbalanced Datasets: K-fold cross-validation can help in cases of imbalanced datasets by ensuring that all
classes are represented in both training and validation sets across folds, providing a fairer evaluation.
Q2 A Issues in ML application:
For ML applications, algorithm requires massive dataset. These systems don’t just require more information than
humans to understand concepts or recognize features, they require hundreds of thousands times more than
human.
Storing data is not a concern since purchasing space is cheap, but buying ready data set is very expensive.
Creating a data set involves collecting it from different sources, organizing and formatting as per requirement,
feature sampling , record sampling etc.
Talent deficit:
There is a shortage of skilled employees available to manage and develop analytical content for Machine
Learning i.e. develop the technology.
Inadequate Infrastructure
Machine Learning requires vast amounts of data churning capabilities. Legacy systems often can’t handle the
workload and buckle under pressure.
ML developer should check infrastructure if not as per specification then system must be upgraded with
hardware acceleration and flexible storage.
The biggest tech corporations like the Alphabet Inc. (former Google) offers TensorFlow, while Microsoft
cooperates with Facebook developing Open Neural Network Exchange (ONNX). are building environments for
ML applications. Since, the technology is still new, it may not be production-ready, or be borderline production
ready.
Other issues could be time consuming in terms of planning, training , testing, requires huge processing power,
integrating with existing legacy software are difficult, understanding which process needs automation and many more.
Bagging:
The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a
generalized result.
Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a
high chance that these models will give the same result since they are getting the same input. So how can we
solve this problem?
One of the techniques is bootstrapping.
Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with
replacement. The size of the subsets is the same as the size of the original set.
Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution
(complete set). The size of subsets created for bagging may be less than the original set.
It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids
overfitting of data and is used for both regression and classification models, specifically for decision tree
algorithms.
Bootstrapping is the method of randomly creating samples of data out of a population with replacement to
estimate a population parameter.
Key to the method is the manner in which each sample of the dataset is prepared to train ensemble members.
Examples (rows) are drawn from the dataset at random, although with replacement. Replacement means that if a
row is selected, it is returned to the training dataset for potential re-selection in the same training dataset.
This is called a bootstrap sample, giving the technique its name.
Boosting It combines weak learners into strong learners by creating sequential models such that the final model has the
highest accuracy. Firstly, a model is built from the training data. Then the second model is built which tries to correct
the errors present in the first model. This procedure is continued and models are added until either the complete training
data set is predicted correctly or the maximum number of models is added.
Concept: Bagging is an ensemble technique that creates multiple versions of a model by training them on random subsets
of the original dataset, generated through sampling with replacement.
Model Independence: Each model in bagging is built independently of others, allowing multiple models to learn different
patterns from different data samples.
Example: Random Forest is a popular bagging algorithm where each tree is trained on a random subset of data.
Goal: Reduce variance by averaging predictions of multiple models, helping stabilize predictions and avoid overfitting.
Performance: Effective when dealing with high-variance models (e.g., decision trees), as it smooths out extreme
predictions.
Boosting:
Concept: Boosting is an iterative ensemble technique where each new model is built to correct the errors of the previous
one. Misclassified instances are given higher importance in each subsequent model.
Model Dependence: Each model depends on the previous one, focusing on the mistakes made to gradually improve
accuracy.
Example: AdaBoost and Gradient Boosting are popular boosting algorithms. AdaBoost adjusts weights for misclassified
instances, while Gradient Boosting minimizes the overall prediction error using gradient descent.
Goal: Reduce both bias and variance, improving model accuracy and handling complex patterns.
Performance: Boosting tends to produce strong models but can be prone to overfitting if not carefully managed.
Bagging helps reduce variance by averaging multiple models trained on different subsets, producing stable predictions
and minimizing overfitting.
Boosting improves both bias and variance by correcting mistakes of previous models, making it powerful for highly
accurate predictions on complex patterns.
Q3 B
Clusters are dense regions in the data space, separated by regions of lower object density.
A cluster is defined as a maximal set of density-connected points.
Discovers clusters of arbitrary shape.
Any two core points are close enough within a distance of one another, are put in the same cluster.
Any border point that is close enough to a core point is put in the same cluster as the core point.
Noise points are discarded.
Q4 A
A spanning tree in an undirected graph is a set of edges with no cycles that connects all nodes.
A minimum spanning tree (or MST) is a spanning tree with the least total cost.
Kruskal’s Algorithm:
Remove all edges from the graph.
Repeatedly find the cheapest edge that doesn’t create a cycle and add it back.
The result is an MST of the overall graph.
Implementing Kruskal’s Algorithm
Place every node into its own cluster.
Place all edges into a priority queue.
While there are two or more clusters remaining:
Dequeue an edge from the priority queue.
If its endpoints are not in the same cluster.
Merge the clusters containing the endpoints.
Add the edge to the resulting spanning tree.
Return the resulting spanning tree.
You're given n items and the distance d(u; v) between each of pair.
d(u; v) may be an actual distance, or some abstract representation of how dissimilar two things are.
Our Goal: Divide the n items up into k groups so that the minimum distance between items in different groups is
maximized.
Main Idea:
Maintain clusters as a set of connected components of a graph.
Iteratively combine the clusters containing the two closest items by adding an edge between them.
Stop when there are k clusters.
This is exactly Kruskal's algorithm.
The “clusters" are the connected components that Kruskal’s algorithm has created after a certain point.
Suppose you want k clusters.
Given the data set, add an edge from each node to each other node whose length depends on their
similarity.
Run Kruskal's algorithm until only k clusters remain.
The pieces of the graph that have been linked together are k maximally-separated clusters.
Q4 B ii)
Ans:
Overfitting occurs when a machine learning model, such as a decision tree, becomes too complex and fits the training
data too closely. This results in the model capturing the noise and anomalies in the data, rather than the underlying
patterns. Consequently, the model performs poorly on new, unseen data.
Tree Pruning to Avoid Overfitting
Tree pruning is a technique used to reduce the complexity of a decision tree and prevent overfitting. There are two main
approaches:
Stop early: During the tree construction process, we set a threshold for the goodness measure (e.g., information
gain, Gini impurity). If splitting a node doesn't improve the goodness measure above the threshold, we stop
further branching at that node.
Example:
Consider a decision tree for predicting whether an email is spam or not. If the goodness measure for splitting on the
"sender's email address" is very low, we might stop splitting further at that node, even if it could potentially improve
accuracy on the training data.
Build a full tree: First, we build a complete decision tree without any restrictions.
Prune iteratively: We start from the bottom of the tree and remove branches that don't significantly improve the
model's performance on a validation set.
Choose the best pruned tree: We select the pruned tree that gives the best performance on the validation set.
Example:
For the email spam classification tree, we might initially have a very deep tree. Post-pruning would involve removing
branches that only classify a few specific email types correctly, as these might not generalize well to new emails.
C4.5: It is considered to be better than the ID3 algorithm as it can handle both discrete and continuous data. In C4.5
splitting is done based on Information gain (attribute selection measure ) and the feature with the highest Information
gain is made the decision node and is further split. C4.5 handles overfitting by the method of pruning i.e it removes the
branches/subpart of the tree that does not hold much importance (or) is redundant. To be specific, C4.5 follows post
pruning i.e removing branches after the tree is created.
Q5 B
Ans: **
Q6 A
1. Hyperplane
In the context of SVMs, a hyperplane is a decision boundary that separates different classes in a dataset. For a 2D
space, this is simply a line, while for a 3D space, it’s a plane. In higher dimensions, it becomes a "hyperplane." The goal
of an SVM is to find the optimal hyperplane that best separates the data points of different classes with the maximum
margin (distance between the nearest points of each class to the hyperplane).
Example: In a binary classification with two features (2D space), a line (hyperplane) is used to separate two classes so that
one class is on one side of the line and the other class is on the opposite side.
2. Support Vectors
Support Vectors are the data points that are closest to the hyperplane and have the greatest influence on the position
and orientation of the hyperplane. These points define the margin, and the SVM algorithm aims to maximize the
distance between the support vectors of different classes. If these points were moved, the hyperplane would shift,
making them crucial to the model.
Example: In a dataset where two classes are separated by a line, the points closest to the line (one from each class) are
the support vectors, determining the margin around the hyperplane.
3. Hard Margin
A Hard Margin SVM is a strict model that assumes that the data is linearly separable and aims to find a hyperplane that
perfectly separates the classes without any misclassification. This approach, however, is not practical when dealing with
noisy data or overlapping classes, as it’s highly sensitive to outliers and can lead to overfitting.
Example: In a dataset with perfectly separable classes, a hard margin SVM can find a line that classifies all points
correctly. However, if there’s an outlier near the other class, the model would fail to accommodate it without violating
the strict separation.
4. Soft Margin
A Soft Margin SVM allows for some misclassifications or overlap in the data to achieve better generalization. It
introduces a tolerance for certain data points that lie within the margin or even on the wrong side of the hyperplane. This
flexibility makes the model robust to noise and outliers. A regularization parameter CCC controls the trade-off between
maximizing the margin and minimizing misclassification errors.
Example: In a dataset with slight overlap between classes, a soft margin SVM would allow some points to be on the
wrong side of the margin or hyperplane, resulting in a more flexible boundary.
5. Kernel
A Kernel is a function used in SVMs to transform the input data into a higher-dimensional space, enabling the model to
create a linear hyperplane in this transformed space even when the data is not linearly separable in the original space. By
applying a kernel, SVM can find complex boundaries in the input space. Common kernel functions include linear,
polynomial, and radial basis function (RBF) kernels.
Example: Suppose we have data that is circularly distributed and can’t be separated with a straight line in 2D space. An
RBF kernel can map this data to a higher dimension where a linear hyperplane can be used to separate the classes
effectively.
Q6 B ,
Ans:
Steps:
2. Create a decision tree using bootstrapped dataset, but only use random subset of variables (or columns) at each step
Using a bootstrapped sample and considering only a subset of the variables at each step results in a wide variety of trees.
The variety is what makes random forests more effective than individual decision trees.
Features:
Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.
Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced.
Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full
use of the CPU to build random forests.
Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of
the data which is not seen by the decision tree.
Stability- Stability arises because the result is based on majority voting/ averaging.
A random forest is an ensemble learning method where multiple decision trees are constructed and then they are merged to get
a more accurate prediction.
Algorithm:
1. The random forests algorithm generates many classification trees. Each tree is generated as follows:
(a) If the number of examples in the training set is N, take a sample of N examples at random - but with replacement, from the
original data. This sample will be the training set for generating the tree.
(b) If there are M input variables, a number m is specified such that at each node, m variables are selected at random out of the
M and the best split on these m is used to split the node. The value of m is held constant during the generation of the various
trees in the forest.
(c) Each tree is grown to the largest extent possible.
2. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a
classification, and we say the tree “votes” for that class. The forest chooses the classification
DEC 23
Ans:
Q1 B Explain any 5 performance metrics
Q1 D Logistic regression vs support vector machines
Ans:
Logistic Regression:
It is a classification model which is used to predict the odds in favour of a particular event. The odds ratio represents the positive
event which we want to predict, for example, how likely a sample has breast cancer/ how likely is it for an individual to become
diabetic in future. It used the sigmoid function to convert an input value between 0 and 1. The basic idea of logistic regression is
to adapt linear regression so that it estimates the probability a new entry falls in a class. The linear decision boundary is simply a
consequence of the structure of the regression function and the use of a threshold in the function to classify. Logistic Regression
tries to maximize the conditional likelihood of the training data, it is highly prone to outliers. Standardization (as co-linearity
checks) is also fundamental to make sure a features’ weights do not dominate over the others.
SVM try to maximize the margin between the closest support vectors whereas logistic regression maximize the posterior
class probability
SVM is deterministic (but we can use Platts model for probability score) while LR is probabilistic.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all
classification thresholds. This curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follow :
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more
items as positive, thus increasing both False Positives and True Positives.
The ROC curve and AUC are commonly used to evaluate the performance of binary classification models, particularly
when dealing with imbalanced classes. These metrics provide insights into how well a model distinguishes between the
positive and negative classes.
1. ROC Curve
The ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates the diagnostic ability of a
binary classifier as its discrimination threshold varies.
True Positive Rate (TPR), also known as Sensitivity or Recall, is plotted on the y-axis. TPR measures the
proportion of actual positives that are correctly identified.
False Positive Rate (FPR) is plotted on the x-axis. FPR measures the proportion of actual negatives that are
incorrectly identified as positive.
The ROC curve shows the trade-off between the True Positive Rate and the False Positive Rate as the decision
threshold is varied.
The curve starts from the bottom left (0,0) and ends at the top right (1,1).
A perfect classifier would reach the top left (0,1) of the graph, which indicates a high TPR with no FPR.
The Area Under the ROC Curve (AUC) is a single scalar value that quantifies the overall ability of the model to
distinguish between positive and negative classes.
Example
Suppose we have a model predicting whether patients have a disease or not based on some test results. By adjusting the
threshold for classification, we can observe how the model’s TPR and FPR change:
Low Threshold: More patients are predicted to have the disease, leading to high TPR and FPR.
High Threshold: Fewer patients are predicted to have the disease, leading to lower FPR but potentially lower TPR as well.
By plotting these values, we get the ROC curve, and calculating the AUC helps quantify how well the model
distinguishes between diseased and non-diseased patients.
1. Model Comparison: ROC and AUC are useful for comparing multiple models. The model with a higher AUC is generally
preferred as it has better discrimination ability.
2. Threshold Selection: The ROC curve helps in selecting an optimal threshold based on the desired balance between
sensitivity and specificity.
3. Imbalanced Datasets: AUC is particularly useful when dealing with imbalanced datasets, as it considers both true
positives and false positives.
In summary, the ROC curve provides a visual method for assessing a model’s performance across different thresholds,
and the AUC gives a single metric that reflects the overall capability of the model to separate classes.
Q2 B
Ans:
Assume you have a classification model, training data and testing data
The error rate is the average error of value predicted by the model and the correct value.
Bias
Let’s assume we have trained the model and are trying to predict values with input ‘x_train’.
The predicted values are y_predicted.
Bias is the error rate of y_predicted and y_train.
In simple terms, think of bias as the error rate of the training data.
When the error rate is high, we call it High Bias and when the error rate is low, we call it Low Bias
Variance
Let’s assume we have trained the model and this time we are trying to predict values with input ‘x_test’.
Again, the predicted values are y_predicted.
Variance is the error rate of the y_predicted and y_test
In simple terms, think of variance as the error rate of the testing data.
When the error rate is high, we call it High Variance and when the error rate is low, we call it Low Variance
Underfitting
When the model has a high error rate in the training data, we can say the model is underfitting. This usually occurs when
the number of training samples is too low.
Since our model performs badly on the training data, it consequently performs badly on the testing data as well.
A high error rate in training data implies a High Bias, therefore In simple terms, High Bias implies underfitting
Overfitting
When the model has a low error rate in training data but a high error rate in testing data, we can say the model is
overfitting.
This usually occurs when the number of training samples is too high or the hyperparameters have been tuned to produce
a low error rate on the training data.
A low error rate in training data implies Low Bias whereas a high error rate in testing data implies a High Variance,
therefore In simple terms, Low Bias and High Variance implies overfitting
In the first image, we try to fit the data using a linear equation. Due to the low flexibility of a linear equation, it is not able
to predict the samples (training data), therefore the error rate is high, and it has a High Bias which in turn means
it’s underfitting. This model won’t perform well on unseen data.
In the second image, the model is flexible enough to predict most of the samples correctly but rigid enough to avoid
overfitting. In this case, our model will be able to do well on the testing data therefore this is an ideal model.
In the third image, although it’s able to predict almost all the samples, it has too much flexibility and will not be able to
perform well on unseen data. As a result, it will have a high error rate in testing data. Since it has a low error rate in
training data (Low Bias) and high error rate in training data (High Variance), it’s overfitting.
1) Linear Regression:
a. Linear Regression model study the relationship between a single dependent variable Y and one or more
independent variable X using a best fit straight line (also known as regression line or population line).
b. If there is only one independent variable, it is called simple linear regression, if there is more than one
independent variable then it is called multiple linear regression.
The Regression Line : The least square regression line is the unique line such that the sum of the squared vertical (y)
distances between the data points and the line is smallest.
Steps to Establish a Linear Regression :
1. Carry out an experiment of gathering a sample of observed
2. Create a relationship model.
3. Find the coefficients from the model created and establish the mathematical equation using these.
4. Compute the residual error or residual
5. Use the model for prediction
is a statistical technique. It can use several variables to predict the outcome of a different variable.
The goal of multiple regression is to model the linear relationship between multiple independent variables and your
dependent variable.
Multiple variables = multiple features
In simple linear regression we had one independent variable and one dependent variable eg. X = house size, use this to
predict, y = house price
Whereas in multiple linear regression, we have more variables (such as number of bedrooms, number floors, age of the
home)
x1, x2, x3, x4 are the four features, x1 - size (feet squared), x2 - Number of bedrooms, x3 - Number of floors, x4 -
Age of home (years)
y is the output variable (price)
Notations:
n - number of features (n = 4)
m - number of examples (i.e. number of rows in a table)
xi
vector of the input for an example (so a vector of the four parameters for the ith input example), i is an index into
the training set
x is an n-dimensional feature vector
x3 is, for example, the 3rd house, and contains the four features associated with that house
xji
The value of feature j in the ith training example
x23 is, for example, the number of bedrooms in the third house
Model/Target Function:
For simple linear regression, hθ(x) = θ0 + θ1x
Here we have two parameters (θ0 and θ1) determined by our cost function
One variable x
Now we have multiple features
hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
In general, hθ(x) = Σ θixi where i = 0 to n
x0 = 1
Now, feature vector is n + 1 dimensional feature vector indexed from 0
This is a column vector called X
Each example has a column vector associated with it
So let's say we have a new example called "X"
Parameters are also a 0 indexed n+1 dimensional vector
This is also a column vector called θ
This vector is the same for each example
Considering this, function/model can be written hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
If we do , hθ(x) = θT X
θT is an [1 x n+1] matrix
In other words, because θ is a column vector, the transposition operation transforms it into a row vector
So before, θ was a matrix [n + 1 x 1] and now, θT is a matrix [1 x n+1]
Which means the inner dimensions of θT and X match, so they can be multiplied together as [1 x n+1] * [n+1 x 1] = hθ(x)
So, in other words, the transpose of parameter vector * an input example X gives a predicted output which is [1 x 1]
dimensions (i.e. a single value)
3. Logistic regression:
This type of statistical model (also known as logit model) is often used for classification
Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset
of independent variables.
Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
Consider y variable (binary classification)
0: negative class
1: positive class
Examples
Email: spam / not spam
Online transactions: fraudulent / not fraudulent
Tumor: malignant / not malignant
Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that
they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not
already seen.
One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed
before training begins. Then when training is done, the data that was removed can be used to test the performance of
the learned model on “new” data. This is the basic idea for a whole class of model evaluation methods called cross
validation.
The holdout method is the simplest kind of cross validation. The data set is separated into two sets, called the training set
and the testing set. The model fits a function using the training set only. Then the model is asked to predict the output
values for the data in the testing set (it has never seen these output values before).
The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the
model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to
compute. However, its evaluation can have a high variance. The evaluation may depend heavily on which data points end
up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending
on how the division is made.
Ans:
4. Stacking:
Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or
svm) to build a new model.
This model is used for making predictions on the test set.
It combines different weak learners using meta models
Q6 A multiclass classification techniques
\
----
---
Q6 C dbscan algorithm done in page 9
MAY 23
Ans: