0% found this document useful (0 votes)

17 views37 pages

ML Pyq Ans

The document outlines various applications of machine learning (ML) including market basket analysis, customer churn prediction, sales forecasting, customer segmentation, and dynamic pricing. It discusses performance evaluation metrics like the confusion matrix, Gini index for decision trees, and K-fold cross-validation for model evaluation. Additionally, it addresses challenges in ML applications such as data acquisition costs, talent shortages, and infrastructure inadequacies, while comparing ensemble techniques like bagging and boosting.

Uploaded by

anirudhacharya134

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views37 pages

ML Pyq Ans

Uploaded by

anirudhacharya134

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

MAY 24

a. 5 business applications of ML

Ans:

Learning Associations (Supermarket):

 Market Basket Analysis: Supermarkets use association rule learning to discover frequently bought-together items. By
identifying patterns in customer purchases (like bread and butter), stores can optimize product placement and offer
bundled promotions. This boosts cross-selling and encourages customers to buy complementary items, increasing overall
revenue.

Classification:

 Customer Churn Prediction: Businesses use classification to predict customer churn, categorizing customers as likely to
leave or stay. By analyzing past customer data (e.g., interactions, purchases, complaints), companies can proactively
engage at-risk customers with targeted retention offers, reducing churn rates and enhancing customer loyalty.

Regression:

 Sales Forecasting: Regression models help businesses predict future sales based on historical data and factors like
seasonality, market trends, and economic indicators. With accurate sales forecasts, companies can manage inventory
effectively, plan staffing, and make informed budgeting decisions.

Unsupervised Learning:

 Customer Segmentation: Businesses use clustering (an unsupervised learning technique) to group customers with similar
characteristics or behaviors. This allows targeted marketing strategies for each segment, personalized promotions, and
better customer experiences, ultimately increasing customer satisfaction and engagement.

Reinforcement Learning:

 Dynamic Pricing: E-commerce platforms apply reinforcement learning to adjust prices in real-time based on demand,
competition, and customer behavior. The model continuously learns which prices maximize profits or conversions,
adapting pricing strategies dynamically to stay competitive while optimizing revenue.

Ans :
c.
Ans: In binary classification, a performance evaluation matrix is used to evaluate a model’s performance in correctly
predicting two classes, typically labeled as positive and negative. This matrix, known as the confusion matrix, captures
four key outcomes for a set of predictions:

1. True Positive (TP): The model correctly predicts the positive class.
2. True Negative (TN): The model correctly predicts the negative class.
3. False Positive (FP): The model incorrectly predicts the positive class when it's actually negative (also known as
a "Type I error").
4. False Negative (FN): The model incorrectly predicts the negative class when it's actually positive (also known
as a "Type II error").

The confusion matrix is represented as:

From these values, we derive several key metrics to evaluate the model’s performance:

1. Accuracy: Measures the proportion of correct predictions (both true positives and true negatives) over all predictions.

2. Precision: Measures the accuracy of positive predictions, representing how many predicted positives were actually
positive.

3. Recall (Sensitivity): Measures how well the model identifies actual positives, representing how many actual positives were
predicted as positive.

4. F1 Score: The harmonic mean of precision and recall, useful when the classes are imbalanced.
Example

Consider a model used for spam detection in emails. We’ll label “spam” as the positive class and “not spam” as the negative class.

Suppose we have the following confusion matrix:

d. Explain gini index wit example

Ans:

Gini index –

 measure of purity or impurity while creating a decision tree. The attribute with a lower Gini index is used as the best
attribute to split.
 It is calculated by subtracting the sum of the squared probabilities of each class from one.
 It is the same as entropy but is known to calculate quicker as compared to entropy.
 CART ( Classification and regression tree ) uses the Gini index as an attribute selection measure to select the best
attribute/feature to split.
 The attribute with a lower Gini index is used as the best attribute to split.

 Maximum value of Gini Index for an N class problem = 1-1/n

Suppose we have a dataset of 10 customers and want to predict whether a customer will buy a product (class "Buy") or
not (class "No-Buy"). The initial dataset has:

 6 instances of “Buy”
 4 instances of “No-Buy”
To calculate the Gini Index for this dataset:

1. Calculate Proportion of Each Class:

2. Formula:

3. So, the Gini Index for this dataset is 0.48. This implies that if we randomly pick an item from this dataset, there is a 48%
chance that it would be misclassified if assigned a random class based on this distribution.

e. K fold cross validation

Ans:

K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the
holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are
put together to form a training set. Then the average error across all k trials is computed. The advantage of this method
is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a
training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this
method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much
computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k
different times. The advantage of doing this is that you can independently choose how large each test set is and how
many trials you average over.

K-Fold Cross Validation is a resampling technique used to evaluate machine learning models on a limited dataset. It
splits the data into kkk subsets (or "folds") and iteratively trains and validates the model on each fold, ensuring that
every data point has a chance to be both in training and validation sets. This helps improve the reliability of performance
estimates by reducing variance caused by the specific choice of training and test sets.

How K-Fold Cross Validation Works

1. Data Splitting: The dataset is randomly divided into kkk equally sized folds.
2. Training and Validation:
o The model is trained kkk times, each time using k−1k-1k−1 folds for training and the remaining one fold
for validation.
o For each iteration, a different fold is set aside as the validation set, and the remaining k−1k-1k−1 folds
are used for training.
3. Performance Averaging: After the model is trained and evaluated kkk times, the performance metrics (e.g.,
accuracy, precision, recall) from each fold are averaged to get an overall model performance estimate.

Formula for K-Fold Cross Validation Error

If we calculate error EiE_iEi for each fold iii using some metric (e.g., mean squared error), the final cross-validation
error is given by:

This average error provides a more robust estimate of the model’s performance compared to using a single training/test
split.

Example of 5-Fold Cross Validation

Suppose we have a dataset of 100 samples and use 5-fold cross-validation. The process would look like this:

1. Step 1: Split the data into 5 folds, each containing 20 samples.

2. Step 2: Train and validate the model 5 times:
o Iteration 1: Train on folds 2–5, validate on fold 1.
o Iteration 2: Train on folds 1, 3–5, validate on fold 2.
o Iteration 3: Train on folds 1, 2, 4, 5, validate on fold 3.
o Iteration 4: Train on folds 1–3, 5, validate on fold 4.
o Iteration 5: Train on folds 1–4, validate on fold 5.
3. Step 3: Calculate performance metrics (e.g., accuracy) for each fold.
4. Step 4: Average the performance metrics from all 5 folds to get the final model performance.

Applications of K-Fold Cross Validation

1. Model Selection: K-fold cross-validation helps in selecting the best model by comparing performance across
various models. For instance, it can compare the accuracy of different classifiers (e.g., logistic regression, SVM,
decision trees) to choose the most suitable one.
2. Hyperparameter Tuning: When tuning hyperparameters (e.g., learning rate in neural networks, depth of trees
in random forests), k-fold cross-validation allows you to find the best settings that consistently yield good
performance without overfitting.
3. Estimating Model Performance: When the dataset is small or has high variance, k-fold cross-validation
provides a more reliable estimate of the model’s performance, as it tests the model on all parts of the data.
4. Preventing Overfitting: By evaluating the model multiple times on different parts of the data, k-fold cross-
validation reduces the risk of overfitting on a single train-test split.
5. Imbalanced Datasets: K-fold cross-validation can help in cases of imbalanced datasets by ensuring that all
classes are represented in both training and validation sets across folds, providing a fairer evaluation.

Q2 A Issues in ML application:

Data is not free at all:

 For ML applications, algorithm requires massive dataset. These systems don’t just require more information than
humans to understand concepts or recognize features, they require hundreds of thousands times more than
human.
 Storing data is not a concern since purchasing space is cheap, but buying ready data set is very expensive.
 Creating a data set involves collecting it from different sources, organizing and formatting as per requirement,
feature sampling , record sampling etc.

Talent deficit:

 There is a shortage of skilled employees available to manage and develop analytical content for Machine
Learning i.e. develop the technology.

Inadequate Infrastructure

 Machine Learning requires vast amounts of data churning capabilities. Legacy systems often can’t handle the
workload and buckle under pressure.
 ML developer should check infrastructure if not as per specification then system must be upgraded with
hardware acceleration and flexible storage.

Specialization not Generalization

 There is no AI application which can do multiple task .

 ML algorithms are incredibly efficient for doing a specific task, eg. recognizing cats or playing Atari games,
but there is no neural network in the world that can do for example identifying objects and images, play ing
Space Invaders, and listen to music at once.

Technology is very young:

 The biggest tech corporations like the Alphabet Inc. (former Google) offers TensorFlow, while Microsoft
cooperates with Facebook developing Open Neural Network Exchange (ONNX). are building environments for
ML applications. Since, the technology is still new, it may not be production-ready, or be borderline production
ready.

Other issues could be time consuming in terms of planning, training , testing, requires huge processing power,
integrating with existing legacy software are difficult, understanding which process needs automation and many more.

Q2 B Bagging And Boosting

Bagging:

 The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a
generalized result.
 Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a
high chance that these models will give the same result since they are getting the same input. So how can we
solve this problem?
 One of the techniques is bootstrapping.
 Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with
replacement. The size of the subsets is the same as the size of the original set.
 Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution
(complete set). The size of subsets created for bagging may be less than the original set.
 It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids
overfitting of data and is used for both regression and classification models, specifically for decision tree
algorithms.
 Bootstrapping is the method of randomly creating samples of data out of a population with replacement to
estimate a population parameter.
 Key to the method is the manner in which each sample of the dataset is prepared to train ensemble members.
 Examples (rows) are drawn from the dataset at random, although with replacement. Replacement means that if a
row is selected, it is returned to the training dataset for potential re-selection in the same training dataset.
 This is called a bootstrap sample, giving the technique its name.

Boosting It combines weak learners into strong learners by creating sequential models such that the final model has the
highest accuracy. Firstly, a model is built from the training data. Then the second model is built which tries to correct
the errors present in the first model. This procedure is continued and models are added until either the complete training
data set is predicted correctly or the maximum number of models is added.

Comparison of Bagging and Boosting in Ensemble Learning

Bagging (Bootstrap Aggregating):

 Concept: Bagging is an ensemble technique that creates multiple versions of a model by training them on random subsets
of the original dataset, generated through sampling with replacement.
 Model Independence: Each model in bagging is built independently of others, allowing multiple models to learn different
patterns from different data samples.
 Example: Random Forest is a popular bagging algorithm where each tree is trained on a random subset of data.
 Goal: Reduce variance by averaging predictions of multiple models, helping stabilize predictions and avoid overfitting.
 Performance: Effective when dealing with high-variance models (e.g., decision trees), as it smooths out extreme
predictions.

Boosting:

 Concept: Boosting is an iterative ensemble technique where each new model is built to correct the errors of the previous
one. Misclassified instances are given higher importance in each subsequent model.
 Model Dependence: Each model depends on the previous one, focusing on the mistakes made to gradually improve
accuracy.
 Example: AdaBoost and Gradient Boosting are popular boosting algorithms. AdaBoost adjusts weights for misclassified
instances, while Gradient Boosting minimizes the overall prediction error using gradient descent.
 Goal: Reduce both bias and variance, improving model accuracy and handling complex patterns.
 Performance: Boosting tends to produce strong models but can be prone to overfitting if not carefully managed.

How Bagging and Boosting Improve Model Performance

 Bagging helps reduce variance by averaging multiple models trained on different subsets, producing stable predictions
and minimizing overfitting.
 Boosting improves both bias and variance by correcting mistakes of previous models, making it powerful for highly
accurate predictions on complex patterns.
Q3 B

Density Based Clustering:

 Clusters are dense regions in the data space, separated by regions of lower object density.
 A cluster is defined as a maximal set of density-connected points.
 Discovers clusters of arbitrary shape.

 Any two core points are close enough within a distance of one another, are put in the same cluster.
 Any border point that is close enough to a core point is put in the same cluster as the core point.
 Noise points are discarded.
Q4 A

 A spanning tree in an undirected graph is a set of edges with no cycles that connects all nodes.
 A minimum spanning tree (or MST) is a spanning tree with the least total cost.
 Kruskal’s Algorithm:
 Remove all edges from the graph.
 Repeatedly find the cheapest edge that doesn’t create a cycle and add it back.
 The result is an MST of the overall graph.
 Implementing Kruskal’s Algorithm
 Place every node into its own cluster.
 Place all edges into a priority queue.
 While there are two or more clusters remaining:
 Dequeue an edge from the priority queue.
 If its endpoints are not in the same cluster.
 Merge the clusters containing the endpoints.
 Add the edge to the resulting spanning tree.
 Return the resulting spanning tree.
 You're given n items and the distance d(u; v) between each of pair.
 d(u; v) may be an actual distance, or some abstract representation of how dissimilar two things are.
 Our Goal: Divide the n items up into k groups so that the minimum distance between items in different groups is
maximized.
 Main Idea:
 Maintain clusters as a set of connected components of a graph.
 Iteratively combine the clusters containing the two closest items by adding an edge between them.
 Stop when there are k clusters.
 This is exactly Kruskal's algorithm.
 The “clusters" are the connected components that Kruskal’s algorithm has created after a certain point.
 Suppose you want k clusters.
 Given the data set, add an edge from each node to each other node whose length depends on their
similarity.
 Run Kruskal's algorithm until only k clusters remain.
 The pieces of the graph that have been linked together are k maximally-separated clusters.

Q4 B ii)

Ans:

Overfitting occurs when a machine learning model, such as a decision tree, becomes too complex and fits the training
data too closely. This results in the model capturing the noise and anomalies in the data, rather than the underlying
patterns. Consequently, the model performs poorly on new, unseen data.
Tree Pruning to Avoid Overfitting

Tree pruning is a technique used to reduce the complexity of a decision tree and prevent overfitting. There are two main
approaches:

1. Pre-Pruning (Pruning while building the tree)

 Stop early: During the tree construction process, we set a threshold for the goodness measure (e.g., information
gain, Gini impurity). If splitting a node doesn't improve the goodness measure above the threshold, we stop
further branching at that node.

Example:

Consider a decision tree for predicting whether an email is spam or not. If the goodness measure for splitting on the
"sender's email address" is very low, we might stop splitting further at that node, even if it could potentially improve
accuracy on the training data.

2. Post-Pruning (Pruning after building the tree)

 Build a full tree: First, we build a complete decision tree without any restrictions.
 Prune iteratively: We start from the bottom of the tree and remove branches that don't significantly improve the
model's performance on a validation set.
 Choose the best pruned tree: We select the pruned tree that gives the best performance on the validation set.

Example:

For the email spam classification tree, we might initially have a very deep tree. Post-pruning would involve removing
branches that only classify a few specific email types correctly, as these might not generalize well to new emails.

C4.5: It is considered to be better than the ID3 algorithm as it can handle both discrete and continuous data. In C4.5
splitting is done based on Information gain (attribute selection measure ) and the feature with the highest Information
gain is made the decision node and is further split. C4.5 handles overfitting by the method of pruning i.e it removes the
branches/subpart of the tree that does not hold much importance (or) is redundant. To be specific, C4.5 follows post
pruning i.e removing branches after the tree is created.

Q5 B

Ans: **

Q6 A

1. Hyperplane

In the context of SVMs, a hyperplane is a decision boundary that separates different classes in a dataset. For a 2D
space, this is simply a line, while for a 3D space, it’s a plane. In higher dimensions, it becomes a "hyperplane." The goal
of an SVM is to find the optimal hyperplane that best separates the data points of different classes with the maximum
margin (distance between the nearest points of each class to the hyperplane).
 Example: In a binary classification with two features (2D space), a line (hyperplane) is used to separate two classes so that
one class is on one side of the line and the other class is on the opposite side.

2. Support Vectors

Support Vectors are the data points that are closest to the hyperplane and have the greatest influence on the position
and orientation of the hyperplane. These points define the margin, and the SVM algorithm aims to maximize the
distance between the support vectors of different classes. If these points were moved, the hyperplane would shift,
making them crucial to the model.

 Example: In a dataset where two classes are separated by a line, the points closest to the line (one from each class) are
the support vectors, determining the margin around the hyperplane.

3. Hard Margin

A Hard Margin SVM is a strict model that assumes that the data is linearly separable and aims to find a hyperplane that
perfectly separates the classes without any misclassification. This approach, however, is not practical when dealing with
noisy data or overlapping classes, as it’s highly sensitive to outliers and can lead to overfitting.

 Example: In a dataset with perfectly separable classes, a hard margin SVM can find a line that classifies all points
correctly. However, if there’s an outlier near the other class, the model would fail to accommodate it without violating
the strict separation.

4. Soft Margin

A Soft Margin SVM allows for some misclassifications or overlap in the data to achieve better generalization. It
introduces a tolerance for certain data points that lie within the margin or even on the wrong side of the hyperplane. This
flexibility makes the model robust to noise and outliers. A regularization parameter CCC controls the trade-off between
maximizing the margin and minimizing misclassification errors.

 Example: In a dataset with slight overlap between classes, a soft margin SVM would allow some points to be on the
wrong side of the margin or hyperplane, resulting in a more flexible boundary.

5. Kernel

A Kernel is a function used in SVMs to transform the input data into a higher-dimensional space, enabling the model to
create a linear hyperplane in this transformed space even when the data is not linearly separable in the original space. By
applying a kernel, SVM can find complex boundaries in the input space. Common kernel functions include linear,
polynomial, and radial basis function (RBF) kernels.

 Example: Suppose we have data that is circularly distributed and can’t be separated with a straight line in 2D space. An
RBF kernel can map this data to a higher dimension where a linear hyperplane can be used to separate the classes
effectively.
Q6 B ,

Ans:

Steps:

1. Create bootstrapped Dataset

2. Create a decision tree using bootstrapped dataset, but only use random subset of variables (or columns) at each step

Using a bootstrapped sample and considering only a subset of the variables at each step results in a wide variety of trees.

The variety is what makes random forests more effective than individual decision trees.

 Random forest is an ensemble of decision tree algorithms.

 It is an extension of bootstrap aggregation (bagging) of decision trees.
 Random forest involves constructing a large number of decision trees from bootstrap samples from the training
dataset, like bagging.
 Unlike bagging, random forest also involves selecting a subset of input features (columns or variables) at each
split point in the construction of trees.
 Typically, constructing for each input variable reducing the features at each split point, it be more different.
Steps:

1. Random subset (bootstrapping)

2. At each node i features are con
3. A decision tree n
4. The final predict

Features:

 Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.
 Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced.
 Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full
use of the CPU to build random forests.
 Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of
the data which is not seen by the decision tree.
 Stability- Stability arises because the result is based on majority voting/ averaging.

A random forest is an ensemble learning method where multiple decision trees are constructed and then they are merged to get
a more accurate prediction.

Algorithm:

1. The random forests algorithm generates many classification trees. Each tree is generated as follows:
(a) If the number of examples in the training set is N, take a sample of N examples at random - but with replacement, from the
original data. This sample will be the training set for generating the tree.
(b) If there are M input variables, a number m is specified such that at each node, m variables are selected at random out of the
M and the best split on these m is used to split the node. The value of m is held constant during the generation of the various
trees in the forest.
(c) Each tree is grown to the largest extent possible.

2. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a
classification, and we say the tree “votes” for that class. The forest chooses the classification
DEC 23

Q1 A How to choose right ML algorithm?

Ans:
Q1 B Explain any 5 performance metrics
Q1 D Logistic regression vs support vector machines

Ans:

Logistic Regression:
It is a classification model which is used to predict the odds in favour of a particular event. The odds ratio represents the positive
event which we want to predict, for example, how likely a sample has breast cancer/ how likely is it for an individual to become
diabetic in future. It used the sigmoid function to convert an input value between 0 and 1. The basic idea of logistic regression is
to adapt linear regression so that it estimates the probability a new entry falls in a class. The linear decision boundary is simply a
consequence of the structure of the regression function and the use of a threshold in the function to classify. Logistic Regression
tries to maximize the conditional likelihood of the training data, it is highly prone to outliers. Standardization (as co-linearity
checks) is also fundamental to make sure a features’ weights do not dominate over the others.

Support Vector Machine (SVM):

It is a very powerful classification algorithm to maximize the margin among class variables. This margin (support vector)
represents the distance between the separating hyperplanes (decision boundary). The reason to have decision boundaries with
large margins is to separate positive and negative hyperplanes with adjustable bias-variance proportion. The goal is to separate so
that negative samples would fall under negative hyperplane and positive samples would fall under positive hyperplane. SVM is
not as prone to outliers as it only cares about the points closest to the decision boundary. It changes its decision boundary
depending on the placement of the new positive or negative events. The decision boundary is much more important for Linear
SVMs – the whole goal is to place a linear boundary in a smart way. There isn’t a probabilistic interpretation of individual
classifications, at least not in the original formulation.

Hence, key points are:

 SVM try to maximize the margin between the closest support vectors whereas logistic regression maximize the posterior
class probability

 SVM is deterministic (but we can use Platts model for probability score) while LR is probabilistic.

 For the kernel space, SVM is faster

Q1 E

 An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all
classification thresholds. This curve plots two parameters:

 True Positive Rate

 False Positive Rate

 True Positive Rate (TPR) is a synonym for recall and is therefore defined as follow :

 TPR = TP / (TP + FN)

 False Positive Rate (FPR) is defined as follows:

 FPR = FP/(TN + FP)

 An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more
items as positive, thus increasing both False Positives and True Positives.
The ROC curve and AUC are commonly used to evaluate the performance of binary classification models, particularly
when dealing with imbalanced classes. These metrics provide insights into how well a model distinguishes between the
positive and negative classes.
1. ROC Curve

The ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates the diagnostic ability of a
binary classifier as its discrimination threshold varies.

 True Positive Rate (TPR), also known as Sensitivity or Recall, is plotted on the y-axis. TPR measures the
proportion of actual positives that are correctly identified.


 False Positive Rate (FPR) is plotted on the x-axis. FPR measures the proportion of actual negatives that are
incorrectly identified as positive.

The ROC curve shows the trade-off between the True Positive Rate and the False Positive Rate as the decision
threshold is varied.

 The curve starts from the bottom left (0,0) and ends at the top right (1,1).
 A perfect classifier would reach the top left (0,1) of the graph, which indicates a high TPR with no FPR.

2. Area Under Curve (AUC)

The Area Under the ROC Curve (AUC) is a single scalar value that quantifies the overall ability of the model to
distinguish between positive and negative classes.

 AUC value ranges from 0 to 1.

o An AUC of 1 indicates a perfect model.
o An AUC of 0.5 indicates a model with no discrimination ability (equivalent to random guessing).
o An AUC closer to 1 represents a better-performing model, while an AUC closer to 0.5 indicates poor
performance.

Example

Suppose we have a model predicting whether patients have a disease or not based on some test results. By adjusting the
threshold for classification, we can observe how the model’s TPR and FPR change:

 Low Threshold: More patients are predicted to have the disease, leading to high TPR and FPR.
 High Threshold: Fewer patients are predicted to have the disease, leading to lower FPR but potentially lower TPR as well.

By plotting these values, we get the ROC curve, and calculating the AUC helps quantify how well the model
distinguishes between diseased and non-diseased patients.

Applications and Importance

1. Model Comparison: ROC and AUC are useful for comparing multiple models. The model with a higher AUC is generally
preferred as it has better discrimination ability.
2. Threshold Selection: The ROC curve helps in selecting an optimal threshold based on the desired balance between
sensitivity and specificity.
3. Imbalanced Datasets: AUC is particularly useful when dealing with imbalanced datasets, as it considers both true
positives and false positives.
In summary, the ROC curve provides a visual method for assessing a model’s performance across different thresholds,
and the AUC gives a single metric that reflects the overall capability of the model to separate classes.

Q2 A (q4a from may 23)

Q2 B

Ans:
Assume you have a classification model, training data and testing data

x_train , y_train // This is the training data

x_test , y_test // This is the testing data
y_predicted // the values predicted by the model given an input

The error rate is the average error of value predicted by the model and the correct value.

Bias

 Let’s assume we have trained the model and are trying to predict values with input ‘x_train’.
 The predicted values are y_predicted.
 Bias is the error rate of y_predicted and y_train.
 In simple terms, think of bias as the error rate of the training data.
 When the error rate is high, we call it High Bias and when the error rate is low, we call it Low Bias

Variance

 Let’s assume we have trained the model and this time we are trying to predict values with input ‘x_test’.
 Again, the predicted values are y_predicted.
 Variance is the error rate of the y_predicted and y_test
 In simple terms, think of variance as the error rate of the testing data.
 When the error rate is high, we call it High Variance and when the error rate is low, we call it Low Variance

Underfitting

 When the model has a high error rate in the training data, we can say the model is underfitting. This usually occurs when
the number of training samples is too low.
 Since our model performs badly on the training data, it consequently performs badly on the testing data as well.
 A high error rate in training data implies a High Bias, therefore In simple terms, High Bias implies underfitting

Overfitting

 When the model has a low error rate in training data but a high error rate in testing data, we can say the model is
overfitting.
 This usually occurs when the number of training samples is too high or the hyperparameters have been tuned to produce
a low error rate on the training data.
 A low error rate in training data implies Low Bias whereas a high error rate in testing data implies a High Variance,
therefore In simple terms, Low Bias and High Variance implies overfitting
 In the first image, we try to fit the data using a linear equation. Due to the low flexibility of a linear equation, it is not able
to predict the samples (training data), therefore the error rate is high, and it has a High Bias which in turn means
it’s underfitting. This model won’t perform well on unseen data.
 In the second image, the model is flexible enough to predict most of the samples correctly but rigid enough to avoid
overfitting. In this case, our model will be able to do well on the testing data therefore this is an ideal model.
 In the third image, although it’s able to predict almost all the samples, it has too much flexibility and will not be able to
perform well on unseen data. As a result, it will have a high error rate in testing data. Since it has a low error rate in
training data (Low Bias) and high error rate in training data (High Variance), it’s overfitting.

Q3 A What is regression, enlist its types:

 Regression analysis is a form of predictive modeling technique which investigates the

relationship between a dependent (target) and independent variable (s) (predictor).
 Regression is a statistical term for describing models that estimate the relationships among variables.
 This technique is used for forecasting, time series modeling and finding the causal effect relationship between the
variables.
 We try to fit a curve / line to the data points, in such a manner that the differences between the distances of data points
from the curve or line is minimized.
 In regression analysis, input to model is instances as pair (X, Y) , where X can be single or multiple feature whereas Y is
continuous value.
 Goal is to learn a function f : X Y, such that given an X it can predict Y.

1) Linear Regression:

a. Linear Regression model study the relationship between a single dependent variable Y and one or more
independent variable X using a best fit straight line (also known as regression line or population line).
b. If there is only one independent variable, it is called simple linear regression, if there is more than one
independent variable then it is called multiple linear regression.

1. Simple Linear Regression:

 It is characterized by one independent variable.

 Example:
 Predict height from age
 Predict house price from house area
 Predict distance from wall from sensors
 In this model, we are trying to fit a straight-line function.
 A line can be characterized by the slope and intercept with one of the axis.
 Hence, given data points, goal is to find a line, which indicates finding slope and intercept (Parameters of simple linear
regression model)
 Y = β0 + β1 X
 However, all points might not fit to a straight-line due noise
 When we actually take a measurement (observe the data), we observe, Yi = β0 + β1 Xi + εi
where εi - random error associated with ith data point
 So we assume, this is the underlying function from which data is generated Y = β0 + β1 X + ε and given the data
points, goal is to find out β0 and β1.


 The Regression Line : The least square regression line is the unique line such that the sum of the squared vertical (y)
distances between the data points and the line is smallest.
 Steps to Establish a Linear Regression :
1. Carry out an experiment of gathering a sample of observed
2. Create a relationship model.
3. Find the coefficients from the model created and establish the mathematical equation using these.
4. Compute the residual error or residual
5. Use the model for prediction

2. Multiple linear regression (MLR/multiple regression):

 is a statistical technique. It can use several variables to predict the outcome of a different variable.
 The goal of multiple regression is to model the linear relationship between multiple independent variables and your
dependent variable.
 Multiple variables = multiple features
 In simple linear regression we had one independent variable and one dependent variable eg. X = house size, use this to
predict, y = house price
 Whereas in multiple linear regression, we have more variables (such as number of bedrooms, number floors, age of the
home)
 x1, x2, x3, x4 are the four features, x1 - size (feet squared), x2 - Number of bedrooms, x3 - Number of floors, x4 -
Age of home (years)
 y is the output variable (price)
 Notations:
 n - number of features (n = 4)
 m - number of examples (i.e. number of rows in a table)
 xi
 vector of the input for an example (so a vector of the four parameters for the ith input example), i is an index into
the training set
 x is an n-dimensional feature vector
 x3 is, for example, the 3rd house, and contains the four features associated with that house
 xji
 The value of feature j in the ith training example
 x23 is, for example, the number of bedrooms in the third house
 Model/Target Function:
 For simple linear regression, hθ(x) = θ0 + θ1x
 Here we have two parameters (θ0 and θ1) determined by our cost function
 One variable x
 Now we have multiple features
 hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
 In general, hθ(x) = Σ θixi where i = 0 to n
 x0 = 1
 Now, feature vector is n + 1 dimensional feature vector indexed from 0
 This is a column vector called X
 Each example has a column vector associated with it
 So let's say we have a new example called "X"
 Parameters are also a 0 indexed n+1 dimensional vector
 This is also a column vector called θ
 This vector is the same for each example
 Considering this, function/model can be written hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
 If we do , hθ(x) = θT X
 θT is an [1 x n+1] matrix
 In other words, because θ is a column vector, the transposition operation transforms it into a row vector
 So before, θ was a matrix [n + 1 x 1] and now, θT is a matrix [1 x n+1]
 Which means the inner dimensions of θT and X match, so they can be multiplied together as [1 x n+1] * [n+1 x 1] = hθ(x)
 So, in other words, the transpose of parameter vector * an input example X gives a predicted output which is [1 x 1]
dimensions (i.e. a single value)

3. Logistic regression:

 This type of statistical model (also known as logit model) is often used for classification
 Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset
of independent variables.
 Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
 Consider y variable (binary classification)
 0: negative class
 1: positive class
 Examples
 Email: spam / not spam
 Online transactions: fraudulent / not fraudulent
 Tumor: malignant / not malignant

 What function is used to represent model in classification?

 Aim of this classifier to output values between 0 and 1
 Using linear regression, hθ(x) = (θTx)
 For classification hypothesis representation we do hθ(x) = g(θTx)
 Where, g(z), z is a real number
 g(z) = 1/(1 + e-z)
 This is the sigmoid function, or the logistic function
 If we combine these equations we can write out the hypothesis as
 What does the sigmoid function look like
 Crosses 0.5 at the origin, then flattens out]
 Asymptotes at 0 and 1
 Given this we need to fit θ to our data
Q3 B Necessity of cross validation

 Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that
they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not
already seen.
 One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed
before training begins. Then when training is done, the data that was removed can be used to test the performance of
the learned model on “new” data. This is the basic idea for a whole class of model evaluation methods called cross
validation.
 The holdout method is the simplest kind of cross validation. The data set is separated into two sets, called the training set
and the testing set. The model fits a function using the training set only. Then the model is asked to predict the output
values for the data in the testing set (it has never seen these output values before).
 The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the
model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to
compute. However, its evaluation can have a high variance. The evaluation may depend heavily on which data points end
up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending
on how the division is made.

Q 4A Express the SVM as a constrained optimization problem

Q5 A Explain Kernel trick in SVM: downloaded pdf

Q5 B Different ways to combine classifiers

Ans:
4. Stacking:

 Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or
svm) to build a new model.
 This model is used for making predictions on the test set.
 It combines different weak learners using meta models
Q6 A multiclass classification techniques
\

Q6 B principle component analysis

----

---
Q6 C dbscan algorithm done in page 9
MAY 23

Q5 B EM Algorithm/ Expectation Maximization Algorithm

Ans:

Unit 2
No ratings yet
Unit 2
28 pages
Cofusion Matrix Cross - Validation
No ratings yet
Cofusion Matrix Cross - Validation
34 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
Comparing Multiple Algorithms
No ratings yet
Comparing Multiple Algorithms
70 pages
Unit 5 (ML)
No ratings yet
Unit 5 (ML)
25 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
Unit 5-2 Marks
No ratings yet
Unit 5-2 Marks
5 pages
18 Bias Variance K-foldCrossValidation Boosting
No ratings yet
18 Bias Variance K-foldCrossValidation Boosting
23 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Topic 3
No ratings yet
Topic 3
48 pages
MLA CT1 - Notes
No ratings yet
MLA CT1 - Notes
17 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
Cross Validation Techniques
No ratings yet
Cross Validation Techniques
27 pages
Ensemble Learning
No ratings yet
Ensemble Learning
52 pages
List Steps in Data Preparation. Give Short Description of Each Step
No ratings yet
List Steps in Data Preparation. Give Short Description of Each Step
20 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Unit 2 Part 2 Data Science Final 23june
No ratings yet
Unit 2 Part 2 Data Science Final 23june
39 pages
Lecture Note #6 - PEC-CS701E
No ratings yet
Lecture Note #6 - PEC-CS701E
11 pages
Lec - 4
No ratings yet
Lec - 4
43 pages
ML Module Iii
No ratings yet
ML Module Iii
12 pages
Ovefitting, Generalization, Cross Validation
No ratings yet
Ovefitting, Generalization, Cross Validation
20 pages
Improving Machine Learning Performance
No ratings yet
Improving Machine Learning Performance
14 pages
Lec 16
No ratings yet
Lec 16
18 pages
Cross Validation for ML Models
No ratings yet
Cross Validation for ML Models
6 pages
Module 10 Notes
No ratings yet
Module 10 Notes
5 pages
Cross Validation
No ratings yet
Cross Validation
7 pages
K-Fold Cross Validation Technique and Its Essentials - Analytics Vidhya
No ratings yet
K-Fold Cross Validation Technique and Its Essentials - Analytics Vidhya
11 pages
QB ML Ans
No ratings yet
QB ML Ans
14 pages
ML - 03 Evaluation Metrics
No ratings yet
ML - 03 Evaluation Metrics
17 pages
Machine Learning Data Splits Guide
No ratings yet
Machine Learning Data Splits Guide
30 pages
Answer-4 Shreyansh
No ratings yet
Answer-4 Shreyansh
4 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Module3-Ensemble Learning
No ratings yet
Module3-Ensemble Learning
107 pages
Cross Validation
No ratings yet
Cross Validation
5 pages
Exploratory Data Analysis & ML Concepts
No ratings yet
Exploratory Data Analysis & ML Concepts
16 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
Cross Validation
No ratings yet
Cross Validation
10 pages
Ads TT2
No ratings yet
Ads TT2
24 pages
Model Answer Paper - UT1-QP-ML-SEM7-COMPUTER-2023-3024 Version2
No ratings yet
Model Answer Paper - UT1-QP-ML-SEM7-COMPUTER-2023-3024 Version2
18 pages
ML Unit1
No ratings yet
ML Unit1
11 pages
Unit 5 ML
No ratings yet
Unit 5 ML
21 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
AI - Lecture 3
No ratings yet
AI - Lecture 3
50 pages
M.L L-6 Re-Sampling Methods
No ratings yet
M.L L-6 Re-Sampling Methods
24 pages
ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
ML-4 Cross Validation in Machine Learning
No ratings yet
ML-4 Cross Validation in Machine Learning
13 pages
Machine Learning for Data Analysts
No ratings yet
Machine Learning for Data Analysts
31 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
Dimensionality Reduction & Model Evaluation
No ratings yet
Dimensionality Reduction & Model Evaluation
80 pages
ML Lec-10
No ratings yet
ML Lec-10
19 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Cross Validation in Machine Learning
No ratings yet
Cross Validation in Machine Learning
4 pages
Osmosis Lab Assessment Crit. B & C New
No ratings yet
Osmosis Lab Assessment Crit. B & C New
10 pages
Module 3 Cost Concept and Behavior - Student
No ratings yet
Module 3 Cost Concept and Behavior - Student
13 pages
Data Science Lab: Linear Regression
No ratings yet
Data Science Lab: Linear Regression
9 pages
Loan Default Prediction System
No ratings yet
Loan Default Prediction System
44 pages
Types of Variables Explained
No ratings yet
Types of Variables Explained
16 pages
Douyin Food Vloggers' Impact On Visit Intention - Taste Awareness As A Mediator
No ratings yet
Douyin Food Vloggers' Impact On Visit Intention - Taste Awareness As A Mediator
10 pages
Report 2
No ratings yet
Report 2
26 pages
Experimental Design Lec 1
No ratings yet
Experimental Design Lec 1
27 pages
Salin, D. (2008)
No ratings yet
Salin, D. (2008)
19 pages
STEM Students' Math Performance
No ratings yet
STEM Students' Math Performance
37 pages
Crop Care Formatted Final
No ratings yet
Crop Care Formatted Final
34 pages
Correlation and Linear
No ratings yet
Correlation and Linear
27 pages
Evaluation of Serum Levels of Copper and Zinc in Patients With Celiac Disease Seropositivity Findings From The National Health and Nutrition Examination Survey
No ratings yet
Evaluation of Serum Levels of Copper and Zinc in Patients With Celiac Disease Seropositivity Findings From The National Health and Nutrition Examination Survey
7 pages
Crop Recommendation System Using KNN and Random Forest Considering Indian Dataset
100% (1)
Crop Recommendation System Using KNN and Random Forest Considering Indian Dataset
13 pages
The-Impact-Of-E-Management-On-Improving-Crisis-Management - A-Field-Study-At-The-Ali-Boushaba-Public-Hospital-In-Khenchela
No ratings yet
The-Impact-Of-E-Management-On-Improving-Crisis-Management - A-Field-Study-At-The-Ali-Boushaba-Public-Hospital-In-Khenchela
18 pages
Research Skills for High Schoolers
No ratings yet
Research Skills for High Schoolers
51 pages
Ai Voice Assistant PPT Project
No ratings yet
Ai Voice Assistant PPT Project
23 pages
Microsoft Word - B4 - PhanDau - KyYeuKinhTe2023
No ratings yet
Microsoft Word - B4 - PhanDau - KyYeuKinhTe2023
5 pages
Discriminant Analysis: Prepared By-Sumit Jain
No ratings yet
Discriminant Analysis: Prepared By-Sumit Jain
44 pages
Linear Regression - Stats 2 (Translated)
No ratings yet
Linear Regression - Stats 2 (Translated)
63 pages
Effects of 8-Weeks Long Muscular Endurance Training With Body Weight in Case of Recreational Athletes
No ratings yet
Effects of 8-Weeks Long Muscular Endurance Training With Body Weight in Case of Recreational Athletes
69 pages
Reading 7 Introduction To Linear Regression
No ratings yet
Reading 7 Introduction To Linear Regression
5 pages
Ho - Diagnostics Examples 2 in SPSS
No ratings yet
Ho - Diagnostics Examples 2 in SPSS
4 pages
Rustiarini Dan Sudiartana - Board Political Connection and Tax Avoidance Ownership Structure As A Moderating Variable
No ratings yet
Rustiarini Dan Sudiartana - Board Political Connection and Tax Avoidance Ownership Structure As A Moderating Variable
17 pages
Fundamentals of Artificial Neural Networks-Book Re
No ratings yet
Fundamentals of Artificial Neural Networks-Book Re
3 pages
Unit 6 Machine Learning Algorithms
No ratings yet
Unit 6 Machine Learning Algorithms
13 pages
EEE 5103 Power System Analysis 1
No ratings yet
EEE 5103 Power System Analysis 1
113 pages
Child Marriage in Ghana: Evidence From A Multi-Method Study: Researcharticle Open Access
No ratings yet
Child Marriage in Ghana: Evidence From A Multi-Method Study: Researcharticle Open Access
15 pages
ML Ch-2 Supervised Learning
No ratings yet
ML Ch-2 Supervised Learning
23 pages
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
No ratings yet
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
32 pages

ML Pyq Ans

Uploaded by

ML Pyq Ans

Uploaded by

MAY 24

Learning Associations (Supermarket):

The confusion matrix is represented as:

Suppose we have the following confusion matrix:

d. Explain gini index wit example

 Maximum value of Gini Index for an N class problem = 1-1/n

1. Calculate Proportion of Each Class:

e. K fold cross validation

How K-Fold Cross Validation Works

Formula for K-Fold Cross Validation Error

Example of 5-Fold Cross Validation

1. Step 1: Split the data into 5 folds, each containing 20 samples.

Applications of K-Fold Cross Validation

Data is not free at all:

Specialization not Generalization

 There is no AI application which can do multiple task .

Technology is very young:

Q2 B Bagging And Boosting

Comparison of Bagging and Boosting in Ensemble Learning

Bagging (Bootstrap Aggregating):

How Bagging and Boosting Improve Model Performance

Density Based Clustering:

1. Pre-Pruning (Pruning while building the tree)

2. Post-Pruning (Pruning after building the tree)

1. Create bootstrapped Dataset

 Random forest is an ensemble of decision tree algorithms.

1. Random subset (bootstrapping)

Q1 A How to choose right ML algorithm?

Support Vector Machine (SVM):

Hence, key points are:

 For the kernel space, SVM is faster

 True Positive Rate

 False Positive Rate

 TPR = TP / (TP + FN)

 False Positive Rate (FPR) is defined as follows:

 FPR = FP/(TN + FP)

2. Area Under Curve (AUC)

 AUC value ranges from 0 to 1.

Applications and Importance

Q2 A (q4a from may 23)

x_train , y_train // This is the training data

Q3 A What is regression, enlist its types:

 Regression analysis is a form of predictive modeling technique which investigates the

1. Simple Linear Regression:

 It is characterized by one independent variable.

2. Multiple linear regression (MLR/multiple regression):

 What function is used to represent model in classification?

Q 4A Express the SVM as a constrained optimization problem

Q5 B Different ways to combine classifiers

Q6 B principle component analysis

Q5 B EM Algorithm/ Expectation Maximization Algorithm

You might also like