A
In machine learning, a classification problem involves predicting the categorical class label
of a data point based on its features or attributes. The goal is to learn a model from labeled
training data that can accurately classify new, unseen data points into predefined classes or
categories. The classes could represent different categories, groups, or outcomes.
Mathematically, a classification problem can be represented as follows:
Given a dataset of n samples, each with p features, the dataset can be denoted as: D={(x1,y1),
(x2,y2),...,(xn,yn)}D={(x1,y1),(x2,y2),...,(xn,yn)}
Where:
xi represents the feature vector of the ith sample in the p-dimensional predictor space.
yi is the corresponding class label for the ith sample, belonging to one of K classes
(K≥2).
The goal is to learn a classifier ff that can predict the class labels for new, unseen samples.
The classifier maps input feature vectors to one of the KK classes:
f:Rp→{C1,C2,...,CK}f:Rp→{C1,C2,...,CK}
In this context:
p represents the dimension of the predictor space, which is the number of features in
the dataset. Each feature contributes to the decision-making process of the classifier.
K is the number of classes in the classification problem. For a binary classification
problem, K=2, while for multi-class problems, K>2.
The classifier's task is to learn the underlying relationships between the feature vectors and
the class labels by identifying decision boundaries or decision functions that separate the
classes as accurately as possible. Various algorithms and techniques, such as decision trees,
support vector machines, neural networks, and k-nearest neighbors, can be used to build
classification models and solve these types of problems.
three methods used actually we have two algorithm here LDA (LDA,LDA2) and one QDA.
however in the second LDA we are doing some feature engineering because we take as X axis as not
X axis or X to the predictor we take the original digit plus all the digits squared OK so that's the only
difference
Assumptions:
LDA assumes that you have different means conditional on the class on the label but the covariance
for each class is the same .
whereas for QDA well of course you also have different means according to different labels but the
covariance matrix can be different labeled by label OK and this gives more flexibility for QDA the
errors computed here.
simply the misclassification error so I I showed you before that what is the accuracy you just read
since this is a classification problem the accuracy is taken from the confusion matrix entries that in
the diagonal and divide by the number of observations the misclassification rate is just the opposite
take everything that is off the diagonal and yeah divide by the number of observations so here we
observe the misclassification rate we see that from the training set at the first 6% of misclassification
rate but on the test session the second model where we do a bit of feature engineering so we also
add squared term is doing already better well on the training set but also on the test set 10% test set
error the third method has very and this is a kind of simple setting where you can start seeing
overfitting so QA has a very low training error one 2% but 13% of tests error misclassification rate and
so answering very briefly
what is the best method for this data set well I would say that is the second so LDA which does not
overfit as much as as QA but it's still flexible enough because we did this feature engineering
transformation that helped to kind of reduce the misclassification on the test from K 11 to 10% and
finally it asks what
could we do with multinomial regression with the last two so here I mean it's again a bit more
theoretical question last regularization the idea is also it also applies to to regression is that as we
said many times with the last two you want to four yeah induce some sort of sparsity in the weights
that you're going to estimate sparsity means that you will force some of the weights of course when
you tune up your Lambda you will force some of the estimated weights to be 0 uh and what is the
two end of the tuning parameter in this method of course it well of course it's Lambda so this is
something it might sound easy question but that's also it's very important so everyone has to know
that that's why we will ask us or something like this and what do we expect about the regularization
OK for this data set in principle it could could help by thinking that OK the the in a sense OK what do
we have we have each row OK so on the on the in the X axis is an image and most of the predictors
right of the columns that are the the pixels are zero so in fact adding some regularization here
doesn't seem to be a bad idea in the sense that the ground truth here is that while we have many
many entries exactly equal to zero so and some entries not equal to to zero so we would like our
coefficient to kind of be activated only for the entries where you have some pixels I don't know if
this again this is a qualitative answer I don't know you know what he would say about that for this
specific image data set with regularization on logistic but I think that's something
I agree with with what you're saying I don't yeah it's hard to say if it's going to be better than this LDA
of course but still it's something that it's worth trying and OK this is again theory question but very
important model parameter and tuning parameter
E
model parameter is a parameter that you fit to the data so you learn it on the on the data and if you
want it is learned by the algorithm OK the the model parameter so it's like the beta in linear
regression or in the logistic it's something that the the specific model learns on the data OK
whereas the tuning parameter is something that you the user can change by hand yeah and it's all
also called hyper parameter and how do you find these how do you choose this actually because of
course hyperparameters or tuning parameters it's a synonym can always be set by hands but the
best way to choose them is to use cross validation as we said right OK and given the training data
but not this data right if I just have training data while the model parameter is learned when you fit
the model so it's learned on the data in a sense automatically by the model whereas the
hyperparameter must be chosen and this is usually usually done with cross validation OK now
regarding the other questions to the multiple choice and things like this we will upload the the
answer on motor very briefly it was about the bias by bias variance decomposition that we saw on on
Jupiter notebook and some of the questions were about the notation OK and for these maybe I can
discuss with the person who sent me uh at some point but I wouldn't say this is important this
notation related question are not important for the midterm but what's important is a typo that the
person discovered so this I want to share with you it's minor type of but still it's good I will just open
the Jupiter notebook your picture let's see if I can share now OK