Solution 2.2

The document discusses the analysis of a credit card application dataset using support vector machines (SVM) and k-nearest neighbors (KNN) for classification. It details the use of the ksvm function from the R package kernlab to create a linear classifier that predicts about 86.4% of the data points correctly, and suggests optimal values for k in KNN, with k=12 and k=15 yielding the highest accuracy of 85.32%. The document emphasizes the importance of scaling data for better model performance and provides R code snippets for implementation.

Uploaded by

tharathep.wisuttinun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views4 pages

Solution 2.2

Uploaded by

tharathep.wisuttinun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Question 2.

The files credit_card_data.txt (without headers) and credit_card_data-headers.txt

(with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It
has anonymized credit card applications with a binary response variable (last column) indicating if the
application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine
Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical
variables and without data points that have missing values.

1. Using the support vector machine function ksvm contained in the R package kernlab, find a
good classifier for this data. Show the equation of your classifier, and how well it classifies the
data points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that
topic soon.)

Notes on ksvm
 You can use scaled=TRUE to get ksvm to scale the data as part of calculating a

 The term λ we used in the SVM lesson to trade off the two components of correctness
classifier.

and margin is called C in ksvm. One of the challenges of this homework is to find a
value of C that works well; for many values of C, almost all predictions will be “yes” or
almost all predictions will be “no”.
 ksvm does not directly return the coefficients a0 and a1…am. Instead, you need to do the
last step of the calculation yourself. Here’s an example of the steps to take (assuming
your data is stored in a matrix called data):1

# call ksvm. Vanilladot is a simple linear kernel.

model <- ksvm(data[,1:10],data[,11],type=”C-
svc”,kernel=”vanilladot”,C=100,scaled=TRUE)
# calculate a1…am
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
a
# calculate a0
a0 <- –model@b
a0
# see what the model predicts
pred <- predict(model,data[,1:10])
pred
# see what fraction of the model’s predictions match the
actual classification
sum(pred == data[,11]) / nrow(data)

Hint: You might want to view the predictions your model makes; if C is too large or too small,
they’ll almost all be the same (all zero or all one) and the predictive value of the model will be
poor. Even finding the right order of magnitude for C might take a little trial-and-error.

1
I know I said I wouldn’t give you exact R code to copy, because I want you to learn for yourself. In general, that’s
definitely true – but in this case, because it’s your first R assignment and because the ksvm function leaves you in
the middle of a mathematical calculation that we haven’t gotten into in this course, I’m giving you the code.
Note: If you get the error “Error in vanilladot(length = 4, lambda = 0.5) :
unused arguments (length = 4, lambda = 0.5)”, it means you need to convert
data into matrix format:

model <-
ksvm(as.matrix(data[,1:10]),as.factor(data[,11]),type=”C-
svc”,kernel=”vanilladot”,C=100,scaled=TRUE)

SOLUTION:

There are multiple possible answers. See file solution 2.2-1.R for the R code for one answer.
Please note that a good solution doesn’t have to try both of the possibilities in the code; they’re both
shown to help you learn, but they’re not necessary.

One possible linear classifier you can use, for scaled data z, is

-0.0010065348z1 - 0.0011729048z2 - 0.0016261967z3 + 0.0030064203z4 + 1.0049405641z5 -

0.0028259432z6 + 0.0002600295z7 - 0.0005349551z8 - 0.0012283758z9 + 0.1063633995z10 + 0.08158492
= 0.

It predicts 565 points (about 86.4%) correctly. (Note that this is its performance on the training data; as
you saw in Module 3, that’s not a reliable estimate of its true predictive ability.) This quality of linear
classifier can be found for a wide range of values of C (from 0.01 to 1000, and beyond). Using unscaled
data, it’s a lot harder to find a C that does this well.

2. You are welcome, but not required, to try other (nonlinear) kernels as well; we’re not covering
them in this course, but they can sometimes be useful and might provide better predictions than
vanilladot.

It’s also possible to find a better nonlinear classifier using a different kernel; kudos to those of you who
went even deeper and tried this!

3. Using the k-nearest-neighbors classification function kknn contained in the R kknn package,
suggest a good value of k, and show how well it classifies that data points in the full data set.
Don’t forget to scale the data (scale=TRUE in kknn).

Notes on kknn
 You need to be a little careful. If you give it the whole data set to find the closest points
to i, it’ll use i itself (which is in the data set) as one of the nearest neighbors. A helpful
feature of R is the index –i, which means “all indices except i”. For example, data[-
i,] is all the data except for the ith data point. For our data file where the first 10
columns are predictors and the 11th column is the response, data[-i,11] is the
response for all but the ith data point, and data[-i,1:10] are the predictors for all
but the ith data point.
(There are other, easier ways to get around this problem, but I want you to get
practice doing some basic data manipulation and extraction, and maybe some looping
too.)
 Note that kknn will read the responses as continuous, and return the fraction of the k
closest responses that are 1 (rather than the most common response, 1 or 0).

SOLUTION:
Here’s one possible solution. See file solution 2.2-3.R for code. Please note that a good solution
doesn’t have to try all of the possibilities in the code.

As detailed in the code, we observe maximum accuracy for k=12 and k=15. (Again, as above, we’re
reporting performance on the training data, which is generally not good practice, as you saw in Module
3.) A summary of the number of correct predictions for different values of k is shown below (using
scaled data).

Value of k Correct Percent correct

(scaled data) predictions predictions
1-4 533 81.50%
6 553 84.56%
7,9 554 84.71%
8 555 84.86%
10,19-20 556 85.02%
5,11,13-14,16-18 557 85.17%
12,15 558 85.32%

As the table shows, the key (in the training data) is to use k ≥ 5; smaller values of k are significantly
inferior. Although k=12 and k=15 look slightly better than the rest, it’s not a statistically significant
difference.

 Note that if we’re just looking at fraction of correct predictions, it might be easy to get caught
up in finding the very highest amount we can find. Don’t lose sight of the fact that these
differences might just be 1 data point out of 654 – which is not statistically significant.

We could do the same using unscaled data, by changing one word in the R code; replace

model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,data[-i,],data[i,],k=X,
scale = TRUE)

with

model=kknn(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,data[-i,],data[i,],k=X,
scale = FALSE)

Using unscaled data, the results are significantly worse.

Value of k Correct Percent correct
(unscaled data) predictions predictions
1-4 434 66.36%
10 443 67.74%
11 445 68.04%
12 447 68.35%
9,13 449 68.65%
14-15 450 68.81%
5,20 452 69.11%
7-8,16-19 453 69.27%
6 455 69.57%

Question 2.2
No ratings yet
Question 2.2
2 pages
Solution 1
No ratings yet
Solution 1
6 pages
Week 1 HW
No ratings yet
Week 1 HW
3 pages
Week-1 NK
No ratings yet
Week-1 NK
5 pages
Data Science Homework Analysis
No ratings yet
Data Science Homework Analysis
7 pages
Credit Card Data SVM & KNN Analysis
No ratings yet
Credit Card Data SVM & KNN Analysis
4 pages
Week-2 NK
No ratings yet
Week-2 NK
12 pages
ISYE 6501 Georgia Tech Hmwk3.1a
No ratings yet
ISYE 6501 Georgia Tech Hmwk3.1a
4 pages
Analysis Course HW2
No ratings yet
Analysis Course HW2
13 pages
Data Science Model Selection Guide
No ratings yet
Data Science Model Selection Guide
4 pages
Chenhao HW1
No ratings yet
Chenhao HW1
5 pages
Final Project
No ratings yet
Final Project
9 pages
ISYE6501 Homework 1
No ratings yet
ISYE6501 Homework 1
7 pages
R Assignment
No ratings yet
R Assignment
8 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
SVM Image Classification Analysis
No ratings yet
SVM Image Classification Analysis
4 pages
Python Code For KNN Classifier 1. Initial Message
No ratings yet
Python Code For KNN Classifier 1. Initial Message
7 pages
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
No ratings yet
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
6 pages
K-Nearest Neighbour Classification Worksheet
No ratings yet
K-Nearest Neighbour Classification Worksheet
15 pages
Classification and K Nearest Neighbour Algorithm
No ratings yet
Classification and K Nearest Neighbour Algorithm
53 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
R Code For Discriminant and Cluster Analysis
No ratings yet
R Code For Discriminant and Cluster Analysis
23 pages
KNN for Cancer Classification
No ratings yet
KNN for Cancer Classification
6 pages
Discussion 3 Supervised
No ratings yet
Discussion 3 Supervised
14 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
33 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Lecture 9 Machine Learning Using Caret API Updated
No ratings yet
Lecture 9 Machine Learning Using Caret API Updated
46 pages
Machine Learning
100% (5)
Machine Learning
56 pages
CS4780 Homework 5 SP24-2
No ratings yet
CS4780 Homework 5 SP24-2
7 pages
Midterm - APS1070 - 2019 - 09 Fall
No ratings yet
Midterm - APS1070 - 2019 - 09 Fall
2 pages
Lecture 02 - KNN and ML Basics
No ratings yet
Lecture 02 - KNN and ML Basics
33 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
K Nearest Neighbours (KNN) : Short Intro To KNN
No ratings yet
K Nearest Neighbours (KNN) : Short Intro To KNN
13 pages
Lecture 11
No ratings yet
Lecture 11
32 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
ML Notes
100% (2)
ML Notes
125 pages
Lab5 Writeup
No ratings yet
Lab5 Writeup
3 pages
Week 10 Abhishek Srivastava VFinal
No ratings yet
Week 10 Abhishek Srivastava VFinal
14 pages
sectionSVM PDF
No ratings yet
sectionSVM PDF
10 pages
KNN Classification Lab Guide
No ratings yet
KNN Classification Lab Guide
4 pages
Da Thoery
No ratings yet
Da Thoery
24 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
MATLAB KNN Tutorial for Beginners
No ratings yet
MATLAB KNN Tutorial for Beginners
6 pages
hw2 - Credit
No ratings yet
hw2 - Credit
3 pages
Making Predictions KNN
No ratings yet
Making Predictions KNN
2 pages
DS File Et C1 23
No ratings yet
DS File Et C1 23
15 pages
18 CV & Model Selection
No ratings yet
18 CV & Model Selection
11 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
Cours KNN
No ratings yet
Cours KNN
10 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Lab3Block1 2021-1
No ratings yet
Lab3Block1 2021-1
3 pages
Spring 2024 - STA301 - 1 - SOL
No ratings yet
Spring 2024 - STA301 - 1 - SOL
4 pages
MacCallum Et Al 1999 Sample Size in FA
No ratings yet
MacCallum Et Al 1999 Sample Size in FA
16 pages
Analysis of Multivariate Social Science Data Second Edition Bartholomew PDF Download
100% (1)
Analysis of Multivariate Social Science Data Second Edition Bartholomew PDF Download
60 pages
Data Mining Classification Methods
No ratings yet
Data Mining Classification Methods
24 pages
Design of Experiments
No ratings yet
Design of Experiments
23 pages
Human Resource Mgt. L 4 Final Final
No ratings yet
Human Resource Mgt. L 4 Final Final
2 pages
Deep Learning Based Long-Term Global Solar Irradiance and Temperature Forecasting Using Time Series With Multi-Step Multivariate Output
No ratings yet
Deep Learning Based Long-Term Global Solar Irradiance and Temperature Forecasting Using Time Series With Multi-Step Multivariate Output
13 pages
Chapter 3 - Statistics
No ratings yet
Chapter 3 - Statistics
16 pages
AP Statistics Syllabus 2011-12
No ratings yet
AP Statistics Syllabus 2011-12
6 pages
Sign Test
No ratings yet
Sign Test
7 pages
Correlation (Manual)
No ratings yet
Correlation (Manual)
11 pages
862-Article Text-2984-1-10-20230105 2
No ratings yet
862-Article Text-2984-1-10-20230105 2
14 pages
Research Methods For Organizational Studies - 2nd Edition Complete DOCX Download
100% (14)
Research Methods For Organizational Studies - 2nd Edition Complete DOCX Download
17 pages
Chapter 6
No ratings yet
Chapter 6
33 pages
Chap 013
50% (2)
Chap 013
141 pages
ENENDA30 - Module 01 Part 2
No ratings yet
ENENDA30 - Module 01 Part 2
53 pages
Runs Test - Stat Notes, From North Carolina State University, Public Administration Program
No ratings yet
Runs Test - Stat Notes, From North Carolina State University, Public Administration Program
3 pages
2022 Stats 1 Ms (Paper 1 (A'level Statistics) ) MR Share
No ratings yet
2022 Stats 1 Ms (Paper 1 (A'level Statistics) ) MR Share
18 pages
Operations Management Exam
No ratings yet
Operations Management Exam
10 pages
Quality Control Exercise Solutions
No ratings yet
Quality Control Exercise Solutions
8 pages
Lesson 4 Measures of Variability or Dispersion
No ratings yet
Lesson 4 Measures of Variability or Dispersion
4 pages
Las - Statistics-And-Probability - Q3 - W2 - For All G11
No ratings yet
Las - Statistics-And-Probability - Q3 - W2 - For All G11
4 pages
Analysis of Variance: Glossary
No ratings yet
Analysis of Variance: Glossary
5 pages
Cambridge International AS & A Level: Mathematics 9709/52
No ratings yet
Cambridge International AS & A Level: Mathematics 9709/52
16 pages
Sem 4 Probabilty & Statics Syallabus
No ratings yet
Sem 4 Probabilty & Statics Syallabus
1 page
PM&R Volume 4 Issue 12 2012 - Sainani, Kristin L. - Dealing With Non-Normal Data
No ratings yet
PM&R Volume 4 Issue 12 2012 - Sainani, Kristin L. - Dealing With Non-Normal Data
5 pages
Investment Return Predictability
No ratings yet
Investment Return Predictability
46 pages
SPSS Data Analysis: Frequencies and Tests
No ratings yet
SPSS Data Analysis: Frequencies and Tests
13 pages
Advanced Regression Analysis Guide
No ratings yet
Advanced Regression Analysis Guide
68 pages
ECON1005 Final Practice Packet Vs3-Merged
No ratings yet
ECON1005 Final Practice Packet Vs3-Merged
15 pages