0% found this document useful (0 votes)

45 views16 pages

Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R

This document is a report on classifying the Covtype dataset using kNN, linear regression, and random forest algorithms. It evaluates the performance of these three supervised learning algorithms and selects the most appropriate one for solving similar classification problems. The kNN algorithm was found to have the best performance based on accuracy and runtime. Experiments were conducted using 10-fold cross-validation, and the results showed that kNN achieved the highest accuracy while also having relatively fast runtimes compared to linear regression and random forest.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views16 pages

Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

COMP5318 M ACHINE L EARNING AND D ATA M INING

A SSIGNMENT 2 R EPORT

OPTIMIZED CLASSIFICATION ON FOREST

COVERTYPE
Based on kNN, Linear Regression and Random Forest Algorithms

June 5, 2017 1

Lin Han 460461265

Wanyi Fang 460165019
Dasheng Ge 460440743

1
Powered by LATEX
Abstract

This project re-classifies Covtype dataset using three classification algorithms: kNN, Linear
Regression and Random Forest. Evaluating the performance of these three supervised algo-
rithms and choose the most appropriate one fitting in solving similar datasets classification
problems. Finally, kNN was considered as the best.
Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Methods and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Algorithm Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Algorithm Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Appendix: How to run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1
COMP5318 Machine Learning and Data Mining The University of Sydney

I NTRODUCTION
Classification aims to classify input data with common characteristics into the same cate-
gories as efficient as possible. These years, methods of classification are proposed increasingly.
However, the performance of each method are quite different.

In this assignment, we have chosen the Covtype dataset as an object, a classical dataset
with 581,012 instances of forests and 54 dimensions in judging the cover type, divided into
seven labels. Since the dataset has been classified, we chose three supervised learning meth-
ods to re-classify in order that we can compare the performance of these three methods, and
through deeper analyzing, we would choose one of them as recommended method. The
three algorithms are: kNN, Linear Regression and Random Forest.

As the performance of an algorithm mainly includes time consumption and accuracy, the
implementation has strengthened our understanding of each algorithms, including their effi-
ciency and situations they fits for separately. Moreover, we got experience on comparing and
choosing the appropriate methods when confronting with real problems, which is premise
for rewriting and optimizing classical machine learning methods.

S PECIFICATION
The whole project was running and tested on ThinkPad T540P with i5-4200m@2.5GHz and
8G Ram.
The OS type is Ubuntu 16.04LTS.

P REVIOUS W ORK
Before working on the problem, we have referred to several literature to find successful
instances in dealing with similar dataset. This process helped us define the algorithms we
would use for achieving the target time-saving and cost-effective in the process.

The characteristics of Covtype dataset are summarized as following:

1.Massive instances but less dimensions.
There are over 580000 instances shown in the dataset, however, only over 50 dimensions used
for classification

Page 2 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

2. Data in the dataset are discrete objects, not continuous.

3. With 7 labels given, the re-classify process should be a supervised learning process.
4. There are 7 cover types. Therefore, this is a multi-classification problem.

Based on the former dataset characteristics, following appropriate methods was found.

Method 1: VFDT (Very Fast Decision Tree learner), a decision-tree leaning system based on
Hoeffding trees.
This algorithm is a representation of data stream classification technology. Since data today
usually appears in data stream format, with the real-time, continuous(not actually continu-
ous, just numerous), infinite, and non-reproducible four properties, and static classification
cannot satisfy the real needs, classification for data stream is becoming more and more
prevalent.
VFDT has the ability to incorporate and classify new information online in shorter training
time by dividing the income data stream. It is powerful in dealing with large datasets. More-
over, as a ready-to-use model, VFDT can be used after the first few data have been trained,
and its quality increase smoothly with time.

Method 2: Bagging and Boosting.

The two methods also came up from data stream classification wave and usually used as
ensemble classification methods to generate advanced classifiers.
Bagging can be used to enhance the effect of classifier, which produces several replicate
training sets by random sampling, then getting corresponding weak classifiers by training
them separately, and integrating them finally.
Similar with bagging, boosting uses all weak classifiers to form a strong classifier. However,
instances in boosting are all based with certain weight corresponding to the importance of
each repetition. So adjusting the weights can create more accurate classifiers.

Method 3: Round Robin classification, a method based on separate-and-conquer rule algo-

rithms.
It has attracted much attention in neural networks and SVM (support vector machines)
communities. The basic idea is to transform multi-classification problems into binary clas-
sification problems. During the process, one classifier is applied for each pair of classes
and ignoring all others when using only training examples for these two classes. Then the
complexity is lower. Round Robin classification has been proved to get further improvement

Page 3 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

by integrated with bagging algorithm mentioned above.

M ETHODS AND D ESIGN

Preprocessing
The preprocessing in this project can be divided into two types: Preprocessing of the dataset
itself in order to generate predicting set and training set; preprocessing for a defined method
for improving the accuracy. This part will only introduce the former, and the special prepro-
cessing for each algorithms themselves will be followed with algorithms implementation.

In the preprocessing, we used the origin dataset to form four subsets called train sample, train
target, predict sample, predict target. It was completed by generate samples and targets.py.

Firstly, the whole dataset was separated into ten parts equally and randomly. Choose one
subset from the ten and extracting the last column from the subset, as the last column is the
label of instance.
Secondly, defining the remaining 53 columns of the chosen subset as predicting set, called
predict sample in the procedure, and naming the last column as predict target individually. It
is kept for testing the accuracy.
Thirdly, integrate the remaining nine subsets into one, then repeating the former process to
get training set, called train sample and train target in the procedure.
Finally, repeating the process above all to form ten-folds.

Algorithm Selection
For supervised learning, we have learnt four methods: kNN(k-NearestNeighbor), Naive Bayes,
Linear Regression, and SVM(Support Vector Machine). And there are several algorithms that
we haven’t learnt, such as Random Forest, Adaboost, SVR etc. In this assignment, we would
like to choose three algorithms with superior diversity in theory, so that the comparison result
would be more obvious. At the same time, time consumption is also an important index to
consider.

In theory, SVM and Linear Regression have several similarities, and both requires high-
dimensional vector calculation with high accuracy also high time complexity. Naive Bayes
and kNN are both low complexity relatively, however, Naive Bayes is better to deal with

Page 4 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

equal-weight-dimension dataset.

As a result, we chose kNN, Linear Regression and Random Forest as the comparison ob-
jects, for these three algorithms are not only diverse in performance, but can be completed in
the same external open-source library: sklearn. As is known, variable libraries are possible to
influence algorithm performance.

Algorithm Introduction
This part will introduce the algorithms separately and describe the significance of each
parameter in details.

kNN

K-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification.

Representing instance x as a 1 (x), a s (x),...,a n (x)
s
n
X
d (x i , y i ) = (a r (x r ) − a r (x j ))2
r =1

Where: a r (x) = the r.th attribute of x, d (x i , x j ) = distance between instance x i and x j

for discrete dataset, it performs:

k
fˆ(x q ) ← ar g max δ(v, f (x i ))
X
i =1

Where: v is an element of V set,V is finite set v 1 , v 2 , v 3 , ..., v n

select k nearest instances represented as x 1 ,x 2 ,..., x k . Then return fˆ(x q )
In the Covtype dataset, the attribute was fixed as 53.
kNN needs no special preprocessing.The dataset was directly classified through knn.py

Linear Regression

Logistic Regression can be used as a binary as well as multi-classification regression when

the dependent variable is dichotomous. Using the thought of logistic regression, in our
assignment, we can build up a linear model:

z i = x~i · w
~ i = w 0 + x i 1 w 1 + x i 2 w 2 + ... + x i (D+1) w (D+1)

Page 5 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

for i=1,2,... n, where :

w stands for the parameter of the learnt.
x stands for test data, namely forest instance here.
Equating the linear model to a probability p(x) with logistic transformation applied.

p(~
xi )
log( ) = x~i · w
~i = zi
1 − p(~
xi )

Therefore, we could derive:

1
p(~ ~i ) =
xi · w
1 + e −zi

Also, we can have loss function:

N
~ ) = −l (w
~)= log(1 + e z − y i z i )
X
l oss(w
i =1

Where y i is 0 or 1 in logistic regression.

Based on the above process and applying gradient descent algorithm for each label, we can
get a estimated weights vector w for every label. Using this vector, we can get the most
probable label for a specific data.

In this assignment, we designed a specific preprocessing part for improving the perfor-
mance of LR method, accordingly, we used K-Means in advance in order to achieve statistical
outlier removal. Thus, the original dataset had been re-classified with KMeans, and formed 7
new clusters. Then LR worked on the new clustering dataset both to observe if there was an
optimization on LR performance.

Random Forest

A Random Forest consists of a collection of simple tree predictors, each of which has the
ability to produce a response when presented with a set of predictor values and can also be
used to classify the final result. The optimal size of predictor variables is given by log(2M+1),
where M is the number of inputs.

Given a set of simple trees and a set of random predictor variables, the Random Forest

Page 6 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

method defines a margin function that measures the extent to which the average number of
votes for the correct class exceeds the average vote for any other class present in the depen-
dent variable. Given an ensemble of classifiers h 1 (x), h 2 (x), ..., h K (x), and with the training set
drawn at random from the distribution of the random vector Y, X, define the margin function
as:

~ , Y ) = av k I (h k ( X
mg ( X ~ ) = Y ) − max av k I (h k ( X
~)=Y )
j 6=Y

~ , Y ) < 0)
The error can be defines as: PE ∗ = P X ,Y (mg ( X

While implementation of Random Forest, we did similar preprocessing as LR methods, using

KMeans to cluster the dataset in advance and then evaluating the performance.

E XPERIMENTS AND D ISCUSSION

Performance

The index for evaluating the performance of each algorithms are: precision, recall, f1-score
and support. They are calculated as follows:
P r eci si on = T P /(T P + F P )
Recal l = T P /(T P + F N )
F 1 − Scor e = P ∗ R/2(P + R), P = P r eci si on, R = Rec al l
where: TP means actually true and predict positive; FP means actually false but predict
positive; FN means actually false and predict negative.
Table 1-3 shows the performance of three algorithms using 10 folds cross validation.

In addition to precision, recall and f1-score, we derive a more detailed confusion matrix
showing the differences between true data and our prediction. From this matrix, we could
easily see the confused classifications. Rows are predictions and columns are real data. The
matrix for each algorithm are shown in Table 4-6, which is derived from randomly selected
training samples and prediction samples.

Page 7 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

Table 1: kNN performance table

Folds Precision Recall F1-score Support

1 0.97 0.97 0.97 58102
2 0.97 0.97 0.97 58102
3 0.97 0.97 0.97 58102
4 0.97 0.97 0.97 58102
5 0.97 0.97 0.97 58102
6 0.97 0.97 0.97 58102
7 0.97 0.97 0.97 58102
8 0.97 0.97 0.97 58102
9 0.97 0.97 0.97 58102
10 0.97 0.97 0.97 58102

Table 2: Logistic Regression performance table

Folds Precision Recall F1-score Support

1 0.76 0.71 0.73 58102
2 0.73 0.71 0.73 58102
3 0.75 0.71 0.72 58102
4 0.75 0.71 0.73 58102
5 0.75 0.71 0.73 58102
6 0.75 0.71 0.72 58102
7 0.75 0.71 0.73 58102
8 0.75 0.71 0.73 58102
9 0.75 0.71 0.72 58102
10 0.75 0.71 0.73 58102

Table 3: Random Forest performance table

Folds Precision Recall F1-score Support

1 0.95 0.94 0.94 58102
2 0.95 0.95 0.95 58102
3 0.95 0.95 0.95 58102
4 0.95 0.95 0.95 58102
5 0.94 0.94 0.94 58102
6 0.95 0.95 0.95 58102
7 0.95 0.95 0.95 58102
8 0.95 0.95 0.95 58102
9 0.95 0.95 0.95 58102
10 0.95 0.95 0.95 58102

Page 8 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

Table 4: kNN Detailed Confusion Matrix

Predicted/True 1.0 2.0 3.0 4.0 5.0 6.0 7.0 All

1.0 20670 604 1 0 10 1 41 21327
2.0 532 27436 38 0 63 16 5 28090
3.0 0 43 3516 9 3 48 0 3619
4.0 0 0 47 214 0 16 0 277
5.0 11 81 11 0 867 6 0 976
6.0 1 44 62 9 3 1605 0 1724
7.0 40 7 0 0 1 0 2041 2089
All 21254 28215 3675 232 947 1692 2087 58102

Table 5: Logistic Regression Detailed Confusion Matrix

Predicted/True 1.0 2.0 3.0 4.0 6.0 7.0 All

1.0 14290 6238 5 0 0 355 20888
2.0 4973 22876 546 0 57 63 28515
3.0 5 417 3109 11 132 0 3674
4.0 0 0 194 39 11 0 244
5.0 22 876 57 0 10 0 965
6.0 1 625 1012 2 91 0 1731
7.0 1084 23 7 0 0 971 2085
All 20375 31055 4930 52 301 1389 58102

Table 6: Random Forest Detailed Confusion Matrix

Predicted/True 1.0 2.0 3.0 4.0 5.0 6.0 7.0 All

1.0 20198 1082 0 0 4 3 40 21327
2.0 1074 26890 44 3 42 28 9 28090
3.0 2 91 3432 9 3 82 0 3619
4.0 0 1 44 224 0 8 0 277
5.0 23 212 14 0 719 8 0 976
6.0 3 62 152 6 1 1500 0 1724
7.0 112 8 0 0 0 0 1969 2089
All 21412 28346 3686 242 769 1629 2018 58102

Page 9 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

Table 7: Running Time

Name Running Time(seconds per folds on average)

kNN 8.22
Logistic Regression 183.11
Random Forest 13.1

Runtime

Table 7 shows the running time on average for each algorithm

Comparison and Discussion

Totally, from the results shown above, we found that for a single algorithm, each fold of ten
has similar performance and running time, which means that the functions we used from
external libraries are quite stable and scientific.

kNN algorithm performed excellent with the lowest time cost as well as the highest pre-
cise, running out the results within 10 seconds(almost all around 8 seconds) average in each
fold, and the precision of all label are the same with the value 0.97, which is a considerable
value. The result reflects the Knn algorithm’s benefits:
1.Simply and effective
2.Training time complexity is O(n) (has linear relationship with the size of training data set)
3.Not sensitive to the outlier, since we haven’t used outlier remove on it.
However, the value seemed too much for a classifier.

LR performed worst among these three algorithms, with around 3 minutes running each fold,
and the accuracy was all around 0.7 for each label and each fold. Since we have known that
there are only 53 dimensions in the dataset, the running speed was not ideal. However, to
some extent, it is also a good result for a classifier.

Random Forest also performed perfect just next to kNN algorithm with high accuracy and
short running time. Only in the fifth fold there was a slight fluctuation. The average running
time of Random Forest algorithm among ten folds data was 13.1 seconds as Table7 has shown,
5 seconds slower than kNN algorithm.

It seems that kNN performed best and should be chosen. However, evaluating an algo-

Page 10 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

rithm for datasets should not only focus on one experiment result but also the stability of it.

As is widely known,Knn shows its weakness when the training data set is unbalance, for
example the number of label 2 is much smaller than the other 6 labels. In this circumstance,
when we try to classify label 2, we actually calculate the nearest label to label 2, not the label
2 itself. However, our dataset is relatively balance: there are certain amount of intersection
and overlapping among datasets. So, the kNN has a good performance.

Moreover, kNN is weak at classifying large scale of data such as the data stream mentioned
above. However, we have summarized that our dataset has fewer dimensions, and the scale of
data and the complexity is not so large as other types if dataset. Therefore, kNN is appropriate.

For Random Forest, it has benefits that inherited many of the advantage of CART (Clas-
sification and Regression Trees) like:
1.Computationally simple and quick to fit, even for large data set.
2.No formal distributional assumption
3.Can balance the varieties for unbalanced data set.
4.No over-fitting in processing relatively big data set
Just like the picture shows:

IMG_2640.png

Thus, the RF may not look as efficient as KNN after running small dataset. However, as the
more and more training data sets put into training, the performance of RF will gradually
improved. That’s the reason why RF fits for larger scale of dataset, but not the scale like
Covtype dataset.

In summary, we consider kNN is the best algorithm in dealing with this type

Reflection

After experiencing the whole process of machine learning, we find that we can do better in
the following part:

Page 11 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

As the description document explained, the dataset contains binary numbers and deci-
mal numbers. With common sense, we should transform the different types of data into
unified numbers before training the data set.

It seems that the different types of columns did not affect the result too much. Through
searching relative researches, we make an explanation that when the data in high dimension
space, thus the factor that different types of data will not affect each other or even the training
model itself too much.

However, what if we do preprocessing with different types of columns separately? If we

divide the original dataset into different parts, and how to integrate them again then cluster-
ing? We have not research about that.

C ONCLUSIONS AND F UTURE WORK

In conclusion, we choose kNN as the recommended algorithm in dealing with the similar
scale of dataset.

Then what can we do in the future? Here are two objectives:

1.Optimize Random Forest:
Take Boost algorithm into consideration to optimize the Random Forest method. Try to use
Boosting to iterate the data set, and try to modify our new training model based on the last
iteration.
2.Testing on other algorithms:
Although we have used three methods to analysis this work, we still want to use other meth-
ods like SVM, NB to training the data set. Future work will lie on analyzing the advantage of
each method to determine a suitable method for this work.

Page 12 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

R EFERENCES
1. Gama J, Rocha R, Medas P. Accurate decision trees for mining high-speed data streams[C]//Proceedings
of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.
ACM, 2003: 523-528.

2.Oza N C, Russell S. Experimental comparisons of online and batch versions of bagging

and boosting[C]//Proceedings of the seventh ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM, 2001: 359-364.

3.Obradovic Z, Vucetic S. Challenges in Scientific Data Mining: Heterogeneous, Biased,

and Large Samples[R]. Technical Report, Center for Information Science and Technology
Temple University, 2004.

4.Breiman L. Random forests[J]. Machine learning, 2001, 45(1): 5-32.

5.Liaw A, Wiener M. Classification and regression by randomForest[J]. R news, 2002, 2(3):

18-22.

Page 13 of 14
COMP5318 Machine Learning and Data Mining The University of Sydney

A PPENDIX : H OW TO RUN
1. Generate 10 folds. Run generate_samples_and_targets.py. The generated file should be in
data folder.

2. Run Python file for each algorithm. e.g. in command line, run knn_with_orc.py. The
result should be in the result folder.

Page 14 of 14

Orange 3
100% (1)
Orange 3
46 pages
Random Forest
No ratings yet
Random Forest
225 pages
Artikel Algoritma C4.5
No ratings yet
Artikel Algoritma C4.5
7 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
Mini Project 2024
100% (1)
Mini Project 2024
48 pages
Preface To The Second Edition V 1 1
No ratings yet
Preface To The Second Edition V 1 1
9 pages
Comparative Study Classification Algorit PDF
No ratings yet
Comparative Study Classification Algorit PDF
8 pages
Introduction To Data Mining 2005
60% (5)
Introduction To Data Mining 2005
400 pages
Machine Learning Guide: Meher Krishna Patel
No ratings yet
Machine Learning Guide: Meher Krishna Patel
121 pages
Bhagya Report Final
No ratings yet
Bhagya Report Final
73 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Final 1
No ratings yet
Final 1
6 pages
MLT - Lab - Manual FINAL
No ratings yet
MLT - Lab - Manual FINAL
38 pages
R Companion Data Mining
No ratings yet
R Companion Data Mining
370 pages
Mlpy
0% (1)
Mlpy
113 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Supervised Learning Classification Algorithms Comparison
No ratings yet
Supervised Learning Classification Algorithms Comparison
6 pages
Chapter 3
No ratings yet
Chapter 3
67 pages
Machine Learning (Se204A) Lab Manual
No ratings yet
Machine Learning (Se204A) Lab Manual
27 pages
Classification
No ratings yet
Classification
4 pages
Minor Project
No ratings yet
Minor Project
9 pages
StreamMining Manual
No ratings yet
StreamMining Manual
185 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
StreamMining PDF
No ratings yet
StreamMining PDF
185 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
Spam Not Spam
No ratings yet
Spam Not Spam
7 pages
Probability Estimation in Random Forests
No ratings yet
Probability Estimation in Random Forests
35 pages
Big Data-Classification Survey
No ratings yet
Big Data-Classification Survey
11 pages
INT354 - Unit 2
No ratings yet
INT354 - Unit 2
26 pages
FULLTEXT02
No ratings yet
FULLTEXT02
42 pages
ML Algorithms Comprehensive Study
No ratings yet
ML Algorithms Comprehensive Study
9 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Data Mining Classification Guide
No ratings yet
Data Mining Classification Guide
54 pages
Model Combination in Multiclass Classification
No ratings yet
Model Combination in Multiclass Classification
182 pages
5 Markd
No ratings yet
5 Markd
24 pages
Anomaly Detection in Logs Using ML
No ratings yet
Anomaly Detection in Logs Using ML
67 pages
A Comparitive Study On Different Classification Algorithms Using Airline Dataset
No ratings yet
A Comparitive Study On Different Classification Algorithms Using Airline Dataset
4 pages
Phase 3 IBM
No ratings yet
Phase 3 IBM
7 pages
ML Unit4
No ratings yet
ML Unit4
10 pages
RC 25186
No ratings yet
RC 25186
83 pages
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
No ratings yet
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
204 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Classification
No ratings yet
Classification
36 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
Scikit-Learn User Guide Release 0.19.dev0
100% (2)
Scikit-Learn User Guide Release 0.19.dev0
2,133 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
Foundations of Machine
No ratings yet
Foundations of Machine
120 pages
Assign2 01clc.06 Duongmt
No ratings yet
Assign2 01clc.06 Duongmt
23 pages
CE802 Pilot
No ratings yet
CE802 Pilot
2 pages
Final Review Batch 07
No ratings yet
Final Review Batch 07
30 pages
Seminar Presentation
No ratings yet
Seminar Presentation
25 pages
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
Data Mining Classification Models
No ratings yet
Data Mining Classification Models
5 pages
4.0 Supervised Learning 4.1 Discuss Classification Model
No ratings yet
4.0 Supervised Learning 4.1 Discuss Classification Model
48 pages
CE802 Report
No ratings yet
CE802 Report
7 pages
Preview: Comparison of Machine Learning Algorithms and Their Ensembles For Botnet Detection
100% (2)
Preview: Comparison of Machine Learning Algorithms and Their Ensembles For Botnet Detection
11 pages
05 Nonparametric
No ratings yet
05 Nonparametric
22 pages
Lesson 1 Introduction To Experimental Design
No ratings yet
Lesson 1 Introduction To Experimental Design
26 pages
Discriminant Analysis
100% (1)
Discriminant Analysis
17 pages
Program 60: Lifelong Learning Impact
No ratings yet
Program 60: Lifelong Learning Impact
19 pages
Big Data and Analytics in Higher Education Opportunities and Challenges
No ratings yet
Big Data and Analytics in Higher Education Opportunities and Challenges
17 pages
Factors Influencing Plagiarism by Accounting Lecturers
No ratings yet
Factors Influencing Plagiarism by Accounting Lecturers
26 pages
21CS54 Aiml Module3 PPT
No ratings yet
21CS54 Aiml Module3 PPT
102 pages
Sample Size Guidelines For Log
No ratings yet
Sample Size Guidelines For Log
10 pages
The Influence of The Specific Activities in Real Fight As A Predictor To The Win As A Criterion in Competitive Kickboxing
No ratings yet
The Influence of The Specific Activities in Real Fight As A Predictor To The Win As A Criterion in Competitive Kickboxing
4 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
ECO3021 Econometrics Test 1-1
No ratings yet
ECO3021 Econometrics Test 1-1
4 pages
Part C - Proceedings Articles Theme 2 - Computer Science, Electrical and Electronics - Final
No ratings yet
Part C - Proceedings Articles Theme 2 - Computer Science, Electrical and Electronics - Final
171 pages
Gao Et Al. 2023 - Pay Secrecy
No ratings yet
Gao Et Al. 2023 - Pay Secrecy
90 pages
Research Thesis - Pubudu Karunaratne
No ratings yet
Research Thesis - Pubudu Karunaratne
117 pages
Regression-Based Earnings Forecasts (Gerakos and Gramacy, 2013)
No ratings yet
Regression-Based Earnings Forecasts (Gerakos and Gramacy, 2013)
33 pages
Filipino Students' PISA Insights
No ratings yet
Filipino Students' PISA Insights
40 pages
Code PDF
No ratings yet
Code PDF
5 pages
What Is The Capstone Definition?: Why Do You Have To Write About It?
No ratings yet
What Is The Capstone Definition?: Why Do You Have To Write About It?
4 pages
Chapter 4 - Costs in Management Accounting: Concepts, Classifications, Accumulation Multiple Choice
100% (2)
Chapter 4 - Costs in Management Accounting: Concepts, Classifications, Accumulation Multiple Choice
48 pages
Report Capstone Project House Price Prediction
No ratings yet
Report Capstone Project House Price Prediction
28 pages
KEYFACTORSINFLUENCING
No ratings yet
KEYFACTORSINFLUENCING
149 pages
BIA B350F - 2022 Autumn - Specimen Exam Paper
No ratings yet
BIA B350F - 2022 Autumn - Specimen Exam Paper
13 pages
Impact of Bullying On Academic Performan
No ratings yet
Impact of Bullying On Academic Performan
10 pages
2023 Clobber - Progress Brain Research
No ratings yet
2023 Clobber - Progress Brain Research
18 pages
Exercise06 Regression ANSWERS
No ratings yet
Exercise06 Regression ANSWERS
2 pages
Introduction To Political Analysis Lecture Notes
100% (2)
Introduction To Political Analysis Lecture Notes
12 pages
Agriculture Crop Recommendation System Using
No ratings yet
Agriculture Crop Recommendation System Using
57 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Advanced Statistics Final Assignment Effect The Hours Spend For Study (Per Week) To Students Gpa in Feb Ui
No ratings yet
Advanced Statistics Final Assignment Effect The Hours Spend For Study (Per Week) To Students Gpa in Feb Ui
11 pages
An Introduction To Generalized Linear Models Annette J. Dobson Instant Download
No ratings yet
An Introduction To Generalized Linear Models Annette J. Dobson Instant Download
148 pages
Multivariate Analysis
100% (2)
Multivariate Analysis
11 pages

Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R

Uploaded by

Optimized Classification On Forest Covertype: COMP5318 M L D M A 2 R

Uploaded by

COMP5318 M ACHINE L EARNING AND D ATA M INING

OPTIMIZED CLASSIFICATION ON FOREST

Lin Han 460461265

The characteristics of Covtype dataset are summarized as following:

2. Data in the dataset are discrete objects, not continuous.

Method 2: Bagging and Boosting.

Method 3: Round Robin classification, a method based on separate-and-conquer rule algo-

by integrated with bagging algorithm mentioned above.

M ETHODS AND D ESIGN

K-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification.

Where: a r (x) = the r.th attribute of x, d (x i , x j ) = distance between instance x i and x j

Where: v is an element of V set,V is finite set v 1 , v 2 , v 3 , ..., v n

Logistic Regression can be used as a binary as well as multi-classification regression when

for i=1,2,... n, where :

Therefore, we could derive:

Also, we can have loss function:

Where y i is 0 or 1 in logistic regression.

While implementation of Random Forest, we did similar preprocessing as LR methods, using

E XPERIMENTS AND D ISCUSSION

Table 1: kNN performance table

Folds Precision Recall F1-score Support

Table 2: Logistic Regression performance table

Folds Precision Recall F1-score Support

Table 3: Random Forest performance table

Folds Precision Recall F1-score Support

Table 4: kNN Detailed Confusion Matrix

Predicted/True 1.0 2.0 3.0 4.0 5.0 6.0 7.0 All

Table 5: Logistic Regression Detailed Confusion Matrix

Predicted/True 1.0 2.0 3.0 4.0 6.0 7.0 All

Table 6: Random Forest Detailed Confusion Matrix

Predicted/True 1.0 2.0 3.0 4.0 5.0 6.0 7.0 All

Table 7: Running Time

Name Running Time(seconds per folds on average)

Table 7 shows the running time on average for each algorithm

Comparison and Discussion

However, what if we do preprocessing with different types of columns separately? If we

C ONCLUSIONS AND F UTURE WORK

Then what can we do in the future? Here are two objectives:

2.Oza N C, Russell S. Experimental comparisons of online and batch versions of bagging

3.Obradovic Z, Vucetic S. Challenges in Scientific Data Mining: Heterogeneous, Biased,

4.Breiman L. Random forests[J]. Machine learning, 2001, 45(1): 5-32.

5.Liaw A, Wiener M. Classification and regression by randomForest[J]. R news, 2002, 2(3):

You might also like