0% found this document useful (0 votes)
21 views61 pages

ML Lab Guide for CSE Students

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views61 pages

ML Lab Guide for CSE Students

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

MACHINE LEARNING LAB

ETCS-454

Faculty Name: Ms. Prerna Sharma Name: Keshav Chahal


Roll No: 35196402720
Semester: 8th
Group: 8C13

Maharaja Agrasen Institute of Technology, PSP Area,


Sector – 22, Rohini, New Delhi – 110085
MACHINE LEARNING LAB

PRACTICAL RECORD

PAPER CODE : ETCS-454

Name of the student : Keshav Chahal

University Roll No. :35196402720

Branch : CSE

Section/ Group : 8C13

PRACTICAL DETAILS

a) Experiments according to the list provided by GGSIPU

Exp. no Experiment Name Date of Date of Remarks Marks


performa checking (10)
nce
Introduction to machine
learning lab with tools
1. (hands on Weka).
Understanding of
Machine learning
2. algorithms List of
Databases.
Implement K means
3. Algorithm using Python

Study of Databases and


4. understanding attributes
evaluation in regard to
problem description.
Working of Major
5. Classifiers ,a) Naïve
Bayes b) Decision Tree
c)CART d) ARIMA
(using (e) linear and
logistics regression (f)
Support vector machine
(g) KNN (datasets can
be: Breast Cancer data
file or Reuters data set).
Design a prediction
6. model for Analysis of
round trip Time of Flight
measurement from a
supermarket using
random forest, naïve
bayes etc.
7. Implement supervised
learning (KNN
classification) .Estimate
the accuracy of using
5-fold cross-validation.
8. Introduction to R.
9. Build and Develop a
model in R for a
particular classifier
(random Forest).
10. Develop a machine
learning method using
Neural Networks in
Python to Predict stock
prices based on past
price variation.
11. Case Study of RMS
Titanic Database to
predict survival on basis
of decision tree, Logistic
Regression , KNN or
k-Nearest Neighbors,
Support Vector
Machines
12. Understanding of Indian
education in Rural
villages to predict
whether girl child will be
sent to school or not ?.
13. Understanding of dataset
of contact patterns
among students collected
in National University of
Singapore.
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

VISION

To nurture young minds in a learning environment of high academic value and imbibe spiritual and
ethical values with technological and management competence.

MISSION

The Institute shall endeavour to incorporate the following basic missions in the teaching methodology:
❖ Engineering Hardware – Software Symbiosis: Practical exercises in all Engineering and
Management disciplines shall be carried out by Hardware equipment as well as the related
software enabling a deeper understanding of basic concepts and encouraging inquisitive
nature.
❖ Life-Long Learning: The Institute strives to match technological advancements and
encourage students to keep updating their knowledge for enhancing their skills and
inculcating their habit of continuous learning
❖ Liberalization and Globalization: The Institute endeavors to enhance technical and
management skills of students so that they are intellectually capable and competent
professionals with Industrial Aptitude to face the challenges ofglobalization.
❖ Diversification: The Engineering, Technology and Management disciplines have diverse
fields of studies with different attributes. The aim is to create a synergy of the above
attributes by encouraging analytical thinking.
❖ Digitization of Learning Processes: The Institute provides seamless opportunities for
innovative learning in all Engineering and Management disciplines through digitization of
learning processes using analysis, synthesis, simulation, graphics, tutorials and related tools to
create a platform for multi-disciplinary approach.
❖ Entrepreneurship: The Institute strives to develop potential Engineers and Managers by
enhancing their skills and research capabilities so that they emerge as successful
entrepreneurs and responsible citizens.
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

COMPUTER SCIENCE & ENGINEERING DEPARTMENT

VISION

To Produce “Critical thinkers of Innovative Technology”

MISSION

To provide an excellent learning environment across the computer science


discipline to inculcate professional behaviour, strong ethical values, innovative
research capabilities and leadership abilities which enable them to become
successful entrepreneurs in this globalized world.
❖ To nurture an excellent learning environment that helps students to
enhance their problem solving skills and to prepare students to be lifelong
learners by offering a solid theoretical foundation with applied computing
experiences and educating them about their professional, and ethical
responsibilities.
❖ To establish Industry-Institute Interaction, making students ready for
the industrial environment and be successful in their professionallives.
❖ To promote research activities in the emerging areas of technology
convergence.
❖ To build engineers who can look into technical aspects of an engineering
solution thereby setting a ground for producing successful entrepreneur
EXPERIMENT-1

Aim: Introduction to machine learning lab with tools (hands on Weka).

Weka is a data mining/machine learning application and is being developed by


Waikato University in New Zealand. The purpose of this article is to teach you
how to use the Weka Explorer, classify a dataset with Weka, and visualize the
results.

1. KnowledgeFlow is a Java-Beans based interface for tuning and machine learning experiments.

2. Smple CLI is a simple command line interface provided to run Weka functions directly.

3. Explorer is an environment to discover the data.

4. Experimenter is an environment to make experiments and statistical tests between learning schemes.
Pre-processing :
Most of the time, the data wouldn’t be perfect, and we would need to do pre-processing before applying machine
learning algorithms on it. Doing pre-processing is easy in Weka. You can simply click the “Open file” button and
loadyour file as certain file types: Arff, CSV, C4.5, binary, LIBSVM, XRFF; you can also load SQL db file
via the URL and then you can apply filters to it. However, we won’t need to do pre-processing for this post since
we’ll use adataset that Weka provides for us.
If your data type is in xls format like in previous image, you have to convert the file. I’ll use the Iris dataset

toillustrate the conversion:

1. Convert your .xls to .csv format

2. Open your CSV file in any text editor and first add @RELATION database_name to the first row of

the CSVfile

3. Add attributes by using the following definition: @ATTRIBUTE attr_name attr_type. If attr_type is
numeric you should define it as REAL, otherwise you have to add values between curly parentheses.

Sample images arebelow.

4. At last, add a @DATA tag just above on your data rows. Then save your file with .arff extension. You

can seethe illustration in next figure.

Load Your Data

Click the “Open file” button from the Pre-process section and load your .arff file from your local file system. If
youcouldn’t convert your .csv to .arff, don’t worry, because Weka will do that instead of you.
If you could follow all the steps so far, you can load your data set successfully and
you’ll see attribute names (it is illustrated at the red area on above images). The pre-
process stage is named as Filter in Weka, you can click the ‘Choose’ button from
Filter and apply any filter you want. For example, if you would like to
use Association Rule Mining as a training model, you have to dissociate numeric and
continuous attributes. To be able to do that you can follow the path: Choose -> Filter
-> Supervised -> Attribute -> Discritize.

Visualizing the Result : If you’d like to visualize this results you can use
graphic presentations as you can see in below Figure.
VIVA VOICE QUESTIONS - 1

Q. What is a logistic function? What is the range of values of a logistic function?

Ans. Logistic Regression is a Machine Learning algorithm which is used for the classification
problems, it is a predictive analysis algorithm and based on the concept of probability.

Q. Why is logistic regression important ?

Ans. Logistic regression is a classification algorithm used to assign observations to a discrete set of
classes. Some of the examples of classification problems are Email spam or not spam, Online
transactions Fraud or not Fraud, Tumor Malignant or Benign. Logistic regression transforms its output
using the logistic sigmoid function to return a probability value.

Q. What is the formula for the logistic regression function?

Ans. f(z) = 1/(1+e-(α+1X1+2X2+….+kXk))

Q. Why can’t linear regression be used in place of logistic regression for binary classification?

Ans. Predicted value is continuous, not probabilistic, sensitive to imbalance the

data.
EXPERIMENT-2

Aim: Understanding of Machine learning algorithms

Machine learning algorithms are programs that can learn from data and improve from experience, without
human intervention. Learning tasks may include learning the function that maps the input to the output,
learning the hidden structure in unlabeled data; or ‘instance-based learning’, where a class label is produced
for a new instance by comparing the new instance (row) to instances from the training data, which were
stored in memoy.

Types of Machine Learning Algorithms


There are 3 types of machine learning (ML) algorithms:

1. Supervised Learning Algorithms:

Supervised learning uses labeled training data to learn the mapping function that turns input variables
(X) into theoutput variable (Y). In other words, it solves for f in the following equation:
Y = f (X)
This allows us to accurately generate outputs when given new inputs.
We’ll talk about two types of supervised learning: classification and regression.
Classification is used to predict the outcome of a given sample when the output variable is in the form of
categories.A classification model might look at the input data and try to predict labels like “sick” or
“healthy.”
Regression is used to predict the outcome of a given sample when the output variable is in the form of
real values.For example, a regression model might process input data to predict the amount of rainfall,
the height of a person, etc.
The first 5 algorithms that we cover in this blog – Linear Regression, Logistic Regression, CART, Naïve-
Bayes, andK-Nearest Neighbors (KNN) — are examples of supervised learning.
Ensembling is another type of supervised learning. It means combining the predictions of multiple
machine learningmodels that are individually weak to produce a more accurate prediction on a new
sample. Algorithms 9 and 10 of this article — Bagging with Random Forests, Boosting with XGBoost —
are examples of ensemble techniques.

2. Unsupervised Learning Algorithms:


Unsupervised learning models are used when we only have the input variables (X) and no
corresponding outputvariables. They use unlabeled training data to model the underlying structure of
the data.

We’ll talk about three types of unsupervised learning:

Association is used to discover the probability of the co-occurrence of items in a collection. It is


extensively used inmarket-basket analysis. For example, an association model might be used to discover
that if a customer purchases bread, s/he is 80% likely to also purchase eggs.

Clustering is used to group samples such that objects within the same cluster are more similar to each
other than tothe objects from another cluster.
Dimensionality Reduction is used to reduce the number of variables of a data set while ensuring that
important information is still conveyed. Dimensionality Reduction can be done using Feature Extraction
methods and FeatureSelection methods. Feature Selection selects a subset of the original variables. Feature
Extraction performs data transformation from a high-dimensional space to a low-dimensional space.
Example: PCA algorithm is a Feature Extraction approach.
Algorithms 6-8 that we cover here — Apriori, K-means, PCA — are examples of unsupervised learning.

3. Reinforcement learning:

Reinforcement learning is a type of machine learning algorithm that allows an agent to decide the best
next actionbased on its current state by learning behaviors that will maximize a reward.
Reinforcement algorithms usually learn optimal actions through trial and error. Imagine, for example, a video
game in which the player needs to move to certain places at certain times to earn points. A reinforcement
algorithm playingthat game would start by moving randomly but, over time through trial and error, it would
learn where and when it needed to move the in-game character to maximize its point total.
VIVA QUESTIONS – 2

Q. Why is accuracy not a good measure for classification problems?

Ans. Accuracy can be a useful measure if we have the same amount of samples per class but if we
have an imbalanced set of samples accuracy isn't useful at all. Even more so, a test can have a
high accuracy but actually perform worse than a test with a lower accuracy.

Q. What are false positives and false negatives?

Ans. A false positive is an error in data reporting in which a test result improperly indicates
presence of a condition, such as a disease (the result is positive), when in reality it is not present,
while a false negative is an error in which a test result improperly indicates no presence of a
condition (the result is negative), when in reality it is present.

Q. What are the true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and
false negative rate (FNR)?

Ans. TPR refers to the ratio of positives correctly predicted from all the true labels. In simple
words, it is the frequency of correctly predicted true labels. TPR = TP/TP+FN TNR refers to the
ratio of negatives correctly predicted from all the false labels. It is the frequency of correctly
predicted false labels. TNR = TN/TN+FP FPR refers to the ratio of positives incorrectly predicted
from all the true labels. It is the frequency of incorrectly predicted false labels. FPR = FP/TN+FP
FNR refers to the ratio of negatives incorrectly predicted from all the false labels. It is the
frequency of incorrectly predicted true labels. FNR = FN/TP+FN

Q. What are precision and recall?

Ans. Precision means the percentage of your results which are relevant. On the other hand, recall refers to the
percentage of total relevant results correctly classified by your algorithm.
EXPERIMENT – 3

Aim: Implement K mean Algorithm using Python. Evaluate performanceby measuring the
sum of Euclidean distance of each example from its class centre. Test the performance of the
algorithm as afunction of the parameter k.

Python Code :
First we plot the data:

import matplotlib.pyplot as plt


import numpy as npfrom matplotlib
import stylestyle.use('ggplot')X =
np.array([[1, 2],
[1.5, 1.8],
[5, 8 ],
[8, 8],
[1, 0.6],
[9,11]])

plt.scatter(X[:,0], X[:,1],
s=150)plt.show()

KMeans.py :

class K_Means:
def init (self, k=2, tol=0.001,
max_iter=300):self.k = k
self.tol = tol
self.max_iter =
self.centroids = {}

for i in range(self.k):
self.centroids[i] = data[i]

for i in range(self.max_iter):
self.classifications = {}

for i in range(self.k):
self.classifications[i] = []

for featureset in data:


distances = [np.linalg.norm(featureset-self.centroids[centroid]) for centroid in self.centroids]
classification = distances.index(min(distances))
self.classifications[classification].append(featureset)

prev_centroids = dict(self.centroids)

for classification in self.classifications:


self.centroids[classification] = np.average(self.classifications[classification],axis=0)

optimized = True

for c in self.centroids:
original_centroid = prev_centroids[c]
current_centroid = self.centroids[c]
if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:
print(np.sum((current_centroid-original_centroid)/original_centroid*100.0))
optimized = False

if optimized:
break

def predict(self,data):
distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]
classification = distances.index(min(distances))
return classification
model = K_Means()
model.fit(X)

for centroid in model.centroids:


plt.scatter(model.centroids[centroid][0], model.centroids[centroid][1],
marker="o", color="k", s=150, linewidths=5)

for classification in model.classifications:


color = colors[classification]
for featureset in model.classifications[classification]:
plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)

plt.show()
VIVA QUESTIONS – 3

Q. Why “Naïve” Bayes is Naïve ?

Ans. A subtle issue with Naive-Bayes is that if you have no occurrences of a class label and a certain attribute
value together (e.g. class="nice", shape="sphere") then the frequency-based probability estimate will be zero.
Given Naive-Bayes' conditional independence assumption, when all the probabilities are multiplied you will
get zero and this will affect the posterior probability estimate.

Q. What is SVM ?

Ans. Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both
classification or regression challenges. However, it is mostly used in classification problems. In the SVM
algorithm,we plot each data item as a point in n-dimensional space (where n is number of features you have)
with the value of each feature being the value of a particular coordinate.

Q. What is perceptron ?

Ans. The perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a
function which can decide whether or not an input, represented by a vector of numbers, belongs to some
specific class.[1] It isa type of linear classifier, i.e. a classification algorithm that makes its predictions based
on a linear predictor function combining a set of weights with the feature vector.

Q. What are Exponential families ?

Ans. an exponential family is a parametric set of probability distributions of a certain form, specified
below. Thisspecial form is chosen for mathematical convenience, based on some useful algebraic
properties, as well as for generality, as exponential families are in a sense very natural sets of distributions
to consider.
EXPERIMENT – 4

Aim : Study of databases and understanding attributes evaluation in regard to problem


description.

Many data files have store the data or problem description and also their attributes in different formats such as

a) Arff data
b) .xlmns data
c) .csv data

a) Abalone data set :

I am using the Abalone data set for understanding the concepts of databases and it’s attributes,

Abstract: Predict the age of abalone from physical measurements

Data Set Number of


Multivariate 4177 Area: Life
Characteristics: Instances:
Attribute Categorical, Integer, Number of 1995-12-
8 Date Donated
Characteristics: Real Attributes: 01
Number of Web
Associated Tasks: Classification Missing Values? No 934219
Hits:

Data Set Information:


Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the
shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and
time-consumingtask. Other measurements, which are easier to obtain, are used to predict the age. Further
information, such as weather patterns and location (hence food availability) may be required to solve the
problem.

From the original data examples with missing values were removed (the majority having the predicted
value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing
by 200).

Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings
is thevalue to predict: either as a continuous value or as a classification problem.

Name / Data Type / Measurement Unit / Description

Sex / nominal / -- / M, F, and I (infant)


Length / continuous / mm / Longest shell
measurementDiameter / continuous / mm /
perpendicular to length Height / continuous /
mm / with meat in shell
Whole weight / continuous / grams / whole
abalone Shucked weight / continuous / grams /
weight of meat
Viscera weight / continuous / grams / gut weight (after
bleeding)Shell weight / continuous / grams / after being
dried
Rings / integer / -- / +1.5 gives the age in years

b) Annealing data set:

Abstract: Steel annealing data

Data Set Number of


Multivariate 798 Area: Physical
Characteristics: Instances:
Attribute Categorical, Integer, Number of
38 Date Donated N/A
Characteristics: Real Attributes:
Number of Web
Associated Tasks: Classification Missing Values? Yes 162527
Hits:

Source:

Donors: David Sterling and Wray Buntine


Attribute Listing:
1. family: --,GB,GK,GS,TN,ZA,ZF,ZH,ZM,ZS
2. product-type: C, H, G
3. steel: -,R,A,U,K,M,S,W,V
4. carbon: continuous
5. hardness: continuous
6. temper_rolling: -,T
7. condition: -,S,A,X
8. formability: -,1,2,3,4,5
9. strength: continuous
10. non-ageing: -,N
11. surface-finish: P,M,-
12. surface-quality: -,D,E,F,G
13. enamelability: -,1,2,3,4,5
14. bc: Y,-
15. bf: Y,-
16. bt: Y,-
17. bw/me: B,M,-
18. bl: Y,-
19. m: Y,-
20. chrom: C,-
21. phos: P,-
22. cbond: Y,-
23. marvi: Y,-
24. exptl: Y,-
25. ferro: Y,-
26. corr: Y,-
27. blue/bright/varn/clean: B,R,V,C,-
28. lustre: Y,-
29. jurofm: Y,-
30. s: Y,-
31. p: Y,-
32. shape: COIL, SHEET
33. thick: continuous
34. width: continuous
35. len: continuous
36. oil: -,Y,N
37. bore: 0000,0500,0600,0760
38. packing: -,1,2,3
classes: 1,2,3,4,5,U

-- The '-' values are actually 'not_applicable' values rather than 'missing_values' (and so can be treated as legal
discrete values rather than as showing the absence of a discrete value).
VIVA QUESTION – 4

Q. Difference between Supervised and Unsupervised Learning ?

Ans. Supervised learning is the technique of accomplishing a task by providing training, input and output
patternsto the systems whereas unsupervised learning is a self-learning technique in which system has to
discover the features of the input population by its own and no prior set of categories are used.

Q. How is KNN different from k-means clustering ?

Ans. K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an


unsupervised clustering algorithm ..... The critical difference here is that KNN needs labeled points
and is thus
supervised learning, while k-means doesn't — and is thus unsupervised learning.

Q. Explain how a ROC curve works.

Ans. In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in
function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve
represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect
discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner
(100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the upper left corner, the
higher the overall accuracy of the tes
EXPERIMENT – 5

Aim : Working of Major Classifiers :


a) Naïve Bayers
b) Decision Tree
c) CART
d) ARIMA
e) Linear and Logistic regression

a) Naïve Bayes Algorithm :

I am using the Python library scikit-learn to build the Naive Bayes algorithm.

>>> from sklearn.naive_bayes import GaussianNB


>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn import datasets
>>> from sklearn.metrics import confusion_matrix

>>> iris = datasets.load_iris()

>>> gnb = GaussianNB()


>>> mnb = MultinomialNB()

>>> y_pred_gnb = gnb.fit(iris.data, iris.target).predict(iris.data)


>>> cnf_matrix_gnb = confusion_matrix(iris.target, y_pred_gnb)

>>> print(cnf_matrix_gnb)
[[50 0 0]
[ 0 47 3]
[ 0 3 47]]

>>> y_pred_mnb = mnb.fit(iris.data, iris.target).predict(iris.data)


>>> cnf_matrix_mnb = confusion_matrix(iris.target, y_pred_mnb)

>>> print(cnf_matrix_mnb)
[[50 0 0]
[ 0 46 4]
[ 0 3 47]]
Output :

b) Decision Tree

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Function importing Dataset


def importdata():
balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-'+
'databases/balance-scale/balance-scale.data',
sep= ',', header = None)

# Printing the dataswet shape


print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)

# Printing the dataset obseravtions


print ("Dataset: ",balance_data.head())
return balance_data

# Function to split the dataset


def splitdataset(balance_data):

# Separating the target variable


X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]

# Splitting the dataset into train and test


X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)

return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.


def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object


clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)

# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini

# Function to perform training with entropy.


def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy


clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)

# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy

# Function to make predictions


def prediction(X_test, clf_object):

# Predicton on test with giniIndex


y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

# Function to calculate accuracy


def cal_accuracy(y_test, y_pred):

print("Confusion Matrix: ",


confusion_matrix(y_test, y_pred))

print ("Accuracy : ",


accuracy_score(y_test,y_pred)*100)

print("Report : ",
classification_report(y_test, y_pred))

# Driver code
def main():

# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)

# Operational Phase
print("Results Using Gini Index:")

# Prediction using gini


y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)

print("Results Using Entropy:")


# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)
# Calling main function
if name ==" main ":
main()

Output :
Predicted values:
['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']

Confusion Matrix: [[ 0 6 7]
[ 0 63 22]
[ 0 20 70]]
Accuracy : 70.7446808511
Report :
precision recall f1-score support
B 0.00 0.00 0.00 13
L 0.71 0.74 0.72 85
R 0.71 0.78 0.74 90
avg / total 0.66 0.71 0.68 188

c) CART

# CART on the Bank Note dataset


from random import seed
from random import randrange
from csv import reader

# Load a CSV file


def load_csv(filename):
file = open(filename, "rt")
lines = reader(file)
dataset = list(lines)
return dataset

# Convert string column to float


def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())

# Split a dataset into k folds


def cross_validation_split(dataset, n_folds):
dataset_split = list()
dataset_copy = list(dataset)
fold_size = int(len(dataset) / n_folds)
for i in range(n_folds):
fold = list()
while len(fold) < fold_size:
index = randrange(len(dataset_copy))
fold.append(dataset_copy.pop(index))
dataset_split.append(fold)
return dataset_split

# Calculate accuracy percentage


def accuracy_metric(actual, predicted):
correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split


def evaluate_algorithm(dataset, algorithm, n_folds, *args):
folds = cross_validation_split(dataset, n_folds)
scores = list()
for fold in folds:
train_set = list(folds)
train_set.remove(fold)
train_set = sum(train_set, [])
test_set = list()
for row in fold:
row_copy = list(row)
test_set.append(row_copy)
row_copy[-1] = None
predicted = algorithm(train_set, test_set, *args)
actual = [row[-1] for row in fold]
accuracy = accuracy_metric(actual, predicted)
scores.append(accuracy)
return scores

# Split a dataset based on an attribute and an attribute value


def test_split(index, value, dataset):
left, right = list(), list()
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right
# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
# count all samples at split point

gini = 0.0
for group in groups:
size = float(len(group))
# avoid divide by zero
if size == 0:
continue
score = 0.0
# score the group based on the score for each class
for class_val in classes:
p = [row[-1] for row in group].count(class_val) / size
score += p * p
# weight the group score by its relative size
gini += (1.0 - score) * (size / n_instances)
return gini

d) ARIMA

import pandas as pd
data = pd.read_csv(“Electric_Production.csv”,index_col=0)
data.head()
data.index = pd.to_datetime(data.index)
data.columns = ['Energy Production']

import plotly.plotly as ply


import cufflinks as cfdata.iplot(title="Energy Production Jan 1985--Jan 2018")

from plotly.plotly import plot_mpl


from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data, model=’multiplicative’)
fig = result.plot()
plot_mpl(fig)

from pyramid.arima import auto_arimastepwise_model = auto_arima(data, start_p=1, start_q=1,


max_p=3, max_q=3, m=12,
start_P=0, seasonal=True,
d=1, D=1, trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)print(stepwise_model.aic())

Running the example prints a summary of the fit model. This summarizes the coefficient values
used as well as the skill of the fit on the on the in-sample observations.
ARIMA Model Results

============================================================================
==

Dep. Variable: D.Sales No. Observations: 35

Model: ARIMA(5, 1, 0) Log Likelihood -196.170

Method: css-mle S.D. of innovations 64.241

Date: Mon, 12 Dec 2016 AIC


406.340

Time: 11:09:13 BIC


417.227

Sample: 02-01-1901 HQIC


410.098

- 12-01-1903

============================================================================
=====

coef std err z P>|z| [95.0% Conf.


Int.]

const 12.0649 3.652 3.304 0.003 4.908


19.222
ar.L1.D.Sales -1.1082 0.183 -6.063 0.000 -1.466
-0.750

ar.L2.D.Sales -0.6203 0.282 -2.203 0.036 -1.172


-0.068

e) Linear Regression :

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x, m_y = np.mean(x), np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if name == " main ":


main()
VIVA QUESTION – 5

Q. When should you use classification over regression?

Ans. Classification is about identifying group membership while regression technique involves predicting a
response. Both techniques are related to prediction, where classification predicts the belonging to a class
whereasregression predicts the value from a continuous set.

Q. How do you ensure you’re not overfitting with a model?

Ans. 1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby
removing some of the noise in the training data.
2- Use cross-validation techniques such as k-folds cross-validation.

Q. How would you evaluate a logistic regression model?

Ans. To perform a logistic regression, the dependent variable has to be dichotomous (yes/no),
usually coded 0/1. Since logistic regression model is a classifier rather than a regressor, you
couldevaluate the model by using metrics like Accuracy, False Postive, false negative, true
positive, true negative, F1 score.
Experiment 6

AIM - Design a prediction model for Analysis of round-trip Time of Flight measurement
from a supermarket using random forest, naïve bayes etc.

CMU-SuperMarket Data Collection and Preprocessing for round-trip time of Flight(RToF) .

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
data = pd.read_csv("cmu_supermarket.csv")
data = data.iloc[: , 0:5]
data.columns = ['X', 'Y', 'Z', 'Mag', 'RToF']
data = data.dropna()
data.head()

X = data.iloc[: , 0:4]
Y = data.iloc[: , 4]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.10, random_state=33)

Algorithms:

3. Naïve Bayes

from sklearn.linear_model import BayesianRidge


nb_model = BayesianRidge()
nb_model = nb_model.fit(X_train , Y_train)
prediction = nb_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)

4. Decision Tree
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor()
dt_model = dt_model.fit(X_train , Y_train)
prediction = dt_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)

5. Random Forest
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()
rf_model = rf_model.fit(X_train , Y_train)
prediction = rf_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)

6. SVM
from sklearn.svm import SVR
svm_model = SVR()
svm_model = svm_model.fit(X_train , Y_train)
prediction = svm_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)

7. KNN
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model = knn_model.fit(X_train , Y_train)
prediction = knn_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)

Result:
By performing the regression using the above five algorithms, we observe that ‘Random
Forest’ is most accurate with least Root Mean Absolute Error.
Therefore, we conclude that out of the above algorithms, Random Forest performs best.
VIVA QUESTION – 6

Q. What is ensemble learning ?

Ans. Ensemble learning is the process by which multiple models, such as classifiers or experts, are
strategicallygenerated and combined to solve a particular computational intelligence problem.

Q. Why ensemble learning is used ?

Ans. Ensemble learning helps improve machine learning results by combining several models. .... Ensemble
methods are meta-algorithms that combine several machine learning techniques into one predictive model in
orderto decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Q. When to use ensemble learning ?

Ans. Ensemble learning is usually used to average the predictions of different models to get a better
prediction. Bagging consists of running multiple different models, each on a different set of input samples
and then taking theaverage of those predictions.

Q. What are the two paradigms of ensemble methods?

Ans. The two paradigms of ensemble methods are


a) Sequential ensemble methods
b) Parallel ensemble methods
EXPERIMENT - 7

Aim: Implement Supervised Learning ( KNN Classification Algorithm ).

Supervised Learning :
It is the learning where the value or result that we want to predict is within the training data
(labeleddata) and the value which is in data that we want to study is known as Target or
Dependent Variableor Response Variable.
All the other columns in the dataset are known as the Feature or Predictor Variable or
IndependentVariable.
Supervised Learning is classified into two categories:
1. Clarification: Here our target variable consists of the categories.
2. Regression: Here our target variable is continuous and we usually try to find out the line
of thecurve.

K-Nearest Neighbor algorithm:

This algorithm is used to solve the classification model problems. K-nearest neighbor or K-
NN algorithm basically creates an imaginary boundary to classify the data. When new data
points comein, the algorithm will try to predict that to the nearest of the boundary line.
Therefore, larger k value means smother curves of separation resulting in less complex
models.Whereas, smaller k value tends to overfit the data and resulting in complex
models.

# Import necessary modules


from sklearn.neighbors import
KNeighborsClassifier from
sklearn.model_selection import
train_test_split from sklearn.datasets import
load_iris

# Loading
data irisData
= load_iris()

# Create feature and


target arraysX =
irisData.data
y = irisData.target

# Split into training and test set


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state=42)knn =

KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train, y_train)
# Predict on dataset which model has not seen
beforeprint(knn.predict(X_test))

import numpy as np
import matplotlib.pyplot

as pltirisData = load_iris()

# Create feature and target


arraysX = irisData.data
y = irisData.target

# Split into training and test set


X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.2,
random_state=42)

neighbors = np.arange(1, 9)
train_accuracy =
np.empty(len(neighbors))
test_accuracy =
np.empty(len(neighbors))

# Loop over K values


for i, k in enumerate(neighbors):
knn =
KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# Compute traning and test data accuracy


train_accuracy[i] = knn.score(X_train,
y_train)test_accuracy[i] =
knn.score(X_test, y_test)

# Generate plot
plt.plot(neighbors, test_accuracy, label = 'Testing dataset Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training dataset Accuracy')

plt.legend()
plt.xlabel('n_neigh
bors')
plt.ylabel('Accurac
y') plt.show()
Output :
VIVA QUESTION - 7

Q. What is the general principle of an ensemble method and what is bagging and boosting in ensemble method?

Ans. The general principle of an ensemble method is to combine the predictions of several models
built with a given learning algorithm in order to improve robustness over a single model.

Q. What are SVM Kernel Functions?

Ans. The SVM is a machine learning algorithm which • solves classification problems • uses a
flexible representation of the class boundaries • implements automatic complexity control to reduce
overfitting • has a single global minimum which can be found in polynomial time
It is popular because • it can be easy to use • it often has good generalization performance
EXPERIMENT – 8

Aim : Understanding of R and its basics.

R is a programming language developed by Ross Ihaka and Robert Gentleman in


1993. R possesses an extensive catalog of statistical and graphical methods. It
includesmachine learning algorithms, linear regression, time series, statistical
inference to name a few. Most of the R libraries are written in R, but for heavy
computational tasks, C, C++ and Fortran codes are preferred.

Data analysis with R is done in a series of steps; programming, transforming, discovering,


modelingand communicate the results

• Program: R is a clear and accessible programming tool


• Transform: R is made up of a collection of libraries designed specifically for data
science
• Discover: Investigate the data, refine your hypothesis and analyze them
• Model: R provides a wide array of tools to capture the right model for your data
• Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or
buildShiny apps to share with the world
VIVA QUESTION – 8

Q. What is Regularization ?

Ans. This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards
zero. Inother words, this technique discourages learning a more complex or flexible model, so as to avoid the

risk of overfitting. A simple relation for linear regression looks like this. Here Y represents the learned

relation and β represents the coefficient estimates for different variables or predictors(X).

Q. What is Ensemble Learning ?

Ans. Ensemble learning is the use of algorithms and tools in machine learning and other
disciplines,to form a collaborative whole where multiple methods are more effective than a single
learning method. Ensemble learning can be used in many different types of research, for
flexibility and enhanced results.

Q. What is Bagging ?

Ans. A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on
random subsets of the original dataset and then aggregate their individual predictions (either by
voting or byaveraging) to form a final prediction.

Q. What is Boosting ?

Ans. Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.
This is done by building a model from the training data, then creating a second model that attempts to correct
the errors from the first model. Models are added until the training set is predicted perfectly or a maximum
number ofmodels are added.
EXPERIMENT-9
Aim: Build and develop a model in R for a particular classifier (random Forest).

Random Forests is a powerful tool used extensively across a multitude of fields.


As a matter of fact, it is hard to come upon a data scientist that never had to resort
to this technique at some point. Motivated by the fact that I have been using
Random Forests quite a lot recently, I decided to give a quick intro to Random
Forests using R.

a) Split the data :

# Set random seed to make results reproducible:


set.seed(17)
# Calculate the size of each of the data sets:
data_set_size <- floor(nrow(iris)/2)
# Generate a random sample of "data_set_size" indexes
indexes <- sample(1:nrow(iris), size = data_set_size)

# Assign the data to the correct sets


training <- iris[indexes,]

validation1 <- iris[-indexes,]

b) Import the Package

#import the package


library(randomForest)
# Perform training:

rf_classifier = randomForest(Species ~ ., data=training, ntree=100, mtry=2, importance=TRUE)

> rf_classifier
randomForest(formula = Species ~ ., data = training,ntree=100,mtry=2, importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 2

OOB estimate of error rate: 5.33%


Confusion matrix:
setosa versicolor virginica class.error
setosa 21 0 0 0.00000000
versicolor 0 25 2 0.07407407

virginica 0 2 25 0.07407407
# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf_classifier,validation1[,-5])
table(observed=validation1[,5],predicted=prediction_for_table)

predicted
observed setosa versicolor virginica
setosa 29 0 0
versicolor 0 20 3

virginica 0 1 22

# Validation set assessment #2: ROC curves and AUC

# Needs to import ROCR package for ROC curve plotting:


library(ROCR)

# Calculate the probability of new observations belonging to each class


# prediction_for_roc_curve will be a matrix with dimensions data_set_size x number_of_classes
prediction_for_roc_curve <- predict(rf_classifier,validation1[,-5],type="prob")

# Use pretty colours:


pretty_colours <- c("#F8766D","#00BA38","#619CFF")
# Specify the different classes
classes <- levels(validation1$Species)
# For each class
for (i in 1:3)
{
# Define which observations belong to class[i]
true_values <- ifelse(validation1[,5]==classes[i],1,0)
# Assess the performance of classifier for class[i]
pred <- prediction(prediction_for_roc_curve[,i],true_values)
perf <- performance(pred, "tpr", "fpr")
if (i==1)
{
plot(perf,main="ROC Curve",col=pretty_colours[i])
}
else
plot(perf,main="ROC Curve",col=pretty_colours[i],add=TRUE)
}
# Calculate the AUC and print it to screen
auc.perf <- performance(pred, measure = "auc")
print(auc.perf@y.values)

}
Output:
VIVA QUESTION

Q. How to Evaluate Machine Learning Algorithms?

Ans. Performance Measure. The performance measure is the way you want to evaluate a solution to the problem.
Test and Train Datasets. From the transformed data, you will need to select a test set and a training set, Then do
Cross Validation.

Q. What are the five popular algorithms of Machine Learning?

Ans. KNN , Decision Tree , SVM Algorithm, Neural Networks Algorithm and Probabilistic Networks.
EXPERIMENT-10
Aim: Develop a machine learning method using Neural Networks in python to Predict
stock prices based on past price variation.
Program :

#Import the libraries


import math
import pandas_datareader as web
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, LSTM
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

#Get the stock quote


df = web.DataReader('AAPL', data_source='yahoo', start='2012-01-
01', end='2019-12-17')
#Show the data
df

#Visualize the closing price history


plt.figure(figsize=(16,8))
plt.title('Close Price History')
plt.plot(df['Close'])
plt.xlabel('Date',fontsize=18)
plt.ylabel('Close Price USD ($)',fontsize=18)
plt.show()

#Create a new dataframe with only the 'Close' column


data = df.filter(['Close'])#Converting the dataframe to a numpy array
dataset = data.values#Get /Compute the number of rows to train the model on
training_data_len = math.ceil( len(dataset) *.8)

#Scale the all of the data to be values between 0 and 1


scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(dataset)

#Create the scaled training data set


train_data = scaled_data[0:training_data_len , : ]#Split the data into x_train and y_train data sets
x_train=[]
y_train = []
for i in range(60,len(train_data)):
x_train.append(train_data[i-60:i,0])
y_train.append(train_data[i,0])
#Convert x_train and y_train to numpy arrays
x_train, y_train = np.array(x_train), np.array(y_train)

#Build the LSTM network model


model = Sequential()
model.add(LSTM(units=50,
return_sequences=True,input_shape=(x_train.shape[1],1)))
model.add(LSTM(units=50, return_sequences=False))
model.add(Dense(units=25))
model.add(Dense(units=1))

#Compile the model


model.compile(optimizer='adam', loss='mean_squared_error')

#Train the model


model.fit(x_train, y_train, batch_size=1, epochs=1)

#Test data set


test_data = scaled_data[training_data_len - 60: , : ]#Create the x_test and y_test data sets
x_test = []
y_test = dataset[training_data_len : , : ] #Get all of the rows from index 1603 to the rest and all of
the columns (in this case it's only column 'Close'), so 2003 - 1603 = 400 rows of data
for i in range(60,len(test_data)):
x_test.append(test_data[i-60:i,0])
VIVA VOCE
EXPERIMENT-11
Aim: Understanding of RMS Titanic Dataset to predict survival by training a
model and predict the required solution.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April
15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing
1502 out of 2224 passengers and crew. This sensational tragedy shocked the international
community and led to better safety regulations for ships. One of the reasons that the shipwreck
led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of
people were more likely to survive than others, such as women, children, and the upper-class.

● Survived (Target Variable) - Binary categorical variable where 0 represents not survived
and 1 represents survived.
● Pclass - Categorical variable. It is passenger class.
● Sex - Binary Variable representing the gender the of passenger
● Age - Feature engineered variable. It is divided into 4 classes.
● Fare - Feature engineered variable. It is divided into 4 classes.
● Embarked - Categorical Variable. It tells the Port of embarkation.
● Title - New feature created from names. The title of names is classified into 4 different
classes.
● isAlone - Binary Variable. It tells whether the passenger is travelling alone or not.
● Age*Class - Feature engineered variable.

Model, predict and solve


Now we are ready to train a model and predict the required solution. There are 60+ predictive
modelling algorithms to choose from. We must understand the type of problem and solution
requirement to narrow down to a select few models which we can evaluate. Our problem is a
classification and regression problem. We want to identify relationship between output
(Survived or not) with other variables or features (Gender, Age, Port...). We are also performing
a category of machine learning which is called supervised learning as we are training our model
with a given dataset. With these two criteria - Supervised Learning plus Classification and
Regression, we can narrow down our choice of models to a few. These include:
• Logistic Regression
• KNN or k-Nearest Neighbours
• Support Vector Machines
• Naive Bayes classifier
• Decision Tree

Size of the training and testing dataset

1. Logistic Regression is a useful model to run early in the workflow. Logistic regression
measures the relationship between the categorical dependent variable (feature) and one or more
independent variables (features) by estimating probabilities using a logistic function, which
is the cumulative logistic distribution.

Note the confidence score generated by the model based on our training dataset.
2. In pattern recognition, the k-Nearest Neighbours algorithm (or k-NN for short) is a non-
parametric method used for classification and regression. A sample is classified by a majority
vote of its neighbours, with the sample being assigned to the class most common among its k
nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply
assigned to the class of that single nearest neighbour.

KNN confidence score is better than Logistics Regression but worse than SVM.
3. Next we model using Support Vector Machines which are supervised learning models with
associated learning algorithms that analyze data used for classification and regression analysis.
Given a set of training samples, each marked as belonging to one or the other of twocategories,
an SVM training algorithm builds a model that assigns new test samples to one category or the
other, making it a non-probabilistic binary linear classifier.
Note that the model generates a confidence score which is higher than Logistics Regression

model.

4. In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers
based on applying Bayes' theorem with strong (naive) independence assumptions between the
features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear
in the number of variables (features) in a learning problem.

The model generated confidence score is the lowest among the models evaluated so far.

5. This model uses a decision tree as a predictive model which maps features (tree branches)
to conclusions about the target value (tree leaves). Tree models where the target variable can
take a finite set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to those class
labels. Decision trees where the target variable can take continuous values (typically real
numbers) are called regression trees. The model confidence score is the highest amongmodels

evaluated so far.
EXPERIMENT-12
Aim: Understanding of Indian education in Rural villages to predict whether
girl child will be sent to school or not?

The data is focused on rural India. It primarily looks into the fact whether the villagers
are willing to send the girl children to school or not and if they are not sending theirdaughters
to school the reasons have also been mentioned. The district is Gwalior. Various details of the
villagers such as village, gender, age, education, occupation, category, caste, religion, land etc
have also been collected.

1 Naïve Bayes Classifier

The algorithm was run with 10-fold cross-validation: this means it was given an opportunity
to make a prediction for each instance of the dataset (with different training folds) and the
presented result is a summary of those predictions. Firstly, I noted the Classification Accuracy.
The model achieved a result of 109/200 correct or 54.5%.

=== Confusion Matrix ===

a b c d e f g h i j k l m <-- classified as
0 0 1 1 0 1 0 2 0 0 0 0 0 | a = Govt.
2 1 1 1 8 0 0 0 0 1 0 0 0 | b = Driver
2 0 17 2 9 0 0 2 0 0 0 0 0 | c = Farmer
0 0 4 3 2 0 1 0 0 1 0 0 0 | d = Shopkeeper
1 8 2 3 73 1 0 1 1 3 2 2 0 | e = labour
3 0 0 0 0 0 0 1 0 0 0 0 0 | f = Security Guard
0 1 0 1 0 0 0 2 0 0 0 0 0 | g = Raj Mistri
1 0 0 0 1 1 0 8 0 0 0 0 0 | h = Fishing
0 0 2 0 0 0 0 0 2 0 0 0 0 | i = Labour & Driver
0 0 2 0 1 0 0 0 0 2 0 0 0 | j = Homemaker
0 0 0 0 1 0 0 0 0 2 0 0 0 | k = Govt School Teacher
0 0 0 1 4 0 0 0 0 0 0 0 0 | l = Dhobi
1. 0 0 0 3 0 0 0 0 0 0 0 1 | m = goats

The confusion matrix shows the precision of the algorithm showing that 1,1,1,2 Government
officials were misclassified as Farmer, Shopkeeper, Security Guard and Fishermenrespectively,
2,1,1,8,1 Drivers were misclassified as Government officials, Farmer, Shopkeeper, Labour,
Homemaker, and so on. This table can help to explain the accuracy achieved by the algorithm.
Now when we have model,
we need to load our test data we’ve created before. For this, select Supplied test set and click
button Set. Click More Options, where in new window, choose PlainText from Output
predictions. Then click left mouse button on recently created model on result list and select Re-
evaluate model on current test set.
After re-evaluation

Now the Classification Accuracy is 151/200 correct or 75.5%.


TP = true positives: number of examples predicted positive that are actually positive
FP = false positives: number of examples predicted positive that are actually negative
TN = true negatives: number of examples predicted negative that are actually negative
FN = false negatives: number of examples predicted negative that are actually positive
Recall is the TP rate ( also referred to as sensitivity) what fraction of those that are actually
positive were predicted positive? : TP / actual positives Precision is TP / predicted Positive
what fraction of those predicted positive are actually positive? precision is also referred to as
Positive predictive value (PPV); Other related measures used in classification include True
Negative Rate and Accuracy: True Negative Rate is also called Specificity. (TN / actual
negatives) 1-specificity is x-axis of ROC curve: this is the same as the FP rate (FP / actual
negatives) F-measure A measure that combines precision and recall is the harmonic mean of
precision and recall, the traditional F-measure or balanced F-score:

Mean absolute error (MAE)


The MAE measures the average magnitude of the errors in a set of forecasts, without
considering their direction. It measures accuracy for continuous variables. The equation is given
in the library references. Expressed in words, the MAE is the average over the verification
sample of the absolute values of the differences between forecast and the corresponding
observation. The MAE is a linear score which means that all the individual differences are
weighted equally in the average;
Root mean squared error (RMSE)
The RMSE is a quadratic scoring rule which measures the average magnitude of the error. The
equation for the RMSE is given in both of the references. Expressing the formula in words, the
difference between forecast and corresponding observed values are each squared and then
averaged over the sample. Finally, the square root of the average is taken. Since the errors are
squared before they are averaged, the RMSE gives a relatively high weight to large errors. This
means the RMSE is most useful when large errors are particularly undesirable.
2 Support Vector Machine
The model achieved a result of 181/200 correct or 92.3469%.
We have classified the dataset on the basis the reasons why the villagers are unwilling to send
girl children to schools in Gwalior village. The different classes are NA, Poverty, Marriage,
Distance, X, Unsafe Public Space, Transport Facilities, and Household Responsibilities.
The weighted average true positive rate is 0.923 that is nearly all the predicted positive values
are actually positive. The weighted average false positive rate is 0.205 that is few of them are
predicted as positive values but are actually negative. The precision in 0.902 that is the
algorithm is nearly accurate.

=== Confusion Matrix ===

a b c d e f g h <-- classified as
147 0 1 0 0 0 0 0 | a = NA
4 12 0 0 0 0 0 0 | b = Poverty
5 0 3 0 0 0 0 0 | c = Marriage
0 0 1 3 0 0 0 0 | d = Distance
0 0 0 0 8 0 0 0 |e=X
4 0 0 0 0 0 0 0 | f = Unsafe Public Space
0 0 0 0 0 0 4 0 | g = Transport Facilities
1. 0 0 0 0 0 0 4 | h = Household Responsibilities

The confusion matrix shows that majority of the reasons were not available and out of the
reasons which were available people did not send their daughters to school because of poverty
and very few of them considered Distance as a major factor for not sending their girl children
to school.

3 Random Forest

The accuracy of this algorithm is 100% that is 200/200 have been correctly classified.
=== Confusion Matrix ===

a b c d e f g h i j k l m <-- classified as
5 0 0 0 0 0 0 0 0 0 0 0 0 | a = Govt.
0 14 0 0 0 0 0 0 0 0 0 0 0 | b = Driver
0 0 32 0 0 0 0 0 0 0 0 0 0 | c = Farmer
0 0 0 11 0 0 0 0 0 0 0 0 0 | d = Shopkeeper
0 0 0 0 97 0 0 0 0 0 0 0 0 | e = labour
0 0 0 0 0 4 0 0 0 0 0 0 0 | f = Security Guard
0 0 0 0 0 0 4 0 0 0 0 0 0 | g = Raj Mistri
0 0 0 0 0 0 0 11 0 0 0 0 0 | h = Fishing
0 0 0 0 0 0 0 0 4 0 0 0 0 | i = Labour & Driver
0 0 0 0 0 0 0 0 0 5 0 0 0 | j = Homemaker
0 0 0 0 0 0 0 0 0 0 4 0 0 | k = Govt School Teacher
0 0 0 0 0 0 0 0 0 0 0 5 0 | l = Dhobi
1. 0 0 0 0 0 0 0 0 0 0 0 4 | m = goats

There is no observation which has been misclassified. Maximum number of villagers are
laborers.

4 Random Tree

The classification accuracy is 76.0204% that is 149/200 have been classified correctly.
The false positive rate is 0.352 that is highest of all the four algorithms applied above. Here
35.2% of the values which should have been classified negatively have been assigned a positive
value.

=== Confusion Matrix ===


a b c d e f g h <-- classified as
126 7 3 1 0 8 3 0 | a = NA
7 8 1 0 0 0 0 0 | b = Poverty
4 1 3 0 0 0 0 0 | c = Marriage
1 0 0 3 0 0 0 0 | d = Distance
2 0 0 0 6 0 0 0 | e=X
4 0 0 0 0 0 0 0 | f = Unsafe Public Space
3 1 0 0 0 0 0 0 | g = Transport Facilities
1 0 0 0 0 0 0 3 | h = Household Responsibilities

22 NA , 8 Poverty , 5 Marriage, 1 Distance, 2 X, 4 Unsafe Public Space, 4 Transport


Facilities and 1 Household Responsibilities class values have been misclassified.
The best algorithm out of the above algorithms is Random Forest with 100% accuracy rate
and the worst is Naïve Bayes algorithm with 75.5% accuracy rate.
EXPERIMENT-13
Aim: Understanding of Dataset of contact patterns among students collected in
National University of Singapore.

This is dataset collected from contact patterns among students collected during
the spring semester 2006 in National University of Singapore

Using RemovePercentage filter, instances have been reduced to: 500


This data has been taken and saved as training data set and then used for further
classification.

ALGORITHM-1 : GaussianProcesses
=== Run information ===

Scheme: weka.classifiers.functions.SimpleLinearRegression
Relation: MOCK_DATA (1)-weka.filters.unsupervised.instance.RemovePercentage-P50.0
Instances: 500
Attributes: 4
Start Time
Session Id
Student Id
Duration
Test mode: evaluate on training data

=== Classifier model (full training set) ===

Linear regression on Session Id

0.03 * Session Id + 10.38

Time taken to build model: 0 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0 seconds

=== Summary ===


Correlation coefficient 0.0677
Mean absolute error 4.9869
Root mean squared error 5.7893
Relative absolute error 99.7326 %
Root relative squared error 99.7708 %
Total Number of Instances 500

ALGORITHM 2: Linear Regression

Linear Regression Model

Start Time =

0.0274 * Session Id +
10.3846

Time taken to build model: 0.01 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0 seconds

=== Summary ===

Correlation coefficient 0.0677


Mean absolute error 4.9869
Root mean squared error 5.7893
Relative absolute error 99.7326 %
Root relative squared error 99.7708 %
Total Number of Instances 500
Algorithm 3: Decision Table
Algorithm 6: DecisionTable

Merit of best subset found: 5.814


Evaluation (for feature selection): CV (leave one out)
Feature set: 1

Time taken to build model: 0.02 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0 seconds

=== Summary ===

Correlation coefficient 0
Mean absolute error 5.0003
Root mean squared error 5.8026
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 500

CONCLUSION:
Six algorithms have been used to measure the best classifier. Depending on various attributes,
performance of various algorithms can be measured via mean absolute error and correlation
coefficient.

Depending on the results above, worst correlation has been found by DecisionTable and best
correlation
has been found by Decision Stump

=== Run information ===


Scheme: weka.classifiers.rules.DecisionTable -X 1 -S "weka.attributeSelection.BestFirst
-D 1 -N 5"
Relation: MOCK_DATA (1)-weka.filters.unsupervised.instance.RemovePercentage-P50.0
Instances: 500
Attributes: 4
Start Time
Session Id
Student Id
Duration
Test mode: evaluate on training data

=== Classifier model (full training set) ===

Decision Table:

Number of training instances: 500


Number of Rules : 1
Non matches covered by Majority class.

Best first.

Start set: no attributes

Search direction: forward

Stale search after 5 node expansions

Total number of subsets evaluated: 9

You might also like