ML Lab Guide for CSE Students
ML Lab Guide for CSE Students
ETCS-454
PRACTICAL RECORD
Branch : CSE
PRACTICAL DETAILS
VISION
To nurture young minds in a learning environment of high academic value and imbibe spiritual and
ethical values with technological and management competence.
MISSION
The Institute shall endeavour to incorporate the following basic missions in the teaching methodology:
❖ Engineering Hardware – Software Symbiosis: Practical exercises in all Engineering and
Management disciplines shall be carried out by Hardware equipment as well as the related
software enabling a deeper understanding of basic concepts and encouraging inquisitive
nature.
❖ Life-Long Learning: The Institute strives to match technological advancements and
encourage students to keep updating their knowledge for enhancing their skills and
inculcating their habit of continuous learning
❖ Liberalization and Globalization: The Institute endeavors to enhance technical and
management skills of students so that they are intellectually capable and competent
professionals with Industrial Aptitude to face the challenges ofglobalization.
❖ Diversification: The Engineering, Technology and Management disciplines have diverse
fields of studies with different attributes. The aim is to create a synergy of the above
attributes by encouraging analytical thinking.
❖ Digitization of Learning Processes: The Institute provides seamless opportunities for
innovative learning in all Engineering and Management disciplines through digitization of
learning processes using analysis, synthesis, simulation, graphics, tutorials and related tools to
create a platform for multi-disciplinary approach.
❖ Entrepreneurship: The Institute strives to develop potential Engineers and Managers by
enhancing their skills and research capabilities so that they emerge as successful
entrepreneurs and responsible citizens.
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
VISION
MISSION
1. KnowledgeFlow is a Java-Beans based interface for tuning and machine learning experiments.
2. Smple CLI is a simple command line interface provided to run Weka functions directly.
4. Experimenter is an environment to make experiments and statistical tests between learning schemes.
Pre-processing :
Most of the time, the data wouldn’t be perfect, and we would need to do pre-processing before applying machine
learning algorithms on it. Doing pre-processing is easy in Weka. You can simply click the “Open file” button and
loadyour file as certain file types: Arff, CSV, C4.5, binary, LIBSVM, XRFF; you can also load SQL db file
via the URL and then you can apply filters to it. However, we won’t need to do pre-processing for this post since
we’ll use adataset that Weka provides for us.
If your data type is in xls format like in previous image, you have to convert the file. I’ll use the Iris dataset
2. Open your CSV file in any text editor and first add @RELATION database_name to the first row of
the CSVfile
3. Add attributes by using the following definition: @ATTRIBUTE attr_name attr_type. If attr_type is
numeric you should define it as REAL, otherwise you have to add values between curly parentheses.
4. At last, add a @DATA tag just above on your data rows. Then save your file with .arff extension. You
Click the “Open file” button from the Pre-process section and load your .arff file from your local file system. If
youcouldn’t convert your .csv to .arff, don’t worry, because Weka will do that instead of you.
If you could follow all the steps so far, you can load your data set successfully and
you’ll see attribute names (it is illustrated at the red area on above images). The pre-
process stage is named as Filter in Weka, you can click the ‘Choose’ button from
Filter and apply any filter you want. For example, if you would like to
use Association Rule Mining as a training model, you have to dissociate numeric and
continuous attributes. To be able to do that you can follow the path: Choose -> Filter
-> Supervised -> Attribute -> Discritize.
Visualizing the Result : If you’d like to visualize this results you can use
graphic presentations as you can see in below Figure.
VIVA VOICE QUESTIONS - 1
Ans. Logistic Regression is a Machine Learning algorithm which is used for the classification
problems, it is a predictive analysis algorithm and based on the concept of probability.
Ans. Logistic regression is a classification algorithm used to assign observations to a discrete set of
classes. Some of the examples of classification problems are Email spam or not spam, Online
transactions Fraud or not Fraud, Tumor Malignant or Benign. Logistic regression transforms its output
using the logistic sigmoid function to return a probability value.
Q. Why can’t linear regression be used in place of logistic regression for binary classification?
data.
EXPERIMENT-2
Machine learning algorithms are programs that can learn from data and improve from experience, without
human intervention. Learning tasks may include learning the function that maps the input to the output,
learning the hidden structure in unlabeled data; or ‘instance-based learning’, where a class label is produced
for a new instance by comparing the new instance (row) to instances from the training data, which were
stored in memoy.
Supervised learning uses labeled training data to learn the mapping function that turns input variables
(X) into theoutput variable (Y). In other words, it solves for f in the following equation:
Y = f (X)
This allows us to accurately generate outputs when given new inputs.
We’ll talk about two types of supervised learning: classification and regression.
Classification is used to predict the outcome of a given sample when the output variable is in the form of
categories.A classification model might look at the input data and try to predict labels like “sick” or
“healthy.”
Regression is used to predict the outcome of a given sample when the output variable is in the form of
real values.For example, a regression model might process input data to predict the amount of rainfall,
the height of a person, etc.
The first 5 algorithms that we cover in this blog – Linear Regression, Logistic Regression, CART, Naïve-
Bayes, andK-Nearest Neighbors (KNN) — are examples of supervised learning.
Ensembling is another type of supervised learning. It means combining the predictions of multiple
machine learningmodels that are individually weak to produce a more accurate prediction on a new
sample. Algorithms 9 and 10 of this article — Bagging with Random Forests, Boosting with XGBoost —
are examples of ensemble techniques.
Clustering is used to group samples such that objects within the same cluster are more similar to each
other than tothe objects from another cluster.
Dimensionality Reduction is used to reduce the number of variables of a data set while ensuring that
important information is still conveyed. Dimensionality Reduction can be done using Feature Extraction
methods and FeatureSelection methods. Feature Selection selects a subset of the original variables. Feature
Extraction performs data transformation from a high-dimensional space to a low-dimensional space.
Example: PCA algorithm is a Feature Extraction approach.
Algorithms 6-8 that we cover here — Apriori, K-means, PCA — are examples of unsupervised learning.
3. Reinforcement learning:
Reinforcement learning is a type of machine learning algorithm that allows an agent to decide the best
next actionbased on its current state by learning behaviors that will maximize a reward.
Reinforcement algorithms usually learn optimal actions through trial and error. Imagine, for example, a video
game in which the player needs to move to certain places at certain times to earn points. A reinforcement
algorithm playingthat game would start by moving randomly but, over time through trial and error, it would
learn where and when it needed to move the in-game character to maximize its point total.
VIVA QUESTIONS – 2
Ans. Accuracy can be a useful measure if we have the same amount of samples per class but if we
have an imbalanced set of samples accuracy isn't useful at all. Even more so, a test can have a
high accuracy but actually perform worse than a test with a lower accuracy.
Ans. A false positive is an error in data reporting in which a test result improperly indicates
presence of a condition, such as a disease (the result is positive), when in reality it is not present,
while a false negative is an error in which a test result improperly indicates no presence of a
condition (the result is negative), when in reality it is present.
Q. What are the true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and
false negative rate (FNR)?
Ans. TPR refers to the ratio of positives correctly predicted from all the true labels. In simple
words, it is the frequency of correctly predicted true labels. TPR = TP/TP+FN TNR refers to the
ratio of negatives correctly predicted from all the false labels. It is the frequency of correctly
predicted false labels. TNR = TN/TN+FP FPR refers to the ratio of positives incorrectly predicted
from all the true labels. It is the frequency of incorrectly predicted false labels. FPR = FP/TN+FP
FNR refers to the ratio of negatives incorrectly predicted from all the false labels. It is the
frequency of incorrectly predicted true labels. FNR = FN/TP+FN
Ans. Precision means the percentage of your results which are relevant. On the other hand, recall refers to the
percentage of total relevant results correctly classified by your algorithm.
EXPERIMENT – 3
Aim: Implement K mean Algorithm using Python. Evaluate performanceby measuring the
sum of Euclidean distance of each example from its class centre. Test the performance of the
algorithm as afunction of the parameter k.
Python Code :
First we plot the data:
plt.scatter(X[:,0], X[:,1],
s=150)plt.show()
KMeans.py :
class K_Means:
def init (self, k=2, tol=0.001,
max_iter=300):self.k = k
self.tol = tol
self.max_iter =
self.centroids = {}
for i in range(self.k):
self.centroids[i] = data[i]
for i in range(self.max_iter):
self.classifications = {}
for i in range(self.k):
self.classifications[i] = []
prev_centroids = dict(self.centroids)
optimized = True
for c in self.centroids:
original_centroid = prev_centroids[c]
current_centroid = self.centroids[c]
if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:
print(np.sum((current_centroid-original_centroid)/original_centroid*100.0))
optimized = False
if optimized:
break
def predict(self,data):
distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]
classification = distances.index(min(distances))
return classification
model = K_Means()
model.fit(X)
plt.show()
VIVA QUESTIONS – 3
Ans. A subtle issue with Naive-Bayes is that if you have no occurrences of a class label and a certain attribute
value together (e.g. class="nice", shape="sphere") then the frequency-based probability estimate will be zero.
Given Naive-Bayes' conditional independence assumption, when all the probabilities are multiplied you will
get zero and this will affect the posterior probability estimate.
Q. What is SVM ?
Ans. Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both
classification or regression challenges. However, it is mostly used in classification problems. In the SVM
algorithm,we plot each data item as a point in n-dimensional space (where n is number of features you have)
with the value of each feature being the value of a particular coordinate.
Q. What is perceptron ?
Ans. The perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a
function which can decide whether or not an input, represented by a vector of numbers, belongs to some
specific class.[1] It isa type of linear classifier, i.e. a classification algorithm that makes its predictions based
on a linear predictor function combining a set of weights with the feature vector.
Ans. an exponential family is a parametric set of probability distributions of a certain form, specified
below. Thisspecial form is chosen for mathematical convenience, based on some useful algebraic
properties, as well as for generality, as exponential families are in a sense very natural sets of distributions
to consider.
EXPERIMENT – 4
Many data files have store the data or problem description and also their attributes in different formats such as
a) Arff data
b) .xlmns data
c) .csv data
I am using the Abalone data set for understanding the concepts of databases and it’s attributes,
From the original data examples with missing values were removed (the majority having the predicted
value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing
by 200).
Attribute Information:
Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings
is thevalue to predict: either as a continuous value or as a classification problem.
Source:
-- The '-' values are actually 'not_applicable' values rather than 'missing_values' (and so can be treated as legal
discrete values rather than as showing the absence of a discrete value).
VIVA QUESTION – 4
Ans. Supervised learning is the technique of accomplishing a task by providing training, input and output
patternsto the systems whereas unsupervised learning is a self-learning technique in which system has to
discover the features of the input population by its own and no prior set of categories are used.
Ans. In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in
function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve
represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect
discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner
(100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the upper left corner, the
higher the overall accuracy of the tes
EXPERIMENT – 5
I am using the Python library scikit-learn to build the Naive Bayes algorithm.
>>> print(cnf_matrix_gnb)
[[50 0 0]
[ 0 47 3]
[ 0 3 47]]
>>> print(cnf_matrix_mnb)
[[50 0 0]
[ 0 46 4]
[ 0 3 47]]
Output :
b) Decision Tree
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
print("Report : ",
classification_report(y_test, y_pred))
# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
Output :
Predicted values:
['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']
Confusion Matrix: [[ 0 6 7]
[ 0 63 22]
[ 0 20 70]]
Accuracy : 70.7446808511
Report :
precision recall f1-score support
B 0.00 0.00 0.00 13
L 0.71 0.74 0.72 85
R 0.71 0.78 0.74 90
avg / total 0.66 0.71 0.68 188
c) CART
gini = 0.0
for group in groups:
size = float(len(group))
# avoid divide by zero
if size == 0:
continue
score = 0.0
# score the group based on the score for each class
for class_val in classes:
p = [row[-1] for row in group].count(class_val) / size
score += p * p
# weight the group score by its relative size
gini += (1.0 - score) * (size / n_instances)
return gini
d) ARIMA
import pandas as pd
data = pd.read_csv(“Electric_Production.csv”,index_col=0)
data.head()
data.index = pd.to_datetime(data.index)
data.columns = ['Energy Production']
Running the example prints a summary of the fit model. This summarizes the coefficient values
used as well as the skill of the fit on the on the in-sample observations.
ARIMA Model Results
============================================================================
==
- 12-01-1903
============================================================================
=====
e) Linear Regression :
import numpy as np
import matplotlib.pyplot as plt
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
Ans. Classification is about identifying group membership while regression technique involves predicting a
response. Both techniques are related to prediction, where classification predicts the belonging to a class
whereasregression predicts the value from a continuous set.
Ans. 1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby
removing some of the noise in the training data.
2- Use cross-validation techniques such as k-folds cross-validation.
Ans. To perform a logistic regression, the dependent variable has to be dichotomous (yes/no),
usually coded 0/1. Since logistic regression model is a classifier rather than a regressor, you
couldevaluate the model by using metrics like Accuracy, False Postive, false negative, true
positive, true negative, F1 score.
Experiment 6
AIM - Design a prediction model for Analysis of round-trip Time of Flight measurement
from a supermarket using random forest, naïve bayes etc.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
data = pd.read_csv("cmu_supermarket.csv")
data = data.iloc[: , 0:5]
data.columns = ['X', 'Y', 'Z', 'Mag', 'RToF']
data = data.dropna()
data.head()
X = data.iloc[: , 0:4]
Y = data.iloc[: , 4]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.10, random_state=33)
Algorithms:
3. Naïve Bayes
4. Decision Tree
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor()
dt_model = dt_model.fit(X_train , Y_train)
prediction = dt_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)
5. Random Forest
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()
rf_model = rf_model.fit(X_train , Y_train)
prediction = rf_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)
6. SVM
from sklearn.svm import SVR
svm_model = SVR()
svm_model = svm_model.fit(X_train , Y_train)
prediction = svm_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)
7. KNN
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model = knn_model.fit(X_train , Y_train)
prediction = knn_model.predict(X_test)
rmse = (mse(y_true = Y_test , y_pred = prediction))**1/2
print("RMSE : " , rmse)
Result:
By performing the regression using the above five algorithms, we observe that ‘Random
Forest’ is most accurate with least Root Mean Absolute Error.
Therefore, we conclude that out of the above algorithms, Random Forest performs best.
VIVA QUESTION – 6
Ans. Ensemble learning is the process by which multiple models, such as classifiers or experts, are
strategicallygenerated and combined to solve a particular computational intelligence problem.
Ans. Ensemble learning helps improve machine learning results by combining several models. .... Ensemble
methods are meta-algorithms that combine several machine learning techniques into one predictive model in
orderto decrease variance (bagging), bias (boosting), or improve predictions (stacking).
Ans. Ensemble learning is usually used to average the predictions of different models to get a better
prediction. Bagging consists of running multiple different models, each on a different set of input samples
and then taking theaverage of those predictions.
Supervised Learning :
It is the learning where the value or result that we want to predict is within the training data
(labeleddata) and the value which is in data that we want to study is known as Target or
Dependent Variableor Response Variable.
All the other columns in the dataset are known as the Feature or Predictor Variable or
IndependentVariable.
Supervised Learning is classified into two categories:
1. Clarification: Here our target variable consists of the categories.
2. Regression: Here our target variable is continuous and we usually try to find out the line
of thecurve.
This algorithm is used to solve the classification model problems. K-nearest neighbor or K-
NN algorithm basically creates an imaginary boundary to classify the data. When new data
points comein, the algorithm will try to predict that to the nearest of the boundary line.
Therefore, larger k value means smother curves of separation resulting in less complex
models.Whereas, smaller k value tends to overfit the data and resulting in complex
models.
# Loading
data irisData
= load_iris()
KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
# Predict on dataset which model has not seen
beforeprint(knn.predict(X_test))
import numpy as np
import matplotlib.pyplot
as pltirisData = load_iris()
neighbors = np.arange(1, 9)
train_accuracy =
np.empty(len(neighbors))
test_accuracy =
np.empty(len(neighbors))
# Generate plot
plt.plot(neighbors, test_accuracy, label = 'Testing dataset Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training dataset Accuracy')
plt.legend()
plt.xlabel('n_neigh
bors')
plt.ylabel('Accurac
y') plt.show()
Output :
VIVA QUESTION - 7
Q. What is the general principle of an ensemble method and what is bagging and boosting in ensemble method?
Ans. The general principle of an ensemble method is to combine the predictions of several models
built with a given learning algorithm in order to improve robustness over a single model.
Ans. The SVM is a machine learning algorithm which • solves classification problems • uses a
flexible representation of the class boundaries • implements automatic complexity control to reduce
overfitting • has a single global minimum which can be found in polynomial time
It is popular because • it can be easy to use • it often has good generalization performance
EXPERIMENT – 8
Q. What is Regularization ?
Ans. This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards
zero. Inother words, this technique discourages learning a more complex or flexible model, so as to avoid the
risk of overfitting. A simple relation for linear regression looks like this. Here Y represents the learned
relation and β represents the coefficient estimates for different variables or predictors(X).
Ans. Ensemble learning is the use of algorithms and tools in machine learning and other
disciplines,to form a collaborative whole where multiple methods are more effective than a single
learning method. Ensemble learning can be used in many different types of research, for
flexibility and enhanced results.
Q. What is Bagging ?
Ans. A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on
random subsets of the original dataset and then aggregate their individual predictions (either by
voting or byaveraging) to form a final prediction.
Q. What is Boosting ?
Ans. Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.
This is done by building a model from the training data, then creating a second model that attempts to correct
the errors from the first model. Models are added until the training set is predicted perfectly or a maximum
number ofmodels are added.
EXPERIMENT-9
Aim: Build and develop a model in R for a particular classifier (random Forest).
> rf_classifier
randomForest(formula = Species ~ ., data = training,ntree=100,mtry=2, importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 2
virginica 0 2 25 0.07407407
# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf_classifier,validation1[,-5])
table(observed=validation1[,5],predicted=prediction_for_table)
predicted
observed setosa versicolor virginica
setosa 29 0 0
versicolor 0 20 3
virginica 0 1 22
}
Output:
VIVA QUESTION
Ans. Performance Measure. The performance measure is the way you want to evaluate a solution to the problem.
Test and Train Datasets. From the transformed data, you will need to select a test set and a training set, Then do
Cross Validation.
Ans. KNN , Decision Tree , SVM Algorithm, Neural Networks Algorithm and Probabilistic Networks.
EXPERIMENT-10
Aim: Develop a machine learning method using Neural Networks in python to Predict
stock prices based on past price variation.
Program :
● Survived (Target Variable) - Binary categorical variable where 0 represents not survived
and 1 represents survived.
● Pclass - Categorical variable. It is passenger class.
● Sex - Binary Variable representing the gender the of passenger
● Age - Feature engineered variable. It is divided into 4 classes.
● Fare - Feature engineered variable. It is divided into 4 classes.
● Embarked - Categorical Variable. It tells the Port of embarkation.
● Title - New feature created from names. The title of names is classified into 4 different
classes.
● isAlone - Binary Variable. It tells whether the passenger is travelling alone or not.
● Age*Class - Feature engineered variable.
1. Logistic Regression is a useful model to run early in the workflow. Logistic regression
measures the relationship between the categorical dependent variable (feature) and one or more
independent variables (features) by estimating probabilities using a logistic function, which
is the cumulative logistic distribution.
Note the confidence score generated by the model based on our training dataset.
2. In pattern recognition, the k-Nearest Neighbours algorithm (or k-NN for short) is a non-
parametric method used for classification and regression. A sample is classified by a majority
vote of its neighbours, with the sample being assigned to the class most common among its k
nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply
assigned to the class of that single nearest neighbour.
KNN confidence score is better than Logistics Regression but worse than SVM.
3. Next we model using Support Vector Machines which are supervised learning models with
associated learning algorithms that analyze data used for classification and regression analysis.
Given a set of training samples, each marked as belonging to one or the other of twocategories,
an SVM training algorithm builds a model that assigns new test samples to one category or the
other, making it a non-probabilistic binary linear classifier.
Note that the model generates a confidence score which is higher than Logistics Regression
model.
4. In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers
based on applying Bayes' theorem with strong (naive) independence assumptions between the
features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear
in the number of variables (features) in a learning problem.
The model generated confidence score is the lowest among the models evaluated so far.
5. This model uses a decision tree as a predictive model which maps features (tree branches)
to conclusions about the target value (tree leaves). Tree models where the target variable can
take a finite set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to those class
labels. Decision trees where the target variable can take continuous values (typically real
numbers) are called regression trees. The model confidence score is the highest amongmodels
evaluated so far.
EXPERIMENT-12
Aim: Understanding of Indian education in Rural villages to predict whether
girl child will be sent to school or not?
The data is focused on rural India. It primarily looks into the fact whether the villagers
are willing to send the girl children to school or not and if they are not sending theirdaughters
to school the reasons have also been mentioned. The district is Gwalior. Various details of the
villagers such as village, gender, age, education, occupation, category, caste, religion, land etc
have also been collected.
The algorithm was run with 10-fold cross-validation: this means it was given an opportunity
to make a prediction for each instance of the dataset (with different training folds) and the
presented result is a summary of those predictions. Firstly, I noted the Classification Accuracy.
The model achieved a result of 109/200 correct or 54.5%.
a b c d e f g h i j k l m <-- classified as
0 0 1 1 0 1 0 2 0 0 0 0 0 | a = Govt.
2 1 1 1 8 0 0 0 0 1 0 0 0 | b = Driver
2 0 17 2 9 0 0 2 0 0 0 0 0 | c = Farmer
0 0 4 3 2 0 1 0 0 1 0 0 0 | d = Shopkeeper
1 8 2 3 73 1 0 1 1 3 2 2 0 | e = labour
3 0 0 0 0 0 0 1 0 0 0 0 0 | f = Security Guard
0 1 0 1 0 0 0 2 0 0 0 0 0 | g = Raj Mistri
1 0 0 0 1 1 0 8 0 0 0 0 0 | h = Fishing
0 0 2 0 0 0 0 0 2 0 0 0 0 | i = Labour & Driver
0 0 2 0 1 0 0 0 0 2 0 0 0 | j = Homemaker
0 0 0 0 1 0 0 0 0 2 0 0 0 | k = Govt School Teacher
0 0 0 1 4 0 0 0 0 0 0 0 0 | l = Dhobi
1. 0 0 0 3 0 0 0 0 0 0 0 1 | m = goats
The confusion matrix shows the precision of the algorithm showing that 1,1,1,2 Government
officials were misclassified as Farmer, Shopkeeper, Security Guard and Fishermenrespectively,
2,1,1,8,1 Drivers were misclassified as Government officials, Farmer, Shopkeeper, Labour,
Homemaker, and so on. This table can help to explain the accuracy achieved by the algorithm.
Now when we have model,
we need to load our test data we’ve created before. For this, select Supplied test set and click
button Set. Click More Options, where in new window, choose PlainText from Output
predictions. Then click left mouse button on recently created model on result list and select Re-
evaluate model on current test set.
After re-evaluation
a b c d e f g h <-- classified as
147 0 1 0 0 0 0 0 | a = NA
4 12 0 0 0 0 0 0 | b = Poverty
5 0 3 0 0 0 0 0 | c = Marriage
0 0 1 3 0 0 0 0 | d = Distance
0 0 0 0 8 0 0 0 |e=X
4 0 0 0 0 0 0 0 | f = Unsafe Public Space
0 0 0 0 0 0 4 0 | g = Transport Facilities
1. 0 0 0 0 0 0 4 | h = Household Responsibilities
The confusion matrix shows that majority of the reasons were not available and out of the
reasons which were available people did not send their daughters to school because of poverty
and very few of them considered Distance as a major factor for not sending their girl children
to school.
3 Random Forest
The accuracy of this algorithm is 100% that is 200/200 have been correctly classified.
=== Confusion Matrix ===
a b c d e f g h i j k l m <-- classified as
5 0 0 0 0 0 0 0 0 0 0 0 0 | a = Govt.
0 14 0 0 0 0 0 0 0 0 0 0 0 | b = Driver
0 0 32 0 0 0 0 0 0 0 0 0 0 | c = Farmer
0 0 0 11 0 0 0 0 0 0 0 0 0 | d = Shopkeeper
0 0 0 0 97 0 0 0 0 0 0 0 0 | e = labour
0 0 0 0 0 4 0 0 0 0 0 0 0 | f = Security Guard
0 0 0 0 0 0 4 0 0 0 0 0 0 | g = Raj Mistri
0 0 0 0 0 0 0 11 0 0 0 0 0 | h = Fishing
0 0 0 0 0 0 0 0 4 0 0 0 0 | i = Labour & Driver
0 0 0 0 0 0 0 0 0 5 0 0 0 | j = Homemaker
0 0 0 0 0 0 0 0 0 0 4 0 0 | k = Govt School Teacher
0 0 0 0 0 0 0 0 0 0 0 5 0 | l = Dhobi
1. 0 0 0 0 0 0 0 0 0 0 0 4 | m = goats
There is no observation which has been misclassified. Maximum number of villagers are
laborers.
4 Random Tree
The classification accuracy is 76.0204% that is 149/200 have been classified correctly.
The false positive rate is 0.352 that is highest of all the four algorithms applied above. Here
35.2% of the values which should have been classified negatively have been assigned a positive
value.
This is dataset collected from contact patterns among students collected during
the spring semester 2006 in National University of Singapore
ALGORITHM-1 : GaussianProcesses
=== Run information ===
Scheme: weka.classifiers.functions.SimpleLinearRegression
Relation: MOCK_DATA (1)-weka.filters.unsupervised.instance.RemovePercentage-P50.0
Instances: 500
Attributes: 4
Start Time
Session Id
Student Id
Duration
Test mode: evaluate on training data
Start Time =
0.0274 * Session Id +
10.3846
Correlation coefficient 0
Mean absolute error 5.0003
Root mean squared error 5.8026
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 500
CONCLUSION:
Six algorithms have been used to measure the best classifier. Depending on various attributes,
performance of various algorithms can be measured via mean absolute error and correlation
coefficient.
Depending on the results above, worst correlation has been found by DecisionTable and best
correlation
has been found by Decision Stump
Decision Table:
Best first.