Data Warehousing and Data Mining Lab
Experiment 1- Matrix Operations
a) Create multi-dimensional arrays and find its shape and dimension
b) Create a matrix full of zeros and ones
c) Reshape and flatten data in the array
d) Append data vertically and horizontally
e) Apply indexing and slicing on array
f) Use statistical functions on array - Min, Max, Mean, Median and Standard Deviation
PROGRAMS:
a) Create multi-dimensional arrays and find its shape and dimension
import numpy as np
#creation of multi-dimensional array
a=np.array([[1,2,3],[2,3,4],[3,4,5]])
#shape
b=a.shapeprint("shape:",a.shape)
#dimension
c=a.ndimprint("dimensions:",a.ndim)
b) Create a matrix full of zeros and ones
#matrix full of zeros z=np.zeros((2,2))
print("zeros:",z)
#matrix full of ones
o=np.ones((2,2))
print("ones:",o)
c) Reshape and flatten data in the array
#matrix reshape
a=np.array([[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7]])
b=a.reshape(4,2,2) print("reshape:",b)
#matrix flatten
c=a.flatten() print("flatten:",c)
d) Append data vertically and horizontally #Appending data vertically
x=np.array([[10,20],[80,90]])
Department of Information Technology Page 1
Data Warehousing and Data Mining Lab
y=np.array([[30,40],[60,70]])
v=np.vstack((x,y))
print("vertically:",v)
#Appending data horizontally
h=np.hstack((x,y))
print("horizontally:",h)
e) Apply indexing and slicing on array #indexing
a=np.array([[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7]]) temp = a[[0, 1, 2, 3], [1, 1, 1, 1]]
print(“indexing”,temp)
#slicing
i=a[:4,::2]
print(“slicing”,i)
f) Use statistical functions on array - Min, Max, Mean, Median and Standard
Deviation #min for finding minimum of an array
a=np.array([[1,3,-1,4],[3,-2,1,4]])
b=a.min() print(“minimum:”,b)
#max for finding maximum of an array
C=a.max() Print(“maximum”,c)
#mean a=np.array([1,2,3,4,5])
d=a.mean()
print(“mean:”,d)
#median e=np.median(a) print(“median:”,e)
#standard deviation
f=a.std()
print(“standard deviation”,f)
OUTPUT:
a) shape: (3, 3) dimensions: 2
zeros:
[[0. 0.]
[0. 0.]]
ones:
[[1. 1.]
Department of Information Technology Page 2
Data Warehousing and Data Mining Lab
[1. 1.]]
b) reshape: [[[1 2] [3 4]] [[2 3] [4 5]] [[3 4] [5 6]] [[4 5] [6 7]]]
flatten: [1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7]
c) vertically: [[10 20] [80 90]
[30 40]
[60 70]]
horizontally: [[10 20 30 40]
[80 90 60 70]]
d) indexing [2 3 4 5] slicing [[1 3]
[2 4]
[3 5]
[4 6]]
e) minimum: -2 maximum: 4
f) ) mean: 3 median: 3
standard deviation: 1.4142135623730951
Department of Information Technology Page 3
Data Warehousing and Data Mining Lab
Experiment 2: Understanding Data
Write a Python program to do the following operations:
Dataset: brain_size.csv
Library: Pandas, matplotlib
a) Loading data from CSV file
b) Compute the basic statistics of given data - shape, no. of columns, mean
c) Splitting a data frame on values of categorical variables
d) Visualize data using Scatter plot
Program:
a)Loading data from CSV file #loading file csv
import pandas as pd
pd.read_csv("P:/python/newfile.csv")
b)Compute the basic statistics of given data - shape, no. of columns, mean #shape
a=pd.read_csv("C:/Users/admin/Documents/diabetes.csv")
print('shape :',a.shape)
#no of columns
cols=len(a.axes[1])
print('no of columns:',cols)
#mean of data
m=a["Age"].mean()
print('mean of Age:',m)
c) Splitting a data frame on values of categorical variables #adding data
a['address']=["hyderabad,ts","Warangal,ts","Adilabad,ts","medak,ts"]
#splitting dataframe
a_split=a['address'].str.split(',',1)
a['district']=a_split.str.get(0)
a['state']=a_split.str.get(1)
del(a['address'])
d)Visualize data using Scatter plot #visualize data using scatter plot
Department of Information Technology Page 4
Data Warehousing and Data Mining Lab
import matplotlib as plt
a.plot.scatter(x='marks',y='rollno',c='Blue')
Output:
a)
student rollno marks 0 a1 121 98
1 a2 122 82
2 a3 123 92
3 a4 124 78
b)
shape: (4, 3)
no of colums: 3 mean: 87.5
c) before:
student rollno marks address
0 a1 121 98 hyderabad,ts
1 a2 122 82 Warangal,ts
2 a3 123 92 Adilabad,ts
3 a4 124 78 medak,ts After:
student rollno marks district state
0 a1 121 98 hyderabadts
1 a2 122 82 Warangal ts
2 a3 123 92 Adilabadts
3 a4 124 78 medakts d)
d)
Department of Information Technology Page 5
Data Warehousing and Data Mining Lab
Experiment 3: Correlation Matrix
Write a python program to load the dataset and understand the input data
Dataset: Pima Indians Diabetes Dataset https://www.kaggle.com/uciml/pima-indians-
diabetes-database#diabetes.csv
Library: Scipy
a) Load data, describe the given data and identify missing, outlier data items
b) Find correlation among all attributes
c) Visualize correlation matrix
Program:
a)Load data
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("C:/Users/admin/Documents/diabetes.csv")
#describe the given data
print(df. describe())
#Display first 10 rows of data
print(df.head(10))
#outlier data items
import numpy as np
def outliers_z_score(ys):
threshold = 3
mean_y = np.mean(ys)
stdev_y = np.std(ys)
Department of Information Technology Page 6
Data Warehousing and Data Mining Lab
z_scores = [(y - mean_y) / stdev_y for y in ys]
return np.where(np.abs(z_scores) > threshold)
b)Find correlation among all attributes # importing pandas as pd
import pandas as pd
# Making data frame from the csv file
df = pd.read_csv("nba.csv")
# Printing the first 10 rows of the data frame for visualization
df[:10]
# To find the correlation among columns # using pearson method
df.corr(method ='pearson')
# using „kendall‟ method.
df.corr(method ='kendall')
c)Visualize correlation matrix
import pandas as pd
df = pd.read_csv("C:/Users/admin/Documents/diabetes.csv")
print(df. describe())
print(df.head(10))
Output:
Department of Information Technology Page 7
Data Warehousing and Data Mining Lab
Department of Information Technology Page 8
Data Warehousing and Data Mining Lab
Experiment 4 - Data Preprocessing – Handling Missing Values
Write a python program to impute missing values with various techniques on given
dataset.
a) Remove rows/ attributes
b) Replace with mean or mode
c) Write a python program to perform transformation of data using Discretization (Binning)
on given dataset.
https://www.kaggle.com/uciml/pima-indians-diabetes-database#diabetes.csv
Library: Scipy
Program:
# filling missing value using fillna()
df.fillna(0)
# filling a missing value with previous value
df.fillna(method ='pad')
#Filling null value with the next ones
df.fillna(method ='bfill')
# filling a null values using fillna()
data["Gender"].fillna("No Gender", inplace = True)
# will replace Nan value in dataframe with value -99
data.replace(to_replace = np.nan, value = -99)
# Remove rows/ attributes
# using dropna() function to remove rows having one Nan
df.dropna()
# using dropna() function to remove rows with all Nan
df.dropna(how = 'all')
# using dropna() function to remove column having one Nan
Department of Information Technology Page 9
Data Warehousing and Data Mining Lab
df.dropna(axis = 1)
# Replace with mean or mode
mean_y = np.mean(ys)
# Perform transformation of data using Discretization (Binning)
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics
# load iris data set dataset = load_iris() a = dataset.data
b = np.zeros(150)
# take 1st column among 4 column of data set for i in range (150):
b[i]=a[i,1]
b=np.sort(b) #sort the array
# create bins bin1=np.zeros((30,5)) bin2=np.zeros((30,5)) bin3=np.zeros((30,5))
# Bin mean
for i in range (0,150,5): k=int(i/5)
mean=(b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4])/5 for j in range(5):
bin1[k,j]=mean print("Bin Mean: \n",bin1)
# Bin boundaries
for i in range (0,150,5): k=int(i/5)
for j in range (5):
if (b[i+j]-b[i]) < (b[i+4]-b[i+j]): bin2[k,j]=b[i]
else:
Department of Information Technology Page 10
Data Warehousing and Data Mining Lab
bin2[k,j]=b[i+4] print("Bin Boundaries: \n",bin2)
# Bin median
for i in range (0,150,5): k=int(i/5)
for j in range (5): bin3[k,j]=b[i+2]
print("Bin Median: \n",bin3)
Output:
Bin Mean: Bin Boundaries: Bin Median:
[[2.18 2.18 2.18 2.18 2.18] [[2. 2.3 2.3 2.3 2.3] [[2.2 2.2 2.2 2.2 2.2]
[2.34 2.34 2.34 2.34 2.34] [2.3 2.3 2.3 2.4 2.4] [2.3 2.3 2.3 2.3 2.3]
[2.48 2.48 2.48 2.48 2.48] [2.4 2.5 2.5 2.5 2.5] [2.5 2.5 2.5 2.5 2.5]
[2.52 2.52 2.52 2.52 2.52] [2.5 2.5 2.5 2.5 2.6] [2.5 2.5 2.5 2.5 2.5]
[2.62 2.62 2.62 2.62 2.62] [2.6 2.6 2.6 2.6 2.7] [2.6 2.6 2.6 2.6 2.6]
Department of Information Technology Page 11
Data Warehousing and Data Mining Lab
Experiment 5 - Association Rule Mining – Apriori
Write a python program to generate frequent itemsets using Apriori Algorithm and also
generate association rules for any market basket data.
Program:
pip install mlxtend
Import pandas as pd
From mlxtend.preprocessing import TransactionEncoder
From mlxtend.frequent_patterns import apriori,association_rules
data =[['Bread', 'Milk', 'Eggs'],
['Bread','Diapers','Beer','Eggs'],
['Milk','Diapers','Beer','Cola'],
['Bread''Milk','Diapers','Beer','Cola','Eggs']] te=TransactionEncoder()
te_ary=te.fit(data).transform(data)
df=pd.DataFrame(te_ary,columns=te.columns_)
df
frequent_itemsets=apriori(df,min_support=0.75,use_colnames=True)
print(frequent_itemsets)
rules=association_rules(frequent_itemsets,metric='confidence',min_threshold=0.7)
print(rules)
selected_columns=['antecedents','consequents','antecedent support','consequent support',
'support', 'confidence']
print(rules[selected_columns])
Apriori:
pip install apyori
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from apyori import apriori
store_data=pd.read_csv("D:/DM/store_data.csv",header=None)
display(store_data.head())
store_data.shape
records= []
for i in range(1,7501):
Department of Information Technology Page 12
Data Warehousing and Data Mining Lab
records.append([str(store_data.values[i,j])
for j in range(0,20)]):
print(type(records))
association_rules=apriori(records,min_support=0.0045,min_confidence=0.2,min_lift=3,
min_length=2)
association_results=list(association_rules)
print("Thereare{}Relation derived.".format(len(association_results)))
for i in range(0, len(association_results)):
print(association_results[i][0])
for item in association_results:
pair = item[0]
items=[xforxin pair]
print("Rule:"+items[0]+"->"+items[1])
print("Support: " +str(item[1]))
print("Confidence: " +str(item[2][0][2]))
print("Lift: " +str(item[2][0][3]))
print("==========================================")
Output:
support itemsets
0 0.75 (Beer)
1 0.75 (Diapers)
2 0.75 (Eggs)
3 0.75(Beer, Diapers)
Antecedents consequents antecedent support consequent support support\
0 (Beer) (Diapers) 0.75 0.75 0.75
1 (Diapers) (Beer) 0.75 0.75 0.75 confidence lift
leverageconvictionzhangs_metric
antecedents consequents antecedent support consequent support support\
0 (Beer) (Diapers) 0.75 0.75 0.75
1 (Diapers) (Beer) 0.75 0.75 0.75
Department of Information Technology Page 13
Data Warehousing and Data Mining Lab
confidence
0 1.0
1 1.0
2
(7501,20)
<class'list'>
There are 48 Relation derived. frozenset({'chicken', 'light cream'})
frozenset({'escalope','mushroomcreamsauce'}) frozenset({'escalope', 'pasta'})
frozenset({'ground beef', 'herb & pepper'})
frozenset({'tomatosauce','groundbeef'}) Page|26
frozenset({'wholewheatpasta','oliveoil'}) frozenset({'shrimp', 'pasta'}) frozenset({'nan',
'chicken', 'light cream'})
frozenset({'shrimp','frozenvegetables','chocolate'}) frozenset({'cooking oil', 'ground beef',
'spaghetti'})
Rule: chicken->light cream Support: 0.004533333333333334
Confidence:0.2905982905982906
Lift:4.843304843304844
==========================================
Rule:escalope->mushroomcreamsauce
Support: 0.005733333333333333
Confidence:0.30069930069930073
Lift: 3.7903273197390845
Department of Information Technology Page 14
Data Warehousing and Data Mining Lab
Experiment 6 – Logistic Regression
Write a python program using Logistic Regression on any dataset.
Program:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
# Sample dataset
# We'll create a dataset about students: Hours studied vs Passed exam (Yes=1, No=0)
# Features (Hours Studied)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
# Labels (0 = Fail, 1 = Pass)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
# Split dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Create Logistic Regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
# Evaluation
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Department of Information Technology Page 15
Data Warehousing and Data Mining Lab
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))
# Predicting a custom value (example: will a student studying 5 hours pass?)
custom_prediction = model.predict([[5]])
print("\nPrediction for student studying 5 hours:", "Pass" if custom_prediction[0] == 1 else
"Fail")
Output:
Confusion Matrix:
[[1 0]
[0 2]]
Accuracy Score: 1.0
Prediction for student studying 5 hours: Pass
Department of Information Technology Page 16
Data Warehousing and Data Mining Lab
Experiment 7: Classification – KNN
Write a python program using K-Nearest Neighbors (KNN) algorithm on any dataset.
Program:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
# Load the Iris dataset
iris = load_iris()
# Features and target
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create KNN model
knn = KNeighborsClassifier(n_neighbors=3) # Using k=3 neighbors
# Train the model
knn.fit(X_train, y_train)
# Predict on test set
y_pred = knn.predict(X_test)
# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Department of Information Technology Page 17
Data Warehousing and Data Mining Lab
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))
Output:
Confusion Matrix:
[[16 0 0]
[ 0 12 2]
[ 0 0 15]]
Accuracy Score: 0.9555555555555556
Department of Information Technology Page 18
Data Warehousing and Data Mining Lab
Experiment 8 - Classification - Decision Trees
Write a python program using Decision Tree algorithm on any dataset.
Program:
#import libraries
from sklearn.datasets import load_iris
from sklearn.tree importDecisionTreeClassifier,plot_tree from sklearn.model_selection
import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt #Load iris dataset iris=load_iris()
X=iris.data y=iris.target
#Split the data into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=None) #Create
an instance of the DecisionTreeClassifier class
tree_clf=DecisionTreeClassifier(max_depth=3)
#Fit the model on the training data tree_clf.fit(X_train,y_train)
DecisionTreeClassifier(max_depth=3) #predict on the testing data
y_pred=tree_clf.predict(X_test) #Calculate accuracy of the model
accuracy=accuracy_score(y_test,y_pred) print('Accuracy : ',accuracy)
#Visualize the decision tree using the plot_tree function plt.figure(figsize=(15,10))
plot_tree(tree_clf,filled=True,feature_names=iris.feature_names,class_names=iris.target_nam
es)
plt.show()
Output:
Accuracy: 0.9555555555555556
Department of Information Technology Page 19
Data Warehousing and Data Mining Lab
Department of Information Technology Page 20
Data Warehousing and Data Mining Lab
Experiment 9 - Classification – Bayesian Network
Write a python program using Naïve Bayes Classification algorithm on any dataset.
Program:
from sklearn import metrics
from sklearn.naive_bayesimport GaussianNB import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split from sklearn.metrics import
confusion_matrix
from sklearn.metrics import accuracy_score dataset=pd.read_csv('D:\DM\iris.csv')
print(dataset)
X=dataset.iloc[:,:4].values Y=dataset['Species'].values print(Y)
print(X)
#split the dataset into training and test datasets
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3) #create an object for Bayes
Classifier GaussianNB classifier=GaussianNB()
classifier.fit(X_train,Y_train) #predict the values print(X_test[0])
y_pred=classifier.predict(X_test) print(y_pred)
accuracy=accuracy_score(Y_test,y_pred) print("Accuracy: ",accuracy)
#build confusion matrix
from sklearn.metrics import confusion_matrix cm=confusion_matrix(Y_test,y_pred)
from sklearn.metrics import accuracy_score
print("Accuracy: ",accuracy_score(Y_test,y_pred)) cm
df=pd.DataFrame({'RealValues':Y_test,'PredictedValues':y_pred})
print(df)
Department of Information Technology Page 21
Data Warehousing and Data Mining Lab
NaiveBayes:
#predict class table for
from sklearn.datasets impor load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np
#import iris dataset
iris=load_iris()
X=iris.data
Y=iris.target
le=LabelEncoder()
Y=le.fit_transform(Y)
#Split the dataset into training and testing sets
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
#Train a Naive bayes model on the training data
nb_model = GaussianNB() nb_model.fit(X_train, Y_train)
#Make predictions on the test data y_pred=nb_model.predict(X_test)
y_pred = le.inverse_transform(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy: ",accuracy)
new_observation = np.array([[5.8, 3.0, 4.5, 1.5]])
predicted_class = nb_model.predict(new_observation)
predicted_class=le.inverse_transform(predicted_class)
Department of Information Technology Page 22
Data Warehousing and Data Mining Lab
print("Predicted class: ", predicted_class)
Output:
IdSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm\
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2 Page |21
3 4 4.6 3.1 1.5 0.2
.. 5
... 5.0
... 3.6 1.4 0.2
... ... ...
145 146 6.7 3.0 5.2 2.
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
Department of Information Technology Page 23
Data Warehousing and Data Mining Lab
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
[150rowsx6columns]
['Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa'
[[ 1. 5.1 3.5 1.4]
[ 2. 4.9 3. 1.4]
[ 3. 4.7 3.2 1.3]
[ 4. 4.6 3.1 1.5]
[ 5. 5. 3.6 1.4]
[ 6. 5.4 3.9 1.7]
[ 7. 4.6 3.4 1.4]
[ 8. 5. 3.4 1.5]
[ 9. 4.4 2.9 1.4]
[ 10. 4.9 3.1 1.5]
[11. 5.4 3.7 1.5]
['Iris-setosa''Iris-virginica''Iris-versicolor''Iris-setosa''Iris-versicolor''Iris-virginica''Iris-
virginica''Iris-setosa''Iris-virginica''Iris-setosa''Iris-virginica''Iris-virginica''Iris-setosa''Iris-
versicolor''Iris-virginica''Iris-virginica'
Department of Information Technology Page 24
Data Warehousing and Data Mining Lab
Experiment 10: Classification – Support Vector Machines (SVM)
Write a python program using Support Vector Machines (SVM) on any dataset.
Program:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC # SVC = Support Vector Classifier
from sklearn.metrics import confusion_matrix, accuracy_score
# Load the Iris dataset
iris = datasets.load_iris()
# Features and target
X = iris.data
y = iris.target
# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create SVM model
svm_model = SVC(kernel='linear') # 'linear' kernel
# Train the model
svm_model.fit(X_train, y_train)
# Predict on test set
y_pred = svm_model.predict(X_test)
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))
Department of Information Technology Page 25
Data Warehousing and Data Mining Lab
# Predicting a custom input (optional)
sample = [[5.1, 3.5, 1.4, 0.2]] # Example input
predicted_class = svm_model.predict(sample)
print("\nPrediction for sample input:", iris.target_names[predicted_class[0]])
Output:
Confusion Matrix:
[[16 0 0]
[ 0 14 1]
[ 0 0 14]]
Accuracy Score: 0.9777777777777777
Prediction for sample input: setosa
Department of Information Technology Page 26