0% found this document useful (0 votes)
22 views26 pages

Ad3411 - Data Science and Analytics Laboratory

The document outlines a laboratory course on Data Science and Analytics, detailing a series of experiments using tools like Python, Pandas, and Matplotlib. Each experiment focuses on different aspects of data analysis, including working with data frames, basic plotting, statistical tests (Z-test, T-test, ANOVA), and regression analysis. The document includes algorithms and sample code for each experiment, providing a practical guide for students.

Uploaded by

sharmila11121311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views26 pages

Ad3411 - Data Science and Analytics Laboratory

The document outlines a laboratory course on Data Science and Analytics, detailing a series of experiments using tools like Python, Pandas, and Matplotlib. Each experiment focuses on different aspects of data analysis, including working with data frames, basic plotting, statistical tests (Z-test, T-test, ANOVA), and regression analysis. The document includes algorithms and sample code for each experiment, providing a practical guide for students.

Uploaded by

sharmila11121311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

AD3411 - DATA SCIENCE AND ANALYTICS LABORATORY

Sl.No LIST OF EXPERIMENTS Pg.No

Tools: Python, Numpy, Scipy, Matplotib, Pandas,Statmodels,


Seaborn,Plotly,Bokeh,working with Numpy arrays

1. Working with Pandas data frame 2

Basic Plots using Matplotlib


2. 3

Frequency distributors, Averages, Variability


3. 5

Normal Curves, Correlation and scatter plots, Correlation


4. coefficient 6

5. Regression 9

6. Z-test 11

7. T-test 13

8. Anova 15

9. Building and validating linear models 16

10. Building and validating logistic models 19

11. Time series analysis 22


EXP 1: WORKING WITH PANDAS DATA FRAMES

Aim:
To working with pandas data frames

Algorithm
Step 1: Start

Step 2: Define Class Cal_Average

Step 3: Sum_Num = Sum_Num + T

Step 4: Avg = Sum_Num / Len(Num)

Step 5: Stop

Program:
import pandas as pd
data = {"calories": [420, 380, 390], "duration": [50, 40,
45]}
#load data into a DataFrame object: df =
pd.DataFrame(data)
print (df.loc[0])

Output:
calories 420
duration 50
Name: 0, dtype: int64
EXP 2: BASIC PLOTS USING MATPLOTLIB
Aim:
To Basic plots using matplotlib
Algorithm
Step1: import the library

Step2: Plot the points using matplotlib

Step3: Display the output


Step4: Stop

Program:
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is # for red
plt.plot(b, "or") plt.plot(list(range(0,
22, 3))) # naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep') # get
current axes command
ax = plt.gca()
# get command over the individual #
boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False) # set the
range or the bounds of
# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
# set the interval by which # the
x-axis set the marks
plt.xticks(list(range(-3, 10)))
# set the intervals by which y-axis # set the
marks plt.yticks(list(range(-3, 20, 3))) #
legend denotes that what color
# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep']) # annotate
command helps to write
# ON THE GRAPH any text xy denotes #
the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15)) # gives a title
to the Graph
plt.title('All Features Discussed')
plt.show()

Output:


EXP 3: FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY
Aim:
To Frequency Distributions, Averages, Variability
Algorithm
Step 1: Start

Step 2: Import Pandas, Numpy And Nltk

Step 3: List The Items As ‘F’ For Fruits And ’V’ For Vegetables

Step 4: Display The Frequency Of Each Items In The List

Step 5: Stop

Program:
# Python program to get average of a list
Output:
105.57142857142857
Algorithm
Step 1: Start

Step 2: Import Statistics

Step 3: Define A List


Step 4:PrintStatistics.Variance(Sample))
Step 5: Stop

#Python program to get variance of a list

# Importing the NumPy module

import numpy as np
# Taking a list of elements list = [2, 4,
4, 4, 5, 5, 7, 9]
# Calculating variance using var()
print(np.var(list))
Output:
4.0
Algorithm
Step 1: Start

Step 2: Import Statistics

Step 3: Define A List


Step 4:Print Standard deviation

Step 5: Stop
# Python program to get standard deviation of a list
# Importing the NumPy module
import numpy as np
# Taking a list of elements list =
[290, 124, 127, 899]
# Calculating standard #
deviation using var()
print(np.std(list))
Output:
318.35750344541907

EXP 4: NORMAL CURVES, CORRELATION AND SCATTER PLOTS,
CORRELATION COEFFICIENT
Aim:
To Normal curves, Correlation and Scatter Plots, Correlation Coefficient

Algorithm

Step 1:Import Necessary Libraries

Step 2: Initialize Parameters

Step 3: Generate Random Data

Step 4:Create a Histogram

Step 5: Compute the Probability Density Function

Step 6: Plot the Normal Curve

Step 7:Add Labels and Title

Step 8:Display the Plot

Program:
#Normal curves
import matplotlib.pyplot as plt import
numpy as np
mu, sigma = 0.5, 0.1
s = np.random.normal(mu, sigma, 1000) #
Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

Output:
Algorithm

Step 1:Import Necessary Libraries

Step 2:Define two pandas Series

Step 3::Calculate the correlation coefficient

Step 4:Print the correlation coefficient

Step 5:Create a scatter plot

Step 6:Label the axes and add a title

Step 7:Show the plot

Program:

#Correlation and scatter plots import


sklearn
import numpy as np
import matplotlib.pyplot as plt import
pandas as pd
y = pd.Series([1, 2, 3, 4, 3, 5, 4])
x = pd.Series([1, 2, 3, 4, 5, 6, 7])
correlation = y.corr(x) correlation

Output:

0.8603090020146067
Algorithm

Step 1:Initialize Variables


Step 2:Iterate Through the Data
Step 3:Compute the Pearson Correlation Coefficient
Step 4:Return the Result

Program:

# Correlation coefficient import math


# function that returns correlation coefficient. def
correlationCoefficient(X, Y, n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i=0
while i < n :
# sum of elements of array X. sum_X =
sum_X + X[i]
# sum of elements of array Y. sum_Y =
sum_Y + Y[i
# sum of X[i] * Y[i].
sum_XY = sum_XY + X[i] * Y[i]
# sum of square of array elements.
squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]
i=i+1
# use formula for calculating correlation #
coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y)/ (float)
(math.sqrt((n * squareSum_X - sum_X *
sum_X)* (n * squareSum_Y - sum_Y *
sum_Y)))
return corr
# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
# Find the size of array. n =
len(X)
# Function call to correlationCoefficient.
print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))

Output:
0.953463

Fundamentals of Data Science

EXP 5: REGRESSION

Aim:

To write a proram Multiple linear Regression

Algorithm

Step 1: Import Required Libraries

Step 2: Define Function to Estimate Regression Coefficients

Step 3: Define Function to Plot Regression Line

Step 4: Main Execution Function

Step 5: Execute the Program in the Main Block

Program:
import numpy as np
import matplotlib.pyplot as plt def
estimate_coef(x, y):
# number of observations/points n =
np.size(x)
# mean of x and y vector m_x
= np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x SS_xy =
np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients b_1 =
SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot plt.scatter(x, y,
color = "m",
marker = "o", s = 30) #
predicted response vector y_pred = b[0]
+ b[1]*x
# plotting the regression line plt.plot(x,
y_pred, color = "g") # putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plotplt.show() def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients b =
estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1])) #
plotting regression line plot_regression_line(x,
y, b)
if name == " main ": main()

Output:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437


Fundamentals of Data Science

EXP 6: Z-TEST

Aim:
To perform Z-TEST

Algorithm

Step 1:Import Necessary Libraries

Step 2:Set Parameters

Step 3:Generate or Collect Sample Data:

Step 4:Calculate Sample Statistics:

Step 5:Perform the One-Sample Z-Test

Step 6:Make a Decision

Program:
# imports import
math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and sd 15
# similar to the IQ scores data we assume above mean_iq =
110
sd_iq = 15/math.sqrt(50) alpha =
0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq #
print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
# now we perform the test. In this function, we passed data, in the value
parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we
check whether the
# mean is larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='la rger')
# the function outputs a p_value and z-score corresponding to that value, we
compare the
# p-value with alpha, if it is greater than alpha then we do not null
hypothesis
# else we reject it. if(p_value <
alpha): print("Reject Null
Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")

Output:
Reject Null Hypothesis


Fundamentals of Data Science

EXP 7: T-TEST

Aim:
To perform T-TEST

Algorithm

Step 1:Import Necessary Libraries

Step 2:Define Sample Size

Step 3:Generate Data for Two Independent Groups

Step 4:Calculate Variance for Each Group

Step 5:Compute Pooled Standard Deviation (SD)

Step 6:Calculate t-Statistic (tval)

Step 7:Determine Degrees of Freedom (dof)

Step 8:Compute p-Value (pval)

Step 9:Output Results

Program:
# Importing the required libraries and packages import
numpy as np
from scipy import stats
# Defining two random distributions #
Sample Size
N = 10
# Gaussian distributed data with mean = 2 and var = 1 x =
np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1 y =
np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation var_x =
x.var(ddof = 1)
var_y = y.var(ddof = 1) #
Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD) #
Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N)) # Comparing
with the critical T-Value
# Degrees of freedom dof =
2*N-2
# p-value after comparison with the T-Statistics pval = 1 -
stats.t.cdf( tval, df = dof) print("t = " + str(tval))
print("p = " + str(2 * pval))
## Cross Checking using the internal function from SciPy Packa ge
tval2, pval2 = stats.ttest_ind(x, y) print("t = " +
str(tval2))
print("p = " + str(pval2))

Output:
Standard Deviation = 0.7642398582227466 t =
4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205


Fundamentals of Data Science

Exp 8: ANOVA

Aim:

To Write a program oerform ANOVA.

Algorithm

Step 1:Install and Load Necessary Packages:

Step 2:Visualize Data with Boxplot:

Step 3:Set Up Hypotheses:

Step 4:Perform One-Way ANOVA:

Step 5:Interpret Results:

Step 6:Post-Hoc Analysis (if applicable):

Step 7:Check ANOVA Assumptions:

Program:
# Installing the package
install.packages("dplyr") #
Loading the package
library(dplyr)
# Variance in mean within group and between group
boxplot(mtcars$disp~factor(mtcars$gear),
xlab = "gear", ylab = "disp")
# Step 1: Setup Null Hypothesis and Alternate Hypothesis # H0 = mu
= mu01 = mu02 (There is no difference
# between average displacement for different gear) # H1 = Not
all means are equal
# Step 2: Calculate test statistics using aov function mtcars_aov <-
aov(mtcars$disp~factor(mtcars$gear)) summary(mtcars_aov)
# Step 3: Calculate F-Critical Value
# For 0.05 Significant value, critical value = alpha = 0.05
# Step 4: Compare test statistics with F-Critical value
# and conclude test p <alpha, Reject Null Hypothesis
Output:
Fundamentals of Data Science

EXP 9: BUILDING AND VALIDATING LINEAR MODELS


Aim:
To Building and Validating Linear models
algorithm

Step1: Start

Step2: Import numpy,pandas,seaborn,matplotlib&sklearn

Step3: calculate linear regression using the appropriate functions

Step4: display the result

Step 5: Stop

Program
# Importing the necessary libraries import
pandas as pd
import numpy as np
import matplotlib.pyplot as plt import
seaborn as sns
from sklearn.datasets import load_boston sns.set(style=”ticks”,color_codes=True)
plt.rcParams[‘figure.figsize’] = (8,5)
plt.rcParams[‘figure.dpi’] = 150
# loading the databoston = load_boston()
You can check those keys with the following code. print(boston.keys())
The output will be as follow:
dict_keys([‘data’, ‘target’, ‘feature_names’, ‘DESCR’, ‘filename’])
print(boston.DESCR)

You will find these details in output:


Attribute Information (in order):
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,000 sq.ft.
— INDUS proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river;
0 otherwise)
— NOX nitric oxides concentration (parts per 10 million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 1940
— DIS weighted distances to five Boston employment centres
— RAD index of accessibility to radial highways
— TAX full-value property-tax rate per $10,000
— PTRATIO pupil-teacher ratio by town
— B 1000 (Bk — 0.63)² where Bk is the proportion of blacks by town
— LSTAT % lower status of the population
— MEDV Median value of owner-occupied homes in $1000’s :Missing
Attribute Values: None
df=pd.DataFrame(boston.data,columns=boston.feature_names) df.head()
# print the columns present in the dataset
print(df.columns)
# print the top 5 rows in the dataset
print(df.head())

First five records from data set


#plotting heatmap for overall data setsns.heatmap(df.corr(), square=True,
cmap=’RdYlGn’)
Fundamentals of Data Science

Heat map of overall data set


So let’s plot a regression plot to see the correlation between RM and MEDV.
sns.lmplot(x = ‘RM’, y = ‘MEDV’, data = df)

Regression plot with RM and MEDV


EXP 10: BUILDING AND VALIDATING LOGISTICS MODELS

Aim:
To Building and Validating Logistics models

Algorithm:

Step 1:Import Necessary Libraries

Step 2:Load the Dataset

Step 3:Define Independent and Dependent Variables

Step 4:Add a Constant Term

Step 5:Fit the Logistic Regression Model

Step 6:Review Model Summary

Program

Building the Logistic Regression model:


# importing libraries import
statsmodels.api as sm import
pandas as pd
# loading the training dataset
df = pd.read_csv('logit_train1.csv', index_col = 0) # defining the
dependent and independent variables Xtrain = df[['gmat', 'gpa',
'work_experience']] ytrain = df[['admitted']]
# building the model and fitting the data log_reg =
sm.Logit(ytrain, Xtrain).fit()

Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
Program
# printing the summary table
print(log_reg.summary())

Output :
Logit Regression Results
=============================================================
Dep. Variable: admitted No. Observations: 30
Fundamentals of Data Science
Model: Logit Df Residuals: 27
Method: MLE Df Model: 2
Date: Wed, 15 Jul 2020 Pseudo R-squ.: 0.4912
Time: 16:09:17 Log-Likelihood: -10.581

converged: True LL-Null: -20.794


Covariance Type: nonrobust LLR p-value: 3.668e-05
=============================================================
===
coef std err z P>|z| [0.025 0.975]

gmat -0.0262 0.011 -2.383 0.017 -0.048 -0.005


gpa 3.9422 1.964 2.007 0.045 0.092 7.792
work_experience 1.1983 0.482 2.487 0.013 0.254 2.143

Algorithm:
Step 1:Import Necessary Libraries:
Step 2:Load the Testing Dataset:
Step 3:Define Independent and Dependent Variables:
Step 4:Add a Constant to the Independent Variables:
Step 5:Load the Pre-trained Logistic Regression Model:
Step 6:Make Predictions on the Test Dataset:
Step 7:Convert Probabilities to Binary Predictions:
Step 8:Compare Actual and Predicted Values:

Program

Predicting on New Data :

# loading the testing dataset


df = pd.read_csv('logit_test1.csv', index_col = 0) # defining the
dependent and independent variables Xtest = df[['gmat', 'gpa',
'work_experience']] ytest = df['admitted']
# performing predictions on the test dataset yhat =
log_reg.predict(Xtest)
prediction = list(map(round, yhat))
# comparing original and predicted values of y print('Actual
values', list(ytest.values)) print('Predictions :', prediction)

Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
Actual values [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions : [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
Algorithm:

Step 1:Import Necessary Libraries

Step 2:Compute the Confusion Matrix

Step 3:Display the Confusion Matrix

Step 4:Calculate the Accuracy Score

Step 5:Display the Accuracy Score

Program:

Testing the accuracy of the model :

from sklearn.metrics import (confusion_matrix, accuracy_score)


# confusion matrix
cm = confusion_matrix(ytest, prediction) print
("Confusion Matrix : \n", cm)
# accuracy score of the model
print('Test accuracy = ', accuracy_score(ytest, prediction))

Output :
Confusion Matrix :
[[6 0]
[2 2]]
Test accuracy = 0.8


Fundamentals of Data Science

EXP 11: TIME SERIES ANALYSIS

Aim:
To Perform Time Series Analysis

Algorithm:
Step1: Start
Step2: Import numpy,pandas, matplotlib&seaborn
Step3: draw the plot
Step4: display the plot
Step 5: Stop
Program

We are using Superstore sales data .


import warnings
import itertools import
numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight') import
pandas as pd
import statsmodels.api as sm
import matplotlibmatplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12 matplotlib.rcParams['text.color'] = 'k'

We start from time series analysis and forecasting for furniture sales.
df=pd.read_excel("Superstore.xls")
furniture = df.loc[df['Category'] == 'Furniture'] A good 4-year
furniture sales data.
furniture['Order Date'].min(), furniture['Order Date'].max() Timestamp(‘2014–
01–06 00:00:00’), Timestamp(‘2017–12–30
00:00:00’)
Data Preprocessing
This step includes removing columns we do not need, check missing values,
aggregate sales by date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product
ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols,axis=1,inplace=True) furniture=furniture.sort_values('Order
Date')furniture.isnull().sum() furniture=furniture.groupby('OrderDate')
['Sales'].sum().reset_ index()
Figure 1

Order Date 0
Sales 0
dtype:
int64
Indexing with Time Series Data
furniture=furniture.set_index('OrderDate') furniture.index

Figure 2
We will use the averages daily sales value for that month instead, and we are using
the start of each month as the timestamp.
y = furniture
['Sales'].resample('MS').mean() Have a
quick peek 2017 furniture sales data.
y['2017':]
Fundamentals of Data Science

Figure 3

Visualizing Furniture Sales Time Series Data


y.plot (figsize=(15,6))
plt.show()



You might also like