0% found this document useful (0 votes)
20 views27 pages

Sowmi DS

The document outlines a series of experiments focused on data analysis techniques using Excel and Python libraries such as NumPy and Pandas. It covers descriptive statistics, histogram analysis, basic operations on NumPy arrays, DataFrame manipulations, handling missing data, and preprocessing techniques for machine learning datasets. Each experiment includes step-by-step instructions and sample code for performing various data analysis tasks.

Uploaded by

23b01a1253
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views27 pages

Sowmi DS

The document outlines a series of experiments focused on data analysis techniques using Excel and Python libraries such as NumPy and Pandas. It covers descriptive statistics, histogram analysis, basic operations on NumPy arrays, DataFrame manipulations, handling missing data, and preprocessing techniques for machine learning datasets. Each experiment includes step-by-step instructions and sample code for performing various data analysis tasks.

Uploaded by

23b01a1253
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Experiment 1:

Aim: -
Perform Descriptive statistics of given dataset using Data Analysis Toolbox of Excel
Performing Descriptive Statistics in Excel: Short Note
1.Open Dataset: Launch Excel and open your data file.
2.Enable Toolpak: Go to File > Options > Add-ins, select Excel Add-ins, click Go, check Analysis ToolPak, and click
OK.
3.Access Toolpak: Go to the Data tab and click Data Analysis.
4.Select Tool: Choose Descriptive Statistics and click OK.
5.Input Range: Enter your data range, specify if data is in Columns or Rows, and check Labels in First Row if
applicable.
6.Output Options: Select Output Range or New Worksheet Ply, and check Summary statistics.
7.Run Analysis: Click OK to generate the descriptive statistics.

Sample Input:

Output:

1|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment 2:
AIM: -
Perform the Histogram Analysis of given dataset using Data Analysis Toolbox of Excel.
Steps to Perform Histogram Analysis in Excel:
1. Enter the Data:
 Manually input the above data into an Excel worksheet.
2. Enable Data Analysis ToolPak:
 Go to File > Options > Add-ins.
 In the Manage box, select Excel Add-ins and click Go.
 Check the Analysis ToolPak box, and click OK.
3. Perform Histogram Analysis:
 Go to the Data tab in Excel.
 Click on Data Analysis in the Analysis group.
 Choose Histogram from the list and click OK.
 For the Input Range
 Select the Output Range where you want the results to appear.
 Check the box for Chart Output to create a histogram chart.
Click OK to generate the histogram.

OUTPUT: -

2|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment 3: -
AIM: - Perform the following operations
a) Basic Operations on NumPy

b) Computations on numpy’s Arrays

c) DataFrames in Pandas

d) Hierarchical Indexing

e) Vectorized String Operations


Program:
Basic Operations on NumPy

Step1:
Import necessary libraries
import numpy as np

Step2:
Create an array
a=np.array([[1207,2,3],[4,5,6]])
a

OUTPUT:

Step3:
Performing operations on numpy array

Finding the data type of elements


dtype_a = a.dtype
print(dtype_a)

OUTPUT:

Finding the shape of the numpy array


np.shape(a)
OUTPUT:

Finding the size of the numpy array


np.size(a)

3|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


OUTPUT:

b=np.asarray(a)
b

OUTPUT: -

c=np.copy(b)
c

OUTPUT: -

a.ravel()

OUTPUT: -

Arithmetic Operations on nd array


Addition of two nd arrays
a+a

OUTPUT:

Multiplication of two arrays:


a*a

OUTPUT:

Subtraction of Two arrays:

a-a
OUTPUT:

Division of Two arrays:


a/a
OUTPUT:

4|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Modulus of Two arrays:
a%a

OUTPUT:

Aggregate Functions
import numpy as np
a=np.array([[1207,2,3], [4,5,6]])
print(a)

OUTPUT:

Minimum value from the given array


a.min()

OUTPUT: -

Maximum value from the given array


a.max()
OUTPUT: -

Sum of the given array elements

a.sum()
OUTPUT: -

Mean of the given array

a.mean()
OUTPUT: -

b) Computations on numpy’s Arrays


Universal Functions
Creating array
a=np. array([[1207,2,3], [4,5,6]])
a

5|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


OUTPUT: -

np.sqrt(a)

OUTPUT: -

b=np.round(2.6)
b

OUTPUT: -

b=np.abs(2.6)
b

OUTPUT: -

b=np.square(2)
b

OUTPUT: -

c=np.isfinite(-1)
c

OUTPUT: -

c=np.isnan(0)
c

OUTPUT: -

c=np.ceil(5.8)
c

OUTPUT: -

c=np.log(4)
c

6|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


OUTPUT: -

c=np.sum([0.5, 1.5])
c

OUTPUT: -

d=np.cos(1)
d

OUTPUT: -

e=np.floor(6.8)
e

OUTPUT: -

f=np.sin(8)
f

OUTPUT: -

c) Basic operations on Dataframes using pandas


Creating a Series
import pandas as pd
data_dict = {
'Name': ['Sowji', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Score': [85, 88, 90, 92]
}
df = pd.DataFrame(data_dict)
print(df)

OUTPUT: -

print(df.head())

7|Page 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab


OUTPUT: -

print(df.info())
OUTPUT: -

print(df['Name'])

OUTPUT: -

filtered_df = df[df['Age'] > 30]


print(filtered_df)

OUTPUT: -

df['Grade'] = ['A', 'B', 'A', 'A']


print(df)

OUTPUT: -

df = df.drop('Grade', axis=1)
print(df)
OUTPUT: -

sorted_df = df.sort_values(by='Age', ascending=False)


print(sorted_df)

8|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


OUTPUT: -

Hierarchical Indexing
import pandas as pd
arrays = [
['Sowji', 'Sowji', 'B', 'B'],
['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Upper', 'Lower'))
df = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=index)
print(df)

OUTPUT: -

print(df.loc['Sowji', 'one'])

OUTPUT: -

df_sorted = df.sort_index()
print(df_sorted)

OUTPUT: -

df_swapped = df.swaplevel()
print(df_swapped)

OUTPUT: -

df_reset = df.reset_index()
print(df_reset)
OUTPUT: -

9|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


e) Vectorized String Operations
import pandas as pd
data = pd.DataFrame({'Names': ['Sowji', 'Bob', 'Charlie', 'David']})
data['Names_Lower'] = data['Names'].str.lower()
data['Names_Upper'] = data['Names'].str.upper()
print(data)
OUTPUT: -

data['Contains_al'] = data['Names'].str.contains('al', case=False)


print(data)

OUTPUT: -

data['Names_Replaced'] = data['Names'].str.replace('a', 'X', case=False)


print(data)

OUTPUT: -

data['Name_Length'] = data['Names'].str.len()
print(data)
OUTPUT: -

10 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment 4: -
AIM: - Write an application to find and Removing Duplicate Records in the given dataset
PROGRAM: -
Step1: - Create a dataset
import pandas as pd
import numpy as np
data = {
'ID': [1207, 2, 3, 1, 1207],
'Name': ['SOWJI', 'Alice', 'Bob', 'John', 'SOWJI'],
'Age': [25, 30, 35,25, 25]
}
df = pd.DataFrame(data)
print(df)

OUTPUT: -

Step2: - Identifying duplicates


duplicate_mask = df. duplicated(data)
duplicates = df[duplicate_mask]
print(duplicates)

OUTPUT: -

Step3: - Removing duplicates from dataset Using drop_duplicates METHOD


data = df. drop_duplicates ()

print(data)

OUTPUT: -

Removing duplicates Specific record


df_no_duplicates_last = df. drop_duplicates(keep='last')

print(df_no_duplicates_last)

OUTPUT: -

11 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab


12 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment 5: -
AIM: - Write an application to handle the Missing Data in the given dataset
Step1: -
Create a dataset and place missing value with np.nan
import pandas as pd
import numpy as np
data = {
'ID': [1207, 2, 3, 4, np.nan],
'Name': ['Sowji', np.nan, 'Sowji', 'Charlie', 'Alice'],
'Age': [25, 30, np.nan, 35, 28],
'Salary': [50000, 54000, 58000, np.nan, 52000]
}
df = pd.DataFrame(data)
print(df)

OUTPUT: -

Step2: -
Identifying missing data
data = df.isnull()
print(data)

OUTPUT: -

Step3: -
Drop rows with any missing data
df_dropna = df.dropna()
print(df_dropna)

OUTPUT: -

13 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Step4: -
Fill missing data with specific values
df_filled = df.fillna({
'ID': 0,
'Name': 'Unknown',
'Age': df['Age'].mean(),
'Salary': df['Salary'].median()
})
print(df_filled)

OUTPUT: -

Fill missing data with forward fill method


df_ffill = df.fillna(method='ffill')
print(df_ffill)

OUTPUT: -

Fill missing data with backward fill method


df_bffill = df.fillna(method='bfill')
print(df_bffill)

OUTPUT: -

14 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment 6: -
AIM: -
Write an application to work with pivot table and cross tabulation in Data Frame .

Step1: -
Create a dataset
import pandas as pd
import numpy as np
data = {
'Employee': ['SOWJI', 'Alice', 'Bob', 'Charlie',],
'Department': ['HR', 'Finance', 'IT', 'HR'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000],
'Bonus': [5000, 7000, 6000, 8000,]
}
df = pd.DataFrame(data)
print(df)

OUTPUT: -

Step2: - Create a pivot table


pivot_table = pd.pivot_table(df,
values= ['Salary', 'Bonus'],
index=['Department'],
aggfunc= {'Salary': np.mean, 'Bonus': np.sum})
print(pivot_table)

OUTPUT: -

Cross tabulation: -
cross_tab = pd.crosstab(df['Department'], df['Employee'])

15 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab


print("\nCross Tabulation:")
print(cross_tab)

OUTPUT: -

16 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment-7
AIM: - Perform following preprocessing techniques on loan prediction dataset a. Feature Scaling b.
Feature Standardization c. Label Encoding d. One Hot Encoding
# Import necessary libraries
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder, OneHotEncoder
# Load the Iris dataset from seaborn
iris = sns.load_dataset('iris')
# Display the first few rows of the dataset
print(iris.head())
# Feature Scaling (Min-Max Scaling)
# Scale all the numerical columns (sepal_length, sepal_width, petal_length, petal_width)
scaler = MinMaxScaler()
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = scaler.fit_transform(
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
)
# Display the scaled dataset
print("\nAfter Feature Scaling (Min-Max Scaling):")
print(iris.head())

# A. Feature Standardization
# Standardize all the numerical columns
std_scaler = StandardScaler()
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = std_scaler.fit_transform(
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
)
# Display the standardized dataset
print("\nAfter Feature Standardization:")
print(iris.head())

# B. Label Encoding (for 'species' column)


label_encoder = LabelEncoder()
iris['species'] = label_encoder.fit_transform(iris['species'])
# Display the label encoded dataset
print("\nAfter Label Encoding (Species):")
print(iris.head())
# One-Hot Encoding (for 'species' column in the original dataset before Label Encoding)
17 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
# Reload the original iris dataset for One-Hot Encoding
iris_original = sns.load_dataset('iris')
# C. Apply One-Hot Encoding to the 'species' column
iris_onehot = pd.get_dummies(iris_original, columns=['species'], drop_first=True)
# Display the one-hot encoded dataset
print("\nAfter One-Hot Encoding (Species):")
print(iris_onehot.head())
OUTPUT:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

After Feature Scaling (Min-Max Scaling):


sepal_length sepal_width petal_length petal_width species
0 0.222222 0.625000 0.067797 0.041667 setosa
1 0.166667 0.416667 0.067797 0.041667 setosa
2 0.111111 0.500000 0.050847 0.041667 setosa
3 0.083333 0.458333 0.084746 0.041667 setosa
4 0.194444 0.666667 0.067797 0.041667 setosa

After Feature Standardization:


sepal_length sepal_width petal_length petal_width species
0 -0.900681 1.019004 -1.340227 -1.315444 setosa
1 -1.143017 -0.131979 -1.340227 -1.315444 setosa
2 -1.385353 0.328414 -1.397064 -1.315444 setosa
3 -1.506521 0.098217 -1.283389 -1.315444 setosa
4 -1.021849 1.249201 -1.340227 -1.315444 setosa

After Label Encoding (Species):


sepal_length sepal_width petal_length petal_width species
0 -0.900681 1.019004 -1.340227 -1.315444 0
1 -1.143017 -0.131979 -1.340227 -1.315444 0
2 -1.385353 0.328414 -1.397064 -1.315444 0
3 -1.506521 0.098217 -1.283389 -1.315444 0

18 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


4 -1.021849 1.249201 -1.340227 -1.315444 0

After One-Hot Encoding (Species):


sepal_length sepal_width petal_length petal_width species_versicolor \
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

species_virginica
0 0
1 0
2 0
3 0
4 0

19 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment-8
AIM: - Write an application to perform Data Visualization with Python using matplotlib
PROGRAM: -
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('/iris.data.csv',header=None)
print(data)

OUTPUT: -

Scatter Plot
data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
plt.scatter(data['sepal_length'], data['sepal_width'], c='blue')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Sepal Length vs Sepal Width')
plt.show()

OUTPUT: -

LINE CHART
20 | P a g e 22B01A125 1SVECW IT_A-III YEAR-I SEM Data Science Lab
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('/iris.data.csv', header=None)
data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
plt.plot(data.index, data['sepal_length'], label='Sepal Length', color='blue')
plt.plot(data.index, data['sepal_width'], label='Sepal Width', color='green')
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Sepal Length and Width over Index')
plt.legend()
plt.show()

OUTPUT: -

Histogram: -
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('/iris.data.csv', header=None)
data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
plt.hist(data['sepal_length'], bins=10, color='skyblue', edgecolor='black')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal Length')
plt.show()
plt.hist(data['sepal_width'], bins=10, color='salmon', edgecolor='black')
plt.xlabel('Sepal Width (cm)')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal Width')
plt.show()
21 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
OUTPUT: -

22 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment 9: -
AIM: -Apply and explore various plotting functions from seaborn library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('/Housing.csv')
print(data)
OUTPUT: -

Scatterplot
sns.scatterplot(x='area', y='price', data=data)
plt.title('Price vs. Area')
plt.show()

OUTPUT: -

Histplot for Price


sns.histplot(data['price'], kde=True)
plt.title('Distribution of Price')
plt.show()

OUTPUT: -

23 | P a g e 22B01A125 1SVECW IT_A-III YEAR-I SEM Data Science Lab


Histplot for Area
sns.histplot(data['area'], kde=True)
plt.title('Distribution of area')
plt.show()

OUTPUT: -

24 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


Experiment-10
AIM: - Write an application to apply regression model on the given data set
# Importing necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# Load the dataset
# Assume 'house_data.csv' contains columns like 'area', 'bedrooms', 'bathrooms', 'location', 'price'
df = pd.read_csv('Housing.csv')
# Display the first few rows
print(df.head())
# Preprocessing
# Check for missing values
print(df.isnull().sum())
# Handling missing values (filling missing values with the median for simplicity)
df.fillna(df.median(), inplace=True)
# Converting categorical data into numeric using One-Hot Encoding
df = pd.get_dummies(df, drop_first=True)
# Split data into features (X) and target variable (y)
X = df.drop('price', axis=1) # Features
y = df['price'] # Target
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train the regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared: {r2}")
# Visualizing the relationship between actual and predicted prices
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted House Prices')
plt.show()
OUTPUT: -

25 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab


10 b. Write an application to apply regression model for iris data set
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset from seaborn
iris = sns.load_dataset('iris')
# Display the first few rows of the dataset
print(iris.head())
# We'll predict 'petal_width' based on the other features (excluding 'species')
X = iris.drop(['petal_width', 'species'], axis=1) # Features: sepal_length, sepal_width, petal_length
y = iris['petal_width'] # Target: petal_width
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train the regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model

26 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab


mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared: {r2}")
# Visualizing the relationship between actual and predicted petal width
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7, color='b')
plt.xlabel('Actual Petal Width')
plt.ylabel('Predicted Petal Width')
plt.title('Actual vs Predicted Petal Width')
plt.show()
# Plotting feature importance (coefficients) for linear regression
plt.figure(figsize=(8, 6))
coefficients = model.coef_
features = ['sepal_length', 'sepal_width', 'petal_length']
sns.barplot(x=features, y=coefficients)
plt.title('Feature Importance (Linear Regression Coefficients)')
plt.ylabel('Coefficient Value')
plt.show()

OUTPUT: -

27 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab

You might also like