Experiment 1:
Aim: -
Perform Descriptive statistics of given dataset using Data Analysis Toolbox of Excel
Performing Descriptive Statistics in Excel: Short Note
1.Open Dataset: Launch Excel and open your data file.
2.Enable Toolpak: Go to File > Options > Add-ins, select Excel Add-ins, click Go, check Analysis ToolPak, and click
OK.
3.Access Toolpak: Go to the Data tab and click Data Analysis.
4.Select Tool: Choose Descriptive Statistics and click OK.
5.Input Range: Enter your data range, specify if data is in Columns or Rows, and check Labels in First Row if
applicable.
6.Output Options: Select Output Range or New Worksheet Ply, and check Summary statistics.
7.Run Analysis: Click OK to generate the descriptive statistics.
Sample Input:
Output:
1|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment 2:
AIM: -
Perform the Histogram Analysis of given dataset using Data Analysis Toolbox of Excel.
Steps to Perform Histogram Analysis in Excel:
1. Enter the Data:
Manually input the above data into an Excel worksheet.
2. Enable Data Analysis ToolPak:
Go to File > Options > Add-ins.
In the Manage box, select Excel Add-ins and click Go.
Check the Analysis ToolPak box, and click OK.
3. Perform Histogram Analysis:
Go to the Data tab in Excel.
Click on Data Analysis in the Analysis group.
Choose Histogram from the list and click OK.
For the Input Range
Select the Output Range where you want the results to appear.
Check the box for Chart Output to create a histogram chart.
Click OK to generate the histogram.
OUTPUT: -
2|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment 3: -
AIM: - Perform the following operations
a) Basic Operations on NumPy
b) Computations on numpy’s Arrays
c) DataFrames in Pandas
d) Hierarchical Indexing
e) Vectorized String Operations
Program:
Basic Operations on NumPy
Step1:
Import necessary libraries
import numpy as np
Step2:
Create an array
a=np.array([[1207,2,3],[4,5,6]])
a
OUTPUT:
Step3:
Performing operations on numpy array
Finding the data type of elements
dtype_a = a.dtype
print(dtype_a)
OUTPUT:
Finding the shape of the numpy array
np.shape(a)
OUTPUT:
Finding the size of the numpy array
np.size(a)
3|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
OUTPUT:
b=np.asarray(a)
b
OUTPUT: -
c=np.copy(b)
c
OUTPUT: -
a.ravel()
OUTPUT: -
Arithmetic Operations on nd array
Addition of two nd arrays
a+a
OUTPUT:
Multiplication of two arrays:
a*a
OUTPUT:
Subtraction of Two arrays:
a-a
OUTPUT:
Division of Two arrays:
a/a
OUTPUT:
4|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Modulus of Two arrays:
a%a
OUTPUT:
Aggregate Functions
import numpy as np
a=np.array([[1207,2,3], [4,5,6]])
print(a)
OUTPUT:
Minimum value from the given array
a.min()
OUTPUT: -
Maximum value from the given array
a.max()
OUTPUT: -
Sum of the given array elements
a.sum()
OUTPUT: -
Mean of the given array
a.mean()
OUTPUT: -
b) Computations on numpy’s Arrays
Universal Functions
Creating array
a=np. array([[1207,2,3], [4,5,6]])
a
5|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
OUTPUT: -
np.sqrt(a)
OUTPUT: -
b=np.round(2.6)
b
OUTPUT: -
b=np.abs(2.6)
b
OUTPUT: -
b=np.square(2)
b
OUTPUT: -
c=np.isfinite(-1)
c
OUTPUT: -
c=np.isnan(0)
c
OUTPUT: -
c=np.ceil(5.8)
c
OUTPUT: -
c=np.log(4)
c
6|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
OUTPUT: -
c=np.sum([0.5, 1.5])
c
OUTPUT: -
d=np.cos(1)
d
OUTPUT: -
e=np.floor(6.8)
e
OUTPUT: -
f=np.sin(8)
f
OUTPUT: -
c) Basic operations on Dataframes using pandas
Creating a Series
import pandas as pd
data_dict = {
'Name': ['Sowji', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Score': [85, 88, 90, 92]
}
df = pd.DataFrame(data_dict)
print(df)
OUTPUT: -
print(df.head())
7|Page 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab
OUTPUT: -
print(df.info())
OUTPUT: -
print(df['Name'])
OUTPUT: -
filtered_df = df[df['Age'] > 30]
print(filtered_df)
OUTPUT: -
df['Grade'] = ['A', 'B', 'A', 'A']
print(df)
OUTPUT: -
df = df.drop('Grade', axis=1)
print(df)
OUTPUT: -
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)
8|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
OUTPUT: -
Hierarchical Indexing
import pandas as pd
arrays = [
['Sowji', 'Sowji', 'B', 'B'],
['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Upper', 'Lower'))
df = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=index)
print(df)
OUTPUT: -
print(df.loc['Sowji', 'one'])
OUTPUT: -
df_sorted = df.sort_index()
print(df_sorted)
OUTPUT: -
df_swapped = df.swaplevel()
print(df_swapped)
OUTPUT: -
df_reset = df.reset_index()
print(df_reset)
OUTPUT: -
9|Page 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
e) Vectorized String Operations
import pandas as pd
data = pd.DataFrame({'Names': ['Sowji', 'Bob', 'Charlie', 'David']})
data['Names_Lower'] = data['Names'].str.lower()
data['Names_Upper'] = data['Names'].str.upper()
print(data)
OUTPUT: -
data['Contains_al'] = data['Names'].str.contains('al', case=False)
print(data)
OUTPUT: -
data['Names_Replaced'] = data['Names'].str.replace('a', 'X', case=False)
print(data)
OUTPUT: -
data['Name_Length'] = data['Names'].str.len()
print(data)
OUTPUT: -
10 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment 4: -
AIM: - Write an application to find and Removing Duplicate Records in the given dataset
PROGRAM: -
Step1: - Create a dataset
import pandas as pd
import numpy as np
data = {
'ID': [1207, 2, 3, 1, 1207],
'Name': ['SOWJI', 'Alice', 'Bob', 'John', 'SOWJI'],
'Age': [25, 30, 35,25, 25]
}
df = pd.DataFrame(data)
print(df)
OUTPUT: -
Step2: - Identifying duplicates
duplicate_mask = df. duplicated(data)
duplicates = df[duplicate_mask]
print(duplicates)
OUTPUT: -
Step3: - Removing duplicates from dataset Using drop_duplicates METHOD
data = df. drop_duplicates ()
print(data)
OUTPUT: -
Removing duplicates Specific record
df_no_duplicates_last = df. drop_duplicates(keep='last')
print(df_no_duplicates_last)
OUTPUT: -
11 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab
12 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment 5: -
AIM: - Write an application to handle the Missing Data in the given dataset
Step1: -
Create a dataset and place missing value with np.nan
import pandas as pd
import numpy as np
data = {
'ID': [1207, 2, 3, 4, np.nan],
'Name': ['Sowji', np.nan, 'Sowji', 'Charlie', 'Alice'],
'Age': [25, 30, np.nan, 35, 28],
'Salary': [50000, 54000, 58000, np.nan, 52000]
}
df = pd.DataFrame(data)
print(df)
OUTPUT: -
Step2: -
Identifying missing data
data = df.isnull()
print(data)
OUTPUT: -
Step3: -
Drop rows with any missing data
df_dropna = df.dropna()
print(df_dropna)
OUTPUT: -
13 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Step4: -
Fill missing data with specific values
df_filled = df.fillna({
'ID': 0,
'Name': 'Unknown',
'Age': df['Age'].mean(),
'Salary': df['Salary'].median()
})
print(df_filled)
OUTPUT: -
Fill missing data with forward fill method
df_ffill = df.fillna(method='ffill')
print(df_ffill)
OUTPUT: -
Fill missing data with backward fill method
df_bffill = df.fillna(method='bfill')
print(df_bffill)
OUTPUT: -
14 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment 6: -
AIM: -
Write an application to work with pivot table and cross tabulation in Data Frame .
Step1: -
Create a dataset
import pandas as pd
import numpy as np
data = {
'Employee': ['SOWJI', 'Alice', 'Bob', 'Charlie',],
'Department': ['HR', 'Finance', 'IT', 'HR'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000],
'Bonus': [5000, 7000, 6000, 8000,]
}
df = pd.DataFrame(data)
print(df)
OUTPUT: -
Step2: - Create a pivot table
pivot_table = pd.pivot_table(df,
values= ['Salary', 'Bonus'],
index=['Department'],
aggfunc= {'Salary': np.mean, 'Bonus': np.sum})
print(pivot_table)
OUTPUT: -
Cross tabulation: -
cross_tab = pd.crosstab(df['Department'], df['Employee'])
15 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab
print("\nCross Tabulation:")
print(cross_tab)
OUTPUT: -
16 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment-7
AIM: - Perform following preprocessing techniques on loan prediction dataset a. Feature Scaling b.
Feature Standardization c. Label Encoding d. One Hot Encoding
# Import necessary libraries
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder, OneHotEncoder
# Load the Iris dataset from seaborn
iris = sns.load_dataset('iris')
# Display the first few rows of the dataset
print(iris.head())
# Feature Scaling (Min-Max Scaling)
# Scale all the numerical columns (sepal_length, sepal_width, petal_length, petal_width)
scaler = MinMaxScaler()
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = scaler.fit_transform(
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
)
# Display the scaled dataset
print("\nAfter Feature Scaling (Min-Max Scaling):")
print(iris.head())
# A. Feature Standardization
# Standardize all the numerical columns
std_scaler = StandardScaler()
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = std_scaler.fit_transform(
iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
)
# Display the standardized dataset
print("\nAfter Feature Standardization:")
print(iris.head())
# B. Label Encoding (for 'species' column)
label_encoder = LabelEncoder()
iris['species'] = label_encoder.fit_transform(iris['species'])
# Display the label encoded dataset
print("\nAfter Label Encoding (Species):")
print(iris.head())
# One-Hot Encoding (for 'species' column in the original dataset before Label Encoding)
17 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
# Reload the original iris dataset for One-Hot Encoding
iris_original = sns.load_dataset('iris')
# C. Apply One-Hot Encoding to the 'species' column
iris_onehot = pd.get_dummies(iris_original, columns=['species'], drop_first=True)
# Display the one-hot encoded dataset
print("\nAfter One-Hot Encoding (Species):")
print(iris_onehot.head())
OUTPUT:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
After Feature Scaling (Min-Max Scaling):
sepal_length sepal_width petal_length petal_width species
0 0.222222 0.625000 0.067797 0.041667 setosa
1 0.166667 0.416667 0.067797 0.041667 setosa
2 0.111111 0.500000 0.050847 0.041667 setosa
3 0.083333 0.458333 0.084746 0.041667 setosa
4 0.194444 0.666667 0.067797 0.041667 setosa
After Feature Standardization:
sepal_length sepal_width petal_length petal_width species
0 -0.900681 1.019004 -1.340227 -1.315444 setosa
1 -1.143017 -0.131979 -1.340227 -1.315444 setosa
2 -1.385353 0.328414 -1.397064 -1.315444 setosa
3 -1.506521 0.098217 -1.283389 -1.315444 setosa
4 -1.021849 1.249201 -1.340227 -1.315444 setosa
After Label Encoding (Species):
sepal_length sepal_width petal_length petal_width species
0 -0.900681 1.019004 -1.340227 -1.315444 0
1 -1.143017 -0.131979 -1.340227 -1.315444 0
2 -1.385353 0.328414 -1.397064 -1.315444 0
3 -1.506521 0.098217 -1.283389 -1.315444 0
18 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
4 -1.021849 1.249201 -1.340227 -1.315444 0
After One-Hot Encoding (Species):
sepal_length sepal_width petal_length petal_width species_versicolor \
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
species_virginica
0 0
1 0
2 0
3 0
4 0
19 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment-8
AIM: - Write an application to perform Data Visualization with Python using matplotlib
PROGRAM: -
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('/iris.data.csv',header=None)
print(data)
OUTPUT: -
Scatter Plot
data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
plt.scatter(data['sepal_length'], data['sepal_width'], c='blue')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Sepal Length vs Sepal Width')
plt.show()
OUTPUT: -
LINE CHART
20 | P a g e 22B01A125 1SVECW IT_A-III YEAR-I SEM Data Science Lab
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('/iris.data.csv', header=None)
data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
plt.plot(data.index, data['sepal_length'], label='Sepal Length', color='blue')
plt.plot(data.index, data['sepal_width'], label='Sepal Width', color='green')
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Sepal Length and Width over Index')
plt.legend()
plt.show()
OUTPUT: -
Histogram: -
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('/iris.data.csv', header=None)
data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
plt.hist(data['sepal_length'], bins=10, color='skyblue', edgecolor='black')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal Length')
plt.show()
plt.hist(data['sepal_width'], bins=10, color='salmon', edgecolor='black')
plt.xlabel('Sepal Width (cm)')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal Width')
plt.show()
21 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
OUTPUT: -
22 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment 9: -
AIM: -Apply and explore various plotting functions from seaborn library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('/Housing.csv')
print(data)
OUTPUT: -
Scatterplot
sns.scatterplot(x='area', y='price', data=data)
plt.title('Price vs. Area')
plt.show()
OUTPUT: -
Histplot for Price
sns.histplot(data['price'], kde=True)
plt.title('Distribution of Price')
plt.show()
OUTPUT: -
23 | P a g e 22B01A125 1SVECW IT_A-III YEAR-I SEM Data Science Lab
Histplot for Area
sns.histplot(data['area'], kde=True)
plt.title('Distribution of area')
plt.show()
OUTPUT: -
24 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
Experiment-10
AIM: - Write an application to apply regression model on the given data set
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# Load the dataset
# Assume 'house_data.csv' contains columns like 'area', 'bedrooms', 'bathrooms', 'location', 'price'
df = pd.read_csv('Housing.csv')
# Display the first few rows
print(df.head())
# Preprocessing
# Check for missing values
print(df.isnull().sum())
# Handling missing values (filling missing values with the median for simplicity)
df.fillna(df.median(), inplace=True)
# Converting categorical data into numeric using One-Hot Encoding
df = pd.get_dummies(df, drop_first=True)
# Split data into features (X) and target variable (y)
X = df.drop('price', axis=1) # Features
y = df['price'] # Target
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train the regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared: {r2}")
# Visualizing the relationship between actual and predicted prices
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted House Prices')
plt.show()
OUTPUT: -
25 | P a g e 22B01A1251SVECW IT_A-III YEAR-I SEM Data Science Lab
10 b. Write an application to apply regression model for iris data set
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset from seaborn
iris = sns.load_dataset('iris')
# Display the first few rows of the dataset
print(iris.head())
# We'll predict 'petal_width' based on the other features (excluding 'species')
X = iris.drop(['petal_width', 'species'], axis=1) # Features: sepal_length, sepal_width, petal_length
y = iris['petal_width'] # Target: petal_width
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train the regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
26 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared: {r2}")
# Visualizing the relationship between actual and predicted petal width
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7, color='b')
plt.xlabel('Actual Petal Width')
plt.ylabel('Predicted Petal Width')
plt.title('Actual vs Predicted Petal Width')
plt.show()
# Plotting feature importance (coefficients) for linear regression
plt.figure(figsize=(8, 6))
coefficients = model.coef_
features = ['sepal_length', 'sepal_width', 'petal_length']
sns.barplot(x=features, y=coefficients)
plt.title('Feature Importance (Linear Regression Coefficients)')
plt.ylabel('Coefficient Value')
plt.show()
OUTPUT: -
27 | P a g e 22B01A1251 SVECW IT_A-III YEAR-I SEM Data Science Lab