0% found this document useful (0 votes)

37 views15 pages

FOUND. DATA SCIENCE Practical

The document outlines a series of practical exercises aimed at data analysis using Python and pandas. It covers various topics including loading datasets, visualizing data distributions, computing summary statistics, handling missing data, and performing hypothesis testing. Each practical includes objectives, theory, code examples, expected outputs, and conclusions to facilitate understanding and application of data analysis techniques.

Uploaded by

maulipatil865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views15 pages

FOUND. DATA SCIENCE Practical

Uploaded by

maulipatil865

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Practical 1: Load a Dataset and Display Basic Information

Objective:
To load a dataset using pandas and understand its structure using basic data inspection
methods.

Theory:
Data Frames are table-like structures used in pandas. Understanding the dataset’s size,
structure, types of columns, and basic statistics is the first step in any analysis.

Tools: Python, pandas

Steps:

1. Import pandas.
2. Load a CSV file using read_csv().
3. Use .head() to preview the first few records.
4. Use .info() to inspect column names, data types, and missing values.
5. Use .describe() for summary statistics.

Code:

python

import pandas as pd

# Load dataset
df = pd.read_csv('sample.csv')

# Display first 5 rows

print(df.head())

# Data types and null values

print(df.info())

# Summary statistics
print(df.describe())

Expected Output:

 List of columns
 Data types (int64, float64, object)
 Count of null values
 Mean, min, max, std deviation

Conclusion:
Gives an overview of the dataset for further cleaning and analysis.
✅ Practical 2: Histogram – Distribution of a Numerical Variable

Objective:

To visualize the distribution of a numeric variable and gain insights into its central tendency,
spread, and shape using a histogram.

Theory:
Histograms group numeric data into bins. The height of each bar shows how many values fall
into that range.

Tools

 matplotlib: For plotting (pyplot).

 seaborn: High-level plotting for beautiful statistical graphics.

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

#plotting histogram with a KDE(distribution curve)

sns.histplot(df['Age'], kde=True)
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

Expected Output:
Histogram with a curve to show distribution.

Conclusion:
Histograms are essential for exploring numerical data distributions. They help detect
skewness, outliers, and give a visual understanding of where the majority of values lie —
which is crucial for selecting appropriate transformation or modeling techniques.
✅ Practical 3: Summary Statistics – Mean, Median, Standard Deviation

Objective:
To compute and interpret basic summary statistics such as mean, median, and standard
deviation for a numerical column in a dataset.

Theory:

 Mean: Arithmetic average

 Median: Middle value
 Std: Spread of values from the mean

Code:

python

#summary statistics for ‘Salary’ column

print("Mean:", df['Salary'].mean())
print("Median:", df['Salary'].median())
print("Standard Deviation:", df['Salary'].std())

Conclusion:
Calculating summary statistics is a fundamental step in data analysis. Mean, median, and
standard deviation provide critical insights into the central value and variability of a dataset
— helping guide data pre-processing, outlier detection, and modeling decisions.
✅ Practical 4: Scatter Plot – Relationship Between Two Variables

Objective:
To visualize the relationship or correlation between two numerical variables using a scatter
plot..

Theory:
A scatter plot shows whether there is a linear/non-linear relationship.

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

#Scatter plot of Experience Vs Salary

sns.scatterplot(x='Experience', y='Salary', data=df)
plt.title("Experience vs Salary")
plt.xlabel("Years of Experience")
plt.ylabel ("Salary")
plt.show()

Expected Output:
A plot showing the trend (positive, negative, or no correlation).

Conclusion:
Scatter plots help in detecting relationships between numeric variables and provide visual
evidence of correlation, making them highly useful before applying statistical models or
machine learning algorithms.
✅ Practical 5: Basic Data Cleaning (Missing Values, Outliers)

Objective:
To clean and prepare raw data by handling missing values and outliers, ensuring it is suitable
for further analysis or modeling.

Theory:

 Missing data can bias the model.

 Outliers can distort statistical analysis.

Steps:

1. Identify missing values isnull().sum().

2. Handle missing values :Drop or fill with mean.
3. Detect outliers using boxplot visualization.

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

# Step 1: Check missing values

print(df.isnull().sum())

# Step 2: Impute missing values in 'Age' with the mean

df['Age'].fillna(df['Age'].mean(), inplace=True)

# Step 3: Detect outliers in 'Salary' column using a boxplot

sns.boxplot(x=df['Salary'])
plt.title("Outliers in Salary")
plt.show()

Conclusion:
Basic data cleaning is essential to ensure data quality and improve the performance of
analytical models.
✅ Practical 6: Encode Categorical Variables

Objective:
To convert categorical (non-numeric) variables into numerical format so that machine learning
algorithms can interpret and use the data effectively.

Theory:

 One-hot encoding for nominal data.

 Label encoding for ordinal data.

Code:

Python

# One-hot encoding
df = pd.get_dummies(df, columns=['Gender'])

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Education_Level'] = le.fit_transform(df['Education_Level'])

Conclusion:
Encoding categorical variables is a critical pre-processing step in machine learning
workflows.
✅ Practical 7: Web Scraping to Collect Data

Objective:
To extract real-world data from a live website using Python libraries and convert it into a
format suitable for analysis.

Theory:
Use requests to fetch HTML and BeautifulSoup to extract data.

Code:

Python

import requests
from bs4 import BeautifulSoup

url = "https://example.com/products"
page = requests.get(url)

# Parse the HTML content

soup = BeautifulSoup(page.text, 'html.parser')

# Example: Extract product titles (inside <h2> tags)

titles = soup.find_all('h2')
for t in titles:
print(t.text.strip())

Conclusion:
Web scraping is a practical and powerful method to collect live, real-world data directly from
websites.
✅ Practical 8: Pre-process Collected Data

Objective:
To clean and structure the raw data collected from sources like web scraping or APIs,
transforming it into a usable and analysable format

Theory:
Remove unwanted characters, HTML tags, and store in DataFrame.

Code:

Python

# Assuming 'titles' is a list of HTML elements collected from web scraping

data = {'Title': [t.text.strip() for t in titles]}

# Removes leading/trailing whitespace and HTML tags

df = pd.DataFrame(data)
print(df.head())

Conclusion:

Preprocessing is a vital step to transform raw, unstructured data (like web-scraped HTML)
into a clean and structured form.
✅ Practical 9: Handle Missing Data (Imputation/Deletion)

Objective:

To effectively manage missing values in a dataset through deletion or imputation, thereby

maintaining data quality and ensuring reliable analysis.

Code:

Python

# Drop rows with any missing value

df.dropna(inplace=True)

# Impute with median

df['Salary'].fillna(df['Salary'].median(), inplace=True)

Conclusion:

Handling missing data is a critical step in the data cleaning process. While dropping rows can
be a quick fix, imputation methods like filling with the median retain the overall dataset
structure and avoid unnecessary information loss.
✅ Practical 10: Visualize Missing Data

Objective:
To identify the location and proportion of missing values in the dataset visually, helping in
the selection of suitable data cleaning or imputation methods.

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap='viridis')

plt.title("Missing Data Heatmap")

plt.show()

Conclusion:

Visualizing missing data using a heatmap makes it easier to spot patterns and clusters of
missing values.
✅ Practical 11: Exploratory Data Analysis (EDA)

Objective:
To explore the dataset in-depth by summarizing key statistical properties and visualizing
relationships between multiple features.

Steps:

 Describe data
 Visualize boxplots
 Explore pair plots

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

# Step 1: Summary statistics

print(df.describe())

# Step 2: Boxplot for 'Age'

sns.boxplot(x=df['Age'])
plt.title("Boxplot of Age")
plt.show()

# Step 3: Pair plot for all numerical features

sns.pairplot(df)
plt.show()

Conclusion:
Exploratory Data Analysis (EDA) is an essential initial step in any data science or machine
learning project. It provides deep insight into the data’s distribution, relationships, and
anomalies, helping guide feature selection, transformation, and modeling strategies.
✅ Practical 12: Boxplots for Outlier Detection

Objective:
To identify outliers in numerical data using a boxplot and understand how these extreme
values can impact data analysis and modeling.

Theory:
Outliers lie outside the 1.5×IQR range.

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

#create a boxplot to detect outliers in ‘Salary’

sns.boxplot(y=df['Salary'])
plt.title("Boxplot of Salary")
plt.ylabel(“Salary”)
plt.show()

Conclusion:
Boxplots provide a quick visual way to detect outliers in numerical data. Outliers may be
excluded, transformed (e.g., using log or z-score), or analyzed separately depending on
their cause and relevance to the problem.
✅ Practical 13: Explore Correlation Between Variables

Objective:
To measure and visualize the degree of relationship between numerical variables using a
correlation matrix and heatmap.

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

#Compute the correlation matrix

correlation = df.corr()

#Visualize the correlation matrix as a heatmap

sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

Conclusion:
Correlation analysis helps in feature selection, detecting highly correlated (redundant)
variables, and understanding the linear relationships within the data. It is a key part of EDA
and modeling preparation.
✅ Practical 14: Dimensionality Reduction using PCA

Objective:
To reduce high-dimensional data into fewer dimensions using Principal Component
Analysis (PCA) while preserving as much variance (information) as possible.

Code:

python

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler
import pandas as pd

# Standardize
scaled =
StandardScaler().fit_transform(df.select_dtypes(include=['float64',
'int64']))

# Apply PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled)
print(pca.explained_variance_ratio_)

Conclusion:
PCA is a powerful tool for dimensionality reduction. It simplifies complex datasets by
projecting them onto fewer dimensions while preserving essential variance.
✅ Practical 15: Hypothesis Testing (t-test)

Objective:
To determine whether the means of two independent groups (e.g., Male vs Female) are
significantly different.

Code:

python

from scipy.stats import ttest_ind

#Extract salary data by gender

group1 = df[df['Gender'] == 'Male']['Salary']
group2 = df[df['Gender'] == 'Female']['Salary']

#perform independent t-test

t_stat, p_value = ttest_ind(group1, group2)

#output the results

print(f"T-stat: {t_stat}, P-value: {p_value}")

Conclusion:
T-test helps compare two groups statistically. If the p-value is less than 0.05, it suggests a
significant difference in the means (e.g., salaries of males vs females).

Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Unit 2
No ratings yet
Unit 2
36 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Da Pra Week-8 (Karthik S) - 074713
No ratings yet
Da Pra Week-8 (Karthik S) - 074713
9 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
9 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Data Analytics Course for Beginners
No ratings yet
Data Analytics Course for Beginners
34 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Semi-Automated EDA in Python
No ratings yet
Semi-Automated EDA in Python
3 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Data Analysis With Python Core Libraries
No ratings yet
Data Analysis With Python Core Libraries
5 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
Dev Core
No ratings yet
Dev Core
7 pages
It Journal
No ratings yet
It Journal
30 pages
DAV Assign6
No ratings yet
DAV Assign6
8 pages
Edap Lab
No ratings yet
Edap Lab
47 pages
STQS2223 CH 4
No ratings yet
STQS2223 CH 4
30 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
ML
No ratings yet
ML
21 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Class Activity-2
No ratings yet
Class Activity-2
3 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
Module 3
No ratings yet
Module 3
108 pages
Udacity Enterprise Syllabus Data Analyst nd002
No ratings yet
Udacity Enterprise Syllabus Data Analyst nd002
16 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
16 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Matplotlib Project Report AIPT
No ratings yet
Matplotlib Project Report AIPT
6 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Certificate
No ratings yet
Certificate
25 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Data Visualization
No ratings yet
Data Visualization
19 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Python Syntax and Functions For Data Mining
No ratings yet
Python Syntax and Functions For Data Mining
6 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Student Performance Analysis and Prediction 2.3
No ratings yet
Student Performance Analysis and Prediction 2.3
19 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Class 12 Practical File Informatics Practices Python
No ratings yet
Class 12 Practical File Informatics Practices Python
19 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
ML Unit 2
No ratings yet
ML Unit 2
71 pages
OAT Practicals Manual
No ratings yet
OAT Practicals Manual
27 pages
ReleaseNoteRSViewME 5 10 02
No ratings yet
ReleaseNoteRSViewME 5 10 02
12 pages
Goniotable
No ratings yet
Goniotable
5 pages
Exception Handling
No ratings yet
Exception Handling
34 pages
Devara Pathigam 2
No ratings yet
Devara Pathigam 2
86 pages
09 01 24 SR Star Co Scmodel A, B&C Jee Main GTM 11n QP
No ratings yet
09 01 24 SR Star Co Scmodel A, B&C Jee Main GTM 11n QP
20 pages
Structural Seismic Design Guide
100% (1)
Structural Seismic Design Guide
58 pages
DLL Math 7 March 6-10, 2023
No ratings yet
DLL Math 7 March 6-10, 2023
6 pages
Chap2 1
No ratings yet
Chap2 1
45 pages
Knex Gears Tguide
No ratings yet
Knex Gears Tguide
42 pages
Milnor UniqueDecompositionTheorem 1962
No ratings yet
Milnor UniqueDecompositionTheorem 1962
8 pages
Compaction
No ratings yet
Compaction
43 pages
TF Idf
No ratings yet
TF Idf
15 pages
Features & Specifications: Ordering Tree Nlight Platform Sensor Switch Jot Photometrics Performance Data
No ratings yet
Features & Specifications: Ordering Tree Nlight Platform Sensor Switch Jot Photometrics Performance Data
9 pages
Specijalni Elektrici Masini
No ratings yet
Specijalni Elektrici Masini
22 pages
Isocop Rev 07-11-2012 - ENGL
No ratings yet
Isocop Rev 07-11-2012 - ENGL
15 pages
LUTEC AUSTRALIA PTY LTD Displays Prototypes That Amplifying Electricity by 5 Times
No ratings yet
LUTEC AUSTRALIA PTY LTD Displays Prototypes That Amplifying Electricity by 5 Times
1 page
Robotics and Automations
No ratings yet
Robotics and Automations
49 pages
Hadean Eon
No ratings yet
Hadean Eon
13 pages
WS100 Manual PDF
No ratings yet
WS100 Manual PDF
42 pages
Project Word Copy - Grade XI
No ratings yet
Project Word Copy - Grade XI
28 pages
Electrical Specs for Office Project
No ratings yet
Electrical Specs for Office Project
4 pages
Enhancing Virtual Reality Experiences Through Embedded 3D Models in Video Content
No ratings yet
Enhancing Virtual Reality Experiences Through Embedded 3D Models in Video Content
5 pages
Acetic Acid Design Project
0% (1)
Acetic Acid Design Project
56 pages
Session 13A - The ARMA and ARIMA Models
No ratings yet
Session 13A - The ARMA and ARIMA Models
173 pages
Circuit Theory Lab Manual Final
No ratings yet
Circuit Theory Lab Manual Final
31 pages
Class 11th DPP 5 Some Basic Concept in Chemistry 2025
No ratings yet
Class 11th DPP 5 Some Basic Concept in Chemistry 2025
2 pages
10th Physics All Lessons Unsolved Problems English Medium PDF
50% (2)
10th Physics All Lessons Unsolved Problems English Medium PDF
10 pages

FOUND. DATA SCIENCE Practical

Uploaded by

FOUND. DATA SCIENCE Practical

Uploaded by

Practical 1: Load a Dataset and Display Basic Information

Tools: Python, pandas

# Display first 5 rows

# Data types and null values

 matplotlib: For plotting (pyplot).

import seaborn as sns

#plotting histogram with a KDE(distribution curve)

 Mean: Arithmetic average

#summary statistics for ‘Salary’ column

import seaborn as sns

#Scatter plot of Experience Vs Salary

 Missing data can bias the model.

1. Identify missing values isnull().sum().

import seaborn as sns

# Step 1: Check missing values

# Step 2: Impute missing values in 'Age' with the mean

# Step 3: Detect outliers in 'Salary' column using a boxplot

 One-hot encoding for nominal data.

# Parse the HTML content

# Example: Extract product titles (inside <h2> tags)

# Assuming 'titles' is a list of HTML elements collected from web scraping

data = {'Title': [t.text.strip() for t in titles]}

# Removes leading/trailing whitespace and HTML tags

To effectively manage missing values in a dataset through deletion or imputation, thereby

# Drop rows with any missing value

# Impute with median

import seaborn as sns

import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap='viridis')

plt.title("Missing Data Heatmap")

import seaborn as sns

# Step 1: Summary statistics

# Step 2: Boxplot for 'Age'

# Step 3: Pair plot for all numerical features

import seaborn as sns

#create a boxplot to detect outliers in ‘Salary’

import seaborn as sns

#Compute the correlation matrix

#Visualize the correlation matrix as a heatmap

from sklearn.decomposition import PCA

from scipy.stats import ttest_ind

#Extract salary data by gender

#perform independent t-test

#output the results

You might also like