0% found this document useful (0 votes)
37 views15 pages

FOUND. DATA SCIENCE Practical

The document outlines a series of practical exercises aimed at data analysis using Python and pandas. It covers various topics including loading datasets, visualizing data distributions, computing summary statistics, handling missing data, and performing hypothesis testing. Each practical includes objectives, theory, code examples, expected outputs, and conclusions to facilitate understanding and application of data analysis techniques.

Uploaded by

maulipatil865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views15 pages

FOUND. DATA SCIENCE Practical

The document outlines a series of practical exercises aimed at data analysis using Python and pandas. It covers various topics including loading datasets, visualizing data distributions, computing summary statistics, handling missing data, and performing hypothesis testing. Each practical includes objectives, theory, code examples, expected outputs, and conclusions to facilitate understanding and application of data analysis techniques.

Uploaded by

maulipatil865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Practical 1: Load a Dataset and Display Basic Information

Objective:
To load a dataset using pandas and understand its structure using basic data inspection
methods.

Theory:
Data Frames are table-like structures used in pandas. Understanding the dataset’s size,
structure, types of columns, and basic statistics is the first step in any analysis.

Tools: Python, pandas

Steps:

1. Import pandas.
2. Load a CSV file using read_csv().
3. Use .head() to preview the first few records.
4. Use .info() to inspect column names, data types, and missing values.
5. Use .describe() for summary statistics.

Code:

python

import pandas as pd

# Load dataset
df = pd.read_csv('sample.csv')

# Display first 5 rows


print(df.head())

# Data types and null values


print(df.info())

# Summary statistics
print(df.describe())

Expected Output:

 List of columns
 Data types (int64, float64, object)
 Count of null values
 Mean, min, max, std deviation

Conclusion:
Gives an overview of the dataset for further cleaning and analysis.
✅ Practical 2: Histogram – Distribution of a Numerical Variable

Objective:

To visualize the distribution of a numeric variable and gain insights into its central tendency,
spread, and shape using a histogram.

Theory:
Histograms group numeric data into bins. The height of each bar shows how many values fall
into that range.

Tools

 matplotlib: For plotting (pyplot).


 seaborn: High-level plotting for beautiful statistical graphics.

Code:

python

import seaborn as sns


import matplotlib.pyplot as plt

#plotting histogram with a KDE(distribution curve)

sns.histplot(df['Age'], kde=True)
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

Expected Output:
Histogram with a curve to show distribution.

Conclusion:
Histograms are essential for exploring numerical data distributions. They help detect
skewness, outliers, and give a visual understanding of where the majority of values lie —
which is crucial for selecting appropriate transformation or modeling techniques.
✅ Practical 3: Summary Statistics – Mean, Median, Standard Deviation

Objective:
To compute and interpret basic summary statistics such as mean, median, and standard
deviation for a numerical column in a dataset.

Theory:

 Mean: Arithmetic average


 Median: Middle value
 Std: Spread of values from the mean

Code:

python

#summary statistics for ‘Salary’ column


print("Mean:", df['Salary'].mean())
print("Median:", df['Salary'].median())
print("Standard Deviation:", df['Salary'].std())

Conclusion:
Calculating summary statistics is a fundamental step in data analysis. Mean, median, and
standard deviation provide critical insights into the central value and variability of a dataset
— helping guide data pre-processing, outlier detection, and modeling decisions.
✅ Practical 4: Scatter Plot – Relationship Between Two Variables

Objective:
To visualize the relationship or correlation between two numerical variables using a scatter
plot..

Theory:
A scatter plot shows whether there is a linear/non-linear relationship.

Code:

python

import seaborn as sns


import matplotlib.pyplot as plt

#Scatter plot of Experience Vs Salary


sns.scatterplot(x='Experience', y='Salary', data=df)
plt.title("Experience vs Salary")
plt.xlabel("Years of Experience")
plt.ylabel ("Salary")
plt.show()

Expected Output:
A plot showing the trend (positive, negative, or no correlation).

Conclusion:
Scatter plots help in detecting relationships between numeric variables and provide visual
evidence of correlation, making them highly useful before applying statistical models or
machine learning algorithms.
✅ Practical 5: Basic Data Cleaning (Missing Values, Outliers)

Objective:
To clean and prepare raw data by handling missing values and outliers, ensuring it is suitable
for further analysis or modeling.

Theory:

 Missing data can bias the model.


 Outliers can distort statistical analysis.

Steps:

1. Identify missing values isnull().sum().


2. Handle missing values :Drop or fill with mean.
3. Detect outliers using boxplot visualization.

Code:

python

import seaborn as sns


import matplotlib.pyplot as plt

# Step 1: Check missing values


print(df.isnull().sum())

# Step 2: Impute missing values in 'Age' with the mean


df['Age'].fillna(df['Age'].mean(), inplace=True)

# Step 3: Detect outliers in 'Salary' column using a boxplot


sns.boxplot(x=df['Salary'])
plt.title("Outliers in Salary")
plt.show()

Conclusion:
Basic data cleaning is essential to ensure data quality and improve the performance of
analytical models.
✅ Practical 6: Encode Categorical Variables

Objective:
To convert categorical (non-numeric) variables into numerical format so that machine learning
algorithms can interpret and use the data effectively.

Theory:

 One-hot encoding for nominal data.


 Label encoding for ordinal data.

Code:

Python

# One-hot encoding
df = pd.get_dummies(df, columns=['Gender'])

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Education_Level'] = le.fit_transform(df['Education_Level'])

Conclusion:
Encoding categorical variables is a critical pre-processing step in machine learning
workflows.
✅ Practical 7: Web Scraping to Collect Data

Objective:
To extract real-world data from a live website using Python libraries and convert it into a
format suitable for analysis.

Theory:
Use requests to fetch HTML and BeautifulSoup to extract data.

Code:

Python

import requests
from bs4 import BeautifulSoup

url = "https://example.com/products"
page = requests.get(url)

# Parse the HTML content


soup = BeautifulSoup(page.text, 'html.parser')

# Example: Extract product titles (inside <h2> tags)


titles = soup.find_all('h2')
for t in titles:
print(t.text.strip())

Conclusion:
Web scraping is a practical and powerful method to collect live, real-world data directly from
websites.
✅ Practical 8: Pre-process Collected Data

Objective:
To clean and structure the raw data collected from sources like web scraping or APIs,
transforming it into a usable and analysable format

Theory:
Remove unwanted characters, HTML tags, and store in DataFrame.

Code:

Python

# Assuming 'titles' is a list of HTML elements collected from web scraping

data = {'Title': [t.text.strip() for t in titles]}

# Removes leading/trailing whitespace and HTML tags

df = pd.DataFrame(data)
print(df.head())

Conclusion:

Preprocessing is a vital step to transform raw, unstructured data (like web-scraped HTML)
into a clean and structured form.
✅ Practical 9: Handle Missing Data (Imputation/Deletion)

Objective:

To effectively manage missing values in a dataset through deletion or imputation, thereby


maintaining data quality and ensuring reliable analysis.

Code:

Python

# Drop rows with any missing value


df.dropna(inplace=True)

# Impute with median


df['Salary'].fillna(df['Salary'].median(), inplace=True)

Conclusion:

Handling missing data is a critical step in the data cleaning process. While dropping rows can
be a quick fix, imputation methods like filling with the median retain the overall dataset
structure and avoid unnecessary information loss.
✅ Practical 10: Visualize Missing Data

Objective:
To identify the location and proportion of missing values in the dataset visually, helping in
the selection of suitable data cleaning or imputation methods.

Code:

python

import seaborn as sns

import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap='viridis')

plt.title("Missing Data Heatmap")

plt.show()

Conclusion:

Visualizing missing data using a heatmap makes it easier to spot patterns and clusters of
missing values.
✅ Practical 11: Exploratory Data Analysis (EDA)

Objective:
To explore the dataset in-depth by summarizing key statistical properties and visualizing
relationships between multiple features.

Steps:

 Describe data
 Visualize boxplots
 Explore pair plots

Code:

python

import seaborn as sns


import matplotlib.pyplot as plt

# Step 1: Summary statistics


print(df.describe())

# Step 2: Boxplot for 'Age'


sns.boxplot(x=df['Age'])
plt.title("Boxplot of Age")
plt.show()

# Step 3: Pair plot for all numerical features


sns.pairplot(df)
plt.show()

Conclusion:
Exploratory Data Analysis (EDA) is an essential initial step in any data science or machine
learning project. It provides deep insight into the data’s distribution, relationships, and
anomalies, helping guide feature selection, transformation, and modeling strategies.
✅ Practical 12: Boxplots for Outlier Detection

Objective:
To identify outliers in numerical data using a boxplot and understand how these extreme
values can impact data analysis and modeling.

Theory:
Outliers lie outside the 1.5×IQR range.

Code:

python

import seaborn as sns


import matplotlib.pyplot as plt

#create a boxplot to detect outliers in ‘Salary’


sns.boxplot(y=df['Salary'])
plt.title("Boxplot of Salary")
plt.ylabel(“Salary”)
plt.show()

Conclusion:
Boxplots provide a quick visual way to detect outliers in numerical data. Outliers may be
excluded, transformed (e.g., using log or z-score), or analyzed separately depending on
their cause and relevance to the problem.
✅ Practical 13: Explore Correlation Between Variables

Objective:
To measure and visualize the degree of relationship between numerical variables using a
correlation matrix and heatmap.

Code:

python

import seaborn as sns


import matplotlib.pyplot as plt

#Compute the correlation matrix


correlation = df.corr()

#Visualize the correlation matrix as a heatmap


sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

Conclusion:
Correlation analysis helps in feature selection, detecting highly correlated (redundant)
variables, and understanding the linear relationships within the data. It is a key part of EDA
and modeling preparation.
✅ Practical 14: Dimensionality Reduction using PCA

Objective:
To reduce high-dimensional data into fewer dimensions using Principal Component
Analysis (PCA) while preserving as much variance (information) as possible.

Code:

python

from sklearn.decomposition import PCA


from sklearn.preprocessing import StandardScaler
import pandas as pd

# Standardize
scaled =
StandardScaler().fit_transform(df.select_dtypes(include=['float64',
'int64']))

# Apply PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled)
print(pca.explained_variance_ratio_)

Conclusion:
PCA is a powerful tool for dimensionality reduction. It simplifies complex datasets by
projecting them onto fewer dimensions while preserving essential variance.
✅ Practical 15: Hypothesis Testing (t-test)

Objective:
To determine whether the means of two independent groups (e.g., Male vs Female) are
significantly different.

Code:

python

from scipy.stats import ttest_ind

#Extract salary data by gender


group1 = df[df['Gender'] == 'Male']['Salary']
group2 = df[df['Gender'] == 'Female']['Salary']

#perform independent t-test


t_stat, p_value = ttest_ind(group1, group2)

#output the results


print(f"T-stat: {t_stat}, P-value: {p_value}")

Conclusion:
T-test helps compare two groups statistically. If the p-value is less than 0.05, it suggests a
significant difference in the means (e.g., salaries of males vs females).

You might also like