Practical 1: Load a Dataset and Display Basic Information
Objective:
To load a dataset using pandas and understand its structure using basic data inspection
methods.
Theory:
Data Frames are table-like structures used in pandas. Understanding the dataset’s size,
structure, types of columns, and basic statistics is the first step in any analysis.
Tools: Python, pandas
Steps:
1. Import pandas.
2. Load a CSV file using read_csv().
3. Use .head() to preview the first few records.
4. Use .info() to inspect column names, data types, and missing values.
5. Use .describe() for summary statistics.
Code:
python
import pandas as pd
# Load dataset
df = pd.read_csv('sample.csv')
# Display first 5 rows
print(df.head())
# Data types and null values
print(df.info())
# Summary statistics
print(df.describe())
Expected Output:
List of columns
Data types (int64, float64, object)
Count of null values
Mean, min, max, std deviation
Conclusion:
Gives an overview of the dataset for further cleaning and analysis.
✅ Practical 2: Histogram – Distribution of a Numerical Variable
Objective:
To visualize the distribution of a numeric variable and gain insights into its central tendency,
spread, and shape using a histogram.
Theory:
Histograms group numeric data into bins. The height of each bar shows how many values fall
into that range.
Tools
matplotlib: For plotting (pyplot).
seaborn: High-level plotting for beautiful statistical graphics.
Code:
python
import seaborn as sns
import matplotlib.pyplot as plt
#plotting histogram with a KDE(distribution curve)
sns.histplot(df['Age'], kde=True)
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
Expected Output:
Histogram with a curve to show distribution.
Conclusion:
Histograms are essential for exploring numerical data distributions. They help detect
skewness, outliers, and give a visual understanding of where the majority of values lie —
which is crucial for selecting appropriate transformation or modeling techniques.
✅ Practical 3: Summary Statistics – Mean, Median, Standard Deviation
Objective:
To compute and interpret basic summary statistics such as mean, median, and standard
deviation for a numerical column in a dataset.
Theory:
Mean: Arithmetic average
Median: Middle value
Std: Spread of values from the mean
Code:
python
#summary statistics for ‘Salary’ column
print("Mean:", df['Salary'].mean())
print("Median:", df['Salary'].median())
print("Standard Deviation:", df['Salary'].std())
Conclusion:
Calculating summary statistics is a fundamental step in data analysis. Mean, median, and
standard deviation provide critical insights into the central value and variability of a dataset
— helping guide data pre-processing, outlier detection, and modeling decisions.
✅ Practical 4: Scatter Plot – Relationship Between Two Variables
Objective:
To visualize the relationship or correlation between two numerical variables using a scatter
plot..
Theory:
A scatter plot shows whether there is a linear/non-linear relationship.
Code:
python
import seaborn as sns
import matplotlib.pyplot as plt
#Scatter plot of Experience Vs Salary
sns.scatterplot(x='Experience', y='Salary', data=df)
plt.title("Experience vs Salary")
plt.xlabel("Years of Experience")
plt.ylabel ("Salary")
plt.show()
Expected Output:
A plot showing the trend (positive, negative, or no correlation).
Conclusion:
Scatter plots help in detecting relationships between numeric variables and provide visual
evidence of correlation, making them highly useful before applying statistical models or
machine learning algorithms.
✅ Practical 5: Basic Data Cleaning (Missing Values, Outliers)
Objective:
To clean and prepare raw data by handling missing values and outliers, ensuring it is suitable
for further analysis or modeling.
Theory:
Missing data can bias the model.
Outliers can distort statistical analysis.
Steps:
1. Identify missing values isnull().sum().
2. Handle missing values :Drop or fill with mean.
3. Detect outliers using boxplot visualization.
Code:
python
import seaborn as sns
import matplotlib.pyplot as plt
# Step 1: Check missing values
print(df.isnull().sum())
# Step 2: Impute missing values in 'Age' with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Step 3: Detect outliers in 'Salary' column using a boxplot
sns.boxplot(x=df['Salary'])
plt.title("Outliers in Salary")
plt.show()
Conclusion:
Basic data cleaning is essential to ensure data quality and improve the performance of
analytical models.
✅ Practical 6: Encode Categorical Variables
Objective:
To convert categorical (non-numeric) variables into numerical format so that machine learning
algorithms can interpret and use the data effectively.
Theory:
One-hot encoding for nominal data.
Label encoding for ordinal data.
Code:
Python
# One-hot encoding
df = pd.get_dummies(df, columns=['Gender'])
# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Education_Level'] = le.fit_transform(df['Education_Level'])
Conclusion:
Encoding categorical variables is a critical pre-processing step in machine learning
workflows.
✅ Practical 7: Web Scraping to Collect Data
Objective:
To extract real-world data from a live website using Python libraries and convert it into a
format suitable for analysis.
Theory:
Use requests to fetch HTML and BeautifulSoup to extract data.
Code:
Python
import requests
from bs4 import BeautifulSoup
url = "https://example.com/products"
page = requests.get(url)
# Parse the HTML content
soup = BeautifulSoup(page.text, 'html.parser')
# Example: Extract product titles (inside <h2> tags)
titles = soup.find_all('h2')
for t in titles:
print(t.text.strip())
Conclusion:
Web scraping is a practical and powerful method to collect live, real-world data directly from
websites.
✅ Practical 8: Pre-process Collected Data
Objective:
To clean and structure the raw data collected from sources like web scraping or APIs,
transforming it into a usable and analysable format
Theory:
Remove unwanted characters, HTML tags, and store in DataFrame.
Code:
Python
# Assuming 'titles' is a list of HTML elements collected from web scraping
data = {'Title': [t.text.strip() for t in titles]}
# Removes leading/trailing whitespace and HTML tags
df = pd.DataFrame(data)
print(df.head())
Conclusion:
Preprocessing is a vital step to transform raw, unstructured data (like web-scraped HTML)
into a clean and structured form.
✅ Practical 9: Handle Missing Data (Imputation/Deletion)
Objective:
To effectively manage missing values in a dataset through deletion or imputation, thereby
maintaining data quality and ensuring reliable analysis.
Code:
Python
# Drop rows with any missing value
df.dropna(inplace=True)
# Impute with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
Conclusion:
Handling missing data is a critical step in the data cleaning process. While dropping rows can
be a quick fix, imputation methods like filling with the median retain the overall dataset
structure and avoid unnecessary information loss.
✅ Practical 10: Visualize Missing Data
Objective:
To identify the location and proportion of missing values in the dataset visually, helping in
the selection of suitable data cleaning or imputation methods.
Code:
python
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Data Heatmap")
plt.show()
Conclusion:
Visualizing missing data using a heatmap makes it easier to spot patterns and clusters of
missing values.
✅ Practical 11: Exploratory Data Analysis (EDA)
Objective:
To explore the dataset in-depth by summarizing key statistical properties and visualizing
relationships between multiple features.
Steps:
Describe data
Visualize boxplots
Explore pair plots
Code:
python
import seaborn as sns
import matplotlib.pyplot as plt
# Step 1: Summary statistics
print(df.describe())
# Step 2: Boxplot for 'Age'
sns.boxplot(x=df['Age'])
plt.title("Boxplot of Age")
plt.show()
# Step 3: Pair plot for all numerical features
sns.pairplot(df)
plt.show()
Conclusion:
Exploratory Data Analysis (EDA) is an essential initial step in any data science or machine
learning project. It provides deep insight into the data’s distribution, relationships, and
anomalies, helping guide feature selection, transformation, and modeling strategies.
✅ Practical 12: Boxplots for Outlier Detection
Objective:
To identify outliers in numerical data using a boxplot and understand how these extreme
values can impact data analysis and modeling.
Theory:
Outliers lie outside the 1.5×IQR range.
Code:
python
import seaborn as sns
import matplotlib.pyplot as plt
#create a boxplot to detect outliers in ‘Salary’
sns.boxplot(y=df['Salary'])
plt.title("Boxplot of Salary")
plt.ylabel(“Salary”)
plt.show()
Conclusion:
Boxplots provide a quick visual way to detect outliers in numerical data. Outliers may be
excluded, transformed (e.g., using log or z-score), or analyzed separately depending on
their cause and relevance to the problem.
✅ Practical 13: Explore Correlation Between Variables
Objective:
To measure and visualize the degree of relationship between numerical variables using a
correlation matrix and heatmap.
Code:
python
import seaborn as sns
import matplotlib.pyplot as plt
#Compute the correlation matrix
correlation = df.corr()
#Visualize the correlation matrix as a heatmap
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
Conclusion:
Correlation analysis helps in feature selection, detecting highly correlated (redundant)
variables, and understanding the linear relationships within the data. It is a key part of EDA
and modeling preparation.
✅ Practical 14: Dimensionality Reduction using PCA
Objective:
To reduce high-dimensional data into fewer dimensions using Principal Component
Analysis (PCA) while preserving as much variance (information) as possible.
Code:
python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Standardize
scaled =
StandardScaler().fit_transform(df.select_dtypes(include=['float64',
'int64']))
# Apply PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled)
print(pca.explained_variance_ratio_)
Conclusion:
PCA is a powerful tool for dimensionality reduction. It simplifies complex datasets by
projecting them onto fewer dimensions while preserving essential variance.
✅ Practical 15: Hypothesis Testing (t-test)
Objective:
To determine whether the means of two independent groups (e.g., Male vs Female) are
significantly different.
Code:
python
from scipy.stats import ttest_ind
#Extract salary data by gender
group1 = df[df['Gender'] == 'Male']['Salary']
group2 = df[df['Gender'] == 'Female']['Salary']
#perform independent t-test
t_stat, p_value = ttest_ind(group1, group2)
#output the results
print(f"T-stat: {t_stat}, P-value: {p_value}")
Conclusion:
T-test helps compare two groups statistically. If the p-value is less than 0.05, it suggests a
significant difference in the means (e.g., salaries of males vs females).