Q1.
Introduction to Data Visualization
• Understand the importance of data visualization in analytics.
• Overview of common chart types: bar, line, scatter, histogram, box plot, pie chart.
Answer:
What is Data Visualization?
Data visualization is the graphical representation of information and data. It helps in:
• Understanding patterns and trends in the data
• Communicating insights clearly and effectively
• Making data-driven decisions
Why is it Important?
• Simplifies complex data
• Reveals patterns that aren't obvious in raw data
• Helps detect outliers and anomalies
• Facilitates storytelling with data
Common Chart Types import pandas as pd
import matplotlib.pyplot as plt import seaborn as
sns
# Load dataset df = pd.read_csv('titanic.csv')
# Preview data df.head()
1. Bar Chart
Use: To compare quantities across categories.
# Count of passengers by class sns.countplot(data=df, x='Pclass')
plt.title('Passenger Count by Class') plt.xlabel('Class')
plt.ylabel('Count') plt.show()
2. Line Chart
Use: To track changes over time. # Simulate some time data df['PassengerId'] =
pd.to_datetime(df['PassengerId'], unit='D', origin='1900-01-01')
df.groupby(df['PassengerId'].dt.year)['Fare'].mean().plot() plt.title('Average Fare Over Time')
plt.xlabel('Year') plt.ylabel('Average Fare') plt.show()
3. Scatter Plot
Use: To show relationship between two numeric variables.
sns.scatterplot(data=df, x='Age', y='Fare') plt.title('Age vs
Fare') plt.show()
4. Histogram
Use: To view the distribution of a single numeric variable.
sns.histplot(data=df, x='Age', bins=30, kde=True) plt.title('Age
Distribution') plt.show()
5. Box Plot
Use: To show distribution and detect outliers.
sns.boxplot(data=df, x='Pclass', y='Age') plt.title('Age
Distribution by Class') plt.show()
6. Pie Chart
Use: To show proportion. # Pie chart of survival survived_counts =
df['Survived'].value_counts() labels = ['Not Survived', 'Survived'] plt.pie(survived_counts,
labels=labels, autopct='%1.1f%%', startangle=140) plt.title('Survival Rate') plt.axis('equal')
plt.show()
Q2. Tools and Libraries for Visualization
• Introduction to Python libraries: Matplotlib, Seaborn, and Plotly.
• Install necessary libraries and understand their use cases.
Answer:
Library Use Case Strengths
Matplotlib Base library for all plots Highly customizable, good for static charts
Seaborn Statistical visualization Clean, attractive default themes, simplifies complex plots
Plotly Interactive plots Great for dashboards and web apps
Installing the Libraries
Open your terminal or Jupyter Notebook and install the following:
pip install matplotlib seaborn plotly
1. Matplotlib – The Foundation
Overview: It’s the base library used to create static, animated, and interactive plots in Python. import matplotlib.pyplot
as plt
# Simple line chart x = [1, 2, 3, 4] y =
[10, 20, 25, 30]
plt.plot(x, y) plt.title("Simple Line Plot")
plt.xlabel("X-axis") plt.ylabel("Y-axis")
plt.grid(True) plt.show()
2. Seaborn – Built on Matplotlib
Overview: Makes it easier to create beautiful and informative statistical plots.
import seaborn as sns import pandas as pd
# Load example dataset df =
sns.load_dataset('tips')
# Seaborn scatter plot sns.scatterplot(data=df, x='total_bill', y='tip',
hue='sex') plt.title("Total Bill vs Tip by Gender") plt.show()
3. Plotly – For Interactive Plots
Overview: Best for interactive, zoomable, and hoverable plots. Excellent for web apps and dashboards.
import plotly.express as px
# Load built-in dataset df = px.data.iris()
# Interactive scatter plot fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species', title="Iris Sepal
Dimensions") fig.show()
Note: Plotly works in Jupyter Notebooks and browser-based apps by default. No need for plt.show().
Q3. Dataset Loading and Exploration
• Load real-world datasets using Pandas.
• Use .head(), .tail(), .info(), .describe() to explore data.
Answer:
Loading a Dataset import pandas as pd
# Load Titanic dataset df =
pd.read_csv("titanic.csv")
# Show the first 5 rows df.head()
Exploring the Dataset .head() – View the first few
rows df.head(3) # First 3 rows
.tail() – View the last few rows df.tail(3) # Last 3 rows
.info() – Overview of columns, data types, non-null counts df.info()
.describe() – Summary statistics for numeric columns df.describe()
Q4. Understanding Variable Types
• Differentiate between categorical, numerical, discrete, and continuous variables.
• Identify types of variables in a dataset.
Answer:
Types of Variables
Type Description Examples
Categorical Represent categories or groups Gender, Class, Embarked
Numerical Represent measurable quantities Age, Fare
➤ Discrete Countable values (integers) Number of siblings, Pclass
➤ Continuous Measurable values (fractions allowed) Age, Fare
Let's Work with the Titanic Dataset import pandas as
pd # Load dataset df = pd.read_csv('titanic.csv')
df.head()
Identify Variable Types # Check data types
df.dtypes
Q5. Data Cleaning and Preparation for Visualization
• Handle missing values, remove duplicates, and convert data types.
• Prepare clean data for analysis and plotting.
Answer:
Step 1: Handling Missing Values Identify Missing Values
df.isnull().sum()
Drop or Fill Missing Values
Drop missing rows (when too many nulls or rows aren't crucial):
df_cleaned = df.dropna(subset=['Embarked'])
Fill missing values (with mean, median, or mode): df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
Step 2: Removing Duplicates # Check and remove
duplicates print("Duplicates:", df.duplicated().sum())
df.drop_duplicates(inplace=True)
Step 3: Convert Data Types
Ensure columns are in correct format: # Convert Survived to
category df['Survived'] = df['Survived'].astype('category')
# Convert Embarked to category df['Embarked'] =
df['Embarked'].astype('category')
# Confirm changes df.dtypes
Clean Data Ready! # Final check
print(df.info()) print(df.isnull().sum())
Q6. Creating Basic Plots Using Matplotlib
• Plot line charts, bar charts, histograms using Matplotlib.
• Customize plots with titles, labels, legends, and colors.
Answer:
import pandas as pd import matplotlib.pyplot as
plt
# Load dataset df = pd.read_csv("titanic.csv")
1. Line Chart # Average fare by class fare_by_class =
df.groupby('Pclass')['Fare'].mean()
# Plot line chart plt.plot(fare_by_class.index, fare_by_class.values, color='green', marker='o', linestyle='--')
plt.title('Average Fare by Passenger Class') plt.xlabel('Passenger Class') plt.ylabel('Average Fare')
plt.grid(True) plt.xticks([1, 2, 3]) plt.show()
2. Bar Chart
# Count of passengers per class class_counts =
df['Pclass'].value_counts().sort_index()
# Bar chart plt.bar(class_counts.index, class_counts.values, color=['skyblue', 'salmon', 'lightgreen'])
plt.title('Passenger Count by Class') plt.xlabel('Passenger Class')
plt.ylabel('Count') plt.xticks([1, 2, 3]) plt.show()
3. Histogram
# Drop missing values in 'Age' ages =
df['Age'].dropna()
# Histogram plt.hist(ages, bins=20, color='purple', edgecolor='black')
plt.title('Age Distribution of Passengers') plt.xlabel('Age')
plt.ylabel('Frequency') plt.grid(axis='y', alpha=0.5)
plt.show()
Q7. Advanced Visualization Using Seaborn
• Create scatter plots, box plots, violin plots, and pair plots.
• Use hue, style, and palette for deeper analysis.
Answer:
import seaborn as sns import pandas as pd
import matplotlib.pyplot as plt
# Load Titanic dataset df = sns.load_dataset('titanic') # built-in
dataset
1. Scatter Plot sns.scatterplot(data=df, x='age', y='fare', hue='sex', style='class',
palette='Set2')
plt.title("Age vs Fare by Gender and Class") plt.show()
2. Box Plot sns.boxplot(data=df, x='class', y='age', hue='sex', palette='coolwarm')
plt.title("Age Distribution by Class and Gender") plt.show()
3. Violin Plot sns.violinplot(data=df, x='class', y='age', hue='sex', split=True,
palette='muted')
plt.title("Age Distribution by Class and Gender (Violin Plot)") plt.show()
4. Pair Plot sns.pairplot(df[['age', 'fare', 'survived', 'sex']], hue='sex', palette='husl')
plt.suptitle("Pairwise Relationships", y=1.02) plt.show()
Q8. Multivariate Analysis with Seaborn
• Heatmaps and correlation matrices to analyze relationships between multiple variables.
• Apply sns.heatmap() and sns.pairplot().
Answer:
import seaborn as sns import pandas as pd
import matplotlib.pyplot as plt
# Load dataset df = sns.load_dataset('titanic')
1. Correlation Matrix # Select numeric columns only num_df =
df.select_dtypes(include='number')
# Compute correlation matrix corr_matrix = num_df.corr()
# Display correlation matrix print(corr_matrix)
2. Heatmap Using sns.heatmap()
plt.figure(figsize=(10, 6)) sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm",
linewidths=0.5)
plt.title("Correlation Heatmap - Titanic Numeric Features") plt.show()
3. Pairplot (Again, But for Multivariate) sns.pairplot(df[['age', 'fare', 'pclass', 'survived']],
hue='survived', palette='Set1') plt.suptitle("Pairwise Plot of Age, Fare, Pclass, and Survival",
y=1.02) plt.show()
Q9. Time Series and Trend Analysis
• Plot time-based data using Pandas and Matplotlib.
• Perform trend analysis and plot rolling averages.
• Select a real dataset (e.g., COVID-19, IPL stats, sales data).
Answer:
import pandas as pd import matplotlib.pyplot as
plt import numpy as np
# Load dataset df = pd.read_csv("titanic.csv")
# Create a fake 'Date' column (spread over 100 days before April 15, 1912) df['Date'] =
pd.date_range(end="1912-04-15", periods=len(df))
# Sort by date df.sort_values('Date', inplace=True)
# Group by date and count passengers daily_passengers =
df.groupby('Date').size()
# Plotting daily passenger entries plt.figure(figsize=(12, 5)) daily_passengers.plot(kind='line',
title='Simulated Passenger Entries Over Time') plt.xlabel("Date") plt.ylabel("Number of
Passengers") plt.grid(True) plt.show()
B. Rolling Averages (Trend Smoothing)
# 7-day rolling average
rolling_avg = daily_passengers.rolling(window=7).mean()
plt.figure(figsize=(12, 5)) plt.plot(daily_passengers, label='Daily Count')
plt.plot(rolling_avg, label='7-Day Rolling Average', color='red') plt.title("Trend of
Simulated Passenger Entries (with Smoothing)") plt.xlabel("Date")
plt.ylabel("Passenger Count") plt.legend() plt.grid(True) plt.show()