Ds&ba 22
Ds&ba 22
Q1)
a) What is driving data deluge? Explain with one example. [9]
Ans >
Answer:
The term data deluge refers to the overwhelming flood of data being generated, collected, and stored at an
unprecedented rate in today's digital world. The driving factors behind this explosion of data come from
multiple sources in both structured and unstructured forms.
Thousands of traffic sensors and cameras are installed at intersections to monitor vehicle flow.
Real-time data is collected on:
o Vehicle count and types
o Traffic speed and congestion levels
o Number plate recognition for law enforcement
This data is transmitted to central servers every few seconds.
Additional data is fetched from mobile apps like Google Maps or public transport systems.
Impact:
Conclusion:
The data deluge is driven by technological advancements, hyper-connectivity, and automation across
industries. While this surge in data presents opportunities for deeper insights and smarter decision-making, it
also poses challenges in terms of storage, processing, and analysis. Understanding and managing the
sources of big data is the first step toward building intelligent systems in fields like healthcare, smart cities,
and digital commerce.
Q1)
b) What is data science? Differentiate between Business Intelligence and
Data Science. [9]
Ans >
Data Science is an interdisciplinary field that uses scientific methods, algorithms, processes, and systems
to extract knowledge and insights from structured and unstructured data. It combines elements from
statistics, computer science, mathematics, and domain knowledge to solve complex problems and enable
data-driven decision-making.
Data Collection – Gathering data from various sources (databases, APIs, sensors).
Data Cleaning and Preparation – Handling missing values, formatting, and normalization.
Exploratory Data Analysis (EDA) – Using statistics and visualization to understand patterns.
Machine Learning and Predictive Modeling – Building models to forecast trends or automate
decisions.
Data Visualization and Communication – Presenting findings using charts and dashboards.
Real-world example:
A data scientist at a retail company might analyze customer behavior to predict which products will sell
more during the festive season using past purchase data, weather, and promotions.
Analyzes historical data to provide insights and Uses data (historical + real-time) to predict future
Purpose
reporting. outcomes.
Tools Used SQL, Power BI, Tableau, Excel. Python, R, Jupyter, TensorFlow, Scikit-learn.
Summary:
Business Intelligence helps organizations understand what happened and supports strategic
decisions through dashboards and visual reports.
Data Science, on the other hand, dives deeper into why it happened and what is likely to happen,
often using complex machine learning models to uncover patterns and automate actions.
Together, both fields are vital in making businesses smarter, but Data Science extends beyond BI by
offering future-focused insights and deeper data exploration.
Q2)
a) What are the sources of Big Data. Explain model building phase with
example. [9]
Ans >
Big Data originates from a wide range of sources, often categorized based on their structure and the nature
of data generation. These sources can be grouped into the following categories:
Generated from platforms like Facebook, Twitter, Instagram, LinkedIn, and YouTube.
Includes text posts, likes, comments, shares, videos, and images.
Example: Tweets about a product launch, user reviews, or trending hashtags.
Produced by sensors, embedded systems, and IoT devices in smart homes, industries, vehicles, and
healthcare devices.
Example: Temperature sensors in a smart refrigerator or GPS signals from delivery trucks.
c) Transactional Data
Comes from online and offline transactions (e.g., e-commerce, bank payments, retail purchases).
Structured and often stored in databases or data warehouses.
Example: Credit card transactions, POS billing systems, or flight bookings.
Produced by research experiments, satellite imagery, genome sequencing, and particle physics labs.
Example: Data from NASA’s Earth observation satellites or CERN’s Large Hadron Collider.
The Model Building phase is a core part of the Data Science Lifecycle. After collecting, cleaning, and
analyzing data, the next step is to build a predictive model using suitable algorithms.
Goal: Predict the price of a house based on features like size (in sq. ft), number of bedrooms, and
location.
Data: Collected from a real estate portal.
Model: Linear Regression is selected.
Process:
o Split data into training and testing sets.
o Train the model using features like:
X = [Size, Bedrooms, Distance to city]
Y = [Price]
o Model learns weights for each feature.
o Evaluate performance using Mean Absolute Error (MAE).
Output: A regression model that can estimate house prices for new inputs.
Q2)
b) Explain big data analytics architecture with diagram. What is data
discovery phase. Explain with example. [9]
Ans >
Big Data Analytics Architecture is a framework that defines how data from various sources is collected,
processed (in real-time and batch modes), stored, analyzed, and visualized. It ensures that large-scale,
diverse, and fast-moving data can be handled efficiently to support decision-making.
The architecture can be divided into six key components with a central orchestration layer to coordinate
tasks.
1. Data Sources
a. Stream Processing
b. Batch Processing
This is the curated data repository used for querying and analysis.
Stores cleaned and transformed data from both stream and batch pipelines.
Tools: Hive, HBase, Amazon Redshift, Google BigQuery
Optimized for fast querying, business intelligence, and machine learning.
6. Orchestration Layer
3. Summary (1 mark)
The Big Data Analytics Architecture supports the end-to-end lifecycle of big data — from raw input to
actionable output. By integrating both batch and real-time processing, and ensuring data is stored,
processed, and visualized efficiently, this architecture allows organizations to harness the full value of data
for strategic decision-making.
Data Discovery is the phase in the data analytics lifecycle where data is explored, understood, and
prepared before formal analysis or modeling. It involves identifying data patterns, relationships,
outliers, and gaining initial insights through visual and statistical exploration.
Data Profiling: Understanding the structure, types, ranges, and quality of data.
Exploratory Data Analysis (EDA): Visual tools (histograms, box plots) used to spot trends and
anomalies.
Correlation Analysis: Finding relationships between variables.
Missing Value Analysis: Identifying and deciding how to handle incomplete data.
Outlier Detection: Removing or treating abnormal values that skew results.
These discoveries guide the data cleaning process and influence the modeling phase, such as creating a
churn prediction model.
Q3)
a) Explain various data pre-processing steps. Discuss essential python
libraries for preprocessing. [8]
Ans >
Data preprocessing is a crucial step in the data science pipeline, often consuming a significant portion of a
data scientist's time (reportedly 70-80%). It involves transforming raw data into a clean, structured, and
understandable format suitable for analysis and machine learning algorithms. High-quality data leads to
more accurate and reliable models.
1. Data Cleaning:
o Purpose: To handle missing values, noisy data, and inconsistent data. Raw data is often incomplete,
contains errors, or has outliers.
o Techniques:
Handling Missing Values:
Deletion: Remove rows or columns with missing values. (Suitable when missing data
is minimal and won't lead to significant data loss).
Imputation: Fill missing values using statistical measures (mean, median, mode),
predictive models (e.g., K-Nearest Neighbors, regression), or domain-specific
knowledge.
Constant Value: Replace with a fixed value (e.g., 'unknown', 0).
Handling Noisy Data: Noise refers to random error or variance in a measured variable.
Binning: Smooth sorted data values by placing them into "bins" and then performing
operations like averaging or median smoothing.
Regression: Use regression functions to smooth the data.
Clustering: Detect and remove outliers by grouping similar data points.
Handling Inconsistent Data: Correcting discrepancies in data, often due to data entry errors
or integration from multiple sources.
Manual Correction: For small datasets.
Data Deduplication: Identifying and merging duplicate records.
Standardization: Ensuring consistent formats (e.g., "USA", "United States", "US" all
mapped to "USA").
2. Data Integration:
o Purpose: To combine data from multiple diverse sources into a coherent data store (e.g., data
warehouse or data lake).
o Challenges:
Schema Integration: Ensuring consistency in naming conventions and data types across
different sources (e.g., cust_id in one table and customer_id in another).
Redundancy and Consistency: Identifying and resolving duplicate entries or conflicting
values when integrating.
Value Conflicts: Different representations for the same entity (e.g., '1' for male in one
source, 'M' in another).
3. Data Transformation:
o Purpose: To convert data into a format suitable for mining. This often involves aggregating data,
normalizing it, or creating new attributes.
o Techniques:
Normalization/Scaling: Scaling numerical features to a standard range (e.g., 0-1 or mean 0,
std dev 1). This is crucial for algorithms sensitive to feature scales (e.g., SVM, k-NN, Neural
Networks).
Min-Max Scaling: Scales features to a range [0, 1].
Z-score Normalization (Standardization): Scales features to have zero mean and
unit variance.
Discretization/Binning: Converting continuous numerical attributes into categorical or
ordinal ones (e.g., converting 'age' into 'youth', 'middle-aged', 'senior'). This can help reduce
noise and improve model interpretability.
Attribute/Feature Construction (Feature Engineering): Creating new attributes from
existing ones to help the mining process. This often requires domain knowledge.
Example: Calculating Age_of_Account from Account_Opening_Date and
Current_Date.
Example: Creating Total_Spend from Quantity * Price.
Aggregation: Summarizing data (e.g., calculating the sum of sales per month, average daily
temperature).
4. Data Reduction:
o Purpose: To obtain a reduced representation of the data set that is much smaller in volume but still
produces almost the same analytical results. This is especially important for Big Data to reduce
storage space and computation time.
o Techniques:
Dimensionality Reduction: Reducing the number of random variables under consideration.
Feature Selection: Identifying and selecting a subset of relevant features for use in
model construction (e.g., removing redundant or irrelevant features). Techniques
include correlation analysis, chi-squared test, Recursive Feature Elimination (RFE).
Feature Extraction: Transforming data from a high-dimensional space to a space of
fewer dimensions (e.g., Principal Component Analysis (PCA), Linear Discriminant
Analysis (LDA)).
Numerosity Reduction: Reducing the data volume by choosing alternative, smaller forms of
data representation.
Sampling: Selecting a representative subset of the data (e.g., random sampling,
stratified sampling).
Data Cube Aggregation: Pre-computing and storing aggregate data views.
Data Compression: Encoding data to reduce storage space.
Python has a rich ecosystem of libraries that make data preprocessing efficient and straightforward.
1. NumPy:
o Purpose: The fundamental package for numerical computing in Python. It provides support for large,
multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions
to operate on these arrays.
o Use in Preprocessing:
Handling missing values (e.g., np.nan for Not a Number).
Basic array operations, reshaping data, mathematical transformations on numerical data.
Efficient vectorized operations, which are crucial for large datasets.
2. Pandas:
o Purpose: Built on top of NumPy, Pandas provides highly optimized data structures like DataFrames
(tabular data) and Series (1D labeled arrays) for easy data manipulation and analysis. It's
indispensable for working with structured data.
o Use in Preprocessing:
Data Loading: Reading various file formats (CSV, Excel, SQL databases, JSON, etc.).
Missing Value Handling: Functions like isnull(), dropna(), fillna() for identifying
and managing missing data.
Data Cleaning: Renaming columns, dropping irrelevant columns, handling duplicates
(drop_duplicates()), changing data types (astype()).
Data Integration: Merging and joining DataFrames (merge(), concat()).
Data Transformation: Grouping and aggregation (groupby(), agg()), applying custom
functions (apply()), creating new features.
Filtering and Selection: Selecting specific rows/columns based on conditions.
3. Scikit-learn (sklearn):
o Purpose: A comprehensive machine learning library that provides a wide range of algorithms for
classification, regression, clustering, and also extensive tools for model selection, preprocessing, and
evaluation.
o Use in Preprocessing (under sklearn.preprocessing module):
Scaling/Normalization: StandardScaler (Z-score normalization), MinMaxScaler (Min-
Max scaling), Normalizer (scaling by vector norm).
Encoding Categorical Data:
OneHotEncoder: Converts categorical integer features into one-hot numeric arrays.
LabelEncoder: Encodes target labels with values between 0 and n_classes-1.
OrdinalEncoder: Encodes categorical features as an integer array.
Imputation: SimpleImputer (for mean, median, mode imputation), KNNImputer
(imputation using K-Nearest Neighbors).
Dimensionality Reduction: PCA (Principal Component Analysis).
Feature Selection: Various methods within sklearn.feature_selection.
4. Matplotlib & Seaborn:
o Purpose: While primarily visualization libraries, they are essential during the data discovery and
initial preprocessing phases for understanding data distributions, identifying outliers, and visualizing
relationships.
o Use in Preprocessing (Indirectly but Critically):
Identifying Outliers: Box plots, scatter plots.
Understanding Distributions: Histograms, density plots.
Spotting Relationships: Scatter plots, pair plots, heatmaps for correlation matrices.
These visualizations guide decisions on how to clean and transform the data.
These libraries, especially in combination, form the backbone of most data preprocessing workflows in
Python, enabling data scientists to efficiently prepare data for subsequent analysis and model building.
Q3)
b) What are association rules? Explain Apriori Algorithm in brief. [9]
Ans >
Association Rules are a fundamental technique in data mining used to discover interesting relationships
(associations) between variables in large datasets. They are most commonly used in market basket
analysis, where the goal is to find patterns in customer purchases.
Example:
If a customer buys bread and butter, they are also likely to buy milk.
CopyEdit
{Bread, Butter} ⇒ {Milk}
Key Metrics:
The Apriori Algorithm is a classic algorithm used for frequent itemset mining and association rule
learning over transactional databases. It works on the principle that:
This means if an itemset is infrequent, any larger itemset containing it can also be discarded — a key
optimization strategy.
Scan the database and count support for each candidate itemset.
Retain those that meet the minimum support.
For each frequent itemset, generate rules that meet minimum confidence.
Transactions:
T2 Bread, Butter
T3 Milk, Bread
T5 Bread, Eggs
Time Complexity: Exponential in the worst case, as it generates many candidate sets.
Apriori uses pruning to reduce unnecessary combinations and scans.
6. Summary (Optional)
The Apriori algorithm is a foundational method for association rule mining. It helps uncover hidden patterns
in large datasets by filtering and expanding frequent itemsets and generating interpretable rules for decision-
making.
Q4)
Ans >
Definition:
Linear Regression is a supervised learning algorithm used to predict a continuous numeric value based
on the linear relationship between the input (independent variable) and the output (dependent variable).
Y=β0+β1X+ε
Where:
Example:
1000 150
1500 200
2000 250
A linear model will try to fit a line through the data to minimize error.
Objective Function:
Applications:
Definition:
Logistic Regression is a classification algorithm used to predict binary outcomes (yes/no, 0/1, spam/not
spam). Unlike linear regression, it models the probability that an instance belongs to a class.
Decision Rule:
If P≥0.5, predict 1
If P<0.5, predict 0
Example:
Predicting whether a student will pass (1) or fail (0) an exam based on hours studied.
2 0
5 1
3 0
7 1
Logistic Regression learns the relationship between Hours and Probability of Passing.
Applications:
Spam detection
Disease diagnosis (positive/negative)
Credit card fraud detection
Table:
Feature
Linear Regression Logistic Regression
Equation
Q4)
c) Explain scikit-learn library for matplotlib with example. [9]
Ans >
1. Introduction to scikit-learn
scikit-learn is one of the most widely used libraries in Python for machine learning. It provides simple and
efficient tools for data mining and data analysis, built on top of other scientific libraries like NumPy, SciPy,
and matplotlib. It includes algorithms for classification, regression, clustering, and dimensionality reduction,
as well as tools for model selection, data preprocessing, and evaluation.
2. Purpose of scikit-learn
Machine Learning Algorithms: Includes algorithms such as decision trees, k-nearest neighbors,
support vector machines, and more.
Data Preprocessing: Offers utilities to preprocess datasets, including feature scaling, encoding
categorical variables, and handling missing data.
Model Selection and Evaluation: Includes functions for cross-validation, hyperparameter tuning,
and performance metrics.
Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python. It is
often used alongside scikit-learn to visualize data, results of model training, and the performance of machine
learning models. scikit-learn does not have its own visualization tools but integrates seamlessly with
matplotlib for plotting graphs, such as decision boundaries, model performance, and learning curves.
Plotting Decision Boundaries: This is useful in classification tasks to visually represent how a
model differentiates between classes.
Learning Curves: These show how the performance of a model changes as the training size
increases.
Confusion Matrix: A performance metric for classification models, easily visualized using
matplotlib.
Let’s walk through an example where we use scikit-learn to build a simple machine learning model (Logistic
Regression) and visualize its decision boundary using matplotlib.
#Python code
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
Integration of Matplotlib: In the example, matplotlib is used to visualize the decision boundary
created by the logistic regression model.
Why Standardization?: Logistic regression and many other machine learning models perform
better when the features are standardized, i.e., scaled to have mean 0 and variance 1.
Decision Boundary: The contour plot shows the boundary between the classes predicted by the
logistic regression model. The scatter plot shows the data points, colored according to their actual
class.
Q5)
a) Write short note on
i) Time series Analysis
ii) TF - IDF. [9]
Ans >
1. Definition
Time series analysis refers to the process of analyzing data points collected or recorded at specific time
intervals. These data points are typically ordered chronologically and are used to identify patterns, trends,
and other characteristics within the data over time.
2. Purpose
The goal of time series analysis is to forecast future values based on historical data, identify trends,
seasonality, and cyclical patterns, and understand underlying phenomena in various fields like economics,
finance, weather forecasting, and sales prediction.
Trend: The long-term movement or direction in the data. It could be upward (growth), downward
(decline), or flat (no change).
Seasonality: Repeated fluctuations or patterns that occur at regular intervals within a year (e.g.,
monthly, quarterly, or weekly).
Cyclic Patterns: Longer-term fluctuations that are not fixed to a specific time period, often linked to
economic or business cycles.
Irregularity (Noise): Random variations or unpredictable events that do not follow a clear pattern.
4. Common Techniques
1. Definition
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used in text mining and
information retrieval to evaluate the importance of a word in a document relative to a collection or corpus of
documents. It helps in extracting features from text data for tasks like text classification, clustering, and
search engine ranking.
2. Components
Term Frequency (TF): Measures how frequently a term (word) appears in a document. It is
computed as:
This gives the relative frequency of a word in a document, highlighting common words.
Inverse Document Frequency (IDF): Measures the importance of a term in the entire corpus. It is
computed as:
Words that appear in many documents (e.g., "the", "and", "is") will have a low IDF score, while
words that are rare across documents will have a higher IDF score.
This combines both the local importance of a term within a document (TF) and its global importance across
the entire corpus (IDF).
4. Purpose
TF-IDF helps identify the most relevant terms in a document relative to a corpus, filtering out common but
less informative words. It is widely used in:
5. Example
Words with higher IDF values (like "machine" or "deep") will have higher TF-IDF scores and are
considered more important.
Q5)
b) What is clustering? With suitable example explain the steps involved in
k - means algorithm. [9]
Ans >
Clustering is a type of unsupervised learning technique used in machine learning and data analysis. It
involves grouping a set of objects (data points) into clusters or groups, where objects within the same cluster
are more similar to each other than to those in other clusters. Clustering is widely used in various
applications such as customer segmentation, image compression, anomaly detection, and document
classification.
In clustering, the main objective is to identify the inherent structure in data without predefined labels. The
data points in each cluster are assigned based on some similarity metric, often Euclidean distance.
K-Means Algorithm
K-Means is one of the most popular and simple clustering algorithms. It partitions the data into a predefined
number of clusters (denoted as "K") by minimizing the variance within each cluster. The algorithm iterates
to find the optimal placement of cluster centers (centroids).
For each data point in the dataset, compute the distance to each of the K centroids.
Assign each data point to the cluster whose centroid is the closest. The most commonly used distance
metric is Euclidean distance.
Where:
After all data points are assigned to clusters, update the centroids by calculating the mean of all data
points in each cluster. This new mean will serve as the updated centroid for the cluster.
The centroid for each cluster is recalculated as:
Where Ck is the set of data points in cluster k1, and Ck is the new centroid for cluster k.
4. Repeat
5. Final Clusters
Once the algorithm converges, the final clusters and their centroids are obtained, with each data point
assigned to the cluster whose centroid is the closest.
Let’s consider a small dataset with two features (2D) to illustrate the K-Means algorithm.
Dataset:
Step-by-Step Execution:
1. Initialization:
o Choose K = 2.
o Randomly select two points as initial centroids. Let’s say the initial centroids are (1, 2) and
(10, 2).
2. Assignment Step:
o Compute the distance from each data point to the centroids:
Distance from (1, 2) to (1, 2) = 0
Distance from (1, 2) to (10, 2) = 9
Distance from (1, 4) to (1, 2) = 2
Distance from (1, 4) to (10, 2) = 9
And so on for all points.
o Assign each point to the nearest centroid.
In this case, the centroids remain the same after the first iteration.
4. Repeat:
o Since the centroids did not change, the algorithm has converged.
5. Final Clusters:
o Cluster 1 (Centroid: (1, 2)): (1, 2), (1, 4), (1, 0)
o Cluster 2 (Centroid: (10, 2)): (10, 2), (10, 4), (10, 0)
Advantages:
Disadvantages:
Conclusion:
Clustering, particularly K-Means, is a foundational unsupervised learning technique used to group data
based on similarity. The K-Means algorithm iterates between assigning points to the nearest centroid and
updating centroids until convergence. It's widely used due to its simplicity and efficiency, although it has
limitations in certain scenarios like non-spherical clusters.
Q6)
a) Write short note on
i) Confusion matrix
ii) AVC - ROC curve [9]
Ans >
i) Confusion Matrix
✅ Definition:
A confusion matrix is a table used to evaluate the performance of a classification algorithm. It shows how
many instances were correctly or incorrectly classified by the model, breaking down the outcomes by
predicted vs. actual class labels.
✅ Structure:
✅ Terms Explained:
css
CopyEdit
Predicted
P N
Actual P [50] [10]
N [5] [35]
TP = 50, FN = 10, FP = 5, TN = 35
Accuracy = (50 + 35) / 100 = 85%
A graphical plot that shows the performance of a classification model at various threshold levels.
It plots:
o True Positive Rate (TPR) (Recall) on the Y-axis
o False Positive Rate (FPR) on the X-axis
This curve shows that as the threshold is varied, the model maintains a good trade-off between TPR and
FPR.
✅ Summary:
Q6)
Ans >
1. Holdout Method
✅ Definition:
The Holdout Method is a simple technique where the dataset is divided into two non-overlapping sets:
Training Set – used to train the model
Testing Set – used to evaluate the model’s performance
✅ How It Works:
✅ Example:
✅ Advantages:
✅ Disadvantages:
✅ Definition:
Random Subsampling (also known as Repeated Holdout) improves upon the basic holdout method by
repeating the train/test split multiple times using random partitions and then averaging the results.
✅ How It Works:
✅ Example:
Each time, split the data (e.g., 80% train, 20% test)
Get 5 accuracy scores: 88%, 85%, 90%, 87%, 89%
Average accuracy = (88+85+90+87+89)/5 = 87.8%
✅ Advantages:
✅ Disadvantages:
✅ Comparison Table:
Feature Holdout Method Random Subsampling
✅ Conclusion:
Both the Holdout and Random Subsampling methods are useful for model validation. While the Holdout
method is quick and simple, it can lead to unreliable evaluations due to its dependency on a single split.
Random Subsampling overcomes this by using multiple splits and averaging results, providing a more
robust measure of model performance.
Q7)
a) With a suitable example explain Histogram and explain its usages. [8]
Ans >
✅ What is a Histogram?
A histogram is a graphical representation of the distribution of numerical data. It is a type of bar chart
that groups values into ranges (called bins) and shows how many values fall into each bin.
Unlike a regular bar chart which is used for categorical data, a histogram is used for continuous
(numerical) data.
✅ Structure of a Histogram:
X-axis: Represents the intervals (bins) into which the data range is divided.
Y-axis: Represents the frequency (count) of data points in each bin.
✅ Example:
Ages = [18, 19, 20, 20, 21, 22, 22, 23, 24, 25]
18–19 2
20–21 3
22–23 3
24–25 2
✅ Usages of Histogram:
2. Detecting Patterns
4. Statistical Analysis
Often used in exploratory data analysis (EDA) to summarize large data sets visually
5. Performance Monitoring
✅ Summary:
Feature Description
Q7)
b) Describe the Data visualization tool “Tableau”. Explain its applications
in brief. [9]
Ans >
✅ What is Tableau?
Tableau is a powerful data visualization and business intelligence (BI) tool used to analyze, visualize,
and share data in the form of interactive dashboards and reports.
It allows users to create visually appealing charts, graphs, and maps with simple drag-and-drop
operations — without needing any programming skills.
1. Drag-and-Drop Interface:
o Makes it easy for users to build visualizations without coding.
2. Real-Time Data Analysis:
o Connects directly to live databases or data streams for up-to-date reporting.
3. Multiple Data Sources Support:
o Can connect to Excel, SQL databases, cloud data (like Google BigQuery), and more.
4. Interactive Dashboards:
o Users can filter, drill down, and interact with visual elements.
5. Data Blending and Joining:
o Allows combining data from different sources for a unified view.
6. Sharing & Collaboration:
o Dashboards can be published to Tableau Server, Tableau Public, or Tableau Online for
collaboration.
✅ Applications of Tableau:
Used by companies to create interactive dashboards for monitoring KPIs, sales performance, and
trends.
2. Marketing Analytics
Helps marketers track campaigns, website traffic, and customer engagement visually.
3. Healthcare
Analyze cash flows, transactions, fraud detection, and stock market patterns.
Researchers use Tableau to visualize academic performance, survey results, and statistical patterns.
Used in policy analysis, crime statistics, public health dashboards (e.g., COVID-19 dashboards).
Fast & Easy to Use No coding needed – just drag and drop
✅ Conclusion:
Tableau is a leading data visualization tool widely used in various industries to make data more accessible,
interactive, and actionable. Its ability to handle large datasets and create meaningful visual stories makes it
an essential tool in the modern data-driven world.
Q8)
a) With a suitable example explain and draw a Box plot and explain its
usages. [8]
Ans >
A box plot (also known as a box-and-whisker plot) is a graphical representation used to summarize the
distribution of a dataset. It visually displays the minimum, first quartile (Q1), median (Q2), third
quartile (Q3), and maximum — commonly called the five-number summary.
Box plots are especially useful for detecting outliers, spread, and skewness in the data.
Scores = [45, 50, 52, 58, 60, 65, 70, 75, 78, 85, 90]
➤ Step-by-Step Summary:
Minimum = 45
Q1 = 52 (25th percentile)
Median (Q2) = 65
Q3 = 78 (75th percentile)
Maximum = 90
2. Detecting Outliers
Useful in exploratory data analysis (EDA) to quickly inspect data quality and variation.
Suppose we want to compare math scores across three different schools. We can plot three box plots side-
by-side, which helps easily identify:
✅ Summary Table:
Term Description
Q1 25th percentile
Q3 75th percentile
Q8)
b) Describe the challenges of data visualization. Draw box plot and explain
its usages. [9]
Ans >
Data visualization plays a critical role in exploring, understanding, and communicating data insights.
However, it comes with several challenges:
Using an incorrect chart type (e.g., pie chart instead of bar graph) can mislead interpretation.
For example, time series data is best shown with line charts, not scatter plots.
Adding too many colors, labels, or chart elements may lead to clutter and confusion.
This decreases the clarity and usability of the visual.
Not starting the Y-axis from zero or using inconsistent intervals can distort comparisons.
Example: Manipulated bar charts may exaggerate small differences.
Using colors that are not colorblind-friendly or have poor contrast can hinder understanding.
Overuse of colors also reduces effectiveness.
Visualizing large datasets can result in slow rendering, overlapping data points, or unreadable
visuals.
Solutions involve aggregation, sampling, or interactive dashboards.
🔹 7. Tool Limitations
A box plot (or box-and-whisker plot) is a graphical summary that shows the distribution of a numerical
dataset through its five-number summary:
Minimum, Q1, Median, Q3, Maximum
➤ Example Dataset:
Data = [45, 50, 52, 58, 60, 65, 70, 75, 78, 85, 90]
Minimum = 45
Q1 = 52
Median (Q2) = 65
Q3 = 78
Maximum = 90
Understand Data Spread Width of the box shows interquartile range (IQR)
Identify Skewness Uneven whiskers or median offset inside box indicate skew
Quick Data Summary Efficient way to summarize without seeing all raw values
✅ Summary:
Data visualization is powerful but comes with technical and human-centered challenges.
A box plot offers a compact view of distribution, spread, and potential outliers in a dataset.