0% found this document useful (0 votes)

71 views29 pages

Model Question Paper With Effect From 2021 (CBCS Scheme) : Data Science and Visualization

Uploaded by

ozsa21cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views29 pages

Model Question Paper With Effect From 2021 (CBCS Scheme) : Data Science and Visualization

Uploaded by

ozsa21cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

21CS644

Model Question Paper with effect from 2021(CBCS Scheme)

USN

Sixth Semester B.E. Degree Examination

Data Science and Visualization
TIME: 03 Hours Max. Marks: 100
Note: 01. Answer any FIVE full questions, choosing at least ONE question from each
MODULE.
*Bloom’s COs
Module -1 Taxonomy Marks
Level
Q.01 a What is Data Science? Explain. L2 CO 1 10
b Explain Datafication. L2 CO 1 10
OR
Q.02 a Explain statistical Inference L2 CO 1 10
Explain the terms with example:
b L2 CO 1 10
1) Population 2) Sample
Module-2
Q. 03 a Explain the Data science Process with a neat diagram. L2 CO 2 10
b Which Machine Learning algorithm to be used when you want to
express the mathematical relationship between two variables? L3 CO 2 10
Explain.
OR
Q.04 a Explain Exploratory Data Analysis L2 CO 2 10
b Which Machine Learning algorithm to be used when you have
bunch of objects that are already classified and based on which
L3 CO 2 10
other similar objects that haven’t got classified to be
automatically labelled? Explain.
Module-3
Q. 05 a Explain feature selection algorithms and selection criterion. L2 CO 3 10
b Define Feature Extraction. Explain different categories of information. L2 CO 3 10
OR
Q. 06 a Explain Random Forest Classifier. L2 CO 3 10
b Explain Principal Component Analysis. L2 CO 3 10
Module-4
Q. 07 a What is the need of Data Visualization? Explain its importance. L2 CO 4 10
b Explain Data Wrangling with a neat diagram. L2 CO 4 10
OR
Q. 08 a Explain composition plots with diagram. L2 CO 4 10
Explain i) Tools and libraries used for visualization.
b L2 CO 4 10
ii) Data Representation.
Module-5
Explain Plotting Using pandas DataFrames, Displaying Figures and
Q. 09 a Saving Figures in Matplotlib. L2 CO 5 10
b Explain formatting of strings and Plotting in Matplotlib. L2 CO 5 10
OR
Explain the following with respect to Matplotlib.
Q. 10 a 1) Labels, Titles, Text, Annotations, Legends. L2 CO 5 10
2) Subplots
b Explain basic image operations of Matplotlib. L2 CO 5 10
*Bloom’s Taxonomy Level: Indicate as L1, L2, L3, L4, etc. It is also desirable to indicate the COs and POs to
beattained by every bit of questions.

Page 01
07082024
21CS644
Model Question Paper-1/2 with effect from 2021(CBCS Scheme)
USN

Sixth Semester B.E. Degree Examination

DATA SCIENCE AND VISUALIZATION
TIME: 03 Hours Max. Marks: 100

Note: 01. Answer any FIVE full questions, choosing at least ONE question from each
MODULE.

Bloom’s COs
Module -1 Taxonomy Marks
Level
Q.01 a What is data science? List and explain skill set required in a data L2 CO1 6
science profile.
b Explain Probability Distribution with example. L2 CO1 6
c Describe the process of fitting a model to a dataset in detail. L2 CO1 8
OR
Q.02 a Explain with neat diagram the current Landscape of data science L2 CO1 6
process.
b Explain population and sample with example. L2 CO1 6
c What is big data? Explain in detail 5 elements of bigdata. L2 CO1 8
Module-2
Q. 03 a What is Machine Learning? Explain the linear regression algorithm. L2 CO2 6
b Explain K-means algorithm with example. L2 CO2 6
c Describe philosophy of EDA in detail. L2 CO2 8
OR
Q.04 a Explain the data science process with a neat diagram. L2 CO2 6
b Explain KNN algorithm with example. L2 CO2 6
c Develop a R script for EDA. L3 CO2 8
Module-3
Q. 05 a Explain the fundamental differences between linear regression and L2 CO3 6
logistic regression.
b Explain selecting an algorithm in wrapper method. L2 CO3 6
c Explain decision tree for chasing dragon problem. L3 CO3 8
OR
Q. 06 a Briefly explain alternating Least squares methods. L2 CO3 6
b Explain different selecting criterion in feature selection. L2 CO3 6
c Explain dimensionality problem with SVD in detail. L3 CO3 8
Module-4
Q. 07 a Define data visualization and explain its importance in data analysis. L2 CO4 6
b Describe different types of plots in comparison plots. L2 CO4 6
c Plot the following L3 CO4 8
i) density plot ii) box plot iii) violin plot iv) bubble plot
OR
Q. 08 a Describe the process of data wrangling and its significance in data L2 CO4 6
visualization.
b Explain the variants of bar chart with example. L2 CO4 6
c Explain different types of plots in relation plots. L2 CO4 8
Module-5
Q. 09 a Develop a code for labels, titles in matplotlib. L3 CO5 6
b Apply code for basic pie chart. L3 CO5 6

Page 01 of 02
07082024
21CS644

c Explain with neat diagram Anatomy of a Matplotlib Figure and L2 CO5 8

Plotting data points with multiple markers.
OR
Q. 10 a Describe the process of creating a box plot in Matplotlib. with suitable L2 CO5 6
programming example.
b Apply code for scatter plot on animal statistics using matplotlib. L3 CO5 6
c Develop a code for bar chart, pie chart in matplotlib. L3 CO5 8

Q.1 (a) What is Data Science? Explain.

Ans- Data science is an interdisciplinary field that uses scientific methods, algorithms,
processes, and systems to extract knowledge and insights from structured and
unstructured data. It combines various disciplines, including statistics, mathematics,
computer science, domain expertise, and data analysis, to make informed decisions
based on data.
Key Components of Data Science
1. Data Collection:
o Gathering raw data from various sources such as databases, APIs, web
scraping, surveys, and sensors.
o Data can be structured (like tables) or unstructured (like text, images, and
videos).
2. Data Cleaning and Preprocessing:
o Cleaning involves removing inaccuracies, duplicates, and inconsistencies
in the data.
o Preprocessing includes transforming data into a suitable format for
analysis, such as normalization, encoding categorical variables, and
handling missing values.
3. Exploratory Data Analysis (EDA):
o The process of analyzing data sets to summarize their main
characteristics, often using visual methods.
o EDA helps identify patterns, trends, and anomalies in the data, which can
inform further analysis.
4. Statistical Analysis:
o Applying statistical methods to infer properties of the population from
sample data.
o Techniques include hypothesis testing, regression analysis, and
descriptive statistics.
5. Machine Learning:
o Developing algorithms that allow computers to learn from and make
predictions or decisions based on data.
o Machine learning can be supervised (with labeled data) or unsupervised
(without labeled data) and is commonly used for tasks like classification,
regression, clustering, and recommendation systems.
6. Data Visualization:
o Creating visual representations of data to help communicate findings
clearly and eVectively.
o Tools like Matplotlib, Seaborn, and Tableau are often used to create
charts, graphs, and dashboards.

Page 01 of 02
07082024
21CS644

7. Deployment and Maintenance:

o Implementing machine learning models and data pipelines into
production systems for real-time data processing and analysis.
o Continuous monitoring and updating of models are essential to ensure
accuracy and relevance over time.
8. Domain Knowledge:
o Understanding the specific industry or field related to the data being
analyzed (e.g., finance, healthcare, marketing) enhances the ability to
draw meaningful insights.
Applications of Data Science
Data science has a wide range of applications across various industries:
• Healthcare: Predictive analytics for patient outcomes, personalized medicine,
and disease diagnosis.
• Finance: Risk assessment, fraud detection, algorithmic trading, and customer
segmentation.
• Marketing: Customer behavior analysis, targeted advertising, and sales
forecasting.
• E-commerce: Recommendation systems, inventory management, and pricing
optimization.
• Transportation: Route optimization, predictive maintenance, and traVic
management.
• Social Media: Sentiment analysis, trend prediction, and user engagement
analysis.
Importance of Data Science
1. Informed Decision-Making: Data science enables organizations to make data-
driven decisions based on empirical evidence rather than intuition.
2. Predictive Analytics: Helps forecast future trends and behaviors, enabling
proactive strategies and interventions.
3. EMiciency and Optimization: Identifying ineViciencies and optimizing
processes can lead to significant cost savings and improved performance.
4. Innovation: Data-driven insights can lead to new products, services, and
business models, fostering innovation.
5. Personalization: Enhances customer experiences through tailored
recommendations and personalized services

Q.1 (b) What is Dataﬁcation? Discuss with examples.

Ans- Dataﬁcation is the process of turning various aspects of human life and business
activities into data that can be measured, tracked, and analyzed. It involves transforming
real-world actions, behaviors, and interactions into quantiﬁable data points, which can
be processed by digital technologies. This process allows organizations and individuals
to make more informed decisions, optimize processes, and create value by analyzing
these datasets.

Key Features of Dataﬁcation:

• Turning Activities into Data: Almost any human activity—whether
communication, consumption, or movement—can be captured and converted
into data.

Page 01 of 02
07082024
21CS644

• Automation and Analytics: Once data is collected, it can be processed and

analyzed using algorithms, machine learning, and other data analytics tools.
• Real-time Decision-making: Dataﬁcation often allows for real-time monitoring
and decision-making, particularly in dynamic environments.

Examples of Datafication:
1. Social Media: Social media platforms like Facebook, Twitter, and Instagram
datafy human interactions. Every "like," "share," comment, and interaction is
recorded and analyzed to build profiles, understand user behavior, and deliver
targeted advertising.
Q.1 (b) Explain Datafication.
Ans- Datafication refers to the process of turning various aspects of human life and
the physical world into data that can be collected, analyzed, and used for decision-
making. It involves transforming everyday interactions, behaviors, processes, and
objects into a quantifiable format. With the rise of digital technology, almost every
action or event can be recorded and transformed into data, including social
interactions, business transactions, physical movements, and even biological
processes.

Examples of Datafication:
Social Media Interactions: Platforms like Facebook, Instagram, and Twitter collect
vast amounts of data from user interactions (likes, shares, comments), which can be
analyzed for trends, preferences, and behavior.
Wearables and IoT: Devices like fitness trackers and smart home appliances collect
health, movement, and usage data that can be used to provide insights or
recommendations.
Business Operations: Retail transactions, supply chains, and customer service
interactions are all datafied, allowing businesses to optimize processes, improve
customer experiences, and predict future trends.
In short, datafication is the shift where more and more of our world is being
transformed into data, allowing for its use in analysis, machine learning, and
decision-making processes.
Q.2 (a) Explain statistical Inference.
Ans- Statistical inference is a branch of statistics that focuses on drawing
conclusions about a population based on a sample of data from that population. It
involves two key concepts: estimation and hypothesis testing.
Key Concepts
1. Population and Sample:
o Population: The entire group of individuals or instances about whom we
want to learn.
o Sample: A subset of the population selected for analysis.
2. Estimation:
o Point Estimation: Provides a single value (point estimate) as an estimate
of a population parameter (e.g., sample mean as an estimate of population
mean).

Page 01 of 02
07082024
21CS644

o Interval Estimation: Provides a range (conﬁdence interval) within which

the population parameter is expected to lie, with a certain level of
confidence (e.g., 95% confidence interval).
3. Hypothesis Testing:
o Involves making an assumption (hypothesis) about a population
parameter and then using sample data to determine the likelihood that the
hypothesis is true.
o Common tests include:
§ t-tests: For comparing means.
§ Chi-squared tests: For categorical data.
§ ANOVA: For comparing means across multiple groups.
4. Types of Inference:
o Parametric Inference: Assumes the data follows a specific distribution
(e.g., normal distribution).
o Non-parametric Inference: Makes fewer assumptions about the data
distribution.
5. P-value: A measure that helps determine the significance of results in hypothesis
testing. A low p-value (typically ≤ 0.05) indicates strong evidence against the null
hypothesis.
6. Type I and Type II Errors:
o Type I Error (α): Rejecting a true null hypothesis (false positive).
o Type II Error (β): Failing to reject a false null hypothesis (false negative).
Applications
Statistical inference is widely used in various fields such as medicine, economics,
social sciences, and engineering to make informed decisions based on data analysis.
Q.2 (b) Explain population and sample
Ans- Population
Definition: A population is the entire group of individuals or instances that you are
interested in studying. This group can be finite or infinite and is defined by specific
characteristics.
Example:
• Suppose a researcher wants to study the average height of adult men in a city.
o Population: All adult men living in that city. This could be thousands or
even millions of individuals.
Sample
Definition: A sample is a subset of the population that is selected for analysis.
Samples are used because it is often impractical or impossible to collect data from
the entire population.
Example:
• Continuing with the previous example, instead of measuring the height of every
adult man in the city, the researcher might select a smaller group.
o Sample: A group of 200 adult men randomly selected from various
neighborhoods within the city. This smaller group is used to estimate the
average height of the entire population of adult men.
Importance of Population and Sample

Page 01 of 02
07082024
21CS644

• Representativeness: It's crucial that the sample accurately represents the

population. If the sample is biased (e.g., only selecting tall men), the results will
not be generalizable to the entire population.
• EMiciency: Using a sample saves time and resources. Collecting data from a
smaller group can provide suVicient information to make inferences about the
larger population.
Q.3(a) Explain the data science process with a neat diagram.
Ans- +------------------+
| Problem |
| Deﬁnition |
+------------------+
|
v
+------------------+
| Data Collection |
+------------------+
|
v
+------------------+
| Data Cleaning |
+------------------+
|
v
+------------------+
| Exploratory |
| Data Analysis |
+------------------+
|
v
+------------------+
| Feature |
| Engineering |
+------------------+
|
v
+------------------+
| Model Selection |
| and Training |
+------------------+
|
v
+------------------+
| Model Evaluation|
+------------------+
|
v
+------------------+

Page 01 of 02
07082024
21CS644

| Deployment |
+------------------+
|
v
+------------------+
| Monitoring |
| and Maintenance |
+------------------+

Q.03 b Which Machine Learning algorithm to be used when you want to

Explain the Data science Process with a neat diagram.
Ans- When discussing which machine learning algorithm to use, it depends on the
specific problem you are trying to solve and the type of data you have. Here’s a brief
overview of common machine learning algorithms categorized by the problem type
they are best suited for:
Types of Machine Learning Algorithms
1. Supervised Learning: Used when you have labeled data.
o Regression Algorithms: For predicting continuous values.
§ Linear Regression: Predicts a continuous outcome based on one
or more predictors.
§ Ridge and Lasso Regression: Variants of linear regression that
include regularization.
o Classification Algorithms: For predicting discrete classes or categories.
§ Logistic Regression: Suitable for binary classification problems.
§ Decision Trees: Can handle both classification and regression
tasks.
§ Random Forest: An ensemble of decision trees for improved
accuracy.
§ Support Vector Machines (SVM): EVective in high-dimensional
spaces for classification tasks.
§ K-Nearest Neighbors (KNN): Classifies based on the majority label
among the nearest data points.
2. Unsupervised Learning: Used when you have unlabeled data.
o Clustering Algorithms: For grouping similar data points.
§ K-Means Clustering: Partitions data into K distinct clusters.
§ Hierarchical Clustering: Builds a hierarchy of clusters.
o Dimensionality Reduction Algorithms: For reducing the number of
features.
§ Principal Component Analysis (PCA): Transforms data into a
lower-dimensional space while retaining variance.
3. Reinforcement Learning: Used for training models to make sequences of
decisions.
o Q-Learning: A model-free reinforcement learning algorithm.
o Deep Q-Networks: Combines Q-learning with deep learning.
Selecting an Algorithm
To select an appropriate algorithm, consider the following:

Page 01 of 02
07082024
21CS644

• Nature of the Problem: Is it a classiﬁcation, regression, clustering, or

reinforcement learning problem?
• Data Type: Are your features continuous, categorical, or a mix? How much data
do you have?
• Desired Outcome: Do you need high accuracy, interpretability, or real-time
predictions?
• Computational Resources: Some algorithms require more computational power
than others.
Q.4 (a) Explain Exploratory Data Analysis
Ans- Exploratory Data Analysis (EDA) is a critical step in the data analysis process that
focuses on understanding and summarizing the main characteristics of a dataset,
often with visual methods. EDA is used to uncover patterns, spot anomalies, test
hypotheses, and check assumptions with the help of summary statistics and
graphical representations. Here are some key components and techniques
associated with EDA:
Key Components of EDA
1. Descriptive Statistics: This includes measures such as mean, median, mode,
standard deviation, variance, and percentiles, which summarize the central
tendency, dispersion, and shape of the dataset's distribution.
2. Data Visualization: Graphical representations help in understanding the data
more intuitively. Common visualizations used in EDA include:

Q.4 (b) Which Machine Learning algorithm to be used when you have bunch of objects
that are already classiﬁed and based on which other similar objects that haven’t got
classiﬁed to be automatically labelled? Explain.

Ans- When you have a set of classified objects (labeled data) and want to classify
new, unlabeled objects based on the existing classifications, the most suitable
machine learning algorithm to use is supervised learning. Within this category, you
can choose from several algorithms, depending on the nature of your data and the
specific requirements of your task. Here are some commonly used algorithms:
1. k-Nearest Neighbors (k-NN):
o Description: This is a simple and intuitive algorithm that classifies a new
object based on the majority class of its k nearest neighbors in the feature
space.
o Use Case: It works well when you have a well-defined metric for distance
(e.g., Euclidean distance) and when your dataset is not too large, as it can
become computationally expensive.
2. Support Vector Machines (SVM):
o Description: SVM finds the hyperplane that best separates the classes in
the feature space. It works well for both linear and non-linear classification
using kernel functions.
o Use Case: SVM is eVective in high-dimensional spaces and can be used
for both binary and multi-class classification problems.
3. Decision Trees:

Page 01 of 02
07082024
21CS644

o Description: A decision tree builds a model in the form of a tree structure

where nodes represent features, branches represent decision rules, and
leaves represent outcomes (classes).
o Use Case: Decision trees are easy to interpret and can handle both
numerical and categorical data.
4. Random Forest:
o Description: This is an ensemble method that constructs multiple
decision trees during training and outputs the mode of the classes
(classification) or mean prediction (regression) of the individual trees.
o Use Case: Random forests are robust to overfitting and work well on large
datasets with many features.
5. Logistic Regression:
o Description: Despite its name, logistic regression is a classification
algorithm used for binary and multi-class problems. It models the
probability of a class label based on input features.
o Use Case: It's eVective when the relationship between the features and
the target variable is approximately linear.
6. Neural Networks:
o Description: Neural networks consist of layers of interconnected nodes
(neurons) that can learn complex patterns in data. They can be used for
both classification and regression tasks.
o Use Case: They are particularly useful for large datasets with complex
relationships, such as image or speech recognition tasks.
Considerations for Choosing an Algorithm:
• Data Size: Some algorithms, like k-NN, may not scale well with large datasets. In
contrast, ensemble methods like Random Forests often handle larger datasets
eVectively.
• Feature Types: Some algorithms work better with numerical data (like logistic
regression) while others can handle both numerical and categorical data (like
decision trees).
• Model Interpretability: If you need to understand how decisions are made,
simpler models like decision trees or logistic regression might be preferred.
• Performance Metrics: Consider what metrics (accuracy, precision, recall, F1-
score) are important for your specific application when evaluating diVerent
algorithms.
In summary, you would typically use a supervised learning approach for this
classification task, with the specific choice of algorithm depending on the
characteristics of your data and your classification goals.
Q.5 (a) explain feature selection algorithms and selection criteria
Ans- Feature selection is a crucial step in the machine learning process, aimed at
identifying the most relevant features from your dataset to improve model
performance, reduce overfitting, and decrease computation time. Here’s an overview
of common feature selection algorithms and the criteria used for selection.
Feature Selection Algorithms
1. Filter Methods:

Page 01 of 02
07082024
21CS644

o Description: These methods assess the relevance of features based on

their intrinsic properties, independently of any machine learning
algorithms.
o Examples:
§ Correlation CoeMicient: Measures the linear relationship between
each feature and the target variable. High correlation indicates a
potentially relevant feature.
§ Chi-Squared Test: Used for categorical features, this test
measures how the observed frequencies diVer from expected
frequencies.
§ Mutual Information: Quantifies the amount of information gained
about the target variable through the feature. Higher mutual
information suggests higher relevance.
2. Wrapper Methods:
o Description: These methods evaluate subsets of features by training a
model on them and assessing the performance. They can capture
interactions between features.
o Examples:
§ Forward Selection: Starts with no features and adds features one
at a time, evaluating performance at each step until no significant
improvement is observed.
§ Backward Elimination: Starts with all features and removes the
least significant ones one at a time based on model performance.
§ Recursive Feature Elimination (RFE): Recursively removes
features and builds a model until the specified number of features
is reached.
3. Embedded Methods:
o Description: These methods perform feature selection as part of the
model training process. They take into account the interaction between
features.
o Examples:
§ Lasso Regression (L1 Regularization): Penalizes the absolute size
of coeVicients, eVectively driving some coeVicients to zero, which
indicates those features can be excluded.
§ Decision Trees: Algorithms like Random Forests can provide
feature importance scores based on how frequently a feature is
used for splitting, allowing you to select the most significant
features.
Selection Criteria
When selecting features, the following criteria are typically considered:
1. Relevance:
o The selected features should have a strong relationship with the target
variable. This can be measured using statistical tests, correlation
coeVicients, or information gain.
2. Redundancy:

Page 01 of 02
07082024
21CS644

o Features that provide similar information can lead to redundancy. The goal
is to reduce redundancy by selecting a diverse set of features that provide
unique information.
3. Performance Improvement:
o Features should be selected based on their contribution to model
performance. Cross-validation can be used to assess how diVerent feature
subsets aVect performance metrics (e.g., accuracy, precision, recall).
4. Simplicity:
o A simpler model with fewer features is often preferred as it is easier to
interpret and less likely to overfit. Models with too many features can
become overly complex and diVicult to maintain.
5. Computational EMiciency:
o The time and resources required to train the model should be considered.
Selecting fewer features can lead to faster training times and lower
computational costs.
6. Domain Knowledge:
o Understanding the domain can provide insights into which features are
likely to be important based on theoretical or practical considerations.
Conclusion
Feature selection is a vital part of the machine learning pipeline that can significantly
impact model performance. By using a combination of filter, wrapper, and embedded
methods, along with carefully defined selection criteria, you can improve the quality
and eViciency of your models.
Q.5 (b) Define Feature Extraction. Explain diVerent categories of information.

Ans- Feature Extraction is the process of transforming raw data into a set of
meaningful features that can be eVectively used in machine learning models. It
involves reducing the dimensionality of the data while preserving its essential
characteristics, thus enhancing the model's ability to learn patterns and
relationships.
Importance of Feature Extraction
• Dimensionality Reduction: Reduces the number of input variables, making
models simpler and faster to train.
• Noise Reduction: Helps eliminate irrelevant or redundant data, leading to better
model performance.
• Improved Performance: By focusing on the most relevant aspects of the data, it
can lead to increased accuracy and eViciency of the model.
• Visualization: Facilitates the understanding of high-dimensional data through
lower-dimensional representations.
Categories of Information in Feature Extraction
Feature extraction can be categorized based on the type of data and the techniques
used to extract features. Here are the primary categories:
1. Statistical Features:
o Description: These features are derived from statistical measures of the
data.
o Examples:

Page 01 of 02
07082024
21CS644

§ Mean, Median, Variance: Basic statistics that summarize the

central tendency and dispersion of the data.
§ Skewness and Kurtosis: Describe the shape of the data
distribution.
2. Geometric Features:
o Description: Features that capture the geometric properties of the data,
particularly useful in image processing.
o Examples:
§ Area, Perimeter: For shapes in images.
§ Centroid: The center of mass of a geometric object.
3. Frequency Domain Features:
o Description: These features are obtained by transforming data from the
time (or spatial) domain to the frequency domain using techniques like the
Fourier Transform.
o Examples:
§ Spectral Features: Such as dominant frequency or spectral energy,
commonly used in signal processing and audio analysis.
4. Temporal Features:
o Description: Features that capture the temporal aspects of the data,
particularly relevant for time series data.
o Examples:
§ Trend, Seasonality, Autocorrelation: Important characteristics of
time series data that help in understanding its behavior over time.
5. Text Features:
o Description: For natural language processing (NLP), features are extracted
from text data to capture semantic meaning and structure.
o Examples:
§ Term Frequency-Inverse Document Frequency (TF-IDF):
Measures the importance of a word in a document relative to a
collection of documents.
§ Word Embeddings: Dense vector representations of words (e.g.,
Word2Vec, GloVe) that capture semantic relationships.
6. Image Features:
o Description: Features extracted from images to represent their content.
o Examples:
§ Edge Detection: Identifying edges in images using techniques like
the Sobel operator.
§ Texture Features: Measures like Local Binary Patterns (LBP) or
Gabor filters that capture surface characteristics.
7. Domain-Specific Features:
o Description: Features tailored to specific application domains, leveraging
expert knowledge.
o Examples:
§ Medical Imaging: Features such as tumor size and shape in MRI
scans.
§ Finance: Features derived from stock prices, such as moving
averages and volatility measures.

Page 01 of 02
07082024
21CS644

Conclusion
Feature extraction is a fundamental aspect of the data preprocessing phase in
machine learning and is crucial for building eVective models. By transforming raw
data into meaningful features across various categories, you can enhance the
model's ability to learn and generalize from the underlying data patterns.
Q.6 Explain Random Forest Classifier
Ans- The Random Forest Classifier is an ensemble learning method used for
classification tasks. It builds multiple decision trees during training and merges their
predictions to improve accuracy and control overfitting. Here's an in-depth look at
how it works, its advantages, and its limitations.
How Random Forest Works
1. Ensemble Learning:
o Random Forest is based on the concept of ensemble learning, where
multiple models (in this case, decision trees) are combined to produce a
more accurate and robust model.
2. Bootstrapping:
o Random Forest employs a technique called bootstrapping, which involves
creating multiple subsets of the original training dataset by randomly
sampling with replacement. Each subset is used to train a separate
decision tree.
3. Feature Randomness:
o When building each decision tree, Random Forest adds another layer of
randomness by selecting a random subset of features for each split in the
tree. This ensures that the trees are diverse, reducing the correlation
between them.
4. Tree Construction:
o Each decision tree is built to its full depth without pruning, which allows for
capturing complex patterns in the data. The trees will each have diVerent
structures due to the randomness in sampling and feature selection.
5. Voting Mechanism:
o After all trees are trained, the Random Forest makes predictions for new
instances by aggregating the predictions of all the individual trees. For
classification tasks, this is done through a majority voting mechanism,
where the class predicted by the most trees is chosen as the final output.
Key Features
• Robustness to Overfitting: Because it averages the predictions of multiple trees,
Random Forest can mitigate overfitting, especially compared to individual
decision trees.
• Handles Missing Values: Random Forest can handle missing data and maintain
accuracy without requiring imputation.
• Feature Importance: Random Forest provides insights into the importance of
diVerent features in the prediction process. This is particularly useful for feature
selection and understanding model behavior.
Advantages
1. High Accuracy: Random Forest typically provides high accuracy for both
classification and regression tasks due to the ensemble nature of the model.

Page 01 of 02
07082024
21CS644

2. Versatility: It can be used for both classification and regression problems and is
eVective for various types of data, including numerical and categorical features.
3. Scalability: Random Forest can handle large datasets and a high number of
features, making it suitable for many real-world applications.
4. Less Tuning Required: It requires less hyperparameter tuning compared to other
complex models, as it performs well out of the box.
Limitations
1. Model Interpretability: While individual decision trees are easy to interpret, the
ensemble nature of Random Forest makes it harder to understand the model’s
decision-making process.
2. Computational Complexity: Training many trees can be computationally
intensive, leading to longer training times and higher resource consumption,
especially with large datasets.
3. Memory Usage: Random Forest can require significant memory for storing all the
trees, which can be a limitation in resource-constrained environments.
4. Not Suitable for Real-Time Predictions: Due to the time it takes to aggregate
predictions from multiple trees, Random Forest may not be ideal for real-time
applications where speed is critical.
Conclusion
The Random Forest Classifier is a powerful and widely used machine learning
algorithm that excels in accuracy and versatility. It is particularly valuable when
dealing with complex datasets and when interpretability is not the primary concern.
By leveraging the strengths of multiple decision trees, it mitigates overfitting and
enhances predictive performance, making it a popular choice across various
domains.
Q.6 (b) Explain Principal Component Analysis.
Ans- Principal Component Analysis (PCA) is a statistical technique used for
dimensionality reduction while preserving as much variance in the data as possible.
It transforms a dataset into a new coordinate system where the greatest variances by
any projection of the data lie on the first coordinates (called principal components).
Here's a detailed explanation of how PCA works, its applications, and its advantages
and limitations.
How PCA Works
1. Standardization:
o PCA begins with standardizing the data, particularly when the features are
on diVerent scales. This involves centering the data (subtracting the mean)
and scaling it (dividing by the standard deviation) to ensure that each
feature contributes equally to the analysis.
2. Covariance Matrix:
o Once the data is standardized, PCA calculates the covariance matrix to
understand how diVerent features vary with respect to each other. The
covariance matrix captures the relationships between features.
3. Eigenvalues and Eigenvectors:
o PCA computes the eigenvalues and eigenvectors of the covariance matrix.
Eigenvalues represent the amount of variance captured by each principal
component, while eigenvectors indicate the direction of these
components.

Page 01 of 02
07082024
21CS644

o Each eigenvector corresponds to a principal component, and the

associated eigenvalue indicates how much variance that principal
component captures from the original data.
4. Sorting Eigenvalues:
o The eigenvalues are sorted in descending order, and the top k eigenvalues
and their corresponding eigenvectors are selected. This selection
determines the number of principal components to retain, where k is less
than or equal to the original number of features.
5. Transforming the Data:
o The original data is projected onto the selected eigenvectors (principal
components). This is done by multiplying the original data matrix by the
matrix of the selected eigenvectors, resulting in a new dataset with
reduced dimensionality.
Applications of PCA
• Data Visualization: PCA is often used to visualize high-dimensional data in two
or three dimensions, making it easier to explore and interpret.
• Noise Reduction: By retaining only the most significant principal components,
PCA can help remove noise and irrelevant features from the data.
• Feature Engineering: PCA can be used as a preprocessing step in machine
learning to create new features that capture the most important information from
the data.
• Compression: Reducing dimensionality can also lead to data compression,
making storage and processing more eVicient.
Advantages of PCA
1. Dimensionality Reduction: PCA reduces the number of features while preserving
variance, which helps simplify models and improve computational eViciency.
2. Uncorrelated Features: The principal components are uncorrelated, which can
be advantageous for certain machine learning algorithms that assume
independence among features.
3. Capturing Variance: PCA identifies the directions of maximum variance in the
data, which can reveal underlying patterns and structures.
Limitations of PCA
1. Loss of Information: While PCA aims to retain as much variance as possible,
reducing dimensions can lead to a loss of important information, especially if too
few principal components are selected.
2. Interpretability: The principal components may not correspond directly to the
original features, making interpretation challenging, especially in applications
where understanding the relationships between features is important.
3. Linearity: PCA is a linear technique and may not perform well on datasets with
nonlinear relationships. In such cases, other methods like kernel PCA or t-SNE
may be more suitable.
4. Sensitivity to Scaling: PCA is sensitive to the scale of the data; hence,
standardization is crucial to ensure that all features contribute equally.
Conclusion
Principal Component Analysis is a powerful tool for dimensionality reduction, helping
to simplify complex datasets while retaining essential variance. It is widely used
across various fields, including finance, biology, and image processing, to uncover

Page 01 of 02
07082024
21CS644

patterns, reduce noise, and facilitate visualization. Understanding both its

capabilities and limitations is essential for eVectively applying PCA in data analysis
and machine learning.
Q.7 (a) Explain the need of Data Visualization? Explain its importance
Ans- Data Visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data. The
need for data visualization arises from the complexity and volume of data that
organizations generate, and its importance can be understood through several key
points:
Importance of Data Visualization
1. Enhanced Understanding:
o Visualization helps to present complex data in a more understandable
format. It allows stakeholders, regardless of their technical background, to
grasp significant insights quickly, leading to better decision-making.
2. Identifying Trends and Patterns:
o Visual representations make it easier to identify trends, correlations, and
patterns in data that may not be immediately apparent in raw numerical
formats. This can be crucial for recognizing opportunities or issues that
need addressing.
3. Quick Insights:
o Data visualizations can quickly convey information. Dashboards and
graphical displays enable users to glean insights rapidly, often at a glance,
without needing to dive deep into numbers.
4. Communication of Findings:
o EVective data visualization is a powerful tool for communicating findings
to diverse audiences, including non-technical stakeholders. It facilitates
storytelling through data, making it easier to convey important messages.
5. Detection of Outliers:
o Visualizations can help identify anomalies or outliers in the data that may
indicate errors, unusual occurrences, or opportunities for further
investigation.
6. Comparison and Analysis:
o Visualization allows for easy comparisons between diVerent datasets or
variables. This can support competitive analysis, benchmarking, or any
scenario where comparative insights are valuable.
7. Facilitates Exploration:
o Interactive visualizations enable users to explore data dynamically, filter
specific dimensions, and focus on areas of interest. This exploratory
approach can lead to deeper insights and discoveries.
8. Improved Retention:
o People are generally more likely to remember information presented
visually than text or numbers. Well-designed visualizations can enhance
memory retention of key insights.
9. Data-Driven Decision Making:

Page 01 of 02
07082024
21CS644

o With clearer insights provided by visualizations, organizations can adopt

data-driven decision-making processes. This leads to more informed
strategies and actions, improving overall eVectiveness.
10. Support for Predictive Analysis:
o Advanced data visualization techniques can support predictive modeling
by illustrating historical data trends, helping organizations forecast future
scenarios.
Use Cases of Data Visualization
• Business Intelligence: Companies use dashboards to monitor key performance
indicators (KPIs) and other critical metrics.
• Healthcare: Visualizations help in analyzing patient data, tracking disease
outbreaks, and managing healthcare resources eVectively.
• Finance: Financial analysts use charts and graphs to visualize stock trends,
economic indicators, and ﬁnancial statements for better investment decisions.
• Marketing: Marketers analyze customer data and campaign performance through
visualizations, enabling targeted strategies and better engagement.
• Research: Scientists and researchers visualize complex data sets to
communicate ﬁndings and support hypotheses in an accessible way.
Conclusion
In an age where data is increasingly abundant, the need for eVective data
visualization has never been greater. It empowers organizations to make sense of
complex information, enhance decision-making processes, and communicate
insights eVectively. By transforming raw data into meaningful visual narratives, data
visualization plays a crucial role in turning data into actionable intelligence.
Q. 7 (b) Explain Data Wrangling with a neat diagram.
Ans- Data Wrangling, also known as data munging, is the process of cleaning,
transforming, and preparing raw data for analysis. It involves various steps to ensure
that the data is suitable for the intended use, enhancing its quality and usability. The
goal is to convert messy, unstructured data into a structured and organized format
that is easy to analyze.
Key Steps in Data Wrangling
1. Data Collection:
o Gathering data from various sources such as databases, APIs,
spreadsheets, or web scraping.
2. Data Cleaning:
o Identifying and correcting errors, inconsistencies, and inaccuracies in the
data. This may include handling missing values, removing duplicates, and
correcting data types.
3. Data Transformation:
o Modifying data into the desired format or structure. This may involve:
§ Normalizing or scaling numerical data.
§ Encoding categorical variables.
§ Aggregating data to create summary statistics.
§ Merging multiple datasets into a single cohesive dataset.
4. Data Enrichment:
o Enhancing the dataset by adding additional data from external sources.
This can provide more context or valuable information.

Page 01 of 02
07082024
21CS644

5. Data Validation:
o Ensuring the data is accurate and meets the required quality standards.
This step veriﬁes that the transformations performed were correct.
6. Data Exploration:
o Analyzing the cleaned data to identify patterns, trends, and insights. This
step often involves generating summary statistics and visualizations.
7. Data Storage:
o Saving the cleaned and transformed dataset in a suitable format (e.g., CSV,
database) for further analysis or sharing with stakeholders.
Data Wrangling Diagram
Here's a simple diagram illustrating the data wrangling process:
plaintext
Copy code
+-----------------+
| Data Collection |
+--------+--------+
|
v
+-----------------+
| Data Cleaning |
+--------+--------+
|
v
+-----------------+
| Data Transformation |
+--------+--------+
|
v
+-----------------+
| Data Enrichment |
+--------+--------+
|
v
+-----------------+
| Data Validation |
+--------+--------+
|
v
+-----------------+
| Data Exploration |
+--------+--------+
|
v
+-----------------+
| Data Storage |
+-----------------+
Importance of Data Wrangling

Page 01 of 02
07082024
21CS644

• Improves Data Quality: Ensures that the data is accurate, consistent, and usable
for analysis.
• Enhances EMiciency: Streamlines the process of preparing data, saving time
during the analysis phase.
• Facilitates Better Decision-Making: Provides high-quality data, which is crucial
for making informed decisions based on analysis.
• Supports Data Analysis: Prepares the data in a structured format, making it
easier to apply analytical techniques and algorithms.
Conclusion
Data wrangling is a critical step in the data analysis pipeline. It transforms raw data
into a clean and structured format, enabling analysts and data scientists to derive
meaningful insights and make data-driven decisions. By following the data wrangling
process, organizations can enhance the quality of their data and maximize the value
derived from it.
Q.8 (a) Explain composition plots with diagram.
Ans- Composition plots are visual representations used to analyze and display the
distribution and proportion of diVerent components within a dataset. They are
particularly useful for understanding how various parts contribute to a whole. This
type of visualization is commonly used in various fields, such as marketing, finance,
and environmental studies, to convey insights about the composition of data.
Types of Composition Plots
1. Stacked Bar Charts:
o Stacked bar charts display the total size of a group while illustrating the
composition of sub-groups within that total. Each bar represents a total,
and diVerent colors represent the proportion of each sub-group.
2. Area Charts:
o Area charts show the cumulative totals over time, with diVerent colors
representing diVerent components. This visualization emphasizes the
magnitude of the total and how diVerent components contribute to it over
time.
3. Pie Charts:
o Although often criticized for being less eVective, pie charts can visually
represent the composition of a whole, showing the percentage of each
category as a slice of a circle.
4. Sankey Diagrams:
o Sankey diagrams illustrate the flow of resources or information between
stages. They are particularly useful for showing how quantities are
distributed among categories.
5. Mosaic Plots:
o Mosaic plots represent the relationship between two or more categorical
variables by displaying rectangles whose areas are proportional to the
frequency of combinations of these variables.
Example: Stacked Bar Chart
Here’s a diagram representing a stacked bar chart, which is one of the common
composition plots:
plaintext
Copy code

Page 01 of 02
07082024
21CS644

|
20 +-----------------------+
| | | | |
15 +------|----|----|-----|------+
| | | | | |
10 +------|----|----|-----|------+
| | | | | |
5 +------|----|----|-----|------+
| | | | | |
0 +-----------------------+------+
A B C D
Description of the Diagram
• X-Axis: Represents diVerent categories (e.g., A, B, C, D).
• Y-Axis: Represents the total value for each category.
• Colors: DiVerent colors in each bar indicate the composition of sub-groups (e.g.,
diVerent segments of the data).
• Height of Bars: The height of each segment shows the proportion of each sub-
group in the total.
Importance of Composition Plots
1. Visual Clarity: Composition plots provide a clear visual representation of how
components contribute to a whole, making complex data more understandable.
2. Comparative Analysis: They enable easy comparison of diVerent groups or
categories, highlighting the diVerences in composition.
3. Insight Generation: These plots can reveal trends and patterns that might not be
evident from raw data, supporting data-driven decision-making.
4. Communication Tool: EVective for communicating ﬁndings to stakeholders, as
they simplify the representation of complex data relationships.
Conclusion
Composition plots are valuable tools for visualizing the parts that make up a whole.
They provide insights into the distribution and contribution of various components,
enhancing data analysis and facilitating better understanding and communication of
ﬁndings. Whether using stacked bar charts, area charts, or other types, composition
plots play a crucial role in making complex data accessible and actionable.

Q. Explain i) Tools and libraries used for visualization.

Data visualization is a critical aspect of data analysis, and a variety of tools and libraries
are available to create eVective visualizations. Below are some popular tools and
libraries categorized by programming languages:
1. Python Libraries:
• Matplotlib:
o A fundamental plotting library for Python, Matplotlib is used for creating
static, interactive, and animated visualizations in Python. It provides ﬁne
control over plot appearance.
• Seaborn:

Page 01 of 02
07082024
21CS644

o Built on top of Matplotlib, Seaborn provides a high-level interface for

drawing attractive statistical graphics. It simpliﬁes the creation of complex
visualizations and is particularly good for visualizing distributions and
relationships between variables.
• Pandas Visualization:
o Pandas, a popular data manipulation library, includes built-in visualization
capabilities that allow quick and easy plotting of data stored in
DataFrames.
• Plotly:
o Plotly is a versatile library for creating interactive visualizations that can be
embedded in web applications. It supports a wide range of chart types and
oVers features like hover information and zooming.
• Bokeh:
o Bokeh is designed for creating interactive visualizations for web browsers.
It allows users to build complex dashboards with interactive features and
supports large datasets.
• Altair:
o A declarative statistical visualization library for Python, Altair is based on
the Vega-Lite visualization grammar. It emphasizes simplicity and provides
a concise API for generating complex visualizations.
2. R Libraries:
• ggplot2:
o A widely used visualization library in R, ggplot2 is based on the Grammar of
Graphics, allowing users to build complex visualizations layer by layer. It is
known for its ﬂexibility and customization options.
• plotly:
o Similar to its Python counterpart, Plotly in R allows for the creation of
interactive plots. It supports a wide range of visualization types and is
suitable for data exploration.
• lattice:
o A powerful system for visualizing multivariate data in R, Lattice provides a
framework for creating trellis graphics, which display the relationship
between variables across diVerent subsets of data.
3. Web-Based Tools:
• Tableau:
o A popular data visualization tool that allows users to create interactive and
shareable dashboards. Tableau is user-friendly, requiring minimal coding
skills, and is widely used in business intelligence.
• Microsoft Power BI:
o A business analytics tool that enables users to visualize data and share
insights across the organization. Power BI supports a wide variety of data
sources and oVers interactive visualizations.
• Google Data Studio:
o A free tool that allows users to create interactive reports and dashboards
from various data sources, including Google Analytics, Google Sheets, and
more.
• D3.js:

Page 01 of 02
07082024
21CS644

o A JavaScript library for producing dynamic and interactive data

visualizations in web browsers. D3.js uses HTML, SVG, and CSS, oVering
fine control over the final visual output.
ii) Data Representation
Data representation refers to the way information is presented in a structured format that
makes it accessible and interpretable for analysis. The choice of representation impacts
how eVectively insights can be derived from the data. Here are some common forms of
data representation:
1. Tabular Representation:
• Definition: Data is organized in rows and columns, similar to a spreadsheet or
database table.
• Usage: Useful for displaying structured data, such as survey responses or
database records, where each row represents an observation and each column
represents a variable.
• Example:
ID Name Age Gender
1 Alice 30 Female
2 Bob 25 Male
2. Graphical Representation:
• Definition: Data is represented visually through charts, graphs, and plots.
• Usage: EVective for identifying patterns, trends, and relationships within the data.
Common types include bar charts, line graphs, scatter plots, and pie charts.
• Example: A line graph showing sales trends over time or a bar chart comparing
sales across diVerent categories.
3. Geospatial Representation:
• Definition: Data is visualized in relation to geographical locations on maps.
• Usage: Useful for representing data with a spatial component, such as population
density, sales by region, or environmental data.
• Example: A heat map showing the distribution of customer locations across a
city.
4. Hierarchical Representation:
• Definition: Data is organized in a tree-like structure to represent relationships
among diVerent levels or categories.
• Usage: Commonly used in organizational charts, file systems, and taxonomy
classifications.
• Example: A family tree diagram or a directory structure in a computer file system.
5. Matrix Representation:
• Definition: Data is organized in a two-dimensional array or matrix format, where
each cell can represent a value associated with a specific row-column pair.
• Usage: Common in machine learning for representing datasets, especially in
numerical analysis and linear algebra.
• Example: A correlation matrix showing the relationships between multiple
variables.
6. JSON and XML:
• Definition: Data is represented in a structured format using lightweight data
interchange formats.

Page 01 of 02
07082024
21CS644

• Usage: Widely used for transmitting data between a server and a web application,
especially in APIs.
• Example: A JSON representation of a user profile:
json
Copy code
{
"id": 1,
"name": "Alice",
"age": 30,
"gender": "Female"
}
Conclusion
The choice of tools and libraries for data visualization and the method of data
representation significantly impact the eVectiveness of data analysis. By selecting
appropriate tools and formats, analysts can create compelling visual narratives that
reveal insights, communicate findings, and drive informed decision-making.
ii) Data Representation. -
Data representation refers to the methods and formats used to organize, store, and
present data so that it can be easily interpreted and analyzed. The way data is
represented can greatly influence how eVectively it can be understood and utilized.
Below, we explore various forms of data representation, their characteristics, and their
applications.
1. Tabular Representation
• Description: Data is organized into rows and columns, similar to a spreadsheet
or database table.
• Characteristics:
o Easy to read and understand.
o Each row represents a record, and each column represents a variable.
• Applications: Used in databases, spreadsheets, and data analysis tools for
structured data.

Q.9
Ans- Plotting Using Pandas DataFrames
Pandas is a powerful data manipulation library in Python that includes built-in
visualization capabilities. It allows users to create plots directly from DataFrames,
making it easy to visualize data without extensive coding.
1. Basic Plotting with Pandas
To plot data using Pandas, you can call the .plot() method on a DataFrame or Series. This
method utilizes Matplotlib under the hood and oVers a straightforward interface for
creating various types of plots.
Types of Plots in Pandas
Pandas supports various plot types, including:
• Line Plot: kind='line' (default)
• Bar Plot: kind='bar' (for vertical bars)
• Horizontal Bar Plot: kind='barh'
• Histogram: kind='hist'
• Box Plot: kind='box'

Page 01 of 02
07082024
21CS644

• Scatter Plot: kind='scatter'

• Area Plot: kind='area'
• Pie Chart: kind='pie'
Displaying Figures in Matplotlib
Matplotlib provides extensive options for customizing and displaying ﬁgures. After
creating a plot, you can further enhance its appearance and layout.
1. Basic Display
To display the ﬁgure, you typically use plt.show(), which opens a window with your plot.
2. Customizing Plots
You can customize the appearance of plots using various Matplotlib functions:
• Titles and Labels: Use plt.title(), plt.xlabel(), and plt.ylabel() to add titles and
labels to the axes.
• Legends: Use plt.legend() to add a legend to your plot, specifying what each line
or bar represents.
• Grid: Enable a grid with plt.grid().
• Color and Style: Customize colors and line styles using parameters in the .plot()
method or with Matplotlib functions.
• Saving Figures in Matplotlib
• After creating a plot, you may want to save it to a file for future reference or sharing.
Matplotlib provides the plt.savefig() function to save figures in various formats.

Saving a Figure

• You can save a figure by specifying the filename and format. Common formats
include PNG, JPG, PDF, and SVG.

Important Parameters in plt.savefig()

• fname: Name of the file (with extension).

• format: File format (e.g., 'png', 'pdf', 'svg'). This can often be inferred from the
filename extension.
• dpi: Dots per inch, controlling the resolution of the output file.
• bbox_inches: Adjust the bounding box to fit the figure. Using 'tight' can help
eliminate unnecessary whitespace.

Q. 10. Explain the following with respect to Matplotlib.

1) Labels, Titles, Text, Annotations, Legends.

Ans- Matplotlib is a powerful plotting library in Python that provides extensive

capabilities for creating static, animated, and interactive visualizations. To enhance
the clarity and interpretability of plots, it offers various features such as labels,
titles, text, annotations, and legends. Below is an explanation of each of these
features along with examples.

Labels

Page 01 of 02
07082024
21CS644

Definition: Labels are used to provide information about the axes of a plot. They help
viewers understand what data is represented on each axis.

• X-axis Label: Specifies what the data on the x-axis represents.

• Y-axis Label: Specifies what the data on the y-axis represent

import matplotlib.pyplot as plt

# Sample data

x = [1, 2, 3, 4]

y = [10, 20, 25, 30]

# Creating a simple plot

plt.plot(x, y)

# Adding labels

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Labels')

plt.show()

Titles

• Definition: The title of a plot provides a brief description of what the plot represents.
It is usually displayed at the top of the figure.

plt.plot(x, y)

# Adding a title

plt.title('Sample Plot Title')

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.show()

Text

Page 01 of 02
07082024
21CS644

Definition: The text() function in Matplotlib is used to place text at arbitrary locations
within the plot. This can be useful for adding descriptive information or highlighting
specific data points.

plt.plot(x, y)

# Adding text at a specific location

plt.text(2, 15, 'This is a text annotation', fontsize=12)

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Text')

plt.show()

Annotations

Definition: Annotations allow you to add text to specific points in your plot along with
arrows to point to the relevant data points. This is useful for providing additional context
or explaining significant points.

plt.plot(x, y)

# Annotating a specific point

plt.annotate('Point of Interest', xy=(3, 25), xytext=(4, 30),

arrowprops=dict(facecolor='black', shrink=0.05))

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Annotations')

plt.show()

Legends

Definition: Legends are used to describe the various elements of a plot, especially when
multiple datasets or categories are displayed. A legend helps distinguish between
different lines, bars, or points in the same plot.

# Sample data

Page 01 of 02
07082024
21CS644

y1 = [10, 20, 25, 30]

y2 = [5, 15, 20, 25]

plt.plot(x, y1, label='Dataset 1', color='blue')

plt.plot(x, y2, label='Dataset 2', color='orange')

# Adding a legend

plt.legend()

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Legends')

plt.show()

Legends
Deﬁnition: Legends are used to describe the various elements of a plot, especially
when multiple datasets or categories are displayed. A legend helps distinguish
between diVerent lines, bars, or points in the same plot.
# Sample data
y1 = [10, 20, 25, 30]
y2 = [5, 15, 20, 25]

plt.plot(x, y1, label='Dataset 1', color='blue')

plt.plot(x, y2, label='Dataset 2', color='orange')

# Adding a legend
plt.legend()

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')
plt.title('Sample Plot with Legends')

plt.show()
Q. 10 (b) Explain basic image operations of Matplotlib
Ans- Matplotlib is a versatile library for creating static, animated, and interactive
visualizations in Python. In addition to plotting data, it can also handle basic image
operations, making it suitable for image processing tasks. Below are some of the
fundamental image operations you can perform using Matplotlib.
1. Displaying Images

Page 01 of 02
07082024
21CS644

The primary function to display an image in Matplotlib is imshow(). This function can
display images in various formats, including grayscale and color images.

3. Image Manipulation

You can perform basic manipulations on images such as resizing, cropping, and flipping
using NumPy in combination with Matplotlib.

4. a) Resizing Images

To resize images, you can use the resize() function from the skimage.transform
module (part of the scikit-image library).

5. Example: Resizing an image.

6. Cropping Images

Cropping involves selecting a specific region of an image. This can be done using NumPy
slicing.

7. Flipping Images- You can flip images horizontally or vertically using NumPy.
8. Image Color Conversion- You can convert images between different color
spaces (e.g., RGB to Grayscale) using NumPy operations.
9. Adding Text to Images- You can overlay text on images using the text() function to
annotate images or highlight specific features.
10. Saving Images- After manipulating images, you can save them using imsave().

Page 01 of 02
07082024

@vtucode - in 21CS644 Model Paper 2021 Scheme
No ratings yet
@vtucode - in 21CS644 Model Paper 2021 Scheme
3 pages
21CS644
No ratings yet
21CS644
3 pages
CS3502
No ratings yet
CS3502
5 pages
Dse Q B
No ratings yet
Dse Q B
13 pages
Cs3352 - Foundation of Data Science
No ratings yet
Cs3352 - Foundation of Data Science
56 pages
20ad41e2 - Data Science
No ratings yet
20ad41e2 - Data Science
2 pages
Data Science and Visualization Updated
No ratings yet
Data Science and Visualization Updated
3 pages
Minor Cse Dsv2
No ratings yet
Minor Cse Dsv2
7 pages
Combined Quesions
No ratings yet
Combined Quesions
3 pages
FODS Prevoius Paper
No ratings yet
FODS Prevoius Paper
4 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
3 pages
Data Science and Visualization Updated
No ratings yet
Data Science and Visualization Updated
3 pages
303 - Data Analysis Using Python
No ratings yet
303 - Data Analysis Using Python
6 pages
6.DS Question Bank
No ratings yet
6.DS Question Bank
3 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
DSP U1
No ratings yet
DSP U1
89 pages
Model QB Many
No ratings yet
Model QB Many
37 pages
Data Science and ML-KTU
No ratings yet
Data Science and ML-KTU
11 pages
0.extracted Pages 20MCA201 From 2020 MCA S3 S4
No ratings yet
0.extracted Pages 20MCA201 From 2020 MCA S3 S4
18 pages
Introduction To Data Science, Evolution of Data Science
No ratings yet
Introduction To Data Science, Evolution of Data Science
11 pages
Data Science
No ratings yet
Data Science
10 pages
DSP U2
No ratings yet
DSP U2
172 pages
Ixs8h l8mgc
No ratings yet
Ixs8h l8mgc
40 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Edit Ds
No ratings yet
Edit Ds
37 pages
Cab112:Introduction To Data Science: Session 2024-25 Page:1/2
No ratings yet
Cab112:Introduction To Data Science: Session 2024-25 Page:1/2
2 pages
227C4A Data Science
No ratings yet
227C4A Data Science
2 pages
File 2
No ratings yet
File 2
43 pages
Data Science Set - B
No ratings yet
Data Science Set - B
5 pages
Data Science Roadmap
No ratings yet
Data Science Roadmap
10 pages
DS 3-Marks Semeseter Suggestion
No ratings yet
DS 3-Marks Semeseter Suggestion
54 pages
Data Science
No ratings yet
Data Science
10 pages
Model Exam Question Paper (Set1, Set2) Format
No ratings yet
Model Exam Question Paper (Set1, Set2) Format
5 pages
Data Science Comprehension Worksheets
No ratings yet
Data Science Comprehension Worksheets
32 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
II CSE - A&B (96) DS-int 1 QP ANS-set1
No ratings yet
II CSE - A&B (96) DS-int 1 QP ANS-set1
7 pages
Ocs353 DSF Syllabus
No ratings yet
Ocs353 DSF Syllabus
3 pages
Data Science & Python Syllabus 2022-24
No ratings yet
Data Science & Python Syllabus 2022-24
9 pages
Fds Question Bank
No ratings yet
Fds Question Bank
116 pages
DVW 203105491 - 5926 - Question - Paper
No ratings yet
DVW 203105491 - 5926 - Question - Paper
2 pages
DS MCQ Semester Suggesstion
No ratings yet
DS MCQ Semester Suggesstion
26 pages
FDS Important Questions Detailed
No ratings yet
FDS Important Questions Detailed
10 pages
Cds3005 Foundations-Of-data-science LP 1.0 18 Cds3005 Foundation-Of-data-science LP 1.0 1 Foundations of Data Science
No ratings yet
Cds3005 Foundations-Of-data-science LP 1.0 18 Cds3005 Foundation-Of-data-science LP 1.0 1 Foundations of Data Science
2 pages
DS Unit1 Part1
No ratings yet
DS Unit1 Part1
59 pages
PDS Question Bank
No ratings yet
PDS Question Bank
19 pages
Sem 6
No ratings yet
Sem 6
12 pages
FDS 2 Marks 50 Questions
No ratings yet
FDS 2 Marks 50 Questions
2 pages
CS3352 QB
No ratings yet
CS3352 QB
15 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
Module 1 - Introduction To Data Science
No ratings yet
Module 1 - Introduction To Data Science
3 pages
Data Science Unit 01
No ratings yet
Data Science Unit 01
19 pages
20 Short Questions
No ratings yet
20 Short Questions
11 pages
# Syllabus
No ratings yet
# Syllabus
2 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
Data Science Exam Prep Guide
No ratings yet
Data Science Exam Prep Guide
2 pages
B.Sc Data Science Curriculum
No ratings yet
B.Sc Data Science Curriculum
19 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Chapter1 Solutions
No ratings yet
Chapter1 Solutions
15 pages
9 Courses Ibm
No ratings yet
9 Courses Ibm
1 page
Business Research Methods BRM - BA4205 - Notes by MIET
No ratings yet
Business Research Methods BRM - BA4205 - Notes by MIET
106 pages
COVID-19 Data Science Review
No ratings yet
COVID-19 Data Science Review
20 pages
Abhinav CV2023
No ratings yet
Abhinav CV2023
2 pages
Teaching Machine Learning As Part of Agile Software Engineering
No ratings yet
Teaching Machine Learning As Part of Agile Software Engineering
10 pages
E Book Decision Matrix Fabric Databricks Snowflake v1
No ratings yet
E Book Decision Matrix Fabric Databricks Snowflake v1
7 pages
01-Introduction To DS With Python
No ratings yet
01-Introduction To DS With Python
32 pages
Techies Guide To Mastering Soft Skills
No ratings yet
Techies Guide To Mastering Soft Skills
75 pages
MBA in Artificial Intelligence & Data ScienceScience
No ratings yet
MBA in Artificial Intelligence & Data ScienceScience
2 pages
GITAM - 2026 Batch - Assessment Structure & Syllabus
No ratings yet
GITAM - 2026 Batch - Assessment Structure & Syllabus
14 pages
Data Science Intro Session-18 & 19
No ratings yet
Data Science Intro Session-18 & 19
48 pages
Getting Start With Analytics INFORMS
No ratings yet
Getting Start With Analytics INFORMS
20 pages
Teaching Data Science Through Storytelling
No ratings yet
Teaching Data Science Through Storytelling
15 pages
4 Algorithm Engineer
No ratings yet
4 Algorithm Engineer
2 pages
Intro-to-Data-and Data-Science-Course-Notes-365-Data-Science
100% (1)
Intro-to-Data-and Data-Science-Course-Notes-365-Data-Science
17 pages
BDU1
No ratings yet
BDU1
39 pages
IITD DSDS F324a6de7a
No ratings yet
IITD DSDS F324a6de7a
30 pages
DataRobot CustomerSuccessStory Citi Ventures
No ratings yet
DataRobot CustomerSuccessStory Citi Ventures
2 pages
ICT Innovations 2017: Dimitar Trajanov Verica Bakeva
No ratings yet
ICT Innovations 2017: Dimitar Trajanov Verica Bakeva
300 pages
Principas Collection of Data Science Machine Learning Resources
No ratings yet
Principas Collection of Data Science Machine Learning Resources
49 pages
Lavastorm Analytics Survey Skills Tools and Attitudes October 2013 PDF
No ratings yet
Lavastorm Analytics Survey Skills Tools and Attitudes October 2013 PDF
24 pages
Data Science AI Course for Students
No ratings yet
Data Science AI Course for Students
12 pages
Data Science Certification for Pros
No ratings yet
Data Science Certification for Pros
16 pages
Ushasri - Buddha - CV
No ratings yet
Ushasri - Buddha - CV
1 page
Betina Kembora Resume
No ratings yet
Betina Kembora Resume
2 pages
MGT1062 Da1
No ratings yet
MGT1062 Da1
6 pages
6.data Science
No ratings yet
6.data Science
6 pages
Business Intelligence Analytics and Data Science Managerial Perspective 4th Edition Sharda HQ File Fast Access
No ratings yet
Business Intelligence Analytics and Data Science Managerial Perspective 4th Edition Sharda HQ File Fast Access
324 pages
Gautam Medhi Resume
No ratings yet
Gautam Medhi Resume
1 page

Model Question Paper With Effect From 2021 (CBCS Scheme) : Data Science and Visualization

Uploaded by

Model Question Paper With Effect From 2021 (CBCS Scheme) : Data Science and Visualization

Uploaded by

21CS644

Model Question Paper with effect from 2021(CBCS Scheme)

Sixth Semester B.E. Degree Examination

Sixth Semester B.E. Degree Examination

c Explain with neat diagram Anatomy of a Matplotlib Figure and L2 CO5 8

Q.1 (a) What is Data Science? Explain.

7. Deployment and Maintenance:

Q.1 (b) What is Dataﬁcation? Discuss with examples.

Key Features of Dataﬁcation:

• Automation and Analytics: Once data is collected, it can be processed and

o Interval Estimation: Provides a range (conﬁdence interval) within which

• Representativeness: It's crucial that the sample accurately represents the

Q.03 b Which Machine Learning algorithm to be used when you want to

• Nature of the Problem: Is it a classiﬁcation, regression, clustering, or

o Description: A decision tree builds a model in the form of a tree structure

o Description: These methods assess the relevance of features based on

§ Mean, Median, Variance: Basic statistics that summarize the

o Each eigenvector corresponds to a principal component, and the

patterns, reduce noise, and facilitate visualization. Understanding both its

o With clearer insights provided by visualizations, organizations can adopt

Q. Explain i) Tools and libraries used for visualization.

o Built on top of Matplotlib, Seaborn provides a high-level interface for

o A JavaScript library for producing dynamic and interactive data

• Scatter Plot: kind='scatter'

Important Parameters in plt.savefig()

• fname: Name of the file (with extension).

Q. 10. Explain the following with respect to Matplotlib.

1) Labels, Titles, Text, Annotations, Legends.

Ans- Matplotlib is a powerful plotting library in Python that provides extensive

• X-axis Label: Specifies what the data on the x-axis represents.

import matplotlib.pyplot as plt

y = [10, 20, 25, 30]

# Creating a simple plot

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Labels')

plt.title('Sample Plot Title')

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

# Adding text at a specific location

plt.text(2, 15, 'This is a text annotation', fontsize=12)

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Text')

# Annotating a specific point

plt.annotate('Point of Interest', xy=(3, 25), xytext=(4, 30),

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Annotations')

y1 = [10, 20, 25, 30]

y2 = [5, 15, 20, 25]

plt.plot(x, y1, label='Dataset 1', color='blue')

plt.plot(x, y2, label='Dataset 2', color='orange')

plt.xlabel('X Axis Label')

plt.ylabel('Y Axis Label')

plt.title('Sample Plot with Legends')

plt.plot(x, y1, label='Dataset 1', color='blue')

plt.xlabel('X Axis Label')

5. Example: Resizing an image.

You might also like