Model Question Paper With Effect From 2021 (CBCS Scheme) : Data Science and Visualization
Model Question Paper With Effect From 2021 (CBCS Scheme) : Data Science and Visualization
Page 01
07082024
21CS644
Model Question Paper-1/2 with effect from 2021(CBCS Scheme)
USN
Note: 01. Answer any FIVE full questions, choosing at least ONE question from each
MODULE.
Bloom’s COs
Module -1 Taxonomy Marks
Level
Q.01 a What is data science? List and explain skill set required in a data L2 CO1 6
science profile.
b Explain Probability Distribution with example. L2 CO1 6
c Describe the process of fitting a model to a dataset in detail. L2 CO1 8
OR
Q.02 a Explain with neat diagram the current Landscape of data science L2 CO1 6
process.
b Explain population and sample with example. L2 CO1 6
c What is big data? Explain in detail 5 elements of bigdata. L2 CO1 8
Module-2
Q. 03 a What is Machine Learning? Explain the linear regression algorithm. L2 CO2 6
b Explain K-means algorithm with example. L2 CO2 6
c Describe philosophy of EDA in detail. L2 CO2 8
OR
Q.04 a Explain the data science process with a neat diagram. L2 CO2 6
b Explain KNN algorithm with example. L2 CO2 6
c Develop a R script for EDA. L3 CO2 8
Module-3
Q. 05 a Explain the fundamental differences between linear regression and L2 CO3 6
logistic regression.
b Explain selecting an algorithm in wrapper method. L2 CO3 6
c Explain decision tree for chasing dragon problem. L3 CO3 8
OR
Q. 06 a Briefly explain alternating Least squares methods. L2 CO3 6
b Explain different selecting criterion in feature selection. L2 CO3 6
c Explain dimensionality problem with SVD in detail. L3 CO3 8
Module-4
Q. 07 a Define data visualization and explain its importance in data analysis. L2 CO4 6
b Describe different types of plots in comparison plots. L2 CO4 6
c Plot the following L3 CO4 8
i) density plot ii) box plot iii) violin plot iv) bubble plot
OR
Q. 08 a Describe the process of data wrangling and its significance in data L2 CO4 6
visualization.
b Explain the variants of bar chart with example. L2 CO4 6
c Explain different types of plots in relation plots. L2 CO4 8
Module-5
Q. 09 a Develop a code for labels, titles in matplotlib. L3 CO5 6
b Apply code for basic pie chart. L3 CO5 6
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Examples of Datafication:
1. Social Media: Social media platforms like Facebook, Twitter, and Instagram
datafy human interactions. Every "like," "share," comment, and interaction is
recorded and analyzed to build profiles, understand user behavior, and deliver
targeted advertising.
Q.1 (b) Explain Datafication.
Ans- Datafication refers to the process of turning various aspects of human life and
the physical world into data that can be collected, analyzed, and used for decision-
making. It involves transforming everyday interactions, behaviors, processes, and
objects into a quantifiable format. With the rise of digital technology, almost every
action or event can be recorded and transformed into data, including social
interactions, business transactions, physical movements, and even biological
processes.
Examples of Datafication:
Social Media Interactions: Platforms like Facebook, Instagram, and Twitter collect
vast amounts of data from user interactions (likes, shares, comments), which can be
analyzed for trends, preferences, and behavior.
Wearables and IoT: Devices like fitness trackers and smart home appliances collect
health, movement, and usage data that can be used to provide insights or
recommendations.
Business Operations: Retail transactions, supply chains, and customer service
interactions are all datafied, allowing businesses to optimize processes, improve
customer experiences, and predict future trends.
In short, datafication is the shift where more and more of our world is being
transformed into data, allowing for its use in analysis, machine learning, and
decision-making processes.
Q.2 (a) Explain statistical Inference.
Ans- Statistical inference is a branch of statistics that focuses on drawing
conclusions about a population based on a sample of data from that population. It
involves two key concepts: estimation and hypothesis testing.
Key Concepts
1. Population and Sample:
o Population: The entire group of individuals or instances about whom we
want to learn.
o Sample: A subset of the population selected for analysis.
2. Estimation:
o Point Estimation: Provides a single value (point estimate) as an estimate
of a population parameter (e.g., sample mean as an estimate of population
mean).
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
| Deployment |
+------------------+
|
v
+------------------+
| Monitoring |
| and Maintenance |
+------------------+
Page 01 of 02
07082024
21CS644
Q.4 (b) Which Machine Learning algorithm to be used when you have bunch of objects
that are already classified and based on which other similar objects that haven’t got
classified to be automatically labelled? Explain.
Ans- When you have a set of classified objects (labeled data) and want to classify
new, unlabeled objects based on the existing classifications, the most suitable
machine learning algorithm to use is supervised learning. Within this category, you
can choose from several algorithms, depending on the nature of your data and the
specific requirements of your task. Here are some commonly used algorithms:
1. k-Nearest Neighbors (k-NN):
o Description: This is a simple and intuitive algorithm that classifies a new
object based on the majority class of its k nearest neighbors in the feature
space.
o Use Case: It works well when you have a well-defined metric for distance
(e.g., Euclidean distance) and when your dataset is not too large, as it can
become computationally expensive.
2. Support Vector Machines (SVM):
o Description: SVM finds the hyperplane that best separates the classes in
the feature space. It works well for both linear and non-linear classification
using kernel functions.
o Use Case: SVM is eVective in high-dimensional spaces and can be used
for both binary and multi-class classification problems.
3. Decision Trees:
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
o Features that provide similar information can lead to redundancy. The goal
is to reduce redundancy by selecting a diverse set of features that provide
unique information.
3. Performance Improvement:
o Features should be selected based on their contribution to model
performance. Cross-validation can be used to assess how diVerent feature
subsets aVect performance metrics (e.g., accuracy, precision, recall).
4. Simplicity:
o A simpler model with fewer features is often preferred as it is easier to
interpret and less likely to overfit. Models with too many features can
become overly complex and diVicult to maintain.
5. Computational EMiciency:
o The time and resources required to train the model should be considered.
Selecting fewer features can lead to faster training times and lower
computational costs.
6. Domain Knowledge:
o Understanding the domain can provide insights into which features are
likely to be important based on theoretical or practical considerations.
Conclusion
Feature selection is a vital part of the machine learning pipeline that can significantly
impact model performance. By using a combination of filter, wrapper, and embedded
methods, along with carefully defined selection criteria, you can improve the quality
and eViciency of your models.
Q.5 (b) Define Feature Extraction. Explain diVerent categories of information.
Ans- Feature Extraction is the process of transforming raw data into a set of
meaningful features that can be eVectively used in machine learning models. It
involves reducing the dimensionality of the data while preserving its essential
characteristics, thus enhancing the model's ability to learn patterns and
relationships.
Importance of Feature Extraction
• Dimensionality Reduction: Reduces the number of input variables, making
models simpler and faster to train.
• Noise Reduction: Helps eliminate irrelevant or redundant data, leading to better
model performance.
• Improved Performance: By focusing on the most relevant aspects of the data, it
can lead to increased accuracy and eViciency of the model.
• Visualization: Facilitates the understanding of high-dimensional data through
lower-dimensional representations.
Categories of Information in Feature Extraction
Feature extraction can be categorized based on the type of data and the techniques
used to extract features. Here are the primary categories:
1. Statistical Features:
o Description: These features are derived from statistical measures of the
data.
o Examples:
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Conclusion
Feature extraction is a fundamental aspect of the data preprocessing phase in
machine learning and is crucial for building eVective models. By transforming raw
data into meaningful features across various categories, you can enhance the
model's ability to learn and generalize from the underlying data patterns.
Q.6 Explain Random Forest Classifier
Ans- The Random Forest Classifier is an ensemble learning method used for
classification tasks. It builds multiple decision trees during training and merges their
predictions to improve accuracy and control overfitting. Here's an in-depth look at
how it works, its advantages, and its limitations.
How Random Forest Works
1. Ensemble Learning:
o Random Forest is based on the concept of ensemble learning, where
multiple models (in this case, decision trees) are combined to produce a
more accurate and robust model.
2. Bootstrapping:
o Random Forest employs a technique called bootstrapping, which involves
creating multiple subsets of the original training dataset by randomly
sampling with replacement. Each subset is used to train a separate
decision tree.
3. Feature Randomness:
o When building each decision tree, Random Forest adds another layer of
randomness by selecting a random subset of features for each split in the
tree. This ensures that the trees are diverse, reducing the correlation
between them.
4. Tree Construction:
o Each decision tree is built to its full depth without pruning, which allows for
capturing complex patterns in the data. The trees will each have diVerent
structures due to the randomness in sampling and feature selection.
5. Voting Mechanism:
o After all trees are trained, the Random Forest makes predictions for new
instances by aggregating the predictions of all the individual trees. For
classification tasks, this is done through a majority voting mechanism,
where the class predicted by the most trees is chosen as the final output.
Key Features
• Robustness to Overfitting: Because it averages the predictions of multiple trees,
Random Forest can mitigate overfitting, especially compared to individual
decision trees.
• Handles Missing Values: Random Forest can handle missing data and maintain
accuracy without requiring imputation.
• Feature Importance: Random Forest provides insights into the importance of
diVerent features in the prediction process. This is particularly useful for feature
selection and understanding model behavior.
Advantages
1. High Accuracy: Random Forest typically provides high accuracy for both
classification and regression tasks due to the ensemble nature of the model.
Page 01 of 02
07082024
21CS644
2. Versatility: It can be used for both classification and regression problems and is
eVective for various types of data, including numerical and categorical features.
3. Scalability: Random Forest can handle large datasets and a high number of
features, making it suitable for many real-world applications.
4. Less Tuning Required: It requires less hyperparameter tuning compared to other
complex models, as it performs well out of the box.
Limitations
1. Model Interpretability: While individual decision trees are easy to interpret, the
ensemble nature of Random Forest makes it harder to understand the model’s
decision-making process.
2. Computational Complexity: Training many trees can be computationally
intensive, leading to longer training times and higher resource consumption,
especially with large datasets.
3. Memory Usage: Random Forest can require significant memory for storing all the
trees, which can be a limitation in resource-constrained environments.
4. Not Suitable for Real-Time Predictions: Due to the time it takes to aggregate
predictions from multiple trees, Random Forest may not be ideal for real-time
applications where speed is critical.
Conclusion
The Random Forest Classifier is a powerful and widely used machine learning
algorithm that excels in accuracy and versatility. It is particularly valuable when
dealing with complex datasets and when interpretability is not the primary concern.
By leveraging the strengths of multiple decision trees, it mitigates overfitting and
enhances predictive performance, making it a popular choice across various
domains.
Q.6 (b) Explain Principal Component Analysis.
Ans- Principal Component Analysis (PCA) is a statistical technique used for
dimensionality reduction while preserving as much variance in the data as possible.
It transforms a dataset into a new coordinate system where the greatest variances by
any projection of the data lie on the first coordinates (called principal components).
Here's a detailed explanation of how PCA works, its applications, and its advantages
and limitations.
How PCA Works
1. Standardization:
o PCA begins with standardizing the data, particularly when the features are
on diVerent scales. This involves centering the data (subtracting the mean)
and scaling it (dividing by the standard deviation) to ensure that each
feature contributes equally to the analysis.
2. Covariance Matrix:
o Once the data is standardized, PCA calculates the covariance matrix to
understand how diVerent features vary with respect to each other. The
covariance matrix captures the relationships between features.
3. Eigenvalues and Eigenvectors:
o PCA computes the eigenvalues and eigenvectors of the covariance matrix.
Eigenvalues represent the amount of variance captured by each principal
component, while eigenvectors indicate the direction of these
components.
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
5. Data Validation:
o Ensuring the data is accurate and meets the required quality standards.
This step verifies that the transformations performed were correct.
6. Data Exploration:
o Analyzing the cleaned data to identify patterns, trends, and insights. This
step often involves generating summary statistics and visualizations.
7. Data Storage:
o Saving the cleaned and transformed dataset in a suitable format (e.g., CSV,
database) for further analysis or sharing with stakeholders.
Data Wrangling Diagram
Here's a simple diagram illustrating the data wrangling process:
plaintext
Copy code
+-----------------+
| Data Collection |
+--------+--------+
|
v
+-----------------+
| Data Cleaning |
+--------+--------+
|
v
+-----------------+
| Data Transformation |
+--------+--------+
|
v
+-----------------+
| Data Enrichment |
+--------+--------+
|
v
+-----------------+
| Data Validation |
+--------+--------+
|
v
+-----------------+
| Data Exploration |
+--------+--------+
|
v
+-----------------+
| Data Storage |
+-----------------+
Importance of Data Wrangling
Page 01 of 02
07082024
21CS644
• Improves Data Quality: Ensures that the data is accurate, consistent, and usable
for analysis.
• Enhances EMiciency: Streamlines the process of preparing data, saving time
during the analysis phase.
• Facilitates Better Decision-Making: Provides high-quality data, which is crucial
for making informed decisions based on analysis.
• Supports Data Analysis: Prepares the data in a structured format, making it
easier to apply analytical techniques and algorithms.
Conclusion
Data wrangling is a critical step in the data analysis pipeline. It transforms raw data
into a clean and structured format, enabling analysts and data scientists to derive
meaningful insights and make data-driven decisions. By following the data wrangling
process, organizations can enhance the quality of their data and maximize the value
derived from it.
Q.8 (a) Explain composition plots with diagram.
Ans- Composition plots are visual representations used to analyze and display the
distribution and proportion of diVerent components within a dataset. They are
particularly useful for understanding how various parts contribute to a whole. This
type of visualization is commonly used in various fields, such as marketing, finance,
and environmental studies, to convey insights about the composition of data.
Types of Composition Plots
1. Stacked Bar Charts:
o Stacked bar charts display the total size of a group while illustrating the
composition of sub-groups within that total. Each bar represents a total,
and diVerent colors represent the proportion of each sub-group.
2. Area Charts:
o Area charts show the cumulative totals over time, with diVerent colors
representing diVerent components. This visualization emphasizes the
magnitude of the total and how diVerent components contribute to it over
time.
3. Pie Charts:
o Although often criticized for being less eVective, pie charts can visually
represent the composition of a whole, showing the percentage of each
category as a slice of a circle.
4. Sankey Diagrams:
o Sankey diagrams illustrate the flow of resources or information between
stages. They are particularly useful for showing how quantities are
distributed among categories.
5. Mosaic Plots:
o Mosaic plots represent the relationship between two or more categorical
variables by displaying rectangles whose areas are proportional to the
frequency of combinations of these variables.
Example: Stacked Bar Chart
Here’s a diagram representing a stacked bar chart, which is one of the common
composition plots:
plaintext
Copy code
Page 01 of 02
07082024
21CS644
|
20 +-----------------------+
| | | | |
15 +------|----|----|-----|------+
| | | | | |
10 +------|----|----|-----|------+
| | | | | |
5 +------|----|----|-----|------+
| | | | | |
0 +-----------------------+------+
A B C D
Description of the Diagram
• X-Axis: Represents diVerent categories (e.g., A, B, C, D).
• Y-Axis: Represents the total value for each category.
• Colors: DiVerent colors in each bar indicate the composition of sub-groups (e.g.,
diVerent segments of the data).
• Height of Bars: The height of each segment shows the proportion of each sub-
group in the total.
Importance of Composition Plots
1. Visual Clarity: Composition plots provide a clear visual representation of how
components contribute to a whole, making complex data more understandable.
2. Comparative Analysis: They enable easy comparison of diVerent groups or
categories, highlighting the diVerences in composition.
3. Insight Generation: These plots can reveal trends and patterns that might not be
evident from raw data, supporting data-driven decision-making.
4. Communication Tool: EVective for communicating findings to stakeholders, as
they simplify the representation of complex data relationships.
Conclusion
Composition plots are valuable tools for visualizing the parts that make up a whole.
They provide insights into the distribution and contribution of various components,
enhancing data analysis and facilitating better understanding and communication of
findings. Whether using stacked bar charts, area charts, or other types, composition
plots play a crucial role in making complex data accessible and actionable.
Data visualization is a critical aspect of data analysis, and a variety of tools and libraries
are available to create eVective visualizations. Below are some popular tools and
libraries categorized by programming languages:
1. Python Libraries:
• Matplotlib:
o A fundamental plotting library for Python, Matplotlib is used for creating
static, interactive, and animated visualizations in Python. It provides fine
control over plot appearance.
• Seaborn:
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
Page 01 of 02
07082024
21CS644
• Usage: Widely used for transmitting data between a server and a web application,
especially in APIs.
• Example: A JSON representation of a user profile:
json
Copy code
{
"id": 1,
"name": "Alice",
"age": 30,
"gender": "Female"
}
Conclusion
The choice of tools and libraries for data visualization and the method of data
representation significantly impact the eVectiveness of data analysis. By selecting
appropriate tools and formats, analysts can create compelling visual narratives that
reveal insights, communicate findings, and drive informed decision-making.
ii) Data Representation. -
Data representation refers to the methods and formats used to organize, store, and
present data so that it can be easily interpreted and analyzed. The way data is
represented can greatly influence how eVectively it can be understood and utilized.
Below, we explore various forms of data representation, their characteristics, and their
applications.
1. Tabular Representation
• Description: Data is organized into rows and columns, similar to a spreadsheet
or database table.
• Characteristics:
o Easy to read and understand.
o Each row represents a record, and each column represents a variable.
• Applications: Used in databases, spreadsheets, and data analysis tools for
structured data.
Q.9
Ans- Plotting Using Pandas DataFrames
Pandas is a powerful data manipulation library in Python that includes built-in
visualization capabilities. It allows users to create plots directly from DataFrames,
making it easy to visualize data without extensive coding.
1. Basic Plotting with Pandas
To plot data using Pandas, you can call the .plot() method on a DataFrame or Series. This
method utilizes Matplotlib under the hood and oVers a straightforward interface for
creating various types of plots.
Types of Plots in Pandas
Pandas supports various plot types, including:
• Line Plot: kind='line' (default)
• Bar Plot: kind='bar' (for vertical bars)
• Horizontal Bar Plot: kind='barh'
• Histogram: kind='hist'
• Box Plot: kind='box'
Page 01 of 02
07082024
21CS644
Saving a Figure
• You can save a figure by specifying the filename and format. Common formats
include PNG, JPG, PDF, and SVG.
Labels
Page 01 of 02
07082024
21CS644
Definition: Labels are used to provide information about the axes of a plot. They help
viewers understand what data is represented on each axis.
# Sample data
x = [1, 2, 3, 4]
plt.plot(x, y)
# Adding labels
plt.show()
Titles
• Definition: The title of a plot provides a brief description of what the plot represents.
It is usually displayed at the top of the figure.
plt.plot(x, y)
# Adding a title
plt.show()
Text
Page 01 of 02
07082024
21CS644
Definition: The text() function in Matplotlib is used to place text at arbitrary locations
within the plot. This can be useful for adding descriptive information or highlighting
specific data points.
plt.plot(x, y)
plt.show()
Annotations
Definition: Annotations allow you to add text to specific points in your plot along with
arrows to point to the relevant data points. This is useful for providing additional context
or explaining significant points.
plt.plot(x, y)
arrowprops=dict(facecolor='black', shrink=0.05))
plt.show()
Legends
Definition: Legends are used to describe the various elements of a plot, especially when
multiple datasets or categories are displayed. A legend helps distinguish between
different lines, bars, or points in the same plot.
# Sample data
Page 01 of 02
07082024
21CS644
# Adding a legend
plt.legend()
plt.show()
Legends
Definition: Legends are used to describe the various elements of a plot, especially
when multiple datasets or categories are displayed. A legend helps distinguish
between diVerent lines, bars, or points in the same plot.
# Sample data
y1 = [10, 20, 25, 30]
y2 = [5, 15, 20, 25]
# Adding a legend
plt.legend()
plt.show()
Q. 10 (b) Explain basic image operations of Matplotlib
Ans- Matplotlib is a versatile library for creating static, animated, and interactive
visualizations in Python. In addition to plotting data, it can also handle basic image
operations, making it suitable for image processing tasks. Below are some of the
fundamental image operations you can perform using Matplotlib.
1. Displaying Images
Page 01 of 02
07082024
21CS644
The primary function to display an image in Matplotlib is imshow(). This function can
display images in various formats, including grayscale and color images.
3. Image Manipulation
You can perform basic manipulations on images such as resizing, cropping, and flipping
using NumPy in combination with Matplotlib.
4. a) Resizing Images
To resize images, you can use the resize() function from the skimage.transform
module (part of the scikit-image library).
Cropping involves selecting a specific region of an image. This can be done using NumPy
slicing.
7. Flipping Images- You can flip images horizontally or vertically using NumPy.
8. Image Color Conversion- You can convert images between different color
spaces (e.g., RGB to Grayscale) using NumPy operations.
9. Adding Text to Images- You can overlay text on images using the text() function to
annotate images or highlight specific features.
10. Saving Images- After manipulating images, you can save them using imsave().
Page 01 of 02
07082024