0% found this document useful (0 votes)
23 views13 pages

FDS - 5 Solved

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views13 pages

FDS - 5 Solved

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Q1) Attempt any eight of the following:

a) What is word cloud?

Sol:

A word cloud is a visual representation of text data where the size of each word indicates its
frequency or importance in the dataset. It helps in quickly identifying the most prominent
terms. For example, in a word cloud generated from a collection of articles, common words
like "data," "analysis," and "science" would appear larger than less common words.

b) Define Statistical data analysis

Sol:

Statistical data analysis is the process of collecting, organizing, analyzing, interpreting, and
presenting data using statistical methods. It involves techniques such as descriptive
statistics (mean, median, mode), inferential statistics (hypothesis testing, confidence
intervals), and predictive statistics (regression analysis) to draw conclusions and make
informed decisions based on data.

c) What do you mean by data transformation?

Sol:

Data transformation is the process of converting data from one format or structure into
another to prepare it for analysis. This includes scaling, normalization, aggregation, and
encoding to improve data quality and ensure consistency. For example, transforming
categorical data into numerical format using one-hot encoding.

d) Why data cleaning is important operation of data preprocessing?

Sol:

Data cleaning is essential because it ensures the accuracy, completeness, and


consistency of the data. It involves handling missing values, correcting errors, and
removing duplicates, which helps prevent biases and errors in analysis and improves the
reliability of results.
e) What is the purpose of data Visualization?

Sol:

The purpose of data visualization is to represent data graphically to make it easier to


understand and interpret. It helps identify patterns, trends, and outliers, facilitating data-
driven decision-making. Visualizations such as charts, graphs, and maps provide insights
that might be missed in raw data.

f) List any two applications of Data Science.

Sol:

• Healthcare: Predictive analytics, disease diagnosis, and personalized treatment


plans.
• Finance: Fraud detection, credit risk assessment, and algorithmic trading.

g) What is Visual encoding?

Sol:

Visual encoding is the use of visual elements such as color, shape, size, and position to
represent data in graphical form. It helps in conveying information effectively through visual
means, making complex data more understandable.

h) Define Bubble plot.

Sol:

A bubble plot is a type of scatter plot where each point is represented by a bubble. The
position of the bubble indicates the values of two variables, while the size of the bubble
represents the value of a third variable. It's useful for visualizing the relationships between
three variables.
i) What is CSV format?

Sol:

CSV (Comma-Separated Values) format is a plain text format where each line represents a
record, and fields are separated by commas. It is widely used for data exchange between
applications due to its simplicity and compatibility.

j) Define Standard deviation.

Sol:

Standard deviation is a measure of the dispersion or spread of a set of values. It quantifies


the amount of variation or deviation from the mean. A low standard deviation indicates that
data points are close to the mean, while a high standard deviation indicates that data
points are spread out.
Q2) Attempt any four of the following:

a) What is 3V's of data science?

Sol:

1. Volume:

o The volume characteristic of data refers to the sheer amount of data generated
and stored in data systems. It signifies the massive quantities of data that
organizations collect, process, and analyze.

o Example: Social media platforms generating terabytes of user data every day.

2. Velocity:

o Velocity refers to the speed at which data is generated, processed, and


analyzed. It emphasizes the need for real-time or near real-time data
processing to gain timely insights.

o Example: Stock market data where prices and trades are updated every
second.

3. Variety:

o Variety refers to the different types and sources of data. It includes structured
data (databases), semi-structured data (XML, JSON), and unstructured data
(text, images, videos).

o Example: Data from social media posts, emails, transaction records, and
multimedia content.

b) What are the uses of XML files?

Sol:

• Data Interchange: XML is used for data exchange between systems and
applications.
• Web Services: XML is widely used in web services for data representation and
communication.
• Configuration Files: XML is used for storing configuration settings in applications.
c) Calculate mean, mode, median and range: 22, 24, 35, 47, 23, 23, 31, 25

Sol:

• Mean = (22 + 24 + 35 + 47 + 23 + 23 + 31 + 25) / 8 = 28.75


• Mode: 23 (most frequently occurring value)
• Median: Sort the values: 22, 23, 23, 24, 25, 31, 35, 47
o Median = (24 + 25) / 2 = 24.5
• Range = 47 – 22 = 25

d) Why data preprocessing is important before data analysis?

Sol:

Data preprocessing is crucial because it ensures the quality and consistency of the data. It
involves cleaning, transforming, and organizing data to make it suitable for analysis.
Preprocessing improves the accuracy and reliability of analysis results and helps in
building better models.

e) List different types of data attributes with example.

Sol:

• Nominal Attributes: Categorical data with no intrinsic order (e.g., Gender: Male,
Female).
• Ordinal Attributes: Categorical data with an intrinsic order (e.g., Education levels:
High School, Bachelor's, Master's).
• Interval Attributes: Numeric data with meaningful differences but no true zero point
(e.g., Temperature in Celsius).
• Ratio Attributes: Numeric data with meaningful differences and a true zero point
(e.g., Height, Weight).
Q3) Attempt any two of the following:

a) What do you mean by Data Reduction? Explain any two Data Reduction strategies.
Sol:

Data reduction involves reducing the volume of data while maintaining its integrity. It helps
in improving storage efficiency and speeding up data processing.

Strategies

• Dimensionality Reduction: Reducing the number of attributes or features in the


dataset (e.g., Principal Component Analysis - PCA).
• Numerosity Reduction: Reducing the number of data points using techniques like
clustering, aggregation, and sampling.

b) What do you mean by hypothesis testing? Explain null and alternate hypothesis.
Sol:

• Hypothesis Testing: A statistical method used to make inferences or draw


conclusions about a population based on sample data.
• Null Hypothesis (H0): The default assumption or statement that there is no effect or
difference.
• Alternate Hypothesis (H1): The statement that there is an effect or difference.
• Example: Testing whether a new drug is more effective than an existing one.
• Null Hypothesis (H0): The new drug is not more effective than the existing one.
• Alternate Hypothesis (H1): The new drug is more effective than the existing one.
c) How to visualize geospatial data? Explain in detail.

Sol:

Geospatial data is data that includes geographic information such as coordinates,


locations, and maps.

Visualization Techniques:

• Choropleth Maps: Represent data using different colors or shades for geographic
regions.
• Example: Population density map.

import geopandas as gpd

import matplotlib.pyplot as plt

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

world.plot(column='pop_est', cmap='OrRd', legend=True)

plt.title('World Population')

plt.show()

• Heatmaps: Represent data intensity using color gradients.


• Example: Traffic congestion heatmap.

import folium

from folium.plugins import HeatMap

# Sample data: list of [latitude, longitude, intensity]

data = [[37.7749, -122.4194, 1], [34.0522, -118.2437, 1], [40.7128, -74.0060, 1]]

map = folium.Map(location=[37.7749, -122.4194], zoom_start=5)

HeatMap(data).add_to(map)

map.save("heatmap.html")
Q4) Attempt any one of the following.

a) What are the components of data scientist tool box? Explain two of them in detail.
Sol:

A data scientist's toolbox consists of various tools and software used for data collection,
analysis, visualization, and modeling.

Components:

• Jupyter Notebooks: An open-source web application that allows data scientists to


create and share documents containing live code, equations, visualizations, and
narrative text. It is widely used for data cleaning, transformation, and exploratory
data analysis.

• RStudio: An integrated development environment (IDE) for R, providing tools for data
analysis, visualization, and statistical modeling. It supports various packages and
libraries for advanced data science tasks.
b) What is Outlier? Explain types of outliers.

Sol:

An outlier is a data point that significantly differs from other observations in a dataset.
Outliers can indicate variability in data, errors, or novelties, and identifying them is
essential for accurate data analysis.

Types of Outliers:

1. Global Outliers (Point Outliers):

o Data points that deviate significantly from the rest of the dataset.

o Example: In a dataset of student grades, a score of 100 when most scores range
between 60 and 80.

o Impact: Global outliers can distort statistical analyses and models if not handled
properly.

2. Contextual Outliers (Conditional Outliers):

o Data points that are considered outliers in a specific context but not necessarily in
others.

o Example: A high temperature reading in a normally cold region during winter.

o Impact: Contextual outliers depend on the context of the data and may require
specialized techniques to identify.

3. Collective Outliers:

o A group of data points that deviate significantly from the overall pattern of the dataset.

o Example: A sudden spike in network traffic at a specific time, indicating a potential


cyber-attack.

o Impact: Collective outliers often indicate patterns or events that might not be
apparent when looking at individual points.
c) What are the various types of data? Explain in detail.

Sol:

1. Nominal Data:

o Categorical data without any intrinsic order. It represents distinct categories or


labels.

o Examples: Gender (Male, Female), Colors (Red, Blue, Green).

o Usage: Suitable for classification but not for arithmetic operations.

2. Ordinal Data:

o Categorical data with a clear ordering or ranking between the categories.

o Examples: Education levels (High School, Bachelor's, Master's, Ph.D.), Satisfaction


levels (Poor, Fair, Good, Excellent).

o Usage: Suitable for classification and ranking but not for arithmetic operations.

3. Interval Data:

o Numeric data with meaningful differences between values, but no true zero point.

o Examples: Temperature in Celsius (0°C does not mean 'no temperature'), Calendar
dates.

o Usage: Suitable for measuring differences but not for ratios or true zero-based
comparisons.

4. Ratio Data:

o Numeric data with meaningful differences between values and a true zero point.

o Examples: Height (in centimeters), Weight (in kilograms), Age (in years), Income.

o Usage: Suitable for all arithmetic operations, including comparisons and ratios.
Q5) Attempt any one of the following:

a) What do you mean by data discretization? Explain discretization by Histogram


analysis.

Sol:

Data discretization is the process of converting continuous data into discrete intervals or
categories. This technique simplifies data analysis by reducing the number of possible
values. It helps in making data more interpretable and suitable for certain machine learning
algorithms.

Discretization by Histogram Analysis:

• Method: Divides the range of data into a series of non-overlapping bins or intervals,
and counts the number of data points in each bin.

• Steps:

1. Select the Number of Bins: Decide the number of intervals to divide the data into.

2. Determine Bin Edges: Calculate the range of each bin.

3. Assign Data Points to Bins: Count the number of data points falling into each bin.

• Example:

import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]

plt.hist(data, bins=5, edgecolor='black')

plt.title('Histogram for Discretization')

plt.xlabel('Value')

plt.ylabel('Frequency')

plt.show()

o Explanation: In this example, the data is divided into 5 bins, and the histogram
shows the frequency distribution of the data points in each bin.
b) Write a short note on different methods for measuring data similarity and
dissimilarity.

Sol:

Measuring Data Similarity and Dissimilarity:

1. Euclidean Distance:

o It measures the straight-line distance between two points in a multi-dimensional


space.

• Formula:

d(p,q) = √∑𝑛𝑖=1(𝑝𝑖 − 𝑞𝑖 )2
• Usage: Commonly used in clustering and nearest neighbor algorithms.

2. Manhattan Distance:

o It measures the distance between two points along axes at right angles (sum of
absolute differences).

o Formula: d(p,q) = √∑𝑛𝑖=1 |𝑝𝑖 − 𝑞𝑖 |


• Usage: Useful in grid-based pathfinding algorithms.

3. Cosine Similarity:

o Description: Measures the cosine of the angle between two vectors, indicating their
orientation.

o Formula:

Cosine Similarity = √∑𝑛𝑖=1 𝑝𝑖 . 𝑞𝑖 /√∑𝑛𝑖=1 𝑝𝑖 2 . √∑𝑛𝑖=1 𝑞𝑖 2

• Usage: Commonly used in text analysis and information retrieval to measure


similarity between documents.
4. Jaccard Similarity:

o It measures the similarity between two sets by comparing their intersection and
union.

o Formula: J (A, B) = (∣A ∩ B∣) / (∣ A ∪ B∣)

• Usage: Useful for comparing binary attributes, such as presence/absence data.

You might also like