0% found this document useful (0 votes)

23 views13 pages

FDS - 5 Solved

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views13 pages

FDS - 5 Solved

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Q1) Attempt any eight of the following:

a) What is word cloud?

Sol:

A word cloud is a visual representation of text data where the size of each word indicates its
frequency or importance in the dataset. It helps in quickly identifying the most prominent
terms. For example, in a word cloud generated from a collection of articles, common words
like "data," "analysis," and "science" would appear larger than less common words.

b) Define Statistical data analysis

Sol:

Statistical data analysis is the process of collecting, organizing, analyzing, interpreting, and
presenting data using statistical methods. It involves techniques such as descriptive
statistics (mean, median, mode), inferential statistics (hypothesis testing, confidence
intervals), and predictive statistics (regression analysis) to draw conclusions and make
informed decisions based on data.

c) What do you mean by data transformation?

Sol:

Data transformation is the process of converting data from one format or structure into
another to prepare it for analysis. This includes scaling, normalization, aggregation, and
encoding to improve data quality and ensure consistency. For example, transforming
categorical data into numerical format using one-hot encoding.

d) Why data cleaning is important operation of data preprocessing?

Sol:

Data cleaning is essential because it ensures the accuracy, completeness, and

consistency of the data. It involves handling missing values, correcting errors, and
removing duplicates, which helps prevent biases and errors in analysis and improves the
reliability of results.
e) What is the purpose of data Visualization?

Sol:

The purpose of data visualization is to represent data graphically to make it easier to

understand and interpret. It helps identify patterns, trends, and outliers, facilitating data-
driven decision-making. Visualizations such as charts, graphs, and maps provide insights
that might be missed in raw data.

f) List any two applications of Data Science.

Sol:

• Healthcare: Predictive analytics, disease diagnosis, and personalized treatment

plans.
• Finance: Fraud detection, credit risk assessment, and algorithmic trading.

g) What is Visual encoding?

Sol:

Visual encoding is the use of visual elements such as color, shape, size, and position to
represent data in graphical form. It helps in conveying information effectively through visual
means, making complex data more understandable.

h) Define Bubble plot.

Sol:

A bubble plot is a type of scatter plot where each point is represented by a bubble. The
position of the bubble indicates the values of two variables, while the size of the bubble
represents the value of a third variable. It's useful for visualizing the relationships between
three variables.
i) What is CSV format?

Sol:

CSV (Comma-Separated Values) format is a plain text format where each line represents a
record, and fields are separated by commas. It is widely used for data exchange between
applications due to its simplicity and compatibility.

j) Define Standard deviation.

Sol:

Standard deviation is a measure of the dispersion or spread of a set of values. It quantifies

the amount of variation or deviation from the mean. A low standard deviation indicates that
data points are close to the mean, while a high standard deviation indicates that data
points are spread out.
Q2) Attempt any four of the following:

a) What is 3V's of data science?

Sol:

1. Volume:

o The volume characteristic of data refers to the sheer amount of data generated
and stored in data systems. It signifies the massive quantities of data that
organizations collect, process, and analyze.

o Example: Social media platforms generating terabytes of user data every day.

2. Velocity:

o Velocity refers to the speed at which data is generated, processed, and

analyzed. It emphasizes the need for real-time or near real-time data
processing to gain timely insights.

o Example: Stock market data where prices and trades are updated every
second.

3. Variety:

o Variety refers to the different types and sources of data. It includes structured
data (databases), semi-structured data (XML, JSON), and unstructured data
(text, images, videos).

o Example: Data from social media posts, emails, transaction records, and
multimedia content.

b) What are the uses of XML files?

Sol:

• Data Interchange: XML is used for data exchange between systems and
applications.
• Web Services: XML is widely used in web services for data representation and
communication.
• Configuration Files: XML is used for storing configuration settings in applications.
c) Calculate mean, mode, median and range: 22, 24, 35, 47, 23, 23, 31, 25

Sol:

• Mean = (22 + 24 + 35 + 47 + 23 + 23 + 31 + 25) / 8 = 28.75

• Mode: 23 (most frequently occurring value)
• Median: Sort the values: 22, 23, 23, 24, 25, 31, 35, 47
o Median = (24 + 25) / 2 = 24.5
• Range = 47 – 22 = 25

d) Why data preprocessing is important before data analysis?

Sol:

Data preprocessing is crucial because it ensures the quality and consistency of the data. It
involves cleaning, transforming, and organizing data to make it suitable for analysis.
Preprocessing improves the accuracy and reliability of analysis results and helps in
building better models.

e) List different types of data attributes with example.

Sol:

• Nominal Attributes: Categorical data with no intrinsic order (e.g., Gender: Male,
Female).
• Ordinal Attributes: Categorical data with an intrinsic order (e.g., Education levels:
High School, Bachelor's, Master's).
• Interval Attributes: Numeric data with meaningful differences but no true zero point
(e.g., Temperature in Celsius).
• Ratio Attributes: Numeric data with meaningful differences and a true zero point
(e.g., Height, Weight).
Q3) Attempt any two of the following:

a) What do you mean by Data Reduction? Explain any two Data Reduction strategies.
Sol:

Data reduction involves reducing the volume of data while maintaining its integrity. It helps
in improving storage efficiency and speeding up data processing.

Strategies

• Dimensionality Reduction: Reducing the number of attributes or features in the

dataset (e.g., Principal Component Analysis - PCA).
• Numerosity Reduction: Reducing the number of data points using techniques like
clustering, aggregation, and sampling.

b) What do you mean by hypothesis testing? Explain null and alternate hypothesis.
Sol:

• Hypothesis Testing: A statistical method used to make inferences or draw

conclusions about a population based on sample data.
• Null Hypothesis (H0): The default assumption or statement that there is no effect or
difference.
• Alternate Hypothesis (H1): The statement that there is an effect or difference.
• Example: Testing whether a new drug is more effective than an existing one.
• Null Hypothesis (H0): The new drug is not more effective than the existing one.
• Alternate Hypothesis (H1): The new drug is more effective than the existing one.
c) How to visualize geospatial data? Explain in detail.

Sol:

Geospatial data is data that includes geographic information such as coordinates,

locations, and maps.

Visualization Techniques:

• Choropleth Maps: Represent data using different colors or shades for geographic
regions.
• Example: Population density map.

import geopandas as gpd

import matplotlib.pyplot as plt

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

world.plot(column='pop_est', cmap='OrRd', legend=True)

plt.title('World Population')

plt.show()

• Heatmaps: Represent data intensity using color gradients.

• Example: Traffic congestion heatmap.

import folium

from folium.plugins import HeatMap

# Sample data: list of [latitude, longitude, intensity]

data = [[37.7749, -122.4194, 1], [34.0522, -118.2437, 1], [40.7128, -74.0060, 1]]

map = folium.Map(location=[37.7749, -122.4194], zoom_start=5)

HeatMap(data).add_to(map)

map.save("heatmap.html")
Q4) Attempt any one of the following.

a) What are the components of data scientist tool box? Explain two of them in detail.
Sol:

A data scientist's toolbox consists of various tools and software used for data collection,
analysis, visualization, and modeling.

Components:

• Jupyter Notebooks: An open-source web application that allows data scientists to

create and share documents containing live code, equations, visualizations, and
narrative text. It is widely used for data cleaning, transformation, and exploratory
data analysis.

• RStudio: An integrated development environment (IDE) for R, providing tools for data
analysis, visualization, and statistical modeling. It supports various packages and
libraries for advanced data science tasks.
b) What is Outlier? Explain types of outliers.

Sol:

An outlier is a data point that significantly differs from other observations in a dataset.
Outliers can indicate variability in data, errors, or novelties, and identifying them is
essential for accurate data analysis.

Types of Outliers:

1. Global Outliers (Point Outliers):

o Data points that deviate significantly from the rest of the dataset.

o Example: In a dataset of student grades, a score of 100 when most scores range
between 60 and 80.

o Impact: Global outliers can distort statistical analyses and models if not handled
properly.

2. Contextual Outliers (Conditional Outliers):

o Data points that are considered outliers in a specific context but not necessarily in
others.

o Example: A high temperature reading in a normally cold region during winter.

o Impact: Contextual outliers depend on the context of the data and may require
specialized techniques to identify.

3. Collective Outliers:

o A group of data points that deviate significantly from the overall pattern of the dataset.

o Example: A sudden spike in network traffic at a specific time, indicating a potential

cyber-attack.

o Impact: Collective outliers often indicate patterns or events that might not be
apparent when looking at individual points.
c) What are the various types of data? Explain in detail.

Sol:

1. Nominal Data:

o Categorical data without any intrinsic order. It represents distinct categories or

labels.

o Examples: Gender (Male, Female), Colors (Red, Blue, Green).

o Usage: Suitable for classification but not for arithmetic operations.

2. Ordinal Data:

o Categorical data with a clear ordering or ranking between the categories.

o Examples: Education levels (High School, Bachelor's, Master's, Ph.D.), Satisfaction

levels (Poor, Fair, Good, Excellent).

o Usage: Suitable for classification and ranking but not for arithmetic operations.

3. Interval Data:

o Numeric data with meaningful differences between values, but no true zero point.

o Examples: Temperature in Celsius (0°C does not mean 'no temperature'), Calendar
dates.

o Usage: Suitable for measuring differences but not for ratios or true zero-based
comparisons.

4. Ratio Data:

o Numeric data with meaningful differences between values and a true zero point.

o Examples: Height (in centimeters), Weight (in kilograms), Age (in years), Income.

o Usage: Suitable for all arithmetic operations, including comparisons and ratios.
Q5) Attempt any one of the following:

a) What do you mean by data discretization? Explain discretization by Histogram

analysis.

Sol:

Data discretization is the process of converting continuous data into discrete intervals or
categories. This technique simplifies data analysis by reducing the number of possible
values. It helps in making data more interpretable and suitable for certain machine learning
algorithms.

Discretization by Histogram Analysis:

• Method: Divides the range of data into a series of non-overlapping bins or intervals,
and counts the number of data points in each bin.

• Steps:

1. Select the Number of Bins: Decide the number of intervals to divide the data into.

2. Determine Bin Edges: Calculate the range of each bin.

3. Assign Data Points to Bins: Count the number of data points falling into each bin.

• Example:

import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]

plt.hist(data, bins=5, edgecolor='black')

plt.title('Histogram for Discretization')

plt.xlabel('Value')

plt.ylabel('Frequency')

plt.show()

o Explanation: In this example, the data is divided into 5 bins, and the histogram
shows the frequency distribution of the data points in each bin.
b) Write a short note on different methods for measuring data similarity and
dissimilarity.

Sol:

Measuring Data Similarity and Dissimilarity:

1. Euclidean Distance:

o It measures the straight-line distance between two points in a multi-dimensional

space.

• Formula:

d(p,q) = √∑𝑛𝑖=1(𝑝𝑖 − 𝑞𝑖 )2
• Usage: Commonly used in clustering and nearest neighbor algorithms.

2. Manhattan Distance:

o It measures the distance between two points along axes at right angles (sum of
absolute differences).

o Formula: d(p,q) = √∑𝑛𝑖=1 |𝑝𝑖 − 𝑞𝑖 |

• Usage: Useful in grid-based pathfinding algorithms.

3. Cosine Similarity:

o Description: Measures the cosine of the angle between two vectors, indicating their
orientation.

o Formula:

Cosine Similarity = √∑𝑛𝑖=1 𝑝𝑖 . 𝑞𝑖 /√∑𝑛𝑖=1 𝑝𝑖 2 . √∑𝑛𝑖=1 𝑞𝑖 2

• Usage: Commonly used in text analysis and information retrieval to measure

similarity between documents.
4. Jaccard Similarity:

o It measures the similarity between two sets by comparing their intersection and
union.

o Formula: J (A, B) = (∣A ∩ B∣) / (∣ A ∪ B∣)

• Usage: Useful for comparing binary attributes, such as presence/absence data.

FDS - 2 Solved
No ratings yet
FDS - 2 Solved
14 pages
FDS - 4 Solved
No ratings yet
FDS - 4 Solved
21 pages
Data Science Concepts and Tools
No ratings yet
Data Science Concepts and Tools
7 pages
FDS - 3 Solved
No ratings yet
FDS - 3 Solved
21 pages
Data Science Concepts and Applications
No ratings yet
Data Science Concepts and Applications
20 pages
FDS Most Imp Question
No ratings yet
FDS Most Imp Question
12 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Data Science
No ratings yet
Data Science
32 pages
FDS 1
No ratings yet
FDS 1
5 pages
FDS Imp Docs
No ratings yet
FDS Imp Docs
22 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
FDS - 1 Solved
No ratings yet
FDS - 1 Solved
17 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
12 pages
Data Science Quiz Answers
No ratings yet
Data Science Quiz Answers
5 pages
Ds Paper Question
No ratings yet
Ds Paper Question
6 pages
Foundation of Data Science Imp Notes
No ratings yet
Foundation of Data Science Imp Notes
6 pages
Class 9 (Chap #4)
No ratings yet
Class 9 (Chap #4)
9 pages
Data Science One Mark Question
No ratings yet
Data Science One Mark Question
3 pages
Important Questions
No ratings yet
Important Questions
26 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
Data Science Pyqdata Science Pyqdata Science Pyq
No ratings yet
Data Science Pyqdata Science Pyqdata Science Pyq
6 pages
Ty - Data Science Qb-1
No ratings yet
Ty - Data Science Qb-1
4 pages
Computer Unit - 4
No ratings yet
Computer Unit - 4
28 pages
FDS Key Answers
No ratings yet
FDS Key Answers
24 pages
Unit 1
No ratings yet
Unit 1
34 pages
3.question Bank
No ratings yet
3.question Bank
7 pages
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
No ratings yet
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
6 pages
01.ad3491 Fdsa QB
No ratings yet
01.ad3491 Fdsa QB
16 pages
Foundation of Data Science Solve Question Paper Aug 2022
No ratings yet
Foundation of Data Science Solve Question Paper Aug 2022
7 pages
MCQs of Chapter 7
No ratings yet
MCQs of Chapter 7
3 pages
Understanding Data Assignment 2
No ratings yet
Understanding Data Assignment 2
12 pages
Unit 4
No ratings yet
Unit 4
10 pages
DS Module 1 Notes
No ratings yet
DS Module 1 Notes
25 pages
Data Science Four Marks Qa
No ratings yet
Data Science Four Marks Qa
4 pages
Data Science Comprehension Worksheets
No ratings yet
Data Science Comprehension Worksheets
32 pages
UNIT 5 Data Literacy Levels of Measurement QuesAnsExtra
No ratings yet
UNIT 5 Data Literacy Levels of Measurement QuesAnsExtra
14 pages
Data Mining: Set-01: (Introduction)
No ratings yet
Data Mining: Set-01: (Introduction)
14 pages
CS3352-QB Fds
No ratings yet
CS3352-QB Fds
12 pages
FDS
No ratings yet
FDS
7 pages
DS Mini
No ratings yet
DS Mini
3 pages
Levels of Measurement Q A
No ratings yet
Levels of Measurement Q A
16 pages
Data Science Concepts & Techniques
No ratings yet
Data Science Concepts & Techniques
18 pages
DM - Midsem - Question Bank
No ratings yet
DM - Midsem - Question Bank
5 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
Muhammad Arslan 70078092
No ratings yet
Muhammad Arslan 70078092
7 pages
Fds Question Bank With Answer
No ratings yet
Fds Question Bank With Answer
35 pages
2 Mark Material
No ratings yet
2 Mark Material
11 pages
Fds 2 Marks
No ratings yet
Fds 2 Marks
13 pages
What Is Data Visualization and Why Is It Important
No ratings yet
What Is Data Visualization and Why Is It Important
18 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Datascience Interview
100% (1)
Datascience Interview
31 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Literacy Homework Guide
No ratings yet
Data Literacy Homework Guide
4 pages
Unit I & II FDS - II AI & DS - Question Bank
No ratings yet
Unit I & II FDS - II AI & DS - Question Bank
15 pages
Data Analytics8 QB
No ratings yet
Data Analytics8 QB
5 pages
Foundations of Data Science Questions
No ratings yet
Foundations of Data Science Questions
93 pages
String JS
No ratings yet
String JS
2 pages
Module 2 Cascadding Style Sheet: Objectives
No ratings yet
Module 2 Cascadding Style Sheet: Objectives
8 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Chapter 1-Introduction To Data Science
No ratings yet
Chapter 1-Introduction To Data Science
39 pages
Operating System Concepts
No ratings yet
Operating System Concepts
64 pages
ANOVA Analysis: Understanding >2 Means
No ratings yet
ANOVA Analysis: Understanding >2 Means
14 pages
Production and Evaluation of Fortified Pasta Noodles
No ratings yet
Production and Evaluation of Fortified Pasta Noodles
63 pages
Amos Proc Calis (Sas) Eqs JMP Lisrel Mplus MX Openmx R Sepath (Statistica) Sem (Stata) W.Plsmodel
No ratings yet
Amos Proc Calis (Sas) Eqs JMP Lisrel Mplus MX Openmx R Sepath (Statistica) Sem (Stata) W.Plsmodel
18 pages
Assignment - Week 6
No ratings yet
Assignment - Week 6
9 pages
Pengaruh Environmental Cost Terhadap Profitabilitas Dan Nilai Perusahaan
No ratings yet
Pengaruh Environmental Cost Terhadap Profitabilitas Dan Nilai Perusahaan
10 pages
Econometric Model Error Detection
No ratings yet
Econometric Model Error Detection
7 pages
Solution For DWDM Problems
No ratings yet
Solution For DWDM Problems
24 pages
Descriptive Research Design Guide
No ratings yet
Descriptive Research Design Guide
6 pages
Action Research Methodology TESL
No ratings yet
Action Research Methodology TESL
5 pages
Linear Correlation Analysis Guide
100% (1)
Linear Correlation Analysis Guide
3 pages
Classification - KNN
No ratings yet
Classification - KNN
8 pages
Model Summary: A. Predictors: (Constant), Shelfspace
No ratings yet
Model Summary: A. Predictors: (Constant), Shelfspace
3 pages
Project: Maharishi Dayanand University
0% (1)
Project: Maharishi Dayanand University
67 pages
UNDERSTANDING SHOPPER BEHAVIOUR AND EVALUATION OF CUSTOMER EXPERIENCE AT SELECTED STORES OF RELIANCE RETAIL in
100% (4)
UNDERSTANDING SHOPPER BEHAVIOUR AND EVALUATION OF CUSTOMER EXPERIENCE AT SELECTED STORES OF RELIANCE RETAIL in
51 pages
TNavigator Modules Available 2019 ENG
100% (1)
TNavigator Modules Available 2019 ENG
26 pages
Future of Analysis
No ratings yet
Future of Analysis
6 pages
Introduction To Quantitative Research
No ratings yet
Introduction To Quantitative Research
23 pages
Sample Size: MCB Bank XD
No ratings yet
Sample Size: MCB Bank XD
12 pages
Knowledge Discovery Process
No ratings yet
Knowledge Discovery Process
15 pages
Data Analytics Starter Pack
No ratings yet
Data Analytics Starter Pack
3 pages
Airplane Passenger Satisfaction Prediction Final Report
No ratings yet
Airplane Passenger Satisfaction Prediction Final Report
47 pages
The Bayesian Lasso: Rebecca C. Steorts Predictive Modeling and Data Mining: STA 521
No ratings yet
The Bayesian Lasso: Rebecca C. Steorts Predictive Modeling and Data Mining: STA 521
16 pages
PythonForML2023 Laboratory07 08 Regression Classification Update2
No ratings yet
PythonForML2023 Laboratory07 08 Regression Classification Update2
6 pages
Ford Porject Charter
No ratings yet
Ford Porject Charter
12 pages
Linear Regression Analysis in Excel
No ratings yet
Linear Regression Analysis in Excel
17 pages
Assignment # 3 Excel Solution-1
No ratings yet
Assignment # 3 Excel Solution-1
3 pages
Chapter 4 Analysing Qualitative Data
No ratings yet
Chapter 4 Analysing Qualitative Data
28 pages
BCSL606 Machine Learning Lab
No ratings yet
BCSL606 Machine Learning Lab
33 pages
ML Lab File
No ratings yet
ML Lab File
48 pages
UGRD-MATH6200 Data Analysis Quiz1
No ratings yet
UGRD-MATH6200 Data Analysis Quiz1
12 pages

FDS - 5 Solved

Uploaded by

FDS - 5 Solved

Uploaded by

Q1) Attempt any eight of the following:

a) What is word cloud?

b) Define Statistical data analysis

c) What do you mean by data transformation?

d) Why data cleaning is important operation of data preprocessing?

Data cleaning is essential because it ensures the accuracy, completeness, and

The purpose of data visualization is to represent data graphically to make it easier to

f) List any two applications of Data Science.

• Healthcare: Predictive analytics, disease diagnosis, and personalized treatment

g) What is Visual encoding?

h) Define Bubble plot.

j) Define Standard deviation.

Standard deviation is a measure of the dispersion or spread of a set of values. It quantifies

a) What is 3V's of data science?

o Velocity refers to the speed at which data is generated, processed, and

b) What are the uses of XML files?

• Mean = (22 + 24 + 35 + 47 + 23 + 23 + 31 + 25) / 8 = 28.75

d) Why data preprocessing is important before data analysis?

e) List different types of data attributes with example.

• Dimensionality Reduction: Reducing the number of attributes or features in the

• Hypothesis Testing: A statistical method used to make inferences or draw

Geospatial data is data that includes geographic information such as coordinates,

import geopandas as gpd

import matplotlib.pyplot as plt

world.plot(column='pop_est', cmap='OrRd', legend=True)

• Heatmaps: Represent data intensity using color gradients.

from folium.plugins import HeatMap

# Sample data: list of [latitude, longitude, intensity]

map = folium.Map(location=[37.7749, -122.4194], zoom_start=5)

• Jupyter Notebooks: An open-source web application that allows data scientists to

1. Global Outliers (Point Outliers):

2. Contextual Outliers (Conditional Outliers):

o Example: A high temperature reading in a normally cold region during winter.

o Example: A sudden spike in network traffic at a specific time, indicating a potential

o Categorical data without any intrinsic order. It represents distinct categories or

o Examples: Gender (Male, Female), Colors (Red, Blue, Green).

o Usage: Suitable for classification but not for arithmetic operations.

o Categorical data with a clear ordering or ranking between the categories.

o Examples: Education levels (High School, Bachelor's, Master's, Ph.D.), Satisfaction

a) What do you mean by data discretization? Explain discretization by Histogram

Discretization by Histogram Analysis:

2. Determine Bin Edges: Calculate the range of each bin.

import matplotlib.pyplot as plt

plt.hist(data, bins=5, edgecolor='black')

plt.title('Histogram for Discretization')

Measuring Data Similarity and Dissimilarity:

o It measures the straight-line distance between two points in a multi-dimensional

o Formula: d(p,q) = √∑𝑛𝑖=1 |𝑝𝑖 − 𝑞𝑖 |

Cosine Similarity = √∑𝑛𝑖=1 𝑝𝑖 . 𝑞𝑖 /√∑𝑛𝑖=1 𝑝𝑖 2 . √∑𝑛𝑖=1 𝑞𝑖 2

• Usage: Commonly used in text analysis and information retrieval to measure

o Formula: J (A, B) = (∣A ∩ B∣) / (∣ A ∪ B∣)

• Usage: Useful for comparing binary attributes, such as presence/absence data.

You might also like