0% found this document useful (0 votes)

25 views10 pages

Research File 3

Uploaded by

Manish Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views10 pages

Research File 3

Uploaded by

Manish Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

(4 + 34 + 18 + 12 + 2 + 26) ÷ 6 = 16

So the mean is 16. Now subtract the mean from each number, then square the result:

(4 - 16)2 = 144

(34 - 16)2 = 324

(18 - 16)2 = 4

(12 - 16)2 = 16

(2 - 16)2 = 196

(26 - 16)2 = 100

Now we have to figure out the average or mean of these squared values to get the variance.

This is done by adding up the squared results from above, then dividing it by the total count in the group:

(144 + 324 + 4 + 16 + 196 + 100) ÷ 6 = 130.67

This means we end up with a variance of 130.67.

To figure out the standard deviation, we have to take the square root of the variance, then subtract one, which is
11.43.

Z Score = (X−μ) / σ
(Value – mean) / SD
Marks (X) = 70

Mean (μ) = 60

SD (σ) = 15

Z = (70 - 60) / 15

= 10/15

= 0.66

-------------------------------------------------------------------------------

Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves analyzing and visualizing
data to understand its key characteristics, uncover patterns, and identify relationships between variables.

Graphical Representations:

Outlier Detection:

Handling Missing Values:

An if-else statement can evaluate almost all the types of data such as integer, floating-point, character, pointer, or
Boolean. A switch statement can evaluate either an integer or a character. In the case of 'if-else' statement, either the
'if' block or the 'else' block will be executed based on the condition.

-----------------------------------------------------------------------------

How do you summarize quantitative data in R using a measure of central tendency?

Using mean(), median(), mode()

-----------------------------------------------------------------------------

Data integration is the process of combining data from multiple heterogeneous sources into a unified

format, making it accessible and usable for analysis, reporting, and decision-making. The goal of data

integration is to create a single, consistent view of the data, regardless of its original source or format.

-----------------------------------------------------------------------------

Outline the characteristics of social network data and elucidate its distinctions from other types of data within the
realm of data science.

Network Structure:

Complex Relationships:

Heterogeneity:

------------------------------------------------------------------------------

Dimensionality reduction techniques are used to reduce the number of variables or features in a dataset while
preserving as much relevant information as possible. Here are some common techniques utilized for
dimensionality reduction in data analysis:

1. Principal Component Analysis (PCA):

• PCA is a linear dimensionality reduction technique that identifies the orthogonal axes

(principal components) that capture the maximum variance in the data.

2. Linear Discriminant Analysis (LDA):

• LDA is a supervised dimensionality reduction technique that maximizes the separation between classes while
minimizing the within-class variance.
Data cleaning, integration, reduction, and transformation are essential components of data preprocessing, and
they interact with each other in a holistic approach to prepare data for analysis. Here's how these processes
interact and the benefits of adopting a holistic approach:

1. Data Cleaning: Data cleaning involves identifying and correcting errors, inconsistencies, and missing values in the

dataset.

2. Data Integration: Data integration involves combining data from different sources or databases into a unified
format.

3. Data Reduction:Data reduction techniques involve reducing the dimensionality or size of the dataset while
preserving its essential features and minimizing information loss.Dimensionality reduction techniques such as
principal component analysis (PCA) or feature selection help eliminate redundant.

4. Data Transformation: Data transformation involves converting the dataset into a suitable format for analysis or
modeling.

------------------------------------------------------------------------------------

Global outliers are data points that are significantly different from the rest of the dataset as a whole. They are
identified based on the distribution of the entire dataset.

Local outliers are data points that are significantly different from their local neighborhood but may not be considered
outliers when considering the entire dataset.

-------------------------------------------------------------------------------------

Possible Types of Missing Data:

1. Missing Completely at Random (MCAR): Data is missing randomly and independently of other variables.

2. Missing at Random (MAR): Data is missing systematically but can be predicted from other observed data.

3. Missing Not at Random (MNAR): Data is missing due to a specific reason related to the unobserved data itself,
which may introduce bias.

--------------------------------------------------------------------------------------

To identify and handle outliers in the given dataset for accurate forecasting and decision-making, the following
steps can be taken:

1. Visual Inspection: Start by visually inspecting the data using scatter plots, box plots, or histograms to identify any
data points that seem unusually distant from the majority.

2. Summary Statistics: Calculate summary statistics for the sales amount such as mean, median, standard deviation,
quartiles, and range. These statistics can provide insights into the distribution of the data and help identify potential
outliers.

3. Z-Score Method: Calculate the Z-score for each data point, which measures how many standard deviations a data
point is from the mean. Data points with a Z-score greater than a certain threshold.

4. Interquartile Range (IQR) Method: Calculate the interquartile range (IQR) for each product's sales data. Any data
points lying more than 1.5 times the IQR above the third quartile or below the first quartile can be flagged as outliers.
Below are the code:

# Load necessary libraries

library(dplyr)

# Create the dataset

data <- data.frame(

Month = c("JAN", "JAN", "FEB", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG"),

Product_ID = c("A", "B", "A", "A", "A", "B", "A", "A", "A", "B"),

Sales_Amount = c(5000, 5300, 5200, 6000, 6500, 40000, 5000, 5600, 2000, 60000)

# Function to detect outliers using IQR method

detect_outliers <- function(data, col_name) {

q1 <- quantile(data[[col_name]], 0.25)

q3 <- quantile(data[[col_name]], 0.75)

iqr <- q3 - q1

lower_bound <- q1 - 1.5 * iqr

upper_bound <- q3 + 1.5 * iqr

outliers <- data[[col_name]] < lower_bound | data[[col_name]] > upper_bound

return(outliers)

# Detect outliers in Sales_Amount

outliers <- detect_outliers(data, "Sales_Amount")

# Print outliers

print(data[outliers, ])

# Handle outliers (e.g., remove them)

cleaned_data <- data[!outliers, ]two stacks in a single array

Decision Making: Write an R script to determine if a student has passed or failed each exam. Use a passing score of
80. Create new columns in the student_data data frame named PassExam1, PassExam2, and PassExam3 to indicate
pass or fail for each exam.

Matrix Operations: Create a matrix named average_matrix that contains the StudentID and the average score for
each student.
Display the final results, including the pass/fail status and average scores for each student using a data
representation technique.

Find the Interquartile Range (IQR) of the dataset?

1. Create Sample Data: Define the student_data data frame.

2. Add Pass/Fail Status: Determine if the student passed or failed each exam.

3. Calculate Average Score: Calculate the average score for each student.

4. Calculate IQR for Each Exam: Calculate the IQR for each exam.

Detailed Breakdown

1. First Quartile (Q1): The 25th percentile of the data.

2. Third Quartile (Q3): The 75th percentile of the data.

3. IQR Calculation: IQR=𝑄3−𝑄1IQR=Q3−Q1

In the k-means clustering algorithm, the following step is performed iteratively until convergence.
The k-means clustering algorithm is a popular unsupervised machine learning algorithm used to partition a dataset
into k distinct, non-overlapping subgroups or clusters. Each data point belongs to the cluster with the nearest mean,
serving as a prototype of the cluster

Key Points:

1. Clusters: Groups of similar data points.

2. Centroids: The center points of the clusters, representing the average position of all the points in the cluster.

3. Objective: Minimize the within-cluster sum of squares (WCSS), which is the sum of squared distances
between each data point and the centroid of its cluster.

Algorithm Steps:

1. Initialization:

2. Assignment Step:

3. Update Step:

4. Iteration:

5. Convergence:
Illustrate a method for detecting outliers in this dataset. Apply the method and identify any outliers, if present.

DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Unit 1
No ratings yet
Unit 1
21 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Key Ingredients of PM
No ratings yet
Key Ingredients of PM
16 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Module 3
No ratings yet
Module 3
108 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Wrangling Assignment Guide
No ratings yet
Data Wrangling Assignment Guide
4 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
Eda U2
No ratings yet
Eda U2
141 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Dsbda Lab - 2.1 - 1736750718198
No ratings yet
Dsbda Lab - 2.1 - 1736750718198
9 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Dev Core
No ratings yet
Dev Core
7 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
1 Program
No ratings yet
1 Program
20 pages
Outliers
No ratings yet
Outliers
3 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
EDA Concepts and Outlier Handling
No ratings yet
EDA Concepts and Outlier Handling
5 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Part 5
No ratings yet
Part 5
22 pages
Data Analysis for Outlier Detection
100% (1)
Data Analysis for Outlier Detection
28 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
33 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
FDS - 3 Solved
No ratings yet
FDS - 3 Solved
21 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Data Mining
No ratings yet
Data Mining
4 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Data Screening/Cleaning/ Preparation For Analyses
No ratings yet
Data Screening/Cleaning/ Preparation For Analyses
13 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
NBA Ankle Sprain Study 2013-2017
No ratings yet
NBA Ankle Sprain Study 2013-2017
8 pages
Data Rep Slides
No ratings yet
Data Rep Slides
20 pages
MDM4U Final Practice
No ratings yet
MDM4U Final Practice
7 pages
Case Study DBM Maths - 3
No ratings yet
Case Study DBM Maths - 3
11 pages
Data Science EDA MCQs Document
No ratings yet
Data Science EDA MCQs Document
24 pages
IB Math SL Statistics Guide
No ratings yet
IB Math SL Statistics Guide
69 pages
Powerpoint Presentation On: "Frequency
100% (2)
Powerpoint Presentation On: "Frequency
36 pages
Assesement For Learning Tirtha Sir
No ratings yet
Assesement For Learning Tirtha Sir
29 pages
CL 10 Project
No ratings yet
CL 10 Project
7 pages
A Population Is The Entire Group That You Want To Draw Conclusions About
No ratings yet
A Population Is The Entire Group That You Want To Draw Conclusions About
39 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
Uncovering Insights From Insurance Claims - EDA and Statistical Analysis
No ratings yet
Uncovering Insights From Insurance Claims - EDA and Statistical Analysis
77 pages
Lecture4B Slides
No ratings yet
Lecture4B Slides
8 pages
7 Statistics
No ratings yet
7 Statistics
62 pages
Chapter 7 Extra Practice
No ratings yet
Chapter 7 Extra Practice
2 pages
MIP2602
No ratings yet
MIP2602
11 pages
Measures of Dispersion (A)
No ratings yet
Measures of Dispersion (A)
69 pages
Week 6. Data Preparation and Transformation
No ratings yet
Week 6. Data Preparation and Transformation
34 pages
INN Hotels Project
No ratings yet
INN Hotels Project
26 pages
SMDM - Assignment 1
No ratings yet
SMDM - Assignment 1
4 pages
Measures of Dispersion in Statistics
No ratings yet
Measures of Dispersion in Statistics
6 pages
Oxford Core 1 2017 Markscheme
No ratings yet
Oxford Core 1 2017 Markscheme
15 pages
Student Notes 1.3 New
No ratings yet
Student Notes 1.3 New
6 pages
Lesson #05: Data Management: Feasible)
No ratings yet
Lesson #05: Data Management: Feasible)
11 pages
Lecture Notes Ma12003 PDF
100% (1)
Lecture Notes Ma12003 PDF
105 pages
Maths Lit P1 Memo June 2022 English
No ratings yet
Maths Lit P1 Memo June 2022 English
8 pages
Chapter 2 Investigating and Comparing Data Distributions Student Notes 2024 DDA
No ratings yet
Chapter 2 Investigating and Comparing Data Distributions Student Notes 2024 DDA
28 pages
Data Science Training Insights
No ratings yet
Data Science Training Insights
32 pages
50 Important Statistics' Q & A To Crack DS Interview
No ratings yet
50 Important Statistics' Q & A To Crack DS Interview
14 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages

Research File 3

Uploaded by

Research File 3

Uploaded by

(4 + 34 + 18 + 12 + 2 + 26) ÷ 6 = 16

(34 - 16)2 = 324

(26 - 16)2 = 100

(144 + 324 + 4 + 16 + 196 + 100) ÷ 6 = 130.67

This means we end up with a variance of 130.67.

Handling Missing Values:

How do you summarize quantitative data in R using a measure of central tendency?

Using mean(), median(), mode()

1. Principal Component Analysis (PCA):

(principal components) that capture the maximum variance in the data.

2. Linear Discriminant Analysis (LDA):

Possible Types of Missing Data:

# Load necessary libraries

# Create the dataset

data <- data.frame(

# Function to detect outliers using IQR method

detect_outliers <- function(data, col_name) {

q1 <- quantile(data[[col_name]], 0.25)

q3 <- quantile(data[[col_name]], 0.75)

lower_bound <- q1 - 1.5 * iqr

upper_bound <- q3 + 1.5 * iqr

outliers <- data[[col_name]] < lower_bound | data[[col_name]] > upper_bound

# Detect outliers in Sales_Amount

outliers <- detect_outliers(data, "Sales_Amount")

# Handle outliers (e.g., remove them)

cleaned_data <- data[!outliers, ]two stacks in a single array

Find the Interquartile Range (IQR) of the dataset?

1. Create Sample Data: Define the student_data data frame.

1. First Quartile (Q1): The 25th percentile of the data.

2. Third Quartile (Q3): The 75th percentile of the data.

3. IQR Calculation: IQR=𝑄3−𝑄1IQR=Q3−Q1

1. Clusters: Groups of similar data points.

You might also like