0% found this document useful (0 votes)
25 views10 pages

Research File 3

Uploaded by

Manish Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Research File 3

Uploaded by

Manish Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

(4 + 34 + 18 + 12 + 2 + 26) ÷ 6 = 16

So the mean is 16. Now subtract the mean from each number, then square the result:

(4 - 16)2 = 144

(34 - 16)2 = 324

(18 - 16)2 = 4

(12 - 16)2 = 16

(2 - 16)2 = 196

(26 - 16)2 = 100

Now we have to figure out the average or mean of these squared values to get the variance.

This is done by adding up the squared results from above, then dividing it by the total count in the group:

(144 + 324 + 4 + 16 + 196 + 100) ÷ 6 = 130.67

This means we end up with a variance of 130.67.

To figure out the standard deviation, we have to take the square root of the variance, then subtract one, which is
11.43.

Z Score = (X−μ) / σ
(Value – mean) / SD
Marks (X) = 70

Mean (μ) = 60

SD (σ) = 15

Z = (70 - 60) / 15

= 10/15

= 0.66

-------------------------------------------------------------------------------

Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves analyzing and visualizing
data to understand its key characteristics, uncover patterns, and identify relationships between variables.

Graphical Representations:

Outlier Detection:

Handling Missing Values:


An if-else statement can evaluate almost all the types of data such as integer, floating-point, character, pointer, or
Boolean. A switch statement can evaluate either an integer or a character. In the case of 'if-else' statement, either the
'if' block or the 'else' block will be executed based on the condition.

-----------------------------------------------------------------------------

How do you summarize quantitative data in R using a measure of central tendency?

Using mean(), median(), mode()

-----------------------------------------------------------------------------

Data integration is the process of combining data from multiple heterogeneous sources into a unified

format, making it accessible and usable for analysis, reporting, and decision-making. The goal of data

integration is to create a single, consistent view of the data, regardless of its original source or format.

-----------------------------------------------------------------------------

Outline the characteristics of social network data and elucidate its distinctions from other types of data within the
realm of data science.

Network Structure:

Complex Relationships:

Heterogeneity:

------------------------------------------------------------------------------

Dimensionality reduction techniques are used to reduce the number of variables or features in a dataset while
preserving as much relevant information as possible. Here are some common techniques utilized for
dimensionality reduction in data analysis:

1. Principal Component Analysis (PCA):

• PCA is a linear dimensionality reduction technique that identifies the orthogonal axes

(principal components) that capture the maximum variance in the data.

2. Linear Discriminant Analysis (LDA):

• LDA is a supervised dimensionality reduction technique that maximizes the separation between classes while
minimizing the within-class variance.
Data cleaning, integration, reduction, and transformation are essential components of data preprocessing, and
they interact with each other in a holistic approach to prepare data for analysis. Here's how these processes
interact and the benefits of adopting a holistic approach:

1. Data Cleaning: Data cleaning involves identifying and correcting errors, inconsistencies, and missing values in the

dataset.

2. Data Integration: Data integration involves combining data from different sources or databases into a unified
format.

3. Data Reduction:Data reduction techniques involve reducing the dimensionality or size of the dataset while
preserving its essential features and minimizing information loss.Dimensionality reduction techniques such as
principal component analysis (PCA) or feature selection help eliminate redundant.

4. Data Transformation: Data transformation involves converting the dataset into a suitable format for analysis or
modeling.

------------------------------------------------------------------------------------

Global outliers are data points that are significantly different from the rest of the dataset as a whole. They are
identified based on the distribution of the entire dataset.

Local outliers are data points that are significantly different from their local neighborhood but may not be considered
outliers when considering the entire dataset.

-------------------------------------------------------------------------------------

Possible Types of Missing Data:

1. Missing Completely at Random (MCAR): Data is missing randomly and independently of other variables.

2. Missing at Random (MAR): Data is missing systematically but can be predicted from other observed data.

3. Missing Not at Random (MNAR): Data is missing due to a specific reason related to the unobserved data itself,
which may introduce bias.

--------------------------------------------------------------------------------------

To identify and handle outliers in the given dataset for accurate forecasting and decision-making, the following
steps can be taken:

1. Visual Inspection: Start by visually inspecting the data using scatter plots, box plots, or histograms to identify any
data points that seem unusually distant from the majority.

2. Summary Statistics: Calculate summary statistics for the sales amount such as mean, median, standard deviation,
quartiles, and range. These statistics can provide insights into the distribution of the data and help identify potential
outliers.

3. Z-Score Method: Calculate the Z-score for each data point, which measures how many standard deviations a data
point is from the mean. Data points with a Z-score greater than a certain threshold.

4. Interquartile Range (IQR) Method: Calculate the interquartile range (IQR) for each product's sales data. Any data
points lying more than 1.5 times the IQR above the third quartile or below the first quartile can be flagged as outliers.
Below are the code:

# Load necessary libraries

library(dplyr)

# Create the dataset

data <- data.frame(

Month = c("JAN", "JAN", "FEB", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG"),

Product_ID = c("A", "B", "A", "A", "A", "B", "A", "A", "A", "B"),

Sales_Amount = c(5000, 5300, 5200, 6000, 6500, 40000, 5000, 5600, 2000, 60000)

# Function to detect outliers using IQR method

detect_outliers <- function(data, col_name) {

q1 <- quantile(data[[col_name]], 0.25)

q3 <- quantile(data[[col_name]], 0.75)

iqr <- q3 - q1

lower_bound <- q1 - 1.5 * iqr

upper_bound <- q3 + 1.5 * iqr

outliers <- data[[col_name]] < lower_bound | data[[col_name]] > upper_bound

return(outliers)

# Detect outliers in Sales_Amount

outliers <- detect_outliers(data, "Sales_Amount")

# Print outliers

print(data[outliers, ])

# Handle outliers (e.g., remove them)

cleaned_data <- data[!outliers, ]two stacks in a single array


Decision Making: Write an R script to determine if a student has passed or failed each exam. Use a passing score of
80. Create new columns in the student_data data frame named PassExam1, PassExam2, and PassExam3 to indicate
pass or fail for each exam.

Matrix Operations: Create a matrix named average_matrix that contains the StudentID and the average score for
each student.
Display the final results, including the pass/fail status and average scores for each student using a data
representation technique.

Find the Interquartile Range (IQR) of the dataset?

1. Create Sample Data: Define the student_data data frame.

2. Add Pass/Fail Status: Determine if the student passed or failed each exam.

3. Calculate Average Score: Calculate the average score for each student.

4. Calculate IQR for Each Exam: Calculate the IQR for each exam.

Detailed Breakdown

1. First Quartile (Q1): The 25th percentile of the data.

2. Third Quartile (Q3): The 75th percentile of the data.

3. IQR Calculation: IQR=𝑄3−𝑄1IQR=Q3−Q1


In the k-means clustering algorithm, the following step is performed iteratively until convergence.
The k-means clustering algorithm is a popular unsupervised machine learning algorithm used to partition a dataset
into k distinct, non-overlapping subgroups or clusters. Each data point belongs to the cluster with the nearest mean,
serving as a prototype of the cluster

Key Points:

1. Clusters: Groups of similar data points.

2. Centroids: The center points of the clusters, representing the average position of all the points in the cluster.

3. Objective: Minimize the within-cluster sum of squares (WCSS), which is the sum of squared distances
between each data point and the centroid of its cluster.

Algorithm Steps:

1. Initialization:

2. Assignment Step:

3. Update Step:

4. Iteration:

5. Convergence:
Illustrate a method for detecting outliers in this dataset. Apply the method and identify any outliers, if present.

You might also like