0% found this document useful (0 votes)
19 views34 pages

Shahun Term Workr1

The document outlines various R programming tasks including computing descriptive statistics, visualizing data from CSV files, cleaning datasets, performing linear and logistic regression, implementing KNN and ID3 algorithms, and applying K-means clustering. Each task includes a problem statement, objective, algorithm, and code snippets for implementation. The document serves as a comprehensive guide for statistical analysis and machine learning using R.

Uploaded by

shagunchandel843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views34 pages

Shahun Term Workr1

The document outlines various R programming tasks including computing descriptive statistics, visualizing data from CSV files, cleaning datasets, performing linear and logistic regression, implementing KNN and ID3 algorithms, and applying K-means clustering. Each task includes a problem statement, objective, algorithm, and code snippets for implementation. The document serves as a comprehensive guide for statistical analysis and machine learning using R.

Uploaded by

shagunchandel843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Shagun Kumar MCA(71) 1102867

Problem Statement
1: Write an R script to compute the mean, mode, variance,
and
standard deviation of a numeric dataset.

Objective: To compute the mean, mode, variance, and standard deviation of a numeric
dataset in R.

Algorithm:
1. Input a numeric vector.

2. Compute:

• mean() – built-in for arithmetic mean.

• variance() – use var().

• sd() – built-in for standard deviation.

• mode() – custom function since R’s mode() refers to data type, not statistical mode.

Code:
# Custom function to calculate mode (statistical) get_mode <-
function(v) { freq_table <- table(v) max_freq <- max(freq_table)
mode_vals <- as.numeric(names(freq_table)[freq_table == max_freq])
return(mode_vals)
}

# Descriptive statistics function descriptive_statistics


<- function(data) {
mean_val <- mean(data)
mode_val <- get_mode(data)
variance_val <- var(data)
std_dev <- sd(data)

return(list(
Mean = mean_val,
Mode = mode_val,
Variance = variance_val,
Standard_Deviation = std_dev
))
}

# Example usage
data <- c(4, 7, 2, 7, 3, 9, 10, 2, 7)
result <- descriptive_statistics(data)
Shagun Kumar MCA(71) 1102867

Output:

# Print results
print(result)
Shagun Kumar MCA(71) 1102867

Problem Statement

2: Read a dataset from a CSV file and


generate basic
visualizations: bar chart, pie chart, and 3D pie chart.

Objective: To import a dataset from a CSV file and visualize categorical data using bar
charts, pie charts, and 3D pie charts in R.

Algorithm:
1. Use read.csv() to load the dataset.

2. Choose a categorical column for visualization.

3. Count frequencies using table().

4. Plot:

• Bar chart using barplot()

• Pie chart using pie()

• 3D pie chart using plotrix::pie3D()

Code:
# Load necessary library
if (!require("plotrix")) install.packages("plotrix", dependencies = TRUE) library(plotrix)

# Read CSV file data <-


read.csv("your_dataset.csv")

# Replace 'CategoryColumn' with your actual categorical column name category_col


<- data$CategoryColumn

# Create frequency table


freq_table <- table(category_col)

# Bar Chart
barplot(freq_table, col = "skyblue", main = "Bar Chart", ylab = "Frequency")

# Pie Chart
pie(freq_table, main = "Pie Chart", col = rainbow(length(freq_table)))

# 3D Pie Chart
pie3D(freq_table, labels = names(freq_table), main = "3D Pie Chart", explode = 0.1)
Shagun Kumar MCA(71) 1102867

Output:
Shagun Kumar MCA(71) 1102867

Problem Statement

3: Clean a dataset read from a CSV file by handling


missing or
inconsistent data using appropriate R functions.

Objective: To clean a dataset by identifying and handling missing or inconsistent data in R.

Algorithm:
1. Import the dataset using read.csv().

2. Check for missing values using is.na() and summary().

3. Handle missing data:

• Remove rows with na.omit() or complete.cases()

• Impute values using mean(), median(), or custom logic.

4. Check for inconsistencies using unique(), table(), or string normalization.

5. Standardize/Correct inconsistent values with tolower(), trimws(), gsub(), etc.

Code:
# Dirty sample data dirty_data
<- data.frame(
ID = 1:10,
CategoryColumn = c(" Apple ", "Banana", "apple", "Orange", "bananna", "Apple",
"Orange", " Banana ", "APPLE", "orange"),
Value = c(10, 15, NA, 5, 25, NA, 10, 20, 15, NA)
)
# Write to CSV
write.csv(dirty_data, "your_dirty_dataset.csv", row.names = FALSE)
print("your_dirty_dataset.csv has been created.")
# Load data data <- read.csv("your_dirty_dataset.csv",
stringsAsFactors = FALSE)
# View summary to detect NAs or anomalies
summary(data) # Find missing values
colSums(is.na(data))
# Option 1: Remove rows with missing values cleaned_data
<- na.omit(data)
# Option 2: Impute missing numeric values with column mean data$Value[is.na(data$Value)]
<- mean(data$Value, na.rm = TRUE)
# Fix inconsistent categorical data (e.g., different casing or extra spaces)
data$CategoryColumn <- trimws(tolower(data$CategoryColumn))
Shagun Kumar MCA(71) 1102867

Output:
# Replace common typos (e.g., "bananna" → "banana") data$CategoryColumn
<- gsub("bananna", "banana", data$CategoryColumn)
# Final cleaned dataset
print(head(data))
Shagun Kumar MCA(71) 1102867

Problem Statement

4: Read a dataset from a CSV file and plot its data


points in
both 2D and 3D space using appropriate R libraries.

Objective: To visualize data points from a CSV file in 2D and 3D space using R plotting
functions.

Algorithm:
1. Import the dataset using read.csv().

2. Plot in 2D using plot() or ggplot2 (e.g., X vs Y).

3. Plot in 3D:

• Use the scatterplot3d or rgl library for 3D scatter plots.

4. Label axes and title for clarity.

Code:
# Sample 3D data
three_d_data <- data.frame(
ID = 1:10,
X = seq(2, 20, by = 2),
Y = seq(3, 12,by = 1),
Z = c(5, 6, 7, 10, 12, 14, 15, 18, 20, 22)
)

# Write to CSV
write.csv(three_d_data, "your_3d_dataset.csv", row.names = FALSE)

print("your_3d_dataset.csv has been created.")

# Load necessary libraries


if (!require("scatterplot3d")) install.packages("scatterplot3d", dependencies = TRUE)
library(scatterplot3d)

# Read data from CSV


data <- read.csv("your_3d_dataset.csv")

# 2D Plot (X vs Y) plot(data$X,
data$Y, main = "2D Plot: X vs
Shagun Kumar MCA(71) 1102867

Output:
Y", xlab = "X Axis", ylab = "Y
Axis", col = "blue", pch = 19)

# 3D Plot (X, Y, Z) scatterplot3d(data$X, data$Y, data$Z,


main = "3D Scatter Plot", xlab = "X Axis", ylab =
"Y Axis", zlab = "Z Axis", color = "darkgreen",
pch = 16)
Shagun Kumar MCA(71) 1102867

Problem Statement

5: Perform a linear regression analysis


on the given Height and Weight dataset of mice. Derive the relationship coefficients and
summarize the regression results.

Objective: To perform a linear regression analysis on the Height and Weight data, derive
the coefficients, and summarize the results in R.

Algorithm:
1. Input the data: Create vectors for Height and Weight.

2. Perform the linear regression: Use lm() to fit the regression model.

3. Summarize the regression: Use summary() to view coefficients, R-squared, and


pvalues.

Code:
# Given data for Height and Weight
Height <- c(140, 142, 150, 147, 139, 152, 154, 135, 148, 147)
Weight <- c(59, 61, 66, 62, 57, 68, 69, 58, 63, 62)

# Perform linear regression: Weight = a + b * Height model


<- lm(Weight ~ Height)

# Summarize the regression results summary(model)


Shagun Kumar MCA(71) 1102867

Output:

6: Apply logistic regression to the Iris dataset, creating a


subset
with only two species, and interpret the results.

Objective: To apply logistic regression to the Iris dataset for predicting the species based on
its attributes and interpret the results, using only two species.
Shagun Kumar MCA(71) 1102867

Problem Statement
Algorithm:
1. Load and inspect the dataset: Use iris dataset, which comes preloaded in R.

2. Create a subset: Filter the dataset to include only two species (e.g., "setosa" and
"versicolor").

3. Fit the logistic regression model: Use glm() to model the binary classification
problem.

4. Summarize the model: Use summary() to inspect coefficients, significance, and


model fit.

Code:
# Given data for Height and Weight
Height <- c(140, 142, 150, 147, 139, 152, 154, 135, 148, 147)
Weight <- c(59, 61, 66, 62, 57, 68, 69, 58, 63, 62)

# Perform linear regression: Weight = a + b * Height# Load the Iris dataset data(iris)

# Create a subset with only two species: 'setosa' and 'versicolor' iris_subset
<- subset(iris, Species %in% c("setosa", "versicolor"))

# Convert the Species variable into a binary response variable (0 for setosa, 1 for versicolor)
iris_subset$Species_binary <- ifelse(iris_subset$Species == "setosa", 0, 1)

# Fit the logistic regression model logistic_model <- glm(Species_binary ~


Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris_subset,
family = binomial)

# Summarize the model results


summary(logistic_model)

model <- lm(Weight ~ Height)

# Summarize the regression results summary(model)


Shagun Kumar MCA(71) 1102867

Output:

7: Implement and apply the K-


Nearest Neighbors (KNN)
classifier on a large dataset using R.
Shagun Kumar MCA(71) 1102867

Problem Statement
Objective: To implement the KNN classifier on a large dataset and use it for classification
tasks in R.

Algorithm:
1. Load the dataset: Use a large dataset like iris or load a custom one.

2. Split the dataset: Divide the data into training and testing subsets.

3. Standardize the data: KNN is sensitive to the scale of the data, so we standardize
features.

4. Apply the KNN algorithm: Use the knn() function from the class library.

5. Evaluate the model: Assess the model's accuracy using a confusion matrix or
accuracy score.

Code:
# Load necessary libraries
if (!require("class")) install.packages("class", dependencies = TRUE) library(class)
if (!require("caret")) install.packages("caret", dependencies = TRUE) library(caret)

# Use the iris dataset for demonstration data(iris)

# Split the data into training and test sets (70% for training, 30% for testing) set.seed(123)
# For reproducibility
train_index <- createDataPartition(iris$Species, p = 0.7, list = FALSE) train_data
<- iris[train_index, ]
test_data <- iris[-train_index, ]

# Standardize the features (important for KNN)


train_features <- scale(train_data[, -5]) # Exclude the 'Species' column test_features
<- scale(test_data[, -5])

# Apply KNN algorithm (Choose k = 3 for this example) k


<- 3
predictions <- knn(train = train_features, test = test_features, cl = train_data$Species, k = k)

# Evaluate the model with a confusion matrix


confusion_matrix <- confusionMatrix(predictions, test_data$Species) print(confusion_matrix)

# Evaluate the accuracy


accuracy <- sum(predictions == test_data$Species) / length(predictions) cat("Accuracy
of KNN model: ", accuracy, "\n")
Shagun Kumar MCA(71) 1102867

Output:

8: Implement the ID3 algorithm (a decision tree


algorithm) to
build a decision tree model using R and apply it to a sample dataset.
Shagun Kumar MCA(71) 1102867

Problem Statement
Objective: To implement the ID3 algorithm for decision tree classification in R and use it to
create a decision tree model.

Algorithm:
1. Load the dataset: Use a dataset like iris or any custom dataset.

2. Preprocess the data: Ensure categorical features are in factors.


3. Apply the ID3 algorithm: Use rpart() from the rpart library or custom implementation
(ID3 builds decision trees based on information gain).

4. Summarize the model: Use summary() and plot the decision tree.

5. Evaluate the model: Use a confusion matrix or accuracy score to evaluate the model.

Code:
# Install necessary libraries
if (!require("rpart")) install.packages("rpart", dependencies = TRUE) if (!
require("rpart.plot")) install.packages("rpart.plot", dependencies = TRUE)
library(rpart)
library(rpart.plot)

# Load the iris dataset (for demonstration purposes) data(iris)

# Apply the ID3 algorithm using rpart (CART algorithm, a simplified decision tree) # ID3
algorithm focuses on information gain, but rpart can work similarly for classification
tree_model <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris, method = "class", control = rpart.control(minsplit = 2, cp = 0.01))

# Summarize the decision tree model summary(tree_model)

# Plot the decision tree


rpart.plot(tree_model, main = "Decision Tree using ID3 Algorithm (CART)", extra = 106)

# Predict on the dataset predictions <-


predict(tree_model, iris, type = "class")

# Evaluate the model using confusion matrix


confusion_matrix <- table(Predicted = predictions, Actual = iris$Species)
print(confusion_matrix)

# Calculate accuracy
accuracy <- sum(predictions == iris$Species) / length(predictions) cat("Accuracy
of Decision Tree model: ", accuracy, "\n")
Shagun Kumar MCA(71) 1102867

Output:
Shagun Kumar MCA(71) 1102867
Shagun Kumar MCA(71) 1102867

Problem Statement 9: Perform K-means clustering on a given dataset in R and


interpret the results.

Objective: To apply K-means clustering to a dataset, analyze the resulting clusters, and
interpret the results in R.

Algorithm:
1. Load the dataset: Use a dataset such as iris or any custom dataset.

2. Preprocess the data: Remove categorical variables and scale/normalize if necessary.

3. Apply K-means clustering: Use the kmeans() function to perform clustering.


4. Analyze the results: View cluster assignments, cluster centers, and evaluate the
clustering performance.

5. Visualize the clusters: Use a 2D or 3D plot to visualize the clusters.

Code:
if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE) library(ggplot2)

data(iris)

iris_data <- iris[, -5]

# Standardize the data (important for K-means clustering) iris_scaled


<- scale(iris_data)

# Perform K-means clustering with k = 3 (since there are 3 species in iris dataset)
set.seed(123) # For reproducibility kmeans_result <- kmeans(iris_scaled, centers
= 3)

# View the results of the clustering


print(kmeans_result)

# Add cluster assignments to the original dataset iris$Cluster


<- as.factor(kmeans_result$cluster)

# Visualize the clusters in 2D (using first two principal components for visualization)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +
geom_point() +
labs(title = "K-means Clustering of Iris Dataset", x = "Sepal Length", y = "Sepal Width") +
theme_minimal()
Shagun Kumar MCA(71) 1102867

# Check the clustering accuracy (comparing the clusters with the actual species labels)
table(iris$Species, iris$Cluster)

# Calculate the total within-cluster sum of squares


cat("Total Within-Cluster Sum of Squares: ", kmeans_result$tot.withinss, "\n")
Output:
Shagun Kumar MCA(71) 1102867
Shagun Kumar MCA(71) 1102867

Problem Statement 10: Implement the CURE (Clustering Using Representatives)


algorithm manually to perform clustering on a given dataset in R without using built-in
clustering functions.

Objective: To manually implement the CURE algorithm for clustering and partition the
dataset into groups based on representative points.

Algorithm:
1. Initialize Clusters: Treat each point as an individual cluster.

2. Distance Measure: Define distance between clusters as the minimum distance


between any two representative points (initially each point is its own
representative).

3. Merge Clusters:

o Iteratively merge the two closest clusters.

o After merging, select c representative points from the merged cluster.

o Shrink those representative points towards the cluster centroid using a shrink
factor α.

4. Repeat until k clusters remain.

Code:
# Load libraries
library(ggplot2) #
Sample 2D dataset
set.seed(42) df <-
rbind(
matrix(rnorm(20, mean = 0), ncol = 2),
matrix(rnorm(20, mean = 5), ncol = 2),
matrix(rnorm(20, mean = 10), ncol = 2)
)
colnames(df) <- c("x", "y")
df <- as.data.frame(df) #
Parameters
k <- 3 # Desired number of clusters c <- 5 # Number of
representative points per cluster alpha <- 0.2 # Shrink factor # Initialize
each point as its own cluster clusters <- lapply(1:nrow(df), function(i)
list(points = df[i, , drop = FALSE])) # Distance between two clusters
(minimum distance between any two points)
cluster_distance <- function(cluster1, cluster2)
{ reps1 <- cluster1$reps reps2 <- cluster2$reps if
(nrow(reps1) == 0 || nrow(reps2) == 0) return(Inf)
Shagun Kumar MCA(71) 1102867

n1 <- nrow(reps1) n2 <- nrow(reps2) if (n1 == 1


&& n2 == 1) {
GARGI SINGHAL MCA(30) 1102748

return(sqrt(sum((reps1 - reps2)^2)))
} dists <- as.matrix(dist(rbind(reps1,
reps2))) dists_sub <- dists[1:n1, (n1 + 1):
(n1 + n2)]

if (length(dists_sub) == 0 || all(is.na(dists_sub))) return(Inf)


return(min(dists_sub, na.rm = TRUE))
}
# Select c representative points and shrink them
shrink_representatives <- function(points, c, alpha)
{ center <- colMeans(points)
dists <- apply(points, 1, function(row) sqrt(sum((row - center)^2))) reps <-
points[order(-dists)[1:c], , drop = FALSE] # Select c farthest points reps <-
reps - alpha * (reps - matrix(center, nrow = c, ncol = 2, byrow = TRUE))
return(reps) }
# Assign initial representatives (each point is its own
representative) for (i in seq_along(clusters)) { clusters[[i]]$reps <-
clusters[[i]]$points
}
# Merge until k clusters remain while
(length(clusters) > k) { # Find the two closest
clusters min_dist <- Inf pair <- c(0, 0) for (i in
1:(length(clusters) - 1)) { for (j in (i +
1):length(clusters)) { d <-
cluster_distance(clusters[[i]], clusters[[j]]) if
(is.na(d)) next if (d < min_dist)
{ min_dist <- d pair <- c(i, j)
}
}
}
# Merge the two clusters
merged_points <- rbind(clusters[[pair[1]]]$points, clusters[[pair[2]]]
$points) merged_cluster <- list(points = merged_points)
merged_cluster$reps <- shrink_representatives(merged_points, c, alpha)
# Remove old clusters and add merged cluster
clusters <- clusters[-sort(pair, decreasing = TRUE)]
clusters[[length(clusters) + 1]] <- merged_cluster
}
# Visualize final clusters plot(df, col = "gray", main = "CURE
Clustering Result", pch = 19) colors <- c("red", "blue", "green",
"orange", "purple") for (i in seq_along(clusters)) {
points(clusters[[i]]$points, col = colors[i], pch = 19)
points(clusters[[i]]$reps, col = colors[i], pch = 8, cex = 1.5)
Shagun Kumar MCA(71) 1102867
GARGI SINGHAL MCA(30) 1102748

Output:
Shagun Kumar MCA(71) 1102867

Problem Statement 11: Anomaly Detection: Implement anomaly detection techniques


in R for a specified dataset.

Objective: To implement anomaly detection techniques in R to identify outliers or


unusual observations in a given dataset.

Algorithm:
1. Z-Score Method:

Algorithm:

1. For each feature, compute the mean and standard deviation.

2. Calculate the Z-score for each data point:

Z=(x−μ)/σ

3. If |Z| > threshold (e.g., 3), mark it as an anomaly.

2. IQR Method:

Algorithm:

1. For each feature, compute the first (Q1) and third (Q3) quartiles.

2. Compute IQR = Q3 - Q1.

3. An observation is an outlier if:

x < Q1 − 1.5 × IQR or x > Q3 + 1.5 ×IQR

3. Isolation Forest (Advanced):

Algorithm:

1. Randomly partition data by selecting features and split values.

2. Isolated points require fewer splits to be separated.

3. The anomaly score is based on the average path length from trees.

4. Higher scores indicate more isolated (anomalous) observations.

Code:
# Load a sample dataset
data("mtcars") df <-
mtcars # Calculate z-
scores z_scores <-
scale(df)
# Identify anomalies (absolute z-score > 3) anomalies_z
<- which(abs(z_scores) > 3, arr.ind = TRUE)
Shagun Kumar MCA(71) 1102867

GARGI SINGHAL MCA(30) 1102748

df[unique(anomalies_z[, 1]), ]
# Function to detect outliers using IQR
iqr_outliers <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR)
}
# Apply to all numeric columns
outlier_matrix <- sapply(df,
iqr_outliers) df[apply(outlier_matrix, 1,
any), ] # Install and load required
package
if (!require("isotree")) install.packages("isotree")
library(isotree)
# Train isolation forest model iso_model <-
isolation.forest(df) # Predict anomaly scores
scores <- predict(iso_model, df, type =
"score")
# Identify anomalies (e.g., score > 0.65)
df[scores > 0.65, ]
Output:
Shagun Kumar MCA(71) 1102867
Shagun Kumar MCA(71) 1102867

Problem Statement 12: Implement the Apriori algorithm in R to find frequent itemsets
and association rules in a given dataset.

Objective: To implement the Apriori algorithm in R for mining frequent itemsets and
generating association rules from a given dataset.

Algorithm:
1. Load and prepare the transactional dataset.

2. Define minimum support and confidence thresholds.

3. Use the Apriori algorithm to identify frequent itemsets.

4. Generate association rules from the frequent itemsets.

5. Analyze the output rules.

Code:
# Step 1: Install and load the required package
if (!require("arules")) install.packages("arules", dependencies = TRUE) library(arules)

# Step 2: Load dataset (you can replace this with your own CSV/transaction data)
data("Groceries") # Built-in dataset from arules package

# Step 3: Run Apriori algorithm rules <- apriori(Groceries, parameter =


list(support = 0.01, confidence = 0.2))

# Step 4: View the generated association rules inspect(rules[1:10])


# Show first 10 rules

# Step 5: Optional - Sort rules by lift


inspect(sort(rules, by = "lift")[1:10])
Shagun Kumar MCA(71) 1102867

Output:
Shagun Kumar MCA(71) 1102867

You might also like