Write an R Script to perform the data Import and Export Operations
Ex.No:01
Date:
AIM:
To Write an R Script to perform the data Import and Export Operations.
ALGORITHM:
Step 1:Define a data frame df with columns: Name, Language, and Age.
Step2:Use write.table() to export df to a text file named "myDataFrame.txt".
Step3:Define column names and add values to the data frame.
Step4:Call the write.table() function to save the data frame to a file.Specify the file name using
the file parameter.
Step5:Check the file in the working directory to confirm successful export.Open the file to verify
the structure and formatting.
Step6:Use the file.choose() function inside read.csv() to allow the user to manually select a
CSV file from their system.
Step7:TRUE Treats the first row of the file as column names.
Step8:Displays the entire data frame in the console.
Step9:The function will display all rows and columns of the data frame.
PROGRAM:
#IMPORTING DATA
df = data.frame(
"Name" = c("Amiya", "Raj","Asish"),
"Language" = c ("R","python","java"),
"Age" = c(22,25,45))
write.table(df,
file="myDataFrame.txt",
sep = "/t",
row.names = TRUE,
col.names = NA)
#EXPORTING DATA
#import and store the dataset in data1
data1 <- read.csv(file.choose(), header=T)
# display the data
data1
OUTPUT:
RESULT:
The data Import and Export Operations are done in R Studio.
Ex.No:02
Date:
Write an R Script to perform the Data Pre-processing techniques
AIM:
To Write an R Script to perform the Data Pre-processing techniques in R
studio.
ALGORITHM:
Step1:Load the dplyr and tidyr packages for data manipulation.
Step2:Load the mtcars dataset.Display the first few rows using head(mtcars).
Step3:Count and print the total number of missing values in the dataset.
Step4:Impute missing values using the mean of each column.
Step5:Introduce a duplicate row by appending the first row to the dataset.
Remove duplicate rows using distinct().
Step6:Normalize the mpg column using the scale() function.
Step6:Create an additional dataset with car names and randomly assigned country values
(USA, Japan, Europe).
Step7:Add a new column car to store row names.Merge the datasets based on the car column.
Step8:Display the first few rows of the final dataset using head(mtcars).
PROGRAM:
# Load necessary libraries
library(dplyr)
library(tidyr)
# Load the dataset
data(mtcars)
# View the first few rows of the dataset
head(mtcars)
# 1. Handling Missing Data
# For demonstration, let's introduce some missing values
mtcars[1, "mpg"] <- NA
mtcars[5, "hp"] <- NA
# Identify missing values
missing_values <- sum(is.na(mtcars))
print(paste("Number of missing values:", missing_values))
# Impute missing values with the mean of the column
mtcars <- mtcars %>%
mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# 2. Removing Duplicates
# For demonstration, let's introduce a duplicate row
mtcars <- rbind(mtcars, mtcars[1, ])
# Remove duplicate rows
mtcars <- mtcars %>%
distinct()
# 3. Data Transformation
# Normalize the 'mpg' column
mtcars <- mtcars %>%
mutate(mpg_normalized = scale(mpg))
# 4. Feature Engineering
# Create a new feature 'power_to_weight_ratio'
mtcars <- mtcars %>%
mutate(power_to_weight_ratio = hp / wt)
# 5. Data Integration
# For demonstration, let's create a second dataset
additional_data <- data.frame(
car = rownames(mtcars),
country = sample(c("USA", "Japan", "Europe"), nrow(mtcars), replace = TRUE)
# Merge the datasets
mtcars <- mtcars %>%
mutate(car = rownames(mtcars)) %>%
left_join(additional_data, by = "car")
# View the final pre-processed dataset
head(mtcars)
OUTPUT:
RESULT:
The Data Pre-processing techniques are executed successfully.
Ex.No:03
Date:
Write an R Script to perform the descriptive statistics concepts.
AIM:
ALGORITHM:
Step1:Check if the dplyr package is installed; if not, install it.Load the dplyr library.
Step2:Set a random seed (set.seed(123)) for reproducibility.
Step3:Display the first six rows of data using head(data).
Step4:Use summary(data) to get basic statistics (min, max, median, mean, etc.) for numeric
columns
Step5:Compute the frequency distribution of the Gender column using
table(data$Gender).
Step6:Compute the percentage distribution using prop.table() and multiply by 100
Step 7:Calculate the Pearson correlation between Height and Weight using cor(data$Height,
data$Weight).
Step8:Mean and standard deviation of Height.
Mean and standard deviation of Weight.
Step9:Use hist(data$Age) with blue bars and a black border.
PROGRAM:
# Load necessary library
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
# Create a sample dataset
set.seed(123)
data <- data.frame(
ID = 1:100,
Age = sample(18:60, 100, replace = TRUE),
Height = round(rnorm(100, mean = 165, sd = 10), 1),
Weight = round(rnorm(100, mean = 65, sd = 15), 1),
Gender = sample(c("Male", "Female"), 100, replace = TRUE)
# View the first few rows of the dataset
head(data)
# Descriptive statistics
# Summary statistics for numeric variables
summary(data)
# Mean, Median, Variance, and Standard Deviation for a specific column
mean_age <- mean(data$Age)
median_age <- median(data$Age)
var_age <- var(data$Age)
sd_age <- sd(data$Age)
cat("Mean Age:", mean_age, "\n")
cat("Median Age:", median_age, "\n")
cat("Variance of Age:", var_age, "\n")
cat("Standard Deviation of Age:", sd_age, "\n")
# Frequency distribution for categorical variable
gender_distribution <- table(data$Gender)
cat("\nGender Distribution:\n")
print(gender_distribution)
# Percentage distribution
gender_percentage <- prop.table(gender_distribution) * 100
cat("\nGender Percentage Distribution:\n")
print(round(gender_percentage, 2))
# Correlation between two numeric variables
correlation <- cor(data$Height, data$Weight)
cat("\nCorrelation between Height and Weight:", correlation, "\n")
# Grouped statistics: Mean Height and Weight by Gender
grouped_stats <- data %>%
group_by(Gender) %>%
summarise(
Mean_Height = mean(Height),
Mean_Weight = mean(Weight),
SD_Height = sd(Height),
SD_Weight = sd(Weight)
cat("\nGrouped Statistics by Gender:\n")
print(grouped_stats)
# Histogram for Age
hist(data$Age, main = "Histogram of Age", xlab = "Age", col = "blue", border = "black")
# Boxplot for Height by Gender
boxplot(Height ~ Gender, data = data,
main = "Boxplot of Height by Gender",
xlab = "Gender", ylab = "Height (cm)",
col = c("lightblue", "pink"))
OUTPUT:
RESULT:
Ex.No:04
Date:
Visualizing the data in different graphics using R Script.
AIM:
To write an R Script for visualizing the data in different graphics.
ALGORITHM:
Step1:Check if the ggplot2 and dplyr libraries are installed.
Step2:If not installed, install them using install.packages().Load the libraries using library().
Step3:Set a random seed for reproducibility.
Create a data frame (data)
Step4:Points plotted using geom_point() with size and transparency.Title, labels, and theme are
set.
Step5:Bars plotted using geom_histogram()
binwidth = 5 to group ages into intervals of 5 years.
Step6:DistributionGroup dataset by Gender and count occurrences using summarise().
Step7:Sort dataset by ID using arrange().
Line chart created using geom_line().
Step8:Density curves plotted using geom_density()Fill color based on Gender.
Step9:Group dataset by Gender and count occurrences.
Step10:Load necessary libraries (ggplot2, dplyr).
Step11:Create a sample dataset with variables: ID, Age, Height, Weight, Gender.
Step12:Scatter Plot: Height vs. Weight with Gender color coding.
PROGRAM:
# Load necessary library
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("dplyr")) install.packages("dplyr")
library(ggplot2)
library(dplyr)
# Create a sample dataset
set.seed(123)
data <- data.frame(
ID = 1:100,
Age = sample(18:60, 100, replace = TRUE),
Height = round(rnorm(100, mean = 165, sd = 10), 1),
Weight = round(rnorm(100, mean = 65, sd = 15), 1),
Gender = sample(c("Male", "Female"), 100, replace = TRUE)
# 1. Scatter Plot: Height vs. Weight
ggplot(data, aes(x = Height, y = Weight, color = Gender)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "Scatter Plot of Height vs. Weight",
x = "Height (cm)",
y = "Weight (kg)") +
theme_minimal()
# 2. Histogram: Distribution of Age
ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Histogram of Age",
x = "Age (years)",
y = "Frequency") +
theme_minimal()
# 3. Boxplot: Height by Gender
ggplot(data, aes(x = Gender, y = Height, fill = Gender)) +
geom_boxplot() +
labs(title = "Boxplot of Height by Gender",
x = "Gender",
y = "Height (cm)") +
theme_minimal()
# 4. Bar Plot: Gender Distribution
gender_distribution <- data %>%
group_by(Gender) %>%
summarise(Count = n())
ggplot(gender_distribution, aes(x = Gender, y = Count, fill = Gender)) +
geom_bar(stat = "identity", width = 0.6) +
labs(title = "Bar Plot of Gender Distribution",
x = "Gender",
y = "Count") +
theme_minimal()
# 5. Line Chart: Age Trend (sorted by ID)
data <- data %>% arrange(ID)
ggplot(data, aes(x = ID, y = Age, group = 1)) +
geom_line(color = "darkgreen", size = 1) +
geom_point(color = "red", size = 2) +
labs(title = "Line Chart of Age Trend by ID",
x = "ID",
y = "Age (years)") +
theme_minimal()
# 6. Density Plot: Weight Distribution
ggplot(data, aes(x = Weight, fill = Gender)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Weight Distribution",
x = "Weight (kg)",
y = "Density") +
theme_minimal()
# 7. Pie Chart: Gender Proportion
gender_proportion <- data %>%
group_by(Gender) %>%
summarise(Count = n())
ggplot(gender_proportion, aes(x = "", y = Count, fill = Gender)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Pie Chart of Gender Proportion") +
theme_void()
# 8. Faceted Plot: Scatter Plot of Height vs. Weight by Gender
ggplot(data, aes(x = Height, y = Weight)) +
geom_point(aes(color = Gender), size = 2, alpha = 0.7) +
facet_wrap(~ Gender) +
labs(title = "Scatter Plot of Height vs. Weight (Faceted by Gender)",
x = "Height (cm)",
y = "Weight (kg)") +
theme_minimal()
OUTPUT:
RESULT:
Visualizing the data in different graphics using R Script was executed successfully.
Ex.No:05
Date:
Write an R Script to implement the Normal and binomial distribution.
AIM:
To write a Write an R Script to implement the Normal and binomial distribution.
ALGORITHM:
Step 1:Load the ggplot2 package, which is used for visualization.
Step2:Set a random seed to ensure reproducibility of result
Step3:Create a histogram to visualize the distribution:Use geom_histogram() to plot the density
of the normal data.
Step 4:Generate 1000 random observations from a binomial distribution with:Number of trials
(size) = 10,Probability of success (prob) = 0.5
Step5:Create a bar plot to visualize the binomial distribution
Step 6:Use geom_bar() to plot the proportion of each outcome.
Step7:This algorithm effectively visualizes both distributions in R
PROGRAM:
# Load necessary library
library(ggplot2)
# Generate a sample of 1000 observations from a normal distribution
set.seed(123)
normal_data <- rnorm(1000, mean = 0, sd = 1)
# Plot the histogram of the normal distribution
ggplot(data.frame(x = normal_data), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 1) +
labs(title = "Normal Distribution", x = "Value", y = "Density")
# Load necessary library
library(ggplot2)
# Generate a sample of 1000 observations from a binomial distribution
set.seed(123)
binomial_data <- rbinom(1000, size = 10, prob = 0.5)
# Plot the histogram of the binomial distribution
ggplot(data.frame(x = binomial_data), aes(x = factor(x))) +
geom_bar(aes(y = ..prop.., group = 1), fill = "green", alpha = 0.5) +
labs(title = "Binomial Distribution", x = "Number of Successes", y = "Proportion")
OUTPUT:
RESULT:
Normal and binomial distribution was executed successfully in R language.
Write an R Script to convert numerical data to categorical and binomial
distribution.
Ex.No:06
Date:
AIM:
To Write an R Script to convert numerical data to categorical and binomial
distribution.
ALGORITHM:
Step1:Ensures that the randomly generated numbers are the same every time you run the script
Step2:Create a data frame data with 20 observations.
Step3:cut() function is used to group ages into predefined categories
Step4:This prints the dataset after the categorical variables have been added
Step5:Counts occurrences of each age group. Counts occurrences of each income bracket
Step6:The script checks if the package ggplot2 is installed. If not, it installs it
Step 7:Defines the dataset and variable for the x-axis.
Step8:Similar to the age group plot, but with different colors
Step9:Two bar plots visualizing age groups and income brackets
Step 10:A dataset with numerical and categorical variables.Frequency counts of categories.
PROGRAM:
# Create a sample dataset with numerical data
set.seed(123)
data <- data.frame(
ID = 1:20,
Age = sample(18:60, 20, replace = TRUE),
Income = round(rnorm(20, mean = 50000, sd = 10000), 0)
# View the dataset
print("Original Dataset:")
print(data)
# Convert numerical data to categorical variables
# 1. Age categorization into age groups
data$Age_Group <- cut(
data$Age,
breaks = c(0, 18, 30, 45, 60, Inf),
labels = c("Child", "Young Adult", "Adult", "Middle Age", "Senior"),
right = FALSE
# 2. Income categorization into income brackets
data$Income_Bracket <- cut(
data$Income,
breaks = c(-Inf, 30000, 50000, 70000, Inf),
labels = c("Low", "Medium", "High", "Very High"),
right = TRUE
# View the updated dataset
print("Updated Dataset with Categorical Variables:")
print(data)
# Frequency tables for the new categorical variables
cat("\nFrequency of Age Groups:\n")
print(table(data$Age_Group))
cat("\nFrequency of Income Brackets:\n")
print(table(data$Income_Bracket))
# Visualize the categorical variables
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Bar plot for Age Groups
ggplot(data, aes(x = Age_Group)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Bar Plot of Age Groups",
x = "Age Group",
y = "Count") +
theme_minimal()
# Bar plot for Income Brackets
ggplot(data, aes(x = Income_Bracket)) +
geom_bar(fill = "orange", color = "black") +
labs(title = "Bar Plot of Income Brackets",
x = "Income Bracket",
y = "Count") +
theme_minimal()
OUTPUT:
RESULT:
R Script to convert numerical data to categorical and binomial distribution are executed
successfully.
Ex.No:07
Date:
Write an R Script to Bayes’ Theorem.
Aim:
To Write an R Script for Bayes’ Theorem.
ALGORITHM:
Step1:Probability that a randomly chosen person has the disease
Step2:Probability that the test is positive given the person has the disease
Step3:Probability that the test is positive given the person does not have the
disease
Step4:Test is positive with probability P(B|A)
Step5:Probability that a person has the disease given a positive test result.
Step6:Prior probability of disease (P(A))Likelihood of a positive test given disease
(P(B|A))
Step7:Marginal probability of a positive test (P(B))
Step8:Posterior probability of disease given a positive test (P(A|B))
Step9:A bar plot showing the prior, likelihood, and posterior probabilities.
PROGRAM:
# Define probabilities
# Example Scenario: Testing for a disease
# A: Person has the disease
# B: Test result is positive
# Prior probability (P(A)): Probability of having the disease
P_A <- 0.01 # 1% of the population has the disease
# Likelihood (P(B|A)): Probability of a positive test result given the person has the disease
P_B_given_A <- 0.95 # Test is 95% accurate for those with the disease
# Marginal probability of B (P(B)):
# P(B) = P(B|A) * P(A) + P(B|?A) * P(?A)
P_B_given_not_A <- 0.05 # 5% false positive rate
P_not_A <- 1 - P_A # Probability of not having the disease
P_B <- (P_B_given_A * P_A) + (P_B_given_not_A * P_not_A)
# Posterior probability (P(A|B)): Using Bayes' Theorem
P_A_given_B <- (P_B_given_A * P_A) / P_B
# Print the results
cat("P(A): Prior Probability of Having the Disease =", P_A, "\n")
cat("P(B|A): Likelihood of Positive Test Given Disease =", P_B_given_A, "\n")
cat("P(B): Marginal Probability of a Positive Test =", P_B, "\n")
cat("P(A|B): Posterior Probability of Having the Disease Given a Positive Test =",
P_A_given_B, "\n")
# Additional Example: Visualization
# Visualize the probabilities using a bar plot
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
data <- data.frame(
Category = c("Prior: P(A)", "Likelihood: P(B|A)", "Posterior: P(A|B)"),
Probability = c(P_A, P_B_given_A, P_A_given_B)
ggplot(data, aes(x = Category, y = Probability, fill = Category)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_brewer(palette = "Set3") +
labs(title = "Bayes' Theorem Probabilities",
x = "Probability Component",
y = "Probability") +
theme_minimal()
OUTPUT:
RESULT:
In R Script the Bayes’ Theorem was executed successfully.
Write an R Script to implement the Time series data analysis and forecasting.
Ex.No:08
Date:
AIM:
To Write an R script implements the Time series data analysis and forecasting.
ALGORITHM:
Step1:Before running the analysis, ensure the required packages (forecast, ggplot2,
and tseries) are installed and loaded.
Step2:the time series into trend, seasonal, and residual components using
decompose()
Step3:Perform Augmented Dickey-Fuller (ADF) Test to check if the series is
stationary.
Step4:Perform Ljung-Box Test to check residual independence
Step5:Use STL (Seasonal and Trend decomposition using LOESS) for better
decomposition.
Step6:Load required packages.Generate synthetic time series data.
Step7:Visualize the original time series.Decompose the time series into trend,
seasonal, and residuals.
Step8:Forecast future values and plot the forecast.
Step9:Analyze residuals and validate the model using Ljung-Box test.
PROGRAM:
# Install and load necessary packages
if (!require("forecast")) install.packages("forecast")
if (!require("ggplot2")) install.packages("ggplot2")
library(forecast)
library(ggplot2)
# 1. Generate Sample Time Series Data
set.seed(123)
time_series_data <- ts(rnorm(120, mean = 50, sd = 10),
start = c(2010, 1), frequency = 12) # Monthly data from 2010
# Plot the time series data
autoplot(time_series_data) +
labs(title = "Original Time Series Data",
x = "Time",
y = "Value") +
theme_minimal()
# 2. Decompose the Time Series (Trend, Seasonal, Residuals)
decomposed <- decompose(time_series_data)
autoplot(decomposed) +
labs(title = "Decomposed Time Series")
# 3. Check Stationarity using Augmented Dickey-Fuller Test
if (!require("tseries")) install.packages("tseries")
library(tseries)
adf_test <- adf.test(time_series_data)
cat("Augmented Dickey-Fuller Test:\n")
print(adf_test)
# 4. Fit an ARIMA Model for Forecasting
arima_model <- auto.arima(time_series_data)
# Print the ARIMA model summary
cat("\nFitted ARIMA Model:\n")
print(summary(arima_model))
# 5. Forecast the Future Values
forecast_horizon <- 12 # Forecast for the next 12 months
forecast_values <- forecast(arima_model, h = forecast_horizon)
# Plot the forecast
autoplot(forecast_values) +
labs(title = "ARIMA Forecast",
x = "Time",
y = "Value") +
theme_minimal()
# 6. Validate the Model with Residual Analysis
autoplot(arima_model$residuals) +
labs(title = "Residual Analysis",
x = "Time",
y = "Residuals") +
theme_minimal()
# Perform Ljung-Box test for residuals
lb_test <- Box.test(arima_model$residuals, lag = 20, type = "Ljung-Box")
cat("\nLjung-Box Test for Residuals:\n")
print(lb_test)
# 7. Seasonal Decomposition of Time Series using LOESS (STL)
stl_decomposed <- stl(time_series_data, s.window = "periodic")
autoplot(stl_decomposed) +
labs(title = "STL Decomposition")
# 8. Export Forecasted Values (Optional)
# write.csv(data.frame(forecast_values), "forecasted_values.csv")
OUTPUT:
RESULT:
Implemented the Time series data analysis and forecasting in R script.
Hypothesis Testing in R Programming.
Ex.No:09
Date:
Aim:
To Write an R Script for Hypothesis Testing.
ALGORITHM:
Step1:Check if the ggplot2 package is installed.
Step2:Ifnotinstalled, install ggplot2.Load the ggplot2 library.
Step3:Set a random seed for reproducibility.
Step4:Generate group1 with 30 values from a normal distribution (mean = 50, sd =
5)
Step5:Generate group2 with 30 values from a normal distribution (mean = 52, sd =
5).
Step6:The means of group1 and group2 are equal.
Step7:The means of group1 and group2 are not equal.
Step8:Perform an independent two-sample t-test assuming equal variance
Step9:The observed frequencies match the expected frequencie
Step10:Perform a Wilcoxon rank-sum test (non-parametric alternative to t-test).
Step11:There is no correlation between x and y.
Step12:Check the strength of linear relationship between two variables
Step13:Non-parametric alternative to t-test
PROGRAM:
# Load necessary library
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Create sample data
set.seed(123)
group1 <- rnorm(30, mean = 50, sd = 5) # Sample from Group 1
group2 <- rnorm(30, mean = 52, sd = 5) # Sample from Group 2
# 1. One-Sample t-test
# H0: The mean of group1 is equal to 50
# H1: The mean of group1 is not equal to 50
t_test_one_sample <- t.test(group1, mu = 50)
cat("\nOne-Sample t-test:\n")
print(t_test_one_sample)
# 2. Two-Sample t-test (Independent)
# H0: The means of group1 and group2 are equal
# H1: The means of group1 and group2 are not equal
t_test_two_sample <- t.test(group1, group2, var.equal = TRUE)
cat("\nTwo-Sample t-test:\n")
print(t_test_two_sample)
# 3. Paired t-test
# H0: The means of paired samples are equal
# Simulate paired data
paired_group1 <- rnorm(20, mean = 10, sd = 2)
paired_group2 <- paired_group1 + rnorm(20, mean = 0.5, sd = 1) # Add small difference
t_test_paired <- t.test(paired_group1, paired_group2, paired = TRUE)
cat("\nPaired t-test:\n")
print(t_test_paired)
# 4. Chi-Square Test
# H0: The observed frequencies match the expected frequencies
observed <- c(50, 30, 20) # Observed frequencies
expected <- c(40, 40, 20) # Expected frequencies
chi_sq_test <- chisq.test(observed, p = expected / sum(expected))
cat("\nChi-Square Test:\n")
print(chi_sq_test)
# 5. ANOVA (Analysis of Variance)
# H0: The means of all groups are equal
group_a <- rnorm(20, mean = 5, sd = 1)
group_b <- rnorm(20, mean = 6, sd = 1)
group_c <- rnorm(20, mean = 7, sd = 1)
anova_data <- data.frame(
values = c(group_a, group_b, group_c),
groups = rep(c("A", "B", "C"), each = 20)
anova_result <- aov(values ~ groups, data = anova_data)
cat("\nANOVA Test:\n")
summary(anova_result)
# 6. Visualizing ANOVA Results
ggplot(anova_data, aes(x = groups, y = values, fill = groups)) +
geom_boxplot() +
labs(title = "Boxplot of Groups for ANOVA",
x = "Group",
y = "Values") +
theme_minimal()
# 7. Shapiro-Wilk Test (Normality Test)
# H0: Data is normally distributed
shapiro_test <- shapiro.test(group1)
cat("\nShapiro-Wilk Test:\n")
print(shapiro_test)
# 8. Wilcoxon Rank-Sum Test (Non-parametric test)
# H0: The distributions of group1 and group2 are the same
wilcox_test <- wilcox.test(group1, group2)
cat("\nWilcoxon Rank-Sum Test:\n")
print(wilcox_test)
# 9. Correlation Test
# H0: There is no correlation between x and y
x <- rnorm(50, mean = 10, sd = 2)
y <- 2 * x + rnorm(50, mean = 0, sd = 1)
cor_test <- cor.test(x, y)
cat("\nCorrelation Test:\n")
print(cor_test)
OUTPUT:
RESULT:
Hypothesis Testing was implemented in R script.
Predictive Analysis R Programming.
Ex.No:10
Date:
AIM:
Write an R script for Predictive Analysis.
ALGORITHM:
Step1:Before running the analysis, necessary libraries are installed and loaded
Step 2:For machine learning operations like data partitioning and evaluation.
Step3:A synthetic dataset is generated using random values.
Step4:Split the data into training (70%) and testing (30%) sets.
Step5:Linear Regression (for continuous variable Income)
Step6:Decision Tree (for classification Default).Random Forest (for classification
Default)
Step7:Make predictions using decision tree and random forest.
Step8:Evaluate model performance using confusion matrices.
Step9:Visualize the decision tree and feature importance from random forest.
PROGRAM:
# Install and load required libraries
if (!require("caret")) install.packages("caret")
if (!require("rpart")) install.packages("rpart")
if (!require("randomForest")) install.packages("randomForest")
if (!require("ggplot2")) install.packages("ggplot2")
library(caret)
library(rpart)
library(randomForest)
library(ggplot2)
# 1. Load or Create Sample Dataset
set.seed(123)
data <- data.frame(
Age = sample(18:70, 100, replace = TRUE),
Income = round(rnorm(100, mean = 50000, sd = 10000), 0),
Education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 100, replace = TRUE),
Credit_Score = round(rnorm(100, mean = 700, sd = 50), 0)
# Target Variable: Whether the person defaults on a loan
data$Default <- ifelse(data$Credit_Score < 650, 1, 0) # 1 = Default, 0 = No Default
# View the dataset
cat("Sample Dataset:\n")
print(head(data))
# Convert categorical variables to factors
data$Education <- as.factor(data$Education)
data$Default <- as.factor(data$Default)
# 2. Split Data into Training and Testing Sets
set.seed(123)
trainIndex <- createDataPartition(data$Default, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
cat("\nTraining Set Size:", nrow(trainData))
cat("\nTesting Set Size:", nrow(testData))
# 3. Build Predictive Models
# (a) Linear Regression (For continuous target variables)
lm_model <- lm(Income ~ Age + Credit_Score, data = trainData)
cat("\nLinear Regression Model Summary:\n")
print(summary(lm_model))
# (b) Decision Tree (For classification tasks)
tree_model <- rpart(Default ~ Age + Income + Education + Credit_Score, data = trainData,
method = "class")
cat("\nDecision Tree Summary:\n")
print(tree_model)
# (c) Random Forest (For classification tasks)
rf_model <- randomForest(Default ~ Age + Income + Education + Credit_Score, data =
trainData, ntree = 100)
cat("\nRandom Forest Model:\n")
print(rf_model)
# 4. Make Predictions
# Predict using Decision Tree
tree_predictions <- predict(tree_model, testData, type = "class")
# Predict using Random Forest
rf_predictions <- predict(rf_model, testData)
# 5. Evaluate Models
# Confusion Matrix for Decision Tree
cat("\nDecision Tree Confusion Matrix:\n")
print(confusionMatrix(tree_predictions, testData$Default))
# Confusion Matrix for Random Forest
cat("\nRandom Forest Confusion Matrix:\n")
print(confusionMatrix(rf_predictions, testData$Default))
# 6. Visualize the Results
# (a) Decision Tree Plot
if (!require("rpart.plot")) install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(tree_model, main = "Decision Tree")
# (b) Feature Importance from Random Forest
importance <- importance(rf_model)
cat("\nFeature Importance:\n")
print(importance)
# Plot Feature Importance
importance_df <- data.frame(Feature = rownames(importance), Importance = importance[, 1])
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance, fill = Feature)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Feature Importance (Random Forest)",
x = "Features",
y = "Importance") +
theme_minimal()
OUTPUT:
RESULT:
Predictive Analysis in R script was executed successfully.
Ex.No:11
Date:
Write an R script to implement the Cross-Validation.
AIM:
Implement R Script for Cross-Validation.
ALGORITHM:
Step1:Load the required libraries using library().
Step2:Convert the Default column into a factor since it's a classification problem.
Step3:Display the first few rows of the dataset.
Step4:Use the train() function to fit a Random Forest model
Step5:Use the plot() function to visualize the model performance and parameter
tuning
Step6:Create a new dataset with sample values for prediction.
Step7:Print the predictions along with the input data
Step8:This step-by-step approach builds a robust credit default prediction model
using Random Forest with cross-validation
PROGRAM:
# Install and load required libraries
if (!require("caret")) install.packages("caret")
if (!require("randomForest")) install.packages("randomForest")
if (!require("e1071")) install.packages("e1071") # Needed for SVM in caret
library(caret)
library(randomForest)
# 1. Create Sample Dataset
set.seed(123)
data <- data.frame(
Age = sample(18:70, 200, replace = TRUE),
Income = round(rnorm(200, mean = 50000, sd = 10000), 0),
Credit_Score = round(rnorm(200, mean = 700, sd = 50), 0),
Default = sample(c("Yes", "No"), 200, replace = TRUE)
# Convert target variable to factor
data$Default <- as.factor(data$Default)
# View dataset
cat("Sample Dataset:\n")
print(head(data))
# 2. Define Train Control for Cross-Validation
train_control <- trainControl(
method = "cv", # Cross-validation method
number = 10, # Number of folds
verboseIter = TRUE# Display training progress
# 3. Train a Model with Cross-Validation
# Example: Random Forest
set.seed(123)
rf_model <- train(
Default ~ Age + Income + Credit_Score, # Formula
data = data, # Dataset
method = "rf", # Random forest method
trControl = train_control, # Cross-validation settings
tuneLength = 3 # Number of tuning parameters to try
# 4. Model Performance
cat("\nRandom Forest Model Summary:\n")
print(rf_model)
# Print the best model parameters
cat("\nBest Tuning Parameters:\n")
print(rf_model$bestTune)
# 5. Evaluate Model Performance
cat("\nCross-Validation Results:\n")
print(rf_model$resample)
# 6. Visualize Model Performance
plot(rf_model)
# 7. Predictions on a New Dataset (Optional)
new_data <- data.frame(
Age = c(25, 40, 65),
Income = c(40000, 55000, 70000),
Credit_Score = c(650, 720, 680)
# Predict using the trained model
predictions <- predict(rf_model, new_data)
cat("\nPredictions on New Data:\n")
print(data.frame(new_data, Predicted_Default = predictions))
OUTPUT:
RESULT:
Implemented R Script for Cross-Validation was Output verified.
Write an R script to implement the Ordinary Least Squares (OLS).
Ex.No:12
Date:
AIM:
Implement the Ordinary Least Squares (OLS) in R script.
ALGORITHM:
Step1:The ggplot2 package is loaded to enable visualization of the regression
results.
Step2:The built-in mtcars dataset is loaded into the R environment.
Step3:This function displays the first six rows of the mtcars dataset for a quick
preview
Step4:The lm() function is used to fit a linear regression model.
Step5:The summary() function provides key details about the regression results
Step6:Estimated effect of each independent variable.
Step7:How well the model explains the variance in mpg
Step 8:Differences between actual and predicted values.
PROGRAM:
# Load necessary libraries
library(ggplot2)
# Load the dataset
data(mtcars)
# View the first few rows of the dataset
head(mtcars)
# Perform OLS regression
model <- lm(mpg ~ wt + hp + disp, data = mtcars)
# View the summary of the model
summary(model)
# Plot the regression results
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, col = "red") +
labs(title = "OLS Regression: MPG vs Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon (MPG)")
OUTPUT:
RESULT:
Ordinary Least Squares (OLS) in R script executed successfully.
Write an R script to implement the Linear regression algorithm.
Ex.No:13
Date:
AIM:
Implement the linear regression algorithm in R programming language.
ALGORITHM:
Step1:The script checks if the ggplot2 package is installed
Step2:If not, it installs the package.Then, it loads ggplot2.
Step3:Sets a seed for reproducibility using set.seed(123)
Step4:Creates a dataframe data with three variables
Step5:Uses the lm() function to create a linear regression model
Step 6:Uses ggplot2 to visualize the relationship between Age and Spending
Step7:The summary(lm_model) function prints details such as:Coefficients,
R-squared value, p-values, Residuals and standard errors
Step8:Plot the relationship between Age and Spending
PROGRAM:
# Load necessary library
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Sample dataset
set.seed(123)
data <- data.frame(
Age = sample(20:70, 100, replace = TRUE),
Income = round(rnorm(100, mean = 50000, sd = 10000), 0)
data$Spending <- 2000 + 50 * data$Age + 0.3 * data$Income + rnorm(100, mean = 0, sd =
5000)
# Fit Linear Regression Model
lm_model <- lm(Spending ~ Age + Income, data = data)
# Model Summary
summary(lm_model)
# Scatter plot with regression line
ggplot(data, aes(x = Age, y = Spending)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, color = "red") +
labs(title = "Linear Regression: Spending vs Age", x = "Age", y = "Spending")
OUTPUT:
RESULT:
Implemented the Linear regression algorithm in R script.
Write an R script to implement the K-Means clustering algorithm.
Ex.No:14
Date:
AIM:
Write an R script to implement the K-Means clustering algorithm in rstudio.
ALGORITHM:
Step1:This loads the ggplot2 library, which is used to create visualizations in R.
Step2:The iris dataset is a built-in dataset in R that contains measurements of iris
flowers for three species (setosa, versicolor, virginica).
Step3:The algorithm tries to group the data into 3 clusters.
Step4:The computed cluster labels (1, 2, or 3) are added to the iris dataset.
Step 5: the as.factor() ensures the cluster labels are treated as categorical values.
Step6:This creates a scatter plot using ggplot2:X-axis: Sepal.Length, Y-axis:
Sepal.Width,Color: Cluster assignment
Step7:Load the dataset and necessary libraries
Step8:Apply K-Means clustering to group the data into three clusters.
Step 9:Assign the cluster labels to the dataset.
Step10:Visualize the clusters using a scatter plot.
PROGRAM:
# Load necessary library
library(ggplot2)
# Load the dataset
data(iris)
# Perform K-Means clustering with 3 clusters
set.seed(123) # For reproducibility
kmeans_result <- kmeans(iris[, 1:4], centers = 3)
# Add the cluster results to the dataset
iris$Cluster <- as.factor(kmeans_result$cluster)
# Plot the clusters
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +
geom_point() +
labs(title = "K-Means Clustering of Iris Dataset",
x = "Sepal Length",
y = "Sepal Width")
OUTPUT:
RESULT:
The K-Means clustering algorithm was implemented successfully.
Ex.No:15
Date:
Write an R script to implement the Naive Bayes.
AIM:
Implement the Naive Bayes in R Script.
ALGORITHM:
Step1:This loads the e1071 package, which provides an implementation of the
Naïve Bayes classifier in R.
Step 2:Loads the built-in Iris dataset, which contains 150 samples of iris flowers
with four features (Sepal.Length, Sepal.Width, Petal.
Step 3:means predicting Species based on all other features.
Step 4:Uses the trained model to predict the species of flowers in the test set.
Step 5:Creates a confusion matrix, comparing the actual vs. predicted species.
Step 6:This helps measure the accuracy of the model.
Step 7:Computes the accuracy by dividing correct predictions by the total number
of test samples.
Step 8:The probability of each class P(Species) in the training set
Step 9:The probability of each feature value given a class using the Gaussian
(Normal) Distribution
Step 10:The class with the highest probability is selected as the prediction.
PROGRAM:
# Load necessary library
library(e1071)
# Load the dataset
data(iris)
# Split the data into training and test sets
set.seed(123) # For reproducibility
train_index <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]
# Train the Na?ve Bayes model
model <- naiveBayes(Species ~ ., data = train_data)
# Predict on the test set
predictions <- predict(model, test_data)
# Evaluate the model
confusion_matrix <- table(predictions, test_data$Species)
print(confusion_matrix)
OUTPUT:
RESULT:
The Naive Bayes theorem was implemented successfully.