Mini Project – Factor Hair
Analysis
Sravanthi.M
1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Set up working Directory....................................................................................................3
3.1.3.Import and Read the Dataset.............................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................5
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5
1.Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for
outliers and missing values
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset
2.Is there evidence of multicollinearity? Showcase your analysis
3.Perform simple linear regression for the dependent variable with every independent variable
4.Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors
4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser Normalization Rule)
4.2 Output Interpretation Tell why only 4 factors are being asked in the questions and tell
whether it is correct in choosing 4 factors. Name the factors with correct explanations.
5.Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody
5.1 Create a data frame with a minimum of 5 columns, 4 of which are different factors and the
5th column is Customer Satisfaction
5.2 Perform Multiple Linear Regression with Customer Satisfaction as the Dependent Variable
and the four factors as Independent Variables
5.3 MLR summary interpretation and significance (R, R2, Adjusted R2,Degrees of Freedom, f-
statistic, coefficients along with p-values)
5.4 Output Interpretation <making it meaningful for everybody>
6. Source Code
1 Project Objective
The objective of the report is to explore the Factor Hair data in R and generate insights about the
data set. This exploration report will consist of the following:
Importing the dataset in R
Understanding the structure of dataset
Graphical exploration
Descriptive statistics
Insights from the dataset
2 Assumptions
Is there evidence of multicollinearity?
Perform factor analysis by extracting four factors.
Name four factors.
Perform multiple Liner regression with customer satisfaction as the dependent variable
and the four factors as independent variable.
3 Exploratory Data Analysis – Step by step approach
A Typical Data exploration activity consists of the following steps:
1. Environment Set up and Data Import
2. Check Multicollinearity
3. Factor analysis
4. Four factors Identification
5. Feature Exploration
6. The data set have 12 variables used for marketing segmentation in the context of product
service Management. Variables and the expansion of the variables are mentioned below
We shall follow these steps in exploring the provided dataset.
3|Page
3.1 Environment Set up and Data Import
3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use
install. packages (“Package name”)
3.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project. For setting up and importing we use
below syntax’s
Syntax → setwd() & getwd()
Please refer 6 for Source Code.
3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the
file.
Please refer 6 for Source Code.
3.2 Variable Identification
We are using
setwd() :For setting working directory
getwd() : returns an absolute file path representing the current working directory
dim: returns the dimension (e.g. the number of columns and rows)
Str: To look specific data row by row we use str()
names() : to find the names of the columns
summary: is a generic function used to produce result summaries of the results of
various model fitting functions. The function invokes particular methods which
depend on the class of the first argument.
attach() : to attach my data
hist(): To plot histogram
boxplot(): To plot boxplot
4 Conclusion
4|Page
From the above given problem, we have found out how Factor Analysis can be used to reduce
the dimensionality of a dataset and then we used multiple linear regression on the
dimensionally reduced columns for further analysis/predictions. Below mentioned points are
covered
1. Checked for Multicollinearity
2. Done Factor Analysis
3. Named the Factors - Cust.Satisf,Sales.Distri,Marketing, After.Sales.Service,Value.For. Money
4. Perform Multiple Linear Regression with customer satisfaction as dependent variable and
Cust.Satisf,Sales.Distri,Marketing,After.Sales.Service,Value.For.Money as independent variables.
5 Detailed Explanation of Findings
1.Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for outliers
and missing values
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset.
Ans: For basic data summary we need to import the data as mentioned above in 3.2 we will be
using all the functions and analyze the data.
## Seeting up working directory and getting working directory
setwd("D:/College Data/Advance stats/Project")
getwd()
## Reading the file
Factorhair <- read.csv("Hair.csv",header = TRUE)
## Varible names in matrix
variables <- c("Product Quality" , "E-Commerce" , "Technical Support" ,
"Complaint Resolution" ,
"Advertising" , "Product Line" , "Salesforce Image",
"Competitive Pricing" ,
"Warranty & Claims" , "Order & Billing" , "Delivery Speed" ,
"Customer Satisfaction")
## Checking dimentions of the data
dim(Factorhair)
## names of the coloumns
names(Factorhair)
## structure of the data
str(Factorhair)
5|Page
## summary of the data
summary(Factorhair)
Output:
From the summary we have noticed that first column is named as “ID” is just column number and it is
not required further hence we will be removing the column and renaming dataset as hair and remove
the column ID from it.
We need to find missing values
Syntax: sum(is.na(hair))
Output:
Graphical representation of Factor Hair Data set
Histogram of dependent variable (Customer satisfaction)
Syntax: hist(`Customer Satisfaction`, breaks = c(0:11),labels = TRUE, include.lowest =
TRUE,right = TRUE,
col = "blue",border = "Green",main = paste("Histogram of Customer Satisfaction"),
xlab = "Customer Satisfaction",ylab = "Count",xlim = c(0,11),ylim = c(0,35))
6|Page
Box plot of dependent variable (Customer satisfaction)
Syntax: boxplot(`Customer Satisfaction`, horizontal = TRUE, xlab = variables[12],
col = "pink", border="blue",ylim = c(0,11))
Histogram of the independent variable
Syntax: par(mfrow = c(3,4)) #Convert Plotting space in 12
for (i in (1:11))
{h = round(max(hair[,i]),0)+1
l = round(min(hair[,i]),0)-1
n = variables[i]
hist (hair[,i], breaks = seq(l,h,((h-l)/6)), labels = TRUE,
include.lowest=TRUE, right=TRUE,
col="pink", border="blue",
main = NULL, xlab= n, ylab=NULL,
cex.lab=1, cex.axis=1, cex.main=1, cex.sub=1,
xlim = c(0,11), ylim = c(0,70))
}
7|Page
Boxplot of independent variables
par(mfrow = c(2,1))
boxplot(hair[,-12], las = 2, names = variables[-12], col = "blue", border = "pink", cex.axis = 1)
Bivariate Analysis - Scatter Plot of independent variables against the dependent variable
Syntax: par(mfrow = c(3,3))
for (i in c(1:11))
{plot(hair[,i],`Customer Satisfaction`, xlab = variables[i],ylab = NULL,col= "red",cex.lab =
1,cex.axis = 1,
cex.main = 1,cex.sub = 1,xlim = c(0,10),ylim = c(0,10))
abline(lm(formula = `Customer Satisfaction`~ hair[,i]),col = "blue")
8|Page
Finding Outliers in variables
Syntax: list("OutLiers")
OutLiers <- hair[(1:12),]
for (i in c(1:12)) {
Box_Plot <- boxplot(hair[,i],plot = F)$out
OutLiers[,i] <- NA
if (length(Box_Plot)>0) {
OutLiers[(1:length(Box_Plot)),i] <- Box_Plot
}
}
OutLiers <- OutLiers[(1:6),]
# Write outliers list in csv
write.csv(OutLiers, "OutLiers.csv")
2.Is there evidence of multicollinearity? Showcase your analysis
9|Page
Ans: First we need to create correlation matrix and the plot the correlation for Factor hair data set.
Now we need to check multicollinearity of independent variables using VIF
Syntax:
## Create correlation matrix
corlnMtrx <- cor(hair[,-12])
corlnMtrx
## Correlation Plot for Data hair.
corrplot.mixed(corlnMtrx, lower = "number", upper = "pie", tl.col = "black",tl.pos = "lt")
## Check multicollinearity in independent variables using VIF
vifmatrix <- vif(lm(`Customer Satisfaction` ~., data = hair))
10 | P a g
e
vifmatrix
write.csv(vifmatrix, "vifmatrix.csv")
`Product Quality` 1.635796913
`E-Commerce` 2.756694028
`Technical Support` 2.976795746
`Complaint
Resolution` 4.730448292
Advertising 1.508933339
`Product Line` 3.488185222
`Salesforce Image` 3.439420023
`Competitive Pricing` 1.635000159
`Warranty & Claims` 3.198337123
`Order & Billing` 2.9029994
`Delivery Speed` 6.516013572
3.Perform simple linear regression for the dependent variable with every independent variable
Ans: From the above correlation matrix we will be doing Bartlett Test. If P-value is less than 0.05
then it is ideal case for dimension reduction.
Syntax: cortest.bartlett(corlnMtrx, 100)
4.Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors
4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser Normalization Rule)
4.2 Output Interpretation Tell why only 4 factors are being asked in the questions and tell whether it
is correct in choosing 4 factors. Name the factors with correct explanations
Ans: Kaiser-Meyer-Olkin (KMO) Test is a measure of how suited your data is for Factor Analysis.
Syntax: KMO(corlnMtrx)
The KMO statistic of 0.65 is also large (greater than 0.50). Hence Factor Analysis is considered as
an appropriate technique for further analysis of the data.
Calculate the Eigen values for the variables
11 | P a g
e
Syntax:
A <- eigen(corlnMtrx)
EV <- A$values
EV
plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values", pch = 20, col = "blue")
lines(EV, col = "red")
abline(h = 1, col = "green", lty = 2)
Eigen values should be always more than 1.
Hence from the above scree plot we will be considering only 4 Factors from 11 variables.
Factor names are as follows: Sales.Distri, Marketing, After.Sales.Service, Value.For.Money
Sales.Distri – Delivery speed, Complaint Resolution, Order & Billing is considered as one factor
because all the product is related to purchasing the product from placing the order to billing and
delivery.
Marketing – Salesforce Image, E-Commerce, Advertising is considered as one factor because the
variables are related to sales and spending on advertising
After sales service – Technical support, warranty & claims are consider as one factor because post
purchase is included in this factor
Value for money - Competitive pricing, Product line, Product quality is considered as one factor
5.Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody
12 | P a g
e
5.1 Create a data frame with a minimum of 5 columns, 4 of which are different factors and the
5th column is Customer Satisfaction
5.2 Perform Multiple Linear Regression with Customer Satisfaction as the Dependent Variable
and the four factors as Independent Variables
5.3 MLR summary interpretation and significance (R, R2, Adjusted R2,Degrees of Freedom, f-
statistic, coefficients along with p-values)
5.4 Output Interpretation
Ans: As per the above scree plot extracting 4 factors from 11 variables
Without rotating
Syntax:
FourFactor = fa(r= hair[,-12], nfactors =4, rotate ="none", fm ="pa")
print(FourFactor)
Loading <- print(FourFactor$loadings,cutoff = 0.3)
13 | P a g
e
write.csv(Loading, "loading.csv")
PA1 PA2 PA3 PA4
Product Quality 0.201261 -0.40795 -0.05811 0.462588
E-Commerce
fa.diagram(FourFactor) 0.29013 0.659153 0.269989 0.215921
Technical Support 0.27765 -0.38082 0.73814 -0.16628
Complaint
Resolution 0.862348 0.011699 -0.25533 -0.18395
Advertising 0.286088 0.457153 0.082418 0.12877
Product Line 0.689465 -0.45337 -0.14239 0.314815
Salesforce Image 0.394536 0.800679 0.345809 0.250827
Competitive Pricing -0.23159 0.553007 -0.04444 -0.28608
Warranty & Claims 0.379328 -0.32446 0.735494 -0.15303
Order & Billing 0.746973 0.02081 -0.17524 -0.18086
Delivery Speed 0.895111 0.098331 -0.30345 -0.19764
fa.diagram(FourFactor)
With varimax rotating
Synatx:
FourFactor1 = fa(r= hair[,-12], nfactors =4, rotate ="varimax", fm ="pa")
14 | P a g
e
print(FourFactor1)
Loading1 <- print(FourFactor1$loadings,cutoff = 0.3)
write.csv(Loading1, "Loading1.csv")
PA1 PA2 PA3 PA4
Product Quality 0.024004 -0.07003 0.01569 0.646969
15 | P a g
e
E-Commerce 0.067574 0.787412 0.0279 -0.11319
Technical Support 0.019767 -0.02524 0.883193 0.116433
Complaint
Resolution 0.897671 0.129545 0.053539 0.13171
Advertising 0.166184 0.529966 -0.04289 -0.06235
Product Line 0.525463 -0.03526 0.127348 0.711841
Salesforce Image 0.115439 0.971489 0.063495 -0.13452
Competitive Pricing -0.07565 0.212939 -0.20892 -0.59039
Warranty & Claims 0.102595 0.056612 0.885113 0.127977
Order & Billing 0.768195 0.126678 0.088175 0.088743
Delivery Speed 0.94873 0.185192 -0.00486 0.087353
fa.diagram(FourFactor1)
Create a new data frame using scores for four factors and dependent variable
hair1 <- cbind(hair[,12],FourFactor1$scores)
Check head of the data
head(hair1)
Name the columns for hair1
colnames(hair1) <-
c("Cust.Satisf","Sales.Distri","Marketing","After.Sales.Service","Value.For.Money")
Check head of the data
16 | P a g
e
head(hair1)
Check class of the hair1
class(hair1)
convert matrix to data.frame
hair1 <- as.data.frame(hair1)
Corplot for the data hair1
corrplot.mixed(cor(hair1),lower = "number", upper = "pie", tl.col = "black",tl.pos = "lt")
setting flag for randomness
set.seed(1)
creating two datasets one to train the model and another to test the model.
spl = sample.split(hair1$Cust.Satisf, SplitRatio = 0.8)
Train = subset(hair1, spl==TRUE)
Test = subset(hair1, spl==FALSE)
check dimentions of Train and Test Data
cat(" Train Dimention: ", dim(Train) ,"\n", "Test Dimention : ", dim(Test))
17 | P a g
e
linearModel = lm(Cust.Satisf ~., data = Train)
summary(linearModel)
vif(linearModel)
pred = predict(linearModel, newdata = Test)
Compute R-sq for the test data
check SST - Total sum of squres
SST = sum((Test$Cust.Satisf - mean(Train$Cust.Satisf))^2)
Check SSE - sum of squared deviations of actual values from predicted values
SSE = sum((pred - Test$Cust.Satisf)^2)
check SSR - sum of squared deviations of predicted values (predicted using regression)
SSR = sum((pred - mean(Train$Cust.Satisf))^2)
R.square.test <- SSR/SST
cat(" SST :", SST, "\n", "SSE :", SSE, "\n","SSR :", SSR, "\n","R squared Test :" , R.square.test)
18 | P a g
e
6 Source Code
## Seeting up working directory and getting working directory
setwd("D:/College Data/Advance stats/Project")
getwd()
##Importing packages
library(corrplot)
install.packages("tidyverse")
library(tidyverse)
library(ggplot2)
install.packages("psych")
library(psych)
library(car)
install.packages("caTools")
library(caTools)
## Reading the file
Factorhair <- read.csv("Hair.csv",header = TRUE)
## Varible names in matrix
variables <- c("Product Quality" , "E-Commerce" , "Technical
Support" , "Complaint Resolution" ,
"Advertising" , "Product Line" , "Salesforce Image",
"Competitive Pricing" ,
"Warranty & Claims" , "Order & Billing" , "Delivery
Speed" , "Customer Satisfaction")
## Checking dimentions of the data
dim(Factorhair)
## names of the coloumns
names(Factorhair)
## structure of the data
str(Factorhair)
## summary of the data
19 | P a g
e
summary(Factorhair)
## Creating new data set with hair name and removing column ID
hair <- Factorhair[,-1]
dim(hair)
## chnaging names of the columns
colnames(hair) <-variables
summary(hair)
## attaching the data
attach(hair)
hair
## find any missing values are there
sum(is.na(hair))
##Histogram of dependent variable(Customer satisfaction)
hist(`Customer Satisfaction`, breaks = c(0:11),labels = TRUE,
include.lowest = TRUE,right = TRUE,
col = "blue",border = "Green",main = paste("Histogram of Customer
Satisfaction"),
xlab = "Customer Satisfaction",ylab = "Count",xlim = c(0,11),ylim
= c(0,35))
##box plot of dependent variable(Customer satifaction)
boxplot(`Customer Satisfaction`, horizontal = TRUE, xlab =
variables[12],
col = "pink", border="blue",ylim = c(0,11))
##Histogram of the independent variable
par(mfrow = c(3,4)) #Convert Plotting space in 12
for (i in (1:11))
{h = round(max(hair[,i]),0)+1
l = round(min(hair[,i]),0)-1
n = variables[i]
hist (hair[,i], breaks = seq(l,h,((h-l)/6)), labels = TRUE,
include.lowest=TRUE, right=TRUE,
col="pink", border="blue",
20 | P a g
e
main = NULL, xlab= n, ylab=NULL,
cex.lab=1, cex.axis=1, cex.main=1, cex.sub=1,
xlim = c(0,11), ylim = c(0,70))
}
## Boxplot of indepentdent variables
par(mfrow = c(2,1))
boxplot(hair[,-12], las = 2, names = variables[-12], col = "blue",
border = "pink", cex.axis = 1)
## Bivariate Analysis
##Scatter Plot of independent variables against the dependent variable
par(mfrow = c(3,3))
for (i in c(1:11))
{plot(hair[,i],`Customer Satisfaction`, xlab = variables[i],ylab =
NULL,col= "red",cex.lab = 1,cex.axis = 1,
cex.main = 1,cex.sub = 1,xlim = c(0,10),ylim = c(0,10))
abline(lm(formula = `Customer Satisfaction`~ hair[,i]),col = "blue")
## Finding Outliers in variables
list("OutLiers")
OutLiers <- hair[(1:12),]
for (i in c(1:12)) {
Box_Plot <- boxplot(hair[,i],plot = F)$out
OutLiers[,i] <- NA
if (length(Box_Plot)>0) {
OutLiers[(1:length(Box_Plot)),i] <- Box_Plot
}
}
OutLiers <- OutLiers[(1:6),]
# Write outliers list in csv
write.csv(OutLiers, "OutLiers.csv")
## Create correlation matrix
corlnMtrx <- cor(hair[,-12])
corlnMtrx
## Correlation Plot for Data hair.
corrplot.mixed(corlnMtrx,lower = "number", upper = "pie", tl.col =
21 | P a g
e
"black",tl.pos = "lt")
## Check multicollinearity in independent variables using VIF
vifmatrix <- vif(lm(`Customer Satisfaction` ~., data = hair))
vifmatrix
write.csv(vifmatrix, "vifmatrix.csv")
## Check corlnMtrx with Bartlett Test
cortest.bartlett(corlnMtrx, 100)
# If P-value less than 0.05 then it is ideal case for dimention
reduction.
## Kaiser-Meyer-Olkin (KMO) Test is a measure of how suited your data
is for Factor Analysis.
KMO(corlnMtrx)
## Calculate the Eigen values for the variables
A <- eigen(corlnMtrx)
EV <- A$values
EV
plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values",
pch = 20, col = "blue")
lines(EV, col = "red")
abline(h = 1, col = "green", lty = 2)
## As per the above scree plot extracting 4 factors from 11 variables
## Without rotating
FourFactor = fa(r= hair[,-12], nfactors =4, rotate ="none", fm ="pa")
print(FourFactor)
Loading <- print(FourFactor$loadings,cutoff = 0.3)
write.csv(Loading, "loading.csv")
fa.diagram(FourFactor)
## With varimax rotating
FourFactor1 = fa(r= hair[,-12], nfactors =4, rotate ="varimax", fm
="pa")
print(FourFactor1)
22 | P a g
e
Loading1 <- print(FourFactor1$loadings,cutoff = 0.3)
write.csv(Loading1, "Loading1.csv")
fa.diagram(FourFactor1)
## Create a new data frame using scores for four factors and dependent
varible
hair1 <- cbind(hair[,12],FourFactor1$scores)
##Check head of the data
head(hair1)
## Name the columns for hair1
colnames(hair1) <- c("Cust.Satisf", "Sales.Distri",
"Marketing","After.Sales.Service","Value.For.Money")
##Check head of the data
head(hair1)
##Check class of the hair1
class(hair1)
# convert matrix to data.frame
hair1 <- as.data.frame(hair1)
## Corplot for the data hair1
corrplot.mixed(cor(hair1),lower = "number", upper = "pie", tl.col =
"black",tl.pos = "lt")
##setting flag for randomness
set.seed(1)
##creating two datasets one to train the model and another to test
the model.
spl = sample.split(hair1$Cust.Satisf, SplitRatio = 0.8)
Train = subset(hair1, spl==TRUE)
Test = subset(hair1, spl==FALSE)
##check dimentions of Train and Test Data
cat(" Train Dimention: ", dim(Train) ,"\n", "Test Dimention : ",
dim(Test))
23 | P a g
e
linearModel = lm(Cust.Satisf ~., data = Train)
summary(linearModel)
vif(linearModel)
pred = predict(linearModel, newdata = Test)
## Compute R-sq for the test data
##check SST - Total sum of squres
SST = sum((Test$Cust.Satisf - mean(Train$Cust.Satisf))^2)
##Check SSE - sum of squared deviations of actual values from
predicted values
SSE = sum((pred - Test$Cust.Satisf)^2)
##check SSR - sum of squared deviations of predicted values (predicted
using regression)
SSR = sum((pred - mean(Train$Cust.Satisf))^2)
R.square.test <- SSR/SST
cat(" SST :", SST, "\n", "SSE :", SSE, "\n","SSR :", SSR, "\n","R
squared Test :" , R.square.test)
24 | P a g
e