Project Report
-by Vipul Malpani
Predictive Modelling- Telecom Customer Churn Dataset
1. Project Objective:-
a. To do EDA on given data set
b. To check for missing value, outlier and multicollinearity in data set
c. To build a model which can predict whether a customer will cancel their service in
the future or not using different predictive modeling technique
2. Assumptions:-
a. Logistic regression requires there to be little or no multicollinearity among the
independent variables. This means that the independent variables should not be
too highly correlated with each other.
b. Dependent variable should be categorical.
c. Outcome is always categorical
3. Environment setup and library installation
#Loading the libraries and dataset
library(readxl)
library(corrplot)
library(psych)
library(ggplot2)
library(RColorBrewer)
library(caTools)
library(car)
library(data.table)
library(ROCR)
library(class)
library(funModeling)
library(tidyverse)
library(Hmisc)
library(ineq)
library(caret)
library(e1071)
celldata <- read_excel("~/Downloads/cellphoneData.xlsx")
##View(Cellphone)
#analysising and treating data
str(celldata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3333 obs. of 11 variables:
## $ Churn : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AccountWeeks : num 128 107 137 84 75 118 121 147 117 141 ...
## $ ContractRenewal: num 1 1 1 0 0 0 1 0 1 0 ...
## $ DataPlan : num 1 1 0 0 0 0 1 0 0 1 ...
## $ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
## $ CustServCalls : num 1 1 0 2 3 0 3 0 1 0 ...
## $ DayMins : num 265 162 243 299 167 ...
## $ DayCalls : num 110 123 114 71 113 98 88 79 97 84 ...
## $ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
## $ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
## $ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
dim(celldata)
## [1] 3333 11
attach(celldata)
boxplot(celldata)
4. Basic Treatment of Data
#saving the data into another dataset for backup
celldata_n = celldata
##fatorising the catagorical variables
celldata$Churn=as.factor(celldata$Churn)
celldata$ContractRenewal=as.factor(celldata$ContractRenewal)
celldata$DataPlan=as.factor(celldata$DataPlan)
summary(celldata)
## Churn AccountWeeks ContractRenewal DataPlan DataUsage
## 0:2850 Min. : 1.0 0: 323 0:2411 Min. :0.0000
## 1: 483 1st Qu.: 74.0 1:3010 1: 922 1st Qu.:0.0000
## Median :101.0 Median :0.0000
## Mean :101.1 Mean :0.8165
## 3rd Qu.:127.0 3rd Qu.:1.7800
## Max. :243.0 Max. :5.4000
## CustServCalls DayMins DayCalls MonthlyCharge
## Min. :0.000 Min. : 0.0 Min. : 0.0 Min. : 14.00
## 1st Qu.:1.000 1st Qu.:143.7 1st Qu.: 87.0 1st Qu.: 45.00
## Median :1.000 Median :179.4 Median :101.0 Median : 53.50
## Mean :1.563 Mean :179.8 Mean :100.4 Mean : 56.31
## 3rd Qu.:2.000 3rd Qu.:216.4 3rd Qu.:114.0 3rd Qu.: 66.20
## Max. :9.000 Max. :350.8 Max. :165.0 Max. :111.30
## OverageFee RoamMins
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 8.33 1st Qu.: 8.50
## Median :10.07 Median :10.30
## Mean :10.05 Mean :10.24
## 3rd Qu.:11.77 3rd Qu.:12.10
## Max. :18.19 Max. :20.00
str(celldata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3333 obs. of 11 variables:
## $ Churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ AccountWeeks : num 128 107 137 84 75 118 121 147 117 141 ...
## $ ContractRenewal: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 1 ...
## $ DataPlan : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 1 2 ...
## $ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
## $ CustServCalls : num 1 1 0 2 3 0 3 0 1 0 ...
## $ DayMins : num 265 162 243 299 167 ...
## $ DayCalls : num 110 123 114 71 113 98 88 79 97 84 ...
## $ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
## $ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
## $ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
#basic EDA
basic_eda <- function(data)
{
summary(celldata)
df_status(celldata)
freq(celldata)
plot_num(celldata)
#profiling_num(celldata)
hist(celldata)
describe(celldata)
attach(celldata)
basic_eda(celldata)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 Churn 2850 85.51 0 0 0 0 factor 2
## 2 AccountWeeks 0 0.00 0 0 0 0 numeric 212
## 3 ContractRenewal 323 9.69 0 0 0 0 factor 2
## 4 DataPlan 2411 72.34 0 0 0 0 factor 2
## 5 DataUsage 1813 54.40 0 0 0 0 numeric 174
## 6 CustServCalls 697 20.91 0 0 0 0 numeric 10
## 7 DayMins 2 0.06 0 0 0 0 numeric 1667
## 8 DayCalls 2 0.06 0 0 0 0 numeric 119
## 9 MonthlyCharge 0 0.00 0 0 0 0 numeric 656
## 10 OverageFee 1 0.03 0 0 0 0 numeric 1024
## 11 RoamMins 18 0.54 0 0 0 0 numeric 162
from above result we can see that the percentage of occurrence of zero's is
85.51 percent. So 14.5% is the churn rate, 483/3333 have churned.
We can see that percentage of occurrence of contract equals to 1 is 90.31 percent which is
very high as compared to occurrence of zero’s.
Similar trend of majority of one’s can be seen here.
Here we can also see that all the variables are almost normally distributed and fulfill
normal distribution assumption
Histograms of all the variables
5. Plots for each variable:-
ggplot(celldata, aes(Churn)) + geom_bar(fill="blue")
#Account Week
One would expect a decreasing churn rate with the increase in the time
(account weeks) of an account, but it does not seem to be the case. There is
no clear trend visible.
#Contract Renewal
table(Churn, ContractRenewal)
## ContractRenewal
## Churn 0 1
## 0 186 2664
## 1 137 346
(137/(186+137))
## [1] 0.4241486
Clearly, there is a good probability (approx 42%) of an account churning if
the contract has not been renewed.
#Data Plan
The probability of an account churning is higher if the account has not
subscribed to a data plan.
#Data Usage
#Clearly, maximum churn is in the 0-0.5 data usage category.
#CustServCall
clearly churn rate significantly increases if user makes more than 4 calls
#DayMins
#The churn rate increases if the monthly average daytime minutes are greater
than 245.
#DayCalls
#no clear pattern can be observed here
#MonthlyCharge
#The churn Rate observed to be maximum if the monthly bill is between 64-74.
#OverageFee
no clear observation can be made from this
#RoamMins
Note: no clear observation can be made from
#check missing values
anyNA(celldata)
## [1] FALSE
no values are missing so missing values treatment is not required
#Outliers check
boxplot(celldata,horizontal = TRUE)
Note: There are outliers in the dataset but they are not asked to be treated else wuld have
used KNN impute method to treat the outliers.
5.2 Collinearity:-
#LOGISTIC REGRESSION
set.seed(12345)
model1 <- glm(Churn ~ ., data= train, family=binomial)
summary(model1)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9954 -0.5164 -0.3475 -0.2119 2.9881
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.3368705 0.6574349 -8.118 4.75e-16 ***
## AccountWeeks 0.0011308 0.0016600 0.681 0.495737
## ContractRenewal1 -1.9892428 0.1699581 -11.704 < 2e-16 ***
## DataPlan1 -1.0422186 0.6527079 -1.597 0.110319
## DataUsage -0.7635666 2.3185753 -0.329 0.741909
## CustServCalls 0.5319410 0.0473461 11.235 < 2e-16 ***
## DayMins -0.0016103 0.0391549 -0.041 0.967195
## DayCalls -0.0008953 0.0033234 -0.269 0.787635
## MonthlyCharge 0.0806978 0.2301505 0.351 0.725865
## OverageFee -0.0283545 0.3926032 -0.072 0.942425
## RoamMins 0.0928157 0.0266561 3.482 0.000498 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1531.2 on 2322 degrees of freedom
## AIC: 1553.2
##
## Number of Fisher Scoring iterations: 6
vif(model1)
## AccountWeeks ContractRenewal DataPlan DataUsage
## 1.004962 1.049342 14.890864 1668.816467
## CustServCalls DayMins DayCalls MonthlyCharge
## 1.071380 987.216745 1.005758 3013.929613
## OverageFee RoamMins
## 211.337513 1.199179
Note: The multicolliniearity has caused the inflated VIF values for
correlated variables, making the model unreliable.
#As per stats facts, if vif value is greater than 5 then multicolienarity is
maximum . so we wil stepwise remove variable whose vif value is grater than 5
#remove MonthlyCharge
model2 <- glm(Churn ~.-MonthlyCharge, data= train, family=binomial)
summary(model2)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9942 -0.5171 -0.3465 -0.2109 2.9827
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.3006340 0.6491593 -8.165 3.20e-16 ***
## AccountWeeks 0.0011226 0.0016594 0.677 0.498711
## ContractRenewal1 -1.9883365 0.1698888 -11.704 < 2e-16 ***
## DataPlan1 -1.0477052 0.6522055 -1.606 0.108185
## DataUsage 0.0456863 0.2209013 0.207 0.836152
## CustServCalls 0.5316024 0.0473523 11.227 < 2e-16 ***
## DayMins 0.0121123 0.0012723 9.520 < 2e-16 ***
## DayCalls -0.0008644 0.0033235 -0.260 0.794787
## OverageFee 0.1089775 0.0272701 3.996 6.44e-05 ***
## RoamMins 0.0926841 0.0266490 3.478 0.000505 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1531.3 on 2323 degrees of freedom
## AIC: 1551.3
##
## Number of Fisher Scoring iterations: 6
vif(model2)
## AccountWeeks ContractRenewal DataPlan DataUsage
## 1.004781 1.049038 14.875167 15.159742
## CustServCalls DayMins DayCalls OverageFee
## 1.070753 1.042789 1.004988 1.019429
## RoamMins
## 1.199067
#now removing dataUsage
model3 <- glm(Churn ~.-MonthlyCharge-DataUsage, data= train, family=binomial)
summary(model3)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9968 -0.5181 -0.3465 -0.2112 2.9759
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.3190778 0.6430461 -8.272 < 2e-16 ***
## AccountWeeks 0.0011291 0.0016593 0.680 0.496202
## ContractRenewal1 -1.9892584 0.1698531 -11.712 < 2e-16 ***
## DataPlan1 -0.9177473 0.1707087 -5.376 7.61e-08 ***
## CustServCalls 0.5311885 0.0472969 11.231 < 2e-16 ***
## DayMins 0.0121141 0.0012722 9.522 < 2e-16 ***
## DayCalls -0.0008638 0.0033232 -0.260 0.794906
## OverageFee 0.1088689 0.0272578 3.994 6.50e-05 ***
## RoamMins 0.0948722 0.0244771 3.876 0.000106 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1531.3 on 2324 degrees of freedom
## AIC: 1549.3
##
## Number of Fisher Scoring iterations: 5
vif(model3)
## AccountWeeks ContractRenewal DataPlan CustServCalls
## 1.004348 1.048383 1.020221 1.068665
## DayMins DayCalls OverageFee RoamMins
## 1.042708 1.005033 1.019067 1.010847
#Now vif value of the variable is less than 5. Accountweeks & Daycalls are
insignificant varaibles for Model3, so we will remove Accountsweek and
daycalls step wise from our final model to check if they are affecting the
AIC and and residual deviance.
model4 <- glm(Churn ~.-MonthlyCharge-DataUsage-AccountWeeks, data= train,
family=binomial)
summary(model4)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9990 -0.5155 -0.3463 -0.2110 2.9737
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.1962737 0.6162506 -8.432 < 2e-16 ***
## ContractRenewal1 -1.9933451 0.1697498 -11.743 < 2e-16 ***
## DataPlan1 -0.9145091 0.1705327 -5.363 8.20e-08 ***
## CustServCalls 0.5312599 0.0472621 11.241 < 2e-16 ***
## DayMins 0.0121098 0.0012720 9.521 < 2e-16 ***
## DayCalls -0.0008059 0.0033222 -0.243 0.808321
## OverageFee 0.1081960 0.0272360 3.973 7.11e-05 ***
## RoamMins 0.0946130 0.0244739 3.866 0.000111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1531.8 on 2325 degrees of freedom
## AIC: 1547.8
##
## Number of Fisher Scoring iterations: 5
model5 <- glm(Churn ~.-MonthlyCharge-DataUsage-AccountWeeks-DayCalls, data=
train, family=binomial)
summary(model5)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
## DayCalls, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0026 -0.5143 -0.3463 -0.2115 2.9666
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.278539 0.515106 -10.247 < 2e-16 ***
## ContractRenewal1 -1.992118 0.169637 -11.743 < 2e-16 ***
## DataPlan1 -0.913216 0.170453 -5.358 8.43e-08 ***
## CustServCalls 0.531395 0.047257 11.245 < 2e-16 ***
## DayMins 0.012101 0.001271 9.519 < 2e-16 ***
## OverageFee 0.108411 0.027220 3.983 6.81e-05 ***
## RoamMins 0.094548 0.024470 3.864 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1531.9 on 2326 degrees of freedom
## AIC: 1545.9
##
## Number of Fisher Scoring iterations: 5
vif(model5)
## ContractRenewal DataPlan CustServCalls DayMins
## 1.046460 1.018063 1.068819 1.041792
## OverageFee RoamMins
## 1.016316 1.010383
#step(model1)
summary(model5)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
## DayCalls, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0026 -0.5143 -0.3463 -0.2115 2.9666
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.278539 0.515106 -10.247 < 2e-16 ***
## ContractRenewal1 -1.992118 0.169637 -11.743 < 2e-16 ***
## DataPlan1 -0.913216 0.170453 -5.358 8.43e-08 ***
## CustServCalls 0.531395 0.047257 11.245 < 2e-16 ***
## DayMins 0.012101 0.001271 9.519 < 2e-16 ***
## OverageFee 0.108411 0.027220 3.983 6.81e-05 ***
## RoamMins 0.094548 0.024470 3.864 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1531.9 on 2326 degrees of freedom
## AIC: 1545.9
##
## Number of Fisher Scoring iterations: 5
vif(model5)
## ContractRenewal DataPlan CustServCalls DayMins
## 1.046460 1.018063 1.068819 1.041792
## OverageFee RoamMins
## 1.016316 1.010383
Note: Model 5 is created with 6 number of significant variables with no
correlation.
AIC: 1545.9 is the lowest for model5
Explanatory Power of odds
round(exp(coef(model5)),2)
## (Intercept) ContractRenewal1 DataPlan1 CustServCalls
## 0.01 0.14 0.40 1.70
## DayMins OverageFee RoamMins
## 1.01 1.11 1.10
Note: Rounded off values
Classification/Prediction based on threshold value (train)
Pred.model5=predict(model5,type = "response",data=train)
summary(Pred.model5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001599 0.041391 0.084945 0.144878 0.186781 0.984946
summary(train$Churn)
## 0 1
## 1995 338
plot(train$Churn,Pred.model5)
Note: from the graph it is clear that we give the threshold of 0.5 , which
means those who have probabilty greater than 0.5 will be classified as
Churned (Customer will end service) and rest will be classified as Not
churned(Customer will continue service).
Pred.model5.factor=ifelse(Pred.model5<0.20,0,1)
6.Confusion Matrix
TRAIN Model
confusionMatrix(table(Actual=train$Churn,Pred.model5.factor))
## Confusion Matrix and Statistics
##
## Pred.model5.factor
## Actual 0 1
## 0 1674 321
## 1 109 229
## Accuracy : 0.8157
## 95% CI : (0.7993, 0.8312)
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.9389
## Specificity : 0.4164
## Pos Pred Value : 0.8391
## Neg Pred Value : 0.6775
## Prevalence : 0.7643
## Detection Rate : 0.7175
## Detection Prevalence : 0.8551
## Balanced Accuracy : 0.6776
Pred.model.test.factor=ifelse(Pred.model.test<0.20,0,1)
confusionMatrix(table(Actual=test$Churn,Pred.model.test.factor))
## Confusion Matrix and Statistics
##
## Pred.model.test.factor
## Actual 0 1
## 0 720 135
## 1 54 91
##
## Accuracy : 0.811
## 95% CI : (0.7853, 0.8348)
## No Information Rate : 0.774
##
## Sensitivity : 0.9302
## Specificity : 0.4027
## Pos Pred Value : 0.8421
## Neg Pred Value : 0.6276
## Prevalence : 0.7740
## Detection Rate : 0.7200
## Detection Prevalence : 0.8550
## Balanced Accuracy : 0.6664
Note: after finding out the Predicted model at .20 we have corrected the user
from 115 to 49 who were fassely predicted to not to be churned(In actual they
were churned), though this makes an overall impact on the accuracy and falls
down to 82% from 85 but also imporves the specificity by 7-8% which is
acceptable.
Initially we had threshold value equal to .50 but after creating the ROC and
taking the business in account we realized that keeping .20 as threshold we
can try to hold most number of users from getting churn. Though that have a
impact of 4% of users with falsely positives Churn which leads to send some
measures to be taken to retrieve that users but this is less than the impact
of losing the users with falsely considered as non-churn users even when they
are about to churn.
From business perspective retaining a user is also important factor which is
taken into consideration.
6 ROC Curve on train and test dataset respectively
NOTE: From ROC curve we were able to identify the threshold where we initially assumed
it to be .50.
library(blorr) # to build and validate binary logistic models
blr_step_aic_both(model5, details = FALSE)
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1 . ContractRenewal
## 2 . DataPlan
## 3 . CustServCalls
## 4 . DayMins
## 5 . OverageFee
## 6 . RoamMins
##
##
## Variables Entered/Removed:
##
## - ContractRenewal added
## - CustServCalls added
## - DayMins added
## - DataPlan added
## - OverageFee added
## - RoamMins added
##
##
## Stepwise Summary
## ---------------------------------------------------------------
## Variable Method AIC BIC Deviance
## ---------------------------------------------------------------
## ContractRenewal addition 1804.930 1816.439 1800.930
## CustServCalls addition 1691.744 1709.009 1685.744
## DayMins addition 1599.200 1622.220 1591.200
## DataPlan addition 1571.798 1600.573 1561.798
## OverageFee addition 1559.186 1593.716 1547.186
## RoamMins addition 1545.863 1586.147 1531.863
## ---------------------------------------------------------------
7 ModelPerformanceParameter
Train
KS_train
## [1] 0.5285685
auc_train
## [1] 0.8190506
gini
## [1] 0.5241413
Test
KS_test
## [1] 0.7579812
auc_test
## [1] 0.8178907
gini
## [1] 0.514548
Note: KS, AUC and GINI values are calculated for the logistic model for both
train and test data.
8 KNN Classifier
#normalize the test & train data
#split the normalized dataset into 7:3 ratio of train and test respectively
## knn.pred
## 0 1
## 0 850 5
## 1 90 55
sum(diag(table.knn)/sum(table.knn))
## [1] 0.905
confusionMatrix(table.knn)
## Confusion Matrix and Statistics
##
## knn.pred
## 0 1
## 0 850 5
## 1 90 55
##
## Accuracy : 0.905
## 95% CI : (0.8851, 0.9225)
## No Information Rate : 0.94
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4936
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.9043
## Specificity : 0.9167
## Pos Pred Value : 0.9942
## Neg Pred Value : 0.3793
## Prevalence : 0.9400
## Detection Rate : 0.8500
## Detection Prevalence : 0.8550
## Balanced Accuracy : 0.9105
##
## 'Positive' Class : 0
##
accuracy percentage obtained is 90.5% at k = 11 , where majority rule is
applied to predict the churn value.
9 Naïve Bayes
confusionMatrix(table(train$Churn,Nb.prediction.train))
## Confusion Matrix and Statistics
##
## Nb.prediction.train
## 0 1
## 0 1771 224
## 1 121 217
##
## Accuracy : 0.8521
## 95% CI : (0.8371, 0.8663)
## No Information Rate : 0.811
## P-Value [Acc > NIR] : 0.00000009962
##
## Kappa : 0.4702
##
## Mcnemar's Test P-Value : 0.00000003985
##
## Sensitivity : 0.9360
## Specificity : 0.4921
## Pos Pred Value : 0.8877
## Neg Pred Value : 0.6420
## Prevalence : 0.8110
## Detection Rate : 0.7591
## Detection Prevalence : 0.8551
## Balanced Accuracy : 0.7141
##
## 'Positive' Class : 0
##
confusionMatrix(table(test$Churn,Nb.prediction.test))
## Confusion Matrix and Statistics
##
## Nb.prediction.test
## 0 1
## 0 749 106
## 1 60 85
##
## Accuracy : 0.834
## 95% CI : (0.8095, 0.8566)
## No Information Rate : 0.809
## P-Value [Acc > NIR] : 0.0229328
##
## Kappa : 0.4084
##
## Mcnemar's Test P-Value : 0.0004782
##
## Sensitivity : 0.9258
## Specificity : 0.4450
## Pos Pred Value : 0.8760
## Neg Pred Value : 0.5862
## Prevalence : 0.8090
## Detection Rate : 0.7490
## Detection Prevalence : 0.8550
## Balanced Accuracy : 0.6854
##
## 'Positive' Class : 0
10 Model Comparison Table
Logistic Regression KNN Naïve Bayes
Accuracy 81.1% 90.5% 85.1%
Sensitivity 93.02% 90.63% 93%
Specificity 40.27% 91.67% 49.50%
Note: From the above Comparison of different Model, we can say that KNN method is
best in our case in predicting the customer who will discontinue the services. If we
predict using KNN then our prediction is 90.5% accurate.
11 Interpretation and conclusion.
Our model is using contract renewal, dataplan, custservcalls, Daymins, overagefee &
roammins features from past data to make a decision if a customer will churn or
not.Since this feature are most important company should focus on this features.
Recommendation for telecom company based on my model:-
Company Should focus in making existing customer to renew their
contract by giving them best offers as compared to competitor telecom
company.
Company should also focus more in making customer to opt for
dataplan by providing then best offers on dataplan .
Company should focus on providing best customer support through
customer service calls.
Company should also focus on customer whose Daymins are reducing over a
time and try to interact with customer for feedback on call quality.