Babson College
Final Report
Google Play Store Applications
Aman Parikh (Section 1)
Raj Mehta (Section 3)
Rishi Mehta (Section 3)
“I pledge my honor that I have neither received nor provided the unauthorized assistance during
the completion of this work.”
QTM2000
Professor Dessislava Pachamanova
December 13, 2019
Page 2
EXECUTIVE SUMMARY
The problem we are trying to address through this project deals with optimizing app
performance and recognition on the Google Play Store. Our clients/target audience would be
companies that either plan to launch new mobile application or wish to make amendments to
their existing applications on the Google Play Store. Based on our insights, an app’s performance
and recognition could be measured by the number of installs an app gets through the play store.
This metric is predominantly correlated to several factors such as ‘paid or free’, price (if paid),
rating, size, reviews, genre, and category. Our clients that already have apps on the playstore,
would have existing information for each of these outlined variables. However, companies that
are planning to launch a new app, are encouraged to conduct a beta/market testing of their app to
gather information about these variables for a period of at least six months before they seek our
consultation.
Through our research and analysis, there were several key learnings about how a
company must study the overall market and its competitors before developing and publishing an
application. In addition, before developing the application, a company must consider various
factors such as the category its falls within, genre type, target user needs, size, etc. Using the
various models we created, trained by an extensive dataset of Play Store apps, we aim to provide
recommendations to the company ‘Saucey’, helping them optimize their app’s performance on
the store platform.
Saucey is a premium alcohol delivery business, that receives orders through an app,
which customers can currently download for free through the Google Play Store. To put our
models into practice, in the real-world context, we collected data about all the variables outlined
earlier on Saucey’s application page. We used our models and qualitative research to create a
holistic set of recommendations. We started by predicting whether the app should remain free or
be monetized. In case it decides to become paid, what would be an optimal price for such an app.
Using the price information we would further predict the estimated potential number of installs
for the future. Thereafter, building on our insights, we decided to also classify the app through
our clustering technique to see general qualities of the cluster it best aligns with. Our final
recommendations were that Saucey should remain a free app, given the nature and genre of the
app, it should strive to maximize its ratings and positive reviews by encouraging users to rate its
app and provide positive reviews if they liked the service. This can be achieved by incentivizing
users with digital redeemable tokens, notification reminders, or through mini discounts (also a
revenue driver). Our models and recommendations can be extrapolated and generalized for use
by other companies as well, given that there are general trends in the data (refer to visualizations)
and strong correlations between some variables.
Page 3
ANALYSIS
Data Pre Processing
● Worked on excel to make the data consistent. For app installs, some values were in
thousands (with k in the data) and some values were in millions (with m in the data). To
get around this obstacle, we worked with the data to replace k into M by removing k and
dividing each ‘k’ observation by 1000 using excel functions.
● Installs was a categorical variable, with a large number of categories. In addition, it had
‘+’ in the data, and therefore we created buckets for the app installs to make it easier to
understand and work with. The buckets that we created can be seen in Table 1
● We removed outliers that existed for any variables, by using filters to identify the range
of values and then manually removing them.
● We upsampled our data for the logistic model. This was to have a balanced data set to
work with since there were many more free apps as compared to paid apps.
● Eliminated missing values on r using the any(na) function to identify variables that have
missing values and then we manually removed fields with missing values since mean,
median, mode and other methods wouldn’t work, since most of the missing fields were in
rating, which cannot be interpolated from other data in this scenario.
● We converted variables: Size from categorical to numerical (eliminating the ‘m’),
Install.Buckets from integer to categorical and Reviews from integer to numerical.
Logistic Regression
To predict whether an application should be free or paid, we created a logistic regression
model with the target variable as ‘Type.Cat’. This was based on various predictor variables
including Rating, Reviews, Size and Install.Buckets. This kind of model can be helpful for
companies that create several applications, such as gaming companies that are studying if they
should create applications that should be paid on the Play Store.
As seen in Table 2 and Table 3, the confusion matrix and performance measures can be
seen. The logistic model that we created had an accuracy of 82.29% which seems adequately
high, given that we used only three predictor variables. However, a closer look at the test set
composition reveals that only 6.8425% of all observations were Class 1 (paid apps) whereas
93.1575% of all observations were Class 0 (free apps). This is due to the fact that our data set is
pulled from real life information where majority of the apps are free. The sensitivity of the model
is 28.105% while our positive class ‘1’ refers to the paid apps because it provides a chance to
monetize the app if characteristics meet model criteria. In light of the sensitivity and test set
composition, despite our model accuracy being fairly high, our model is in fact low performing.
Moreover, the sensitivity is concerning because of the potential costs associated with the
misclassifications. If an app that could suitably be monetized based on its ratings, reviews, and
size is classified as “0” instead of “1” it would lose the opportunity of earning significant
Page 4
revenue through the installation of its app. On the flip side, if an app should be free of cost based
on its predictors but is classified as a “1” (paid), it could lose out on getting higher recognition
(based on number of installs) and ratings, which is generally true for free apps. However, in this
case, the specificity of the model is 86.270%, which means that out of all the observations in
class 0, the model predicted 86.270% of them accurately as class 0 - fairly high. We upsampled
our data to get a balanced data set.
Exhibit 4 shows the ROC curve, which helped determine the optimal cutoff value. Looking at
the ROC curve, the cutoff value we used was 0.7. When the cutoff value we used was 0.5, it led
to a lower accuracy.
Interpretation of % lift chart (Exhibit 1): Approximately 42% of all ‘1’s were found after
examining the highest ranked (by the algorithm) 20% of observations. Based on this, the model
is relatively low-performance.
Interpretation of simple lift chart (Exhibit 2): Approximately 70 of all (total of 153) ‘1’s in the
data set were found after examining the highest probability (out of ) 20% of observations. This
means less than half the 1s were found indicating that it is a relatively low-performing model.
Interpretation of Decile Chart (Exhibit 3): In the most probable decile, the model is 1.8 times as
likely to identify the “positive” or important class compared to a random selection of 10% of the
observations. Our positive class “1” is ‘paid’ because it indicates that an app is suited to be paid
based on its observations of predictor variables and can be monetized on the play store.
AUC (Exhibit 5): 0.7424, which indicates a fairly high predictive performance of the model
given that we had only three usable predictor variables and that the ideal AUC is 1.
Using the information available online, about Saucey’s app information in terms of the
number of reviews, ratings, size, and installs (as seen in Image 1), we ran a logistic regression
on this information to see that the app was classified as Class 0 and therefore should be free.
Linear Regression
Based on the logistic regression outcome or preferences of the company, if the app
launching on the google play store is to be paid in nature, or a company chooses to make its app
paid, our linear regression model could help predict a price for the app to be launched based on
our extensive dataset of observations, using Ratings, Reviews, and Size as the predictor
variables. This means that our model will give a reasonable price estimate for the new app based
on apps in the dataset with similar characteristics for these predictor variables.
The RMSE for this linear model is $1.948374. This is the root mean squared error in our
predictions for new observations - price target variable. This value is relatively high, however
lower than the regression tree, which has a RMSE = $2.916424, which also predicted the price of
an app.
If Saucey chooses to make the application a paid app, through our linear regression
model based on its rating, the number of reviews and size (as seen in Image 1), the model
Page 5
predicts that Saucey should price its app between $0.42 and $2.37 which includes a root mean
squared error of $1.948.
Classification Tree
The classification tree that we created can help determine our target variable which was
the installation buckets an application will fall into based on various variables. To do this, the
predictors selected in this model include Price, Rating, Reviews, and Size. This classification
tree can be used by product managers in companies to propose changes to the application in
order to yield more installs and eventually improve business. As seen in Table 4 and Table 5,
the confusion matrix and the performance measures for this model can be seen. The primary
performance measures: Total Accuracy = 0.8542, out of all observations, the model predicted
85.42% of the values accurately. The classification tree can be seen in Exhibit 6, which helps
classify the installation bucket an app would fall in based on the variables included. Using data
for an application, one can classify the application’s install bucket by going down the tree based
on the various variables.
Using Saucey’s data (as seen in Image 1), the classification tree in Exhibit 6 classifies
Saucey as being part of install bucket 0, which proves that the classification tree is a high
performing model and fairly accurate (matches with real life data). This accuracy would have
been even higher if we could include our categorical variables of the data set in the model. The
categorical variables in our dataset could not be converted to dummy since there were multiple
categories. For future scope, we can use price information for paid apps to classify which install
bucket it would fall in to see the effect of its pricing decision.
Regression Tree
The regression tree that we created predicts the price of the app, using the predictors
rating, number of reviews and size of the app. The regression tree helps predict the price of the
app using the specific data of the app. The RMSE that we calculated for the regression tree is
$2.916424. This is a relatively high error, given that the majority of the apps are free or low
priced. However, this error has resulted due to the limited number of usable numerical predictor
variables included in the model. If we would have a dataset with more usable predictor variables
for this model, its predictive performance in terms of RMSE would be better. The regression tree
can be seen in Exhibit 7, which helps predict the price of the app based on the predictors
highlighted above. One can input the data for the selected variables while going down the tree, in
order to predict the price of an application.
Using Saucey’s data (as seen in Image 1), the regression tree in Exhibit 7 predicts the
price of Saucey’s app accurately as 0 (saucey is a free app) and plans to remain as one for the
foreseeable future.
Page 6
Clustering
For unsupervised learning, we used clustering. We created 5 clusters based on the large
volume of our data. As seen in Table 6, the conclusions from each cluster can be seen through
the labels in the first column. The intuitive interpretation of cluster 1 is that the apps are
characterized as having relatively low ratings compared to the other clusters, an average number
of reviews, an average app size and they are paid apps, however at a relatively cheap price. This
shows that apps that are paid tend to have low number of reviews, perhaps due to relatively
lower installs on average. For cluster 2, the characteristics of the apps are average ratings, a low
number of reviews, extremely low app size and they are paid apps with an average price. Moving
on, for cluster 3, the characteristics of the apps are relatively high ratings, a large number of
reviews, large app size and they are free apps. Moreover, for cluster 4, the characteristics of the
apps are extremely high rated apps, exceedingly high number of reviews, very large apps in
terms of size and they are free apps. Lastly, for cluster 5, the apps have the lowest ratings
compared to the other clusters, have the lowest number of reviews, they are the smallest in terms
of size and are paid apps that are expensive. These conclusions can be skewed since we have
relatively fewer amounts of data for paid apps as compared to free apps. All this information is
summarised in Table 6.
kNN
Total Accuracy for k = 3 is 71.24329 %
Total Accuracy for k = 5 is 69.58855 %
Total Accuracy for k = 7 is 68.73882 %
Looking at the above accuracies for the kNN models, we chose the model with k = 3 as it
has the highest accuracy. The sensitivity and specificity for the individual classes can be seen in
Table 8. The confusion matrix for the kNN model can be seen in Table 7.
We created the kNN model to classify an app in a particular install bucket based on the
ratings, number of reviews and size of its nearest three neighbor apps. We tested the model
using Saucey’s information and the model correctly classified the app in install bucket 0,
indicating that Saucey has less than 500,000 installs in reality (as seen in Image 1). From a
managerial standpoint, we can imply that in order for Saucey to be classified in a higher install
bucket i.e have a higher number of installs for its app (and therefore higher sales through it),
Saucey should try to get higher ratings and more positive reviews (considered more trustworthy)
for its app. It is important to note the accuracy of the model which is 71.24329 % while relying
on its predictions. This is a moderately high number given that we had over 1500 observations in
the test set (scaled).
Page 7
RECOMMENDATIONS
Using the aforementioned models, app developers and managers of companies planning
to launch a new app can reach an optimal price/price range (if paid) and subsequently predict the
install bucket the app would fall into while anticipating some general characteristics it would get
based on the cluster it aligns best with.
Similarly, companies such as Saucey that have applications on the Google Play Store can
use such models to improve the performance of their apps on the platform.
● Using Visualization 1, companies can see how Entertainment, Game, Photography, and
Education are the most popular categories in terms of the average number of installs with
predominantly free apps. Drawing insights from this, we recommend that app developers
that are flexible to creating apps of different genres should make an app in the above
mentioned categories in an attempt to get maximum recognition (installs). However, it is
also important to note that these segments are highly competitive and the majority of
them are free.
● As seen in Visualization 2, companies that receive higher ratings on average generally
receive greater number of installs. Through this, it is recommended that companies
should strive to maximize ratings for their app on the Google Play Store.
● As seen in Visualization 3, apps that received a higher number of reviews on average
received a higher number of installs. Based on this, we recommend that companies
incentivize their users through redeemable digital tokens/points, discounts, and/or push
notifications in return for providing positive reviews on the Google Play Store.
● Visualization 4 shows the Average Install Buckets versus rating, specific for the Food
and Beverage industry, which Saucey belongs to. Based on this, we would recommend
Saucey to encourage users to rate their app highly, if they were satisfied with the service.
Page 8
Appendix
Table 1: The various categories of number of installs
Bucket Number of Installs
0 0 - 500,000
1 500,000 - 5,000,000
2 5,000,000 - 100,000,000
3 100,000,000 - 500,000,000
4 500,000,000 - 1,000,000,000
5 1,000,000,000 - Infinity
Tableau:
Visualization 1 - Average number of Installs by category
Page 9
Visualization 2 - Average number of Installs by Rating
Visualization 3 - Relation between number of app installs and number of reviews
Page 10
Visualization 4 - Food and Beverage specific Average Install Buckets by Rating
Exhibit 1: Logistic Regression Lift Chart
Page 11
Exhibit 2: Logistic Regression Lift Chart with 45 Degree Line
Exhibit 3: Logistic Regression Decile Lift chart
Page 12
Exhibit 4: Logistic Regression ROC Curve
Exhibit 5: Logistic Regression ROC Curve with Area Under the Curve (AUC)
Table 2: Confusion Matrix for Logistic Regression
cmaxlog = Actual Test Class
Predicted Test Class 0 1
0 1797 110
1 286 43
Page 13
Table 3: Performance measures from Confusion Matrix for Logistic Regression
Accuracy 0.5438
Sensitivity 0.86275
Specificity 0.52040
Exhibit 6: Pruned Classification Tree (CART) (Arrows show flow specific to Saucey)
Page 14
Table 4: Confusion Matrix for Classification Tree
Reference
Prediction 0 1 2 3 4 5
0 1137 52 1 0 0 0
1 77 416 116 0 0 0
2 0 45 315 18 0 1
3 0 0 8 42 7 1
4 0 0 0 0 0 0
5 0 0 0 0 0 0
Table 5: Performance measures from Confusion Matrix for Classification Tree
Class Sensitivity Specificity
0 0.9366 0.9481
1 0.8109 0.8880
2 0.7159 0.9644
3 0.70000 0.99265
4 0.000000 1.000000
5 0.000000 1.000000
Page 15
Exhibit 7: Pruned Regression Tree (CART) (Arrows show flow specific to Saucey)
Table 6: Clusters with characteristics
Cluster Number Rating Reviews Size Price
1 - Average in all measures 4.178367 228511.514 66.39020 0.1681034
2 - Average rated apps with low 4.258462 7547.492 47.03077 12.6044615
number of reviews and smallest app
size, and high prices.
3 - Relatively high rated free apps, 4.482353 24462081.882 147.82353 0
with large number of reviews and big
size.
4 - Highest rated free apps, with most 4.600000 44889695.250 181.00000 0
number of reviews and large size.
5 - Low rated apps, with the highest 4.093333 885.800 57.60000 30.0566667
prices, the lowest number of reviews.
Page 16
Table 7: Confusion Matrix for kNN
Reference
Prediction 0 1 2 3 4 5
0 1063 278 67 1 0 0
1 138 203 89 0 0 0
2 13 32 276 14 0 0
3 0 0 8 42 0 0
4 0 0 0 3 7 0
5 0 0 0 0 0 2
Table 8: Performance Measures from Confusion Matrix for kNN
Class 0 Class 1 Class 2 Class 3 Class 4 Class 5
Sensitivity 0.8756 0.39571 0.6273 0.7000 1.0000 1.0000
Specificity 0.6614 0.86825 0.9671 0.99632 0.998654 1.0000
Image 1: Saucey app on Google Play Store
Page 17
R - Code:
setwd("C:/Users/rmehta10/Desktop/QTM 3/Final Project/Models/Data")
playstore <- read.csv("googleplaystore.csv")
if (!require("caret")) {
install.packages("caret")
library("caret")
}
if (!require("FNN")) {
install.packages("FNN")
library("FNN")
}
if (!require("DMwR")) {
install.packages("DMwR")
library("DMwR")
}
if (!require("fastDummies")) {
install.packages("fastDummies")
library("fastDummies")
}
if (!require("rpart")) {
install.packages("rpart")
library("rpart")
}
if (!require("rpart.plot")) {
Page 18
install.packages("rpart.plot")
library("rpart.plot")
}
if (!require("Metrics")) {
install.packages("Metrics")
library("Metrics")
}
if (!require("pROC")) {
install.packages("pROC")
library("pROC")
}
if (!require("e1071")) {
install.packages("e1071")
library("e1071")
}
if (!require("gains")) {
install.packages("gains")
library("gains")
}
playstore$Size <- as.numeric(playstore$Size)
playstore$Install.Buckets <- as.factor(playstore$Install.Buckets)
playstore$Reviews <- as.numeric(playstore$Reviews)
str(playstore)
apply(playstore,2,anyNA)
summary(playstore)
which(is.na(playstore$App))
which(is.na(playstore$Category))
which(is.na(playstore$Rating))
which(is.na(playstore$Reviews))
which(is.na(playstore$Size))
which(is.na(playstore$Installs))
which(is.na(playstore$Type))
which(is.na(playstore$Price))
which(is.na(playstore$Content))
which(is.na(playstore$Genres))
which(is.na(playstore$Last.Updated))
which(is.na(playstore$Current.Ver))
which(is.na(playstore$Android.Ver))
which(is.na(playstore$Install.Buckets))
Page 19
playstorewom <- playstore[-which(is.na(playstore$Rating)),]
apply(playstorewom,2,anyNA)
trainSetSize <- floor(0.7 * nrow(playstorewom))
RNGkind(sample.kind = "Rejection")
set.seed(12345)
trainInd <- sample(seq_len(nrow(playstorewom)), size = trainSetSize)
playstoreTrainSet <- playstorewom[trainInd, ]
playstoreTestSet <- playstorewom[-trainInd, ]
#logistic regression
upTrain <- upSample(x = playstoreTrainSet[, -ncol(playstoreTrainSet)],
y = playstoreTrainSet$Type)
table(upTrain$Class)
logRegrModelType <- glm(Type.Cat ~ Rating + Reviews + Size + Install.Buckets,
data =upTrain,
family ="binomial")
summary(logRegrModelType)
newObs = data.frame(Rating = 4.3, Reviews = 758, Size = 9.4, Install.Buckets = "0")
predNewObsScore <- predict(logRegrModelType,newObs,type="response")
predNewObsScore
ifelse(predNewObsScore >= 0.7,"1","0")
#Score the logistic regression model on the test data set
predTestScores <- predict(logRegrModelType, type="response",
newdata=playstoreTestSet)
##Set cutoff value
cutoff <- 0.7
##Initially, set all predicted class assignments to 0
predTestClass <- rep(0, length(predTestScores))
##Then, replace with only those entries that are greater than the cutoff value
predTestClass[predTestScores >= cutoff] <- 1
##Output to file
dfToExport <- data.frame(playstoreTestSet,predTestScores,predTestClass)
write.csv(dfToExport, file = "../ROutput/predictedTypeCAT.csv")
#Create a confusion matrix
cmaxlog <- playstoreTestSet$Type.Cat
#Simply using tables
confMx <- table(predTestClass, cmaxlog)
confMx
Page 20
#Use confustionMatrix from the caret package
confusionMatrix(as.factor(predTestClass), as.factor(cmaxlog), positive = "1")
#Create a lift chart
####################
#Simple lift chart; from scratch
dfForLiftChart <- data.frame(predTestScores, cmaxlog)
sortedData <- dfForLiftChart[order(-dfForLiftChart$predTestScores),]
cumulCases <- cumsum(sortedData[,2])
##Plot the lift chart
plot(cumulCases, xlab = "Number of Cases", ylab = "Number of 1s Identified by
Algorithm So Far", type="l", col = "blue")
##Plot the 45 degree line
X <- c(0, length(predTestScores))
Y <- c(0, cumulCases[length(predTestScores)])
lines(X, Y, col = "red", type = "l", lty = 2)
#Lift chart using the "caret" library
li <-lift(relevel(as.factor(cmaxlog), ref="1") ~ predTestScores)
xyplot(li, plot = "gain")
#Decile Lift Chart using the "gains" library
gain <- gains(cmaxlog, predTestScores)
barplot(gain$mean.resp/mean(cmaxlog), names.arg = gain$depth, xlab = "Percentile",
ylab = "Mean Response", main = "Decile Lift Chart")
####################
#Create an ROC curve
simpleROC <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels),
labels)
}
rocData <- simpleROC(cmaxlog, predTestScores)
rocData
plot(rocData$FPR,rocData$TPR, xlab = "1 - Specificity", ylab = "Sensitivity")
#Use library pROC to plot ROC and
#calculate area under the curve (AUC)
pROCData <- pROC::roc(playstoreTestSet$Type.Cat,predTestScores)
plot(pROCData) # Gets a smoother version of the curve for calculating AUC; axes
labels a bit confusing
pROCData[9] # Prints the AUC (Area Under Curve)
Page 21
#linear
regrModel <-lm(Price ~ Rating + Reviews + Size,
data = playstoreTrainSet)
summary(regrModel)
newObslinear <- data.frame(Rating = 4.3, Reviews = 758, Size = 9.4)
predict(regrModel, newObslinear)
predictedValues <- predict(regrModel, playstoreTestSet)
predictedValues
#Calculate RMSE error
##First, extract actual realized values for Price and store in actualValue
actualValues <- playstoreTestSet$Price
##Then, calculate the RMSE error between predicted and actual value for test data
rmse(actualValues, predictedValues)
#Classification Tree
Tree <- rpart(Install.Buckets ~ Price + Rating + Reviews + Size, upTrain) # grow a tree
on training.
prp(Tree,type=1,extra=1) # display the tree
#Prune the tree
prunedTree <- prune(Tree, cp=
Tree$cptable[which.min(Tree$cptable[,"xerror"]), "CP"])
#prunedTree <- prune(fullTree, cp=0.084)
printcp(prunedTree)
prp(prunedTree, extra=1)
prunedTree
#Fit the rules to new data (test set)
predTestClass <- predict(prunedTree, newdata = upTrain, type="class")
confMx <- confusionMatrix(predTestClass, playstoreTestSet$Install.Buckets, positive =
"5")
confMx
#Regression tree
RegTree <- rpart(Price ~ Rating + Reviews + Size, upTrain) # grow a tree on training.
prp(RegTree,type=1,extra=1) #display the tree
#Prune the tree
prunedTreeReg <- prune(RegTree, cp=
Page 22
RegTree$cptable[which.min(RegTree$cptable[,"xerror"]), "CP"])
#prunedTree <- prune(fullTree, cp=0.084)
printcp(prunedTreeReg)
prp(prunedTreeReg, extra=1)
prunedTreeReg
#Fit the rules to new data (test set)
#Predict the prices of apps from test set
predTestPrice <- predict(RegTree, newdata = playstoreTestSet, type="vector")
#Calculate RMSE for test data
actualTestPrice <- playstoreTestSet$Price
RMSE(actualTestPrice, predTestPrice)
#Fit the rules to new data (test set)
#Predict the prices of apps in the test set
predTestPricex <- predict(prunedTree, newdata = playstoreTestSet, type="vector")
#Clustering
clusterdata <- playstorewom[c(-1,-2,-6,-7,-8,-9,-11,-12,-13,-14,-15)]
distances = dist(scale(clusterdata), method = "euclidean")
hcluster_resultA = hclust(distances, method = "average")
plot(hcluster_resultA, hang=-1, ann=FALSE)
cluster = cutree(hcluster_resultA,k=5)
clusterdata = cbind(clusterdata,cluster)
aggregate(clusterdata, by = list (cluster), mean)
#compute the cluster membership by "cutting the dendrogram" in to k=5 clusters
memb <- cutree(hcluster_resultA,k=5)
memb
#output data and cluster assignments to a file
myDataClusteredH <- data.frame(clusterdata,memb)
write.csv(myDataClusteredH, file = "../ROutput/clusteredInstallsH.csv")
#return the centroid of each cluster
aggregate(myDataClusteredH, by=list(memb), mean)
#display the full summary statistics for each cluster
describeBy(myDataClusteredH, myDataClusteredH$memb)
#Predict kNN classifications for each observation in the test set
#original arrays with data
playstoretrainSca <- playstoreTrainSet
Page 23
playstoretestSca <- playstoreTestSet
playstoresca <- playstorewom
#use preProcess from the caret package to normalize the numerical variables
normValues <- preProcess(playstoreTrainSet[,3:5], method = c("center", "scale"))
playstoretrainSca[,3:5] <- predict(normValues, playstoreTrainSet[,3:5])
playstoretestSca[,3:5] <- predict(normValues, playstoreTestSet[,3:5])
playstoresca[,3:5] <- predict(normValues, playstorewom[,3:5])
knn3 <- knn(train = playstoretrainSca[,3:5], test = playstoretestSca[,3:5], cl =
playstoretrainSca[,7], k = 3)
newObs1 <- data.frame(Rating = 4.3, Reviews = 758, Size = 9.4)
newObs1Scaled <- newObs1
newObs1Scaled <- predict(normValues, newObs1)
predNewObsknn3 <- knn(train = playstoretrainSca[,3:5], test = newObs1Scaled, cl =
playstoretrainSca[,7], k = 3)
predNewObsknn3
#Compute confusion matrix for prediction model
################################################
actualknnTestClass <- playstoretestSca$Install.Buckets
#Calculate all relevant statistics for confusion matrix
#############################################
confMx3 <- confusionMatrix(knn3, actualknnTestClass,positive = "5")
confMx3
#calculate total accuracy
totAcc <- confMx3$overall[1]
totAcc
Page 24
References
Inc., Saucey,. "Saucey: Alcohol Delivery - Apps on Google Play." Google Play,
play.google.com/store/apps/details?id=com.saucey&hl=en_US.
"Google Play Store Apps." Kaggle: Your Home for Data Science,
www.kaggle.com/lava18/google-play-store-apps.