0% found this document useful (0 votes)

106 views24 pages

Final Report

This document provides a final report on optimizing app performance on the Google Play Store. It analyzes data on existing Play Store apps to build logistic regression and linear regression models to predict whether an app should be free or paid and estimate a potential price. These models were applied to an alcohol delivery app called Saucey to provide recommendations. The key recommendations were for Saucey to remain free to maximize ratings and reviews through incentives.

Uploaded by

api-519464459

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views24 pages

Final Report

Uploaded by

api-519464459

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Babson College

Final Report

Google Play Store Applications

Aman Parikh (Section 1)

Raj Mehta (Section 3)
Rishi Mehta (Section 3)

“I pledge my honor that I have neither received nor provided the unauthorized assistance during

the completion of this work.”

QTM2000
Professor Dessislava Pachamanova
December 13, 2019
Page 2

EXECUTIVE SUMMARY

The problem we are trying to address through this project deals with optimizing app
performance and recognition on the Google Play Store. Our clients/target audience would be
companies that either plan to launch new mobile application or wish to make amendments to
their existing applications on the Google Play Store. Based on our insights, an app’s performance
and recognition could be measured by the number of installs an app gets through the play store.
This metric is predominantly correlated to several factors such as ‘paid or free’, price (if paid),
rating, size, reviews, genre, and category. Our clients that already have apps on the playstore,
would have existing information for each of these outlined variables. However, companies that
are planning to launch a new app, are encouraged to conduct a beta/market testing of their app to
gather information about these variables for a period of at least six months before they seek our
consultation.
Through our research and analysis, there were several key learnings about how a
company must study the overall market and its competitors before developing and publishing an
application. In addition, before developing the application, a company must consider various
factors such as the category its falls within, genre type, target user needs, size, etc. Using the
various models we created, trained by an extensive dataset of Play Store apps, we aim to provide
recommendations to the company ‘Saucey’, helping them optimize their app’s performance on
the store platform.
Saucey is a premium alcohol delivery business, that receives orders through an app,
which customers can currently download for free through the Google Play Store. To put our
models into practice, in the real-world context, we collected data about all the variables outlined
earlier on Saucey’s application page. We used our models and qualitative research to create a
holistic set of recommendations. We started by predicting whether the app should remain free or
be monetized. In case it decides to become paid, what would be an optimal price for such an app.
Using the price information we would further predict the estimated potential number of installs
for the future. Thereafter, building on our insights, we decided to also classify the app through
our clustering technique to see general qualities of the cluster it best aligns with. Our final
recommendations were that Saucey should remain a free app, given the nature and genre of the
app, it should strive to maximize its ratings and positive reviews by encouraging users to rate its
app and provide positive reviews if they liked the service. This can be achieved by incentivizing
users with digital redeemable tokens, notification reminders, or through mini discounts (also a
revenue driver). Our models and recommendations can be extrapolated and generalized for use
by other companies as well, given that there are general trends in the data (refer to visualizations)
and strong correlations between some variables.
Page 3

ANALYSIS

Data Pre Processing

● Worked on excel to make the data consistent. For app installs, some values were in
thousands (with k in the data) and some values were in millions (with m in the data). To
get around this obstacle, we worked with the data to replace k into M by removing k and
dividing each ‘k’ observation by 1000 using excel functions.
● Installs was a categorical variable, with a large number of categories. In addition, it had
‘+’ in the data, and therefore we created buckets for the app installs to make it easier to
understand and work with. The buckets that we created can be seen in Table 1
● We removed outliers that existed for any variables, by using filters to identify the range
of values and then manually removing them.
● We upsampled our data for the logistic model. This was to have a balanced data set to
work with since there were many more free apps as compared to paid apps.
● Eliminated missing values on r using the any(na) function to identify variables that have
missing values and then we manually removed fields with missing values since mean,
median, mode and other methods wouldn’t work, since most of the missing fields were in
rating, which cannot be interpolated from other data in this scenario.
● We converted variables: Size from categorical to numerical (eliminating the ‘m’),
Install.Buckets from integer to categorical and Reviews from integer to numerical.

Logistic Regression
To predict whether an application should be free or paid, we created a logistic regression
model with the target variable as ‘Type.Cat’. This was based on various predictor variables
including Rating, Reviews, Size and Install.Buckets. This kind of model can be helpful for
companies that create several applications, such as gaming companies that are studying if they
should create applications that should be paid on the Play Store.
As seen in Table 2 and Table 3, the confusion matrix and performance measures can be
seen. The logistic model that we created had an accuracy of 82.29% which seems adequately
high, given that we used only three predictor variables. However, a closer look at the test set
composition reveals that only 6.8425% of all observations were Class 1 (paid apps) whereas
93.1575% of all observations were Class 0 (free apps). This is due to the fact that our data set is
pulled from real life information where majority of the apps are free. The sensitivity of the model
is 28.105% while our positive class ‘1’ refers to the paid apps because it provides a chance to
monetize the app if characteristics meet model criteria. In light of the sensitivity and test set
composition, despite our model accuracy being fairly high, our model is in fact low performing.
Moreover, the sensitivity is concerning because of the potential costs associated with the
misclassifications. If an app that could suitably be monetized based on its ratings, reviews, and
size is classified as “0” instead of “1” it would lose the opportunity of earning significant
Page 4

revenue through the installation of its app. On the flip side, if an app should be free of cost based
on its predictors but is classified as a “1” (paid), it could lose out on getting higher recognition
(based on number of installs) and ratings, which is generally true for free apps. However, in this
case, the specificity of the model is 86.270%, which means that out of all the observations in
class 0, the model predicted 86.270% of them accurately as class 0 - fairly high. We upsampled
our data to get a balanced data set.
Exhibit 4 shows the ROC curve, which helped determine the optimal cutoff value. Looking at
the ROC curve, the cutoff value we used was 0.7. When the cutoff value we used was 0.5, it led
to a lower accuracy.
Interpretation of % lift chart (Exhibit 1): Approximately 42% of all ‘1’s were found after
examining the highest ranked (by the algorithm) 20% of observations. Based on this, the model
is relatively low-performance.
Interpretation of simple lift chart (Exhibit 2): Approximately 70 of all (total of 153) ‘1’s in the
data set were found after examining the highest probability (out of ) 20% of observations. This
means less than half the 1s were found indicating that it is a relatively low-performing model.
Interpretation of Decile Chart (Exhibit 3): In the most probable decile, the model is 1.8 times as
likely to identify the “positive” or important class compared to a random selection of 10% of the
observations. Our positive class “1” is ‘paid’ because it indicates that an app is suited to be paid
based on its observations of predictor variables and can be monetized on the play store.
AUC (Exhibit 5): 0.7424, which indicates a fairly high predictive performance of the model
given that we had only three usable predictor variables and that the ideal AUC is 1.
Using the information available online, about Saucey’s app information in terms of the
number of reviews, ratings, size, and installs (as seen in Image 1), we ran a logistic regression
on this information to see that the app was classified as Class 0 and therefore should be free.

Linear Regression
Based on the logistic regression outcome or preferences of the company, if the app
launching on the google play store is to be paid in nature, or a company chooses to make its app
paid, our linear regression model could help predict a price for the app to be launched based on
our extensive dataset of observations, using Ratings, Reviews, and Size as the predictor
variables. This means that our model will give a reasonable price estimate for the new app based
on apps in the dataset with similar characteristics for these predictor variables.
The RMSE for this linear model is $1.948374. This is the root mean squared error in our
predictions for new observations - price target variable. This value is relatively high, however
lower than the regression tree, which has a RMSE = $2.916424, which also predicted the price of
an app.
If Saucey chooses to make the application a paid app, through our linear regression
model based on its rating, the number of reviews and size (as seen in Image 1), the model
Page 5

predicts that Saucey should price its app between $0.42 and $2.37 which includes a root mean
squared error of $1.948.

Classification Tree
The classification tree that we created can help determine our target variable which was
the installation buckets an application will fall into based on various variables. To do this, the
predictors selected in this model include Price, Rating, Reviews, and Size. This classification
tree can be used by product managers in companies to propose changes to the application in
order to yield more installs and eventually improve business. As seen in Table 4 and Table 5,
the confusion matrix and the performance measures for this model can be seen. The primary
performance measures: Total Accuracy = 0.8542, out of all observations, the model predicted
85.42% of the values accurately. The classification tree can be seen in Exhibit 6, which helps
classify the installation bucket an app would fall in based on the variables included. Using data
for an application, one can classify the application’s install bucket by going down the tree based
on the various variables.
Using Saucey’s data (as seen in Image 1), the classification tree in Exhibit 6 classifies
Saucey as being part of install bucket 0, which proves that the classification tree is a high
performing model and fairly accurate (matches with real life data). This accuracy would have
been even higher if we could include our categorical variables of the data set in the model. The
categorical variables in our dataset could not be converted to dummy since there were multiple
categories. For future scope, we can use price information for paid apps to classify which install
bucket it would fall in to see the effect of its pricing decision.

Regression Tree
The regression tree that we created predicts the price of the app, using the predictors
rating, number of reviews and size of the app. The regression tree helps predict the price of the
app using the specific data of the app. The RMSE that we calculated for the regression tree is
$2.916424. This is a relatively high error, given that the majority of the apps are free or low
priced. However, this error has resulted due to the limited number of usable numerical predictor
variables included in the model. If we would have a dataset with more usable predictor variables
for this model, its predictive performance in terms of RMSE would be better. The regression tree
can be seen in Exhibit 7, which helps predict the price of the app based on the predictors
highlighted above. One can input the data for the selected variables while going down the tree, in
order to predict the price of an application.
Using Saucey’s data (as seen in Image 1), the regression tree in Exhibit 7 predicts the
price of Saucey’s app accurately as 0 (saucey is a free app) and plans to remain as one for the
foreseeable future.
Page 6

Clustering
For unsupervised learning, we used clustering. We created 5 clusters based on the large
volume of our data. As seen in Table 6, the conclusions from each cluster can be seen through
the labels in the first column. The intuitive interpretation of cluster 1 is that the apps are
characterized as having relatively low ratings compared to the other clusters, an average number
of reviews, an average app size and they are paid apps, however at a relatively cheap price. This
shows that apps that are paid tend to have low number of reviews, perhaps due to relatively
lower installs on average. For cluster 2, the characteristics of the apps are average ratings, a low
number of reviews, extremely low app size and they are paid apps with an average price. Moving
on, for cluster 3, the characteristics of the apps are relatively high ratings, a large number of
reviews, large app size and they are free apps. Moreover, for cluster 4, the characteristics of the
apps are extremely high rated apps, exceedingly high number of reviews, very large apps in
terms of size and they are free apps. Lastly, for cluster 5, the apps have the lowest ratings
compared to the other clusters, have the lowest number of reviews, they are the smallest in terms
of size and are paid apps that are expensive. These conclusions can be skewed since we have
relatively fewer amounts of data for paid apps as compared to free apps. All this information is
summarised in Table 6.

kNN
Total Accuracy for k = 3 is 71.24329 %
Total Accuracy for k = 5 is 69.58855 %
Total Accuracy for k = 7 is 68.73882 %
Looking at the above accuracies for the kNN models, we chose the model with k = 3 as it
has the highest accuracy. The sensitivity and specificity for the individual classes can be seen in
Table 8. The confusion matrix for the kNN model can be seen in Table 7.
We created the kNN model to classify an app in a particular install bucket based on the
ratings, number of reviews and size of its nearest three neighbor apps. We tested the model
using Saucey’s information and the model correctly classified the app in install bucket 0,
indicating that Saucey has less than 500,000 installs in reality (as seen in Image 1). From a
managerial standpoint, we can imply that in order for Saucey to be classified in a higher install
bucket i.e have a higher number of installs for its app (and therefore higher sales through it),
Saucey should try to get higher ratings and more positive reviews (considered more trustworthy)
for its app. It is important to note the accuracy of the model which is 71.24329 % while relying
on its predictions. This is a moderately high number given that we had over 1500 observations in
the test set (scaled).
Page 7

RECOMMENDATIONS
Using the aforementioned models, app developers and managers of companies planning
to launch a new app can reach an optimal price/price range (if paid) and subsequently predict the
install bucket the app would fall into while anticipating some general characteristics it would get
based on the cluster it aligns best with.
Similarly, companies such as Saucey that have applications on the Google Play Store can
use such models to improve the performance of their apps on the platform.
● Using Visualization 1, companies can see how Entertainment, Game, Photography, and
Education are the most popular categories in terms of the average number of installs with
predominantly free apps. Drawing insights from this, we recommend that app developers
that are flexible to creating apps of different genres should make an app in the above
mentioned categories in an attempt to get maximum recognition (installs). However, it is
also important to note that these segments are highly competitive and the majority of
them are free.
● As seen in Visualization 2, companies that receive higher ratings on average generally
receive greater number of installs. Through this, it is recommended that companies
should strive to maximize ratings for their app on the Google Play Store.
● As seen in Visualization 3, apps that received a higher number of reviews on average
received a higher number of installs. Based on this, we recommend that companies
incentivize their users through redeemable digital tokens/points, discounts, and/or push
notifications in return for providing positive reviews on the Google Play Store.
● Visualization 4 shows the Average Install Buckets versus rating, specific for the Food
and Beverage industry, which Saucey belongs to. Based on this, we would recommend
Saucey to encourage users to rate their app highly, if they were satisfied with the service.
Page 8

Appendix
Table 1: The various categories of number of installs

Bucket Number of Installs

0 0 - 500,000

1 500,000 - 5,000,000

2 5,000,000 - 100,000,000

3 100,000,000 - 500,000,000

4 500,000,000 - 1,000,000,000

5 1,000,000,000 - Infinity

Tableau:

Visualization 1 - Average number of Installs by category

Page 9

Visualization 2 - Average number of Installs by Rating

Visualization 3 - Relation between number of app installs and number of reviews

Page 10

Visualization 4 - Food and Beverage specific Average Install Buckets by Rating

Exhibit 1: Logistic Regression Lift Chart

Page 11

Exhibit 2: Logistic Regression Lift Chart with 45 Degree Line

Exhibit 3: Logistic Regression Decile Lift chart

Page 12

Exhibit 4: Logistic Regression ROC Curve

Exhibit 5: Logistic Regression ROC Curve with Area Under the Curve (AUC)

Table 2: Confusion Matrix for Logistic Regression

cmaxlog = Actual Test Class

Predicted Test Class 0 1

0 1797 110

1 286 43
Page 13

Table 3: Performance measures from Confusion Matrix for Logistic Regression

Accuracy 0.5438

Sensitivity 0.86275

Specificity 0.52040

Exhibit 6: Pruned Classification Tree (CART) (Arrows show flow specific to Saucey)
Page 14

Table 4: Confusion Matrix for Classification Tree

Reference

Prediction 0 1 2 3 4 5

0 1137 52 1 0 0 0

1 77 416 116 0 0 0

2 0 45 315 18 0 1

3 0 0 8 42 7 1

4 0 0 0 0 0 0

5 0 0 0 0 0 0

Table 5: Performance measures from Confusion Matrix for Classification Tree

Class Sensitivity Specificity

0 0.9366 0.9481

1 0.8109 0.8880

2 0.7159 0.9644

3 0.70000 0.99265

4 0.000000 1.000000

5 0.000000 1.000000
Page 15

Exhibit 7: Pruned Regression Tree (CART) (Arrows show flow specific to Saucey)

Table 6: Clusters with characteristics

Cluster Number Rating Reviews Size Price

1 - Average in all measures 4.178367 228511.514 66.39020 0.1681034

2 - Average rated apps with low 4.258462 7547.492 47.03077 12.6044615

number of reviews and smallest app
size, and high prices.

3 - Relatively high rated free apps, 4.482353 24462081.882 147.82353 0

with large number of reviews and big
size.

4 - Highest rated free apps, with most 4.600000 44889695.250 181.00000 0

number of reviews and large size.

5 - Low rated apps, with the highest 4.093333 885.800 57.60000 30.0566667
prices, the lowest number of reviews.
Page 16

Table 7: Confusion Matrix for kNN

Reference

Prediction 0 1 2 3 4 5

0 1063 278 67 1 0 0

1 138 203 89 0 0 0

2 13 32 276 14 0 0

3 0 0 8 42 0 0

4 0 0 0 3 7 0

5 0 0 0 0 0 2

Table 8: Performance Measures from Confusion Matrix for kNN

Class 0 Class 1 Class 2 Class 3 Class 4 Class 5

Sensitivity 0.8756 0.39571 0.6273 0.7000 1.0000 1.0000

Specificity 0.6614 0.86825 0.9671 0.99632 0.998654 1.0000

Image 1: Saucey app on Google Play Store

Page 17

R - Code:

setwd("C:/Users/rmehta10/Desktop/QTM 3/Final Project/Models/Data")

playstore <- read.csv("googleplaystore.csv")

if (!require("caret")) {
install.packages("caret")
library("caret")
}

if (!require("FNN")) {
install.packages("FNN")
library("FNN")
}

if (!require("DMwR")) {
install.packages("DMwR")
library("DMwR")
}

if (!require("fastDummies")) {
install.packages("fastDummies")
library("fastDummies")
}

if (!require("rpart")) {
install.packages("rpart")
library("rpart")
}
if (!require("rpart.plot")) {
Page 18

install.packages("rpart.plot")
library("rpart.plot")
}

if (!require("Metrics")) {
install.packages("Metrics")
library("Metrics")
}

if (!require("pROC")) {
install.packages("pROC")
library("pROC")
}

if (!require("e1071")) {
install.packages("e1071")
library("e1071")
}

if (!require("gains")) {
install.packages("gains")
library("gains")
}

playstore$Size <- as.numeric(playstore$Size)

playstore$Install.Buckets <- as.factor(playstore$Install.Buckets)
playstore$Reviews <- as.numeric(playstore$Reviews)

str(playstore)
apply(playstore,2,anyNA)
summary(playstore)
which(is.na(playstore$App))
which(is.na(playstore$Category))
which(is.na(playstore$Rating))
which(is.na(playstore$Reviews))
which(is.na(playstore$Size))
which(is.na(playstore$Installs))
which(is.na(playstore$Type))
which(is.na(playstore$Price))
which(is.na(playstore$Content))
which(is.na(playstore$Genres))
which(is.na(playstore$Last.Updated))
which(is.na(playstore$Current.Ver))
which(is.na(playstore$Android.Ver))
which(is.na(playstore$Install.Buckets))
Page 19

playstorewom <- playstore[-which(is.na(playstore$Rating)),]

apply(playstorewom,2,anyNA)

trainSetSize <- floor(0.7 * nrow(playstorewom))

RNGkind(sample.kind = "Rejection")
set.seed(12345)
trainInd <- sample(seq_len(nrow(playstorewom)), size = trainSetSize)
playstoreTrainSet <- playstorewom[trainInd, ]
playstoreTestSet <- playstorewom[-trainInd, ]

#logistic regression
upTrain <- upSample(x = playstoreTrainSet[, -ncol(playstoreTrainSet)],
y = playstoreTrainSet$Type)
table(upTrain$Class)

logRegrModelType <- glm(Type.Cat ~ Rating + Reviews + Size + Install.Buckets,

data =upTrain,
family ="binomial")
summary(logRegrModelType)

newObs = data.frame(Rating = 4.3, Reviews = 758, Size = 9.4, Install.Buckets = "0")

predNewObsScore <- predict(logRegrModelType,newObs,type="response")
predNewObsScore
ifelse(predNewObsScore >= 0.7,"1","0")

#Score the logistic regression model on the test data set

predTestScores <- predict(logRegrModelType, type="response",
newdata=playstoreTestSet)

##Set cutoff value

cutoff <- 0.7
##Initially, set all predicted class assignments to 0
predTestClass <- rep(0, length(predTestScores))
##Then, replace with only those entries that are greater than the cutoff value
predTestClass[predTestScores >= cutoff] <- 1
##Output to file
dfToExport <- data.frame(playstoreTestSet,predTestScores,predTestClass)
write.csv(dfToExport, file = "../ROutput/predictedTypeCAT.csv")

#Create a confusion matrix

cmaxlog <- playstoreTestSet$Type.Cat
#Simply using tables
confMx <- table(predTestClass, cmaxlog)
confMx
Page 20

#Use confustionMatrix from the caret package

confusionMatrix(as.factor(predTestClass), as.factor(cmaxlog), positive = "1")

#Create a lift chart

####################

#Simple lift chart; from scratch

dfForLiftChart <- data.frame(predTestScores, cmaxlog)
sortedData <- dfForLiftChart[order(-dfForLiftChart$predTestScores),]
cumulCases <- cumsum(sortedData[,2])
##Plot the lift chart
plot(cumulCases, xlab = "Number of Cases", ylab = "Number of 1s Identified by
Algorithm So Far", type="l", col = "blue")
##Plot the 45 degree line
X <- c(0, length(predTestScores))
Y <- c(0, cumulCases[length(predTestScores)])
lines(X, Y, col = "red", type = "l", lty = 2)

#Lift chart using the "caret" library

li <-lift(relevel(as.factor(cmaxlog), ref="1") ~ predTestScores)
xyplot(li, plot = "gain")

#Decile Lift Chart using the "gains" library

gain <- gains(cmaxlog, predTestScores)
barplot(gain$mean.resp/mean(cmaxlog), names.arg = gain$depth, xlab = "Percentile",
ylab = "Mean Response", main = "Decile Lift Chart")

####################
#Create an ROC curve
simpleROC <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels),
labels)
}

rocData <- simpleROC(cmaxlog, predTestScores)

rocData
plot(rocData$FPR,rocData$TPR, xlab = "1 - Specificity", ylab = "Sensitivity")

#Use library pROC to plot ROC and

#calculate area under the curve (AUC)
pROCData <- pROC::roc(playstoreTestSet$Type.Cat,predTestScores)
plot(pROCData) # Gets a smoother version of the curve for calculating AUC; axes
labels a bit confusing
pROCData[9] # Prints the AUC (Area Under Curve)
Page 21

#linear
regrModel <-lm(Price ~ Rating + Reviews + Size,
data = playstoreTrainSet)
summary(regrModel)

newObslinear <- data.frame(Rating = 4.3, Reviews = 758, Size = 9.4)

predict(regrModel, newObslinear)

predictedValues <- predict(regrModel, playstoreTestSet)

predictedValues

#Calculate RMSE error

##First, extract actual realized values for Price and store in actualValue
actualValues <- playstoreTestSet$Price

##Then, calculate the RMSE error between predicted and actual value for test data
rmse(actualValues, predictedValues)

#Classification Tree
Tree <- rpart(Install.Buckets ~ Price + Rating + Reviews + Size, upTrain) # grow a tree
on training.
prp(Tree,type=1,extra=1) # display the tree

#Prune the tree

prunedTree <- prune(Tree, cp=
Tree$cptable[which.min(Tree$cptable[,"xerror"]), "CP"])
#prunedTree <- prune(fullTree, cp=0.084)
printcp(prunedTree)
prp(prunedTree, extra=1)
prunedTree

#Fit the rules to new data (test set)

predTestClass <- predict(prunedTree, newdata = upTrain, type="class")
confMx <- confusionMatrix(predTestClass, playstoreTestSet$Install.Buckets, positive =
"5")
confMx

#Regression tree
RegTree <- rpart(Price ~ Rating + Reviews + Size, upTrain) # grow a tree on training.
prp(RegTree,type=1,extra=1) #display the tree

#Prune the tree

prunedTreeReg <- prune(RegTree, cp=
Page 22

RegTree$cptable[which.min(RegTree$cptable[,"xerror"]), "CP"])
#prunedTree <- prune(fullTree, cp=0.084)
printcp(prunedTreeReg)
prp(prunedTreeReg, extra=1)
prunedTreeReg

#Fit the rules to new data (test set)

#Predict the prices of apps from test set
predTestPrice <- predict(RegTree, newdata = playstoreTestSet, type="vector")

#Calculate RMSE for test data

actualTestPrice <- playstoreTestSet$Price
RMSE(actualTestPrice, predTestPrice)

#Fit the rules to new data (test set)

#Predict the prices of apps in the test set
predTestPricex <- predict(prunedTree, newdata = playstoreTestSet, type="vector")

#Clustering
clusterdata <- playstorewom[c(-1,-2,-6,-7,-8,-9,-11,-12,-13,-14,-15)]
distances = dist(scale(clusterdata), method = "euclidean")
hcluster_resultA = hclust(distances, method = "average")
plot(hcluster_resultA, hang=-1, ann=FALSE)
cluster = cutree(hcluster_resultA,k=5)
clusterdata = cbind(clusterdata,cluster)
aggregate(clusterdata, by = list (cluster), mean)

#compute the cluster membership by "cutting the dendrogram" in to k=5 clusters

memb <- cutree(hcluster_resultA,k=5)
memb

#output data and cluster assignments to a file

myDataClusteredH <- data.frame(clusterdata,memb)
write.csv(myDataClusteredH, file = "../ROutput/clusteredInstallsH.csv")

#return the centroid of each cluster

aggregate(myDataClusteredH, by=list(memb), mean)

#display the full summary statistics for each cluster

describeBy(myDataClusteredH, myDataClusteredH$memb)

#Predict kNN classifications for each observation in the test set

#original arrays with data
playstoretrainSca <- playstoreTrainSet
Page 23

playstoretestSca <- playstoreTestSet

playstoresca <- playstorewom

#use preProcess from the caret package to normalize the numerical variables
normValues <- preProcess(playstoreTrainSet[,3:5], method = c("center", "scale"))
playstoretrainSca[,3:5] <- predict(normValues, playstoreTrainSet[,3:5])
playstoretestSca[,3:5] <- predict(normValues, playstoreTestSet[,3:5])
playstoresca[,3:5] <- predict(normValues, playstorewom[,3:5])

knn3 <- knn(train = playstoretrainSca[,3:5], test = playstoretestSca[,3:5], cl =

playstoretrainSca[,7], k = 3)

newObs1 <- data.frame(Rating = 4.3, Reviews = 758, Size = 9.4)

newObs1Scaled <- newObs1
newObs1Scaled <- predict(normValues, newObs1)
predNewObsknn3 <- knn(train = playstoretrainSca[,3:5], test = newObs1Scaled, cl =
playstoretrainSca[,7], k = 3)
predNewObsknn3

#Compute confusion matrix for prediction model

################################################
actualknnTestClass <- playstoretestSca$Install.Buckets

#Calculate all relevant statistics for confusion matrix

#############################################
confMx3 <- confusionMatrix(knn3, actualknnTestClass,positive = "5")
confMx3

#calculate total accuracy

totAcc <- confMx3$overall[1]
totAcc
Page 24

References

Inc., Saucey,. "Saucey: Alcohol Delivery - Apps on Google Play." Google Play,
play.google.com/store/apps/details?id=com.saucey&hl=en_US.

"Google Play Store Apps." Kaggle: Your Home for Data Science,
www.kaggle.com/lava18/google-play-store-apps.

BC2406 S01 G02 Final Report
No ratings yet
BC2406 S01 G02 Final Report
33 pages
Mobile Apps: On Google Play Store Aman Parikh, Raj Mehta, Rishi Mehta
No ratings yet
Mobile Apps: On Google Play Store Aman Parikh, Raj Mehta, Rishi Mehta
21 pages
Python Assignment
20% (5)
Python Assignment
3 pages
App Price Prediction with ML
100% (1)
App Price Prediction with ML
3 pages
App Rating Prediction Project by CHANCHAL SINGH
No ratings yet
App Rating Prediction Project by CHANCHAL SINGH
15 pages
Google Play Store Apps-Data Analysis and Ratings Prediction
No ratings yet
Google Play Store Apps-Data Analysis and Ratings Prediction
10 pages
App Rating Prediction Project
100% (5)
App Rating Prediction Project
14 pages
Google Play Download Behavior Analysis
No ratings yet
Google Play Download Behavior Analysis
11 pages
μ 1= population mean of reviews μ 2= population mean of μ 3= population meanof price
No ratings yet
μ 1= population mean of reviews μ 2= population mean of μ 3= population meanof price
4 pages
Data Analysis of Google Play Apps
100% (2)
Data Analysis of Google Play Apps
32 pages
Introduction To Data Science Notes and Report 2
No ratings yet
Introduction To Data Science Notes and Report 2
3 pages
MobileAppUsageBehaviour Report
No ratings yet
MobileAppUsageBehaviour Report
89 pages
Mobile App Success Prediction
No ratings yet
Mobile App Success Prediction
8 pages
Mobile Apps: Regression-Model-When-Some-Variables-Are-Log-Transformed
No ratings yet
Mobile Apps: Regression-Model-When-Some-Variables-Are-Log-Transformed
1 page
Liu 2014
No ratings yet
Liu 2014
31 pages
Dissertation On Applied Microeconomics of Freemium Pricing Strategies in Mobile App Market
No ratings yet
Dissertation On Applied Microeconomics of Freemium Pricing Strategies in Mobile App Market
74 pages
Introduction To Data Science Notes and Report
No ratings yet
Introduction To Data Science Notes and Report
3 pages
Google Play Store Analysis
No ratings yet
Google Play Store Analysis
3 pages
Google Play Store Apps Data Analysis
No ratings yet
Google Play Store Apps Data Analysis
24 pages
Mobile Marketer's Guide to App Ratings
0% (1)
Mobile Marketer's Guide to App Ratings
55 pages
Personalized App Service System Algorithm For Effective Classification of Mobile Applications
No ratings yet
Personalized App Service System Algorithm For Effective Classification of Mobile Applications
4 pages
Rating Decision Analysis Based On Ios App Store Data: DOI: 10.12776/QIP.V18I2.337
No ratings yet
Rating Decision Analysis Based On Ios App Store Data: DOI: 10.12776/QIP.V18I2.337
11 pages
Reliability Analysis of Mobile Application Based On The User Reviews of Health and Fitness
No ratings yet
Reliability Analysis of Mobile Application Based On The User Reviews of Health and Fitness
8 pages
Google Play Store Apps
100% (1)
Google Play Store Apps
63 pages
Mobile App Analytics
100% (1)
Mobile App Analytics
58 pages
A PPVC Report On "Google Playstore Insights" Department of Computer Science and Engineering (Data Science)
No ratings yet
A PPVC Report On "Google Playstore Insights" Department of Computer Science and Engineering (Data Science)
30 pages
Playstore - 6010 (1) (27) - Pages
No ratings yet
Playstore - 6010 (1) (27) - Pages
62 pages
REVENUE MODELS IN MOBILE APP DEVELOPMENT. Digi Media
No ratings yet
REVENUE MODELS IN MOBILE APP DEVELOPMENT. Digi Media
29 pages
Project S
No ratings yet
Project S
6 pages
Apps Downloadd
No ratings yet
Apps Downloadd
15 pages
Estimating Demand Function Parameters of Mobile Applications
No ratings yet
Estimating Demand Function Parameters of Mobile Applications
14 pages
Easter Formitting Eda
No ratings yet
Easter Formitting Eda
7 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
Presentation 1
No ratings yet
Presentation 1
20 pages
Problem Statements For PBL Internships
No ratings yet
Problem Statements For PBL Internships
3 pages
RP05 - Ghose - Han - MS2014-Estimating Demand For Mobile Applications in The New Economy
No ratings yet
RP05 - Ghose - Han - MS2014-Estimating Demand For Mobile Applications in The New Economy
19 pages
Aryansh MDS202312
No ratings yet
Aryansh MDS202312
7 pages
Business Brief For Data Analytics Project
No ratings yet
Business Brief For Data Analytics Project
2 pages
User Reviews of Top Mobile Apps in Apple and Google App Stores
No ratings yet
User Reviews of Top Mobile Apps in Apple and Google App Stores
7 pages
Primera Expo Traduccion
No ratings yet
Primera Expo Traduccion
3 pages
A Framework For App Store Optimization: Abstract
No ratings yet
A Framework For App Store Optimization: Abstract
10 pages
Creating Customer Engagement Via Mobile Apps-How Apps Usage Drive
No ratings yet
Creating Customer Engagement Via Mobile Apps-How Apps Usage Drive
48 pages
1 s2.0 S0957417423011612 Main
No ratings yet
1 s2.0 S0957417423011612 Main
13 pages
10.1007@978 3 030 12453 354
No ratings yet
10.1007@978 3 030 12453 354
8 pages
Investment Vs Profit
No ratings yet
Investment Vs Profit
3 pages
How To Choose The Right Pricing Plan For Mobile App Development
No ratings yet
How To Choose The Right Pricing Plan For Mobile App Development
4 pages
App 1
No ratings yet
App 1
20 pages
Analyzing App Releasing and The Updating Behavior of Android Apps Developers
No ratings yet
Analyzing App Releasing and The Updating Behavior of Android Apps Developers
6 pages
Project Summary For Play Store App Review Analysis
No ratings yet
Project Summary For Play Store App Review Analysis
3 pages
Final Assignment
No ratings yet
Final Assignment
6 pages
Beyond Downloads: The Quest For App Usage in Leading Mobile Games
No ratings yet
Beyond Downloads: The Quest For App Usage in Leading Mobile Games
9 pages
Rascunho Entrega Final
No ratings yet
Rascunho Entrega Final
4 pages
Software Testing Documentation
No ratings yet
Software Testing Documentation
20 pages
App Annie IDC Mobile App Advertising Monetization Trends 2013 2018
No ratings yet
App Annie IDC Mobile App Advertising Monetization Trends 2013 2018
25 pages
App Testing White-Box Mode: Surname1
No ratings yet
App Testing White-Box Mode: Surname1
10 pages
ReCell Project PDF
No ratings yet
ReCell Project PDF
21 pages
Mobile App SDK Guide 2017
No ratings yet
Mobile App SDK Guide 2017
45 pages
Assignment: Analyse Sample Data Set Using Excel
No ratings yet
Assignment: Analyse Sample Data Set Using Excel
13 pages
Making Money:: The App Monetization Playbook
No ratings yet
Making Money:: The App Monetization Playbook
13 pages
MSL Technical Guide 12 Assuring The Quality of Weighing Results
No ratings yet
MSL Technical Guide 12 Assuring The Quality of Weighing Results
4 pages
Nasa Technical Note
No ratings yet
Nasa Technical Note
62 pages
Critical Thinking Chapter 6
No ratings yet
Critical Thinking Chapter 6
33 pages
Semi Graph
100% (1)
Semi Graph
46 pages
(Ebook PDF) Absolute C++ 6Th Edition by Walter Savitch PDF Download
No ratings yet
(Ebook PDF) Absolute C++ 6Th Edition by Walter Savitch PDF Download
111 pages
Thermodynamics Terms & Problems
100% (2)
Thermodynamics Terms & Problems
19 pages
SCD Assignment 1
No ratings yet
SCD Assignment 1
5 pages
Resolution Part-1: Dr. Abdelaziz Said
No ratings yet
Resolution Part-1: Dr. Abdelaziz Said
44 pages
Solution
No ratings yet
Solution
36 pages
Grade 3, Unit Six: Money, Fractions & Probability: Problem Comments
No ratings yet
Grade 3, Unit Six: Money, Fractions & Probability: Problem Comments
2 pages
NSS Mathematics in Action - 4BOLR - FS - 01e
No ratings yet
NSS Mathematics in Action - 4BOLR - FS - 01e
8 pages
Word Problem Misconception
No ratings yet
Word Problem Misconception
22 pages
Hollomon Influence A - W - Bowen - 1974 - J. - Phys. - D - Appl. - Phys. - 7 - 969
No ratings yet
Hollomon Influence A - W - Bowen - 1974 - J. - Phys. - D - Appl. - Phys. - 7 - 969
11 pages
Quantum Computing Basics
100% (1)
Quantum Computing Basics
22 pages
Assignment 28-Sept-2024 - 241005 - 215505
No ratings yet
Assignment 28-Sept-2024 - 241005 - 215505
7 pages
DSP Imp Questions
No ratings yet
DSP Imp Questions
2 pages
Probability Mass Function Basics
No ratings yet
Probability Mass Function Basics
32 pages
NEW Torsion (Ch-4)
No ratings yet
NEW Torsion (Ch-4)
23 pages
Topo-Centric Houses
100% (2)
Topo-Centric Houses
7 pages
Balcerowicz Naya Vada
No ratings yet
Balcerowicz Naya Vada
25 pages
751 Books Doubtnut Question Bank
No ratings yet
751 Books Doubtnut Question Bank
466 pages
Wma14 01 Rms 20240118
No ratings yet
Wma14 01 Rms 20240118
31 pages
Haitian Series: Haitian Partner: Haitian International Holdings Limited
No ratings yet
Haitian Series: Haitian Partner: Haitian International Holdings Limited
6 pages
Hsslive Xii Physics QB Solved Seema 2024
No ratings yet
Hsslive Xii Physics QB Solved Seema 2024
84 pages
đề anh 8 kì 2
No ratings yet
đề anh 8 kì 2
13 pages
CS1A, April19 To April22
No ratings yet
CS1A, April19 To April22
118 pages
ISO GPS Standards List
No ratings yet
ISO GPS Standards List
22 pages
Further Pure Math Formula (Edexcel)
No ratings yet
Further Pure Math Formula (Edexcel)
22 pages
STPM Maths T Sem 1 Trial 2014 P1 Port Dickson Answer
No ratings yet
STPM Maths T Sem 1 Trial 2014 P1 Port Dickson Answer
2 pages
गणित (GaNita) vs Mathematics
No ratings yet
गणित (GaNita) vs Mathematics
96 pages

Final Report

Uploaded by

Final Report

Uploaded by

Babson College

Google Play Store Applications

Aman Parikh (Section 1)

the completion of this work.”

Data Pre Processing

Bucket Number of Installs

Visualization 1 - Average number of Installs by category

Visualization 2 - Average number of Installs by Rating

Visualization 3 - Relation between number of app installs and number of reviews

Visualization 4 - Food and Beverage specific Average Install Buckets by Rating

Exhibit 1: Logistic Regression Lift Chart

Exhibit 2: Logistic Regression Lift Chart with 45 Degree Line

Exhibit 3: Logistic Regression Decile Lift chart

Exhibit 4: Logistic Regression ROC Curve

Table 2: Confusion Matrix for Logistic Regression

cmaxlog = Actual Test Class

Predicted Test Class 0 1

Table 3: Performance measures from Confusion Matrix for Logistic Regression

Table 4: Confusion Matrix for Classification Tree

Table 5: Performance measures from Confusion Matrix for Classification Tree

Class Sensitivity Specificity

Table 6: Clusters with characteristics

Cluster Number Rating Reviews Size Price

1​ - Average in all measures 4.178367 228511.514 66.39020 0.1681034

2​ - Average rated apps with low 4.258462 7547.492 47.03077 12.6044615

3​ - Relatively high rated free apps, 4.482353 24462081.882 147.82353 0

4​ - Highest rated free apps, with most 4.600000 44889695.250 181.00000 0

Table 7: Confusion Matrix for kNN

Table 8: Performance Measures from Confusion Matrix for kNN

Class 0 Class 1 Class 2 Class 3 Class 4 Class 5

Sensitivity 0.8756 0.39571 0.6273 0.7000 1.0000 1.0000

Specificity 0.6614 0.86825 0.9671 0.99632 0.998654 1.0000

Image 1: Saucey app on Google Play Store

setwd("C:/Users/rmehta10/Desktop/QTM 3/Final Project/Models/Data")

playstore$Size <- as.numeric(playstore$Size)

playstorewom <- playstore[-which(is.na(playstore$Rating)),]

trainSetSize <- floor(0.7 * nrow(playstorewom))

logRegrModelType <- glm(Type.Cat ~ Rating + Reviews + Size + Install.Buckets,

newObs = data.frame(Rating = 4.3, Reviews = 758, Size = 9.4, Install.Buckets = "0")

#Score the logistic regression model on the test data set

##Set cutoff value

#Create a confusion matrix

#Use confustionMatrix from the caret package

#Create a lift chart

#Simple lift chart; from scratch

#Lift chart using the "caret" library

#Decile Lift Chart using the "gains" library

rocData <- simpleROC(cmaxlog, predTestScores)

#Use library pROC to plot ROC and

newObslinear <- data.frame(Rating = 4.3, Reviews = 758, Size = 9.4)

predictedValues <- predict(regrModel, playstoreTestSet)

#Calculate RMSE error

#Prune the tree

#Fit the rules to new data (test set)

#Prune the tree

#Fit the rules to new data (test set)

#Calculate RMSE for test data

#Fit the rules to new data (test set)

#compute the cluster membership by "cutting the dendrogram" in to k=5 clusters

#output data and cluster assignments to a file

#return the centroid of each cluster

#display the full summary statistics for each cluster

#Predict kNN classifications for each observation in the test set

playstoretestSca <- playstoreTestSet

knn3 <- knn(train = playstoretrainSca[,3:5], test = playstoretestSca[,3:5], cl =

newObs1 <- data.frame(Rating = 4.3, Reviews = 758, Size = 9.4)

#Compute confusion matrix for prediction model

#Calculate all relevant statistics for confusion matrix

#calculate total accuracy

You might also like

1 - Average in all measures 4.178367 228511.514 66.39020 0.1681034

2 - Average rated apps with low 4.258462 7547.492 47.03077 12.6044615

3 - Relatively high rated free apps, 4.482353 24462081.882 147.82353 0

4 - Highest rated free apps, with most 4.600000 44889695.250 181.00000 0