Google Play Store Apps-Data Analysis and Ratings Prediction
Google Play Store Apps-Data Analysis and Ratings Prediction
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 265
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
This dataset is for Web scratched information of 10k Play 1.5.2 Unsupervised Learning
Store applications to analyze the market of android. Here it is
a downloaded dataset which a user can use to examine the In the Unsupervised learning we do not train
Android market of different use of classifications music, our machine according to the present data or input.
camera etc. With the assistance of this, client can predict see It means there is not any supervisor as a teacher in
whether any given application will get lower or higher rating this learning. In this we allow algorithm to work on
level. This dataset can be moreover used for future references their own without any training or guidance. Here
for the proposal of any application. Additionally, the the main working of the machine is that it works on
disconnected dataset is picked so as to choose the estimate some definite patterns, similarities in the given
exactly as online data gets revived all around a great part of dataset without any training or proper guidance.
the time. With the assistance of this dataset I will examine Therefore machine is restricted to find out the
various qualities like rating, free or paid and so forth utilizing structure which is hidden in the given dataset.
Hive and after that I will likewise do forecast of various traits
like client surveys, rating etc. 1.5.3 Semi-supervised Learning
1.3 Data Mining This type of learning lies between the above two
learning methods.
Data mining is that the process of rummaging through a
knowledge set and finding correlations, anomalies and or 1.6 Neural Networks
patterns which will be of usefulness. In other words, it's
We can divide neural network into different forms such as
having an outsized dataset filled with scattered information
artificial neural network, deep neural network, recurrent
and trying to form sense of it by finding meaningfulness.
neural network, convolutional deep neural networks. Each
1.4 Python form has its own importance and its own features. In neural
network we have input layer, no of hidden layers, and output
Most of the info scientist use python due to the good layer.
built-in library functions and therefore the decent
community. Python now has 70,000 libraries. Python is
simplest programing language to select up compared to
other language. That’s the most reason data scientists use
python more often, for machine learning and data processing
data analyst want to use some language which is
straightforward to use. That’s one among the most reasons
to use python. Specifically, for data scientist the foremost
popular data inbuilt open source library is named panda. As
we've seen earlier in our previous assignment once we got to
plot scatterplot, heat maps, graphs, 3-dimensional data
python built-in library comes very helpful.
Fig -1: Neural Network
1.5 Machine Learning
1.6.1 Deep Neural Network
Machine learning is an application of AI (AI) that gives
A deep neural network is defined as a neural network
systems the power to automatically learn and improve from
which contains certain level of complexity, like a neural
experience without being explicitly programmed. Machine
network which contains more than two layers. In the deep
learning focuses on the event of computer programs which
neural network we use some mathematical model to solve
will access data and use it learn for themselves
any model in a proper way using all the complexities.
1.5.1 Supervised Learning
A neural system, when all is said in done, is an innovation
It is defined as a learning in which we train a worked to reproduce the action of the human brain –
machine as per our dataset or input. From that explicitly, design acknowledgment and the section of
point forward, the machine is furnished with contribution through different layers of deep neural
another arrangement of examples (data) so associations.
supervised learning analyses the provided data
In DNN, data flows forward it means from input layer to
(set of preparing models) and creates a right
output layer without having any loopholes. At first, the DNN
result from given input.
makes a guide of virtual neurons and allots irregular
numerical qualities, or "loads", to create link between them.
The loads and data sources are increased and return a yield
somewhere in the range of 0 and 1. On the off chance that the
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 266
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
system didn't precisely perceive a specific example, a in text. For example, some authors [14] proposed a
calculation would alter the loads. That way the calculation scheme for annotating a low-level representation of
can make certain parameters progressively opinions within a text. Additionally, they described
an opinion-oriented “scenario template” that
summarizes the opinions expressed in a document.
This approach is helpful for tasks that involve posing
question from multiple perspectives.
- Kumari and other researchers [8,9,10,5] used the - App ratings have been predicted based on the
Naïve Bayes (NB) classifier to classify opinions as features provided for app [18,19]. Experiments were
positive, negative, or neutral. performed on the BlackBerry World and Samsung
Android stores to collect the raw features provided
- Wang and others [11] argued that a rating is not for the apps, including their price, rank of
entirely determined by a review content. For downloads, ratings, and textual descriptions. The
example, a user may well intend to give a positive features were then encoded into a numerical vector
review by employing positive words, and yet issue a to be used in case-based reasoning and to predict the
comparatively lower rating. app rating.
- Dave and others [12] proposed a method for - In contrast to the above-cited studies, other authors
extracting the polarity in user reviews of products, [20] investigated the nature of sentiments expressed
expressed as poor, mixed, or good. The classifier in Google app reviews. Their study measured
used was Naïve Bayes (NB). opinions and sentiments represented in user
reviews through a variety 4 | UMER et al. of emojis
- According to Pang et al [13], although machine
expressing, for example, negativity, positivity, anger,
learning approaches perform far better for
or excitement. It evaluated whether those
traditional topic-based categorization, they're less
sentiments are informative for the purpose of app
successful for sentiment analysis.
development and refinement.
- Information-extraction technologies have also been
explored to identify and organize opinions contained
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 267
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
Here we can see that 92.6% apps are free and 7.38% Fig -5: Updated Free Apps
apps are paid on Google Play Store, so we can say that Most of
the apps are free on Google Play Store. 3.4 Updated Paid Apps
3.2 Updated Apps Same as free apps most of the paid apps too updates in the
month of July.
In the below plot, we plotted the apps updated or added
over the years comparing Free vs. Paid, by observing this plot
we can conclude that before 2011 there were no paid apps,
but with the years passing free apps has been added more in
comparison to paid apps, By comparing the apps updated or
added in the year 2011 and 2018 free apps are increases
from 80% to 96% and paid apps are goes from 20% to 4%. So
we can conclude that most of the people are after free
apps
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 268
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
Most Number of ratings which got on Google Play Store is Free apps are the most rated apps on the Google Play Store
given for free apps. compared to Paid Apps
Most of the paid apps on the app store are rated 4.2 to 4.8
Fig -8: Free App Rating
From the below chart we can find that most of the apps which
Fig -9: Paid App Ratings are on Google Play Store belong to Family, Gamming and
Tools.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 269
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
The apps which are available for everyone are having the
ratings 4 and above out of 5.
Most of the apps in Google Play store are of Android version Fig -16: Paid App Ratings
4.1 and up.
3.15 Ratings over the Android Version
The Android version 4.1 and above have the ratings 4 and
above.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 270
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
The apps which has got the 1M and 10M installations has got 4.3.1 PRE PROCESSING
the ratings 4 and above.
Preprocessing is important into transitioning raw data into a
more desirable format. Undergoing the preprocessing
process can help with completeness and compellability. For
instance, you'll see if certain values were recorded or not.
Also, you'll see how trustable the info is. It could also help
with finding how consistent the values are. We need
preprocessing because most real-world data are dirty. Data
can be noisy i.e. the data can contain outliers or simply errors
generally. Data can also be incomplete i.e. there can be some
missing values.
4.2 SOFTWARE
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 271
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
Algorithm Accuracy
Fig -21: Correlation Matrix Random Forest 73.55%
SVR 76.49%
5. ALGORITHMS Linear Regression 72,45%
K- Nearest Neighbor 92.22%
5.1 Random Forest
K-Means Clustering 69.56%
Random forest regression is applied to all the variables the Table -2: Accuracy of Algorithms
results of random forest determine the importance of all the
variable and their influence on the rating. The results of 6. CONCLUSIONS
random forest regression are evaluated using Mean Square
After undergoing these algorithms and process, we
Error. Random forest model is the first model that is applied
concluded that our hypothesis is true. Meaning you can
to the dataset and the results of Random forest classification
predict the app ratings, however significant
are computed for a number of variables to find the
preprocessing must be done before you start the
importance of these variables.
classification and regression processes.
5.2 Support Vector Regression
The Play Store apps data has enormous potential to
As Support Vector Regression (SVR) is a promising drive app-making businesses to success. Actionable
regression model for continuous variables, it is used to find insights can be drawn for developers to work on and
the importance of all the numeric variables. In this model, capture the Android market! This shows that given the
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 272
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
Size, Type, Price, Content Rating, and Genre of an app, Proc. Int. Conf.- Cloud Syst. Big Data Eng. (Noida, India),
we can predict about 92% accuracy if an app will have Jan. 2016, pp. 320–325.
more than 100,000 installs and be a hit on the Google
Play Store. [9] R. M. Duwairi and I. Qarqaz, Arabic sentiment analysis
using supervised classification, in Proc. Int. Conf. Future
User reviews are limited to identifying polarity and Internet Things Cloud (Barcelona, Spain), Aug. 2014, pp.
subjectivity. However, the massive increase in review- 579–583.
based data implies a requirement to focus also on
performing predictions. This process is challenging yet [10] H. S. Le, T. V. Le, and T. V. Pham, Aspect analysis for
fruitful, as user reviews are qualitative while ratings are opinion mining of vietnamese text, in Proc. Int. Conf.
essentially quantitative. The numeric scoring of apps Adv. Comput. Applicat. (Ho Chi Minh, Vietnam), Nov.
within the Google App store could also be biased and 2015, pp. 118–123.
overrated because higher ratings given by users
potentially attract several new users disproportionately. [11] H. Wang, L. Yue, and C. Zhai, Latent aspect rating
This study therefore investigated the utilization of analysis on review text data: a rating regression
ensemble classifiers to predict numeric ratings for approach, in Proc. ACM SIGKDD Int. Conf. Knowledge
Google Play store apps supported the user reviews for Discovery Data Mining (Washington, D.C., USA), July
those apps. Several ensemble classifiers were 2010, pp. 783–792.
investigated to guage their performance on the reviews
[12] K. Dave, S. Lawrence, and D. M. Pennock, Mining the
scraped from the Google App store. Future work
peanut gallery: Opinion extraction and semantic
includes the implementation of the deep learning
classification of product reviews, in Proc. Int. Conf.
technique to predict numeric rating.
World Wide Web (New York, USA), 2003, pp. 519–528.
REFERENCES
[13] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: sentiment
[1] Statista, Number of available application in the Google classification using machine learning techniques, in
Play store from December 2009 to March 2019, Proc. ACL-02 Conf. Empirical Methods Natural Language
https://www.statista.com/ statistics/266210/number- Process. (Stroudsbrug, PA, USA), 2002, pp. 79–86.
of-available-applications-in-the-googl e-play-store/,
[14] C. Cardie et al., Combining low-level and summary
Online: accessed 22 May 2019.
representations of opinions for multi-perspective
[2] Statistaa, Number of mobile app downloads worldwide question answering, New directions in question
in 2017, 2018 and 2020 (in billions), answering, 2003, pp. 20–27.
https://www.statista.com/statistics/
[15] H. Takamura, T. Inui, and M. Okumura, Extracting
271644/worldwide-free-and-paid-mobile-app-store-
semantic orientations of words using spin model, in
downloads/, Online: accessed 22 May 2019.
Proc. Annu. Meeting Association Comput. Linguistics
[3] J. Horrigan, Online shopping, pew internet and American (Ann Arbor, MI, USA), 2005, pp. 133–140.
life project, Washington, DC, 2018,
[16] A. Buche, D. Chandak, and A. Zadgaonkar, Opinion
http://www.pewinternet.org/Repor ts/2008/Online-
mining and analysis: a survey, arXiv preprint
Shopping/01-Summary-of-Findings.aspx Online:
arXiv:1307.3336, 2013.
accessed 8 Aug. 2014.
[17] M. Suleman, A. Malik, and S. S. Hussain, Google play store
[4] D. Pagano and W. Maalej, User feedback in the appstore:
app ranking prediction using machine learning
an empirical study, in Proc. IEEE Int. Requirements Eng.
algorithm, Urdu News Headline, Text Classification by
Conf. (Rio de Janeiro, Brazil), July 2013, pp. 125–134.
Using Different Machine Learning Algorithms, 2019.
[5] T. Chumwatana, Using sentiment analysis technique for
[18] F. Sarro et al., Customer rating reactions can be
analyzing Thai customer satisfaction from social media,
predicted purely using app features, in Proc. IEEE Int.
2015.
Requirements Eng. Conf. (Banaf, Canada), Aug. 2018, pp.
[6] T. Thiviya et al., Mobile apps' feature extraction based on 76–87.
user reviews using machine learning, 2019.
[19] S. Aslam and I. Ashraf, Data mining algorithms and their
[7] H. Hanyang et al., Studying the consistency of star applications in education data mining, Int. J. Adv. Res.
ratings and reviews of popular free hybrid android and Computer Sci. Manag. Studies 2 (2014), no. 7, 50–56.
ios apps, Empirical Softw. Eng. 24 (2019), no. 7, 7–32.
[20] D. Martens and T. Johann, On the emotion of users in app
[8] N. Kumari and S. Narayan Singh, Sentiment analysis on reviews, in Proc. IEEE/ACM Int. Workshop Emotion
e-commerce application by using opinion mining, in Awareness Softw. Eng. (Buenos Aires, Argentina), May
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 273
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 12 | Dec 2020 www.irjet.net p-ISSN: 2395-0072
BIOGRAPHIES
S SHASHANK
Student
ICFAI Tech Hyderabad
BRAHMA NAIDU
Professor
ICFAI Tech Hyderabad
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 274