Vol 13, Issue 06, June/2022
ISSN NO:0377-9254
       Malicious Application Detection Using Machine Learning
                          1G.Chandana ,2B. Anusha , 3S. Sri Devi , 4Mrs. V. Sri Suma
                                          B.tech Student, Assistant Professor
                              DEPARTMENT OF INFORMATION TECHNOLOGY
                                      CMR TECHNICAL CAMPUS, Hyderabad
    ABSTRACT                                                     and trained dataset we can predict the malware android
                                                                 apps. With an estimated market share of 70% to 80%,
    Android plays a vital role in the today's market.            Android has become the most popular operating system
    According to recent survey placed nearly 84.4% of            for smartphones and tablets. Unsurprisingly, cyber-
    people stick to android which explosively become             criminals have followed, expanding their malicious
    popular for personal or business purposes. It is no doubt    activities to mobile platforms. Mobile threat researchers
    that the application is extremely familiar in the market     have recognized an alarming increase of Android
    for their amazing features and the wonderful benefits of     malware from 2012 to 2013 and estimate thatthe number
    android applications makes the users to fall for it.         of detected malicious applications is in the range of
    Android imparts significant responsibility to                120,000 to 718,000. To efficiently detect malware from
    application developers for designing the application         applications available from official and third-party
    with understanding the risk of security issues. When         sources, many efforts have contributed to studying the
    concerned about security, malware protection is a major      nature of smartphone platforms and theirapplications in
    issue in which android has been a major target of            the past decade. The Android platform employs the
    malicious applications. In android based applications,       permission system to restrictapplications privileges to
    permission control is one of the major security              secure the sensitive resources of the users. The
    mechanisms. In this project, the permission induced risk     developer is responsible for determining appropriately
    in application, and the fundamentals of the android          which permissions an application requires, but an
    security architecture are explored, and it also focuses on   application needs to get a user’s approval of the
    the securit y ranking algorithms that are unique to          requested permissions to access private or otherwise-
    specific applications. Hence, we propose the system          restricted resources. Although the permission system
    providing the detection of malware analysis based on         can protect users from applications with invasive
    permission and steps to mitigate from accessing              behaviors, its effectiveness highly depends on a user’s
    unwanted permission (limits the permission). It is also      comprehension of the consequences of granting a
    designed to reduce the probability of vulnerable attacks.    permission. According to recent studies, many users do
    I.INTRODUCTION                                               not understand what each permission means and blindly
    1.1OBJECTIVE OF THE PROJECT                                  grant them, potentiallyallowing an application to access
                                                                 sensitive/private information. Another laws that the
            In recent years, the usages of smart phones are      user cannot decide to grant single permissions, while
    increasing steadily and also growth of Android               denying others. Many users, although an app might
    application users are increasing. Due to growth of           request a suspicious permission among much seemingly
    Android application user, some intruder are creating
                                                                 legitimate permission, will stillconfirm the installation.
    malicious android application as tool to steal the           The Android security model is based mainly on
    sensitive data and identity theft / fraud mobile bank,       permissions.
    mobile wallets. There are so many malicious
    applications detection tools and software are available.     1.2PURPOSE OF THE PROJECT
    But an effectively and efficiently malicious applications         The ultimate aim of the project is to improve
    detection tools needed to tackle and handle new              permission for detecting the malicious android mobile
    complex malicious apps created by intruder orhackers.        application using machine learning algorithms. As a
    In this paper we came up with idea of using machine          result, the implementation of these permissions is of
    learning approaches for detectingthe malicious android       interest to us. An Android permission is a restriction
    application. First we have to gather dataset of past         limiting accessto a part of the code or to data on the
    malicious apps as training set and with the help of          device. The limitation is imposed to protect critical data
    Support vector machine algorithm and decision tree           and code that could be misused to distort or damage a
    algorithm make up comparision with training dataset          user’s experience. Permissions are also used to allow or
www.jespublication.com                                                                                         Page No:1198
                                                                                               Vol 13, Issue 06, June/2022
                                                                                               ISSN NO:0377-9254
    restrict application access to restricted APIs and          method, they developed a model that calculates two
    resources. For example, the Android ‘INTERNET’              scores called normal score and malicious score for
    permission is required by apps to perform network           every application and decides whether a particular
    communications so, opening a network connection is          application is malware or not. The most commonly used
    restricted    by the       ‘INTERNET’ permission.           properties in static and dynamic Android malware
    Furthermore, an application must have the ‘READ             detection are permissions and network traffic features
    CONTACTS’ permission in order to read entries in a          respectively. Static permissions cannot identify
    user’sphonebook as well. To require a permission, the       sophisticated malware, which is capable of update
    developer specifies them using the Manifest file in         attacks. And coming to dynamic network traffic, it
    declaring a "" attribute. The "android : name" field        cannot detect malware samples without a network
    specifies the name of the permissionin the code.            connection. Therefore, a hybrid model integrating both
    1.3 PROJECT FEATURES                                        of these properties is proposed. They extracted both
                                                                permissions and network traffic features and made them
       A new method to detect malicious Android                 into a single vector. Using the K-medoids method, they
    applications through machine learning techniques by         partitioned the vectors into K clusters. And they used
    analyzing the extracted permissions from the                the K-Nearest Neighbours method, to classify whether
    application itself. Features used to classify are the       a particular application is malicious or not. They made
    presence of tags uses-permission and uses-feature into      sure that K is odd, just to make sure out of K nearest
    the manifest as well as the number of permissions of        neighbours, the count of malicious and benign
    each application. These features are the permission         neighbours is not the same. In another work, Zhenlong
    requested individually and the «uses- feature» tag the      Yuan et al. proposed a technique to associate static
    possibility of detection malicious Android applications     features with dynamic features and then classify the
    based on permissions and 20 features from Android           given android applications as malicious or safe. They
    application packages.                                       got the features they used as input to their model in three
    II.LITERATURE SURVEY                                        stages: • Static Phase • Sensitive APIs • Dynamic Phase
        We studied the techniques that are proposed to          Static phase includes the permissions that are obtained
    identify Android malwares. In his work, Anshul et al.       by unzipping the apk file and parsing xml files obtained
    presented an idea to detect Android Malwares by             later. Another file classes.dex accounts for the sensitive
    Network traffic analysis. Their approach is used to         api calls.
    identify malware on Android that is operated by a              III.SYSTEM ANALYSIS
    remote server. These malwares either accept orders             3.1PROBLEM STATEMENT
    from the server or leak sensitive data to it. First, they
                                                                    Smartphones have become the most used device in
    analyzed the network traffic of android malwares and
                                                                one’s day to day life. They facilitateusers with a variety
    then the traffic of normal applications. They discovered
                                                                of applications that are enriched with powerful features.
    the characteristics that distinguish malware traffic from
                                                                It is almost impossible for anyone these days to spend a
    non-malware traffic.. And in the second phase, they
                                                                day without their smartphones. Out of allsmartphones,
    built a classifier using these network traffic features
                                                                Android smartphones are the ones that are widely used.
    which can detect the malwares. In another work, Anshul
                                                                This increasing popularity of Android smartphones has
    et al. proposed a technique called the PermPair method.
                                                                also attracted malicious attackers. This malicious
    They approached the goal by considering every pair of
                                                                activity can be done by either a single application or a
    permissions as the possible input feature and finally
                                                                group of applications working together.The objective of
    decided on each pair, if that combination is vulnerable.
                                                                this project is to create a model that can detect such
    Their method includes data sets from 3 different sources
                                                                malicious applications.
    called Genome ,Debris and Koodoos. Their approach
    had 3 phases. In the first phase, they constructed 4
    different graphs by extracting permission pairs from
    each application. Out of the 4 graphs, 3 graphs are for        3.2 EXISTING SYSTEM
    malwares and 1 graph is for benign applications. In the
                                                                   Traditionally Numerous malware detection tools
    second phase, they dealt with merging 3 malicious
                                                                have been developed, but some tools are may not able
    graphs into a single malicious graph. At the end of this
                                                                to detect newly created malware application and
    phase, they ended up with two graphs, one for malicious
                                                                unknown malware application infected by various
    and one for benign. In the third and final phase of their
                                                                Trojan, worms, spyware. Detecting of large number of
www.jespublication.com                                                                                         Page No:1199
                                                                                               Vol 13, Issue 06, June/2022
                                                                                               ISSN NO:0377-9254
    malicious application over millions of android               popularity. Web applications are used for web
    application is still a challenging task using traditional    mail, online retail sales, discussion boards,
    way. In existing, Non machine learning way of                weblogs, online banking, and more. One web
    detecting the malicious application based on                 application can beaccessed and used by millions
    characteristics, properties, behavioral.                     of people.
           DISADVANTAGES OF THE EXISTING
       SYSTEM
          Identification of newly updated or created
    malicious application is hard to findout.
      Non Machine learning approaches are not reliable and
    efficient.
         In Existing approaches covers only 30
    permissions out of 300 app permissions,due to this
    limited apps permissions different types of attacks can
    occur.
    3.3PROPOSED SYSTEM
                          In proposed paper, we implement
    SIGPID, Significant Permission Identification
    (SIGPID). The goal of the sigpid is to improve the apps
    permissions effectively and efficiently. This SIGID             Figure no. 3.1 Project Architecture
    system improves the accuracy and efficient detection of
                                                                 4.2MODULE DESCRIPTION
    malware application. With help machine learning
                                                                     1. Permission
    algorithms such as SVM and Decision Tree algorithms
    make a comparison between training and trained                         Permission     characterize    existing
    datasets. Support vector machine algorithms act as a              Android malware from various aspects,
    classifier which is used to classify malicious application        including the permissions requested. They
    and benign app.                                                   identified individually the permissions that
    ADVANTAGES OF THE PROPOSED SYSTEM                                 are widely requested in both malicious and
                                                                      benign apps
    Improves the percentages of detection malicious
                                                                       2. Combination of Permission
      application
                                                                           This method on network classification
    Machine learning is better efficient than Non machine
                                                                       helps to define irregular permission
      learning algorithm.
                                                                       combinations requested by abnormal
    Able to detect new malware android applications.                   applications. The nature, sources and
                                                                       implications of sensitive data on Android
    We only need to consider 22 out of 135
                                                                       devices in enterprise settings.
    permissions to improve the runtimeperformance
    by85.6.
                                                                       3. Feature Extraction
    IV.ARCHITECTURE                                                        A new method to detect malicious
                                                                      Android applications through machine
    4.1PROJECT ARCHITECTURE                                           learning techniques by analyzing the
                                                                      extracted permissions from the application
                           Web applications are by
                                                                      itself.
    nature distributed applications, meaning that they
                                                                      4. Classification
    are programsthat run on more than one computer
    and communicate through network or server.                              According to them, by combining
    Specifically,web applications are accessed with a                 results from various classifiers, it can be a
    web browser and are popular because of the ease                   quick filter to identify more suspicious
    of using the browser as a user client. For the                    applications. And propose a framework that
    enterprise, software on potentially thousands of                  intends to develop a machine learning-based
    client computers is a key reason for their                        malware detection system on Android to
www.jespublication.com                                                                                      Page No:1200
                                                                                         Vol 13, Issue 06, June/2022
                                                                                         ISSN NO:0377-9254
         detect malware applications and to enhance        To achieve Normalization
       V.IMPLEMENTATION                                    Null Value Handling
       METHODOLOGY                                         To remove invalid data
                To classify malicious application          To achieve scaling
       from benign application a decent dataset is
                                                           Data Preprocessing
       required.The dataset can be downloaded from
       debrin dataset. We construct massive                The process of converting raw data into a
       experiments, including 516 benign applications      comprehensible format is known as data preparation.
       and 528 malicious applications. In this section     We can’t work with raw data, thus this is a key stage in
       the methodology followed is discussed in            machine learning. Before using machine learning or
       details.                                            data mining methods, make sure the data is of good
                                                           quality.The purpose of data preprocessing is to ensure
        Dataset                                            that the data is of good quality. The following criteria
                                                           can be used to assess quality accuracy, completeness,
                  Any machine learning model needs a       consistency, trustworthy, understandability. Data
        dataset over which it can be trained. So data      Preprocessing involves the followingsteps:
        collection is one of the most important steps.
                                                           Data Cleaning: Correcting or deleting incorrect,
        We’ve worked on 3 different datasets and
                                                           corrupted, improperly formatted, duplicate, or
        compared their result against each other
                                                           incomplete data. We have removed those columns
                                1st dataset is collected
                                                           having missing values. We’ve removed any undesirable
        from google comprised of 70 different
                                                           observations from our datasets, such as duplicates or
        application eachhaving a set of 17 permission
                                                           irrelevant observations
                                 2nd is downloaded
    from Kaggle which has 184 different permissions        Data Transformation: Changing data from one format
    or we can say features list for 29999 apps             to another. For string columns and decimal columns
    individually.                                          such as price, they’re converted to binary. • Data
                                3rd one is downloaded      integration: combining data from a variety of sources,
    from Kaggle. It has 138047 records with each           including databases (bothrelational and non-relational),
    record consisting 57 columns(permissions)              data cubes, files, and so on.
                                For training and testing
    purposes, we split the dataset into two parts. We       Data reduction: It is possible to reduce the amount of
                                                           records, characteristics, ordimensions. It is carried out
    used 80% of the dataset for training the machine
    learning model and the remaining 20% dataset           during feature selection using correlation matrix.
    was used for testing every machine learning               We have worked on last dataset in detail and for
    model and calculating the performance of each          remaining we have tested accuracy and compared its
    model with metrics such as accuracy, f1-score,         result. We made sure that the dataset contained enough
    precision and recall.                                  examples for both the malware and benign applications.
    Feature Engineering                                    There are 41323 examples for benign applications and
            The feature set used for training has a big    96724 applications for malicious applications. So, we
    impact on machine learning. Several research           can say that the dataset is not skewed. Each permission
    have found that certain features are helpful in        column is a binary column,indicating a permission is
    training machine learning-based malware                asked or not.
    classifiers. That is the reason we have used feature
    engineering in ourimplementation. In supervised        A SVM Linear classifier is built to fit the data you
    learning, we will use Feature engineering, which       supply and provide a hyperplane that fits well and
    is the process of selecting, manipulating, and         classify your data into different classes. Following that,
    changing raw data into features. We use Feature        you may input some attributes to your classifier to
    Engineering for the following reasons:                 check what the projected class is once you’ve obtained
    To remove imputation                                   the hyperplane. Support Vectors are the points which
                                                           can be considered as edge cases. They are very nearer
    Handling outliers                                      to the hyperplane. The two support vectors
www.jespublication.com                                                                                   Page No:1201
                                                                                            Vol 13, Issue 06, June/2022
                                                                                            ISSN NO:0377-9254
    corresponding to either classes benign and malicious            VI. SCREENSHOTS
    respectively are equidistant to the hyperplane with
    maximum margin possible. In classification using                After building the svm model, we have trained and
    Support vector machines, the model with polynomial              tested the data using the built svmmodel and
    kernel performs merely when the positive examples and           obtained a accuracy of 87.5
    negative examples in the data are overlapping. One way
    to deal with this overlapping data is to use a support
    vector machine with a radial kernel generally known as
    Radial Basis Function(RBF). When RBF kernel is used
    in SVM, radial kernel behaves like a weighted nearest
    neighbor model. In other words, the nearest observation
    has a lot of influence on how we classify the new
    example. The value obtained after substituting in radial
    kernel function is inversely proportional to the
    closeness. The radial kernel function of two data               Screenshot no 6.1 Building SVM Model.
    observations a,b is as below.                                            After building the Bayesian model, we
                       RBF(a, b) = e −γ(a−b) 2                      have trained and tested the data using the built
                  Decision Trees are a non-parametric               Bayesian model and obtained a accuracy of 54.1
       supervised learning approach which can be used for
       classifying problems as well as regression problems.
       The sole objective is to construct a machine learning
       model that guesses the class of a given instance by
       learning basic decision rules from feature values.
       First, we will calculate the entropy. It is also known
       as measure of uncertainity. Then for each attribute
       A, we calculate information gain. The attribute with
       maximum value for information gain will be
       selected as the root node and this process continues.        Screenshot no 6.2 Building Naïve Bayes
       The formulas for entropy and Information gain are:           Classification Model.
       E(S) = −[plog(p) + (1 − p)log(1 − p)]
                  A collection of classification algorithms
       which are based on a theorem named after Bayes
       together forms a Naive Bayes classifier. It isn’t a
       single algorithm. It is a group of algorithms sharing
       one common principle i.e., every pair of features
       which are used in classification are independent of
       each other. For two events Ea, Eb Bayes theorem
       tells that:
                  P(Ea|Eb) = P(Eb|Ea) ∗ P(Ea) P(Eb)
                  Using this principle for classification task,
       we can say that:
                  P r(class|attributes) = P r(attributes|class) ∗   Screenshot no 6.3 SVM vs Decision Tree vs Naïve
       P r(class) P r(attributes)                                   Bayes
                  As we have two classes malicious and safe,
       we will calculate the probabilities for the application
       to be in malicious class and to be in safe class. The
       class for which the probability value is higher, is the
       class to which the application belongs to.
www.jespublication.com                                                                                     Page No:1202
                                                                                      Vol 13, Issue 06, June/2022
                                                                                      ISSN NO:0377-9254
                                                           attention all the possibly dangerous
                                                           applications, allowing them to scrutinize the
                                                           applications that they trust more carefully. This
                                                           in turn will help users become more security-
                                                           conscious overall. Even so, this is only a first
                                                           step. Future work for this project will include
                                                           increasing the accuracy of the classifier,
                                                           migrating the Python portions of this project to
                                                           Java, and integrating more advanced methods
                                                           of detecting maliciousbehavior such as looking
                                                           at API calls (this follows a "defense in depth"
      Screenshot no. 6.4 Learning curve
                                                           strategy). One benefit of the decision tree
                                                           classifier is its speed. It can serve as a
                                                           preliminary screen for more advanced but
                                                           slower methods, to focus the applications they
                                                           will inspect. Lastly, taking into account
                                                           application categories such as being a game or
                                                           email-client would also help detect suspicious
                                                           permissions and behaviors. But, a set of
                                                           android applications operating together can
                                                           carry out a malicious activity. We call them
      Screenshot no. 6.5 Performance of different models
                                                           colluding apps. In this, the malicious activity is
      on different datasets                                carried out by more than one application. Each
                                                           application participating in collusion does a
                                                           small part of the malicious action. These
                                                           applications communicate with each other
                                                           through covert channels. Sometimes when a
                                                           malicious activity cannot be performed by a
                                                           single application, it might be possible that a
                                                           group of applications coordinating with each
                                                           other can perform that malicious activity. This
      Screenshot no. 6.6 Dataset 1                         phenomenon is called Application Collusion. It
                                                           is an emerging threat. The reason behind why
                                                           we are calling this as an emerging threat is
                                                           because most of the android malware detectors
                                                           scan the applications individually when
                                                           determining whether it is a malware or not. But
                                                           as the malicious activity here is being carried
                                                           out by a group of applications, those traditional
                                                           detectors cannot detect this. So, we need a
                                                           model to detect these colluding applications.
                                                           Till now, very little research has been done on
      Screenshot no. 6.7 Dataset 2
                                                           this and there is scarcity for datasets. We are
      VII.CONCLUSION
                                                           trying to create or obtain a few applications that
                In conclusion, our project can
                                                           can perform collusion, so that we can do some
      identify, with moderate success, applications
                                                           research on them which may eventually help in
      that pose a potential threat based on the
                                                           creating a model that can detect colluding
      permissions that they request. Our application
                                                           applications. Firstly, we have to obtain a
      can scan applications on a phone at any time,
                                                           template for collusion. Then, we have to try to
      and alerts the user to do so when an installation
                                                           split a malicious task into various steps and
      or app update occurs. We believe that this is an
                                                           make each application perform one of the steps.
      important step in preventing Android malware,
                                                           By doing this, we can achieve collusion.
      because this application brings to the user’s
www.jespublication.com                                                                                Page No:1203
                                                                            Vol 13, Issue 06, June/2022
                                                                            ISSN NO:0377-9254
                                                           10. W. Enck, M. Ongtang, P. McDaniel,
      VIII.FUTURESCOPE                                     "Understanding Android Security"
          We are trying to create or obtain a few
      applications that can perform collusion, so that
      we can do some research on them which may
      eventually help in creating a model that can
      detect colluding applications. Firstly, we have
      to obtain a template for collusion. Then, we
      have to try to split a malicious task into various
      steps and make each application perform oneof
      the steps. By doing this, we can achieve
      collusion.
      REFERENCES
         1. A. P. Felt, K. Greenwood, and D.
         Wagner, “The effectiveness of install-time
         permission        systems        forthird-
         partyapplications”,2010.
         2. B. P. Sarma, N. Li, C. Gates, R.
         Potharaju, C. Nita-Rotaru, and I. Molloy,
         “Android permissions:         aperspective
         combining risks and benefits,” 2012.
         3.Y. Zhou and X. Jiang, “Dissecting android
         malware: Characterization andevolution,2012.
         4. V. Rastogi, Y. Chen, and X. Jiang,
         “Droidchameleon: evaluating android
         antimalware againsttransformation attacks,
         2013.
         5. G. Canfora, F. Mercaldo, and
         C. A. Visaggio, “A classifier of
         malicious               android
         applications,”2013.
         6. B. Sanz, I. Santos, C. Laorden, X.
         Ugarte-Pedrero, P. G. Bringas, and G.
         A´ lvarez, “Puma:Permission usage to
         detect malware in android,”,2013.
         7. C.-Y. Huang, Y.-T. Tsai, and C.-H.
         Hsu, “Performance Evaluation       on
         PermissionBased Detection for Android
         Malware,”2013.
         8. Franklin Tchakount´, Computers
         &     Security    “Permission-based
         Malware Detection Mechanisms on
         Android:         Analysis       and
         Perspectives”,2014.
         9.Z. Fang, W. Han, and Y. Li, “Permission-
         based Android security: Issues and counter
         measures,”
www.jespublication.com                                                                 Page No:1204