Software Vulnerability Prediction using Text Analysis
Techniques
          Aram Hovsepyan, Riccardo Scandariato,                                                          James Walden
                    Wouter Joosen                                                         Department of Computer Science, Northern
            IBBT-DistriNet, Katholieke Universiteit Leuven,                                         Kentucky University
                               Belgium                                                                waldenj@nku.edu
                       {first.last}@cs.kuleuven.be
ABSTRACT                                                                                   The remainder of our paper is organized is as follows. In
Early identification of software vulnerabilities is essential in                          section 2, we provide an overview of the related work. In
software engineering and can help reduce not only costs, but                             section 3, we present our approach for classifying a compo-
also prevent loss of reputation and damaging litigations for                             nent based on the source code. In section 4, we summarize
a software firm. Techniques and tools for software vulner-                                our preliminary findings. Finally, section 5 concludes the
ability prediction are thus invaluable. Most of the existing                             paper and provides an outline of the future research path.
techniques rely on using component characteristic(s) (like
code complexity, code churn) for the vulnerability predic-                               2. RELATED WORK
tion. In this position paper, we present a novel approach                                   While there is a large body of work on defect prediction,
for vulnerability prediction that leverages on the analysis of                           the body of work on vulnerability prediction is smaller. Vul-
raw source code as text, instead of using “cooked” features.                             nerabilities differ from defects in that there are many fewer
Our initial results seem to be very promising as the predic-                             security flaws in code compared to defects, and defect pre-
tion model achieves an average accuracy of 0.87, precision                               diction techniques do not directly transfer to the task of
of 0.85 and recall of 0.88 on 18 versions of a large mobile                              vulnerability prediction.
application.                                                                             Neuhaus et al. [4] have focused on investigating the corre-
                                                                                         lation between vulnerabilities and import statements. They
                                                                                         have successfully leveraged machine learning techniques to
1.     INTRODUCTION                                                                      predict vulnerabilities based on imports in the context of
   Software security is a crucial concern within the software                            the Mozilla project. Neuhaus et al. have reported average
development process as software vulnerabilities can not only                             precision of 0.70 and recall of 0.45.
incur additional costs, but also cause severe damages to an                              Zimmermann et al. [8] have investigated the correlation
organization. It is essential to have the right tools and tech-                          between vulnerabilities and various metrics measuring code
niques in order to assess and predict the vulnerability of                               churn, code complexity, dependencies, code coverage, orga-
software components produced by the development team(s).                                 nizational measures and actual dependencies. With a sta-
   In this position paper, we propose a novel approach for                               tistical significance they have found a weak correlation for
vulnerability prediction of a software component. Our ap-                                each of the investigated metric. The authors have also used
proach is based on creating and using a prediction model by                              logistic regression methods to predict vulnerabilities based
means of machine learning techniques. Although this idea is                              on these metrics. The study was performed in the context
not new, most of the existing works are focused on security                              of a proprietary commercial product, i.e., Windows Vista.
vulnerability prediction based on various derived features                               The results of the study indicate that most metrics can actu-
of the source code (e.g., total lines of code, total complex-                            ally predict vulnerabilities with an average to good precision
ity, code churn, etc.). As opposed to these, we propose an                               (median precision was 0.60), but with a relatively low recall
approach that relies on the textual analysis of the source                               (median recall was 0.40).
code and treats every monogram in that source as a feature.                              A study by Shin et al. [5] explored the relationship between
In the context of this paper we have used a version of an                                complexity, code churn and developer activity metrics with
email client application for Android for building the predic-                            vulnerabilities. The authors have utilized two classification
tion model. We have predicted 18 subsequent versions of the                              techniques, i.e., linear discriminant analysis and Bayesian
same application with a very good precision, but low recall.                             network. They have determined that these metrics are in-
                                                                                         deed predictive of vulnerabilities.
                                                                                         All these approaches rely on extracting certain features from
                                                                                         the source code (e.g., complexity, number of imports, code
Permission to make digital or hard copies of all or part of this work for                churn, etc.) and using them for building a prediction model.
personal or classroom use is granted without fee provided that copies are                As opposed to these techniques in our approach we propose
not made or distributed for profit or commercial advantage and that copies               to use the source code itself for building a prediction model.
bear this notice and the full citation on the first page. To copy otherwise, to          The advantage of this method is that it does not make any
republish, to post on servers or to redistribute to lists, requires prior specific       assumptions regarding the impact of a certain feature on the
permission and/or a fee.
MetriSec’12, September 21, 2012, Lund, Sweden.                                           software vulnerabilities. The machine learning techniques
Copyright 2012 ACM 978-1-4503-1508-1/12/09 ...$15.00.                                    have yet to create these features based on the complete code
                                                                                     7
base. The disadvantage of this approach is that the learning          We have leveraged the concept of the support vector ma-
may fail to create any meaningful features. In the following          chine (SVM) for both the training phase where a prediction
section, we present the proposed approach in detail.                  model is built from a set of training examples, and the pre-
                                                                      diction phase where a feature vector is classified based on
3.   OUR APPROACH                                                     the previously built prediction model. In our initial explo-
                                                                      ration, we have used a radial basis function with a set of
   As Java is a language, we looked at Java files as text. The         parameters (cost and gamma) that are selected by running
starting point for our approach is the source code of a soft-         a grid search algorithm. The precise details of the training
ware system that consists of a number of Java files. Each file          algorithm are out of the scope of this paper.
is transformed into a feature vector where every word (also
called a “monogram” in text processing) within that file is
treated as a feature.                                                 4. PRELIMINARY RESULTS
Before splitting the file source code into a set of words rep-           We have performed an initial exploration of the presented
resenting the features we run a preprocessing step. Certain           approach using a concrete application. In this section, we
blocks in the source code files are likely to pollute the pre-         briefly present the preliminary results of our investigation.
diction model. Such blocks are, for instance, the comments.
Indeed, we believe that it is rather unlikely that comments
                                                                      4.1 Application
could have an impact on the vulnerability of a file. Hence, in            Market analysis has shown that consumers are purchas-
a preprocessing step we filter out all the comments from the           ing more smart phones than PCs since the last quarter of
source. For the same reasons, we also filter out all strings           2010 [1]. Hence, a potential vulnerability in any mobile ap-
and numerical values.                                                 plication may affect a huge number of users. Most of these
In order to transform the preprocessed source code into a             mobile applications are running on the Android platform [7].
feature vector, we need to tokenize the textual representa-           This is why we have chosen to investigate the vulnerabilities
tion of the source into a set of monograms. As a set of               of mobile applications developed for the Android platform.
delimiting we have chosen to use not only white spaces, but           Repositories containing a large version history of open source
also the Java “punctuation” characters (such as, “. , ; ) ( } {       mobile applications for the Android platform are readily
] [”) as well as mathematical and logical operators (such as,         available and represent an ideal testbed for our approach.
“+ - / * ˆ | || & && !”). In a feature vector each monogram           For the purposes of our initial exploration, we have selected
(i.e., feature) must also have an assigned value. We use the          to use 19 versions of the K9 mail client application spread
count of a given monogram in a given file source code as its           over the period of 22 months. The timespan between each
value.                                                                version is approximately one month. We have used the first
Consider the figure 1 that depicts the HelloWorld.java file.            version in order to build the prediction model and we have
                                                                      predicted the vulnerabilities of the files of each subsequent
                                                                      version using this prediction model.
                                                                      In order to assign the vulnerability labels we have lever-
                                                                      aged the state-of-the-practice Fortify tool [3] that analyzes
                                                                      the source code for various known types of software security
                                                                      vulnerabilities. Fortify not only spots a vulnerability, but
                                                                      also assigns a severity for each vulnerability found. In our
                                                                      exploratory work, we have treated a file as vulnerable if For-
                                                                      tify has assigned any type of vulnerability to it and as clean
            Figure 1: Hello World Java File                           otherwise. By using Fortify we rely on vulnerabilities that
                                                                      are extracted during a static analysis of the source (based
   In order to transform this file into a corresponding feature        on common vulnerabilities and exposures) rather than re-
vector we filter out all the comments from this file as well as         ported vulnerabilities. There are systematic studies that
the “Hello World!” string. What remains from the source               have shown that there are strong correlations between such
of this file is tokenized into a feature vector that treats each       static analysis metrics and the quantity of subsequently re-
monogram as a feature. Hence, the feature vector of the               ported vulnerabilities [6]. Nevertheless, this issue is rather
HelloWorld.java file becomes:                                          controversial as commercial tools are said to produce high
         class:1, HelloWorldApp:1, public:1, static:1,                false positives [2].
      void:1, main:1, String:1, args:1, System:1, out:1,              4.2 Results
      println:1
                                                                        We have used the version k9-2.504 to build the prediction
  where each of the monograms is followed by a count (in              model. We assessed the model performance (in terms of
this case 1). Note that in this example we do not follow any          prediction power) by means of three indicators:
particular (e.g., SVM) notation.                                         • Accuracy is the percentage of correct results.
During the learning phase each file represented as a feature
vector also has a vulnerability label assigned to it. We use             • Precision is the probability that a file classified as vul-
this training set to build a prediction model. Throughout                  nerable is indeed vulnerable.
this paper we consider a binary classification scheme where
                                                                         • Recall is the probability that a vulnerable file is clas-
a file is either classified as vulnerable or clean. Once the
                                                                           sified as such.
prediction model is created from the training set, we can
use this prediction model to predict the vulnerability of ar-         Figures 2 and 3 illustrate the initial results that we have ob-
bitrary files each represented as a feature vector.                    tained. The main observation is that the prediction model
                                                                  8
scores very high (above 80%) for all three indicators. Figure              In the future, we plan to further investigate the presented
2 also shows the positive rate of the application, i.e., the per-       approach by looking at various alternatives in building the
centage of vulnerable files, which is between 40% and 60%.               feature vector. We also plan to investigate the possibilities
Therefore, a “naive” classifier that classifies all files as vul-          to build a vulnerability prediction model that uses the six-
nerable (or alternatively as clean) would achieve a precision           class classification supported by Fortify (i.e., non-vulnerable,
in the range of 40% to 60% as well. This range is a base-               vulnerable with severity 1 to 5). Finally, we believe that
line for the accuracy indicator and our approach performs               our approach is complementary to using the existing tech-
substantially better compared to the baseline.                          niques that use, e.g., internal metrics for building a predic-
                                                                        tion model. Hence, an even more interesting research track
                                                                        would be to expand our approach to use a feature vector
                                                                        that consists both of the complete source code treated as
                                                                        text and a list of code metrics.
                                                                        6. REFERENCES
                                                                        [1] Android rises, symbian and windows phone 7 launch as
                                                                            worldwide smartphone shipments increase 87.2% year
                                                                            over year, according to idc (2011),
                                                                            http://www.idc.com/
                                                                        [2] Austin, A., Williams, L.: One technique is not enough:
                                                                            A comparison of vulnerability discovery techniques. In:
                                                                            ESEM. pp. 97–106 (2011)
                                                                        [3] Fortify: Fortify. https://www.fortify.com/ (2011)
Figure 2: Accuracy vs % of vulnerable files identified                  [4] Neuhaus, S., Zimmermann, T., Holler, C., Zeller, A.:
by fortify                                                                  Predicting vulnerable software components. In:
                                                                            Proceedings of the 14th ACM Conference on Computer
                                                                            and Communications Security (October 2007)
                                                                        [5] Shin, Y., Meneely, A., Williams, L., Osborne, J.A.:
                                                                            Evaluating complexity, code churn, and developer
                                                                            activity metrics as indicators of software vulnerabilities.
                                                                            IEEE Trans. Software Eng. 37(6), 772–787 (2011)
                                                                        [6] Walden, J., Doyle, M.: Savi: Static analysis
                                                                            vulnerability indicator. IEEE Security and Privacy (to
                                                                            appear) (2012)
                                                                        [7] Zeman, E.: Android, ios crush blackberry market share
                                                                            (2011), http://www.informationweek.com
                                                                        [8] Zimmermann, T., Nagappan, N., Williams, L.:
                                                                            Searching for a needle in a haystack: Predicting security
                                                                            vulnerabilities for windows vista. In: Proceedings of the
                                                                            3rd International Conference on Software Testing,
                                                                            Verification and Validation (April 2010)
              Figure 3: Precision vs recall
   Finally, note that the number of files grows substantially
from the first training set used to build the prediction model
(97 Java files in k9-2.504) to the last version (177 Java files
in k9-3.991). Hence, in the testing phase, the model is also
classifying many new files that were not present in the train-
ing set.
5.   CONCLUSIONS AND FUTURE WORK
   In this paper we have presented a novel approach that
can predicts the vulnerability of a file based on its source
code. As opposed to a number of current state-of-the-art ap-
proaches that build a vulnerability prediction model based
on a certain characteristic (e.g., software metrics) of the
source code, our approach treats each “word” in the source as
a feature. We have explored this approach on an open source
mobile application, i.e., K9 email client for the Android plat-
form. Our initial results indicate that the proposed approach
has very values for accuracy (average of 0.87), precision (av-
erage of 0.85) and recall (average of 0.88). These results are
very promising and encourage further research in this area.