Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
We review machine learning approaches for detecting (and correcting) vulnerabilities in source code,
finding that the biggest challenges ahead involve agreeing to a benchmark, increasing language and error
type coverage, and using pipelines that do not flatten the code’s structure.
Term Definition
Defect Also known as errors, bugs, and faults, defects are deviations between a program’s expected behavior and what
actually happens.
Syntactic defects These are mistakes in the syntax of a program, i.e., the grammar and rules of the language. They are usually detected
at compile time and runtime and prevent a program from running at all. Such problems, depending on the language,
include missing brackets and semicolons, typos, indentation problems, and so on.
Semantic defects These are mistakes in the semantics of a program, i.e., its meaning and intended behavior. They result in programs
that do not behave as intended but are not primarily a security concern. Such problems include inconsistent
method names, variable misuse bugs, typing errors, application programming interface misuse, swapped
arguments in functions, and so on.
Vulnerabilities Vulnerabilities form a particular set of semantic defects that can compromise the security of a system. Such
problems include buffer overflows, integer overflows, cross-site scripting, use-after-free, and so on.
www.computer.org/security 61
MACHINE LEARNING SECURITY AND PRIVACY
several important stages: data collection and prepara- ML comes with a distinct set of challenges that
tion, model training, and, finally, evaluation and deploy- need to be considered to produce reliable and use-
ment. (We assume that the reader is familiar with these ful results. First, it is crucial to train the model on a
steps, but to make the article self-contained, we report high-quality data set. In general, this means a large
them in Table 3.) Typically, the tool is monitored, main- enough, realistic data set with a representative distri-
tained, and improved after deployment, but this is out- bution of classes. For example, a model trained on a
side the scope of this article. data set that contains an equal number of buggy and
ML models are generally not capable of ingesting nonbuggy programs might not perform well when
the source code in its original format, so the code is used in a real setting where the occurrence of bugs is
processed and transformed into some low-level rep- significantly lower or different types of bugs occur.
resentation appropriate for ML model input [e.g., Problems with a data set can be mitigated to some
vectors for neural networks (NNs)]. To preserve the extent in the later stages of the pipeline, but a strong
semantic and syntactic properties of the program, it data set is preferred.
is useful to consider some intermediate interpreta- Additionally, a common problem that surfaces when
tion that is capable of encoding such properties before evaluating and replicating the results is overfitting,
feeding the program into the model. The three pre- meaning that the model too closely fits the training data
dominant approaches treat the source code as follows and does not show previous predictive power, often due
(roughly inspired by Chakraborty et al.7 and Shen and to noisy data and overcomplicated models. Finally, the
Chen8): selection of relevant features is one of the most impor-
tant tasks of ML. It is important to consider the num-
■■ Sequence of tokens: The raw source code is split into ber of features—having more features is not necessarily
small units (e.g., “int,” “func,” “{,” and “}”) and pre- better—and what information about the code they
sented to the model as such. carry. The most recent deep learning-based approaches
■■ Abstract syntax tree (AST): The syntactic structure of do not require manual feature selection but rather take
the program is captured by a tree representation, with advantage of the ability of the model to learn important
each node of the tree denoting a construct occurring features directly from the training data themselves.
in the source code. The prediction of an ML model has four possible
■■ Graphs: These can capture various syntactic and classification states, i.e., the confusion matrix: true
semantic relationships and properties of the source positives (TPs), true negatives (TNs), FPs, and FNs.
code through the edges and nodes of the graphs (e.g., In our case, a TP could mean a buggy line of code
control flow graphs9 and code property graphs10). that is correctly classified as a bug, and an FP could
be a nonbuggy line of code that is wrongly classified
The three classes are a simplification for the purpose as a bug.
of synthesis. In practice, many of the tools use versions
that blend the lines between representations. Addition- Methodology
ally, the classes do not reflect the full picture but rather Our goal is to examine and present a representative
the most widely used approaches. snapshot of the state-of-the-art research and identify
Stage Description
Data collection A sufficiently large and representative data set for the task is constructed.
Data preparation Data preparation consists of cleaning and sometimes labeling, feature engineering, and, finally, splitting into
(nonoverlapping) subsets for training and testing. Ideally, the goal is to eliminate as much noise as possible to allow for
better training. Additionally, it is important to select the most relevant features, which is often a nontrivial task.
Model training The training portion of the data set is used to create a model that will be able to distinguish erroneous code from
correct code and optionally propose candidate corrections. Depending on the technique and type of model used, it
is often necessary to adapt the parameters and retrain several models before achieving satisfactory results. Training is
frequently the longest and computationally most expensive part.
Evaluation The model is evaluated on the test subset of data to determine if it exhibits the desired behaviors when presented
with unseen data. At this stage, the model should be able to detect and optionally correct programming defects and
can be deployed to be used.
www.computer.org/security 63
MACHINE LEARNING SECURITY AND PRIVACY
(Continued )
www.computer.org/security 65
MACHINE LEARNING SECURITY AND PRIVACY
www.computer.org/security 67
68
Table 6. The studies and their codes.
Tool name Defect Representation Method Language Size Type Realism Label Data Tool
Harer 2018 [P5] Vulnerability Token GAN C 117,000 functions B+F Synthetic ✓ — —
Ratchet [P6] Semantic Token RNN (LSTM) Java 35,137 pairs B+F Real ✓ ✓ ✓
Sensibility [P7] Syntactic Token n-Gram and RNN (LSTM) Java 2,300,000 files NB Semireal — ✓ ✓
SequenceR [P8] Semantic Token RNN (LSTM) Java 35,000 samples B+F Real ✓ ✓ ✓
RLAssist [P9] Syntactic Token DRL and RNN (LSTM) C 7,000 programs B+F Semireal ✓ ✓ ✓
Liu 2019 [P10] Semantic AST, token CNN and paragraph vector Java 2,000,000 methods B+F Real — ✓ ✓
DeepDelta [P11] Semantic AST RNN (LSTM) Java 4,800,000 builds B+F Real ✓ — —
Tufano 2019 [P12] Semantic AST RNN Java 2,300,000 fixes B+F Real ✓ ✓ ✓
VarMisuseRepair [P13] Semantic Token RNN (LSTM) and pointer Python 650,000 functions B+F Real ✓ ✓ —
network
DeepRepair [P14] Semantic AST RNN Java 374 programs B+F Semireal ✓ ✓ ✓
Hoppity [P15] Semantic AST GNN and RNN (LSTM) JavaScript 500,000 program pairs B+F Real ✓ ✓ ✓
SampleFix [P16] Syntactic Token GAN, CVAE, and RNN C 7,000 programs B+F Semireal ✓ ✓ —
(LSTM)
DLFix [P17] Semantic AST RNN (tree RNN) Java 4,900,000 methods B+F Real ✓ ✓ ✓
Graph2Diff [P18] Semantic AST GNN (GGNN) Java 500,000 fixes B+F Real ✓ — —
DrRepair [P19] Syntactic Token, Graph GNN and RNN (LSTM) C 64,000 programs B+F Semireal — ✓ ✓
September/October 2022
Works That Detect Defects
Wang 2016 [P20] Semantic AST, graph DBN Java 10 × 150–1,046 files B+F Real ✓ — —
DP-CNN [P21] Semantic AST CNN and logistic regression Java 7 × 330 files B+F Real ✓ ✓ —
POSTER [P22] Vulnerability AST RNN (BLSTM) C 6,000 functions B+F Real — ✓ ✓
DeepBugs [P23] Semantic AST, graph NN JavaScript 150,000 files B+F Semireal ✓ ✓ ✓
VarMisuse [P24] Semantic AST, graph GGNN and GRU C# 2.9 million LoC B + NB Real — ✓ ✓
www.computer.org/security
VulDeePecker [P25] Vulnerability Token RNN (BLSTM) C 61,000 code gadgets B+F Synthetic ✓ ✓ —
Russell 2018 [P26] Vulnerability Token CNN, BoW, RNN, and C 1.27 million functions B+F Real + ✓ ✓ —
random forest synthetic
μVulDeePecker [P27] Vulnerability AST, graph RNN (BLSTM) C 181,000 code gadgets B + NF Synthetic ✓ ✓ —
Gupta 2019 [P28] Semantic AST Tree CNN C 29 × 1,300 programs B+F Semireal ✓ ✓ ✓
Habib 2019 [P29] Syntactic Token RNN (BLSTM) Java 112 projects B+F Synthetic ✓ — —
Li 2019 [P30] Semantic AST, graph RNN (GRU) and CNN Java 4.9 million methods B+F Real ✓ ✓ ✓
Project Achilles [P31] Vulnerability Token RNN (LSTM) Java 44,495 programs B+F Synthetic ✓ ✓ ✓
Li 2020 [P32] Vulnerability Graph, token BoW and CNN C 60,000 samples B + NB Synthetic — — —
VulDeeLocator [P33] Vulnerability AST, token RNN (BRNN) C 120,000 program slices B + NB Synthetic ✓ ✓ ✓
SinkFinder [P34] Vulnerability Graph, token SVM C 15 million LoC NB Real — ✓ —
OffSide [P35] Vulnerability AST Attention NN Java 1.5 million code B+F Semireal ✓ ✓ ✓
snippets
AI4VA [P36] Vulnerability AST, graph Graph NN C 1.95 million functions B+F Real + ✓ ✓ ✓
synthetic
Tanwar 2020 [P37] Vulnerability AST NN C 1.27 million functions B+F Real ✓ — —
Devign [P38] Vulnerability All Graph NN C 48,000 commits B+F Real ✓ ✓ ✓
Dam 2021 [P39] Vulnerability AST, token RNN (LSTM) Java 18 × 46–3,450 files B + NB Real ✓ — —
SySeVR [P40] Vulnerability AST, graph RNN (BLSTM and BGRU) C 15,000 programs B+F Synthetic ✓ ✓ ✓
RNN: recurrent NN; LSTM: long short-term memory; GAN: generative adversarial network; DRL: deep reinforcement learning; CNN: convolutional NN; GNN: graph NN; CVAE: conditional variational autoencoder;
GGNN: gated GNN; DBN: deep belief network; BLSTM: bidirectional LSTM; GRU: gated recurrent unit; BoW: bag of words; BRNN: bidirectional RNN; SVM: support vector machine; BGRU: bidirectional GRU; NB: no
bug; B: buggy; F: fixed.
The studies are first ordered chronologically and then alphabetically (by author name) within the top and bottom halves of the table.
In the “Method” column, we refer to the primary ML approach used in the tool. When a tool experiments with several approaches, we include all of them if they are presented and discussed equally and skip the
ones mentioned only in passing.
69
MACHINE LEARNING SECURITY AND PRIVACY
minimal changes to the models and by retraining on a and nine correction studies. JavaScript has one paper
suitable data set. for correction [P15] and one for detection [P23]. All
four Python studies are capable of correction. Finally,
Observation 7. We see a nonuniform distribution of the one examined C# paper [P24] can detect defects.
goals across the examined languages both in terms of Overall, the two most common are defect detecting
correction ability as well as targeted defect types. Look- C studies (12) and defect correcting Java studies (nine).
ing at correction ability, we notice that the majority of In terms of defect types, most of the C language stud-
C studies (12) only detect defects, while five can cor- ies target vulnerabilities (12), while the majority of Java
rect them. Java is more balanced, with seven detection papers target semantic defects (11). Python studies focus
primarily on semantic defects (three), with one paper
targeting syntactic defects. The two examined JavaS-
Error Correction Error Detection cript studies as well as the only C# study target semantic
10
defects. There are no Python, JavaScript, or C# papers
that focus on security vulnerabilities. Similarly, no JavaS-
cript or C# paper detects or corrects syntax defects.
Publications per Year
ML Approaches/Models
5
Observation 8. Both defect detection and correction
studies increasingly rely on NNs. The most commonly
used model is the recurrent NN (RNN). Defect correc-
tion studies heavily borrow from natural language trans-
lation, often referred to as neural machine translation or
sequence-to-sequence translation. This means that the
0 majority of the models comes from the same domain,
more specifically, RNNs that appear 16 times out of
16
17
18
19
16
17
18
19
+
20
20
20
20
20
20
20
20
20
20
20
20
Semireal Synthetic
more experimentation. Some of the classifica-
tion methods include logistic regression, bags
of words, random forests, and support vector
7
1
0
1
7
3
3
6
6
0
2
0
0
machines. Despite some outliers, the task of
Real
detection also seems to be heading in the NN
direction. The analyzed papers commonly
3
9
5
6
1
5
2
7
5
0
3
1
3
attribute this to the NN’s ability to operate
13
14
11
9
9
0
5
5
7
4
1
1
1
to understand contexts and keep some form
of memory over time, and suitability for han-
dling texts and (a form of) language.
0
4
3
1
0
1
0
3
1
3
0
Data Sets
Language
Observation 9. There is large disparity
1
1
2
0
0
2
1
0
1
1
0
among data sets in terms of data set size Java
11
11
11
and data unit size. The sizes of the data 7
9
2
3
2
7
3
2
sets range from hundreds to millions of
C#
1
0
0
source code snippets fed into the model)
12
12
11
Semireal Synthetic Vulnerability AST Graph Token C
5
1
4
8
7
4
5
6
also range from full program files to meth-
ods, functions, code gadgets, and similar
Representation
12
11
paper-specific granularities. We notice that
9
5
7
9
0
7
0
3
7
7
6
the granularity of data points commonly
coincides with the output granularity at
10
1
4
1
6
7
1
2
1
0
5
2
3
which the tool is capable of spotting defects.
15
14
11
14
8
0
9
8
1
2
1
5
3
Observation 10. There are significant differ-
ences in source code complexity, realism, and
origin. On the one hand, we have real source
code (18 studies), often collected from Github
14
12
9
6
9
0
3
0
0
5
1
7
we find eight papers that use primarily syn-
Error
0
1
7
4
0
2
0
1
0
6
ples of errors. The remaining data sets fall 1
somewhere in the middle, consisting of either
real source code with artificially injected errors
12
14
11
13
6
4
5
1
1
2
3
5
0
12
12
5
0
9
1
4
9
9
1
Vulnerability (15) 14
15
10
12
6
1
1
7
1
0
9
3
7
Semireal (12)
Synthetic (7)
Synthetic (8)
Graph (11)
Token (21)
Python (4)
Java (16)
Yes (19)
No (21)
Real
www.computer.org/security 71
MACHINE LEARNING SECURITY AND PRIVACY
data sets; semireal data are common with syntactic errors and There are also differences in how many different
real data with semantic errors. Among the correction tools, error types a tool can handle. Some tools are trained
real and semireal data sets are used equally often—nine and tested on a smaller set of vulnerability types, which
times—with only one study using synthetic data sets. makes them narrow but comparatively high perform-
Detection-only tools primarily use realistic data (nine), ing. Examples of such tools include SinkFinder, [P34]
but synthetic data sets are also popular (seven). Semireal- which looks for vulnerabilities in function pairs, such
istic data are the least popular among the error detection as lock/unlock; OffSide, [P35] which focuses on
tools, with only three occurrences. b o u n d ar y conditions; and VulDeePecker, [P25]
We also notice distinct patterns of data sets used for which targets buffer and resource management error
different types of errors. Synthetic data are almost exclu- vulnerabilities. On the other hand, some tools target a
sively used in tools targeting vulnerabilities (seven). wide range of errors, potentially at some performance
Semireal data are mostly harnessed in studies related to cost. SySeVR, [P40] for example, targets 126 vulner-
syntactic errors (six) and semantic errors (five) and less ability types, and Project Achilles [P31] focuses on 29.
in studies related to vulnerabilities (one). Finally, real It is worth mentioning that some tools train separate
data are employed in semantic error (13) and vulner- models for each type of error and evaluate a piece of
ability (five) studies but not in syntactic error studies. code by passing it through each of the trained models
separately to determine the probability of each of the
Observation 12. The majority of data sets consists of vulnerabilities.
bug fix pairs. We notice three distinct patterns in data
set structure: data sets with bug fix pairs (31), data sets Observation 14. There are significant inconsistencies in
of unrelated buggy and nonbuggy examples (five), and the reporting of performance metrics. Studies using real
data sets with no bugs (four). The latter are mostly data sets seems to perform worse than those using syn-
used to teach a model the correct use of the language thetic data sets. We find that the studies differ greatly
so that it is capable of discrimination and potentially in their reporting of performance. The most commonly
translation when it encounters an unfamiliar code pat- reported metrics include recall (reported in some form
tern. The remaining two patterns help teach the model by 22 studies), the F1 score (16), accuracy (15), and
examples of good and bad behavior. The difference is precision (11). While detection-only tools tend to be
that for defect correction, it is valuable to have examples more diligent in their reporting, the correction tools
of concrete fixes for a buggy example. This is commonly more commonly frame their results simply as “we could
achieved by either collecting version histories (commits fix x out of y errors” without providing more detail. We
with fixes) from public repositories or artificially inject- find additional inconsistencies even among the studies
ing bugs to correct code. In case of defect detection, that report the same metrics: some relay only the best
it is not crucial to have such pairs, so several data sets performance, others provide average values, and others
include examples of bugs and correct code but not nec- convey the full range.
essarily on the same piece of code. Taking all this into account, it is uninformative, if
not misleading, to directly compare performance across
Output and Performance the papers. However, setting aside all nuances, we can
cautiously draw some rough patterns from the metrics
Observation 13. There is little uniformity among studies’ reported. Specifically, we find that studies using syn-
outputs in terms of granularity and error types a tool can thetic data sets generally report higher metrics regard-
target. We notice a significant variety of detection granu- less of the other study properties (around 80–90% for
larity, ranging from simple binary classification (buggy all mentioned metrics), while studies using real and
versus nonbuggy) to method, function, and specific lines semireal data perform significantly worse (their accu-
of code. For example, Dam et al. [P39] focus on file-level racy and recall rarely exceed 60–70%), have wider
detection, VulDeePecker [P25] works on code gadget ranges, and sometimes dip all the way down to 0–20%.
granularity, and Project Achilles [P31] concentrates on Given the previously identified relations among cor-
methods. An interesting goal was set by Zou et al. [P27] rection ability, error type, and representation, the per-
The authors attempted not only to recognize whether formance across those categories is also affected by the
there was a vulnerability with fine granularity but also realism of the data set.
determine the vulnerability type. There are similar dif- An interesting insight into the effects of data set real-
ferences among the correction studies that range from ism is provided by Russell et al. [P26] and Suneja et al.,
single token correction all the way to full code sections, [P36] who train and test their pipelines separately
sometimes as a single-step fix or as a collection of smaller on real and synthetic data sets. The two stud-
steps with some form of correction checking in between. ies enable us to get a glimpse at the behavior of the
www.computer.org/security 73
MACHINE LEARNING SECURITY AND PRIVACY
granularity and range of tokens. Depending on the type to define the features that will sufficiently capture the
of bug that is targeted, a model can benefit from simple, semantics of the program. The main benefit of deep
stand-alone tokens and from grouped and more struc- learning is its ability to ingest the source code itself (in
tured representation (code gadgets, functions, or some an appropriate format) and create its own “features” to
other syntactic or semantic unit). learn from.
Syntactic representation considers the ASTs of the
source code, enabling a less flat view of the code. Such Challenges and Future Directions
representations are larger in size and more difficult to This article is motivated by the need to discover pat-
construct but can capture lexical and syntactic code terns in the rapidly evolving field of ML for source
properties. They are often combined with recursive code. Some of the challenges toward effective solu-
NNs and LSTM models. Their popularity lies mainly tions (Table 8) include access to and use of high-quality
with defect detection tools, especially semantic defect training data sets with realistic, representative, and cor-
and vulnerability detection. While ASTs are good at rectly labeled data; effective source code representation
capturing the structure of the code, they do not cap- capable of semantic understanding; standardization in
ture the semantics and large and complex pieces of terms of goals and reporting; detection and correction
code very well.12 This is why ASTs are commonly sup- across domains; and catching application-specific bugs
ported by semantic representation capturing data and (in regard to semantic defects) and high FP rates. We
control flow information. The ability of graph mod- briefly elaborate on some of these challenges.
els to capture more advanced semantic properties of There is significant variety in terms of data sets, goals,
code reflects itself in the use cases: they appear almost testing, and performance reporting. We believe the field
exclusively in tools targeting semantic defects and would benefit from some degree of standardization,
vulnerabilities. potentially in the form of a curated collection of open
Somewhat surprisingly, we observe a very unbal- source data sets, together with some uniform goals for
anced picture when it comes to the languages beyond each defect type along with a test suite and benchmarks.
C/C++ and Java. For example, we found that C#, JavaS- Since a tool’s performance can heavily rely on the training
cript, and Python lack tools aimed at detecting and cor- data, stabilizing the data set would enable more precise
recting vulnerabilities. The possible reason for prevalence evaluation of the tool itself rather than the training data.
of C/C++ and Java is that these languages are popular, Such data sets would ideally consist of realistic source
well studied, and have large, open databases of known code with representative errors and high-quality label-
defects (both bugs and security vulnerabilities). How- ing to increase the usability of the tools in the real world.
ever, considering the ever-growing popularity of C#, The formalization, or at the very least, clear reporting,
JavaScript, and Python, it becomes very important to of goals (e.g., in terms of granularity and defect types)
develop the tools supporting them. This also extends would also enable researchers in the field to get a clearer
to other popular languages that did not appear in and more complete picture of the available tools.
the study. Finally, there is a need for clearer and more com-
A look at the ML methods highlights the fact that plete reporting of performance. One step in the right
traditional ML approaches are more of a stepping stone direction could be the reporting of the four basic met-
toward a deep learning solution than solutions of their rics (TP, TN, FP, and FN), which facilitate the calcu-
own. The reason likely lies in the fact that it is difficult lation of the remaining metrics. However, at the end
www.computer.org/security 75
MACHINE LEARNING SECURITY AND PRIVACY
Ivan Pashchenko is a security manager at TomTom, Fabio Massacci is a professor at the University of Trento,
Amsterdam, 1011 AC, The Netherlands. His re Trento, 38123, Italy, and Vrije Universiteit Amster-
search interests include threat intelligence, open dam, Amsterdam, 1081 HV, The Netherlands. His
source software security, and machine learning current research interest is in empirical methods for
for security. Pashchenko received a Ph.D. from cybersecurity of sociotechnical systems. Massacci
the University of Trento. In 2017, he was awarded received a Ph.D. in computing from the Sapienza Uni-
a Second Place Silver Medal at the Association versity of Rome. He participates in the Cyber Security
for Computing Machinery/Microsoft Student for Europe pilot and leads the Horizon 2020 Assur-
Research competition in the graduate category. ance and Certification in Secure Multi-Party Open
He was the UniTrento main contact in the Con- Software and Services project. For his work on secu-
tinuous Analysis and Correction of Secure Code rity and trust in sociotechnical systems, he received
work package for the Horizon 2020 Assurance and the Ten Year Most Influential Paper Award at the
Certification in Secure Multi-Party Open Soft- 2015 IEEE International Requirements Engineering
ware and Services project. Contact him at ivan. Conference. He is a Member of IEEE. Contact him at
pashchenko@tomtom.com. fabio.massacci@ieee.org.
Call rticles
for A putin
g
e C o m
E Pe rvasiv the la
te s t
IEE r s o n
l p ap e
u s efu ible, e,
a c ce s s er vasiv
seek s men ts in p
de velop opics
ev iewed u ting. T
peer-r s co m p
uitou
e , a n d ubiq s of t w
a re
mobil c h n o logy,
te
ware nd
e hard sing a
includ - w o r ld sen
l n,
re , re a erac tio
truc tu er int
s: infras - com p u t
d e li n e u man
o r gui c tion, h , inclu
ding
Au t h rg /m c/ int e ra
tio n s
uter.o era y.
.comp consid privac
w w w
s y s te m s u r it y, and
.htm an d y, se c
u th o r labilit
sive /a
p e r v a
a ils: y m e nt , s c a
e r de t deplo
Furth uter.o
rg
ive
sive@
co m p
rg/p ervas
p e r va
o m puter.o
w w w.c