0% found this document useful (0 votes)
29 views17 pages

Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views17 pages

Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet

Uploaded by

Kelner Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

MACHINE LEARNING SECURITY AND PRIVACY

Machine Learning for Source Code


Vulnerability Detection:
What Works and What Isn’t There Yet

Tina Marjanov | Vrije Universiteit Amsterdam and University of Cambridge


Ivan Pashchenko | TomTom
Fabio Massacci | University of Trento and Vrije Universiteit Amsterdam

We review machine learning approaches for detecting (and correcting) vulnerabilities in source code,
finding that the biggest challenges ahead involve agreeing to a benchmark, increasing language and error
type coverage, and using pipelines that do not flatten the code’s structure.

T raditionally, defects in source code have been discov-


ered by means of static and dynamic analysis tech-
niques. However, static analysis techniques are known to
detection and correction of security defects through
ML techniques. Given the rapidly increasing interest
in ML applications in source code, several studies have
generate a high number of false positive (FP) findings,1 started to apply ML for bug prediction. Some earlier
while dynamic analysis tools are designed to underesti- reviews are presented in Table 1.
mate the number of defects in a program and therefore
are prone to false negatives (FNs). Moreover, static anal- From Detection to Correction
ysis techniques might require significant computation To frame our work, we adopt the recent terminol-
resources, while dynamic analysis tools increase the size ogy by Monperrus et al.4 For simplicity, we formulate
and execution time of a program. Hence, such techniques detection as classification, while repair is formulated as
cannot always be seamlessly integrated into the continu- generation. Realistically, not all approaches fit this pre-
ous delivery pipelines of today’s software projects.2,3 cisely; for example, Hoppity [P15] uses the classifica-
In this respect, machine learning (ML) techniques tion of graph edit types to do repair.
seem to have become a very attractive alternative to Automated defect detection is “a process of build-
traditional software defect detection and correction ing classifiers to predict code areas that potentially con-
techniques. In this article, we discuss a representa- tain defects, using information such as code complexity
tive snapshot of the state-of-the-art research on the and change history.”5 There have been several defect
detection tools available in recent years, two of the
Digital Object Identifier 10.1109/MSEC.2022.3176058
bigger ones being Google’s Error Prone and SpotBugs
Date of current version: 18 August 2022
This work is licensed under a Creative
Commons Attribution 4.0 License. For more information,
60 September/October 2022   Copublished by the IEEE Computer and Reliability Societies  see https://creativecommons.org/licenses/by/4.0/
(formerly known as FindBugs). Such earlier works were English and Dutch) or learning from correct examples
usually frameworks on which checkers, consisting of and translating programs that deviate from that.
manually defined heuristics, formal logical rules, and If we consider autonomous vulnerability detection
test oracles containing ground truth, could be built. and/or correction the end goal, we can see research in
They required a considerable amount of expert work syntactic (i.e., the grammar of code) and semantic (i.e.,
and generated many FPs. the meaning of code) error detection and correction as
A logical next step from detection is automated cor- stepping stones toward the end goal. Additionally, even
rection, which tries “to automatically identify patches if a tool is not primarily aimed at traditional vulnerabili-
for a given bug that can then be applied with little, or ties, syntactic and semantic errors can introduce vulner-
possibly even without human intervention.”6 Com- abilities to code, which is why we discuss the three error
pared to detection, correction is a more ambitious goal, types equally. Table 2 provides a concise dictionary of
which has only recently emerged as a realistic research the relevant terms.
topic through the use of techniques previously applied
to natural language. It often operates either by learning Enter ML for Source Code
to translate from pairs of incorrect and correct programs Recent years have seen the emergence of ML for find-
(as one would translate between two languages, e.g., ing vulnerabilities. A typical ML pipeline consists of

Table 1. The surveys on ML, defect detection, and correction.

Authors Venue Survey type


Malhotra Applied Soft Computing (2014) ML techniques for software fault prediction, comparing the
performance of ML to statistical techniques
Ghaffarian and ACM Computing Surveys (2018) Traditional ML and data mining techniques for vulnerability
Shahriari detection
Allamanis et al. ACM Computing Surveys (2018) ML used for source code and natural language translations
Ji et al. IEEE Conference on Dependable and Secure Autonomous cyberreasoning systems for detection, patching,
Computing (2018) and exploiting software vulnerabilities
Monperrus ACM Computing Surveys (2019) Automatic program repair techniques
Singh and Chaturvedi International Conference on Soft Computing: Deep learning techniques for vulnerability detection
Theories and Applications (2020)
Lin et al. Proceedings of the IEEE (2020) Vulnerability detection tools using deep neural networks
Shen et al. Security and Communication Networks Vulnerability detection, program repair, and defect prediction
(2020) methods that include binary code
Zeng et al. IEEE Access (2020) Deep learning software vulnerability discovery approaches

Table 2. The defect types.

Term Definition
Defect Also known as errors, bugs, and faults, defects are deviations between a program’s expected behavior and what
actually happens.
Syntactic defects These are mistakes in the syntax of a program, i.e., the grammar and rules of the language. They are usually detected
at compile time and runtime and prevent a program from running at all. Such problems, depending on the language,
include missing brackets and semicolons, typos, indentation problems, and so on.
Semantic defects These are mistakes in the semantics of a program, i.e., its meaning and intended behavior. They result in programs
that do not behave as intended but are not primarily a security concern. Such problems include inconsistent
method names, variable misuse bugs, typing errors, application programming interface misuse, swapped
arguments in functions, and so on.
Vulnerabilities Vulnerabilities form a particular set of semantic defects that can compromise the security of a system. Such
problems include buffer overflows, integer overflows, cross-site scripting, use-after-free, and so on.

www.computer.org/security 61
MACHINE LEARNING SECURITY AND PRIVACY

several important stages: data collection and prepara- ML comes with a distinct set of challenges that
tion, model training, and, finally, evaluation and deploy- need to be considered to produce reliable and use-
ment. (We assume that the reader is familiar with these ful results. First, it is crucial to train the model on a
steps, but to make the article self-contained, we report high-quality data set. In general, this means a large
them in Table 3.) Typically, the tool is monitored, main- enough, realistic data set with a representative distri-
tained, and improved after deployment, but this is out- bution of classes. For example, a model trained on a
side the scope of this article. data set that contains an equal number of buggy and
ML models are generally not capable of ingesting nonbuggy programs might not perform well when
the source code in its original format, so the code is used in a real setting where the occurrence of bugs is
processed and transformed into some low-level rep- significantly lower or different types of bugs occur.
resentation appropriate for ML model input [e.g., Problems with a data set can be mitigated to some
vectors for neural networks (NNs)]. To preserve the extent in the later stages of the pipeline, but a strong
semantic and syntactic properties of the program, it data set is preferred.
is useful to consider some intermediate interpreta- Additionally, a common problem that surfaces when
tion that is capable of encoding such properties before evaluating and replicating the results is overfitting,
feeding the program into the model. The three pre- meaning that the model too closely fits the training data
dominant approaches treat the source code as follows and does not show previous predictive power, often due
(roughly inspired by Chakraborty et al.7 and Shen and to noisy data and overcomplicated models. Finally, the
Chen8): selection of relevant features is one of the most impor-
tant tasks of ML. It is important to consider the num-
■■ Sequence of tokens: The raw source code is split into ber of features—having more features is not necessarily
small units (e.g., “int,” “func,” “{,” and “}”) and pre- better—and what information about the code they
sented to the model as such. carry. The most recent deep learning-based approaches
■■ Abstract syntax tree (AST): The syntactic structure of do not require manual feature selection but rather take
the program is captured by a tree representation, with advantage of the ability of the model to learn important
each node of the tree denoting a construct occurring features directly from the training data themselves.
in the source code. The prediction of an ML model has four possible
■■ Graphs: These can capture various syntactic and classification states, i.e., the confusion matrix: true
semantic relationships and properties of the source positives (TPs), true negatives (TNs), FPs, and FNs.
code through the edges and nodes of the graphs (e.g., In our case, a TP could mean a buggy line of code
control flow graphs9 and code property graphs10). that is correctly classified as a bug, and an FP could
be a nonbuggy line of code that is wrongly classified
The three classes are a simplification for the purpose as a bug.
of synthesis. In practice, many of the tools use versions
that blend the lines between representations. Addition- Methodology
ally, the classes do not reflect the full picture but rather Our goal is to examine and present a representative
the most widely used approaches. snapshot of the state-of-the-art research and identify

Table 3. The ML pipeline.

Stage Description
Data collection A sufficiently large and representative data set for the task is constructed.
Data preparation Data preparation consists of cleaning and sometimes labeling, feature engineering, and, finally, splitting into
(nonoverlapping) subsets for training and testing. Ideally, the goal is to eliminate as much noise as possible to allow for
better training. Additionally, it is important to select the most relevant features, which is often a nontrivial task.
Model training The training portion of the data set is used to create a model that will be able to distinguish erroneous code from
correct code and optionally propose candidate corrections. Depending on the technique and type of model used, it
is often necessary to adapt the parameters and retrain several models before achieving satisfactory results. Training is
frequently the longest and computationally most expensive part.
Evaluation The model is evaluated on the test subset of data to determine if it exhibits the desired behaviors when presented
with unseen data. At this stage, the model should be able to detect and optionally correct programming defects and
can be deployed to be used.

62 IEEE Security & Privacy September/October 2022


the trends and gaps. Adopting an agnostic starting ■■ Correction refers to the correction and detection abil-
point, we want to discover patterns without being ity of the tool.
biased by our own dispositions and conjectures. For ■■ Defect type refers to the primary type of defect the tool
this, we leverage the grounded theory approach widely targets. If a more advanced tool can simultaneously
used in empirical studies; this method allows hypoth- correct simpler mistakes (e.g., a semantic defect tool
eses to emerge from the data. fixing misplaced brackets, which is a syntax mistake),
An initial set of 343 works was drawn from an we classify it according to the most advanced type of
online repository containing ML research on source defect it can target.
code; the collection was created in the scope of Alla- ■■ Representation refers to the main representation of the
manis et al.11 and is actively maintained (https:// source code that is fed to the model as defined. This
ml4code.github.io). To reflect the state-of-the-art does not include further transformations inside the
techniques and considering that ML is rapidly evolv- models but rather the initial information presented to
ing, we focus our review on papers from 2015 onward the model.
(322 papers out of 343). Additionally, we keep only ■■ Language refers to the language the tool targets. If a
the papers on defect detection and correction that tool can act in a language-agnostic way, we refer to the
use static analysis of the source code. We therefore language of the data set that is tested.
exclude papers on topics such as synthesis, predic-
tion, recommendation, summarization, and so on Code groups that capture information about the
as well as those discussing supporting techniques data sets include the following:
for defect detection, including testing, fuzzing, taint
analysis, symbolic execution, defects in binary code, ■■ The type captures whether the data sets include buggy
and so forth. examples and, if bugs are present, whether buggy and
Finally, we exclude papers without a proof of con- nonbuggy examples are paired.
cept or full ML pipeline. Papers that share large parts ■■ The label captures whether the data set is labeled or
of the pipeline and adapt or discuss only one part unlabeled.
of it are treated as one with the most representative ■■ Realism captures whether the programs and errors in
paper included and discussed. After the removals, the data set are taken from real applications or syn-
a set of 31 relevant papers emerges. To avoid biases thetically produced.
and present a complete picture, we consider any ■■ Availability captures whether the data set and/or tool
additional relevant works referenced in the original are publicly available.
set of papers and cross reference the works with top
hits from Google Scholar. The final list consists of 40 It is important to note that the type, label, and avail-
papers containing an end-to-end ML pipeline capa- ability of the data set refer to the training data. When
ble of either detecting or correcting defects in source training is performed on data that do not have the same
code (see Table 4). structure as the test subset, we describe the training
To facilitate the discovery of emerging patterns from data (e.g., correction tools that train only on nonbuggy
data (i.e., the set of selected papers), we need to iden- examples). Additionally, when training data are col-
tify the defining characteristics of a defect correction/ lected from a public data set but then modified in some
detection tool. Our initial codes were heavily inspired way, we describe the modified version (e.g., a publicly
by the characteristics discussed by related literature available data set is injected with bugs). Table 5 presents
(see Table 1). The codes were first used to annotate a the final codebook. It shows the identified code groups,
small portion of the selected papers to test their suit- the possible values for each, and illustrative examples
ability and completeness. Then we synchronized the taken from the source papers (an overview of the stud-
resulting codebook among all the researchers involved ies is also available on https://github.com/tmv200/
in the study to identify a set of codes that captured the ml4code/blob/main/sota.yaml).
most important differences among the studies while
ensuring that no part of the ML pipeline was left out. Analysis of Recent Works
We performed this process iteratively until the code- Table 6 provides an overview of the studies included in
book became stable. Finally, after finalizing the full set this review. Generally speaking, we can see (Figure 1)
of codes, we expanded the coding to the remainder of an increase in publications since 2015, signaling grow-
the papers. The coding and subsequent analysis were ing interest in the field. This holds for both detection
performed using Atlas.ti. and correction studies. Overall, the examined papers
Code groups related to the abilities of tools include exhibit wide variety in goals and priorities, leading to a
the following: wealth of different approaches (Figure 2 and Table 7).

www.computer.org/security 63
MACHINE LEARNING SECURITY AND PRIVACY

Table 4. The analyzed papers (correction).


[P1] P. Yewen, K. Narasimhan, A. Solar-Lezama, and R. Barzilay, “sk_p: A neural program corrector for MOOCs,” in Proc. Companion
2016 SIGPLAN Int. Conf. Syst., Program., Languages Appl., Softw. Humanity, ACM, 2016, pp. 39–40, doi: 10.1145/2984043.
2989222.
[P2] G. Rahul, S. Pal, A. Kanade, and S. Shevade, “DeepFix: Fixing common C language errors by deep learning,” in Proc. 31st Conf. Artif.
Intell., 2017, pp. 1345–1351, doi: 10.5555/3298239.3298436.
[P3] S. Bhatia, P. Kohli, and R. Singh, “Neuro-symbolic program corrector for introductory programming assignments,” in Proc. 40th Int.
Conf. Softw. Eng., 2018, pp. 60–70, doi: 10.1145/3180155.3180219.
[P4] J. Devlin, J. Uesato, R. Singh, and P. Kohli, “Semantic code repair using neuro-symbolic transformation networks,” in Proc. 6th Int.
Conf. Learn. Representations, 2018, pp. 1–11.
[P5] J. Harer et al., “Learning to repair software vulnerabilities with generative adversarial networks,” in Proc. 32nd Conf. Neural Inf.
Process. Syst., 2018, pp. 7944–7954, doi: 10.5555/3327757.3327890.
[P6] H. Hata, E. Shihab, and G. Neubig, “Learning to generate corrective patches using neural machine translation,” 2018,
arXiv:1812.07170.
[P7] E. A. Santos, J. C. Campbell, D. Patel, A. Hindle, and J. N. Amaral, “Syntax and sensibility: Using language models to detect
and correct syntax errors,” in Proc. 25th Int. Conf. Softw. Anal., Evol. Reeng., 2018, pp. 311–322, doi: 10.1109/SANER.
2018.8330219.
[P8] Z. Chen, S. J. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, “Sequencer: Sequence-to-
sequence learning for end-to-end program repair,” IEEE Trans. Softw. Eng., vol. 47, no. 9, pp. 1943–1959, 2019, doi: 10.1109/
TSE.2019.2940179.
[P9] R. Gupta, A. Kanade, and S. Shevade, “Deep reinforcement learning for syntactic error repair in student programs,” in Proc. 33rd
Conf. Artif. Intell., 2019, pp. 930–937, doi: 10.1609/aaai.v33i01.3301930.
[P10] K. Liu et al., “Learning to spot and refactor inconsistent method names,” in Proc. 41st Int. Conf. Softw. Eng., 2019, pp. 1–12, doi:
10.1109/ICSE.2019.00019.
[P11] A. Mesbah, A. Rice, E. Johnston, N. Glorioso, and E. Aftandilian, “DeepDelta: Learning to repair compilation errors,”
presented at the 27th Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 925–936, doi: 10.1145/
3338906.3340455.
[P12] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “An empirical study on learning bug-fixing
patches in the wild via neural machine translation,” ACM Trans. Softw. Eng. Methodol., vol. 28, no. 4, pp. 1–29, 2019, doi:
10.1145/3340544.
[P13] M. Vasic, A. Kanade, P. Maniatis, D. Bieber, and R. Singh, “Neural program repair by jointly learning to localize and repair,” in Proc.
7th Int. Conf. Learn. Representations, 2019, pp. 1–12.
[P14] M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk, “Sorting and transforming program repair ingredients
via deep learning code similarities,” in Proc. 26th Int. Conf. Softw. Anal., Evol. Reeng., 2019, pp. 479–490, doi: 10.1109/
SANER.2019.8668043.
[P15] E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang, “Hoppity: Learning graph transformations to detect and fix bugs in
programs,” in Proc. 8th Int. Conf. Learn. Representations, 2020, pp. 1–17.
[P16] H. Hajipour, A. Bhattacharyya, and M. Fritz, “SampleFix: Learning to correct programs by efficient sampling of diverse fixes,” in
Proc. Workshop Comput.-Assisted Program., ACM, 2020, pp. 1–10.
[P17] Y. Li, S. Wang, and T. N. Nguyen, “DLFix: Context-based code transformation learning for automated program repair,” in Proc. 42nd
Int. Conf. Softw. Eng., ACM/IEEE, 2020, pp. 602–614, doi: 10.1145/3377811.3380345.
[P18] D. Tarlow et al., “Learning to fix build errors with graph2diff neural networks,” in Proc. 42nd Int. Conf. Softw. Eng. Workshops, IEEE/
ACM, 2020, pp. 19–20, doi: 10.1145/3387940.3392181.
[P19] M. Yasunaga and P. Liang, “Graph-based, self-supervised program repair from diagnostic feed-back,” in Proc. 37st Int. Conf. Mach.
Learn., 2020, pp. 10,799–10,808.

(Continued )

64 IEEE Security & Privacy September/October 2022


Table 4. The analyzed papers (detection). (Continued )
[P20] S. Wang, and T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” in Proc. 38th Int. Conf. Softw. Eng.,
IEEE/ACM, 2016, pp. 297–308, doi: 10.1145/2884781.2884804.
[P21] J. Li, P. He, J. Zhu, and M. Lyu, “Software defect prediction via convolutional neural network,” in Proc. Int. Conf. Softw. Qual., Rel.
Secur., 2017, pp. 318–328, doi: 10.1109/QRS.2017.42.
[P22] G. Lin, J. Zhang, W. Luo, L. Pan, and Y. Xiang, “POSTER: Vulnerability discovery with function representation learning from
unlabeled projects,” in Proc. Conf. Comput. Commun. Secur., ACM, 2017, pp. 2539–2541, doi: 10.1145/3133956.3138840.
[P23] M. Pradel and K. Sen, “DeepBugs: A learning approach to name-based bug detection,” Proc. ACM Program. Languages, vol. 2, no.
OOPSLA, pp. 1–25, 2018, doi: 10.1145/3276517.
[P24] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” in Proc. 6th Int. Conf. Learn.
Representations, 2018, pp. 1–17.
[P25] Z. Li et al., “VulDeePecker: A deep learning-based system for vulnerability detection,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2018,
pp. 1–15.
[P26] R. Russell et al., “Automated vulnerability detection in source code using deep representation learning,” in Proc. 17th Int. Conf.
Mach. Learn. Appl., 2018, pp. 757–762, doi: 10.1109/ICMLA.2018.00120.
[P27] D. Zou, S. Wang, S. Xu, Z. Li, and H. Jin, “μ-VulDeePecker: A deep learning-based system for multi-class vulnerability detection,”
IEEE Trans. Dependable Secure Comput., vol. 18, no. 5, pp. 2224–2236, 2019, doi: 10.1109/TDSC.2019.2942930.
[P28] R. Gupta, A. Kanade, and S. Shevade, “Neural attribution for semantic bug-localization in student programs,” in Proc. 33rd Conf.
Neural Inf. Process. Syst., ACM, 2019, 11,884-11,894.
[P29] A. Habib and M. Pradel, “Neural bug finding: A study of opportunities and challenges,” 2019, arXiv:1906.00307.
[P30] Y. Li, S. Wang, T. N. Nguyen, and S. V. Nguyen, “Improving bug detection via context-based code representation learning and
attention-based neural networks,” Proc. ACM Program. Languages, vol. 3, no. OOPSLA, pp. 1–30, 2019, doi: 10.1145/3360588.
[P31] N. Saccente, J. Dehlinger, L. Deng, S. Chakraborty, and Y. Xiong, “Project achilles: A prototype tool for static method-level
vulnerability detection of java source code using a recurrent neural network,” in Proc. 34th Int. Conf. Autom. Softw. Eng. Workshop,
IEEE/ACM, 2019, pp. 114–121, doi: 10.1109/ASEW.2019.00040.
[P32] X. Li, L. Wang, Y. Xin, Y. Yang, and Y. Chen, “Automated vulnerability detection in source code using minimum intermediate
representation learning,” Appl. Sci., vol. 10, no. 5, p. 1692, 2020, doi: 10.3390/app10051692.
[P33] Z. Li, D. Zou, S. Xu, Z. Chen, Y. Zhu, and H. Jin, “VulDeeLocator: A deep learning-based fine-grained vulnerability detector,” IEEE
Trans. Dependable Secure Comput., early access, 2020, doi: 10.1109/TDSC.2021.3076142.
[P34] P. Bian, B. Liang, J. Huang, W. Shi, X. Wang, and J. Zhang, “SinkFinder: Harvesting hundreds of unknown interesting function pairs
with just one seed,” presented at the 28th Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., ACM, 2020, pp. 1101–
1113, doi: 10.1145/3368089.3409678.
[P35] J. A. Briem, J. Smit, H. Sellik, P. Rapoport, G. Gousios, and M. Aniche, “OffSide: Learning to identify mistakes in boundary
conditions,” in Proc. 42nd Int. Conf. Softw. Eng. Workshops, IEEE/ACM, 2020, pp. 203–208, doi: 10.1145/3387940.3391464.
[P36] S. Suneja, Y. Zheng, Y. Zhuang, J. Laredo, and A. Morari, “Learning to map source code to software vulnerability using
code-as-a-graph,” 2020, arXiv:2006.08614.
[P37] A. Tanwar, K. Sundaresan, P. Ashwath, P. Ganesan, S. K. Chandrasekaran, and S. Ravi, “Predicting vulnerability in large codebases
with deep code representation,” 2020, arXiv:2004.12783.
[P38] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program
semantics via graph neural networks,” in Proc. 33rd Conf. Neural Inf. Process. Syst., ACM, 2020, pp. 10,197–10,207, doi:
10.5555/3454287.3455202.
[P39] H. K. Dam, T. Tran, T. Pham, S. W. Ng, J. Grundy, and A. Ghose, “Automatic feature learning for predicting vulnerable software
components,” IEEE Trans. Softw. Eng., vol. 47, no. 1, pp. 67–85, 2021, doi: 10.1109/TSE.2018.2881961.
[P40] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “SySeVR: A framework for using deep learning to detect software vulnerabilities,”
IEEE Trans. Dependable Secure Comput., early access, 2021, doi: 10.1109/TDSC.2021.3051525.

www.computer.org/security 65
MACHINE LEARNING SECURITY AND PRIVACY

Table 5. The codebook.

Code group Code Description Example


Correction No Tool capable only of detecting defects Design and implementation of a
deep learning-based vulnerability
detection system [P25]
Yes Tool capable of correcting defects End-to-end solution ... that can fix
multiple such errors in a program [P2]
Defect type Syntactic Tool targets syntax defects Algorithm ... for finding repairs to
syntax errors [P3]
Semantic Tool targets semantic defects Addressing the issue of semantic
program repair [P4]
Vulnerability Tool targets vulnerabilities System for vulnerability detection [P25]
Representation Tokens Source code represented as a sequence of tokens Model treats a program statement as a
list of tokens [P1]
AST Source code represented as an abstract syntax tree Representations of ASTs [P22]
Graph Source code represented as a graph capturing Generates a system dependency graph
additional semantic information (control flow for each training program [P27]
graphs, data flow graphs, and so on)
Language Python Tool evaluated on source code written in From an introduction to programming
Python in python course [P3]
C Tool evaluated on source code written in C/C++ Fixing common C language errors [P2]
Java Tool evaluated on source code written in Java Targeting Java source code [P6]
JavaScript Tool evaluated on source code written in Broad range of bugs in JavaScript
JavaScript programs [P15]
C# Tool evaluated on source code written in C# Open source C# projects on GitHub [P24]
Type No bug Tool trained on only nonbuggy source code Using language models trained on
correct source code to find tokens
that seem out of place [P7]
Bug + fixed Tool trained on paired examples of buggy and A pair (p; p0), where p is an incorrect
fixed code program and p0 is its correct version [P9]
Bug + no Tool trained on unpaired examples of buggy Data set that contains 181,641 pieces of
bug and nonbuggy code code; 138,522 are nonvulnerable (i.e.,
not known to contain vulnerabilities)
and 43,119 are vulnerable [P27]
Label Yes Tool trained on labeled data A program is labeled as “good,”
...“bad,” ... or “mixed” [P27]
No Tool trained on unlabeled data Self-supervised learning with
unlabeled programs [P19]
Realism Real Data set consists of mostly real programs Javascript code change commits
collected from Github [P15]
Semireal Data set consists of semirealistic code: real Corpus of open source Python projects
code injected with synthetic bugs, or simpler/ with synthetically injected bugs [P4]
beginner code with real mistakes and C programs written by students
for 93 different programming tasks [P2]
Synthetic Data set consists of mainly synthetic/academic Juliet Test Suite, with 81,000
code synthetic C/C++ and Java programs with
known security vulnerabilities [P31]
Availability Yes Data set and/or tool are publicly available —
No Data set and/or tool are not publicly available —

66 IEEE Security & Privacy September/October 2022


For conciseness, we leave out detailed descriptions and representations, the input is commonly flattened when
low-level comparisons among the approaches and focus serving as input for an NN. AST representation is
on more general directions. used by 23 papers, token representation appears in 21
studies, and graph representation is employed by 11
Detection Versus Correction studies. The approaches can coexist, which is evident
Ability and Defect Types from studies that combine several representations: 12
defect detection and two defect correction studies use
Observation 1. We find an almost equal split among some combination. The most common combination is
the papers that focus only on detection and those AST–graph (seven), followed by AST–token (three),
also correcting defects. Twenty-one papers focus on and graph–token (three). Zhou et al. [P38] use a com-
detecting defects, while 19 can also correct them. In bination of all the three representations. With deep
terms of the defects’ evolution over time, research learning rising compared to other ML techniques, the
into both types seems to be growing fast, as evident in need for manually defined “traditional” features is fall-
Figure 1. The slight drop in publications of defect cor- ing. Instead, NNs require input in the form of a vector.
rection studies could be the consequence of a small To achieve that, the previously described source code
sample size or an actual shift toward defect detection. representations are commonly flattened into a vector
(vectorized) [P10], [P25].
Observation 2. The papers mostly address seman-
tic defects and vulnerabilities; syntactic defects are Observation 4. Correction papers mostly use ASTs and
less popular. Among them, vulnerabilities are only tokens, whereas detection studies use all three repre-
detected, whereas semantic and syntactic defects are sentations. We can see a significant division in repre-
often also corrected. Seven papers target syntactic sentation approaches between detection and correction
defects, 15 focus on vulnerabilities, and 18 concentrate studies. Among the defect correction papers, the most
on semantic defects. Correction studies target mostly common representation is tokens (12), followed by
semantic (12) and syntactic (six) defects, while detec- ASTs (eight). Only one correction study uses graph rep-
tion studies target mostly vulnerabilities (14) and resentation [P19]. The split in representations is a bit
semantic defects (six). Only a single correction study more balanced among the detection-only papers: ASTs
[P5] targets vulnerabilities, and one detection study appear 15 times, graphs 10 times, and tokens nine times.
[P29] focuses on syntactic defects.
Since defect detection often targets more complex Observation 5. Different representations seem preferred
problems, such as semantic bugs and vulnerabilities, by researchers for addressing varying types of defects,
many detection papers focus on a narrower array of prob- depending on the defect type targeted by a study. Tools
lems or try to narrow the granularity. As such, DeepBugs targeting syntactic defects almost exclusively use token
[P23] targets only name-based semantic bugs, SinkFinder representation (seven), with a single paper adding
[P34] examines security-sensitive function pairs, and graph representation [P19]. Papers aimed at semantic
OffSide [P35] looks for boundary condition mistakes. defects primarily use ASTs (14), followed by tokens
Among the correction papers, [P5] presented one of (five) and graphs (four). The most variety in repre-
the first studies requiring no paired labeled examples for sentation comes from the vulnerability finding papers.
mapping from the buggy domain to the nonbuggy one. Those use ASTs and tokens equally often (nine), with
Sensibility [P7] was one of the first studies focusing graphs employed slightly less frequently (six). Vulner-
on the correction of single token syntax defects across ability finding studies also most commonly use a com-
domains. DeepRepair [P14] builds on the idea of redun- bination of more than one representation.
dancy, exploiting the fact that many programs contain
seeds to their own repair. More advanced studies, such Languages
as Hoppity, [P15] use NNs for source code embedding
and graph transformations to correct semantic mis- Observation 6. The majority of the examined studies
takes. Graph2Diff [P18] and VarMisuseRepair [P13] target C and Java, with only a few papers aimed at other
both use pointers to locate the defect and a potential fix. languages. Within the examined works, five program-
ming languages are supported: C/C++ (17), Java (16),
Source Code Representation Python (four), JavaScript (two) and C# (one). Several
of the featured studies aimed to be language and syn-
Observation 3. The majority of the studies use either tax agnostic but were trained and tested only on a spe-
ASTs or token representation, with graph represen- cific language. It is, however, commonly noted that such
tation being the least used. Despite the different studies could be used on different languages through

www.computer.org/security 67
68
Table 6. The studies and their codes.

General Data Set Availability

Tool name Defect Representation Method Language Size Type Realism Label Data Tool

IEEE Security & Privacy


Works That Detect and Correct Defects
sk_p [P1] Semantic Token RNN (LSTM) and skip-gram Python 7 × 315-9,000 NB semireal — — —
programs
DeepFix [P2] Syntactic Token RNN C 7,000 programs B+F Semireal ✓ ✓ ✓
SynFix [P3] Syntactic Token RNN (LSTM) Python 75,000 programs NB Semireal — — —
SSC [P4] Semantic AST RNN and rule based Python 2,900,000 code B+F Semireal ✓ ✓ —
snippets
MACHINE LEARNING SECURITY AND PRIVACY

Harer 2018 [P5] Vulnerability Token GAN C 117,000 functions B+F Synthetic ✓ — —
Ratchet [P6] Semantic Token RNN (LSTM) Java 35,137 pairs B+F Real ✓ ✓ ✓
Sensibility [P7] Syntactic Token n-Gram and RNN (LSTM) Java 2,300,000 files NB Semireal — ✓ ✓
SequenceR [P8] Semantic Token RNN (LSTM) Java 35,000 samples B+F Real ✓ ✓ ✓
RLAssist [P9] Syntactic Token DRL and RNN (LSTM) C 7,000 programs B+F Semireal ✓ ✓ ✓
Liu 2019 [P10] Semantic AST, token CNN and paragraph vector Java 2,000,000 methods B+F Real — ✓ ✓
DeepDelta [P11] Semantic AST RNN (LSTM) Java 4,800,000 builds B+F Real ✓ — —
Tufano 2019 [P12] Semantic AST RNN Java 2,300,000 fixes B+F Real ✓ ✓ ✓
VarMisuseRepair [P13] Semantic Token RNN (LSTM) and pointer Python 650,000 functions B+F Real ✓ ✓ —
network
DeepRepair [P14] Semantic AST RNN Java 374 programs B+F Semireal ✓ ✓ ✓
Hoppity [P15] Semantic AST GNN and RNN (LSTM) JavaScript 500,000 program pairs B+F Real ✓ ✓ ✓
SampleFix [P16] Syntactic Token GAN, CVAE, and RNN C 7,000 programs B+F Semireal ✓ ✓ —
(LSTM)
DLFix [P17] Semantic AST RNN (tree RNN) Java 4,900,000 methods B+F Real ✓ ✓ ✓
Graph2Diff [P18] Semantic AST GNN (GGNN) Java 500,000 fixes B+F Real ✓ — —
DrRepair [P19] Syntactic Token, Graph GNN and RNN (LSTM) C 64,000 programs B+F Semireal — ✓ ✓

September/October 2022
Works That Detect Defects
Wang 2016 [P20] Semantic AST, graph DBN Java 10 × 150–1,046 files B+F Real ✓ — —
DP-CNN [P21] Semantic AST CNN and logistic regression Java 7 × 330 files B+F Real ✓ ✓ —
POSTER [P22] Vulnerability AST RNN (BLSTM) C 6,000 functions B+F Real — ✓ ✓
DeepBugs [P23] Semantic AST, graph NN JavaScript 150,000 files B+F Semireal ✓ ✓ ✓
VarMisuse [P24] Semantic AST, graph GGNN and GRU C# 2.9 million LoC B + NB Real — ✓ ✓

www.computer.org/security
VulDeePecker [P25] Vulnerability Token RNN (BLSTM) C 61,000 code gadgets B+F Synthetic ✓ ✓ —
Russell 2018 [P26] Vulnerability Token CNN, BoW, RNN, and C 1.27 million functions B+F Real + ✓ ✓ —
random forest synthetic
μVulDeePecker [P27] Vulnerability AST, graph RNN (BLSTM) C 181,000 code gadgets B + NF Synthetic ✓ ✓ —
Gupta 2019 [P28] Semantic AST Tree CNN C 29 × 1,300 programs B+F Semireal ✓ ✓ ✓
Habib 2019 [P29] Syntactic Token RNN (BLSTM) Java 112 projects B+F Synthetic ✓ — —
Li 2019 [P30] Semantic AST, graph RNN (GRU) and CNN Java 4.9 million methods B+F Real ✓ ✓ ✓
Project Achilles [P31] Vulnerability Token RNN (LSTM) Java 44,495 programs B+F Synthetic ✓ ✓ ✓
Li 2020 [P32] Vulnerability Graph, token BoW and CNN C 60,000 samples B + NB Synthetic — — —
VulDeeLocator [P33] Vulnerability AST, token RNN (BRNN) C 120,000 program slices B + NB Synthetic ✓ ✓ ✓
SinkFinder [P34] Vulnerability Graph, token SVM C 15 million LoC NB Real — ✓ —
OffSide [P35] Vulnerability AST Attention NN Java 1.5 million code B+F Semireal ✓ ✓ ✓
snippets
AI4VA [P36] Vulnerability AST, graph Graph NN C 1.95 million functions B+F Real + ✓ ✓ ✓
synthetic
Tanwar 2020 [P37] Vulnerability AST NN C 1.27 million functions B+F Real ✓ — —
Devign [P38] Vulnerability All Graph NN C 48,000 commits B+F Real ✓ ✓ ✓
Dam 2021 [P39] Vulnerability AST, token RNN (LSTM) Java 18 × 46–3,450 files B + NB Real ✓ — —
SySeVR [P40] Vulnerability AST, graph RNN (BLSTM and BGRU) C 15,000 programs B+F Synthetic ✓ ✓ ✓
RNN: recurrent NN; LSTM: long short-term memory; GAN: generative adversarial network; DRL: deep reinforcement learning; CNN: convolutional NN; GNN: graph NN; CVAE: conditional variational autoencoder;
GGNN: gated GNN; DBN: deep belief network; BLSTM: bidirectional LSTM; GRU: gated recurrent unit; BoW: bag of words; BRNN: bidirectional RNN; SVM: support vector machine; BGRU: bidirectional GRU; NB: no
bug; B: buggy; F: fixed.
The studies are first ordered chronologically and then alphabetically (by author name) within the top and bottom halves of the table.
In the “Method” column, we refer to the primary ML approach used in the tool. When a tool experiments with several approaches, we include all of them if they are presented and discussed equally and skip the
ones mentioned only in passing.

69
MACHINE LEARNING SECURITY AND PRIVACY

minimal changes to the models and by retraining on a and nine correction studies. JavaScript has one paper
suitable data set. for correction [P15] and one for detection [P23]. All
four Python studies are capable of correction. Finally,
Observation 7. We see a nonuniform distribution of the one examined C# paper [P24] can detect defects.
goals across the examined languages both in terms of Overall, the two most common are defect detecting
correction ability as well as targeted defect types. Look- C studies (12) and defect correcting Java studies (nine).
ing at correction ability, we notice that the majority of In terms of defect types, most of the C language stud-
C studies (12) only detect defects, while five can cor- ies target vulnerabilities (12), while the majority of Java
rect them. Java is more balanced, with seven detection papers target semantic defects (11). Python studies focus
primarily on semantic defects (three), with one paper
targeting syntactic defects. The two examined JavaS-
Error Correction Error Detection cript studies as well as the only C# study target semantic
10
defects. There are no Python, JavaScript, or C# papers
that focus on security vulnerabilities. Similarly, no JavaS-
cript or C# paper detects or corrects syntax defects.
Publications per Year

ML Approaches/Models
5
Observation 8. Both defect detection and correction
studies increasingly rely on NNs. The most commonly
used model is the recurrent NN (RNN). Defect correc-
tion studies heavily borrow from natural language trans-
lation, often referred to as neural machine translation or
sequence-to-sequence translation. This means that the
0 majority of the models comes from the same domain,
more specifically, RNNs that appear 16 times out of
16

17

18

19

16

17

18

19

+
20

20
20

20

20

20

20

20

20

20
20

20

19 among defect correction papers. The most com-


Publication Year
mon method within the RNN family is long short-term
Figure 1. A histogram of publications per year. Note that 2020 and 2021 are memory (LSTM)—11 studies—which specifically
merged as 2020+. targets the problem of long-term dependencies by
enabling learning from context.
The most recent papers highlight the usefulness of
Graph NNs that are capable of understanding contexts since
C# the presence of a defect can highly depend on that
[P30]. Additionally, attention (focusing on the relevant
parts of the code, depending on the context) helps such
Vulnerability
NNs learn long-distance relations to keep track of issues
Detection outside a narrow code segment. It is worth mention-
C ing that despite perceived uniformity, most studies add
their own spin to the method, leading to diverse final
AST implementations.
JS
Among defect detection papers, nine use RNNs,
and four use convolutional NNs. Most of the remain-
Java Token
ing papers still rely on some member of the NN fam-
Semantic ily [e.g., attention NNs, (gated) graph NNs, deep belief
networks, and so on]. Similar to defect correction stud-
Syntactic ies, methods that can learn from context, such as bidi-
Correction rectional LSTM—five papers—and the gated recurrent
unit—three papers—are popular due to their ability to
take into account both future and past contexts [P25].
There is only slightly more variety in the defect
Python detection world, where the task can (but does not need
to) be logically split in two: embedding/feature extrac-
Figure 2. A co-occurrence graph of tool characteristics.
tion and classification. While the former is mostly

70 IEEE Security & Privacy September/October 2022


handled by a form of NN, the latter invites

Semireal Synthetic
more experimentation. Some of the classifica-
tion methods include logistic regression, bags
of words, random forests, and support vector

7
1
0
1
7
3
3
6
6
0
2
0
0
machines. Despite some outliers, the task of

Real
detection also seems to be heading in the NN
direction. The analyzed papers commonly

3
9
5
6
1
5
2
7
5
0
3
1
3
attribute this to the NN’s ability to operate

JavaScript Python Real


without explicit feature formation, capacity

13

14

11
9
9

0
5

5
7
4
1

1
1
to understand contexts and keep some form
of memory over time, and suitability for han-
dling texts and (a form of) language.

0
4
3
1
0
1
0
3

1
3
0
Data Sets
Language
Observation 9. There is large disparity

1
1
2
0
0
2
1
0

1
1
0
among data sets in terms of data set size Java

11

11

11
and data unit size. The sizes of the data 7
9

2
3

2
7

3
2
sets range from hundreds to millions of
C#

data units. Data units themselves (i.e., the


1
0
1
0
0
1
1
0

1
0
0
source code snippets fed into the model)
12

12

11
Semireal Synthetic Vulnerability AST Graph Token C

5
1
4

8
7

4
5
6
also range from full program files to meth-
ods, functions, code gadgets, and similar
Representation

12

11
paper-specific granularities. We notice that
9

5
7
9

0
7
0
3
7
7
6
the granularity of data points commonly
coincides with the output granularity at
10
1
4
1
6

7
1
2
1
0
5
2
3
which the tool is capable of spotting defects.
15

14

11

14
8

0
9

8
1

2
1

5
3
Observation 10. There are significant differ-
ences in source code complexity, realism, and
origin. On the one hand, we have real source
code (18 studies), often collected from Github
14

12

and open source projects. On the other hand,


1

9
6
9

0
3
0
0
5
1
7
we find eight papers that use primarily syn-
Error

thetic data sets, which consist of shorter and


cleaner code samples with “textbook” exam-
1
6

0
1
7
4
0
2
0
1
0
6
ples of errors. The remaining data sets fall 1
somewhere in the middle, consisting of either
real source code with artificially injected errors
12

14

11

13
6

4
5
1
1

2
3

5
0

(four) or simple code segments and student


Correction

assignments with genuine mistakes (eight).


Yes

12

12

Two studies, Russell et al. [P26] and Suneja


6
1
8
1

5
0
9
1
4
9
9
1

et al., [P36] separately train and evaluate on


No
Table 7. The co-occurrence table.

Vulnerability (15) 14
15
10

12
6
1

1
7
1
0
9
3
7

both real and synthetic data. Regardless of


the realism, the studies often source their data
JavaScript (2)

from publicly available data sets and previous


Semireal (18)

Semireal (12)
Synthetic (7)

Synthetic (8)
Graph (11)
Token (21)

Python (4)

studies. Such data sets include the Software


Real (18)
Representation AST (23)

Java (16)
Yes (19)
No (21)

Assurance Reference Dataset, National Vulner-


C (17)
C# (1)

ability Database, Juliet Test Suite, and Draper.

Observation 11. Correction tools mostly use


Correction

real and semireal data, while detection tools use


Language

both real and synthetic data. Additionally, tools


Error

Real

targeting vulnerabilities mostly employ synthetic

www.computer.org/security 71
MACHINE LEARNING SECURITY AND PRIVACY

data sets; semireal data are common with syntactic errors and There are also differences in how many different
real data with semantic errors. Among the correction tools, error types a tool can handle. Some tools are trained
real and semireal data sets are used equally often—nine and tested on a smaller set of vulnerability types, which
times—with only one study using synthetic data sets. makes them narrow but comparatively high perform-
Detection-only tools primarily use realistic data (nine), ing. Examples of such tools include SinkFinder, [P34]
but synthetic data sets are also popular (seven). Semireal- which looks for vulnerabilities in function pairs, such
istic data are the least popular among the error detection as lock/unlock; OffSide, [P35] which focuses on
tools, with only three occurrences. b o u n d ar y conditions; and VulDeePecker, [P25]
We also notice distinct patterns of data sets used for which targets buffer and resource management error
different types of errors. Synthetic data are almost exclu- vulnerabilities. On the other hand, some tools target a
sively used in tools targeting vulnerabilities (seven). wide range of errors, potentially at some performance
Semireal data are mostly harnessed in studies related to cost. SySeVR, [P40] for example, targets 126 vulner-
syntactic errors (six) and semantic errors (five) and less ability types, and Project Achilles [P31] focuses on 29.
in studies related to vulnerabilities (one). Finally, real It is worth mentioning that some tools train separate
data are employed in semantic error (13) and vulner- models for each type of error and evaluate a piece of
ability (five) studies but not in syntactic error studies. code by passing it through each of the trained models
separately to determine the probability of each of the
Observation 12. The majority of data sets consists of vulnerabilities.
bug fix pairs. We notice three distinct patterns in data
set structure: data sets with bug fix pairs (31), data sets Observation 14. There are significant inconsistencies in
of unrelated buggy and nonbuggy examples (five), and the reporting of performance metrics. Studies using real
data sets with no bugs (four). The latter are mostly data sets seems to perform worse than those using syn-
used to teach a model the correct use of the language thetic data sets. We find that the studies differ greatly
so that it is capable of discrimination and potentially in their reporting of performance. The most commonly
translation when it encounters an unfamiliar code pat- reported metrics include recall (reported in some form
tern. The remaining two patterns help teach the model by 22 studies), the F1 score (16), accuracy (15), and
examples of good and bad behavior. The difference is precision (11). While detection-only tools tend to be
that for defect correction, it is valuable to have examples more diligent in their reporting, the correction tools
of concrete fixes for a buggy example. This is commonly more commonly frame their results simply as “we could
achieved by either collecting version histories (commits fix x out of y errors” without providing more detail. We
with fixes) from public repositories or artificially inject- find additional inconsistencies even among the studies
ing bugs to correct code. In case of defect detection, that report the same metrics: some relay only the best
it is not crucial to have such pairs, so several data sets performance, others provide average values, and others
include examples of bugs and correct code but not nec- convey the full range.
essarily on the same piece of code. Taking all this into account, it is uninformative, if
not misleading, to directly compare performance across
Output and Performance the papers. However, setting aside all nuances, we can
cautiously draw some rough patterns from the metrics
Observation 13. There is little uniformity among studies’ reported. Specifically, we find that studies using syn-
outputs in terms of granularity and error types a tool can thetic data sets generally report higher metrics regard-
target. We notice a significant variety of detection granu- less of the other study properties (around 80–90% for
larity, ranging from simple binary classification (buggy all mentioned metrics), while studies using real and
versus nonbuggy) to method, function, and specific lines semireal data perform significantly worse (their accu-
of code. For example, Dam et al. [P39] focus on file-level racy and recall rarely exceed 60–70%), have wider
detection, VulDeePecker [P25] works on code gadget ranges, and sometimes dip all the way down to 0–20%.
granularity, and Project Achilles [P31] concentrates on Given the previously identified relations among cor-
methods. An interesting goal was set by Zou et al. [P27] rection ability, error type, and representation, the per-
The authors attempted not only to recognize whether formance across those categories is also affected by the
there was a vulnerability with fine granularity but also realism of the data set.
determine the vulnerability type. There are similar dif- An interesting insight into the effects of data set real-
ferences among the correction studies that range from ism is provided by Russell et al. [P26] and Suneja et al.,
single token correction all the way to full code sections, [P36] who train and test their pipelines separately
sometimes as a single-step fix or as a collection of smaller on real and synthetic data sets. The two stud-
steps with some form of correction checking in between. ies enable us to get a glimpse at the behavior of the

72 IEEE Security & Privacy September/October 2022


same tool when faced with different types of source for complex vulnerabilities in a student program that
code. Both papers exhibit the same pattern we does not even compile. It seems that the performance
obser ved: the F1 score is significantly lower when goes hand-in-hand with the realism of the code. Gen-
realistic data are used. More specifically, the studies erally, we find better performance with tools using
report F1 scores of 50–60% on real data and 70–90% synthetic data, even when the goal is more challeng-
for synthetic data. ing (e.g., dozens of different vulnerability types). A
similar pattern has been documented by Chakraborty
Discussion et al.7 More research is required to confirm such pat-
There are significant differences among the studies terns, but present evidence highlights how crucial the
when it comes to the error types that are targeted, lead- use of appropriate, realistic, and well-labeled data is.
ing to different defect patterns and, consequently, repre- The field should be wary of high-performance reports,
sentation choices. All these seem to determine whether especially when synthetic data sets are used, and work
a tool will be able to instead toward more
automatically correct realistic goals that
found bugs or only will make tools prac-
detect them. Argu- tical in the real world.
ably, the simplest de- Similar to the data
fect type to catch is Arguably, the simplest defect type to catch sets, it is useful to
a syntactic one, with is a syntactic one, with vulnerabilities being consider the full pic-
vulnerabilities being the most challenging. ture when discussing
the most challenging. tool output. It is not
Seeing that most of crucial to be given
the correction tools very specific output if
address the former, the program consists
while detection tools of a dozen lines of
largely address the latter, we can assume that effective code, whereas classifying a big project as vulnerable
correction is more difficult to achieve. With several de- is next to useless if there is no way to determine where
tection and correction tools targeting semantic defects, the problem lies. This is especially important for practi-
we speculate that such defects lie in the middle in terms cal applications where the tools are applied on a large
of difficulty. number of real-world projects. Overall, the importance
We can find additional support for such observa- of lower granularity and higher precision is recognized
tions when looking at data set realism. Fully synthetic and often highlighted, with the trends moving toward
data sets are used primarily by vulnerability detection more precise tools.
tools, suggesting that is not yet possible to detect real- Patterns in source code representation seem to fol-
istic vulnerabilities “in the wild.” It is worth noting that low defect type patterns and, in turn, the detection and
some of the vulnerability detection tools use real-world correction goals. We see that the defect correcting tools
projects and successfully catch vulnerabilities, but this can achieve the intended goal through the use of sim-
cannot be effectively done on a large scale and without pler representations, while defect detecting tools use
a high number of false classifications. more advanced and combined representations. This
Tools targeting syntactic errors use semirealistic further shows that tackling vulnerabilities and semantic
data, in particular, simple code snippets written by stu- defects is likely more challenging, so automatic correc-
dents for beginner programming courses and that have tion on a large scale is not yet possible.
genuine but simple mistakes. The use of such data sets Sequence-of-tokens-based models are att­ractive
seems only natural, as syntax problems are common because of their simplicity. They are especially useful for
with beginner programmers, who cannot yet catch and representing programs with syntactic defects in which
correct their mistakes. Finally, we find the use of real and constructing ASTs and control flow graphs is limited or
semireal data with the tools aimed at semantic errors. not possible due to severe syntax problems. The simi-
The semireal data sets that were used mostly consist of larity to natural language makes it an attractive choice in
realistic source code injected with artificial errors. sequence-to-sequence models, where the goal is defect
We see that the complexity of the used data set re- correction by translating a problematic sequence into a
flects common use cases as well as the complexity of syntactically correct one.
the targeted error type, which is to be expected. For Overall, token-level representation is the most pop-
example, one does not expect to find many syntax ular choice for defect correction tools. The challenge
bugs in Linux kernel, nor does it make sense to look of this approach is the selection of the appropriate

www.computer.org/security 73
MACHINE LEARNING SECURITY AND PRIVACY

granularity and range of tokens. Depending on the type to define the features that will sufficiently capture the
of bug that is targeted, a model can benefit from simple, semantics of the program. The main benefit of deep
stand-alone tokens and from grouped and more struc- learning is its ability to ingest the source code itself (in
tured representation (code gadgets, functions, or some an appropriate format) and create its own “features” to
other syntactic or semantic unit). learn from.
Syntactic representation considers the ASTs of the
source code, enabling a less flat view of the code. Such Challenges and Future Directions
representations are larger in size and more difficult to This article is motivated by the need to discover pat-
construct but can capture lexical and syntactic code terns in the rapidly evolving field of ML for source
properties. They are often combined with recursive code. Some of the challenges toward effective solu-
NNs and LSTM models. Their popularity lies mainly tions (Table 8) include access to and use of high-quality
with defect detection tools, especially semantic defect training data sets with realistic, representative, and cor-
and vulnerability detection. While ASTs are good at rectly labeled data; effective source code representation
capturing the structure of the code, they do not cap- capable of semantic understanding; standardization in
ture the semantics and large and complex pieces of terms of goals and reporting; detection and correction
code very well.12 This is why ASTs are commonly sup- across domains; and catching application-specific bugs
ported by semantic representation capturing data and (in regard to semantic defects) and high FP rates. We
control flow information. The ability of graph mod- briefly elaborate on some of these challenges.
els to capture more advanced semantic properties of There is significant variety in terms of data sets, goals,
code reflects itself in the use cases: they appear almost testing, and performance reporting. We believe the field
exclusively in tools targeting semantic defects and would benefit from some degree of standardization,
vulnerabilities. potentially in the form of a curated collection of open
Somewhat surprisingly, we observe a very unbal- source data sets, together with some uniform goals for
anced picture when it comes to the languages beyond each defect type along with a test suite and benchmarks.
C/C++ and Java. For example, we found that C#, JavaS- Since a tool’s performance can heavily rely on the training
cript, and Python lack tools aimed at detecting and cor- data, stabilizing the data set would enable more precise
recting vulnerabilities. The possible reason for prevalence evaluation of the tool itself rather than the training data.
of C/C++ and Java is that these languages are popular, Such data sets would ideally consist of realistic source
well studied, and have large, open databases of known code with representative errors and high-quality label-
defects (both bugs and security vulnerabilities). How- ing to increase the usability of the tools in the real world.
ever, considering the ever-growing popularity of C#, The formalization, or at the very least, clear reporting,
JavaScript, and Python, it becomes very important to of goals (e.g., in terms of granularity and defect types)
develop the tools supporting them. This also extends would also enable researchers in the field to get a clearer
to other popular languages that did not appear in and more complete picture of the available tools.
the study. Finally, there is a need for clearer and more com-
A look at the ML methods highlights the fact that plete reporting of performance. One step in the right
traditional ML approaches are more of a stepping stone direction could be the reporting of the four basic met-
toward a deep learning solution than solutions of their rics (TP, TN, FP, and FN), which facilitate the calcu-
own. The reason likely lies in the fact that it is difficult lation of the remaining metrics. However, at the end

Table 8. The key takeaways.

Finding Observation Challenge


Missing detection or correction tools for some 2 and 7 Expand correction and detection tools for all defect types
language–defect combinations
Variety of representation techniques but 4, 3, and 5 Advanced (semantic) representations and embeddings
struggling to capture deeper properties of code;
oversimplistic embeddings
Java and C/C++ most studied languages 6 Expand to more languages
Tool outputs not comparable 13 and 14 Formalize goals and metrics for tools and simplify output for developers
Vast differences in data sets and performance 9, 10, and 11 Collect, standardize high-quality, realistic, and representative data
sets across all defect types and languages

74 IEEE Security & Privacy September/October 2022


of the day, such metrics tell us little about the usabil- 2. I. Pashchenko, R. Scandariato, A. Sabetta, and F. Massacci,
ity of a tool to its intended users—the developers— “Secure software development in the era of fluid multi-party
who should be more closely involved in the testing open software and services,” in Proc. 2021 IEEE/ACM 43rd
and evaluation. Future research in the domain should Int. Conf. Softw. Eng., New Ideas Emerg. Results (ICSE-NIER),
also consider expansions to other commonly used pp. 91–95, doi: 10.1109/ICSE-NIER52604.2021.00027.
programming languages, improve defect localization 3. M. Christakis and C. Bird, “What developers want and
precision, and provide a wider coverage of different need from program analysis: An empirical study,” in Proc.
defect types. 31st IEEE/ACM Int. Conf. Autom. Softw. Eng., 2016, pp.
As mentioned, effective representation seems to be 332–343, doi: 10.1145/2970276.2970347.
an active area of research, with more comprehensive 4. M. Monperrus, “Automatic software repair: A bibliogra-
approaches emerging, especially in the form of graph phy,” ACM Comput. Surv., vol. 51, no. 1, pp. 1–24, 2018,
representations. A common go-to method for tools that doi: 10.1145/3105906.
do not invest into novel approaches seems to be the 5. J. Li, P. He, J. Zhu, and M. Lyu, “Software defect pre-
word-to-vector technique,13 which is primarily a simple diction via convolutional neural network,” in Proc. Int.
token embedding technique. One then wonders: Why Conf. Softw. Quality, Rel. Secur., 2017, pp. 318–328, doi:
bother with all the complex representations to flatten 10.1109/QRS.2017.42.
everything at the end of the pipe? We are already seeing 6. C. L. Goues, M. Pradel, and A. Roychoudhury, “Auto-
(and expect to see) a further rise in similar but more spe- mated program repair,” Commun. ACM, vol. 62, no. 12,
cialized x-to-vector-like vectorization techniques capa- pp. 56–65, 2019, doi: 10.1145/3318162.
ble of capturing deeper properties of code and, as is the 7. S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep
current trend, finding overfitting with the particular data learning based vulnerability detection: Are we there yet?”
set that is used. IEEE Trans. Softw. Eng., vol. 1, no. 1, pp. 1–17, 2021, doi:
Closely related to source code representation is the 10.1109/TSE.2021.3087402.
challenge of semantic understanding. A tool’s ability 8. Z. Shen and S. Chen, “A survey of automatic software vul-
to detect more complex semantic defects and vulner- nerability detection, program repair, and defect predic-
abilities depends on its understanding of the source tion techniques,” Secur. Commun. Netw., vol. 2020, no. 1,
code. While syntax is finite, well defined, and therefore pp. 1–16, 2020, doi: 10.1155/2020/8858010.
easier to understand and capture, the semantics of pro- 9. F. E. Allen, “Control flow analysis,” ACM Sigplan Notices,
grams are harder to capture. As more tools attempt to vol. 5, no. 7, pp. 1–19, 1970, doi: 10.1145/390013.808479.
tackle complex types of defects, the need for advanced 10. F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Model-
representation will further increase. In this respect, ing and discovering vulnerabilities with code property
graph-based representations capable of capturing com- graphs,” in Proc. 2014 IEEE Symp. Secur. Privacy, pp. 590–
plex characteristics of the analyzed programs seem par- 604, doi: 10.1109/SP.2014.44.
ticularly promising. 11. M. Allamanis, E. Barr, P. Devanbu, and C. Sutton, “A sur-
Finally, the relatively small number of tools work- vey of machine learning for big code and naturalness,”
ing with unlabeled data points shows that this is still a ACM Comput. Surv., vol. 51, no. 4, pp. 1–37, 2018, doi:
largely unexplored direction. It comes with the chal- 10.1145/3212695.
lenge of unsupervised learning, but at the same time, 12. J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X.
unlocks access to large data sets of unlabeled corpora, Liu, “A novel neural source code representation based on
eliminating the need for synthetic bug introduction and abstract syntax tree,” in Proc. 2019 IEEE/ACM 41st Int.
manual self-labeling. Conf. Softw. Eng. (ICSE), pp. 783–794, doi: 10.1109/
ICSE.2019.00086.
Acknowledgments 13. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean,
This work was partly supported by the European “Distributed representations of words and phrases and their
Union, under the Horizon 2020 research and innova- compositionality,” in Proc. 26th Int. Conf. Neural Inf. Process.
tion program (Assurance and Certification in Secure Syst., 2013, pp. 3111–3119, doi: 10.5555/2999792.2999959.
Multi-Party Open Software and Services project grant
952647), and The Netherlands’ Sectorplan program. Tina Marjanov is a research assistant at the University
of Cambridge, Cambridge, CB3 0FD, The United
References Kingdom. Her research interests include cyberse-
1. H. Shen, J. Fang, and J. Zhao, “Efindbugs: Effective error curity, privacy, and machine learning. Marjanov
ranking for findbugs,” in Proc. 2011 4th IEEE Int. Conf. received an M.S. in computer science from Vrije Uni-
Softw. Testing, Verification Validation, pp. 299–308, doi: versiteit Amsterdam and the University of Amster-
10.1109/ICST.2011.51. dam. Contact her at marjanov.tina@gmail.com.

www.computer.org/security 75
MACHINE LEARNING SECURITY AND PRIVACY

Ivan Pashchenko is a security manager at TomTom, Fabio Massacci is a professor at the University of Trento,
Amsterdam, 1011 AC, The Netherlands. His re­­ Trento, 38123, Italy, and Vrije Universiteit Amster-
search interests include threat intelligence, open dam, Amsterdam, 1081 HV, The Netherlands. His
source software security, and machine learning current research interest is in empirical methods for
for security. Pashchenko received a Ph.D. from cybersecurity of sociotechnical systems. Massacci
the University of Trento. In 2017, he was awarded received a Ph.D. in computing from the Sapienza Uni-
a Second Place Silver Medal at the Association versity of Rome. He participates in the Cyber Security
for Computing Machinery/Microsoft Student for Europe pilot and leads the Horizon 2020 Assur-
Research competition in the graduate category. ance and Certification in Secure Multi-Party Open
He was the UniTrento main contact in the Con- Software and Services project. For his work on secu-
tinuous Analysis and Correction of Secure Code rity and trust in sociotechnical systems, he received
work package for the Horizon 2020 Assurance and the Ten Year Most Influential Paper Award at the
Certification in Secure Multi-Party Open Soft- 2015 IEEE International Requirements Engineering
ware and Services project. Contact him at ivan. Conference. He is a Member of IEEE. Contact him at
pashchenko@tomtom.com. fabio.massacci@ieee.org.

Call rticles
for A putin
g
e C o m
E Pe rvasiv the la
te s t
IEE r s o n
l p ap e
u s efu ible, e,
a c ce s s er vasiv
seek s men ts in p
de velop opics
ev iewed u ting. T
peer-r s co m p
uitou
e , a n d ubiq s of t w
a re
mobil c h n o logy,
te
ware nd
e hard sing a
includ - w o r ld sen
l n,
re , re a erac tio
truc tu er int
s: infras - com p u t
d e li n e u man
o r gui c tion, h , inclu
ding
Au t h rg /m c/ int e ra
tio n s
uter.o era y.
.comp consid privac
w w w
s y s te m s u r it y, and
.htm an d y, se c
u th o r labilit
sive /a
p e r v a
a ils: y m e nt , s c a
e r de t deplo
Furth uter.o
rg
ive
sive@
co m p
rg/p ervas
p e r va
o m puter.o
w w w.c

Digital Object Identifier 10.1109/MSEC.2022.3199549

76 IEEE Security & Privacy September/October 2022

You might also like