Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph
In this work, we propose Vul-LMGNN, a unified model that combines pre-trained code language models with code property graphs for code vulnerability detection. Vul-LMGNN constructs a code property graph, thereafter leveraging pre-trained code model to extract local semantic features as node embeddings in the code property graph. Furthermore, we introduce a gated code Graph Neural Network (GNN). By jointly training the code language model and the gated code GNN modules in Vul-LMGNN, our proposed method efficiently leverages the strengths of both mechanisms. Finally, we use a pre-trained CodeBERT as an auxiliary classifier. The proposed method demonstrated superior performance compared to six state-of-the-art approaches.
Create environment and install required packages for LMGGNN
-
transformer(=3.3.1)
The experiments were executed on single NVIDIA A100 80GB GPU. The system specifications comprised NVIDIA driver version 525.85.12 and CUDA version 11.8.
We evaluated the performance of our model using four publicly available datasets. The composition of the datasets is as follows, and you can click on the dataset names to download them. Please note that you need to modify the code in the CPG_generator function in run.py to adapt to different dataset formats.
| Dataset | #Vulnerable | #Non-Vulnerable | Source |
|---|---|---|---|
| DiverseVul | 18,945 | 330,492 | Snyk,Bugzilla |
| Devign | 11,888 | 14,149 | Github |
| VDSIC | 82,411 | 119,1955 | Github, Debian |
| ReVeal | 1664 | 16,505 | Chrome, Debian |
- Modifications to the
configs.jsonstructure should be updated in theconfigs.pyscript. - Joern processing may be slow or potentially freeze your OS, depending on your system’s specs. To prevent this, reduce the chunk size processed during the CPG_generation process by adjusting the
"slice_size"value in the"create"section of theconfigs.jsonfile. - Within the
"slice_size"parameter, nodes exceeding the configured size limit will be filtered out and discarded. - Follow the instructions on Joern's documentation page and install Joern's command line tools under
'project'\joern\joern-cli\. - You can find the implementation code of the baselines mentioned in the paper in the
baselines.zip, which consists offour Jupyter notebooks.
python run.py -cpg -embed -mode train -path /your/model/path
-cpg and -embed respectively represent using joern to extract the code's CPG and generating corresponding embeddings. -path is used to specify the path for saving the model.
python run.py -mode test -path /your/model/saved/path
-mode is used to specify whether only the training process is executed or both the training and testing processes are performed. -path is used to specify the path for saving the model.
This command is used to fine-tune CodeBERT on a specific dataset and then generate embeddings for subsequent nodes. Pre-trained CodeBERT weights need to be downloaded from here.
python fine-tune.py
Here only the accuracy results are displayed; for other metrics, please refer to the paper.
| Model | DiverseVul | VDSIC | Devign | ReVeal |
|---|---|---|---|---|
| BERT | 91.99 | 79.41 | 60.58 | 86.88 |
| CodeBERT | 92.40 | 83.13 | 64.80 | 88.64 |
| GraphCodeBERT | 92.96 | 83.98 | 64.80 | 89.25 |
| TextCNN | 92.16 | 66.54 | 60.38 | 85.43 |
| TextGCN | 91.50 | 67.55 | 60.47 | 87.25 |
| Devign | 70.21 | 59.30 | 57.66 | 65.47 |
| Our | 93.06 | 84.38 | 65.70 | 90.80 |
Parts of the code for data preprocessing and graph construction using Joern are adapted from Devign. We appreciate their excellent work!