PrimeVul: Vulnerability Detection with Code Language Models: How Far Are We?

📜 Overview | 📚 Dataset | 💻 Experiments | 📝 Citation

(09/05/24) 🌟 We released PrimeVul-v0.1 to include more metadata for vulnerabilities. Check what's new!
(07/02/24) 🎉 Our paper has been accepted to ICSE 2025 during the first submission cycle!
(06/14/24) 🚀 PrimeVul has been included to evaluate Gemini-1.5 for vulnerability detection!
(03/27/24) We released our paper, data, and code for experiments.

📜 Overview

PrimeVul is a new dataset for vulnerability detection, aiming to train and evaluate code language models in the realistic vulnerability detection settings.

Better Data

✨ Diverse and Sufficient Data: ~7k vulnerable functions and ~229k benign functions from real-world C/C++ projects, covering 140+ CWEs.
✨ Accurate Labels: Novel labeling techniques achieve human-level labeling accuracy, up to 3 $\times$ more accurate than existing automatic labeling approaches.
✨ Minimal Contamination: Thorough data de-duplication and chronological data splits minimizes the data contamination.

Better Evaluation

✨ Realistic Tradeoff w/ VD-Score: Measure risks of missing security flaws when not overwhelming developers with false alarms.
✨ Exposing Models' Weaknesses w/ Paired Samples: Analyzing models' capabilities in capturing subtle vulnerable patterns and distinguishing the vulnerable code from its patch.

📚 PrimeVul Dataset

💡 [Latest Release] v0.1

🌟 What's New?

To facilitate the future research and evaluation using PrimeVul, we retrieve more metadata, so that users could customize their training and evaluation (e.g., including more contexts) according to thier needs. Specifically, we have retrieve these key attributes:

⭐ Commit Metadata: We add (1) project URL, (2) commit URL, and (3) commit message correspnding to PrimeVul vulnerabilities. As a result, users could easily locate the original commit corresponding to the sample and retrieve project-level contexts.
⭐ Vulnerability Metadata: We add metadata for the vulnerabilities: (1) CVE description (2) NVD link to the vulnerability. With these information, users could perform in-dpeth analysis (manually or programmatically) for included vulnerabilities
⭐ File-level Metadata: We retrieve the file-level information for functions in PrimeVul. Specifically, we include (1) file name that the function belongs to, (2) the file's relative path in the project (3) the location of the function in the original file (4) a copy of the whole file. With these information, users could play with the file-level context of PrimeVul samples offline.

How to Use: While commit and vulnerability metadata are directly saved as part of the json object, file-level metadata is saved separately in file_info.json and file_contents/. Using the func_hash in the dataset, the file information can be found in file_info.json, which also provides the path to the local copy of the whole file.

Note

PrimeVul is a dataset that combines and reconstructs existing vulnerability detection datasets with more accurate labels and thorough evaluation. However, not all datasets provide sufficient resources to retrieve the metadata (e.g., some samples do not originally have CWE type or CVE numbers). Therefore, we could not retrieve the same metadata for every sample in PrimeVul. In PrimeVul-v0.1, we only include vulnerabilities that we successfully retrieved their metadata. For the full set of samples that we orignally used in the paper, please refer to the original release below.

[Original Release]

💻 Experiments

Install Dependencies

conda env create -f environment.yml

📖 Open-source Code LMs (< 7B)

Standard Binary Classification

cd os_expr;

PROJECT="primevul_cls"
TYPE=<MODEL_TYPE>
MODEL=<HUGGINGFACE_MODEL>
TOKENIZER=<HUGGINGFACE_TOKENIZER>
OUTPUT_DIR=../output/
python run_ft.py \
    --project ${PROJECT} \
    --model_dir ${MODEL} \
    --output_dir=${OUTPUT_DIR} \
    --model_type=${TYPE} \
    --tokenizer_name=${TOKENIZER} \
    --model_name_or_path=${MODEL} \
    --do_train \
    --do_test \
    --train_data_file=<PATH_TO_primevul_train.jsonl> \
    --eval_data_file=<PATH_TO_primevul_valid.jsonl> \
    --test_data_file=<PATH_TO_primevul_test.jsonl> \
    --epoch 10 \
    --block_size 512 \
    --train_batch_size 64 \
    --eval_batch_size 128 \
    --learning_rate 2e-5 \
    --warmup_steps 1000 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456
cd ..;

Binary Classification + Class Weights

cd os_expr;

WEIGHT=30
PROJECT="primevul_cls_weights_${WEIGHT}"
TYPE=<MODEL_TYPE>
MODEL=<HUGGINGFACE_MODEL>
TOKENIZER=<HUGGINGFACE_TOKENIZER>
OUTPUT_DIR=../output/
python run_ft.py \
    --project ${PROJECT} \
    --model_dir ${MODEL} \
    --output_dir=${OUTPUT_DIR} \
    --model_type=${TYPE} \
    --tokenizer_name=${TOKENIZER} \
    --model_name_or_path=${MODEL} \
    --do_train \
    --do_test \
    --train_data_file=<PATH_TO_primevul_train.jsonl> \
    --eval_data_file=<PATH_TO_primevul_valid.jsonl> \
    --test_data_file=<PATH_TO_primevul_test.jsonl> \
    --epoch 10 \
    --block_size 512 \
    --train_batch_size 64 \
    --eval_batch_size 128 \
    --learning_rate 2e-5 \
    --class_weight \
    --vul_weight $WEIGHT \
    --warmup_steps 1000 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456
cd ..;

Binary Classification + Contrastive Learning

cd os_expr;

PROJECT="primevul_cls_clr"
TYPE=<MODEL_TYPE>
MODEL=<HUGGINGFACE_MODEL>
TOKENIZER=<HUGGINGFACE_TOKENIZER>
OUTPUT_DIR=../output/
python run_ft.py \
    --project ${PROJECT} \
    --model_dir ${MODEL} \
    <--clr_mask> \ # Add this parameter to enable CA-CLR
    --clr_temp 1.0 \
    --output_dir=${OUTPUT_DIR} \
    --model_type=${TYPE} \
    --tokenizer_name=${TOKENIZER} \
    --model_name_or_path=${MODEL} \
    --do_train \
    --do_test \
    --train_data_file=<PATH_TO_primevul_train.jsonl> \
    --eval_data_file=<PATH_TO_primevul_valid.jsonl> \
    --test_data_file=<PATH_TO_primevul_test.jsonl> \
    --epoch 10 \
    --block_size 512 \
    --group_size 32 \
    --train_batch_size 1 \
    --eval_batch_size 1 \
    --learning_rate 2e-5 \
    --warmup_steps 1000 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456
cd ..;

📖 Open-source Code LMs (7B)

For 7B models, since it requires more expensive computation, we implement the training differently fromm < 7B models

Set up

Install Huggingface Accelerate.
To train CodeGen2.5, we need transformers==4.33.0. Otherwise, there will be an error while loading the tokenizer. See this issue for further details.
To train StarCoder2, please use the newest transformers.
Cconfigure Accelerate. Run accelerate config and choose the configuration based on your need. The configuration we used is shown in os_expr/default_config.yaml.

Standard Binary Classification with Parallel Training

PROJECT="parallel_primevul_cls"
TYPE=<MODEL_TYPE>
MODEL=<HUGGINGFACE_MODEL>
TOKENIZER=<HUGGINGFACE_TOKENIZER>
OUTPUT_DIR=../output/
accelerate launch run_ft_accelerator.py \
    --project ${PROJECT} \
    --model_dir ${MODEL} \
    --output_dir=${OUTPUT_DIR} \
    --model_type=${TYPE} \
    --tokenizer_name=${TOKENIZER} \
    --model_name_or_path=${MODEL} \
    --do_train \
    --do_test \
    --train_data_file=<PATH_TO_primevul_train.jsonl> \
    --eval_data_file=<PATH_TO_primevul_valid.jsonl> \
    --test_data_file=<PATH_TO_primevul_test.jsonl> \
    --gradient_accumulation_steps 4 \
    --epoch 10 \
    --block_size 512 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --warmup_steps 1000 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456

🤖 OpenAI Models

Standard Binary Classification

cd openai_expr;
MODEL=gpt-3.5-turbo-0125 # gpt-4-0125-preview
PROMPT_STRATEGY="std_cls";
python run_prompting.py \
    --model $MODEL \
    --prompt_strategy $PROMPT_STRATEGY \
    --data_path <PATH_TO_primevul_test_paired.jsonl> \
    --output_folder ../output_dir \
    --temperature 0.0 \
    --max_gen_length 1024 \
    --seed 1337 \
    --logprobs \
    --fewshot_eg
cd ..;

Chain-of-thought

cd openai_expr;
MODEL=gpt-3.5-turbo-0125 # gpt-4-0125-preview
PROMPT_STRATEGY=cot;
python run_prompting.py \
    --model $MODEL \
    --prompt_strategy $PROMPT_STRATEGY \
    --data_path <PATH_TO_primevul_test_paired.jsonl> \
    --output_folder ../output_dir \
    --temperature 0.0 \
    --max_gen_length 1024 \
    --seed 1337
cd ..;

⏲️ Calculate Vulnerability Detection Score (VD-S)

python calc_vd_score.py \
    --pred_file <OUTPUT_DIR>/predictions.txt
    --test_file <PATH_TO_primevul_test.jsonl>

📝 Citation

@article{ding2024primevul,
  title={Vulnerability Detection with Code Language Models: How Far Are We?}, 
  author={Yangruibo Ding and Yanjun Fu and Omniyyah Ibrahim and Chawin Sitawarin and Xinyun Chen and Basel Alomair and David Wagner and Baishakhi Ray and Yizheng Chen},
  journal={arXiv preprint arXiv:2403.18624},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
openai_expr		openai_expr
os_expr		os_expr
LICENSE		LICENSE
README.md		README.md
calc_vd_score.py		calc_vd_score.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PrimeVul: Vulnerability Detection with Code Language Models: How Far Are We?

📜 Overview

Better Data

Better Evaluation

📚 PrimeVul Dataset

💡 [Latest Release] v0.1

🌟 What's New?

[Original Release]

💻 Experiments

Install Dependencies

📖 Open-source Code LMs (< 7B)

Standard Binary Classification

Binary Classification + Class Weights

Binary Classification + Contrastive Learning

📖 Open-source Code LMs (7B)

Set up

Standard Binary Classification with Parallel Training

🤖 OpenAI Models

Standard Binary Classification

Chain-of-thought

⏲️ Calculate Vulnerability Detection Score (VD-S)

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

minhna1112/PrimeVul

Folders and files

Latest commit

History

Repository files navigation

PrimeVul: Vulnerability Detection with Code Language Models: How Far Are We?

📜 Overview

Better Data

Better Evaluation

📚 PrimeVul Dataset

💡 [Latest Release] v0.1

🌟 What's New?

[Original Release]

💻 Experiments

Install Dependencies

📖 Open-source Code LMs (< 7B)

Standard Binary Classification

Binary Classification + Class Weights

Binary Classification + Contrastive Learning

📖 Open-source Code LMs (7B)

Set up

Standard Binary Classification with Parallel Training

🤖 OpenAI Models

Standard Binary Classification

Chain-of-thought

⏲️ Calculate Vulnerability Detection Score (VD-S)

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages