0% found this document useful (0 votes)
40 views9 pages

Final Research Paper

The document introduces VulHierGGNN, a hierarchical deep learning framework designed for multi-type software vulnerability classification by integrating BERT for textual descriptions and CodeBERT for source code analysis. This innovative approach enhances classification accuracy and robustness by capturing both semantic and structural information through a two-stage methodology, which includes contrastive pretraining and hierarchical fine-tuning. Empirical results demonstrate significant improvements in performance, achieving high F1-scores and accuracy across various vulnerability categories, thus advancing automated security analysis capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views9 pages

Final Research Paper

The document introduces VulHierGGNN, a hierarchical deep learning framework designed for multi-type software vulnerability classification by integrating BERT for textual descriptions and CodeBERT for source code analysis. This innovative approach enhances classification accuracy and robustness by capturing both semantic and structural information through a two-stage methodology, which includes contrastive pretraining and hierarchical fine-tuning. Empirical results demonstrate significant improvements in performance, achieving high F1-scores and accuracy across various vulnerability categories, thus advancing automated security analysis capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

VulHierGGNN: A Hierarchical Deep Learning Framework for Multi-Type

Software Vulnerability Classification Using Description and Source Code

Harsh Vardhan (workwithvardhan@gmail.com),Jatin (jatin17092003@gmail.com), Kalash


Kumar (kumar.kalash2022@gmail.com),Kanishk Teotia(kanishkteotia5077@gmail.com),
Birendra Kumar Verma(birendraverma@jssaten.ac.in)
Information technology, JSS Academy of Technical Education, Noida, India

ABSTRACT: Ensuring software security necessitates the precise identification of vulnerabilities


within code. Traditional detection methods often emphasize semantic representations while overlooking
crucial syntactic structures and contextual insights from vulnerability descriptions. This research introduces
VulHierGGNN, an innovative framework that combines Code Property Graphs (CPGs) with a Gated Graph
Neural Network (GGNN) encoder, trained using supervised contrastive learning to effectively capture both
semantic and structural information from source code. In parallel, textual descriptions of vulnerabilities are
encoded using BERT to incorporate contextual understanding. These embeddings are then fused to enhance
classification capabilities. The proposed two-stage approach—initial contrastive pretraining followed by
hierarchical fine-tuning—demonstrates notable improvements in accuracy and robustness when evaluated
on the CVE fixes dataset. Empirical results show that the BERT-based description model achieves an F1-
score and accuracy of 98%, the CodeBERT-based code model attains 91%, and the integrated model
delivers an overall accuracy of 93%, outperforming models that rely solely on semantic contrastive
learning.

Keywords: Software Vulnerability, BERT, CodeBERT, Deep Learning, Multiclass Classification,


Source Code Analysis, Hierarchical Framework

1.INTRODUCTION
In today’s interconnected digital environment, software vulnerabilities pose critical risks to system
integrity, data confidentiality, and overall security. These weaknesses, if left unidentified, can be exploited
by malicious actors, leading to severe consequences including data breaches, financial losses, and system
disruptions. Traditional vulnerability detection tools often rely on static rules or manual analysis, which are
time-consuming and fail to scale with the increasing complexity of modern software.

Recent advancements in deep learning, particularly transformer-based models, have opened new avenues
for understanding both natural language and programming languages. BERT, a pre-trained language model,
has demonstrated significant success in a variety of NLP tasks due to its ability to learn deep contextual
relationships in text. Similarly, CodeBERT extends this paradigm to source code, enabling the extraction
of meaningful representations from code syntax and semantics.

This research introduces VulHierGGNN, a hierarchical framework that combines the power of BERT for
analyzing vulnerability descriptions with CodeBERT for modeling the corresponding source code. The
objective is to enhance the classification of software vulnerabilities across multiple categories, thereby
supporting more intelligent and automated security assessment. By integrating both natural language and
code-based features, this approach aims to achieve high precision and recall in vulnerability classification,
offering an effective step toward robust software security solutions.

2.LITERATURE REVIEW

2.1 Traditional Systems for Vulnerability Detection


Conventional techniques for identifying software vulnerabilities have largely depended on rule-based
static analysis and heuristic methods. These tools inspect source code without executing it, attempting to
detect flaws such as buffer overflows, SQL injections, and memory leaks using predefined patterns. Static
analyzers like Fortify and Flawfinder are commonly used but often generate a high number of false
positives due to their limited contextual understanding. Moreover, they struggle to adapt to evolving
coding styles and emerging vulnerability types. While useful for known vulnerability signatures, these
systems lack the flexibility to generalize beyond their hardcoded rules [3].

2.2 Deep Learning and Modern Transformer-Based Systems

The introduction of deep learning brought a paradigm shift in vulnerability detection by enabling models
to automatically learn features from large datasets. Recurrent neural networks (RNNs) and convolutional
neural networks (CNNs) were initially applied to source code analysis but were limited by their inability
to capture long-range dependencies. The advent of transformer-based models, particularly BERT, marked
a significant breakthrough. Pre-trained on vast text corpora, BERT has shown remarkable performance in
natural language processing tasks, including the analysis of vulnerability descriptions [1].

In the realm of code, CodeBERT extended the transformer architecture to programming languages.
Trained on paired code and natural language, it can model the semantics and syntax of code more
effectively than earlier models, enhancing the accuracy of code classification and retrieval tasks [2].

2.3 Hybrid and Hierarchical Frameworks

Recent studies have explored the benefits of combining natural language understanding with code
semantics for more robust vulnerability classification. Hybrid models aim to utilize both textual
descriptions and code representations to capture the full context of a vulnerability. Hierarchical approaches
build on this concept by organizing the input data into layers, often using one model (e.g., BERT) for text
and another (e.g., CodeBERT) for code, then integrating their outputs to make a unified prediction. Such
frameworks have demonstrated improved precision and recall by leveraging complementary data
modalities [4].

The proposed VulHierGGNN aligns with this direction, introducing a structured method that incorporates
both BERT and CodeBERT to classify software vulnerabilities across multiple categories.

2.4 Advancement over existing work

The landscape of software vulnerability detection has evolved through the adoption of various
machine learning and deep learning approaches. Early systems emphasized code-level analysis,
while modern frameworks incorporate natural language understanding. Despite these advancements, a
gap remains in combining both vulnerability descriptions and source code representations for precise
multi-type classification. The proposed VulHierGGNN addresses this by unifying BERT and CodeBERT
to extract complementary semantic and structural features, resulting in enhanced performance.

Table 1. Existing work on different methods of Vulnerability Classification Technique

Sr. no Paper Author Year Objective Findings Limitation


(s)
BERT: Pre- Learn deep Achieved state-
training of bidirectional of-the-art results Not applicable to
Deep representations in multiple NLP code or
Devlin et 2019
Bidirectional from unlabeled benchmarks. structured
1. al.[1]
Transformers text for NLP software inputs.
for Language tasks.
Understanding.
CodeBERT: A Pre-train a Outperforms Needs task-
Pre-Trained model on code prior models on specific fine-
Model for and natural code-language tuning and lacks
2. Feng et al.[2] 2020
Programming language for tasks like NL- graph-level
and Natural tasks like code code retrieval. structure
Languages search and awareness.
classification.

SySeVR: A
Framework for Extract code Improved binary Ignores textual
Using Deep gadgets and classification vulnerability
3.
Learning to use CNNs to accuracy on descriptions;
Detect Li et al.[4] 2021
detect standard datasets. focused only on
Software vulnerabilities. code.
Vulnerabilities
VulDeePecker: Use BLSTM First to apply DL Lacks support
A Deep networks on on vulnerability for multi-type
Learning- manually detection using classification;
4. Z. Li et al.[5] 2018
Based System extracted code code patterns requires manual
for gadgets. preprocessing.
Vulnerability
Detection
DeepVD: Apply CNNs Achieved Ignores semantic
Deep on tokenized reasonable binary context from
Learning- code using detection vulnerability
5. Alshahwan et 2022
Based Word2Vec accuracy. reports.
al.[6]
Vulnerability embeddings.
Detection with
Word2Vec
DeKeDVer: A Use Text- Achieved up to Uses outdated
Deep RCNN and 91.4% accuracy architectures;
Learning- RGAT on in multi-class lacks
6. Kumar and 2023
Based Multi- vulnerability classification. transformer-
Tripathi[7]
Type Software description based contextual
Vulnerability and code. learning.
Classification
Vul-LMGNN: Use multiple Demonstrated Does not
Multi-view code views improved incorporate
Graph Neural (AST, CFG, detection natural language
7. Xu et al.[8] 2025 descriptions of
Network for DFG) within a performance
Software GNN using multi-view vulnerabilities.
Vulnerability framework. graphs
Detection
8. SCL-CVD Liu et al.[9] 2024 Leverage Applies Captures some
supervised supervised structure, but
contrastive contrastive underutilizes
learning with learning with CPGs
structural GraphCodeBERT
features
10 Devign Zhou et al[10] 2019 Detect Applies GGNNs No pretraining,
vulnerabilities on code graphs lacks multi-
using graph- for vulnerability modal fusion
based learning detection
with GGNN
11. HSVC Zhang et 2022 Use CWE Leverages Does not use a
al.[11] hierarchy for hierarchical two-step (binary
structured structure of then
multi-class CWE-IDs for classification)
vulnerability multi-class approach
classification classification
3.PROPOSED WORK
3.1 System overview and architecture
This research introduces VulHierGGNN, a dual-path architecture for hierarchical vulnerability detection
in source code. The system leverages two independent branches: one processes source code into Code
Property Graphs (CPGs) and generates structural embeddings using a Gated Graph Neural Network
(GGNN), while the other encodes textual vulnerability descriptions using BERT. Unlike previous multi-
modal approaches that merge embeddings at the feature level, our architecture maintains these two
representations separately and instead fuses their individual classification outputs through an additional
decision-level classifier. This final classifier integrates insights from both structural code analysis and
vulnerability semantics to improve prediction reliability. Furthermore, we employ a hierarchical
classification framework, beginning with a binary classifier to identify whether a function is vulnerable,
followed by a CWE-type classifier applied only to positively identified samples. To strengthen the model's
representation learning, we pretrain the GGNN encoder using a supervised contrastive loss, encouraging
the separation of vulnerability classes based on structural patterns. We also implement a robust
preprocessing pipeline that standardizes code identifiers and removes noisy tokens, enhancing feature
consistency. Overall, VulHierGGNN combines decision-level multi-modal fusion with graph-based
learning and hierarchical classification to address challenges such as data imbalance and lack of semantic-
structural alignment in vulnerability detection. The system architecture of VulHierGGNN is shown in Fig
3.1.1

Fig 3.1.1 System Architecture of VulHierGGNN

3.2 Methodology

Our system, VulHierGGNN, is a two-branch pipeline with a fusion module (see Figure 1). The first
branch processes source code into CPGs and uses a GGNN encoder to generate graph-level
embeddings. The second branch encodes vulnerability descriptions using BERT. These embeddings
are fused and fed into a hierarchical classification module consisting of:

• A binary classifier to predict whether a function is vulnerable.

• A CWE classifier to predict the specific CWE type for functions classified as vulnerable
during inference.
3.2.1 Source Code Pathway

Code is parsed into a CPG, combining AST, CFG, and PDG. Each node is embedded using CodeBERT
[2] (768-dimensional). The GGNN encoder processes the CPG:

• GatedGraphConv: Performs message passing over CPG edges for 3 layers.

• Global Mean Pooling: Aggregates node embeddings into a 768-dimensional graph-level


embedding

• MLP: Refines the embedding (768 → 256 → 768).

3.2.2 Vulnerability Description Pathway

Vulnerability descriptions from CWE records are encoded using BERT [15]. The [CLS] token
embedding (768-dimensional) represents the description.

3.3 Data Preprocessing

To prepare source code for CPG generation, we implement a preprocessing pipeline that enhances the
quality of node features:

• Removing comments: Eliminates irrelevant text that does not contribute to code execution or
structure, aligning with standard practices [16].

• Handling non-ASCII characters: Removes or normalizes non-ASCII characters to ensure


compatibility with CodeBERT, reducing noise from Unicode symbols.

• Standardizing variable and function names: Renames user-defined variables and functions to
generic identifiers (e.g., Var1, FUN1). This reduces variability in the input to CodeBERT, ensuring
consistent node features and mitigating issues with out-of-vocabulary tokens. Unlike regex-based
methods, which may struggle with diverse naming conventions [17], our approach leverages
CodeBERT’s semantic understanding for robust feature extraction.

This preprocessing strategy is particularly effective for graph-based models, as it minimizes noise
in node features, allowing the GGNN to focus on structural patterns captured by the CPG.

4. Experiment Results

To assess the effectiveness of the proposed VulHierGGNN framework, we conducted extensive


experiments on the CVE fixes dataset. The evaluation focused on three primary models: the
standalone BERT model for vulnerability descriptions, the CodeBERT model for source code, and
the fused architecture combining both modalities.
The training result of standalone BERT model for vulnerability description, CodeBERT model for
source code are shown in Fig 4.1 and Fig 4.2 respectively.

Fig 4.1 Training result of standalone BERT model for Description


Fig 4.2 Training result for CodeBERT model for Source Code

The curve such as Training vs Validation loss curve for BERT model for description and CodeBERT
for source code is shown in Fig 4.3 and Fig 4.4 respectively.

Fig 4.3 Training vs Validation loss for BERT model for Description

Fig 4.4 Training vs Validation loss for CodeBERT for Source code
The confusion matrix for VulHierGGNN for 30 labels (CWE type) is shown in Fig 4.5

Fig 4.5 Confusion matrix

5. DISCUSSION
The complete implementation of the proposed model, which combines BERT for processing vulnerability
descriptions and CodeBERT for analyzing source code, has demonstrated significant improvements in multi-
type vulnerability classification. By leveraging both natural language understanding and code semantics, the
system effectively captures contextual and structural cues essential for precise classification.

BERT proved highly effective in extracting semantic features from textual vulnerability reports, achieving an
F1-score and accuracy of 98%, indicating that most categories of vulnerabilities are well-represented in their
descriptions. Meanwhile, the integration of CodeBERT added an additional layer of granularity by enabling
the model to understand source-level syntax and logical flow. This dual-representation enhanced the system’s
ability to handle complex patterns where textual cues alone were insufficient.

The hierarchical framework adopted in the model allowed each component—description and code—to
contribute distinct yet complementary insights. This fusion improved the model’s ability to differentiate
between closely related vulnerability types, which is often a limitation in single-modality systems.

The model achieved strong generalization and accuracy across multiple vulnerability categories, offering a
robust framework for automated security analysis in real-world scenarios.
6. CONCLUSION

This study presents a hybrid deep learning framework for multi-type software vulnerability classification,
integrating BERT for textual descriptions and CodeBERT for source code analysis. The results affirm that
combining linguistic and structural features leads to significant improvements in classification accuracy
and robustness. The model effectively captures the semantic depth of vulnerability descriptions while
simultaneously extracting meaningful code patterns, enabling a more comprehensive and precise
identification of diverse vulnerability types.

By bridging the gap between natural language understanding and code semantics, the proposed system
advances the state of automated vulnerability detection. The high performance achieved across evaluation
metrics demonstrates its potential for real-world applications, particularly in environments where manual
vulnerability triage is time-consuming or error-prone.

Our model uniquely leverages both natural language vulnerability descriptions and source code features,
utilizing the powerful capabilities of BERT for textual analysis and CodeBERT integrated with Code
Property Graphs (CPGs) for code structure representation. This multimodal and hierarchical design allows
the model to understand and relate abstract semantic cues from descriptions with the precise syntactic and
structural patterns in source code, ultimately improving the classification performance.
Unlike traditional approaches that treat source code and textual descriptions separately or rely on flat
representations, VulHierGGNN provides a unified and hierarchical architecture that integrates contrastive
learning, graph-based reasoning, and cross- modal feature fusion. By modeling the inherent hierarchical
relationships between CWE (Common Weakness Enumeration) classes and their subtypes, our approach
enables finer-grained classification that better mirrors real-world vulnerability taxonomies.

7. FUTURE SCOPE
While the current model achieves strong performance, there remains scope for further enhancement.
Future research can explore incorporating graph-based code representations such as Abstract Syntax Trees
(ASTs) or Control Flow Graphs (CFGs) alongside CodeBERT, providing deeper structural insights.
Moreover, adapting the framework to support multilingual codebases and evolving vulnerability databases
can make it more versatile.

Another potential direction is the integration of real-time detection in CI/CD pipelines, enabling
proactive security measures during software development. Additionally, attention mechanisms or
explainability modules could be embedded to offer interpretable insights for security analysts.
Expanding this system into a fully deployable tool with visualization and feedback components may also
bridge the gap between research and industry use.

However, several avenues remain for future improvement and exploration.

1.Extension to Zero-Day and Unseen Vulnerabilities


While VulHierGGNN performs well on known vulnerabilities, its application can be extended to
detecting zero-day or unseen vulnerabilities. This would require adapting the model using techniques
such as semi-supervised learning or anomaly detection, enabling it to flag previously unrecorded or
ambiguous threats with minimal labeled data.

2.Real-Time Integration in Development Pipelines


A potential enhancement involves embedding VulHierGGNN into integrated development
environments (IDEs) or DevOps pipelines. This would allow for real-time vulnerability analysis as
developers write code, promoting secure programming practices and immediate threat feedback before
deployment.

3.Expansion to Multi-language and Cross-platform Support


The current implementation primarily focuses on a single programming language dataset. Future
iterations could incorporate multilingual code analysis by training on diverse datasets, supporting
languages like C++, Java, Python, and Go. This would broaden the system’s applicability to varied
software ecosystems.
REFERENCES
[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding.

[2] Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., ... & Zhou, M. (2020).
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

[3] Chess, B., & McGraw, G. (2004). Static analysis for security.

[4] SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities.

[5] VulDeePecker: A Deep Learning-Based System for Vulnerability Detection.

[6] DeepVD: Deep Learning-Based Vulnerability Detection with Word2Vec.

[7] DeKeDVer: A Deep Learning-Based Multi-Type Software Vulnerability Classification.

[8] VulLMGNN: Multi-view Graph Neural Network for Software Vulnerability Detection

[9] Supervised contrastive learning for code vulnerability detection via GraphcodeBERT

[10] Effective Vulnerability Identification by Learning Comprehensive Program Semantics


via Graph Neural Networks

[11] Transformer based hierarchical distillation for software vulnerability classification

You might also like