CodeBERT + Domain-Adaptive MLM for Source Code Comment Classification

This repository contains the LaTeX source, figures, and tables for the research paper on enhancing CodeBERT representations using domain-adaptive masked language modeling (MLM) for source code comment classification. The study uses the IRSE/FIRE dataset and explores the impact of combining original C code data with Python-derived silver-standard examples generated using x-ai/grok-4-fast.

📄 Paper Overview

Title: Improving Source Code Comment Relevance with Domain-Adaptive CodeBERT

Authors:

L. Simiyon Vinscent Samuel
Sundareswaran Rajaguru
Vijayaraaghavan K. S.
Dr. A. Vijayalakshmi (Supervisor)

Affiliation:
Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, India

🧪 Key Highlights

Datasets:
- Original IRSE/FIRE C code dataset (11,452 pairs, 8,326 cleaned training samples)
- Rule-based synthetic comments (500 samples)
- Silver-standard Python examples (1,640 samples) generated from humaneval.jsonl via x-ai/grok-4-fast
Training Approach:
- CodeBERT base MLM continued on combined corpus
- 15% mask probability, batch size 16, learning rate 5e-5, up to 7 epochs
- Fine-tuning with a classification head for relevance prediction
Baselines:
- TF–IDF features with Logistic Regression, Linear SVM, Multinomial NB, Random Forest
- 10-fold cross-validation for hyperparameter tuning
Evaluation Metrics: Accuracy, Precision, Recall, Weighted F1
Configurations:
1. Train Original → Test Original
2. Train Combined → Test Combined
3. Original → Combined
4. Combined → Original
Findings:
- Domain-adaptive MLM significantly improves CodeBERT representations.
- Cross-evaluation shows robust transfer even with Python-derived silver data.
- Silver-standard augmentation increases vocabulary and diversity but introduces minor noise.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
data		data
figures		figures
results		results
README.md		README.md
irse_codebert.ipynb		irse_codebert.ipynb
irse_codebert_paper.pdf		irse_codebert_paper.pdf
irse_codebert_paper.tex		irse_codebert_paper.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeBERT + Domain-Adaptive MLM for Source Code Comment Classification

📄 Paper Overview

🧪 Key Highlights

About

Uh oh!

Releases

Contributors 3

Uh oh!

Languages

blackscythe123/IRSE

Folders and files

Latest commit

History

Repository files navigation

CodeBERT + Domain-Adaptive MLM for Source Code Comment Classification

📄 Paper Overview

🧪 Key Highlights

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 3

Uh oh!

Languages