Official implementation for ASE2025 Submission #1925:
Enhancing LLM to Decompile Optimized PTX to Readable CUDA for Tensor Programs
This repository contains code for our LLM-based PTX-to-CUDA decompilation framework featuring:
- Compiler-based data augmentation for generating aligned PTX-CUDA pairs
- Rolled-PTX representation to handle optimized loop structures
- LLM fine-tuning pipeline for decompilation
- Evaluation scripts for decompilation accuracy and quality
.
├── data_generator/ # Sec 3.2: Data augmentation pipelines
│ ├── Scheduling-Diverse/ # Scheduling diversity pipeline
│ └── Subgraph-Diverse/ # Subgraph diversity pipeline
│
├── dataset_workspace/ # Dataset processing & evaluation
│ ├── simplify_cuda.py # Sec 3.3: CUDA Kernel Refactoring
│ ├── loop_reroll_ptx.py # Sec 3.4: PTX loop rerolling
│ └── ...
│
└── model_train_infer/ # LLM training & inference
├── train.ipynb # Model fine-tuning
├── infer.py # Decompilation inference
└── ...
- Scheduling-Diverse Pipeline:
Entry:data_generator/Scheduling-Diverse/tenset/scripts/measure_programs_cuda.py - Subgraph-Diverse Pipeline:
Entry:data_generator/Subgraph-Diverse/welder/nnfusion/artifacts/my_welder_cudaptx.py
- Quality improvement (Sec 3.3):
dataset_workspace/simplify_cuda.py - Rolled-PTX generation (Sec 3.4):
dataset_workspace/loop_reroll_ptx.py
- LLM fine-tuning:
model_train_infer/train.ipynb - Inference:
model_train_infer/infer.py
- Decompilation accuracy and quality evaluation:
Entry:dataset_workspace/my_eval_decompile_multi_*.py
The full dataset (400K PTX-CUDA pairs) and pretrained model weights are being prepared for public release. Due to their size:
- Full dataset (~69 GB) will be available on Hugging Face Datasets
- Model weights (~14.5 GB) will be available on Hugging Face Hub
A sample dataset subset is included in dataset_workspace/dataset_example/ for initial exploration.