Surogate Trainer is built for developers and enterprises that need fast experimentation β whether running on-premise or in the cloud.
β‘ The Surogate trainer surpasses all existing training frameworks in performance for single-GPU, multi-GPU and GPU+CPU by a large margin.
β¨ The native CPU offloading feature achieves superior performance and VRAM usage compared to QLoRA. You can fine-tune models at native bf16 precision, rendering QLoRA obsolete.
- π§ Pre-training + Fine-tuning: full fine-tuning, LoRA
- π§ BF16, FP8 and NVFP4 Reinforcement Learning: advanced GRPO training and evaluation with custom, deterministic environments
- π§ RL Environments*: predictable environments for RL training
- π₯οΈ...π₯οΈ Native multi-GPU training with multi-threaded backend
- π₯οΈ...π₯οΈ Native multi-Node DDP training with Ray
- β‘ Native C++/CUDA engine for nearβSpeed-Of-Light (SOL) throughput
- π₯ Python DSL with AOT auto-differentiation for adding new model architectures
- βοΈ Smart CPU Offloading for weights, gradients, activations, quants
- π Pre-built training recipes:
- π BF16: Baseline recipe using
bfloat16for all GEMMs, designed for maximum numerical accuracy. No quantization is applied. - π₯ FP8: Native
FP8training delivering extreme performance withE4M3used for activations and weights andE5M2for gradients. Uses per-tensor delayed scaling to provide stable training. - π₯ NVFP4: Native CUTLASS
FP4 E2M1training with two-level block scaling for extreme performance and memory efficiency on Blackwell GPUs (SM100+: B200, B300, RTX 50xx series). Uses stochastic rounding and random Hadamard Transforms for numerical stability. Supports NVIDIA B200, B300, RTX 5070, 5080, 5090 !!
- π BF16: Baseline recipe using
- β‘ BnB/FP8/NVFP4 QLoRA Support for a variety of QLoRA configurations, including online quantization (FP8, NVFP4, BnB) or loading pre-quantized weights (FP8, NVFP4)
- π Optimizers: AdamW 8bit, !! NorMuon !!
- π₯οΈ Runs on all NVIDIA GPUs: sm80, sm86, sm89, sm90, sm100, sm103, sm120, sm121
- π§ͺ Mixed-precision training: Mix different dtypes for GEMMs, model, gradients and LoRA recipes to create your own flavor.
- π‘οΈ Designed for reliability: deterministic configs, explicit recipes, and a clear C++ core
- 𧬠Adaptive Training: built-in automated training monitoring with automatic phase detection, multi-criteria early stopping (convergence, compute-efficiency, divergence, plateau), auto LR management, MoE imbalance detection, Chinchilla token budgeting and dynamic epoch adjustment
- π¨ Dedicated MoE Features: Expert Parallelism, Least-Loaded EP load-balancing, MoE training metrics, Imbalance detection
- π₯ Stacked LoRA training: Train a LoRA adapter on top of another LoRA adapter to skip offline merging into base model.
We support the following models. Please create a PR if you need a specific model
| Model | Architecture | Model Sizes |
|---|---|---|
| Qwen3 | Qwen3ForCausalLM | 0.6B, 1.7B, 4B, 8B, 14B, 35B |
| Qwen3VL | Qwen3VLForConditionalGeneration | 2B, 4B, 8B, 32B |
| Qwen3 MoE | Qwen3MoeForCausalLM | 30B-A3B, 235B-A22B |
| Qwen3.5 | Qwen3_5ForCausalLM, Qwen3_5ForConditionalGeneration | 0.8B, 2B 4B, 9B, 27B |
| Qwen3.5 Moe | Qwen3MoeForCausalLM, Qwen3_5MoeForConditionalGeneration | 35B-A3B, 122B-A10B, 397B-A17B |
| Nemotron Nano v3 | NemotronHForCausalLM | 30B-A3B |
| Nemotron Super v3 | NemotronHForCausalLM | 120B-A12B |
| Nemotron Cascade 2 | NemotronHForCausalLM | 30B-A3B |
| GPT-OSS | GptOssForCausalLM | 20B, 120B |
| Llama 3.1 | LlamaForCausalLM | 8B, 70B, 405B |
| Llama 3.2 | LlamaForCausalLM | 1B, 3B |
You can interact with the Surogate High-Performance Training Engine at the framework level via the CLI.
Surogate provides 3 docker images for various CUDA versions. Currently only the x86-64 architecture is supported.
| CUDA | Image | Recommended NVIDIA Driver | Minimum NVIDIA Driver |
|---|---|---|---|
| 12.8.1 | ghcr.io/invergent-ai/surogate:latest-cu128 |
>= 570.124.06 |
>= 525 |
| 12.9.1 | ghcr.io/invergent-ai/surogate:latest-cu129 |
>= 575.57.08 |
>= 525 |
| 13.1 | ghcr.io/invergent-ai/surogate:latest-cu130 |
>= 590.48.01 |
>= 580 |
docker run --gpus=all -v /my/local/config.yaml:/home/surogate/config.yaml -v /my/local/output_dir:<OUTPUT_DIR_FROM_CONFIG_YAML> <IMAGE> sft config.yamlThe install.sh script auto-detects your CUDA version (12.8, 12.9, or 13.x) and installs the matching pre-built wheel into a local .venv/.
Install the latest release:
curl -LsSf https://github.com/invergent-ai/surogate/releases/latest/download/install.sh | bashOr pin to a specific release (e.g. v0.0.1):
curl -LsSf https://github.com/invergent-ai/surogate/releases/download/v0.0.1/install.sh | bashAfter installation:
source .venv/bin/activate
surogate sft examples/sft/qwen3/qwen3-lora-bf16.yamlYou need CUDA 12.8/12.9/13.x installed on your machine and NCCL development libraries libnccl-dev for your CUDA version
# ...clone repo...
uv pip install -e .- Create a config (example):
model: Qwen/Qwen3-0.6B
output_dir: ./output
# training
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
sequence_len: 2048
learning_rate: 2e-4
# LoRA / QLoRA
lora: true
lora_rank: 16
# qlora_fp8: true # optional, hardware-dependent
# qlora_fp4: true # Blackwell+
# qlora_bnb: true # Any GPU, lowest
datasets:
- path: "mlabonne/FineTome-100k"
type: auto- Run:
surogate sft config.yaml- Outputs:
- checkpoints, logs and artifacts are written under
output_dir
- NVIDIA GPU + recent driver
- CUDA 12.8, 12.9, 13, NCCL, cuDNN
- Linux x86_64
SM80: A100, A30SM86: A2, A16, A10, A40, RTX3050, RTX3060, RTX 3070, RTX 3080, RTX 3090, A2000, A3000, A4000, A5000, A6000SM89: L4, L40, L40S, RTX 4050, RTX 4060, RTX 4070, RTX 4080, RTX 4090, RTX 2000 Ada, RTX 4000 SFF Ada, RTX 4000 Ada, RTX 4500 Ada, RTX 5000 Ada, RTX 6000 AdaSM90: H100, H200, GH200SM100: B200, GB200SM103: B300, GB300SM120: RTX PRO 6000/5000/4000/2500/2000 Blackwell, RTX 5050, RTX 5060, RTX 5070, RTX 5080, RTX 5090SM121: DGX Spark
- Docs: https://docs.surogate.ai
- Examples: https://github.com/invergent-ai/surogate/tree/master/examples
We welcome contributions across the entire ecosystem! If you are submitting a PR to the core framework, please ensure you include a clear description, steps to test locally, and relevant examples.
If youβre adding kernels/recipes or touching build/tooling, please keep changes minimal and include:
- a short description of the change,
- how to reproduce/validate locally (
make testwhere applicable), - and any GPU/arch assumptions.
Apache 2.0 β see LICENSE.