Zen Gym is the unified training infrastructure for all Zen AI models. Built on LLaMA Factory, it provides comprehensive support for fine-tuning, reinforcement learning, and deployment across the entire Zen model family.
Zen Gym is the centralized training platform for:
- zen-nano (0.6B) - Ultra-lightweight models
- zen-eco (4B) - Efficient models (instruct, thinking, agent)
- zen-agent (4B) - Tool-calling and function execution
- zen-director (5B) - Text-to-video generation
- zen-musician (7B) - Music generation with lyrics
All Zen models are trained, fine-tuned, and optimized through Zen Gym's unified infrastructure.
âś… All training methods verified and supported:
- Full Fine-tuning: 16-bit full parameter training
- LoRA: Low-Rank Adaptation (memory efficient)
- QLoRA: Quantized LoRA (2/3/4/5/6/8-bit via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ)
- DoRA: Weight-Decomposed Low-Rank Adaptation
- GRPO: Group Relative Policy Optimization (40-60% memory reduction)
- GSPO: Group Sampled Policy Optimization (MoE stability for Qwen3)
- DPO: Direct Preference Optimization
- PPO: Proximal Policy Optimization
- KTO: Kahneman-Tversky Optimization
- ORPO: Odds Ratio Preference Optimization
- SimPO: Simple Preference Optimization
- GaLore: Gradient Low-Rank Projection
- BAdam: Block-wise Adam
- APOLLO: Adaptive Learning Rate Optimizer
- Adam-mini: Memory-efficient Adam
- Muon: Momentum-based optimizer
- OFT/OFTv2: Orthogonal Fine-Tuning
- LongLoRA: Extended context LoRA
- LoRA+: Enhanced LoRA training
- LoftQ: Quantization-aware LoRA
- PiSSA: Principal Singular values and Singular vectors Adaptation
- BitDelta: Efficient delta compression
- DeltaSoup: Model merging and averaging
- GGUF Export: llama.cpp compatible quantization
- AWQ: Activation-aware Weight Quantization
- GPTQ: Post-training quantization
- FlashAttention-2: 2x faster attention
- Unsloth: 2-5x faster training
- Liger Kernel: LinkedIn's optimized kernels
- RoPE Scaling: Extended context windows
- NEFTune: Noise-enhanced fine-tuning
- Gradient Checkpointing: Reduced memory usage
Zen Gym natively supports:
- Qwen3 (0.6B, 4B, 7B, 14B, 30B) - Zen model foundation
- LLaMA, Mistral, Mixtral-MoE, Gemma, DeepSeek, Yi, ChatGLM, Phi
- Multimodal: Qwen2-VL, LLaVA, MiniCPM-o, InternVL
- Audio: Qwen2-Audio, MiniCPM-o-2.6
- Video: Wan2.2, Llama4
cd /Users/z/work/zen/gym
conda create -n zen-gym python=3.10
conda activate zen-gym
pip install -r requirements.txt
# For FlashAttention-2 (recommended)
pip install flash-attn --no-build-isolation
# For Unsloth acceleration
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"# Fine-tune zen-nano with LoRA
llamafactory-cli train \
--stage sft \
--do_train \
--model_name_or_path Qwen/Qwen3-0.6B \
--dataset your_dataset \
--template qwen3 \
--finetuning_type lora \
--lora_target all \
--output_dir ./zen-nano-lora \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--lr_scheduler_type cosine \
--learning_rate 5e-5 \
--save_steps 100 \
--logging_steps 10 \
--flash_attn fa2 \
--use_unsloth true
# GRPO training for zen-eco
llamafactory-cli train \
--stage grpo \
--do_train \
--model_name_or_path Qwen/Qwen3-4B \
--dataset your_preference_dataset \
--template qwen3 \
--finetuning_type lora \
--output_dir ./zen-eco-grpo \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--num_train_epochs 1 \
--learning_rate 1e-5 \
--save_steps 50
# QLoRA 4-bit training
llamafactory-cli train \
--stage sft \
--do_train \
--model_name_or_path Qwen/Qwen3-4B \
--dataset your_dataset \
--template qwen3 \
--finetuning_type lora \
--quantization_bit 4 \
--output_dir ./zen-eco-qlora \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--num_train_epochs 3 \
--learning_rate 2e-4# Launch Zen Gym web interface
llamafactory-cli webuiAccess at http://localhost:7860
# Export trained model to GGUF for llama.cpp
llamafactory-cli export \
--model_name_or_path ./zen-eco-lora \
--adapter_name_or_path ./zen-eco-lora \
--template qwen3 \
--export_dir ./zen-eco-gguf \
--export_size 4 \
--export_quantization_bit 4 \
--export_legacy_format falsemodel_name_or_path: Qwen/Qwen3-0.6B
template: qwen3
finetuning_type: lora
lora_rank: 64
lora_alpha: 32
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
learning_rate: 5e-5
flash_attn: fa2
use_unsloth: truemodel_name_or_path: Qwen/Qwen3-4B
template: qwen3
finetuning_type: lora
lora_rank: 128
lora_alpha: 64
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2e-5
flash_attn: fa2
use_unsloth: truemodel_name_or_path: Qwen/Qwen3-4B
dataset: Salesforce/xlam-function-calling-60k
template: qwen3
finetuning_type: lora
lora_rank: 128
lora_alpha: 64
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 1e-5model_name_or_path: m-a-p/YuE-s1-7B-anneal-en-cot
template: yue
finetuning_type: lora
lora_rank: 64
lora_alpha: 32
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 2e-4
flash_attn: fa2- Memory: ~30% of full fine-tuning
- Speed: 1.5-2x faster
- Quality: 95-98% of full fine-tuning
- Best for: Most training scenarios
--finetuning_type lora \
--lora_rank 64 \
--lora_alpha 32 \
--lora_dropout 0.1 \
--lora_target all- Memory: ~10% of full fine-tuning
- Speed: 1.2-1.5x faster than full
- Quality: 90-95% of full fine-tuning
- Best for: Limited GPU memory
--finetuning_type lora \
--quantization_bit 4 \
--lora_rank 64 \
--lora_alpha 32- Memory: 40-60% less than PPO (no value network)
- Speed: 2x faster than PPO
- Quality: Superior to DPO for instruction following
- Best for: Reinforcement learning
--stage grpo \
--finetuning_type lora \
--learning_rate 1e-5- Memory: Similar to GRPO
- Speed: Optimized for MoE models
- Quality: Better stability for Qwen3-MoE
- Best for: Mixture-of-Experts models
--stage gspo \
--finetuning_type lora \
--learning_rate 1e-5cd /Users/z/work/zen/zen-musician
# Train with zen-gym
llamafactory-cli train \
--config /Users/z/work/zen/gym/configs/zen_musician_lora.yamlcd /Users/z/work/zen/zen-nano
# Train with zen-gym
llamafactory-cli train \
--config /Users/z/work/zen/gym/configs/zen_nano_lora.yamlcd /Users/z/work/zen/zen-eco
# Train with zen-gym
llamafactory-cli train \
--config /Users/z/work/zen/gym/configs/zen_eco_lora.yamlZen Gym supports multiple experiment tracking tools:
- TensorBoard: Built-in, zero config
- Weights & Biases:
--report_to wandb - MLflow:
--report_to mlflow - SwanLab:
--report_to swanlab
# Enable WandB logging
export WANDB_PROJECT=zen-models
llamafactory-cli train --report_to wandb --config your_config.yaml
# View TensorBoard
tensorboard --logdir ./output# Deploy with vLLM
llamafactory-cli api \
--model_name_or_path ./zen-eco-lora \
--template qwen3 \
--infer_backend vllm
# Deploy with SGLang
llamafactory-cli api \
--model_name_or_path ./zen-eco-lora \
--template qwen3 \
--infer_backend sglangllamafactory-cli chat \
--model_name_or_path ./zen-eco-lora \
--template qwen3# Q4_K_M (recommended for most use cases)
llamafactory-cli export \
--model_name_or_path ./model \
--export_dir ./gguf \
--export_quantization_bit 4
# Q8_0 (higher quality)
llamafactory-cli export \
--model_name_or_path ./model \
--export_dir ./gguf \
--export_quantization_bit 8
# Q2_K (maximum compression)
llamafactory-cli export \
--model_name_or_path ./model \
--export_dir ./gguf \
--export_quantization_bit 2# Convert to MLX format
python -m mlx_lm.convert \
--hf-path ./model \
--mlx-path ./mlx \
--quantize| Model Size | Training Method | GPU Memory | Recommended GPU |
|---|---|---|---|
| 0.6B | Full | 8GB | RTX 3060 |
| 0.6B | LoRA | 4GB | GTX 1660 Ti |
| 4B | Full | 32GB | RTX 3090 |
| 4B | LoRA | 16GB | RTX 4060 Ti |
| 4B | QLoRA 4-bit | 8GB | RTX 3060 |
| 7B | Full | 48GB | A6000 |
| 7B | LoRA | 24GB | RTX 3090 |
| 7B | QLoRA 4-bit | 12GB | RTX 3060 Ti |
Training speed with various optimizations (zen-eco-4b):
| Configuration | Speed | Memory |
|---|---|---|
| Baseline | 1.0x | 24GB |
| + FlashAttention-2 | 1.5x | 22GB |
| + Unsloth | 2.5x | 20GB |
| + Liger Kernel | 3.0x | 18GB |
| + Gradient Checkpointing | 2.8x | 14GB |
# Reduce batch size
--per_device_train_batch_size 1
# Increase gradient accumulation
--gradient_accumulation_steps 32
# Enable gradient checkpointing
--gradient_checkpointing true
# Use QLoRA
--quantization_bit 4# Enable FlashAttention-2
--flash_attn fa2
# Enable Unsloth
--use_unsloth true
# Enable Liger Kernel
--enable_liger_kernel true# Lower learning rate
--learning_rate 1e-5
# Add warmup
--warmup_ratio 0.1
# Use cosine scheduler
--lr_scheduler_type cosine- Official Docs: https://gym.readthedocs.io/
- Examples: examples/
- Training Configs: configs/
- Qwen3 Guide: configs/qwen3_training_guide.md
Zen Gym is built on LLaMA Factory by hiyouga. We thank the LLaMA Factory team and all contributors for their excellent work.
Apache 2.0 License - see LICENSE for details.
If you use Zen Gym in your research, please cite:
@misc{zengym2025,
title={Zen Gym: Unified Training Infrastructure for Zen AI Models},
author={Zen AI Team},
year={2025},
howpublished={\url{https://github.com/zenlm/zen-gym}}
}
@article{zheng2024llamafactory,
title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Zhangchi Feng and Yongqiang Ma},
journal={arXiv preprint arXiv:2403.13372},
year={2024}
}- GitHub: https://github.com/zenlm/zen-gym
- Organization: https://github.com/zenlm
- HuggingFace Models: https://huggingface.co/zenlm
- Zen Engine (Inference): https://github.com/zenlm/zen-engine
- Zen Musician: https://github.com/zenlm/zen-musician
- Zen 3D: https://github.com/zenlm/zen-3d
Zen Gym - Unified training platform for all Zen AI models
Part of the Zen AI ecosystem.