V0.2.5
HelloAgents V0.2.5 Release Notes
📅 Release Date: 2025-10-18
📦 Package: pip install hello-agents[rl]
🔗 GitHub: https://github.com/jjyaoao/HelloAgents
📚 Documentation: https://github.com/jjyaoao/HelloAgents/tree/main/docs/chapter11
🎯 Overview
HelloAgents V0.2.5 introduces a comprehensive Agentic Reinforcement Learning (RL) System, enabling developers to train and fine-tune language models using state-of-the-art RL algorithms. This release implements a complete RL training pipeline according to Chapter 11 architecture, providing a unified toolkit for SFT, GRPO, and distributed training.
✨ Core Features
- 🎓 SFT (Supervised Fine-Tuning): Train models on instruction-following tasks with LoRA support
- 🚀 GRPO (Group Relative Policy Optimization): Simplified PPO without Value Model for efficient RL training
- 🎯 Custom Reward Functions: Accuracy, length penalty, and step-based rewards
- 🛠️ Unified Tool Interface: RLTrainingTool fully integrated with HelloAgents framework
- 📊 Distributed Training: Multi-GPU and multi-node support via Accelerate and DeepSpeed
- 🔄 Monitoring Integration: Wandb and TensorBoard support with detailed logging
- 📦 Simplified Imports: Direct access from
hello_agents.rllayer - 🔄 Backward Compatible: All existing code continues to work
🔧 Installation & Dependencies
Installation
# With RL support
pip install hello-agents[rl]
# Or install manually
pip install hello-agents
pip install trl transformers datasets peft accelerateOptional Dependencies
# For distributed training
pip install deepspeed
# For monitoring
pip install wandb tensorboardDependencies
| Component | Packages | Description |
|---|---|---|
| TRL | trl>=0.12.0 |
Transformer Reinforcement Learning |
| Transformers | transformers>=4.40.0 |
HuggingFace Transformers |
| PEFT | peft>=0.10.0 |
Parameter-Efficient Fine-Tuning |
| Datasets | datasets>=2.18.0 |
HuggingFace Datasets |
| Accelerate | accelerate>=0.28.0 |
Distributed training support |
| DeepSpeed (optional) | deepspeed>=0.14.0 |
Advanced distributed training |
| Wandb (optional) | wandb>=0.16.0 |
Experiment tracking |
| TensorBoard | tensorboard>=2.15.0 |
Training visualization |
| Core Framework | hello-agents>=0.2.5 |
HelloAgents framework |
Environment Configuration
API Keys (Optional for model download):
# HuggingFace Token (for gated models)
HUGGINGFACE_TOKEN="hf_xxx"
# Wandb API Key (for experiment tracking)
WANDB_API_KEY="xxx"🏗️ RL Training Architecture
Three-Layer RL System (Chapter 11 Design)
Agentic RL Training Architecture
├── Application Layer
│ └── RLTrainingTool - Unified RL training tool wrapper
│
├── Training Layer
│ ├── SFT (Supervised Fine-Tuning)
│ │ ├── SFTTrainerWrapper - TRL SFTTrainer wrapper
│ │ ├── SFT Dataset - Instruction-following dataset
│ │ ├── LoRA Configuration - Parameter-efficient fine-tuning
│ │ └── Training Callbacks - Detailed logging and monitoring
│ │
│ ├── GRPO (Group Relative Policy Optimization)
│ │ ├── GRPOTrainerWrapper - TRL GRPOTrainer wrapper
│ │ ├── GRPO Dataset - Prompt-based dataset
│ │ ├── Reward Functions - Accuracy, length, step-based
│ │ └── KL Divergence Control - Policy regularization
│ │
│ └── Distributed Training
│ ├── DDP - Data Parallel (2-8 GPUs)
│ ├── DeepSpeed ZeRO-2 - Optimizer state sharding
│ ├── DeepSpeed ZeRO-3 - Full model sharding
│ └── Multi-Node - Cluster training support
│
└── Data Layer
├── GSM8K - Grade School Math 8K dataset
├── UltraFeedback - Preference learning dataset
└── Custom Datasets - User-defined datasets
🎓 SFT (Supervised Fine-Tuning)
Overview
SFT trains models to follow instructions and learn task-specific formats. HelloAgents provides a complete SFT implementation with LoRA support for efficient training.
Key Features:
- LoRA Support: Train only 0.1% of parameters with minimal quality loss
- GSM8K Dataset: 7,473 math problems for training
- Automatic Formatting: Chat template handling (Qwen, Llama, etc.)
- Progress Tracking: Detailed logging with epoch/step/loss/LR
- Model Saving: Automatic checkpoint saving and merging
Quick Start
from hello_agents.tools import RLTrainingTool
# Create RL training tool
rl_tool = RLTrainingTool()
# Run SFT training
result = rl_tool.run({
"action": "train",
"algorithm": "sft",
"model_name": "Qwen/Qwen3-0.6B",
"output_dir": "./models/sft_model",
"max_samples": 100, # Use 100 samples for quick test
"num_epochs": 3,
"batch_size": 4,
"use_lora": True, # Enable LoRA
"use_tensorboard": True,
})
print(f"Training completed! Model saved to: {result['output_dir']}")Training Output
Epoch 1/3 | Step 10/75 | Loss: 2.3456 | LR: 4.5e-05
Epoch 1/3 | Step 20/75 | Loss: 1.8234 | LR: 4.0e-05
Epoch 1/3 | Step 25/75 | Loss: 1.6543 | LR: 3.8e-05
✓ Epoch 1/3 completed | Average Loss: 1.7234
Epoch 2/3 | Step 35/75 | Loss: 1.2345 | LR: 3.5e-05
...
LoRA Configuration
# Default LoRA config (optimized for Qwen models)
{
"lora_r": 16, # Rank
"lora_alpha": 32, # Alpha (scaling factor)
"lora_dropout": 0.05, # Dropout
"lora_target_modules": ["q_proj", "v_proj"] # Target modules
}
# Trainable parameters: ~0.1% of total
# Memory usage: ~50% of full fine-tuning
# Training speed: ~2x faster🚀 GRPO (Group Relative Policy Optimization)
Overview
GRPO is a simplified PPO algorithm that doesn't require a separate Value Model. It uses group-relative rewards for efficient policy optimization, making it ideal for Agentic RL scenarios.
Key Features:
- No Value Model: Simpler architecture, faster training
- Group Relative Rewards: Compare within mini-batches for stable learning
- Custom Reward Functions: Accuracy, length penalty, step-based rewards
- KL Divergence Control: Prevent policy from deviating too far from reference
- Math Reasoning: Optimized for GSM8K-style math problems
Quick Start
from hello_agents.tools import RLTrainingTool
# Create RL training tool
rl_tool = RLTrainingTool()
# Run GRPO training
result = rl_tool.run({
"action": "train",
"algorithm": "grpo",
"model_name": "Qwen/Qwen3-0.6B",
"output_dir": "./models/grpo_model",
"max_samples": 100,
"num_epochs": 2,
"batch_size": 2,
"reward_type": "accuracy", # Use accuracy reward
"use_lora": True,
"use_tensorboard": True,
})
print(f"GRPO training completed! Model saved to: {result['output_dir']}")Training Output
Epoch 1/2 | Step 10/50 | Loss: 0.8234 | Reward: 0.45 | KL: 0.023
Epoch 1/2 | Step 20/50 | Loss: 0.6543 | Reward: 0.62 | KL: 0.018
Epoch 1/2 | Step 25/50 | Loss: 0.5432 | Reward: 0.71 | KL: 0.015
✓ Epoch 1/2 completed | Average Reward: 0.68 | Average KL: 0.019
Epoch 2/2 | Step 35/50 | Loss: 0.4321 | Reward: 0.78 | KL: 0.012
...
Reward Functions
| Reward Type | Description | Use Case |
|---|---|---|
| accuracy | Exact match with ground truth | Math problems, QA |
| length_penalty | Penalize overly long responses | Concise generation |
| step | Reward based on reasoning steps | Multi-step reasoning |
# Accuracy reward
reward = 1.0 if prediction == ground_truth else 0.0
# Length penalty reward
base_reward = accuracy_reward
length_penalty = max(0, (len(prediction) - target_length) / target_length)
reward = base_reward - 0.1 * length_penalty
# Step reward
reward = accuracy_reward + 0.1 * num_reasoning_steps📊 Distributed Training
Overview
When data and model size grow, single-GPU training becomes too slow. HelloAgents supports multi-GPU and multi-node distributed training via Accelerate and DeepSpeed, with zero code changes required.
Key Features:
- DDP (Data Parallel): Simple multi-GPU training (2-8 GPUs)
- DeepSpeed ZeRO-2: Optimizer state sharding (~30% memory savings)
- DeepSpeed ZeRO-3: Full model sharding (~50% memory savings)
- Multi-Node: Cluster training support
- Zero Code Changes: Same training code works for all configurations
Distributed Training Methods
Table: Distributed Training Methods Comparison
| Method | Use Case | Memory Savings | Speed | Complexity |
|---|---|---|---|---|
| DDP | Single machine, 2-8 GPUs | None | Fastest | Low |
| ZeRO-2 | Medium models (1B-7B) | ~30% | Fast | Medium |
| ZeRO-3 | Large models (>7B) | ~50% | Moderate | High |
| Multi-Node | Very large models | ~50%+ | Scalable | High |
Quick Start - DDP Training
# Step 1: Create Accelerate config (one-time setup)
# File: accelerate_configs/multi_gpu_ddp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 4 # Number of GPUs
mixed_precision: fp16
# Step 2: Run training (no code changes!)
accelerate launch --config_file accelerate_configs/multi_gpu_ddp.yaml train_script.pyTraining Script (same as single-GPU):
from hello_agents.tools import RLTrainingTool
rl_tool = RLTrainingTool()
# Same code works for single-GPU and multi-GPU!
result = rl_tool.run({
"action": "train",
"algorithm": "grpo",
"model_name": "Qwen/Qwen3-0.6B",
"output_dir": "./models/grpo_ddp",
"num_epochs": 3,
"batch_size": 4, # Per-GPU batch size
"use_lora": True,
})Performance Benchmarks
Test Environment: 4×A100 (40GB), Qwen3-0.6B model
| Method | Training Time | Memory Usage | Throughput | Speedup |
|---|---|---|---|---|
| Single GPU | 120 min | 8GB | 100 samples/s | 1.0x |
| DDP (4 GPUs) | 35 min | 8GB | 350 samples/s | 3.4x |
| ZeRO-2 (4 GPUs) | 32 min | 6GB | 380 samples/s | 3.8x |
| ZeRO-3 (4 GPUs) | 38 min | 4GB | 320 samples/s | 3.2x |
Conclusion:
- DDP: Best for medium models, fastest speed
- ZeRO-2: Balanced performance and memory
- ZeRO-3: Best for large models, lowest memory
DeepSpeed ZeRO-3 Training
# Step 1: Create DeepSpeed config
# File: accelerate_configs/deepspeed_zero3.yaml
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
num_processes: 4
deepspeed_config:
zero_stage: 3 # Full model sharding
offload_optimizer_device: cpu # Offload to CPU
offload_param_device: cpu
# Step 2: Run training
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml train_script.py🔄 Monitoring & Logging
Detailed Training Logs
HelloAgents provides detailed training logs with epoch/step/loss/LR/reward/KL metrics:
# Enable detailed logging (default: enabled)
result = rl_tool.run({
"action": "train",
"algorithm": "grpo",
# ... other params
})
# Output:
# Epoch 1/3 | Step 10/75 | Loss: 0.8234 | Reward: 0.45 | KL: 0.023 | LR: 4.5e-05
# Epoch 1/3 | Step 20/75 | Loss: 0.6543 | Reward: 0.62 | KL: 0.018 | LR: 4.0e-05TensorBoard Integration
# Enable TensorBoard (default: enabled)
result = rl_tool.run({
"action": "train",
"use_tensorboard": True,
"output_dir": "./models/grpo_model",
# ... other params
})
# View logs:
# tensorboard --logdir ./models/grpo_model/runsWandb Integration
# Enable Wandb
result = rl_tool.run({
"action": "train",
"use_wandb": True,
"wandb_project": "helloagents-rl",
# ... other params
})
# Logs automatically uploaded to: https://wandb.ai/your-username/helloagents-rl📦 Simplified Import System
Before (Complex)
# Deep module paths
from hello_agents.rl.datasets import create_sft_dataset, create_grpo_dataset
from hello_agents.rl.rewards import create_accuracy_reward
from hello_agents.rl.trainers import SFTTrainerWrapper, GRPOTrainerWrapperAfter (Simple)
# Direct imports from rl layer
from hello_agents.rl import create_sft_dataset, create_grpo_dataset
from hello_agents.rl import create_accuracy_reward
from hello_agents.rl import SFTTrainerWrapper, GRPOTrainerWrapperTool Layer (Simplest)
# Direct tool import
from hello_agents.tools import RLTrainingTool
# All RL functionality in one tool
rl_tool = RLTrainingTool()💡 Best Practices
Training Strategy
| Scenario | Algorithm | Samples | Epochs | Batch Size | LoRA | Time (Single GPU) |
|---|---|---|---|---|---|---|
| Quick Test | SFT | 100 | 1 | 4 | ✅ | ~5 min |
| Development | SFT | 500 | 3 | 4 | ✅ | ~20 min |
| Production SFT | SFT | 7473 | 3 | 4 | ✅ | ~2 hours |
| RL Fine-tuning | GRPO | 1000 | 2 | 2 | ✅ | ~1 hour |
Hyperparameter Tuning
SFT:
{
"learning_rate": 5e-5, # Lower for stability
"num_epochs": 3, # 3-5 epochs usually sufficient
"batch_size": 4, # Adjust based on GPU memory
"lora_r": 16, # 8-32 for most cases
"lora_alpha": 32, # Usually 2x lora_r
}GRPO:
{
"learning_rate": 1e-5, # Lower than SFT
"num_epochs": 2, # 2-3 epochs to avoid overfitting
"batch_size": 2, # Smaller due to generation overhead
"reward_type": "accuracy", # Choose based on task
}Memory Optimization
- Use LoRA: Reduces memory by ~50%
- Use fp16/bf16: Reduces memory by ~50%
- Gradient Checkpointing: Reduces memory by ~30% (slower)
- Smaller Batch Size: Linear memory reduction
- DeepSpeed ZeRO: Up to 50% memory savings
📚 Documentation & Examples
Complete Examples
chapter11/01_dataset_loading.py- Dataset loading exampleschapter11/02_reward_functions.py- Reward function exampleschapter11/03_lora_configuration.py- LoRA configurationchapter11/04_sft_training.py- SFT training examplechapter11/05_grpo_training.py- GRPO training examplechapter11/06_complete_pipeline.py- Complete training pipelinechapter11/07_distributed_training.py- Distributed training exampledocs/chapter11/第十一章 Agentic-RL.md- Comprehensive documentation
Quick Reference
# Load dataset
result = rl_tool.run({"action": "load_dataset", "format": "sft", "max_samples": 100})
# Create reward function
result = rl_tool.run({"action": "create_reward", "reward_type": "accuracy"})
# SFT training
result = rl_tool.run({
"action": "train",
"algorithm": "sft",
"model_name": "Qwen/Qwen3-0.6B",
"max_samples": 100,
"num_epochs": 3,
})
# GRPO training
result = rl_tool.run({
"action": "train",
"algorithm": "grpo",
"model_name": "Qwen/Qwen3-0.6B",
"max_samples": 100,
"num_epochs": 2,
"reward_type": "accuracy",
})
# Distributed training (no code changes!)
# accelerate launch --config_file xxx.yaml train_script.py🎯 What's Next
Upcoming Features (V0.3.0)
- PPO Training: Full PPO implementation with Value Model
- RLHF: Reinforcement Learning from Human Feedback
- Custom Datasets: Easy integration of custom datasets
- Model Evaluation: Automatic evaluation on test sets
- Hyperparameter Search: Automated hyperparameter optimization
Community & Contributions
- GitHub Issues: Report bugs and request features
- Documentation: Help improve docs and examples
- Code Contributions: Submit PRs for new features
📞 Support
- GitHub: https://github.com/jjyaoao/HelloAgents
- Documentation: https://github.com/jjyaoao/HelloAgents/tree/main/docs/chapter11
- Email: jjyaoao@126.com
- Datawhale: https://github.com/datawhalechina/Hello-Agents
Happy Training with HelloAgents RL System! 🚀✨