Skip to content

V0.2.5

Choose a tag to compare

@jjyaoao jjyaoao released this 18 Oct 05:15
· 26 commits to main since this release

HelloAgents V0.2.5 Release Notes

📅 Release Date: 2025-10-18
📦 Package: pip install hello-agents[rl]
🔗 GitHub: https://github.com/jjyaoao/HelloAgents
📚 Documentation: https://github.com/jjyaoao/HelloAgents/tree/main/docs/chapter11


🎯 Overview

HelloAgents V0.2.5 introduces a comprehensive Agentic Reinforcement Learning (RL) System, enabling developers to train and fine-tune language models using state-of-the-art RL algorithms. This release implements a complete RL training pipeline according to Chapter 11 architecture, providing a unified toolkit for SFT, GRPO, and distributed training.

✨ Core Features

  • 🎓 SFT (Supervised Fine-Tuning): Train models on instruction-following tasks with LoRA support
  • 🚀 GRPO (Group Relative Policy Optimization): Simplified PPO without Value Model for efficient RL training
  • 🎯 Custom Reward Functions: Accuracy, length penalty, and step-based rewards
  • 🛠️ Unified Tool Interface: RLTrainingTool fully integrated with HelloAgents framework
  • 📊 Distributed Training: Multi-GPU and multi-node support via Accelerate and DeepSpeed
  • 🔄 Monitoring Integration: Wandb and TensorBoard support with detailed logging
  • 📦 Simplified Imports: Direct access from hello_agents.rl layer
  • 🔄 Backward Compatible: All existing code continues to work

🔧 Installation & Dependencies

Installation

# With RL support
pip install hello-agents[rl]

# Or install manually
pip install hello-agents
pip install trl transformers datasets peft accelerate

Optional Dependencies

# For distributed training
pip install deepspeed

# For monitoring
pip install wandb tensorboard

Dependencies

Component Packages Description
TRL trl>=0.12.0 Transformer Reinforcement Learning
Transformers transformers>=4.40.0 HuggingFace Transformers
PEFT peft>=0.10.0 Parameter-Efficient Fine-Tuning
Datasets datasets>=2.18.0 HuggingFace Datasets
Accelerate accelerate>=0.28.0 Distributed training support
DeepSpeed (optional) deepspeed>=0.14.0 Advanced distributed training
Wandb (optional) wandb>=0.16.0 Experiment tracking
TensorBoard tensorboard>=2.15.0 Training visualization
Core Framework hello-agents>=0.2.5 HelloAgents framework

Environment Configuration

API Keys (Optional for model download):

# HuggingFace Token (for gated models)
HUGGINGFACE_TOKEN="hf_xxx"

# Wandb API Key (for experiment tracking)
WANDB_API_KEY="xxx"

🏗️ RL Training Architecture

Three-Layer RL System (Chapter 11 Design)

Agentic RL Training Architecture
├── Application Layer
│   └── RLTrainingTool - Unified RL training tool wrapper
│
├── Training Layer
│   ├── SFT (Supervised Fine-Tuning)
│   │   ├── SFTTrainerWrapper - TRL SFTTrainer wrapper
│   │   ├── SFT Dataset - Instruction-following dataset
│   │   ├── LoRA Configuration - Parameter-efficient fine-tuning
│   │   └── Training Callbacks - Detailed logging and monitoring
│   │
│   ├── GRPO (Group Relative Policy Optimization)
│   │   ├── GRPOTrainerWrapper - TRL GRPOTrainer wrapper
│   │   ├── GRPO Dataset - Prompt-based dataset
│   │   ├── Reward Functions - Accuracy, length, step-based
│   │   └── KL Divergence Control - Policy regularization
│   │
│   └── Distributed Training
│       ├── DDP - Data Parallel (2-8 GPUs)
│       ├── DeepSpeed ZeRO-2 - Optimizer state sharding
│       ├── DeepSpeed ZeRO-3 - Full model sharding
│       └── Multi-Node - Cluster training support
│
└── Data Layer
    ├── GSM8K - Grade School Math 8K dataset
    ├── UltraFeedback - Preference learning dataset
    └── Custom Datasets - User-defined datasets

🎓 SFT (Supervised Fine-Tuning)

Overview

SFT trains models to follow instructions and learn task-specific formats. HelloAgents provides a complete SFT implementation with LoRA support for efficient training.

Key Features:

  • LoRA Support: Train only 0.1% of parameters with minimal quality loss
  • GSM8K Dataset: 7,473 math problems for training
  • Automatic Formatting: Chat template handling (Qwen, Llama, etc.)
  • Progress Tracking: Detailed logging with epoch/step/loss/LR
  • Model Saving: Automatic checkpoint saving and merging

Quick Start

from hello_agents.tools import RLTrainingTool

# Create RL training tool
rl_tool = RLTrainingTool()

# Run SFT training
result = rl_tool.run({
    "action": "train",
    "algorithm": "sft",
    "model_name": "Qwen/Qwen3-0.6B",
    "output_dir": "./models/sft_model",
    "max_samples": 100,  # Use 100 samples for quick test
    "num_epochs": 3,
    "batch_size": 4,
    "use_lora": True,  # Enable LoRA
    "use_tensorboard": True,
})

print(f"Training completed! Model saved to: {result['output_dir']}")

Training Output

Epoch 1/3 | Step 10/75 | Loss: 2.3456 | LR: 4.5e-05
Epoch 1/3 | Step 20/75 | Loss: 1.8234 | LR: 4.0e-05
Epoch 1/3 | Step 25/75 | Loss: 1.6543 | LR: 3.8e-05
✓ Epoch 1/3 completed | Average Loss: 1.7234

Epoch 2/3 | Step 35/75 | Loss: 1.2345 | LR: 3.5e-05
...

LoRA Configuration

# Default LoRA config (optimized for Qwen models)
{
    "lora_r": 16,              # Rank
    "lora_alpha": 32,          # Alpha (scaling factor)
    "lora_dropout": 0.05,      # Dropout
    "lora_target_modules": ["q_proj", "v_proj"]  # Target modules
}

# Trainable parameters: ~0.1% of total
# Memory usage: ~50% of full fine-tuning
# Training speed: ~2x faster

🚀 GRPO (Group Relative Policy Optimization)

Overview

GRPO is a simplified PPO algorithm that doesn't require a separate Value Model. It uses group-relative rewards for efficient policy optimization, making it ideal for Agentic RL scenarios.

Key Features:

  • No Value Model: Simpler architecture, faster training
  • Group Relative Rewards: Compare within mini-batches for stable learning
  • Custom Reward Functions: Accuracy, length penalty, step-based rewards
  • KL Divergence Control: Prevent policy from deviating too far from reference
  • Math Reasoning: Optimized for GSM8K-style math problems

Quick Start

from hello_agents.tools import RLTrainingTool

# Create RL training tool
rl_tool = RLTrainingTool()

# Run GRPO training
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    "model_name": "Qwen/Qwen3-0.6B",
    "output_dir": "./models/grpo_model",
    "max_samples": 100,
    "num_epochs": 2,
    "batch_size": 2,
    "reward_type": "accuracy",  # Use accuracy reward
    "use_lora": True,
    "use_tensorboard": True,
})

print(f"GRPO training completed! Model saved to: {result['output_dir']}")

Training Output

Epoch 1/2 | Step 10/50 | Loss: 0.8234 | Reward: 0.45 | KL: 0.023
Epoch 1/2 | Step 20/50 | Loss: 0.6543 | Reward: 0.62 | KL: 0.018
Epoch 1/2 | Step 25/50 | Loss: 0.5432 | Reward: 0.71 | KL: 0.015
✓ Epoch 1/2 completed | Average Reward: 0.68 | Average KL: 0.019

Epoch 2/2 | Step 35/50 | Loss: 0.4321 | Reward: 0.78 | KL: 0.012
...

Reward Functions

Reward Type Description Use Case
accuracy Exact match with ground truth Math problems, QA
length_penalty Penalize overly long responses Concise generation
step Reward based on reasoning steps Multi-step reasoning
# Accuracy reward
reward = 1.0 if prediction == ground_truth else 0.0

# Length penalty reward
base_reward = accuracy_reward
length_penalty = max(0, (len(prediction) - target_length) / target_length)
reward = base_reward - 0.1 * length_penalty

# Step reward
reward = accuracy_reward + 0.1 * num_reasoning_steps

📊 Distributed Training

Overview

When data and model size grow, single-GPU training becomes too slow. HelloAgents supports multi-GPU and multi-node distributed training via Accelerate and DeepSpeed, with zero code changes required.

Key Features:

  • DDP (Data Parallel): Simple multi-GPU training (2-8 GPUs)
  • DeepSpeed ZeRO-2: Optimizer state sharding (~30% memory savings)
  • DeepSpeed ZeRO-3: Full model sharding (~50% memory savings)
  • Multi-Node: Cluster training support
  • Zero Code Changes: Same training code works for all configurations

Distributed Training Methods

Table: Distributed Training Methods Comparison

Method Use Case Memory Savings Speed Complexity
DDP Single machine, 2-8 GPUs None Fastest Low
ZeRO-2 Medium models (1B-7B) ~30% Fast Medium
ZeRO-3 Large models (>7B) ~50% Moderate High
Multi-Node Very large models ~50%+ Scalable High

Quick Start - DDP Training

# Step 1: Create Accelerate config (one-time setup)
# File: accelerate_configs/multi_gpu_ddp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 4  # Number of GPUs
mixed_precision: fp16

# Step 2: Run training (no code changes!)
accelerate launch --config_file accelerate_configs/multi_gpu_ddp.yaml train_script.py

Training Script (same as single-GPU):

from hello_agents.tools import RLTrainingTool

rl_tool = RLTrainingTool()

# Same code works for single-GPU and multi-GPU!
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    "model_name": "Qwen/Qwen3-0.6B",
    "output_dir": "./models/grpo_ddp",
    "num_epochs": 3,
    "batch_size": 4,  # Per-GPU batch size
    "use_lora": True,
})

Performance Benchmarks

Test Environment: 4×A100 (40GB), Qwen3-0.6B model

Method Training Time Memory Usage Throughput Speedup
Single GPU 120 min 8GB 100 samples/s 1.0x
DDP (4 GPUs) 35 min 8GB 350 samples/s 3.4x
ZeRO-2 (4 GPUs) 32 min 6GB 380 samples/s 3.8x
ZeRO-3 (4 GPUs) 38 min 4GB 320 samples/s 3.2x

Conclusion:

  • DDP: Best for medium models, fastest speed
  • ZeRO-2: Balanced performance and memory
  • ZeRO-3: Best for large models, lowest memory

DeepSpeed ZeRO-3 Training

# Step 1: Create DeepSpeed config
# File: accelerate_configs/deepspeed_zero3.yaml
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
num_processes: 4
deepspeed_config:
  zero_stage: 3  # Full model sharding
  offload_optimizer_device: cpu  # Offload to CPU
  offload_param_device: cpu

# Step 2: Run training
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml train_script.py

🔄 Monitoring & Logging

Detailed Training Logs

HelloAgents provides detailed training logs with epoch/step/loss/LR/reward/KL metrics:

# Enable detailed logging (default: enabled)
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    # ... other params
})

# Output:
# Epoch 1/3 | Step 10/75 | Loss: 0.8234 | Reward: 0.45 | KL: 0.023 | LR: 4.5e-05
# Epoch 1/3 | Step 20/75 | Loss: 0.6543 | Reward: 0.62 | KL: 0.018 | LR: 4.0e-05

TensorBoard Integration

# Enable TensorBoard (default: enabled)
result = rl_tool.run({
    "action": "train",
    "use_tensorboard": True,
    "output_dir": "./models/grpo_model",
    # ... other params
})

# View logs:
# tensorboard --logdir ./models/grpo_model/runs

Wandb Integration

# Enable Wandb
result = rl_tool.run({
    "action": "train",
    "use_wandb": True,
    "wandb_project": "helloagents-rl",
    # ... other params
})

# Logs automatically uploaded to: https://wandb.ai/your-username/helloagents-rl

📦 Simplified Import System

Before (Complex)

# Deep module paths
from hello_agents.rl.datasets import create_sft_dataset, create_grpo_dataset
from hello_agents.rl.rewards import create_accuracy_reward
from hello_agents.rl.trainers import SFTTrainerWrapper, GRPOTrainerWrapper

After (Simple)

# Direct imports from rl layer
from hello_agents.rl import create_sft_dataset, create_grpo_dataset
from hello_agents.rl import create_accuracy_reward
from hello_agents.rl import SFTTrainerWrapper, GRPOTrainerWrapper

Tool Layer (Simplest)

# Direct tool import
from hello_agents.tools import RLTrainingTool

# All RL functionality in one tool
rl_tool = RLTrainingTool()

💡 Best Practices

Training Strategy

Scenario Algorithm Samples Epochs Batch Size LoRA Time (Single GPU)
Quick Test SFT 100 1 4 ~5 min
Development SFT 500 3 4 ~20 min
Production SFT SFT 7473 3 4 ~2 hours
RL Fine-tuning GRPO 1000 2 2 ~1 hour

Hyperparameter Tuning

SFT:

{
    "learning_rate": 5e-5,  # Lower for stability
    "num_epochs": 3,        # 3-5 epochs usually sufficient
    "batch_size": 4,        # Adjust based on GPU memory
    "lora_r": 16,           # 8-32 for most cases
    "lora_alpha": 32,       # Usually 2x lora_r
}

GRPO:

{
    "learning_rate": 1e-5,  # Lower than SFT
    "num_epochs": 2,        # 2-3 epochs to avoid overfitting
    "batch_size": 2,        # Smaller due to generation overhead
    "reward_type": "accuracy",  # Choose based on task
}

Memory Optimization

  • Use LoRA: Reduces memory by ~50%
  • Use fp16/bf16: Reduces memory by ~50%
  • Gradient Checkpointing: Reduces memory by ~30% (slower)
  • Smaller Batch Size: Linear memory reduction
  • DeepSpeed ZeRO: Up to 50% memory savings

📚 Documentation & Examples

Complete Examples

  • chapter11/01_dataset_loading.py - Dataset loading examples
  • chapter11/02_reward_functions.py - Reward function examples
  • chapter11/03_lora_configuration.py - LoRA configuration
  • chapter11/04_sft_training.py - SFT training example
  • chapter11/05_grpo_training.py - GRPO training example
  • chapter11/06_complete_pipeline.py - Complete training pipeline
  • chapter11/07_distributed_training.py - Distributed training example
  • docs/chapter11/第十一章 Agentic-RL.md - Comprehensive documentation

Quick Reference

# Load dataset
result = rl_tool.run({"action": "load_dataset", "format": "sft", "max_samples": 100})

# Create reward function
result = rl_tool.run({"action": "create_reward", "reward_type": "accuracy"})

# SFT training
result = rl_tool.run({
    "action": "train",
    "algorithm": "sft",
    "model_name": "Qwen/Qwen3-0.6B",
    "max_samples": 100,
    "num_epochs": 3,
})

# GRPO training
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    "model_name": "Qwen/Qwen3-0.6B",
    "max_samples": 100,
    "num_epochs": 2,
    "reward_type": "accuracy",
})

# Distributed training (no code changes!)
# accelerate launch --config_file xxx.yaml train_script.py

🎯 What's Next

Upcoming Features (V0.3.0)

  • PPO Training: Full PPO implementation with Value Model
  • RLHF: Reinforcement Learning from Human Feedback
  • Custom Datasets: Easy integration of custom datasets
  • Model Evaluation: Automatic evaluation on test sets
  • Hyperparameter Search: Automated hyperparameter optimization

Community & Contributions

  • GitHub Issues: Report bugs and request features
  • Documentation: Help improve docs and examples
  • Code Contributions: Submit PRs for new features

📞 Support


Happy Training with HelloAgents RL System! 🚀✨