V0.2.5

HelloAgents V0.2.5 Release Notes

📅 Release Date: 2025-10-18
📦 Package: pip install hello-agents[rl]
🔗 GitHub: https://github.com/jjyaoao/HelloAgents
📚 Documentation: https://github.com/jjyaoao/HelloAgents/tree/main/docs/chapter11

🎯 Overview

HelloAgents V0.2.5 introduces a comprehensive Agentic Reinforcement Learning (RL) System, enabling developers to train and fine-tune language models using state-of-the-art RL algorithms. This release implements a complete RL training pipeline according to Chapter 11 architecture, providing a unified toolkit for SFT, GRPO, and distributed training.

✨ Core Features

🎓 SFT (Supervised Fine-Tuning): Train models on instruction-following tasks with LoRA support
🚀 GRPO (Group Relative Policy Optimization): Simplified PPO without Value Model for efficient RL training
🎯 Custom Reward Functions: Accuracy, length penalty, and step-based rewards
🛠️ Unified Tool Interface: RLTrainingTool fully integrated with HelloAgents framework
📊 Distributed Training: Multi-GPU and multi-node support via Accelerate and DeepSpeed
🔄 Monitoring Integration: Wandb and TensorBoard support with detailed logging
📦 Simplified Imports: Direct access from hello_agents.rl layer
🔄 Backward Compatible: All existing code continues to work

🔧 Installation & Dependencies

Installation

# With RL support
pip install hello-agents[rl]

# Or install manually
pip install hello-agents
pip install trl transformers datasets peft accelerate

Optional Dependencies

# For distributed training
pip install deepspeed

# For monitoring
pip install wandb tensorboard

Dependencies

Component	Packages	Description
TRL	`trl>=0.12.0`	Transformer Reinforcement Learning
Transformers	`transformers>=4.40.0`	HuggingFace Transformers
PEFT	`peft>=0.10.0`	Parameter-Efficient Fine-Tuning
Datasets	`datasets>=2.18.0`	HuggingFace Datasets
Accelerate	`accelerate>=0.28.0`	Distributed training support
DeepSpeed (optional)	`deepspeed>=0.14.0`	Advanced distributed training
Wandb (optional)	`wandb>=0.16.0`	Experiment tracking
TensorBoard	`tensorboard>=2.15.0`	Training visualization
Core Framework	`hello-agents>=0.2.5`	HelloAgents framework

Environment Configuration

API Keys (Optional for model download):

# HuggingFace Token (for gated models)
HUGGINGFACE_TOKEN="hf_xxx"

# Wandb API Key (for experiment tracking)
WANDB_API_KEY="xxx"

🏗️ RL Training Architecture

Three-Layer RL System (Chapter 11 Design)

Agentic RL Training Architecture
├── Application Layer
│   └── RLTrainingTool - Unified RL training tool wrapper
│
├── Training Layer
│   ├── SFT (Supervised Fine-Tuning)
│   │   ├── SFTTrainerWrapper - TRL SFTTrainer wrapper
│   │   ├── SFT Dataset - Instruction-following dataset
│   │   ├── LoRA Configuration - Parameter-efficient fine-tuning
│   │   └── Training Callbacks - Detailed logging and monitoring
│   │
│   ├── GRPO (Group Relative Policy Optimization)
│   │   ├── GRPOTrainerWrapper - TRL GRPOTrainer wrapper
│   │   ├── GRPO Dataset - Prompt-based dataset
│   │   ├── Reward Functions - Accuracy, length, step-based
│   │   └── KL Divergence Control - Policy regularization
│   │
│   └── Distributed Training
│       ├── DDP - Data Parallel (2-8 GPUs)
│       ├── DeepSpeed ZeRO-2 - Optimizer state sharding
│       ├── DeepSpeed ZeRO-3 - Full model sharding
│       └── Multi-Node - Cluster training support
│
└── Data Layer
    ├── GSM8K - Grade School Math 8K dataset
    ├── UltraFeedback - Preference learning dataset
    └── Custom Datasets - User-defined datasets

🎓 SFT (Supervised Fine-Tuning)

Overview

SFT trains models to follow instructions and learn task-specific formats. HelloAgents provides a complete SFT implementation with LoRA support for efficient training.

Key Features:

LoRA Support: Train only 0.1% of parameters with minimal quality loss
GSM8K Dataset: 7,473 math problems for training
Automatic Formatting: Chat template handling (Qwen, Llama, etc.)
Progress Tracking: Detailed logging with epoch/step/loss/LR
Model Saving: Automatic checkpoint saving and merging

Quick Start

from hello_agents.tools import RLTrainingTool

# Create RL training tool
rl_tool = RLTrainingTool()

# Run SFT training
result = rl_tool.run({
    "action": "train",
    "algorithm": "sft",
    "model_name": "Qwen/Qwen3-0.6B",
    "output_dir": "./models/sft_model",
    "max_samples": 100,  # Use 100 samples for quick test
    "num_epochs": 3,
    "batch_size": 4,
    "use_lora": True,  # Enable LoRA
    "use_tensorboard": True,
})

print(f"Training completed! Model saved to: {result['output_dir']}")

Training Output

Epoch 1/3 | Step 10/75 | Loss: 2.3456 | LR: 4.5e-05
Epoch 1/3 | Step 20/75 | Loss: 1.8234 | LR: 4.0e-05
Epoch 1/3 | Step 25/75 | Loss: 1.6543 | LR: 3.8e-05
✓ Epoch 1/3 completed | Average Loss: 1.7234

Epoch 2/3 | Step 35/75 | Loss: 1.2345 | LR: 3.5e-05
...

LoRA Configuration

# Default LoRA config (optimized for Qwen models)
{
    "lora_r": 16,              # Rank
    "lora_alpha": 32,          # Alpha (scaling factor)
    "lora_dropout": 0.05,      # Dropout
    "lora_target_modules": ["q_proj", "v_proj"]  # Target modules
}

# Trainable parameters: ~0.1% of total
# Memory usage: ~50% of full fine-tuning
# Training speed: ~2x faster

🚀 GRPO (Group Relative Policy Optimization)

Overview

GRPO is a simplified PPO algorithm that doesn't require a separate Value Model. It uses group-relative rewards for efficient policy optimization, making it ideal for Agentic RL scenarios.

Key Features:

No Value Model: Simpler architecture, faster training
Group Relative Rewards: Compare within mini-batches for stable learning
Custom Reward Functions: Accuracy, length penalty, step-based rewards
KL Divergence Control: Prevent policy from deviating too far from reference
Math Reasoning: Optimized for GSM8K-style math problems

Quick Start

from hello_agents.tools import RLTrainingTool

# Create RL training tool
rl_tool = RLTrainingTool()

# Run GRPO training
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    "model_name": "Qwen/Qwen3-0.6B",
    "output_dir": "./models/grpo_model",
    "max_samples": 100,
    "num_epochs": 2,
    "batch_size": 2,
    "reward_type": "accuracy",  # Use accuracy reward
    "use_lora": True,
    "use_tensorboard": True,
})

print(f"GRPO training completed! Model saved to: {result['output_dir']}")

Training Output

Epoch 1/2 | Step 10/50 | Loss: 0.8234 | Reward: 0.45 | KL: 0.023
Epoch 1/2 | Step 20/50 | Loss: 0.6543 | Reward: 0.62 | KL: 0.018
Epoch 1/2 | Step 25/50 | Loss: 0.5432 | Reward: 0.71 | KL: 0.015
✓ Epoch 1/2 completed | Average Reward: 0.68 | Average KL: 0.019

Epoch 2/2 | Step 35/50 | Loss: 0.4321 | Reward: 0.78 | KL: 0.012
...

Reward Functions

Reward Type	Description	Use Case
accuracy	Exact match with ground truth	Math problems, QA
length_penalty	Penalize overly long responses	Concise generation
step	Reward based on reasoning steps	Multi-step reasoning

# Accuracy reward
reward = 1.0 if prediction == ground_truth else 0.0

# Length penalty reward
base_reward = accuracy_reward
length_penalty = max(0, (len(prediction) - target_length) / target_length)
reward = base_reward - 0.1 * length_penalty

# Step reward
reward = accuracy_reward + 0.1 * num_reasoning_steps

📊 Distributed Training

Overview

When data and model size grow, single-GPU training becomes too slow. HelloAgents supports multi-GPU and multi-node distributed training via Accelerate and DeepSpeed, with zero code changes required.

Key Features:

DDP (Data Parallel): Simple multi-GPU training (2-8 GPUs)
DeepSpeed ZeRO-2: Optimizer state sharding (~30% memory savings)
DeepSpeed ZeRO-3: Full model sharding (~50% memory savings)
Multi-Node: Cluster training support
Zero Code Changes: Same training code works for all configurations

Distributed Training Methods

Table: Distributed Training Methods Comparison

Method	Use Case	Memory Savings	Speed	Complexity
DDP	Single machine, 2-8 GPUs	None	Fastest	Low
ZeRO-2	Medium models (1B-7B)	~30%	Fast	Medium
ZeRO-3	Large models (>7B)	~50%	Moderate	High
Multi-Node	Very large models	~50%+	Scalable	High

Quick Start - DDP Training

# Step 1: Create Accelerate config (one-time setup)
# File: accelerate_configs/multi_gpu_ddp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 4  # Number of GPUs
mixed_precision: fp16

# Step 2: Run training (no code changes!)
accelerate launch --config_file accelerate_configs/multi_gpu_ddp.yaml train_script.py

Training Script (same as single-GPU):

from hello_agents.tools import RLTrainingTool

rl_tool = RLTrainingTool()

# Same code works for single-GPU and multi-GPU!
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    "model_name": "Qwen/Qwen3-0.6B",
    "output_dir": "./models/grpo_ddp",
    "num_epochs": 3,
    "batch_size": 4,  # Per-GPU batch size
    "use_lora": True,
})

Performance Benchmarks

Test Environment: 4×A100 (40GB), Qwen3-0.6B model

Method	Training Time	Memory Usage	Throughput	Speedup
Single GPU	120 min	8GB	100 samples/s	1.0x
DDP (4 GPUs)	35 min	8GB	350 samples/s	3.4x
ZeRO-2 (4 GPUs)	32 min	6GB	380 samples/s	3.8x
ZeRO-3 (4 GPUs)	38 min	4GB	320 samples/s	3.2x

Conclusion:

DDP: Best for medium models, fastest speed
ZeRO-2: Balanced performance and memory
ZeRO-3: Best for large models, lowest memory

DeepSpeed ZeRO-3 Training

# Step 1: Create DeepSpeed config
# File: accelerate_configs/deepspeed_zero3.yaml
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
num_processes: 4
deepspeed_config:
  zero_stage: 3  # Full model sharding
  offload_optimizer_device: cpu  # Offload to CPU
  offload_param_device: cpu

# Step 2: Run training
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml train_script.py

🔄 Monitoring & Logging

Detailed Training Logs

HelloAgents provides detailed training logs with epoch/step/loss/LR/reward/KL metrics:

# Enable detailed logging (default: enabled)
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    # ... other params
})

# Output:
# Epoch 1/3 | Step 10/75 | Loss: 0.8234 | Reward: 0.45 | KL: 0.023 | LR: 4.5e-05
# Epoch 1/3 | Step 20/75 | Loss: 0.6543 | Reward: 0.62 | KL: 0.018 | LR: 4.0e-05

TensorBoard Integration

# Enable TensorBoard (default: enabled)
result = rl_tool.run({
    "action": "train",
    "use_tensorboard": True,
    "output_dir": "./models/grpo_model",
    # ... other params
})

# View logs:
# tensorboard --logdir ./models/grpo_model/runs

Wandb Integration

# Enable Wandb
result = rl_tool.run({
    "action": "train",
    "use_wandb": True,
    "wandb_project": "helloagents-rl",
    # ... other params
})

# Logs automatically uploaded to: https://wandb.ai/your-username/helloagents-rl

📦 Simplified Import System

Before (Complex)

# Deep module paths
from hello_agents.rl.datasets import create_sft_dataset, create_grpo_dataset
from hello_agents.rl.rewards import create_accuracy_reward
from hello_agents.rl.trainers import SFTTrainerWrapper, GRPOTrainerWrapper

After (Simple)

# Direct imports from rl layer
from hello_agents.rl import create_sft_dataset, create_grpo_dataset
from hello_agents.rl import create_accuracy_reward
from hello_agents.rl import SFTTrainerWrapper, GRPOTrainerWrapper

Tool Layer (Simplest)

# Direct tool import
from hello_agents.tools import RLTrainingTool

# All RL functionality in one tool
rl_tool = RLTrainingTool()

💡 Best Practices

Training Strategy

Scenario	Algorithm	Samples	Epochs	Batch Size	LoRA	Time (Single GPU)
Quick Test	SFT	100	1	4	✅	~5 min
Development	SFT	500	3	4	✅	~20 min
Production SFT	SFT	7473	3	4	✅	~2 hours
RL Fine-tuning	GRPO	1000	2	2	✅	~1 hour

Hyperparameter Tuning

SFT:

{
    "learning_rate": 5e-5,  # Lower for stability
    "num_epochs": 3,        # 3-5 epochs usually sufficient
    "batch_size": 4,        # Adjust based on GPU memory
    "lora_r": 16,           # 8-32 for most cases
    "lora_alpha": 32,       # Usually 2x lora_r
}

GRPO:

{
    "learning_rate": 1e-5,  # Lower than SFT
    "num_epochs": 2,        # 2-3 epochs to avoid overfitting
    "batch_size": 2,        # Smaller due to generation overhead
    "reward_type": "accuracy",  # Choose based on task
}

Memory Optimization

Use LoRA: Reduces memory by ~50%
Use fp16/bf16: Reduces memory by ~50%
Gradient Checkpointing: Reduces memory by ~30% (slower)
Smaller Batch Size: Linear memory reduction
DeepSpeed ZeRO: Up to 50% memory savings

📚 Documentation & Examples

Complete Examples

chapter11/01_dataset_loading.py - Dataset loading examples
chapter11/02_reward_functions.py - Reward function examples
chapter11/03_lora_configuration.py - LoRA configuration
chapter11/04_sft_training.py - SFT training example
chapter11/05_grpo_training.py - GRPO training example
chapter11/06_complete_pipeline.py - Complete training pipeline
chapter11/07_distributed_training.py - Distributed training example
docs/chapter11/第十一章 Agentic-RL.md - Comprehensive documentation

Quick Reference

# Load dataset
result = rl_tool.run({"action": "load_dataset", "format": "sft", "max_samples": 100})

# Create reward function
result = rl_tool.run({"action": "create_reward", "reward_type": "accuracy"})

# SFT training
result = rl_tool.run({
    "action": "train",
    "algorithm": "sft",
    "model_name": "Qwen/Qwen3-0.6B",
    "max_samples": 100,
    "num_epochs": 3,
})

# GRPO training
result = rl_tool.run({
    "action": "train",
    "algorithm": "grpo",
    "model_name": "Qwen/Qwen3-0.6B",
    "max_samples": 100,
    "num_epochs": 2,
    "reward_type": "accuracy",
})

# Distributed training (no code changes!)
# accelerate launch --config_file xxx.yaml train_script.py

🎯 What's Next

Upcoming Features (V0.3.0)

PPO Training: Full PPO implementation with Value Model
RLHF: Reinforcement Learning from Human Feedback
Custom Datasets: Easy integration of custom datasets
Model Evaluation: Automatic evaluation on test sets
Hyperparameter Search: Automated hyperparameter optimization

Community & Contributions

GitHub Issues: Report bugs and request features
Documentation: Help improve docs and examples
Code Contributions: Submit PRs for new features

📞 Support

GitHub: https://github.com/jjyaoao/HelloAgents
Documentation: https://github.com/jjyaoao/HelloAgents/tree/main/docs/chapter11
Email: jjyaoao@126.com
Datawhale: https://github.com/datawhalechina/Hello-Agents

Happy Training with HelloAgents RL System! 🚀✨

V0.2.5

HelloAgents V0.2.5 Release Notes

🎯 Overview

✨ Core Features

🔧 Installation & Dependencies

Installation

Optional Dependencies

Dependencies

Environment Configuration

🏗️ RL Training Architecture

Three-Layer RL System (Chapter 11 Design)

🎓 SFT (Supervised Fine-Tuning)

Overview

Quick Start

Training Output

LoRA Configuration

🚀 GRPO (Group Relative Policy Optimization)

Overview

Quick Start

Training Output

Reward Functions

📊 Distributed Training

Overview

Distributed Training Methods

Quick Start - DDP Training

Performance Benchmarks

DeepSpeed ZeRO-3 Training

🔄 Monitoring & Logging

Detailed Training Logs

TensorBoard Integration

Wandb Integration

📦 Simplified Import System

Before (Complex)

After (Simple)

Tool Layer (Simplest)

💡 Best Practices

Training Strategy

Hyperparameter Tuning

Memory Optimization

📚 Documentation & Examples

Complete Examples

Quick Reference

🎯 What's Next

Upcoming Features (V0.3.0)

Community & Contributions

📞 Support

Uh oh!