Training social AI agents through self-play with verifiable rewards. Instead of relying on subjective human ratings or LLM judges, we use scenarios with explicit rules and binary win/loss conditions that can be formally verified.
Most social AI training uses feedback that's either subjective (human preferences) or gameable (LLM judges). This leads to reward hacking and inconsistent evaluation. We need a better way.
We train agents on social scenarios where outcomes are objectively verifiable. Think negotiation games with clear rules, resource allocation with defined success criteria, or cooperative tasks with measurable goals. The agent learns through self-play against strategic opponents (currently GPT-4o), getting clean binary rewards based on whether they achieved the scenario's win condition.
conda activate sotopia-rl
cd /home/keyuh/sotopia-verifiablepython self_play_training.py \
--num_iterations 3 \
--games_per_scenario 10 \
--output_dir training_resultsThis will:
- Generate self-play games between your trainee (Qwen2.5-7B) and partner (GPT-4o)
- Convert games to training data with binary rewards
- Train using GRPO (Group Reward Policy Optimization)
- Iterate to improve performance
Collect training data:
python training_data_collector.py \
--trainee_model_path None \
--num_games 50 \
--output_dir training_dataTrain the model:
cd fresh_training_results/iteration_1
bash train_grpo.shEvaluate performance:
python self_play_evaluator.py \
--trainee_model_path checkpoints/policy_adapter \
--num_games 20 \
--output_path results/evaluation.jsonWe generate social interaction scenarios based on established social science theories. Each scenario has:
- A clear context (negotiation, resource allocation, cooperation task, etc.)
- Explicit win conditions that can be verified through pattern matching
- Strategic depth that requires actual reasoning, not just following a script
- Load scenario from database
- Run conversation between trainee and partner
- Verify outcome using formal patterns (FINAL_BID, ALLOCATION, etc.)
- Assign binary reward (+1 win, -1 loss, 0 draw)
- Update model using GRPO with LoRA adapters
- Base Model: Qwen2.5-7B with LoRA (392M trainable params)
- Partner Model: GPT-4o (fixed, provides strategic opposition)
- Training: GRPO with binary rewards
- Infrastructure: Multi-GPU support, WandB tracking
sotopia-verifiable/
├── scenarios/ # Scenario generation and database
│ ├── scenario_generator.py
│ ├── scenarios.db
│ └── db_helper.py
├── self_play_evaluator.py # Core self-play framework
├── training_data_collector.py # Convert games to training data
├── structured_social_verifier.py # Outcome verification
├── fresh_training_results/ # Training experiments
│ └── iteration_1/
│ ├── train_grpo.sh
│ ├── training_data/
│ └── checkpoints/
└── sotopia-rl/ # Training infrastructure
Training metrics are automatically tracked on WandB: https://wandb.ai/keyuhe/grpo-model-training
Expected progression:
- Iteration 1: ~45% win rate vs GPT-4o
- Iteration 2: ~60% win rate with better strategic understanding
- Iteration 3: ~70% win rate with improved social awareness
The training pipeline is operational and we're running initial experiments. First results show the approach is working - agents are learning to win scenarios through strategic interaction rather than just mimicking patterns.
- Scenario generation from social science theories
- Self-play game execution with GPT-4o
- Formal verification of outcomes
- GRPO training with LoRA adapters
- Scenario diversity and complexity
- Partner model selection (considering curriculum learning)
- Evaluation metrics beyond win rate
- Transfer to open-ended social interaction
- CUDA-capable GPU (tested on RTX A6000)
- Python 3.10+ with PyTorch
- OpenAI API access
- WandB account
- Test basic functionality:
python test_self_play.py - Generate new scenarios:
cd scenarios && python scenario_generator.py - Run training experiments
- Monitor on WandB
- Evaluate and iterate
The codebase is actively being developed. Feel free to explore and experiment with different approaches to scenario design, reward structures, and training algorithms.