Sotopia-Verifiable

Training social AI agents through self-play with verifiable rewards. Instead of relying on subjective human ratings or LLM judges, we use scenarios with explicit rules and binary win/loss conditions that can be formally verified.

The Problem

Most social AI training uses feedback that's either subjective (human preferences) or gameable (LLM judges). This leads to reward hacking and inconsistent evaluation. We need a better way.

Our Approach

We train agents on social scenarios where outcomes are objectively verifiable. Think negotiation games with clear rules, resource allocation with defined success criteria, or cooperative tasks with measurable goals. The agent learns through self-play against strategic opponents (currently GPT-4o), getting clean binary rewards based on whether they achieved the scenario's win condition.

Quick Start

Setup

conda activate sotopia-rl
cd /home/keyuh/sotopia-verifiable

Run Complete Training Pipeline

python self_play_training.py \
  --num_iterations 3 \
  --games_per_scenario 10 \
  --output_dir training_results

This will:

Generate self-play games between your trainee (Qwen2.5-7B) and partner (GPT-4o)
Convert games to training data with binary rewards
Train using GRPO (Group Reward Policy Optimization)
Iterate to improve performance

Or Run Steps Manually

Collect training data:

python training_data_collector.py \
  --trainee_model_path None \
  --num_games 50 \
  --output_dir training_data

Train the model:

cd fresh_training_results/iteration_1
bash train_grpo.sh

Evaluate performance:

python self_play_evaluator.py \
  --trainee_model_path checkpoints/policy_adapter \
  --num_games 20 \
  --output_path results/evaluation.json

How It Works

Scenarios

We generate social interaction scenarios based on established social science theories. Each scenario has:

A clear context (negotiation, resource allocation, cooperation task, etc.)
Explicit win conditions that can be verified through pattern matching
Strategic depth that requires actual reasoning, not just following a script

Training Loop

Load scenario from database
Run conversation between trainee and partner
Verify outcome using formal patterns (FINAL_BID, ALLOCATION, etc.)
Assign binary reward (+1 win, -1 loss, 0 draw)
Update model using GRPO with LoRA adapters

Technical Stack

Base Model: Qwen2.5-7B with LoRA (392M trainable params)
Partner Model: GPT-4o (fixed, provides strategic opposition)
Training: GRPO with binary rewards
Infrastructure: Multi-GPU support, WandB tracking

Project Structure

sotopia-verifiable/
├── scenarios/                    # Scenario generation and database
│   ├── scenario_generator.py
│   ├── scenarios.db
│   └── db_helper.py
├── self_play_evaluator.py       # Core self-play framework
├── training_data_collector.py   # Convert games to training data
├── structured_social_verifier.py # Outcome verification
├── fresh_training_results/       # Training experiments
│   └── iteration_1/
│       ├── train_grpo.sh
│       ├── training_data/
│       └── checkpoints/
└── sotopia-rl/                  # Training infrastructure

Monitoring Progress

Training metrics are automatically tracked on WandB: https://wandb.ai/keyuhe/grpo-model-training

Expected progression:

Iteration 1: ~45% win rate vs GPT-4o
Iteration 2: ~60% win rate with better strategic understanding
Iteration 3: ~70% win rate with improved social awareness

Current Status

The training pipeline is operational and we're running initial experiments. First results show the approach is working - agents are learning to win scenarios through strategic interaction rather than just mimicking patterns.

What's Working

Scenario generation from social science theories
Self-play game execution with GPT-4o
Formal verification of outcomes
GRPO training with LoRA adapters

What We're Improving

Scenario diversity and complexity
Partner model selection (considering curriculum learning)
Evaluation metrics beyond win rate
Transfer to open-ended social interaction

For Contributors

Prerequisites

CUDA-capable GPU (tested on RTX A6000)
Python 3.10+ with PyTorch
OpenAI API access
WandB account

Development

Test basic functionality: python test_self_play.py
Generate new scenarios: cd scenarios && python scenario_generator.py
Run training experiments
Monitor on WandB
Evaluate and iterate

The codebase is actively being developed. Feel free to explore and experiment with different approaches to scenario design, reward structures, and training algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scenarios		scenarios
sotopia-rl @ 9ab8572		sotopia-rl @ 9ab8572
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
data_processor.py		data_processor.py
inference_verifiable.py		inference_verifiable.py
iterative_training.py		iterative_training.py
pipeline_diagrams.md		pipeline_diagrams.md
playground_verifiable.py		playground_verifiable.py
quick_start.sh		quick_start.sh
scenario_runner.py		scenario_runner.py
self_play_evaluator.py		self_play_evaluator.py
setup_training.py		setup_training.py
structured_scenario_templates.py		structured_scenario_templates.py
structured_scenarios.json		structured_scenarios.json
structured_social_verifier.py		structured_social_verifier.py
test_self_play.py		test_self_play.py
train_grpo_verifiable.sh		train_grpo_verifiable.sh
train_sft_verifiable.sh		train_sft_verifiable.sh
training_data_collector.py		training_data_collector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sotopia-Verifiable

The Problem

Our Approach

Quick Start

Setup

Run Complete Training Pipeline

Or Run Steps Manually

How It Works

Scenarios

Training Loop

Technical Stack

Project Structure

Monitoring Progress

Current Status

What's Working

What We're Improving

For Contributors

Prerequisites

Development

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

sotopia-lab/sotopia-verifiable

Folders and files

Latest commit

History

Repository files navigation

Sotopia-Verifiable

The Problem

Our Approach

Quick Start

Setup

Run Complete Training Pipeline

Or Run Steps Manually

How It Works

Scenarios

Training Loop

Technical Stack

Project Structure

Monitoring Progress

Current Status

What's Working

What We're Improving

For Contributors

Prerequisites

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages