A novel reinforcement learning framework for training AI agents in a Godot-based robot FPS combat environment. This project implements per-step online learning with self-regulated learning rates and death-driven negative replay.
This project demonstrates a biologically-inspired training paradigm where agents learn to survive in a hostile environment through continuous adaptation. Unlike traditional RL approaches that separate training and inference, SDAEA performs parameter updates at every step, enabling true online learning even on resource-constrained devices.
- Per-Step Online Learning: Parameters are updated after every interaction step, not in batches
- Self-Regulated Learning: The model dynamically determines its own learning rate from internal signals
- Death-Driven Negative Replay: When HP reaches zero, cached parameters are restored and negative gradients are applied
- Binocular Vision Processing: Processes left and right eye observations (320x300 each) from Godot
- MobileNetV3 Backbone: Efficient neural architecture suitable for real-time processing
- Multi-Agent Environment: Supports up to 8 concurrent agents in the Godot simulation
The Godot environment is a robot FPS combat simulation where agents must:
- Hit other robots (+1 reward)
- Avoid getting hit (2 HP per robot)
- Survive as long as possible
Godot Environment Repository: https://github.com/ymrdf/EnvolutionRobot
Each agent has 4 discrete action dimensions:
accelerate_forward: 3 choices (backward, stay, forward)accelerate_sideways: 3 choices (left, stay, right)turn: 3 choices (left, stay, right)shoot: 2 choices (don't shoot, shoot)
- Left Eye: (3, 300, 320) - RGB image from left camera
- Right Eye: (3, 300, 320) - RGB image from right camera
- HP: Scalar value representing current health
- Binocular images are stacked vertically: (3, 600, 320)
- HP is embedded as a bar: (3, 40, 320)
- Combined with previous model output (out_next): (3, 640, 320)
- Final input shape: (3, 640, 640)
The model produces a (3, 640, 320) tensor containing:
- Action logits: Extracted from eleven 3×8×8 crops in the first row
- Loss region: Center 3×8×8 region used for computing loss
- Learning rate region: Adjacent 3×8×8 region determining the learning rate
- Internal state: The entire output feeds back as input for the next step
-
Normal Step:
- Forward pass through MobileNetV3
- Extract loss from designated region
- Compute dynamic learning rate from LR region
- Apply gradient update
-
Death Event (HP ≤ 0):
- Restore parameters from cache (last 20 steps)
- Apply negative loss gradient with large learning rate (1e-2)
- Reset environment and continue training
- Python 3.10+
- Conda (recommended)
- Godot 4.x with RL Agents plugin
conda env create -f environment.yml
conda activate envolution### Key Dependencies
torch==2.2.2- Deep learning frameworktorchvision==0.17.2- Pre-trained models (MobileNetV3)godot-rl==0.8.2- Godot RL integrationgymnasium==1.0.0- RL environment interfacestable-baselines3==2.4.0- RL algorithms (reference)
- Clone and open the Godot environment: git clone https://github.com/ymrdf/EnvolutionRobot
jupyter notebook my_battle_zone_godot_train.ipynb3. Press Play in Godot Editor when prompted
- The training will run with the following default hyperparameters:
- Learning rate range: [1e-5, 5e-4]
- Death penalty LR: 1e-2
- Parameter cache interval: 20 steps
- Max steps per episode: 100,000
The final model is saved to: