This project introduces ALMA (Automated meta-Learning of Memory designs for Agentic systems), a framework that meta-learns memory designs to replace hand-engineered memory designs, therefore minimizing human effort and enabling agentic systems to be continual learners across diverse domains. ALMA employs a Meta Agent that searches over memory designs expressed as executable code in an open-ended manner, theoretically allowing the discovery of arbitrary memory designs, including database schemas as well as their retrieval and update mechanisms.
Open-ended Exploration Process of ALMA.
- 🧠 Automatic Memory Design Discovery - ALMA learns memory designs rather than hand-engineered designs
- 🎯 Domain Adaptation - Automatically specializes memory designs for diverse sequential decision-making tasks
- 🔬 Comprehensive Evaluation - Tested across four domains: AlfWorld, TextWorld, BabaisAI, and MiniHack
- 📈 Superior Performance - Outperforms state-of-the-art human-designed baselines across all benchmarks
- ⚡ Cost Efficiency - Learned designs are more efficient than most human-designed baselines
# Cloning project
git clone https://github.com/zksha/alma.git
cd ./alma
# Create environment
conda create -n alma python=3.11
conda activate alma
# Install dependencies
pip install -r requirements.txt
# Then add your API key to the `.env` file:
# .env
OPENAI_API_KEY=your_openai_api_key_hereWarning
This repository executes model-generated code as part of the memory design search process. While the code goes through a verification and debugging stage, dynamically generated code may behave unpredictably. Use at your own risk, ideally inside a sandboxed or isolated environment.
# Build ALFWorld Docker image
cd envs_docker/alfworld
bash image_build.sh# Build BALROG Docker image (used for TextWorld, BabaisAI, and MiniHack)
cd envs_docker/BALROG
bash image_build.shTip
The BALROG image is shared across TextWorld, BabaisAI, and MiniHack domains, so you only need to build it once.
To run the learning process that discovers new memory designs:
python run_main.py \
--rollout_type batched \
--meta_model gpt-5 \
--execution_model gpt-5-nano \
--batch_max_update_concurrent 10 \
--batch_max_retrieve_concurrent 10 \
--task_type alfworld \
--status train \
--train_size 30Parameters:
| Parameter | Description | Options |
|---|---|---|
--rollout_type |
Execution strategy for evaluations, sequential allows both update and retrieval in deployment phase |
batched, sequential |
--meta_model |
Model used by the meta agent to propose memory designs | gpt-5, gpt-4.1, etc. |
--execution_model |
Model used by agents during task execution | gpt-5-mini/medium, gpt-5-nano/low, gpt-4o-mini, etc. |
--batch_max_update_concurrent |
Max concurrent memory update operations | Integer (e.g., 10) |
--batch_max_retrieve_concurrent |
Max concurrent memory retrieval operations | Integer (e.g., 10) |
--task_type |
Domain to run experiments on | alfworld, textworld, babaisai, minihack |
--status |
Execution mode | train, eval_in_distribution, eval_out_of_distribution |
--train_size |
Number of training tasks | Integer (e.g., 30, 50, 100) |
--memo_SHA |
Memory designs' SHA, provided for testing | String (e.g., g-memory, 53cee295) |
Tip
Example configurations for different domains is in training.sh and testing.sh.
Learned memory design should be store in memo_archive.
Learning logs should be store in logs.
To extend the benchmark to a new domain:
- Build up image for new domain.
- Adding prompts, configs, and
{env_name}_envs.pyinenvs archive. - Adding task descriptions for meta agent in
meta_agent_prompt.py. - Register container and name for the new benchmark in
eval_in_container.py. - Run the meta agent to discover specialized memory designs.
- Evaluate results against baseline memory designs.
Our learned memory designs consistently outperform state-of-the-art human-designed memory across all benchmarks.
Numbers indicate overall success rate in percentage (higher is better). Improvements are relative to the no-memory baseline.
| FM in Agentic System | GPT-5-nano / low | GPT-5-mini / medium |
|---|---|---|
| No Memory | 6.1 | 41.1 |
| Manual Memory Designs | ||
| Trajectory Retrieval | 8.6 (+2.5) | 48.6 (+7.5) |
| Reasoning Bank | 7.5 (+1.4) | 40.1 (−1.0) |
| Dynamic Cheatsheet | 7.2 (+1.1) | 46.5 (+5.4) |
| G-Memory | 7.7 (+1.6) | 46.0 (+4.9) |
| Learned Memory Design | ||
| Our Method | 12.3 (+6.2) | 53.9 (+12.8) |
Key findings:
- Learned designs adapt to domain-specific requirements automatically
- Better performance scaling with memory size
- Faster learning under task distribution shifts
- Lower computational costs compared to human-designed baselines
This research was supported by the Vector Institute, the Canada CIFAR AI Chairs program, a grant from Schmidt Futures, an NSERC Discovery Grant, and a generous donation from Rafael Cosman. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute (https://vectorinstitute.ai/partnerships/current-partners/). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.