A minimal implementation of Recursive Language Models (RLMs) using Deno and Pyodide.
GitHub | Documentation | PyPI
Watch the full video on YouTube RLM Tutorial
RLMs are an inference technique where an LLM interacts with arbitrarily long prompts through an external REPL. The LLM can write code to explore, decompose, and transform the prompt. It can recursively invoke sub-agents to complete smaller subtasks. Crucially, sub-agent responses are not automatically loaded into the parent agent's context — they are returned as symbols or variables inside the parent's REPL.
If you find this helpful, consider supporting on Patreon — it hosts all code, projects, slides, and write-ups from the YouTube channel.
RLM-demo.mp4
pip install fast-rlm- Python 3.10+
- Deno 2+
- macOS/Linux:
curl -fsSL https://deno.land/install.sh | sh - Windows (npm):
npm install -g deno
- macOS/Linux:
- (Optional) Bun — only needed for the TUI log viewer
Set your LLM API key before running:
export RLM_MODEL_API_KEY=sk-or-...| Variable | Description | Default |
|---|---|---|
RLM_MODEL_API_KEY |
API key for the OpenAI-compatible backend (falls back to OPENAI_API_KEY, then OPENROUTER_API_KEY) |
— |
RLM_MODEL_BASE_URL |
OpenAI-compatible base URL | https://openrouter.ai/api/v1 |
That's all you need to get started. By default, fast-rlm uses OpenRouter; you can point it at any OpenAI-compatible API by setting RLM_MODEL_BASE_URL. fast-rlm also runs on Vertex AI, the native Anthropic API, and local ACP coding agents — see Backend setup at the end of this README.
import fast_rlm
from fast_rlm import RLMConfig
# primary_agent is REQUIRED — there is no default model.
config = RLMConfig(primary_agent="z-ai/glm-5")
result = fast_rlm.run("Generate 50 fruits and count number of r", config=config)
print(result["results"])
print(result["usage"])
primary_agentis required. Everyrun()needs a config that sets it (e.g.RLMConfig(primary_agent="..."));sub_agentis optional and defaults toprimary_agent. The shorter examples below omitconfig=for brevity — pass theconfigabove to run them.
The same engine is available as a fast-rlm CLI — handy for one-off runs and shell pipelines:
# A plain prompt
fast-rlm "Generate 50 fruits and count number of r" --primary-agent z-ai/glm-5
# Feed a file as the context. Parsed by extension:
# .json/.yaml/.yml -> dict/list .jsonl/.ndjson -> list[dict]
# anything else (.csv, .tsv, .xml, .toml, .txt, ...) -> raw text the model parses
# itself (its extension is noted so it knows the format).
# The prompt becomes the instruction; for a dict input with no "instruction" key,
# it's also injected into the dict.
fast-rlm "Aggregate the reviews into a verdict" --input-file reviews.json --primary-agent z-ai/glm-5
# -q prints only the result (clean for piping); other knobs mirror RLMConfig:
fast-rlm "..." --primary-agent acp:opencode --max-depth 2 --max-global-calls 50 -qRun fast-rlm --help for all flags (--sub-agent, --max-calls, --acp-agents, --vertex, …).
The same file loading is available from Python — run() accepts an input_file (in place of query):
fast_rlm.run(input_file="reviews.json", instruction="Aggregate into a verdict", config=config)The primary_agent / sub_agent string selects one of four backends:
| Mode | Example primary_agent |
What it is |
|---|---|---|
| Any OpenAI-compatible API (default) | "gpt-5-mini", "deepseek-chat", "minimax/minimax-m3" |
OpenAI, DeepSeek, OpenRouter (default), or any compatible endpoint |
| Vertex AI | "vertex/claude-sonnet-4-6" |
Google Cloud (ADC auth) |
| Anthropic API | "claude-haiku-4-5", "anthropic/claude-sonnet-4-6" |
Native Anthropic; falls back to the OpenAI-compatible endpoint if no key |
| ACP coding agent | "acp:codex", "acp:claude-code", "acp:opencode" |
Drives a local coding agent, read-only |
Set the credential only for the backend(s) you use — see Backend setup at the end of this README. An ACP-only run needs no API key at all.
The key idea behind RLMs is that the prompt can be arbitrarily long — far beyond any model's context window. The agent explores it programmatically through the REPL rather than trying to fit it all into a single call.
import fast_rlm
transcripts = open("lex_fridman_all_transcripts.txt").read() # millions of tokens
result = fast_rlm.run(
"Here are the transcripts of all Lex Fridman podcasts. "
"Summarize what the first 5 Machine Learning guests had to say about AGI.\n\n"
+ transcripts
)
print(result["results"])The agent will write code to search, filter, and chunk the transcripts on its own — no manual splitting required.
Instead of squeezing your data into a string, you can pass a dict as the query and ask for a typed result back via output_schema. The agent receives the dict as a real Python dict (no parsing on its first turn), and its FINAL value is validated against the schema before being returned.
import fast_rlm
from pydantic import BaseModel
class Verdict(BaseModel):
movie: str
average_score: float
consensus: str
result = fast_rlm.run(
{
"task": "Aggregate the reviews into a single verdict.",
"movie": "The Trail of Pixels",
"reviews": [
{"name": "Asha", "score": 8, "text": "Tight pacing..."},
{"name": "Bo", "score": 6, "text": "Beautiful but thin..."},
{"name": "Cy", "score": 9, "text": "Instant favorite..."},
],
},
output_schema=Verdict,
)
verdict = Verdict.model_validate(result["results"])Structured input. When query is a dict, the agent's initial probe prints a flat top-level schema (keys + type + length + truncated preview) so it can index context["reviews"] directly instead of stringifying.
Structured output. output_schema accepts:
| Form | Example |
|---|---|
| Pydantic model class | output_schema=MyModel |
| Pydantic generic | output_schema=list[MyModel] |
| Python primitive | output_schema=int (also str, float, bool, list, dict) |
| Raw JSON Schema dict | output_schema={"type": "array", "items": {"type": "string"}} |
The schema is shown to the agent at step 0 (Required output schema for FINAL (JSON Schema):). After every FINAL(...) call the value is validated; on failure the agent receives the schema and the specific validation errors (path + message) and may retry within its remaining call budget. Pydantic is an optional dependency — only required if you pass a Pydantic class or generic.
Schemas for subagents. Inside the REPL the agent can require a subagent's output shape by passing a JSON Schema dict as the second argument to llm_query:
schema = {"type": "array", "items": {"type": "string"}}
fruits = await llm_query("Generate 25 fruit names.", schema)
The child subagent enforces the schema the same way. See examples/structured_io.py and examples/parallel_r_count.py for end-to-end demos.
Inside the REPL the agent has two built-in tools and may also receive user-defined tools as ordinary Python functions. There is no separate tool-calling API — tools are just callables in the REPL namespace.
Pass Python functions to fast_rlm.run(..., tools=[my_fn]) and they will be pre-loaded into the root agent's REPL. The RLM is shown the function name, input names, and docstring as description. They are not shown the full internal code of the tool (although they can choose to inspect it if the task requires them to). The agent calls them like any normal function inside the REPL.
def filter_short(items: list[str], max_len: int = 20) -> list[str]:
"""Return only items shorter than max_len."""
return [x for x in items if len(x) < max_len]
result = fast_rlm.run("Pick the short titles from the list." + str(list_of_titles), tools=[filter_short])Two rules apply to any tool that may be handed to a sub-agent:
- Sub-agents do NOT inherit tools automatically. To give a child a tool, the main agent must pass it explicitly in the REPL:
await llm_query("...", tools=[filter_short]). - Tools must be self-contained. Do imports inside the function body and don't close over REPL-level variables - the child runs in a fresh REPL where outer state does not exist.
The agent can also def new functions inside the REPL at any time and pass them down the same way.
Currently all tools are expected to be Python functions. These functions are available inside the REPL. They are NOT available when the LLM produces code or generates reasoning steps.
Tools often need credentials or configuration (API keys, base URLs, account IDs). Pass them through the env_variables kwarg on fast_rlm.run(...):
import os
import fast_rlm
def search_web(query: str, top_k: int = 5) -> list[dict]:
"""Search the web via Tavily and return the top results."""
import os, urllib.request, json
req = urllib.request.Request(
"https://api.tavily.com/search",
data=json.dumps({"query": query, "max_results": top_k}).encode(),
headers={
"Authorization": f"Bearer {os.environ['TAVILY_API_KEY']}",
"Content-Type": "application/json",
},
)
return json.loads(urllib.request.urlopen(req).read())["results"]
result = fast_rlm.run(
"Find three recent papers on recursive language models.",
tools=[search_web],
env_variables={"TAVILY_API_KEY": os.environ["TAVILY_API_KEY"]},
)Behavior:
env_variablesmust be adict[str, str].- Each entry is injected into
os.environinside every Pyodide REPL spawned by the run — the root agent and all sub-agents. - They are not set on the host Deno process and never appear in prompts, logs, or model context. The model only ever sees a tool's signature + docstring, so the key stays hidden as long as your tool doesn't print or return it.
- Tools read them with the normal
os.environ["..."](do theimport osinside the tool body — see the self-containment rule above).
Pass a directive through the instruction kwarg on fast_rlm.run(...). When provided, it is appended to the end of the agent's system prompt:
Here is the user's instructions - you must follow it closely:
{instruction}
result = fast_rlm.run(
"Summarize the attached incident report.",
instruction="Write all output in formal British English and never use bullet points.",
)Instructions apply to one agent only — they are never inherited. run(instruction=...) configures the root agent and nothing else. Sub-agents start with no instruction; to give a sub-agent one, the parent must pass it explicitly when it delegates:
# inside an agent's REPL — instruct the child you spawn
result = await llm_query(
chunk,
instruction="Extract only dollar amounts; return them as a JSON list.",
)This is a recursive, no-carry-on design: each agent sees only the instruction its spawner handed it. A child does not inherit its parent's instruction, and the child's own llm_query(...) calls start fresh unless it passes instruction= again. There is intentionally no global, run-wide instruction.
Behavior:
instructionmust be astr. When omitted (None), nothing is appended and the prompt is unchanged.- Because it is appended after the built-in prompt, a forceful instruction can override default behavior (e.g. output format or even the task itself). Keep it focused on how to answer rather than restating the task.
run(instruction=...)is a per-call argument, not part ofRLMConfig; pass it directly torun(...). The in-REPL form isllm_query(..., instruction=...).
fast-rlm can connect to Model Context Protocol servers and expose their tools and resources inside the REPL. The agent calls them with await mcp_call(server, tool, **kwargs) and reads resources with await mcp_read_resource(uri) — just like any other REPL function.
Nothing extra to install for fast-rlm. MCP support is optional and lazy: the MCP client lives in the Deno engine, and Deno auto-downloads it on first use. There is no pip install fast-rlm[mcp] — runs that don't use MCP never load it. You only install the MCP servers you actually want to connect to (each per its own docs).
Pass servers to run(..., mcp_servers={...}), keyed by name. Transport is chosen by the config shape:
import fast_rlm
result = fast_rlm.run(
"Read /data/report.md and summarize it in three bullets.",
mcp_servers={
# stdio: fast-rlm SPAWNS the server (and kills it on exit) — you don't run it.
"fs": {"command": "npx", "args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"]},
# http: the server must already be running; you point at its URL.
"web": {"url": "http://localhost:3333/mcp", "headers": {"Authorization": "Bearer ..."}},
},
)Install a server the usual way before pointing fast-rlm at it, e.g.:
# stdio servers are launched on demand via their command (npx/uvx/node/...)
npx -y @modelcontextprotocol/server-filesystem /data # Node-based
uvx mcp-server-fetch # Python-based| Config key | Transport | Who runs the server? | Notes |
|---|---|---|---|
command (+ args, cwd, env) |
stdio | fast-rlm spawns it | grants Deno --allow-run; a shell/filesystem server is full host access, not sandboxed |
url (+ headers) |
HTTP | you (must be listening) |
Inside the REPL the agent gets a small, lazy discovery API (the step-0 probe only shows counts, never full schemas):
mcp_list_tools(server=None)/mcp_tool_schema("server.tool")/await mcp_call(server, tool, **kwargs)mcp_list_resources()/mcp_list_resource_templates()/await mcp_read_resource(uri, server=None)
from fast_rlm import run, RLMConfig
config = RLMConfig.default()
config.primary_agent = "minimax/minimax-m2.5"
config.sub_agent = "minimax/minimax-m2.5"
config.max_depth = 5
config.max_money_spent = 2.0
result = run(
"Count the r's in 50 fruit names",
prefix="r_count",
config=config,
)All config fields:
| Field | Type | Default | Description |
|---|---|---|---|
primary_agent |
str |
(required) | Model for the root agent. No default — must be set or run() raises. |
sub_agent |
str |
primary_agent |
Model for child subagents. Defaults to primary_agent when unset. |
max_depth |
int |
3 |
Max recursive subagent depth |
max_calls_per_subagent |
int |
20 |
Max LLM calls per subagent |
truncate_len |
int |
2000 |
Output chars shown to the LLM per step |
max_money_spent |
float |
1.0 |
Hard budget cap in USD |
max_completion_tokens |
int |
50000 |
Max total completion tokens across all subagents |
max_prompt_tokens |
int |
200000 |
Max total prompt tokens across all subagents |
max_global_calls |
int |
∞ (50 for ACP) |
Max total LLM calls across the whole run (root + all subagents) |
- Place your task at the top or bottom of the prompt — the REPL restricts how much context the LLM sees, so don't bury the task in the middle.
- Mark structured data with backtick blocks — wrap JSON, CSV, etc. in fenced code blocks and name the format in the prompt.
- Use strong coding models — agents write and execute Python, so coding benchmarks matter. See recommended models.
- Inject domain docs when needed — for obscure domains, add reference material and tell the agent how it's organized (e.g. with
##headers). - Check logs and start with strict limits — review what the agent is doing before scaling up. Prompt changes usually help more than bigger budgets.
For the full guide, see the Best Practices & Troubleshooting docs page.
Skip this section unless you specifically want to run Gemini models on Google Cloud. It is not required for the default OpenRouter (or any OpenAI-compatible) setup above.
Use Gemini models via Vertex AI with IAM-based auth (no API key needed):
import fast_rlm
config = fast_rlm.RLMConfig()
config.primary_agent = "vertex/google/gemini-2.5-flash"
config.sub_agent = "vertex/google/gemini-2.5-flash"
result = fast_rlm.run("Count the r's in 50 fruits", config=config, vertex=True)This path uses these extra environment variables instead of RLM_MODEL_API_KEY:
| Variable | Description | Default |
|---|---|---|
GOOGLE_CLOUD_PROJECT |
GCP project ID | — |
GOOGLE_CLOUD_LOCATION |
GCP region | us-central1 |
Auth uses Application Default Credentials. Either run gcloud auth application-default login or set GOOGLE_APPLICATION_CREDENTIALS to a service account key path.
Every run saves a .jsonl log file to logs/.
# Print stats (no extra dependencies)
fast-rlm-log logs/run_xxx.jsonl
# Interactive TUI viewer (requires bun)
fast-rlm-log logs/run_xxx.jsonl --tuiWindows (npm):
npm install -g denomacOS / Linux:
curl -fsSL https://deno.land/install.sh | shThen add Deno to your PATH:
export DENO_INSTALL="$HOME/.deno"
export PATH="$DENO_INSTALL/bin:$PATH"curl -fsSL https://bun.sh/install | bash
cd tui_log_viewer && bun installSet your key in .env or .envrc:
export RLM_MODEL_API_KEY=sk-or-...Edit rlm_config.yaml at the project root:
max_calls_per_subagent: 20
max_depth: 3
truncate_len: 2000
primary_agent: "z-ai/glm-5" # REQUIRED — no default
# sub_agent is optional; omit it to reuse primary_agent for subagents
sub_agent: "minimax/minimax-m2.5"
max_money_spent: 1.0
max_completion_tokens: 50000
max_prompt_tokens: 200000# Run the example
deno task test_counting_r
# Run the subagent directly
echo "What is 2+2?" | deno task subagent
# View logs
./viewlog logs/<logfile>.jsonluv sync --extra benchmarks
uv run benchmarks/oolong_synth_benchmark.py
uv run benchmarks/longbench_benchmark.pyfast-rlm picks a backend from the primary_agent/sub_agent string (see Model backends). Set the credential only for the backend(s) you use — each is validated at point of use, so an ACP-only run needs no API key at all.
Any OpenAI-compatible endpoint — OpenAI, DeepSeek, OpenRouter (default), or anything else.
export RLM_MODEL_API_KEY=sk-... # or OPENAI_API_KEY, or OPENROUTER_API_KEY
export RLM_MODEL_BASE_URL=https://api.deepseek.com # optional; defaults to OpenRouterprimary_agent: "deepseek-chat" # or "gpt-5-mini", "minimax/minimax-m3", ...Google Cloud, via Application Default Credentials (no static key). Prefix the model with vertex/, or set RLM_VERTEX_AI=1 (Python: run(..., vertex=True)) to route every model through Vertex.
gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT=your-projectprimary_agent: "vertex/claude-sonnet-4-6"Claude models (claude-* or anthropic/claude-*) use the native Anthropic API when ANTHROPIC_API_KEY is set. If the native call is unavailable, fast-rlm transparently falls back to the OpenAI-compatible endpoint — so anthropic/... strings keep working through OpenRouter even without an Anthropic key.
export ANTHROPIC_API_KEY=sk-ant-...
export ANTHROPIC_BASE_URL=https://my-proxy.example.com # optional; defaults to https://api.anthropic.comprimary_agent: "claude-haiku-4-5" # or "anthropic/claude-sonnet-4-6"Token usage is reported (so budgets apply); cost shows Unknown (the SDK returns no cost).
Drives a local coding agent (Claude Code, Codex, opencode) read-only — no API key needed (the agent uses its own CLI login). Because token/cost budgets don't apply to ACP, max_global_calls defaults to 50 for ACP runs. See the ACP agents section below for presets and the backdoor.
primary_agent: "acp:opencode" # or "acp:claude-code", "acp:codex"| Backend | Selector | Credential |
|---|---|---|
| OpenAI-compatible | unprefixed (e.g. gpt-5-mini) |
RLM_MODEL_API_KEY → OPENAI_API_KEY → OPENROUTER_API_KEY (+ optional RLM_MODEL_BASE_URL) |
| Vertex AI | vertex/… or RLM_VERTEX_AI=1 |
ADC + GOOGLE_CLOUD_PROJECT |
| Anthropic | claude-… / anthropic/… |
ANTHROPIC_API_KEY (or RLM_ANTHROPIC_API_KEY) (+ optional ANTHROPIC_BASE_URL) |
| ACP | acp:… |
none (agent's own CLI login) |
Besides OpenAI-compatible and Vertex models, fast-rlm can use a coding agent that
speaks the Agent Client Protocol (ACP) as the
"brain". The agent is prompted with fast-rlm's system prompt + history and replies
with a ```repl block, which fast-rlm executes in its own Pyodide sandbox —
exactly like any other model. The agent itself runs read-only and never writes
files or runs the code; fast-rlm does.
Select one with an acp: prefix on primary_agent/sub_agent (mirrors the
vertex/ convention):
primary_agent: "acp:claude-code"
sub_agent: "acp:codex?model=gpt-5.5-codex" # ?model= is optionalrun(query, config=RLMConfig(primary_agent="acp:opencode"))Built-in presets (verified): acp:claude-code, acp:codex, acp:opencode.
Claude Code and Codex are launched via their npx adapters, so Node/npx must be
on PATH and the agent itself must already be logged in (e.g. claude /login,
codex login, opencode auth login).
Backdoor — any other ACP agent. Register it by command under acp_agents, then
select it by name. Built-in presets need no entry; a registered name overrides a
preset of the same name.
run(query, config=RLMConfig(
primary_agent="acp:hermes",
acp_agents={
"hermes": {"command": "hermes", "args": ["acp"]},
"cursor": {"command": "npx", "args": ["-y", "cursor-agent-acp"]},
},
))Each entry accepts command, args?, readonly_mode? (the agent's read-only mode
id, if it has one), model?, auth_method? (ACP auth method id — pinning it
silences the provider's "authMethodId is not configured" warning), and env?.
Safety & caveats:
- Every ACP agent runs in a throwaway temp
cwd, so a stray write is contained. - When the agent has a read-only session mode (
readonly_mode), fast-rlm switches into it. The presets do this automatically: opencode/claude-code useplan(a hard block); codex usesread-only(approval-gated — it may still write if it asks, so the tempcwdis its real guardrail). - Agents with no session modes (e.g. cursor, hermes) are contained by the temp
cwdalone. - Budgets: ACP agents report no token usage, so
max_money_spent,max_completion_tokens, andmax_prompt_tokensare inert for them (always zero, never trip). The only budget that works ismax_global_calls, which defaults to50for ACP runs (override it on the config/CLI as needed).
- Small PRs only — keep changes focused and minimal. Large PRs will not be accepted.
- No LLM-generated slop — AI-assisted code is fine, but bulk-generated boilerplate with no thought behind it will be rejected.
- Minor features welcome — small, well-scoped PRs that add useful functionality will be considered.
- Large feature requests — open an issue first to discuss the design before writing any code.
MIT License. See LICENSE.