Broad global goal: automate workflow to investigate the scope of narrow, file-scooped, low-risk tasks that orchestrator gpt-5.4 can delegate to given local LLM to get essentially the same solution quality while conserving precious gpt-5.4 tokens.
Rationale: Local LLM tokens are essentially free, so it is a win to use lots of them to save a few gpt-5.4 tokens without sacrificing solution quality.
What counts as full success: parity in problem solving (with A>2 and B>3) between these separate Codex setups:
- current: powered by A*X smart tokens from the best model available via OpenAI Plus plan
- hybrid-local: powered by X smart orchestrator tokens plus B*X "free but dumber" local LLM tokens for delegated subtasks
-
disler/aider-mcp-serveris the best match to "Master agent delegates grunt work to a cheaper/alternate coding agent." Its README says it “allows Claude Code to offload AI coding tasks to Aider,” with the explicit goal of reducing costs and using the stronger model more orchestratively. It exposes the delegate as an MCP server, passes an --editor-model, and includes tests for the coding tool path. -
Aider-AI/aiderplusbjodah/local-aideris the best proven split-model editing pattern. Aider’s official docs describe architect/editor mode: one model proposes the solution, then a second model turns that proposal into concrete file edits. Their docs also say architect mode often improves reliability for weaker editors, and their troubleshooting guide says local quantized models often have editing problems, recommending architect mode and sometimes “whole” edit format as mitigations. The local-aider proof-of-concept shows this pattern with local models on one consumer GPU, using a local reasoning architect and a separate local editor model. It is the clearest existing evidence that reasoning and editing should often be split when local models are flaky at edits. -
can1357/oh-my-piis the best open-source example of delegated subagents with per-agent model overrides. Its README advertises per-agent model overrides, and the development docs describe a task tool that delegates work to child agent sessions, resolves model overrides per agent, and aggregates results. That is highly relevant to our longer-term goal of turning this into a reusable evaluation skill, because it shows the architecture we want: planner/orchestrator on one model, executor(s) on another, with explicit delegation boundaries and observability. -
ToDo: check if the above repos give enough inspiration for our own task.
-
OS: Ubuntu 24.04 LTS
-
GPU: NVIDIA Quadro RTX 4000, 8 GB VRAM
-
I do not want experiments to overwrite or destabilize my normal Codex setup powered solely by
gpt-5.4 -
User
dansa302initialize local LLM and live logging outside of Codex sandbox using helper script./tools/start_local_lmstudio_model.sh.- Codex is expected to improve this script to increase local LLM fitness for delegated tasks
- before editing this script Codex should stop, explain the proposed changes, and wait for user approval
- after editing this script Codex should stop, explain how to re-run modified script, wait for user to reload local LLM, and verify
-
We start by investigating . Later we will also investigate
ibm/granite-4-h-tiny,qwen/qwen3.5-9b, possibly other LLMs. -
Codex CLI operates as unpriviledged user
ai2using free local LLM like this:
MODEL="${MODEL:-google/gemma-4-e4b}"
CTX="${CTX:-16384}"
curl -fsS http://127.0.0.1:1234/v1/responses \
-H 'Content-Type: application/json' \
-d "{
\"model\": \"$MODEL\",
\"input\": \"Reply with exactly: MODEL_OK\",
\"max_output_tokens\": $CTX
}"
- We have lean logs to triage failures:
truncate -s 0 /home/ai2/lmstudio-codex-trace/lms-*.jsonl
curl -fsS http://127.0.0.1:1234/v1/responses \
-H 'Content-Type: application/json' \
-d "{
\"model\": \"$MODEL\",
\"input\": \"List all tools you can call\",
\"max_output_tokens\": $CTX
}"
cat /home/ai2/lmstudio-codex-trace/lms-model.jsonl
cat /home/ai2/lmstudio-codex-trace/lms-server.jsonl
There is a repeated upstream-style problem where local/custom providers may not expose the edit contract the way Codex expects, especially around apply_patch / tool-calling behavior.
Similar reports already exist upstream, so we do not need to file a new bug report right now.
When a task is narrow, file-scoped, and low-risk, first try the local worker:
./tools/local_free_helper_codex.sh "<task>"
Start conservatively, using the local worker only for:
- explicit file paths
- small edits
- repetitive grunt work
After the worker returns, always verify:
- which files actually changed
- whether the task was done close enough to orchestrator's intent
- whether the worker claimed success without changing files
If verification fails, stop delegating and report failure. Do not let the local worker choose broad plans or architecture.
Each folder ./tests/<task_id> holds one simple task as
setup.shscript to create a fresh task instance in./results/<time_stamp>_<task_id>prompt.txta prompt for to delegate task with./tools/local_free_helper_codex.shto delegate to local LLM along with