Argus Agent

Local memory. Hybrid retrieval. Self-correcting reasoning. Benchmark-grade recall.

Argus Agent is a memory-first AI agent for long-context personal intelligence, grounded recall, temporal reasoning, quantitative reasoning, and tool use.

It is designed as an open, inspectable alternative to memory agents such as Hermes or OpenClaw, with a stronger emphasis on durable local memory, strict attribution, self-correcting retrieval, and benchmark-grade long-term recall.

The current local implementation keeps normal semantic, episodic, and procedural learning in agent.py. Deterministic structured-artifact extractor code exists in the repo, but it is disabled in the active learning path while benchmark memory quality is being tuned.

Local LongMemEval-S reports are checkpoints while learning and generation behavior is being validated. Treat checked-in report files as local progress snapshots, not final published benchmark numbers.

Benchmark cost warning: a full 500-question LongMemEval-S run with the current model mix has cost about $2,500 in practice, or about $5 per average question. Run a 1-question or small-sample benchmark first before starting the full dataset.

Why Argus Exists

Most agents can chat. Fewer can remember. Almost none can remember carefully.

Argus Agent is built around a simple idea: memory is not just vector search. A serious memory agent needs to know what a memory means, when it happened, what numbers belong to, which entity a fact is attached to, when evidence is incomplete, and when it must search again instead of guessing.

Argus combines:

local FAISS vector memory
semantic, episodic, and procedural memory separation
hybrid vector plus keyword retrieval
HyDE-style query expansion
dynamic recall depth
strict temporal grounding
numeric attribution and exact aggregation rules
structured artifact extractor code for table rows, lists, blocks, quotes, budgets, timelines, metrics, ratios, and other evidence-shaped outputs, currently disabled in the active learning path
self-correcting fallback retrieval
background memory consolidation
LangGraph orchestration
progressive tool routing

The result is an agent that behaves less like a stateless chatbot and more like a disciplined cognitive system.

What's New In v0.5.0

This release turns Argus into a much more complete local agent runtime:

Triage-first normal runtime: normal chat now starts with a fast LLM triage layer that chooses direct response, tool routing, memory retrieval, or retrieval-before-tools. Information-only updates and questions already answerable from recent chat can skip foreground retrieval entirely.
Benchmark path preserved: benchmark traffic still uses the original retrieval, tool-routing, generation, and learning prompts/flow so memory benchmark behavior remains comparable.
Learning-safe optional retrieval: when foreground retrieval is skipped, background learning still performs a learning-related retrieval pass before memory editing so UPDATE and DELETE actions have prior memory IDs/context.
Clearer metrics: triage, direct response, foreground retrieval, tool routing, generation, and learning-related retrieval are timed separately, and terminal background metrics avoid overwriting the active input prompt.
Argus control console: a Codex-style CLI with a fixed bottom input row, scrollable transcript, Markdown rendering, command palette, multiline compose, live status header, global argus launcher, and one-command setup scripts for macOS/Linux and Windows.
API job queue: chat requests now create jobs, emit status events while work is happening, and return final responses when the job completes. The CLI polls events instead of blocking silently.
On-demand channels: Telegram connects only when requested with /connect telegram, supports startup defaults, shows typing indicators, retries registration cleanly, and can receive coding-task progress after connecting mid-run.
Durable channel context: CLI and Telegram chat history are stored locally, command responses are saved into history, and only the latest four user/assistant pairs are passed into each agent request.
Multimodal input storage: incoming Telegram/API files are stored under local channel state, indexed, and passed into the agent with best-effort text extraction or AI-assisted image/audio/PDF understanding.
Local identity management: agent name, personality, use cases, and custom prompt can be updated through a local identity config file instead of Supabase-backed identity tools.
Cloud-tool expansion: external SaaS actions are routed through a single cloud-tool skill with configurable toolkits, user-facing /tools, /which-tool, /cloud-tools, /add-tool, and /remove-tool commands.
Coding-agent delegation: Argus can delegate software work to Codex, persist tasks and logs, continue or start fresh sessions, delete task history, configure workspace/provider/network mode, and stream progress into CLI and subscribed Telegram chats.
Coding safety and UX: coding tasks use shallow retrieval, portable workspace defaults, task-id suggestions, network-on defaults for package registries, safe restart when Codex sandbox settings change, and progress-file polling for long training/build tasks.
Cloudflared setup fixes: setup can install cloudflared with Homebrew on macOS/Linux, winget on Windows, or a direct PowerShell download fallback to C:\cloudflared\cloudflared.exe.

What Makes It Different

Argus is not a wrapper around a vector database. It is a full memory reasoning loop.

Standard RAG systems usually fail long-memory tasks for one of four reasons:

They retrieve the wrong memory.
They retrieve the right memory but attach it to the wrong entity.
They confuse storage time with event time.
They calculate with nearby numbers that do not belong to the question.

Argus directly attacks those failure modes with retrieval decomposition, evidence attribution, temporal guardrails, numeric scope checks, and a second-pass recovery path when the first context is incomplete.

Highlights

Local-first memory: no Supabase pgvector dependency. Memories and rules are saved under local_memory/<AGENT_ID>/.
Three memory types: semantic facts, episodic events, and procedural behavioral rules.
FAISS-backed retrieval: normalized OpenAI embeddings with IndexFlatIP cosine-style similarity.
Hybrid search: every retrieval pass combines vector search and direct keyword matching.
HyDE query optimizer: rewrites the user prompt into multiple retrieval probes before search.
Dynamic thresholds: wide-net deep mode for aggregation, timelines, broad categories, and recommendations; stricter standard mode for point facts.
Required-data fallback: the model can request a targeted second retrieval pass when evidence is missing.
Temporal truth protocol: separates database storage time from narrative event time.
Quantitative fidelity: numbers are stored and used with owner, property, item, and exactness.
Benchmark ingestion learning: history chunks are split into individual user/assistant pairs, learned sequentially, staged in RAM between pairs, and committed after the chunk is complete.
Duplicate protection: batch writes skip exact duplicate content before embedding, then use the normal vector duplicate check to avoid repeated memories.
Triage-first normal chat: information-only updates, casual chat, and questions answerable from current chat can skip foreground retrieval.
Retrieval-before-tools: when an action genuinely needs stored memory before a tool can run, triage routes retrieval first and then tool selection.
Background learning: normal user responses return immediately while memory extraction runs asynchronously.
Learning-related retrieval: optional foreground retrieval does not weaken memory editing, because background learning can retrieve prior context before issuing ADD, UPDATE, or DELETE.
Benchmark ingestion synchronization: benchmark memory-ingestion turns learn synchronously before returning, guarded by an ingestion lock.
Progressive tool loading: tool docs are only injected when a skill is selected.
Benchmark mode: disables tool routing, synchronously learns memory-ingestion chunks, and waits for any pending learning before final evaluation.
Local control console: starts the FastAPI worker, shows structured request/job/channel events, supports multiline input, command completion, and a scrollable transcript.
On-demand channel connections: channels are connected only when requested, starting with Telegram through a temporary Cloudflare tunnel and automatic webhook registration.
Local identity config: agent name, personality, use cases, and custom directives can be updated by tool call into a local JSON file instead of Supabase.
Coding-agent delegation: Argus can start durable Codex tasks, stream coding progress into the CLI, let the user reply, and persist task logs locally.

Architecture

User / API
    |
    v
LangGraph StateGraph
    |
    +-- triage_request  (normal channels)
    |     |
    |     +-- direct_response
    |     +-- route_tools
    |     +-- retrieve_memories
    |     +-- retrieve_memories -> route_tools
    |
    +-- retrieve_memories  (benchmark channel always starts here)
    |     |
    |     +-- HyDE query generation
    |     +-- semantic FAISS search
    |     +-- episodic FAISS search
    |     +-- keyword search
    |     +-- procedural rule routing
    |
    +-- route_tools
    |     |
    |     +-- skill catalog router
    |     +-- progressive skill markdown loading
    |
    +-- generate_response
          |
          +-- grounded answer synthesis
          +-- optional ReAct tool loop
          +-- REQUIRED_DATA fallback retrieval
          +-- memory learning
                +-- normal chat: background async
                +-- benchmark ingestion: inline sync

The normal interactive graph is triage-first:

START -> triage_request -> direct_response -> END
START -> triage_request -> route_tools -> generate_response -> END
START -> triage_request -> retrieve_memories -> generate_response -> END
START -> triage_request -> retrieve_memories -> route_tools -> generate_response -> END

Benchmark traffic intentionally keeps the original high-recall route:

START -> retrieve_memories -> route_tools -> generate_response -> END

Learning is launched from generate_response or direct_response. Normal chat keeps the interactive path fast by learning in the background. If normal chat skipped foreground retrieval, background learning performs a learning-related retrieval pass before memory editing so updates and deletes still see prior memory context. Benchmark memory-ingestion prompts learn inline before the response returns so the next history chunk retrieves against the latest committed memories.

Memory System

Argus uses three memory layers.

Semantic Memory

Semantic memory stores durable user facts:

identity
preferences
relationships
routines
long-term projects
possessions
stable traits
active statuses and inventories

Example:

User owns a crystal chandelier that originally belonged to their great-grandmother and was given to them by their aunt.

Episodic Memory

Episodic memory stores events and interaction history:

what happened
when it happened
who was involved
what was decided
what the user asked for
what changed

Example:

On March 4, 2023, user received a crystal chandelier from their aunt that originally belonged to their great-grandmother.

Procedural Memory

Procedural memory stores behavioral rules:

tone preferences
formatting preferences
project-specific instructions
forbidden wording
content generation constraints

Procedural rules are tagged and routed, so the model sees only the relevant rules for the current prompt instead of carrying every rule forever.

Structured Artifact Extractors

The local agent keeps the original normal-text learning path intact. Full user and assistant turns are still passed to the learning model, so ordinary narrative details, decisions, preferences, and summaries can become semantic, episodic, or procedural memories.

Structured extractor code is present for artifact-shaped content that summarization can otherwise compress too aggressively. In the current active runtime, these extractors are disabled and their outputs are not injected into the learning prompt or appended to episodic memory. This keeps benchmark learning focused on the model-generated memory actions from the actual user/assistant pair.

When enabled experimentally, the extractor layer can create deterministic episodic units for high-signal data such as:

markdown table rows
explicit artifact blocks such as ::title:: == description
numbered sections for objectives, parameters, methods, options, steps, recommendations, and similar headings
recommendation, remedy, dish, shop, restaurant, and product list items
ingredient and material items
budget, cost, allocation, and campaign plan rows
timeline and dated event clauses
implementation or "uses algorithm/tool" relationships
attributed quotations and exact source claims
metric, percentage, improvement, and score relationships
ratios, dilutions, and mixture instructions
music sections, chord/note style rows, and chess move notation
counted entity headings such as encounter counts, party sizes, item totals, or named grouped entities

The extractor layer is intentionally capped and high-signal. It is not meant to memorize every sentence. Its job is to preserve compact data-bearing rows and items that future recall questions often target verbatim.

In the current runtime, STRUCTURED ARTIFACT UNITS are not passed to the learning prompt. Vector writes still pass through the normal local action execution path, including exact duplicate blocking, batch embedding, and vector duplicate checks.

Local Storage Layout

The current runtime stores memory locally:

local_memory/
  <AGENT_ID>/
    semantic_memory/
      index.faiss
      memories.json
    episodic_memory/
      index.faiss
      memories.json
    procedural_memory/
      rules.json
    channel_state/
      chat_history.json
      attachments_index.json
      attachments/
    coding_agents/
      tasks.json
      logs/
        <task_id>.jsonl

AGENT_ID determines which memory folder is used. Reusing the same AGENT_ID reuses the same memory. Changing AGENT_ID gives you a clean isolated agent profile.

Channel state is also durable. CLI, Telegram, and future channels store full conversation history in channel_state/chat_history.json, while each agent request receives only the last eight messages (four user/assistant pairs) as short-term chat context. Command responses such as /cloud-tools are saved there too, so follow-ups can refer to them.

Incoming files are saved under channel_state/attachments/ and indexed in attachments_index.json. Text, Markdown, JSON, CSV, PDFs, DOCX files, images, and audio get best-effort local extraction or AI-assisted description/transcript when supported. The stored file remains available even when readable text cannot yet be extracted.

PDF extraction uses pypdf for embedded text. If a PDF has no embedded text or behaves like a scanned/image document, Argus can render the first few pages with PyMuPDF and send them through the configured multimodal image model for vision OCR. Make sure pip install -r requirements.txt has been run in the same virtual environment that starts main.py or agent_cli.py.

Coding-agent task state is durable too. Each delegated coding task is stored in coding_agents/tasks.json, and every progress/update/approval/completion event is appended to that task's JSONL log under coding_agents/logs/.

Each semantic and episodic memory record includes:

UUID
agent ID
memory type
content
embedding
created timestamp
updated timestamp

The formatted retrieval output intentionally preserves this shape:

[STORED_AT: 2026-05-25 14:00:00] [ID: <uuid>] <memory content>

Downstream deduplication, recency sorting, contradiction handling, and memory update logic all depend on that stable format.

Retrieval Pipeline

Argus does not simply embed the latest user prompt and hope for the best.

For normal chat, retrieval is now optional on the foreground path. A fast triage LLM first decides whether the latest turn is:

direct: answer or acknowledge from the current message and recent chat history
tool: route to the tool system without memory retrieval
retrieval: retrieve memory before generation
retrieval + tool: retrieve memory first, then route tools with that memory context

The triage layer is semantic rather than regex-based. It treats information-only updates as direct responses, because newly supplied facts do not need old memory to be acknowledged. It chooses retrieval when the user is asking for an answer that depends on stored personal memory not already present in recent chat. It chooses tools when the user asks Argus to perform an external action or delegate work. For action requests, tool-only is the default unless a concrete required action input is missing and stored memory is the likely source.

Benchmark traffic bypasses triage and still runs the original retrieval-first path.

The retrieval node first asks a lightweight model to produce a structured search plan:

{
  "vector_queries": [
    "User total driving duration and travel history",
    "User vehicle, road trip, transit records",
    "User travel milestones, driving time calculation",
    "hours, road trip, destinations"
  ],
  "keywords": "driving, hours, total",
  "search_mode": "deep"
}

Then it performs:

Semantic vector search
Episodic vector search
Semantic keyword search
Episodic keyword search
ID-based deduplication
recency sorting
procedural rule routing

Search modes:

standard: strict retrieval for point facts, threshold 0.38
deep: wide recall for totals, timelines, histories, recommendations, and broad categories, threshold 0.28

This is why Argus can answer questions that require multiple memories rather than only nearest-neighbor recall.

Temporal Reasoning

Argus treats time as evidence, not decoration.

The agent distinguishes:

storage timestamp: when the memory was saved
narrative date: when the event actually happened
benchmark current date: the simulated date for an evaluation question
relative dates: phrases like "yesterday", "today", "last month"

The Temporal Truth Protocol prevents common long-memory errors:

using database timestamps as event dates
borrowing dates from nearby but unrelated memories
assuming a discussion date is the same as an event date
calculating date gaps from guessed anchors

For "how long ago" questions, Argus searches for the named event first and only uses the current date as the calculation anchor after retrieval.

Quantitative Reasoning

Long-memory benchmarks often punish sloppy number handling. Argus's numeric protocol is built to avoid that.

For totals, counts, durations, prices, quantities, or money questions, the model must identify:

actor or entity
measured action or property
event or item
exactness

It excludes numbers that are merely nearby or topically related.

Example:

User helped organize a concert, which raised over $5,000.

The amount belongs to the concert, not automatically to the user. It is also a lower-bound value, not an exact addend.

For exact totals, Argus sums only exact unqualified values unless the user explicitly asks for a minimum, estimate, or range.

Self-Correcting Retrieval

If the first retrieval pass does not contain enough evidence, Argus can emit:

{
  "agent_response": "",
  "flags": ["REQUIRED_DATA"],
  "hyde_queries": ["aunt meetup", "received chandelier", "chandelier handoff"]
}

The runtime then performs a targeted fallback search and regenerates the answer with expanded context.

The fallback pass has a strict final verification rule: if the exact target is still missing, the agent must say the information is not available instead of guessing.

This makes the agent aggressive about recall but conservative about truth.

Learning Pipeline

After every normal response, Argus starts background learning.

Because foreground retrieval can now be skipped, background learning has its own safety step. If the user-facing path did not retrieve memory, the learning worker runs a retrieval pass for the actual user prompt before calling the memory editor. That retrieved context is used only for learning, so memory updates and deletes still have the prior memory lines and IDs they need without forcing every user-facing response through retrieval.

Benchmark memory-ingestion prompts are the exception. They are learned synchronously before the ingestion response returns, so run_dataset_evals.py does not feed the next history chunk until the previous chunk has been learned and committed.

The learning model extracts:

semantic memories
episodic memories
procedural rules

It can issue:

{
  "actions": [
    {"action": "ADD", "content": "New memory"},
    {"action": "UPDATE", "id": "uuid", "content": "Updated memory"},
    {"action": "DELETE", "id": "uuid"}
  ]
}

Important learning behaviors:

preserves specific names and proper nouns
resolves relative dates using the current date
anchors transfer and acquisition events
preserves every number and qualifier
avoids duplicate memories across semantic and episodic layers
prefers exact values over approximate or bounded restatements
updates existing records instead of creating conflicting duplicates
keeps normal prose learning active
splits benchmark ingestion chunks into individual user/assistant pairs
stages semantic and episodic ADD, UPDATE, and DELETE actions in RAM between pairs
commits staged semantic and episodic vector actions after all pairs in the chunk have been processed
reloads procedural context between ingestion pairs when procedural rules change

Background learning is protected by:

learning-related retrieval before memory editing when foreground retrieval was skipped
persistent retry loop
exponential backoff
concurrency limit of 4 learning tasks for normal background learning
benchmark ingestion lock for synchronous memory-ingestion learning
benchmark synchronization before final questions

Tool System

Argus includes a progressive skill router.

Instead of injecting all tool instructions into every prompt, the router sees a compact catalog and selects only the relevant skills. The generation model then receives the full markdown and bound tools for those selected skills.

Current included skills:

agent identity management
cloud app actions
coding-agent delegation

Tool execution uses a ReAct loop with a maximum of 5 iterations. If the loop reaches the limit, the model is forced to stop calling tools and produce a final text response.

Cloud tools are the external-action layer for app and SaaS integrations such as GitHub, Gmail, Google Calendar, Slack, Notion, and Linear. Argus keeps its own local identity tool native, while external app auth, tool search, and execution flow through the cloud-tool session.

Users can inspect and expand the enabled cloud toolkit at runtime:

/tools
/which-tool check my unread emails
/cloud-tools
/add-tool gmail
/remove-tool slack

Enabled cloud tools are stored in local_memory/<AGENT_ID>/agent_tools.json, with .env values used as startup defaults. /cloud-tools fetches the available cloud-tool catalog when credentials are configured, then falls back to the small local catalog if the remote catalog is unavailable.

Native coding-agent commands:

/coding implement the failing auth test
/coding-new start a fresh implementation pass for the dashboard
/coding-continue now add tests for the same change
/coding-continue <task_id> use this exact older session
/coding or /coding-tasks
/coding-agents
/coding-use codex
/coding-workspace /absolute/path/to/repo
/coding-network on
/coding-network off
/coding-network <task_id> on
/coding-allow-network <task_id>
/coding-status <task_id>
/coding-log <task_id>
/coding-reply <task_id> approved, continue
/coding-cancel <task_id>
/coding-delete <task_id>
/coding-clear

The coding-agent skill is provider-neutral. V1 uses Codex; future providers can plug into the same task store, event feed, and API command surface. The current default provider, workspace, and network mode are visible in the CLI header.

Coding Agent Delegation

Argus delegates coding work instead of editing files directly in the chat flow. The first provider is Codex, launched through Codex CLI's MCP server with the default command:

codex mcp-server

Delegated tasks are durable:

tasks.json stores task status, provider, workspace, network mode, prompt, changed files, errors, and summary.
logs/<task_id>.jsonl stores every progress event.
.argus/coding_progress/<task_id>.jsonl inside the coding workspace is a provider-side progress file Argus asks Codex to update during long tasks. Argus polls it and streams those progress rows to the CLI and subscribed channels.
config.json stores local overrides for default provider, workspace, and coding-network default when changed from CLI/API/tool calls.
The CLI subscribes to /api/events and renders coding events in the transcript.
The user can reply to a waiting task with /coding-reply <task_id> <message>.
The user can continue the most recent completed Codex session with /coding-continue <message>.
The user can start a deliberately fresh Codex session with /coding-new <task>.
The user can delete a completed/failed/cancelled task with /coding-delete <task_id>, or clear all non-active task history with /coding-clear.

Provider/workspace control:

.env provides startup defaults such as CODING_AGENT_DEFAULT_PROVIDER and CODEX_WORKSPACE_ROOT.
CODEX_WORKSPACE_ROOT=. is portable and resolves to the directory where the user launched the Argus CLI/API.
/coding-agents lists supported providers and shows the current default.
/coding-use <provider> selects the default provider. V1 supports codex; claude_code and cursor are reserved provider slots for later.
/coding-workspace <path> sets the default workspace without editing .env.
/coding-network on|off sets whether new coding tasks can use network access. It defaults to on so package registries such as PyPI/npm work during coding tasks.
/coding-network <task_id> on|off changes the stored network mode for a specific task. /coding-allow-network <task_id> is a shortcut for turning it on.
If an older Codex thread was created before network was enabled, Argus restarts Codex from the saved task context on the next continuation so the new network mode actually reaches the provider shell.
If a Telegram chat lists or inspects a running coding task, that chat is subscribed to future progress/completion updates for the task.
When /connect telegram succeeds, Argus also subscribes known Telegram chats from TELEGRAM_ALLOWED_USERS or previous Telegram history to currently active coding tasks.
Asking the agent to change the coding agent, workspace, or network mode can use the same native coding-agent config tool.

Common Codex session flow:

/coding build an ml project scaffold
/coding-continue add a README and run the tests
/coding-new investigate a separate bug in the API
/coding-continue <task_id> continue an older task explicitly

Each Codex task stores the provider session id when Codex returns one, so /coding-continue can resume the same Codex thread instead of starting from a blank session.

Codex MCP returns the final tool result at the end of a provider call rather than streaming every shell stdout line to Argus. For long work, Argus injects a progress-reporting instruction and polls the workspace progress file above; if Codex cannot write it, Argus still sends periodic heartbeat/status events.

For commands that need a task id, the CLI provides task-id selection. Type the command followed by a space, choose a recent task with Ctrl+N / Ctrl+P, then press Tab to insert the selected id:

/coding-status <space>
/coding-log <space>
/coding-reply <space>
/coding-cancel <space>
/coding-delete <space>
/coding-continue <space>
/coding-network <space>
/coding-allow-network <space>

Task statuses:

queued
running
waiting_user
completed
failed
cancelled

Default safety policy:

Auto-approve low-risk read-only operations, normal edits inside the configured workspace, and routine test/build commands.
Coding-task network access defaults to on inside the workspace-write sandbox. Turn it off globally with /coding-network off, or for a task with /coding-network <task_id> off.
Ask for user confirmation before destructive commands, writes outside the workspace, secret/env edits, git commits, git pushes, credential access, or anything the provider explicitly marks as confirmation-required.

Coding requests use a shallow retrieval profile in normal channels. Argus still retrieves recent local context, but it skips the HyDE LLM call and forces standard mode with small top_k instead of broad/deep memory search. Benchmark mode is unchanged.

Adding A New Tool

Tool expansion is deliberately simple. To add a new capability, drop a new folder inside tools/ and follow the existing skill convention.

tools/
  your_tool/
    skill.md
    __init__.py
    tools.py

skill.md defines the router-facing metadata:

---
name: your_tool
description: One-line description of what this skill can do.
triggers: keyword one, keyword two, natural language trigger
---

# Your Tool Skill

Describe when to use it, when not to use it, available tools, and operating rules.

tools.py defines LangChain tool callables, and __init__.py exports them using the folder-name convention:

from .tools import your_first_tool, your_second_tool

YOUR_TOOL_TOOLS = [your_first_tool, your_second_tool]

That is it. tools/__init__.py automatically scans every subdirectory with a skill.md, imports the package, reads the frontmatter, and registers the exported <FOLDER_NAME>_TOOLS list. No central registry edit is required.

For user-scoped dynamic tools, a skill package can also export <FOLDER_NAME>_TOOLS_FACTORY(runtime_config). This is how the cloud-tools skill creates session tools for the active user_id without hard-coding every external app schema into the prompt.

This gives Argus a clean capability expansion path: add a folder, describe the skill, export the tools, restart the process, and the agent can route to the new capability.

Benchmarks

Argus includes a LongMemEval runner:

python run_dataset_evals.py

The benchmark pipeline:

Loads eval_datasets/longmemeval_s_cleaned.json
Splits haystack sessions into chunks
Feeds each chunk through the agent with learning enabled
Learns benchmark memory-ingestion chunks synchronously before returning the ingestion ACK
Asks the benchmark question with learning disabled
Judges with a binary evaluator
Stores question_type with each result row for reporting, while benchmark agent calls use the same runtime interface as normal agent calls
Writes results to reports/longmemeval_results.json

For each benchmark history chunk, the agent splits the chunk into individual user/assistant pairs. Semantic and episodic actions are staged in RAM after each pair, so the next pair sees the updated working memory. After all pairs in the chunk are processed, staged semantic and episodic actions are committed to the vector stores. Procedural learning also runs per pair and refreshes procedural context when rules change.

Parallel Evaluation

For faster local benchmark runs, Argus also includes a process-based parallel evaluator:

EVAL_WORKERS=5 python run_dataset_evals_parallel.py

The parallel runner is designed for long benchmark runs where you do not want to lose completed progress. It first loads all completed question IDs from:

reports/longmemeval_results.json
reports/longmemeval_results.worker*.json

Then it calculates the remaining questions, splits only those questions across the requested number of workers, and launches one isolated process per worker.

Each worker receives its own AGENT_ID:

<AGENT_ID>_eval_worker_0
<AGENT_ID>_eval_worker_1
<AGENT_ID>_eval_worker_2
...

That means each worker gets an isolated FAISS memory folder under local_memory/, so multiple questions can be learned and evaluated at the same time without memory collision.

Worker outputs are written independently:

reports/longmemeval_results.worker0.json
reports/longmemeval_results.worker1.json
reports/longmemeval_results.worker2.json

When all workers finish, the runner merges worker outputs back into:

reports/longmemeval_results.json

The default is 5 workers. Choose EVAL_WORKERS based on the machine and API limits, then increase until OpenAI rate limits or local CPU pressure become the bottleneck.

Prompt Regression Samples

For faster iteration on prompt and retrieval changes, use the sample runner:

python run_dataset_evals_sample.py

It reuses the parallel evaluator, selects a deterministic sample, writes a manifest of sampled questions, keeps per-worker checkpoints, and merges results into a sample-specific report. Useful environment variables:

EVAL_SAMPLE_NAME=prompt_regression
EVAL_SAMPLE_SIZE=60
EVAL_SAMPLE_SOURCE_LIMIT=500
EVAL_SAMPLE_SEED=test6
EVAL_SAMPLE_QUESTION_IDS=<comma-separated ids>
EVAL_SAMPLE_FRESH=1
EVAL_WORKERS=20

To monitor live results across both the main and worker result files:

python monitor_results.py

Current local report files:

reports/longmemeval_results.json
reports/longmemeval_results.worker*.json

Benchmark Cost Planning

LongMemEval-S is an expensive benchmark for this agent because each question feeds many conversation-history chunks, and each chunk can trigger retrieval planning, memory learning, embeddings, and final answer generation.

Observed cost with the current model mix:

Run size	Approximate cost
1 average question	about `$5`
10 questions	about `$50`
100 questions	about `$500`
Full 500-question LongMemEval-S run	about `$2,500`

The current benchmark model mix is:

Component	Model	Input / 1M	Output / 1M	Cached input / 1M
Retrieval planning	`gpt-4o-mini`	`$0.15`	`$0.60`	`$0.075`
Generation	`gpt-4.1`	`$2.00`	`$8.00`	`$0.50`
Memory learning	`gpt-4.1`	`$2.00`	`$8.00`	`$0.50`
Embeddings	`text-embedding-3-large`	`$0.13`	n/a	n/a
Benchmark judge	`gpt-5`	`$1.25`	`$10.00`	`$0.125`

For the current LongMemEval-S dataset, the benchmark runner sees about 41,813 chunks total, or about 83.6 chunks per question. The direct chunk-ingestion traffic alone is about 62.3M estimated input tokens, but that is only a lower bound. The full cost is higher because the agent re-reads chunk content during learning and creates embeddings for retrieval and memory writes.

Run a 1-question or small-sample benchmark first and inspect recorded usage metrics before running all 500 questions.

Current Local Metrics

Current local LongMemEval-S metrics, computed from reports/longmemeval_results.json and joined with eval_datasets/longmemeval_s_cleaned.json by question_id:

Question type	Correct	Incorrect	Total	Accuracy
Overall	491	9	500	98.20%
knowledge-update	77	1	78	98.72%
multi-session	129	4	133	96.99%
single-session-assistant	56	0	56	100.00%
single-session-preference	30	0	30	100.00%
single-session-user	70	0	70	100.00%
temporal-reasoning	129	4	133	96.99%

These metrics represent the current LongMemEval-S progress while Argus Agent is actively being improved. Some answers may change as failing or uncertain questions are rerun and fixes are added.

Requirements

Python 3.11 or higher
An OpenAI API key
Optional for Telegram: a Telegram bot token from @BotFather
Optional for Telegram without a domain: cloudflared. Setup can install it with Homebrew, Windows winget, or a direct Windows PowerShell download fallback.
Optional for coding delegation: OpenAI Codex CLI on your PATH

Quick Start

Clone the repo, then run the setup script from the repo root.

git clone https://github.com/quarqlabs/argus.git
cd argus

macOS/Linux:

python3 scripts/setup_argus.py

Windows PowerShell:

py scripts\setup_argus.py

The setup script:

creates .venv
installs requirements.txt
creates .env from .env.example if missing
installs cloudflared when possible
installs the global argus launcher for your user
updates the user PATH for future terminals when possible

cloudflared setup behavior:

macOS/Linux with Homebrew: brew install cloudflare/cloudflare/cloudflared
Windows with winget: winget install --id Cloudflare.cloudflared --source winget --accept-package-agreements --accept-source-agreements
Windows without winget: download the latest Cloudflare binary to C:\cloudflared\cloudflared.exe, temporarily add C:\cloudflared to the setup process PATH, and verify with cloudflared.exe --version

The CLI also checks C:\cloudflared\cloudflared.exe directly on Windows, so /connect telegram can work even before a new terminal picks up a permanent PATH update.

After setup, edit .env and fill at least:

OPENAI_API_KEY=your_api_key
USER_ID=local_user
AGENT_ID=local_agent
LOCAL_MEMORY_ROOT=local_memory

Open a new terminal, or run the PATH refresh command printed by the setup script. Then start Argus from any directory:

argus

If you only want to install or repair the global launcher without reinstalling dependencies, run the lighter launcher installer.

macOS/Linux:

python3 scripts/install_argus.py --force

Windows PowerShell:

py scripts\install_argus.py --force

The global launcher points back to the cloned repo. If you move or delete that repo folder, rerun the launcher installer from the new location.

On Windows, the launcher sets Python UTF-8 mode so CLI/API status symbols do not crash under legacy cmd.exe code pages. If you see a charmap error such as can't encode character '\u274c', pull the latest code and rerun:

py scripts\install_argus.py --force

The control console starts main:app for you, connects the CLI to the API job queue, and shows structured events as requests move through triage, retrieval, tool routing, generation, tool use, and final response.

For coding-agent delegation, the default Codex provider launches:

codex mcp-server

The Python side uses the OpenAI Agents SDK dependency from requirements.txt. If the Codex CLI or the Agents SDK is missing, Argus records a failed coding task with a setup message instead of crashing. On macOS, if codex is not on the API worker's PATH, Argus also checks the standard Codex.app binary at /Applications/Codex.app/Contents/Resources/codex.

The API worker keeps process-lifetime chat history per channel and passes the last four user/assistant pairs into each agent request. This preserves short references such as "done" after an auth link or "now check calendar" without stuffing the full conversation into every prompt.

You can still run the raw terminal agent directly:

python agent.py

Or run only the API server:

uvicorn main:app --reload

Call the API:

curl -X POST http://127.0.0.1:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What do you remember about me?", "channel_type": "web"}'

Control Console

agent_cli.py is the recommended local entrypoint. It provides a Codex-style terminal surface around the local FastAPI worker:

starts main:app on 127.0.0.1:8000
hides noisy HTTP client logs
shows a scrollable transcript of structured events
shows the current model label, working directory, API URL, connected channels, startup channels, default coding agent, coding workspace, and coding network mode
reads the agent name from the live local identity config, so a rename through agent_identity_manager updates the header without restarting
supports formatted Markdown in agent responses
supports multiline input: Enter sends, Shift+Enter inserts a newline
supports command suggestions when you type /, with Tab completing the first suggestion

Console commands:

/help
/status
/tools
/which-tool <task>
/cloud-tools
/add-tool <tool>
/remove-tool <tool>
/coding <task>
/coding-new <task>
/coding-continue <message>
/coding-continue <task_id> <message>
/coding-tasks
/coding-agents
/coding-use <provider>
/coding-workspace <path>
/coding-network [on|off]
/coding-network <task_id> on|off
/coding-allow-network <task_id>
/coding-status <task_id>
/coding-log <task_id>
/coding-reply <task_id> <message>
/coding-cancel <task_id>
/coding-delete <task_id>
/coding-clear
/connect telegram
set-default start-channel telegram
set-default start-channel none
/wipe
/quit

/connect telegram starts the Telegram connection pipeline only when you ask for it. set-default start-channel telegram stores a local startup preference in local_memory/<AGENT_ID>/agent_cli.json, so future CLI launches can connect that channel automatically. set-default start-channel none clears that preference.

Agent Identity Config

Argus no longer needs Supabase for local identity updates. Runtime identity uses a local JSON config file, with .env values as defaults.

Default location:

local_memory/<AGENT_ID>/agent_identity.json

Override location:

AGENT_IDENTITY_CONFIG_PATH=local_memory/local_agent/agent_identity.json

Supported identity fields:

{
  "agent_name": "Argus",
  "agent_personality": "professional and helpful",
  "agent_use_cases": ["general assistance"],
  "agent_custom_prompt": ""
}

Initial values can come from .env:

AGENT_NAME=Argus
AGENT_PERSONALITY="friendly, precise, high-energy"
AGENT_USE_CASES=["coding","research","life-long memory"]
AGENT_CUSTOM_PROMPT="Be concise, grounded, and useful."

When the user asks the agent to rename itself, change its personality, update its main use cases, or change global instructions, the agent_identity_manager tool writes the update to the JSON config file. The env values remain fallback defaults for a new agent profile or a missing config file.

Channel Integrations

Channels are API-facing integrations. The first supported channel is Telegram; the design leaves room for WhatsApp and other channels later.

Telegram supports text, captions, photos, documents, PDFs, DOCX files, audio, voice notes, videos, stickers, and other Telegram file objects. Every received file is downloaded into local channel storage, indexed with source metadata, and passed to the current agent job as attachment context. Future channel adapters can use the same generic POST /api/files endpoint, then include the returned attachment IDs in /api/jobs or /api/chat.

Telegram may allow a user to send a larger file into the chat, but the official bot download path is limited. Argus defaults CHANNEL_FILE_MAX_BYTES to 20000000 bytes (about 20 MB) for channel attachments. If a file is too large or cannot be downloaded/read, the Telegram bot replies with a clear attachment failure message instead of silently answering from the caption alone.

Telegram

Create a Telegram bot with @BotFather.
Put the bot token in .env.
Put your Telegram numeric user ID in TELEGRAM_ALLOWED_USERS.
Set a random TELEGRAM_WEBHOOK_SECRET.
Install cloudflared if you do not have a public domain. scripts/setup_argus.py can install it automatically, including the direct Windows fallback to C:\cloudflared\cloudflared.exe.
Run python agent_cli.py.
In the console, run /connect telegram.

Example .env values:

TELEGRAM_BOT_TOKEN=123456789:replace_with_botfather_token
TELEGRAM_ALLOWED_USERS=123456789
TELEGRAM_WEBHOOK_SECRET=replace_with_random_secret

What /connect telegram does:

Starts a temporary Cloudflare tunnel to the local API.
Builds the public webhook URL as <tunnel-url>/api/telegram/webhook.
Calls Telegram setWebhook with the webhook secret.
Shows channel registration progress in the CLI.

Telegram messages are processed through the same API job queue as CLI messages. While a response is generating, the API sends Telegram typing chat actions so the chat feels alive instead of silent.

Channel commands also work from Telegram:

/help
/status
/tools
/which-tool <task>
/cloud-tools
/add-tool <tool>
/remove-tool <tool>
/coding <task>
/coding-new <task>
/coding-continue <message>
/coding-continue <task_id> <message>
/coding-tasks
/coding-agents
/coding-use <provider>
/coding-workspace <path>
/coding-network [on|off]
/coding-network <task_id> on|off
/coding-allow-network <task_id>
/coding-status <task_id>
/coding-log <task_id>
/coding-reply <task_id> <message>
/coding-cancel <task_id>
/coding-delete <task_id>
/coding-clear
/wipe
/quit

/quit only stops the local CLI when typed in the console. From Telegram it returns a safety message because remote channels should not stop the local process.

API Job Queue

The FastAPI worker exposes both synchronous and job-based paths.

Synchronous compatibility route:

POST /api/chat

Job queue routes:

POST /api/jobs
GET  /api/jobs/{job_id}
GET  /api/events?after=<event_id>

Coding-task routes:

GET  /api/coding-agents
POST /api/coding-agents/default
POST /api/coding-agents/workspace
POST /api/coding-agents/network
GET  /api/coding-tasks
DELETE /api/coding-tasks
POST /api/coding-tasks/subscribe-channel
POST /api/coding-tasks
POST /api/coding-tasks/latest/reply
GET  /api/coding-tasks/{task_id}
DELETE /api/coding-tasks/{task_id}
GET  /api/coding-tasks/{task_id}/logs
POST /api/coding-tasks/{task_id}/reply
POST /api/coding-tasks/{task_id}/cancel
POST /api/coding-tasks/{task_id}/network

The CLI uses the job routes. A request is enqueued, the single worker processes jobs one by one, and status events are emitted for:

triage
retrieval
tool routing
generation
tool running/completed/failed
coding-agent progress
final response

This is what lets the console show useful loader text such as memory retrieval, response generation, and active tool usage instead of blocking silently until the final answer arrives.

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	yes	Used for generation, retrieval planning, learning, and embeddings.
`AGENT_ID`	no	Selects the local memory namespace. Defaults to `local_agent`.
`USER_ID`	API only	Required by `main.py` for the FastAPI worker.
`LOCAL_MEMORY_ROOT`	no	Root folder for local memory. Defaults to `local_memory`.
`LOCAL_CHANNEL_STORAGE_ROOT`	no	Optional override for durable channel chat history and attachment storage. Defaults to `local_memory/<AGENT_ID>/channel_state`.
`CHANNEL_FILE_MAX_BYTES`	no	Max accepted channel attachment size in bytes. Defaults to `20000000` (about 20 MB, matching the practical Telegram bot download ceiling).
`ATTACHMENT_EXTRACT_MAX_CHARS`	no	Max extracted text saved from an attachment. Defaults to `24000`.
`MULTIMODAL_IMAGE_MODEL`	no	Optional OpenAI model for image descriptions. Defaults to `gpt-4o-mini`.
`MULTIMODAL_AUDIO_MODEL`	no	Optional OpenAI model for audio transcription. Defaults to `gpt-4o-mini-transcribe`.
`PDF_VISION_MAX_PAGES`	no	Max PDF pages rendered for vision OCR fallback when embedded text extraction fails. Defaults to `3`.
`AGENT_IDENTITY_CONFIG_PATH`	no	Optional override for the local identity config file. Defaults to `local_memory/<AGENT_ID>/agent_identity.json`.
`AGENT_NAME`	no	Default persona name when no local identity config exists.
`AGENT_PERSONALITY`	no	Default tone/personality when no local identity config exists.
`AGENT_USE_CASES`	no	Default use-case description. Accepts a JSON array or comma-separated string.
`AGENT_CUSTOM_PROMPT`	no	Default custom behavior instructions when no local identity config exists.
`ARGUS_AGENT_VERSION`	no	Display-only version label for the control console. Defaults to `v0.5.0`.
`ARGUS_MODEL_LABEL`	no	Display-only model label for the control console. Falls back to generation model labels.
`ARGUS_REASONING_EFFORT`	no	Optional display suffix for the console model label.
`AGENT_DEBUG`	no	Set to `true`/`1` to show verbose debug logs from `agent.py`; metrics still print without debug.
`CLOUD_TOOLS_API_KEY`	cloud tools only	Required to use external app tools through the cloud-tool session.
`CLOUD_TOOLKITS`	no	Comma-separated cloud-tool slugs. Defaults to `github,gmail,googlecalendar,slack,notion,linear`.
`CLOUD_TOOLS_CONFIG_PATH`	no	Optional override for enabled cloud-tool config. Defaults to `local_memory/<AGENT_ID>/agent_tools.json`.
`CLOUD_TOOLS_CACHE_DIR`	no	Writable cloud-tool SDK cache directory. Defaults to `local_memory/cloud_tools_cache`.
`CODING_AGENTS_ENABLED`	no	Enables coding-agent delegation. Defaults to `true`.
`CODING_AGENT_DEFAULT_PROVIDER`	no	Default coding provider. V1 supports `codex`.
`CODEX_MCP_COMMAND`	no	Command used to launch Codex MCP. Defaults to `codex`.
`CODEX_MCP_ARGS`	no	Comma-separated args for Codex MCP. Defaults to `mcp-server`.
`CODEX_WORKSPACE_ROOT`	no	Workspace path for delegated coding work. `.env.example` uses `.`, resolved from the user's launch directory.
`CODEX_APPROVAL_POLICY`	no	Approval policy label. Defaults to `argus-safe-auto`.
`CODEX_NETWORK_ACCESS`	no	Allows network access for new Codex coding tasks inside the workspace-write sandbox. Defaults to `true`; use `/coding-network off` to disable locally.
`CODEX_TASK_TIMEOUT_SECONDS`	no	Max runtime for a delegated coding task. Defaults to `1800`.
`TELEGRAM_BOT_TOKEN`	Telegram only	Bot token from `@BotFather`. Required for `/connect telegram`.
`TELEGRAM_ALLOWED_USERS`	Telegram recommended	Comma-separated numeric Telegram user IDs allowed to use the local agent.
`TELEGRAM_WEBHOOK_SECRET`	Telegram recommended	Secret token sent to Telegram `setWebhook` and verified by `/api/telegram/webhook`.

Agent identity updates are local-first. The agent_identity_manager tool writes to the JSON config file above, while env values remain startup defaults/fallbacks.

Repository Map

agent.py                  Core LangGraph agent, memory, retrieval, generation, learning
agent_cli.py              Local control console for API, jobs, events, and channels
agent_config.py           Local agent identity config loader/saver
agent_connector.py        Public async integration gateway
main.py                   FastAPI single-tenant worker
local_channel_store.py    Durable channel history and attachment storage
coding_agents/            Provider-neutral coding task store, policy, manager, Codex runner
run_dataset_evals.py      LongMemEval evaluation runner
run_dataset_evals_parallel.py
                          Parallel LongMemEval evaluation runner
monitor_results.py        Benchmark monitoring helper
tools/                    Skill registry and tool implementations
tools/coding_agent/       Native skill for coding-agent delegation
eval_datasets/            Cleaned LongMemEval dataset
reports/                  Evaluation outputs and checkpoints
local_memory/             Local FAISS and JSON memory stores

Design Principles

Argus is built around a few hard rules:

Retrieve broadly, reason narrowly.
Store memories with ownership, dates, and qualifiers intact.
Prefer saying "missing data" over inventing an answer.
Treat temporal and numeric claims as evidence-bound operations.
Keep normal user-facing latency low by learning in the background, while keeping benchmark ingestion deterministic by learning synchronously.
Keep the context window clean with routing and progressive disclosure.
Make the memory system portable, local, and easy to inspect.

Status

Argus Agent v0.5.0 is an active OSS release candidate.

The current version is optimized for long-memory evaluation and single-user local memory. The next natural steps are:

package cleanup
dependency trimming
unit tests for memory storage and retrieval
reproducible benchmark scripts
Docker packaging
memory compaction and archival policies
multi-user serving with isolated local stores

License

Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
coding_agents		coding_agents
reports		reports
scripts		scripts
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
agent.py		agent.py
agent_cli.py		agent_cli.py
agent_config.py		agent_config.py
agent_connector.py		agent_connector.py
agent_tools_config.py		agent_tools_config.py
agent_v1.py		agent_v1.py
agent_v2.py		agent_v2.py
agent_v3.py		agent_v3.py
agent_v4.4.py		agent_v4.4.py
agent_v4.py		agent_v4.py
argus		argus
check_dataset_raw.py		check_dataset_raw.py
content_check.py		content_check.py
local_channel_store.py		local_channel_store.py
main.py		main.py
monitor_results.py		monitor_results.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
run_dataset_evals.py		run_dataset_evals.py
run_dataset_evals_parallel.py		run_dataset_evals_parallel.py
run_dataset_evals_sample.py		run_dataset_evals_sample.py

Folders and files

Latest commit

History

Repository files navigation

Argus Agent

Contents

Why Argus Exists

What's New In v0.5.0

What Makes It Different

Highlights

Architecture

Memory System

Semantic Memory

Episodic Memory

Procedural Memory

Structured Artifact Extractors

Local Storage Layout

Retrieval Pipeline

Temporal Reasoning

Quantitative Reasoning

Self-Correcting Retrieval

Learning Pipeline

Tool System

Coding Agent Delegation

Adding A New Tool

Benchmarks

Parallel Evaluation

Prompt Regression Samples

Benchmark Cost Planning

Current Local Metrics

Requirements

Quick Start

Control Console

Agent Identity Config

Channel Integrations

Telegram

API Job Queue

Environment Variables

Repository Map

Design Principles

Status

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages