Skip to content

ctrl-gaurav/effGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

216 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
effGen

CI arXiv PyPI Python License

Total Downloads Monthly Downloads Stars Forks Prompt Library Multimodal Cookbook Prometheus Metrics OTel Traces SLOs Docker Helm Lambda Cloudflare VSCode

Paper Website Docs PyPI

Typing SVG

πŸ“° News & Updates

Date Update
πŸ”’ 27 May 2026 v0.2.10 Released: Security, Edge & DX β€” secret scanning (gitleaks), SBOM (CycloneDX), pip-audit CI, sandboxed CodeExecutor (SubprocessSandbox + DockerSandbox), OAuth2/OIDC + RBAC + audit log, Docker + Helm, AWS Lambda (Mangum), Cloudflare Worker edge proxy, VSCode extension, Jupyter magics, live dashboard. See changelog
πŸ“Š 23 May 2026 v0.2.9 Released: Observability & Reliability β€” structured JSON logs + secret redaction, OTel samplers + canonical span spec, Prometheus histograms, SLO tracking, circuit breakers, bulkheads, jittered retries, chaos harness, fuzz suite, effgen loadtest CLI, Alertmanager rules. See changelog
πŸ–ΌοΈ 21 May 2026 v0.2.8 Released: First-class multimodal input β€” image, audio, and video across 6 providers (Gemini, OpenAI, Groq, Anthropic, Together, HF). New multimodal preset, MultimodalDescribeTool, unified Message content schema, 5 cookbook walkthroughs. See changelog
πŸ“š 20 May 2026 v0.2.7 Released: 31 prompt templates across 7 domains β€” research, coding, data/SQL, legal, medical, creative, business β€” with golden eval harness, interactive playground, and auto-generated gallery. See changelog
πŸš€ 19 May 2026 v0.2.6 Released: 14 new tools β€” OCR, AudioTranscribe, ImageInfo, ImageCaption, PDF, DOCX, Excel, Weather, Geocode, Maps, EmailSMTP, EmailIMAP, SlackWebhook, DiscordWebhook. New presets: media, notify. 58+ built-in tools total. See changelog
πŸš€ 18 May 2026 v0.2.5 Released: 13 new free tools β€” PubMed, ArXiv, SemanticScholar, RSS, News, YouTubeTranscript, YouTubeMetadata, Reddit, HackerNews, Translate, LanguageDetect, QRGenerate, QRRead. 44+ built-in tools total. See changelog
πŸš€ 14 May 2026 v0.2.4 Released: ModelRouter with CostBased/LatencyBased/FirstAvailable policies, transparent provider failover, cross-process SQLite rate-limit coordination, persistent cost tracker + effgen cost dashboard CLI. See changelog
πŸš€ 4 May 2026 v0.2.3 Released: 5 new cloud backends (Groq, Together AI, Fireworks, Replicate, HuggingFace Inference) β€” 9 providers total. Unified ProviderRegistry, effgen doctor auth check, backend parity matrix. See changelog
πŸš€ 28 Apr 2026 v0.2.2 Released: Gemini 3.x/2.5/2.0 registry, thinking_budget, Google Search grounding, Files API, Gemini native tools (GoogleSearch, UrlContext, CodeExecution). Anthropic Claude 4.7 registry, extended thinking, prompt caching (cache_control), streaming polish, experimental native tools. See changelog
πŸš€ 25 Apr 2026 v0.2.1 Released: Cerebras backend (4 free-tier models, streaming, native tool-calling, rate-limit coordinator, cost tracking) + OpenAI gpt-5/gpt-5.4-nano/o-series with reasoning_effort, prompt caching, structured outputs v2, and OpenAI native tools (web_search, code_interpreter, file_search). See changelog
πŸš€ 9 Apr 2026 v0.2.0 Released: Major release β€” native tool calling, guardrails, multi-agent orchestration, RAG pipeline, 31 tools, eval framework, production API server, MLX Apple Silicon support, Python & TypeScript SDKs. See changelog
🍎 8 Apr 2026 MLX & Apple Silicon support merged (PR #4): Native Metal GPU acceleration via MLX & MLX-VLM backends, hardware detection, 5 Gradio GUI examples. pip install effgen[mlx]
πŸ”§ 25 Mar 2026 v0.1.3 Released: Verification hardening β€” smarter loop detection, "skip the tool" prompting, model-aware token counting, sub-agent depth limits, circuit breaker persistence. See changelog
πŸ”§ 12 Mar 2026 v0.1.2 Released: Test-driven hardening β€” 10 example agents, 19 bug fixes, cross-model compatibility matrix (11 models, 73% pass rate). See changelog
πŸ”’ 6 Mar 2026 v0.1.1 Released: Stabilization β€” fixed license/metadata consistency, improved error handling, added 6 examples, expanded test suite. See changelog
πŸŽ‰ 1 Mar 2026 v0.1.0 Released: Major feature release β€” 14 built-in tools, agent presets, plugin system, real streaming, memory integration, ACP/MCP protocols, CI/CD, and comprehensive test suite. See changelog
πŸ”§ 3 Feb 2026 v0.0.2 Released: vLLM backend fixes with automatic chat template support, GPU memory control, improved OOM error handling, and multi-model family compatibility
πŸ“„ 2 Feb 2026 Preprint available: EffGen: Enabling Small Language Models as Capable Autonomous Agents
πŸš€ 31 Jan 2026 Initial release of effGen framework (v0.0.1)

πŸ€” What is effGen?

effGen transforms Small Language Models into powerful AI agents. While most frameworks require massive LLMs, effGen is optimized from the ground up for efficient, smaller models β€” delivering fast, capable agents without the compute overhead.

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, PythonREPL

# Load a small but mighty model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")

# Create agent with tools
config = AgentConfig(
    name="math_agent",
    model=model,
    tools=[Calculator(), PythonREPL()]
)
agent = Agent(config=config)

# Run computation
result = agent.run("What is 24344 * 334?")
print(f"Answer: {result.output}")

⚑ Installation

Requires Python 3.10 or newer. Tested on Python 3.10, 3.11, 3.12, 3.13, 3.14.

πŸ“¦ From PyPI (Recommended)

pip install effgen

🍎 Apple Silicon (MLX β€” Recommended for Mac)

pip install effgen[mlx]          # Text models on Apple Silicon
pip install effgen[mlx-vlm]      # Vision-Language models on Apple Silicon

πŸš€ With vLLM for Faster Inference

pip install effgen[vllm]

🎁 Everything in one shot

pip install effgen[all]    # installs vLLM + RAG + vector-DB + search + cloud-secrets + monitoring + …

⚑ Optional: flash-attn (NVIDIA GPUs only β€” 2 steps)

flash-attn is not in [all] on purpose: its own setup.py imports torch before pip's isolated build environment has torch installed (a well-known upstream bug), so bundling it would break pip install effgen[all] for everyone. Install it in two steps instead:

pip install effgen[all]                       # step 1: gets torch + the rest
pip install flash-attn --no-build-isolation   # step 2: reuses the torch from step 1

See docs/installation.md for the full guide.

πŸ”§ From Source

git clone https://github.com/ctrl-gaurav/effGen.git
cd effGen

# Quick install
./install.sh

# Full install (includes vLLM + dev tools)
./install.sh --full

# Manual install
pip install -e .

πŸš€ Quick Start

πŸ’» CLI Usage

# Run a task
effgen run "What is the capital of France?"

# Interactive chat
effgen chat

# Start API server
effgen serve --port 8000

# List available presets
effgen presets

# Check infrastructure health
effgen health

# Interactive wizard
effgen

🐍 Python API

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

# Load model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")

# Configure agent
config = AgentConfig(
    name="calculator_agent",
    model=model,
    tools=[Calculator()],
    system_prompt="You are a helpful math assistant."
)

# Create and run
agent = Agent(config=config)
result = agent.run("Calculate 15% tip on $85.50")
print(result.output)

🍎 Apple Silicon (MLX)

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

# Load MLX model β€” native Metal GPU, unified memory, no CPU-GPU transfer
model = load_model("LiquidAI/LFM2.5-1.2B-Instruct-MLX-8bit", engine="mlx")

config = AgentConfig(
    name="mlx_agent",
    model=model,
    tools=[Calculator()],
)
agent = Agent(config=config)
result = agent.run("What is sqrt(144) + 2^10?")
print(result.output)

✨ Features

🧠
SLM Optimized
Small models

🍎
Apple Silicon
MLX + Metal GPU

πŸ›‘οΈ
Guardrails
PII, injection, safety

πŸ“š
RAG Pipeline
Ingest, search, cite

πŸ‘₯
Multi-Agent
DAG workflows

πŸ–ΌοΈ
Multimodal
image/audio/video

🏭
Production API
OpenAI-compat

πŸ“Š
Observability
metrics/traces/SLOs


πŸ†• What's New in v0.2.9

Observability & Reliability β€” production-ready telemetry in v0.2.9

effGen v0.2.9 ships the full observability and reliability stack. All telemetry is async/non-blocking β€” a failed export never fails inference.

Structured JSON logging with secret redaction. Every log line is a JSON object: {ts, level, module, event, attributes, trace_id, span_id}. The built-in Redactor strips OpenAI, Anthropic, Cerebras, Google, HF, Groq, Bearer, Slack, and Discord webhook patterns at the encoder β€” no secret ever appears in a log file.

from effgen.observability import get_logger
log = get_logger(__name__)
log.event("model.call.started", provider="cerebras", model="llama3.1-8b", cached_tokens=0)
# β†’ {"ts": "2026-05-23T...", "level": "INFO", "event": "model.call.started", ...}

Prometheus histograms + SLO tracking. effgen_model_call_latency_seconds, effgen_tool_call_latency_seconds, effgen_agent_iteration_latency_seconds, and effgen_tokens_total now expose histogram buckets at /metrics. SLOTracker maintains a rolling-window error budget and burn_rate() at /slo.

Configurable OTel samplers + canonical span spec. Choose AlwaysOn, AlwaysOff, TraceIdRatio(p), or RateLimited(per_second) in config. effgen/observability/spans.py is the single source of truth for every span attribute name β€” no more scattered string literals across adapters.

Reliability primitives. Four layers now protect every adapter call:

Primitive Class What it does
Timeouts ReliabilityConfig model_call=60s, tool_call=30s, http=20s β€” explicit on every httpx client
Retries @retryable(Retry(...)) Jittered exponential backoff for 5xx / 429 / network errors; emits OTel events
Circuit breaker CircuitBreaker CLOSED β†’ OPEN β†’ HALF_OPEN per provider; isolates misbehaving backends
Bulkhead Bulkhead Per-provider concurrency + queue limit; prevents provider starvation

Deterministic chaos harness. Inject NetworkTimeout, Http5xx, Http429, SlowResponse, PartialResponse, or MalformedJSON faults with Chaos(seed). Four canonical scenarios β€” fallback on 5xx, Retry-After honoured, timeout fires cleanly, AllProvidersFailed β€” all pass deterministically across 10 seeds.

Fuzz suite. Hypothesis runs 500 examples against all 66 BaseTool subclasses, random ContentPart message sequences, and the router's provider-availability logic. No unhandled exceptions, no secret leaks.

Load-testing CLI + Alertmanager rules.

# Run a 30-second load test (JSON report prints to stdout by default)
effgen loadtest --concurrency 10 --duration 30 --scenario fixed

# Or write the report to a file with --output
effgen loadtest --concurrency 10 --duration 30 --output report.json

# Integrate with Alertmanager
cp docs/observability/alert_rules.yaml /etc/prometheus/rules/effgen.yaml

See docs/observability/overview.md for full setup, docs/observability/metrics.md for all metric definitions, and docs/observability/alerting.md for Alertmanager integration.

πŸ†• What's New in v0.2.8

First-class multimodal in v0.2.8 β€” image, audio & video across 6 providers

effGen v0.2.8 makes multimodal input a first-class citizen. Send images, audio clips, and short video to any vision-capable provider through a unified Message schema β€” the adapter handles the translation, not your code.

Image input β€” Gemini, OpenAI gpt-4o, Groq, Anthropic (code-only), Together, HF. Automatic resize/MIME validation via image_pre.py. Raises CapabilityNotSupportedError cleanly when the provider doesn't support vision.

Audio input β€” Gemini native inline audio, OpenAI Whisper transcription + gpt-4o audio, HF Inference ASR. Auto-downsamples to 16 kHz mono; chunks files over provider max duration. Anthropic raises CapabilityNotSupportedError.

Video input β€” Gemini native video for providers that accept raw video; frame-sampling fallback (ffmpeg) for all others. MissingSystemDependency with install hints when ffmpeg is absent.

Unified message schema β€” TextPart, ImagePart, AudioPart, VideoPart form a typed ContentPart union. Message.content is always a List[ContentPart]; backwards-compatible string constructor still works.

multimodal preset β€” create_agent("multimodal", model) wires Gemini Flash-Lite (primary) + OpenAI gpt-4o-mini (fallback) with ImageInfo, ImageCaption, OCR, AudioTranscribe, MultimodalDescribeTool, and the full tool suite.

5 cookbook walkthroughs β€” image Q&A, audio transcribe + reason, video summarize, OCR + LLM structured extraction, chart reading from an image. All in docs/cookbook/.

from effgen import image_from, audio_from, video_from
from effgen.core.messages import Message, Role
from effgen.presets import create_agent
from effgen import load_model

model = load_model("gemini-2.0-flash", provider="gemini")
agent = create_agent("multimodal", model)

# Image question
img = image_from("https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/240px-PNG_transparency_demonstration_1.png")
msg = Message(role=Role.USER, content=[img, "What is in this image?"])
result = agent.run_message(msg)
print(result.output)

# Audio transcription
aud = audio_from("/tmp/clip.mp3")
msg = Message(role=Role.USER, content=[aud, "Transcribe and summarize."])
result = agent.run_message(msg)
# Multimodal preset
effgen run --preset multimodal "Describe this image" --image /tmp/photo.jpg

# Check capability
python -c "from effgen.models.capabilities import Capability; print(Capability.vision)"

See docs/multimodal/overview.md for the full architecture and docs/cookbook/README.md for the cookbook index.

31 prompt templates in v0.2.7 β€” Prompt Library, Eval Harness & Interactive Playground

effGen v0.2.7 adds a curated, domain-organized Prompt Library with 31 reusable templates across 7 domains, paired with a golden evaluation harness and an interactive playground CLI. See the full gallery.

Research β€” literature review (zero-shot + CoT), paper summary, citation extraction, methodology critique.

Coding β€” code review, bug diagnosis, refactoring plan, test generation, docstring fill.

Data / SQL β€” NL-to-SQL with warnings, SQL explain, SQL optimize, data profile, ETL plan.

Legal β€” contract summary, clause classify, research brief. All templates include mandatory legal disclaimer.

Medical β€” symptom triage, drug interaction, medical literature synthesis. All templates include mandatory medical disclaimer.

Creative β€” story continuation (zero-shot + few-shot), poetry forms, character bio, world building.

Business β€” meeting summary, email draft (formal/casual), OKR generation, SWOT analysis, elevator pitch.

# Discover and browse
effgen prompts list
effgen prompts list --domain research
effgen prompts list --format markdown

# Inspect and evaluate
effgen prompts show research.literature_review.v1.cot
effgen prompts eval
effgen prompts eval --domain coding --live --model llama3.1-8b

# Interactive playground
effgen prompts playground
from effgen.prompts.library import registry

p = registry.get("data.sql_from_nl.v1")
sql_prompt = p.template(
    schema_ddl="CREATE TABLE orders (id INT, customer TEXT, total FLOAT, created_at DATE)",
    question="Total revenue per customer this month",
    dialect="postgresql",
)

See docs/prompts/gallery.md for the full template catalog and docs/prompts/library.md for the framework overview.

14 new tools in v0.2.6 β€” OCR, Audio, Images, Documents, Geo/Weather & Communications

effGen v0.2.6 adds 14 new built-in tools across document, media, and communication categories, bringing the total to 58+. Two new presets (media, notify) are also introduced.

  1. OCR β€” OCRTool (Tesseract local + OCR.space fallback; OCRBackendUnavailable raised with install instructions).

    from effgen.tools.builtin.ocr import OCRTool
    result = OCRTool().execute({"operation": "extract", "image_path": "/tmp/scan.png"})
    print(result["data"]["text"])
  2. Audio Transcription β€” AudioTranscribeTool (faster-whisper local; HF Inference fallback; GPU auto-detected).

    from effgen.tools.builtin.audio_transcribe import AudioTranscribeTool
    result = AudioTranscribeTool().execute({"operation": "transcribe", "audio_path": "/tmp/clip.mp3"})
  3. Image Analysis β€” ImageInfoTool (Pillow metadata, zero network) + ImageCaptionTool (vision-capable model router).

  4. Document Parsing β€” PDFTool (pypdf + pdfplumber), DOCXTool (python-docx), ExcelTool (openpyxl + pandas). All added to research and general presets.

    from effgen.tools.builtin.pdf import PDFTool
    result = PDFTool().execute({"operation": "text", "path": "/tmp/paper.pdf"})
  5. Geo / Weather β€” WeatherTool (Open-Meteo, free, no auth), GeocodeTool (Nominatim/OSM, 1 req/s), MapsTool (staticmap PNG renderer).

    from effgen.tools.builtin.geocode import GeocodeTool
    result = GeocodeTool().execute({"operation": "geocode", "address": "San Francisco, CA"})
  6. Email & Webhooks β€” EmailSMTPTool, EmailIMAPTool, SlackWebhookTool, DiscordWebhookTool. All in new notify preset. Webhook URLs are redacted in logs.

    from effgen.tools.builtin.slack_webhook import SlackWebhookTool
    result = SlackWebhookTool().execute({"operation": "post", "text": "Deploy complete!"})

See the full tool gallery for quickstart snippets for all 58+ tools.

13 new free tools in v0.2.5 β€” Research, News, YouTube, Social, Translation & QR

effGen v0.2.5 adds 13 free, no-auth-required tools, bringing the built-in tool count above 44. All tools integrate with the research and general presets.

  1. Academic Research β€” PubMedTool (NCBI, 3 ops, built-in rate limiting), ArXivTool (Atom feed + PDF download), SemanticScholarTool (search + citations + references).

    from effgen.tools.builtin.arxiv import ArXivTool
    tool = ArXivTool()
    result = tool.execute({"operation": "search", "query": "transformer attention", "max_results": 5})
  2. News & RSS β€” RSSFeedTool (any RSS/Atom feed), NewsTool (BBC, Reuters, HN, NPR, etc. + optional NewsAPI.org key).

    from effgen.tools.builtin.news import NewsTool
    result = NewsTool().execute({"operation": "top_headlines", "category": "technology"})
  3. YouTube β€” YouTubeTranscriptTool (captions without Google API key), YouTubeMetadataTool (via yt-dlp, public content only).

  4. Social Media β€” RedditTool (public JSON, no OAuth), HackerNewsTool (Firebase API, no auth).

  5. Translation & Language Detection β€” TranslateTool (LibreTranslate + offline argostranslate fallback), LanguageDetectTool (55+ languages, fully offline).

  6. QR Codes β€” QRGenerateTool (generate locally), QRReadTool (decode from image, with OpenCV fallback if zbar is unavailable).

See the full tool gallery for quickstart snippets for all 58+ tools.

Top 5 features from v0.2.4 β€” ModelRouter & Cost Optimizer
  1. PolicyBasedRouter β€” composable routing engine with three built-in policies. Pick the cheapest provider within your budget, the fastest under your SLA, or simply the first available β€” and combine them freely.

    from effgen import PolicyBasedRouter, RoutingContext, CostBasedPolicy, LatencyBasedPolicy
    from effgen.models.capabilities import Capability
    
    router = PolicyBasedRouter(policies=[LatencyBasedPolicy(), CostBasedPolicy()])
    ctx = RoutingContext(
        prompt_tokens_estimate=500,
        user_budget_usd=0.01,
        latency_budget_ms=3000,
        required_capabilities={Capability.chat},
    )
    decision = router.route(ctx)
    print(decision.chosen)      # e.g., ProviderModelPair("cerebras", "llama3.1-8b")
    print(decision.eliminated)  # [(pair, reason), ...] β€” fully explainable
  2. Transparent failover β€” route_and_execute(ctx, fn) retries on rate-limits / 5xx / timeouts and seamlessly moves to the next-best provider. Each hop fires a RouterEvent to registered subscribers.

    from effgen import load_model
    
    def call_provider(pair):
        model = load_model(pair.model_id, provider=pair.provider)
        return model.generate("Hello!").text
    
    router.subscribe(
        lambda event: print(
            f"Failover: {event.from_provider}/{event.from_model} "
            f"β†’ {event.to_provider}/{event.to_model}"
        )
    )
    result = router.route_and_execute(ctx, call_provider)
  3. Cross-process SQLite rate-limit coordination β€” share a single rate-limit budget across multiple workers:

    from effgen import RateLimitCoordinator, SQLiteRateLimitStore
    
    store = SQLiteRateLimitStore("~/.effgen/rate_limits.sqlite")
    coordinator = RateLimitCoordinator(storage=store)  # WAL-mode, BEGIN IMMEDIATE
  4. Persistent cost tracking + effgen cost CLI β€” every API call persists to SQLite; query spend instantly:

    effgen cost today          # per-provider per-model table
    effgen cost week           # rolling 7-day view
    effgen cost by-provider    # lifetime totals
    effgen cost set-budget 1.0 # set $1/day cap (BudgetExceededError at 100%)
  5. Fully explainable decisions + budget guard β€” RouterDecision records every eliminated provider and why ("rate_limited", "no_key", "cost_exceeds_budget", "latency_exceeds_sla"). Configure a daily spend cap; the router automatically fails over to a free-tier provider when the budget is hit.

Top 5 features from v0.2.3
  1. 5 new cloud backends β€” GroqAdapter, TogetherAdapter, FireworksAdapter, ReplicateAdapter, HFInferenceAdapter β€” each with streaming, native tools, rate-limit coordination, and cost tracking. 9 providers total.

    model = load_model("llama-3.1-8b-instant", provider="groq")
    model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")
  2. Unified ProviderRegistry β€” list_providers(), list_models(provider), lookup(model_id) consolidated across all 9 adapters. AmbiguousModelError on bare IDs shared across providers.

  3. effgen doctor β€” new CLI command showing which providers have API keys configured.

  4. Backend parity matrix β€” canonical agentic task ("(17 Γ— 23) + sqrt(144) = 403") runs identically across all providers; streaming and error surfaces verified uniform. See docs/providers/parity.md.

  5. HuggingFace Router support β€” HFInferenceAdapter with 124-model dynamic catalog, refresh_models() + check_drift(), ModelUnavailableError with suggest_alternatives(), and custom Inference Endpoint URL.

Top 5 features from v0.2.2 (and earlier)
  1. Gemini 3.x/2.5/2.0 + Gemma families β€” full model registry with correct context windows, output limits, and feature flags; SDK migrated to google-genai>=1.0.0.

  2. Gemini thinking_budget β€” activate Gemini's internal reasoning with GenerationConfig(thinking_budget=8192, include_thoughts=True); thinking trace surfaces in ModelResponse.metadata["thinking"].

  3. Gemini grounding + Files API β€” GenerationConfig(grounding=True) injects Google Search; upload_file(path) passes PDFs/images to the model with a 2 GiB guard.

  4. Gemini native tools β€” GoogleSearchTool, GeminiUrlContextTool, GeminiCodeExecutionTool activate server-side Gemini capabilities in any Agent. Parallel function calls handled automatically.

  5. Anthropic Claude 4.7, extended thinking, prompt caching β€” full Claude 4.x registry; GenerationConfig.thinking for extended reasoning; mark_cached() + AgentConfig.cache_system_prompt/cache_tools for cache_control; cache tokens surfaced in usage.

Top 5 features from v0.2.1
  1. Cerebras backend β€” 4 free-tier models (llama3.1-8b, qwen-3-235b-a22b-instruct-2507, gpt-oss-120b, zai-glm-4.7) with streaming, native function-calling, automatic RPM/TPM/RPD/TPD rate-limit coordination, and per-call cost tracking. pip install effgen[cerebras] and set CEREBRAS_API_KEY.

    from effgen import load_model
    model = load_model("llama3.1-8b", provider="cerebras")
  2. OpenAI gpt-5 / gpt-5.4-nano / o-series reasoning models β€” full registry coverage with reasoning_effort (minimal/low/medium/high) and max_reasoning_tokens on GenerationConfig. Reasoning payloads are routed only to reasoning-capable models.

  3. OpenAI prompt caching surfacing β€” cached_input_tokens exposed on ModelResponse.usage; AgentConfig.stable_system_prompt=True keeps the system prompt anchored at position 0 to maximize OpenAI's automatic β‰₯1024-token prefix cache hit rate.

  4. Structured outputs v2 β€” OpenAIAdapter.generate_structured() with strict JSON Schema; to_openai_schema(pydantic_model) inlines $refs and forces additionalProperties: false; refusals raise ModelRefusalError.

  5. OpenAI native tools β€” OpenAIWebSearchTool, OpenAICodeInterpreterTool, OpenAIFileSearchTool route through OpenAI's Responses API and compose with effGen's local tools in the same agent. ToolIncompatibleError fires at Agent init when paired with a non-OpenAI model.

Top 5 features from v0.2.0
  1. Native Tool Calling β€” Qwen, Llama, Mistral models use built-in function calling instead of text parsing. Set tool_calling_mode="native" or "hybrid". Structured JSON/Pydantic output validation included.

  2. Guardrails & Safety β€” PII detection, prompt injection blocking, toxicity filtering, tool permissions. One-liner: get_guardrail_preset("strict").

  3. Production RAG Pipeline β€” Ingest PDF/DOCX/HTML/Markdown, semantic+BM25 hybrid search, reranking, inline citations. create_agent("rag", model, knowledge_base="./docs/").

  4. Production API Server β€” OpenAI-compatible /v1/chat/completions, request queuing, agent pooling, multi-tenancy, API keys. Drop-in OpenAI replacement with local SLMs.

  5. Apple Silicon Native β€” MLX & MLX-VLM backends for M1/M2/M3/M4. Metal GPU acceleration, unified memory. pip install effgen[mlx].


🎯 Agent Presets

Get started instantly with ready-to-use agent configurations:

from effgen import load_model
from effgen.presets import create_agent

model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")

# One-line agent creation
math_agent = create_agent("math", model)       # Calculator + PythonREPL
research_agent = create_agent("research", model) # WebSearch + URLFetch + Wikipedia
coding_agent = create_agent("coding", model)     # CodeExecutor + PythonREPL + FileOps + Bash
general_agent = create_agent("general", model)   # All tools
rag_agent = create_agent("rag", model, knowledge_base="./docs/")  # RAG pipeline
minimal_agent = create_agent("minimal", model)   # Direct inference, no tools
# CLI preset support
effgen run --preset math "What is sqrt(144)?"
effgen run --preset research "Tell me about quantum computing"

πŸ› οΈ Built-in Tools (58+)

πŸ”’
Calculator
Math & Units

🌐
WebSearch
DuckDuckGo

πŸ’»
CodeExecutor
Sandboxed

🐍
PythonREPL
Interactive

πŸ“
FileOps
Read/Write

πŸ”
Retrieval
RAG + BM25

🎯
AgenticSearch
ripgrep

πŸ–₯️
BashTool
Shell Cmds

🌀️
WeatherTool
Open-Meteo

πŸ“‹
JSONTool
Query/Validate

πŸ•
DateTimeTool
Timezones

πŸ“
TextProcessing
Regex/Count

πŸ”—
URLFetch
Web Scrape

πŸ“–
Wikipedia
Free API

πŸ”¬
PubMed
NCBI / Free

πŸ“„
ArXiv
Papers + PDF

πŸŽ“
SemanticScholar
Citations

πŸ“‘
RSSFeed
Any Feed

πŸ“°
News
BBC/Reuters/HN

▢️
YouTubeTranscript
No API key

🎬
YouTubeMetadata
yt-dlp

πŸ€–
Reddit
Public JSON

πŸ”₯
HackerNews
Firebase API

🌍
Translate
LibreTranslate

πŸ”Ž
LanguageDetect
Offline / 55+

πŸ“±
QRGenerate
Local / No net

πŸ“·
QRRead
Local Decode

…
+more
Finance, DevOps


πŸ“ Prompt Library (New in v0.2.7)

effGen ships a curated catalog of 31 reusable prompt templates across 7 domains, each with a golden evaluation test and CLI access. Browse the full gallery.

Domain Templates Variants
Research 5 zero-shot, CoT, structured, tool-augmented
Coding 5 zero-shot, CoT, structured, few-shot, tool-augmented
Data / SQL 5 zero-shot, CoT, structured, few-shot, tool-augmented
Legal 3 zero-shot, structured, tool-augmented
Medical 3 structured, tool-augmented
Creative 5 zero-shot, CoT, structured, few-shot
Business 5 zero-shot, CoT, structured, few-shot
effgen prompts list                          # browse all 31 templates
effgen prompts show research.paper_summary.v1  # inspect a template
effgen prompts eval                          # run golden eval (no model needed)
effgen prompts playground                    # interactive REPL
from effgen.prompts.library import registry

# Get and render a template
p = registry.get("coding.code_review.v1")
prompt = p.template(code="def add(a, b): return a + b", language="python")

# Search templates
cot_prompts = registry.search(variant="cot")
sql_prompts = registry.search(domain="data")

Legal and medical templates enforce a mandatory non-advice disclaimer in every rendered output, verified by unit tests.


πŸ“š Examples

πŸ–₯️ GUI Applications (Gradio)

# Visual agent & tool development
python examples/basic/chat_gui_mlx.py              # MLX Chat β€” streaming chat with Apple Silicon models (port 7860)
python examples/basic/agent_viz_mlx.py             # Agent Visualizer β€” step-by-step reasoning + code editor (port 7860)
python examples/basic/tool_builder_gui.py          # Tool Builder β€” visually create custom tools (port 7863)
python examples/basic/tool_tester_gui.py           # Tool Tester β€” browse, test, inspect all 58+ tools (port 7864)

🍎 Apple Silicon (MLX)

python examples/basic/basic_agent_mlx.py           # Basic MLX agent with calculator
python examples/basic/chat_gui_mlx.py --autoload   # Chat GUI with auto model loading
python examples/basic/agent_viz_mlx.py --autoload   # Agent visualizer with auto model loading

πŸ€– Core Agent Examples

python examples/basic/qa_agent.py                  # Q&A agent (no tools)
python examples/basic/calculator_agent.py          # Math with Calculator + PythonREPL
python examples/tools/advanced_multi_tool_agent.py # 5 tools + fallback chains
python examples/tools/file_operations_agent.py     # File read/write/search
python examples/tools/coding_agent.py              # Code execution + iteration
python examples/advanced/conversational_agent.py   # Multi-turn memory
python examples/advanced/advanced_streaming_agent.py # Token streaming with callbacks
python examples/advanced/data_processing_agent.py  # JSON & data pipelines
python examples/advanced/multi_agent_pipeline.py   # Multi-agent orchestration
python examples/advanced/error_recovery_agent.py   # Error handling patterns

⚑ Quick-Start Examples

python examples/basic/basic_agent.py               # Basic agent (Transformers)
python examples/basic/basic_agent_vllm.py          # Basic agent (vLLM - 5-10x faster)
python examples/plugins_presets/preset_agents.py   # Ready-to-use agent presets
python examples/web_retrieval/streaming_agent.py   # Simple streaming
python examples/web_retrieval/memory_agent.py      # Simple multi-turn memory
python examples/tools/multi_tool_agent.py          # Simple multi-tool
python examples/web_retrieval/weather_agent.py     # Weather via Open-Meteo (free)
python examples/plugins_presets/plugin_example.py  # Custom tool plugins
python examples/web_retrieval/web_agent.py         # Web search agent
python examples/web_retrieval/retrieval_agent.py   # RAG-based retrieval

πŸ“Š See examples/compatibility_matrix.md for model compatibility across all agents.

πŸ“– More Examples

Multi-Tool Agent

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, WebSearch, PythonREPL

model = load_model("Qwen/Qwen2.5-3B-Instruct")

config = AgentConfig(
    name="research_agent",
    model=model,
    tools=[Calculator(), WebSearch(), PythonREPL()],
    system_prompt="You are a research assistant."
)

agent = Agent(config=config)
result = agent.run("Search for the population of Tokyo and calculate what percentage it is of Japan's total population")

Streaming

from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")
agent = Agent(config=AgentConfig(
    name="stream_demo", model=model,
    tools=[Calculator()], enable_streaming=True
))

for token in agent.stream("What is 2 + 2?"):
    print(token, end="", flush=True)

Memory (Multi-Turn)

agent = Agent(config=AgentConfig(
    name="memory_demo", model=model,
    tools=[], enable_memory=True
))

agent.run("My name is Alice and I'm working on quantum computing.")
result = agent.run("What's my name and what am I working on?")
# β†’ "Your name is Alice and you're working on quantum computing."

Retrieval Agent (RAG)

from effgen.tools.builtin import Retrieval

retrieval_tool = Retrieval(knowledge_base_path="./docs")
config = AgentConfig(name="qa_agent", model=model, tools=[retrieval_tool])
agent = Agent(config=config)
result = agent.run("What does the documentation say about configuration?")

πŸ€– Multi-Model Support

effGen supports 9 cloud inference providers + 4 local backends, tested across 11+ model families:

Backend Platform Install Best For
MLX Apple Silicon (M1/M2/M3/M4) effgen[mlx] Native Metal GPU, unified memory, 4/8-bit quantization
MLX-VLM Apple Silicon effgen[mlx-vlm] Vision-Language models (Qwen2-VL, LLaVA, Phi-3 Vision, 30+ architectures)
vLLM NVIDIA GPU effgen[vllm] High-throughput batch inference
Transformers Any (CPU/GPU) (bundled) Universal compatibility, local models
OpenAI Cloud API (bundled) gpt-5/gpt-5.4/o-series, reasoning_effort, structured outputs, native tools
Anthropic Cloud API (bundled) Claude 4.7/4.x, extended thinking, prompt caching, native tools
Google Gemini Cloud API (bundled) Gemini 3.x/2.5/2.0, thinking_budget, grounding, Files API, native tools
Cerebras Cloud API effgen[cerebras] 4 free-tier models (llama3.1-8b, qwen-3-235b), ultra-low latency
Groq Cloud API effgen[groq] 16 models (llama-3.3-70b, mixtral, qwen3-32b), ultra-fast free-tier inference
Together AI Cloud API effgen[together] 163-model catalog (llama, deepseek, qwen, mistral), per-model pricing
Fireworks Cloud API effgen[fireworks] 80 chat models (54 tool-capable), serverless + dedicated
Replicate Cloud API effgen[replicate] 38 models, async run-poll, SSE streaming, compute-second billing
HuggingFace Cloud API effgen[hf] 124-model HF Router catalog, custom Inference Endpoints, free serverless tier

Provider Auth Check

# See which API keys are configured
effgen doctor

Quick Cloud Start

from effgen import load_model, Agent
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator

# Any of the 9 cloud providers
model = load_model("llama-3.1-8b-instant", provider="groq")          # Groq
# model = load_model("meta-llama/Llama-3.3-70B-Instruct-Turbo", provider="together")
# model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")

agent = Agent(config=AgentConfig(name="agent", model=model, tools=[Calculator()]))
result = agent.run("What is (17 * 23) + sqrt(144)?")
print(result.output)  # β†’ 403

Top Recommended Models

Model Size Compatibility
LFM2.5-1.2B-Instruct-MLX-8bit 1.2B Apple Silicon optimized, fast agentic
Qwen2.5-1.5B-Instruct 1.5B 10/10 agents pass
Qwen2.5-3B-Instruct 3B 10/10 agents pass (recommended default)
Phi-4-mini-instruct 3.8B 10/10 agents pass
Qwen3-1.7B 1.7B 9.5/10
Qwen2.5-7B-Instruct 7B 9/10
Llama-3.2-3B-Instruct 3B 8.5/10

Full matrix with 11 models x 10 agents: compatibility_matrix.md


πŸ”’ Security

🐳
Docker Sandbox
Isolated execution

πŸ›‘οΈ
Input Validation
Auto sanitization

⚑
Rate Limiting
Configurable limits

πŸ“‹ For security policies and vulnerability reporting, see SECURITY.md


πŸš€ Deployment

effGen v0.2.10 ships production-ready deployment recipes for every major target:

🐳 Docker

Multi-stage build with a non-root user, read-only filesystem, and /health healthcheck. See docs/deploy/docker.md.

docker build -f deploy/docker/Dockerfile -t effgen:0.2.10 .
docker run -p 8000:8000 --env-file .env effgen:0.2.10
curl http://localhost:8000/health

⎈ Kubernetes / Helm

Full Helm chart with Deployment, Service, Ingress, NetworkPolicy, PDB, and HPA (scales on CPU + effgen_model_call_latency_seconds). See docs/deploy/kubernetes.md.

helm lint deploy/k8s/helm/effgen/
helm install effgen deploy/k8s/helm/effgen/ --set image.tag=0.2.10

Ξ» AWS Lambda

Mangum adapter wrapping the FastAPI app. Cold start < 3 s; warm call < 100 ms. SAM template included. See docs/deploy/lambda.md.

cd deploy/aws_lambda
sam build && sam deploy --guided

☁ Cloudflare Worker

Thin edge proxy handling CORS, Bearer JWT auth, and KV-backed rate limiting before forwarding to your backend. See docs/deploy/cloudflare.md.

cd deploy/cloudflare
wrangler deploy  # staging: wrangler deploy --env staging

πŸ”· Developer Experience

VSCode Extension

Prompt-template completion, inline "Run" code lens on LibraryPrompt definitions, and hover docs β€” all from the effGen registry. See docs/dx/vscode.md.

cd tools/vscode-effgen
npm ci && npm run compile
# Install: Extensions β†’ Β·Β·Β· β†’ Install from VSIX β†’ vscode-effgen-*.vsix

Jupyter Magics

%load_ext effgen.jupyter
%effgen_chat "What is 17 * 23?"
%%effgen_agent general
Summarise the top HackerNews stories today and rank them by interest.
%effgen_metrics

See docs/dx/jupyter.md.

Live Dashboard

The API server serves a real-time SPA at /dashboard (no auth required). Panels: span stream (SSE), Prometheus metrics, recent agent runs with token counts and cost, SLO burn rates. See docs/dx/dashboard.md.

EFFGEN_DEV_MODE=1 effgen serve --port 8000
open http://localhost:8000/dashboard

πŸ”’ Security

Secret Scanning

Gitleaks pre-commit hook + CI workflow (secret-scan.yml) catch secrets before they reach the repo. Install the hook once:

pip install pre-commit && pre-commit install

Sandboxed Code Execution

CodeExecutor defaults to SubprocessSandbox (rootless user-namespace, network blocked, isolated /tmp) or DockerSandbox when Docker is available. To opt out (not recommended):

EFFGEN_SANDBOX_BACKEND=off effgen run ...   # loud warning emitted

API Server Auth

Protect your API server with OAuth2/OIDC (any OIDC provider β€” Auth0, Keycloak, Cognito):

export EFFGEN_OIDC_ISSUER=https://your-tenant.auth0.com/
export EFFGEN_OIDC_CLIENT_ID=your-client-id
export EFFGEN_OIDC_JWKS_URI=https://your-tenant.auth0.com/.well-known/jwks.json
effgen serve --port 8000

See docs/server/auth.md, docs/server/rbac.md, and docs/server/audit.md.


πŸ“– Citation

If you use effGen in your research, please cite our paper:

@software{srivastava2026effgen,
      title={effGen: Enabling Small Language Models as Capable Autonomous Agents},
      author={Gaurav Srivastava and Aafiya Hussain and Chi Wang and Yingyan Celine Lin and Xuan Wang},
      year={2026},
      eprint={2602.00887},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.00887},
}

πŸ”— Links

Paper Website Docs PyPI Issues


πŸ“„ License

Apache License 2.0 β€” see LICENSE for details.


Get Started Examples Paper GitHub

Made with ❀️ for the AI community

effGen footer

About

[ICML 2026] effGen: Enabling Small Language Models as Capable Autonomous Agents

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors