Skip to content

chaliy/llmsim

Repository files navigation

CI Crates.io Repo: Agent Friendly

LLMSim

LLM Traffic Simulator - A lightweight, high-performance LLM API simulator for load testing, CI/CD, and local development.

Overview

LLMSim replicates realistic LLM API behavior without running actual models. It solves common challenges when testing LLM-integrated applications:

  • Cost: Real API calls during load tests are expensive
  • Rate Limits: Production APIs prevent realistic load testing
  • Reproducibility: Real models produce variable responses
  • Traffic Realism: LLM responses have unique characteristics (streaming, variable latency, token-based billing)

Features

  • Multi-Provider API Support - OpenAI Chat Completions, OpenResponses, and Anthropic Messages APIs
  • Realistic Latency Simulation - Time-to-first-token (TTFT) and inter-token delays with normal distribution
  • Streaming Support - Server-Sent Events (SSE) for OpenAI, OpenResponses, and Anthropic streaming formats
  • Image Generation - Simulated gpt-image ("ChatGPT Images") endpoint returning watermarked PNGs, with streaming partial images
  • Accurate Token Counting - Uses tiktoken-rs (OpenAI's tokenizer implementation)
  • Error Injection - Rate limits (429), server errors (500/503), timeouts
  • Multiple Response Generators - Lorem ipsum, echo, fixed, random, sequence
  • Model-Specific Profiles - GPT-5, GPT-4, Claude, Gemini latency profiles
  • Real-time Stats Dashboard - TUI dashboard with live metrics (requests, tokens, latency, errors)
  • Stats API - JSON endpoint for programmatic access to server metrics

Installation

cargo install llmsim

# Include the optional terminal dashboard
cargo install llmsim --features tui

Demo

Console UI Demo

Usage

CLI Server

# Start with defaults (port 8080, lorem generator)
llmsim serve

# Start with real-time stats dashboard (TUI)
# Requires installing/building with `--features tui`
llmsim serve --tui

# All options
llmsim serve \
  --port 8080 \
  --host 0.0.0.0 \
  --generator lorem \
  --target-tokens 150 \
  --tui

# Using config file
llmsim serve --config config.toml

Stats Dashboard

The --tui flag launches an interactive terminal dashboard showing real-time metrics:

  • Requests: Total, active, streaming vs non-streaming, requests/sec
  • Tokens: Prompt, completion, total, tokens/sec
  • Latency: Average, min, max response times
  • Errors: Total errors, rate limits (429), server errors (5xx), timeouts
  • Charts: RPS and token rate sparklines, model distribution

Controls: q to quit, r to force refresh.

As a Library

use llmsim::{
    openai::{ChatCompletionRequest, Message},
    generator::LoremGenerator,
    latency::LatencyProfile,
};

// Create a latency profile
let latency = LatencyProfile::gpt5();

// Count tokens
let tokens = llmsim::count_tokens("Hello, world!", "gpt-5").unwrap();

// Generate responses
let generator = LoremGenerator::new(100);
let response = generator.generate(&request);

Cargo features

The crate is split into optional features so library consumers only pull in what they use. The defaults (["cli"]) give the full binary, so cargo build, cargo run -- serve, and cargo test work out of the box.

Feature Adds Extra dependencies
tokens tokens module (token counting) tiktoken-rs
server cli module (axum router, handlers, websockets); implies tokens axum, tower-http
cli the llmsim binary; implies server clap, tracing-subscriber
tui serve --tui dashboard; implies cli ratatui, crossterm

To embed only the core library modules (types, generators, latency, streaming, stats, scripts) and shed axum, tower-http, tiktoken-rs, clap, websockets, and tracing-subscriber:

[dependencies]
llmsim = { version = "0.4", default-features = false }

Note that count_tokens and the tokens module are only available with the tokens feature enabled.

API Endpoints

OpenAI API (/openai/v1/...)

Endpoint Method Description
/openai/v1/chat/completions POST Chat completions (streaming & non-streaming)
/openai/v1/models GET List available models
/openai/v1/models/{model_id} GET Get specific model details
/openai/v1/responses POST Responses API (streaming & non-streaming)
/openai/v1/images/generations POST Image generation (gpt-image, streaming & non-streaming)

When using OpenAI SDKs, set the base URL to http://localhost:8080/openai/v1.

OpenResponses API (/openresponses/v1/...)

OpenResponses is an open-source specification for building multi-provider, interoperable LLM interfaces.

Endpoint Method Description
/openresponses/v1/responses POST Create response (streaming & non-streaming)

Anthropic API (/anthropic/v1/...)

Simulates the Anthropic Messages API with realistic Claude model profiles.

Endpoint Method Description
/anthropic/v1/messages POST Messages API (streaming & non-streaming)
/anthropic/v1/models GET List available Claude models
/anthropic/v1/models/{model_id} GET Get specific model details

When using Anthropic SDKs, set the base URL to http://localhost:8080/anthropic:

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8080/anthropic", api_key="not-needed")
msg = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=64,
    messages=[{"role": "user", "content": "Hello, Claude"}],
)
print(msg.content[0].text)

Runnable examples for Python, TypeScript, Go, curl, and LangChain live in examples/ (see examples/README.md).

LLMSim endpoints

Endpoint Method Description
/health GET Health check
/llmsim/stats GET Real-time server statistics (JSON)

Configuration

TOML Config File

[server]
port = 8080
host = "0.0.0.0"

[latency]
profile = "gpt5"
# Custom values (optional):
# ttft_mean_ms = 600
# ttft_stddev_ms = 150
# tbt_mean_ms = 40
# tbt_stddev_ms = 12

[response]
generator = "lorem"
target_tokens = 100

[errors]
rate_limit_rate = 0.01
server_error_rate = 0.001
timeout_rate = 0.0
timeout_after_ms = 30000

[models]
available = [
  "gpt-5",
  "gpt-5-mini",
  "gpt-4o",
  "claude-opus",
]

Note: The config file format moved from YAML to TOML in this release. To migrate an existing config.yaml, replace section headers like server: with [server], change key: value to key = value, quote strings, and convert lists. See benchmarks/config/*.toml for working examples.

Supported Models

Family Models
GPT-5 gpt-5, gpt-5-pro, gpt-5-mini, gpt-5-nano, gpt-5-codex, gpt-5.1, gpt-5.2, gpt-5.3-codex, gpt-5.4, gpt-5.5
O-Series o1, o1-mini, o3, o3-mini, o4-mini
GPT-4 gpt-4, gpt-4-turbo, gpt-4o, gpt-4o-mini, gpt-4.1
Claude claude-opus, claude-sonnet, claude-haiku (with 4.x versions through Opus 4.8 and Sonnet 4.6)
Gemini gemini-2.0-flash, gemini-2.5-pro, gemini-3 and gemini-3.1 previews

The Anthropic endpoints (/anthropic/v1/...) use the real Anthropic API model IDs (dash-separated, e.g. claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5, claude-fable-5), including dated-snapshot and -latest aliases. List them via GET /anthropic/v1/models.

Latency Profiles

Profile TTFT Mean TBT Mean
gpt-5 600ms 40ms
gpt-5-mini 300ms 20ms
gpt-4 800ms 50ms
gpt-4o 400ms 25ms
o-series 2000ms 30ms
claude-opus 1000ms 60ms
claude-sonnet 500ms 30ms
claude-haiku 200ms 15ms
instant 0ms 0ms
fast 10ms 1ms

Use Cases

  • Load Testing - Simulate thousands of concurrent LLM requests
  • CI/CD Pipelines - Fast, deterministic tests for LLM integrations
  • Local Development - Develop without API keys or costs
  • Chaos Engineering - Test behavior under failure scenarios
  • Cost Estimation - Estimate token usage before production

Requirements

  • Rust 1.83+ (for building from source)
  • OR Docker

License

MIT License - see LICENSE for details.

Contributing

See CONTRIBUTING.md for contribution guidelines.

About

LLM Simulation for Load Testing

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors