PromptProof

Deterministic LLM testing for production reliability.
Record→Replay + policy‑as‑code to catch PII leaks, schema drift, and behavioral regressions before merge.

Try it in 60s 🚀

# Clone and try the example (expect a failure)
git clone https://github.com/geminimir/promptproof.git
cd promptproof
corepack enable && corepack prepare pnpm@9 --activate
pnpm i
pnpm run try:example

This runs PromptProof against a deliberately failing JSON output. No API calls, no setup - just pure validation.

Then fix it:

pnpm run fix:example  # Now it passes!

Quickstart

Run locally

npx promptproof-cli@latest eval -c promptproof.yaml --out report

GitHub Action

# .github/workflows/promptproof.yml
name: PromptProof
on: [pull_request]
permissions: { contents: read, pull-requests: write, security-events: write }
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: geminimir/promptproof-action@v0
        with:
          config: promptproof.yaml
          format: sarif  # optional: upload to Code Scanning

One‑line recording (Node)

import OpenAI from 'openai'
import { withPromptProofOpenAI } from 'promptproof-sdk-node/openai'
const ai = withPromptProofOpenAI(new OpenAI({ apiKey: process.env.OPENAI_API_KEY }), { suite: 'support-replies' })

This writes sanitized JSONL lines to fixtures/<suite>/outputs*.jsonl for deterministic CI replay. No network calls during CI.

Guarantees

Deterministic CI: replay fixtures offline; zero network calls in CI.
Safety by default: PII redaction on (emails/phones). The SDK never blocks your app if recording fails.
Provider‑agnostic: we evaluate outputs, not vendors.

Why not just JSON schema at runtime?

We enforce pre‑merge gates (block risky PRs), not best‑effort runtime checks.
Replay of real outputs removes flakiness (no live model calls in CI).
Budgets catch cost/latency creep alongside quality rules.

Examples

Demo app: examples/node-support-bot/
Fixtures: fixtures/ (support replies, RAG, tool calls)
Failure Zoo: zoo/ — real cases with copy‑pasteable rules

What it looks like

🎯 Key Features

Deterministic Replay: Test against recorded LLM outputs with zero network calls in CI
Comprehensive Assertions: JSON schemas, regex patterns, numeric bounds, string operations, list/set equality, file diffs, and custom checks
Regression Testing: Snapshot baselines and automatic comparison to catch new failures and performance degradation
Cost Controls: Budget gates for total cost, per-test cost, and latency with regression tracking
Flake Management: Seed control and multiple runs with stability scoring for non-deterministic checks
CI/CD Integration: GitHub Action that fails PRs on violations with detailed reporting
Provider Agnostic: Works with OpenAI, Anthropic, and any HTTP-based LLM API
Privacy First: Built-in PII redaction and offline evaluation

📋 Requirements

Node.js >= 18.0.0
npm >= 8.0.0

🚀 Quick Start

Install Packages

# Install CLI for evaluation
npm install -g promptproof-cli@beta

# Install SDK for recording (in your project)
npm install promptproof-sdk-node@beta

Initialize in Your Project

# Initialize project structure
promptproof init --suite support-replies

This creates:

promptproof.yaml - Policy configuration
fixtures/ - Directory for recorded outputs
.github/workflows/promptproof.yml - GitHub Action workflow

📝 Record → Replay Workflow

Step 1: Record LLM Outputs (One Line Change!)

OpenAI Integration

import OpenAI from 'openai'
import { withPromptProofOpenAI } from 'promptproof-sdk-node/openai'

const base = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
export const ai = withPromptProofOpenAI(base, { suite: 'support-replies' })

// Use normally - fixtures are recorded automatically
const response = await ai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: 'Hello!' }]
})

Anthropic Integration

import Anthropic from '@anthropic-ai/sdk'
import { withPromptProofAnthropic } from 'promptproof-sdk-node/anthropic'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
export const claude = withPromptProofAnthropic(anthropic, { suite: 'rag-answers' })

// Use normally - fixtures are recorded automatically
const response = await claude.messages.create({
  model: 'claude-3-sonnet-20240229',
  max_tokens: 1000,
  messages: [{ role: 'user', content: 'Hello!' }]
})

Generic HTTP Integration

import { wrapFetch } from 'promptproof-sdk-node/http'

// Wrap global fetch to record any LLM API calls
globalThis.fetch = wrapFetch(globalThis.fetch, { suite: 'generic-llm' })

// All fetch calls to LLM APIs are automatically recorded
const response = await fetch('https://api.openai.com/v1/chat/completions', {
  method: 'POST',
  headers: { 'Authorization': `Bearer ${apiKey}` },
  body: JSON.stringify({ model: 'gpt-4', messages: [...] })
})

Step 2: Fixtures Are Created Automatically

Each LLM call creates a sanitized record in fixtures/<suite>/outputs.jsonl:

{
  "schema_version": "pp.v1",
  "id": "auto-generated",
  "timestamp": "2024-08-10T12:34:56Z",
  "source": "dev",
  "input": {
    "prompt": "user: Hello!\nassistant: Hi there!",
    "params": { "model": "gpt-4", "temperature": 0.7 }
  },
  "output": { "text": "Hello! How can I help you today?" },
  "metrics": { "latency_ms": 812, "cost_usd": 0.0012 },
  "redaction": { "status": "sanitized" }
}

Step 3: Define Your Contracts

# promptproof.yaml
schema_version: pp.v1
fixtures: fixtures/support-replies/outputs.jsonl
checks:
  - id: no_pii
    type: regex_forbidden
    target: output.text
    patterns:
      - "[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}"  # No emails
      - "\\b\\+?\\d[\\d\\s().-]{7,}\\b"           # No phone numbers
  
  - id: response_schema
    type: json_schema
    target: output.json
    schema:
      type: object
      required: [status, message]
      properties:
        status: { type: string, enum: [success, error] }
        message: { type: string }
  
  - id: contains_disclaimer
    type: string_contains
    target: output.text
    expected: "We cannot guarantee"
    ignore_case: true
  
  - id: response_list_exact
    type: list_equality
    target: output.json.items
    expected: ["step1", "step2", "step3"]
    order_sensitive: true

budgets:
  cost_usd_per_run_max: 0.50
  cost_usd_total_max: 10.00  # Total cost gate
  latency_ms_p95_max: 2000
  cost_usd_total_pct_increase_max: 10  # Max 10% cost increase vs baseline

mode: warn  # Start with 'warn', switch to 'fail' after validation

Step 4: Evaluate Against Contracts

# Local evaluation
promptproof eval -c promptproof.yaml

# With regression comparison against baseline
promptproof eval -c promptproof.yaml --regress

# With flake controls for non-deterministic checks
promptproof eval -c promptproof.yaml --seed 42 --runs 3

# Create a snapshot after successful run
promptproof snapshot promptproof.yaml --promote

# In CI (automatic via GitHub Action)
# Runs on every PR and blocks merge on violations

⚙️ Environment Variables

Control recording behavior with environment variables:

Variable	Default	Description
`PP_RECORD`	`1` (dev), `0` (prod)	Master on/off switch
`PP_SAMPLE_RATE`	`1.0`	Record 0-100% of calls
`PP_SUITE`	from options	Override suite name
`PP_OUT`	`fixtures`	Custom output directory
`PP_SOURCE`	`NODE_ENV`	Environment label
`PP_SHARD_BY_PID`	`0`	Write to `outputs.<pid>.jsonl`

🔒 Safety & Privacy

✅ Redaction ON by default - emails, phones, SSNs masked
✅ Never blocks your app - recording failures are logged, not thrown
✅ No secrets recorded - API keys and auth headers excluded
✅ Deterministic - same input = same output (ignoring timestamp/id)
✅ Production ready - sampling controls, PID sharding for concurrency

📊 Example Output

✓ Evaluated 142 fixtures
✗ 3 violations found:

  [no_pii] Record #47: Found forbidden pattern (email) at output.text
  [response_schema] Record #89: Missing required field 'status'
  [latency_budget] P95 latency: 2341ms exceeds limit of 2000ms

📊 Regression Comparison
Baseline: 2024-01-15-stable
⚠ 2 new failures:
  • [string_contains] test-102: Expected string "disclaimer" not found
  • [cost_budget] Total cost $12.50 exceeds budget $10.00
✓ 1 fixed failures
↔ 0 unchanged failures

Cost & Performance:
Cost: ↑ $2.50 (25.0%)
P95 Latency: ↑ 341ms

Exit code: 1

🏗️ Architecture

App/Service → SDK Wrapper → fixtures/*.jsonl
                    ↓
Developer → PR → GitHub Action → CLI eval → Report → Pass/Fail Gate

🔧 CLI Commands

promptproof eval        # Run contract checks on fixtures
  --regress             # Compare against baseline snapshot
  --seed <n>            # Set seed for non-deterministic checks
  --runs <k>            # Run non-deterministic checks k times

promptproof snapshot   # Create evaluation snapshot
  --promote             # Promote to baseline
  --tag <name>          # Custom snapshot tag

promptproof init        # Initialize project with templates
promptproof promote     # Convert logs to fixture format
promptproof redact      # Remove PII from fixtures
promptproof validate    # Validate fixture schema

Note: Use npx promptproof-cli@beta or install globally with npm install -g promptproof-cli@beta

SDK: Install promptproof-sdk-node@beta in your project for automatic fixture recording

📦 Packages

Core Packages

promptproof-cli@beta: Command-line interface for evaluation
promptproof-sdk-node@beta: SDK wrappers for OpenAI, Anthropic, HTTP

Architecture

CLI: Evaluates pre-recorded fixtures against contracts
SDK: Automatically records LLM interactions to fixtures
Workflow: SDK records → CLI evaluates → CI gates

Coming Soon

@promptproof/action: GitHub Action for CI integration
@promptproof/evaluator: Core evaluation engine (bundled in CLI)

🎪 Failure Zoo

Browse real-world LLM failure cases in our Failure Zoo - anonymized production incidents with patterns and mitigations.

🎭 Demo Project

See our demo project for a complete working example:

Realistic LLM application with support & RAG endpoints
SDK integration with automatic fixture recording
CLI validation with intentional failure modes
CI/CD integration via GitHub Actions
Red → Green demonstrations showing PromptProof in action

Support & Community

Issues: https://github.com/geminimir/promptproof/issues
Discussions: GitHub Discussions

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

📄 License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.devcontainer		.devcontainer
.github		.github
docs/assets		docs/assets
examples		examples
fixtures/support-replies		fixtures/support-replies
packages		packages
packs/json-essentials		packs/json-essentials
zoo		zoo
.editorconfig		.editorconfig
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.prettierrc		.prettierrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GUARANTEES.md		GUARANTEES.md
LICENSE		LICENSE
README.md		README.md
REPO_POLISH_CHECKLIST.md		REPO_POLISH_CHECKLIST.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
npm-package-publish.md		npm-package-publish.md
package-lock.json		package-lock.json
package.json		package.json
prompt_proof_mvp_repo_implementation_guide_v_0.md		prompt_proof_mvp_repo_implementation_guide_v_0.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PromptProof

Try it in 60s 🚀

Quickstart

Run locally

GitHub Action

One‑line recording (Node)

Guarantees

Why not just JSON schema at runtime?

Examples

What it looks like

🎯 Key Features

📋 Requirements

🚀 Quick Start

Install Packages

Initialize in Your Project

📝 Record → Replay Workflow

Step 1: Record LLM Outputs (One Line Change!)

OpenAI Integration

Anthropic Integration

Generic HTTP Integration

Step 2: Fixtures Are Created Automatically

Step 3: Define Your Contracts

Step 4: Evaluate Against Contracts

⚙️ Environment Variables

🔒 Safety & Privacy

📊 Example Output

🏗️ Architecture

🔧 CLI Commands

📦 Packages

Core Packages

Architecture

Coming Soon

🎪 Failure Zoo

🎭 Demo Project

Support & Community

🤝 Contributing

📄 License

🔗 Links

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages