Skip to content

itsHabib/sense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sense

Make sense of non-deterministic output. Extract structured data from text and evaluate output quality using Claude.

// Judge: output → pass/fail with evidence
sense.Assert(t, output).
    Expect("covers all sections from the brief").
    Expect("includes actionable recommendations").
    Run()

// Extract: unstructured text → typed struct
s := sense.New()
var m MountError
s.Extract("device /dev/sdf already mounted with vol-0abc123", &m).Run()
fmt.Println(m.Device)   // "/dev/sdf"
fmt.Println(m.VolumeID) // "vol-0abc123"

Sense uses the Anthropic API (Claude) with forced tool_use for structured responses — no prompt engineering, no JSON parsing on your end. Requires an Anthropic API key.

Two surfaces, one package:

  • Extract — parse unstructured text into typed Go structs. Logs, error messages, support tickets, API responses — define a struct, get structured data back.
  • Judge — evaluate non-deterministic output against natural-language expectations. Assert in tests (Assert/Require), eval programmatically (Eval), or A/B compare two outputs (Compare).

Why it exists

Go programs that touch LLM output have two recurring problems: turning messy text into something typed, and asserting that non-deterministic output is good without a brittle string match. Sense does exactly those two things behind one seam — a caller that forces Claude to call a single tool whose schema is the output contract. Everything else is a thin builder in front of that seam.

Scope is deliberately narrow:

  • It judges and extracts. It is not an agent framework. No tool-calling loops, no chains, no orchestration — Sense makes a single forced-tool call and unmarshals the result.
  • It speaks Claude only, today. The caller interface is abstracted so a second provider is ~100 lines, but no OpenAI/other caller is shipped. An OpenAI caller and WithCaller injection live in docs/NEXT.md, not in the code.
  • Eval-framework features are roadmap, not core. Deterministic Check()s, snapshots/regression detection, dataset runners, multi-judge consensus, JUnit/GitHub-Actions reporters, file cache, and a cost budget are designed in docs/NEXT.md and not built. What's listed under "Surfaces" below is what exists.

Install

go get github.com/itsHabib/sense
export ANTHROPIC_API_KEY=...

Requires Go 1.25+.

Surfaces

Function What it does Returns
Extract[T](text) / s.Extract(text, &dst) One typed struct from text (generic, or json.Unmarshal-style into a pointer) *ExtractResult[T] / (*ExtractResult, error)
ExtractSlice[T](text) A []T from one text — invoices, log batches, entity lists *ExtractSliceResult[T]
s.ExtractParallel(ctx, jobs) Run N extractions concurrently, results written into each job's Dest *ExtractParallelResult
Assert(t, output) Test assertion, t.Error on failure (test continues)
Require(t, output) Test assertion, t.Fatal on failure (test stops)
Eval(output) Programmatic evaluation you inspect yourself *EvalResult
Compare(a, b) A/B comparison of two outputs against the same criteria *CompareResult

Each is a chainable builder ending in a terminal (Run() / Judge()). Package-level forms (sense.Eval, sense.Extract[T], …) use a lazy default session; the s.* forms run on a session you configured with New/ForTest.

Extract — structure from chaos

Define a struct. Get structured data back. Works with any text.

type MountError struct {
    Device   string `json:"device" sense:"The device path"`
    VolumeID string `json:"volume_id" sense:"The EBS volume ID"`
    Message  string `json:"message"`
}

s := sense.New()

var m MountError
_, err := s.Extract("device /dev/sdf already mounted with vol-0abc123", &m).
    Context("AWS EC2 EBS error messages").
    Run()

fmt.Println(m.Device)   // "/dev/sdf"
fmt.Println(m.VolumeID) // "vol-0abc123"

Pass a pointer to a struct — data is written directly into it, like json.Unmarshal. Schema is generated from your struct via reflection — json tags for field names, sense tags for descriptions. Pointer fields are optional; value fields are required.

Works with nested structs, slices, and all Go primitive types.

A generic function is also available for callers who prefer compile-time type safety:

result, err := sense.Extract[MountError]("device /dev/sdf already mounted with vol-0abc123").Run()
fmt.Println(result.Data.Device)   // "/dev/sdf"

ExtractResult[T] carries Data, Duration, TokensUsed, Model, Usage, and Fallback.

ExtractSlice — multiple items from one text

Extract a list of typed structs from a single input. Same API as Extract, returns []T:

type LineItem struct {
    Description string  `json:"description" sense:"Item description"`
    Quantity    int     `json:"quantity" sense:"Number of units"`
    UnitPrice   float64 `json:"unit_price" sense:"Price per unit in dollars"`
}

result, err := sense.ExtractSlice[LineItem](invoiceText).
    Context("Invoice line items").
    Run()

for _, item := range result.Data {
    fmt.Printf("%s x%d @ $%.2f\n", item.Description, item.Quantity, item.UnitPrice)
}

Works with log batches, entity lists, table rows — anything where one text contains multiple structured items.

Per-item validation is built in:

result, err := sense.ExtractSlice[LineItem](text).
    Validate(func(item LineItem) error {
        if item.Quantity <= 0 {
            return fmt.Errorf("invalid quantity: %d", item.Quantity)
        }
        return nil
    }).
    Run()

ExtractParallel — many extractions at once

Run a batch of independent extractions concurrently. Each job writes into its own destination pointer; the result reports per-job errors and total wall-clock time:

var mount MountError
var ticket TicketInfo

res := s.ExtractParallel(ctx, []sense.ExtractJob{
    {Text: logLine, Dest: &mount, Context: "AWS EBS error"},
    {Text: emailBody, Dest: &ticket, Context: "support email"},
})

if res.Failed() {
    for i, err := range res.Errors {
        if err != nil {
            log.Printf("job %d failed: %v", i, err)
        }
    }
}

Validation

All extract paths support two kinds of validation:

Closure — pass a function via .Validate(fn):

result, err := sense.Extract[Order](text).
    Validate(func(o Order) error {
        if o.Total < 0 {
            return fmt.Errorf("invalid total: %f", o.Total)
        }
        return nil
    }).
    Run()

Interface — implement Validate() error on your struct:

type Order struct {
    Total float64 `json:"total"`
    Items []Item  `json:"items"`
}

func (o *Order) Validate() error {
    if o.Total < 0 {
        return fmt.Errorf("invalid total: %f", o.Total)
    }
    return nil
}

// Validate() is called automatically after unmarshalling.
result, err := sense.Extract[Order](text).Run()

Both work with Extract[T], Extract(text, &dest), and ExtractSlice[T]. When both are set, the closure runs first.

Fallback

All extract builders support a fallback function for when the API call fails:

result, err := sense.Extract[MountError](logLine).
    Fallback(func() (*MountError, error) {
        return regexParseMountError(logLine)
    }).
    Run()

if result.Fallback {
    log.Warn("used fallback parser")
}

The Fallback field on the result is true when the fallback path fired.

Use cases

Extract isn't just for tests. Use it anywhere you need structure from messy text:

// Parse log lines into typed events
var event DeployEvent
s.Extract(logLine, &event).Context("Kubernetes deployment logs").Run()

// Classify support tickets
var ticket TicketInfo
s.Extract(emailBody, &ticket).Context("Customer support emails for a SaaS product").Run()

// Normalize inconsistent API responses
var order Order
s.Extract(thirdPartyJSON, &order).Context("Legacy vendor API, format varies by region").Run()

Judge — evaluate non-deterministic output

Assert — test assertion, continues on failure

func TestMyAgent(t *testing.T) {
    output := runMyAgent()

    sense.Assert(t, output).
        Expect("produces valid Go code").
        Expect("handles errors idiomatically").
        Context("task was to write a REST API server").
        Run()
}

When a check fails, you get structured feedback — what passed, what failed, why, and evidence:

--- FAIL: TestMyAgent (4.82s)
    agent_test.go:15: evaluation: 1/2 passed, score: 0.50

        ✓ produces valid Go code
          reason: The snippet is syntactically valid Go code for a simple addition function.
          evidence: func Add(a, b int) int { return a + b }
          confidence: 0.95

        ✗ handles errors idiomatically
          reason: The output is a trivial math function with no error handling whatsoever.
            It does not demonstrate idiomatic Go error handling (e.g., returning an error
            as a second value, using fmt.Errorf, etc.), nor does it relate to a REST API
            server where error handling would be expected.
          evidence: func Add(a, b int) int { return a + b } — no error return value,
            no error handling logic, no REST API context
          confidence: 0.99

Require — test assertion, stops on failure

sense.Require(t, output).
    Expect("produces valid Go code").
    Run()

Assert uses t.Error() (test continues). Require uses t.Fatal() (test stops). Same pattern as testify.

Eval — inspect results programmatically

result, err := sense.Eval(output).
    Expect("is a complete sentence").
    Expect("mentions an animal").
    Expect("contains a number").
    Judge()

fmt.Println(result.Pass)   // false
fmt.Println(result.Score)  // 0.67

for _, c := range result.FailedChecks() {
    fmt.Println(c.Expect, "—", c.Reason)
}

EvalResult exposes Pass, Score, Checks, and helpers PassedChecks() / FailedChecks(). Each Check carries Expect, Pass, Confidence, Reason, Evidence, and BelowThreshold.

Confidence threshold

A check can pass Claude's judgment but with low confidence. Set a minimum and low-confidence passes are demoted to failures (flagged BelowThreshold):

// Per call
sense.Eval(output).Expect("is factually accurate").MinConfidence(0.8).Judge()

// Or session-wide
s := sense.New(sense.WithMinConfidence(0.8))

Compare — A/B test two outputs

cmp, err := sense.Compare(outputV1, outputV2).
    Criteria("completeness").
    Criteria("clarity").
    Criteria("professionalism").
    Judge()

fmt.Println(cmp.Winner)     // "A"
fmt.Println(cmp.ScoreA)     // 0.85
fmt.Println(cmp.ScoreB)     // 0.10
fmt.Println(cmp.Reasoning)  // "Output A is significantly better..."

Session

Three tiers — use only what you need:

// Zero config — just works
sense.Assert(t, output).Expect("covers all sections").Run()

// Test suite — auto-cleanup, usage tracking
s := sense.ForTest(t)
s.Assert(t, output).Expect("covers all sections").Run()

// Custom config
s := sense.New(sense.WithModel("claude-haiku-4-5-20251001"))
s.Assert(t, output).Expect("covers all sections").Run()

Extract requires an explicit session:

s := sense.New()
var m MountError
s.Extract("device /dev/sdf already mounted", &m).Run()

// Generic version (uses default session)
result, err := sense.Extract[MountError](logLine).Run()

Functional options

s := sense.New(
    sense.WithModel("claude-haiku-4-5-20251001"),
    sense.WithTimeout(10 * time.Second),  // -1 or 0 disables the timeout
    sense.WithRetries(5),                 // -1 disables retries
    sense.WithAPIKey("sk-..."),
    sense.WithMemoryCache(),              // in-memory response cache, lives with the session
    sense.WithMinConfidence(0.8),         // demote low-confidence passes to failures
    sense.WithContext("you are reviewing API docs"), // prepended to every call
    sense.WithLogger(slog.Default()),     // log calls, latencies, tokens, errors
    sense.WithHook(func(e sense.Event) { /* per-call callback */ }),
)

ForTest — auto-cleanup for test suites

s := sense.ForTest(t)                                    // defaults
s := sense.ForTest(t, sense.WithModel("claude-haiku-4-5-20251001"))  // custom

// t.Cleanup handles Close and prints usage summary

Usage tracking

s := sense.New()
// ... run evaluations ...
fmt.Println(s.Usage())
// sense: 15 calls, 18420 input tokens, 4210 output tokens (~$0.0612)

Token usage is tracked across all operations using atomic counters — safe for concurrent use. Usage() returns a SessionUsage snapshot with an estimated cost from the built-in per-model pricing table (Sonnet / Haiku / Opus).

Batching

Enable batching for 50% cost reduction. Requests are collected and submitted as a single Anthropic Batch API call:

s := sense.New(sense.WithBatch(50, 2*time.Second))
defer s.Close() // required — flushes pending batch requests

Note: Batching trades latency for cost. The Batch API processes requests asynchronously — it can take minutes to hours depending on load. Use it for large test suites where 50% cost savings matter more than speed.

Interfaces

Sense provides two interfaces for decoupling your code from the concrete Session:

// For code that judges output
func AnalyzeReport(s sense.Evaluator, doc string) (bool, error) {
    result, err := s.Eval(doc).
        Expect("has executive summary").
        Judge()
    if err != nil {
        return false, err
    }
    return result.Pass, nil
}

// For code that extracts structure
func ParseTicket(s sense.Extractor, raw string) (*Ticket, error) {
    var t Ticket
    _, err := s.Extract(raw, &t).Run()
    return &t, err
}

*Session satisfies both interfaces. Accept Evaluator or Extractor in your function signatures to make your code testable without the Claude API.

Architecture

Single flat package — no cmd/, no sub-packages. Everything pivots on one seam: the caller interface (call(ctx, *callRequest)). Builders sit in front of it; transports/decorators sit behind it.

  Extract[T] / ExtractSlice[T] / ExtractParallel   Assert / Require / Eval / Compare
        (extract builders)                              (judge builders)
                    \                                       /
                     \                                     /
                      ▼                                   ▼
                   ┌─────────────────────────────────────────┐
                   │  Session  — model, timeout, retries,     │
                   │  usage counters, optional cache/batcher  │
                   └─────────────────────────────────────────┘
                                      │
                              caller.call(...)         ← the one seam
                                      │
        ┌──────────────┬──────────────┴───────┬───────────────┐
        ▼              ▼                       ▼               ▼
   claudeClient   batchCaller            cachedCaller     nopCaller
   (real API,     (Anthropic Batch       (decorator,      (Nop(): {})
    retries,       API, 50% cost)         memory cache)
    forced tool,
    prompt cache)
Component File Role
caller client.go One-method seam every operation calls
claudeClient client.go Real API call — forces one tool via tool_choice, retries 429/5xx with backoff, ephemeral prompt cache on the system block
batchCaller + batcher batch.go Routes calls through the Batch API (50% cost); needs Close() to flush
cachedCaller cache.go Decorator over any caller; content-addressed in-memory cache
nopCaller nop.go Returns {}; backs sense.Nop() for offline runs
Extract schema gen extract_schema.go Reflects a struct into a tool input schema, cached per reflect.Type
Session config.go, option.go Holds config + atomic usage counters; built via functional options

How it works

  1. Your struct schema (Extract) or expectations (Judge) become a prompt
  2. Claude is forced to call a structured tool via tool_choice
  3. The tool's input schema enforces the output format server-side
  4. Sense unmarshals the tool call result into typed Go structs

The schema is enforced server-side, so there's no output parsing on your end. The system prompt carries an ephemeral cache_control block, so repeated calls within a session pay reduced input cost on the cached prefix.

Develop

go test ./...                       # unit tests — mock caller, no API, no key needed
go test -tags=e2e -v ./...          # e2e — hits the real API, COSTS money (~$0.10–0.15/run)
SENSE_SKIP=1 go test ./...          # offline — every sense call becomes a passing no-op
go tool golangci-lint run           # lint (golangci-lint v2 pinned as a go.mod tool dependency)
go build ./...

Unit tests inject a mock caller and never touch the network; e2e tests live behind the //go:build e2e tag in e2e_test.go. House style is Dave Cheney's and is enforced, not aspirational — see .golangci.yml (revive indent-error-flow / superfluous-else, nestif, gocyclo 15, funlen 80 lines).

Offline development

Skip all sense calls when you don't have an API key:

SENSE_SKIP=1 go test ./...

All Assert, Require, Eval, Extract, ExtractSlice, and Compare calls become no-ops that pass immediately.

sense.Nop() also accepts options for cases where you want a no-op session with specific configuration (e.g., model name in result metadata, logging):

s := sense.Nop(sense.WithModel("claude-haiku-4-5-20251001"), sense.WithLogger(logger))

Environment variables

Variable Description Default
ANTHROPIC_API_KEY Claude API key Required (real usage + e2e only)
SENSE_MODEL Override default judge model claude-sonnet-4-6
SENSE_SKIP Set to 1 to skip all sense calls unset

Model precedence: per-call .Model() > $SENSE_MODEL > session model.

Docs

Doc What's in it
docs/NEXT.md Feature backlog — what's shipped vs. designed-but-unbuilt
docs/CONFIDENCE-THRESHOLD.md Confidence-threshold design
docs/FOOTGUNS.md API footguns and the fixes applied
docs/API-SIMPLIFICATION.md API-simplification history
docs/MULTI-JUDGE-CONSENSUS.md Multi-judge consensus design (unbuilt)
docs/INTEGRATION-FEEDBACK.md Notes from production usage

What's next

Roadmap — designed in docs/NEXT.md, not yet built. Ideas, not commitments.

  • Deterministic checks — mix Check(sense.ValidJSON()) with LLM-judged Expect() in the same assertion. Deterministic checks run first; if any fail, skip the LLM call. Free, fast, saves money.
  • File cache — cache responses to disk. Identical prompts during iterative development hit the cache instead of the API. (Today only WithMemoryCache() exists.)
  • Snapshots — save eval results to disk, detect regressions when prompts change. SENSE_UPDATE_SNAPSHOTS=1 to update.
  • CI reporter — JUnit XML output and GitHub Actions annotations so eval results show up in your pipeline.
  • Multi-judge consensus — fan out to N models, require agreement for a pass. Reduces false positives from single-model bias.
  • Cost budgetWithMaxCost(sense.Dollars(0.50)) to cap session spend. Prevents runaway costs in CI.
  • Multi-model judges — an OpenAI caller + WithCaller injection, to judge with non-Claude models.

Already shipped: extract validation (Validate + Validator), ExtractSlice[T], ExtractParallel, confidence thresholds, in-memory cache, batching, and Anthropic prompt caching on system prompts.

About

Make sense of non-deterministic output. Test assertions and structured text extraction for Go, powered by Claude

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages