Sense

Make sense of non-deterministic output. Extract structured data from text and evaluate output quality using Claude.

// Judge: output → pass/fail with evidence
sense.Assert(t, output).
    Expect("covers all sections from the brief").
    Expect("includes actionable recommendations").
    Run()

// Extract: unstructured text → typed struct
s := sense.New()
var m MountError
s.Extract("device /dev/sdf already mounted with vol-0abc123", &m).Run()
fmt.Println(m.Device)   // "/dev/sdf"
fmt.Println(m.VolumeID) // "vol-0abc123"

Sense uses the Anthropic API (Claude) with forced tool_use for structured responses — no prompt engineering, no JSON parsing on your end. Requires an Anthropic API key.

Two surfaces, one package:

Extract — parse unstructured text into typed Go structs. Logs, error messages, support tickets, API responses — define a struct, get structured data back.
Judge — evaluate non-deterministic output against natural-language expectations. Assert in tests (Assert/Require), eval programmatically (Eval), or A/B compare two outputs (Compare).

Why it exists

Go programs that touch LLM output have two recurring problems: turning messy text into something typed, and asserting that non-deterministic output is good without a brittle string match. Sense does exactly those two things behind one seam — a caller that forces Claude to call a single tool whose schema is the output contract. Everything else is a thin builder in front of that seam.

Scope is deliberately narrow:

It judges and extracts. It is not an agent framework. No tool-calling loops, no chains, no orchestration — Sense makes a single forced-tool call and unmarshals the result.
It speaks Claude only, today. The caller interface is abstracted so a second provider is ~100 lines, but no OpenAI/other caller is shipped. An OpenAI caller and WithCaller injection live in docs/NEXT.md, not in the code.
Eval-framework features are roadmap, not core. Deterministic Check()s, snapshots/regression detection, dataset runners, multi-judge consensus, JUnit/GitHub-Actions reporters, file cache, and a cost budget are designed in docs/NEXT.md and not built. What's listed under "Surfaces" below is what exists.

Install

go get github.com/itsHabib/sense

export ANTHROPIC_API_KEY=...

Requires Go 1.25+.

Surfaces

Function	What it does	Returns
`Extract[T](text)` / `s.Extract(text, &dst)`	One typed struct from text (generic, or `json.Unmarshal`-style into a pointer)	`ExtractResult[T]` / `(ExtractResult, error)`
`ExtractSlice[T](text)`	A `[]T` from one text — invoices, log batches, entity lists	`*ExtractSliceResult[T]`
`s.ExtractParallel(ctx, jobs)`	Run N extractions concurrently, results written into each job's `Dest`	`*ExtractParallelResult`
`Assert(t, output)`	Test assertion, `t.Error` on failure (test continues)	—
`Require(t, output)`	Test assertion, `t.Fatal` on failure (test stops)	—
`Eval(output)`	Programmatic evaluation you inspect yourself	`*EvalResult`
`Compare(a, b)`	A/B comparison of two outputs against the same criteria	`*CompareResult`

Each is a chainable builder ending in a terminal (Run() / Judge()). Package-level forms (sense.Eval, sense.Extract[T], …) use a lazy default session; the s.* forms run on a session you configured with New/ForTest.

Extract — structure from chaos

Define a struct. Get structured data back. Works with any text.

type MountError struct {
    Device   string `json:"device" sense:"The device path"`
    VolumeID string `json:"volume_id" sense:"The EBS volume ID"`
    Message  string `json:"message"`
}

s := sense.New()

var m MountError
_, err := s.Extract("device /dev/sdf already mounted with vol-0abc123", &m).
    Context("AWS EC2 EBS error messages").
    Run()

fmt.Println(m.Device)   // "/dev/sdf"
fmt.Println(m.VolumeID) // "vol-0abc123"

Pass a pointer to a struct — data is written directly into it, like json.Unmarshal. Schema is generated from your struct via reflection — json tags for field names, sense tags for descriptions. Pointer fields are optional; value fields are required.

Works with nested structs, slices, and all Go primitive types.

A generic function is also available for callers who prefer compile-time type safety:

result, err := sense.Extract[MountError]("device /dev/sdf already mounted with vol-0abc123").Run()
fmt.Println(result.Data.Device)   // "/dev/sdf"

ExtractResult[T] carries Data, Duration, TokensUsed, Model, Usage, and Fallback.

ExtractSlice — multiple items from one text

Extract a list of typed structs from a single input. Same API as Extract, returns []T:

type LineItem struct {
    Description string  `json:"description" sense:"Item description"`
    Quantity    int     `json:"quantity" sense:"Number of units"`
    UnitPrice   float64 `json:"unit_price" sense:"Price per unit in dollars"`
}

result, err := sense.ExtractSlice[LineItem](invoiceText).
    Context("Invoice line items").
    Run()

for _, item := range result.Data {
    fmt.Printf("%s x%d @ $%.2f\n", item.Description, item.Quantity, item.UnitPrice)
}

Works with log batches, entity lists, table rows — anything where one text contains multiple structured items.

Per-item validation is built in:

result, err := sense.ExtractSlice[LineItem](text).
    Validate(func(item LineItem) error {
        if item.Quantity <= 0 {
            return fmt.Errorf("invalid quantity: %d", item.Quantity)
        }
        return nil
    }).
    Run()

ExtractParallel — many extractions at once

Run a batch of independent extractions concurrently. Each job writes into its own destination pointer; the result reports per-job errors and total wall-clock time:

var mount MountError
var ticket TicketInfo

res := s.ExtractParallel(ctx, []sense.ExtractJob{
    {Text: logLine, Dest: &mount, Context: "AWS EBS error"},
    {Text: emailBody, Dest: &ticket, Context: "support email"},
})

if res.Failed() {
    for i, err := range res.Errors {
        if err != nil {
            log.Printf("job %d failed: %v", i, err)
        }
    }
}

Validation

All extract paths support two kinds of validation:

Closure — pass a function via .Validate(fn):

result, err := sense.Extract[Order](text).
    Validate(func(o Order) error {
        if o.Total < 0 {
            return fmt.Errorf("invalid total: %f", o.Total)
        }
        return nil
    }).
    Run()

Interface — implement Validate() error on your struct:

type Order struct {
    Total float64 `json:"total"`
    Items []Item  `json:"items"`
}

func (o *Order) Validate() error {
    if o.Total < 0 {
        return fmt.Errorf("invalid total: %f", o.Total)
    }
    return nil
}

// Validate() is called automatically after unmarshalling.
result, err := sense.Extract[Order](text).Run()

Both work with Extract[T], Extract(text, &dest), and ExtractSlice[T]. When both are set, the closure runs first.

Fallback

All extract builders support a fallback function for when the API call fails:

result, err := sense.Extract[MountError](logLine).
    Fallback(func() (*MountError, error) {
        return regexParseMountError(logLine)
    }).
    Run()

if result.Fallback {
    log.Warn("used fallback parser")
}

The Fallback field on the result is true when the fallback path fired.

Use cases

Extract isn't just for tests. Use it anywhere you need structure from messy text:

// Parse log lines into typed events
var event DeployEvent
s.Extract(logLine, &event).Context("Kubernetes deployment logs").Run()

// Classify support tickets
var ticket TicketInfo
s.Extract(emailBody, &ticket).Context("Customer support emails for a SaaS product").Run()

// Normalize inconsistent API responses
var order Order
s.Extract(thirdPartyJSON, &order).Context("Legacy vendor API, format varies by region").Run()

Judge — evaluate non-deterministic output

Assert — test assertion, continues on failure

func TestMyAgent(t *testing.T) {
    output := runMyAgent()

    sense.Assert(t, output).
        Expect("produces valid Go code").
        Expect("handles errors idiomatically").
        Context("task was to write a REST API server").
        Run()
}

When a check fails, you get structured feedback — what passed, what failed, why, and evidence:

--- FAIL: TestMyAgent (4.82s)
    agent_test.go:15: evaluation: 1/2 passed, score: 0.50

        ✓ produces valid Go code
          reason: The snippet is syntactically valid Go code for a simple addition function.
          evidence: func Add(a, b int) int { return a + b }
          confidence: 0.95

        ✗ handles errors idiomatically
          reason: The output is a trivial math function with no error handling whatsoever.
            It does not demonstrate idiomatic Go error handling (e.g., returning an error
            as a second value, using fmt.Errorf, etc.), nor does it relate to a REST API
            server where error handling would be expected.
          evidence: func Add(a, b int) int { return a + b } — no error return value,
            no error handling logic, no REST API context
          confidence: 0.99

Require — test assertion, stops on failure

sense.Require(t, output).
    Expect("produces valid Go code").
    Run()

Assert uses t.Error() (test continues). Require uses t.Fatal() (test stops). Same pattern as testify.

Eval — inspect results programmatically

result, err := sense.Eval(output).
    Expect("is a complete sentence").
    Expect("mentions an animal").
    Expect("contains a number").
    Judge()

fmt.Println(result.Pass)   // false
fmt.Println(result.Score)  // 0.67

for _, c := range result.FailedChecks() {
    fmt.Println(c.Expect, "—", c.Reason)
}

EvalResult exposes Pass, Score, Checks, and helpers PassedChecks() / FailedChecks(). Each Check carries Expect, Pass, Confidence, Reason, Evidence, and BelowThreshold.

Confidence threshold

A check can pass Claude's judgment but with low confidence. Set a minimum and low-confidence passes are demoted to failures (flagged BelowThreshold):

// Per call
sense.Eval(output).Expect("is factually accurate").MinConfidence(0.8).Judge()

// Or session-wide
s := sense.New(sense.WithMinConfidence(0.8))

Compare — A/B test two outputs

cmp, err := sense.Compare(outputV1, outputV2).
    Criteria("completeness").
    Criteria("clarity").
    Criteria("professionalism").
    Judge()

fmt.Println(cmp.Winner)     // "A"
fmt.Println(cmp.ScoreA)     // 0.85
fmt.Println(cmp.ScoreB)     // 0.10
fmt.Println(cmp.Reasoning)  // "Output A is significantly better..."

Session

Three tiers — use only what you need:

// Zero config — just works
sense.Assert(t, output).Expect("covers all sections").Run()

// Test suite — auto-cleanup, usage tracking
s := sense.ForTest(t)
s.Assert(t, output).Expect("covers all sections").Run()

// Custom config
s := sense.New(sense.WithModel("claude-haiku-4-5-20251001"))
s.Assert(t, output).Expect("covers all sections").Run()

Extract requires an explicit session:

s := sense.New()
var m MountError
s.Extract("device /dev/sdf already mounted", &m).Run()

// Generic version (uses default session)
result, err := sense.Extract[MountError](logLine).Run()

Functional options

s := sense.New(
    sense.WithModel("claude-haiku-4-5-20251001"),
    sense.WithTimeout(10 * time.Second),  // -1 or 0 disables the timeout
    sense.WithRetries(5),                 // -1 disables retries
    sense.WithAPIKey("sk-..."),
    sense.WithMemoryCache(),              // in-memory response cache, lives with the session
    sense.WithMinConfidence(0.8),         // demote low-confidence passes to failures
    sense.WithContext("you are reviewing API docs"), // prepended to every call
    sense.WithLogger(slog.Default()),     // log calls, latencies, tokens, errors
    sense.WithHook(func(e sense.Event) { /* per-call callback */ }),
)

ForTest — auto-cleanup for test suites

s := sense.ForTest(t)                                    // defaults
s := sense.ForTest(t, sense.WithModel("claude-haiku-4-5-20251001"))  // custom

// t.Cleanup handles Close and prints usage summary

Usage tracking

s := sense.New()
// ... run evaluations ...
fmt.Println(s.Usage())
// sense: 15 calls, 18420 input tokens, 4210 output tokens (~$0.0612)

Token usage is tracked across all operations using atomic counters — safe for concurrent use. Usage() returns a SessionUsage snapshot with an estimated cost from the built-in per-model pricing table (Sonnet / Haiku / Opus).

Batching

Enable batching for 50% cost reduction. Requests are collected and submitted as a single Anthropic Batch API call:

s := sense.New(sense.WithBatch(50, 2*time.Second))
defer s.Close() // required — flushes pending batch requests

Note: Batching trades latency for cost. The Batch API processes requests asynchronously — it can take minutes to hours depending on load. Use it for large test suites where 50% cost savings matter more than speed.

Interfaces

Sense provides two interfaces for decoupling your code from the concrete Session:

// For code that judges output
func AnalyzeReport(s sense.Evaluator, doc string) (bool, error) {
    result, err := s.Eval(doc).
        Expect("has executive summary").
        Judge()
    if err != nil {
        return false, err
    }
    return result.Pass, nil
}

// For code that extracts structure
func ParseTicket(s sense.Extractor, raw string) (*Ticket, error) {
    var t Ticket
    _, err := s.Extract(raw, &t).Run()
    return &t, err
}

*Session satisfies both interfaces. Accept Evaluator or Extractor in your function signatures to make your code testable without the Claude API.

Architecture

Single flat package — no cmd/, no sub-packages. Everything pivots on one seam: the caller interface (call(ctx, *callRequest)). Builders sit in front of it; transports/decorators sit behind it.

  Extract[T] / ExtractSlice[T] / ExtractParallel   Assert / Require / Eval / Compare
        (extract builders)                              (judge builders)
                    \                                       /
                     \                                     /
                      ▼                                   ▼
                   ┌─────────────────────────────────────────┐
                   │  Session  — model, timeout, retries,     │
                   │  usage counters, optional cache/batcher  │
                   └─────────────────────────────────────────┘
                                      │
                              caller.call(...)         ← the one seam
                                      │
        ┌──────────────┬──────────────┴───────┬───────────────┐
        ▼              ▼                       ▼               ▼
   claudeClient   batchCaller            cachedCaller     nopCaller
   (real API,     (Anthropic Batch       (decorator,      (Nop(): {})
    retries,       API, 50% cost)         memory cache)
    forced tool,
    prompt cache)

Component	File	Role
`caller`	`client.go`	One-method seam every operation calls
`claudeClient`	`client.go`	Real API call — forces one tool via `tool_choice`, retries 429/5xx with backoff, ephemeral prompt cache on the system block
`batchCaller` + `batcher`	`batch.go`	Routes calls through the Batch API (50% cost); needs `Close()` to flush
`cachedCaller`	`cache.go`	Decorator over any caller; content-addressed in-memory cache
`nopCaller`	`nop.go`	Returns `{}`; backs `sense.Nop()` for offline runs
Extract schema gen	`extract_schema.go`	Reflects a struct into a tool input schema, cached per `reflect.Type`
`Session`	`config.go`, `option.go`	Holds config + atomic usage counters; built via functional options

How it works

Your struct schema (Extract) or expectations (Judge) become a prompt
Claude is forced to call a structured tool via tool_choice
The tool's input schema enforces the output format server-side
Sense unmarshals the tool call result into typed Go structs

The schema is enforced server-side, so there's no output parsing on your end. The system prompt carries an ephemeral cache_control block, so repeated calls within a session pay reduced input cost on the cached prefix.

Develop

go test ./...                       # unit tests — mock caller, no API, no key needed
go test -tags=e2e -v ./...          # e2e — hits the real API, COSTS money (~$0.10–0.15/run)
SENSE_SKIP=1 go test ./...          # offline — every sense call becomes a passing no-op
go tool golangci-lint run           # lint (golangci-lint v2 pinned as a go.mod tool dependency)
go build ./...

Unit tests inject a mock caller and never touch the network; e2e tests live behind the //go:build e2e tag in e2e_test.go. House style is Dave Cheney's and is enforced, not aspirational — see .golangci.yml (revive indent-error-flow / superfluous-else, nestif, gocyclo 15, funlen 80 lines).

Offline development

Skip all sense calls when you don't have an API key:

SENSE_SKIP=1 go test ./...

All Assert, Require, Eval, Extract, ExtractSlice, and Compare calls become no-ops that pass immediately.

sense.Nop() also accepts options for cases where you want a no-op session with specific configuration (e.g., model name in result metadata, logging):

s := sense.Nop(sense.WithModel("claude-haiku-4-5-20251001"), sense.WithLogger(logger))

Environment variables

Variable	Description	Default
`ANTHROPIC_API_KEY`	Claude API key	Required (real usage + e2e only)
`SENSE_MODEL`	Override default judge model	`claude-sonnet-4-6`
`SENSE_SKIP`	Set to `1` to skip all sense calls	unset

Model precedence: per-call .Model() > $SENSE_MODEL > session model.

Docs

Doc	What's in it
docs/NEXT.md	Feature backlog — what's shipped vs. designed-but-unbuilt
docs/CONFIDENCE-THRESHOLD.md	Confidence-threshold design
docs/FOOTGUNS.md	API footguns and the fixes applied
docs/API-SIMPLIFICATION.md	API-simplification history
docs/MULTI-JUDGE-CONSENSUS.md	Multi-judge consensus design (unbuilt)
docs/INTEGRATION-FEEDBACK.md	Notes from production usage

What's next

Roadmap — designed in docs/NEXT.md, not yet built. Ideas, not commitments.

Deterministic checks — mix Check(sense.ValidJSON()) with LLM-judged Expect() in the same assertion. Deterministic checks run first; if any fail, skip the LLM call. Free, fast, saves money.
File cache — cache responses to disk. Identical prompts during iterative development hit the cache instead of the API. (Today only WithMemoryCache() exists.)
Snapshots — save eval results to disk, detect regressions when prompts change. SENSE_UPDATE_SNAPSHOTS=1 to update.
CI reporter — JUnit XML output and GitHub Actions annotations so eval results show up in your pipeline.
Multi-judge consensus — fan out to N models, require agreement for a pass. Reduces false positives from single-model bias.
Cost budget — WithMaxCost(sense.Dollars(0.50)) to cap session spend. Prevents runaway costs in CI.
Multi-model judges — an OpenAI caller + WithCaller injection, to judge with non-Claude models.

Already shipped: extract validation (Validate + Validator), ExtractSlice[T], ExtractParallel, confidence thresholds, in-memory cache, batching, and Anthropic prompt caching on system prompts.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
docs		docs
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CLAUDE.md		CLAUDE.md
PROJECT.state.yaml		PROJECT.state.yaml
README.md		README.md
assert.go		assert.go
batch.go		batch.go
batch_test.go		batch_test.go
cache.go		cache.go
cache_test.go		cache_test.go
client.go		client.go
compare.go		compare.go
config.go		config.go
default.go		default.go
default_test.go		default_test.go
e2e_test.go		e2e_test.go
errors.go		errors.go
eval.go		eval.go
evaluator.go		evaluator.go
extract.go		extract.go
extract_schema.go		extract_schema.go
extract_slice.go		extract_slice.go
extract_slice_test.go		extract_slice_test.go
extract_test.go		extract_test.go
extractor.go		extractor.go
extractor_test.go		extractor_test.go
for_test_helper.go		for_test_helper.go
for_test_helper_test.go		for_test_helper_test.go
go.mod		go.mod
go.sum		go.sum
nop.go		nop.go
observe.go		observe.go
option.go		option.go
option_test.go		option_test.go
parallel.go		parallel.go
prompt.go		prompt.go
sense.go		sense.go
sense_test.go		sense_test.go
unit_test.go		unit_test.go
usage_test.go		usage_test.go

Folders and files

Latest commit

History

Repository files navigation

Sense

Why it exists

Install

Surfaces

Extract — structure from chaos

ExtractSlice — multiple items from one text

ExtractParallel — many extractions at once

Validation

Fallback

Use cases

Judge — evaluate non-deterministic output

Assert — test assertion, continues on failure

Require — test assertion, stops on failure

Eval — inspect results programmatically

Confidence threshold

Compare — A/B test two outputs

Session

Functional options

ForTest — auto-cleanup for test suites

Usage tracking

Batching

Interfaces

Architecture

How it works

Develop

Offline development

Environment variables

Docs

What's next

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages