Make sense of non-deterministic output. Extract structured data from text and evaluate output quality using Claude.
// Judge: output → pass/fail with evidence
sense.Assert(t, output).
Expect("covers all sections from the brief").
Expect("includes actionable recommendations").
Run()
// Extract: unstructured text → typed struct
s := sense.New()
var m MountError
s.Extract("device /dev/sdf already mounted with vol-0abc123", &m).Run()
fmt.Println(m.Device) // "/dev/sdf"
fmt.Println(m.VolumeID) // "vol-0abc123"Sense uses the Anthropic API (Claude) with forced tool_use for structured responses — no prompt engineering, no JSON parsing on your end. Requires an Anthropic API key.
Two surfaces, one package:
- Extract — parse unstructured text into typed Go structs. Logs, error messages, support tickets, API responses — define a struct, get structured data back.
- Judge — evaluate non-deterministic output against natural-language expectations. Assert in tests (
Assert/Require), eval programmatically (Eval), or A/B compare two outputs (Compare).
Go programs that touch LLM output have two recurring problems: turning messy text into something typed, and asserting that non-deterministic output is good without a brittle string match. Sense does exactly those two things behind one seam — a caller that forces Claude to call a single tool whose schema is the output contract. Everything else is a thin builder in front of that seam.
Scope is deliberately narrow:
- It judges and extracts. It is not an agent framework. No tool-calling loops, no chains, no orchestration — Sense makes a single forced-tool call and unmarshals the result.
- It speaks Claude only, today. The
callerinterface is abstracted so a second provider is ~100 lines, but no OpenAI/other caller is shipped. An OpenAI caller andWithCallerinjection live in docs/NEXT.md, not in the code. - Eval-framework features are roadmap, not core. Deterministic
Check()s, snapshots/regression detection, dataset runners, multi-judge consensus, JUnit/GitHub-Actions reporters, file cache, and a cost budget are designed in docs/NEXT.md and not built. What's listed under "Surfaces" below is what exists.
go get github.com/itsHabib/senseexport ANTHROPIC_API_KEY=...Requires Go 1.25+.
| Function | What it does | Returns |
|---|---|---|
Extract[T](text) / s.Extract(text, &dst) |
One typed struct from text (generic, or json.Unmarshal-style into a pointer) |
*ExtractResult[T] / (*ExtractResult, error) |
ExtractSlice[T](text) |
A []T from one text — invoices, log batches, entity lists |
*ExtractSliceResult[T] |
s.ExtractParallel(ctx, jobs) |
Run N extractions concurrently, results written into each job's Dest |
*ExtractParallelResult |
Assert(t, output) |
Test assertion, t.Error on failure (test continues) |
— |
Require(t, output) |
Test assertion, t.Fatal on failure (test stops) |
— |
Eval(output) |
Programmatic evaluation you inspect yourself | *EvalResult |
Compare(a, b) |
A/B comparison of two outputs against the same criteria | *CompareResult |
Each is a chainable builder ending in a terminal (Run() / Judge()). Package-level forms (sense.Eval, sense.Extract[T], …) use a lazy default session; the s.* forms run on a session you configured with New/ForTest.
Define a struct. Get structured data back. Works with any text.
type MountError struct {
Device string `json:"device" sense:"The device path"`
VolumeID string `json:"volume_id" sense:"The EBS volume ID"`
Message string `json:"message"`
}
s := sense.New()
var m MountError
_, err := s.Extract("device /dev/sdf already mounted with vol-0abc123", &m).
Context("AWS EC2 EBS error messages").
Run()
fmt.Println(m.Device) // "/dev/sdf"
fmt.Println(m.VolumeID) // "vol-0abc123"Pass a pointer to a struct — data is written directly into it, like json.Unmarshal. Schema is generated from your struct via reflection — json tags for field names, sense tags for descriptions. Pointer fields are optional; value fields are required.
Works with nested structs, slices, and all Go primitive types.
A generic function is also available for callers who prefer compile-time type safety:
result, err := sense.Extract[MountError]("device /dev/sdf already mounted with vol-0abc123").Run()
fmt.Println(result.Data.Device) // "/dev/sdf"ExtractResult[T] carries Data, Duration, TokensUsed, Model, Usage, and Fallback.
Extract a list of typed structs from a single input. Same API as Extract, returns []T:
type LineItem struct {
Description string `json:"description" sense:"Item description"`
Quantity int `json:"quantity" sense:"Number of units"`
UnitPrice float64 `json:"unit_price" sense:"Price per unit in dollars"`
}
result, err := sense.ExtractSlice[LineItem](invoiceText).
Context("Invoice line items").
Run()
for _, item := range result.Data {
fmt.Printf("%s x%d @ $%.2f\n", item.Description, item.Quantity, item.UnitPrice)
}Works with log batches, entity lists, table rows — anything where one text contains multiple structured items.
Per-item validation is built in:
result, err := sense.ExtractSlice[LineItem](text).
Validate(func(item LineItem) error {
if item.Quantity <= 0 {
return fmt.Errorf("invalid quantity: %d", item.Quantity)
}
return nil
}).
Run()Run a batch of independent extractions concurrently. Each job writes into its own destination pointer; the result reports per-job errors and total wall-clock time:
var mount MountError
var ticket TicketInfo
res := s.ExtractParallel(ctx, []sense.ExtractJob{
{Text: logLine, Dest: &mount, Context: "AWS EBS error"},
{Text: emailBody, Dest: &ticket, Context: "support email"},
})
if res.Failed() {
for i, err := range res.Errors {
if err != nil {
log.Printf("job %d failed: %v", i, err)
}
}
}All extract paths support two kinds of validation:
Closure — pass a function via .Validate(fn):
result, err := sense.Extract[Order](text).
Validate(func(o Order) error {
if o.Total < 0 {
return fmt.Errorf("invalid total: %f", o.Total)
}
return nil
}).
Run()Interface — implement Validate() error on your struct:
type Order struct {
Total float64 `json:"total"`
Items []Item `json:"items"`
}
func (o *Order) Validate() error {
if o.Total < 0 {
return fmt.Errorf("invalid total: %f", o.Total)
}
return nil
}
// Validate() is called automatically after unmarshalling.
result, err := sense.Extract[Order](text).Run()Both work with Extract[T], Extract(text, &dest), and ExtractSlice[T]. When both are set, the closure runs first.
All extract builders support a fallback function for when the API call fails:
result, err := sense.Extract[MountError](logLine).
Fallback(func() (*MountError, error) {
return regexParseMountError(logLine)
}).
Run()
if result.Fallback {
log.Warn("used fallback parser")
}The Fallback field on the result is true when the fallback path fired.
Extract isn't just for tests. Use it anywhere you need structure from messy text:
// Parse log lines into typed events
var event DeployEvent
s.Extract(logLine, &event).Context("Kubernetes deployment logs").Run()
// Classify support tickets
var ticket TicketInfo
s.Extract(emailBody, &ticket).Context("Customer support emails for a SaaS product").Run()
// Normalize inconsistent API responses
var order Order
s.Extract(thirdPartyJSON, &order).Context("Legacy vendor API, format varies by region").Run()func TestMyAgent(t *testing.T) {
output := runMyAgent()
sense.Assert(t, output).
Expect("produces valid Go code").
Expect("handles errors idiomatically").
Context("task was to write a REST API server").
Run()
}When a check fails, you get structured feedback — what passed, what failed, why, and evidence:
--- FAIL: TestMyAgent (4.82s)
agent_test.go:15: evaluation: 1/2 passed, score: 0.50
✓ produces valid Go code
reason: The snippet is syntactically valid Go code for a simple addition function.
evidence: func Add(a, b int) int { return a + b }
confidence: 0.95
✗ handles errors idiomatically
reason: The output is a trivial math function with no error handling whatsoever.
It does not demonstrate idiomatic Go error handling (e.g., returning an error
as a second value, using fmt.Errorf, etc.), nor does it relate to a REST API
server where error handling would be expected.
evidence: func Add(a, b int) int { return a + b } — no error return value,
no error handling logic, no REST API context
confidence: 0.99
sense.Require(t, output).
Expect("produces valid Go code").
Run()Assert uses t.Error() (test continues). Require uses t.Fatal() (test stops). Same pattern as testify.
result, err := sense.Eval(output).
Expect("is a complete sentence").
Expect("mentions an animal").
Expect("contains a number").
Judge()
fmt.Println(result.Pass) // false
fmt.Println(result.Score) // 0.67
for _, c := range result.FailedChecks() {
fmt.Println(c.Expect, "—", c.Reason)
}EvalResult exposes Pass, Score, Checks, and helpers PassedChecks() / FailedChecks(). Each Check carries Expect, Pass, Confidence, Reason, Evidence, and BelowThreshold.
A check can pass Claude's judgment but with low confidence. Set a minimum and low-confidence passes are demoted to failures (flagged BelowThreshold):
// Per call
sense.Eval(output).Expect("is factually accurate").MinConfidence(0.8).Judge()
// Or session-wide
s := sense.New(sense.WithMinConfidence(0.8))cmp, err := sense.Compare(outputV1, outputV2).
Criteria("completeness").
Criteria("clarity").
Criteria("professionalism").
Judge()
fmt.Println(cmp.Winner) // "A"
fmt.Println(cmp.ScoreA) // 0.85
fmt.Println(cmp.ScoreB) // 0.10
fmt.Println(cmp.Reasoning) // "Output A is significantly better..."Three tiers — use only what you need:
// Zero config — just works
sense.Assert(t, output).Expect("covers all sections").Run()
// Test suite — auto-cleanup, usage tracking
s := sense.ForTest(t)
s.Assert(t, output).Expect("covers all sections").Run()
// Custom config
s := sense.New(sense.WithModel("claude-haiku-4-5-20251001"))
s.Assert(t, output).Expect("covers all sections").Run()Extract requires an explicit session:
s := sense.New()
var m MountError
s.Extract("device /dev/sdf already mounted", &m).Run()
// Generic version (uses default session)
result, err := sense.Extract[MountError](logLine).Run()s := sense.New(
sense.WithModel("claude-haiku-4-5-20251001"),
sense.WithTimeout(10 * time.Second), // -1 or 0 disables the timeout
sense.WithRetries(5), // -1 disables retries
sense.WithAPIKey("sk-..."),
sense.WithMemoryCache(), // in-memory response cache, lives with the session
sense.WithMinConfidence(0.8), // demote low-confidence passes to failures
sense.WithContext("you are reviewing API docs"), // prepended to every call
sense.WithLogger(slog.Default()), // log calls, latencies, tokens, errors
sense.WithHook(func(e sense.Event) { /* per-call callback */ }),
)s := sense.ForTest(t) // defaults
s := sense.ForTest(t, sense.WithModel("claude-haiku-4-5-20251001")) // custom
// t.Cleanup handles Close and prints usage summarys := sense.New()
// ... run evaluations ...
fmt.Println(s.Usage())
// sense: 15 calls, 18420 input tokens, 4210 output tokens (~$0.0612)Token usage is tracked across all operations using atomic counters — safe for concurrent use. Usage() returns a SessionUsage snapshot with an estimated cost from the built-in per-model pricing table (Sonnet / Haiku / Opus).
Enable batching for 50% cost reduction. Requests are collected and submitted as a single Anthropic Batch API call:
s := sense.New(sense.WithBatch(50, 2*time.Second))
defer s.Close() // required — flushes pending batch requestsNote: Batching trades latency for cost. The Batch API processes requests asynchronously — it can take minutes to hours depending on load. Use it for large test suites where 50% cost savings matter more than speed.
Sense provides two interfaces for decoupling your code from the concrete Session:
// For code that judges output
func AnalyzeReport(s sense.Evaluator, doc string) (bool, error) {
result, err := s.Eval(doc).
Expect("has executive summary").
Judge()
if err != nil {
return false, err
}
return result.Pass, nil
}
// For code that extracts structure
func ParseTicket(s sense.Extractor, raw string) (*Ticket, error) {
var t Ticket
_, err := s.Extract(raw, &t).Run()
return &t, err
}*Session satisfies both interfaces. Accept Evaluator or Extractor in your function signatures to make your code testable without the Claude API.
Single flat package — no cmd/, no sub-packages. Everything pivots on one seam: the caller interface (call(ctx, *callRequest)). Builders sit in front of it; transports/decorators sit behind it.
Extract[T] / ExtractSlice[T] / ExtractParallel Assert / Require / Eval / Compare
(extract builders) (judge builders)
\ /
\ /
▼ ▼
┌─────────────────────────────────────────┐
│ Session — model, timeout, retries, │
│ usage counters, optional cache/batcher │
└─────────────────────────────────────────┘
│
caller.call(...) ← the one seam
│
┌──────────────┬──────────────┴───────┬───────────────┐
▼ ▼ ▼ ▼
claudeClient batchCaller cachedCaller nopCaller
(real API, (Anthropic Batch (decorator, (Nop(): {})
retries, API, 50% cost) memory cache)
forced tool,
prompt cache)
| Component | File | Role |
|---|---|---|
caller |
client.go |
One-method seam every operation calls |
claudeClient |
client.go |
Real API call — forces one tool via tool_choice, retries 429/5xx with backoff, ephemeral prompt cache on the system block |
batchCaller + batcher |
batch.go |
Routes calls through the Batch API (50% cost); needs Close() to flush |
cachedCaller |
cache.go |
Decorator over any caller; content-addressed in-memory cache |
nopCaller |
nop.go |
Returns {}; backs sense.Nop() for offline runs |
| Extract schema gen | extract_schema.go |
Reflects a struct into a tool input schema, cached per reflect.Type |
Session |
config.go, option.go |
Holds config + atomic usage counters; built via functional options |
- Your struct schema (Extract) or expectations (Judge) become a prompt
- Claude is forced to call a structured tool via
tool_choice - The tool's input schema enforces the output format server-side
- Sense unmarshals the tool call result into typed Go structs
The schema is enforced server-side, so there's no output parsing on your end. The system prompt carries an ephemeral cache_control block, so repeated calls within a session pay reduced input cost on the cached prefix.
go test ./... # unit tests — mock caller, no API, no key needed
go test -tags=e2e -v ./... # e2e — hits the real API, COSTS money (~$0.10–0.15/run)
SENSE_SKIP=1 go test ./... # offline — every sense call becomes a passing no-op
go tool golangci-lint run # lint (golangci-lint v2 pinned as a go.mod tool dependency)
go build ./...Unit tests inject a mock caller and never touch the network; e2e tests live behind the //go:build e2e tag in e2e_test.go. House style is Dave Cheney's and is enforced, not aspirational — see .golangci.yml (revive indent-error-flow / superfluous-else, nestif, gocyclo 15, funlen 80 lines).
Skip all sense calls when you don't have an API key:
SENSE_SKIP=1 go test ./...All Assert, Require, Eval, Extract, ExtractSlice, and Compare calls become no-ops that pass immediately.
sense.Nop() also accepts options for cases where you want a no-op session with specific configuration (e.g., model name in result metadata, logging):
s := sense.Nop(sense.WithModel("claude-haiku-4-5-20251001"), sense.WithLogger(logger))| Variable | Description | Default |
|---|---|---|
ANTHROPIC_API_KEY |
Claude API key | Required (real usage + e2e only) |
SENSE_MODEL |
Override default judge model | claude-sonnet-4-6 |
SENSE_SKIP |
Set to 1 to skip all sense calls |
unset |
Model precedence: per-call .Model() > $SENSE_MODEL > session model.
| Doc | What's in it |
|---|---|
| docs/NEXT.md | Feature backlog — what's shipped vs. designed-but-unbuilt |
| docs/CONFIDENCE-THRESHOLD.md | Confidence-threshold design |
| docs/FOOTGUNS.md | API footguns and the fixes applied |
| docs/API-SIMPLIFICATION.md | API-simplification history |
| docs/MULTI-JUDGE-CONSENSUS.md | Multi-judge consensus design (unbuilt) |
| docs/INTEGRATION-FEEDBACK.md | Notes from production usage |
Roadmap — designed in docs/NEXT.md, not yet built. Ideas, not commitments.
- Deterministic checks — mix
Check(sense.ValidJSON())with LLM-judgedExpect()in the same assertion. Deterministic checks run first; if any fail, skip the LLM call. Free, fast, saves money. - File cache — cache responses to disk. Identical prompts during iterative development hit the cache instead of the API. (Today only
WithMemoryCache()exists.) - Snapshots — save eval results to disk, detect regressions when prompts change.
SENSE_UPDATE_SNAPSHOTS=1to update. - CI reporter — JUnit XML output and GitHub Actions annotations so eval results show up in your pipeline.
- Multi-judge consensus — fan out to N models, require agreement for a pass. Reduces false positives from single-model bias.
- Cost budget —
WithMaxCost(sense.Dollars(0.50))to cap session spend. Prevents runaway costs in CI. - Multi-model judges — an OpenAI
caller+WithCallerinjection, to judge with non-Claude models.
Already shipped: extract validation (Validate + Validator), ExtractSlice[T], ExtractParallel, confidence thresholds, in-memory cache, batching, and Anthropic prompt caching on system prompts.