Skip to content

DJLougen/honey-comb

Repository files navigation

Honey-Comb

version accuracy training tests license

v0.2.0 - Production Ready

status thread-safe statistically-validated observability

Keep the honey, drop the wax.

CPU-only inline context depollution for agent harnesses


Visual Gallery

Architecture Overview

Honey-Comb Architecture

Honey-Comb operates in two loops:

  • Hot Loop (per message, ~0.035ms rules / ~0.8ms ML): Classifies and depollutes every message on ingestion
  • Cool Loop (every N turns, ~10-50ms): Performs staleness detection and budget enforcement

Performance Benchmarks

Performance Comparison

Honey-Comb achieves exceptional throughput across different modes:

  • Production mode (thread-safe, metrics enabled): 17,069 msg/s
  • High-performance mode (no locks, no metrics): 24,667 msg/s
  • Rule-based classification: 28,948 msg/s

Depollution Ratios

Depollution Ratios

Real-world depollution examples from agent sessions:

  • Test output: 500 lines → "94 passed, 2 failed" + failure details (83x reduction)
  • File contents: 69-line source file → "src/auth.py (69 lines)" (103x reduction)
  • Reasoning traces: Verbose reasoning → key conclusions (3-5x reduction)

Latency Breakdown

Latency Breakdown

The hot loop completes in under 1.5ms per message, making inline depollution practical for real-time agent loops.

Statistical Validation

Statistical Significance Dashboard

All key claims validated with bootstrap confidence intervals (10,000 resamples), one-sample t-tests, and Cohen's d effect sizes across n >= 100 samples. See full results.

Reduction Ratio Distribution Accuracy Distribution

Real-World Demo

The demo shows a 10-turn coding agent session depolluted from 4,062 tokens to 640 tokens (6.3x reduction):

Turn 1 (SYSTEM)   - 137 → 137 tokens (CORE - kept verbatim)
Turn 2 (USER)     - 93 → 93 tokens (CORE - kept verbatim)
Turn 3 (FILE)     - 514 → 5 tokens (COMPACT - 103x reduction)
Turn 4 (TESTS)    - 759 → 93 tokens (DISTILL - 8x reduction)
Turn 5 (REASON)   - 351 → 110 tokens (DISTILL - 3x reduction)
Turn 6 (DIFF)     - 451 → 9 tokens (COMPACT - 50x reduction)
Turn 7 (TESTS)    - 334 → 20 tokens (DISTILL - 17x reduction)
Turn 8 (TESTS)    - 585 → 20 tokens (DISTILL - 29x reduction)
Turn 9 (SUMMARY)  - 247 → 146 tokens (DISTILL - 2x reduction)
─────────────────────────────────────────────────────
Total: 4,062 → 640 tokens (6.3x reduction, 84% noise removed)

Run the demo yourself:

python scripts/demo_pollution.py

Statistical Significance

All performance claims are validated with proper statistical methods (bootstrap confidence intervals, hypothesis testing, effect sizes).

Statistical Significance Dashboard

Key Results

Metric Mean 95% CI Baseline p-value Effect Size
Classification Accuracy 84.2% [79.9%, 88.3%] 25.0% (random) < 0.001 d = 1.42
Reduction Ratio 13.7x [12.2x, 15.3x] 1.0x (no depollution) < 0.001 d = 2.33
Token Savings 3,103 tokens [2,815, 3,398] 0 tokens < 0.001
Throughput (rule-based) 13,635 msg/s [13,374, 13,867]
Throughput (ML-based) 1,028 msg/s [995, 1,057]

All key metrics are statistically significant (p < 0.05) with large effect sizes.

  • n=273 evaluation examples for accuracy (held-out test set)
  • n=100 synthetic sessions for reduction ratio and token savings
  • n=100 trials for throughput (1000 messages per trial)
  • Bootstrap confidence intervals with 10,000 resamples
  • One-sample t-tests vs appropriate baselines
  • Cohen's d for effect size (d > 0.8 = large effect)

See docs/statistical_validation.json for full results and scripts/validate_significance.py to reproduce.

What It Does

Agent context windows fill up with noise: 500-line test outputs where everything passed, file contents from 10 turns ago, reasoning chains about bugs that are already fixed. Today's approach is reactive — call an LLM to summarize when the window gets too long.

Honey-Comb takes a different approach: depollute every message on the way in, before it ever enters the context window. A CPU classifier (~1ms per message) labels each message with a depollution strategy, and deterministic regex extractors execute it. No model reads or understands the text. The LLM only sees clean context — the honey, not the wax.

Every message enters the agent loop:
  raw → classify(1ms) → depollute → context window
  raw → classify(1ms) → depollute → context window
  raw → classify(1ms) → depollute → context window

LLM sees: clean, depolluted context

No batch summarization. No "when do I depollute?" threshold. Every message, every time.

What This Is (and Isn't)

This is not compression. Real compression (gzip, delta encoding) preserves information in a smaller form. Honey-Comb selectively deletes noise and extracts structural summaries from tool outputs. A file read replaced by src/auth.py (69 lines) has lost the file contents — they are not recoverable from the context window.

This is a classifier + rule engine. The ML model (TF-IDF + VotingClassifier) picks a bucket. Hand-written regex extractors do the actual work. No model reads or understands the text being "depolluted."

This works on structured tool outputs where "what matters" is mechanically extractable: test results, file reads, diffs, error traces, command output. It does not summarize free-form conversation.

Misclassification = data loss. A CORE message mislabeled as DROP is gone. The 84% classification accuracy means ~16% of messages get a suboptimal strategy. For most tool outputs this is harmless (slightly more or less pruning), but it is a real risk for ambiguous content.

The honest pitch: CPU-only inline context pruning for agent harnesses.

The Two Loops

HOT LOOP (per message, ~1ms rules / ~1ms ML):
  raw message → classifier → label → depolluter → clean context entry

COOL LOOP (every N turns, ~10-50ms):
  walk context → drop stale/superseded entries
  budget check → force-downgrade if over budget

Both loops are CPU-only. The LLM only ever sees clean, depolluted context.

Label Taxonomy

Label Strategy Example
CORE Keep verbatim Active goal, current error, system prompt
DISTILL Extract key info Test output → "94 passed, 2 failed" + failure details
COMPACT Structural summary File → "src/foo.py (200 lines): class Foo, def bar()"
DROP Remove entirely Completed tool calls (the result is what matters)
STALE Mark for deletion File read before a later edit of the same file
ESCALATE Defer to LLM Ambiguous content (rare)

Quick Start

git clone https://github.com/DJLougen/honey-comb.git
cd honey-comb
pip install -e ".[dev]"

# Run tests
pytest tests/

# Generate training data
python scripts/generate_synthetic.py

# Train classifier
honeycomb-train examples/train.jsonl --eval examples/eval.jsonl

Usage

Rule-based (no training needed)

from honeycomb import HoneyComb, Message

hc = HoneyComb()  # Uses rule-based classification

# Process every message inline
for raw_message in agent_messages:
    compressed = hc.process(Message(
        role=raw_message["role"],
        content=raw_message["content"],
    ))
    # Send depolluted content to your LLM
    send_to_llm({"role": compressed.role, "content": compressed.content})

# Get the full depolluted context window
window = hc.get_context_window()

ML classifier (trained)

from honeycomb import HoneyComb, Message

hc = HoneyComb(model_path="models/honeycomb.joblib")

# Same API — just with ML classification instead of rules
compressed = hc.process(Message(role="tool", content="94 passed, 2 failed..."))

With budget management

from honeycomb import HoneyComb, Message
from honeycomb.budget import BudgetConfig

hc = HoneyComb(
    budget_config=BudgetConfig(target_tokens=10_000),
    cool_interval=5,  # Run cool loop every 5 turns
)

Production Features

Thread Safety

Honey-Comb is thread-safe by default, allowing concurrent processing from multiple threads:

hc = HoneyComb(thread_safe=True)  # Default

# Safe to call from multiple threads
# All internal state is protected by locks

For single-threaded workloads, disable locks for maximum performance:

hc = HoneyComb(thread_safe=False)  # 1.45x faster, single-threaded only

Observability

Structured logging, Prometheus metrics, and health checks are built-in:

from honeycomb import setup_logging, metrics, health_checker

# Setup structured JSON logging
setup_logging(level="INFO", json_format=True)

# Metrics are automatically recorded
print(f"Messages: {metrics.messages_processed.value}")
print(f"Compression p95: {metrics.compression_ratio.get_percentile(95):.2f}x")
print(f"Avg latency: {metrics.processing_latency_seconds.get_mean() * 1000:.3f}ms")

# Health check endpoint
health = health_checker.check()
print(f"Status: {health.status}")
print(f"Uptime: {health.uptime_seconds:.1f}s")

Export metrics in Prometheus format:

prometheus_text = metrics.export_prometheus()
# Serve at /metrics endpoint

Configuration

Configure via environment variables or config files:

# Environment variables
export HONEYCOMB_THREAD_SAFE=true
export HONEYCOMB_METRICS_ENABLED=true
export HONEYCOMB_COOL_LOOP_INTERVAL=10
export HONEYCOMB_LOG_LEVEL=INFO

Or load from a config file:

from honeycomb import load_config

config = load_config("config.json")
print(f"Thread safe: {config.thread_safe}")
print(f"Cool loop interval: {config.cool_loop_interval}")

Performance Tuning

Choose the right mode for your workload:

Mode thread_safe metrics_enabled Use Case
Production True True Concurrent server workloads
High-perf False False Single-threaded batch processing
# Production mode (default)
hc = HoneyComb(thread_safe=True, metrics_enabled=True)
# ~17,000 msg/s

# High-performance mode
hc = HoneyComb(thread_safe=False, metrics_enabled=False)
# ~24,000 msg/s (1.45x faster)

Run the production demo:

python scripts/demo_production.py

Architecture

honeycomb/
  labels.py          Label taxonomy (CORE/DISTILL/COMPACT/DROP/STALE/ESCALATE)
  features.py        Message-level feature extraction
  compressor.py      Deterministic per-label depollution rules (regex extractors)
  session.py         Turn tracking, staleness detection, supersession (thread-safe)
  budget.py          Token budget management
  classifier.py      TF-IDF + VotingClassifier (SGD + NB + LR)
  firewall.py        Main orchestrator (hot loop + cool loop)
  observability.py   Structured logging, metrics, health checks
  config.py          Configuration management (env vars, files)
  io.py              JSONL I/O for training data
  cli_train.py       Training CLI entry point
scripts/
  generate_synthetic.py   Synthetic training data generator
  demo_pollution.py       Side-by-side raw vs clean demo
  demo_production.py      Production features demo (threading, metrics, config)
  benchmark.py            Performance benchmarks
  benchmark_statistical.py Statistical benchmarks with confidence intervals
examples/
  train.jsonl        Training examples (1335 rows)
  eval.jsonl         Evaluation examples (273 rows)
tests/
  test_labels.py       6 tests
  test_features.py    12 tests
  test_compressor.py  25 tests
  test_session.py     17 tests
  test_budget.py       8 tests
  test_firewall.py    18 tests
  test_classifier.py   5 tests
  test_production.py  19 tests
  test_performance.py 15 tests
  test_threading.py    4 tests

Performance

All values below are statistically validated (bootstrap 95% CI, n >= 100 trials). See the Statistical Significance section for methodology.

Metric Value 95% CI
Classification accuracy (end-to-end pipeline) 84.2% [79.9%, 88.3%]
Classification accuracy (isolated ML classifier) 94.5%
Training examples 1,335
Per-message latency (rules) 0.061ms
Per-message latency (ML) 0.899ms
Throughput (rules) 13,635 msg/s [13,374, 13,867]
Throughput (ML) 1,028 msg/s [995, 1,057]
Reduction ratio (100 sessions) 13.7x [12.2x, 15.3x]
Token savings per session 3,103 tokens [2,815, 3,398]
Tests 129 passed

The end-to-end accuracy (84.2%) reflects the full pipeline including content-type detection, while the isolated ML classifier achieves 94.5% on the same held-out evaluation set. Both are significantly better than the 25% random baseline (p < 0.001, Cohen's d = 1.42).

Demo: Raw vs Clean

Run python scripts/demo_pollution.py to see a 10-turn coding agent session depolluted in real time:

Turn Raw Clean What happened
System prompt 137 tokens 137 tokens Kept verbatim (CORE)
User goal 93 tokens 93 tokens Kept verbatim (CORE)
File read (69 lines) 514 tokens 5 tokens src/auth.py (69 lines)
Test failures (60 lines) 759 tokens 93 tokens Summary + 4 failed test names + error details
Agent reasoning 351 tokens 110 tokens Extracted decision plan (numbered lists)
Git diff 451 tokens 9 tokens Edited 1 file(s): +2/-9 src/auth.py
Test pass 334 tokens 20 tokens 12 passed in 1.15s
Full suite pass 585 tokens 20 tokens 213 passed in 4.72s
Final summary 247 tokens 146 tokens Headers + numbered change list
Total 4,062 tokens 640 tokens 6.3x reduction, 84% noise removed

How It Works

Hot Loop (per message)

  1. Extract features from the message: role, content type signals (paths, errors, code blocks), turn age, duplicate detection.
  2. Classify the message into a depollution label using either rules or the ML classifier.
  3. Depollute using deterministic regex extractors specific to the content type and label. No model reads the text.
  4. Record in the session state for staleness tracking.

Cool Loop (every N turns)

  1. Staleness check: Walk the context and drop entries that are stale (file read before a later edit) or superseded (file read again later).
  2. Budget enforcement: If over the token budget, force-downgrade the lowest-priority entries to more aggressive depollution.

Depollution Examples

Test output (DISTILL):

Before: 500 lines of pytest output
After:  "94 passed, 2 failed in 3.5s\nFailures:\ntest_foo.py::test_bar\nAssertionError: expected 5 got 3"

File content (COMPACT):

Before: 200 lines of source code
After:  "src/foo.py (200 lines):\nclass Foo\ndef bar()\ndef baz()"

Command output (DISTILL):

Before: 100 lines of build output
After:  "exit=0\nBuild complete\nSuccessfully built abc123"

Error trace (DISTILL):

Before: 50-line traceback
After:  "ValueError: invalid literal for int()\nat foo.py:42"

Relationship to busyBee-cpu

Honey-Comb applies the same principle as busyBee-cpu to a different problem:

busyBee-cpu Honey-Comb
Problem Tool selection in agent loops Context depollution in agent loops
Principle Most decisions are mechanical Most depollution is mechanical
Classifier Which of 4 actions to take Which depollution strategy to apply
Resolver Fill arguments from state Execute extraction per content type
Escalation Defer to LLM for reasoning Defer to LLM for ambiguous content

Both use the same architecture: TF-IDF + VotingClassifier on CPU, with deterministic resolvers/extractors, escalating to the LLM only when uncertain.

Dependencies

  • scikit-learn >= 1.4 — classifiers and pipelines
  • joblib >= 1.4 — model serialization
  • numpy >= 1.24 — numeric operations

License

MIT

About

CPU-only inline context compression for agent harnesses. Keep the honey, drop the wax.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages